Mat Posted September 9, 2011 Posted September 9, 2011 (edited) For those of you (and I know from previous discussions that there are people out there) making your own compilers and interpreters and choosing not to use tools like yacc. Pratt's parser is a predictive parser, but with a number of improvements that make it so easy to implement you wonder why in the 38 since the original paper was published it hasn't been used in more than a small handful of projects. Look it up if you ever consider hand writing a parser. I managed to write a calculator from first principles in 30 mins. 20 minutes of that was writing the lexer. Mat Edit: Just to give you an idea of how unused this is... Google "Pratt's Parser" and you'll get this thread as the first result. Edited September 12, 2011 by Mat AutoIt Project Listing
jvanegmond Posted September 9, 2011 Posted September 9, 2011 Nice thread, Mat. Any docs/links too? Would be great. github.com/jvanegmond
twitchyliquid64 Posted September 9, 2011 Posted September 9, 2011 hmmm. maybe, considering I'm doing a complete re-write of my project in C++. ongoing projects:-firestorm: Largescale P2P Social NetworkCompleted Autoit Programs/Scripts: Variable Pickler | Networked Streaming Audio (in pure autoIT) | firenet p2p web messenger | Proxy Checker | Dynamic Execute() Code Generator | P2P UDF | Graph Theory Proof of Concept - Breadth First search
Mat Posted September 9, 2011 Author Posted September 9, 2011 Nice thread, Mat. Any docs/links too? Would be great.Well... Simple: http://effbot.org/zone/simple-top-down-parsing.htmMore in depth: http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/Also in depth: http://javascript.crockford.com/tdop/tdop.html I'll also tidy up the source code for the calculator I made and post that. All I need to add is a tertiary operator to prove that this parser can do anything. AutoIt Project Listing
jvanegmond Posted September 9, 2011 Posted September 9, 2011 You know what would be nice to post on these forums. An as short as possible little AutoIt code that can parse things like "5 + 3 * 2". github.com/jvanegmond
Mat Posted September 9, 2011 Author Posted September 9, 2011 I was thinking of that, unfortunately I got started in C as I was going to be using asm to implement the final thing and I didn't want one of those algorithms that looked nice in python when you don't worry about data types. It would be easy to do. Especially given that autoit makes it almost as easy as python. AutoIt Project Listing
Mat Posted September 11, 2011 Author Posted September 11, 2011 (edited) Ok, I tidied it up a bit to show you how nice it is to build it incrementally. Initially, for binary operators and no lexer, it was an 80 line C program, of which a lot were utility functions to do with the functions rather than the parser. To make the code easier to understand it only works with integers. Usually I'd seperate this out into different files, but for now I'm posting them as standalone c programs... They are all tested compiling on gcc. For those involving sqrt you'll need to link the maths library in (gcc -lm file.c). expandcollapse popup#include <stdio.h> #define TOK_END 0 #define TOK_INT 1 #define TOK_OP 2 #define OP_PLUS 0 #define OP_MUL 1 // 7 + 2 * 3 int pos = 0; struct TOKEN { int Type; int Val; } test[] = { {TOK_INT, 7}, {TOK_OP, OP_PLUS}, {TOK_INT, 2}, {TOK_OP, OP_MUL}, {TOK_INT, 3} }, tok; int Op_Add(int left, int right) { return left + right; } int Op_Mul(int left, int right) { return left * right; } struct OPERATOR { int Precedence; void* func; } Ops[] = { {30,&Op_Add}, {40,&Op_Mul} }; int lbp(struct TOKEN token) { if (token.Type == TOK_OP) return Ops[token.Val].Precedence; return 0; } int led(struct TOKEN token, int left) { int right = expression(lbp(token)); int (*fun)(int, int) = Ops[token.Val].func; return fun(left, right); } int nud(struct TOKEN token) { if (token.Type == TOK_INT) return token.Val; return -1; } int expression(int rbp) { struct TOKEN token = tok; tok = test[++pos]; int left = nud(token); while (lbp(tok) > rbp) { token = tok; tok = test[++pos]; left = led(token, left); } return left; } int main(int argc, char **argv) { tok = test[0]; printf("%i\n", expression(0)); return 0; }(Simple.c Simple.exe) It then got more complex gradually:Adding prefix operators (+ and - hard coded due to their dual use, sqrt used as an example): Prefix.c Prefix.exeAdded postfix operators (factorial, though easy to add more): Postfix.c Postfix.exeAdded brackets, both on their own and used for multiplication: Brackets.c Brackets.exeAdded the tertiary operator (stmt ? if_true : if_false): Tertiary-EvalAll.c Tertiary-EvalAll.exeFixed the tertiary operator so it only evaluates one of the statements: Tertiary.c Tertiary.exeModified the program so that it outputs a parse tree instead of evaluating: AST.c AST.exe (the parse tree is printed as simplified asm, with no specific code generation for operators)Next up is error checking. Again it's very easy to add, as you just need to add default statements to the switches and else statements for ifs for unexpeced tokens. After that I'll do the same for AutoIt. The lexer is a mess at the moment, so I'll tidy that up sometime. If anyones interested about what I did to go from one step to the next then just ask MatEdit: Added exes. Edited September 11, 2011 by Mat AutoIt Project Listing
Mat Posted September 11, 2011 Author Posted September 11, 2011 And for those interested: Simple.au3 AutoIt Project Listing
jvanegmond Posted September 12, 2011 Posted September 12, 2011 (edited) And for those interested: Simple.au3 Cool. btw, Why did you make the distinction between TOK_OP with params and TOK_INT with params instead of just TOK_INT and TOK_ADD, TOK_MUL? Only because python example does it like that? expandcollapse popup$s = "7 + 2 * 3" Global $test = lex($s) MsgBox(0, $s, expression(0)) Func lex($s) Local $out[100][2] $arr = StringSplit($s, '', 2) $n = 0 For $i = 0 To UBound($arr)-1 Switch $arr[$i] Case " " ContinueLoop Case "0" To "9" $out[$n][0] = $TOK_INT $out[$n][1] = lexNumber($arr, $i) $n += 1 Case "+" $out[$n][0] = $TOK_OP $out[$n][1] = $OP_ADD $n += 1 Case "*" $out[$n][0] = $TOK_OP $out[$n][1] = $OP_MUL $n += 1 EndSwitch Next $out[$n][0] = 0 $out[$n][1] = 0 Return $out EndFunc Func lexNumber($s, ByRef $i) $out = "" While $s[$i] >= "0" And $s[$i] <= "9" $out &= $s[$i] $i += 1 If $i = UBound($s) Then ExitLoop WEnd $i -= 1 Return $out EndFunc Edited September 12, 2011 by Manadar github.com/jvanegmond
Mat Posted September 12, 2011 Author Posted September 12, 2011 Thats a nice simple way to do it... I was thinking of using the operator array and adding strings to it rather than looking for the tokens in the switch statement... It would make it very easy to add operators.Also, I think that it would be possible to lose the global $test, probably by combining the nud,led and expression functions and putting it in there as a static. AutoIt Project Listing
jvanegmond Posted September 12, 2011 Posted September 12, 2011 Also, I think that it would be possible to lose the global $test, probably by combining the nud,led and expression functions and putting it in there as a static.Just give $test as the first parameter on everything. github.com/jvanegmond
Mat Posted September 12, 2011 Author Posted September 12, 2011 (edited) Well... Actually the correct way to do it is to make the lex function yeild tokens. My version of the parser uses $pos and $pos-1 because I was lazy and AutoIt doesn't do structs very easily. If you look at the c examples where I use a struct for tokens, it copies the token, then fills tok with the next one. (or it should do anyway, I might have been lazy there as well)expandcollapse popupGlobal $tok ; int calc(string) Func calc($expr) $tok = lexGetToken($expr) Return expression(0) EndFunc ; int nud(TOKEN) Func nud($token) ... EndFunc ; int led(TOKEN) Func led($token) ... EndFunc ; int expression(int) Func expression($rbp) Local $token = $tok ; Stores the last token $tok = lexGetToken() ; Fills the current one Local $left = nud($token) ; Rather than using $pos-1 ... return $left EndFunc ; TOKEN lexGetToken( [string] ) Func lexGetToken($in = "") Local static $pos = 0 Local static $str = $in Local $ret[2] ; Token of the form [TYPE, VALUE] ... Return $ret EndFuncThe problem is that it gets very messy passing 2 element arrays around all the time. And accessing them is not nearly as easy to read as tok.Type and tok.Val.Because nud and led never call themselves, only calling expression, it is possible to merge them into a single function, and replace Global $tok with Local Static $tok. expression calls lexGetToken and the user only has to call calc. Nice and clean.It's also possible to take out expression calling itself completely by using a stack, but that would be pointless in AutoIt. It does work well in asm though. Edited September 12, 2011 by Mat AutoIt Project Listing
Mat Posted September 12, 2011 Author Posted September 12, 2011 btw, Why did you make the distinction between TOK_OP with params and TOK_INT with params instead of just TOK_INT and TOK_ADD, TOK_MUL? Only because python example does it like that?Sorry, didn't see this.The reason is that it makes it considerably easier to add and manage operators (including dynamically: More on that later).You'll see in the C examples that there is an array of operators, with details like flags, precedence and then a pointer to the function to call. When I implemented the lexer, I didn't hard-program the operators in, instead added a pointer to the string containing the operator to that array. The lexer then searches through the operators and returns {TOK_OP,OP_????}. When I add an operator, I add a OP_* constant (optional btw), write the function for the operator and add the operator details to the array. Simple It won't get any easier than that. There will be one more operator flag (User) which would mean user defined operators are handled just as easily, with a slightly different pointer to the function.Additionally, most operators behave the same, and those that are different it is managed using their flags (Right associative, unary and postfix). This makes functions like lbp only a few lines long as it just needs to check that the token is of type TOK_OP and then read the precedence from the table. Obvious exceptions are those that have more than one meaning (+, -).The last reason is because of the language I am ultimately going to use this for. It allows definition of operators, and an operator is any string that passes ispunct and isn't a special character like parenthesis/brackets. AutoIt Project Listing
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now