Pratt's parser

Mat · September 9, 2011

For those of you (and I know from previous discussions that there are people out there) making your own compilers and interpreters and choosing not to use tools like yacc.

Pratt's parser is a predictive parser, but with a number of improvements that make it so easy to implement you wonder why in the 38 since the original paper was published it hasn't been used in more than a small handful of projects.

Look it up if you ever consider hand writing a parser. I managed to write a calculator from first principles in 30 mins. 20 minutes of that was writing the lexer.

Mat

Edit: Just to give you an idea of how unused this is... Google "Pratt's Parser" and you'll get this thread as the first result.

Edited September 12, 2011 by Mat

jvanegmond · September 9, 2011

Nice thread, Mat.

Any docs/links too? Would be great.

twitchyliquid64 · September 9, 2011

hmmm. maybe, considering I'm doing a complete re-write of my project in C++.

Mat · September 9, 2011

Nice thread, Mat.

Any docs/links too? Would be great.

Well...

Simple: http://effbot.org/zone/simple-top-down-parsing.htm

More in depth: http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/

Also in depth: http://javascript.crockford.com/tdop/tdop.html

I'll also tidy up the source code for the calculator I made and post that. All I need to add is a tertiary operator to prove that this parser can do anything.

jvanegmond · September 9, 2011

You know what would be nice to post on these forums.

An as short as possible little AutoIt code that can parse things like "5 + 3 * 2".

Mat · September 9, 2011

I was thinking of that, unfortunately I got started in C as I was going to be using asm to implement the final thing and I didn't want one of those algorithms that looked nice in python when you don't worry about data types. It would be easy to do. Especially given that autoit makes it almost as easy as python.

Mat · September 11, 2011

Ok, I tidied it up a bit to show you how nice it is to build it incrementally. Initially, for binary operators and no lexer, it was an 80 line C program, of which a lot were utility functions to do with the functions rather than the parser. To make the code easier to understand it only works with integers.

Usually I'd seperate this out into different files, but for now I'm posting them as standalone c programs... They are all tested compiling on gcc. For those involving sqrt you'll need to link the maths library in (gcc -lm file.c).

#include <stdio.h>
 
#define TOK_END 0
#define TOK_INT 1
#define TOK_OP  2
 
#define OP_PLUS 0
#define OP_MUL  1
 
// 7 + 2 * 3
 
int pos = 0;
struct TOKEN
{
    int Type;
    int Val;
} test[] = {
    {TOK_INT, 7},
    {TOK_OP, OP_PLUS},
    {TOK_INT, 2},
    {TOK_OP, OP_MUL},
    {TOK_INT, 3}
}, tok;
 
int Op_Add(int left, int right)
{
    return left + right;
}
 
int Op_Mul(int left, int right)
{
    return left * right;
}
 
struct OPERATOR
{
    int Precedence;
    void* func;
} Ops[] = {
    {30,&Op_Add},
    {40,&Op_Mul}
};
 
int lbp(struct TOKEN token)
{
    if (token.Type == TOK_OP)
        return Ops[token.Val].Precedence;
    return 0;
}
 
int led(struct TOKEN token, int left)
{
    int right = expression(lbp(token));
 
    int (*fun)(int, int) = Ops[token.Val].func;
    return fun(left, right);
}
 
int nud(struct TOKEN token)
{
    if (token.Type == TOK_INT)
        return token.Val;
    return -1;
}
 
int expression(int rbp)
{
    struct TOKEN token = tok;
    tok = test[++pos];
    int left = nud(token);
 
    while (lbp(tok) > rbp)
    {
        token = tok;
        tok = test[++pos];
        left = led(token, left);
    }
 
    return left;
}
 
int main(int argc, char **argv)
{
    tok = test[0];
 
    printf("%i\n", expression(0));
    return 0;
}

(Simple.c Simple.exe)

It then got more complex gradually:

Adding prefix operators (+ and - hard coded due to their dual use, sqrt used as an example): Prefix.c Prefix.exe
Added postfix operators (factorial, though easy to add more): Postfix.c Postfix.exe
Added brackets, both on their own and used for multiplication: Brackets.c Brackets.exe
Added the tertiary operator (stmt ? if_true : if_false): Tertiary-EvalAll.c Tertiary-EvalAll.exe
Fixed the tertiary operator so it only evaluates one of the statements: Tertiary.c Tertiary.exe
Modified the program so that it outputs a parse tree instead of evaluating: AST.c AST.exe (the parse tree is printed as simplified asm, with no specific code generation for operators)

Next up is error checking. Again it's very easy to add, as you just need to add default statements to the switches and else statements for ifs for unexpeced tokens. After that I'll do the same for AutoIt.

The lexer is a mess at the moment, so I'll tidy that up sometime.

If anyones interested about what I did to go from one step to the next then just ask :graduated:

Mat

Edit: Added exes.

Edited September 11, 2011 by Mat

Mat · September 11, 2011

And for those interested: Simple.au3

jvanegmond · September 12, 2011

And for those interested: Simple.au3

Cool.

btw, Why did you make the distinction between TOK_OP with params and TOK_INT with params instead of just TOK_INT and TOK_ADD, TOK_MUL? Only because python example does it like that?

$s = "7 + 2 * 3"
Global $test = lex($s)

MsgBox(0, $s, expression(0))


Func lex($s)
    Local $out[100][2]
    $arr = StringSplit($s, '', 2)
    $n = 0
    For $i = 0 To UBound($arr)-1
        Switch $arr[$i]
            Case " "
                ContinueLoop
            Case "0" To "9"
                $out[$n][0] = $TOK_INT
                $out[$n][1] = lexNumber($arr, $i)
                $n += 1
            Case "+"
                $out[$n][0] = $TOK_OP
                $out[$n][1] = $OP_ADD
                $n += 1
            Case "*"
                $out[$n][0] = $TOK_OP
                $out[$n][1] = $OP_MUL
                $n += 1
        EndSwitch
    Next
    $out[$n][0] = 0
    $out[$n][1] = 0
    Return $out
EndFunc

Func lexNumber($s, ByRef $i)
    $out = ""
    While $s[$i] >= "0" And $s[$i] <= "9"
        $out &= $s[$i]
        $i += 1

        If $i = UBound($s) Then ExitLoop
    WEnd
    $i -= 1
    Return $out
EndFunc

Edited September 12, 2011 by Manadar

Mat · September 12, 2011

Thats a nice simple way to do it... I was thinking of using the operator array and adding strings to it rather than looking for the tokens in the switch statement... It would make it very easy to add operators.

Also, I think that it would be possible to lose the global $test, probably by combining the nud,led and expression functions and putting it in there as a static.

jvanegmond · September 12, 2011

Also, I think that it would be possible to lose the global $test, probably by combining the nud,led and expression functions and putting it in there as a static.

Just give $test as the first parameter on everything.

Mat · September 12, 2011

Well... Actually the correct way to do it is to make the lex function yeild tokens. My version of the parser uses $pos and $pos-1 because I was lazy and AutoIt doesn't do structs very easily. If you look at the c examples where I use a struct for tokens, it copies the token, then fills tok with the next one. (or it should do anyway, I might have been lazy there as well)

Global $tok

; int calc(string)
Func calc($expr)
    $tok = lexGetToken($expr)

    Return expression(0)
EndFunc

; int nud(TOKEN)
Func nud($token)
    ...
EndFunc

; int led(TOKEN)
Func led($token)
    ...
EndFunc

; int expression(int)
Func expression($rbp)
    Local $token = $tok  ; Stores the last token
    $tok = lexGetToken() ; Fills the current one

    Local $left = nud($token) ; Rather than using $pos-1

    ...

    return $left
EndFunc

; TOKEN lexGetToken( [string] )
Func lexGetToken($in = "")
    Local static $pos = 0
    Local static $str = $in

    Local $ret[2] ; Token of the form [TYPE, VALUE]

    ...

    Return $ret
EndFunc

The problem is that it gets very messy passing 2 element arrays around all the time. And accessing them is not nearly as easy to read as tok.Type and tok.Val.

Because nud and led never call themselves, only calling expression, it is possible to merge them into a single function, and replace Global $tok with Local Static $tok. expression calls lexGetToken and the user only has to call calc. Nice and clean.

It's also possible to take out expression calling itself completely by using a stack, but that would be pointless in AutoIt. It does work well in asm though.

Edited September 12, 2011 by Mat

Mat · September 12, 2011

btw, Why did you make the distinction between TOK_OP with params and TOK_INT with params instead of just TOK_INT and TOK_ADD, TOK_MUL? Only because python example does it like that?

Sorry, didn't see this.

The reason is that it makes it considerably easier to add and manage operators (including dynamically: More on that later).

You'll see in the C examples that there is an array of operators, with details like flags, precedence and then a pointer to the function to call. When I implemented the lexer, I didn't hard-program the operators in, instead added a pointer to the string containing the operator to that array. The lexer then searches through the operators and returns {TOK_OP,OP_????}. When I add an operator, I add a OP_* constant (optional btw), write the function for the operator and add the operator details to the array. Simple :graduated: It won't get any easier than that. There will be one more operator flag (User) which would mean user defined operators are handled just as easily, with a slightly different pointer to the function.

Additionally, most operators behave the same, and those that are different it is managed using their flags (Right associative, unary and postfix). This makes functions like lbp only a few lines long as it just needs to check that the token is of type TOK_OP and then read the precedence from the table. Obvious exceptions are those that have more than one meaning (+, -).

The last reason is because of the language I am ultimately going to use this for. It allows definition of operators, and an operator is any string that passes ispunct and isn't a special character like parenthesis/brackets.

Pratt's parser

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Similar Content