Jump to content

Mat
 Share

Recommended Posts

For those of you (and I know from previous discussions that there are people out there) making your own compilers and interpreters and choosing not to use tools like yacc.

Pratt's parser is a predictive parser, but with a number of improvements that make it so easy to implement you wonder why in the 38 since the original paper was published it hasn't been used in more than a small handful of projects.

Look it up if you ever consider hand writing a parser. I managed to write a calculator from first principles in 30 mins. 20 minutes of that was writing the lexer.

Mat

Edit: Just to give you an idea of how unused this is... Google "Pratt's Parser" and you'll get this thread as the first result.

Edited by Mat
Link to comment
Share on other sites

hmmm. maybe, considering I'm doing a complete re-write of my project in C++.

ongoing projects:-firestorm: Largescale P2P Social NetworkCompleted Autoit Programs/Scripts: Variable Pickler | Networked Streaming Audio (in pure autoIT) | firenet p2p web messenger | Proxy Checker | Dynamic Execute() Code Generator | P2P UDF | Graph Theory Proof of Concept - Breadth First search

Link to comment
Share on other sites

Nice thread, Mat.

Any docs/links too? Would be great.

Well...

Simple: http://effbot.org/zone/simple-top-down-parsing.htm

More in depth: http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/

Also in depth: http://javascript.crockford.com/tdop/tdop.html

I'll also tidy up the source code for the calculator I made and post that. All I need to add is a tertiary operator to prove that this parser can do anything.

Link to comment
Share on other sites

I was thinking of that, unfortunately I got started in C as I was going to be using asm to implement the final thing and I didn't want one of those algorithms that looked nice in python when you don't worry about data types. It would be easy to do. Especially given that autoit makes it almost as easy as python.

Link to comment
Share on other sites

Ok, I tidied it up a bit to show you how nice it is to build it incrementally. Initially, for binary operators and no lexer, it was an 80 line C program, of which a lot were utility functions to do with the functions rather than the parser. To make the code easier to understand it only works with integers.

Usually I'd seperate this out into different files, but for now I'm posting them as standalone c programs... They are all tested compiling on gcc. For those involving sqrt you'll need to link the maths library in (gcc -lm file.c).

#include <stdio.h>
 
#define TOK_END 0
#define TOK_INT 1
#define TOK_OP  2
 
#define OP_PLUS 0
#define OP_MUL  1
 
// 7 + 2 * 3
 
int pos = 0;
struct TOKEN
{
    int Type;
    int Val;
} test[] = {
    {TOK_INT, 7},
    {TOK_OP, OP_PLUS},
    {TOK_INT, 2},
    {TOK_OP, OP_MUL},
    {TOK_INT, 3}
}, tok;
 
int Op_Add(int left, int right)
{
    return left + right;
}
 
int Op_Mul(int left, int right)
{
    return left * right;
}
 
struct OPERATOR
{
    int Precedence;
    void* func;
} Ops[] = {
    {30,&Op_Add},
    {40,&Op_Mul}
};
 
int lbp(struct TOKEN token)
{
    if (token.Type == TOK_OP)
        return Ops[token.Val].Precedence;
    return 0;
}
 
int led(struct TOKEN token, int left)
{
    int right = expression(lbp(token));
 
    int (*fun)(int, int) = Ops[token.Val].func;
    return fun(left, right);
}
 
int nud(struct TOKEN token)
{
    if (token.Type == TOK_INT)
        return token.Val;
    return -1;
}
 
int expression(int rbp)
{
    struct TOKEN token = tok;
    tok = test[++pos];
    int left = nud(token);
 
    while (lbp(tok) > rbp)
    {
        token = tok;
        tok = test[++pos];
        left = led(token, left);
    }
 
    return left;
}
 
int main(int argc, char **argv)
{
    tok = test[0];
 
    printf("%i\n", expression(0));
    return 0;
}

(Simple.c Simple.exe)

It then got more complex gradually:

Next up is error checking. Again it's very easy to add, as you just need to add default statements to the switches and else statements for ifs for unexpeced tokens. After that I'll do the same for AutoIt.

The lexer is a mess at the moment, so I'll tidy that up sometime.

If anyones interested about what I did to go from one step to the next then just ask :graduated:

Mat

Edit: Added exes.

Edited by Mat
Link to comment
Share on other sites

And for those interested: Simple.au3

Cool. :graduated:

btw, Why did you make the distinction between TOK_OP with params and TOK_INT with params instead of just TOK_INT and TOK_ADD, TOK_MUL? Only because python example does it like that?

$s = "7 + 2 * 3"
Global $test = lex($s)

MsgBox(0, $s, expression(0))


Func lex($s)
    Local $out[100][2]
    $arr = StringSplit($s, '', 2)
    $n = 0
    For $i = 0 To UBound($arr)-1
        Switch $arr[$i]
            Case " "
                ContinueLoop
            Case "0" To "9"
                $out[$n][0] = $TOK_INT
                $out[$n][1] = lexNumber($arr, $i)
                $n += 1
            Case "+"
                $out[$n][0] = $TOK_OP
                $out[$n][1] = $OP_ADD
                $n += 1
            Case "*"
                $out[$n][0] = $TOK_OP
                $out[$n][1] = $OP_MUL
                $n += 1
        EndSwitch
    Next
    $out[$n][0] = 0
    $out[$n][1] = 0
    Return $out
EndFunc

Func lexNumber($s, ByRef $i)
    $out = ""
    While $s[$i] >= "0" And $s[$i] <= "9"
        $out &= $s[$i]
        $i += 1

        If $i = UBound($s) Then ExitLoop
    WEnd
    $i -= 1
    Return $out
EndFunc
Edited by Manadar
Link to comment
Share on other sites

Thats a nice simple way to do it... I was thinking of using the operator array and adding strings to it rather than looking for the tokens in the switch statement... It would make it very easy to add operators.

Also, I think that it would be possible to lose the global $test, probably by combining the nud,led and expression functions and putting it in there as a static.

Link to comment
Share on other sites

Well... Actually the correct way to do it is to make the lex function yeild tokens. My version of the parser uses $pos and $pos-1 because I was lazy and AutoIt doesn't do structs very easily. If you look at the c examples where I use a struct for tokens, it copies the token, then fills tok with the next one. (or it should do anyway, I might have been lazy there as well)

Global $tok

; int calc(string)
Func calc($expr)
    $tok = lexGetToken($expr)

    Return expression(0)
EndFunc

; int nud(TOKEN)
Func nud($token)
    ...
EndFunc

; int led(TOKEN)
Func led($token)
    ...
EndFunc

; int expression(int)
Func expression($rbp)
    Local $token = $tok  ; Stores the last token
    $tok = lexGetToken() ; Fills the current one

    Local $left = nud($token) ; Rather than using $pos-1

    ...

    return $left
EndFunc

; TOKEN lexGetToken( [string] )
Func lexGetToken($in = "")
    Local static $pos = 0
    Local static $str = $in

    Local $ret[2] ; Token of the form [TYPE, VALUE]

    ...

    Return $ret
EndFunc

The problem is that it gets very messy passing 2 element arrays around all the time. And accessing them is not nearly as easy to read as tok.Type and tok.Val.

Because nud and led never call themselves, only calling expression, it is possible to merge them into a single function, and replace Global $tok with Local Static $tok. expression calls lexGetToken and the user only has to call calc. Nice and clean.

It's also possible to take out expression calling itself completely by using a stack, but that would be pointless in AutoIt. It does work well in asm though.

Edited by Mat
Link to comment
Share on other sites

btw, Why did you make the distinction between TOK_OP with params and TOK_INT with params instead of just TOK_INT and TOK_ADD, TOK_MUL? Only because python example does it like that?

Sorry, didn't see this.

The reason is that it makes it considerably easier to add and manage operators (including dynamically: More on that later).

You'll see in the C examples that there is an array of operators, with details like flags, precedence and then a pointer to the function to call. When I implemented the lexer, I didn't hard-program the operators in, instead added a pointer to the string containing the operator to that array. The lexer then searches through the operators and returns {TOK_OP,OP_????}. When I add an operator, I add a OP_* constant (optional btw), write the function for the operator and add the operator details to the array. Simple :graduated: It won't get any easier than that. There will be one more operator flag (User) which would mean user defined operators are handled just as easily, with a slightly different pointer to the function.

Additionally, most operators behave the same, and those that are different it is managed using their flags (Right associative, unary and postfix). This makes functions like lbp only a few lines long as it just needs to check that the token is of type TOK_OP and then read the precedence from the table. Obvious exceptions are those that have more than one meaning (+, -).

The last reason is because of the language I am ultimately going to use this for. It allows definition of operators, and an operator is any string that passes ispunct and isn't a special character like parenthesis/brackets.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...