Sign in to follow this  
Followers 0
WaitingForZion

Tokenizer

9 posts in this topic

#1 ·  Posted (edited)

I was working on a UDF for inclusion in the UDF standard library that breaks text up into tokens based on an array of token definitions. I'm aware that according to good engineering principles code should be decomposed as several functions to make it more human readable. However, since this was a UDF, I did not think it would be approved if I broke up my UDF into smaller functions, so I decided to write this function without any subroutines except for one. You may think I'm inapt to code for saying this, but I'm starting to have trouble understanding my own code. I have nearly succeeded in creating a tokenizer that recognizes specified symbols, sets of characters according to a given regular expression, and in-quote strings. My goal was to create a function that could break up structured information into tokens for easy processing. Unfortunately, because the code is becoming so unmanageable for me, and so disorganized to the point that I can no longer refine it through minor modifications, I've chosen not to complete it. Nevertheless, the function does work successfully with the proper parameters.

_Tokenize() takses three arguments.

$sText - the text to tokenized.

$aTokenTypes - the array of token definitions.

$aTokens - The destination array to which shall be added the new tokens. [Type, Text]

Each token definition is an array of five elements.

1. The type name of the token.

2. Whether the token will be matched directly with a given single character, or regular expression describing the kind of character.

3. The character/regular expression.

4. Whether the token consists of a single character, or multiple characters each within the class on the one specified character or regular expression.

5. Whether to accept the following characters literally a string under the type of token specified, until it encounters a character of the same token definition.

I've never thoroughly studied or understood already developed algorithms for tokenizing or parsing, so this process came from my own limited and faulty idea of how one would work. Any feedback, ideas, etc will be appreciated.

#include <Array.au3>

$NO_TOKEN = -1

Func _Tokenize($sText, $aTokenTypes, byref $aTokens)
    $iCharCount = StringLen($sText)
    $vLastType = 0
    $sLastChar = 0
    $sCurrentToken = ""
    $bLastIsSingle = False
    $bIsSingle = False
    $bHoldLastChar = False
    $bInLiteral = False
    $bStartLiteral = False
    $bLastStartLiteral = False
    $sLiteralText = ""

    Dim $aNewToken[2]

    For $iCharIndex = 1 to $iCharCount
        $sChar = StringMid($sText, $iCharIndex, 1)
        $vType = _CharIdentifyTokenType($sChar, $aTokenTypes, $bIsSingle, $bStartLiteral)

        if $iCharIndex > 1 Then
            if NOT $bHoldLastChar Then
                $sLastChar = StringMid($sText, $iCharIndex-1, 1)
                $vLastType = _CharIdentifyTokenType($sLastChar, $aTokenTypes, $bLastIsSingle, $bLastStartLiteral)
            EndIf

            If $bInLiteral AND $bStartLiteral <> $bLastStartLiteral then
                $sLiteralText  &= $sChar
            Else
                $bHoldLastChar = False
            EndIf


            if ($vType <> $vLastType OR ($vType == $vLastType AND $bLastIsSingle)) AND $vType <> $NO_TOKEN Then
                If Not $bInLiteral then
                    If Not $bLastStartLiteral then
                        $aNewToken[0] = $vLastType
                        $aNewToken[1] = $sCurrentToken

                        _ArrayAdd($aTokens, $aNewToken)
                    EndIf

                    If Not $bStartLiteral Then

                    Else
                        $sLiteralText = ""

                        $sLastChar = $sChar
                        $vLastType = $vType
                        $bLastStartLiteral = True

                        $bInLiteral = True
                        $bHoldLastChar = True
                    EndIf
                Else
                    If ($bStartLiteral AND $bLastStartLiteral) AND ($vType = $vLastType) Then
                        $aNewToken[0] = $vType
                        $aNewToken[1] = $sLiteralText
                        _ArrayAdd($aTokens, $aNewToken)

                        $bInLiteral = False
                        $bHoldLastChar = False
                    EndIf
                EndIf

                $sCurrentToken = ""
            ElseIf $vType = $NO_TOKEN Then
                $bHoldLastChar = true
            EndIf

            If $iCharIndex = $iCharCount AND $vType <> $NO_TOKEN Then
                if $bInLiteral Then
                    SetError(1)
                    Return -1
                EndIf

                If ($bStartLiteral AND $bLastStartLiteral) AND ($vType = $vLastType) Then
                    $aNewToken[0] = $vType
                    $aNewToken[1] = $sLiteralText
                    _ArrayAdd($aTokens, $aNewToken)

                    Return
                EndIf

                $sCurrentToken &= $sChar
                $aNewToken[0] = $vType
                $aNewToken[1] = $sCurrentToken
                _ArrayAdd($aTokens, $aNewToken)
                Return
            EndIf

            If $vType <> $NO_TOKEN Then
                $sCurrentToken &= $sChar
            EndIf
        Else
            If $vType <> $NO_TOKEN Then
                If $iCharCount = 1 Then
                    if $bStartLiteral Then
                        SetError(2)
                        Return -1
                    EndIf

                    $aNewToken[0] = $vType
                    $aNewToken[1] = $sChar
                    _ArrayAdd($aTokens, $aNewToken)
                Else
                    $sCurrentToken &= $sChar
                EndIf
            EndIf
        EndIf
    Next
EndFunc

Func _CharIdentifyTokenType($sChar, $aTokenTypes, byref $bIsSingle, byref $bStartLiteral)
    For $aType in $aTokenTypes
        If $aType[1] = true AND StringRegExp($sChar, $aType[2]) Then
            $bIsSingle = $aType[3]
            $bStartLiteral = $aType[4]
            Return $aType[0]
        ElseIf $aType[2] == $sChar then
            $bIsSingle = $aType[3]
            $bStartLiteral = $aType[4]
            Return $aType[0]
        EndIf
    Next

    $bIsSingle = false
    $bStartLiteral = False

    Return $NO_TOKEN
EndFunc

Dim $tokenDefs[5]
Dim $token1[5]
Dim $token2[5]
Dim $token3[5]
Dim $token4[5]
Dim $token5[5]

$token1[0] = "open_param"
$token1[1] = False
$token1[2] = "("
$token1[3] = True
$token1[4] = False

$token2[0] = "close_paren"
$token2[1] = False
$token2[2] = ")"
$token2[3] = True
$token2[4] = False

$token3[0] = "comma"
$token3[1] = False
$token3[2] = ","
$token3[3] = True
$token3[4] = False

$token4[0] = "single_alnum_word"
$token4[1] = True
$token4[2] = "[[:alnum:]]"
$token4[3] = False
$token4[4] = False

$token5[0] = "string"
$token5[1] = False
$token5[2] = '"'
$token5[3] = False
$token5[4] = True

$tokenDefs[0] = $token1
$tokenDefs[1] = $token2
$tokenDefs[2] = $token3
$tokenDefs[3] = $token4
$tokenDefs[4] = $token5

Dim $tokens[1]

if _Tokenize('sandwich(cheese, "confusing solomy", "Another string?", mayonaze, mustard)', $tokenDefs, $tokens) < 0 Then
    ConsoleWrite(@Error & @CRLF)
EndIf

for $i = 1 to UBound($tokens)-1
    $token = $tokens[$i]
    ConsoleWrite("Type: " & $token[0] & @CRLF & "Text: " & $token[1] & @CRLF & @CRLF)
Next
Edited by WaitingForZion

Spoiler

"This then is the message which we have heard of him, and declare unto you, that God is light, and in him is no darkness at all. If we say that we have fellowship with him, and walk in darkness, we lie, and do not the truth: But if we walk in the light, as he is in the light, we have fellowship one with another, and the blood of Jesus Christ his Son cleanseth us from all sin. If we say that we have no sin, we deceive ourselves, and the truth is not in us. If we confess our sins, he is faithful and just to forgive us our sins, and to cleanse us from all unrighteousness. If we say that we have not sinned, we make him a liar, and his word is not in us." (I John 1:5-10)

 

Share this post


Link to post
Share on other sites



I guess everyone thinks it's garbage.

Well, that's ok. But can I have some kind of feedback?


Spoiler

"This then is the message which we have heard of him, and declare unto you, that God is light, and in him is no darkness at all. If we say that we have fellowship with him, and walk in darkness, we lie, and do not the truth: But if we walk in the light, as he is in the light, we have fellowship one with another, and the blood of Jesus Christ his Son cleanseth us from all sin. If we say that we have no sin, we deceive ourselves, and the truth is not in us. If we confess our sins, he is faithful and just to forgive us our sins, and to cleanse us from all unrighteousness. If we say that we have not sinned, we make him a liar, and his word is not in us." (I John 1:5-10)

 

Share this post


Link to post
Share on other sites

WaitingForZion,

OK, I will bite! :mellow:

I can see what it does (and it seems to do it quite adequately) but why does it do it? What can it be used for? What lacuna in my coding life is it looking to fill?

Apologies if that sounds negative, but answering those questions might elicit some response. At the moment I can imagine most forum members looking at your post and thinking "This is a solution in search of a problem". So give us the problem! :(

M23


Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind._______My UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites

What can it be used for? What lacuna in my coding life is it looking to fill?

To translate it for our average member: What game does this Bot handle? :mellow:

Seriously: I have the same question: give us a real life example where this script could be useful.

Jos


Visit the SciTE4AutoIt3 Download page for the latest versions        Beta files                                                          Forum Rules
 
Live for the present,
Dream of the future,
Learn from the past.
  :)

Share this post


Link to post
Share on other sites

#5 ·  Posted (edited)

I explain that in my improved version: Alexi 1.0

Edited by WaitingForZion

Spoiler

"This then is the message which we have heard of him, and declare unto you, that God is light, and in him is no darkness at all. If we say that we have fellowship with him, and walk in darkness, we lie, and do not the truth: But if we walk in the light, as he is in the light, we have fellowship one with another, and the blood of Jesus Christ his Son cleanseth us from all sin. If we say that we have no sin, we deceive ourselves, and the truth is not in us. If we confess our sins, he is faithful and just to forgive us our sins, and to cleanse us from all unrighteousness. If we say that we have not sinned, we make him a liar, and his word is not in us." (I John 1:5-10)

 

Share this post


Link to post
Share on other sites

#6 ·  Posted (edited)

No you didn't actually :mellow: I still don't see any example where the library is put to use.

I do see an example.au3 in your Alexi.zip in that thread btw, though IMO it's not a very extensive and useful example :( Maybe other people are interested in using it though, and kudos for also including a help file in the .zip (and for it using OO :lol:)

Edited by d4ni

Share this post


Link to post
Share on other sites

I have no idea why you need to tokenize data but apparently people do. I assume for good reasons I don't yet understand. Apparently this is an important part of lexical analysis. I have seen tokenizing perl modules on cpan and tokenization code for java and c++. Maybe after I read the Wikipedia page for lexical analysis I will get it.


AutoIt changed my life.

Share this post


Link to post
Share on other sites

I have no idea why you need to tokenize data but apparently people do. I assume for good reasons I don't yet understand. Apparently this is an important part of lexical analysis. I have seen tokenizing perl modules on cpan and tokenization code for java and c++. Maybe after I read the Wikipedia page for lexical analysis I will get it.

Tokenizing is mainly good for breaking up text, some examples of this I can come up with off the top of my head are:

Configuration files

Preprocessing source code

Command line calculators

Script Engines <- heaven forbid, implementing a script engine inside AutoIt would be horribly inefficent

Etc.

Morgen


#fgpkerw4kcmnq2mns1ax7ilndopen (Q, $0); while ($l = <Q>){if ($l =~ m/^#.*/){$l =~ tr/a-z1-9#/Huh, Junketeer's Alternate Pro Ace /; print $l;}}close (Q);[code] tag ninja!

Share this post


Link to post
Share on other sites

#9 ·  Posted (edited)

"This is a solution in search of a problem". So give us the problem! :graduated:

@Melba23 & @Jon

any syntaxhilighting needs tokenizing, any sourcecode-editors need tokenizing the code, to sort out, what is code, what is a comment, what is commented out code, etc.

my idea is to shorten the code like i have writen

without proper tokenizing its impossible to do a perfect job.

have played around:

#include <Array.au3>
Global Const $VFName = StringSplit("abcdefghijklmnopqrstuvwxyz_0123456789",'',2) ; 0 - 36 Elemente
Global $VFcount = 0
$File = FileOpenDialog("",@ScriptDir,"Scripts (*.au3)",5)
If @error Then 
    msgbox(0,"Error",@error)
    Exit
EndIf
$File = StringReplace($File, "|", @CRLF)
$Source = FileRead($File)

; insert used functions from includes
; remove comments
; identify and protect strings from changes
; count and replace variables
; count and replace function names
; hex to dec, if shorter
; replace constants with content, if shorter then constants name 
; for constants and params eval BitOr,BitAnd, ... 
; reduce whitespace as far as possible
Func ReplaceFuncs()
    Dim $Vars[1][3]
    Local $l=0
    $VFcount = 0
    $functions = StringRegExp($source,"(?i)func\s+(\w*)\(.*\)",3)
    For $element In $functions
        ReDim $Vars[$l+1][3]
        $count = StringRegExp($source,"[^$]"&$element&"\s*\(",3)
        $Vars[$l][0] = $element
        $Vars[$l][1] = UBound($count)
        $l+=1
    Next
    _ArraySort($Vars, 1, 0, 0, 0)
    For $i=0 To $l-1
        $Vars[$i][2] = getName(true)
    Next
    Return $Vars
EndFunc
Func ReplaceVariables()
    Dim $Vars[1][3]
    Local $l=0
    $VFcount = 0
    $variables = StringRegExp($source,"\$(\w*)",3)
    FOR $element IN $variables
        For $i=0 To $l-1
            If StringCompare($Vars[$i][0],$element)=0 Then
                $Vars[$i][1] += 1
                ContinueLoop 2
            EndIf
        NEXT
        ReDim $Vars[$l+1][3]
        $Vars[$l][0] = $element
        $Vars[$l][1] = 1
        $l+=1
    NEXT
    _ArraySort($Vars,1,0,0,1)
    For $i=0 To $l-1
        $Vars[$i][2] = getName()
    Next    
    Return $Vars
EndFunc

Func getName($type_func=false)
    If $type_func And Mod($VFcount,37)=27 Then $VFcount+=10
    $counter=$VFcount
    $mod=Mod($counter,37)
    $var=$VFName[$mod]
    While $counter>36
        $counter=($counter-$mod)/37-1
        $var &= $VFName[Mod($counter,37)]
    Wend
    $VFcount+=1
    return $var
EndFunc
_ArrayDisplay(ReplaceFuncs(), "Funcs")
_ArrayDisplay(ReplaceVariables(), "Variables")

; $Source=StringStripCR($Source)

TODO:

* prevent numerics in the first place of new name for functions

Edited by Raik

AutoIt-Syntaxsheme for Proton & Phase5 * Firefox Addons by me (resizable Textarea 0.1d) (docked JS-Console 0.1.1)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0