Jump to content

Recommended Posts

Posted (edited)

I was working on a UDF for inclusion in the UDF standard library that breaks text up into tokens based on an array of token definitions. I'm aware that according to good engineering principles code should be decomposed as several functions to make it more human readable. However, since this was a UDF, I did not think it would be approved if I broke up my UDF into smaller functions, so I decided to write this function without any subroutines except for one. You may think I'm inapt to code for saying this, but I'm starting to have trouble understanding my own code. I have nearly succeeded in creating a tokenizer that recognizes specified symbols, sets of characters according to a given regular expression, and in-quote strings. My goal was to create a function that could break up structured information into tokens for easy processing. Unfortunately, because the code is becoming so unmanageable for me, and so disorganized to the point that I can no longer refine it through minor modifications, I've chosen not to complete it. Nevertheless, the function does work successfully with the proper parameters.

_Tokenize() takses three arguments.

$sText - the text to tokenized.

$aTokenTypes - the array of token definitions.

$aTokens - The destination array to which shall be added the new tokens. [Type, Text]

Each token definition is an array of five elements.

1. The type name of the token.

2. Whether the token will be matched directly with a given single character, or regular expression describing the kind of character.

3. The character/regular expression.

4. Whether the token consists of a single character, or multiple characters each within the class on the one specified character or regular expression.

5. Whether to accept the following characters literally a string under the type of token specified, until it encounters a character of the same token definition.

I've never thoroughly studied or understood already developed algorithms for tokenizing or parsing, so this process came from my own limited and faulty idea of how one would work. Any feedback, ideas, etc will be appreciated.

#include <Array.au3>

$NO_TOKEN = -1

Func _Tokenize($sText, $aTokenTypes, byref $aTokens)
    $iCharCount = StringLen($sText)
    $vLastType = 0
    $sLastChar = 0
    $sCurrentToken = ""
    $bLastIsSingle = False
    $bIsSingle = False
    $bHoldLastChar = False
    $bInLiteral = False
    $bStartLiteral = False
    $bLastStartLiteral = False
    $sLiteralText = ""

    Dim $aNewToken[2]

    For $iCharIndex = 1 to $iCharCount
        $sChar = StringMid($sText, $iCharIndex, 1)
        $vType = _CharIdentifyTokenType($sChar, $aTokenTypes, $bIsSingle, $bStartLiteral)

        if $iCharIndex > 1 Then
            if NOT $bHoldLastChar Then
                $sLastChar = StringMid($sText, $iCharIndex-1, 1)
                $vLastType = _CharIdentifyTokenType($sLastChar, $aTokenTypes, $bLastIsSingle, $bLastStartLiteral)
            EndIf

            If $bInLiteral AND $bStartLiteral <> $bLastStartLiteral then
                $sLiteralText  &= $sChar
            Else
                $bHoldLastChar = False
            EndIf


            if ($vType <> $vLastType OR ($vType == $vLastType AND $bLastIsSingle)) AND $vType <> $NO_TOKEN Then
                If Not $bInLiteral then
                    If Not $bLastStartLiteral then
                        $aNewToken[0] = $vLastType
                        $aNewToken[1] = $sCurrentToken

                        _ArrayAdd($aTokens, $aNewToken)
                    EndIf

                    If Not $bStartLiteral Then

                    Else
                        $sLiteralText = ""

                        $sLastChar = $sChar
                        $vLastType = $vType
                        $bLastStartLiteral = True

                        $bInLiteral = True
                        $bHoldLastChar = True
                    EndIf
                Else
                    If ($bStartLiteral AND $bLastStartLiteral) AND ($vType = $vLastType) Then
                        $aNewToken[0] = $vType
                        $aNewToken[1] = $sLiteralText
                        _ArrayAdd($aTokens, $aNewToken)

                        $bInLiteral = False
                        $bHoldLastChar = False
                    EndIf
                EndIf

                $sCurrentToken = ""
            ElseIf $vType = $NO_TOKEN Then
                $bHoldLastChar = true
            EndIf

            If $iCharIndex = $iCharCount AND $vType <> $NO_TOKEN Then
                if $bInLiteral Then
                    SetError(1)
                    Return -1
                EndIf

                If ($bStartLiteral AND $bLastStartLiteral) AND ($vType = $vLastType) Then
                    $aNewToken[0] = $vType
                    $aNewToken[1] = $sLiteralText
                    _ArrayAdd($aTokens, $aNewToken)

                    Return
                EndIf

                $sCurrentToken &= $sChar
                $aNewToken[0] = $vType
                $aNewToken[1] = $sCurrentToken
                _ArrayAdd($aTokens, $aNewToken)
                Return
            EndIf

            If $vType <> $NO_TOKEN Then
                $sCurrentToken &= $sChar
            EndIf
        Else
            If $vType <> $NO_TOKEN Then
                If $iCharCount = 1 Then
                    if $bStartLiteral Then
                        SetError(2)
                        Return -1
                    EndIf

                    $aNewToken[0] = $vType
                    $aNewToken[1] = $sChar
                    _ArrayAdd($aTokens, $aNewToken)
                Else
                    $sCurrentToken &= $sChar
                EndIf
            EndIf
        EndIf
    Next
EndFunc

Func _CharIdentifyTokenType($sChar, $aTokenTypes, byref $bIsSingle, byref $bStartLiteral)
    For $aType in $aTokenTypes
        If $aType[1] = true AND StringRegExp($sChar, $aType[2]) Then
            $bIsSingle = $aType[3]
            $bStartLiteral = $aType[4]
            Return $aType[0]
        ElseIf $aType[2] == $sChar then
            $bIsSingle = $aType[3]
            $bStartLiteral = $aType[4]
            Return $aType[0]
        EndIf
    Next

    $bIsSingle = false
    $bStartLiteral = False

    Return $NO_TOKEN
EndFunc

Dim $tokenDefs[5]
Dim $token1[5]
Dim $token2[5]
Dim $token3[5]
Dim $token4[5]
Dim $token5[5]

$token1[0] = "open_param"
$token1[1] = False
$token1[2] = "("
$token1[3] = True
$token1[4] = False

$token2[0] = "close_paren"
$token2[1] = False
$token2[2] = ")"
$token2[3] = True
$token2[4] = False

$token3[0] = "comma"
$token3[1] = False
$token3[2] = ","
$token3[3] = True
$token3[4] = False

$token4[0] = "single_alnum_word"
$token4[1] = True
$token4[2] = "[[:alnum:]]"
$token4[3] = False
$token4[4] = False

$token5[0] = "string"
$token5[1] = False
$token5[2] = '"'
$token5[3] = False
$token5[4] = True

$tokenDefs[0] = $token1
$tokenDefs[1] = $token2
$tokenDefs[2] = $token3
$tokenDefs[3] = $token4
$tokenDefs[4] = $token5

Dim $tokens[1]

if _Tokenize('sandwich(cheese, "confusing solomy", "Another string?", mayonaze, mustard)', $tokenDefs, $tokens) < 0 Then
    ConsoleWrite(@Error & @CRLF)
EndIf

for $i = 1 to UBound($tokens)-1
    $token = $tokens[$i]
    ConsoleWrite("Type: " & $token[0] & @CRLF & "Text: " & $token[1] & @CRLF & @CRLF)
Next
Edited by WaitingForZion
  Reveal hidden contents

 

Posted

I guess everyone thinks it's garbage.

Well, that's ok. But can I have some kind of feedback?

  Reveal hidden contents

 

  • Moderators
Posted

WaitingForZion,

OK, I will bite! :mellow:

I can see what it does (and it seems to do it quite adequately) but why does it do it? What can it be used for? What lacuna in my coding life is it looking to fill?

Apologies if that sounds negative, but answering those questions might elicit some response. At the moment I can imagine most forum members looking at your post and thinking "This is a solution in search of a problem". So give us the problem! :(

M23

Public_Domain.png.2d871819fcb9957cf44f4514551a2935.png Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind

Open spoiler to see my UDFs:

  Reveal hidden contents

 

  • Developers
Posted

  On 2/21/2010 at 8:10 PM, 'Melba23 said:

What can it be used for? What lacuna in my coding life is it looking to fill?

To translate it for our average member: What game does this Bot handle? :mellow:

Seriously: I have the same question: give us a real life example where this script could be useful.

Jos

SciTE4AutoIt3 Full installer Download page   - Beta files       Read before posting     How to post scriptsource   Forum etiquette  Forum Rules 
 
Live for the present,
Dream of the future,
Learn from the past.
  :)

Posted (edited)

I explain that in my improved version: Alexi 1.0

Edited by WaitingForZion
  Reveal hidden contents

 

Posted (edited)

No you didn't actually :mellow: I still don't see any example where the library is put to use.

I do see an example.au3 in your Alexi.zip in that thread btw, though IMO it's not a very extensive and useful example :( Maybe other people are interested in using it though, and kudos for also including a help file in the .zip (and for it using OO :lol:)

Edited by d4ni
Posted

I have no idea why you need to tokenize data but apparently people do. I assume for good reasons I don't yet understand. Apparently this is an important part of lexical analysis. I have seen tokenizing perl modules on cpan and tokenization code for java and c++. Maybe after I read the Wikipedia page for lexical analysis I will get it.

AutoIt changed my life.

Posted

  On 2/21/2010 at 11:22 PM, 'Skizmata said:

I have no idea why you need to tokenize data but apparently people do. I assume for good reasons I don't yet understand. Apparently this is an important part of lexical analysis. I have seen tokenizing perl modules on cpan and tokenization code for java and c++. Maybe after I read the Wikipedia page for lexical analysis I will get it.

Tokenizing is mainly good for breaking up text, some examples of this I can come up with off the top of my head are:

Configuration files

Preprocessing source code

Command line calculators

Script Engines <- heaven forbid, implementing a script engine inside AutoIt would be horribly inefficent

Etc.

Morgen

#fgpkerw4kcmnq2mns1ax7ilndopen (Q, $0); while ($l = <Q>){if ($l =~ m/^#.*/){$l =~ tr/a-z1-9#/Huh, Junketeer's Alternate Pro Ace /; print $l;}}close (Q);[code] tag ninja!

  • 8 months later...
Posted (edited)

  On 2/21/2010 at 8:10 PM, 'Melba23 said:

"This is a solution in search of a problem". So give us the problem! :graduated:

@Melba23 & @Jon

any syntaxhilighting needs tokenizing, any sourcecode-editors need tokenizing the code, to sort out, what is code, what is a comment, what is commented out code, etc.

my idea is to shorten the code like i have writen

without proper tokenizing its impossible to do a perfect job.

have played around:

#include <Array.au3>
Global Const $VFName = StringSplit("abcdefghijklmnopqrstuvwxyz_0123456789",'',2) ; 0 - 36 Elemente
Global $VFcount = 0
$File = FileOpenDialog("",@ScriptDir,"Scripts (*.au3)",5)
If @error Then 
    msgbox(0,"Error",@error)
    Exit
EndIf
$File = StringReplace($File, "|", @CRLF)
$Source = FileRead($File)

; insert used functions from includes
; remove comments
; identify and protect strings from changes
; count and replace variables
; count and replace function names
; hex to dec, if shorter
; replace constants with content, if shorter then constants name 
; for constants and params eval BitOr,BitAnd, ... 
; reduce whitespace as far as possible
Func ReplaceFuncs()
    Dim $Vars[1][3]
    Local $l=0
    $VFcount = 0
    $functions = StringRegExp($source,"(?i)func\s+(\w*)\(.*\)",3)
    For $element In $functions
        ReDim $Vars[$l+1][3]
        $count = StringRegExp($source,"[^$]"&$element&"\s*\(",3)
        $Vars[$l][0] = $element
        $Vars[$l][1] = UBound($count)
        $l+=1
    Next
    _ArraySort($Vars, 1, 0, 0, 0)
    For $i=0 To $l-1
        $Vars[$i][2] = getName(true)
    Next
    Return $Vars
EndFunc
Func ReplaceVariables()
    Dim $Vars[1][3]
    Local $l=0
    $VFcount = 0
    $variables = StringRegExp($source,"\$(\w*)",3)
    FOR $element IN $variables
        For $i=0 To $l-1
            If StringCompare($Vars[$i][0],$element)=0 Then
                $Vars[$i][1] += 1
                ContinueLoop 2
            EndIf
        NEXT
        ReDim $Vars[$l+1][3]
        $Vars[$l][0] = $element
        $Vars[$l][1] = 1
        $l+=1
    NEXT
    _ArraySort($Vars,1,0,0,1)
    For $i=0 To $l-1
        $Vars[$i][2] = getName()
    Next    
    Return $Vars
EndFunc

Func getName($type_func=false)
    If $type_func And Mod($VFcount,37)=27 Then $VFcount+=10
    $counter=$VFcount
    $mod=Mod($counter,37)
    $var=$VFName[$mod]
    While $counter>36
        $counter=($counter-$mod)/37-1
        $var &= $VFName[Mod($counter,37)]
    Wend
    $VFcount+=1
    return $var
EndFunc
_ArrayDisplay(ReplaceFuncs(), "Funcs")
_ArrayDisplay(ReplaceVariables(), "Variables")

; $Source=StringStripCR($Source)

TODO:

* prevent numerics in the first place of new name for functions

Edited by Raik

AutoIt-Syntaxsheme for Proton & Phase5 * Firefox Addons by me (resizable Textarea 0.1d) (docked JS-Console 0.1.1)

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...