Sign in to follow this  
Followers 0
Nutster

Regular Expression Testing

139 posts in this topic

[..]

So escape of "." is not working?

<{POST_SNAPBACK}>

I think you're experienceing the repeating characters phenomenon: from the helpfile .. "Repeating charactrers (*, +, ?) will try to match the largest set possible. e.g. ba*a will always fail because the trailing a will have already matched the repeating a."

Try this :) ...

$line = "C:\Documents and Settings\User\NTUSER.DAT"

    If RegExp($line, '\.DAT$') Then           ;  .dat at end-of-line
       Msgbox(0, "RegExp", "Pattern found")
    Else
       Msgbox(0, "RegExp", "Pattern NOT found")
    Endif

Share this post


Link to post
Share on other sites



I think you're experienceing the repeating characters phenomenon: from the helpfile .. "Repeating charactrers (*, +, ?) will try to match the largest set possible. e.g. ba*a will always fail because the trailing a will have already matched the repeating a."

Well, but why second code is working? :)

Another example, I just removed repeating:

$line = "NTUSER.DAT"
If RegExp($line, '\.DAT$') Then ...

Because "." escaped this should not match, but this did.

Try this  :)  ...

$line = "C:\Documents and Settings\User\NTUSER.DAT"

    If RegExp($line, '\.DAT$') Then          ;  .dat at end-of-line
       Msgbox(0, "RegExp", "Pattern found")
    Else
       Msgbox(0, "RegExp", "Pattern NOT found")
    Endif

<{POST_SNAPBACK}>

In this case "NTUSERDAT" will be also matched, so I will not to know that "DAT" was an extension.

Share this post


Link to post
Share on other sites

#43 ·  Posted (edited)

Hmm .. I'm not sure why '[.]*\\DAT$' works on $line = "C:\Documents and Settings\User\NTUSER\DAT" :) .. maybe Nutster can explain?

But .. I think you'll find '\.DAT$' does NOT in fact match "C:\Documents and Settings\User\NTUSERDAT"

.. I tried it, and it correctly fails :) .. cos it's looking for ".DAT" at end of line

Hope this helps

Edit: typo + more -->

[..]

Another example, I just removed repeating:

$line = "NTUSER.DAT"
If RegExp($line, '\.DAT$') Then ...

Because "." escaped this should not match, but this did.

[..]

<{POST_SNAPBACK}>

.. This SHOULD match (and it does) because:

".DAT" means "any-char-followed-byDAT"

but "\.DAT" means "dot-followed-byDAT" <-- the repeating character "." is escaped back to a literal meaning.

I may be wrong .. but this is how I understand it ;)

Edited by trids

Share this post


Link to post
Share on other sites

But .. I think you'll find '\.DAT$' does NOT in fact match "C:\Documents and Settings\User\NTUSERDAT"

.. I tried it, and it correctly fails  :)  .. cos it's looking for ".DAT" at end of line

Oops :) I really forgot $ at the end... And I again try to treat regexp like usual filemasks. Thanks, I got it now. I will try to bypass this prob...

Share this post


Link to post
Share on other sites

#45 ·  Posted (edited)

Hmm .. I'm not sure why '[.]*\\DAT$'  works on $line = "C:\Documents and Settings\User\NTUSER\DAT"  :idiot:  .. maybe Nutster can explain?

<{POST_SNAPBACK}>

'[.]*\\DAT$' mean zero or more "real" dots. There are 0 dots, so that works. \\ matches the real backslash and DAT$ matches the last 3 characters of the string. [.]*\. will never match the last dot because it was already read by the repeating set. Both [.] and \. will match a real dot. Edited by Nutster

David Nuttall
Nuttall Computer Consulting

An Aquarius born during the Age of Aquarius

AutoIt allows me to re-invent the wheel so much faster.

I'm off to write a wizard, a wonderful wizard of odd...

Share this post


Link to post
Share on other sites

#46 ·  Posted (edited)

Thanks, I not realize that [.] mean real dot... Brains is hot and I feel that I (and sure many other) just need few good examples... But I can't sleep until I'll know why not match this pattern... :idiot:

$line = "C:\Documents and Settings\User\NTUSER.DAT"
RegExp($line, '^C:\\Documents and Settings[\A\\]*\.DAT$'); don't match

RegExp($line, '^C:\\Documents and Settings[\A\\]*'); works until here

1. Matching ^ - start of line

2. Exactly matching C:\Documents and Settings

3. Next starts run of any number of alfanumeric symbols, slashes or nothing of it

4. Next should be real dot and DAT at and of line - but this is not match.

Please direct me where I was wrong...

BTW I'm read some info about PHP regexp (which mainly the same as current implementation), and found that although by default they are consume next char(s) after * or +, it's possible to use "?" after them, which stop consume effect (ab*?b will not consume last "b"). Current Autoit implementation of "?" seems not have the same "magic"...

Edit: accidental smile conversion :D

Edited by Lazycat

Share this post


Link to post
Share on other sites

'[.]*\\DAT$' mean zero or more "real" dots.  There are 0 dots, so that works.  \\ matches the real backslash and DAT$ matches the last 3 characters of the string.  [.]*\. will never match the last dot because it was already read by the repeating set.  Both [.] and \. will match a real dot.

<{POST_SNAPBACK}>

:D .. there we go! Of course, "*" includes the possibility of no ocurrence!

With regexp, I sometimes feel like a child playing with daddy's power-tools.

:idiot:

Share this post


Link to post
Share on other sites

#48 ·  Posted (edited)

[..]

Brains is hot and I feel that I (and sure many other) just need few good examples

[..]

<{POST_SNAPBACK}>

I agree .. perhaps Jon will consider opening another Forum called Regexp Support for examples & questions? :idiot:

Edit:

Lazycat, I experimented with your regexp "^C:\\Documents and Settings[\A\\]*\.DAT$", and it looks like we can't use the special tokens, like \A, in a set. Or at least, it doesn't recognise \A in a set.

Nutster, I notice that regexps in other apps (like TextPad) have special tokens that are specifically for use inside sets. Maybe this is an idea? ..

[Regexp Token] Description

[:alpha:] Any letter.

[:lower:] Any lower case letter.

[:upper:] Any upper case letter.

[:alnum:] Any digit or letter.

[:digit:] Any digit.

[:xdigit:] Any hexadecimal digit (0-9, a-f or A-F).

[:blank:] Space or tab.

[:space:] Space, tab, vertical tab or form feed.

[:cntrl:] Control characters (Delete and ASCII codes less than space).

[:print:] Printable characters, including space.

[:graph:] Printable characters, excluding space.

[:punct:] Anything that is not a control or alphanumeric character.

[:word:] Letters, hypens and apostrophes.

Edited by trids

Share this post


Link to post
Share on other sites

#49 ·  Posted (edited)

With regexp, I sometimes feel like a child playing with daddy's power-tools.

:D

<{POST_SNAPBACK}>

Actually I start feel the same :lol:

I'm not so many worked with regexp before, mainly in the programs internal variants (EditPlus, Total Commander, Proxomitron), but never have so many troubles... most of my known solutions doesn't work with the Autoit. :idiot:

Edited by Lazycat

Share this post


Link to post
Share on other sites

#50 ·  Posted (edited)

$line = "C:\Documents and Settings\User\NTUSER.DAT"RegExp($line, '^C:\\Documents and Settings[\A\\]*\.DAT$'); don't matchegrep matches this (when I substitue \A with [:alpha:] which is the common way to address character classes) so this should be a problem with the AutoIt implementation.With regexp, I sometimes feel like a child playing with daddy's power-tools.  I can suggest O'Reilly's sed&awk for that. Sed and awk both heavily rely on regex and there's a good introduction to regex in this book. And when you've started with sed you'll miss it on every windows system There's also a book called Mastering Regular Expressions from O'Reilly. I don't know this book but it seems to be available online. Here's a quote from its intended audience:This book will interest anyone who has an opportunity to use regular expressions. If you don't yet understand the power that regular expressions can provide, you should benefit greatly as a whole new world is opened up to you.


            
                


    Edited  by sugi
    
    

            
        

        
            
                

    
        
            
        
    


            
        

        
            
                
                
                
            
            
                
            
        
        
    

    
        
            Share this post
            
Link to post
            
            
            

            
                Share on other sites
                

    
        
            

    

        
            

    

        
            

    

        
            

    

        
    

            
        
    


    

                    
                
                    
                    
                    






    
    
        
Nutster   

    
        
    
             2
    
        
    

        
            
                Developer at Large
            
            

            
                

    
        
    

            
            Developers
            
                
            
            
                

    
        
    
             2
    
        
    

                1,411 posts
                
                

            
        
    
    
        


    

    
        
            
            
                
            
            
        

        
    #51 · 
            Posted 
            
                (edited)
            
            
            
        
    

    


            
        
            
I agree .. perhaps Jon will consider opening another Forum called Regexp Support for examples & questions?  Edit:Lazycat, I experimented with your regexp "^C:\\Documents and Settings[\A\\]*\.DAT$", and it looks like we can't use the special tokens, like \A, in a set. Or at least, it doesn't recognise \A in a set.Nutster, I notice that regexps in other apps (like TextPad) have special tokens that are specifically for use inside sets. Maybe this is an idea? ..<{POST_SNAPBACK}>RegExp support forum is just scary!    The list of [:token:] sequences is already on the TO DO list.  But you can replace most of them with escaped sequences.  e.g. [:digit:] is \d.


            
                


    Edited  by Nutster
    
    

            
        

        
            
                

    
        
            
        
    


            
        

        
            
                
                
                
            
            
                
            
        
        
            




  
  
    
David NuttallNuttall Computer ConsultingAn Aquarius born during the Age of AquariusAutoIt allows me to re-invent the wheel so much faster.I'm off to write a wizard, a wonderful wizard of odd...

  


        
    

    
        
            Share this post
            
Link to post
            
            
            

            
                Share on other sites
                

    
        
            

    

        
            

    

        
            

    

        
            

    

        
    

            
        
    


    

                    
                
                    
                    
                    






    
    
        
Nutster   

    
        
    
             2
    
        
    

        
            
                Developer at Large
            
            

            
                

    
        
    

            
            Developers
            
                
            
            
                

    
        
    
             2
    
        
    

                1,411 posts
                
                

            
        
    
    
        


    

    
        
            
            
                
            
            
        

        
    #52 · 
            Posted 
            
            
            
        
    

    


            
        
            
Thanks, I not realize that [.] mean real dot... Brains is hot and I feel that I (and sure many other) just need few good examples... But I can't sleep until I'll know why not match this pattern...  $line = "C:\Documents and Settings\User\NTUSER.DAT"
RegExp($line, '^C:\\Documents and Settings[\A\\]*\.DAT$'); don't match

RegExp($line, '^C:\\Documents and Settings[\A\\]*'); works until here

1. Matching ^ - start of line

2. Exactly matching C:\Documents and Settings

3. Next starts run of any number of alfanumeric symbols, slashes or nothing of it

4. Next should be real dot and DAT at and of line - but this is not match.

Please direct me where I was wrong...

BTW I'm read some info about PHP regexp (which mainly the same as current implementation), and found  that although by default they are consume next char(s) after * or +, it's possible to use "?" after them, which stop consume effect (ab*?b will not consume last "b"). Current Autoit implementation of "?" seems not have the same "magic"...

Edit: accidental smile conversion  :D

<{POST_SNAPBACK}>

Point 3: escaped sequences are not supported in sets. Try [A-Z\a-z]* instead. The [\A\\]* tries to match "A" or "\" (ignoring extra definitions of \) and finding no occurances, succeeds on zero or more matches.

*?, +?, ?? are already on my to do list. When I find the time, and all the bugs are out, I plan on tackling the items on the to do list. ab*?b will then match the smallest group that lets the next character work. This will still lead to problems, because the b*? would match 0 b's and the next b would match. I do not advise b*b patterns, as they will almost always fail in the current implementation.


David Nuttall
Nuttall Computer Consulting

An Aquarius born during the Age of Aquarius

AutoIt allows me to re-invent the wheel so much faster.

I'm off to write a wizard, a wonderful wizard of odd...

Share this post


Link to post
Share on other sites

*?, +?, ?? are already on my to do list.  When I find the time, and all the bugs are out,  I plan on tackling the items on the to do list.  ab*?b will then match the smallest group that lets the next character work.  This will still lead to problems, because the b*? would match 0 b's and the next b would match.  I do not advise b*b patterns, as they will almost always fail in the current implementation.

Basically you have to try every possibility to see if everything matches and that's why all regex implementations are usually slow compared to normal string functions.

Maybe it's just easier to use the regex functions from the GNU libc from linux or something like that. They provide a full implementation with pretty much all known bugs squashed.

Share this post


Link to post
Share on other sites

#54 ·  Posted (edited)

Maybe it's just easier to use the regex functions from the GNU libc from linux or something like that. They provide a full implementation with pretty much all known bugs squashed.

<{POST_SNAPBACK}>

You should've said that before he started 3 months of hard work. Edited by SlimShady

Share this post


Link to post
Share on other sites

You should've said that before he started 3 months of hard work.

<{POST_SNAPBACK}>

:idiot: About a month, actually. A good chunk of the time was spent do "real" work programming, so I could (and still can) put very little time into the programming. I have already received some requests for bug fixes. I hope to be able to tackle them this weekend, maybe some of the simple enhancements as well.

David Nuttall
Nuttall Computer Consulting

An Aquarius born during the Age of Aquarius

AutoIt allows me to re-invent the wheel so much faster.

I'm off to write a wizard, a wonderful wizard of odd...

Share this post


Link to post
Share on other sites

#56 ·  Posted (edited)

I've been playing around with the new (and exciting :idiot: ) RegExp() .. thanks again, Nutster!

Just wondering: is there any chance of making RegExp() function follow the same principles employed by the likes of PixelSearch(), MouseGetPos(), DriveGetDrive(), etc - so that the return value would be the array of "hits", instead of the current approach where the array is provided as a string parameter to the function?

So it might work as follows:

Return Value

Success: Returns a zero-based array of matching groups found by the regular expression pattern.

@Error:

0 = Pattern matched successfully.

1 = The regular expression given is not valid.

2 = The handle given is not valid.

Not only in the interests of consistency and user-friendliness, but it simplifies the issue of whether or not to declare the array up-front too (which I guess is also in the interests of consistency and user-friendliness :D )

hmm .. whaddaya think?

Edits: minor

Edited by trids

Share this post


Link to post
Share on other sites

Just wondering: is there any chance of making RegExp() function follow the same principles employed by the likes of PixelSearch(), MouseGetPos(), DriveGetDrive(), etc - so that the return value would be the array of "hits", instead of the current approach where the array is provided as a string parameter to the function?

So it might work as follows:

Not only in the interests of consistency and user-friendliness, but it simplifies the issue of whether or not to declare the array up-front too (which I guess is also in the interests of consistency and user-friendliness  :idiot: )

hmm .. whaddaya think?

Edits: minor

<{POST_SNAPBACK}>

Hmm, so how would this be called?

$Results = RegExp($sLine, $sPattern)
If @Error = 0 Then
  ; Found the pattern
ElseIf @Error = 1 Then
  ; Did not find the pattern
ElseIf @Error = 2 Then
  ; The pattern was not valid
Endif

I have removed the handle approach (RegExpSet, RegExpClose).


David Nuttall
Nuttall Computer Consulting

An Aquarius born during the Age of Aquarius

AutoIt allows me to re-invent the wheel so much faster.

I'm off to write a wizard, a wonderful wizard of odd...

Share this post


Link to post
Share on other sites

#58 ·  Posted (edited)

Thanks, I not realize that [.] mean real dot... Brains is hot and I feel that I (and sure many other) just need few good examples... But I can't sleep until I'll know why not match this pattern...  :idiot:

<{POST_SNAPBACK}>

I just posted my testing script http://www.autoitscript.com/fileman/users/Nutster/test%20regexp%202.au3 to give you some examples. Try all sorts of patterns yourself. This one works with the version I uploaded to Jon today. I will be posting a better one in a few days to work with the updated version that has some of the TO DO list items implemented. Edited by Nutster

David Nuttall
Nuttall Computer Consulting

An Aquarius born during the Age of Aquarius

AutoIt allows me to re-invent the wheel so much faster.

I'm off to write a wizard, a wonderful wizard of odd...

Share this post


Link to post
Share on other sites

Hmm, so how would this be called?

$Results = RegExp($sLine, $sPattern)
If @Error = 0 Then
; Found the pattern
ElseIf @Error = 1 Then
; Did not find the pattern
ElseIf @Error = 2 Then
; The pattern was not valid
Endif

<{POST_SNAPBACK}>

.. yes ..

$asResults = RegExp($sLine, $sPattern)
If @Error = 0 Then
 ; Found the pattern
 ; .. and the hits are in a zero-based array called $asResults
ElseIf @Error = 1 Then
 ; Did not find the pattern
ElseIf @Error = 2 Then
 ; The pattern was not valid
Endif

.. or ..

$asResults = RegExp($sLine, $sPattern)
If @Error Then
 ; Something is wrong
Else
 ; Found the pattern
 ; .. and the hits are in a zero-based array called $asResults
Endif

Share this post


Link to post
Share on other sites

.. yes ..

$asResults = RegExp($sLine, $sPattern)
If @Error = 0 Then
; Found the pattern
; .. and the hits are in a zero-based array called $asResults
ElseIf @Error = 1 Then
; Did not find the pattern
ElseIf @Error = 2 Then
; The pattern was not valid
Endif

<{POST_SNAPBACK}>

Or

$asResults = RegExp($sLine, $sPattern)
If @Error = 0 Then
; Found the pattern
; .. and the hits are in a zero-based array called $asResults
ElseIf @Error = 1 Then
; Did not find the pattern
; $asResults = ""
ElseIf @Error = 2 Then
; The pattern was not valid
; $asResults = ""
Endif

This can solve the problems with storing back-references when I implement them as well as RegExpReplace. Ok. I will go this way. @Error will indicate whether the search worked or not (or buggered up completely because of a screwed pattern. I think the return in that case should indicate where the problem occured in the pattern.


David Nuttall
Nuttall Computer Consulting

An Aquarius born during the Age of Aquarius

AutoIt allows me to re-invent the wheel so much faster.

I'm off to write a wizard, a wonderful wizard of odd...

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0