Jump to content

RegExp - has anyone seen this library before?


sohfeyr
 Share

Recommended Posts

  • Administrators

As buggy as David's implementation was, at least simple patterns I expect to work... do. Is PCRE really just retarded or am I missing something completely obvious?

I wish I knew enough expressions to comment. I'm pretty much limited to using the API based on the documentation and then relying on you guys and the test exe to see if it's working OK. But unless I compiled it incorrectly then it should be working as intended.
Link to comment
Share on other sites

  • Replies 136
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

  • Administrators

This one as well.

http://www.lumadis.be/regex/test_regex.php?lang=en

Hmm, do we not need the old array[0] value?

re> /(Foo)*/g

data> FooFooFoo

0: FooFooFoo

1: Foo

0:

So the 0 value (which we are now throwing away) does indeed match the entire thing, and then the first captured sub pattern is the single Foo.

?

Link to comment
Share on other sites

I found what I think might be an issue. Here is the code:

..

The output you got (1 match) is what i expect ... cos what matched was the entire pattern ( ie: including the "Uniques*" ).

Link to comment
Share on other sites

A general remark about REs: it is easy to produce meaningless patterns that actually crash an RE engine, a sort of "while 1 ... wend" thing in RE syntax. What happens depends on the actual implementation.

Therefore, such an effect is not necessarily a bug in PCRE or in Jon's implementation of PCRE. It all depends on the pattern.

As to (.*?): this a pattern that matches 0 or or more of whatever (that's .*) but is not greedy (the ?). So it matches any string (say "test") first in position 0 and returns an empty string. Then it matches the "t", then the empty string between "t" and "e", then the empty string between "e" and "s" and so on. It returns 9 matches.

(.+?) returns exactly the four matches "t", "e", "s", "t", as one would expect.

I agree that REs can be hell but then again they are a completely logical hell :lmao:

EDIT:

I wish I knew enough expressions to comment. I'm pretty much limited to using the API based on the documentation and then relying on you guys and the test exe to see if it's working OK. But unless I compiled it incorrectly then it should be working as intended.

PCRE does work as intented. It is used in dozens of high-profile apps.

The fact that patterns don't do what people expect probably reflects more on their understanding of REs (or lack thereof) than actual errors in PCRE. (Note the "probably": this is not to say that PCRE has no bugs; it sure has. But if used correctly it tends to work correctly.)

One of the good things about PCRE (and Perl REs in general) is that they are well-documented, so it shouldn't be too difficult to get the hang of it. Much of what has been written in this thread is a classic case of RTFM.

As to the pattern (and results) Nutster's code accepted (and delivered), I would take these with a pinch of salt. They were definitely not Perl compatible.

Edited by thomasl
Link to comment
Share on other sites

I have now downloaded the newest build and played a bit with it. My batch of patterns still work (though that's mostly ...Replace() stuff with backreferences etc.).

What doesn't work at all is StringRegExp(), flag=3, ie global match.

$s="test"
$b=StringRegExp($s,"(.*?)",3)
for $i=0 to ubound($B)-1
  ConsoleWrite("!"&$b[$i]&"!"&@CRLF);
next

This should return the nine strings as detailed in my other post, above. Simpler patterns like a lone . also don't work.

Edit: code

Edited by thomasl
Link to comment
Share on other sites

  • Administrators

I have now downloaded the newest build and played a bit with it. My batch of patterns still work (though that's mostly ...Replace() stuff with backreferences etc.).

What doesn't work at all is StringRegExp(), flag=3, ie global match.

$s="test"
$b=StringRegExp($s,"(.*?)",3)
for $i=0 to ubound($B)-1
  ConsoleWrite("!"&$b[$i]&"!"&@CRLF);
next

This should return the nine strings as detailed in my other post, above. Simpler patterns like a lone . also don't work.

Edit: code

It's working here (9 strings). I'm about to upload a new build in 10 mins so try again with that.
Link to comment
Share on other sites

  • Administrators

Ok, new build: http://www.autoitscript.com/autoit3/files/...utoIt3-pcre.exe

I added option 2 and 4.

Option 2, same as option 1 but it returns the full match as well in array[0] ( like preg_match() )

Option 4, same as option 3 but returns an array of arrays :lmao: Each sub array is like the single return value from option 2. This is like the php / preg_match_all() return value.

Examples:

;Option 2, single return, php/preg_match() style
$array = StringRegExp('<test>a</test> <test>b</test> <test>c</Test>', '<(?i)test>(.*?)</(?i)test>', 2)
for $i = 0 to UBound($array) - 1
    msgbox(0, "Option 2 - " & $i, $array[$i])
Next


;Option 3, global return, old AutoIt style
$array = StringRegExp('test', '(.*?)', 3)

for $i = 0 to UBound($array) - 1
    msgbox(0, "Option 3 - " & $i, $array[$i])
Next


;Option 4, global return, php/preg_match_all() style
$array = StringRegExp('<test>a</test> <test>b</test> <test>c</Test>', '<(?i)test>(.*?)</(?i)test>', 4)

for $i = 0 to UBound($array) - 1

    $match = $array[$i]
    for $j = 0 to UBound($match) - 1
        msgbox(0, "Option 4 - " & $i & ',' & $j, $match[$j])
    Next
Next
Link to comment
Share on other sites

Thx. (The previous build did work after all..., after I got me flaming paths sorted :lmao: )

Now all is well. Well, almost...

StringRegExp("F1oF2oF3o","(F.o)*?",3) should give seven matches. AU3 gives only three, omitting the four empty matches (the other example -- "(.*? )" -- works):

--

AU3 :F1o

AU3 :F2o

AU3 :F3o

Perl:

Perl:F1o

Perl:

Perl:F2o

Perl:

Perl:F3o

Perl:

--

I will continue to throw REs at it.

EDIT:

Mode 4 hangs with StringRegExp("test","(.*?)",4)

Edited by thomasl
Link to comment
Share on other sites

Fixed

:lmao:

Here's another thing to chew over.

$s="test"&@CRLF&"test"
ConsoleWrite($s&@CRLF)
$s=StringRegExpReplace($s,".","_")
ConsoleWrite($s&@CRLF)

This replaces everything with the exception of the LF (ie it also replaces the CR):

test

test

!!!!!

!!!!

Now this whole CR/LF handling is a thorny problem anyway. Perl REs have an option that switches between \n (which Perl assumes to be "\n" under *x and "\r\n\" under Win32) being treated like a string terminator (ie not matched by a .) or as just another character.

Your code seems to work under the assumption that \n is a terminator, not a normal character, which is fine for most matches and replaces (though at some point there should be an option to switch this off). But I am not sure about the semantics in terms of coding for AU3: if LF is not replaced, perhaps CR shouldn't either.

EDIT:

Here's more. StringRegExp("test"&@CRLF&"test",".",3) works as expected: nine matches (2*4 for the test's and 1 for the CR).

OTOH, StringRegExp("test"&@CRLF&"test","(.*?)",3) simply stops matching after the LF.

Edited by thomasl
Link to comment
Share on other sites

The output you got (1 match) is what i expect ... cos what matched was the entire pattern ( ie: including the "Uniques*" ).

Please explain to me what's going on then because from what I understand about regular expressions, it should start matching on Unique and once that part of the pattern matches, it moves to the first Foo which also matches the pattern. Then because of the repitition operator, it should move to the next and final Foo in the string which still matches because we are repeatedly capturing Foo's.

What I was trying to do was find a unique position in a string which is then followed by one or more lines of data followed by an empty line. I wanted to capture the lines of data individually. An example of the string:

Unique
Data Line 1
Data Line 2
Note that the example is basically like my AutoIt code above.

Also, the "s*" should be "\s*". I don't know why but the forum stripped the escape sequence.

Link to comment
Share on other sites

  • Administrators

Your code seems to work under the assumption that \n is a terminator, not a normal character, which is fine for most matches and replaces (though at some point there should be an option to switch this off). But I am not sure about the semantics in terms of coding for AU3: if LF is not replaced, perhaps CR shouldn't either.

Can't comment on the other stuff - right on the limit of my knowledge now - but I found an option in the pcrelib that is set at compile time that says you can specify a newline as \n or \r (a single char) it doesn't seem to have any options for \r\n. Our library was compiled with \n specified. It may be that when using CRLF sequences you have to strip them with StringStripCR() first to get expected results. Dunno.
Link to comment
Share on other sites

  • Administrators

Please explain to me what's going on then because from what I understand about regular expressions, it

Is this any closer *makes straw grasping motion* :lmao:

re> /(?U)(Foo)*/g

data> FooFooFoo

Link to comment
Share on other sites

Jon, I think that option sets what character(s) \n means. It can be either LF (Probably the default), CR or CRLF. I'm pretty sure I saw a flag in the documentation that sets it to CRLF, too. IMO, leaving \n to mean LF is fine because we can build a CRLF sequence with \r\n. However, it shouldn't affect \s, which is what I used above, because \s matches all whitespace characters and because of the repetition, it'll catch both CR and LF.

Link to comment
Share on other sites

  • Administrators

The code is looking pretty good to me (RE differences problems rather than buggy code problems). So the important thing is who can write the help file page on this - because I certainly can't! :lmao:

Link to comment
Share on other sites

..

What I was trying to do was find a unique position in a string which is then followed by one or more lines of data followed by an empty line. I wanted to capture the lines of data individually. An example of the string:

Unique
Data Line 1
Data Line 2
Note that the example is basically like my AutoIt code above.

..

ok .. then this RE "Unique\s*(?:((Foo)\s*))*"

.. with PCRE calls via Thomasl's wrapper i get two "Foo"s

HTH

:)

Link to comment
Share on other sites

ok .. then this RE "Unique\s*(? :( (Foo)\s*))*"

.. with PCRE calls via Thomasl's wrapper i get two "Foo"s

HTH

:)

Alright, that works. Now explain to me why. All you did was add another capture. How does that magically get it working?

Edit: And I can simplify that to this "Unique\s*((Foo)\s*)*" and it still works further adding to my confusion. If that simplified form works, why does the non-capturing form not work?

Edited by Valik
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...