RegExp - has anyone seen this library before?

SmOke_N · October 2, 2006

Ok I had to download the exe and look at pcre's regexp's here http://perldoc.perl.org/perlre.html#Regular-Expressions

But this worked:

$a = StringRegExp('blah ${1} blah', "\$\{{0,1}[0-9]+\}{0,1}", 1, 1)
If IsArray($a) Then MsgBox(0, 'info', $a[0])

Edit:

Oops, thomasl was a bit fast for me... and he used /d (I was just happy to see the above work :lmao: )

Edited October 2, 2006 by SmOke_N

Jon · October 2, 2006

Thanks, the pattern worked great.

But I may have to manually parse the replacement string as it won't let me escape it so that the replace text is the literal text "$1" rather than a reference. Hard to explain.

I think I need to support \1 \2 convention as well?

thomasl · October 2, 2006

But I may have to manually parse the replacement string as it won't let me escape it so that the replace text is the literal text "$1" rather than a reference. Hard to explain.

Well, convention says that $1, $2 ... is replaced by its respective group. If there are no valid groups, replacement is empty. So if you want a literal $1 in the replacement, you'd write something like \$1: the \ escapes the $.

So if you search initially for something like "(\\{0,1}\$\{{0,1}(\d+)\}{0,1})" you'd get either a group starting with \ (->literal ${...}) or with $ (->replacement ${...}).

Hm... perhaps better to parse that manually :lmao:

I think I need to support \1 \2 convention as well?

Depends on whom you ask. I am much more used to the $1 syntax (and found the StringRegExpReplace() syntax a bit strange), but there's a sizeable minority :ph34r:

out there who uses \1.

Given that the current StringRegExpReplace() uses \1, why not stick with it?

Jon · October 2, 2006

New version: http://www.autoitscript.com/autoit3/files/...utoIt3-pcre.exe

I've done StringRegExpReplace so test it out. I've also removed the full match return value in the array as requested.

The regexp replace text can use $0 or ${1}. \1 \2 also work. \ must be escaped like \\. To get a real $ you must use \$

Give it a test and let me know how it works out.

thomasl · October 2, 2006

New version: [...] Give it a test and let me know how it works out.

It all works out very well. The bugs I reported against the "old" RE version (and a few I didn't report) are gone.

I have also done some very preliminary time diffs with some REALLY long strings (up 2048 kb) (as I did for the old version) and the PCRE library plus your replace code looks pretty good in this respect as well. Sometimes AU3 is a bit faster than Perl, sometimes a bit slower... but it's now very much in the same league, not a factor of 20, 40, even 100 slower, as it used to be.

Very nice. Now I can scrap my Perl RE library. Well, was a pre-release anyway :lmao:

EDIT: PCRE uses slightly different definitions for its character classes and assertions (\b \d \w etc.). Anyone who is translating "old style" REs to new should check whether the classes are the same. I have run into some small differences that can wreck an otherwise working pattern. For instance, \w in PCRE includes 0..9.

Edited October 2, 2006 by thomasl

Jon · October 2, 2006

I've done flag 3 (global) in StringRegExp.

http://www.autoitscript.com/autoit3/files/...utoIt3-pcre.exe

Edit: That should be all the existing functionality. If it tests OK I'll switch over to this code for the next beta and then delete all the bug reports :lmao:

jftuga · October 2, 2006

I just want to say that having PCRE support built into AU3 will be fantastic. I bet a lot of users will enjoy having this functionality.

-John

steve8tch · October 2, 2006

Do you want to try this under new Vs old

$str = "abcd"
$ptn = "(.*)"
$msg = ""
$aResult = StringRegExp($str, $ptn, 3)
For $i = 0 To UBound($aResult) - 1
    $msg &= $aResult[$i] & @CRLF
Next
MsgBox(0, "Result", $msg)

On my PC - the new version never completes. It just eats up memory until the aplication fails due to lack of memory. (XP SP2)

I have left out the check for @extended , because at the moment it always go to "0"

Jon · October 2, 2006

Fixed :lmao:

http://www.autoitscript.com/autoit3/files/...utoIt3-pcre.exe

steve8tch · October 2, 2006

That was quick work, but...

$str = "abcd"
$ptn = "(.*?)"
$msg = ""
$aResult = StringRegExp($str, $ptn, 3)
For $i = 0 To UBound($aResult) - 1
    $msg &= $aResult[$i] & @CRLF
Next
MsgBox(0, "Result", $msg)

This pattern has similar problem :lmao:

Sorry...

Jon · October 2, 2006

That was quick work, but...

$str = "abcd"
$ptn = "(.*?)"
$msg = ""
$aResult = StringRegExp($str, $ptn, 3)
For $i = 0 To UBound($aResult) - 1
    $msg &= $aResult[$i] & @CRLF
Next
MsgBox(0, "Result", $msg)

This pattern has similar problem :lmao:

Sorry...

What should this return? The pcre library is giving a really odd result back, it seems to be saying that there was a match of zero length and then gets stuck in a loop because it never advances.

Valik · October 2, 2006

I think that it should return the entire string but my expectation could be wrong.

Jon · October 2, 2006

If I run the expression in a test exe that comes with the pcre library in global mode it gives

blank string

a

blank string

b

blank string

c

blank string

d

blank string

I looked at the source to the test exe and there is this note

/* If we have matched an empty string, first check to see if we are at
the end of the subject. If so, the /g loop is over. Otherwise, mimic
what Perl's /g options does. This turns out to be rather cunning. First
we set PCRE_NOTEMPTY and PCRE_ANCHORED and try the match again at the
same point. If this fails (picked up above) we advance to the next
character. */

I think this might be related as the match is indeed coming back as totally empty.

Edit: Updated the test exe results, there is actually a blank in between each match

Jon · October 2, 2006

More:

PCRE_NOTEMPTY
An empty string is not considered to be a valid match if this option is set. If there are alternatives in the pattern, they are tried. If all the alternatives match the empty string, the entire match fails. For example, if the pattern
a?b?
is applied to a string not beginning with "a" or "b", it matches the empty string at the start of the subject. With PCRE_NOTEMPTY set, this match is not valid, so PCRE searches further into the string for occurrences of "a" or "b".
Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a special case of a pattern match of the empty string within its split() function, and when using the /g modifier. It is possible to emulate Perl's behaviour after matching a null string by first trying the match again at the same offset with PCRE_NOTEMPTY set, and then if that fails by advancing the starting offset (see below) and trying an ordinary match again.

Valik · October 2, 2006

For what it's worth, this is the equivalent in LUA:

from, to, data = string.find("abc", "(.+)")
    print(data)

And it produces "abc".

Jon · October 2, 2006

Updated: http://www.autoitscript.com/autoit3/files/...utoIt3-pcre.exe

I've made it work like the pcre test exe in that when a global operation is done blank strings are matched so that it gives the odd result in the post above.

Also odd things like doing a _global_ match on (.*) for "abcd" gives a match of:

abcd

blank string

This is also the same result as I'm getting from the pcre test exe which I assume is the same as perl.

I believe that if I just turn on the option to ignore blank string matches then the results will be more like the predications but not sure if that messes up some other elements of compatibility with perl. (Nutsters implementation of the (.*?) pattern actually returned nothing at all which I guess meant it came accross the blank string at the start and barfed)

Jon · October 2, 2006

For reference the pcretest file is at http://www.autoitscript.com/autoit3/files/...it/pcretest.exe

re> /(.*?)/g
data> abcd

Valik · October 3, 2006

I found what I think might be an issue. Here is the code:

Main()

Func Main()
    Local $s = "Unique" & @CRLF & "Foo" & @CRLF & "Foo"
    Local $p = "Uniques*(?:(Foo)s*)*"
    Local $a = StringRegExp($s, $p, 3)
    ConsoleWrite("Matches: " & UBound($a) & @CRLF)
    For $i = 0 To UBound($a) - 1
        ConsoleWrite($a[$i] & @CRLF)
    Next
EndFunc    ; Main()

Here is the output:

Matches: 1
Foo

That's not what I expected. I expected:

Matches: 2
Foo
Foo

The old StringRegExp() returns what I expected, the new doesn't. The example is supposed to look for a string starting with the text "Unique" optionally followed by whitespace (CRLF). If it finds that, then it's supposed to look for the string "Foo" optionally followed by whitespace (CRLF in the example). If it finds that, then it captures the text "Foo" (The non-capturing group is used to be able to test for the trailing whitespace but not capture it). With the sample string, it should find both instances of "Foo" since it's supposed to keep repeating the "(?:(Foo)\s*)*" part of the pattern.

Edit: I see in the Perl documentation something about //s and //m and how //s means treat things as a single text block and //m means it's multiple lines and that //m is the default. I don't know if there is a way for me to change to the //s mode but that's what I need to be in for that pattern to match correctly.

Edit2: I found the options (?s) and (?m) in the PCRE documentation which allows me to set those two flags I mentioned in my last edit. I still can't seem to get the pattern working how I want, though.

Edited October 3, 2006 by Valik

Valik · October 3, 2006

I don't understand these Perl expressions. I don't understand why the following pattern doesn't work like I expect:

Pattern: "(Foo)*"
String: "FooFoo"

That only matches one "Foo" and I expect 2. I even tried the test application and it didn't work like I thought, either:

re> /(Foo)*/
data> FooFoo
 0: FooFoo
 1: Foo
data>
  re> /(Foo)*/g
data> FooFoo
 0: FooFoo
 1: Foo
 0:

As buggy as David's implementation was, at least simple patterns I expect to work... do. Is PCRE really just retarded or am I missing something completely obvious?

sohfeyr · October 3, 2006

I don't understand these Perl expressions. I don't understand why the following pattern doesn't work like I expect:
...
As buggy as David's implementation was, at least simple patterns I expect to work... do. Is PCRE really just retarded or am I missing something completely obvious?

If you want it to return two captures of "Foo", try just /(Foo)/ , or even /Foo/ or Foo if the syntax will support it.

You know what I'd really, really like to see support for in this implementation? Named groups. Those are SO much easier to remember and manage than $1 or \1 style backreferences. I haven't tried your code yet (you wouldn't believe how busy I am these days), but I've noticed all grouping that's been posted here uses numbered backreferences instead of named ones.

Another nice resource: Online .Net RegEx Tester

I know you aren't really trying to approximate .Net, but it's a handy way to do a quick, free regexp logic test. Someone else may have one that's specific to Perl.

Sign In

RegExp - has anyone seen this library before?

Recommended Posts

SmOke_N

Top Posters In This Topic

Top Posters In This Topic

Popular Posts

Jon

Jon

thomasl

Jon

thomasl

Jon

jftuga

steve8tch

Jon

steve8tch

Jon

Valik

Jon

Jon

Valik

Jon

Jon

Valik

Valik

sohfeyr

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta