Jump to content

RegExp - has anyone seen this library before?


sohfeyr
 Share

Recommended Posts

  • Moderators

Ok I had to download the exe and look at pcre's regexp's here http://perldoc.perl.org/perlre.html#Regular-Expressions

But this worked:

$a = StringRegExp('blah ${1} blah', "\$\{{0,1}[0-9]+\}{0,1}", 1, 1)
If IsArray($a) Then MsgBox(0, 'info', $a[0])

Edit:

Oops, thomasl was a bit fast for me... and he used /d (I was just happy to see the above work :lmao: )

Edited by SmOke_N

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

  • Replies 136
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

  • Administrators

Thanks, the pattern worked great.

But I may have to manually parse the replacement string as it won't let me escape it so that the replace text is the literal text "$1" rather than a reference. Hard to explain.

I think I need to support \1 \2 convention as well?

Link to comment
Share on other sites

But I may have to manually parse the replacement string as it won't let me escape it so that the replace text is the literal text "$1" rather than a reference. Hard to explain.

Well, convention says that $1, $2 ... is replaced by its respective group. If there are no valid groups, replacement is empty. So if you want a literal $1 in the replacement, you'd write something like \$1: the \ escapes the $.

So if you search initially for something like "(\\{0,1}\$\{{0,1}(\d+)\}{0,1})" you'd get either a group starting with \ (->literal ${...}) or with $ (->replacement ${...}).

Hm... perhaps better to parse that manually :lmao:

I think I need to support \1 \2 convention as well?

Depends on whom you ask. I am much more used to the $1 syntax (and found the StringRegExpReplace() syntax a bit strange), but there's a sizeable minority :ph34r: out there who uses \1.

Given that the current StringRegExpReplace() uses \1, why not stick with it?

Link to comment
Share on other sites

  • Administrators

New version: http://www.autoitscript.com/autoit3/files/...utoIt3-pcre.exe

I've done StringRegExpReplace so test it out. I've also removed the full match return value in the array as requested.

The regexp replace text can use $0 or ${1}. \1 \2 also work. \ must be escaped like \\. To get a real $ you must use \$

Give it a test and let me know how it works out.

Link to comment
Share on other sites

New version: [...] Give it a test and let me know how it works out.

:ph34r::geek::)

It all works out very well. The bugs I reported against the "old" RE version (and a few I didn't report) are gone.

I have also done some very preliminary time diffs with some REALLY long strings (up 2048 kb) (as I did for the old version) and the PCRE library plus your replace code looks pretty good in this respect as well. Sometimes AU3 is a bit faster than Perl, sometimes a bit slower... but it's now very much in the same league, not a factor of 20, 40, even 100 slower, as it used to be.

Very nice. Now I can scrap my Perl RE library. Well, was a pre-release anyway :lmao:

EDIT: PCRE uses slightly different definitions for its character classes and assertions (\b \d \w etc.). Anyone who is translating "old style" REs to new should check whether the classes are the same. I have run into some small differences that can wreck an otherwise working pattern. For instance, \w in PCRE includes 0..9.

Edited by thomasl
Link to comment
Share on other sites

  • Administrators

I've done flag 3 (global) in StringRegExp.

http://www.autoitscript.com/autoit3/files/...utoIt3-pcre.exe

Edit: That should be all the existing functionality. If it tests OK I'll switch over to this code for the next beta and then delete all the bug reports :lmao:

Link to comment
Share on other sites

Do you want to try this under new Vs old

$str = "abcd"
$ptn = "(.*)"
$msg = ""
$aResult = StringRegExp($str, $ptn, 3)
For $i = 0 To UBound($aResult) - 1
    $msg &= $aResult[$i] & @CRLF
Next
MsgBox(0, "Result", $msg)

On my PC - the new version never completes. It just eats up memory until the aplication fails due to lack of memory. (XP SP2)

I have left out the check for @extended , because at the moment it always go to "0"

Link to comment
Share on other sites

  • Administrators

That was quick work, but...

$str = "abcd"
$ptn = "(.*?)"
$msg = ""
$aResult = StringRegExp($str, $ptn, 3)
For $i = 0 To UBound($aResult) - 1
    $msg &= $aResult[$i] & @CRLF
Next
MsgBox(0, "Result", $msg)

This pattern has similar problem :lmao:

Sorry...

What should this return? The pcre library is giving a really odd result back, it seems to be saying that there was a match of zero length and then gets stuck in a loop because it never advances.
Link to comment
Share on other sites

  • Administrators

If I run the expression in a test exe that comes with the pcre library in global mode it gives

blank string

a

blank string

b

blank string

c

blank string

d

blank string

I looked at the source to the test exe and there is this note

/* If we have matched an empty string, first check to see if we are at

the end of the subject. If so, the /g loop is over. Otherwise, mimic

what Perl's /g options does. This turns out to be rather cunning. First

we set PCRE_NOTEMPTY and PCRE_ANCHORED and try the match again at the

same point. If this fails (picked up above) we advance to the next

character. */

I think this might be related as the match is indeed coming back as totally empty.

Edit: Updated the test exe results, there is actually a blank in between each match

Link to comment
Share on other sites

  • Administrators

More:

PCRE_NOTEMPTY

An empty string is not considered to be a valid match if this option is set. If there are alternatives in the pattern, they are tried. If all the alternatives match the empty string, the entire match fails. For example, if the pattern

a?b?

is applied to a string not beginning with "a" or "b", it matches the empty string at the start of the subject. With PCRE_NOTEMPTY set, this match is not valid, so PCRE searches further into the string for occurrences of "a" or "b".

Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a special case of a pattern match of the empty string within its split() function, and when using the /g modifier. It is possible to emulate Perl's behaviour after matching a null string by first trying the match again at the same offset with PCRE_NOTEMPTY set, and then if that fails by advancing the starting offset (see below) and trying an ordinary match again.

Link to comment
Share on other sites

  • Administrators

Updated: http://www.autoitscript.com/autoit3/files/...utoIt3-pcre.exe

I've made it work like the pcre test exe in that when a global operation is done blank strings are matched so that it gives the odd result in the post above.

Also odd things like doing a _global_ match on (.*) for "abcd" gives a match of:

abcd

blank string

This is also the same result as I'm getting from the pcre test exe which I assume is the same as perl.

I believe that if I just turn on the option to ignore blank string matches then the results will be more like the predications but not sure if that messes up some other elements of compatibility with perl. (Nutsters implementation of the (.*?) pattern actually returned nothing at all which I guess meant it came accross the blank string at the start and barfed)

Link to comment
Share on other sites

I found what I think might be an issue. Here is the code:

Main()

Func Main()
    Local $s = "Unique" & @CRLF & "Foo" & @CRLF & "Foo"
    Local $p = "Uniques*(?:(Foo)s*)*"
    Local $a = StringRegExp($s, $p, 3)
    ConsoleWrite("Matches: " & UBound($a) & @CRLF)
    For $i = 0 To UBound($a) - 1
        ConsoleWrite($a[$i] & @CRLF)
    Next
EndFunc    ; Main()

Here is the output:

Matches: 1
Foo

That's not what I expected. I expected:

Matches: 2
Foo
Foo

The old StringRegExp() returns what I expected, the new doesn't. The example is supposed to look for a string starting with the text "Unique" optionally followed by whitespace (CRLF). If it finds that, then it's supposed to look for the string "Foo" optionally followed by whitespace (CRLF in the example). If it finds that, then it captures the text "Foo" (The non-capturing group is used to be able to test for the trailing whitespace but not capture it). With the sample string, it should find both instances of "Foo" since it's supposed to keep repeating the "(?:(Foo)\s*)*" part of the pattern.

Edit: I see in the Perl documentation something about //s and //m and how //s means treat things as a single text block and //m means it's multiple lines and that //m is the default. I don't know if there is a way for me to change to the //s mode but that's what I need to be in for that pattern to match correctly.

Edit2: I found the options (?s) and (?m) in the PCRE documentation which allows me to set those two flags I mentioned in my last edit. I still can't seem to get the pattern working how I want, though.

Edited by Valik
Link to comment
Share on other sites

I don't understand these Perl expressions. I don't understand why the following pattern doesn't work like I expect:

Pattern: "(Foo)*"
String: "FooFoo"

That only matches one "Foo" and I expect 2. I even tried the test application and it didn't work like I thought, either:

re> /(Foo)*/
data> FooFoo
 0: FooFoo
 1: Foo
data>
  re> /(Foo)*/g
data> FooFoo
 0: FooFoo
 1: Foo
 0:

As buggy as David's implementation was, at least simple patterns I expect to work... do. Is PCRE really just retarded or am I missing something completely obvious?

Link to comment
Share on other sites

I don't understand these Perl expressions. I don't understand why the following pattern doesn't work like I expect:

...

As buggy as David's implementation was, at least simple patterns I expect to work... do. Is PCRE really just retarded or am I missing something completely obvious?

If you want it to return two captures of "Foo", try just /(Foo)/ , or even /Foo/ or Foo if the syntax will support it.

You know what I'd really, really like to see support for in this implementation? Named groups. Those are SO much easier to remember and manage than $1 or \1 style backreferences. I haven't tried your code yet (you wouldn't believe how busy I am these days), but I've noticed all grouping that's been posted here uses numbered backreferences instead of named ones.

Another nice resource: Online .Net RegEx Tester

I know you aren't really trying to approximate .Net, but it's a handy way to do a quick, free regexp logic test. Someone else may have one that's specific to Perl.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...