Jump to content

RegExp - has anyone seen this library before?


sohfeyr
 Share

Recommended Posts

  • Replies 136
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

It would be very difficult but it basically requires dropping some supported operating systems or writing a ton of code that Windows already implements for us if we do want to support them... Then there is porting the existing code to use WCHAR instead of CHAR. That is probably about as much effort as writing all the wrappers.

As I said, not something I expect to see any time soon :) Nice to have the description of the process involved, though.

Link to comment
Share on other sites

Just stumbled on this thread .. and wanted to add my vote of support :)

Also, following some links that thomasl included with his PCRE wrapper (in another thread), I came across the following pages which offer an excellent introduction to regexps. For those who need to a quick introduction:

They also include some examples that might prove useful for testing the AU3 implementation, as they spell out the results and subtleties for various expressions and features.

HTH

:P

Link to comment
Share on other sites

  • Administrators

Test AutoIt Exe: http://www.autoitscript.com/autoit3/files/...utoIt3-pcre.exe

///////////////////////////////////////////////////////////////////////////////
//
// $val = StringRegExp("string", "pattern", [flag, [offset]])
//
// Perform regular expression matching on the given string.
//
// flags:
//      0(default) - returns 1 (matched) or 0 (no match)
//      1 - return array of matches
//
// When flag = 1:
//      Returns an array.
//      @Error = 0.  Array is valid.  Check @Extended for next offset
//      @Error = 1.  Array is invalid.  No matches.
//      @Error = 2.  Bad pattern, array is invalid.  @Extended = offset of error in pattern.
//
///////////////////////////////////////////////////////////////////////////////

Based on the php: preg_match function (seems to return entire match followed by matching subsubstring). Haven't done a global version yet because I don't know if this is working correctly yet (half the patterns I try don't work, but I don't know if they should work or if it is broken...) and I also have no idea how to return a global selection of data that would be meaningful. It's very hard to implement regexp code when you barely understand them, so help testing would be great.

Here is code using the offset parameter to perform a manual global match.

$nOffset = 1
While 1
    $array = StringRegExp('<test>a</test> <test>b</test> <test>c</Test>', '<(?i)test>(.*?)</(?i)test>', 1, $nOffset)
    If @error = 0 Then
        $nOffset = @extended
    Else
        ExitLoop
    EndIf
    for $i = 0 to UBound($array) - 1
        msgbox(0, $i, $array[$i])
    Next
WEnd

Link to comment
Share on other sites

I just tested one of my patterns and didn't even have to change it (That was unexpected). It worked mostly but the returned array contained data I didn't expect.

Take this simple script:

Main()

Func Main()
    Local $s = "abcdef"
    Local $p = "(ab)(cd)"
    Local $a = StringRegExp($s, $p, 1)
    ConsoleWrite('@@ (48) :(' & @min & ':' & @sec & ') UBound($a) = ' & UBound($a) & @CR);### Debug Console
    For $i = 0 To UBound($a) - 1
        ConsoleWrite($a[$i] & @CRLF)
    Next
EndFunc; Main()

The output is:

@@ (48) :(59:13) UBound($a) = 3
abcd
ab
cd

I expected:

@@ (48) :(59:13) UBound($a) = 2
ab
cd

Edit: Fixed the post up a bit.

Edited by Valik
Link to comment
Share on other sites

Just tried another expression. It looks like you have to escape $ when it's not being used as an anchor. For example, I had the pattern "$\((.*?)\)" which would match things like "Foo" in the string "$(Foo)". In order to make that pattern compatible with PCRE, I had to make it "\$\((.*?)\)".

So far I'm optimistic that our patterns won't be too broken by using PCRE. Just need to get the damn "too-much-data" problem fixed. StringRegExp() would return this using the pattern and string mentioned above:

$(Foo)
Foo

Again, the first line should not be there. The group only specified that "Foo" should be captured.

Link to comment
Share on other sites

  • Administrators

The first array entry seems to be something to do with a full match, the one in php does the same (and also, the implementation that tylo did a while ago has this too). So I thought I'd keep it the same.

Edit: The comment from php's preg_match:

$matches[0]will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized subpattern, and so on.

Whether it's useful or not I have no idea whatsoever.

I had a go at the replace stuff and decided I'd had enough for one day.

Link to comment
Share on other sites

The problem is, the existing implementation did not use it and that will break scripts. So far I'm surprised at how compatible the expressions are. I guess David used Perl as a guide so a lot of patterns are going to work with PCRE out of the box. However, if the returned data is different than the "native" implementation, things are just as broken. I'll have to go through and adjust all my loops to start indexing at 1 instead of 0 even if the pattern itself works perfectly. That seems a shame to me since the patterns are what I thought would make the implementation incompatible.

Link to comment
Share on other sites

Jon, here is my proposal. It's a combination of maintaining backwards compatibility and supporting what PCRE does by default. Here are the flags I propose:

  • 0 - Current behavior, returns True or False if the pattern matches.
  • 1 - Old behavior. Only return data that matches a group and only return the first matches. Example:

    Main()
    
    Func Main()
        Local $s = "abcdefabcdef"
        Local $p = "(ab)(cd)"
        Local $a = StringRegExp($s, $p, 1)
        ConsoleWrite("Matches: " & UBound($a) & @CRLF)
        For $i = 0 To UBound($a) - 1
            ConsoleWrite($a[$i] & @CRLF)
        Next
    EndFunc
     oÝ÷ Øë­¦ë¡×j×!zÎ|Ù¦Üw÷(uïåX¶5ì  z¯Ó+"³Z´ý¸­r§¦èºÑej
    °jÉ÷öÛ¬yا¶¨ÛޮȨÊ"µÆ§mæj^)z·è®kazw°

    Output:

    Matches: 6
    abcd
    ab
    cd
    abcd
    ab
    cd
This will work because the flags in the old StringRegExp() were not bit-flags. This provides maximum compatibility so that any breakages will require very minor tweaks to the pattern. It also adds in the new functionality which I admit could be useful.

Edit: For flag 4, I'm assuming that PCRE behaves the same with a global match that it does with a single match. If PCRE behaves exactly like flag 3, then flag 4 can be skipped. If the behavior of PCRE does not match flag 3 but does seem useful, then it can be put onto flag 4.

Edited by Valik
Link to comment
Share on other sites

  • Administrators

It has no concept of a global match, which is what I'm struggling with atm. You basically have to manually re-call it (like AutoIt example above) but we would implement it internally. If the interface doesn't output something mentioned above then I don't understand enough about it to make it do so :ph34r:

If the new return value is of no use then we can ditch it, it just seemed odd that other implementations seemed to think it was something important to return which is why I left it in there. Adding more flags to support something 99% of users won't even have heard of seems a bit extreme. It's never been a release function after all.

PS. I've got the simple version of StringRegExp replace working (no dollar substitutions etc) so I'll post that in a while.

Edit: At least your post gives me some examples to play with. I was really struggling to find some. :lmao:

Link to comment
Share on other sites

It has no concept of a global match, which is what I'm struggling with atm. You basically have to manually re-call it (like AutoIt example above) but we would implement it internally. If the interface doesn't output something mentioned above then I don't understand enough about it to make it do so :ph34r:

If the new return value is of no use then we can ditch it, it just seemed odd that other implementations seemed to think it was something important to return which is why I left it in there. Adding more flags to support something 99% of users won't even have heard of seems a bit extreme. It's never been a release function after all.

PS. I've got the simple version of StringRegExp replace working (no dollar substitutions etc) so I'll post that in a while.

Edit: At least your post gives me some examples to play with. I was really struggling to find some. :lmao:

I think it may be useful but I'm trying to keep as much backwards compatibility as possible. Like I said before, the patterns are pretty close and a lot of them are going to work out of the box with PCRE so it's a shame the output is not the same, otherwise this transition would be very smooth requiring only minor changes to patterns.

From what you posted earlier (in private maybe), it sounded like the function with all in the name did a global search. I don't know what it's output would be, though I suspect it should be similar to flag 3 of David's implementation.

Link to comment
Share on other sites

  • Administrators

From what you posted earlier (in private maybe), it sounded like the function with all in the name did a global search. I don't know what it's output would be, though I suspect it should be similar to flag 3 of David's implementation.

Yeah, preg_match_all is the php function. But the underlying pcre api doesn't have a global option so it seems we have to do the global cleverness manually. There's no way to predict how many matches will be done so it seems like we'll have to keep calling the single match function and adding the matches to some sort of linked list and then when there are no more matches decide how to turn that into something useful for AutoIt.

I'm leaving global until last, I think doing StringRegExpReplace looks easier.

Link to comment
Share on other sites

It'd be nice to use std::vector for that. Wonder how much STL would increase size by? I wonder if we've gotten to the point we can use STL without too much size bloat? We could port a lot of stuff to STL...

Link to comment
Share on other sites

If the new return value is of no use then we can ditch it, it just seemed odd that other implementations seemed to think it was something important to return which is why I left it in there. Adding more flags to support something 99% of users won't even have heard of seems a bit extreme. It's never been a release function after all.

I think the value in position 0 is very useful when parsing long documents. You can examine both your capturing groups and their context and relation to eachother. (.Net's implementation is similar: RegEx.Matches(n).Groups(0) returns the text that matched the whole expression.)

If reverse compatibility is really an issue though, people like me could always just enclose the whole expression as a group. As long as nested groups are supported, that shouldn't be too big a problem. Personally, I like the flags idea. It would be easier for people to add a flag to their regexp calls than to go through and be sure of every 0-based loop that needs to become 1-based.

Edited by sohfeyr
Link to comment
Share on other sites

  • Administrators

I need a regexp that will match the $n or ${n} parts of of a string.

I currently have "\\$(0-9]+)" which matches $1 $2 ok but I need also to cope with situtations that have {} like ${1}

It's for the replacement parameter code in StringRegExpReplace - I was going to use a regexp to parse itself Oo

This almost works: "\\${*(0-9]+)}*" but it allows for ${{{1}}} which is wrong, is there some way to say a match for 0 or 1 lots of { but no more?

Link to comment
Share on other sites

  • Moderators

I need a regexp that will match the $n or ${n} parts of of a string.

I currently have "\\$(0-9]+)" which matches $1 $2 ok but I need also to cope with situtations that have {} like ${1}

It's for the replacement parameter code in StringRegExpReplace - I was going to use a regexp to parse itself Oo

This almost works: "\\${*(0-9]+)}*" but it allows for ${{{1}}} which is wrong, is there some way to say a match for 0 or 1 lots of { but no more?

I'm going to assume you're speaking of the current project you're working on and now the current releases version?

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

This looks pretty good, Jon. I have thrown some simple and quite a few of my more convoluted patterns at it and they work out okay. I did compare the output of AU3 to what the same pattern produces in Perl and with the expection of element[0] (whole match) they agree. Good job.

FWIW, I agree about keeping backwards compatibility if at all possible. If someone really wants the whole match, another pair of parentheses does the trick, as sohfeyr pointed out.

As to ${...}: try this: \$\{{0,1}\d+\}{0,1}

EDIT:sorry, forgot the () around \d+: \$\{{0,1}(\d+)\}{0,1}

Edited by thomasl
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...