Jump to content

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Find out more here. X
X


Photo

RegExp - has anyone seen this library before?


  • Please log in to reply
136 replies to this topic

#21 martijn

martijn

    Seeker

  • Active Members
  • 27 posts

Posted 27 September 2006 - 08:19 PM

I really don't recommend that until we actually decide we want to go this route. Testing is okay but don't write mission-critical applications with any test executables because nothing is final yet.

No problemo :)







#22 sohfeyr

sohfeyr

    Prodigy

  • Active Members
  • PipPipPip
  • 194 posts

Posted 28 September 2006 - 01:54 AM

It would be very difficult but it basically requires dropping some supported operating systems or writing a ton of code that Windows already implements for us if we do want to support them... Then there is porting the existing code to use WCHAR instead of CHAR. That is probably about as much effort as writing all the wrappers.


As I said, not something I expect to see any time soon :) Nice to have the description of the process involved, though.

#23 trids

trids

    Hmmm .. and what have we here?

  • Active Members
  • PipPipPipPipPipPip
  • 1,004 posts

Posted 28 September 2006 - 02:40 PM

Just stumbled on this thread .. and wanted to add my vote of support :)

Also, following some links that thomasl included with his PCRE wrapper (in another thread), I came across the following pages which offer an excellent introduction to regexps. For those who need to a quick introduction:They also include some examples that might prove useful for testing the AU3 implementation, as they spell out the results and subtleties for various expressions and features.

HTH
:P

#24 Jon

Jon

    Up all night to get lucky

  • Administrators
  • 10,630 posts

Posted 01 October 2006 - 09:59 AM

Test AutoIt Exe: http://www.autoitscript.com/autoit3/files/...utoIt3-pcre.exe

/////////////////////////////////////////////////////////////////////////////// // // $val = StringRegExp("string", "pattern", [flag, [offset]]) // // Perform regular expression matching on the given string. // // flags: //      0(default) - returns 1 (matched) or 0 (no match) //      1 - return array of matches // // When flag = 1: //      Returns an array. //      @Error = 0.  Array is valid.  Check @Extended for next offset //      @Error = 1.  Array is invalid.  No matches. //      @Error = 2.  Bad pattern, array is invalid.  @Extended = offset of error in pattern. // ///////////////////////////////////////////////////////////////////////////////


Based on the php: preg_match function (seems to return entire match followed by matching subsubstring). Haven't done a global version yet because I don't know if this is working correctly yet (half the patterns I try don't work, but I don't know if they should work or if it is broken...) and I also have no idea how to return a global selection of data that would be meaningful. It's very hard to implement regexp code when you barely understand them, so help testing would be great.

Here is code using the offset parameter to perform a manual global match.

$nOffset = 1 While 1     $array = StringRegExp('<test>a</test> <test>b</test> <test>c</Test>', '<(?i)test>(.*?)</(?i)test>', 1, $nOffset)     If @error = 0 Then         $nOffset = @extended     Else         ExitLoop     EndIf     for $i = 0 to UBound($array) - 1         msgbox(0, $i, $array[$i])     Next WEnd


#25 steve8tch

steve8tch

    Universalist

  • Active Members
  • PipPipPipPipPip
  • 291 posts

Posted 01 October 2006 - 02:47 PM

Checked out a few of my regexs (including some that I used to have issues with) - most of them are quite simple - but it seems to be behaving fine.

#26 Valik

Valik

    Former developer.

  • Active Members
  • PipPipPipPipPipPip
  • 18,879 posts

Posted 01 October 2006 - 03:01 PM

I just tested one of my patterns and didn't even have to change it (That was unexpected). It worked mostly but the returned array contained data I didn't expect.

Take this simple script:
Main() Func Main()     Local $s = "abcdef"     Local $p = "(ab)(cd)"     Local $a = StringRegExp($s, $p, 1)     ConsoleWrite('@@ (48) :(' & @min & ':' & @sec & ') UBound($a) = ' & UBound($a) & @CR);### Debug Console     For $i = 0 To UBound($a) - 1         ConsoleWrite($a[$i] & @CRLF)     Next EndFunc; Main()

The output is:
@@ (48) :(59:13) UBound($a) = 3 abcd ab cd

I expected:
@@ (48) :(59:13) UBound($a) = 2 ab cd


Edit: Fixed the post up a bit.

Edited by Valik, 01 October 2006 - 03:02 PM.


#27 Valik

Valik

    Former developer.

  • Active Members
  • PipPipPipPipPipPip
  • 18,879 posts

Posted 01 October 2006 - 03:17 PM

Just tried another expression. It looks like you have to escape $ when it's not being used as an anchor. For example, I had the pattern "$\((.*?)\)" which would match things like "Foo" in the string "$(Foo)". In order to make that pattern compatible with PCRE, I had to make it "\$\((.*?)\)".

So far I'm optimistic that our patterns won't be too broken by using PCRE. Just need to get the damn "too-much-data" problem fixed. StringRegExp() would return this using the pattern and string mentioned above:
$(Foo) Foo

Again, the first line should not be there. The group only specified that "Foo" should be captured.

#28 Jon

Jon

    Up all night to get lucky

  • Administrators
  • 10,630 posts

Posted 01 October 2006 - 03:20 PM

The first array entry seems to be something to do with a full match, the one in php does the same (and also, the implementation that tylo did a while ago has this too). So I thought I'd keep it the same.

Edit: The comment from php's preg_match:

$matches[0]will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized subpattern, and so on.

Whether it's useful or not I have no idea whatsoever.

I had a go at the replace stuff and decided I'd had enough for one day.

#29 Valik

Valik

    Former developer.

  • Active Members
  • PipPipPipPipPipPip
  • 18,879 posts

Posted 01 October 2006 - 03:26 PM

The problem is, the existing implementation did not use it and that will break scripts. So far I'm surprised at how compatible the expressions are. I guess David used Perl as a guide so a lot of patterns are going to work with PCRE out of the box. However, if the returned data is different than the "native" implementation, things are just as broken. I'll have to go through and adjust all my loops to start indexing at 1 instead of 0 even if the pattern itself works perfectly. That seems a shame to me since the patterns are what I thought would make the implementation incompatible.

#30 Valik

Valik

    Former developer.

  • Active Members
  • PipPipPipPipPipPip
  • 18,879 posts

Posted 01 October 2006 - 05:31 PM

Jon, here is my proposal. It's a combination of maintaining backwards compatibility and supporting what PCRE does by default. Here are the flags I propose:
  • 0 - Current behavior, returns True or False if the pattern matches.
  • 1 - Old behavior. Only return data that matches a group and only return the first matches. Example:
    Main() Func Main()     Local $s = "abcdefabcdef"     Local $p = "(ab)(cd)"     Local $a = StringRegExp($s, $p, 1)     ConsoleWrite("Matches: " & UBound($a) & @CRLF)     For $i = 0 To UBound($a) - 1         ConsoleWrite($a[$i] & @CRLF)     Next EndFunc  ƒo݊÷ Øë­¦ë¡×Œj×!zÎ|Ù¦Üw÷(uïåŠX¶5ì z¯ŠŠÓ†+"³Z´ý™¸­r§¦èºÑej °jÉ÷öÛ¬yا¶¨•ÛޮȨžÊ"µÆ§mæj›^)z·è®kazw°

    Output:
    Matches: 6 abcd ab cd abcd ab cd
This will work because the flags in the old StringRegExp() were not bit-flags. This provides maximum compatibility so that any breakages will require very minor tweaks to the pattern. It also adds in the new functionality which I admit could be useful.

Edit: For flag 4, I'm assuming that PCRE behaves the same with a global match that it does with a single match. If PCRE behaves exactly like flag 3, then flag 4 can be skipped. If the behavior of PCRE does not match flag 3 but does seem useful, then it can be put onto flag 4.

Edited by Valik, 01 October 2006 - 05:33 PM.


#31 Jon

Jon

    Up all night to get lucky

  • Administrators
  • 10,630 posts

Posted 01 October 2006 - 06:04 PM

It has no concept of a global match, which is what I'm struggling with atm. You basically have to manually re-call it (like AutoIt example above) but we would implement it internally. If the interface doesn't output something mentioned above then I don't understand enough about it to make it do so :ph34r:

If the new return value is of no use then we can ditch it, it just seemed odd that other implementations seemed to think it was something important to return which is why I left it in there. Adding more flags to support something 99% of users won't even have heard of seems a bit extreme. It's never been a release function after all.

PS. I've got the simple version of StringRegExp replace working (no dollar substitutions etc) so I'll post that in a while.

Edit: At least your post gives me some examples to play with. I was really struggling to find some. :lmao:

#32 spyrorocks

spyrorocks

    Universalist

  • Active Members
  • PipPipPipPipPipPip
  • 728 posts

Posted 01 October 2006 - 06:06 PM

If there was some way to make this exacly like the php function, i could really use it.

#33 Valik

Valik

    Former developer.

  • Active Members
  • PipPipPipPipPipPip
  • 18,879 posts

Posted 01 October 2006 - 06:13 PM

It has no concept of a global match, which is what I'm struggling with atm. You basically have to manually re-call it (like AutoIt example above) but we would implement it internally. If the interface doesn't output something mentioned above then I don't understand enough about it to make it do so :ph34r:

If the new return value is of no use then we can ditch it, it just seemed odd that other implementations seemed to think it was something important to return which is why I left it in there. Adding more flags to support something 99% of users won't even have heard of seems a bit extreme. It's never been a release function after all.

PS. I've got the simple version of StringRegExp replace working (no dollar substitutions etc) so I'll post that in a while.

Edit: At least your post gives me some examples to play with. I was really struggling to find some. :lmao:

I think it may be useful but I'm trying to keep as much backwards compatibility as possible. Like I said before, the patterns are pretty close and a lot of them are going to work out of the box with PCRE so it's a shame the output is not the same, otherwise this transition would be very smooth requiring only minor changes to patterns.

From what you posted earlier (in private maybe), it sounded like the function with all in the name did a global search. I don't know what it's output would be, though I suspect it should be similar to flag 3 of David's implementation.

#34 Jon

Jon

    Up all night to get lucky

  • Administrators
  • 10,630 posts

Posted 01 October 2006 - 06:22 PM

From what you posted earlier (in private maybe), it sounded like the function with all in the name did a global search. I don't know what it's output would be, though I suspect it should be similar to flag 3 of David's implementation.

Yeah, preg_match_all is the php function. But the underlying pcre api doesn't have a global option so it seems we have to do the global cleverness manually. There's no way to predict how many matches will be done so it seems like we'll have to keep calling the single match function and adding the matches to some sort of linked list and then when there are no more matches decide how to turn that into something useful for AutoIt.

I'm leaving global until last, I think doing StringRegExpReplace looks easier.

#35 Valik

Valik

    Former developer.

  • Active Members
  • PipPipPipPipPipPip
  • 18,879 posts

Posted 01 October 2006 - 06:30 PM

It'd be nice to use std::vector for that. Wonder how much STL would increase size by? I wonder if we've gotten to the point we can use STL without too much size bloat? We could port a lot of stuff to STL...

#36 sohfeyr

sohfeyr

    Prodigy

  • Active Members
  • PipPipPip
  • 194 posts

Posted 01 October 2006 - 08:51 PM

If the new return value is of no use then we can ditch it, it just seemed odd that other implementations seemed to think it was something important to return which is why I left it in there. Adding more flags to support something 99% of users won't even have heard of seems a bit extreme. It's never been a release function after all.


I think the value in position 0 is very useful when parsing long documents. You can examine both your capturing groups and their context and relation to eachother. (.Net's implementation is similar: RegEx.Matches(n).Groups(0) returns the text that matched the whole expression.)

If reverse compatibility is really an issue though, people like me could always just enclose the whole expression as a group. As long as nested groups are supported, that shouldn't be too big a problem. Personally, I like the flags idea. It would be easier for people to add a flag to their regexp calls than to go through and be sure of every 0-based loop that needs to become 1-based.

Edited by sohfeyr, 01 October 2006 - 08:56 PM.


#37 Jon

Jon

    Up all night to get lucky

  • Administrators
  • 10,630 posts

Posted 02 October 2006 - 09:22 AM

I need a regexp that will match the $n or ${n} parts of of a string.

I currently have "\\$(0-9]+)" which matches $1 $2 ok but I need also to cope with situtations that have {} like ${1}

It's for the replacement parameter code in StringRegExpReplace - I was going to use a regexp to parse itself Oo

This almost works: "\\${*(0-9]+)}*" but it allows for ${{{1}}} which is wrong, is there some way to say a match for 0 or 1 lots of { but no more?

#38 SmOke_N

SmOke_N

    It's not what you know ... It's what you can prove!

  • Moderators
  • 16,014 posts

Posted 02 October 2006 - 09:30 AM

I need a regexp that will match the $n or ${n} parts of of a string.

I currently have "\\$(0-9]+)" which matches $1 $2 ok but I need also to cope with situtations that have {} like ${1}

It's for the replacement parameter code in StringRegExpReplace - I was going to use a regexp to parse itself Oo

This almost works: "\\${*(0-9]+)}*" but it allows for ${{{1}}} which is wrong, is there some way to say a match for 0 or 1 lots of { but no more?

I'm going to assume you're speaking of the current project you're working on and now the current releases version?

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.


#39 Jon

Jon

    Up all night to get lucky

  • Administrators
  • 10,630 posts

Posted 02 October 2006 - 09:33 AM

I'm going to assume you're speaking of the current project you're working on and now the current releases version?

Yes.

#40 thomasl

thomasl

    Wayfarer

  • Active Members
  • Pip
  • 63 posts

Posted 02 October 2006 - 09:47 AM

Test AutoIt Exe: http://www.autoitscript.com/autoit3/files/...utoIt3-pcre.exe

This looks pretty good, Jon. I have thrown some simple and quite a few of my more convoluted patterns at it and they work out okay. I did compare the output of AU3 to what the same pattern produces in Perl and with the expection of element[0] (whole match) they agree. Good job.

FWIW, I agree about keeping backwards compatibility if at all possible. If someone really wants the whole match, another pair of parentheses does the trick, as sohfeyr pointed out.

As to ${...}: try this: \$\{{0,1}\d+\}{0,1}

EDIT:sorry, forgot the () around \d+: \$\{{0,1}(\d+)\}{0,1}

Edited by thomasl, 02 October 2006 - 09:49 AM.





0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users