Jump to content

RegExp - has anyone seen this library before?


sohfeyr
 Share

Recommended Posts

I swear anyone who understands this stuff is clinically insane.

Not true, but it certainly helps.

Don't want to be a spoilsport, Jon, but there's still the problem of some matches terminating early when they stumble across a \n (see my EDIT "OTOH, StringRegExp("test"&@CRLF&"test","(.*?)",3) simply stops matching after the LF" above.)

Link to comment
Share on other sites

  • Replies 136
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

  • Administrators

Not true, but it certainly helps.

Don't want to be a spoilsport, Jon, but there's still the problem of some matches terminating early when they stumble across a \n (see my EDIT "OTOH, StringRegExp("test"&@CRLF&"test","(.*?)",3) simply stops matching after the LF" above.)

I have no idea at all I'm afraid. Can you reproduce it with pcretest.exe so we can work out if it's me or pcre? (because I can't see any "make pcre fail with \n" option that I'm missing)

Edit: Hmm, seems to work on pcretest.exe I'll step through it in autoit

Link to comment
Share on other sites

Alright, trids, why doesn't this work?

Main()

Func Main()
    Local $s = "Unique" & @CRLF & "Foo" & @CRLF & "Foo"
    Local $p = "Unique\s*(?:((\w*)\s*))*"
    Local $a = StringRegExp($s, $p, 3)
    ConsoleWrite("Matches: " & UBound($a) & @CRLF)
    For $i = 0 To UBound($a) - 1
        ConsoleWrite("|" & $a[$i] & "|" & @CRLF)
    Next
EndFunc; Main()

Output:

Matches: 2
||
||

I'm just trying to use your same working pattern except now I'm trying to capture Foo without knowing that it's actually Foo but I'm not getting anything.

<rant>I must say, I think this engine is absolutely retarded. My expectations are quite obviously wrong on how things should behave and maybe that's the norm. But I've never had trouble writing patterns before and here I'm completely in left field with what I'm trying. I always heard RE was hard but I never had much trouble. If this is the format people have been using, then yes, it's hard because it's stupid. As I said before, at least David's works with patterns that make sense and don't require some voodoo grouping I don't understand.</rant>

Link to comment
Share on other sites

I have no idea at all I'm afraid. Can you reproduce it with pcretest.exe so we can work out if it's me or pcre? (because I can't see any "make pcre fail with \n" option that I'm missing)

Edit: Hmm, seems to work on pcretest.exe I'll step through it in autoit

Before I posted that I checked the pattern with three different RE engines, all agree :lmao:

So I think something is brewing. However, I think I'll call it a day, lads.

Link to comment
Share on other sites

  • Administrators

Before I posted that I checked the pattern with three different RE engines, all agree :lmao:

So I think something is brewing. However, I think I'll call it a day, lads.

Hmm pcre returns -1 (PCRE_ERROR_NOMATCH) at position 5 (\n). /stumped

Edit: I estimate around 1 days of interest left in regexp before I get bored and move on creating some template code for stacks and vectors. so I need to work it out soon...

Link to comment
Share on other sites

I've already explained that Jon, before it ever come up. The dot character doesn't match newlines. The pattern needs to be "(?s)(.*?)" in order to make the dot match newlines. With that pattern, I get 21 matches. Without it or using the option (?m) I get 11. See PCRE_DOTALL in pcre.txt for what it says.

Link to comment
Share on other sites

Alright, trids, why doesn't this work?

...

Try Unique\s*(?:((\w+)\s*))* .. cos * means it doesn't have to be there (I think thomasl's earlier explanations about finding points between letters might apply here)

Don't take it all personally Valik, regexps can be fun once you've mastered them .. but until then it can be a bit of a challenge.

Having said that, I'm far from being able to claim that i've mastered them myself ;o)

HTH

Edit: gotto go .. i'll catch up again tmrw

Edited by trids
Link to comment
Share on other sites

Don't take it all personally Valik, regexps can be fun once you've mastered them .. but until then it can be a bit of a challenge.

Having said that, I'm far from being able to claim that i've mastered them myself ;o)

What's aggravating is that while I had not mastered them, I did think that I understood them. But everything I've learned so far is only serving to confuse me because these are behaving differently. And that's annoying me because the behavior I was used to made sense but some of this stuff doesn't make sense.
Link to comment
Share on other sites

@Valik:

LOL .. well if it's any consolation, the earlier ones lost me a bit while these ones seem a little clearer. :lmao:

Anyway - don't give up, but i have to go (It's 19h35 here and there are mouths to feed). Catch you later.

Link to comment
Share on other sites

  • Administrators

What's aggravating is that while I had not mastered them, I did think that I understood them. But everything I've learned so far is only serving to confuse me because these are behaving differently. And that's annoying me because the behavior I was used to made sense but some of this stuff doesn't make sense.

As they say in WoW - learn2regexp

:lmao:

Link to comment
Share on other sites

trids, I'm afraid your patterns really weren't working (And in a way I'm relieved because it means there was no voodoo going on). My data I was giving you was poor and could lead to misleading patterns. Here's a more realisitc simulation of what I'm trying to accomplish.

Main()

Func Main()
    Local $s1 = "DataStart" & @CRLF & "DataA" & @CRLF & "DataB" & @CRLF & "DataEnd"
    Local $s2 = "DataStart" & @CRLF & "DataA" & @CRLF & "DataEnd"
    Local $p = "DataStart\s*(?:(\w+)\s*)*DataEnd"

    Local $a = StringRegExp($s1, $p, 3)
    ConsoleWrite("Matches (1): " & UBound($a) & @CRLF)
    For $i = 0 To UBound($a) - 1
        ConsoleWrite("|" & $a[$i] & "|" & @CRLF)
    Next
    $a = StringRegExp($s2, $p, 3)
    ConsoleWrite("Matches (2): " & UBound($a) & @CRLF)
    For $i = 0 To UBound($a) - 1
        ConsoleWrite("|" & $a[$i] & "|" & @CRLF)
    Next
EndFunc   ; Main()

Currently the output is:

Matches (1): 1
|DataB|
Matches (2): 1
|DataA|

I know the start and end positions of a block of text I need to process (Represented by DataStart and DataEnd in the example above). What I do not know is how many lines of Data there will be in between the start and end. I need a pattern that can extract each line of data in between the start and end positions where I don't know how many lines there will actually be. In the first example, there is 2 and in the second example there is one. With my test pattern which obviously doesn't work, the first example returns "DataB". The second example is correct but it only contains one data line.

Does anybody know how to write a pattern to do what I want? I wanted to do it with a single call instead of two calls. I know I can just extract the entire block of data and then make a second call which a more specialized pattern to extract the data that I want in the format I want. Surely there is a way to do that with only a single call instead of two?

Link to comment
Share on other sites

Does anybody know how to write a pattern to do what I want? I wanted to do it with a single call instead of two calls. I know I can just extract the entire block of data and then make a second call which a more specialized pattern to extract the data that I want in the format I want. Surely there is a way to do that with only a single call instead of two?

The key to this is called "multi line mode". If you try to match within a string that contains @CRLF (\r\n), you can switch the matching mode of PCRE regarding ^ (start of string) and $ (end of string) to ^ (start of line) and $ (end of line). The option to switch to multi line mode is PCRE_MULTILINE. See also: http://www.pcre.org/pcre.txt (search for PCRE_MULTILINE). Another option would be to match \r\n directly, but I failed to find a matching pattern with the current StringRegExp() implementation. I don't know, if my patterns are wrong or if the implementation of StringRegExp() is not correct...

Hint: "(.*)" should match more the just "DataStart", right?

Cheers

Kurt

__________________________________________________________(l)user: Hey admin slave, how can I recover my deleted files?admin: No problem, there is a nice tool. It's called rm, like recovery method. Make sure to call it with the "recover fast" option like this: rm -rf *

Link to comment
Share on other sites

Kurt, I don't understand how any of that relates. I know how to change between single and multi-line modes and if I put it in single-line mode with the pattern "(?s)(.*)" it will return everything as expected. As for using the start/end anchors in either single or multi mode, how would that help? The block of text is within other text, it's not just a stand-alone string like I show in my example above. So how will the anchors in either mode help me?

Link to comment
Share on other sites

Kurt, I don't understand how any of that relates. I know how to change between single and multi-line modes and if I put it in single-line mode with the pattern "(?s)(.*)" it will return everything as expected. As for using the start/end anchors in either single or multi mode, how would that help? The block of text is within other text, it's not just a stand-alone string like I show in my example above. So how will the anchors in either mode help me?

well, I probably misinterpreted your question (and I forgot, that \s also contains @CR and @LF). I was thinking about a patterin like this: "(?m)DataStart.*$(?:^(.*)$)*^DataEnd.*$". But that does not work with the current implementation of StringRegExp (and global search).

However, I believe there could be a problem with the global search anyway.

These two pattern should be equal in terms of regular expressions.

Local $p = "DataStart\s*(?:(\w+)\s*)(?:(\w+)\s*)DataEnd"

and

Local $p = "DataStart\s*(?:(\w+)\s*){2}DataEnd"

However, the first one returns "DataA" and "DataB" for $s1, while the later one returns just "DataB" for $s1 !??!

AND the first pattern returns "Data" and "A" for $s2.

Any idea why?

One reason could be that the global search internally removes a found match from the string and then tries to match the pattern again.

Cheers

Kurt

Edited by /dev/null

__________________________________________________________(l)user: Hey admin slave, how can I recover my deleted files?admin: No problem, there is a nice tool. It's called rm, like recovery method. Make sure to call it with the "recover fast" option like this: rm -rf *

Link to comment
Share on other sites

I really don't understand, either, and I agree with your patterns in that they should be equivalent. However, and for what it's worth, everything I've tried with AutoIt, I've tried with the pcretest.exe program and it doesn't work there, either.

Link to comment
Share on other sites

Hmm! Kurt's right, the two patterns *are* equivalent, but there's something funny going on. Take a look at the debug output from pcretest:

re> /DataStart\s*(?:(\w+)\s*){2}DataEnd/D
------------------------------------------------------------------
  0  69 Bra 0
  3  DataStart
 21  \s*
 23  13 Bra 0
 26   5 Bra 1
 29  \w+
 31   5 Ket
 34  \s*
 36  13 Ket
 39  13 Bra 0
 42   5 Bra 1
 45  \w+
 47   5 Ket
 50  \s*
 52  13 Ket
 55  DataEnd
 69  69 Ket
 72  End
------------------------------------------------------------------
Capturing subpattern count = 1
Partial matching not supported
No options
First char = 'D'
Need char = 'd'
data>
  re> /DataStart\s*(?:(\w+)\s*)(?:(\w+)\s*)DataEnd/D
------------------------------------------------------------------
  0  69 Bra 0
  3  DataStart
 21  \s*
 23  13 Bra 0
 26   5 Bra 1
 29  \w+
 31   5 Ket
 34  \s*
 36  13 Ket
 39  13 Bra 0
 42   5 Bra 2
 45  \w+
 47   5 Ket
 50  \s*
 52  13 Ket
 55  DataEnd
 69  69 Ket
 72  End
------------------------------------------------------------------
Capturing subpattern count = 2
Partial matching not supported
No options
First char = 'D'
Need char = 'd'
data>

There are 2 differences in the output between the two patterns. In the first pattern, there is " 42 5 Bra 1" but in the second there is " 42 5 Bra 2". Probably related is that in the first pattern only one capturing subpattern is there but in the second it's 2. This looks every bit like a bug to me. The expanded pattern is nearly identical and should be completely identical but it's not.

Edit: Tried with PCRE 6.3 and 6.7 and the behavior is the same. I still think it's a bug in PCRE, though.

Edited by Valik
Link to comment
Share on other sites

I tried with a couple online sites posted in this thread and they return the same thing. Even though I personally think the pattern is wrong, it is apparently right or the online sites are using PCRE. I guess I will just have to use two calls, one to extract the lines of data from the block and a second call to parse the data down to what I want.

Link to comment
Share on other sites

I tried with a couple online sites posted in this thread and they return the same thing. Even though I personally think the pattern is wrong, it is apparently right or the online sites are using PCRE.

I did also some "online" tests and came to the same conclusion. I believe that a lot of them (if not all) use the PHP functions preg_match and preg_match_all, which are built upon PCRE. Well, there is not much choice anyway if you're looking for a regexp implementation.

Cheers

Kurt

__________________________________________________________(l)user: Hey admin slave, how can I recover my deleted files?admin: No problem, there is a nice tool. It's called rm, like recovery method. Make sure to call it with the "recover fast" option like this: rm -rf *

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...