Unexpected StringRegExp results with (x)|(y) type patterns

jerodast · June 22, 2011

First of all this is my first post here; I didn't find a similar topic with a quick search and apologize if it's been mentioned before. Thanks in advance!

For context, I have a long HTML string and am trying to capture an attribute within a tag which may or may not be present. My HTML pattern has other capture groups before and after this particular tag, and I would like the numbering to stay consistent regardless of whether this tag is present. This discussion assumes StringRegExp mode 1 (or 3; I guarantee that at most one match is present, so the two are equivalent). To simplify the discussion I will use a very short example with analogous results.

Pattern "a(?(.)y)?b" will match "ab" and return no captures, and also match "axkyb" and capture "k". What I want to do is have a pattern that behaves similarly but instead captures a single empty string when it matches "ab". This way the indices of subsequent captures are consistent regardless of which test string was matched.

The patterns I expected to work would be "a(?)|x(.)y)b" and "a(?(.)y|())b" - if the string is "ab", it will match the "()" side and thus capture an empty string, otherwise it will capture "(.)". However, this is what I found:

Pattern          Test String  Capture 0  Capture 1  Expected Capture (only 1)
a(?:()|x(.)y)b   ab           ""                    ""
a(?:()|x(.)y)b   axkyb        ""         "k"        "k"
a(?:x(.)y|())b   ab           ""         ""         ""
a(?:x(.)y|())b   axkyb        "k"                   "k"

This suggests the hypothesis that:

1) the left side of the | is checked first

2a) if the left is matched then the captures work correctly

2b) if not, any captures on the left side are returned as empty strings

3b) the right side is matched, with its captures working correctly but returned after the left-side empty strings

I experimented with several examples, including unequal numbers of captures on either side, further nesting, etc. All fail to disprove this hypothesis except for when there are zero capture groups on one side, in which case it behaves exactly like a ?, which is the behavior I originally expected. However, it counters the hypothesis:

Pattern        Test String  Capture  Hypothesized Capture
a(?:|x(.)y)b   ab                    
a(?:|x(.)y)b   axkyb        "k"      "k"
a(?:x(.)y|)b   ab                    ""
a(?:x(.)y|)b   axkyb        "k"      "k"

I hope I've shown how the | behavior is hard to explain with certain capture group arrangements. I can and probably will work around this, for example by capturing the entire "x(.)y"? area and having subsequent checks to capture the "." if the result was non-empty. But ideally I prefer to understand why it doesn't work as I thought. It seems unfortunate to not be able to maintain consistent capture group numbering when dealing with optional captures. Is this a bug? Intended? Simply not supported for "(m)|(n)"-type patterns? Am I missing a simple flag that would help with this?

Thanks for your insight!

Version: 3.3.6.1 (current release version as of this post date)

PsaltyDS · June 23, 2011

Not enough info to reproduce. Post the input string on which these patterns are used.

:huh2:

jerodast · June 23, 2011

The strings in the tables are exactly the strings I used. The equivalent code to demonstrate what I was talking about in the first table would be:

#include <Array.au3>

$a1 = StringRegExp("ab",   "a(?:()|x(.)y)b",1)  ; Expected: $a1 = [""]  by choosing the left  side and "capturing" an empty group
$a2 = StringRegExp("axkyb","a(?:()|x(.)y)b",1)  ; Expected: $a2 = ["k"] by choosing the right side and capturing the middle char
$a3 = StringRegExp("ab",   "a(?:x(.)y|())b",1)  ; Expected: $a3 = [""]  by choosing the right side and "capturing" an empty group
$a4 = StringRegExp("axkyb","a(?:x(.)y|())b",1)  ; Expected: $a4 = ["k"] by choosing the left  side and capturing the middle char

ConsoleWrite("'"&_ArrayToString($a1,"','")&"'"&@CRLF)  ; ''     - as expected
ConsoleWrite("'"&_ArrayToString($a2,"','")&"'"&@CRLF)  ; '','k' - for some reason it captured an empty group before the expected 'k'
ConsoleWrite("'"&_ArrayToString($a3,"','")&"'"&@CRLF)  ; '',''  - for some reason it captured an extra empty group
ConsoleWrite("'"&_ArrayToString($a4,"','")&"'"&@CRLF)  ; 'k'    - as expected

As I mentioned, I can and have worked around it by separating the task into sub-RegExps and extra conditionals, but I'm very curious why this is the behavior.

Hope that clarifies things. It's surprising how simple the patterns can be to observe these results, you just need any two capture groups on different sides of a |. One last, super trivial example:

$a1 = StringRegExp("m","(m)|(n)",1)  ; Expected: $a1 = ["m"] by choosing left side and capturing m
$a2 = StringRegExp("n","(m)|(n)",1)  ; Expected: $a2 = ["n"] by choosing right side and capturing n

ConsoleWrite("'"&_ArrayToString($a1,"','")&"'"&@CRLF)  ; 'm'    - as expected
ConsoleWrite("'"&_ArrayToString($a2,"','")&"'"&@CRLF)  ; '','n' - where did that empty capture come from?

Thanks for takin' a look!

(Edit: Removed example illustrating the second table, since that whole thing is really a sidetrack "disproving" a possible explanation. I'd rather just hear your reasoning

Edited June 23, 2011 by jerodast

GEOSoft · June 23, 2011

One reason is you are using too many parenthesis.

Try this on your second example

"(m|n)"

Malkey · June 24, 2011

...

#include <Array.au3>

$a1 = StringRegExp("ab",   "a(?:()|x(.)y)b",1)  ; Expected: $a1 = [""]  by choosing the left  side and "capturing" an empty group
$a2 = StringRegExp("axkyb","a(?:()|x(.)y)b",1)  ; Expected: $a2 = ["k"] by choosing the right side and capturing the middle char
$a3 = StringRegExp("ab",   "a(?:x(.)y|())b",1)  ; Expected: $a3 = [""]  by choosing the right side and "capturing" an empty group
$a4 = StringRegExp("axkyb","a(?:x(.)y|())b",1)  ; Expected: $a4 = ["k"] by choosing the left  side and capturing the middle char

ConsoleWrite("'"&_ArrayToString($a1,"','")&"'"&@CRLF)  ; ''     - as expected
ConsoleWrite("'"&_ArrayToString($a2,"','")&"'"&@CRLF)  ; '','k' - for some reason it captured an empty group before the expected 'k'
ConsoleWrite("'"&_ArrayToString($a3,"','")&"'"&@CRLF)  ; '',''  - for some reason it captured an extra empty group
ConsoleWrite("'"&_ArrayToString($a4,"','")&"'"&@CRLF)  ; 'k'    - as expected

...

... I'd rather just hear your reasoning :huh2:

Here's a thought .

"()" in the pattern captures nothing. As nothing always exists, or does not exist, that is, something does not get in the way. Then "()" will always return a captured group of nothing. Yes, it is that simple.

#include <Array.au3>

$a2 = StringRegExp("axkyb", "a(?:()()|x(.)y)b", 1) ; Expectation leads to disappointment.
$a3 = StringRegExp("ab", "a(?:x(.)y|()())b", 1)      ; Expect nothing, and you will never be disappointed.
$a4 = StringRegExp("", "()()()()", 1)   ; Nothing will come from nothing ya know what they say.

ConsoleWrite("'" & _ArrayToString($a2, "','") & "'" & @CRLF) ; '','','k' - ()() captures two empty groups before the expected 'k'
ConsoleWrite("'" & _ArrayToString($a3, "','") & "'" & @CRLF) ; '','',''  - ()() captures two extra empty groups.
ConsoleWrite("'" & _ArrayToString($a4, "','") & "'" & @CRLF) ; '','','','' - Captures 4 empty groups from nothing.

jchd · June 24, 2011

@jerodast,

To answer your last question:

$a2 = StringRegExp("n","(m)|(n)",1)  ; Expected: $a2 = ["n"] by choosing right side and capturing n
ConsoleWrite("'"&_ArrayToString($a2,"','")&"'"&@CRLF)  ; '','n' - where did that empty capture come from?

Realize that the engine will number your capturng parenthesis pairs and scan the pattern and the subject.

When it arrives at (m) it doesn't find a match, hence the first capture is '' then it finds the alternation | and the (n) wich matches, hence the second capture is 'n'.

The correct way to capture one (i.e. the first) match in an alternation is to make the pattern (m|n).

Since I suppose a, b, x, k, y, m, n are placeholders in your examples which stand for more complex patterns, you probably should know there is another kind of alternation (albeit more complex) which is available in PCRE and allow IF..ELSEIF..ELSEIF..ENDIF constructs :

( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ )

If COND1 matches then FOO is tried, else if COND2 matches then BAR is tried, ...

jerodast · June 25, 2011

Thanks for your responses!

@Malkey, I do understand () captures outside of situations involving |, but it's some of those | cases where I'm missing something. Take your $a2 example but flip the order of the pattern around the |:

$a2 = StringRegExp("axkyb", "a(?:x(.)y|()())b", 1)
; If () "will always capture an empty group" as you say, then why does this return only one "k" and no empty groups?

We also have your $a3 example. In that one, the question isn't "why are the two empty groups being captured" at all, the question is "why is a (.) group being captured as empty? Isn't that against the very definition of what a (.) could possibly capture?"

Love your comments, by the way

@jchd

You call the | an "alternation". I was under the impression it was a "disjunction" (an "or"), so it would pick one side OR the other. I just find it strange that the (m) capture could NOT be a match and still be a capture. How can you capture a match that isn't a match? Nonetheless, if you look at my OP you'll see that I did think your explanation was correct. But it did not explain the third entry in my second table. Let me give another example:

$a = StringRegEx("n","(m)|n",1)  ; Does not result in any captures

By your explanation: "When it arrives at (m) it doesn't find a match, hence the first capture is '', then it finds the alternation | and the n which matches, but does not contain a second capture." But this is not what we see, instead there are simply no captures at all. All I did was remove the capture from the right side, yet somehow this changed what was captured on the left side too! So that's why I thought we're missing something. Perhaps this is just a special case by design.

Your multi-part conditionals look interesting, but I think it's getting away from my original question: Why do simple |'s behave like they do?

@jchd @GEOSoft

I get that "(m|n)" produces the desired results. But it doesn't explain why "(m)|(n)" doesn't work. I chose that to give an ultra-simple case with which to examine the core behavior of the | operator, regardless of if there's another form. To me, just because the first form works shouldn't mean that the second form shouldn't work. Here's a slightly more complex example: "a(.)b|x(.)y". Afaik this can't be simplified so easily.

But really, my question boils down to how "(m)|(n)" works, and why it was designed that way, regardless of whether some workaround exists.

Edited June 25, 2011 by jerodast

jerodast · June 25, 2011

I just want to re-summarize my best explanation for how capturing within different sides of the | character appears to work, in pseudocode:

If neither side matches Then
    Return "invalid pattern"

ElseIf left side matches Then
    Return captures on the left side in the standard way
    ; ex:patterns "(m)|n" or "(m)|(n)" matching "m" returns "m"

Else ; right side matches
    If right side contains no capture groups Then
        Return no captures
        ; ex: pattern "(m)|n" matching "n" returns nothing
    Else ; right side contains capture groups
        For every capture group in the left side
            Return "" (regardless of whether the actual capture group is empty or not)
        Return captures on the right side in the standard way
        ; ex: pattern "(m)|(n)" matching "n" returns "","n"

; "in the standard way" means return captures on that side as if only that side had been in the pattern in the first place

If this is inaccurate, correct me. If there is a simpler algorithm that's also correct, enlighten us. But this is what I've concluded after some experimentation. I'll post again shortly as to why it's unintuitive and buggy/flawed design.

jchd · June 25, 2011

PCRE is _way_ more complex and smart than you believe. Not only that but it also needs to stick to Perl behavior as close to as possible (remember it's Perl Compatible Regular Expressions).

I refer you to the voluminous documentation found on the PCRE website, the pcretest.exe test program and the source code if you wish to have a look at it.

FYI, here's the pcretest output (including generated bytecode) for some of your patterns/subjects:

C:\msys\1.0\home\Jean-Christophe\pcre\pcre-8.12\.libs>pcretest -d
PCRE version 8.12 2011-01-15
re> :(m)|(n):
------------------------------------------------------------------
0 13 Bra
3 7 CBra 1
8 m
10 7 Ket
13 13 Alt
16 7 CBra 2
21 n
23 7 Ket
26 26 Ket
29 End
------------------------------------------------------------------
Capturing subpattern count = 2
No options
No first char
No need char
data> m
0: m
1: m
data> n
0: n
1: <unset>
2: n
data>
re> :(m|n):
------------------------------------------------------------------
0 18 Bra
3 7 CBra 1
8 m
10 5 Alt
13 n
15 12 Ket
18 18 Ket
21 End
------------------------------------------------------------------
Capturing subpattern count = 1
No options
No first char
No need char
data> m
0: m
1: m
data> n
0: n
1: n
data>
re> :(m)|n:
------------------------------------------------------------------
0 13 Bra
3 7 CBra 1
8 m
10 7 Ket
13 5 Alt
16 n
18 18 Ket
21 End
------------------------------------------------------------------
Capturing subpattern count = 1
No options
No first char
No need char
data> m
0: m
1: m
data> n
0: n
data>
re> [mn]):
------------------------------------------------------------------
0 44 Bra
3 38 CBra 1
8 [mn]
41 38 Ket
44 44 Ket
47 End
------------------------------------------------------------------
Capturing subpattern count = 1
No options
No first char
No need char
data> m
0: m
1: m
data> n
0: n
1: n
data>
re>

GEOSoft · June 25, 2011

Just to add to what jchd has already stated. AutoIt uses a custom implementation of the PCRE engine, a sub-set if you will, and as such it won't do what you think it will in all situations.

Someday that may change but for now it's what it is.

jerodast · June 25, 2011

Just to add to what jchd has already stated. AutoIt uses a custom implementation of the PCRE engine, a sub-set if you will, and as such it won't do what you think it will in all situations.
Someday that may change but for now it's what it is.

Fair enough, I guess I can accept that.

Thanks jchd for the pointer to the PCRE documentation, I think that's what I've really been missing the whole time. As I said, my "algorithm" was only what I could surmise from my test examples, I figured there was *something* bigger motivating the behavior from behind the scenes

jchd · June 25, 2011

Forget what some senile MVP just said about AutoIt having a non-standard PCRE engine. This fellow is trying to get you more confused

Of course I'm gently kidding here: GEOSoft is very helpful and knowledgeable but here he's probably confusing with RegexpBuddy or something else.

jerodast · June 27, 2011

Okay, reading the PCRE docs explained enough that I suddenly realized the problem. For any noob trying to figure out the same thing I am:

Regardless of whether the capture group is optional or in a | branch or whatever, every single one will be numbered and returned in the array in the same order every time (specifically in the order their left parentheses occur). Capture groups that were optional and did not get matched will appear as empty strings. (I initially assumed ONLY matched captures would be returned, but this way is actually better due to the numbering consistency.)

So why were the results I was seeing inconsistent (seemingly changing the number of results depending on what order branches were in or how many capture groups were on one side or another)? I was looking in the wrong place, the RegEx matching was actually working just fine - it's just that the array that gets returned apparently truncates any of those empty strings from the end, making it LOOK like the pattern had weird numbers of returns. So regEx("m","(m)|(n)") = ["m",""] but you only see the truncated ["m"], whereas regEx("n","(m)|(n)") = ["","n"] as it should.

Working around it is simple: Slap a capture on the end of your pattern that it won't truncate, and ignore that final capture. Luckily, it does not truncate empty strings if they came from a "()" capture group, it only truncates if it came from a capture group that was never matched at all due to it being optional. So pattern "(m)|(n)" becomes "(?:(m)|(n))()", and with that you will always see a three element array corresponding to the three capture groups (though you will only be interested in the first two). Similarly, "(\d)?()" will always return a two element array.

I must again disclaim that some of my statements are based on experimentation rather than documentation, specifically the truncating. The group numbering is straight from the PCRE docs, which allowed me to assume that it was implemented correctly and try to explain why I would then see inconsistent arrays. The truncating theory explains it, and at this point I'm just happy I finally understand how the matching is working. Thanks a ton for pointing me in the right direction to figure this out!

jchd · June 27, 2011

So why were the results I was seeing inconsistent (seemingly changing the number of results depending on what order branches were in or how many capture groups were on one side or another)? I was looking in the wrong place, the RegEx matching was actually working just fine - it's just that the array that gets returned apparently truncates any of those empty strings from the end, making it LOOK like the pattern had weird numbers of returns. So regEx("m","(m)|(n)") = ["m",""] but you only see the truncated ["m"], whereas regEx("n","(m)|(n)") = ["","n"] as it should.

I'm sorry to differ (again), but what you get is not what you describe.

In NO case would AutoIt remove an entry in an array and make it ["m"]instead of ["m",""]. Such terrible behavior would make programming more a gamble than a science. Your truncating theory collapses, sorry for that.

It's just that PCRE will find a match on (m) with m and stop there due to end of subject and end of pattern. The engine doesn't even reach the (n) part of the pattern since the alternation has already been satisfied. Only one result in returned for the capturing group #1, so AutoIt returns a $res[1] array = ["m"].

With regEx("n","(m)|(n)") = ["","n"] the engine first meet capturing group#1 with doesn't match anything, then goes to the (n) part of the alternation, which matches. It then returns _one_ captured group, "n" as group#2. Since AutoIt returns an array of results with the option used, it needs to return a $res[2] array with the second element set to "n". AutoIt has only one way to inform you that the result is group#2: the first element _must_ correspond to a non-match for group#1 (or an empty mtach, which is the same from AutoIt point of view) hence it returns ["","n"]

In general it is you responsability as a programmer, or a user of a regexp engine, to cleverly devise your capturing groups so that the result you get is easily dealt with. In your case, the pattern returns a variable-size array and this is avoidable as we have seen. To facilitate your programming and further maintenance, it is recommended to favor the simplest pattern delivering the most predictable information. The fact that in your case the array has one or two elements, and the meaningful result is either in $res[0] or $res[1] only requires extra code and care in your application. There are indeed cases where such unbalanced result is unavoidable, but they are rare in everydays use.

You seem to be keen on understanding things under the surface. I highly warmly recommend you grab a copy of the bible of regexp:

"Mastering Regular Expressions", Jeffrey Friedl, O'Reilly, ISBN 1-56592-257-3

A used copy of early edition will do and cost you a beer.

Unexpected StringRegExp results with (x)|(y) type patterns

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members