regular expression engine: bug in interpreting ".*?"

Imbuter2000 · March 15, 2012

While troubleshooting my html parsing code I found a strange bug(?) in the regular expression engine in AutoIt:

$html_string = "<tag><subtag>text</subtag><tag>"
$match1 = StringRegExp($html_string,"<.*?>text<.*?>",3)
$match2 = StringRegExp($html_string,"<[^>]*>text<.*?>",3)
msgbox(0,"",$match1[0])  ;  AutoIt display the result  "<tag><subtag>text</subtag>"  (WHY???)
msgbox(0,"",$match2[0])  ;  AutoIt display the result  "<subtag>text</subtag>"

...for what strange reason $match1 should not be identical to $match2?

bogQ · March 15, 2012

don't know the reason but your result is valid according to

http://www.regular-expressions.info/javascriptexample.html

and note that the match is from position 0

JohnQSmith · March 15, 2012

While troubleshooting my html parsing code I found a strange bug(?) in the regular expression engine in AutoIt:

No bug, it's doing exactly what you told it to do.

$html_string = "<tag><subtag>text</subtag><tag>"
$match1 = StringRegExp($html_string,"<.*?>text<.*?>",3)
$match2 = StringRegExp($html_string,"<[^>]*>text<.*?>",3)
msgbox(0,"",$match1[0])  ;  AutoIt display the result  "<tag><subtag>text</subtag>"  (WHY???)
msgbox(0,"",$match2[0])  ;  AutoIt display the result  "<subtag>text</subtag>"

...for what strange reason $match1 should not be identical to $match2?

It doesn't match because $match2 is excluding any enclosed close brackets.

Let's break down what $match1 is doing.

When reading the $html_string, $match1 starts at the beginning and grabs the "<"

It then continues grabbing characters (non-greedy) until it finds the first instance ">text<"

then continues grabbing the minimum (non-greedy) number of characters until it finds the next ">"

So basically your first wildcard in $match1 is saying "give me the minimum number of characters that fall between < and >text<", whereas your first wildcard in $match2 is saying "give me all characters that are not > that fall between < and >text<".

Edit:

Here are two more lines of code to add to your script.

$match3 = StringRegExp($html_string,"(?<=>)<.*?>text<.*?>",3)
msgbox(0,"",$match3[0])  ;  AutoIt displays the result  "<subtag>text</subtag>"

$match3 is the same as $match1, except that I've added a positive lookbehind in front of your regular expression. This forces it to find your match as long as it is preceded by a ">".

Note that this only works for this example. If you change your $html_string to

$html_string = "<pretag><tag><subtag>text</subtag></tag></pretag>"

$match3 will give you "<tag><subtag>text</subtag>" again.

Basically, I think $match2 is your best bet.

Edited March 15, 2012 by JohnQSmith

Imbuter2000 · March 17, 2012

Ok, thanks guy, you're right and I learned a new thing about the lazy quantifier.

But I still think that the plain english explanation of the ".*?" in "<.*?>text<" is not simply "give me the minimum number of characters that fall between <and >text<.

In particular saying "minimum number of characters" is defective of not clear, as demonstred by the fact that in my opening case it takes

"<tag><subtag>text</subtag>" instead of shorter "<subtag>text</subtag>".

At this point it comes to my mind an impossible(?) problem:

suppose to have an HTML source similar to this:

"random_text_and_tags1<div>random_text_and_tags2<div>random_text_and_tags3</div>random_text_and_tags4</div>random_text_and_tags5"

How would you take the inner <div> part (i.e.: "<div>random_text_and_tags3</div>")?

The only solution that I found is to take the group1 out from a regex ".*(<div>.*?</div>)".

Does a solution without using groups exist?

Edited March 17, 2012 by Imbuter2000

jchd · March 17, 2012

To match embedded constructs like these you need recursion to parse (or match or extract part of) them.

I warmly recommend you have a good read of the complete official PCRE documentation (AutoIt uses PCRE, albeit sometimes a couple versions behind latest release). The complete documentation comes with the PCRE source tarball that you can find at

ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/

Be warned that there may be some discrepancies between latest doc and AutoIt version, but they only affect very dark corners and advanced features.

Imbuter2000 · March 17, 2012

To match embedded constructs like these you need recursion to parse (or match or extract part of) them.
I warmly recommend you have a good read of the complete official PCRE documentation (AutoIt uses PCRE, albeit sometimes a couple versions behind latest release). The complete documentation comes with the PCRE source tarball that you can find at
ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/
Be warned that there may be some discrepancies between latest doc and AutoIt version, but they only affect very dark corners and advanced features.

Interesting!

Is there a tutorial that talks specificly about HTML parsing using recursive regular expressions?

jchd · March 17, 2012

Not that I know. Anyway, I'd recommend using the AutoIt IE UDF rather than parsing the html by regexp independantly of the regexp-fu you have. IE functions do the hard work of dissecting the numerous html constructs for you very efficiently and robustly, while regexp parsing is subject to unexpected failures when, for instance, a server decides to insert random whitespaces (tabs, linefeeds, spaces) at almost every point in the html source. To cope with that you have to allow for s* everywhere which makes your pattern incredibly heavy and almost unmaintainable.

Imbuter2000 · March 17, 2012

I think I've found a simple regex solution!!! Here it is: <div>((?!<div>).)*?</div>

Not that I know. Anyway, I'd recommend using the AutoIt IE UDF rather than parsing the html by regexp independantly of the regexp-fu you have. IE functions do the hard work of dissecting the numerous html constructs for you very efficiently and robustly, while regexp parsing is subject to unexpected failures when, for instance, a server decides to insert random whitespaces (tabs, linefeeds, spaces) at almost every point in the html source. To cope with that you have to allow for s* everywhere which makes your pattern incredibly heavy and almost unmaintainable.

I really really want to do my extractions via the IE UDF but I find it very very difficult to do so.

For example if I want to extract the titles of the Google search results, I open for example http://www.google.it/#q=foobar, I see with the DebugBar that the code for example of the title "foobar2000 - Wikipedia" is:

"<a class="l" onmousedown="return rwt(this,'','','','6','AFQjCNG5l1JlHEfLHSE1yqxjOCBlWP5Z4A','','0CFoQFjAF',null,event)" href="http://it.wikipedia.org/wiki/Foobar2000"><em>foobar2000</em> - Wikipedia</a>"

After I attach the IE window with _IEattach, obtaining $oIE, how do I create an array with the title "foobar2000 - Wikipedia" and all the other titles in the page?

jchd · March 17, 2012

Does your simple regexp solution work with nested tags?

About IE: make a distinct post with your example to attract attention of experienced IE users.

Imbuter2000 · March 18, 2012

Does your simple regexp solution work with nested tags?

I'm not sure about what nested tags is but I can tell you that <div>((?!<div>).)*?</div> captures the nearest divs even if there are other tags inside!

Edited March 18, 2012 by Imbuter2000

hawkair · March 18, 2012

@Imbuter2000

I dont know about complex texts but for the example you gave I would use this

$txt = "random_text_and_tags1<div>random_text_and_tags2<div>random_text_and_tags3</div>random_text_and_tags4</div>random_text_and_tags5"

$Pattern = "<div>[^<]*</div>"

that is:

<div>:search for "<div>"

[^<]*:match all following characters that are not "<"

</div>:it must be followed by </div>

only <div>random_text_and_tags3</div> matches that

Imbuter2000 · March 18, 2012

About IE: make a distinct post with your example to attract attention of experienced IE users.

I just created this new topic for it:

JohnQSmith · March 19, 2012

But I still think that the plain english explanation of the ".*?" in "<.*?>text<" is not simply "give me the minimum number of characters that fall between <and >text<.
In particular saying "minimum number of characters" is defective of not clear, as demonstred by the fact that in my opening case it takes
"<tag><subtag>text</subtag>" instead of shorter "<subtag>text</subtag>".

My description of the "minimum number of characters that fall between <and >text<" is correct and absolutely clear. You are just missing the fact that the starting character is the < in front of "tag", not the < in front of "subtag".

Regular expressions are VERY PARTICULAR about what they return. They do exactly what you TELL them to do, not what you WISH them to do. It's not a matter of interpretation.

<tag><subtag>text</subtag></tag>
^
start at <

now find >text<

<tag><subtag>text</subtag></tag>
<>text<                             not found
<tag><subtag>text</subtag></tag>
<->text<                            not found
<tag><subtag>text</subtag></tag>
<-->text<                           not found
<tag><subtag>text</subtag></tag>
<--->text<                          not found
<tag><subtag>text</subtag></tag>
<---->text<                         not found
<tag><subtag>text</subtag></tag>
<----->text<                        not found
<tag><subtag>text</subtag></tag>
<------>text<                       not found
<tag><subtag>text</subtag></tag>
<------->text<                      not found
<tag><subtag>text</subtag></tag>
<-------->text<                     not found
<tag><subtag>text</subtag></tag>
<--------->text<                    not found
<tag><subtag>text</subtag></tag>
<---------->text<                   not found
<tag><subtag>text</subtag></tag>
<--- .*? --->text<                  FOUND      .*?  =  tab><subtag

Imbuter2000 · March 20, 2012

My description of the "minimum number of characters that fall between <and >text<" is correct and absolutely clear. You are just missing the fact that the starting character is the < in front of "tag", not the < in front of "subtag".

I'm not missing that fact, your definition was missing it. You didn't defined that the starting character is the leftmost "<" instead of the rightmost. When you write "minimum number of characters" I think that most people think to the rghtmost because it will end capturing less number of characters.

Anyhow now I know how "*?" works, i.e. leftmost to leftmost, so I'm only objecting about the plain-english definition...

Edited March 20, 2012 by Imbuter2000

Imbuter2000 · March 20, 2012

@Imbuter2000
I dont know about complex texts but for the example you gave I would use this
$txt = "random_text_and_tags1<div>random_text_and_tags2<div>random_text_and_tags3</div>random_text_and_tags4</div>random_text_and_tags5"
$Pattern = "<div>[^<]*</div>"

that is:
<div>:search for "<div>"
[^<]*:match all following characters that are not "<"
</div>:it must be followed by </div>

only <div>random_text_and_tags3</div> matches that

Hi hawkair, you forgot that random_text_and_tags3 can contain other tags so it would not work.

A working solution uses a negative lookbehind:

<div>(?:(?!<div).)*?</div>

Edited March 20, 2012 by Imbuter2000

Sign In

regular expression engine: bug in interpreting ".*?"

Recommended Posts

Imbuter2000

bogQ

JohnQSmith

Imbuter2000

jchd

Imbuter2000

jchd

Imbuter2000

jchd

Imbuter2000

hawkair

Imbuter2000

JohnQSmith

Imbuter2000

Imbuter2000

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta