Jump to content
Sign in to follow this  
Imbuter2000

regular expression engine: bug in interpreting ".*?"

Recommended Posts

Imbuter2000

While troubleshooting my html parsing code I found a strange bug(?) in the regular expression engine in AutoIt:

$html_string = "<tag><subtag>text</subtag><tag>"
$match1 = StringRegExp($html_string,"<.*?>text<.*?>",3)
$match2 = StringRegExp($html_string,"<[^>]*>text<.*?>",3)
msgbox(0,"",$match1[0])  ;  AutoIt display the result  "<tag><subtag>text</subtag>"  (WHY???)
msgbox(0,"",$match2[0])  ;  AutoIt display the result  "<subtag>text</subtag>"

...for what strange reason $match1 should not be identical to $match2?

Share this post


Link to post
Share on other sites
bogQ

don't know the reason but your result is valid according to

http://www.regular-expressions.info/javascriptexample.html

and note that the match is from position 0

post-24508-0-12524700-1331824884_thumb.j


TCP server and client - Learning about TCP servers and clients connection
Au3 oIrrlicht - Irrlicht project
Au3impact - Another 3D DLL game engine for autoit. (3impact 3Drad related)



460px-Thief-4-temp-banner.jpg
There are those that believe that the perfect heist lies in the preparation.
Some say that it’s all in the timing, seizing the right opportunity. Others even say it’s the ability to leave no trace behind, be a ghost.

 

Share this post


Link to post
Share on other sites
JohnQSmith

While troubleshooting my html parsing code I found a strange bug(?) in the regular expression engine in AutoIt:

No bug, it's doing exactly what you told it to do.

$html_string = "<tag><subtag>text</subtag><tag>"
$match1 = StringRegExp($html_string,"<.*?>text<.*?>",3)
$match2 = StringRegExp($html_string,"<[^>]*>text<.*?>",3)
msgbox(0,"",$match1[0])  ;  AutoIt display the result  "<tag><subtag>text</subtag>"  (WHY???)
msgbox(0,"",$match2[0])  ;  AutoIt display the result  "<subtag>text</subtag>"

...for what strange reason $match1 should not be identical to $match2?

It doesn't match because $match2 is excluding any enclosed close brackets.

Let's break down what $match1 is doing.

When reading the $html_string, $match1 starts at the beginning and grabs the "<"

It then continues grabbing characters (non-greedy) until it finds the first instance ">text<"

then continues grabbing the minimum (non-greedy) number of characters until it finds the next ">"

So basically your first wildcard in $match1 is saying "give me the minimum number of characters that fall between < and >text<", whereas your first wildcard in $match2 is saying "give me all characters that are not > that fall between < and >text<".

Edit:

Here are two more lines of code to add to your script.

$match3 = StringRegExp($html_string,"(?<=>)<.*?>text<.*?>",3)
msgbox(0,"",$match3[0])  ;  AutoIt displays the result  "<subtag>text</subtag>"

$match3 is the same as $match1, except that I've added a positive lookbehind in front of your regular expression. This forces it to find your match as long as it is preceded by a ">".

Note that this only works for this example. If you change your $html_string to

$html_string = "<pretag><tag><subtag>text</subtag></tag></pretag>"

$match3 will give you "<tag><subtag>text</subtag>" again.

Basically, I think $match2 is your best bet.

Edited by JohnQSmith

Whenever someone says "pls" because it's shorter than "please", I say "no" because it's shorter than "yes".

Share this post


Link to post
Share on other sites
Imbuter2000

Ok, thanks guy, you're right and I learned a new thing about the lazy quantifier.

But I still think that the plain english explanation of the ".*?" in "<.*?>text<" is not simply "give me the minimum number of characters that fall between <and >text<.

In particular saying "minimum number of characters" is defective of not clear, as demonstred by the fact that in my opening case it takes

"<tag><subtag>text</subtag>" instead of shorter "<subtag>text</subtag>".

At this point it comes to my mind an impossible(?) problem:

suppose to have an HTML source similar to this:

"random_text_and_tags1<div>random_text_and_tags2<div>random_text_and_tags3</div>random_text_and_tags4</div>random_text_and_tags5"

How would you take the inner <div> part (i.e.: "<div>random_text_and_tags3</div>")?

The only solution that I found is to take the group1 out from a regex ".*(<div>.*?</div>)".

Does a solution without using groups exist?

Edited by Imbuter2000

Share this post


Link to post
Share on other sites
jchd

To match embedded constructs like these you need recursion to parse (or match or extract part of) them.

I warmly recommend you have a good read of the complete official PCRE documentation (AutoIt uses PCRE, albeit sometimes a couple versions behind latest release). The complete documentation comes with the PCRE source tarball that you can find at

ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/

Be warned that there may be some discrepancies between latest doc and AutoIt version, but they only affect very dark corners and advanced features.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Imbuter2000

To match embedded constructs like these you need recursion to parse (or match or extract part of) them.

I warmly recommend you have a good read of the complete official PCRE documentation (AutoIt uses PCRE, albeit sometimes a couple versions behind latest release). The complete documentation comes with the PCRE source tarball that you can find at

ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/

Be warned that there may be some discrepancies between latest doc and AutoIt version, but they only affect very dark corners and advanced features.

Interesting!

Is there a tutorial that talks specificly about HTML parsing using recursive regular expressions?

Share this post


Link to post
Share on other sites
jchd

Not that I know. Anyway, I'd recommend using the AutoIt IE UDF rather than parsing the html by regexp independantly of the regexp-fu you have. IE functions do the hard work of dissecting the numerous html constructs for you very efficiently and robustly, while regexp parsing is subject to unexpected failures when, for instance, a server decides to insert random whitespaces (tabs, linefeeds, spaces) at almost every point in the html source. To cope with that you have to allow for s* everywhere which makes your pattern incredibly heavy and almost unmaintainable.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Imbuter2000

I think I've found a simple regex solution!!! Here it is: <div>((?!<div>).)*?</div>

Not that I know. Anyway, I'd recommend using the AutoIt IE UDF rather than parsing the html by regexp independantly of the regexp-fu you have. IE functions do the hard work of dissecting the numerous html constructs for you very efficiently and robustly, while regexp parsing is subject to unexpected failures when, for instance, a server decides to insert random whitespaces (tabs, linefeeds, spaces) at almost every point in the html source. To cope with that you have to allow for s* everywhere which makes your pattern incredibly heavy and almost unmaintainable.

I really really want to do my extractions via the IE UDF but I find it very very difficult to do so.

For example if I want to extract the titles of the Google search results, I open for example http://www.google.it/#q=foobar, I see with the DebugBar that the code for example of the title "foobar2000 - Wikipedia" is:

"<a class="l" onmousedown="return rwt(this,'','','','6','AFQjCNG5l1JlHEfLHSE1yqxjOCBlWP5Z4A','','0CFoQFjAF',null,event)" href="http://it.wikipedia.org/wiki/Foobar2000"><em>foobar2000</em> - Wikipedia</a>"

After I attach the IE window with _IEattach, obtaining $oIE, how do I create an array with the title "foobar2000 - Wikipedia" and all the other titles in the page?

Share this post


Link to post
Share on other sites
jchd

Does your simple regexp solution work with nested tags?

About IE: make a distinct post with your example to attract attention of experienced IE users.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Imbuter2000

Does your simple regexp solution work with nested tags?

I'm not sure about what nested tags is but I can tell you that <div>((?!<div>).)*?</div> captures the nearest divs even if there are other tags inside!

Edited by Imbuter2000

Share this post


Link to post
Share on other sites
hawkair

@Imbuter2000

I dont know about complex texts but for the example you gave I would use this

$txt = "random_text_and_tags1<div>random_text_and_tags2<div>random_text_and_tags3</div>random_text_and_tags4</div>random_text_and_tags5"

$Pattern = "<div>[^<]*</div>"

that is:

<div>:search for "<div>"

[^<]*:match all following characters that are not "<"

</div>:it must be followed by </div>

only <div>random_text_and_tags3</div> matches that

Share this post


Link to post
Share on other sites
Imbuter2000

About IE: make a distinct post with your example to attract attention of experienced IE users.

I just created this new topic for it:

Share this post


Link to post
Share on other sites
JohnQSmith

But I still think that the plain english explanation of the ".*?" in "<.*?>text<" is not simply "give me the minimum number of characters that fall between <and >text<.

In particular saying "minimum number of characters" is defective of not clear, as demonstred by the fact that in my opening case it takes

"<tag><subtag>text</subtag>" instead of shorter "<subtag>text</subtag>".

My description of the "minimum number of characters that fall between <and >text<" is correct and absolutely clear. You are just missing the fact that the starting character is the < in front of "tag", not the < in front of "subtag".

Regular expressions are VERY PARTICULAR about what they return. They do exactly what you TELL them to do, not what you WISH them to do. It's not a matter of interpretation.

<tag><subtag>text</subtag></tag>
^
start at <

now find >text<

<tag><subtag>text</subtag></tag>
<>text<                             not found
<tag><subtag>text</subtag></tag>
<->text<                            not found
<tag><subtag>text</subtag></tag>
<-->text<                           not found
<tag><subtag>text</subtag></tag>
<--->text<                          not found
<tag><subtag>text</subtag></tag>
<---->text<                         not found
<tag><subtag>text</subtag></tag>
<----->text<                        not found
<tag><subtag>text</subtag></tag>
<------>text<                       not found
<tag><subtag>text</subtag></tag>
<------->text<                      not found
<tag><subtag>text</subtag></tag>
<-------->text<                     not found
<tag><subtag>text</subtag></tag>
<--------->text<                    not found
<tag><subtag>text</subtag></tag>
<---------->text<                   not found
<tag><subtag>text</subtag></tag>
<--- .*? --->text<                  FOUND      .*?  =  tab><subtag

Whenever someone says "pls" because it's shorter than "please", I say "no" because it's shorter than "yes".

Share this post


Link to post
Share on other sites
Imbuter2000

My description of the "minimum number of characters that fall between <and >text<" is correct and absolutely clear. You are just missing the fact that the starting character is the < in front of "tag", not the < in front of "subtag".

I'm not missing that fact, your definition was missing it. You didn't defined that the starting character is the leftmost "<" instead of the rightmost. When you write "minimum number of characters" I think that most people think to the rghtmost because it will end capturing less number of characters.

Anyhow now I know how "*?" works, i.e. leftmost to leftmost, so I'm only objecting about the plain-english definition...

Edited by Imbuter2000

Share this post


Link to post
Share on other sites
Imbuter2000

@Imbuter2000

I dont know about complex texts but for the example you gave I would use this

$txt = "random_text_and_tags1<div>random_text_and_tags2<div>random_text_and_tags3</div>random_text_and_tags4</div>random_text_and_tags5"

$Pattern = "<div>[^<]*</div>"

that is:

<div>:search for "<div>"

[^<]*:match all following characters that are not "<"

</div>:it must be followed by </div>

only <div>random_text_and_tags3</div> matches that

Hi hawkair, you forgot that random_text_and_tags3 can contain other tags so it would not work.

A working solution uses a negative lookbehind:

<div>(?:(?!<div).)*?</div>
Edited by Imbuter2000

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×