Imbuter2000 Posted March 15, 2012 Share Posted March 15, 2012 While troubleshooting my html parsing code I found a strange bug(?) in the regular expression engine in AutoIt:$html_string = "<tag><subtag>text</subtag><tag>" $match1 = StringRegExp($html_string,"<.*?>text<.*?>",3) $match2 = StringRegExp($html_string,"<[^>]*>text<.*?>",3) msgbox(0,"",$match1[0]) ; AutoIt display the result "<tag><subtag>text</subtag>" (WHY???) msgbox(0,"",$match2[0]) ; AutoIt display the result "<subtag>text</subtag>"...for what strange reason $match1 should not be identical to $match2? Link to comment Share on other sites More sharing options...
bogQ Posted March 15, 2012 Share Posted March 15, 2012 don't know the reason but your result is valid according tohttp://www.regular-expressions.info/javascriptexample.htmland note that the match is from position 0 TCP server and client - Learning about TCP servers and clients connectionAu3 oIrrlicht - Irrlicht projectAu3impact - Another 3D DLL game engine for autoit. (3impact 3Drad related) There are those that believe that the perfect heist lies in the preparation.Some say that it’s all in the timing, seizing the right opportunity. Others even say it’s the ability to leave no trace behind, be a ghost. Link to comment Share on other sites More sharing options...
JohnQSmith Posted March 15, 2012 Share Posted March 15, 2012 (edited) While troubleshooting my html parsing code I found a strange bug(?) in the regular expression engine in AutoIt: No bug, it's doing exactly what you told it to do. $html_string = "<tag><subtag>text</subtag><tag>" $match1 = StringRegExp($html_string,"<.*?>text<.*?>",3) $match2 = StringRegExp($html_string,"<[^>]*>text<.*?>",3) msgbox(0,"",$match1[0]) ; AutoIt display the result "<tag><subtag>text</subtag>" (WHY???) msgbox(0,"",$match2[0]) ; AutoIt display the result "<subtag>text</subtag>" ...for what strange reason $match1 should not be identical to $match2? It doesn't match because $match2 is excluding any enclosed close brackets. Let's break down what $match1 is doing. When reading the $html_string, $match1 starts at the beginning and grabs the "<" It then continues grabbing characters (non-greedy) until it finds the first instance ">text<" then continues grabbing the minimum (non-greedy) number of characters until it finds the next ">" So basically your first wildcard in $match1 is saying "give me the minimum number of characters that fall between < and >text<", whereas your first wildcard in $match2 is saying "give me all characters that are not > that fall between < and >text<". Edit: Here are two more lines of code to add to your script. $match3 = StringRegExp($html_string,"(?<=>)<.*?>text<.*?>",3) msgbox(0,"",$match3[0]) ; AutoIt displays the result "<subtag>text</subtag>" $match3 is the same as $match1, except that I've added a positive lookbehind in front of your regular expression. This forces it to find your match as long as it is preceded by a ">". Note that this only works for this example. If you change your $html_string to $html_string = "<pretag><tag><subtag>text</subtag></tag></pretag>" $match3 will give you "<tag><subtag>text</subtag>" again. Basically, I think $match2 is your best bet. Edited March 15, 2012 by JohnQSmith Whenever someone says "pls" because it's shorter than "please", I say "no" because it's shorter than "yes". Link to comment Share on other sites More sharing options...
Imbuter2000 Posted March 17, 2012 Author Share Posted March 17, 2012 (edited) Ok, thanks guy, you're right and I learned a new thing about the lazy quantifier.But I still think that the plain english explanation of the ".*?" in "<.*?>text<" is not simply "give me the minimum number of characters that fall between <and >text<.In particular saying "minimum number of characters" is defective of not clear, as demonstred by the fact that in my opening case it takes "<tag><subtag>text</subtag>" instead of shorter "<subtag>text</subtag>".At this point it comes to my mind an impossible(?) problem:suppose to have an HTML source similar to this:"random_text_and_tags1<div>random_text_and_tags2<div>random_text_and_tags3</div>random_text_and_tags4</div>random_text_and_tags5"How would you take the inner <div> part (i.e.: "<div>random_text_and_tags3</div>")?The only solution that I found is to take the group1 out from a regex ".*(<div>.*?</div>)".Does a solution without using groups exist? Edited March 17, 2012 by Imbuter2000 Link to comment Share on other sites More sharing options...
jchd Posted March 17, 2012 Share Posted March 17, 2012 To match embedded constructs like these you need recursion to parse (or match or extract part of) them.I warmly recommend you have a good read of the complete official PCRE documentation (AutoIt uses PCRE, albeit sometimes a couple versions behind latest release). The complete documentation comes with the PCRE source tarball that you can find atftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/Be warned that there may be some discrepancies between latest doc and AutoIt version, but they only affect very dark corners and advanced features. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
Imbuter2000 Posted March 17, 2012 Author Share Posted March 17, 2012 To match embedded constructs like these you need recursion to parse (or match or extract part of) them.I warmly recommend you have a good read of the complete official PCRE documentation (AutoIt uses PCRE, albeit sometimes a couple versions behind latest release). The complete documentation comes with the PCRE source tarball that you can find atftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/Be warned that there may be some discrepancies between latest doc and AutoIt version, but they only affect very dark corners and advanced features.Interesting!Is there a tutorial that talks specificly about HTML parsing using recursive regular expressions? Link to comment Share on other sites More sharing options...
jchd Posted March 17, 2012 Share Posted March 17, 2012 Not that I know. Anyway, I'd recommend using the AutoIt IE UDF rather than parsing the html by regexp independantly of the regexp-fu you have. IE functions do the hard work of dissecting the numerous html constructs for you very efficiently and robustly, while regexp parsing is subject to unexpected failures when, for instance, a server decides to insert random whitespaces (tabs, linefeeds, spaces) at almost every point in the html source. To cope with that you have to allow for s* everywhere which makes your pattern incredibly heavy and almost unmaintainable. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
Imbuter2000 Posted March 17, 2012 Author Share Posted March 17, 2012 I think I've found a simple regex solution!!! Here it is: <div>((?!<div>).)*?</div>Not that I know. Anyway, I'd recommend using the AutoIt IE UDF rather than parsing the html by regexp independantly of the regexp-fu you have. IE functions do the hard work of dissecting the numerous html constructs for you very efficiently and robustly, while regexp parsing is subject to unexpected failures when, for instance, a server decides to insert random whitespaces (tabs, linefeeds, spaces) at almost every point in the html source. To cope with that you have to allow for s* everywhere which makes your pattern incredibly heavy and almost unmaintainable.I really really want to do my extractions via the IE UDF but I find it very very difficult to do so.For example if I want to extract the titles of the Google search results, I open for example http://www.google.it/#q=foobar, I see with the DebugBar that the code for example of the title "foobar2000 - Wikipedia" is:"<a class="l" onmousedown="return rwt(this,'','','','6','AFQjCNG5l1JlHEfLHSE1yqxjOCBlWP5Z4A','','0CFoQFjAF',null,event)" href="http://it.wikipedia.org/wiki/Foobar2000"><em>foobar2000</em> - Wikipedia</a>"After I attach the IE window with _IEattach, obtaining $oIE, how do I create an array with the title "foobar2000 - Wikipedia" and all the other titles in the page? Link to comment Share on other sites More sharing options...
jchd Posted March 17, 2012 Share Posted March 17, 2012 Does your simple regexp solution work with nested tags? About IE: make a distinct post with your example to attract attention of experienced IE users. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
Imbuter2000 Posted March 18, 2012 Author Share Posted March 18, 2012 (edited) Does your simple regexp solution work with nested tags?I'm not sure about what nested tags is but I can tell you that <div>((?!<div>).)*?</div> captures the nearest divs even if there are other tags inside! Edited March 18, 2012 by Imbuter2000 Link to comment Share on other sites More sharing options...
hawkair Posted March 18, 2012 Share Posted March 18, 2012 @Imbuter2000I dont know about complex texts but for the example you gave I would use this$txt = "random_text_and_tags1<div>random_text_and_tags2<div>random_text_and_tags3</div>random_text_and_tags4</div>random_text_and_tags5"$Pattern = "<div>[^<]*</div>"that is:<div>:search for "<div>"[^<]*:match all following characters that are not "<"</div>:it must be followed by </div>only <div>random_text_and_tags3</div> matches that Link to comment Share on other sites More sharing options...
Imbuter2000 Posted March 18, 2012 Author Share Posted March 18, 2012 About IE: make a distinct post with your example to attract attention of experienced IE users.I just created this new topic for it: Link to comment Share on other sites More sharing options...
JohnQSmith Posted March 19, 2012 Share Posted March 19, 2012 But I still think that the plain english explanation of the ".*?" in "<.*?>text<" is not simply "give me the minimum number of characters that fall between <and >text<. In particular saying "minimum number of characters" is defective of not clear, as demonstred by the fact that in my opening case it takes "<tag><subtag>text</subtag>" instead of shorter "<subtag>text</subtag>". My description of the "minimum number of characters that fall between <and >text<" is correct and absolutely clear. You are just missing the fact that the starting character is the < in front of "tag", not the < in front of "subtag". Regular expressions are VERY PARTICULAR about what they return. They do exactly what you TELL them to do, not what you WISH them to do. It's not a matter of interpretation. <tag><subtag>text</subtag></tag> ^ start at < now find >text< <tag><subtag>text</subtag></tag> <>text< not found <tag><subtag>text</subtag></tag> <->text< not found <tag><subtag>text</subtag></tag> <-->text< not found <tag><subtag>text</subtag></tag> <--->text< not found <tag><subtag>text</subtag></tag> <---->text< not found <tag><subtag>text</subtag></tag> <----->text< not found <tag><subtag>text</subtag></tag> <------>text< not found <tag><subtag>text</subtag></tag> <------->text< not found <tag><subtag>text</subtag></tag> <-------->text< not found <tag><subtag>text</subtag></tag> <--------->text< not found <tag><subtag>text</subtag></tag> <---------->text< not found <tag><subtag>text</subtag></tag> <--- .*? --->text< FOUND .*? = tab><subtag Whenever someone says "pls" because it's shorter than "please", I say "no" because it's shorter than "yes". Link to comment Share on other sites More sharing options...
Imbuter2000 Posted March 20, 2012 Author Share Posted March 20, 2012 (edited) My description of the "minimum number of characters that fall between <and >text<" is correct and absolutely clear. You are just missing the fact that the starting character is the < in front of "tag", not the < in front of "subtag".I'm not missing that fact, your definition was missing it. You didn't defined that the starting character is the leftmost "<" instead of the rightmost. When you write "minimum number of characters" I think that most people think to the rghtmost because it will end capturing less number of characters.Anyhow now I know how "*?" works, i.e. leftmost to leftmost, so I'm only objecting about the plain-english definition... Edited March 20, 2012 by Imbuter2000 Link to comment Share on other sites More sharing options...
Imbuter2000 Posted March 20, 2012 Author Share Posted March 20, 2012 (edited) @Imbuter2000 I dont know about complex texts but for the example you gave I would use this $txt = "random_text_and_tags1<div>random_text_and_tags2<div>random_text_and_tags3</div>random_text_and_tags4</div>random_text_and_tags5" $Pattern = "<div>[^<]*</div>" that is: <div>:search for "<div>" [^<]*:match all following characters that are not "<" </div>:it must be followed by </div> only <div>random_text_and_tags3</div> matches that Hi hawkair, you forgot that random_text_and_tags3 can contain other tags so it would not work. A working solution uses a negative lookbehind: <div>(?:(?!<div).)*?</div> Edited March 20, 2012 by Imbuter2000 Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now