snaileater Posted April 19, 2010 Posted April 19, 2010 Hi,i fear my question is more a regexp question than an autoIt question : i would like to use StringRegExp to grab the content of a <li></li> tag in a multiline context...the following regxp works with all my text editors and does what's expected :^<li>(.|\r\n|\s)*?^</li>i would like to make it work with autoit but i never find the right syntax to use to active multiline mode. i tried for example the following :(?m)^<li>(.|\r\n|\s)*?^</li>but AutoIt crashed ...Can anybody show me the right syntax ?Thanks !
PsaltyDS Posted April 19, 2010 Posted April 19, 2010 (edited) I think you're making it too hard. Try this: #include <Array.au3> $sString = "Some stuff before... <li>Line 1" & @CRLF & _ "Line 2" & @CRLF & _ "Line 3" & @CRLF & "</li> ...Some stuff between... <li>Line 4" & @CRLF & _ "Line 5" & @CRLF & _ "Line 6" & @CRLF & "</li> Some stuff after..." & @CRLF $aResult = StringRegExp($sString, "(?U)(?s)(?:<li>)(.+)(?:</li>)", 3) If @error Then ConsoleWrite("Error = " & @error & @LF) Else _ArrayDisplay($aResult, "$aResult") For $n = 0 To UBound($aResult) - 1 ConsoleWrite($n & ": " & $aResult[$n] & @LF) Next EndIf Edit: Updated demo to make it clear the newlines are in the captured strings (even though you don't see them in the _ArrayDisplay() output). Edit2: Didn't need to escape the angle brackets. Edited April 19, 2010 by PsaltyDS Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law
GEOSoft Posted April 19, 2010 Posted April 19, 2010 $aRegEx = StringRegExp($sStr, "(?i)(?s)(<li.+?</li>", 3) George Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.*** The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number. Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else. "Old age and treachery will always overcome youth and skill!"
snaileater Posted April 20, 2010 Author Posted April 20, 2010 thanks for ur help guys, but :i see u both show me syntax using the (?s) pattern modifier that i would like to avoid.i want to have a look at code indentation so "start of line" or even "tab" must keep their significance.but i still don't see why my example doesn't work, a bad use of the (?m) modifier i suppose ...
GEOSoft Posted April 20, 2010 Posted April 20, 2010 thanks for ur help guys, but :i see u both show me syntax using the (?s) pattern modifier that i would like to avoid.i want to have a look at code indentation so "start of line" or even "tab" must keep their significance.but i still don't see why my example doesn't work, a bad use of the (?m) modifier i suppose ...I think you misunderstand the use of (?s). It allows the regex to pick up anything including new lines and spaces so indentation won't be affected. George Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.*** The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number. Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else. "Old age and treachery will always overcome youth and skill!"
snaileater Posted April 20, 2010 Author Posted April 20, 2010 I think you misunderstand the use of (?s). It allows the regex to pick up anything including new lines and spaces so indentation won't be affected.yes...certainly...but ^ and $ will only be matched once, no ? how do i match "start of line" ? ...
GEOSoft Posted April 20, 2010 Posted April 20, 2010 (edited) (?m:^) Actually as long as you are going to assume that each <li> are the first non-space characters on the line then $aRegEx = StringRegExp($sStr, "(?i)(?s)(?m:^|\n)(\s*<li.+?</li>", 3) Edited April 20, 2010 by GEOSoft George Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.*** The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number. Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else. "Old age and treachery will always overcome youth and skill!"
snaileater Posted April 20, 2010 Author Posted April 20, 2010 (edited) I made some tries like the following : (?U)(?m)^<li>(\s|.)*^</li> crashes (?U)(?m)^<li>(\s|.)*</li> works but of course is not doing what i expect since i want the closing </li> tag to be at a line start ... (?U)(?m)^<li>(\s|.)*(?m)^</li> (?U)(?m:^)<li>(\s|.)*(?m:^)</li> crashes too ... What is the right syntax ... ? thanks ! Edited April 20, 2010 by snaileater
PsaltyDS Posted April 20, 2010 Posted April 20, 2010 (edited) i want to have a look at code indentation so "start of line" or even "tab" must keep their significance. Did you even try my example? All the white space including tabs are preserved (you don't see them in the _ArrayDisply(), but can see it in the console output): #include <Array.au3> $sString = "Some stuff before... <li>Line 1" & @CRLF & _ @TAB & "Line 2" & @CRLF & _ @TAB & @TAB & "Line 3" & @CRLF & "</li> ...Some stuff between... <li>Line 4" & @CRLF & _ @TAB & "Line 5" & @CRLF & _ @TAB & @TAB & "Line 6" & @CRLF & "</li> Some stuff after..." & @CRLF $aResult = StringRegExp($sString, "(?U)(?s)(?:<li>)(.+)(?:</li>)", 3) If @error Then ConsoleWrite("Error = " & @error & @LF) Else _ArrayDisplay($aResult, "$aResult") For $n = 0 To UBound($aResult) - 1 ConsoleWrite($n & ": " & $aResult[$n] & @LF) Next EndIf If that's not it, post some sample data to illustrate what you want. Edited April 20, 2010 by PsaltyDS Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law
GEOSoft Posted April 20, 2010 Posted April 20, 2010 Let's get this straight. Do you or do you not want to include indenting? In the latter case the tag may not be at the beginning of a line. Are the opening and closing tags ever on the same line? DO you or do you not want the tags included in the return? In this case, where the closing tag is positioned on a line will be irrelevent. Is there only one list on the page are are there more than one? These are all questions which can affect the SRE. When asking for help with a regex it's always better to post some sample of the input and show what you expect to be returned from that example, or post a page link and show what you need to return from that. George Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.*** The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number. Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else. "Old age and treachery will always overcome youth and skill!"
snaileater Posted April 20, 2010 Author Posted April 20, 2010 Here is the kind of HTML code i would like to treat : <li> <a id="a"></a> <h4>ZIZOU</h4> <a href="#haut"><img class="haut" src="../fleche_ht.gif" alt="haut de page" width="14" height="18"/></a> <ul> <li>Ecole élémentaire : 0180462 E</li> <li>place de la Mairie</li> <li>78888</li> <li>☎   05 48 26 78 76</li> <li> </li> </ul> </li> <li> <h4>ARZOUILLE</h4> <a href="#haut"><img class="haut" src="../fleche_ht.gif" alt="haut de page" width="14" height="18"/></a> <ul> <li>La Celette et La Perche</li> <li>Ecole élémentaire : 0180414 C</li> <li>rue Jean Valette</li> <li>99999</li> <li>☎   07 77 63 50 33</li> <li>Circonscription : St Amand</li> </ul> </li> i would like to grab the content of the 2 <li></li> tags. in that case only two strings would be to return. As u asked it, the content of the li tags is scattered on several lines, that's why i would like to avoid the (?s) modifier ... unless i misunderstand something ...
GEOSoft Posted April 20, 2010 Posted April 20, 2010 This will get it but I can think of a better way. $aArray = StringRegExp($sStr, "(?i)(?s)(?m:^|\v)<li>(.+?)</li>", 3) I don't think you really need all the extra crap in there though so why not just get an array of the lists and then you can work from those if you need to narrow it down more. $aArray = StringRegExp($sStr, "(?i)(?s)<[uo]l.*?>\s*(.+?)\s*</[uo]l>", 3) George Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.*** The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number. Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else. "Old age and treachery will always overcome youth and skill!"
snaileater Posted April 21, 2010 Author Posted April 21, 2010 This will get it but I can think of a better way. $aArray = StringRegExp($sStr, "(?i)(?s)(?m:^|\v)<li>(.+?)</li>", 3) I don't think you really need all the extra crap in there though so why not just get an array of the lists and then you can work from those if you need to narrow it down more. $aArray = StringRegExp($sStr, "(?i)(?s)<[uo]l.*?>\s*(.+?)\s*</[uo]l>", 3) Thanks for ur help Geosoft, the fisrt one "(?i)(?s)(?m:^|\v)<li>(.+?)</li>" is doing the job perfectly in AutoIt... but i will have to understand its syntax (especially the (?m:^|v) part...) but is not understood in some of the text editors i use, for instance komodo edit (sayin "unknown extension"...) the second formula was designed for ul and ol tag i suppose ? it is working well most of the time but is making mistakes in some cases of nested tags (not finding the right closing tag) applied for li tags (?i)(?s)<li.*?>\s*(.+?)\s*</li> the formula is making the same mistakes... thanks.
GEOSoft Posted April 21, 2010 Posted April 21, 2010 The idea of the second was to get each list (not list item) into an array and then work with the array anyway you wanted to. As for Komodo editor, did you happen to make the same typo in there as you did above? (?m:^|v) should have been (?m:^|\v). Also if you are attempting to use a built in RegEx engine in Komodo, what engine do they use? That is very important because there are several engins and each will have slightly different syntax although they are basicly the same. That is also an issue with RegEx testers. Other than in the AutoIt forums I have yet to see one available for download that 100% supported the syntax used by AutoIt (PCRE engine). Any of them will work MOST of the time. George Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.*** The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number. Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else. "Old age and treachery will always overcome youth and skill!"
PsaltyDS Posted April 21, 2010 Posted April 21, 2010 Using a combination of lookbehind and lookahead assertions to find the "li" tags, I got this: expandcollapse popup#include <Array.au3> Global $sPatt = "(?U)(?s)(?<=<li>).+(?=<li>|</li>)" Global $sString = '<li> <a id="a"></a> ' & @CRLF & _ @TAB & '<h4>ZIZOU</h4>' & @CRLF & _ @TAB & '<a href="#haut"><img class="haut" src="../fleche_ht.gif" alt="haut de page" width="14" height="18"/></a> ' & @CRLF & _ @TAB & '<ul>' & @CRLF & _ @TAB & @TAB & '<li>Ecole élémentaire : 0180462 E</li>' & @CRLF & _ @TAB & @TAB & '<li>place de la Mairie</li>' & @CRLF & _ @TAB & @TAB & '<li>78888</li>' & @CRLF & _ @TAB & @TAB & '<li>?   05 48 26 78 76</li>' & @CRLF & _ @TAB & @TAB & '<li> </li>' & @CRLF & _ @TAB & '</ul>' & @CRLF & _ '</li>' & @CRLF & _ '<li> ' & @CRLF & _ @TAB & '<h4>ARZOUILLE</h4>' & @CRLF & _ @TAB & '<a href="#haut"><img class="haut" src="../fleche_ht.gif" alt="haut de page" width="14" height="18"/></a> ' & @CRLF & _ @TAB & '<ul>' & @CRLF & _ @TAB & @TAB & '<li>La Celette et La ' & @CRLF & _ @TAB & @TAB & @TAB & 'Perche</li>' & @CRLF & _ @TAB & @TAB & '<li>Ecole élémentaire : 0180414 C</li>' & @CRLF & _ @TAB & @TAB & '<li>rue Jean Valette</li>' & @CRLF & _ @TAB & @TAB & '<li>99999</li>' & @CRLF & _ @TAB & @TAB & '<li>?   07 77 63 50 33</li>' & @CRLF & _ @TAB & @TAB & '<li>Circonscription : St Amand</li>' & @CRLF & _ @TAB & '</ul>' & @CRLF & _ '</li>' & @CRLF Global $aRET = StringRegExp($sString, $sPatt, 3) If @error Then ConsoleWrite("Error = " & @error & "; @extended = " & @extended & @LF) Else _ArrayDisplay($aRET, "$aRET") For $r = 0 To UBound($aRET) - 1 ConsoleWrite($r & ": " & $aRET[$r] & @LF) Next EndIf Results: >Running:(3.3.6.1):C:\Program Files\AutoIt3\autoit3.exe "C:\Temp\Test1.au3" 0: <a id="a"></a> <h4>ZIZOU</h4> <a href="#haut"><img class="haut" src="../fleche_ht.gif" alt="haut de page" width="14" height="18"/></a> <ul> 1: Ecole élémentaire : 0180462 E 2: place de la Mairie 3: 78888 4: ?   05 48 26 78 76 5:   6: <h4>ARZOUILLE</h4> <a href="#haut"><img class="haut" src="../fleche_ht.gif" alt="haut de page" width="14" height="18"/></a> <ul> 7: La Celette et La Perche 8: Ecole élémentaire : 0180414 C 9: rue Jean Valette 10: 99999 11: ?   07 77 63 50 33 12: Circonscription : St Amand +>10:54:26 AutoIT3.exe ended.rc:0 >Exit code: 0 Time: 4.176 Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now