Sign in to follow this  
Followers 0
snaileater

StringRegExp : Help needed ...

15 posts in this topic

Hi,

i fear my question is more a regexp question than an autoIt question : i would like to use StringRegExp to grab the content of a <li></li> tag in a multiline context...

the following regxp works with all my text editors and does what's expected :

^<li>(.|\r\n|\s)*?^</li>

i would like to make it work with autoit but i never find the right syntax to use to active multiline mode. i tried for example the following :

(?m)^<li>(.|\r\n|\s)*?^</li>

but AutoIt crashed ...

Can anybody show me the right syntax ?

Thanks !

Share this post


Link to post
Share on other sites



#2 ·  Posted (edited)

I think you're making it too hard. Try this:

#include <Array.au3>

$sString = "Some stuff before... <li>Line 1" & @CRLF & _
        "Line 2" & @CRLF & _
        "Line 3" & @CRLF & "</li> ...Some stuff between... <li>Line 4" & @CRLF & _
        "Line 5" & @CRLF & _
        "Line 6" & @CRLF & "</li> Some stuff after..." & @CRLF

$aResult = StringRegExp($sString, "(?U)(?s)(?:<li>)(.+)(?:</li>)", 3)

If @error Then
    ConsoleWrite("Error = " & @error & @LF)
Else
    _ArrayDisplay($aResult, "$aResult")
    For $n = 0 To UBound($aResult) - 1
        ConsoleWrite($n & ":  " & $aResult[$n] & @LF)
    Next
EndIf

:(

Edit: Updated demo to make it clear the newlines are in the captured strings (even though you don't see them in the _ArrayDisplay() output).

Edit2: Didn't need to escape the angle brackets.

Edited by PsaltyDS

Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law

Share this post


Link to post
Share on other sites

$aRegEx = StringRegExp($sStr, "(?i)(?s)(<li.+?</li>", 3)


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

thanks for ur help guys, but :

i see u both show me syntax using the (?s) pattern modifier that i would like to avoid.

i want to have a look at code indentation so "start of line" or even "tab" must keep their significance.

but i still don't see why my example doesn't work, a bad use of the (?m) modifier i suppose ...

Share this post


Link to post
Share on other sites

thanks for ur help guys, but :

i see u both show me syntax using the (?s) pattern modifier that i would like to avoid.

i want to have a look at code indentation so "start of line" or even "tab" must keep their significance.

but i still don't see why my example doesn't work, a bad use of the (?m) modifier i suppose ...

I think you misunderstand the use of (?s). It allows the regex to pick up anything including new lines and spaces so indentation won't be affected.

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

I think you misunderstand the use of (?s). It allows the regex to pick up anything including new lines and spaces so indentation won't be affected.

yes...certainly...but ^ and $ will only be matched once, no ? how do i match "start of line" ? ...

Share this post


Link to post
Share on other sites

#7 ·  Posted (edited)

(?m:^)

Actually as long as you are going to assume that each <li> are the first non-space characters on the line then

$aRegEx = StringRegExp($sStr, "(?i)(?s)(?m:^|\n)(\s*<li.+?</li>", 3)

Edited by GEOSoft

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

#8 ·  Posted (edited)

I made some tries like the following :

(?U)(?m)^<li>(\s|.)*^</li>

crashes

(?U)(?m)^<li>(\s|.)*</li>

works but of course is not doing what i expect since i want the closing </li> tag to be at a line start ...

(?U)(?m)^<li>(\s|.)*(?m)^</li>
(?U)(?m:^)<li>(\s|.)*(?m:^)</li>

crashes too ...

What is the right syntax ... ? :idea:

thanks !

Edited by snaileater

Share this post


Link to post
Share on other sites

#9 ·  Posted (edited)

i want to have a look at code indentation so "start of line" or even "tab" must keep their significance.

Did you even try my example? All the white space including tabs are preserved (you don't see them in the _ArrayDisply(), but can see it in the console output):
#include <Array.au3>

$sString = "Some stuff before... <li>Line 1" & @CRLF & _
        @TAB & "Line 2" & @CRLF & _
        @TAB & @TAB & "Line 3" & @CRLF & "</li> ...Some stuff between... <li>Line 4" & @CRLF & _
        @TAB & "Line 5" & @CRLF & _
        @TAB & @TAB & "Line 6" & @CRLF & "</li> Some stuff after..." & @CRLF

$aResult = StringRegExp($sString, "(?U)(?s)(?:<li>)(.+)(?:</li>)", 3)

If @error Then
    ConsoleWrite("Error = " & @error & @LF)
Else
    _ArrayDisplay($aResult, "$aResult")
    For $n = 0 To UBound($aResult) - 1
        ConsoleWrite($n & ":  " & $aResult[$n] & @LF)
    Next
EndIf

If that's not it, post some sample data to illustrate what you want.

:idea:

Edited by PsaltyDS

Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law

Share this post


Link to post
Share on other sites

Let's get this straight. Do you or do you not want to include indenting? In the latter case the tag may not be at the beginning of a line. Are the opening and closing tags ever on the same line? DO you or do you not want the tags included in the return? In this case, where the closing tag is positioned on a line will be irrelevent. Is there only one list on the page are are there more than one? These are all questions which can affect the SRE.

When asking for help with a regex it's always better to post some sample of the input and show what you expect to be returned from that example, or post a page link and show what you need to return from that.


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

Here is the kind of HTML code i would like to treat :

<li> <a id="a"></a> 
    <h4>ZIZOU</h4>
    <a href="#haut"><img class="haut" src="../fleche_ht.gif" alt="haut de page" width="14" height="18"/></a> 
    <ul>
        <li>Ecole &eacute;l&eacute;mentaire : 0180462 E</li>
        <li>place de la Mairie</li>
        <li>78888</li>
        <li>☎ &nbsp; 05 48 26 78 76</li>
        <li>&nbsp;</li>
    </ul>
</li>
<li> 
    <h4>ARZOUILLE</h4>
    <a href="#haut"><img class="haut" src="../fleche_ht.gif" alt="haut de page" width="14" height="18"/></a> 
    <ul>
        <li>La Celette et La 
            Perche</li>
        <li>Ecole &eacute;l&eacute;mentaire : 0180414 C</li>
        <li>rue Jean Valette</li>
        <li>99999</li>
        <li>☎ &nbsp; 07 77 63 50 33</li>
        <li>Circonscription : St Amand</li>
    </ul>
</li>

i would like to grab the content of the 2 <li></li> tags. in that case only two strings would be to return. As u asked it, the content of the li tags is scattered on several lines, that's why i would like to avoid the (?s) modifier ... unless i misunderstand something ...

Share this post


Link to post
Share on other sites

This will get it but I can think of a better way.

$aArray = StringRegExp($sStr, "(?i)(?s)(?m:^|\v)<li>(.+?)</li>", 3)

I don't think you really need all the extra crap in there though so why not just get an array of the lists and then you can work from those if you need to narrow it down more.

$aArray = StringRegExp($sStr, "(?i)(?s)<[uo]l.*?>\s*(.+?)\s*</[uo]l>", 3)


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

This will get it but I can think of a better way.

$aArray = StringRegExp($sStr, "(?i)(?s)(?m:^|\v)<li>(.+?)</li>", 3)

I don't think you really need all the extra crap in there though so why not just get an array of the lists and then you can work from those if you need to narrow it down more.

$aArray = StringRegExp($sStr, "(?i)(?s)<[uo]l.*?>\s*(.+?)\s*</[uo]l>", 3)

Thanks for ur help Geosoft,

the fisrt one "(?i)(?s)(?m:^|\v)<li>(.+?)</li>" is doing the job perfectly in AutoIt... but i will have to understand its syntax (especially the (?m:^|v) part...) but is not understood in some of the text editors i use, for instance komodo edit (sayin "unknown extension"...)

the second formula was designed for ul and ol tag i suppose ? it is working well most of the time but is making mistakes in some cases of nested tags (not finding the right closing tag)

applied for li tags (?i)(?s)<li.*?>\s*(.+?)\s*</li> the formula is making the same mistakes...

thanks.

Share this post


Link to post
Share on other sites

The idea of the second was to get each list (not list item) into an array and then work with the array anyway you wanted to.

As for Komodo editor, did you happen to make the same typo in there as you did above? (?m:^|v) should have been (?m:^|\v). Also if you are attempting to use a built in RegEx engine in Komodo, what engine do they use? That is very important because there are several engins and each will have slightly different syntax although they are basicly the same. That is also an issue with RegEx testers. Other than in the AutoIt forums I have yet to see one available for download that 100% supported the syntax used by AutoIt (PCRE engine). Any of them will work MOST of the time.


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

Using a combination of lookbehind and lookahead assertions to find the "li" tags, I got this:

#include <Array.au3>

Global $sPatt = "(?U)(?s)(?<=<li>).+(?=<li>|</li>)"


Global $sString = '<li> <a id="a"></a> ' & @CRLF & _
        @TAB & '<h4>ZIZOU</h4>' & @CRLF & _
        @TAB & '<a href="#haut"><img class="haut" src="../fleche_ht.gif" alt="haut de page" width="14" height="18"/></a> ' & @CRLF & _
        @TAB & '<ul>' & @CRLF & _
        @TAB & @TAB & '<li>Ecole &eacute;l&eacute;mentaire : 0180462 E</li>' & @CRLF & _
        @TAB & @TAB & '<li>place de la Mairie</li>' & @CRLF & _
        @TAB & @TAB & '<li>78888</li>' & @CRLF & _
        @TAB & @TAB & '<li>? &nbsp; 05 48 26 78 76</li>' & @CRLF & _
        @TAB & @TAB & '<li>&nbsp;</li>' & @CRLF & _
        @TAB & '</ul>' & @CRLF & _
        '</li>' & @CRLF & _
        '<li> ' & @CRLF & _
        @TAB & '<h4>ARZOUILLE</h4>' & @CRLF & _
        @TAB & '<a href="#haut"><img class="haut" src="../fleche_ht.gif" alt="haut de page" width="14" height="18"/></a> ' & @CRLF & _
        @TAB & '<ul>' & @CRLF & _
        @TAB & @TAB & '<li>La Celette et La ' & @CRLF & _
        @TAB & @TAB & @TAB & 'Perche</li>' & @CRLF & _
        @TAB & @TAB & '<li>Ecole &eacute;l&eacute;mentaire : 0180414 C</li>' & @CRLF & _
        @TAB & @TAB & '<li>rue Jean Valette</li>' & @CRLF & _
        @TAB & @TAB & '<li>99999</li>' & @CRLF & _
        @TAB & @TAB & '<li>? &nbsp; 07 77 63 50 33</li>' & @CRLF & _
        @TAB & @TAB & '<li>Circonscription : St Amand</li>' & @CRLF & _
        @TAB & '</ul>' & @CRLF & _
        '</li>' & @CRLF

Global $aRET = StringRegExp($sString, $sPatt, 3)

If @error Then
    ConsoleWrite("Error = " & @error & "; @extended = " & @extended & @LF)
Else
    _ArrayDisplay($aRET, "$aRET")
    For $r = 0 To UBound($aRET) - 1
        ConsoleWrite($r & ":  " & $aRET[$r] & @LF)
    Next
EndIf

Results:

>Running:(3.3.6.1):C:\Program Files\AutoIt3\autoit3.exe "C:\Temp\Test1.au3"    
0:   <a id="a"></a> 
    <h4>ZIZOU</h4>
    <a href="#haut"><img class="haut" src="../fleche_ht.gif" alt="haut de page" width="14" height="18"/></a> 
    <ul>
        
1:  Ecole &eacute;l&eacute;mentaire : 0180462 E
2:  place de la Mairie
3:  78888
4:  ? &nbsp; 05 48 26 78 76
5:  &nbsp;
6:   
    <h4>ARZOUILLE</h4>
    <a href="#haut"><img class="haut" src="../fleche_ht.gif" alt="haut de page" width="14" height="18"/></a> 
    <ul>
        
7:  La Celette et La 
            Perche
8:  Ecole &eacute;l&eacute;mentaire : 0180414 C
9:  rue Jean Valette
10:  99999
11:  ? &nbsp; 07 77 63 50 33
12:  Circonscription : St Amand
+>10:54:26 AutoIT3.exe ended.rc:0
>Exit code: 0    Time: 4.176

:idea:


Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0