Mingre

[Again in conflict; post 21] StringRegExp help - on nested HTML tags

24 posts in this topic

#1 ·  Posted (edited)

#include <Array.au3>
; Script Start - Add your code below here
Local $test = "<li>One<li>Inner<li>Innermost</li></li></li>" & _
            "<li>Two</li> "
$loob = StringRegExp($test, '\Q<li>\E(.*?)\Q</li>\E', 3)
_ArrayDisplay($loob, "How to return the One... and Two?")

Hello, can somebody help me:

(1) How can I have the regexp matched the two outermost bullets? Such that:
 

Quote

$array[0] = "One<li>Inner<li>Innermost</li></li>"

$array[1] = "Two"

(2) How can I match the "Innermost" bullet?

Thanks so much.

Edited by Mingre
Added the second question; added [SOLVED] on the title; changed title to not solved.

Share this post


Link to post
Share on other sites



#2 ·  Posted (edited)

Do you mean something like this:

Local $s = "<li>One<li>Inner<li>Innermost</li></li></li><li>Two</li>"
Local $regex = _
    '(?imsx)' & _
    '(?(DEFINE) (?<LiStart> <li>  ) )' & _
    '(?(DEFINE) (?<LiEnd>   <\/li> ) )' & _
    '(?(DEFINE) (?<LiBlock> (?&LiStart) (?: (?&LiBlock)* | .*? )* (?&LiEnd) ) )' & _
    '(?&LiBlock)'
$data = StringRegExp($s, $regex, 3)
_ArrayDisplay($data)

Note that this seemingly complex regexp is using an explicitely recursive pattern. Using named sub-patterns makes it more verbose but much clearer. The X (eXtended) option, allowing unescaped whitespaces to be unsignificant, also adds to readability. Refer to https://regex101.com/ for an english translation of the regexp semantics and debugging possibility. Also read up the official PCRE documentation for details on available constructs.

Edited by jchd
3 people like this

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

@jchd Thanks so much! That's exactly what I need tho I don't really understand that regexp :D Will try to learn it.

Again, thanks!

Share this post


Link to post
Share on other sites

Thanks also for the tip re: unescaped whitespaces. I was having a hard time reading through regexps because of the lack of spaces. :lol:

Share this post


Link to post
Share on other sites

#5 ·  Posted (edited)

Assuming that the regex engine works left to right couldn't something like this be enough ?

$data = StringRegExp($s, '(?s)<li>(.*?)</li>' , 3)

Edit
BTW jchd, thanks for this recursion example. This will make a nice cogitation for the next weekend  :)

Edited by mikell

Share this post


Link to post
Share on other sites

@mikell If it's done that way, the first encountered "</li>" from the left will be a match, which isn't exactly the pair of the outermost "<li>". :(

Share this post


Link to post
Share on other sites

OK, sorry. I didn't think that including the "</li>" in the captured match was something important  :)

Share this post


Link to post
Share on other sites

#8 ·  Posted (edited)

@mikell, the issue isn't including or not the end markup, but the problem with nested <li>...</li> blocks. The naive "<li>(.*?)</li>" will anchor at the first <li> and match up to the first </li> after it, matching wrong colors:  <li> with  </li>

<li>One<li>Inner<li>Innermost</li>something else</li>more stuff</li>

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

what about:

#include<array.au3>

Local $s = "<li>One<li>Inner<li>Innermost</li></li></li><li>Two</li>"

_ArrayDisplay(stringregexp($s , "(<li>.*?(?:</li>)+)" , 3))

 


,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Share this post


Link to post
Share on other sites

@jchd
Thanks, but I somewhat know that  :)
I thought that Mingre was interested in grabbing the content but not the end tags
This produces the result mentioned in post #1

$data = StringRegExp($s, '<li>(.*?)</li>(?!</li>)' , 3)

Anyway I'll try to understand your recursive thing. For the moment there is a missing connection in my brain which makes me unable to understand it  :sweating:

Share this post


Link to post
Share on other sites
2 minutes ago, mikell said:

I thought that Mingre was interested in grabbing the content but not the end tags

I actually trimmed the ends after getting the content :lol:

Here's what I'm working with straight from SCiTe, hehe. Sorry for my messy coding style!

#include <Array.au3>
Local $s = "Lal<li>One<li>Inner<li hehe>Innermost</li></li><li>Inner 2<li>Innermost 2</li></li></li><li>Two</li>"
;GLobal $iRecursion = 0
;Local $s = "Innermost"
Local $ha[1][2]
hehe($ha, $s)
_ArrayDisplay($ha)

Func hehe(ByRef $array, $x, $iRecursion = -1)
    $iRecursion += 1
    Local Const $regEx_Start = "<li[^>]*?>"
    Local Const $start = '(?(DEFINE) (?<LiStart> ' & $regEx_Start & ' ) )'
    Local Const $regEx_End = "<\/li>"
    Local Const $end = '(?(DEFINE) (?<LiEnd>  ' & $regEx_End & '  ) )'
    Local $regex = _
            '(?imsx)' & _
            $start & _
            $end & _
            '(?(DEFINE) (?<LiBlock> (?&LiStart) (?: (?&LiBlock)* | .*? )* (?&LiEnd) ) )' & _
            '(?&LiBlock)'
    Local $data = StringRegExp($x, $regex, 3)
    ;Consolewrite(@LF & 'x' & ' - ' & $x)
    Local $left, $wilLRecurse
    For $i = 0 To UBound($data) - 1 Step +1
        $data[$i] = StringRegExpReplace($data[$i], _
                '(?imsx)\A' & $regEx_Start & '(.*)' & $regEx_End & '\Z', '$1')

        $wilLRecurse = False
        If StringRegExp($data[$i], $regex) Then
            $wilLRecurse = True
            $left = $data[$i]
            $hi = StringRegExp($data[$i], '(?imsx)(\A.*?)(?:' & $regEx_Start & ')', 3)
            $data[$i] = $hi[0]
        EndIf


        ;_ArrayDisplay($left)
        _ArrayAdd($array, $iRecursion)
        $array[UBound($array) -1][1] = $data[$i]
        ConsoleWrite(@LF & $iRecursion & ' - ' & $data[$i])
        If $wilLRecurse Then hehe($array, $left, $iRecursion)

        ;EndIf
    Next
    ;
    $iRecursion -= 1
    ;_ArrayDisplay($data, $x)
    Return $data
EndFunc   ;==>hehe

 

Share this post


Link to post
Share on other sites

@mikell

Its structure is quite similar to this one. But still your last example doesn't do the job in the general case. See the difference:

Local $s = "<li>One<li>Inner<li>Innermost</li>Rha ... lovely!</li>oops</li><li>Two</li>"
Local $regex = _
    '(?imsx)' & _
    '(?(DEFINE) (?<LiStart> <li>  ) )' & _
    '(?(DEFINE) (?<LiEnd>   <\/li> ) )' & _
    '(?(DEFINE) (?<LiBlock> (?&LiStart) (?: (?&LiBlock)* | .*? )* (?&LiEnd) ) )' & _
    '(?&LiBlock)'
$data = StringRegExp($s, $regex, 3)
_ArrayDisplay($data)
$data = StringRegExp($s, '<li>(.*?)</li>(?!</li>)' , 3)
_ArrayDisplay($data)

 


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
13 minutes ago, iamtheky said:

what about:

#include<array.au3>

Local $s = "<li>One<li>Inner<li>Innermost</li></li></li><li>Two</li>"

_ArrayDisplay(stringregexp($s , "(<li>.*?(?:</li>)+)" , 3))

 

I don't know proper HTML but sometimes there are intervening texts between two "</li>". :(

#include<array.au3>

Local $s = "<li><b>One<li>Inner<li>Innermost</li></li></b></li><li>Two</li>"

_ArrayDisplay(stringregexp($s , "(<li>.*?(?:</li>)+)" , 3))

 

Share this post


Link to post
Share on other sites

Exactly why I pointed that detail out.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

@jchd
Right. I surrender   :D

Is there a way to write your code using numbered subpatterns instead of named ones ?

Share this post


Link to post
Share on other sites

Of course it's up to you to rewrite it with numbered references. It's a bit faster to parse (few µs), but much more confusing. I find named patterns very useful when a complex regex has to break down several similar structures.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

@jchd and @mikell : I don't understand how it's possible to write JC's code with numbered subpatterns ... For me each number of subpattern correspond to a captured group : it's not the case with defined subroutines which are not captured. (?1) refers to the first capturing group, so how can it be done tp refer to a non capturing group ? Is it possible ? (I search for a long time, so if you have an answer, I would be grateful to you !)
...or the StringRegExp result will have more results that JC's "defined-subroutine" way.

By the way, JC, your beautiful regex is not so hard to decorticate, but it really needs an extra-evolved brain to build something like it.

Share this post


Link to post
Share on other sites

You want variants? Okay.

Local $s = "<li>One<li>Inner<li>Innermost</li>rhagnagna</li>gloups</li><li>Two</li>"
Local $regex = _
    '(?imsx)' & _
    '(?(DEFINE) (?<LiStart> <li>  ) )' & _
    '(?(DEFINE) (?<LiEnd>   <\/li> ) )' & _
    '(?(DEFINE) (?<LiBlock> (?&LiStart) (?: (?&LiBlock)* | .*? )* (?&LiEnd) ) )' & _
    '(?&LiBlock)'
$data = StringRegExp($s, $regex, 3)
_ArrayDisplay($data)

$regex = _
    '(?imsx)' & _
    '(?<LiBlock>' & _
    '   <li>' & _
    '   (?: (?&LiBlock)* | .*? )*' & _
    '   <\/li>' & _
    ')'
$data = StringRegExp($s, $regex , 3)
_ArrayDisplay($data)

$regex = _
    '(?imsx)' & _
    '(' & _
    '   <li>' & _
    '   (?: (?-1)* | .*? )*' & _
    '   <\/li>' & _
    ')'
$data = StringRegExp($s, $regex , 3)
_ArrayDisplay($data)

$regex = '(?ims)(<li>(?:(?-1)*|.*?)*<\/li>)'
$data = StringRegExp($s, $regex , 3)
_ArrayDisplay($data)

 

2 people like this

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

#19 ·  Posted (edited)

I could have added a couple (English "couple" often means 3) of shorter versions:

$regex = '(?ims)(<li>(?:(?1)*|.*?)*</li>)'
$data = StringRegExp($s, $regex , 3)
_ArrayDisplay($data)

$regex = '(?ims)(<li>(?:(?0)*|.*?)*</li>)'
$data = StringRegExp($s, $regex , 3)
_ArrayDisplay($data)

$regex = '(?ims)(<li>(?:(?R)*|.*?)*</li>)'
$data = StringRegExp($s, $regex , 3)
_ArrayDisplay($data)

Note that all of this and above is only a series of semantic cosmetic rewrites, the structure and working are exactly the same.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now

  • Similar Content

    • Robinson1
      By Robinson1
      Well the plan is to use the power of regular expressions engine of AutoIT for patching binary data.
      Something like this: StringRegExp( $BinaryData,  "(?s)\x55\x8B.."
       
      <cut> ... Okay straight to question/problem
      ... certain bytes that are in the range from 0x80 to 0xA0 won't match.
      Hmm seem to be a char encoding problem. In detail these are 27 chars: 0x80, 0x82~8C, 0x8E, 0x91~9C, 0x9E,0x9F
      Here's a small code snippet to explore / explain this problem:
      #include "StringConstants.au3" $TestData = BinaryToString("0x7E7F808182") ;Okay $match = StringRegExp( $TestData ,'\x7E' ,$STR_REGEXPARRAYFULLMATCH) ConsoleWrite('@extended = ' & @extended & ' $match = ' & $match & @CRLF) ;Okay $match = StringRegExp( $TestData ,'\x7F' ,$STR_REGEXPARRAYFULLMATCH) ConsoleWrite('@extended = ' & @extended & ' $match = ' & $match & @CRLF) ;Error no match $match = StringRegExp( $TestData ,'\x80' ,$STR_REGEXPARRAYFULLMATCH) ConsoleWrite('@extended = ' & @extended & ' $match = ' & $match & @CRLF) ;Okay $match = StringRegExp( $TestData ,'\x81' ,$STR_REGEXPARRAYFULLMATCH) ConsoleWrite('@extended = ' & @extended & ' $match = ' & $match & @CRLF) ;Error no match $match = StringRegExp( $TestData ,'\x82' ,$STR_REGEXPARRAYFULLMATCH) ConsoleWrite('@extended = ' & @extended & ' $match = ' & $match & @CRLF) ;~ output: ;~ @extended = 2 $match = ;~ @extended = 3 $match = ;~ @extended = 0 $match = 1 ;~ @extended = 5 $match = ;~ @extended = 0 $match = 1 Hmm what to do? Go back and use the 'numberstring monster' implementation or just omit that range of 'unsafe bytes'. What is the root of this problem?
      Any idea how to fix this?
       
      Update: Okay I know a byte is not a character.
      But StringRegExp operates on String and so character level.
      Okay as long as you stay at Ansi encoding and only use /x00 - /X7F in the search pattern using  StringRegExp works well to search for binary data.
      What bytes can be matched that are in the range from /X7F - /xFF is also depending on the code page.
      So this avoid to search for bytes in the range from 0x80-0xa0 only applies to Germany.
      I just change this country setting:

      to Thai and now near all bytes from /X7F - /xFF fails to match.
    • RichardL
      By RichardL
      Text in a file, read into var with fileread:
      <> <> <> <> < J please look > <> <> <> Hi, 
      I want  a RegExp to select around 'please', back to the previous < and forward to the next >.  I can select the line of text.  Then I add in (?s) and it selects the whole text.  I think I want to make it not greedy, (?U) , that seems to make it ungreedy after, but it still selects all the previous lines.
      $sPattern = "(?s)<.*please.*>" ; 1 $sPattern = "(?s)<(?U).*please.*>" ; 2 $sPattern = "(?s)<(?U).*please(?U).*>" ; 3 $sAry = StringRegExp($sHTML, $sPattern, 3)  
    • JohnNash
      By JohnNash
      I want to rename every new instance of notepad to notepad(random number)
      If I use WinSetTitle ( "notepad", "", "notepad("&$randomnumber&")" )
      this will work pretty good, because if more windows match the search entry it will take the newest. But what if this code runs, but there is no new instance of notepad. It will rename one that was already assigned a number. So I would like to check whether it is already renamed. For example by excluding titles that contain a ")".
      How do I do that. 
      Read this, but that is pretty confusing: http://stackoverflow.com/questions/406230/regular-expression-to-match-line-that-doesnt-contain-a-word?rq=1
    • InunoTaishou
      By InunoTaishou
      You can now use your favorite html tags in a richedit control!
      Supports:
      <b></b> (bold) <i></i> (italic) <s></s> (strike) <u></u> (underline) <color=#nnnnnn></color> (color text) <color=0xnnnnn></color> (color text) <bkcolor=#nnnnnn></color> (color background text) <bkcolor=0xnnnnnn></color> (color background text) <font name="Font name" size=n></font> (Font name is the only one that has to have quotes around it. They can be double quotes or single, it doesn't matter. color, bkcolor, and size can also use quotes but they're optional)
      Does not support (but I may add later)
      <align> <a href=""> <img>
      ; Functions _GUICtrlRichEdit_AppendHtmlText _StringToRichEditArray __GetArrayFromRegex __GetRichEditAttrFromChar __GUICtrlRichEdit_AppendTextColored __GUICtrlRichEdit_SetTextColor (I'm sure someone can come up with some better names for this lol)
      You can use a formatted (<color=0xNNNNNN>Formatted</color>) or non-formatted (Non-formatted) string with calling _GUICtrlRichEdit_AppendHtmlText and _StringToRichEditArray. _StringToRichEditArray will just set the attributes, font name, font size, and colors to the current attributes of the RichEdit Control (At least it should... RichEdit can be picky about setting the font and char attributes correctly....).
      Calling _StringToRichEditArray will return a 2d array with n amount of rows.
      [n][0] - String [n][1] - Font name for the [0] string. [n][2] - Font size for the [0] string. [n][3] - Character attributes for the [0] string. [n][4] - Text color for the [0] string. [n][5] - Back color for the [0] string. It seems I was too hasty in my release and this still wasn't working 100% . I guess I'll try to work on it tomorrow and see if I can get it working  the way tags are supposed to work. It's not a complete waste, it's close  but not perfect and I know why.
      Formatted RichEdit Array V2.rar
      Html Richedit.rar
    • mLipok
      By mLipok
      #include <Array.au3> If @Compiled Then Exit Global Enum $FUNC_OUTER, $FUNC_NAME, $FUNC_PARAM, $FUNC_INNER _Example() Func _Example() Local $sIncludeDir = StringTrimRight(@AutoItExe, StringLen('AutoIt3.exe')) & 'Include\' Local $aOuterArray = _GetFunctionsToArray($sIncludeDir & 'Color.au3') If Not @error Then For $iOuter_idx = 0 To UBound($aOuterArray) - 1 _ArrayDisplay($aOuterArray[$iOuter_idx], ($aOuterArray[$iOuter_idx])[$FUNC_NAME]) Next EndIf EndFunc ;==>_Example Func _GetFunctionsToArray($sUDF_FileFullPath) Local $sUDFContent = FileRead($sUDF_FileFullPath) Local $aResult = StringRegExp($sUDFContent, '(?is)\RFunc (.*?)\((.*?)\)\v\R(.*?)\REndFunc', $STR_REGEXPARRAYGLOBALFULLMATCH) Return SetError(@error, @extended, $aResult) EndFunc ;==>_GetFunctionsToArray