Jump to content
Mingre

[Again in conflict; post 21] StringRegExp help - on nested HTML tags

Recommended Posts

Mingre
#include <Array.au3>
; Script Start - Add your code below here
Local $test = "<li>One<li>Inner<li>Innermost</li></li></li>" & _
            "<li>Two</li> "
$loob = StringRegExp($test, '\Q<li>\E(.*?)\Q</li>\E', 3)
_ArrayDisplay($loob, "How to return the One... and Two?")

Hello, can somebody help me:

(1) How can I have the regexp matched the two outermost bullets? Such that:
 

Quote

$array[0] = "One<li>Inner<li>Innermost</li></li>"

$array[1] = "Two"

(2) How can I match the "Innermost" bullet?

Thanks so much.

Edited by Mingre
Added the second question; added [SOLVED] on the title; changed title to not solved.

Share this post


Link to post
Share on other sites
jchd

Do you mean something like this:

Local $s = "<li>One<li>Inner<li>Innermost</li></li></li><li>Two</li>"
Local $regex = _
    '(?imsx)' & _
    '(?(DEFINE) (?<LiStart> <li>  ) )' & _
    '(?(DEFINE) (?<LiEnd>   <\/li> ) )' & _
    '(?(DEFINE) (?<LiBlock> (?&LiStart) (?: (?&LiBlock)* | .*? )* (?&LiEnd) ) )' & _
    '(?&LiBlock)'
$data = StringRegExp($s, $regex, 3)
_ArrayDisplay($data)

Note that this seemingly complex regexp is using an explicitely recursive pattern. Using named sub-patterns makes it more verbose but much clearer. The X (eXtended) option, allowing unescaped whitespaces to be unsignificant, also adds to readability. Refer to https://regex101.com/ for an english translation of the regexp semantics and debugging possibility. Also read up the official PCRE documentation for details on available constructs.

Edited by jchd
  • Like 3

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Mingre

@jchd Thanks so much! That's exactly what I need tho I don't really understand that regexp :D Will try to learn it.

Again, thanks!

Share this post


Link to post
Share on other sites
Mingre

Thanks also for the tip re: unescaped whitespaces. I was having a hard time reading through regexps because of the lack of spaces. :lol:

Share this post


Link to post
Share on other sites
mikell

Assuming that the regex engine works left to right couldn't something like this be enough ?

$data = StringRegExp($s, '(?s)<li>(.*?)</li>' , 3)

Edit
BTW jchd, thanks for this recursion example. This will make a nice cogitation for the next weekend  :)

Edited by mikell

Share this post


Link to post
Share on other sites
Mingre

@mikell If it's done that way, the first encountered "</li>" from the left will be a match, which isn't exactly the pair of the outermost "<li>". :(

Share this post


Link to post
Share on other sites
mikell

OK, sorry. I didn't think that including the "</li>" in the captured match was something important  :)

Share this post


Link to post
Share on other sites
jchd

@mikell, the issue isn't including or not the end markup, but the problem with nested <li>...</li> blocks. The naive "<li>(.*?)</li>" will anchor at the first <li> and match up to the first </li> after it, matching wrong colors:  <li> with  </li>

<li>One<li>Inner<li>Innermost</li>something else</li>more stuff</li>

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
iamtheky

what about:

#include<array.au3>

Local $s = "<li>One<li>Inner<li>Innermost</li></li></li><li>Two</li>"

_ArrayDisplay(stringregexp($s , "(<li>.*?(?:</li>)+)" , 3))

 


,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Share this post


Link to post
Share on other sites
mikell

@jchd
Thanks, but I somewhat know that  :)
I thought that Mingre was interested in grabbing the content but not the end tags
This produces the result mentioned in post #1

$data = StringRegExp($s, '<li>(.*?)</li>(?!</li>)' , 3)

Anyway I'll try to understand your recursive thing. For the moment there is a missing connection in my brain which makes me unable to understand it  :sweating:

Share this post


Link to post
Share on other sites
Mingre
2 minutes ago, mikell said:

I thought that Mingre was interested in grabbing the content but not the end tags

I actually trimmed the ends after getting the content :lol:

Here's what I'm working with straight from SCiTe, hehe. Sorry for my messy coding style!

#include <Array.au3>
Local $s = "Lal<li>One<li>Inner<li hehe>Innermost</li></li><li>Inner 2<li>Innermost 2</li></li></li><li>Two</li>"
;GLobal $iRecursion = 0
;Local $s = "Innermost"
Local $ha[1][2]
hehe($ha, $s)
_ArrayDisplay($ha)

Func hehe(ByRef $array, $x, $iRecursion = -1)
    $iRecursion += 1
    Local Const $regEx_Start = "<li[^>]*?>"
    Local Const $start = '(?(DEFINE) (?<LiStart> ' & $regEx_Start & ' ) )'
    Local Const $regEx_End = "<\/li>"
    Local Const $end = '(?(DEFINE) (?<LiEnd>  ' & $regEx_End & '  ) )'
    Local $regex = _
            '(?imsx)' & _
            $start & _
            $end & _
            '(?(DEFINE) (?<LiBlock> (?&LiStart) (?: (?&LiBlock)* | .*? )* (?&LiEnd) ) )' & _
            '(?&LiBlock)'
    Local $data = StringRegExp($x, $regex, 3)
    ;Consolewrite(@LF & 'x' & ' - ' & $x)
    Local $left, $wilLRecurse
    For $i = 0 To UBound($data) - 1 Step +1
        $data[$i] = StringRegExpReplace($data[$i], _
                '(?imsx)\A' & $regEx_Start & '(.*)' & $regEx_End & '\Z', '$1')

        $wilLRecurse = False
        If StringRegExp($data[$i], $regex) Then
            $wilLRecurse = True
            $left = $data[$i]
            $hi = StringRegExp($data[$i], '(?imsx)(\A.*?)(?:' & $regEx_Start & ')', 3)
            $data[$i] = $hi[0]
        EndIf


        ;_ArrayDisplay($left)
        _ArrayAdd($array, $iRecursion)
        $array[UBound($array) -1][1] = $data[$i]
        ConsoleWrite(@LF & $iRecursion & ' - ' & $data[$i])
        If $wilLRecurse Then hehe($array, $left, $iRecursion)

        ;EndIf
    Next
    ;
    $iRecursion -= 1
    ;_ArrayDisplay($data, $x)
    Return $data
EndFunc   ;==>hehe

 

Share this post


Link to post
Share on other sites
jchd

@mikell

Its structure is quite similar to this one. But still your last example doesn't do the job in the general case. See the difference:

Local $s = "<li>One<li>Inner<li>Innermost</li>Rha ... lovely!</li>oops</li><li>Two</li>"
Local $regex = _
    '(?imsx)' & _
    '(?(DEFINE) (?<LiStart> <li>  ) )' & _
    '(?(DEFINE) (?<LiEnd>   <\/li> ) )' & _
    '(?(DEFINE) (?<LiBlock> (?&LiStart) (?: (?&LiBlock)* | .*? )* (?&LiEnd) ) )' & _
    '(?&LiBlock)'
$data = StringRegExp($s, $regex, 3)
_ArrayDisplay($data)
$data = StringRegExp($s, '<li>(.*?)</li>(?!</li>)' , 3)
_ArrayDisplay($data)

 


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Mingre
13 minutes ago, iamtheky said:

what about:

#include<array.au3>

Local $s = "<li>One<li>Inner<li>Innermost</li></li></li><li>Two</li>"

_ArrayDisplay(stringregexp($s , "(<li>.*?(?:</li>)+)" , 3))

 

I don't know proper HTML but sometimes there are intervening texts between two "</li>". :(

#include<array.au3>

Local $s = "<li><b>One<li>Inner<li>Innermost</li></li></b></li><li>Two</li>"

_ArrayDisplay(stringregexp($s , "(<li>.*?(?:</li>)+)" , 3))

 

Share this post


Link to post
Share on other sites
jchd

Exactly why I pointed that detail out.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
mikell

@jchd
Right. I surrender   :D

Is there a way to write your code using numbered subpatterns instead of named ones ?

Share this post


Link to post
Share on other sites
jchd

Of course it's up to you to rewrite it with numbered references. It's a bit faster to parse (few µs), but much more confusing. I find named patterns very useful when a complex regex has to break down several similar structures.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
jguinch

@jchd and @mikell : I don't understand how it's possible to write JC's code with numbered subpatterns ... For me each number of subpattern correspond to a captured group : it's not the case with defined subroutines which are not captured. (?1) refers to the first capturing group, so how can it be done tp refer to a non capturing group ? Is it possible ? (I search for a long time, so if you have an answer, I would be grateful to you !)
...or the StringRegExp result will have more results that JC's "defined-subroutine" way.

By the way, JC, your beautiful regex is not so hard to decorticate, but it really needs an extra-evolved brain to build something like it.

Share this post


Link to post
Share on other sites
jchd

You want variants? Okay.

Local $s = "<li>One<li>Inner<li>Innermost</li>rhagnagna</li>gloups</li><li>Two</li>"
Local $regex = _
    '(?imsx)' & _
    '(?(DEFINE) (?<LiStart> <li>  ) )' & _
    '(?(DEFINE) (?<LiEnd>   <\/li> ) )' & _
    '(?(DEFINE) (?<LiBlock> (?&LiStart) (?: (?&LiBlock)* | .*? )* (?&LiEnd) ) )' & _
    '(?&LiBlock)'
$data = StringRegExp($s, $regex, 3)
_ArrayDisplay($data)

$regex = _
    '(?imsx)' & _
    '(?<LiBlock>' & _
    '   <li>' & _
    '   (?: (?&LiBlock)* | .*? )*' & _
    '   <\/li>' & _
    ')'
$data = StringRegExp($s, $regex , 3)
_ArrayDisplay($data)

$regex = _
    '(?imsx)' & _
    '(' & _
    '   <li>' & _
    '   (?: (?-1)* | .*? )*' & _
    '   <\/li>' & _
    ')'
$data = StringRegExp($s, $regex , 3)
_ArrayDisplay($data)

$regex = '(?ims)(<li>(?:(?-1)*|.*?)*<\/li>)'
$data = StringRegExp($s, $regex , 3)
_ArrayDisplay($data)

 

  • Like 2

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
jchd

I could have added a couple (English "couple" often means 3) of shorter versions:

$regex = '(?ims)(<li>(?:(?1)*|.*?)*</li>)'
$data = StringRegExp($s, $regex , 3)
_ArrayDisplay($data)

$regex = '(?ims)(<li>(?:(?0)*|.*?)*</li>)'
$data = StringRegExp($s, $regex , 3)
_ArrayDisplay($data)

$regex = '(?ims)(<li>(?:(?R)*|.*?)*</li>)'
$data = StringRegExp($s, $regex , 3)
_ArrayDisplay($data)

Note that all of this and above is only a series of semantic cosmetic rewrites, the structure and working are exactly the same.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Similar Content

    • PClough
      By PClough
      Hi everyone!
      After updating autoit, I tried to run an old program using complex regexp's.  It did not work.  Eventually I broke the problem down to this example:
       
      #include <Array.au3> $buf = "First title" & @CRLF & "Tom" & Chr(0x92) & "s sleepwalking" & @CRLF & "Last | line" & @CRLF $items = StringRegExp($buf, '([\x20-\xff]+)\x0d\x0a', 3) _ArrayDisplay($items,'') And this is the result I get when running it:
      Row 0
       
    • Miliardsto
      By Miliardsto
      Hello . How to do that
      $regexp = starts from "abcdef" and after this could be anything in name
      WinActivate($regexp)
    • Robinson1
      By Robinson1
      Well the plan is to use the power of regular expressions engine of AutoIT for patching binary data.
      Something like this: StringRegExp( $BinaryData,  "(?s)\x55\x8B.."
       
      <cut> ... Okay straight to question/problem
      ... certain bytes that are in the range from 0x80 to 0xA0 won't match.
      Hmm seem to be a char encoding problem. In detail these are 27 chars: 0x80, 0x82~8C, 0x8E, 0x91~9C, 0x9E,0x9F
      Here's a small code snippet to explore / explain this problem:
      #include "StringConstants.au3" $TestData = BinaryToString("0x7E7F808182") ;Okay $match = StringRegExp( $TestData ,'\x7E' ,$STR_REGEXPARRAYFULLMATCH) ConsoleWrite('@extended = ' & @extended & ' $match = ' & $match & @CRLF) ;Okay $match = StringRegExp( $TestData ,'\x7F' ,$STR_REGEXPARRAYFULLMATCH) ConsoleWrite('@extended = ' & @extended & ' $match = ' & $match & @CRLF) ;Error no match $match = StringRegExp( $TestData ,'\x80' ,$STR_REGEXPARRAYFULLMATCH) ConsoleWrite('@extended = ' & @extended & ' $match = ' & $match & @CRLF) ;Okay $match = StringRegExp( $TestData ,'\x81' ,$STR_REGEXPARRAYFULLMATCH) ConsoleWrite('@extended = ' & @extended & ' $match = ' & $match & @CRLF) ;Error no match $match = StringRegExp( $TestData ,'\x82' ,$STR_REGEXPARRAYFULLMATCH) ConsoleWrite('@extended = ' & @extended & ' $match = ' & $match & @CRLF) ;~ output: ;~ @extended = 2 $match = ;~ @extended = 3 $match = ;~ @extended = 0 $match = 1 ;~ @extended = 5 $match = ;~ @extended = 0 $match = 1 Hmm what to do? Go back and use the 'numberstring monster' implementation or just omit that range of 'unsafe bytes'. What is the root of this problem?
      Any idea how to fix this?
       
      Update: Okay I know a byte is not a character.
      But StringRegExp operates on String and so character level.
      Okay as long as you stay at Ansi encoding and only use /x00 - /X7F in the search pattern using  StringRegExp works well to search for binary data.
      What bytes can be matched that are in the range from /X7F - /xFF is also depending on the code page.
      So this avoid to search for bytes in the range from 0x80-0xa0 only applies to Germany.
      I just change this country setting:

      to Thai and now near all bytes from /X7F - /xFF fails to match.
    • RichardL
      By RichardL
      Text in a file, read into var with fileread:
      <> <> <> <> < J please look > <> <> <> Hi, 
      I want  a RegExp to select around 'please', back to the previous < and forward to the next >.  I can select the line of text.  Then I add in (?s) and it selects the whole text.  I think I want to make it not greedy, (?U) , that seems to make it ungreedy after, but it still selects all the previous lines.
      $sPattern = "(?s)<.*please.*>" ; 1 $sPattern = "(?s)<(?U).*please.*>" ; 2 $sPattern = "(?s)<(?U).*please(?U).*>" ; 3 $sAry = StringRegExp($sHTML, $sPattern, 3)  
×