Jump to content

[Again in conflict; post 21] StringRegExp help - on nested HTML tags


Recommended Posts

#include <Array.au3>
; Script Start - Add your code below here
Local $test = "<li>One<li>Inner<li>Innermost</li></li></li>" & _
            "<li>Two</li> "
$loob = StringRegExp($test, '\Q<li>\E(.*?)\Q</li>\E', 3)
_ArrayDisplay($loob, "How to return the One... and Two?")

Hello, can somebody help me:

(1) How can I have the regexp matched the two outermost bullets? Such that:
 

Quote

$array[0] = "One<li>Inner<li>Innermost</li></li>"

$array[1] = "Two"

(2) How can I match the "Innermost" bullet?

Thanks so much.

Edited by Mingre
Added the second question; added [SOLVED] on the title; changed title to not solved.
Link to comment
Share on other sites

Do you mean something like this:

Local $s = "<li>One<li>Inner<li>Innermost</li></li></li><li>Two</li>"
Local $regex = _
    '(?imsx)' & _
    '(?(DEFINE) (?<LiStart> <li>  ) )' & _
    '(?(DEFINE) (?<LiEnd>   <\/li> ) )' & _
    '(?(DEFINE) (?<LiBlock> (?&LiStart) (?: (?&LiBlock)* | .*? )* (?&LiEnd) ) )' & _
    '(?&LiBlock)'
$data = StringRegExp($s, $regex, 3)
_ArrayDisplay($data)

Note that this seemingly complex regexp is using an explicitely recursive pattern. Using named sub-patterns makes it more verbose but much clearer. The X (eXtended) option, allowing unescaped whitespaces to be unsignificant, also adds to readability. Refer to https://regex101.com/ for an english translation of the regexp semantics and debugging possibility. Also read up the official PCRE documentation for details on available constructs.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Assuming that the regex engine works left to right couldn't something like this be enough ?

$data = StringRegExp($s, '(?s)<li>(.*?)</li>' , 3)

Edit
BTW jchd, thanks for this recursion example. This will make a nice cogitation for the next weekend  :)

Edited by mikell
Link to comment
Share on other sites

@mikell, the issue isn't including or not the end markup, but the problem with nested <li>...</li> blocks. The naive "<li>(.*?)</li>" will anchor at the first <li> and match up to the first </li> after it, matching wrong colors:  <li> with  </li>

<li>One<li>Inner<li>Innermost</li>something else</li>more stuff</li>

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

what about:

#include<array.au3>

Local $s = "<li>One<li>Inner<li>Innermost</li></li></li><li>Two</li>"

_ArrayDisplay(stringregexp($s , "(<li>.*?(?:</li>)+)" , 3))

 

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

@jchd
Thanks, but I somewhat know that  :)
I thought that Mingre was interested in grabbing the content but not the end tags
This produces the result mentioned in post #1

$data = StringRegExp($s, '<li>(.*?)</li>(?!</li>)' , 3)

Anyway I'll try to understand your recursive thing. For the moment there is a missing connection in my brain which makes me unable to understand it  :sweating:

Link to comment
Share on other sites

2 minutes ago, mikell said:

I thought that Mingre was interested in grabbing the content but not the end tags

I actually trimmed the ends after getting the content :lol:

Here's what I'm working with straight from SCiTe, hehe. Sorry for my messy coding style!

#include <Array.au3>
Local $s = "Lal<li>One<li>Inner<li hehe>Innermost</li></li><li>Inner 2<li>Innermost 2</li></li></li><li>Two</li>"
;GLobal $iRecursion = 0
;Local $s = "Innermost"
Local $ha[1][2]
hehe($ha, $s)
_ArrayDisplay($ha)

Func hehe(ByRef $array, $x, $iRecursion = -1)
    $iRecursion += 1
    Local Const $regEx_Start = "<li[^>]*?>"
    Local Const $start = '(?(DEFINE) (?<LiStart> ' & $regEx_Start & ' ) )'
    Local Const $regEx_End = "<\/li>"
    Local Const $end = '(?(DEFINE) (?<LiEnd>  ' & $regEx_End & '  ) )'
    Local $regex = _
            '(?imsx)' & _
            $start & _
            $end & _
            '(?(DEFINE) (?<LiBlock> (?&LiStart) (?: (?&LiBlock)* | .*? )* (?&LiEnd) ) )' & _
            '(?&LiBlock)'
    Local $data = StringRegExp($x, $regex, 3)
    ;Consolewrite(@LF & 'x' & ' - ' & $x)
    Local $left, $wilLRecurse
    For $i = 0 To UBound($data) - 1 Step +1
        $data[$i] = StringRegExpReplace($data[$i], _
                '(?imsx)\A' & $regEx_Start & '(.*)' & $regEx_End & '\Z', '$1')

        $wilLRecurse = False
        If StringRegExp($data[$i], $regex) Then
            $wilLRecurse = True
            $left = $data[$i]
            $hi = StringRegExp($data[$i], '(?imsx)(\A.*?)(?:' & $regEx_Start & ')', 3)
            $data[$i] = $hi[0]
        EndIf


        ;_ArrayDisplay($left)
        _ArrayAdd($array, $iRecursion)
        $array[UBound($array) -1][1] = $data[$i]
        ConsoleWrite(@LF & $iRecursion & ' - ' & $data[$i])
        If $wilLRecurse Then hehe($array, $left, $iRecursion)

        ;EndIf
    Next
    ;
    $iRecursion -= 1
    ;_ArrayDisplay($data, $x)
    Return $data
EndFunc   ;==>hehe

 

Link to comment
Share on other sites

@mikell

Its structure is quite similar to this one. But still your last example doesn't do the job in the general case. See the difference:

Local $s = "<li>One<li>Inner<li>Innermost</li>Rha ... lovely!</li>oops</li><li>Two</li>"
Local $regex = _
    '(?imsx)' & _
    '(?(DEFINE) (?<LiStart> <li>  ) )' & _
    '(?(DEFINE) (?<LiEnd>   <\/li> ) )' & _
    '(?(DEFINE) (?<LiBlock> (?&LiStart) (?: (?&LiBlock)* | .*? )* (?&LiEnd) ) )' & _
    '(?&LiBlock)'
$data = StringRegExp($s, $regex, 3)
_ArrayDisplay($data)
$data = StringRegExp($s, '<li>(.*?)</li>(?!</li>)' , 3)
_ArrayDisplay($data)

 

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

13 minutes ago, iamtheky said:

what about:

#include<array.au3>

Local $s = "<li>One<li>Inner<li>Innermost</li></li></li><li>Two</li>"

_ArrayDisplay(stringregexp($s , "(<li>.*?(?:</li>)+)" , 3))

 

I don't know proper HTML but sometimes there are intervening texts between two "</li>". :(

#include<array.au3>

Local $s = "<li><b>One<li>Inner<li>Innermost</li></li></b></li><li>Two</li>"

_ArrayDisplay(stringregexp($s , "(<li>.*?(?:</li>)+)" , 3))

 

Link to comment
Share on other sites

Exactly why I pointed that detail out.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Of course it's up to you to rewrite it with numbered references. It's a bit faster to parse (few µs), but much more confusing. I find named patterns very useful when a complex regex has to break down several similar structures.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

@jchd and @mikell : I don't understand how it's possible to write JC's code with numbered subpatterns ... For me each number of subpattern correspond to a captured group : it's not the case with defined subroutines which are not captured. (?1) refers to the first capturing group, so how can it be done tp refer to a non capturing group ? Is it possible ? (I search for a long time, so if you have an answer, I would be grateful to you !)
...or the StringRegExp result will have more results that JC's "defined-subroutine" way.

By the way, JC, your beautiful regex is not so hard to decorticate, but it really needs an extra-evolved brain to build something like it.

Link to comment
Share on other sites

You want variants? Okay.

Local $s = "<li>One<li>Inner<li>Innermost</li>rhagnagna</li>gloups</li><li>Two</li>"
Local $regex = _
    '(?imsx)' & _
    '(?(DEFINE) (?<LiStart> <li>  ) )' & _
    '(?(DEFINE) (?<LiEnd>   <\/li> ) )' & _
    '(?(DEFINE) (?<LiBlock> (?&LiStart) (?: (?&LiBlock)* | .*? )* (?&LiEnd) ) )' & _
    '(?&LiBlock)'
$data = StringRegExp($s, $regex, 3)
_ArrayDisplay($data)

$regex = _
    '(?imsx)' & _
    '(?<LiBlock>' & _
    '   <li>' & _
    '   (?: (?&LiBlock)* | .*? )*' & _
    '   <\/li>' & _
    ')'
$data = StringRegExp($s, $regex , 3)
_ArrayDisplay($data)

$regex = _
    '(?imsx)' & _
    '(' & _
    '   <li>' & _
    '   (?: (?-1)* | .*? )*' & _
    '   <\/li>' & _
    ')'
$data = StringRegExp($s, $regex , 3)
_ArrayDisplay($data)

$regex = '(?ims)(<li>(?:(?-1)*|.*?)*<\/li>)'
$data = StringRegExp($s, $regex , 3)
_ArrayDisplay($data)

 

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

I could have added a couple (English "couple" often means 3) of shorter versions:

$regex = '(?ims)(<li>(?:(?1)*|.*?)*</li>)'
$data = StringRegExp($s, $regex , 3)
_ArrayDisplay($data)

$regex = '(?ims)(<li>(?:(?0)*|.*?)*</li>)'
$data = StringRegExp($s, $regex , 3)
_ArrayDisplay($data)

$regex = '(?ims)(<li>(?:(?R)*|.*?)*</li>)'
$data = StringRegExp($s, $regex , 3)
_ArrayDisplay($data)

Note that all of this and above is only a series of semantic cosmetic rewrites, the structure and working are exactly the same.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...