Jump to content

[Again in conflict; post 21] StringRegExp help - on nested HTML tags


Recommended Posts

@jchd (and to other kind souls) :D

I have two HTML files saved as <*.txt>: one is the original version (ORIG.txt) and the other is the simplified one (SIMPLE.txt; simplified in the sense that all tag parameters are stripped off, e.g., <p style="bla bla"> turned to <p>). (Both are attached on this post.)

I already got another code working, but for curiosity's sake, can you guys enlighten me why this code doesn't work on the original HTML?

#include <Array.au3>
Local $fileread = FileRead(@ScriptDir & '\ORIG.txt') ; Function does not work on this
;Local $fileread = FileRead(@ScriptDir & '\SIMPLE.txt') ; Function works on the simplified version of HTML.

Local $ha[1]
__retrieveList($ha, $fileread)
_ArrayDisplay($ha)

Func __retrieveList(ByRef $array, Const $string, $iRecursion = -1)
    If $iRecursion = -1 Then
        ReDim $array[1][2]
        $array[0][0] = ''
        $array[0][1] = ''
    EndIf

    $iRecursion += 1
    Local Const $regEx_Start = "<li[^>]*?>", _
            $regEx_End = "<\/li>", _
            $regex = '(?imsx)' & _
            '(?(DEFINE) (?<LiStart> ' & $regEx_Start & ' ) )' & _
            '(?(DEFINE) (?<LiEnd>  ' & $regEx_End & '  ) )' & _
            '(?(DEFINE) (?<LiBlock> (?&LiStart) (?: (?&LiBlock)* | .*? )* (?&LiEnd) ) )' & _
            '(?&LiBlock)'
    Local $data = StringRegExp($string, $regex, 3) ; Not Const because this will be modified later.
    Local $aTemp, $iUbound
    For $i = 0 To UBound($data) - 1 Step +1
        $data[$i] = StringRegExpReplace($data[$i], '(?imsx)\A' & $regEx_Start & '(.*)' & $regEx_End & '\Z', '$1')
        $iUbound = UBound($array)
        If String($array[$iUbound - 1][0]) = "" Then $iUbound -= 1
        ReDim $array[$iUbound + 1][2]
        $array[$iUbound][0] = $iRecursion ;$data[$i]
        If Not StringRegExp($data[$i], $regex) Then
            $array[$iUbound][1] = $data[$i]
            ContinueLoop
        EndIf
        $aTemp = StringRegExp($data[$i], '(?imsx)(\A.*?)(?:' & $regEx_Start & ')', 3)
        $array[$iUbound][1] = $aTemp[0]
        __retrieveList($array, $data[$i], $iRecursion)
    Next
    $iRecursion -= 1
    Return 1 ; $data
EndFunc   ;==>__retrieveList

SCiTE output:

>Running:(3.3.14.2):C:\Program Files\AutoIt3\autoit3.exe "C:\Documents and Settings\G99\Desktop\__retrieveList.au3"    
--> Press Ctrl+Alt+Break to Restart or Ctrl+Break to Stop
!>19:39:49 AutoIt3.exe ended.rc:-1073741819
+>19:39:49 AutoIt3Wrapper Finished.
>Exit code: 3221225477    Time: 0.6851

 

Anyway, here's the other code I'm referring to, quite a different approach but basically does what I want. This works on both HTML files.

#include <Array.au3>

Local $fileread = FileRead(@ScriptDir & '\ORIG.txt') ; This works.
;Local $fileread = FileRead(@ScriptDir & '\SIMPLE.txt') ; This works.
Local $ha[1]
__retrieveList($ha, $fileread)
_ArrayDisplay($ha)

Func __retrieveList(ByRef $array, $s__parsedRight, $i__recurse = -1)
    If $i__recurse = -1 Then
        ReDim $array[1][2]
        $array[0][0] = ''
        $array[0][1] = ''
    EndIf
    $i__recurse += 1
    Local Const $s__keyWord = 'li', _
            $s__start = "<" & $s__keyWord & "[^>]*?>", _
            $s__end = "<\/" & $s__keyWord & ">", _
            $s__regEx = "(?ims)\A.*?(" & $s__start & ".*?" & $s__end & ")"
    Local $s__parsedLeft = '', $a__temp[1], $i__uBound
    Do
        Switch UBound(StringRegExp($s__parsedLeft, $s__start, 3))
            Case 0
            Case UBound(StringRegExp($s__parsedLeft, $s__end, 3))
                $a__temp = StringRegExp($s__parsedLeft, '\A' & $s__start & '(.*)' & $s__end & '\Z', 3)
                $i__uBound = UBound($array)
                If String($array[$i__uBound - 1][0]) = "" Then $i__uBound -= 1
                ReDim $array[$i__uBound + 1][2]
                $array[$i__uBound][0] = $i__recurse
                $array[$i__uBound][1] = $a__temp[0]
                If StringRegExp($a__temp[0], $s__start) Then __retrieveList($array, $a__temp[0], $i__recurse)
            Case Else
                $s__parsedLeft &= __parse($s__parsedRight, "(?ims)(\A.*?" & $s__end & ")")
                ContinueLoop
        EndSwitch
        $s__parsedLeft = __parse($s__parsedRight, $s__regEx)
    Until @error
    $i__recurse -= 1
EndFunc   ;==>__retrieveList

Func __parse(ByRef $parsedString, $s__regEx)
    Local Const $a__temp = StringRegExp($parsedString, $s__regEx, 3)
    If @error Then Return SetError(1, 0, 0)
    $parsedString = StringRegExpReplace($parsedString, $s__regEx, "")
    Return $a__temp[0]
EndFunc   ;==>__parse

 

SIMPLE.txt

ORIG.txt

 

 

Edited by Mingre
Added SCiTE output.
Link to comment
Share on other sites

I have no time to dig into this down to details, but I strongly suspect that the crash is due to PCRE (the regexp engine) exploding the available stack space allocated.

There are request underneath to compile PCRE into AutoIt with an option forcing use of the heap in lieu of the stack, getting rid of this kind of issues.

 

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...