Jump to content

RegExp for parsing


 Share

Recommended Posts

Hi,

I'm wondering if it's possible to use RegExp to "parse" html.

Here's what I want to do:

For example, i have the following text :

"
<div>
  Blabla
  <div>
    <div>
      Blabla
    </div>
  </div>
  Blabla
</div>
"

I'm wondering if it's possible to get what's inside the higher div with a regexp.

Something that would return :

"
  Blabla
  <div>
    <div>
      Blabla
    </div>
  </div>
  Blabla
"

I wrote a function that count the number of opening <div> and closing </div> and continue to search until the two numbers are equal.

But I think it would be much more efficient with a "simple" regexp.

What do you think ?

Thanks a lot for your help and sorry for my bad english :x

Pollop.

Edited by pollop
Link to comment
Share on other sites

Unfortunately not :x Some languages like lua have regex that can do this, but not AutoIt. If you know how many divs there are, or if it's always the first and last tags you want then you could do it like this:

StringRegExpReplace($sInput, "(?s).*?<div>(?:\r\n)*(.*)(?:\r\n)*</div>.*?", "\1")

Alternatives are using the XmlDomWrapper udf that's on the forum somewhere that uses msxml, and then use xpath queries, or use what you already have.

Edit: Found the link to Xml dom wrapper:

Edited by Mat
Link to comment
Share on other sites

Thanks a lot for the reply...

I think i'm gonna continue using my solution :x

Here it is (if someone needs something like that)

Func HtmlBetween($sText, $sStart, $sEndTag = "</div>")
    Local $sCountUp
    Local $sCountDown
    Switch $sEndTag
        Case "</div>"
            $sCountUp = "<div"
            $sCountDown = "</div>"
        Case "</span>"
            $sCountUp = "<span"
            $sCountDown = "</span>"
        Case "</ul>"
            $sCountUp = "<ul"
            $sCountDown = "</ul>"
        Case "</li>"
            $sCountUp = "<li"
            $sCountDown = "</li>"
        Case "</a>"
            $sCountUp = "<a"
            $sCountDown = "</a>"
        Case Else
            LogError("Func HtmlBetween: Wrong type tag")
            Return False
    EndSwitch

    ; We begin by deleting what's before the start.
    Local $sStartPos = StringInStr($sText, $sStart)
    If $sStartPos == 0 Then
        LogError("Func HtmlBetween: Can't find the start")
        Return False
    EndIf
    $sText = StringTrimLeft($sText, $sStartPos + StringLen($sStart) - 1)

    ; We now search for the content
    Local $iNumberUp = 1
    Local $iNumberDown = 0
    Local $iUp
    Local $iDown
    While $iNumberDown <> $iNumberUp
        $iUp = StringInStr($sText, $sCountUp, 0, $iNumberUp)
        $iDown = StringInStr($sText, $sCountDown, 0, $iNumberDown + 1)
        If $iUp > 0 And $iUp < $iDown Then
            $iNumberUp += 1
        ElseIf $iDown > 0 Then
            $iNumberDown += 1
        Else
            LogError("Func HtmlBetween: Can't parse HTML, number of open tags != number of closing tags")
            Return False
        EndIf
    WEnd

    ; We get everything that's before the last closing tag
    Return StringLeft($sText, $iDown - 1)
EndFunc
Link to comment
Share on other sites

You want to have a look at this that I wrote. It checks to see if tags are opened and closed in the right order, but could be easily modified to do what you want. It needs a bit more error checking to see if more tags are closed than opened or vice versa, but it works :x

#include<Array.au3>

$s = '<a href="www.google.com"><span>This is a test</a></span>'
MsgBox(0, $s, _HTML_Check($s))

$s = '<a href="www.google.com"><span>This is a test</span></a>'
MsgBox(0, $s, _HTML_Check($s))

Func _HTML_Check($sString)
    Local $aStack[1] = [0]
    Local $sTemp, $sLast

    For $i = 1 To StringLen($sString)
        If StringMid($sString, $i, 1) = "<" Then
            $sTemp = ""
            While 1
                $i += 1
                If $i > StringLen($sString) Or (Not StringIsAlNum(StringMid($sString, $i, 1)) And StringMid($sString, $i, 1) <> "/") Then ExitLoop
                $sTemp &= StringMid($sString, $i, 1)
            WEnd

            ConsoleWrite($sTemp & @LF)

            If StringLeft($sTemp, 1) = "/" Then
                $sTemp = StringTrimLeft($sTemp, 1)

                $sLast = _ArrayPop($aStack)
                $aStack[0] -= 1

                If $sTemp <> $sLast Then Return SetError(1, 0, "Expected closing tag for '" & $sLast & "' tag. Got closing tag for '" & $sTemp & "' instead.")
            Else
                If Not _HTML_IsTag($sTemp) Then Return SetError(1, 0, "Unrecognized tag: '" & $sTemp & "'")

                _ArrayAdd($aStack, $sTemp)
                $aStack[0] += 1
            EndIf
        EndIf
    Next

    Return "Success"
EndFunc   ;==>_HTML_Check

Func _HTML_IsTag($sTag)
    ; Add a switch or lookup and see if sTag is a proper tag.
    ; I just assume it is for now.
    Return True
EndFunc   ;==>_HTML_IsTag

Edit: Just found this:

Edited by Mat
Link to comment
Share on other sites

I beg to differ from Mat assertion that AutoIt PCRE can't do this.

Using the pattern

(?imsx) <div> ( ( (?>(?<=<div>).*(?=</div>)) | (?R) )+ ) </div>

and the input

<html><div>ab<div>cd<div>abcd<div>cdef</div><div>cdef1</div>

<div>

Blabla

<div>

<div>

Blabla

</div>

</div>

Blabla

</div>

<div>cdef2</div>efgh</div>gh</div>ef</div></html>

you get the wanted result. AutoIt PCRE _does_ support recursion. Note that recursing with multi-character boundaries (like html opening/closing tags pairs) is less trivial than with single character boundaries (e.g. parenthesis) but it surely can be done.

I don't forcibly mean that the solution above is the best thing since sliced bread but it does work.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Yes, you're right :x I was talking about recursion like lua has with '%bxy' where it matches something beginning with x and ending in y with the same number of each, I should have known that it would be possible to do it some other way.

I'd still say that XmlDomWrapper.au3 is a better solution, but then I've never really liked using regex a lot.

Edit: I also worked a bit on my example:

#include<Array.au3>
#include<String.au3>

$s = BinaryToString(InetRead("http://www.isup.me/autoitscript.com"))
MsgBox(0, "www.google.com", _HTML_Check($s) & @CRLF & @extended)

Func _HTML_Check($sString)
    Local $aStack[1] = [0]
    Local $sTemp, $sLast
    Local $iLine = 1

    $sString = StringStripCR($sString)

    For $i = 1 To StringLen($sString)
        If StringMid($sString, $i, 1) = @LF Then
            $iLine += 1
        ElseIf StringMid($sString, $i, 1) = "<" Then
            $sTemp = ""
            While 1
                $i += 1
                If StringMid($sString, $i, 1) = @LF Then $iLine += 1

                If $i > StringLen($sString) Or (Not StringIsAlNum(StringMid($sString, $i, 1)) And StringMid($sString, $i, 1) <> "/") Then ExitLoop
                $sTemp &= StringMid($sString, $i, 1)
            WEnd

            If StringMid($sString, StringInStr($sString, ">", 1, 1, $i) - 1, 1) = "/" Then ; Self closing
                If Not _HTML_IsTag($sTemp) Then Return SetError(1, $iLine, "Unrecognized tag: '" & $sTemp & "'")

                ConsoleWrite(_StringRepeat("|", $aStack[0]) & "-" & $sTemp & @LF)

                ContinueLoop
            EndIf

            If StringLeft($sTemp, 1) = "/" Then
                $sTemp = StringTrimLeft($sTemp, 1)

                If $aStack[0] = 0 Then Return SetError(1, $iLine, "Unexpected closing tag: '" & $sTemp & "'")

                $sLast = _ArrayPop($aStack)
                $aStack[0] -= 1

                If $sTemp <> $sLast Then Return SetError(1, $iLine, "Expected closing tag for '" & $sLast & "' tag. Got closing tag for '" & $sTemp & "' instead.")
            ElseIf $sTemp = "" Then
            Else
                If Not _HTML_IsTag($sTemp) Then Return SetError(1, $iLine, "Unrecognized tag: '" & $sTemp & "'")

                ConsoleWrite(_StringRepeat("|", $aStack[0]) & "-" & $sTemp & @LF)

                _ArrayAdd($aStack, $sTemp)
                $aStack[0] += 1
            EndIf
        EndIf
    Next

    Return "Success"
EndFunc   ;==>_HTML_Check

Func _HTML_IsTag($sTag)
    ; Add a switch or lookup and see if sTag is a proper tag.
    ; I just assume it is for now.
    Return True
EndFunc   ;==>_HTML_IsTag

Edited by Mat
Link to comment
Share on other sites

I fully agree that parsing such input (especially html where whitespaces can appear almost everywhere) with regexps is not the best solution. For html, navigating in the IE objects is probably the most robust way, after all a browser engine is particularly well suited to parse html.

There are nonetheless countless situations where using non-basic to advanced regexp possibilities is a reasonable, efficient and reliable approach. I only mentionned the recursion possibility here to that effect. To be honest, I didn't use PCRE recursion for some time and had to try a couple of times before coming up with a working pattern, due to the tags being multi-character. Real regexp gurus would find that simple one really trivial...

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...