Parse HTML string with DOM

Azevedo · February 22, 2015

Hello. Is it possible to parse an HTML string using some DOM library in AU3?

Edited February 22, 2015 by Azevedo

water · February 22, 2015

First thing that comes to my mind is the IE UDF that comes with AutoIt.

Azevedo · February 22, 2015

I'm not using IE's engine.

I'm getting the HTTP stream (html code) to a string.

Porbably there isn't a DOM for that. Then I'll use RegEx.

SmOke_N · February 22, 2015

So essentially you want to build a browser?

If not, maybe it's time you use a browser if you want to use browser objects?

Edit:

I say this, because they have already done all that work for you.

Edited February 22, 2015 by SmOke_N

Azevedo · February 25, 2015

The purpose is to automate some online tasks without using IE.

By using IE's engine I would be compromising privacy once it keeps history and cache.

IE will load web components (flash, javascript, images) that is not what I want.

Besides, I don't want to depend on IE's interface/engine.

Edited February 25, 2015 by Azevedo

SmOke_N · February 25, 2015

Unfortunately, there's no "DOM" au3... although it sounds like a fun and extremely lengthy project.

I know chimp worked on raw html table parser though.

If you got a group of descent coders together for the project, I might be willing to add to the mix.

But, now you know why I suggested the IE engine. There's always methods to cleanup as well, but if you're doing this on client machines, your project may be too delicate and the need for a complete dom parser eludes me at the moment.

Gianni · February 25, 2015

Hi Azevedo,

just 3 days ago, as SmOke_N said in previous post, I posted an >udf to parse tables from a raw html that makes use of an internal function (the core function) that I wrote and used for the tables extraction purpose, but it's been thinked to be also used for a more general purpose, that is to extract portions of code related to specific html tags. Maybe it can be useful also for your project.
In short, that function can return a sort of collection of the portions of code in the page related to specific html tags.

Of sure it can be enhanced and refined, but it can be a starting point.
An example is better of many words:

#include <array.au3>
Local $sHtml = BinaryToString(InetRead("http://www.autoitscript.com")) ; get the raw source

Local $aMyTags = _ParseTags($sHtml, "<a", "</a>") ; collection of <a> tags
_ArrayDisplay($aMyTags)

$aMyTags = _ParseTags($sHtml, "<script", "</script>")
_ArrayDisplay($aMyTags)

$aMyTags = _ParseTags($sHtml, "<div", "</div>")
_ArrayDisplay($aMyTags)

$aMyTags = _ParseTags($sHtml, "<style", "</style>")
_ArrayDisplay($aMyTags)

; #FUNCTION# ====================================================================================================================
; Name ..........: _ParseTags
; Description ...: searches and extract all portions of html code within opening and closing tags inclusive.
;                  Returns an array containing a collection of <tag ...... </tag> lines. one in each element (even if are nested)
; Syntax ........: _ParseTags($sHtml, $sOpening, $sClosing)
; Parameters ....: $sHtml               - A string value containing the html listing
;                  $sOpening            - A string value indicating the opening tag
;                  $sClosing            - A string value indicating the closing tag
; Return values .: success:               an 1D 1 based array containing all the portions of html code representing the element
;                                         element [0] af the array (and @extended as well) contains the counter of found elements
;                  faillure:              An empty string and sets @error as following:
;                                         @error:   1 - required tags are not present in the passed HTML
;                                                   2 - error while parsing tags, (opening and closing tags are not balanced)
;                                                   3 - error while parsing tags, (open/close mismatch error)
; ===============================================================================================================================
Func _ParseTags($sHtml, $sOpening, $sClosing) ; example: $sOpening = '<table', $sClosing = '</table>'
    ; it finds how many of such tags are on the HTML page
    StringReplace($sHtml, $sOpening, $sOpening) ; in @xtended nr. of occurences
    Local $iNrOfThisTag = @extended
    ; I assume that opening <tag and closing </tag> tags are balanced (as should be)
    ; (so NO check is made to see if they are actually balanced)
    If $iNrOfThisTag Then ; if there is at least one of this tag
        ; $aThisTagsPositions array will contain the positions of the
        ; starting <tag and ending </tag> tags within the HTML
        Local $aThisTagsPositions[$iNrOfThisTag * 2 + 1][3] ; 1 based (make room for all open and close tags)
        ; 2) find in the HTML the positions of the $sOpening <tag and $sClosing </tag> tags
        For $i = 1 To $iNrOfThisTag
            $aThisTagsPositions[$i][0] = StringInStr($sHtml, $sOpening, 0, $i) ; start position of $i occurrence of <tag opening tag
            $aThisTagsPositions[$i][1] = $sOpening ; it marks which kind of tag is this
            $aThisTagsPositions[$i][2] = $i ; nr of this tag
            $aThisTagsPositions[$iNrOfThisTag + $i][0] = StringInStr($sHtml, $sClosing, 0, $i) + StringLen($sClosing) - 1 ; end position of $i^ occurrence of </tag> closing tag
            $aThisTagsPositions[$iNrOfThisTag + $i][1] = $sClosing ; it marks which kind of tag is this
        Next
        _ArraySort($aThisTagsPositions, 0, 1) ; now all opening and closing tags are in the same sequence as them appears in the HTML
        Local $aStack[UBound($aThisTagsPositions)][2]
        Local $aTags[Ceiling(UBound($aThisTagsPositions) / 2)] ; will contains the collection of <tag ..... </tag> from the html
        For $i = 1 To UBound($aThisTagsPositions) - 1
            If $aThisTagsPositions[$i][1] = $sOpening Then ; opening <tag
                $aStack[0][0] += 1 ; nr of tags in html
                $aStack[$aStack[0][0]][0] = $sOpening
                $aStack[$aStack[0][0]][1] = $i
            ElseIf $aThisTagsPositions[$i][1] = $sClosing Then ; a closing </tag> was found
                If Not $aStack[0][0] Or Not ($aStack[$aStack[0][0]][0] = $sOpening And $aThisTagsPositions[$i][1] = $sClosing) Then
                    Return SetError(3, 0, "") ; Open/Close mismatch error
                Else ; pair detected (the reciprocal tag)
                    ; now get coordinates of the 2 tags
                    ; 1) extract this tag <tag ..... </tag> from the html to the array
                    $aTags[$aThisTagsPositions[$aStack[$aStack[0][0]][1]][2]] = StringMid($sHtml, $aThisTagsPositions[$aStack[$aStack[0][0]][1]][0], 1 + $aThisTagsPositions[$i][0] - $aThisTagsPositions[$aStack[$aStack[0][0]][1]][0])
                    ; 2) remove that tag <tag ..... </tag> from the html
                    $sHtml = StringLeft($sHtml, $aThisTagsPositions[$aStack[$aStack[0][0]][1]][0] - 1) & StringMid($sHtml, $aThisTagsPositions[$i][0] + 1)
                    ; 3) adjust the references to the new positions of remaining tags
                    For $ii = $i To UBound($aThisTagsPositions) - 1
                        $aThisTagsPositions[$ii][0] -= StringLen($aTags[$aThisTagsPositions[$aStack[$aStack[0][0]][1]][2]])
                    Next
                    $aStack[0][0] -= 1 ; nr of tags still in html
                EndIf
            EndIf
        Next
        If Not $aStack[0][0] Then ; all tags where parsed correctly
            $aTags[0] = $iNrOfThisTag
            Return SetError(0, $iNrOfThisTag, $aTags) ; OK
        Else
            Return SetError(2, 0, "") ; opening and closing tags are not balanced
        EndIf
    Else
        Return SetError(1, 0, "") ; there are no of such tags on this HTML page
    EndIf
EndFunc   ;==>_ParseTags

Azevedo · February 26, 2015

Thanks chimp, smoke

This chimp's function will help me in some cases!

Thanks!

Sign In

Parse HTML string with DOM

Recommended Posts

Azevedo

water

Azevedo

SmOke_N

Azevedo

SmOke_N

Gianni

Azevedo

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta