Parse HTML string with DOM

Azevedo · February 22, 2015

Hello. Is it possible to parse an HTML string using some DOM library in AU3?

Edited February 22, 2015 by Azevedo

water · February 22, 2015

First thing that comes to my mind is the IE UDF that comes with AutoIt.

Azevedo · February 22, 2015

I'm not using IE's engine.

I'm getting the HTTP stream (html code) to a string.

Porbably there isn't a DOM for that. Then I'll use RegEx.

SmOke_N · February 22, 2015

So essentially you want to build a browser?

If not, maybe it's time you use a browser if you want to use browser objects?

Edit:

I say this, because they have already done all that work for you.

Edited February 22, 2015 by SmOke_N

Azevedo · February 25, 2015

The purpose is to automate some online tasks without using IE.

By using IE's engine I would be compromising privacy once it keeps history and cache.

IE will load web components (flash, javascript, images) that is not what I want.

Besides, I don't want to depend on IE's interface/engine.

Edited February 25, 2015 by Azevedo

SmOke_N · February 25, 2015

Unfortunately, there's no "DOM" au3... although it sounds like a fun and extremely lengthy project.

I know chimp worked on raw html table parser though.

If you got a group of descent coders together for the project, I might be willing to add to the mix.

But, now you know why I suggested the IE engine. There's always methods to cleanup as well, but if you're doing this on client machines, your project may be too delicate and the need for a complete dom parser eludes me at the moment.

Gianni · February 25, 2015

Hi Azevedo,

just 3 days ago, as SmOke_N said in previous post, I posted an >udf to parse tables from a raw html that makes use of an internal function (the core function) that I wrote and used for the tables extraction purpose, but it's been thinked to be also used for a more general purpose, that is to extract portions of code related to specific html tags. Maybe it can be useful also for your project.
In short, that function can return a sort of collection of the portions of code in the page related to specific html tags.

Of sure it can be enhanced and refined, but it can be a starting point.
An example is better of many words:

#include <array.au3>
Local $sHtml = BinaryToString(InetRead("http://www.autoitscript.com")) ; get the raw source

Local $aMyTags = _ParseTags($sHtml, "<a", "</a>") ; collection of <a> tags
_ArrayDisplay($aMyTags)

$aMyTags = _ParseTags($sHtml, "<script", "</script>")
_ArrayDisplay($aMyTags)

$aMyTags = _ParseTags($sHtml, "<div", "</div>")
_ArrayDisplay($aMyTags)

$aMyTags = _ParseTags($sHtml, "<style", "</style>")
_ArrayDisplay($aMyTags)

; #FUNCTION# ====================================================================================================================
; Name ..........: _ParseTags
; Description ...: searches and extract all portions of html code within opening and closing tags inclusive.
;                  Returns an array containing a collection of <tag ...... </tag> lines. one in each element (even if are nested)
; Syntax ........: _ParseTags($sHtml, $sOpening, $sClosing)
; Parameters ....: $sHtml               - A string value containing the html listing
;                  $sOpening            - A string value indicating the opening tag
;                  $sClosing            - A string value indicating the closing tag
; Return values .: success:               an 1D 1 based array containing all the portions of html code representing the element
;                                         element [0] af the array (and @extended as well) contains the counter of found elements
;                  faillure:              An empty string and sets @error as following:
;                                         @error:   1 - required tags are not present in the passed HTML
;                                                   2 - error while parsing tags, (opening and closing tags are not balanced)
;                                                   3 - error while parsing tags, (open/close mismatch error)
; ===============================================================================================================================
Func _ParseTags($sHtml, $sOpening, $sClosing) ; example: $sOpening = '<table', $sClosing = '</table>'
    ; it finds how many of such tags are on the HTML page
    StringReplace($sHtml, $sOpening, $sOpening) ; in @xtended nr. of occurences
    Local $iNrOfThisTag = @extended
    ; I assume that opening <tag and closing </tag> tags are balanced (as should be)
    ; (so NO check is made to see if they are actually balanced)
    If $iNrOfThisTag Then ; if there is at least one of this tag
        ; $aThisTagsPositions array will contain the positions of the
        ; starting <tag and ending </tag> tags within the HTML
        Local $aThisTagsPositions[$iNrOfThisTag * 2 + 1][3] ; 1 based (make room for all open and close tags)
        ; 2) find in the HTML the positions of the $sOpening <tag and $sClosing </tag> tags
        For $i = 1 To $iNrOfThisTag
            $aThisTagsPositions[$i][0] = StringInStr($sHtml, $sOpening, 0, $i) ; start position of $i occurrence of <tag opening tag
            $aThisTagsPositions[$i][1] = $sOpening ; it marks which kind of tag is this
            $aThisTagsPositions[$i][2] = $i ; nr of this tag
            $aThisTagsPositions[$iNrOfThisTag + $i][0] = StringInStr($sHtml, $sClosing, 0, $i) + StringLen($sClosing) - 1 ; end position of $i^ occurrence of </tag> closing tag
            $aThisTagsPositions[$iNrOfThisTag + $i][1] = $sClosing ; it marks which kind of tag is this
        Next
        _ArraySort($aThisTagsPositions, 0, 1) ; now all opening and closing tags are in the same sequence as them appears in the HTML
        Local $aStack[UBound($aThisTagsPositions)][2]
        Local $aTags[Ceiling(UBound($aThisTagsPositions) / 2)] ; will contains the collection of <tag ..... </tag> from the html
        For $i = 1 To UBound($aThisTagsPositions) - 1
            If $aThisTagsPositions[$i][1] = $sOpening Then ; opening <tag
                $aStack[0][0] += 1 ; nr of tags in html
                $aStack[$aStack[0][0]][0] = $sOpening
                $aStack[$aStack[0][0]][1] = $i
            ElseIf $aThisTagsPositions[$i][1] = $sClosing Then ; a closing </tag> was found
                If Not $aStack[0][0] Or Not ($aStack[$aStack[0][0]][0] = $sOpening And $aThisTagsPositions[$i][1] = $sClosing) Then
                    Return SetError(3, 0, "") ; Open/Close mismatch error
                Else ; pair detected (the reciprocal tag)
                    ; now get coordinates of the 2 tags
                    ; 1) extract this tag <tag ..... </tag> from the html to the array
                    $aTags[$aThisTagsPositions[$aStack[$aStack[0][0]][1]][2]] = StringMid($sHtml, $aThisTagsPositions[$aStack[$aStack[0][0]][1]][0], 1 + $aThisTagsPositions[$i][0] - $aThisTagsPositions[$aStack[$aStack[0][0]][1]][0])
                    ; 2) remove that tag <tag ..... </tag> from the html
                    $sHtml = StringLeft($sHtml, $aThisTagsPositions[$aStack[$aStack[0][0]][1]][0] - 1) & StringMid($sHtml, $aThisTagsPositions[$i][0] + 1)
                    ; 3) adjust the references to the new positions of remaining tags
                    For $ii = $i To UBound($aThisTagsPositions) - 1
                        $aThisTagsPositions[$ii][0] -= StringLen($aTags[$aThisTagsPositions[$aStack[$aStack[0][0]][1]][2]])
                    Next
                    $aStack[0][0] -= 1 ; nr of tags still in html
                EndIf
            EndIf
        Next
        If Not $aStack[0][0] Then ; all tags where parsed correctly
            $aTags[0] = $iNrOfThisTag
            Return SetError(0, $iNrOfThisTag, $aTags) ; OK
        Else
            Return SetError(2, 0, "") ; opening and closing tags are not balanced
        EndIf
    Else
        Return SetError(1, 0, "") ; there are no of such tags on this HTML page
    EndIf
EndFunc   ;==>_ParseTags

Azevedo · February 26, 2015

Thanks chimp, smoke

This chimp's function will help me in some cases!

Thanks!

Sign In

Parse HTML string with DOM

Recommended Posts

Azevedo

Link to comment

Share on other sites

water

Link to comment

Share on other sites

Azevedo

Link to comment

Share on other sites

SmOke_N

Link to comment

Share on other sites

Azevedo

Link to comment

Share on other sites

SmOke_N

Link to comment

Share on other sites

Gianni

Link to comment

Share on other sites

Azevedo

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

AutoIt Resources

Release

Beta