Azevedo

Parse HTML string with DOM

8 posts in this topic

#1 ·  Posted (edited)

Hello. Is it possible to parse an HTML string using some DOM library in AU3?

Edited by Azevedo

Share this post


Link to post
Share on other sites



First thing that comes to my mind is the IE UDF that comes with AutoIt.


My UDFs and Tutorials:

Spoiler

UDFs:
Active Directory (NEW 2017-04-18 - Version 1.4.8.0) - Download - General Help & Support - Example Scripts - Wiki
OutlookEX (NEW 2017-02-27 - Version 1.3.1.0) - Download - General Help & Support - Example Scripts - Wiki
ExcelChart (2015-04-01 - Version 0.4.0.0) - Download - General Help & Support - Example Scripts
Excel - Example Scripts - Wiki
Word - Wiki
PowerPoint (2015-06-06 - Version 0.0.5.0) - Download - General Help & Support

Tutorials:
ADO - Wiki

 

Share this post


Link to post
Share on other sites

I'm not using IE's engine.

I'm getting the HTTP stream (html code) to a string.

Porbably there isn't a DOM for that. Then I'll use RegEx.

Share this post


Link to post
Share on other sites

#4 ·  Posted (edited)

So essentially you want to build a browser?

If not, maybe it's time you use a browser if you want to use browser objects?

Edit:

I say this, because they have already done all that work for you.

Edited by SmOke_N
1 person likes this

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites

#5 ·  Posted (edited)

The purpose is to automate some online tasks without using IE.

By using IE's engine I would be compromising privacy once it keeps history and cache.

IE will load web components (flash, javascript, images) that is not what I want.

Besides, I don't want to depend on IE's interface/engine.

Edited by Azevedo

Share this post


Link to post
Share on other sites

Unfortunately, there's no "DOM" au3... although it sounds like a fun and extremely lengthy project.

I know chimp worked on raw html table parser though.

If you got a group of descent coders together for the project, I might be willing to add to the mix.

But, now you know why I suggested the IE engine.  There's always methods to cleanup as well, but if you're doing this on client machines, your project may be too delicate and the need for a complete dom parser eludes me at the moment.

1 person likes this

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites

Hi Azevedo,

just 3 days ago, as SmOke_N said in previous post,  I posted an >udf to parse tables from a raw html that makes use of an internal function (the core function) that I wrote and used for the tables extraction purpose,  but it's been thinked to be also used for a more general purpose, that is to extract portions of code related to specific html tags. Maybe it can be useful also for your project.
In short, that function can return a sort of collection of the portions of code in the page related to specific html tags.

Of sure it can  be enhanced and refined, but it can be a starting point.
An example is better of many words:

#include <array.au3>
Local $sHtml = BinaryToString(InetRead("http://www.autoitscript.com")) ; get the raw source

Local $aMyTags = _ParseTags($sHtml, "<a", "</a>") ; collection of <a> tags
_ArrayDisplay($aMyTags)

$aMyTags = _ParseTags($sHtml, "<script", "</script>")
_ArrayDisplay($aMyTags)

$aMyTags = _ParseTags($sHtml, "<div", "</div>")
_ArrayDisplay($aMyTags)

$aMyTags = _ParseTags($sHtml, "<style", "</style>")
_ArrayDisplay($aMyTags)

; #FUNCTION# ====================================================================================================================
; Name ..........: _ParseTags
; Description ...: searches and extract all portions of html code within opening and closing tags inclusive.
;                  Returns an array containing a collection of <tag ...... </tag> lines. one in each element (even if are nested)
; Syntax ........: _ParseTags($sHtml, $sOpening, $sClosing)
; Parameters ....: $sHtml               - A string value containing the html listing
;                  $sOpening            - A string value indicating the opening tag
;                  $sClosing            - A string value indicating the closing tag
; Return values .: success:               an 1D 1 based array containing all the portions of html code representing the element
;                                         element [0] af the array (and @extended as well) contains the counter of found elements
;                  faillure:              An empty string and sets @error as following:
;                                         @error:   1 - required tags are not present in the passed HTML
;                                                   2 - error while parsing tags, (opening and closing tags are not balanced)
;                                                   3 - error while parsing tags, (open/close mismatch error)
; ===============================================================================================================================
Func _ParseTags($sHtml, $sOpening, $sClosing) ; example: $sOpening = '<table', $sClosing = '</table>'
    ; it finds how many of such tags are on the HTML page
    StringReplace($sHtml, $sOpening, $sOpening) ; in @xtended nr. of occurences
    Local $iNrOfThisTag = @extended
    ; I assume that opening <tag and closing </tag> tags are balanced (as should be)
    ; (so NO check is made to see if they are actually balanced)
    If $iNrOfThisTag Then ; if there is at least one of this tag
        ; $aThisTagsPositions array will contain the positions of the
        ; starting <tag and ending </tag> tags within the HTML
        Local $aThisTagsPositions[$iNrOfThisTag * 2 + 1][3] ; 1 based (make room for all open and close tags)
        ; 2) find in the HTML the positions of the $sOpening <tag and $sClosing </tag> tags
        For $i = 1 To $iNrOfThisTag
            $aThisTagsPositions[$i][0] = StringInStr($sHtml, $sOpening, 0, $i) ; start position of $i occurrence of <tag opening tag
            $aThisTagsPositions[$i][1] = $sOpening ; it marks which kind of tag is this
            $aThisTagsPositions[$i][2] = $i ; nr of this tag
            $aThisTagsPositions[$iNrOfThisTag + $i][0] = StringInStr($sHtml, $sClosing, 0, $i) + StringLen($sClosing) - 1 ; end position of $i^ occurrence of </tag> closing tag
            $aThisTagsPositions[$iNrOfThisTag + $i][1] = $sClosing ; it marks which kind of tag is this
        Next
        _ArraySort($aThisTagsPositions, 0, 1) ; now all opening and closing tags are in the same sequence as them appears in the HTML
        Local $aStack[UBound($aThisTagsPositions)][2]
        Local $aTags[Ceiling(UBound($aThisTagsPositions) / 2)] ; will contains the collection of <tag ..... </tag> from the html
        For $i = 1 To UBound($aThisTagsPositions) - 1
            If $aThisTagsPositions[$i][1] = $sOpening Then ; opening <tag
                $aStack[0][0] += 1 ; nr of tags in html
                $aStack[$aStack[0][0]][0] = $sOpening
                $aStack[$aStack[0][0]][1] = $i
            ElseIf $aThisTagsPositions[$i][1] = $sClosing Then ; a closing </tag> was found
                If Not $aStack[0][0] Or Not ($aStack[$aStack[0][0]][0] = $sOpening And $aThisTagsPositions[$i][1] = $sClosing) Then
                    Return SetError(3, 0, "") ; Open/Close mismatch error
                Else ; pair detected (the reciprocal tag)
                    ; now get coordinates of the 2 tags
                    ; 1) extract this tag <tag ..... </tag> from the html to the array
                    $aTags[$aThisTagsPositions[$aStack[$aStack[0][0]][1]][2]] = StringMid($sHtml, $aThisTagsPositions[$aStack[$aStack[0][0]][1]][0], 1 + $aThisTagsPositions[$i][0] - $aThisTagsPositions[$aStack[$aStack[0][0]][1]][0])
                    ; 2) remove that tag <tag ..... </tag> from the html
                    $sHtml = StringLeft($sHtml, $aThisTagsPositions[$aStack[$aStack[0][0]][1]][0] - 1) & StringMid($sHtml, $aThisTagsPositions[$i][0] + 1)
                    ; 3) adjust the references to the new positions of remaining tags
                    For $ii = $i To UBound($aThisTagsPositions) - 1
                        $aThisTagsPositions[$ii][0] -= StringLen($aTags[$aThisTagsPositions[$aStack[$aStack[0][0]][1]][2]])
                    Next
                    $aStack[0][0] -= 1 ; nr of tags still in html
                EndIf
            EndIf
        Next
        If Not $aStack[0][0] Then ; all tags where parsed correctly
            $aTags[0] = $iNrOfThisTag
            Return SetError(0, $iNrOfThisTag, $aTags) ; OK
        Else
            Return SetError(2, 0, "") ; opening and closing tags are not balanced
        EndIf
    Else
        Return SetError(1, 0, "") ; there are no of such tags on this HTML page
    EndIf
EndFunc   ;==>_ParseTags

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

Thanks chimp, smoke

This chimp's function will help me in some cases!

Thanks!

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now