Sign in to follow this  
Followers 0
Chimp

parsing tables from raw HTML

21 posts in this topic

referring to this >post from this >topic, I've used that listing against a raw HTML file with the purpose of extracting the content of a table.
Well, it works nearly well for my purpose, but it extracts the content of all the tables of the HTML page in just one array.

#include <Array.au3>
#include <IE.au3> ;  just for HTML extraction from the page of the table example
;
Local $oie = _IE_Example("table")
Local $sHtml = _IEBodyReadHTML($oie) ; extract whole HTML
;
Local $aResult = _TableWriteToArrayFromHTML($sHtml, 0) ; extracts table contents
;
_ArrayDisplay($aResult)

Func _TableWriteToArrayFromHTML($sHtml, $iTableNr = 0) ; second parameter should indicate which table
    Local $aRes = StringRegExp($sHtml, "(?isU)(?|<(/)tr>\s*|<t[dh].*>(.*)</t[dh]>)", 3)
    Local $aTempResult[UBound($aRes)][UBound($aRes)]
    Local $iRow = 0, $iCol = 0, $iMaxRow = 0

    For $i = 0 To UBound($aRes) - 1
        If $aRes[$i] = "/" Then
            $iRow += 1
            $iCol = 0
        Else
            $aTempResult[$iRow][$iCol] = $aRes[$i]
            $iCol += 1
            If $iCol > $iMaxRow Then $iMaxRow = $iCol
        EndIf
    Next

    ReDim $aTempResult[$iRow][$iMaxRow]
    Return $aTempResult
EndFunc   ;==>_TableWriteToArrayFromHTML

I would like to use that script (or some other way as well) in a way similar to the _IETableWriteToArray() function,

used on a raw HTML instead of on an IE object instance.

How could it be modified to allow the extraction of only one of the tables on the page (with the possibility to choose which one to extract)?
Maybe, for example, starting the extraction from the "n" occurence of the tag <table> till </table>

any help will be appreciated
thanks


small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites



I would first extract the "<table> till </table>" contents into an array, the "n" occurence of any table will be its index in this array

Share this post


Link to post
Share on other sites

thanks mikell,

Even I was thinking about something like that.

here a possible solution using _StringBetween()

#include <Array.au3>
#include <IE.au3> ;  just for HTML extraction from the page of the table example
#include <String.au3>
;
; Local $oie = _IECreate("http://www.w3schools.com/html/html_tables.asp")
Local $oie = _IE_Example("table")
Local $sHtml = _IEBodyReadHTML($oie) ; extract whole HTML

$aTables = _StringBetween($sHtml, "<table", "</table>") ; each table goes into the array elements

$iWantedTable = 1 ; second table (zero based)

Local $aResult = _TableWriteToArrayFromHTML($aTables[$iWantedTable]) ; extracts table contents
;
_ArrayDisplay($aResult, "Table nr." & $iWantedTable)

_IEQuit($oie)

Func _TableWriteToArrayFromHTML($sHtml)
    Local $aRes = StringRegExp($sHtml, "(?isU)(?|<(/)tr>\s*|<t[dh].*>(.*)</t[dh]>)", 3)
    ; _ArrayDisplay($aRes)
    Local $aTempResult[UBound($aRes)][UBound($aRes)]
    Local $iRow = 0, $iCol = 0, $iMaxRow = 0

    For $i = 0 To UBound($aRes) - 1
        If $aRes[$i] = "/" Then
            $iRow += 1
            $iCol = 0
        Else
            $aTempResult[$iRow][$iCol] = $aRes[$i]
            $iCol += 1
            If $iCol > $iMaxRow Then $iMaxRow = $iCol
        EndIf
    Next

    ReDim $aTempResult[$iRow][$iMaxRow]
    Return $aTempResult
EndFunc   ;==>_TableWriteToArrayFromHTML

has someone an alternative solution by using Regular expression instead of _StringBetween ?

Thanks everybody


small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

#4 ·  Posted (edited)

This should be exactly the same

$aTables = StringRegExp($sHtml, '(?s)<table.*?</table>', 3)

Edit

Fixed... sorry :)

Edited by mikell

Share this post


Link to post
Share on other sites

 

This should be exactly the same

$aTables = StringRegExp($sHtml, '(?s)<table>.*?</table>', 3)

 

... nope,  I don't get an array...

p.s.

first string is <table (without the closing >)


small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

Edited

Share this post


Link to post
Share on other sites

#7 ·  Posted (edited)

Edited

 

...is not yet an array...

code below returns Int32 as var type

$aTables = StringRegExp($sHtml, '(?s)<table.*?</table>', 3)
MsgBox(0, 0, VarGetType($aTables))
Edited by Chimp

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

I went from trying a regex or 2 to this... but keep in mind, this is not a great solution.

Nested tables, frame/iframe(s), html strings with regex data in them, all of those could screw these methods up.

#include <Array.au3>

Global $gsStr = _myHTML()

Global $gaTables = _htmlraw_GetTables($gsStr)
_ArrayDisplay($gaTables)

Global $gaTable1Rows = _htmlraw_GetTableRows($gaTables[0])
_ArrayDisplay($gaTable1Rows)

Global $gaTable1Row1Cols = _htmlraw_GetTableCols($gaTable1Rows[0])
_ArrayDisplay($gaTable1Row1Cols)

Func _htmlraw_GetTables($sHTML) ; return an array of tables

    If Not StringLen($sHTML) Then
        Return SetError(1, 0, 0)
    EndIf

    ; some of the below pattern isn't necessary, but I code it as I think about conditions
    ; problem is with nested tables, this is not a good solution
    Local $sPatt = "(?si)<\s*table(?:\s*|\s.+?)>.*?<\s*/\s*table\s*>"
    Local $aReg = StringRegExp($sHTML, $sPatt, 3)
    If @error Then
        Return SetError(2, @error, 0)
    EndIf

    Return $aReg
EndFunc

Func _htmlraw_GetTableRows($sTable)

    ; believe it or not </tr> is not necessary
    ;  though most use it, so better look for </table too>
    ;  then there's the fun of not having nested tables
    ;  but I don't have the brain power to think through all that today, so simple it is
    Local $sPatt = "(?si)<\s*tr(?:\s*|\s.+?)>.*?<\s*/\s*tr\s*>"
    Local $aReg = StringRegExp($sTable, $sPatt, 3)
    If @error Then
        Return SetError(1, 0, 0)
    EndIf

    Return $aReg
EndFunc

Func _htmlraw_GetTableCols($sData)

    ; I've talked about nesting issues, just going to do it simple
    ; th/td
    Local $sPatt = "(?si)(?:<\s*th(?:\s*|\s.+?)>.*?<\s*/\s*th\s*>|" & _
        "<\s*td(?:\s*|\s.+?)>.*?<\s*/\s*td\s*>)+"
    Local $aReg = StringRegExp($sData, $sPatt, 3)
    If @error Then
        Return SetError(1, 0, 0)
    EndIf

    Return $aReg
EndFunc

Func _myHTML()
    Local $sHTML
    $sHTML &= "0x3C68746D6C3E0D0A3C626F64793E0D0A3C7461626C65207374796C653D"
    $sHTML &= "2277696474683A31303025223E0D0A20203C74723E0D0A202020203C7468"
    $sHTML &= "3E4E616D653A3C2F74683E0D0A202020203C74643E42696C6C2047617465"
    $sHTML &= "733C2F74643E0D0A20203C2F74723E0D0A20203C74723E0D0A202020203C"
    $sHTML &= "746820726F777370616E3D2232223E54656C6570686F6E653A3C2F74683E"
    $sHTML &= "0D0A202020203C74643E353535203737203835343C2F74643E0D0A20203C"
    $sHTML &= "2F74723E0D0A20203C74723E0D0A202020203C74643E3535352037372038"
    $sHTML &= "35353C2F74643E0D0A20203C2F74723E0D0A3C2F7461626C653E0D0A3C74"
    $sHTML &= "61626C65207374796C653D2277696474683A31303025223E0D0A20203C63"
    $sHTML &= "617074696F6E3E4D6F6E74686C7920736176696E67733C2F63617074696F"
    $sHTML &= "6E3E0D0A20203C74723E0D0A202020203C74683E4D6F6E74683C2F74683E"
    $sHTML &= "0D0A202020203C74683E536176696E67733C2F74683E0D0A20203C2F7472"
    $sHTML &= "3E0D0A20203C74723E0D0A202020203C74643E4A616E756172793C2F7464"
    $sHTML &= "3E0D0A202020203C74643E243130303C2F74643E0D0A20203C2F74723E0D"
    $sHTML &= "0A20203C74723E0D0A202020203C74643E46656272756172793C2F74643E"
    $sHTML &= "0D0A202020203C74643E2435303C2F74643E0D0A20203C2F74723E0D0A3C"
    $sHTML &= "2F7461626C653E0D0A3C7461626C65207374796C653D2277696474683A31"
    $sHTML &= "303025223E0D0A20203C74723E0D0A202020203C74683E4E616D653C2F74"
    $sHTML &= "683E0D0A202020203C746820636F6C7370616E3D2232223E54656C657068"
    $sHTML &= "6F6E653C2F74683E0D0A20203C2F74723E0D0A20203C74723E0D0A202020"
    $sHTML &= "203C74643E42696C6C2047617465733C2F74643E0D0A202020203C74643E"
    $sHTML &= "353535203737203835343C2F74643E0D0A202020203C74643E3535352037"
    $sHTML &= "37203835353C2F74643E0D0A20203C2F74723E0D0A3C2F7461626C653E0D"
    $sHTML &= "0A3C7461626C652069643D22743031223E0D0A20203C74723E0D0A202020"
    $sHTML &= "203C74683E46697273746E616D653C2F74683E0D0A202020203C74683E4C"
    $sHTML &= "6173746E616D653C2F74683E200D0A202020203C74683E506F696E74733C"
    $sHTML &= "2F74683E0D0A20203C2F74723E0D0A20203C74723E0D0A202020203C7464"
    $sHTML &= "3E4576653C2F74643E0D0A202020203C74643E4A61636B736F6E3C2F7464"
    $sHTML &= "3E200D0A202020203C74643E39343C2F74643E0D0A20203C2F74723E0D0A"
    $sHTML &= "3C2F7461626C653E0D0A3C2F626F64793E0D0A3C2F68746D6C3E"
    Return BinaryToString($sHTML)
EndFunc 

Anyway, have fun.


Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites

Hi SmOke_N
I'm back on this, I'm trying to get some results....
I came out with a way to extract all the tables from a web page also if them are nested.
also, I've seen that my function to extract data from a given table works quite well. It returns a 2D array containing the table's data. (not quite good instead with tables that are not square)
Trying to use your 2 functions I see that _htmlraw_GetTableCols returns all the cells in an 1D array, while _htmlraw_GetTableRows returns each row in each single element of an 1D array.

well, I would like to merge both of your 2 functions to return an 2D array....

some suggestions on how to achieve it are welcome

here is my base code for experimenting on tables

#include <IE.au3>
#include <String.au3>
#include <Array.au3>
;
; 1) open an html page containing tables (also nested)
;    it's an hodgepodge of tables just to make tests
Local $oie = _IECreate()
_IEDocWriteHTML($oie, MyHTML()) ; just to show the tables on the browser
Do
    Sleep(250)
Until IsObj($oie)
Local $sHtml = _IEBodyReadHTML($oie) ; extract whole raw HTML of the page
;
Local $aTables = ParseTables($sHtml) ; each table in each element of the array
;
Local $iWantedTable, $sError, $aResult
Do
    $iWantedTable = InputBox("select a table", "Please enter the nr. of the table to get data from (1 based)")
    $sError = @error
    If Not $sError Then
        $aResult = _TableWriteToArrayFromHTML($aTables[$iWantedTable]) ; extracts table contents in a 2D array
        ; $aResult = _htmlraw_GetTableRows($aTables[$iWantedTable]) ; by SmOke_N
        ; $aResult = _htmlraw_GetTableCols($aTables[$iWantedTable]) ; by SmOke_N
        $sError = @error
        _ArrayDisplay($aResult, "Content of table nr." & $iWantedTable)
    EndIf
Until $sError
;
; -----------------------------------------------------------------
; returns an array containing positions of <table and </table> tags
; -----------------------------------------------------------------
Func ParseTables($sHtml)
    ; finds how many tables are on the HTML page (tables collection)
    StringReplace($sHtml, "<table", "<table") ; in @xtended nr. of occurences
    Local $iNrOfTableTags = @extended
    ; ConsoleWrite(@CRLF & "Debug: This page contains " & $iNrOfTableTags & " tables." & @CRLF)
    ; I assume that <table and </table> tags are balanced (as should be)
    ; (so NO check is made to see if they are actually balanced)
    If $iNrOfTableTags Then ; if at least one table exists
        ; $aTableTagsPositions array will contain the positions of the
        ; starting <table and ending </table> tags within the HTML
        Local $aTableTagsPositions[$iNrOfTableTags * 2 + 1][3] ; 1 based (make room for all open and close tags)
        ; 2) find in the HTML the positions of the <table and </table> tags
        For $i = 1 To $iNrOfTableTags
            $aTableTagsPositions[$i][0] = StringInStr($sHtml, "<table", 0, $i) ; start position of $i occurrence of <table opening tag
            $aTableTagsPositions[$i][1] = "<table" ; mark tag of this location
            $aTableTagsPositions[$i][2] = $i ; nr of table
            $aTableTagsPositions[$iNrOfTableTags + $i][0] = StringInStr($sHtml, "</table>", 0, $i) + 7 ; end position of $i occurrence of </table> closing tag
            $aTableTagsPositions[$iNrOfTableTags + $i][1] = "</table>" ; mark tag of this location
        Next
        _ArraySort($aTableTagsPositions, 0, 1) ; now all opening and closing tags are in the same sequence as them appears in the HTML
        Local $aTables = ExtractTables($aTableTagsPositions, $sHtml) ; $aTables array will contains a table in each element
        If Not @error Then Return $aTables
        Return SetError(2, 0, 0)
    Else
        Return SetError(1, 0, 0) ; No tables in HTML
    EndIf
EndFunc   ;==>ParseTables

; ---------------------------------------------------
; returns an array containing a table in each element
; ---------------------------------------------------
Func ExtractTables(ByRef $aTableTagsPositions, $html)
    Local $aStack[UBound($aTableTagsPositions)][2]
    Local $aTables[Ceiling(UBound($aTableTagsPositions) / 2)] ; will contains the collection of tables
    For $i = 1 To UBound($aTableTagsPositions) - 1
        If $aTableTagsPositions[$i][1] = "<table" Then ; opening tag
            $aStack[0][0] += 1
            $aStack[$aStack[0][0]][0] = "<table"
            $aStack[$aStack[0][0]][1] = $i
        ElseIf $aTableTagsPositions[$i][1] = "</table>" Then ; a closing tag was found
            If Not $aStack[0][0] Or Not ArePair($aStack[$aStack[0][0]][0], $aTableTagsPositions[$i][1]) Then
                Return SetError(1, 0, 0) ; False ; something is not ok
            Else ; pair detected (the reciprocal tag)
                ; now get coordinates of the 2 tags
                ; 1) extract this table from the html to the array
                $aTables[$aTableTagsPositions[$aStack[$aStack[0][0]][1]][2]] = StringMid($html, $aTableTagsPositions[$aStack[$aStack[0][0]][1]][0], 1 + $aTableTagsPositions[$i][0] - $aTableTagsPositions[$aStack[$aStack[0][0]][1]][0])
                ; 2) remove that table from the html
                $html = StringLeft($html, $aTableTagsPositions[$aStack[$aStack[0][0]][1]][0] - 1) & StringMid($html, $aTableTagsPositions[$i][0] + 1)
                ; 3) adjust the references to the new positions of remaining tags
                For $ii = $i To UBound($aTableTagsPositions) - 1
                    $aTableTagsPositions[$ii][0] -= StringLen($aTables[$aTableTagsPositions[$aStack[$aStack[0][0]][1]][2]])
                Next
                $aStack[0][0] -= 1
            EndIf
        EndIf
    Next
    If Not $aStack[0][0] Then
        Return $aTables
    Else
        Return SetError(1, 0, 0)
    EndIf
EndFunc   ;==>ExtractTables

Func ArePair($sOpening, $sClosing)
    If ($sOpening = '<table' And $sClosing = '</table>') Then Return True
    Return False
EndFunc   ;==>ArePair

; ------------------------------------
; copy content of cells into the array
; ------------------------------------
Func _TableWriteToArrayFromHTML($sHtml)
    Local $aRes = StringRegExp($sHtml, "(?isU)(?|<(/)tr>\s*|<t[dh].*>(.*)</t[dh]>)", 3)
    ; _ArrayDisplay($aRes)
    Local $aTempResult[UBound($aRes)][UBound($aRes)]
    Local $iRow = 0, $iCol = 0, $iMaxRow = 0

    For $i = 0 To UBound($aRes) - 1
        If $aRes[$i] = "/" Then
            $iRow += 1
            $iCol = 0
        Else
            $aTempResult[$iRow][$iCol] = $aRes[$i]
            $iCol += 1
            If $iCol > $iMaxRow Then $iMaxRow = $iCol
        EndIf
    Next

    ReDim $aTempResult[$iRow][$iMaxRow]
    Return $aTempResult
EndFunc   ;==>_TableWriteToArrayFromHTML

Func MyHTML()
    Local $sData = '0x' & _
            '3C5441424C4520626F726465723D223122206267436F6C6F723D233030666630303E0D0A202020203C54523E0D0A20202020202020203C54443E5461626C6520' & _
            '31202872316331293C7461626C6520626F726465723D223122206267436F6C6F723D236666303030303E0D0A20203C74723E0D0A202020203C74683E5461626C' & _
            '6520322028743272316331293C2F74683E0D0A202020203C74683E5461626C65203220726F77203120436F6C756D6E20323C2F74683E0D0A202020203C74683E' & _
            '5432523143323C2F74683E0D0A20203C2F74723E0D0A20203C74723E0D0A202020203C74643E5432523243313C2F74643E0D0A202020203C74643E0D0A202020' & _
            '2020203C7461626C6520626F726465723D223122206267436F6C6F723D236666666630303E0D0A20202020202020203C74723E0D0A202020202020202020203C' & _
            '74643E5461626C652033206E6573746564207461626C6520636F6C756D6E20313C2F74643E0D0A202020202020202020203C74643E6E6573746564207461626C' & _
            '6520636F6C756D6E20323C2F74643E0D0A20202020202020203C2F74723E0D0A2020202020203C2F7461626C653E0D0A202020203C2F74643E0D0A202020203C' & _
            '74643E5432523243333C2F74643E0D0A20203C2F74723E0D0A20203C74723E0D0A202020203C74643E5432523343313C2F74643E0D0A202020203C74643E5432' & _
            '523343323C2F74643E0D0A202020203C74643E5432523343333C2F74643E0D0A20203C2F74723E0D0A3C2F7461626C653E203C2F54443E0D0A20202020202020' & _
            '203C54443E5431523143323C2F54443E0D0A20202020202020203C2F54523E0D0A202020203C54523E0D0A20202020202020203C54443E5431523243310D0A20' & _
            '202020202020202020203C7461626C6520626F726465723D31206267436F6C6F723D233939303030302020414C49474E3D43454E5445523E200D0A2020202020' & _
            '2020202020203C74723E3C74643E205461626C652034204162636465663C2F74643E3C74643E7434723163323C2F74643E3C74643E7434723163333C2F74643E' & _
            '3C74643E7434723163343C2F74643E3C74643E7434723163350D0A2020202020202020202020202020202020203C7461626C652020626F726465723D31206267' & _
            '436F6C6F723D233939393930303E0D0A2020202020202020202020202020202020203C74723E3C74643E205461626C652035204768696A6B3C2F74643E3C7464' & _
            '3E7435723163323C2F74643E3C74643E7435723163333C2F74643E3C74643E7435723163340D0A20202020202020202020202020202020202020202020202020' & _
            '3C7461626C6520626F726465723D31206267436F6C6F723D233939393939393E0D0A202020202020202020202020202020202020202020202020203C74723E3C' & _
            '74643E205461626C652036204C6D6E6F70713C2F74643E3C74643E7435723163323C2F74643E3C74643E7435723163330D0A2020202020202020202020202020' & _
            '20202020202020202020202020202020202020202020203C7461626C652020626F726465723D31206267436F6C6F723D234545303045453E203C74723E3C7464' & _
            '3E205461626C6520372052737475767778797A3C2F74643E3C74643E7437723163323C2F74643E3C74643E7437723163333C2F74643E3C2F74723E0D0A202020' & _
            '202020202020202020202020202020202020202020202020202020202020202020203C74723E3C74643E7437723263313C2F74643E3C74643E7437723263323C' & _
            '2F74643E3C74643E7437723263330D0A202020202020202020202020202020202020202020202020202020202020202020203C2F74643E3C2F74723E203C2F74' & _
            '61626C653E0D0A202020202020202020202020202020202020202020202020202020202020202020203C2F74643E3C2F74723E203C2F7461626C653E0D0A2020' & _
            '20202020202020202020202020202020202020202020202020202020202020203C2F74643E3C2F74723E203C2F7461626C653E0D0A2020202020202020202020' & _
            '202020202020202020202020202020202020202020203C2F74643E3C2F74723E203C2F7461626C653E3C2F54443E0D0A20202020202020203C54443E54315232' & _
            '43323C5441424C4520626F726465723D223122206267436F6C6F723D233030666666663E0D0A202020202020202020202020202020203C54523E0D0A20202020' & _
            '202020202020202020202020202020203C54443E5461626C6520380D0A2020202020202020202020202020202020202020202020203C5441424C4520626F7264' & _
            '65723D223122206267436F6C6F723D233030303066663E0D0A202020202020202020202020202020202020202020202020202020203C54523E0D0A2020202020' & _
            '2020202020202020202020202020202020202020202020202020203C54443E5461626C6520393C2F54443E0D0A20202020202020202020202020202020202020' & _
            '202020202020202020202020203C54443E54392052314332203C2F54443E0D0A2020202020202020202020202020202020202020202020202020202020202020' & _
            '3C2F54523E0D0A202020202020202020202020202020202020202020202020202020203C54523E0D0A2020202020202020202020202020202020202020202020' & _
            '2020202020202020203C54443E543920523243313C2F54443E0D0A20202020202020202020202020202020202020202020202020202020202020203C54443E54' & _
            '3920523243323C2F54443E0D0A20202020202020202020202020202020202020202020202020202020202020203C2F54523E0D0A202020202020202020202020' & _
            '202020202020202020202020202020203C2F5441424C453E0D0A2020202020202020202020202020202020202020202020203C2F54443E0D0A20202020202020' & _
            '202020202020202020202020203C54443E543820523143323C2F54443E0D0A20202020202020202020202020202020202020203C2F54523E0D0A202020202020' & _
            '202020202020202020203C54523E0D0A20202020202020202020202020202020202020203C54443E543820523243313C2F54443E0D0A20202020202020202020' & _
            '202020202020202020203C54443E543820523243323C2F54443E0D0A20202020202020202020202020202020202020203C2F54523E0D0A202020202020202020' & _
            '202020202020203C2F5441424C453E0D0A2020202020202020202020203C2F54443E0D0A20202020202020203C2F54523E0D0A3C54523E3C54443E5431205233' & _
            '204331202D20412073696E676C652063656C6C20726F772028576974686F75742063656C6C70616464696E67293C2F54443E3C2F54523E0D0A20203C74723E0D' & _
            '0A202020203C746420636F6C7370616E3D323E0D0A20202020202068656C6C6F2C2049276D20543152344331202873696E676C652063656C6C20576974682063' & _
            '656C6C70616464696E673D32290D0A3C7461626C6520626F726465723D332063656C6C70616464696E673D3520414C49474E3D4C454654206267436F6C6F723D' & _
            '233636363630303E0D0A20203C74723E0D0A202020203C746420636F6C7370616E3D323E0D0A2020202020205461626C6520313020524F573120434F4C554D4E' & _
            '310D0A202020203C2F74643E0D0A20203C2F74723E3C74723E0D0A202020203C74643E0D0A202020202020436F6E74656E742066726F6D20543130523243310D' & _
            '0A202020203C2F74643E3C74643E0D0A202020202020436F6E74656E742066726F6D20543130523243320D0A202020203C2F74643E0D0A20203C2F74723E3C74' & _
            '723E0D0A202020203C74643E0D0A202020202020436F6E74656E742066726F6D20543130523343310D0A202020203C2F74643E3C74643E0D0A20202020202043' & _
            '6F6E74656E742066726F6D20543130523343320D0A202020203C2F74643E0D0A20203C2F74723E0D0A3C2F7461626C653E0D0A3C7461626C6520626F72646572' & _
            '3D332063656C6C70616464696E673D313020414C49474E3D43454E544552206267436F6C6F723D233939393939393E0D0A3C74723E0D0A20203C74642076616C' & _
            '69676E3D746F703E0D0A2020202054616220313120723163310D0A20203C2F74643E3C74643E0D0A2020202054616220313120723163323C703E0D0A0909090D' & _
            '0A202020203C7461626C6520626F726465723D31206267436F6C6F723D233030393939393E0D0A202020203C74723E0D0A2020202020203C74643E5431325231' & _
            '43313C2F74643E0D0A2020202020203C74643E543132523143323C2F74643E0D0A202020203C2F74723E3C74723E0D0A2020202020203C74643E543132523243' & _
            '313C2F74643E0D0A2020202020203C74643E543132523243323C2F74643E0D0A202020203C2F74723E0D0A202020203C2F7461626C653E3C703E0D0A0909090D' & _
            '0A2020202054616220313120723163320D0A20203C2F74643E0D0A3C2F74723E0D0A3C2F7461626C653E0D0A3C7461626C6520626F726465723D3320414C4947' & _
            '4E3D5249474854206267436F6C6F723D233939303039393E0D0A20203C74723E0D0A202020203C746420726F777370616E3D333E0D0A20202020202054414231' & _
            '3320433120726F777370616E3D330D0A202020203C2F74643E3C74643E0D0A202020202020543133523143320D0A202020203C2F74643E0D0A20203C2F74723E' & _
            '3C74723E0D0A202020203C74643E0D0A202020202020543133523243320D0A202020203C2F74643E0D0A20203C2F74723E0D0A20203C74723E0D0A202020203C' & _
            '74643E0D0A202020202020543133523343320D0A202020203C2F74643E0D0A20203C2F74723E0D0A3C2F7461626C653E0D0A202020203C2F74643E0D0A20203C' & _
            '2F74723E3C74723E0D0A202020203C74643E0D0A2020202020205461626C653120726F773520636F6C756D6E310D0A202020203C2F74643E3C74643E0D0A2020' & _
            '202020205461626C653120726F773520636F6C756D6E320D0A202020203C2F74643E0D0A20203C2F74723E3C74723E0D0A202020203C74643E0D0A2020202020' & _
            '205461626C653120726F773620636F6C756D6E310D0A202020203C2F74643E3C74643E0D0A2020202020205461626C653120726F773620636F6C756D6E320D0A' & _
            '202020203C2F74643E0D0A20203C2F74723E0D0A202020203C2F5441424C453E'
    Return BinaryToString($sData)
EndFunc   ;==>MyHTML

; ------------------------------------
; following functions are from SmOke_N
; ------------------------------------

Func _htmlraw_GetTables($sHtml) ; return an array of tables

    If Not StringLen($sHtml) Then
        Return SetError(1, 0, 0)
    EndIf

    ; some of the below pattern isn't necessary, but I code it as I think about conditions
    ; problem is with nested tables, this is not a good solution
    Local $sPatt = "(?si)<\s*table(?:\s*|\s.+?)>.*?<\s*/\s*table\s*>"
    Local $aReg = StringRegExp($sHtml, $sPatt, 3)
    If @error Then
        Return SetError(2, @error, 0)
    EndIf

    Return $aReg
EndFunc   ;==>_htmlraw_GetTables


Func _htmlraw_GetTableRows($sTable)
    ; believe it or not </tr> is not necessary
    ;  though most use it, so better look for </table too>
    ;  then there's the fun of not having nested tables
    ;  but I don't have the brain power to think through all that today, so simple it is
    Local $sPatt = "(?si)<\s*tr(?:\s*|\s.+?)>.*?<\s*/\s*tr\s*>"
    Local $aReg = StringRegExp($sTable, $sPatt, 3)
    If @error Then
        Return SetError(1, 0, 0)
    EndIf

    Return $aReg
EndFunc   ;==>_htmlraw_GetTableRows

Func _htmlraw_GetTableCols($sData)

    ; I've talked about nesting issues, just going to do it simple
    ; th/td
    Local $sPatt = "(?si)(?:<\s*th(?:\s*|\s.+?)>.*?<\s*/\s*th\s*>|" & _
            "<\s*td(?:\s*|\s.+?)>.*?<\s*/\s*td\s*>)+"
    Local $aReg = StringRegExp($sData, $sPatt, 3)
    If @error Then
        Return SetError(1, 0, 0)
    EndIf

    Return $aReg
EndFunc   ;==>_htmlraw_GetTableCols

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

Do you mean like this:

#include <Array.au3>

Global $gsStr = _myHTML()

Global $gaTables = _htmlraw_GetTables($gsStr)
;~ _ArrayDisplay($gaTables)

Global $gaTableData = _htmlraw_TableToArray($gaTables[0])
_ArrayDisplay($gaTableData)

Func _htmlraw_TableToArray($sTable)

    If Not StringLen($sTable) Then
        Return SetError(1, 0, 0)
    EndIf

    Local $aRows = _htmlraw_GetTableRows($sTable)
    If Not IsArray($aRows) Then
        Return SetError(2, 0, 0)
    EndIf

    Local $iUBRow = UBound($aRows)
    Local $aRet[$iUBRow][1]

    Local $aCols, $iEnum = 0, $iUBCol

    For $i = 0 To $iUBRow - 1
        $aCols = _htmlraw_GetTableCols($aRows[$i])
        $iUBCol = UBound($aCols)
        If Not $iUBCol Then ContinueLoop
        If $iUBCol > UBound($aRet, 2) Then
            ReDim $aRet[$iUBRow][$iUBCol]
        EndIf
        For $j = 0 To $iUBCol - 1
            $aRet[$iEnum][$j] = $aCols[$j]
        Next
        $iEnum += 1
    Next

    Return $aRet
EndFunc

Func _htmlraw_GetTables($sHTML) ; return an array of tables

    If Not StringLen($sHTML) Then
        Return SetError(1, 0, 0)
    EndIf

    ; some of the below pattern isn't necessary, but I code it as I think about conditions
    ; problem is with nested tables, this is not a good solution
    Local $sPatt = "(?si)<\s*table(?:\s*|\s.+?)>.*?<\s*/\s*table\s*>"
    Local $aReg = StringRegExp($sHTML, $sPatt, 3)
    If @error Then
        Return SetError(2, @error, 0)
    EndIf

    Return $aReg
EndFunc

Func _htmlraw_GetTableRows($sTable)

    ; believe it or not </tr> is not necessary
    ;  though most use it, so better look for </table too>
    ;  then there's the fun of not having nested tables
    ;  but I don't have the brain power to think through all that today, so simple it is
    Local $sPatt = "(?si)<\s*tr(?:\s*|\s.+?)>.*?<\s*/\s*tr\s*>"
    Local $aReg = StringRegExp($sTable, $sPatt, 3)
    If @error Then
        Return SetError(1, 0, 0)
    EndIf

    Return $aReg
EndFunc

Func _htmlraw_GetTableCols($sData)

    ; I've talked about nesting issues, just going to do it simple
    ; th/td
    Local $sPatt = "(?si)(?:<\s*th(?:\s*|\s.+?)>.*?<\s*/\s*th\s*>|" & _
        "<\s*td(?:\s*|\s.+?)>.*?<\s*/\s*td\s*>)+"
    Local $aReg = StringRegExp($sData, $sPatt, 3)
    If @error Then
        Return SetError(1, 0, 0)
    EndIf

    Return $aReg
EndFunc

Func _myHTML()
    Local $sHTML
    $sHTML &= "0x3C68746D6C3E0D0A3C626F64793E0D0A3C7461626C65207374796C653D"
    $sHTML &= "2277696474683A31303025223E0D0A20203C74723E0D0A202020203C7468"
    $sHTML &= "3E4E616D653A3C2F74683E0D0A202020203C74643E42696C6C2047617465"
    $sHTML &= "733C2F74643E0D0A20203C2F74723E0D0A20203C74723E0D0A202020203C"
    $sHTML &= "746820726F777370616E3D2232223E54656C6570686F6E653A3C2F74683E"
    $sHTML &= "0D0A202020203C74643E353535203737203835343C2F74643E0D0A20203C"
    $sHTML &= "2F74723E0D0A20203C74723E0D0A202020203C74643E3535352037372038"
    $sHTML &= "35353C2F74643E0D0A20203C2F74723E0D0A3C2F7461626C653E0D0A3C74"
    $sHTML &= "61626C65207374796C653D2277696474683A31303025223E0D0A20203C63"
    $sHTML &= "617074696F6E3E4D6F6E74686C7920736176696E67733C2F63617074696F"
    $sHTML &= "6E3E0D0A20203C74723E0D0A202020203C74683E4D6F6E74683C2F74683E"
    $sHTML &= "0D0A202020203C74683E536176696E67733C2F74683E0D0A20203C2F7472"
    $sHTML &= "3E0D0A20203C74723E0D0A202020203C74643E4A616E756172793C2F7464"
    $sHTML &= "3E0D0A202020203C74643E243130303C2F74643E0D0A20203C2F74723E0D"
    $sHTML &= "0A20203C74723E0D0A202020203C74643E46656272756172793C2F74643E"
    $sHTML &= "0D0A202020203C74643E2435303C2F74643E0D0A20203C2F74723E0D0A3C"
    $sHTML &= "2F7461626C653E0D0A3C7461626C65207374796C653D2277696474683A31"
    $sHTML &= "303025223E0D0A20203C74723E0D0A202020203C74683E4E616D653C2F74"
    $sHTML &= "683E0D0A202020203C746820636F6C7370616E3D2232223E54656C657068"
    $sHTML &= "6F6E653C2F74683E0D0A20203C2F74723E0D0A20203C74723E0D0A202020"
    $sHTML &= "203C74643E42696C6C2047617465733C2F74643E0D0A202020203C74643E"
    $sHTML &= "353535203737203835343C2F74643E0D0A202020203C74643E3535352037"
    $sHTML &= "37203835353C2F74643E0D0A20203C2F74723E0D0A3C2F7461626C653E0D"
    $sHTML &= "0A3C7461626C652069643D22743031223E0D0A20203C74723E0D0A202020"
    $sHTML &= "203C74683E46697273746E616D653C2F74683E0D0A202020203C74683E4C"
    $sHTML &= "6173746E616D653C2F74683E200D0A202020203C74683E506F696E74733C"
    $sHTML &= "2F74683E0D0A20203C2F74723E0D0A20203C74723E0D0A202020203C7464"
    $sHTML &= "3E4576653C2F74643E0D0A202020203C74643E4A61636B736F6E3C2F7464"
    $sHTML &= "3E200D0A202020203C74643E39343C2F74643E0D0A20203C2F74723E0D0A"
    $sHTML &= "3C2F7461626C653E0D0A3C2F626F64793E0D0A3C2F68746D6C3E"
    Return BinaryToString($sHTML)
EndFunc

?


Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites

Hi SmOke_N, thanks for the reply
yes, your  _htmlraw_TableToArray() function extracts data in  a 2D array, (could be removed the html tags?)
but it has the same issue of my function, that is: if you extract table 13 for example the rows 2 and 3 of the column 2 are extracted in column 1 of the array.
I include again the above code including your new function to show the issue on table 13 for example

thanks

#include <IE.au3>
#include <String.au3>
#include <Array.au3>
;
; 1) open an html page containing tables (also nested)
;    it's an hodgepodge of tables just to make tests
Local $oie = _IECreate()
_IEDocWriteHTML($oie, MyHTML()) ; just to show the tables on the browser
Do
    Sleep(250)
Until IsObj($oie)
Local $sHtml = _IEBodyReadHTML($oie) ; extract whole raw HTML of the page
;
Local $aTables = ParseTables($sHtml) ; each table in each element of the array
;
Local $iWantedTable, $sError, $aResult
Do
    $iWantedTable = InputBox("select a table", "Please enter the nr. of the table to get data from (1 based)")
    $sError = @error
    If Not $sError Then
        ; $aResult = _TableWriteToArrayFromHTML($aTables[$iWantedTable]) ; extracts table contents in a 2D array
        ; $aResult = _htmlraw_GetTableRows($aTables[$iWantedTable]) ; by SmOke_N
        ; $aResult = _htmlraw_GetTableCols($aTables[$iWantedTable]) ; by SmOke_N
        $aResult = _htmlraw_TableToArray($aTables[$iWantedTable])
        $sError = @error
        _ArrayDisplay($aResult, "Content of table nr." & $iWantedTable)
    EndIf
Until $sError
;
; -----------------------------------------------------------------
; returns an array containing positions of <table and </table> tags
; -----------------------------------------------------------------
Func ParseTables($sHtml)
    ; finds how many tables are on the HTML page (tables collection)
    StringReplace($sHtml, "<table", "<table") ; in @xtended nr. of occurences
    Local $iNrOfTableTags = @extended
    ; ConsoleWrite(@CRLF & "Debug: This page contains " & $iNrOfTableTags & " tables." & @CRLF)
    ; I assume that <table and </table> tags are balanced (as should be)
    ; (so NO check is made to see if they are actually balanced)
    If $iNrOfTableTags Then ; if at least one table exists
        ; $aTableTagsPositions array will contain the positions of the
        ; starting <table and ending </table> tags within the HTML
        Local $aTableTagsPositions[$iNrOfTableTags * 2 + 1][3] ; 1 based (make room for all open and close tags)
        ; 2) find in the HTML the positions of the <table and </table> tags
        For $i = 1 To $iNrOfTableTags
            $aTableTagsPositions[$i][0] = StringInStr($sHtml, "<table", 0, $i) ; start position of $i occurrence of <table opening tag
            $aTableTagsPositions[$i][1] = "<table" ; mark tag of this location
            $aTableTagsPositions[$i][2] = $i ; nr of table
            $aTableTagsPositions[$iNrOfTableTags + $i][0] = StringInStr($sHtml, "</table>", 0, $i) + 7 ; end position of $i occurrence of </table> closing tag
            $aTableTagsPositions[$iNrOfTableTags + $i][1] = "</table>" ; mark tag of this location
        Next
        _ArraySort($aTableTagsPositions, 0, 1) ; now all opening and closing tags are in the same sequence as them appears in the HTML
        Local $aTables = ExtractTables($aTableTagsPositions, $sHtml) ; $aTables array will contains a table in each element
        If Not @error Then Return $aTables
        Return SetError(2, 0, 0)
    Else
        Return SetError(1, 0, 0) ; No tables in HTML
    EndIf
EndFunc   ;==>ParseTables

; ---------------------------------------------------
; returns an array containing a table in each element
; ---------------------------------------------------
Func ExtractTables(ByRef $aTableTagsPositions, $html)
    Local $aStack[UBound($aTableTagsPositions)][2]
    Local $aTables[Ceiling(UBound($aTableTagsPositions) / 2)] ; will contains the collection of tables
    For $i = 1 To UBound($aTableTagsPositions) - 1
        If $aTableTagsPositions[$i][1] = "<table" Then ; opening tag
            $aStack[0][0] += 1
            $aStack[$aStack[0][0]][0] = "<table"
            $aStack[$aStack[0][0]][1] = $i
        ElseIf $aTableTagsPositions[$i][1] = "</table>" Then ; a closing tag was found
            If Not $aStack[0][0] Or Not ArePair($aStack[$aStack[0][0]][0], $aTableTagsPositions[$i][1]) Then
                Return SetError(1, 0, 0) ; False ; something is not ok
            Else ; pair detected (the reciprocal tag)
                ; now get coordinates of the 2 tags
                ; 1) extract this table from the html to the array
                $aTables[$aTableTagsPositions[$aStack[$aStack[0][0]][1]][2]] = StringMid($html, $aTableTagsPositions[$aStack[$aStack[0][0]][1]][0], 1 + $aTableTagsPositions[$i][0] - $aTableTagsPositions[$aStack[$aStack[0][0]][1]][0])
                ; 2) remove that table from the html
                $html = StringLeft($html, $aTableTagsPositions[$aStack[$aStack[0][0]][1]][0] - 1) & StringMid($html, $aTableTagsPositions[$i][0] + 1)
                ; 3) adjust the references to the new positions of remaining tags
                For $ii = $i To UBound($aTableTagsPositions) - 1
                    $aTableTagsPositions[$ii][0] -= StringLen($aTables[$aTableTagsPositions[$aStack[$aStack[0][0]][1]][2]])
                Next
                $aStack[0][0] -= 1
            EndIf
        EndIf
    Next
    If Not $aStack[0][0] Then
        Return $aTables
    Else
        Return SetError(1, 0, 0)
    EndIf
EndFunc   ;==>ExtractTables

Func ArePair($sOpening, $sClosing)
    If ($sOpening = '<table' And $sClosing = '</table>') Then Return True
    Return False
EndFunc   ;==>ArePair

; ------------------------------------
; copy content of cells into the array
; ------------------------------------
Func _TableWriteToArrayFromHTML($sHtml)
    Local $aRes = StringRegExp($sHtml, "(?isU)(?|<(/)tr>\s*|<t[dh].*>(.*)</t[dh]>)", 3)
    ; _ArrayDisplay($aRes)
    Local $aTempResult[UBound($aRes)][UBound($aRes)]
    Local $iRow = 0, $iCol = 0, $iMaxRow = 0

    For $i = 0 To UBound($aRes) - 1
        If $aRes[$i] = "/" Then
            $iRow += 1
            $iCol = 0
        Else
            $aTempResult[$iRow][$iCol] = $aRes[$i]
            $iCol += 1
            If $iCol > $iMaxRow Then $iMaxRow = $iCol
        EndIf
    Next

    ReDim $aTempResult[$iRow][$iMaxRow]
    Return $aTempResult
EndFunc   ;==>_TableWriteToArrayFromHTML

Func MyHTML()
    Local $sData = '0x' & _
            '3C5441424C4520626F726465723D223122206267436F6C6F723D233030666630303E0D0A202020203C54523E0D0A20202020202020203C54443E5461626C6520' & _
            '31202872316331293C7461626C6520626F726465723D223122206267436F6C6F723D236666303030303E0D0A20203C74723E0D0A202020203C74683E5461626C' & _
            '6520322028743272316331293C2F74683E0D0A202020203C74683E5461626C65203220726F77203120436F6C756D6E20323C2F74683E0D0A202020203C74683E' & _
            '5432523143323C2F74683E0D0A20203C2F74723E0D0A20203C74723E0D0A202020203C74643E5432523243313C2F74643E0D0A202020203C74643E0D0A202020' & _
            '2020203C7461626C6520626F726465723D223122206267436F6C6F723D236666666630303E0D0A20202020202020203C74723E0D0A202020202020202020203C' & _
            '74643E5461626C652033206E6573746564207461626C6520636F6C756D6E20313C2F74643E0D0A202020202020202020203C74643E6E6573746564207461626C' & _
            '6520636F6C756D6E20323C2F74643E0D0A20202020202020203C2F74723E0D0A2020202020203C2F7461626C653E0D0A202020203C2F74643E0D0A202020203C' & _
            '74643E5432523243333C2F74643E0D0A20203C2F74723E0D0A20203C74723E0D0A202020203C74643E5432523343313C2F74643E0D0A202020203C74643E5432' & _
            '523343323C2F74643E0D0A202020203C74643E5432523343333C2F74643E0D0A20203C2F74723E0D0A3C2F7461626C653E203C2F54443E0D0A20202020202020' & _
            '203C54443E5431523143323C2F54443E0D0A20202020202020203C2F54523E0D0A202020203C54523E0D0A20202020202020203C54443E5431523243310D0A20' & _
            '202020202020202020203C7461626C6520626F726465723D31206267436F6C6F723D233939303030302020414C49474E3D43454E5445523E200D0A2020202020' & _
            '2020202020203C74723E3C74643E205461626C652034204162636465663C2F74643E3C74643E7434723163323C2F74643E3C74643E7434723163333C2F74643E' & _
            '3C74643E7434723163343C2F74643E3C74643E7434723163350D0A2020202020202020202020202020202020203C7461626C652020626F726465723D31206267' & _
            '436F6C6F723D233939393930303E0D0A2020202020202020202020202020202020203C74723E3C74643E205461626C652035204768696A6B3C2F74643E3C7464' & _
            '3E7435723163323C2F74643E3C74643E7435723163333C2F74643E3C74643E7435723163340D0A20202020202020202020202020202020202020202020202020' & _
            '3C7461626C6520626F726465723D31206267436F6C6F723D233939393939393E0D0A202020202020202020202020202020202020202020202020203C74723E3C' & _
            '74643E205461626C652036204C6D6E6F70713C2F74643E3C74643E7435723163323C2F74643E3C74643E7435723163330D0A2020202020202020202020202020' & _
            '20202020202020202020202020202020202020202020203C7461626C652020626F726465723D31206267436F6C6F723D234545303045453E203C74723E3C7464' & _
            '3E205461626C6520372052737475767778797A3C2F74643E3C74643E7437723163323C2F74643E3C74643E7437723163333C2F74643E3C2F74723E0D0A202020' & _
            '202020202020202020202020202020202020202020202020202020202020202020203C74723E3C74643E7437723263313C2F74643E3C74643E7437723263323C' & _
            '2F74643E3C74643E7437723263330D0A202020202020202020202020202020202020202020202020202020202020202020203C2F74643E3C2F74723E203C2F74' & _
            '61626C653E0D0A202020202020202020202020202020202020202020202020202020202020202020203C2F74643E3C2F74723E203C2F7461626C653E0D0A2020' & _
            '20202020202020202020202020202020202020202020202020202020202020203C2F74643E3C2F74723E203C2F7461626C653E0D0A2020202020202020202020' & _
            '202020202020202020202020202020202020202020203C2F74643E3C2F74723E203C2F7461626C653E3C2F54443E0D0A20202020202020203C54443E54315232' & _
            '43323C5441424C4520626F726465723D223122206267436F6C6F723D233030666666663E0D0A202020202020202020202020202020203C54523E0D0A20202020' & _
            '202020202020202020202020202020203C54443E5461626C6520380D0A2020202020202020202020202020202020202020202020203C5441424C4520626F7264' & _
            '65723D223122206267436F6C6F723D233030303066663E0D0A202020202020202020202020202020202020202020202020202020203C54523E0D0A2020202020' & _
            '2020202020202020202020202020202020202020202020202020203C54443E5461626C6520393C2F54443E0D0A20202020202020202020202020202020202020' & _
            '202020202020202020202020203C54443E54392052314332203C2F54443E0D0A2020202020202020202020202020202020202020202020202020202020202020' & _
            '3C2F54523E0D0A202020202020202020202020202020202020202020202020202020203C54523E0D0A2020202020202020202020202020202020202020202020' & _
            '2020202020202020203C54443E543920523243313C2F54443E0D0A20202020202020202020202020202020202020202020202020202020202020203C54443E54' & _
            '3920523243323C2F54443E0D0A20202020202020202020202020202020202020202020202020202020202020203C2F54523E0D0A202020202020202020202020' & _
            '202020202020202020202020202020203C2F5441424C453E0D0A2020202020202020202020202020202020202020202020203C2F54443E0D0A20202020202020' & _
            '202020202020202020202020203C54443E543820523143323C2F54443E0D0A20202020202020202020202020202020202020203C2F54523E0D0A202020202020' & _
            '202020202020202020203C54523E0D0A20202020202020202020202020202020202020203C54443E543820523243313C2F54443E0D0A20202020202020202020' & _
            '202020202020202020203C54443E543820523243323C2F54443E0D0A20202020202020202020202020202020202020203C2F54523E0D0A202020202020202020' & _
            '202020202020203C2F5441424C453E0D0A2020202020202020202020203C2F54443E0D0A20202020202020203C2F54523E0D0A3C54523E3C54443E5431205233' & _
            '204331202D20412073696E676C652063656C6C20726F772028576974686F75742063656C6C70616464696E67293C2F54443E3C2F54523E0D0A20203C74723E0D' & _
            '0A202020203C746420636F6C7370616E3D323E0D0A20202020202068656C6C6F2C2049276D20543152344331202873696E676C652063656C6C20576974682063' & _
            '656C6C70616464696E673D32290D0A3C7461626C6520626F726465723D332063656C6C70616464696E673D3520414C49474E3D4C454654206267436F6C6F723D' & _
            '233636363630303E0D0A20203C74723E0D0A202020203C746420636F6C7370616E3D323E0D0A2020202020205461626C6520313020524F573120434F4C554D4E' & _
            '310D0A202020203C2F74643E0D0A20203C2F74723E3C74723E0D0A202020203C74643E0D0A202020202020436F6E74656E742066726F6D20543130523243310D' & _
            '0A202020203C2F74643E3C74643E0D0A202020202020436F6E74656E742066726F6D20543130523243320D0A202020203C2F74643E0D0A20203C2F74723E3C74' & _
            '723E0D0A202020203C74643E0D0A202020202020436F6E74656E742066726F6D20543130523343310D0A202020203C2F74643E3C74643E0D0A20202020202043' & _
            '6F6E74656E742066726F6D20543130523343320D0A202020203C2F74643E0D0A20203C2F74723E0D0A3C2F7461626C653E0D0A3C7461626C6520626F72646572' & _
            '3D332063656C6C70616464696E673D313020414C49474E3D43454E544552206267436F6C6F723D233939393939393E0D0A3C74723E0D0A20203C74642076616C' & _
            '69676E3D746F703E0D0A2020202054616220313120723163310D0A20203C2F74643E3C74643E0D0A2020202054616220313120723163323C703E0D0A0909090D' & _
            '0A202020203C7461626C6520626F726465723D31206267436F6C6F723D233030393939393E0D0A202020203C74723E0D0A2020202020203C74643E5431325231' & _
            '43313C2F74643E0D0A2020202020203C74643E543132523143323C2F74643E0D0A202020203C2F74723E3C74723E0D0A2020202020203C74643E543132523243' & _
            '313C2F74643E0D0A2020202020203C74643E543132523243323C2F74643E0D0A202020203C2F74723E0D0A202020203C2F7461626C653E3C703E0D0A0909090D' & _
            '0A2020202054616220313120723163320D0A20203C2F74643E0D0A3C2F74723E0D0A3C2F7461626C653E0D0A3C7461626C6520626F726465723D3320414C4947' & _
            '4E3D5249474854206267436F6C6F723D233939303039393E0D0A20203C74723E0D0A202020203C746420726F777370616E3D333E0D0A20202020202054414231' & _
            '3320433120726F777370616E3D330D0A202020203C2F74643E3C74643E0D0A202020202020543133523143320D0A202020203C2F74643E0D0A20203C2F74723E' & _
            '3C74723E0D0A202020203C74643E0D0A202020202020543133523243320D0A202020203C2F74643E0D0A20203C2F74723E0D0A20203C74723E0D0A202020203C' & _
            '74643E0D0A202020202020543133523343320D0A202020203C2F74643E0D0A20203C2F74723E0D0A3C2F7461626C653E0D0A202020203C2F74643E0D0A20203C' & _
            '2F74723E3C74723E0D0A202020203C74643E0D0A2020202020205461626C653120726F773520636F6C756D6E310D0A202020203C2F74643E3C74643E0D0A2020' & _
            '202020205461626C653120726F773520636F6C756D6E320D0A202020203C2F74643E0D0A20203C2F74723E3C74723E0D0A202020203C74643E0D0A2020202020' & _
            '205461626C653120726F773620636F6C756D6E310D0A202020203C2F74643E3C74643E0D0A2020202020205461626C653120726F773620636F6C756D6E320D0A' & _
            '202020203C2F74643E0D0A20203C2F74723E0D0A202020203C2F5441424C453E'
    Return BinaryToString($sData)
EndFunc   ;==>MyHTML

; ------------------------------------
; following functions are from SmOke_N
; ------------------------------------

Func _htmlraw_GetTables($sHtml) ; return an array of tables

    If Not StringLen($sHtml) Then
        Return SetError(1, 0, 0)
    EndIf

    ; some of the below pattern isn't necessary, but I code it as I think about conditions
    ; problem is with nested tables, this is not a good solution
    Local $sPatt = "(?si)<\s*table(?:\s*|\s.+?)>.*?<\s*/\s*table\s*>"
    Local $aReg = StringRegExp($sHtml, $sPatt, 3)
    If @error Then
        Return SetError(2, @error, 0)
    EndIf

    Return $aReg
EndFunc   ;==>_htmlraw_GetTables


Func _htmlraw_GetTableRows($sTable)
    ; believe it or not </tr> is not necessary
    ;  though most use it, so better look for </table too>
    ;  then there's the fun of not having nested tables
    ;  but I don't have the brain power to think through all that today, so simple it is
    Local $sPatt = "(?si)<\s*tr(?:\s*|\s.+?)>.*?<\s*/\s*tr\s*>"
    Local $aReg = StringRegExp($sTable, $sPatt, 3)
    If @error Then
        Return SetError(1, 0, 0)
    EndIf

    Return $aReg
EndFunc   ;==>_htmlraw_GetTableRows

Func _htmlraw_GetTableCols($sData)

    ; I've talked about nesting issues, just going to do it simple
    ; th/td
    Local $sPatt = "(?si)(?:<\s*th(?:\s*|\s.+?)>.*?<\s*/\s*th\s*>|" & _
            "<\s*td(?:\s*|\s.+?)>.*?<\s*/\s*td\s*>)+"
    Local $aReg = StringRegExp($sData, $sPatt, 3)
    If @error Then
        Return SetError(1, 0, 0)
    EndIf

    Return $aReg
EndFunc   ;==>_htmlraw_GetTableCols

Func _htmlraw_TableToArray($sTable)

    If Not StringLen($sTable) Then
        Return SetError(1, 0, 0)
    EndIf

    Local $aRows = _htmlraw_GetTableRows($sTable)
    If Not IsArray($aRows) Then
        Return SetError(2, 0, 0)
    EndIf

    Local $iUBRow = UBound($aRows)
    Local $aRet[$iUBRow][1]

    Local $aCols, $iEnum = 0, $iUBCol

    For $i = 0 To $iUBRow - 1
        $aCols = _htmlraw_GetTableCols($aRows[$i])
        $iUBCol = UBound($aCols)
        If Not $iUBCol Then ContinueLoop
        If $iUBCol > UBound($aRet, 2) Then
            ReDim $aRet[$iUBRow][$iUBCol]
        EndIf
        For $j = 0 To $iUBCol - 1
            $aRet[$iEnum][$j] = $aCols[$j]
        Next
        $iEnum += 1
    Next

    Return $aRet
EndFunc   ;==>_htmlraw_TableToArray

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

#12 ·  Posted (edited)

<table border=3 ALIGN=RIGHT bgColor=#990099>
  <tr>
    <td rowspan=3>
      TAB13 C1 rowspan=3
    </td><td>
      T13R1C2
    </td>
  </tr><tr>
    <td>
      T13R2C2
    </td>
  </tr>
  <tr>
    <td>
      T13R3C2
    </td>
  </tr>
</table>
 
The array returns exactly what I'd expect.
[0][0] = TAB13 C1 rowspan=3
[0][1] = T13R1C2
[1][0] = T13R2C2
[2][0] = T13R3C2
post-4813-0-25841800-1422147761.jpg
 
As far as your other question:
Func _htmlraw_TableToArray($sTable)

    If Not StringLen($sTable) Then
        Return SetError(1, 0, 0)
    EndIf

    Local $aRows = _htmlraw_GetTableRows($sTable)
    If Not IsArray($aRows) Then
        Return SetError(2, 0, 0)
    EndIf

    Local $iUBRow = UBound($aRows)
    Local $aRet[$iUBRow][1]

    Local $aCols, $iEnum = 0, $iUBCol

    For $i = 0 To $iUBRow - 1
        $aCols = _htmlraw_GetTableCols($aRows[$i])
        $iUBCol = UBound($aCols)
        If Not $iUBCol Then ContinueLoop
        If $iUBCol > UBound($aRet, 2) Then
            ReDim $aRet[$iUBRow][$iUBCol]
        EndIf
        For $j = 0 To $iUBCol - 1
            $aRet[$iEnum][$j] = StringRegExpReplace($aCols[$j], _
                "^(?i)\h*<(?:th|td).*?(?<!>)>\h*|(?:\h*<\h*/\h*(?:th|td)\h*>\h*)$", "")
        Next
        $iEnum += 1
    Next

    Return $aRet
EndFunc   ;==>_htmlraw_TableToArray

Edit:

Perhaps you were meaning to write your table code like:

<table border=3 ALIGN=RIGHT bgColor=#990099>
  <tr>
    <td rowspan=3>
      TAB13 C1 rowspan=3
    </td><td>
      T13R1C2
    </td>
  </tr><tr>
    <td></td>
    <td>
      T13R2C2
    </td>
  </tr>
  <tr>
    <td></td>
    <td>
      T13R3C2
    </td>
  </tr>
</table>
 
If you did that, then this comes out as you expect I believe:
#include <IE.au3>
#include <String.au3>
#include <Array.au3>

;
; 1) open an html page containing tables (also nested)
;    it's an hodgepodge of tables just to make tests
Local $oie = _IECreate()
_IEDocWriteHTML($oie, MyHTML()) ; just to show the tables on the browser
Do
    Sleep(250)
Until IsObj($oie)
Local $sHtml = _IEBodyReadHTML($oie) ; extract whole raw HTML of the page
;
Local $aTables = ParseTables($sHtml) ; each table in each element of the array
;
Local $iWantedTable, $sError, $aResult
Do
    $iWantedTable = InputBox("select a table", "Please enter the nr. of the table to get data from (1 based)")
    $sError = @error
    If Not $sError Then
        ; $aResult = _TableWriteToArrayFromHTML($aTables[$iWantedTable]) ; extracts table contents in a 2D array
        ; $aResult = _htmlraw_GetTableRows($aTables[$iWantedTable]) ; by SmOke_N
        ; $aResult = _htmlraw_GetTableCols($aTables[$iWantedTable]) ; by SmOke_N
        $aResult = _htmlraw_TableToArray($aTables[$iWantedTable])
        $sError = @error
        _ArrayDisplay($aResult, "Content of table nr." & $iWantedTable)
    EndIf
Until $sError
;
; -----------------------------------------------------------------
; returns an array containing positions of <table and </table> tags
; -----------------------------------------------------------------
Func ParseTables($sHtml)
    ; finds how many tables are on the HTML page (tables collection)
    StringReplace($sHtml, "<table", "<table") ; in @xtended nr. of occurences
    Local $iNrOfTableTags = @extended
    ; ConsoleWrite(@CRLF & "Debug: This page contains " & $iNrOfTableTags & " tables." & @CRLF)
    ; I assume that <table and </table> tags are balanced (as should be)
    ; (so NO check is made to see if they are actually balanced)
    If $iNrOfTableTags Then ; if at least one table exists
        ; $aTableTagsPositions array will contain the positions of the
        ; starting <table and ending </table> tags within the HTML
        Local $aTableTagsPositions[$iNrOfTableTags * 2 + 1][3] ; 1 based (make room for all open and close tags)
        ; 2) find in the HTML the positions of the <table and </table> tags
        For $i = 1 To $iNrOfTableTags
            $aTableTagsPositions[$i][0] = StringInStr($sHtml, "<table", 0, $i) ; start position of $i occurrence of <table opening tag
            $aTableTagsPositions[$i][1] = "<table" ; mark tag of this location
            $aTableTagsPositions[$i][2] = $i ; nr of table
            $aTableTagsPositions[$iNrOfTableTags + $i][0] = StringInStr($sHtml, "</table>", 0, $i) + 7 ; end position of $i occurrence of </table> closing tag
            $aTableTagsPositions[$iNrOfTableTags + $i][1] = "</table>" ; mark tag of this location
        Next
        _ArraySort($aTableTagsPositions, 0, 1) ; now all opening and closing tags are in the same sequence as them appears in the HTML
        Local $aTables = ExtractTables($aTableTagsPositions, $sHtml) ; $aTables array will contains a table in each element
        If Not @error Then Return $aTables
        Return SetError(2, 0, 0)
    Else
        Return SetError(1, 0, 0) ; No tables in HTML
    EndIf
EndFunc   ;==>ParseTables

; ---------------------------------------------------
; returns an array containing a table in each element
; ---------------------------------------------------
Func ExtractTables(ByRef $aTableTagsPositions, $html)
    Local $aStack[UBound($aTableTagsPositions)][2]
    Local $aTables[Ceiling(UBound($aTableTagsPositions) / 2)] ; will contains the collection of tables
    For $i = 1 To UBound($aTableTagsPositions) - 1
        If $aTableTagsPositions[$i][1] = "<table" Then ; opening tag
            $aStack[0][0] += 1
            $aStack[$aStack[0][0]][0] = "<table"
            $aStack[$aStack[0][0]][1] = $i
        ElseIf $aTableTagsPositions[$i][1] = "</table>" Then ; a closing tag was found
            If Not $aStack[0][0] Or Not ArePair($aStack[$aStack[0][0]][0], $aTableTagsPositions[$i][1]) Then
                Return SetError(1, 0, 0) ; False ; something is not ok
            Else ; pair detected (the reciprocal tag)
                ; now get coordinates of the 2 tags
                ; 1) extract this table from the html to the array
                $aTables[$aTableTagsPositions[$aStack[$aStack[0][0]][1]][2]] = StringMid($html, $aTableTagsPositions[$aStack[$aStack[0][0]][1]][0], 1 + $aTableTagsPositions[$i][0] - $aTableTagsPositions[$aStack[$aStack[0][0]][1]][0])
                ; 2) remove that table from the html
                $html = StringLeft($html, $aTableTagsPositions[$aStack[$aStack[0][0]][1]][0] - 1) & StringMid($html, $aTableTagsPositions[$i][0] + 1)
                ; 3) adjust the references to the new positions of remaining tags
                For $ii = $i To UBound($aTableTagsPositions) - 1
                    $aTableTagsPositions[$ii][0] -= StringLen($aTables[$aTableTagsPositions[$aStack[$aStack[0][0]][1]][2]])
                Next
                $aStack[0][0] -= 1
            EndIf
        EndIf
    Next
    If Not $aStack[0][0] Then
        Return $aTables
    Else
        Return SetError(1, 0, 0)
    EndIf
EndFunc   ;==>ExtractTables

Func ArePair($sOpening, $sClosing)
    If ($sOpening = '<table' And $sClosing = '</table>') Then Return True
    Return False
EndFunc   ;==>ArePair

; ------------------------------------
; copy content of cells into the array
; ------------------------------------
Func _TableWriteToArrayFromHTML($sHtml)
    Local $aRes = StringRegExp($sHtml, "(?isU)(?|<(/)tr>\s*|<t[dh].*>(.*)</t[dh]>)", 3)
    ; _ArrayDisplay($aRes)
    Local $aTempResult[UBound($aRes)][UBound($aRes)]
    Local $iRow = 0, $iCol = 0, $iMaxRow = 0

    For $i = 0 To UBound($aRes) - 1
        If $aRes[$i] = "/" Then
            $iRow += 1
            $iCol = 0
        Else
            $aTempResult[$iRow][$iCol] = $aRes[$i]
            $iCol += 1
            If $iCol > $iMaxRow Then $iMaxRow = $iCol
        EndIf
    Next

    ReDim $aTempResult[$iRow][$iMaxRow]
    Return $aTempResult
EndFunc   ;==>_TableWriteToArrayFromHTML

Func MyHTML()
    Local $sData = "0x3C5441424C4520626F726465723D223122206267436F6C6F723D233030666630303E0D0A2" & _
            "02020203C54523E0D0A20202020202020203C54443E5461626C652031202872316331293C74" & _
            "61626C6520626F726465723D223122206267436F6C6F723D236666303030303E0D0A20203C7" & _
            "4723E0D0A202020203C74683E5461626C6520322028743272316331293C2F74683E0D0A2020" & _
            "20203C74683E5461626C65203220726F77203120436F6C756D6E20323C2F74683E0D0A20202" & _
            "0203C74683E5432523143323C2F74683E0D0A20203C2F74723E0D0A20203C74723E0D0A2020" & _
            "20203C74643E5432523243313C2F74643E0D0A202020203C74643E0D0A2020202020203C746" & _
            "1626C6520626F726465723D223122206267436F6C6F723D236666666630303E0D0A20202020" & _
            "202020203C74723E0D0A202020202020202020203C74643E5461626C652033206E657374656" & _
            "4207461626C6520636F6C756D6E20313C2F74643E0D0A202020202020202020203C74643E6E" & _
            "6573746564207461626C6520636F6C756D6E20323C2F74643E0D0A20202020202020203C2F7" & _
            "4723E0D0A2020202020203C2F7461626C653E0D0A202020203C2F74643E0D0A202020203C74" & _
            "643E5432523243333C2F74643E0D0A20203C2F74723E0D0A20203C74723E0D0A202020203C7" & _
            "4643E5432523343313C2F74643E0D0A202020203C74643E5432523343323C2F74643E0D0A20" & _
            "2020203C74643E5432523343333C2F74643E0D0A20203C2F74723E0D0A3C2F7461626C653E2" & _
            "03C2F54443E0D0A20202020202020203C54443E5431523143323C2F54443E0D0A2020202020" & _
            "2020203C2F54523E0D0A202020203C54523E0D0A20202020202020203C54443E54315232433" & _
            "10D0A20202020202020202020203C7461626C6520626F726465723D31206267436F6C6F723D" & _
            "233939303030302020414C49474E3D43454E5445523E200D0A20202020202020202020203C7" & _
            "4723E3C74643E205461626C652034204162636465663C2F74643E3C74643E7434723163323C" & _
            "2F74643E3C74643E7434723163333C2F74643E3C74643E7434723163343C2F74643E3C74643" & _
            "E7434723163350D0A2020202020202020202020202020202020203C7461626C652020626F72" & _
            "6465723D31206267436F6C6F723D233939393930303E0D0A202020202020202020202020202" & _
            "0202020203C74723E3C74643E205461626C652035204768696A6B3C2F74643E3C74643E7435" & _
            "723163323C2F74643E3C74643E7435723163333C2F74643E3C74643E7435723163340D0A202" & _
            "020202020202020202020202020202020202020202020203C7461626C6520626F726465723D" & _
            "31206267436F6C6F723D233939393939393E0D0A20202020202020202020202020202020202" & _
            "0202020202020203C74723E3C74643E205461626C652036204C6D6E6F70713C2F74643E3C74" & _
            "643E7435723163323C2F74643E3C74643E7435723163330D0A2020202020202020202020202" & _
            "02020202020202020202020202020202020202020202020203C7461626C652020626F726465" & _
            "723D31206267436F6C6F723D234545303045453E203C74723E3C74643E205461626C6520372" & _
            "052737475767778797A3C2F74643E3C74643E7437723163323C2F74643E3C74643E74377231" & _
            "63333C2F74643E3C2F74723E0D0A20202020202020202020202020202020202020202020202" & _
            "0202020202020202020202020203C74723E3C74643E7437723263313C2F74643E3C74643E74" & _
            "37723263323C2F74643E3C74643E7437723263330D0A2020202020202020202020202020202" & _
            "02020202020202020202020202020202020203C2F74643E3C2F74723E203C2F7461626C653E" & _
            "0D0A202020202020202020202020202020202020202020202020202020202020202020203C2" & _
            "F74643E3C2F74723E203C2F7461626C653E0D0A202020202020202020202020202020202020" & _
            "202020202020202020202020202020203C2F74643E3C2F74723E203C2F7461626C653E0D0A2" & _
            "020202020202020202020202020202020202020202020202020202020202020203C2F74643E" & _
            "3C2F74723E203C2F7461626C653E3C2F54443E0D0A20202020202020203C54443E543152324" & _
            "3323C5441424C4520626F726465723D223122206267436F6C6F723D233030666666663E0D0A" & _
            "202020202020202020202020202020203C54523E0D0A2020202020202020202020202020202" & _
            "0202020203C54443E5461626C6520380D0A2020202020202020202020202020202020202020" & _
            "202020203C5441424C4520626F726465723D223122206267436F6C6F723D233030303066663" & _
            "E0D0A202020202020202020202020202020202020202020202020202020203C54523E0D0A20" & _
            "202020202020202020202020202020202020202020202020202020202020203C54443E54616" & _
            "26C6520393C2F54443E0D0A2020202020202020202020202020202020202020202020202020" & _
            "2020202020203C54443E54392052314332203C2F54443E0D0A2020202020202020202020202" & _
            "0202020202020202020202020202020202020203C2F54523E0D0A2020202020202020202020" & _
            "20202020202020202020202020202020203C54523E0D0A20202020202020202020202020202" & _
            "020202020202020202020202020202020203C54443E543920523243313C2F54443E0D0A2020" & _
            "2020202020202020202020202020202020202020202020202020202020203C54443E5439205" & _
            "23243323C2F54443E0D0A202020202020202020202020202020202020202020202020202020" & _
            "20202020203C2F54523E0D0A202020202020202020202020202020202020202020202020202" & _
            "020203C2F5441424C453E0D0A2020202020202020202020202020202020202020202020203C" & _
            "2F54443E0D0A20202020202020202020202020202020202020203C54443E543820523143323" & _
            "C2F54443E0D0A20202020202020202020202020202020202020203C2F54523E0D0A20202020" & _
            "2020202020202020202020203C54523E0D0A202020202020202020202020202020202020202" & _
            "03C54443E543820523243313C2F54443E0D0A20202020202020202020202020202020202020" & _
            "203C54443E543820523243323C2F54443E0D0A2020202020202020202020202020202020202" & _
            "0203C2F54523E0D0A202020202020202020202020202020203C2F5441424C453E0D0A202020" & _
            "2020202020202020203C2F54443E0D0A20202020202020203C2F54523E0D0A3C54523E3C544" & _
            "43E5431205233204331202D20412073696E676C652063656C6C20726F772028576974686F75" & _
            "742063656C6C70616464696E67293C2F54443E3C2F54523E0D0A20203C74723E0D0A2020202" & _
            "03C746420636F6C7370616E3D323E0D0A20202020202068656C6C6F2C2049276D2054315234" & _
            "4331202873696E676C652063656C6C20576974682063656C6C70616464696E673D32290D0A3" & _
            "C7461626C6520626F726465723D332063656C6C70616464696E673D3520414C49474E3D4C45" & _
            "4654206267436F6C6F723D233636363630303E0D0A20203C74723E0D0A202020203C7464206" & _
            "36F6C7370616E3D323E0D0A2020202020205461626C6520313020524F573120434F4C554D4E" & _
            "310D0A202020203C2F74643E0D0A20203C2F74723E3C74723E0D0A202020203C74643E0D0A2" & _
            "02020202020436F6E74656E742066726F6D20543130523243310D0A202020203C2F74643E3C" & _
            "74643E0D0A202020202020436F6E74656E742066726F6D20543130523243320D0A202020203" & _
            "C2F74643E0D0A20203C2F74723E3C74723E0D0A202020203C74643E0D0A202020202020436F" & _
            "6E74656E742066726F6D20543130523343310D0A202020203C2F74643E3C74643E0D0A20202" & _
            "0202020436F6E74656E742066726F6D20543130523343320D0A202020203C2F74643E0D0A20" & _
            "203C2F74723E0D0A3C2F7461626C653E0D0A3C7461626C6520626F726465723D332063656C6" & _
            "C70616464696E673D313020414C49474E3D43454E544552206267436F6C6F723D2339393939" & _
            "39393E0D0A3C74723E0D0A20203C74642076616C69676E3D746F703E0D0A202020205461622" & _
            "0313120723163310D0A20203C2F74643E3C74643E0D0A202020205461622031312072316332" & _
            "3C703E0D0A0909090D0A202020203C7461626C6520626F726465723D31206267436F6C6F723" & _
            "D233030393939393E0D0A202020203C74723E0D0A2020202020203C74643E54313252314331" & _
            "3C2F74643E0D0A2020202020203C74643E543132523143323C2F74643E0D0A202020203C2F7" & _
            "4723E3C74723E0D0A2020202020203C74643E543132523243313C2F74643E0D0A2020202020" & _
            "203C74643E543132523243323C2F74643E0D0A202020203C2F74723E0D0A202020203C2F746" & _
            "1626C653E3C703E0D0A0909090D0A2020202054616220313120723163320D0A20203C2F7464" & _
            "3E0D0A3C2F74723E0D0A3C2F7461626C653E0D0A3C7461626C6520626F726465723D3320414" & _
            "C49474E3D5249474854206267436F6C6F723D233939303039393E0D0A20203C74723E0D0A20" & _
            "2020203C746420726F777370616E3D333E0D0A202020202020544142313320433120726F777" & _
            "370616E3D330D0A202020203C2F74643E3C74643E0D0A202020202020543133523143320D0A" & _
            "202020203C2F74643E0D0A20203C2F74723E3C74723E0D0A202020203C74643E3C2F74643E0" & _
            "D0A202020203C74643E0D0A202020202020543133523243320D0A202020203C2F74643E0D0A" & _
            "20203C2F74723E0D0A20203C74723E0D0A202020203C74643E3C2F74643E0D0A202020203C7" & _
            "4643E0D0A202020202020543133523343320D0A202020203C2F74643E0D0A20203C2F74723E" & _
            "0D0A3C2F7461626C653E0D0A202020203C2F74643E0D0A20203C2F74723E3C74723E0D0A202" & _
            "020203C74643E0D0A2020202020205461626C653120726F773520636F6C756D6E310D0A2020" & _
            "20203C2F74643E3C74643E0D0A2020202020205461626C653120726F773520636F6C756D6E3" & _
            "20D0A202020203C2F74643E0D0A20203C2F74723E3C74723E0D0A202020203C74643E0D0A20" & _
            "20202020205461626C653120726F773620636F6C756D6E310D0A202020203C2F74643E3C746" & _
            "43E0D0A2020202020205461626C653120726F773620636F6C756D6E320D0A202020203C2F74" & _
            "643E0D0A20203C2F74723E0D0A3C2F5441424C453E"

    Return BinaryToString($sData)
EndFunc   ;==>MyHTML

; ------------------------------------
; following functions are from SmOke_N
; ------------------------------------

Func _htmlraw_GetTables($sHtml) ; return an array of tables

    If Not StringLen($sHtml) Then
        Return SetError(1, 0, 0)
    EndIf

    ; some of the below pattern isn't necessary, but I code it as I think about conditions
    ; problem is with nested tables, this is not a good solution
    Local $sPatt = "(?si)<\s*table(?:\s*|\s.+?)>.*?<\s*/\s*table\s*>"
    Local $aReg = StringRegExp($sHtml, $sPatt, 3)
    If @error Then
        Return SetError(2, @error, 0)
    EndIf

    Return $aReg
EndFunc   ;==>_htmlraw_GetTables


Func _htmlraw_GetTableRows($sTable)
    ; believe it or not </tr> is not necessary
    ;  though most use it, so better look for </table too>
    ;  then there's the fun of not having nested tables
    ;  but I don't have the brain power to think through all that today, so simple it is
    Local $sPatt = "(?si)<\s*tr(?:\s*|\s.+?)>.*?<\s*/\s*tr\s*>"
    Local $aReg = StringRegExp($sTable, $sPatt, 3)
    If @error Then
        Return SetError(1, 0, 0)
    EndIf

    Return $aReg
EndFunc   ;==>_htmlraw_GetTableRows

Func _htmlraw_GetTableCols($sData)

    ; I've talked about nesting issues, just going to do it simple
    ; th/td
    Local $sPatt = "(?si)(?:<\s*th(?:\s*|\s.+?)>.*?<\s*/\s*th\s*>|" & _
            "<\s*td(?:\s*|\s.+?)>.*?<\s*/\s*td\s*>)+"
    Local $aReg = StringRegExp($sData, $sPatt, 3)
    If @error Then
        Return SetError(1, 0, 0)
    EndIf

    Return $aReg
EndFunc   ;==>_htmlraw_GetTableCols

Func _htmlraw_TableToArray($sTable)

    If Not StringLen($sTable) Then
        Return SetError(1, 0, 0)
    EndIf

    Local $aRows = _htmlraw_GetTableRows($sTable)
    If Not IsArray($aRows) Then
        Return SetError(2, 0, 0)
    EndIf

    Local $iUBRow = UBound($aRows)
    Local $aRet[$iUBRow][1]

    Local $aCols, $iEnum = 0, $iUBCol

    For $i = 0 To $iUBRow - 1
        $aCols = _htmlraw_GetTableCols($aRows[$i])
        $iUBCol = UBound($aCols)
        If Not $iUBCol Then ContinueLoop
        If $iUBCol > UBound($aRet, 2) Then
            ReDim $aRet[$iUBRow][$iUBCol]
        EndIf
        For $j = 0 To $iUBCol - 1
            $aRet[$iEnum][$j] = StringRegExpReplace($aCols[$j], _
                    "^(?is)\h*<(?:th|td).*?(?<!>)>\s*|(?:\s*<\h*/\h*(?:th|td)\h*>\h*)$", "")
        Next
        $iEnum += 1
    Next

    Return $aRet
EndFunc   ;==>_htmlraw_TableToArray

.

Edited by SmOke_N

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites

Nice work you done here :D

How how can I fix this output?

wwq.png

 
How the table looks like
table.png
#include <Array.au3>
#NoTrayIcon
$b64tbl = "PHRhYmxlIGNsYXNzPSJkYXRhLWJvcmRlcmVkIG1heCI+DQogIDx0Ym9keT4NCiAgICA8dHI+DQogICAgICA8dGggd2lkdGg9IjcxIiBzY29wZT0iY29sIj5QbGF0Zm9ybTwvdGg+DQogICAgICA8dGggd2lkdGg9Ijg4IiBzY29wZT0iY29sIj5Ccm93c2Vy" & _
        "PC90aD4NCiAgICAgIDx0aCB3aWR0aD0iMTAxIiBzY29wZT0iY29sIj5QbGF5ZXImbmJzcDt2ZXJzaW9uPC90aD4NCiAgICA8L3RyPg0KICAgIDx0cj4NCiAgICAgIDx0ZCByb3dzcGFuPSI0Ij48c3Ryb25nPldpbmRvd3M8L3N0cm9uZz48L3RkPg0KICAgICA" & _
        "gPHRkPkludGVybmV0IEV4cGxvcmVyIC0gQWN0aXZlWDwvdGQ+DQogICAgICA8dGQ+MTYuMC4wLjI4NzwvdGQ+DQogICAgPC90cj4NCiAgICANCiAgICAgPHRyPg0KICAgICAgPHRkPkludGVybmV0IEV4cGxvcmVyIChXaW5kb3dzIDgueCkgLSBBY3RpdmVYPC" & _
        "90ZD4NCiAgICAgIDx0ZD4xNi4wLjAuMjg3PC90ZD4NCiAgICA8L3RyPg0KICAgIA0KICAgIDx0cj4NCiAgICAgIDx0ZD5GaXJlZm94LCBNb3ppbGxhIC0gTlBBUEk8L3RkPg0KICAgICAgPHRkPjE2LjAuMC4yODc8L3RkPg0KICAgIDwvdHI+DQogICAgPHRyP" & _
        "g0KICAgICAgPHRkPkNocm9tZSAoZW1iZWRkZWQpLCBPcGVyYSwgQ2hyb21pdW0tYmFzZWQgYnJvd3NlcnMgLSBQUEFQSTwvdGQ+DQogICAgICA8dGQ+MTYuMC4wLjI4NzwvdGQ+DQogICAgPC90cj4NCiAgICA8dHI+DQogICAgICA8dGQgcm93c3Bhbj0iMiI+" & _
        "PHN0cm9uZz5NYWNpbnRvc2g8YnIgLz5PUyBYPC9zdHJvbmc+PC90ZD4NCiAgICAgIDx0ZD5GaXJlZm94LCBTYWZhcmkgLSBOUEFQSTwvdGQ+DQogICAgICA8dGQ+MTYuMC4wLjI4NzwvdGQ+DQogICAgPC90cj4NCiAgICA8dHI+DQogICAgICA8dGQ+Q2hyb21" & _
        "lIChlbWJlZGRlZCksIE9wZXJhLCBDaHJvbWl1bS1iYXNlZCBicm93c2VycyAtIFBQQVBJPC90ZD4NCiAgICAgIDx0ZD4xNi4wLjAuMjg3PC90ZD4NCiAgICA8L3RyPg0KICAgIDx0cj4NCiAgICAgIDx0ZCByb3dzcGFuPSIyIj48c3Ryb25nPkxpbnV4PC9zdH" & _
        "Jvbmc+PC90ZD4NCiAgICAgIDx0ZD5Nb3ppbGxhLCBGaXJlZm94IC0gTlBBUEkgKEV4dGVuZGVkIFN1cHBvcnQgUmVsZWFzZSk8L3RkPg0KICAgICAgPHRkPjExLjIuMjAyLjQzODwvdGQ+DQogICAgPC90cj4NCiAgICA8dHI+DQogICAgICA8dGQ+Q2hyb21lI" & _
        "ChlbWJlZGRlZCksIENocm9taXVtLWJhc2VkIGJyb3dzZXJzIC0gUFBBUEk8L3RkPg0KICAgICAgPHRkPjE2LjAuMC4yOTE8L3RkPg0KICAgIDwvdHI+DQogICAgPHRyPg0KICAgICAgPHRkPjxzdHJvbmc+U29sYXJpczwvc3Ryb25nPjwvdGQ+DQogICAgICA8" & _
        "dGQ+Rmxhc2ggUGxheWVyIDExLjIuMjAyLjIyMyBpcyB0aGUgbGFzdCBzdXBwb3J0ZWQgRmxhc2ggUGxheWVyIHZlcnNpb24gZm9yIFNvbGFyaXMuPC90ZD4NCiAgICAgIDx0ZD4xMS4yLjIwMi4yMjM8L3RkPg0KICAgIDwvdHI+DQogIDwvdGJvZHk+DQo8L3R" & _
        "hYmxlPg=="
$stbl = BinaryToString(_Base64Decode($b64tbl))
$table = _htmlraw_TableToArray($stbl)
_ArrayDisplay($table)
Exit

Func _Base64Decode($input_string) ; by trancexx
    Local $struct = DllStructCreate('int')
    Local $a_Call = DllCall('Crypt32.dll', 'int', 'CryptStringToBinary', 'str', $input_string, 'int', 0, 'int', 1, 'ptr', 0, 'ptr', DllStructGetPtr($struct, 1), 'ptr', 0, 'ptr', 0)
    If @error Or Not $a_Call[0] Then Return SetError(1, 0, '')
    Local $a = DllStructCreate('byte[' & DllStructGetData($struct, 1) & ']')
    $a_Call = DllCall('Crypt32.dll', 'int', 'CryptStringToBinary', 'str', $input_string, 'int', 0, 'int', 1, 'ptr', DllStructGetPtr($a), 'ptr', DllStructGetPtr($struct, 1), 'ptr', 0, 'ptr', 0)
    If @error Or Not $a_Call[0] Then Return SetError(2, 0, '')
    Return DllStructGetData($a, 1)
EndFunc   ;==>_Base64Decode

Func _htmlraw_TableToArray($sTable)
    If Not StringLen($sTable) Then Return SetError(1, 0, 0)
    Local $aRows = _htmlraw_GetTableRows($sTable)
    If Not IsArray($aRows) Then Return SetError(2, 0, 0)
    Local $iUBRow = UBound($aRows), $aRet[$iUBRow][1]
    Local $aCols, $iEnum = 0, $iUBCol

    For $i = 0 To $iUBRow - 1
        $aCols = _htmlraw_GetTableCols($aRows[$i])
        $iUBCol = UBound($aCols)
        If Not $iUBCol Then ContinueLoop
        If $iUBCol > UBound($aRet, 2) Then ReDim $aRet[$iUBRow][$iUBCol]
        For $j = 0 To $iUBCol - 1
            $aRet[$iEnum][$j] = StringRegExpReplace($aCols[$j], "^(?i)\h*<(?:th|td).*?(?<!>)>\h*|(?:\h*<\h*/\h*(?:th|td)\h*>\h*)$", "")
        Next
        $iEnum += 1
    Next
    Return $aRet
EndFunc   ;==>_htmlraw_TableToArray

Func _htmlraw_GetTableRows($sTable)
    Local $sPatt = "(?si)<\s*tr(?:\s*|\s.+?)>.*?<\s*/\s*tr\s*>"
    Local $aReg = StringRegExp($sTable, $sPatt, 3)
    If @error Then Return SetError(1, 0, 0)
    Return $aReg
EndFunc   ;==>_htmlraw_GetTableRows

Func _htmlraw_GetTableCols($sData)
    Local $sPatt = "(?si)(?:<\s*th(?:\s*|\s.+?)>.*?<\s*/\s*th\s*>|" & _
            "<\s*td(?:\s*|\s.+?)>.*?<\s*/\s*td\s*>)+"
    Local $aReg = StringRegExp($sData, $sPatt, 3)
    If @error Then Return SetError(1, 0, 0)
    Return $aReg
EndFunc   ;==>_htmlraw_GetTableCols

Heroes, there is no such thing

One day I'll discover what IE.au3 has of special for so many users using it.
C'mon there's InetRead and WinHTTP, way better
happy.png

Share this post


Link to post
Share on other sites

#14 ·  Posted (edited)

What do you mean "clean it up"?  The functions produce exactly what the data shows?

Do you mean the html tags (&nbsp;, <strong></strong>, </br>)?  Well you'd strip them (I'd imagine before sending them to the table functions).

Are you talking about the format?  Well that's something the CSS is taking care of I'm sure by the scope/width/columns etc... something totally outside what I understood was trying to be accomplished.

Edit:

I guess you mean the rowspan... ugh

Edited by SmOke_N

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites

Func _htmlraw_TableToArray($sTable)

    If Not StringLen($sTable) Then
        Return SetError(1, 0, 0)
    EndIf

    Local $aRows = _htmlraw_GetTableRows($sTable)
    If Not IsArray($aRows) Then
        Return SetError(2, 0, 0)
    EndIf

    Local $iUBRow = UBound($aRows)
    Local $aRet[$iUBRow][1]

    Local $aCols, $iEnum = 0, $iUBCol
    Local $aRowSpan
    Local Const $sRowSpanPatt = "(?is)<\s*(?:td|th)\h+rowspan=" & _
        "(?:\x22|\x27)(\d+)(?:\x22|\x27)\s*>"
    Local Const $sRemoveTagPatt = "^(?is)\h*<(?:th|td).*?(?<!>)>" & _
        "\s*|(?:\s*<\h*/\h*(?:th|td)\h*>\h*)$"
    Local $iRowCount = -1, $aTmp
    For $i = 0 To $iUBRow - 1
        $aCols = _htmlraw_GetTableCols($aRows[$i])
        $iUBCol = UBound($aCols)
        If Not $iUBCol Then ContinueLoop
        ; take care of rowspan
        If $iRowCount > -1 Then
            Dim $aTmp[$iUBCol + 1]
            For $j = 0 To $iUBCol - 1
                $aTmp[$j + 1] = $aCols[$j]
            Next
            $aCols = $aTmp
            $iUBCol = UBound($aCols)
            $iRowCount -= 1
        EndIf
        If $iUBCol > UBound($aRet, 2) Then
            ReDim $aRet[$iUBRow][$iUBCol]
        EndIf
        For $j = 0 To $iUBCol - 1
            If $iRowCount = -1 Then
                $aRowSpan = StringRegExp($aRows[$i], $sRowSpanPatt, 1)
                $iRowCount = ((Not @error) ? $aRowSpan[0] - 2 : -1)
            EndIf
            $aRet[$iEnum][$j] = StringRegExpReplace($aCols[$j], $sRemoveTagPatt, "")
        Next
        $iEnum += 1
    Next

    Return $aRet
EndFunc   ;==>_htmlraw_TableToArray

1 person likes this

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites

Works flawlessly, thank you SmOke_N!

wwq.png


Heroes, there is no such thing

One day I'll discover what IE.au3 has of special for so many users using it.
C'mon there's InetRead and WinHTTP, way better
happy.png

Share this post


Link to post
Share on other sites

Func _htmlraw_TableToArray($sTable)

    If Not StringLen($sTable) Then
        Return SetError(1, 0, 0)
    EndIf

    Local $aRows = _htmlraw_GetTableRows($sTable)
    If Not IsArray($aRows) Then
        Return SetError(2, 0, 0)
    EndIf

    Local $iUBRow = UBound($aRows)
    Local $aRet[$iUBRow][1]

    Local $aCols, $iEnum = 0, $iUBCol
    Local $aRowSpan
    Local Const $sRowSpanPatt = "(?is)<\s*(?:td|th)\h+rowspan=" & _
        "(?:\x22|\x27)(\d+)(?:\x22|\x27)\s*>"
    Local Const $sRemoveTagPatt = "^(?is)\h*<(?:th|td).*?(?<!>)>" & _
        "\s*|(?:\s*<\h*/\h*(?:th|td)\h*>\h*)$"
    Local $iRowCount = -1, $aTmp
    For $i = 0 To $iUBRow - 1
        $aCols = _htmlraw_GetTableCols($aRows[$i])
        $iUBCol = UBound($aCols)
        If Not $iUBCol Then ContinueLoop
        ; take care of rowspan
        If $iRowCount > -1 Then
            Dim $aTmp[$iUBCol + 1]
            For $j = 0 To $iUBCol - 1
                $aTmp[$j + 1] = $aCols[$j]
            Next
            $aCols = $aTmp
            $iUBCol = UBound($aCols)
            $iRowCount -= 1
        EndIf
        If $iUBCol > UBound($aRet, 2) Then
            ReDim $aRet[$iUBRow][$iUBCol]
        EndIf
        For $j = 0 To $iUBCol - 1
            If $iRowCount = -1 Then
                $aRowSpan = StringRegExp($aRows[$i], $sRowSpanPatt, 1)
                $iRowCount = ((Not @error) ? $aRowSpan[0] - 2 : -1)
            EndIf
            $aRet[$iEnum][$j] = StringRegExpReplace($aCols[$j], $sRemoveTagPatt, "")
        Next
        $iEnum += 1
    Next

    Return $aRet
EndFunc   ;==>_htmlraw_TableToArray

 

This is something similar to what I'm trying to accomplish.

manage both "COLSPAN" and "ROWSPAN" to fill only the first cell of the array corresponding with the "colspan" or "rowspan" area and leaving other cells empty. (or maybe also repeat the same value in all cells of the array corresponding to the "col/rowspan" could be an option)

also, exclude any tag between <td and </td> and keeping only the data contained within the cell should give cleaner data.

since my regexp skill is nearly 0, I am not able to modify your regexp to achieve my goal, so I will try to achieve this result maybe using string functions.

thanks for your  sample code


small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

#18 ·  Posted (edited)

@Chimp

I can't find an example where colspan would play a need here.

Do you have table code that would make sense to even pursue it?

Edit:

Maybe I found one, extending the Cols out 1 more if colspan was used on say the 2nd of 3 cols (eg. colspan="2" on the second column and there is still a 3rd to process)

Edited by SmOke_N

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites

#19 ·  Posted (edited)

Hi SmOke_N

here some tables,
you can also see that if the tag <td contains also extra parameters other than only rowspan, like this for example,

<td bgcolor="#d3d3d3" align="center" valign="middle" rowspan="2">

then your rowspan management via regexp fails....

#include <Array.au3>
#include <ie.au3>

$stbl = MyHtml()
ConsoleWrite(@CRLF & $stbl & @CRLF)
Local $oie = _IECreate()
_IEDocWriteHTML($oie, $stbl) ; just to show the tables on the browser
Do
    Sleep(250)
Until IsObj($oie)
$table = _htmlraw_TableToArray($stbl)
_ArrayDisplay($table)
Exit

Func _htmlraw_GetTableRows($sTable)
    Local $sPatt = "(?si)<\s*tr(?:\s*|\s.+?)>.*?<\s*/\s*tr\s*>"
    Local $aReg = StringRegExp($sTable, $sPatt, 3)
    If @error Then Return SetError(1, 0, 0)
    Return $aReg
EndFunc   ;==>_htmlraw_GetTableRows

Func _htmlraw_GetTableCols($sData)
    Local $sPatt = "(?si)(?:<\s*th(?:\s*|\s.+?)>.*?<\s*/\s*th\s*>|" & _
            "<\s*td(?:\s*|\s.+?)>.*?<\s*/\s*td\s*>)+"
    Local $aReg = StringRegExp($sData, $sPatt, 3)
    If @error Then Return SetError(1, 0, 0)
    Return $aReg
EndFunc   ;==>_htmlraw_GetTableCols

Func _htmlraw_TableToArray($sTable)

    If Not StringLen($sTable) Then
        Return SetError(1, 0, 0)
    EndIf

    Local $aRows = _htmlraw_GetTableRows($sTable)
    ; _ArrayDisplay($aRows, '_htmlraw_GetTableRows')
    If Not IsArray($aRows) Then
        Return SetError(2, 0, 0)
    EndIf

    Local $iUBRow = UBound($aRows)
    Local $aRet[$iUBRow][1]

    Local $aCols, $iEnum = 0, $iUBCol
    Local $aRowSpan
    Local Const $sRowSpanPatt = "(?is)<\s*(?:td|th)\h+rowspan=" & _
            "(?:\x22|\x27)(\d+)(?:\x22|\x27)\s*>"
    Local Const $sRemoveTagPatt = "^(?is)\h*<(?:th|td).*?(?<!>)>" & _
            "\s*|(?:\s*<\h*/\h*(?:th|td)\h*>\h*)$"
    Local $iRowCount = -1, $aTmp
    For $i = 0 To $iUBRow - 1
        $aCols = _htmlraw_GetTableCols($aRows[$i])
        ; _ArrayDisplay($aCols, '_htmlraw_GetTableCols')
        $iUBCol = UBound($aCols)
        If Not $iUBCol Then ContinueLoop
        ; take care of rowspan
        If $iRowCount > -1 Then
            Dim $aTmp[$iUBCol + 1]
            For $j = 0 To $iUBCol - 1
                $aTmp[$j + 1] = $aCols[$j]
            Next
            $aCols = $aTmp
            $iUBCol = UBound($aCols)
            $iRowCount -= 1
        EndIf
        If $iUBCol > UBound($aRet, 2) Then
            ReDim $aRet[$iUBRow][$iUBCol]
        EndIf
        For $j = 0 To $iUBCol - 1
            If $iRowCount = -1 Then
                $aRowSpan = StringRegExp($aRows[$i], $sRowSpanPatt, 1)
                $iRowCount = ((Not @error) ? $aRowSpan[0] - 2 : -1)
            EndIf
            $aRet[$iEnum][$j] = StringRegExpReplace($aCols[$j], $sRemoveTagPatt, "")
        Next
        $iEnum += 1
    Next

    Return $aRet
EndFunc   ;==>_htmlraw_TableToArray
Func MyHtml()
    Local $sHTML = ""

    $sHTML &= @CRLF & '<table border=1 class="data-bordered max">'
    $sHTML &= @CRLF & '<tbody>'
    $sHTML &= @CRLF & '<tr>'
    $sHTML &= @CRLF & '<th width="71" scope="col">Platform</th>'
    $sHTML &= @CRLF & '<th width="88" scope="col">Browser</th>'
    $sHTML &= @CRLF & '<th width="101" scope="col">Player&nbsp;version</th>'
    $sHTML &= @CRLF & '</tr>'
    $sHTML &= @CRLF & '<tr>'
    $sHTML &= @CRLF & '<td rowspan="4"><strong>Windows</strong></td>'
    $sHTML &= @CRLF & '<td>Internet Explorer - ActiveX</td>'
    $sHTML &= @CRLF & '<td>16.0.0.287</td>'
    $sHTML &= @CRLF & '</tr>'

    $sHTML &= @CRLF & '<tr>'
    $sHTML &= @CRLF & '<td>Internet Explorer (Windows 8.x) - ActiveX</td>'
    $sHTML &= @CRLF & '<td>16.0.0.287</td>'
    $sHTML &= @CRLF & '</tr>'

    $sHTML &= @CRLF & '<tr>'
    $sHTML &= @CRLF & '<td>Firefox, Mozilla - NPAPI</td>'
    $sHTML &= @CRLF & '<td>16.0.0.287</td>'
    $sHTML &= @CRLF & '</tr>'
    $sHTML &= @CRLF & '<tr>'
    $sHTML &= @CRLF & '<td>Chrome (embedded), Opera, Chromium-based browsers - PPAPI</td>'
    $sHTML &= @CRLF & '<td>16.0.0.287</td>'
    $sHTML &= @CRLF & '</tr>'
    $sHTML &= @CRLF & '<tr>'
    $sHTML &= @CRLF & '<td rowspan="2"><strong>Macintosh<br />OS X</strong></td>'
    $sHTML &= @CRLF & '<td>Firefox, Safari - NPAPI</td>'
    $sHTML &= @CRLF & '<td>16.0.0.287</td>'
    $sHTML &= @CRLF & '</tr>'
    $sHTML &= @CRLF & '<tr>'
    $sHTML &= @CRLF & '<td>Chrome (embedded), Opera, Chromium-based browsers - PPAPI</td>'
    $sHTML &= @CRLF & '<td>16.0.0.287</td>'
    $sHTML &= @CRLF & '</tr>'
    $sHTML &= @CRLF & '<tr>'
    $sHTML &= @CRLF & '<td rowspan="2"><strong>Linux</strong></td>'
    $sHTML &= @CRLF & '<td>Mozilla, Firefox - NPAPI (Extended Support Release)</td>'
    $sHTML &= @CRLF & '<td>11.2.202.438</td>'
    $sHTML &= @CRLF & '</tr>'
    $sHTML &= @CRLF & '<tr>'
    $sHTML &= @CRLF & '<td>Chrome (embedded), Chromium-based browsers - PPAPI</td>'
    $sHTML &= @CRLF & '<td>16.0.0.291</td>'
    $sHTML &= @CRLF & '</tr>'
    $sHTML &= @CRLF & '<tr>'
    $sHTML &= @CRLF & '<td><strong>Solaris</strong></td>'
    $sHTML &= @CRLF & '<td>Flash Player 11.2.202.223 is the last supported Flash Player version for Solaris.</td>'
    $sHTML &= @CRLF & '<td>11.2.202.223</td>'
    $sHTML &= @CRLF & '</tr>'
    $sHTML &= @CRLF & '</tbody>'
    $sHTML &= @CRLF & '</table>'

    $sHTML &= @CRLF & '<br><br>'

    $sHTML &= @CRLF & '<TABLE BORDER=1 CELLPADDING=4>'
    $sHTML &= @CRLF & '<tbody>'
    $sHTML &= @CRLF & '<TR>'
    $sHTML &= @CRLF & '<TD rowspan=''3''>Production</TD>'
    $sHTML &= @CRLF & '<TD>Raha Mutisya</TD> <TD>1493</TD>'
    $sHTML &= @CRLF & '</TR>'
    $sHTML &= @CRLF & '<TR>'
    $sHTML &= @CRLF & '<TD>Shalom Buraka</TD> <TD>3829</TD> '
    $sHTML &= @CRLF & '</TR>'
    $sHTML &= @CRLF & '<TR>'
    $sHTML &= @CRLF & '<TD>Brandy Davis</TD> <TD>0283</TD>'
    $sHTML &= @CRLF & '</TR>'
    $sHTML &= @CRLF & '<TR>'
    $sHTML &= @CRLF & '<TD ROWSPAN=3 BGCOLOR="#99CCFF">Sales</TD>'
    $sHTML &= @CRLF & '<TD>Claire Horne</TD> <TD>4827</TD>'
    $sHTML &= @CRLF & '</TR>'
    $sHTML &= @CRLF & '<TR>'
    $sHTML &= @CRLF & '<TD>Bruce Eckel</TD> <TD>7246</TD>'
    $sHTML &= @CRLF & '</TR>'
    $sHTML &= @CRLF & '<TR>'
    $sHTML &= @CRLF & '<TD>Danny Zeman</TD> <TD>5689</TD>'
    $sHTML &= @CRLF & '</TR>'
    $sHTML &= @CRLF & '</tbody>'
    $sHTML &= @CRLF & '</TABLE>'

    $sHTML &= @CRLF & '<br><br>'

    $sHTML &= @CRLF & '<TABLE BORDER=2 CELLPADDING=4>'
    $sHTML &= @CRLF & '<TR> <TH COLSPAN=2>Production2</TH> </TR>'
    $sHTML &= @CRLF & '<TR> <TD>Raha Mutisya</TD>      <TD>1493</TD> </TR>'
    $sHTML &= @CRLF & '<TR> <TD>Shalom Buraka</TD>     <TD>3829</TD> </TR>'
    $sHTML &= @CRLF & '<TR> <TD>Brandy Davis</TD>      <TD>0283</TD> </TR>'
    $sHTML &= @CRLF & '<TR> <TH COLSPAN=2>Sales</TH> </TR>'
    $sHTML &= @CRLF & '<TR> <TD>Claire Horne</TD>      <TD>4827</TD> </TR>'
    $sHTML &= @CRLF & '<TR> <TD>Bruce Eckel</TD>       <TD>7246</TD> </TR>'
    $sHTML &= @CRLF & '<TR> <TD>Danny Zeman</TD>       <TD>5689</TD> </TR>'
    $sHTML &= @CRLF & '<TR> <TD></TD>       <TD></TD> </TR>'
    $sHTML &= @CRLF & '</TABLE>'

    $sHTML &= @CRLF & '<br><br>'

    $sHTML &= @CRLF & '<table border="1" cellpadding="0" cellspacing="0">'
    $sHTML &= @CRLF & '<tr height="50">'
    $sHTML &= @CRLF & '    <td align="center" width="150" rowspan="2">State of Health</td>'
    $sHTML &= @CRLF & '    <td align="center" width="300" colspan="2">Fasting Value</td>'
    $sHTML &= @CRLF & '    <td align="center" width="150">After Eating</td>'
    $sHTML &= @CRLF & '</tr>'
    $sHTML &= @CRLF & '<tr height="50">'
    $sHTML &= @CRLF & '    <td align="center" width="150">Minimum</td>'
    $sHTML &= @CRLF & '    <td align="center" width="150">Maximum</td>'
    $sHTML &= @CRLF & '    <td align="center" width="150">2 hours after eating</td>'
    $sHTML &= @CRLF & '</tr>'
    $sHTML &= @CRLF & '<tr height="50">'
    $sHTML &= @CRLF & '   <td align="center" width="150">Healthy</td>'
    $sHTML &= @CRLF & '    <td align="center" width="150">70</td>'
    $sHTML &= @CRLF & '    <td align="center" width="150">100</td>'
    $sHTML &= @CRLF & '    <td align="center" width="150">Less than 140</td>'
    $sHTML &= @CRLF & '</tr>'
    $sHTML &= @CRLF & '<tr height="50">'
    $sHTML &= @CRLF & '    <td align="center" width="150">Pre-Diabetes</td>'
    $sHTML &= @CRLF & '    <td align="center" width="150">101</td>'
    $sHTML &= @CRLF & '    <td align="center" width="150">126</td>'
    $sHTML &= @CRLF & '    <td align="center" width="150">140 to 200</td>'
    $sHTML &= @CRLF & '</tr>'
    $sHTML &= @CRLF & '<tr height="50">'
    $sHTML &= @CRLF & '    <td align="center" width="150">Diabetes</td>'
    $sHTML &= @CRLF & '    <td align="center" width="150">More than 126</td>'
    $sHTML &= @CRLF & '    <td align="center" width="150">N/A</td>'
    $sHTML &= @CRLF & '    <td align="center" width="150">More than 200</td>'
    $sHTML &= @CRLF & '</tr>'
    $sHTML &= @CRLF & '</table>'

    $sHTML &= @CRLF & '<br><br>'

    $sHTML &= @CRLF & '<table width="400" cellpadding="10" cellspacing="0" border="1">'
    $sHTML &= @CRLF & '<tr><td bgcolor="#fa8072" align="center" valign="middle">'
    $sHTML &= @CRLF & '<font size="2" color="#000000" face="verdana">'
    $sHTML &= @CRLF & '<b>Cell One</b></font>'
    $sHTML &= @CRLF & '</td><td bgcolor="#d3d3d3" align="center" valign="middle" rowspan="2">'
    $sHTML &= @CRLF & '<font size="2" color="#000000" face="verdana">'
    $sHTML &= @CRLF & '<b>Cell Two</b></font>'
    $sHTML &= @CRLF & '</td>'
    $sHTML &= @CRLF & '<td bgcolor="#fa8072" align="center" valign="middle">'
    $sHTML &= @CRLF & '<font size="2" color="#000000" face="verdana">'
    $sHTML &= @CRLF & '<b>Cell Three</b></font>'
    $sHTML &= @CRLF & '</td></tr>'
    $sHTML &= @CRLF & '<tr><td bgcolor="#90ee90" align="center" valign="middle">'
    $sHTML &= @CRLF & '<font size="2" color="#000000" face="verdana">'
    $sHTML &= @CRLF & '<b>Cell Four</b></font>'
    $sHTML &= @CRLF & '</td>'
    $sHTML &= @CRLF & '<td bgcolor="#90ee90" align="center" valign="middle">'
    $sHTML &= @CRLF & '<font size="2" color="#000000" face="verdana">'
    $sHTML &= @CRLF & '<b>Cell Five</b></font>'
    $sHTML &= @CRLF & '</td></tr></table>'
    Return $sHTML
EndFunc   ;==>MyHtml
Edited by Chimp

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

Ahh, well this may fix that... not sure... don't have time at the moment to mess around.

Func _htmlraw_TableToArray($sTable)

    If Not StringLen($sTable) Then
        Return SetError(1, 0, 0)
    EndIf

    Local $aRows = _htmlraw_GetTableRows($sTable)
    If Not IsArray($aRows) Then
        Return SetError(2, 0, 0)
    EndIf

    Local $iUBRow = UBound($aRows)
    Local $aRet[$iUBRow][1]

    Local $aCols, $iEnum = 0, $iUBCol
    Local $aRowSpan
    Local Const $sRowSpanPatt = "(?is)<\s*(?:td|th)\h+.*?(?<!>)\hrowspan=" & _
        "(?:\x22|\x27)(\d+)(?:\x22|\x27).*?(?<!>)>"
    Local Const $sRemoveTagPatt = "^(?is)\h*<(?:th|td).*?(?<!>)>" & _
        "\s*|(?:\s*<\h*/\h*(?:th|td)\h*>\h*)$"
    Local $iRowCount = -1, $aTmp
    For $i = 0 To $iUBRow - 1
        $aCols = _htmlraw_GetTableCols($aRows[$i])
        $iUBCol = UBound($aCols)
        If Not $iUBCol Then ContinueLoop
        ; take care of rowspan
        If $iRowCount > -1 Then
            Dim $aTmp[$iUBCol + 1]
            For $j = 0 To $iUBCol - 1
                $aTmp[$j + 1] = $aCols[$j]
            Next
            $aCols = $aTmp
            $iUBCol = UBound($aCols)
            $iRowCount -= 1
        EndIf
        If $iUBCol > UBound($aRet, 2) Then
            ReDim $aRet[$iUBRow][$iUBCol]
        EndIf
        For $j = 0 To $iUBCol - 1
            If $iRowCount = -1 Then
                $aRowSpan = StringRegExp($aRows[$i], $sRowSpanPatt, 1)
                $iRowCount = ((Not @error) ? $aRowSpan[0] - 2 : -1)
            EndIf
            $aRet[$iEnum][$j] = StringRegExpReplace($aCols[$j], $sRemoveTagPatt, "")
        Next
        $iEnum += 1
    Next

    Return $aRet
EndFunc   ;==>_htmlraw_TableToArray

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0