Jump to content
Chimp

Read data from html Tables from raw HTML source

Recommended Posts

Chimp

After seeing the function __HTML_Filter() in this topic by Stilgar (https://www.autoitscript.com/forum/topic/124330-_htmlau3-v101/) I thought I'd include that function also in this script.
the purpose of that function is to clean the extracted data from the table by those codes that are not visible in the browser but are visible as code "dirty" in the data when they are picked up from the table.

Updated the udf and the example script in first post.

To see the difference in the extracted data with or without the use of the HTML_Filter() function, just extract the table data from the example page by clicking on the "Preview array" button with the filter CheckBox "tags to entities" one time unchecked and then checked instead.


small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites
barbossa

Hi Chimp!
Thanks a lot for your example - it saved me a lot work!
I had to parse a table with almost 1400 rows (and lots of rowspans) in an 1.5MB HTML file, and got some performance issues. Here is how I solved them:
First, I adapted the HTML tag position search in _ParseTags to search starting on the last tag found position, so StringInStr doesn't need to count thousands of "<tr" tags every iteration. Then, _ArraySort failed (too many rows...). So, to get the tag list pre-sorted, I search for the first opening and first closing tag. If the opening is before the closing, write to $aThisTagsPositions and find the next opening; if the closing is before the next opening, write to $aThisTagsPositions and find the next closing.

This made it possible to read that huge HTML file in less than 90 seconds.

Just replace the code on lines 208-216 with this:

Local $iNextOpenPosition = StringInStr($sHtml, $sOpening, 0, 1)
        Local $iNextClosePosition = StringInStr($sHtml, $sClosing, 0, 1)
        Local $iOpenCount = 1

        ; 2) find in the HTML the positions of the $sOpening <tag and $sClosing </tag> tags
        For $i = 1 To $iNrOfThisTag * 2 ;search all the opening and closing tags
            If ($iNextOpenPosition < $iNextClosePosition) And $iNextOpenPosition <> 0 Then
                $aThisTagsPositions[$i][0] = $iNextOpenPosition
                $aThisTagsPositions[$i][1] = $sOpening ; it marks which kind of tag is this
                $aThisTagsPositions[$i][2] = $iOpenCount; nr of this tag
                $iOpenCount += 1
                $iNextOpenPosition = StringInStr($sHtml, $sOpening, 0, 1, $aThisTagsPositions[$i][0] + 1)
            Else
                $aThisTagsPositions[$i][0] = $iNextClosePosition + StringLen($sClosing) - 1
                $aThisTagsPositions[$i][1] = $sClosing ; it marks which kind of tag is this
                $iNextClosePosition = StringInStr($sHtml, $sClosing, 0, 1, $aThisTagsPositions[$i][0] + 1)
            EndIf
        Next

 

  • Like 1

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Similar Content

    • TrashBoat
      By TrashBoat
      So Im trying to make a simple 2d game and make some sort of collision detection so why not to make a 2 dimensional array but i have no clue how  to write it in multiple lines
      Global $map[5,5] = [0,0,0,0,0 _ [0,0,0,0,0 _ [0,0,0,0,0 _ [0,0,0,0,0 _ [0,0,0,0,0] something like this but it doesn't work
    • Zein
      By Zein
      #include "..\Include\Array.au3" #include "..\Include\File.au3" #include "..\Include\AutoItConstants.au3" Local $aRetArray Local $sFilePath = "n.csv" _FileReadToArray($sFilePath, $aRetArray, ",") ; _FileReadToArray($sFilePath, $aRetArray, $FRTA_COUNT, ",") _ArrayDisplay($aRetArray, "Original", Default, 8) The above code shows two versions of _FileReadToArray and both don't work as expected.
      The first one doesn't use the comma as a delimiter. (so I get a single column array)  I tried adding "Default" between $aRetArray and "," then it told me it had an incorrect number of parameters. 

      I looked again at the documentation:
       
      #include <File.au3> _FileReadToArray ( $sFilePath, ByRef $vReturn [, $iFlags = $FRTA_COUNT [, $sDelimiter = ""]] )
      And I with or without the flags params I should be getting a 2D array due to my file being a csv. 
      I then tried a regular flag, $FRTA_COUNT, and it tells me that I'm using a variable $FRTA_COUNT while it's not declared. Tried putting in 1 instead and it told me again, incorrect number of params. 

       
    • lavascript
      By lavascript
      I have a Word document containing a 9-column table where row 1 is the column headers. My goal is to read the table into a 2d array, remove some rows, update some fields, and add a few rows to the end. The resulting array will likely be a different length. Next, I want to write the data back into the table. If it's easier, I can write the data to a new document from a template containing the same table header with a blank 2nd row.
      Here's my early attempt:
      Local $oWord = _Word_Create() Local $oDoc = _Word_DocOpen($oWord, $sFile) Local $aData = _Word_DocTableRead($oDoc, 1) $aData[3][5] = "Something else" Local $oRange = _Word_DocRangeSet($oDoc, 0) $oRange = _Word_DocRangeSet($oDoc, $oRange, $wdCell, 9) _Word_DocTableWrite($oRange,$aData) This, unfortunately, writes the entire array into the first cell of row 2. What am I doing wrong?
       
    • ternal
      By ternal
      Hi,
      Recently I have had the need to do a sort and then do a second sort while the item of the first sort stays the same ( double sorting , first on column x then while column x is the same sort column y).
      I did not put much efffort into error checking but so far I did not need it.
      For my applications so far it works perfectly however if someone is willing I want to test this extensivly.
      If anyone has big lists of random stuff to sort could you try this out please?
      #include <Array.au3> ; #FUNCTION# ==================================================================================================================== ; Name ..........: _ArraySort_Double ; Description ...: ; Syntax ........: _ArraySort_Double (Byref $array[, $first_index = Default[, $second_index = Default[, $ascending = Default]]]) ; Parameters ....: $array - 2d array to sort. ; $first_index - [optional] first column to sort. Default is 0. ; $second_index - [optional] second column to sort. Default is 1. ; $ascending - [optional] ascending/descending. Default is 1. ; Return values .: 1 if no errors occured , -1 if errors occured ; Author ........: Ternal ; Remarks .......: Needs excessive testing. ; Related .......: _arraysort() ; =============================================================================================================================== Func _ArraySort_Double (byref $array, $first_index = Default, $second_index = Default, $ascending = Default) Local $temp_value Local $counter = 1 If UBound($array, $UBOUND_DIMENSIONS) <> 2 Then MsgBox(0, "error", "error") return -1 EndIf If $first_index = Default Then $first_index = 0 If $second_index = Default Then $second_index = 1 If $ascending = Default Then $ascending = 1 _ArraySort($array, $ascending, 0, 0, $first_index); you can alter settings of primary sort here If @error Then MsgBox(0, "error", @error) return -1 EndIf $temp_value = $array[0][$first_index] For $x = 1 to UBound($array, 1) - 1 If Mod( $x, 10000) = 0 Then ConsoleWrite("at " & $x & " of a total : " & UBound($array, 1) & @CRLF) If $array[$x][$first_index] = $temp_value Then $counter+= 1 If $x = UBound($array, 1) - 1 Then; do last line here(if last line is not a new item) _ArraySort($array, $ascending, $x - $counter, $x, $second_index);you can alter settings of secondary sort here(don't forget to place line 34 the exact same) If @error Then MsgBox(0, "error", @error) return -1 EndIf EndIf Else If $counter > 0 Then ;at least 2 of the same _ArraySort($array, $ascending, $x - $counter, $x - 1, $second_index);you can alter settings of secondary sort here(don't forget to place line 29 the exact same) If @error Then MsgBox(0, "error", @error) return -1 EndIf $counter = 1 EndIf EndIf $temp_value = $array[$x][$first_index] Next Return 1 EndFunc Kind regards, Ternal
    • ur
      By ur
      Is there any UDF to remove all anchor tags <a> with a particular class (and also its sub elements completely) in a html document.
      Here the classes are browse and breadcrumbs
      Like in the below image.


       
      I am not able to find that option in IE.au3
       
      Please suggest.
×