Chimp

Read data from html Tables from raw HTML source

3 posts in this topic

After seeing the function __HTML_Filter() in this topic by Stilgar (https://www.autoitscript.com/forum/topic/124330-_htmlau3-v101/) I thought I'd include that function also in this script.
the purpose of that function is to clean the extracted data from the table by those codes that are not visible in the browser but are visible as code "dirty" in the data when they are picked up from the table.

Updated the udf and the example script in first post.

To see the difference in the extracted data with or without the use of the HTML_Filter() function, just extract the table data from the example page by clicking on the "Preview array" button with the filter CheckBox "tags to entities" one time unchecked and then checked instead.


small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

Hi Chimp!
Thanks a lot for your example - it saved me a lot work!
I had to parse a table with almost 1400 rows (and lots of rowspans) in an 1.5MB HTML file, and got some performance issues. Here is how I solved them:
First, I adapted the HTML tag position search in _ParseTags to search starting on the last tag found position, so StringInStr doesn't need to count thousands of "<tr" tags every iteration. Then, _ArraySort failed (too many rows...). So, to get the tag list pre-sorted, I search for the first opening and first closing tag. If the opening is before the closing, write to $aThisTagsPositions and find the next opening; if the closing is before the next opening, write to $aThisTagsPositions and find the next closing.

This made it possible to read that huge HTML file in less than 90 seconds.

Just replace the code on lines 208-216 with this:

Local $iNextOpenPosition = StringInStr($sHtml, $sOpening, 0, 1)
        Local $iNextClosePosition = StringInStr($sHtml, $sClosing, 0, 1)
        Local $iOpenCount = 1

        ; 2) find in the HTML the positions of the $sOpening <tag and $sClosing </tag> tags
        For $i = 1 To $iNrOfThisTag * 2 ;search all the opening and closing tags
            If ($iNextOpenPosition < $iNextClosePosition) And $iNextOpenPosition <> 0 Then
                $aThisTagsPositions[$i][0] = $iNextOpenPosition
                $aThisTagsPositions[$i][1] = $sOpening ; it marks which kind of tag is this
                $aThisTagsPositions[$i][2] = $iOpenCount; nr of this tag
                $iOpenCount += 1
                $iNextOpenPosition = StringInStr($sHtml, $sOpening, 0, 1, $aThisTagsPositions[$i][0] + 1)
            Else
                $aThisTagsPositions[$i][0] = $iNextClosePosition + StringLen($sClosing) - 1
                $aThisTagsPositions[$i][1] = $sClosing ; it marks which kind of tag is this
                $iNextClosePosition = StringInStr($sHtml, $sClosing, 0, 1, $aThisTagsPositions[$i][0] + 1)
            EndIf
        Next

 

1 person likes this

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now

  • Similar Content

    • Ascer
      By Ascer
      1. Description.
      Udf working with MSDN System.Collections.ArrayList. Allow you to make fast operations on huge arrays, speed is even x10 better than basic _ArrayAdd.  Not prefered for small arrays < 600 items. 2. Requirements
      .NET Framework 1.1 - 4.5 (on this version Microsoft destroy old rules) System Windows 3. Possibilities.
      ;=============================================================================================================== ; UDF Name: List.au3 ; ; Date: 2018-02-17, 10:52 ; Description: Simple udf to create System Collections as ArrayList and make multiple actions on them. ; ; Function(s): _ListCreate -> Creates a new list ; _ListCapacity -> Gets a list size in bytes ; _ListCount -> Gets items count in list ; _ListIsFixedSize -> Get bool if list if fixed size ; _ListIsReadOnly -> Get bool if list is read only ; _ListIsSynchronized -> Get bool if list is synchronized ; _ListGetItem -> Get item on index ; _ListSetItem -> Set item on index ; ; _ListAdd -> Add item at end of list ; _ListClear -> Remove all list items ; _ListClone -> Duplicate list in new var ; _ListContains -> Get bool if item is in list ; _ListGetHashCode -> Get hash code for list ; _ListGetRange -> Get list with items between indexs ; _ListIndexOf -> Get index of item ; _ListInsert -> Insert a new item on index ; _ListInsertRange -> Insert list into list on index ; _ListLastIndexOf -> Get index last of item ; _ListRemove -> Remove first found item ; _ListRemoveAt -> Remove item in index ; _ListRemoveRange -> Remove items between indexs ; _ListReverse -> Reverse all items in list ; _ListSetRange -> Set new value for items in range ; _ListSort -> Sort items in list (speed of reading) ; _ListToString -> Get list object name ; _ListTrimToSize -> Remove unused space in list ; ; Author(s): Ascer ;=============================================================================================================== 4. Downloads
      List.au3 5. Examples
      SpeedTest _ArrayAdd vs ListAdd SpeedTest ArraySearch vs ListIndexOf Basic usage - crating guild with members  
    • ronmage
      By ronmage
      So I have a loop that keeps reading data from an array and searching it for the same value. If the value is no there it does work then adds the value to the array to prevent it from doing the same work.
      If _ArraySearch($ID,$filearray[$i]) = -1 Then Work.... _ArrayAdd($ID,$filearray[$i]) EndIf This is in a for loop hence $i
      So what is happening is the code works great for several hours. After a period of time _ArraySearch($ID,$filearray[$i]) will result in -1 even if $ID = $filearray. So it ready as if there is no data in the array. Anyone have this problem? 
       
      Also I am just running in using F5 not compiling it and running it if that makes a difference.
       
    • FMS
      By FMS
      Hello,
      I'm trying to read a binary file to an array but couln't get it to work.
      Also I coul not find any help in the forum around this subject whish was helpfull.
      Is there any way it could be done?
      I tried a lot of ways but maybe somebody know's the right way?
      #AutoIt3Wrapper_Au3Check_Parameters=-d -w 1 -w 2 -w 3 -w 4 -w 5 -w 6 -w 7 #include <File.au3> #include <Array.au3> #include <AutoItConstants.au3> Local $in=FileOpen("TEST_labels.idx1-ubyte",16) ; 16+0=Read binary Local $data = FileRead($in) Local $FileArray = BinaryToString($data,4) ;~ $FileArray = StringSplit($BinarydData, @CRLF, 1+2) ;~ Local $FileArray = StringRegExp($BinarydData, "[^\r\n]+", 3) FileClose($in) _ArrayDisplay($FileArray,"$FileArray","",32) MsgBox($MB_SYSTEMMODAL, "", "$FileArray = " & $FileArray )  
      TEST_labels.idx1-ubyte
    • ur
      By ur
      I am reading a CSV file which is tab seperated as below.
      Local $aArray = FileReadToArray($file) And now, I am splitting this main array record wise so that Array contains internally another arrow to represent each row.
      For $i = 0 to (UBound($aArray) - 1) ;MsgBox(0,"",$aArray[$i]) $aArray[$i] = StringSplit(StringStripCR($aArray[$i]), Chr(9),2);Chr(9) for tab ;_ArrayDisplay($aArray[$i]) Next Afther that, _ArrayDIsplay is able to see the individual internal arrays.
      _ArrayDisplay($aArray[1]) But If I try to access the individual element of it as below.It is not showing any result.
      MsgBox(0,"",$aArray[1][1]) Any suggestion, below is the sample csv file.
      New Text Document.csv