Jump to content
Chimp

Read data from html Tables from raw HTML source

Recommended Posts

After seeing the function __HTML_Filter() in this topic by Stilgar (https://www.autoitscript.com/forum/topic/124330-_htmlau3-v101/) I thought I'd include that function also in this script.
the purpose of that function is to clean the extracted data from the table by those codes that are not visible in the browser but are visible as code "dirty" in the data when they are picked up from the table.

Updated the udf and the example script in first post.

To see the difference in the extracted data with or without the use of the HTML_Filter() function, just extract the table data from the example page by clicking on the "Preview array" button with the filter CheckBox "tags to entities" one time unchecked and then checked instead.


small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

Hi Chimp!
Thanks a lot for your example - it saved me a lot work!
I had to parse a table with almost 1400 rows (and lots of rowspans) in an 1.5MB HTML file, and got some performance issues. Here is how I solved them:
First, I adapted the HTML tag position search in _ParseTags to search starting on the last tag found position, so StringInStr doesn't need to count thousands of "<tr" tags every iteration. Then, _ArraySort failed (too many rows...). So, to get the tag list pre-sorted, I search for the first opening and first closing tag. If the opening is before the closing, write to $aThisTagsPositions and find the next opening; if the closing is before the next opening, write to $aThisTagsPositions and find the next closing.

This made it possible to read that huge HTML file in less than 90 seconds.

Just replace the code on lines 208-216 with this:

Local $iNextOpenPosition = StringInStr($sHtml, $sOpening, 0, 1)
        Local $iNextClosePosition = StringInStr($sHtml, $sClosing, 0, 1)
        Local $iOpenCount = 1

        ; 2) find in the HTML the positions of the $sOpening <tag and $sClosing </tag> tags
        For $i = 1 To $iNrOfThisTag * 2 ;search all the opening and closing tags
            If ($iNextOpenPosition < $iNextClosePosition) And $iNextOpenPosition <> 0 Then
                $aThisTagsPositions[$i][0] = $iNextOpenPosition
                $aThisTagsPositions[$i][1] = $sOpening ; it marks which kind of tag is this
                $aThisTagsPositions[$i][2] = $iOpenCount; nr of this tag
                $iOpenCount += 1
                $iNextOpenPosition = StringInStr($sHtml, $sOpening, 0, 1, $aThisTagsPositions[$i][0] + 1)
            Else
                $aThisTagsPositions[$i][0] = $iNextClosePosition + StringLen($sClosing) - 1
                $aThisTagsPositions[$i][1] = $sClosing ; it marks which kind of tag is this
                $iNextClosePosition = StringInStr($sHtml, $sClosing, 0, 1, $aThisTagsPositions[$i][0] + 1)
            EndIf
        Next

 

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Similar Content

    • By jmp
      Script running good but error in line 7.
      When i run this script :
      #include <IE.au3> #include <Array.au3> $oIE = _IEAttach ("Shop") $oTable = _IETableGetCollection ($oIE, 1) $aTableData = _IETableWriteToArray ($oTable) For $inumber = 1 To UBound($aTableData) -1 $table = $aTableData[4][$inumber] MsgBox(0, "", $table) Next I got Error: array variable has incorrect number of subscripts or subscript dimension range exceeded
    • By SkysLastChance
      What would be the best way to grab the last digits of this <span>? One of the problems I know I am going to have is sometimes it will be 1 digit other times it might be 3. 

      I am trying to get the list of spans and I get this error.

       
      $oInputs = _IETagNameGetCollection($oIE, "span") $sTxt = "" For $oInput In $oInputs     $sTxt &= $oInput.Innertext & @CRLF Next MsgBox($MB_SYSTEMMODAL, "Form Input Type", "Form: " & $oInput.form.name & @CRLF & @CRLF & "         Types :" & @CRLF & $sTxt)  
    • By nacerbaaziz
      Good morning guys, i hope that you're all well.
      guys, i have a problem and i hope that you can help me
      i've created an 3d array
      the array Contain a Categories info
      as folow
      $array[n][0][0] = Categorie name
      $array[n][0][1] = Categorie file path
      $array[n][0][2] = Categorie contents number
      $array[n][m][0] = link name
      $array[n][m][1] = link url
      $array[n][m][2] = link section name
      in my tool i want to add an option to delete a Category
      as you know the Categorie mean that must delete a region from the array
      when i tried to use _arrayDelete
      with the 2d array it work well
      but here i couldn't find any way to do that, can any one help me please?
      thanks in advance.
    • By nacerbaaziz
      hello sirs, please help me
      i tried to create a function that read a folder files to 3d array
      e.g
      $array[n][0][0] = ctName
      $array[n][0][1] = ctFilePath
      $array[n][0][2] = crtsections number
      $array[n][m][0] = KeyName
      $array[n][m][1] = KeyVal
      $array[n][m][2] = keySectionName
       
      that the array
      when i put one file into the folder all things work fine
      but when i put more than one file
      the last file worked fine but the others only the first key is showing
      please can you help me to correct this problem
      here is the example with the folder
      please accept my greetings
      and thanks in advance
       
      array3d.zip
    • By matwachich
      Hi guys!
      A pretty simple UDF to convert HTML to PDF using wkHTMLtoPDF.
      It uses the C API of the tool (DLL), so no external process, no ActiveX or COM sh*t.
      See the example, and the documentation of wkHTMLtoPDF.
      Cheers
      https://github.com/matwachich/wkhtmltopdf-au3
×
×
  • Create New...