Chimp

Read data from html Tables from raw HTML source

3 posts in this topic




After seeing the function __HTML_Filter() in this topic by Stilgar (https://www.autoitscript.com/forum/topic/124330-_htmlau3-v101/) I thought I'd include that function also in this script.
the purpose of that function is to clean the extracted data from the table by those codes that are not visible in the browser but are visible as code "dirty" in the data when they are picked up from the table.

Updated the udf and the example script in first post.

To see the difference in the extracted data with or without the use of the HTML_Filter() function, just extract the table data from the example page by clicking on the "Preview array" button with the filter CheckBox "tags to entities" one time unchecked and then checked instead.


small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

Hi Chimp!
Thanks a lot for your example - it saved me a lot work!
I had to parse a table with almost 1400 rows (and lots of rowspans) in an 1.5MB HTML file, and got some performance issues. Here is how I solved them:
First, I adapted the HTML tag position search in _ParseTags to search starting on the last tag found position, so StringInStr doesn't need to count thousands of "<tr" tags every iteration. Then, _ArraySort failed (too many rows...). So, to get the tag list pre-sorted, I search for the first opening and first closing tag. If the opening is before the closing, write to $aThisTagsPositions and find the next opening; if the closing is before the next opening, write to $aThisTagsPositions and find the next closing.

This made it possible to read that huge HTML file in less than 90 seconds.

Just replace the code on lines 208-216 with this:

Local $iNextOpenPosition = StringInStr($sHtml, $sOpening, 0, 1)
        Local $iNextClosePosition = StringInStr($sHtml, $sClosing, 0, 1)
        Local $iOpenCount = 1

        ; 2) find in the HTML the positions of the $sOpening <tag and $sClosing </tag> tags
        For $i = 1 To $iNrOfThisTag * 2 ;search all the opening and closing tags
            If ($iNextOpenPosition < $iNextClosePosition) And $iNextOpenPosition <> 0 Then
                $aThisTagsPositions[$i][0] = $iNextOpenPosition
                $aThisTagsPositions[$i][1] = $sOpening ; it marks which kind of tag is this
                $aThisTagsPositions[$i][2] = $iOpenCount; nr of this tag
                $iOpenCount += 1
                $iNextOpenPosition = StringInStr($sHtml, $sOpening, 0, 1, $aThisTagsPositions[$i][0] + 1)
            Else
                $aThisTagsPositions[$i][0] = $iNextClosePosition + StringLen($sClosing) - 1
                $aThisTagsPositions[$i][1] = $sClosing ; it marks which kind of tag is this
                $iNextClosePosition = StringInStr($sHtml, $sClosing, 0, 1, $aThisTagsPositions[$i][0] + 1)
            EndIf
        Next

 

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now

  • Similar Content

    • Dimmae
      By Dimmae
      Hello,
      at first: i'm new here, so please forgive me my mistakes, and show them to me, just that i can learn to do better in the future.
      Now to my Problem: i have an excel sheet, where i just need some columns for further actions, but i have no idea how to add single columns to a new array.
      I found the following code(the one i just added as a file) from 'water' in this forum, but i wont get how i could add multiple columns into a new array.
      The biggest problem in my situation is that i dont know the count of the rows i need for the array, i just got a fix number of rows, which is 4.
       
      Hope you can help me, and sry again for this 'unlucky illustration'.
       
      btw: how can i add code shown as code here, instead of posting it as a attached file?.
       
       
       
      autoit-select-column.au3
      defects.xlsx
    • islandspapand
      By islandspapand
      Hi all
      i am currently trying to click on an element in a HTML Table, but just can get it to work.
      i am able to click the top of the table so it changes to sort  but just can't click on the element in the table.
      an i need to click on element to continue in the site.
      i have attached the code so far and pictures of the table  element want to click plus the source of the table.
      i am able to get data in the table with $oTable = _IETableGetCollection($oIE, 2) but not able to click on them.
       
      Help is very much appreciated
       
      #cs ---------------------------------------------------------------------------- AutoIt Version: 3.3.14.2 Author: myName Script Function: Template AutoIt script. #ce ---------------------------------------------------------------------------- ; Script Start - Add your code below here #include <IE.au3> #include "DOM.au3" #include <Array.au3> #include <MsgBoxConstants.au3> Global $oIE = _IECreate("*") _IELoadWait($oIE) Sleep(2000) _PageLogin($oIE) _PageLoadWait() _PageNewReq($oIE) _PageLoadWait() _InputModelInf($oIE) _PageLoadWait() Sleep(1000) $aTableLink = BGe_IEGetDOMObjByXPathWithAttributes($oIE, "//table/tbody/tr/td[.='Name Of user']", 2000) ;~ $aTableLink = BGe_IEGetDOMObjByXPathWithAttributes($oIE, "//table/tbody/tr", 2000) ;~ _ArrayDisplay($aTableLink,"$aTableLink") If IsArray($aTableLink) Then ConsoleWrite("Able to BGe_IEGetDOMObjByXPathWithAttributes($oIE, //table/tbody/tr/td[.='Name Of user'])" & @CRLF) For $i = 0 To UBound($aTableLink)-1 ConsoleWrite(" OuterHTML : " & $aTableLink[$i].outerHTML & @CRLF) ConsoleWrite(" Parentnode : " & $aTableLink[$i].parentnode & @CRLF) ConsoleWrite(" Parentnode.click : " & $aTableLink[$i].parentnode.fireEvent("onclick","click") & @CRLF) $objClick = $aTableLink[$i].parentnode ;~ _IEAction($aTableLink[$i] , "focus") _IEAction($objClick , "focus") ;~ If _IEAction($aTableLink[$i], "click") Then If _IEAction($objClick, "click") Then ConsoleWrite("Able to _IEAction($aForumLink[0], 'click')" & @CRLF) _IELoadWait($oIE) Else ConsoleWrite("UNable to _IEAction($aForumLink[0], 'click')" & @CRLF) Exit 3 EndIf Next Else ConsoleWrite("Unable to BGe_IEGetDOMObjByXPathWithAttributes($oIE, //table/tbody/tr/td[.='Name Of user'])" & @CRLF) Exit 2 EndIf _PageLoadWait() Func _InputModelInf($oTmpIE) ; Add Var for Model & Serial in Func $oModelInput = _IEGetObjById($oTmpIE,"model") _IEAction($oModelInput,"focus") _IEDocInsertText($oModelInput, "*") $oSerialInput = _IEGetObjById($oTmpIE,"serial") _IEAction($oModelInput,"focus") _IEDocInsertText($oSerialInput, "*") $links = $oTmpIE.document.getElementsByClassName("btn btn-primary ng-scope") For $link In $links If $link.innertext = "Søg" Or $link.innertext = "Search" Then $link.click() ExitLoop EndIf Next Return True EndFunc Func _PageNewReq($oTmpIE) $links = $oTmpIE.document.getElementsByClassName("ng-scope k-link") For $link In $links If $link.innertext = "Send ny fejlmelding" Or $link.innertext = "Submit a New Service Request" Then $link.click() ExitLoop EndIf Next Return True EndFunc Func _PageLogin($oTmpIE) $oUserInput = _IEGetObjById($oTmpIE,"loginid") _IEDocInsertText($oUserInput, "*") $oPasswordInput = _IEGetObjById($oTmpIE,"password") _IEDocInsertText($oPasswordInput, "*") $links = $oTmpIE.document.getElementsByClassName("btn btn-primary login ng-scope") For $link In $links If $link.innertext = "Sign in" Then $link.click() ExitLoop EndIf Next Return True EndFunc Func _PageLoadWait() Local $PageLoadWait = False ;~ nav navbar-nav navbar-right ng-hide ;~ nav navbar-nav navbar-right $tags = $oIE.document.GetElementsByTagName("ul") For $tag in $tags $class_value = $tag.GetAttribute("class") If $class_value = "nav navbar-nav navbar-right" Then ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') : Webpage loading :) ' & @CRLF) ;### Debug Console $PageLoadWait = True ExitLoop EndIf Next Do sleep(250) For $tag in $tags $class_value = $tag.GetAttribute("class") If $class_value = "nav navbar-nav navbar-right ng-hide" Then ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') : Webpage load finished :)'& @CRLF) ;### Debug Console $PageLoadWait = False ExitLoop EndIf Next Until $PageLoadWait = False EndFunc  
      Thanks in advance
       
       


    • Gowrisankar
      By Gowrisankar
      Hello everyone,
      I'm trying to pass values to elements in a website.
      The elements are present within a table, which is again present within a table, which is inside a form.
      I tried to read the form, tables, etc., but with no results. It appears to me that the elements, tables, form, etc., were not read at all.
      The following is what I tried. Please guide me.
      ;I tried the following to read the tables into arrays #include <IE.au3> #include <MsgBoxConstants.au3> Local $oIE = _IECreate() _IENavigate($oIE, "---- URL HERE ----") _IELoadWait($oIE) $o_Table = _IETableGetCollection ($oIE) $i_NumTables = @extended For $i = 0 To $i_NumTables - 1 Step 1 $o_Table_Temp2 = _IETableGetCollection ($oIE, $i) $a_TableData = _IETableWriteToArray ($o_Table_Temp2) _ArrayDisplay($a_TableData) Next  
      ;I tried the following code to pass value to the field #include <IE.au3> #include <MsgBoxConstants.au3> Local $oIE = _IECreate() _IENavigate($oIE, "---- URL HERE ----", 0) _IELoadWait($oIE) Local $oForm = _IEFormGetObjByName($oIE, "default") Local $oField = _IEFormElementGetObjByName($oForm, "tGroup") _IEFormElementSetValue($oField, "---- VALUE HERE ----")  
      The following is the html view of the website and the highlighted field is the one that I want to pass values to. Since this is an official website, I can't share the exact url.

    • Spask
      By Spask
      Hi, I'm trying to find a text value inside of a html.
      This is what the line looks like normally:
      <p id="line1" class> <span class="bot">TEXT HERE</span> </p> The text then changes to a non breaking space:
      <p id="line1" class> <span class="bot">&nbsp;</span> </p> And then it changes back to normal text but it's different every time.
      Can I code this so that it grabs the text every time it changes and has a variable that represents it?
      I currently have this inside of my loop:
      $span = .document.getElementsByTagName("span") For $text In $span If $text.value = "&nbsp;" Then Sleep(50) MsgBox(0,0,0) ;messagebox to test if it can be found, but I don't know how to grab the text EndIf Next The problem is that there are many other lines in the html that have the same span but are called "line3", "line5", etc and the one I need is from "line1".
      I will appreciate if anyone can help with this!