Jump to content
Chimp

Read data from html Tables from raw HTML source

Recommended Posts

After seeing the function __HTML_Filter() in this topic by Stilgar (https://www.autoitscript.com/forum/topic/124330-_htmlau3-v101/) I thought I'd include that function also in this script.
the purpose of that function is to clean the extracted data from the table by those codes that are not visible in the browser but are visible as code "dirty" in the data when they are picked up from the table.

Updated the udf and the example script in first post.

To see the difference in the extracted data with or without the use of the HTML_Filter() function, just extract the table data from the example page by clicking on the "Preview array" button with the filter CheckBox "tags to entities" one time unchecked and then checked instead.


small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

Hi Chimp!
Thanks a lot for your example - it saved me a lot work!
I had to parse a table with almost 1400 rows (and lots of rowspans) in an 1.5MB HTML file, and got some performance issues. Here is how I solved them:
First, I adapted the HTML tag position search in _ParseTags to search starting on the last tag found position, so StringInStr doesn't need to count thousands of "<tr" tags every iteration. Then, _ArraySort failed (too many rows...). So, to get the tag list pre-sorted, I search for the first opening and first closing tag. If the opening is before the closing, write to $aThisTagsPositions and find the next opening; if the closing is before the next opening, write to $aThisTagsPositions and find the next closing.

This made it possible to read that huge HTML file in less than 90 seconds.

Just replace the code on lines 208-216 with this:

Local $iNextOpenPosition = StringInStr($sHtml, $sOpening, 0, 1)
        Local $iNextClosePosition = StringInStr($sHtml, $sClosing, 0, 1)
        Local $iOpenCount = 1

        ; 2) find in the HTML the positions of the $sOpening <tag and $sClosing </tag> tags
        For $i = 1 To $iNrOfThisTag * 2 ;search all the opening and closing tags
            If ($iNextOpenPosition < $iNextClosePosition) And $iNextOpenPosition <> 0 Then
                $aThisTagsPositions[$i][0] = $iNextOpenPosition
                $aThisTagsPositions[$i][1] = $sOpening ; it marks which kind of tag is this
                $aThisTagsPositions[$i][2] = $iOpenCount; nr of this tag
                $iOpenCount += 1
                $iNextOpenPosition = StringInStr($sHtml, $sOpening, 0, 1, $aThisTagsPositions[$i][0] + 1)
            Else
                $aThisTagsPositions[$i][0] = $iNextClosePosition + StringLen($sClosing) - 1
                $aThisTagsPositions[$i][1] = $sClosing ; it marks which kind of tag is this
                $iNextClosePosition = StringInStr($sHtml, $sClosing, 0, 1, $aThisTagsPositions[$i][0] + 1)
            EndIf
        Next

 

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Similar Content

    • By matwachich
      Hi guys!
      A pretty simple UDF to convert HTML to PDF using wkHTMLtoPDF.
      It uses the C API of the tool (DLL), so no external process, no ActiveX or COM sh*t.
      See the example, and the documentation of wkHTMLtoPDF.
      Cheers
      https://github.com/matwachich/wkhtmltopdf-au3
    • By nacerbaaziz
      good morning everybody.
      today i liked to share an small example with you
      which it an function to read the registry values as an array
      the result array is 2d array witch
      $a_array[n][0] = value name
      $a_array[n][1] = value Data
      $a_array[0][0] = values count
      here's the function

      #include <Array.au3> #include <WinAPIReg.au3> #include <APIRegConstants.au3> Local $a_array = _RegReadToArray("HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Run") If @error Then     MsgBox(16, "error", @error)     Exit EndIf _ArrayDisplay($a_array) Func _RegReadToArray($s_RegKey)     Local $a_KeySplitInfo = StringSplit($s_RegKey, "\\", 2)     If UBound($a_KeySplitInfo) <= 1 Then         $a_KeySplitInfo = StringSplit($s_RegKey, "\", 2)         If UBound($a_KeySplitInfo) <= 1 Then Return (1, 1, 0)     EndIf     Local $H_KeyInfo = "", $s_RegKeyInfo = ""     Switch $a_KeySplitInfo[0]         Case "hklm", "HKEY_LOCAL_MACHINE", "hklm64", "HKEY_LOCAL_MACHINE64"             $H_KeyInfo = $HKEY_LOCAL_MACHINE         Case "hkCu", "HKEY_CURRENT_USER", "hkCU64", "HKEY_CURRENT_USER64"             $H_KeyInfo = $HKEY_CURRENT_USER         Case "hkCr", "HKEY_CLASSES_ROOT", "HKCR64", "HKEY_CLASSES_ROOT64"             $H_KeyInfo = $HKEY_CLASSES_ROOT         Case "HKU", "HKEY_USERS", "HKU64", "HKEY_USERS64"             $H_KeyInfo = $HKEY_USERS         Case Else             Return SetError(2, 2, 0)     EndSwitch     _ArrayDelete($a_KeySplitInfo, 0)     $s_RegKeyInfo = _ArrayToString($a_KeySplitInfo, "\")     Local $H_KeyInfoOpen = _WinAPI_RegOpenKey($H_KeyInfo, $s_RegKeyInfo, $KEY_READ)     Local $A_KeyInfo = _WinAPI_RegQueryInfoKey($H_KeyInfoOpen)     If @error Then Return SetError(1, 1, 0)     _WinAPI_RegCloseKey($H_KeyInfoOpen)     Local $A_RegVal[$A_KeyInfo[2] + 1][2]     Local $iV = 1, $s_RegRead = ""     While 1         $s_RegVal = RegEnumVal($s_RegKey, $iV)         If @error <> 0 Then ExitLoop         $s_RegRead = RegRead($s_RegKey, $s_RegVal)         If Not (@error) Then             $A_RegVal[$iV][0] = $s_RegVal             $A_RegVal[$iV][1] = $s_RegRead         EndIf         $iV += 1     WEnd     $A_RegVal[0][0] = UBound($A_RegVal) - 1     If $A_RegVal[0][0] >= 1 Then         Return $A_RegVal     Else         Return SetError(3, 3, 0)     EndIf EndFunc   ;==>_RegReadToArray
      i hope you benefit from it
      with my greetings
    • By JackER4565
      Hi, first of all thanks to all the guys who always help people in the forums, I wouldn't be able to do anything if wasn't for your help, even if I don't ask it myself.
       
      I've created this code to get some info on a monitoring network on my work. It relays on _IETableGetCollection and _IETableWriteToArray.
      It works well, but take around 3:25 minutes to get the info from 28 pages (some of them are large and take longer to load, but most of them are small and fast).
      My question is if you see a way to get the program to go faster...
       
      I've tried to make it easy for you to understand and edited somethings with sensitive info.
      (Some of the pages doesn't have the black divider with MIRA in the end, so I need to search if it is there or not.)
       
      #include <IE.au3> #include <array.au3> Local $oIE = _IECreate("about:blank", 0, 0) Local $paginas[28] = [89, 90, 91, 92, 93, 96, 105, 113, 119, 125, 126, 129, 131, 133, 135, 137, 139, 140, 141, 144, 145, 146, 148, 149, 150, 151, 158, 159] Local $Datos_array[0][2] Local $oTable Local $tabla Local $aux_x = 1 Local $ar = 1 Local $Numtables_datos = 0 MsgBox(0, "asd", "asd") For $pag = 0 To UBound($paginas) - 1 Step 1 _IENavigate($oIE, "<WEBSITE URL>" & $paginas[$pag]) ; <<< the pages to load are always the same except for the last digits. _ArrayAdd($Datos_array, $paginas[$pag] & "|" & "Entrante", 0, "|") ; <<<<<<<<<<<<<<<< adds the page number toarray [0, 0] ;############################################ START counts amount of tables with traffic $oTable = _IETableGetCollection($oIE) Local $iNumTables = @extended For $i = 3 To $iNumTables - 2 Step 1 $oTable = _IETableGetCollection($oIE, $i) $nomb_tabla2 = _IETableWriteToArray($oTable) ; <<<<<<<< TABLE TO ARRAY. $string2 = StringStripWS($nomb_tabla2[1][0], 8) If $string2 <> "MIRA" Then $Numtables_datos = $Numtables_datos + 1 Next $tabla_End = $iNumTables - $Numtables_datos ;############################################ FIN $tabla_Start = 4 $tabla_trafico = 2 For $for = 1 To $Numtables_datos Step 1 $oTable = _IETableGetCollection($oIE, $tabla_Start - 1) ; <<<<<<<<<<< NAME OF THE TABLE; row2 = mira $nomb_tabla = _IETableWriteToArray($oTable) ; <<<<<<<< TABLE TO ARRAY ;########################################### ADDS the traffic number into the row $string = StringStripWS($nomb_tabla[1][0], 8) If $string == "MIRA" Then ;si o si pasa por aca 1 vez _ArrayAdd($Datos_array, $nomb_tabla[0][0]) $nomb_aux = $nomb_tabla[0][0] $aux_x = 1 $tabla_trafico = $tabla_trafico + 2 Else ;esto deberia ser por row _ArrayAdd($Datos_array, $nomb_aux & " " & $aux_x) $aux_x = $aux_x + 1 $tabla_trafico = $tabla_trafico + 1 EndIf $oTable = _IETableGetCollection($oIE, $tabla_trafico) Local $aTableData = _IETableWriteToArray($oTable) $bps = _ArrayToString($aTableData, "|", 0, 0, @CRLF, 0, 0) $bps = StringRight($bps, 5) $bps = StringLeft($bps, 4) $trafico_actual = _ArrayToString($aTableData, "|", 0, 0, @CRLF, 2, 2) If $bps == "Gbps" Then $trafico_actual = $trafico_actual * 1000 If $bps == "Kbps" Then $trafico_actual = $trafico_actual / 1000 $Datos_array[$ar][1] = $trafico_actual $ar = $ar + 1 If $string == "MIRA" Then $tabla_Start = $tabla_Start + 2 Else $tabla_Start = $tabla_Start + 1 EndIf Next $ar = $ar + 1 ;~ ############# CAÍDA ############ ;~ If $actual_entrante = 0 Then ;~ $xxx = 0 ;~ Do ;~ MsgBox(0, "Tráfico Caído", $paginas[$i], 5) ;~ $xxx = $xxx + 1 ;~ Until $xxx = 10 ;~ EndIf ;~ ############# CAÍDA ############. Local $Numtables_datos = 0 Next _ArrayDisplay($Datos_array, "Array display") _IEQuit($oIE) Thanks!! 


      monitoria.html
    • By Colduction
      Hello AutoIt Scriptwriters! 
      I want to read https based site that it's address is: Soft98 (https://soft98.ir/)
      I've tried with "_INetGetSource", "BinaryToString(InetRead)" and "InetRead" but none of them don't help me
       
      How can i get this site html source code without opening IE Windows? 
       
×
×
  • Create New...