Jump to content
Sign in to follow this  
Chimp

Read data from html Tables from raw HTML source

Recommended Posts

After seeing the function __HTML_Filter() in this topic by Stilgar (https://www.autoitscript.com/forum/topic/124330-_htmlau3-v101/) I thought I'd include that function also in this script.
the purpose of that function is to clean the extracted data from the table by those codes that are not visible in the browser but are visible as code "dirty" in the data when they are picked up from the table.

Updated the udf and the example script in first post.

To see the difference in the extracted data with or without the use of the HTML_Filter() function, just extract the table data from the example page by clicking on the "Preview array" button with the filter CheckBox "tags to entities" one time unchecked and then checked instead.


small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Share this post


Link to post
Share on other sites

Hi Chimp!
Thanks a lot for your example - it saved me a lot work!
I had to parse a table with almost 1400 rows (and lots of rowspans) in an 1.5MB HTML file, and got some performance issues. Here is how I solved them:
First, I adapted the HTML tag position search in _ParseTags to search starting on the last tag found position, so StringInStr doesn't need to count thousands of "<tr" tags every iteration. Then, _ArraySort failed (too many rows...). So, to get the tag list pre-sorted, I search for the first opening and first closing tag. If the opening is before the closing, write to $aThisTagsPositions and find the next opening; if the closing is before the next opening, write to $aThisTagsPositions and find the next closing.

This made it possible to read that huge HTML file in less than 90 seconds.

Just replace the code on lines 208-216 with this:

Local $iNextOpenPosition = StringInStr($sHtml, $sOpening, 0, 1)
        Local $iNextClosePosition = StringInStr($sHtml, $sClosing, 0, 1)
        Local $iOpenCount = 1

        ; 2) find in the HTML the positions of the $sOpening <tag and $sClosing </tag> tags
        For $i = 1 To $iNrOfThisTag * 2 ;search all the opening and closing tags
            If ($iNextOpenPosition < $iNextClosePosition) And $iNextOpenPosition <> 0 Then
                $aThisTagsPositions[$i][0] = $iNextOpenPosition
                $aThisTagsPositions[$i][1] = $sOpening ; it marks which kind of tag is this
                $aThisTagsPositions[$i][2] = $iOpenCount; nr of this tag
                $iOpenCount += 1
                $iNextOpenPosition = StringInStr($sHtml, $sOpening, 0, 1, $aThisTagsPositions[$i][0] + 1)
            Else
                $aThisTagsPositions[$i][0] = $iNextClosePosition + StringLen($sClosing) - 1
                $aThisTagsPositions[$i][1] = $sClosing ; it marks which kind of tag is this
                $iNextClosePosition = StringInStr($sHtml, $sClosing, 0, 1, $aThisTagsPositions[$i][0] + 1)
            EndIf
        Next

 

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By Colduction
      Hi dear friends!, i'm sorry for creating a new thread (a new problem), i have over than 9 lists that i want to combine them to be this (in this example, there are 3 test files):


      I've written a little code for splitting main information, but i really confused how to make results as "Output.txt", here is that code:
       
      $sRegex_1 = StringRegExp(FileRead("1.txt"), '(?s:(?<=\=\=\r\n)(.*?)(?=\r\n\=\=))', 3) $sRegex_2 = StringRegExp(FileRead("2.txt"), '(?s:(?<=\=\=\r\n)(.*?)(?=\r\n\=\=))', 3) $sRegex_3 = StringRegExp(FileRead("3.txt"), '(?s:(?<=\=\=\r\n)(.*?)(?=\r\n\=\=))', 3) For $i = 0 To UBound($sRegex_1) - 1 ConsoleWrite($sRegex_1[$i] & @CRLF) For $j = 0 To UBound($sRegex_2) - 1 ConsoleWrite($sRegex_2[$j] & @CRLF) For $k = 0 To UBound($sRegex_3) - 1 ConsoleWrite($sRegex_3[$k] & @CRLF) Next Next Next  
    • By Colduction
      Hi guys, i'm using Telegram UDF by @LinkOut from github with latest update, but this UDF has not any parse_mode ability in SendDocument and other send file's functions to make texts bold, italic or underline and i can't send Emojis via these functions too. i've tried to change HTML section of multipart/form-data but i did not get correct results.

      For example, i can't get correct results by sending a document with this URL Encoded caption: %F0%9F%93%84%20*Test*%20%F0%9F%93%84

      I will be happy to help me in this section. Thanks!
    • By nacerbaaziz
      hello evrybody
      here is an example about how to split your texts using a delimiter with the ability to select how much of delimiters shows in each colum  with $i_number
      e.g
      you have a long text and you want to split it in an array
      that evry colum have a number (n) of lines
      i made a function that do that for you
      just call it with a three params
      $s_text
      your text
      $i_number
      the number that you want to put in each col
      $s_siparator
      the siparator
      default is "|"
      here is the function with example
      i hope that it will be useful for you
       
      ****
       
      #include <Array.au3> $s_txt = "some text1some text2|some text3|some text4|some text5|some text6" $array = splitText($s_txt, 2) _ArrayDisplay($array) Func splitText($s_text, $i_number, $s_siparator = "|") Local $a_TXT = StringSplit($s_text, $s_siparator) Local $a_Return[$a_TXT[0] + 1] If ($a_TXT[0] <= $i_number) Or ($i_number <= 0) Then ReDim $a_Return[2] $a_Return[0] = 1 $a_Return[1] = $s_text Return $a_Return EndIf Local $i_Processed = 1, $i_arrayProcessed = 1 Do For $i = $i_Processed To ($i_Processed + $i_number) - 1 If ($a_TXT[0] < $i) Then ExitLoop If Not ($a_Return[$i_arrayProcessed]) Then $a_Return[$i_arrayProcessed] = $a_TXT[$i] Else $a_Return[$i_arrayProcessed] &= $s_siparator & $a_TXT[$i] EndIf $i_Processed += 1 Next $i_arrayProcessed += 1 Until ($a_TXT[0] < $i_Processed) ReDim $a_Return[$i_arrayProcessed] $a_Return[0] = $i_arrayProcessed - 1 Return $a_Return EndFunc ;==>splitText
      accept my greetings
      thanks to
      @Dan_555
      for his notes
       
    • By MesterPerfect
      good morning
      this is the first post here in the autoit forums
      i hope that you can help me in my problem
      i have a JSON encoded
      it a map of my forums
      where i want to make a treeview that have the same type of map
      e.g
      a system (as category)
      windows (as sub category)
      software (as an child item in the windows category)
      .....
      i don't know how to do that
      so, i know that i can do that using the json functions
      but i need your help about how we can do it as the type that i told you
      by the way i need to put the sub info for each item in an array that give me the ability to manage my items
      e.g
      can post thread
      can reply
      message cound ...
      you just give me a small example and i can continue.
      am sorry if this against the rules of the forum.
      but i realy searched a lot but i couldn't
      i hope some one give me the way.
      thank you very much in advance
       
      here is the link of json forum
      https://www.autoitscript.com/forum/topic/148114-a-non-strict-json-udf-jsmn/
      and here is my encoded json file
       
      { "tree_map": { "0": [ 1, 5, 6, 7 ], "1": [ 2 ], "2": [ 4 ], "5": [ 3 ], "6": [ 8 ], "8": [ 9, 10 ] }, "nodes": [ { "breadcrumbs": [], "description": "", "display_in_list": true, "display_order": 1, "node_id": 1, "node_name": null, "node_type_id": "Category", "parent_node_id": 0, "title": "Main category", "type_data": {} }, { "breadcrumbs": [ { "node_id": 1, "title": "Main category", "node_type_id": "Category" } ], "description": "", "display_in_list": true, "display_order": 1, "node_id": 2, "node_name": null, "node_type_id": "Forum", "parent_node_id": 1, "title": "Main forum", "type_data": { "allow_poll": true, "allow_posting": true, "can_create_thread": true, "can_upload_attachment": true, "discussion_count": 0, "last_post_date": 0, "last_post_id": 0, "last_post_username": "", "last_thread_id": 0, "last_thread_prefix_id": 0, "last_thread_title": "", "message_count": 0, "min_tags": 0, "require_prefix": false } }, { "breadcrumbs": [ { "node_id": 1, "title": "Main category", "node_type_id": "Category" }, { "node_id": 2, "title": "Main forum", "node_type_id": "Forum" } ], "description": "", "display_in_list": true, "display_order": 1, "node_id": 4, "node_name": null, "node_type_id": "Forum", "parent_node_id": 2, "title": "my forums1", "type_data": { "allow_poll": true, "allow_posting": true, "can_create_thread": true, "can_upload_attachment": true, "discussion_count": 0, "last_post_date": 0, "last_post_id": 0, "last_post_username": "", "last_thread_id": 0, "last_thread_prefix_id": 0, "last_thread_title": "", "message_count": 0, "min_tags": 0, "require_prefix": false } }, { "breadcrumbs": [], "description": "", "display_in_list": true, "display_order": 2, "node_id": 5, "node_name": null, "node_type_id": "Category", "parent_node_id": 0, "title": "Perfect", "type_data": {} }, { "breadcrumbs": [ { "node_id": 5, "title": "Perfect", "node_type_id": "Category" } ], "description": "", "display_in_list": true, "display_order": 2, "node_id": 3, "node_name": null, "node_type_id": "Forum", "parent_node_id": 5, "title": "ahmed", "type_data": { "allow_poll": true, "allow_posting": true, "can_create_thread": true, "can_upload_attachment": true, "discussion_count": 0, "last_post_date": 0, "last_post_id": 0, "last_post_username": "", "last_thread_id": 0, "last_thread_prefix_id": 0, "last_thread_title": "", "message_count": 0, "min_tags": 0, "require_prefix": false } }, { "breadcrumbs": [], "description": "", "display_in_list": true, "display_order": 3, "node_id": 6, "node_name": null, "node_type_id": "Forum", "parent_node_id": 0, "title": "autoit", "type_data": { "allow_poll": true, "allow_posting": true, "can_create_thread": true, "can_upload_attachment": true, "discussion_count": 0, "last_post_date": 0, "last_post_id": 0, "last_post_username": "", "last_thread_id": 0, "last_thread_prefix_id": 0, "last_thread_title": "", "message_count": 0, "min_tags": 0, "require_prefix": false } }, { "breadcrumbs": [ { "node_id": 6, "title": "autoit", "node_type_id": "Forum" } ], "description": "", "display_in_list": true, "display_order": 3, "node_id": 8, "node_name": null, "node_type_id": "Forum", "parent_node_id": 6, "title": "examples", "type_data": { "allow_poll": true, "allow_posting": true, "can_create_thread": true, "can_upload_attachment": true, "discussion_count": 0, "last_post_date": 0, "last_post_id": 0, "last_post_username": "", "last_thread_id": 0, "last_thread_prefix_id": 0, "last_thread_title": "", "message_count": 0, "min_tags": 0, "require_prefix": false } }, { "breadcrumbs": [ { "node_id": 6, "title": "autoit", "node_type_id": "Forum" }, { "node_id": 8, "title": "examples", "node_type_id": "Forum" } ], "description": "", "display_in_list": true, "display_order": 3, "node_id": 9, "node_name": null, "node_type_id": "Forum", "parent_node_id": 8, "title": "GUI", "type_data": { "allow_poll": true, "allow_posting": true, "can_create_thread": true, "can_upload_attachment": true, "discussion_count": 0, "last_post_date": 0, "last_post_id": 0, "last_post_username": "", "last_thread_id": 0, "last_thread_prefix_id": 0, "last_thread_title": "", "message_count": 0, "min_tags": 0, "require_prefix": false } }, { "breadcrumbs": [ { "node_id": 6, "title": "autoit", "node_type_id": "Forum" }, { "node_id": 8, "title": "examples", "node_type_id": "Forum" } ], "description": "", "display_in_list": true, "display_order": 31, "node_id": 10, "node_name": null, "node_type_id": "Forum", "parent_node_id": 8, "title": "windowEX", "type_data": { "allow_poll": true, "allow_posting": true, "can_create_thread": true, "can_upload_attachment": true, "discussion_count": 0, "last_post_date": 0, "last_post_id": 0, "last_post_username": "", "last_thread_id": 0, "last_thread_prefix_id": 0, "last_thread_title": "", "message_count": 0, "min_tags": 0, "require_prefix": false } }, { "breadcrumbs": [], "description": "", "display_in_list": true, "display_order": 4, "node_id": 7, "node_name": null, "node_type_id": "Category", "parent_node_id": 0, "title": "vbs", "type_data": {} } ] }  
    • By nooneclose
      I need to dynamically resize my 2d array while looping. 
      I know this code:
      ReDim $rArray[UBound($rArray) + 1] works for the rows, however, I also need to increase the columns. How would i go about increasing both rows and columns while looping? 
×
×
  • Create New...