Jump to content

Read data from html Tables from raw HTML source


Recommended Posts

  • 4 months later...

After seeing the function __HTML_Filter() in this topic by Stilgar (https://www.autoitscript.com/forum/topic/124330-_htmlau3-v101/) I thought I'd include that function also in this script.
the purpose of that function is to clean the extracted data from the table by those codes that are not visible in the browser but are visible as code "dirty" in the data when they are picked up from the table.

Updated the udf and the example script in first post.

To see the difference in the extracted data with or without the use of the HTML_Filter() function, just extract the table data from the example page by clicking on the "Preview array" button with the filter CheckBox "tags to entities" one time unchecked and then checked instead.

 

image.jpeg.9f1a974c98e9f77d824b358729b089b0.jpeg Chimp

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Link to post
Share on other sites
  • 9 months later...

Hi Chimp!
Thanks a lot for your example - it saved me a lot work!
I had to parse a table with almost 1400 rows (and lots of rowspans) in an 1.5MB HTML file, and got some performance issues. Here is how I solved them:
First, I adapted the HTML tag position search in _ParseTags to search starting on the last tag found position, so StringInStr doesn't need to count thousands of "<tr" tags every iteration. Then, _ArraySort failed (too many rows...). So, to get the tag list pre-sorted, I search for the first opening and first closing tag. If the opening is before the closing, write to $aThisTagsPositions and find the next opening; if the closing is before the next opening, write to $aThisTagsPositions and find the next closing.

This made it possible to read that huge HTML file in less than 90 seconds.

Just replace the code on lines 208-216 with this:

Local $iNextOpenPosition = StringInStr($sHtml, $sOpening, 0, 1)
        Local $iNextClosePosition = StringInStr($sHtml, $sClosing, 0, 1)
        Local $iOpenCount = 1

        ; 2) find in the HTML the positions of the $sOpening <tag and $sClosing </tag> tags
        For $i = 1 To $iNrOfThisTag * 2 ;search all the opening and closing tags
            If ($iNextOpenPosition < $iNextClosePosition) And $iNextOpenPosition <> 0 Then
                $aThisTagsPositions[$i][0] = $iNextOpenPosition
                $aThisTagsPositions[$i][1] = $sOpening ; it marks which kind of tag is this
                $aThisTagsPositions[$i][2] = $iOpenCount; nr of this tag
                $iOpenCount += 1
                $iNextOpenPosition = StringInStr($sHtml, $sOpening, 0, 1, $aThisTagsPositions[$i][0] + 1)
            Else
                $aThisTagsPositions[$i][0] = $iNextClosePosition + StringLen($sClosing) - 1
                $aThisTagsPositions[$i][1] = $sClosing ; it marks which kind of tag is this
                $iNextClosePosition = StringInStr($sHtml, $sClosing, 0, 1, $aThisTagsPositions[$i][0] + 1)
            EndIf
        Next

 

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By SkysLastChance
      I am having trouble finding a good way to click these "button" below. 

      I only need to be able to click them when they have both yes/no. Otherwise I don't have to worry about them. For instance if they looked like this I would NOT have worry about clicking them and can just ignore them all togheter.(Below Picture)

      The problem is as mentioned in the title, all of the ID's  are dynamic. (Classes too)

      Here is what it looks like if yes is already selected.

      This is what I was using to select the the button. However, I need to know if the button has already been clicked/selected or not.
      _WD_LoadWait($sSession) $sElement = _WD_FindElement($sSession, $_WD_LOCATOR_ByXPath, "//span[text() = 'Offered access to electronic health information?']") Sleep(1000) _WD_ElementAction($sSession, $sElement, 'click') Sleep(500) _WD_Action($sSession, "actions", $sActionTab) Sleep(500) _WD_Action($sSession, "actions", $sActionEnter) Is there a way I can get the count of spans in the span class-"s_636" by tabbing over to the button? I am hoping someone might have some ideas on what I can try.
      Unfortunally, The site is for work so giving the site wont do any good. 
    • By TheAlienDoctor
      Hi, I was looking into creating a script that would detect if a file exists, then move it (and in some cases rename it, depending on the file) as well as write to a log file.
      The issue is, there is a lot of files that need to be moved, sometimes some files will exist and others won't depending on the use-case. However if a file does exist, it will always be going into the same directory with the same name.
      Currently I have an array nested inside of the array, with each array inside that array having both the old and new directory, and then a For loop to actually run through and do the file transferring. The issue I am having is how to call the Array inside of the array, because how do I specify  which the old directory is and which the new is?
      Global $FileTransfer[2000] = [Global $Dir1[2] = ["original dir 1", "new dir 1"], Global $Dir2[2] = ["original dir 2", "new dir 2"]] For $FileTransfer = [0] To [1] Step +1 If FileExists({original dir}) Then FileMove({original dir}, new dir, 1) FileOpen("log.latest.txt", 1) FileWrite("log.latest.txt", "{original dir} found, moved it to new dir." & @CRLF) FileClose("log.latest.txt") Else FileOpen("log_latest.txt", 1) FileWrite("log_latest.txt", "{original dir} not found, ignoreing it." & @CRLF) FileClose("log_latest.txt") EndIf Next I have put what I want the old and new directory to be for each array in {}, so hopefully its easier to tell which part is working and whats not.

      I am still reasonably new to AutoIT, any help is appreciated. Thankyou
    • By arunkw
      I have a spreadsheet - daily routine which has two columns: activity and time as shown here
      | Activity             | Time     | |----------------------|----------| | Sleep               |  6:00 am | | Toilet              |  6:15 am | | Get ready for gym  |  6:30 am | | Exercise            |  7:50 am | | ... more things      |  9:00 pm | | ... still more       | 10:45 pm | | Sleep               |  6:00 am |   I wanted to find out, say in C1 which activity is current for me using now() I.e., if it’s 6:45am on my watch, it should show me Exercise  in C1 Thanks to Adam D. PE, this formula works like magic to get the result =VLOOKUP(MOD(NOW(),1),{B2:B,A2:A},2,1)   Now, I want to reproduce same result in autoit, how to do that? To have easy solution say, I copy-paste spreadsheet data in array directly in code, right? Use for loop and run the above vlookup function and show the answer using tooltip. How to achieve this? please help.  
    • By goku200
      I'm having an issue with my html paginated table. The script work as expected. It reads the html table and clicks on the Download button. However when it clicks on the next page its not iterating the items. instead it goes to the next URL from the spreadsheet and then iterates through the html table clicking the Download button and so on. Not sure why its doing that. I want it to click the next page and then continue iterating then after it has reached the end of the pagination go to the next url in the spreadsheet and repeat the process. Below is my script. Any help is appreciated 🙂
       
       
    • By goku200
      I have an Autoit script that lists files from a folder into an array list. Is there a way to separate the filenames by an underscore and include the id, version, name and date into separate columns in Excel.
      Example of filename:
      12345_v1.0_TEST Name [12345]_01.01.2022.html
      12345 would be in one column
      v1.0 would be in another column
      TEST Name [12345] would be in another column
      01.01.2022 would be in another column
      .html would be in another column
      Note: filenames always change each day.
      Here is my code that lists the files into column C and then writes the column Headers into Column D, E, F, G. Just need some help with separating them into columns by the _ delimiter
       
×
×
  • Create New...