Jump to content

Need to read contents of a web page and extract table text


Recommended Posts

There's so many examples here that it's very hard to find a specific item...

I need to load a web page, and extract certain text items from within a table. Then I' like to display those items in a custom gui.

Is there anything like that floating around, or maybe just the extracting part?

The learning curve is high enough that an example would save me a lot of time.

Link to comment
Share on other sites

There's so many examples here that it's very hard to find a specific item...

I need to load a web page, and extract certain text items from within a table. Then I' like to display those items in a custom gui.

Is there anything like that floating around, or maybe just the extracting part?

The learning curve is high enough that an example would save me a lot of time.

The examples within the helpfile should be sufficient.

Check _IECreate and _IETableGetCollection & _IETableWriteToArray to obtain your array containing the desired table.

hf,

101011

[font="Courier New"][center]Me vs. 127.0.0.1 =>> 0:2But I never give up! >:-][/center][/font]
Link to comment
Share on other sites

CODE
$oTable = _IETableGetCollection($yourIEObject, 3)

$aTableData = _IETableWriteToArray($oTable)

$rows = UBound($aTableData)

Thanks, I was able to get rolling on this and have completed quite a bit of the project.

Maybe someone can help on this issue. I'm retrieving a collection using

$oTableLinkText = _IETableGetCollection($oIE, 4)

This works as expected. What I need are the LINKS in this table. I know I can get all links with _IELinkGetCollection, but using the same index (4) does not retrieve anything. I have to use -1, which gives me EVERY link on the web page.

At that point i'd have to do some looping and compares to get the links i need.

Seems there must be a better , more direct way to do this using only the original IETableGetCollection($oIE, 4) data.

Edited by Jdop
Link to comment
Share on other sites

Seems there must be a better , more direct way to do this using only the original IETableGetCollection($oIE, 4) data.

Yup yup. If I'm understanding correctly, you're interested in all the links within a given table? This should give you some ideas:

$oTable = _IETableGetCollection($yourIEObject, 3)
$oLinks = _IETagNameGetCollection($oTable, "A")
For $oLink In $oLinks
    $href = $oLink.href
    ConsoleWrite($href & @CR)
Next
IE Dev ToolbarMSDN: InternetExplorer ObjectMSDN: HTML/DHTML Reference Guide[quote]It is surprising what a man can do when he has to, and how little most men will do when they don't have to. - Walter Linn[/quote]--------------------[font="Franklin Gothic Medium"]Post a reproducer with less than 100 lines of code.[/font]
Link to comment
Share on other sites

I'm retrieving a collection using

$oTableLinkText = _IETableGetCollection($oIE, 4)

This works as expected.

What is expected from that command is an object pointing to the table, not the text of anything, yet...

What I need are the LINKS in this table. I know I can get all links with _IELinkGetCollection, but using the same index (4) does not retrieve anything. I have to use -1, which gives me EVERY link on the web page.

At that point i'd have to do some looping and compares to get the links i need.

Seems there must be a better , more direct way to do this using only the original IETableGetCollection($oIE, 4) data.

You might be able to get just the links from the table you cleverly selected the object for earlier:

$oLinks = _IELinkGetCollection($oTableLinkText)
$iNumLinks = @extended
MsgBox(0, "Link Info", $iNumLinks & " links found")
Dim $strLinks = "", $i = 0
For $oLink In $oLinks
    $strLinks &= $i & ":  " & $oLink.href
    $i += 1
Next
MsgBox(0, "Link Info", $strLinks)

Can't test without a page to try it on...

:lmao:

Edit: Added "$i += 1" to the loop. Had to get it in there before mikehu.... DOH! :whistle:

Edited by PsaltyDS
Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law
Link to comment
Share on other sites

*sneaks an $i += 1 into PSaltyDS's loop*

:whistle:

IE Dev ToolbarMSDN: InternetExplorer ObjectMSDN: HTML/DHTML Reference Guide[quote]It is surprising what a man can do when he has to, and how little most men will do when they don't have to. - Walter Linn[/quote]--------------------[font="Franklin Gothic Medium"]Post a reproducer with less than 100 lines of code.[/font]
Link to comment
Share on other sites

Thanks for the quick help. Thats what I need. Strange though, I'm seeing a 'bug' in that the second item in the table is not outputting. All the other lines output properly. Here's what I see.

Note: This is based on the first suggested code from mikehunt114, you guys are faster than I am.

http://www.trade-ideas.com/SingleAlertType/NHP/New_high.html

http://www.trade-ideas.com/Help.html#NHP ?

http://www.trade-ideas.com/Help.html#NLP ?

http://www.trade-ideas.com/SingleAlertType...w_high_ask.html

http://www.trade-ideas.com/Help.html#NHA ?

http://www.trade-ideas.com/SingleAlertType...ew_low_bid.html

http://www.trade-ideas.com/Help.html#NLB ?

etc.....

Line 3 should be >> http://www.trade-ideas.com/SingleAlertType/NLP/New_low.html

but it does not output. Is there some bug in the autoit code (i doubt it but you never know)

Here is the relevant raw html from the actual web page I'm parsing. I don't see any 'malformed' code that would cause this to happen.

<TR><TD><IMG SRC='http://static.trade-ideas.com/Alerts/NHP.gif'></TD><TD><A HREF='http://www.trade-ideas.com/SingleAlertType/NHP/New_high.html'>New high</A></TD></TD><TD ALIGN='center'><A HREF='/Help.html#NHP' TARGET='Alerts Help'><B>?</B></A></TD></TR>

<TR><TD><IMG SRC='http://static.trade-ideas.com/Alerts/NLP.gif'></TD><TD><A HREF='http://www.trade-ideas.com/SingleAlertType/NLP/New_low.html'>New low</A></TD></TD><TD ALIGN='center'><A HREF='/Help.html#NLP' TARGET='Alerts Help'><B>?</B></A></TD></TR>

<TR><TD><IMG SRC='http://static.trade-ideas.com/Alerts/NHA.gif'></TD><TD><A HREF='http://www.trade-ideas.com/SingleAlertType/NHA/New_high_ask.html'>New high ask</A></TD></TD><TD ALIGN='center'><A HREF='/Help.html#NHA' TARGET='Alerts Help'><B>?</B></A></TD></TR>

<TR><TD><IMG SRC='http://static.trade-ideas.com/Alerts/NLB.gif'></TD><TD><A HREF='http://www.trade-ideas.com/SingleAlertType/NLB/New_low_bid.html'>New low bid</A></TD></TD><TD ALIGN='center'><A HREF='/Help.html#NLB' TARGET='Alerts Help'><B>?</B></A></TD></TR>

Edited by Jdop
Link to comment
Share on other sites

That little link collection snippet works fine for me on that bit of HTML. Try running the code on only that section of the HTML (save it as a new .htm file on your puter, then run that through IE and apply the script). Other than that, I can only suggest to check @extended after the _IELinkGetCollection to see if it is finding the correct number of elements. If that number is correct, try looping through the links using For $i = 0 To (@extended - 1)...although I have no reason to believe For...In is misbehaving.

Edit: typo

Edited by mikehunt114
IE Dev ToolbarMSDN: InternetExplorer ObjectMSDN: HTML/DHTML Reference Guide[quote]It is surprising what a man can do when he has to, and how little most men will do when they don't have to. - Walter Linn[/quote]--------------------[font="Franklin Gothic Medium"]Post a reproducer with less than 100 lines of code.[/font]
Link to comment
Share on other sites

The second code example works, but returns all the links on the entire page. Not sure exactly what is going on there.

Interestingly, BOTH code examples are missing that

http://www.trade-ideas.com/SingleAlertType/NLP/New_low.html

in the output stream.

Very odd, wonder if its something in the page itself or autoit? Very little room for error here on my part.

Heres the link to the original source page if someone wants to examine it.

http://www.trade-ideas.com/SingleAlertType...pping_down.html

I'm trying to extract the links in the frame titled "View alerts by type"

Edited by Jdop
Link to comment
Share on other sites

Ya'll gave up on me already . lol.

Thing should work, but doesn't. I can kludge fix it , but would be nice to know why that single line is 'invisible' to autoit/windows dom

hey not sure if i understood what you meant but the following seems to be close to what your after:

#include <IE.au3>

HttpSetProxy(0)

Dim $oIE = _IECreate ("http://www.trade-ideas.com/SingleAlertType/SD/Offer_stepping_down.html", 0, 0, 1, -1)

$oTable = _IETableGetCollection($oIE, 4)

$oLinks = _IETagNameGetCollection($oTable, "a")
$iNumLinks = @extended

Dim $strLinks = "", $i = 0
For $oLink In $oLinks
    If StringInStr($oLink.href, "http://www.trade-ideas.com/SingleAlertType") Then ;checks the link to make sure its one of the proper links not the help file ones
    $strLinks &= $i & ":  " & $oLink.href
    $i += 1
    ConsoleWrite("Match Found - " & $oLink.outerText & " : " & $oLink.href & @LF)
EndIf

Next

ConsoleWrite("TOTAL MATCHING LINKS FOUND: " & $i & @LF)

Was it just the middle column links that you were after in that table? The above code returns 175 links which seems pretty close.

Link to comment
Share on other sites

fu2m8, (love the nicks around here ;-) ) your code seems to work around the 'bug' I was talking about in the other two versions.

If you run those against the same page, you will see that the second link, 'new lows' does not get captured by the output routines.

Link to comment
Share on other sites

fu2m8, (love the nicks around here ;-) ) your code seems to work around the 'bug' I was talking about in the other two versions.

If you run those against the same page, you will see that the second link, 'new lows' does not get captured by the output routines.

hmm thought i was getting the same thing as you originally but I just ran the following and line 3 (i.e the one starting with 2: ...) seemed to have the correct output. Running v3.2.4.0 . I may have misunderstood what links you were after which is why i took out the help file related ones in the script I posted above. This version should return 351 links.

#include <IE.au3>

Dim $oIE = _IECreate ("http://www.trade-ideas.com/SingleAlertType/SD/Offer_stepping_down.html", 0, 0, 1, -1)

$oTable = _IETableGetCollection($oIE, 4)

$oLinks = _IETagNameGetCollection($oTable, "a")
$iNumLinks = @extended

Dim $strLinks = "", $i = 0
For $oLink In $oLinks
;   If StringInStr($oLink.href, "http://www.trade-ideas.com/SingleAlertType") Then
    $strLinks &= $i & ":  " & $oLink.href & @LF
    $i += 1
;   ConsoleWrite("Match Found - " & $oLink.outerText & " : " & $oLink.href & @LF)
;EndIf

Next

MsgBox(0, 0, $strLinks)
ConsoleWrite("TOTAL MATCHING LINKS FOUND: " & $i & @LF)

Good luck with it :whistle:

Link to comment
Share on other sites

I had a quick look last night before I went home, and your desired table looked like it was nested in at least one more table. Myabe have a second look at the DOM structure to confirm. That said, if you referenced the first table, all links within subsequently nested tables should be returned in a collection call. I haven't given up, just busy :whistle:

IE Dev ToolbarMSDN: InternetExplorer ObjectMSDN: HTML/DHTML Reference Guide[quote]It is surprising what a man can do when he has to, and how little most men will do when they don't have to. - Walter Linn[/quote]--------------------[font="Franklin Gothic Medium"]Post a reproducer with less than 100 lines of code.[/font]
Link to comment
Share on other sites

I had a quick look last night before I went home, and your desired table looked like it was nested in at least one more table. Myabe have a second look at the DOM structure to confirm. That said, if you referenced the first table, all links within subsequently nested tables should be returned in a collection call. I haven't given up, just busy :whistle:

Actually , I think I figured out what was happening. Each 'scan' has its own page. That scans link is omitted on the table.

Don't know why I missed it when debugging, but I've just about finished the project with all the bells and whistles.

Autoit, pretty cool.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...