Jump to content

How to relate links and text when tables obscure connections


RobMac
 Share

Recommended Posts

This is a general question about the best way to process (read, analyze, click, download) complex websites. In general, I am dealing with crawling through websites and reconstructing the underlying databases (I am an academic and so use this for data gathering - also academic = novice programmer - sorry).

I was wondering what is the best way to do this when you have complicated tables with javascript that effectively separate the actual links from the text to which they refer. For example, once I am working with full text how can I refer to an object/element I find in the full text, especially if I dont have unique names or IDs to work off of?

It seems that since all the _IE functions work with lists of a specific object type they cannot connect these objects, for example, within rows of a table. I have been forced to read in the full text and count occurances when I parse the full text, while at the same time extracting all links on the page and then painstakingly threading these back together by trying to reference the correct link based on the count I got in my fulltext parsing.

Is this a normal method? I want to see if there is an easier way to do this. Just looking for "best practice".

Thanks in advance.

-Rob

Link to comment
Share on other sites

Thanks Valuater - that is a cool application.

Unfortunately it seems to cut off the HTML after element 488 or so... Actually, I guess it does not see inside the iframe...?

This is the website I am looking at:

http://www.wipo.int/classifications/ipc/ipc8/?lang=en

Click on one of the Class letters (A to F or so) to see the more complex page that I am working with.

I am using AutoIt v3.2.8.1

Edited by RobMac
Link to comment
Share on other sites

Also suggest you download and use DebugBar (free - see my sig) to help descipher your pages and frames.

Dale

Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y

Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Link to comment
Share on other sites

Also suggest you download and use DebugBar (free - see my sig) to help descipher your pages and frames.

Dale

Thanks Dale,

Yes I am using DebugBar and it is very helpful.

I am still having problems when I have tables where a link which runs javascript is in one cell and the associated text is in another so I can't relate the link (and its results after a click) to the text. Problem is when objects have no IDs and names that apear multiple times.

Anyway, I am just wondering if best practice in these cases is often to read the full HTML, parse it, count links manually and relate these links to other items of parsed text. Then finally relate this manually count of links (or images or whatever) to _IE functions which work by referencing the #th element on the page.

Thanks for the help.

-Rob

Link to comment
Share on other sites

Typically in situations like that, studying the HTML will reveal patterns taht can be used to your advantage. For example, the label and the link are in adjacent TDs in a TR and you can look through them in nested loops looking for what you want.

Dale

Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y

Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Link to comment
Share on other sites

Typically in situations like that, studying the HTML will reveal patterns taht can be used to your advantage. For example, the label and the link are in adjacent TDs in a TR and you can look through them in nested loops looking for what you want.

Dale

Thanks Dale, That is more or less what I am doing so it is good to know I was not totally off track. I was afraid there was a really easy way and I was killing myself trying to figure out some patterns. To be honest the problem I am dealing with here seems to be an unreliable (or too inteligent) interface that changes the way it displays things as you go through it.

Anyway thank you very much for the response on both my questions.

-Rob

Link to comment
Share on other sites

_IETagnameAllGetCollection will return ALL elements and you can loop through them or index into them with the .item property of the collection. Elements also have .parent, .nextSibling, .previousSibling and .children properties. If you want to invest the time at MSDN, these may be useful to you, but it takes some work.

Dale

Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y

Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...