RobMac Posted December 10, 2007 Share Posted December 10, 2007 This is a general question about the best way to process (read, analyze, click, download) complex websites. In general, I am dealing with crawling through websites and reconstructing the underlying databases (I am an academic and so use this for data gathering - also academic = novice programmer - sorry). I was wondering what is the best way to do this when you have complicated tables with javascript that effectively separate the actual links from the text to which they refer. For example, once I am working with full text how can I refer to an object/element I find in the full text, especially if I dont have unique names or IDs to work off of? It seems that since all the _IE functions work with lists of a specific object type they cannot connect these objects, for example, within rows of a table. I have been forced to read in the full text and count occurances when I parse the full text, while at the same time extracting all links on the page and then painstakingly threading these back together by trying to reference the correct link based on the count I got in my fulltext parsing. Is this a normal method? I want to see if there is an easier way to do this. Just looking for "best practice". Thanks in advance. -Rob Link to comment Share on other sites More sharing options...
Valuater Posted December 10, 2007 Share Posted December 10, 2007 Give IE-Builder a Look-Seehttp://www.autoitscript.com/forum/index.ph...st&p=1337678) Link to comment Share on other sites More sharing options...
RobMac Posted December 10, 2007 Author Share Posted December 10, 2007 (edited) Thanks Valuater - that is a cool application. Unfortunately it seems to cut off the HTML after element 488 or so... Actually, I guess it does not see inside the iframe...?This is the website I am looking at:http://www.wipo.int/classifications/ipc/ipc8/?lang=enClick on one of the Class letters (A to F or so) to see the more complex page that I am working with.I am using AutoIt v3.2.8.1 Edited December 10, 2007 by RobMac Link to comment Share on other sites More sharing options...
DaleHohm Posted December 10, 2007 Share Posted December 10, 2007 Also suggest you download and use DebugBar (free - see my sig) to help descipher your pages and frames. Dale Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model Automate input type=file (Related) Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded Better Better? IE.au3 issues with Vista - Workarounds SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead? Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble Link to comment Share on other sites More sharing options...
RobMac Posted December 11, 2007 Author Share Posted December 11, 2007 Also suggest you download and use DebugBar (free - see my sig) to help descipher your pages and frames.DaleThanks Dale, Yes I am using DebugBar and it is very helpful. I am still having problems when I have tables where a link which runs javascript is in one cell and the associated text is in another so I can't relate the link (and its results after a click) to the text. Problem is when objects have no IDs and names that apear multiple times.Anyway, I am just wondering if best practice in these cases is often to read the full HTML, parse it, count links manually and relate these links to other items of parsed text. Then finally relate this manually count of links (or images or whatever) to _IE functions which work by referencing the #th element on the page.Thanks for the help.-Rob Link to comment Share on other sites More sharing options...
DaleHohm Posted December 11, 2007 Share Posted December 11, 2007 Typically in situations like that, studying the HTML will reveal patterns taht can be used to your advantage. For example, the label and the link are in adjacent TDs in a TR and you can look through them in nested loops looking for what you want. Dale Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model Automate input type=file (Related) Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded Better Better? IE.au3 issues with Vista - Workarounds SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead? Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble Link to comment Share on other sites More sharing options...
RobMac Posted December 11, 2007 Author Share Posted December 11, 2007 Typically in situations like that, studying the HTML will reveal patterns taht can be used to your advantage. For example, the label and the link are in adjacent TDs in a TR and you can look through them in nested loops looking for what you want.DaleThanks Dale, That is more or less what I am doing so it is good to know I was not totally off track. I was afraid there was a really easy way and I was killing myself trying to figure out some patterns. To be honest the problem I am dealing with here seems to be an unreliable (or too inteligent) interface that changes the way it displays things as you go through it. Anyway thank you very much for the response on both my questions.-Rob Link to comment Share on other sites More sharing options...
DaleHohm Posted December 11, 2007 Share Posted December 11, 2007 _IETagnameAllGetCollection will return ALL elements and you can loop through them or index into them with the .item property of the collection. Elements also have .parent, .nextSibling, .previousSibling and .children properties. If you want to invest the time at MSDN, these may be useful to you, but it takes some work. Dale Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model Automate input type=file (Related) Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded Better Better? IE.au3 issues with Vista - Workarounds SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead? Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now