Jump to content
Sign in to follow this  
Sokko

HTML extracting contents of a tag

Recommended Posts

Sokko

I'm working on a small project in which I need to download a web page and extract the source code that is inside a tag with a particular property. For instance, locate the div with an ID of "features" and extract the source inside it, being careful not to be tripped up by any divs inside that one. Since I don't feel like writing my own cut-down HTML parser just for this project, I looked at IE.au3 for some function that could do this.

So far I haven't been able to find anything useful. If I could get a handle on the div I could use _IEPropertyGet with innerHTML to pull out the contents, but there is no function (or if there is, I'm blind) that will even let me find a particular tag on the page, much less a tag with a specific ID, class, etc. Can this be done with the IE functions, or is there another way? (can't think of a RegEx that would work for this sort of thing at the moment)

Share this post


Link to post
Share on other sites
DaleHohm

_IEGetObjByName() will get an element by name or ID. _IETagNameGetCollection() will get a collection of all elements with that tag or if you pass a zero-based index you can get a reference to a specific element.

Dale


Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Share this post


Link to post
Share on other sites
Sokko

Is there a way to get an element by looking at some other property (class comes to mind)? I'm guessing this would have to be done with _IETagNameGetCollection, but once you obtain the collection, how would you find out which elements have the class or other property you want?

Share this post


Link to post
Share on other sites
DaleHohm

Example:

$oDivs = _IETagNameGetCollection($oIE, "div")
For $oDiv in $oDivs
    If String($oDiv.className) = "the one I'm looking for" Then
        ; Yahoo! I found one
        ; do something
    EndIf
Next

Dale

Edited by DaleHohm

Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Share this post


Link to post
Share on other sites
Sokko

Thanks for that example! But how did you know to use "className", and what else could you put in its place? It doesn't match up with the actual name of the property in the HTML, which is just "class", so I'm not sure how to extend this to other properties of tags. Also, why did you put a String function around it?

Edited by Sokko

Share this post


Link to post
Share on other sites
DaleHohm

>Thanks for that example!

You're welcome

>But how did you know to use "className", and what else could you put in its place?

>It doesn't match up with the actual name of the property in the HTML, which is just "class",

>so I'm not sure how to extend this to other properties of tags.

See the link for the MSDN Document Object documentation in my Sig... then drill down to the DIV tag to see what properties it has.

>Also, why did you put a String function around it?

Experience.

If a Div has no classname, then $oDiv.className returns a numeric 0 instead of a null string as you might expect. Since AutoIt uses variants rather than typed variables, it assumes that since the left side of the comparison is numeric, you want to do a numeric comparison so it converts the right side to numeric as well - and all strings evaluate to 0 as numerics... so, you get [if 0 = 0] which evaluates to True instead of [if "" = "what I'm looking for"] which would evaluate to False. Using String() on the left side forces a string comparison. You'll likely forget this, like I often do, but hopefully you'll remember when you are getting really strange results sometime and you just can't figure it out... and then, Doh!

Dale


Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Share this post


Link to post
Share on other sites
Sokko

See the link for the MSDN Document Object documentation in my Sig... then drill down to the DIV tag to see what properties it has.

Ah, I see. Took me about five minutes to figure out you had to choose the "Collections" button and click the "HTML Elements" link under the description for the "childNodes" collection. :"> Thanks for the tip about String, I certainly hope I won't forget it.

Share this post


Link to post
Share on other sites
DaleHohm

Ah yes... sorry I made it hard. I've now added a "DHTML Objects" link to my sig that I will direct others to for similar things in the future.

Dale


Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×