Jump to content
Sign in to follow this  
empty75

_IETagNameGetCollection

Recommended Posts

empty75

I have a piece of code that downloads a webpage then strips out several meta tags:

CODE

$oTags = _IETagNameGetCollection ($oIE, "META") ;Search for all Meta tags

For $otheTag In $oTags

IF $otheTag.Name = "description" Then ;Find the description meta tag

$fnDesc = $otheTag.content ; get the data

endif

IF $otheTag.Name = "keywords" Then

$keyWords = $otheTag.content

EndIf

Next

All goes well if the meta tags are correctly coded on the webpage, but if a quote is in the wrong place, then the above fails and only returns up to the first quote.

CODE

<meta name="description" content="Step-up to Towers II with over 430 animated tiles. Enjoy 500 awesome and unique layouts with up to nine levels tall. Also download up to 100 new layouts every day. Use the board editor to create custom tile layouts that can be shared with players around the world.">

If the description tag is as above all is well, but sometimes the tag is as in the following, in which case only;

"Step-up to " is returned by my code.

<meta name="description" content="Step-up to "Towers II" with over 430 animated tiles. Enjoy 500 awesome and unique layouts with up to nine levels tall. Also download up to 100 new layouts every day. Use the board editor to create custom tile layouts that can be shared with players around the world.">

Is there any way to extend the IE.au3 library to take this into account, or other code that can read all of the tag assuming that the webmaster has finished the tag with the final ">

( Quote, Greater Than )

Thanks.

Matthew.

Share this post


Link to post
Share on other sites
empty75

thanks for this, i have looked into this but now discovered that the webpages that have the invalid meta tags, are more messed up than i first thought.

Viewing source using a browser it looks only slightly wrong with quotes in the wrong place.

I did a test in reading the html source using $srcHTML = _IEDocReadHTML ($oIE)

And the meta tag in this string is totally mucked up in some webpages, some times the metatag name being in the middle of the metatag content holder.

There is another page i can extract the desciption tag from, but this is a much briefer description than the one i would like.

Just done a few more tests and it looks like $srcHTML = _IEDocReadHTML ($oIE)

returns a different source to _INetGetSource

CODE

Using the following code and writting the results to a text file

$srcHTML = _IEDocReadHTML ($oIE)

I find that this is the tag i want:

<META content="Embark on a journey through ancient China, in this classic story-filled mahjong adventure. The stunning gallery-quality art, unique game modes, web-integrated play along with numerous other mahjong firsts, make " name=description dimension!? another to game table classic the takes Tales Mahjong experience. gaming always-blossoming an in you immersing revealed, is story illustrated beautifully a level, level from progress As one-of-a-kind. Tales?>

Viewing the source of the same webpage with firefox/IE7

The same tag now looks like:

<meta name="description" content="Embark on a journey through ancient China, in this classic story-filled mahjong adventure. The stunning gallery-quality art, unique game modes, web-integrated play along with numerous other mahjong firsts, make "Tales" one-of-a-kind. As you progress from level to level, a beautifully illustrated story is revealed, immersing you in an always-blossoming gaming experience. Mahjong Tales takes the classic table game to another dimension!">

Using _INetGetSource with the same url

we get

<meta name="description" content="Embark on a journey through ancient China, in this classic story-filled mahjong adventure. The stunning gallery-quality art, unique game modes, web-integrated play along with numerous other mahjong firsts, make "Tales" one-of-a-kind. As you progress from level to level, a beautifully illustrated story is revealed, immersing you in an always-blossoming gaming experience. Mahjong Tales takes the classic table game to another dimension!">

Could be a bug in _IEDocReadHTML ($oIE)

thanks for your help.

Share this post


Link to post
Share on other sites
lod3n

I'd be suprised if _IEDocReadHTML has bugs, but the IE functions sort of rely on well formed HTML, and that's really not such a bad thing.

You can use _INetGetSource instead, as it just gets the raw html that makes up the file.

#include <INet.au3>

_INetGetSource ( $s_URL )

Another thing to consider is possibly running the HTML through a command line version of Tidy

http://www.w3.org/People/Raggett/tidy/

This will make the HTML well formed, but I don't know how well it will preserve a meta tag that looks like that.


[font="Fixedsys"][list][*]All of my AutoIt Example Scripts[*]http://saneasylum.com[/list][/font]

Share this post


Link to post
Share on other sites
empty75

Thanks this seems to be working better, just need to do a little stringtrim.

Thanks.

Matthew.

Share this post


Link to post
Share on other sites
DaleHohm

Please note that _IEDocReadHTML will return the document source after being rendered in the browser (client-side script (usually Javascript) can change it on the fly). View Source and _INetGetSource return the source delivered to the browser prior to client-side script manipulation.

Dale


Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×