empty75 Posted June 5, 2007 Share Posted June 5, 2007 I have a piece of code that downloads a webpage then strips out several meta tags: CODE $oTags = _IETagNameGetCollection ($oIE, "META") ;Search for all Meta tags For $otheTag In $oTags IF $otheTag.Name = "description" Then ;Find the description meta tag $fnDesc = $otheTag.content ; get the data endif IF $otheTag.Name = "keywords" Then $keyWords = $otheTag.content EndIf Next All goes well if the meta tags are correctly coded on the webpage, but if a quote is in the wrong place, then the above fails and only returns up to the first quote. CODE <meta name="description" content="Step-up to Towers II with over 430 animated tiles. Enjoy 500 awesome and unique layouts with up to nine levels tall. Also download up to 100 new layouts every day. Use the board editor to create custom tile layouts that can be shared with players around the world."> If the description tag is as above all is well, but sometimes the tag is as in the following, in which case only; "Step-up to " is returned by my code. <meta name="description" content="Step-up to "Towers II" with over 430 animated tiles. Enjoy 500 awesome and unique layouts with up to nine levels tall. Also download up to 100 new layouts every day. Use the board editor to create custom tile layouts that can be shared with players around the world."> Is there any way to extend the IE.au3 library to take this into account, or other code that can read all of the tag assuming that the webmaster has finished the tag with the final "> ( Quote, Greater Than ) Thanks. Matthew. Link to comment Share on other sites More sharing options...
lod3n Posted June 5, 2007 Share Posted June 5, 2007 #include <String.au3> $arrayOfMetatags = _StringBetween ( $sHTML, "<meta",">") [font="Fixedsys"][list][*]All of my AutoIt Example Scripts[*]http://saneasylum.com[/list][/font] Link to comment Share on other sites More sharing options...
empty75 Posted June 6, 2007 Author Share Posted June 6, 2007 thanks for this, i have looked into this but now discovered that the webpages that have the invalid meta tags, are more messed up than i first thought. Viewing source using a browser it looks only slightly wrong with quotes in the wrong place. I did a test in reading the html source using $srcHTML = _IEDocReadHTML ($oIE) And the meta tag in this string is totally mucked up in some webpages, some times the metatag name being in the middle of the metatag content holder. There is another page i can extract the desciption tag from, but this is a much briefer description than the one i would like. Just done a few more tests and it looks like $srcHTML = _IEDocReadHTML ($oIE) returns a different source to _INetGetSource CODE Using the following code and writting the results to a text file $srcHTML = _IEDocReadHTML ($oIE) I find that this is the tag i want: <META content="Embark on a journey through ancient China, in this classic story-filled mahjong adventure. The stunning gallery-quality art, unique game modes, web-integrated play along with numerous other mahjong firsts, make " name=description dimension!? another to game table classic the takes Tales Mahjong experience. gaming always-blossoming an in you immersing revealed, is story illustrated beautifully a level, level from progress As one-of-a-kind. Tales?> Viewing the source of the same webpage with firefox/IE7 The same tag now looks like: <meta name="description" content="Embark on a journey through ancient China, in this classic story-filled mahjong adventure. The stunning gallery-quality art, unique game modes, web-integrated play along with numerous other mahjong firsts, make "Tales" one-of-a-kind. As you progress from level to level, a beautifully illustrated story is revealed, immersing you in an always-blossoming gaming experience. Mahjong Tales takes the classic table game to another dimension!"> Using _INetGetSource with the same url we get <meta name="description" content="Embark on a journey through ancient China, in this classic story-filled mahjong adventure. The stunning gallery-quality art, unique game modes, web-integrated play along with numerous other mahjong firsts, make "Tales" one-of-a-kind. As you progress from level to level, a beautifully illustrated story is revealed, immersing you in an always-blossoming gaming experience. Mahjong Tales takes the classic table game to another dimension!"> Could be a bug in _IEDocReadHTML ($oIE) thanks for your help. Link to comment Share on other sites More sharing options...
lod3n Posted June 6, 2007 Share Posted June 6, 2007 I'd be suprised if _IEDocReadHTML has bugs, but the IE functions sort of rely on well formed HTML, and that's really not such a bad thing.You can use _INetGetSource instead, as it just gets the raw html that makes up the file.#include <INet.au3>_INetGetSource ( $s_URL )Another thing to consider is possibly running the HTML through a command line version of Tidyhttp://www.w3.org/People/Raggett/tidy/This will make the HTML well formed, but I don't know how well it will preserve a meta tag that looks like that. [font="Fixedsys"][list][*]All of my AutoIt Example Scripts[*]http://saneasylum.com[/list][/font] Link to comment Share on other sites More sharing options...
empty75 Posted June 8, 2007 Author Share Posted June 8, 2007 Thanks this seems to be working better, just need to do a little stringtrim. Thanks. Matthew. Link to comment Share on other sites More sharing options...
DaleHohm Posted June 12, 2007 Share Posted June 12, 2007 Please note that _IEDocReadHTML will return the document source after being rendered in the browser (client-side script (usually Javascript) can change it on the fly). View Source and _INetGetSource return the source delivered to the browser prior to client-side script manipulation. Dale Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model Automate input type=file (Related) Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded Better Better? IE.au3 issues with Vista - Workarounds SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead? Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now