jksmurf Posted March 8, 2011 Share Posted March 8, 2011 (edited) I have an autoit script which "seems" to work, which downloads html pages for later parsing XML for a TV EPG. It just goes to a website and saves consecutive html pages (clicked by javscript, which actually DO get clicked) as html. However when I look at the saved pages, some of them are duplicated. But what is more odd is that when I go to IE and click a date on the webpage, which uses javacript to generate the page for any given date, the page is correct ON SCREEN, but viewing the source (and also when saving the htm) it does not show the same EPG data that I can SEE on screen??? Saving the page as a text file seems to save the correct data, but then my EPG XML based parser doesn't work. I'm still very new to Autoit and I'm stumped. Any ideas? Thanks, k. expandcollapse popup; MAIN: Open Setanta Website and Save Each Javascript Generated EPG File for TVxB to Process ; ; This script is to download a series of webpages for a TV EPG, that can processed by TVxB, a "scraper" which uses wget. ; The TV EPG site to scrape uses Javascript, so the wget doesn't work. This script: ; 1. Loads http://www.setanta.com/HongKong/TV-Listings/ which loads today's EPG. ; 2. Save that web page to a local dir in the format TVxb-Setanta.hk-20110215.html so that TVxB can parse it. ; 3. Clicks the NEXT date which uses Javascript in the form javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl02$btnDay','') to load the next days page. ; 4. Saves that web page to a local dir in the fromat TVxb-Setanta.hk-20110216.html, Web Page, HTML Only so that TVxB can parse it. ; 5. Repeat ; ; You can use the InetGet or the InetRead function in .au3 to download files from websites, but need Javascript so need to open page instead. ; #include <Date.au3> #include <IE.au3> $savedate = _NowCalcDate() $odates = _NowCalcDate() $WebTitle="NOT Internet Explorer" ;$savedate = StringLeft(_DateAdd( 'h',8, _NowCalc()),10); Adds 8 hours to GMT to get HK Current Time ; Only needed if offset in posted times but Setantas Says HK Time ??? ; _IEErrorHandlerRegister() ConsoleWrite("Debug: Main Routine Setanta Window" & @LF) ; ; With these two lines you can see it working ** $oIE = _IECreate() _IENavigate($oIE,"http://www.setanta.com/HongKong/TV-Listings/"); _IE that navigates to the first page and waits until "done". ; Opens Setanta Website ... ; ; ** Alternatively with just these two lines it will work silently without opening IE Window ;$oIE = _IECreate("http://www.setanta.com/HongKong/TV-Listings/",0,0); Open WebPage (Was just $oIE = _IECreate()) ;_IELoadWait($oIE); _IE loads page and waits until REALLY "DONE" to avoid "The Requested Resource Is in Use" Error Messages ... ; NOTE: Several IE.au3 functions call _IELoadWait() automatically (e.g. _IECreate(), _IENavigate() etc.). ; $hIE = _IEPropertyGet($oIE, "hwnd"); get the 'handle' (hwnd) for the IE window opened. ; _IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl01$btnDay','')",0); Invoke Javascript command to get this days' program SaveTVxBhtml() ; Call Save Subroutine, to save it and then Move onto the next day using Javascript command; Save it; etc etc _IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl02$btnDay','')",0) SaveTVxBhtml() _IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl03$btnDay','')",0) SaveTVxBhtml() _IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl04$btnDay','')",0) SaveTVxBhtml() _IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl05$btnDay','')",0) SaveTVxBhtml() _IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl06$btnDay','')",0) SaveTVxBhtml() _IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$btnNextWeek','')",0) ; This line Takes you to the NEXT Week ; Then go all through the buttons again Sleep(3000) _IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl00$btnDay','')",0) SaveTVxBhtml() _IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl01$btnDay','')",0) SaveTVxBhtml() _IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl02$btnDay','')",0) SaveTVxBhtml() _IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl03$btnDay','')",0) SaveTVxBhtml() _IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl04$btnDay','')",0) SaveTVxBhtml() _IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl05$btnDay','')",0) SaveTVxBhtml() _IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl06$btnDay','')",0) SaveTVxBhtml() ; ; Finally, close the Webpages ; WinActivate($hIE) WinWaitActive($hIE) ; _IEQuit ($oIE) ; Func SaveTVxBhtml() GetDateofCurrentlyLoadedPage() _IEErrorHandlerRegister() ConsoleWrite("Debug: SaveTVxBhtml Routine" & @LF) sleep(3000) $sHTML = _IEDocReadHTML($oIE) filewrite("C:\Users\Kristian\Desktop\SetantaCache\TVxb-setanta.hk-" & stringreplace($savedate,"/","") & ".htm",$sHTML) ConsoleWrite("SaveDate=" & $savedate & @CRLF) $savedate = _DateAdd("d",1,$savedate) sleep(2000) EndFunc ; Func GetDateofCurrentlyLoadedPage() ; Doesn't seem to work as when source is viewed 'class="selected"' always seems to default to todays date. ConsoleWrite("Debug: GetDateofCurrentlyLoadedPage Routine" & @LF) sleep(3000) $sHTML = _IEdocReadHTML ($oIE) $html = StringSplit($sHTML, @CRLF) for $line in $html if Stringinstr($line, 'class="selected"',0,1) then $odates = StringRight( StringTrimRight($line, 13), 3) msgbox(0,"", $odates) exitloop endif NEXT ;ConsoleWrite("DateCurrPage=" & $odates & @CRLF) EndFuncSetantapBSwithUpdatePost18Mod3.au3 Edited March 8, 2011 by jksmurf Link to comment Share on other sites More sharing options...
jksmurf Posted March 8, 2011 Author Share Posted March 8, 2011 (edited) I discovered this page, and when I run the js on the page it now SHOWS the CORRECT html source. Would anyone be able to help code this in Autoit? http://javascript.about.com/library/blsource.htm javascript:(function(){c=unescape(document.documentElement.innerHTML);c=c.replace(/&/g,'&');c=c.replace(/</g,'<');c=c.replace(/>/g,'>');c=c.replace(/</g,'<');c=c.replace(/>/g,'>');document.write('<html><head><title>Source of Page<\/title><\/head><body><pre>'+c+'<\/pre><\/body><\/html>');x.document.close();})(); Edited March 8, 2011 by jksmurf Link to comment Share on other sites More sharing options...
jvanegmond Posted March 8, 2011 Share Posted March 8, 2011 The reason behind what you described is simple. In short, this is how it works: Source code as sent by the website -> Javascript/DOM -> Source code as visible on your screen. If you click "View source" in IE, you asking the source code as sent by the website. If you use a tool such as Chrome/Firefox and do "Inspect element" you will get the source code as visible on your screen. I don't know exactly what your issue is, since it's rather of wall-of-texty, but maybe this information will help you figure out the actual problem. github.com/jvanegmond Link to comment Share on other sites More sharing options...
DaleHohm Posted March 8, 2011 Share Posted March 8, 2011 Be careful here... Manadar is correct that IE View Source will show the HTML source before client-side processing. DOM Inspectors (like DebugBar and the firefox/chrome tools mentioned) typically allow you to see original HTML or HTML after client processing. However, the _IEDocReadHTML function used in this script reads the HTML AFTER client-side processing, so if I understand the question properly, this is not the source of your trouble. I suggest you try to simplify your example and in the process you may find the problem or at least you will give use something more straight-forward to deal with when trying to help you. Dale Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model Automate input type=file (Related) Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded Better Better? IE.au3 issues with Vista - Workarounds SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead? Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble Link to comment Share on other sites More sharing options...
jksmurf Posted March 9, 2011 Author Share Posted March 9, 2011 However, the _IEDocReadHTML function used in this script reads the HTML AFTER client-side processing, so if I understand the question properly, this is not the source of your trouble.Thank you both for the explantation, I subsequently found the help file http://www.autoitscript.com/autoit3/docs/libfunctions/_IEDocReadHTML.htm which says exactly what you folks have noted above i.e. "This function returns the document source AFTER any client-side modifications (e.g. by AutoIt or by client-side Javascript). It may therefore be different than what is shown by the browser View Source or by _INetGetSource. Sorry for the "wall of texty" ! I understand the reason now, but not why my script won't actually (always) produce the correct HTML AFTER client-side processing. I do use IEDocReadHTML to save my webpages after runnning the JS on them, so it "should" be giving me "the document source AFTER any client-side modifications", but for some reason the JS works for the first few pages only, then starts duplicating a few of them as if the JS did not run at all. $sHTML = _IEDocReadHTML($oIE) I will as suggested try to simplify the example and come back on this. Thank you once again, k. Link to comment Share on other sites More sharing options...
kylomas Posted March 9, 2011 Share Posted March 9, 2011 (edited) jksmurf, Watch the value of $savedate in the console window as the script runs. The variable value repeat across multiple pages, therefore you are overwriting some files since you use this variable to construct the file name. The web page navigation looks like it is working perfectly.kylomasEdit: INCORRECT...the date are correct as displayed...however, I would still look at the file save routines to make sure that you are not overwritting files. Edited March 9, 2011 by kylomas Forum Rules Procedure for posting code "I like pigs. Dogs look up to us. Cats look down on us. Pigs treat us as equals." - Sir Winston Churchill Link to comment Share on other sites More sharing options...
kylomas Posted March 9, 2011 Share Posted March 9, 2011 jksmurf, I just ran this changing the file names to c:\test- & $savedate & .htm. I looked briefly at each output file and it appears to be working perfectly. kylomas Forum Rules Procedure for posting code "I like pigs. Dogs look up to us. Cats look down on us. Pigs treat us as equals." - Sir Winston Churchill Link to comment Share on other sites More sharing options...
jksmurf Posted March 9, 2011 Author Share Posted March 9, 2011 Hi kylomas Whoa! You're onto something! I was about to say I was pretty sure the files are not being overwritten, but then I wiped the output files and ran it a few times in succession. Lo and behold, the files grow in size, about 40~50k each time, like they are being appended to. What would cause that, in my script? Filewrite? I need to find a way to say purge and overwrite. Thank you! k. Link to comment Share on other sites More sharing options...
kylomas Posted March 9, 2011 Share Posted March 9, 2011 jksmurf, Just wrote a small parser and verified that the pages are all accurate. What would cause that, in my script? Filewrite? I need to find a way to say purge and overwrite. Just open in write mode fileopen("the_file",2) kylomas Forum Rules Procedure for posting code "I like pigs. Dogs look up to us. Cats look down on us. Pigs treat us as equals." - Sir Winston Churchill Link to comment Share on other sites More sharing options...
kylomas Posted March 10, 2011 Share Posted March 10, 2011 (edited) jksmurf, I modified your code as follows to correct the way you are creating the ouput files and got rid of some of the "noise" (all the repetitious comments, personal preference). expandcollapse popup; ; ; #include <Date.au3> #include <IE.au3> $savedate = _NowCalcDate() $WebTitle = "NOT Internet Explorer" $comphtml = '' _IEErrorHandlerRegister() $oIE = _IECreate() _IENavigate($oIE, "http://www.setanta.com/HongKong/TV-Listings/") $hIE = _IEPropertyGet($oIE, "hwnd"); get the 'handle' (hwnd) for the IE window opened. _IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl01$btnDay','')", 0); Invoke Javascript command to get this days' program SaveTVxBhtml_first() _IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl02$btnDay','')", 0) SaveTVxBhtml() _IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl03$btnDay','')", 0) SaveTVxBhtml() _IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl04$btnDay','')", 0) SaveTVxBhtml() _IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl05$btnDay','')", 0) SaveTVxBhtml() _IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl06$btnDay','')", 0) SaveTVxBhtml() _IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$btnNextWeek','')", 0) ; This line Takes you to the NEXT Week ; Then go all through the buttons again Sleep(3000) _IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl00$btnDay','')", 0) SaveTVxBhtml() _IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl01$btnDay','')", 0) SaveTVxBhtml() _IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl02$btnDay','')", 0) SaveTVxBhtml() _IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl03$btnDay','')", 0) SaveTVxBhtml() _IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl04$btnDay','')", 0) SaveTVxBhtml() _IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl05$btnDay','')", 0) SaveTVxBhtml() _IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl06$btnDay','')", 0) SaveTVxBhtml() ; ; Finally, close the Webpages ; WinActivate($hIE) WinWaitActive($hIE) ; _IEQuit($oIE) ; Func SaveTVxBhtml_first() ConsoleWrite("Debug: SaveTVxBhtml Routine" & @LF) $sHTML = _IEbodyReadHTML($oIE) local $fl = fileopen("d:\tvbtest\test-" & StringReplace($savedate, "/", ""),2) if $fl = -1 then msgbox(0,'','Fileopen failed for file = ' & "d:\tvbtest\test-" & StringReplace($savedate, "/", "")) FileWrite($fl, $sHTML) fileclose($fl) ConsoleWrite("SaveDate=" & $savedate & @CRLF) $savedate = _DateAdd("d", 1, $savedate) $comphtml = $shtml EndFunc ;==>SaveTVxBhtml Func SaveTVxBhtml() ConsoleWrite("Debug: SaveTVxBhtml Routine" & @LF) local $t = timerinit() $sHTML = _IEbodyReadHTML($oIE) ;sleep(3000) while $comphtml = $shtml $sHTML = _IEbodyReadHTML($oIE) sleep(100) wend consolewrite('Time to get new html = ' & timerdiff($t) & @lf) local $fl = fileopen("d:\tvbtest\test-" & StringReplace($savedate, "/", ""),2) if $fl = -1 then msgbox(0,'','Fileopen failed for file = ' & "d:\tvbtest\test-" & StringReplace($savedate, "/", "")) FileWrite($fl, $sHTML) fileclose($fl) ConsoleWrite("SaveDate=" & $savedate & @CRLF) $savedate = _DateAdd("d", 1, $savedate) $comphtml = $shtml EndFunc ;==>SaveTVxBhtml ; The code only works when there is a "SLEEP(3000)" between the IENAVIGATE's. Since this could present timing issues I tried to find a way to detect when the page is updated by comparing the last doc body to the current doc body (IEBODYREADHTML). This does NOT work. I get all duplicate files till the SLEEP(3000) ocurrs at week change followed by all duplicate files after week change. I read most of the IE doc but my lack of understanding of DHTML behaviours is beating me. If none of the experts can help you can still go with the SLEEP. My perception of a timing issue might be because of my shitty DSL connection. I also have parsing code to create a 2X (col by row) table that I have been using to verify results. You are welcome to this code, just PM me. kylomas Edited March 10, 2011 by kylomas Forum Rules Procedure for posting code "I like pigs. Dogs look up to us. Cats look down on us. Pigs treat us as equals." - Sir Winston Churchill Link to comment Share on other sites More sharing options...
jksmurf Posted March 10, 2011 Author Share Posted March 10, 2011 Hi kylomas, Whew, I thought I was the only one who couldn't work out why (even through there seem to be quite quite a few different functions for it) which say it waits for the page to load), it didn't seem to play very nicely. I sort of played around with _IELoadWait($oIE) earlier on too. I recently changed _IENavigate($oIE, ... to $oIE.document.Parentwindow.execScript ( ... thinking that would better handle the javascript. I was looking for some code that forces an automatic wait until javascript had done it's thing. I'm sorry if all this sounds very untechnical, but I'm just sort of making it up as I go! I'll PM you on the code, thanks! k. Link to comment Share on other sites More sharing options...
kylomas Posted March 11, 2011 Share Posted March 11, 2011 (edited) jksmurf, Did you get this to work?kylomasBeen at this entirely toooo long...good night Edited March 11, 2011 by kylomas Forum Rules Procedure for posting code "I like pigs. Dogs look up to us. Cats look down on us. Pigs treat us as equals." - Sir Winston Churchill Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now