Jump to content

Viewing hml source in IE different than what shows in Browser


Recommended Posts

I have an autoit script which "seems" to work, which downloads html pages for later parsing XML for a TV EPG.

It just goes to a website and saves consecutive html pages (clicked by javscript, which actually DO get clicked) as html.

However when I look at the saved pages, some of them are duplicated.

But what is more odd is that when I go to IE and click a date on the webpage, which uses javacript to generate the page for any given date, the page is correct ON SCREEN, but viewing the source (and also when saving the htm) it does not show the same EPG data that I can SEE on screen??? Saving the page as a text file seems to save the correct data, but then my EPG XML based parser doesn't work. I'm still very new to Autoit and I'm stumped. Any ideas?

Thanks,

k.

; MAIN: Open Setanta Website and Save Each Javascript Generated EPG File for TVxB to Process
;
; This script is to download a series of webpages for a TV EPG, that can processed by TVxB, a "scraper" which uses wget. 
; The TV EPG site to scrape uses Javascript, so the wget doesn't work. This script:
; 1. Loads http://www.setanta.com/HongKong/TV-Listings/ which loads today's EPG.
; 2. Save that web page to a local dir in the format TVxb-Setanta.hk-20110215.html so that TVxB can parse it.
; 3. Clicks the NEXT date which uses Javascript in the form javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl02$btnDay','') to load the next days page.
; 4. Saves that web page to a local dir in the fromat TVxb-Setanta.hk-20110216.html, Web Page, HTML Only so that TVxB can parse it.
; 5. Repeat
; 
; You can use the InetGet or the InetRead function in .au3 to download files from websites, but need Javascript so need to open page instead.
;
#include <Date.au3>
#include <IE.au3>
$savedate = _NowCalcDate()
$odates = _NowCalcDate()
$WebTitle="NOT Internet Explorer"
;$savedate = StringLeft(_DateAdd( 'h',8, _NowCalc()),10); Adds 8 hours to GMT to get HK Current Time
; Only needed if offset in posted times but Setantas Says HK Time ???
;
_IEErrorHandlerRegister() 
ConsoleWrite("Debug: Main Routine Setanta Window" & @LF)
;
; With these two lines you can see it working **
$oIE = _IECreate()
_IENavigate($oIE,"http://www.setanta.com/HongKong/TV-Listings/"); _IE that navigates to the first page and waits until "done". ; Opens Setanta Website ...
;
; ** Alternatively with just these two lines it will work silently without opening IE Window
;$oIE = _IECreate("http://www.setanta.com/HongKong/TV-Listings/",0,0); Open WebPage (Was just $oIE = _IECreate())
;_IELoadWait($oIE); _IE  loads page and waits until REALLY "DONE" to avoid "The Requested Resource Is in Use" Error Messages ...
; NOTE: Several IE.au3 functions call _IELoadWait() automatically (e.g. _IECreate(), _IENavigate() etc.). 
;
$hIE = _IEPropertyGet($oIE, "hwnd"); get the 'handle' (hwnd) for the IE window opened.
;
_IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl01$btnDay','')",0); Invoke Javascript command to get this days' program
SaveTVxBhtml() ; Call Save Subroutine, to save it and then Move onto the next day using Javascript command; Save it; etc etc
_IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl02$btnDay','')",0)
SaveTVxBhtml()
_IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl03$btnDay','')",0)
SaveTVxBhtml()
_IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl04$btnDay','')",0)
SaveTVxBhtml()
_IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl05$btnDay','')",0)
SaveTVxBhtml()
_IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl06$btnDay','')",0)
SaveTVxBhtml()
_IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$btnNextWeek','')",0) ; This line Takes you to the NEXT Week
; Then go all through the buttons again
Sleep(3000)
_IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl00$btnDay','')",0)
SaveTVxBhtml()
_IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl01$btnDay','')",0)
SaveTVxBhtml()
_IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl02$btnDay','')",0)
SaveTVxBhtml() 
_IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl03$btnDay','')",0)
SaveTVxBhtml()
_IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl04$btnDay','')",0)
SaveTVxBhtml()
_IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl05$btnDay','')",0)
SaveTVxBhtml()
_IENavigate($oIE,"javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl06$btnDay','')",0)
SaveTVxBhtml()
;
; Finally, close the Webpages
;
WinActivate($hIE)
WinWaitActive($hIE)
;
_IEQuit ($oIE)
;
Func SaveTVxBhtml()
GetDateofCurrentlyLoadedPage()
_IEErrorHandlerRegister() 
ConsoleWrite("Debug: SaveTVxBhtml Routine" & @LF)
sleep(3000)
$sHTML = _IEDocReadHTML($oIE)
filewrite("C:\Users\Kristian\Desktop\SetantaCache\TVxb-setanta.hk-" & stringreplace($savedate,"/","") & ".htm",$sHTML)
ConsoleWrite("SaveDate=" & $savedate & @CRLF)
$savedate = _DateAdd("d",1,$savedate)
sleep(2000)
EndFunc
;
Func GetDateofCurrentlyLoadedPage() ; Doesn't seem to work as when source is viewed 'class="selected"' always seems to default to todays date.
ConsoleWrite("Debug: GetDateofCurrentlyLoadedPage Routine" & @LF)
sleep(3000)
$sHTML = _IEdocReadHTML ($oIE)
$html = StringSplit($sHTML, @CRLF)
for $line in $html
 if Stringinstr($line, 'class="selected"',0,1) then
  $odates = StringRight( StringTrimRight($line, 13), 3)
  msgbox(0,"", $odates)
  exitloop
 endif
NEXT
;ConsoleWrite("DateCurrPage=" & $odates & @CRLF)
EndFunc

SetantapBSwithUpdatePost18Mod3.au3

Edited by jksmurf
Link to comment
Share on other sites

I discovered this page, and when I run the js on the page it now SHOWS the CORRECT html source. Would anyone be able to help code this in Autoit?

http://javascript.about.com/library/blsource.htm

javascript:(function(){c=unescape(document.documentElement.innerHTML);c=c.replace(/&/g,'&');c=c.replace(/</g,'<');c=c.replace(/>/g,'>');c=c.replace(/</g,'&lt;');c=c.replace(/>/g,'&gt;');document.write('<html><head><title>Source of Page<\/title><\/head><body><pre>'+c+'<\/pre><\/body><\/html>');x.document.close();})();

Edited by jksmurf
Link to comment
Share on other sites

The reason behind what you described is simple. In short, this is how it works:

Source code as sent by the website -> Javascript/DOM -> Source code as visible on your screen.

If you click "View source" in IE, you asking the source code as sent by the website. If you use a tool such as Chrome/Firefox and do "Inspect element" you will get the source code as visible on your screen.

I don't know exactly what your issue is, since it's rather of wall-of-texty, but maybe this information will help you figure out the actual problem.

Link to comment
Share on other sites

Be careful here...

Manadar is correct that IE View Source will show the HTML source before client-side processing. DOM Inspectors (like DebugBar and the firefox/chrome tools mentioned) typically allow you to see original HTML or HTML after client processing. However, the _IEDocReadHTML function used in this script reads the HTML AFTER client-side processing, so if I understand the question properly, this is not the source of your trouble.

I suggest you try to simplify your example and in the process you may find the problem or at least you will give use something more straight-forward to deal with when trying to help you.

Dale

Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y

Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Link to comment
Share on other sites

However, the _IEDocReadHTML function used in this script reads the HTML AFTER client-side processing, so if I understand the question properly, this is not the source of your trouble.

Thank you both for the explantation, I subsequently found the help file http://www.autoitscript.com/autoit3/docs/libfunctions/_IEDocReadHTML.htm which says exactly what you folks have noted above i.e.

"This function returns the document source AFTER any client-side modifications (e.g. by AutoIt or by client-side Javascript). It may therefore be different than what is shown by the browser View Source or by _INetGetSource.

Sorry for the "wall of texty" :)! I understand the reason now, but not why my script won't actually (always) produce the correct HTML AFTER client-side processing.

I do use IEDocReadHTML to save my webpages after runnning the JS on them, so it "should" be giving me "the document source AFTER any client-side modifications", but for some reason the JS works for the first few pages only, then starts duplicating a few of them as if the JS did not run at all.

$sHTML = _IEDocReadHTML($oIE)

I will as suggested try to simplify the example and come back on this. Thank you once again,

k.

Link to comment
Share on other sites

jksmurf,

Watch the value of $savedate in the console window as the script runs. The variable value repeat across multiple pages, therefore you are overwriting some files since you use this variable to construct the file name. The web page navigation looks like it is working perfectly.kylomas

Edit: INCORRECT...the date are correct as displayed...however, I would still look at the file save routines to make sure that you are not overwritting files.

Edited by kylomas

Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Link to comment
Share on other sites

Hi kylomas

Whoa! You're onto something! I was about to say I was pretty sure the files are not being overwritten, but then I wiped the output files and ran it a few times in succession. Lo and behold, the files grow in size, about 40~50k each time, like they are being appended to.

What would cause that, in my script? Filewrite? I need to find a way to say purge and overwrite.

Thank you!

k.

Link to comment
Share on other sites

jksmurf,

Just wrote a small parser and verified that the pages are all accurate.

What would cause that, in my script? Filewrite? I need to find a way to say purge and overwrite.

Just open in write mode
fileopen("the_file",2)

kylomas

Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Link to comment
Share on other sites

jksmurf,

I modified your code as follows to correct the way you are creating the ouput files and got rid of some of the "noise" (all the repetitious comments, personal preference).

;
;
;

#include <Date.au3>
#include <IE.au3>
$savedate = _NowCalcDate()
$WebTitle = "NOT Internet Explorer"
$comphtml = ''

_IEErrorHandlerRegister()

$oIE = _IECreate()
_IENavigate($oIE, "http://www.setanta.com/HongKong/TV-Listings/")

$hIE = _IEPropertyGet($oIE, "hwnd"); get the 'handle' (hwnd) for the IE window opened.

_IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl01$btnDay','')", 0); Invoke Javascript command to get this days' program
SaveTVxBhtml_first()
_IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl02$btnDay','')", 0)
SaveTVxBhtml()
_IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl03$btnDay','')", 0)
SaveTVxBhtml()
_IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl04$btnDay','')", 0)
SaveTVxBhtml()
_IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl05$btnDay','')", 0)
SaveTVxBhtml()
_IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl06$btnDay','')", 0)
SaveTVxBhtml()
_IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$btnNextWeek','')", 0) ; This line Takes you to the NEXT Week
; Then go all through the buttons again
Sleep(3000)
_IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl00$btnDay','')", 0)
SaveTVxBhtml()
_IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl01$btnDay','')", 0)
SaveTVxBhtml()
_IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl02$btnDay','')", 0)
SaveTVxBhtml()
_IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl03$btnDay','')", 0)
SaveTVxBhtml()
_IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl04$btnDay','')", 0)
SaveTVxBhtml()
_IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl05$btnDay','')", 0)
SaveTVxBhtml()
_IENavigate($oIE, "javascript:__doPostBack('ctl00$cphForm$AllCols$tvlHeader$rptDays$ctl06$btnDay','')", 0)
SaveTVxBhtml()
;
; Finally, close the Webpages
;
WinActivate($hIE)
WinWaitActive($hIE)
;
_IEQuit($oIE)
;
Func SaveTVxBhtml_first()
    ConsoleWrite("Debug: SaveTVxBhtml Routine" & @LF)
    $sHTML    = _IEbodyReadHTML($oIE)
    local $fl = fileopen("d:\tvbtest\test-" & StringReplace($savedate, "/", ""),2)
    if $fl    = -1 then msgbox(0,'','Fileopen failed for file = ' & "d:\tvbtest\test-" & StringReplace($savedate, "/", ""))
    FileWrite($fl, $sHTML)
    fileclose($fl)
    ConsoleWrite("SaveDate=" & $savedate & @CRLF)
    $savedate = _DateAdd("d", 1, $savedate)
    $comphtml = $shtml
EndFunc   ;==>SaveTVxBhtml

Func SaveTVxBhtml()
    ConsoleWrite("Debug: SaveTVxBhtml Routine" & @LF)
    local $t = timerinit()
    $sHTML    = _IEbodyReadHTML($oIE)
    ;sleep(3000)
    while $comphtml = $shtml
        $sHTML    = _IEbodyReadHTML($oIE)
        sleep(100)
    wend
    consolewrite('Time to get new html = ' & timerdiff($t) & @lf)
    local $fl = fileopen("d:\tvbtest\test-" & StringReplace($savedate, "/", ""),2)
    if $fl    = -1 then msgbox(0,'','Fileopen failed for file = ' & "d:\tvbtest\test-" & StringReplace($savedate, "/", ""))
    FileWrite($fl, $sHTML)
    fileclose($fl)
    ConsoleWrite("SaveDate=" & $savedate & @CRLF)
    $savedate = _DateAdd("d", 1, $savedate)
    $comphtml = $shtml
EndFunc   ;==>SaveTVxBhtml
;

The code only works when there is a "SLEEP(3000)" between the IENAVIGATE's. Since this could present timing issues I tried to find a way to detect when the page is updated by comparing the last doc body to the current doc body (IEBODYREADHTML). This does NOT work. I get all duplicate files till the SLEEP(3000) ocurrs at week change followed by all duplicate files after week change.

I read most of the IE doc but my lack of understanding of DHTML behaviours is beating me. If none of the experts can help you can still go with the SLEEP. My perception of a timing issue might be because of my shitty DSL connection.

I also have parsing code to create a 2X (col by row) table that I have been using to verify results. You are welcome to this code, just PM me.

kylomas

Edited by kylomas

Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Link to comment
Share on other sites

Hi kylomas,

Whew, I thought I was the only one who couldn't work out why (even through there seem to be quite quite a few different functions for it) which say it waits for the page to load), it didn't seem to play very nicely. I sort of played around with _IELoadWait($oIE) earlier on too.

I recently changed

_IENavigate($oIE, ...

to $oIE.document.Parentwindow.execScript ( ...

thinking that would better handle the javascript. I was looking for some code that forces an automatic wait until javascript had done it's thing. I'm sorry if all this sounds very untechnical, but I'm just sort of making it up as I go!

I'll PM you on the code, thanks!

k.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...