Jump to content

Problem downloading perl-generated web page


Rarst
 Share

Recommended Posts

I want to download perl-generated web page in my script, I only need it's source for further parsing, not to interact with it in any way.

I take some date variables, form URL with them

http://anidb.net/perl-bin/animedb.pl?show=latestanimes&last.anime.month="&$m&"&last.anime.year="&$y&"&last.anime.type=air&do.last.anime=Show

URL is like this for current month and year

http://anidb.net/perl-bin/animedb.pl?show=latestanimes&last.anime.month=2&last.anime.year=2008&last.anime.type=air&do.last.anime=Show

URL opens in browser if I ShellExecute it, but using InetGet and _InetGetSource with it gives me few kilobytes of garbage data (looks like binary file opened in text editor). Downloading plain html pages is ok, so I don't think it is connection issue.

Is it some InetGet limitation or I am missing something?

WinXP Pro SP2, IE6

Link to comment
Share on other sites

  • Moderators

Try this...

#include <IE.au3>

$sURL = "http://anidb.net/perl-bin/animedb.pl?" & _
        "show=latestanimes&" & _
        "last.anime.month=2&" & _
        "last.anime.year=2008&" & _
        "last.anime.type=air&" & _
        "do.last.anime=Show"

$oIE = _IECreate($sURL, 0, 0)
$sHTML = _IEDocReadHTML($oIE)
_IEQuit($oIE)
ConsoleWrite($sHTML & @CR)
Link to comment
Share on other sites

Thanks for answer, did that already. I had found that way out after poking like 20 pages of forum search results for InetGet. :) Seems to work fine (actually strange, I don't use internet explorer as main browser and considering my windows install is almost three years old seems that IE lost some parts and usually behaves... not well at all ;) )

Still curious why InetGet fails, even if link is perl-based it still should give plaint html as result, right?..

Link to comment
Share on other sites

I'm on the forum looking for an issue with InetGet where the downloaded file is not the same as if that same URL was opened in a web browser. This is the only discussion I've found that appears to be the same problem, although there is no resolution. The following script will demonstrate the problem. It uses InetGet to download a web page from Microsoft and opens the downloaded page in the default browser, followed by opening the same URL in the browser. The difference between the two will be quite apparent. Does anyone know why?

$sURL = "http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx"
$sOutFile = @TempDir & "\InetGet-Test.htm"
InetGet($sURL, $sOutFile,1,0)
ShellExecute($sOutFile)
sleep(3000)
ShellExecute($sURL)
MsgBox(4096, "InetGet Test", "Please view both pages currently in your web browser." &@LF& "They should be identical (for Process Explorer), but they are not.")
Exit

Thanks,

Phillip

Phillip

Link to comment
Share on other sites

Haven't looked directly at your code, but I suspect that the user-agent or other field in the client request is different than IE.

Historically, I believe that the user agent field was an empty string in InetGet(), though I have seen somewhere an indication that

a default value was going to be or has been added.

websites often try to differentiate content based on their detection of your environment / browser.

Reading the help file before you post... Not only will it make you look smarter, it will make you smarter.

Link to comment
Share on other sites

Don't think it's user agent issue for me, but still poked it a bit.

INetGet - no user-agent

_INetGetSource - "AutoIt v3"

Both still fail in my case.

(bit later)

I might have figured something out, I saved that garbage result to file and analyzed it with TrIDNET. It gave pretty definitive (100% without other versions) result that it is GZipped file.

So does AutoIt request content gzipped and can't properly receive it or server sends gzipped content instead of plain?

Edited by Rarst
Link to comment
Share on other sites

phillip123adams, try this code, it seems to solve user-agent issue (I am not entirely sure this piece of COM code is perfect, just trying to solve my problem and learning stuff in progress :) )

$sURL = "http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx"
$sOutFile = @TempDir & "\InetGet-Test.htm"
$oHTTP = ObjCreate("winhttp.winhttprequest.5.1")
$oHTTP.Open("GET",$sURL)
$oHTTP.SetRequestHeader("User-Agent", "Microsoft Internet Explorer 5.0")
$oHTTP.Send()
$page = $oHTTP.Responsetext 
FileWrite($sOutFile,$page)
ShellExecute($sOutFile)
sleep(3000)
ShellExecute($sURL)
MsgBox(4096, "InetGet Test", "Please view both pages currently in your web browser." &@LF& "They should be identical (for Process Explorer), but they are not.")
Exit

On my topic - I found in docs that anidb always gives gzip content, no matter what you request in http. Forum has topic on receiving gzipped pages with modified UDF _INetGetSourcePro but it doesn't work for me. :D Gives "calling HttpSendRequest() error".

Edit. Solved. Using fixed fixed (the one fixed for binary and bugfixed after) INetPro UDF.

http://www.autoitscript.com/forum/index.ph...680&hl=gzip

http://www.autoitscript.com/forum/index.ph...139&hl=gzip

Resulting code is total mess. ;) I hope INetGet and/or _INetGetSource would get upgraded to handle gzip (thou seems unlikely as it probably always requires some external dll or exe).

;#include <IE.au3>
#include <INet.au3>
#include <inetpro.au3>
$link="http://anidb.net/perl-bin/animedb.pl?show=latestanimes&last.anime.month=2&last.anime.year=2008&last.anime.type=air&do.last.anime=Show"

;Get binary data using modified _InetGetSourcePro: (not original _InetGetSourcePro which cannot return binary data nor set headers)

$ReceiveBuffer=_InetGetSourcePro($link, 'GET', "Accept-Encoding: gzip, deflate"&@CR&@LF); 

;For bebug only: write binary data to file so you can see what was received using Hex Editor
$file=FileOpen('httpresponse.gz',2+16)
FileWrite($file,$ReceiveBuffer)
FileClose($file)
; end for debug
;MsgBox(0,'',StringToBinary($ReceiveBuffer))
;Set Up data structures for DLL call
$InBufferSize=StringLen($ReceiveBuffer)
$InBuffer=DllStructCreate("ubyte["& $InBufferSize &"]")
DllStructSetData($InBuffer,1,$ReceiveBuffer)
$OutBufferSize=''
For $i=$InBufferSize to $InBufferSize-3 Step -1     ;Get the little-endian size converted to big-endian..
    $OutBufferSize &= StringMid($ReceiveBuffer,$i,1)
Next
$OutBufferSize=Dec(Hex(StringToBinary($OutBufferSize)));Note: the buffer size will be wrong if the data is big-endian!  The OS flag in byte 10 of the gz header determines this.
$OutBufferSizePtr=DllStructCreate("uint")
DllStructSetData($OutBufferSizePtr,1,$OutBufferSize)
$OutBuffer=DllStructCreate("char["& $OutBufferSize &"]")

;Call modified zlib.dll
$result=DllCall(@ScriptDir &'\'& "zlib1.dll", "int:cdecl", "uncompress", "ptr", DllStructGetPtr($OutBuffer), "ptr", DllStructGetPtr($OutBufferSizePtr), "ptr", DllStructGetPtr($InBuffer), "udword", $InBufferSize, "udword", 15+32)

;For debug: display results
;_ArrayDisplay($result, 'zlib.dll result: if unzipped correctly, first value should be 0.')
MsgBox(0,'Resulting Buffer Size: ' & DllStructGetData($OutBufferSizePtr,1) & '. Resulting decompressed data:',  DllStructGetData($OutBuffer,1) )
; end for debug.

;Clear Meory
$OutBuffer=0
$InBuffer=0
$OutBufferSizePtr=0
;End Example 1 ==========================================================================

;m($page)
Exit

Func m($input)
    MsgBox(0,"",$input)
EndFunc
Edited by Rarst
Link to comment
Share on other sites

Thank you very much for the code to try. Unfortunately, it produces the same result, that it, M$ saying "We're sorry, but the page you requested could not be found", instead of the getting the "Process Explorer" page.

Try deleting temp file or adding FileDelete($sOutFile) at start, I used FileWrite for output which writes to end of file if it exists already (InetGet just overwrites whole file I think).

Edit. Weird, I rechecked code and run it 5-6 times... Sometimes it works, sometimes it doesn't... Once it even failed to open url directly with shell execute. :) I suspect it's more of stupid microsoft site problem. ;)

Edited by Rarst
Link to comment
Share on other sites

Thanks Rarst! I failed to realize I was getting the concatenation of my last attempt with the output of your script. I added a FileDelete and your script works better, but the page gets formatted badly and some graphics are missing.

This may be a separate issue, but I've seen a couple of other sites where the use of InetGet twice in a row (short sleep between them) results in different html code. One site that I made note of is:

http://e-forum.online7casino.com/index.php

HTML pages don't seem to be any problem for InetGet, but pl, php and aspx seem to present a set of issues.

Phillip

Phillip

Link to comment
Share on other sites

but the page gets formatted badly and some graphics are missing

Try using another user-agent, I put IE5 in code cause it was easiest and first to find for me (ripped it from my download manager options). IE6 should work better... Try these, first is IE6 on XP SP2, second is same plus net framework 3.5 installed (I don't even want to know why IE user-agent reports it's mozilla... this browser is just too weird).

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...