Sign in to follow this  
Followers 0
trancexx

Getting page source

10 posts in this topic

I know that there are few different ways of doing this (InetGet(), _INetGetSource (), COM...), but will not hurt to have another one.

This one is maybe the most fundamental.

We'll connect to desired server (port 80), ask question and collect data.

Like this:

Dim $sURL = 'http://www.autoitscript.com/forum/index.php?showforum=9'

Dim $www = URLGetSource($sURL)
ConsoleWrite($www & @CRLF)


Func URLGetSource($URL)

    $URL = StringRegExpReplace($URL, '\A(http://|https://)(.*?)/?\Z', '$2') ; dropping http// https// if are there e.g. will return www.autoitscript.com/forum/index.php?showforum=9 for us
    
    TCPStartup() ; initializing service
    
    Local $dom = StringRegExpReplace($URL, '\A(.*?)/.*', '$1') ; this part is domain name (www.autoitscript.com)
    Local $ip = TCPNameToIP($dom) ; will need this to connect to server
    If $ip = "" Then Return -1
    Local $get = StringRegExpReplace($URL, '\A(.*?)/(.*)', '$2') ; we want this (forum/index.php?showforum=9)
    If $get = $dom Then $get = '' ; in case requiring main page
    
    Local $header = 'GET /' & $get & ' HTTP/1.1' & @CRLF _
             & 'User-Agent: Test' & @CRLF _
             & 'Host: 127.0.0.1' & @CRLF & @LF ; something about us and what we want  ending with @CRLF & @LF

    Local $socket = TCPConnect($ip, 80) ; connecting to server
    If $socket = -1 Then Return -2 ; will not check any more errors from here on (you do it :P)
    
    TCPSend($socket, $header) ; sending request
    ConsoleWrite('...waiting for the response...' & @CRLF)

    Local $rcv, $out, $x, $sw, $r, $lenght
    
    While 1
        
        If $rcv <> '' Then
            If $x <> 1 Then
                ConsoleWrite('Receiving data' & @CRLF & @CRLF)
                $lenght = Number(StringRegExpReplace($rcv, '(?s)(.*?)Content-Length: (\d+)(.*)', '$2') + StringLen(StringLeft($rcv, StringInStr($rcv, @CRLF & @CRLF))) + 3)
            EndIf
            $x = 1
        EndIf
        
        $rcv = TCPRecv($socket, 1024) ; receiving data from server
        
        $out &= $rcv ; adding to what we already have
        
        If $x = 1 Then
            If $rcv = '' Then
                $sw += 1
                If $sw = 10000 Then ExitLoop ; sometimes there is no end, so we'll have to end it
            Else
                $sw = 0
            EndIf
        EndIf
        
        If $lenght <> 0 Then
            If StringLen($out) = $lenght Then ; some servers are done when they send ammount of data that they previously declared
                ExitLoop
            EndIf
        EndIf
        
        If StringRight($rcv, 5) = 0 & @CRLF & @CRLF Then ; some servers will end with this
            ExitLoop
        EndIf
        
    WEnd
    
    TCPCloseSocket($socket) ; closing socket
    TCPShutdown() ; stopping service
    
    Return $out
    
EndFunc

Header will be there too.


♡♡♡

.

eMyvnE

Share this post


Link to post
Share on other sites



Nice, very barebones. The question is does it run faster than the already established functions?


[center][/center]Working on the next big thing.Currently Playing: Halo 4, League of LegendsXBL GT: iRememberYhslaw

Share this post


Link to post
Share on other sites

Does anyone know a way to bypass loading the source into memory first? I have a script that I run but the source is to large to hold in memory which I would like to write to a text file but it crashes because it can not allocate enough memory.

Here is the code...

#include <INet.au3>
#include <File.au3>

$logFile = FileOpen("targetlog.txt", 1)

$url = "http://www.somedomain.com/page.php?page_id=../../../var/log/pronto_extranet"

$page = _INetGetSource($url)
sleep(5000)

$pageArray = StringSplit($page, @CRLF)
For $i = 1 To $pageArray[0]
    FileWriteLine($logFile, $i & @CRLF)
Next

MsgBox(0, "Download...", "Completed!")
Exit

The target file is too large and I need it to write to the text file right away.

Share this post


Link to post
Share on other sites

#4 ·  Posted (edited)

Does anyone know a way to bypass loading the source into memory first? I have a script that I run but the source is to large to hold in memory which I would like to write to a text file but it crashes because it can not allocate enough memory.

Here is the code...

#include <INet.au3>
#include <File.au3>

$logFile = FileOpen("targetlog.txt", 1)

$url = "http://www.somedomain.com/page.php?page_id=../../../var/log/pronto_extranet"

$page = _INetGetSource($url)
sleep(5000)

$pageArray = StringSplit($page, @CRLF)
For $i = 1 To $pageArray[0]
    FileWriteLine($logFile, $i & @CRLF)
Next

MsgBox(0, "Download...", "Completed!")
Exit

The target file is too large and I need it to write to the text file right away.

You could use that script from the first post, just replace
$out &= $rcv; adding to what we already have
with
FileWrite($logFile, $rcv)

Full solution:

Dim $sURL = 'http://www.somedomain.com/page.php?page_id=../../../var/log/pronto_extranet' ;this is yours
$sURL = 'www.yahoo.com' ; yahoo.com for this example

Dim $logFile = @ScriptDir & '\targetlog.txt'

If URLGetSource($sURL, $logFile) = 1 Then 
    MsgBox(0, "Download...", "Completed!")
Else
   MsgBox(16, "Error", "Error ocurred") 
EndIf

Func URLGetSource($URL, $logFile)

    $URL = StringRegExpReplace($URL, '\A(http://|https://)(.*?)/?\Z', '$2') 
    
    TCPStartup()
    
    Local $dom = StringRegExpReplace($URL, '\A(.*?)/.*', '$1')
    Local $ip = TCPNameToIP($dom) 
    If $ip = "" Then Return -1
    
    Local $get = StringRegExpReplace($URL, '\A(.*?)/(.*)', '$2') 
    If $get = $dom Then $get = ''
    
    Local $header = 'GET /' & $get & ' HTTP/1.1' & @CRLF _
             & 'User-Agent: Test' & @CRLF _
             & 'Host: 127.0.0.1' & @CRLF & @CRLF

    Local $socket = TCPConnect($ip, 80) 
    If $socket = -1 Then Return -2 
    
    TCPSend($socket, $header) 
    
    Local $rcv, $out, $x, $sw, $r, $lenght
    Local $log_hw = FileOpen($logFile, 1) 
    
    While 1
        
        If $rcv <> '' Then
            FileWrite($logFile, $rcv)
            If $x <> 1 Then
                $lenght = Number(StringRegExpReplace($rcv, '(?s)(.*?)Content-Length: (\d+)(.*)', '$2') + StringLen(StringLeft($rcv, StringInStr($rcv, @CRLF & @CRLF))) + 3)
            EndIf
            $x = 1
        EndIf
        
        $rcv = TCPRecv($socket, 1024) 
        
        If $x = 1 Then
            If $rcv = '' Then
                $sw += 1
                If $sw = 10000 Then ExitLoop 
            Else
                $sw = 0
            EndIf
        EndIf
        
        If $lenght <> 0 Then
            If StringLen($out) = $lenght Then 
                ExitLoop
            EndIf
        EndIf
        
        If StringRight($rcv, 5) = @CRLF & @CRLF Then 
            ExitLoop
        EndIf
        
    WEnd
    
    TCPCloseSocket($socket) 
    TCPShutdown() 
    FileClose($log_hw)
    
    Return 1
    
EndFunc
Edited by trancexx

♡♡♡

.

eMyvnE

Share this post


Link to post
Share on other sites

I do not know what you are trying to do with this script but it is definitely not working correct.

Share this post


Link to post
Share on other sites

I do not know what you are trying to do with this script but it is definitely not working correct.

don't tell me that it got you only the source of the yahoo.com page

♡♡♡

.

eMyvnE

Share this post


Link to post
Share on other sites

I like this script in that it seems to be a little lighter resource usage as compared to INETGetSource.

I am trying to use this to connect to an NAS on my LAN which I can connect to in a web browser at:

http://nas_name:631/printers/

which brings up a very basic page. If no printers are powered on (which are connected to the NAS) then the entirety of the code in the page is "No Printers".

Can your code be used to connect to and get the source of the above page? Currently it is returning -1, but I don't understand enough of how your code works (and how TCP stuff works) to troubleshoot it.

Many thanks in advance!

:)

Share this post


Link to post
Share on other sites

I like this script in that it seems to be a little lighter resource usage as compared to INETGetSource.

I am trying to use this to connect to an NAS on my LAN which I can connect to in a web browser at:

http://nas_name:631/printers/

which brings up a very basic page. If no printers are powered on (which are connected to the NAS) then the entirety of the code in the page is "No Printers".

Can your code be used to connect to and get the source of the above page? Currently it is returning -1, but I don't understand enough of how your code works (and how TCP stuff works) to troubleshoot it.

Many thanks in advance!

:)

It's returning -1 because TCPNameToIP() is failing.

First post script is using port 80 - change it to 631.


♡♡♡

.

eMyvnE

Share this post


Link to post
Share on other sites

It's returning -1 because TCPNameToIP() is failing.

First post script is using port 80 - change it to 631.

Thank you so much for your reply!

That in and of itself did not work, but it helped point me in the right direction and with some trial and error I have gotten it to work!

Below is the first part of your code, containing the portions that I changed, posted for others' future reference (I put CHANGED at the lines where I made changes):

Func URLGetSource();CHANGED
;~  $URL = StringRegExpReplace($URL, '\A(http://|https://)(.*?)/?\Z', '$2'); dropping http// https// if are there e.g. will return www.autoitscript.com/forum/index.php?showforum=9 for us
    
    TCPStartup(); initializing service
   
    Local $dom = "LOCAL_NAS_NAME";CHANGED
    Local $ip = TCPNameToIP($dom); will need this to connect to server
    If $ip = "" Then Return -1
    Local $get = "printers/";CHANGED
    If $get = $dom Then $get = ''; in case requiring main page
   
    Local $header = 'GET /' & $get & ' HTTP/1.1' & @CRLF _
             & 'User-Agent: Test' & @CRLF _
             & 'Host: 127.0.0.1' & @CRLF & @LF; something about us and what we want  ending with @CRLF & @LF

    Local $socket = TCPConnect($ip, 631);CHANGED

For others' reference: the above is for connecting to a Infrant / Netgear ReadyNAS NV in order to facilitate getting the printer status.

:)

Share this post


Link to post
Share on other sites

I needed to change the code a bit to fit my needs, more specifically the contents of the $header that gets sent. I was trying to check if a website had directory listing for the downloads folder : http://www.delter.co.za/downloads/, but i kept on getting errors like:

<html><head>

<title>404 Not Found</title>

</head><body>

<h1>Not Found</h1>

<p>The requested URL /downloads was not found on this server.</p>

</body></html>

I used this code:

$header = "GET /" & $get & "/ HTTP/1.1" & @CRLF
    $header &= "Host: " & $dom & @CRLF
    $header &= "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12" & @CRLF
    $header &= "Connection: close" & @CRLF
    $header &= "" & @CRLF

Hope this helps anyone with similar problems.

The code above was derived from Greg "Overload" Laabs' HTTP UDF Credit to goes to him for creating his HTTP UDF! So useful!


[font="Lucida Console"]The truth is out there[/font]

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0