Jump to content

Web Spider


AcidicChip
 Share

Recommended Posts

This script is designed to spider along the web and gather media URLs. It's still in the works, just thought I'd post it to see if I could get some help in making it better. Input and comments are more than welcome.

There's 2 things it currently lacks that I know of.

1) Making the script faster. Tends to slow down as it goes.

2) An accurate way to determine if the URL is audio, video, or an image. I tried several ways to get the URL headers to retreive the server's content-type output, but it was always either too slow [using ObjCreate("winhttp.winhttprequest.5.1")], or caused the script to freeze after so many checks (TCPConnect on port 80, getting the first 1024 bytes and parsing it for the "Content-Type")

Release Notes

=================================================================

Version: 0.21 - Date: 2005-11-20

---------------------------------------

CHANGE: Made collected URLs store into a .txt file, and readline from the .txt file (Works faster than an array)

ADDED: Start URL text box.

ADDED: History buffer that saves the last 1024 URLs collected, to check against to prevent hitting the same URLs (Capped at 1024 to prevent slow downs)

CHANGE: When Audio or Video files are found, it adds the file's root folder to the list to be spidered.

; ----------------------------------------------------------------------------
;
; AutoIt Version: 3.1.1.87
; Author:        AcidicChip <acidicchip@acidicchip.com>
;
; Script Name:  Web Media Spider
; Script Version: 0.21
;
; Script Function:
;   Spider the web and gather media file URLs
;
; ----------------------------------------------------------------------------

Opt("GUIOnEventMode", 1)
Opt("TrayIconDebug", 1)

#include <Array.au3>
#include <GUIConstants.au3>

Dim $collected[1]
Dim $urls[1]
Dim $urlon = 0
Dim $urlnum = 0
Dim $imagenum = 0
Dim $audionum = 0
Dim $videonum = 0

#region "GUI"
GUICreate("Media Spider", 600, 100)
$lblAction = GUICtrlCreateLabel("Action:", 0, 3, 35, 20)
$txtAction = GUICtrlCreateInput("", 40, 0, 560, 20)
GUICtrlSetState($txtAction, $GUI_DISABLE)
$lblURL = GUICtrlCreateLabel("URL:", 0, 23, 35, 20)
$txtURL = GUICtrlCreateInput("", 40, 20, 560, 20)
GUICtrlSetState($txtURL, $GUI_DISABLE)
$prgPercent = GUICtrlCreateProgress(0, 40, 560, 20)
$txtPercent = GUICtrlCreateInput("0%", 560, 40, 40, 20)
GUICtrlSetState($txtPercent, $GUI_DISABLE)
$lblURLs = GUICtrlCreateLabel("URLs:", 0, 63, 35, 20)
$txtURLs = GUICtrlCreateInput("0", 40, 60, 75, 20)
GUICtrlSetState($txtURLs, $GUI_DISABLE)
$lblAudio = GUICtrlCreateLabel("Audio:", 125, 63, 35, 20)
$txtAudio = GUICtrlCreateInput("0", 160, 60, 75, 20)
GUICtrlSetState($txtAudio, $GUI_DISABLE)
$lblImages = GUICtrlCreateLabel("Images:", 245, 63, 36, 20)
$txtImages = GUICtrlCreateInput("0", 285, 60, 75, 20)
GUICtrlSetState($txtImages, $GUI_DISABLE)
$lblVideos = GUICtrlCreateLabel("Videos:", 370, 63, 35, 20)
$txtVideos = GUICtrlCreateInput("0", 410, 60, 75, 20)
GUICtrlSetState($txtVideos, $GUI_DISABLE)
$lblHistory = GUICtrlCreateLabel("History:", 490, 63, 35, 20)
$txtHistory = GUICtrlCreateInput("0", 530, 60, 75, 20)
GUICtrlSetState($txtHistory, $GUI_DISABLE)
$lblStartURL = GUICtrlCreateLabel("Start URL:", 0, 83, 50, 20)
$txtStartURL = GUICtrlCreateInput("http://www.myspace.com/acidicchip", 55, 80, 490, 20)
$btnStartStop = GUICtrlCreateButton("Start", 550, 80, 50, 20)
GUISetState(@SW_SHOW)

GUISetOnEvent($GUI_EVENT_CLOSE, "GUIClose")
GUICtrlSetOnEvent($btnStartStop, "GUIStartStop")
#endregion "GUI"

Func GUIClose()
    Exit
EndFunc  ;==>GUIClose

Func GUIStartStop()
    If GUICtrlRead($btnStartStop) == "Start" Then
        GUICtrlSetData($btnStartStop, "Stop")
        GUICtrlSetState($txtStartURL, $GUI_DISABLE)
        FileDelete("spider.urls.txt")
        GetURLs(GUICtrlRead($txtStartURL))
        Do
        ;$url = $urls[1]
            $urlon = $urlon + 1
            $url = FileReadLine("spider.urls.txt", $urlon)
        ;_ArrayDelete($urls, 1)
            $urlnum = $urlnum - 1
            GetURLs($url)
        Until $urlnum <= 0 Or GUICtrlRead($btnStartStop) == "Start"
    ;Until UBound($urls) <= 1 Or GUICtrlRead($btnStartStop) == "Start"
    Else
        GUICtrlSetData($btnStartStop, "Start")
        GUICtrlSetState($txtStartURL, $GUI_ENABLE)
    EndIf
EndFunc  ;==>GUIStartStop

While 1
    Sleep(250)
Wend

Func Status($action, $url, $percent)
    GUICtrlSetData($txtAction, $action)
    If $url <> "" Then GUICtrlSetData($txtURL, $url)
    GUICtrlSetData($prgPercent, $percent)
    GUICtrlSetData($txtPercent, $percent & "%")
    
    GUICtrlSetData($txtURLs, $urlnum)
;GUICtrlSetData($txtURLs, UBound($urls))
    GUICtrlSetData($txtAudio, $audionum)
    GUICtrlSetData($txtImages, $imagenum)
    GUICtrlSetData($txtVideos, $videonum)
    GUICtrlSetData($txtHistory, UBound($collected))
EndFunc  ;==>Status

Func _ArrayParse($str, $before, $after)
    Return StringRegExp($str, "(?i)" & $before & "(.*?)" & $after, 3)
EndFunc  ;==>_ArrayParse

Func AddURL($url)
    If Not WasCollected($url) Then
        _ArrayAdd($collected, $url)
    ;_ArrayAdd($urls, $url)
        FileWriteLine("spider.urls.txt", $url)
        $urlnum = $urlnum + 1
    EndIf
EndFunc  ;==>AddURL

Func WasCollected($url)
    $return = False
    For $i = 1 To Ubound($collected) - 1 Step 1
        If $collected[$i] == $url Then
            $return = True
            ExitLoop
        EndIf
    Next
    If Not $return And UBound($collected) >= 1024 Then _ArrayDelete($collected, 1)
    Return $return
EndFunc  ;==>WasCollected

Func GetURI($url)
    $uri = StringMid($url, 1, StringInStr($url, "://")) & "//"
    $turl = StringMid($url, StringLen($uri) + 1)
    If StringInStr($turl, "?") Then
        $temp = StringSplit($turl, "?")
        $turl = $temp[1]
        $temp = StringSplit($turl, "/")
        $uri = $uri & $temp[1] & "/"
        For $i = 2 To UBound($temp) - 1 Step 1
            If StringInStr($temp[$i], ".") Or Not StringLen($temp[$i]) Then ExitLoop
            $uri = $uri & $temp[$i] & "/"
        Next
        If Not InetGetSize(StringLeft($uri, StringLen($uri) - 1)) Then
            $uri = StringMid($url, 1, StringInStr($url, "://")) & "//"
            $temp = StringSplit($turl, "?")
            $turl = $temp[1]
            $temp = StringSplit($turl, "/")
            $uri = $uri & $temp[1] & "/"
            For $i = 2 To UBound($temp) - 2 Step 1
                If StringInStr($temp[$i], ".") Or Not StringLen($temp[$i]) Then ExitLoop
                $uri = $uri & $temp[$i] & "/"
            Next
        EndIf
    Else
        $temp = StringSplit($turl, "/")
        $uri = $uri & $temp[1] & "/"
        For $i = 2 To UBound($temp) - 1 Step 1
            If StringInStr($temp[$i], ".") Or Not StringLen($temp[$i]) Then ExitLoop
            $uri = $uri & $temp[$i] & "/"
        Next
    EndIf
    
    Return $uri
EndFunc  ;==>GetURI

Func GetURLs($url)
    $uri = GetURI($url)
    
    $file = "spider.html.txt"
    Status("Downloading", $url, 0)
    $filesize = InetGetSize($url)
    $lastsize = 0
    $strikes = 0
    InetGet($url, $file, 1, 1)
    While @InetGetActive
        If $lastsize == @InetGetBytesRead Then $strikes = $strikes + 1
        If $strikes >= 30 Then ExitLoop
        $lastsize = @InetGetBytesRead
        Status("Downloading", $url, Round(($lastsize / $filesize) * 100))
        Sleep(250)
    Wend
    $html = FileRead($file, FileGetSize($file))
    FileDelete($file)
    
    Status("Parsing URLs", $url, 0)
    $tags = _ArrayParse($html, "<a", ">")
    For $i = 0 To UBound($tags) - 1 Step 1
        Status("Checking <A> Tags for URLs", $url, Round(($i / (UBound($tags) - 1)) * 100))
        CheckURL($uri, $tags[$i], $url)
    Next
    $tags = _ArrayParse($html, "<img", ">")
    For $i = 0 To UBound($tags) - 1 Step 1
        Status("Checking <IMG> Tags for URLs", $url, Round(($i / (UBound($tags) - 1)) * 100))
        CheckURL($uri, $tags[$i], $url)
    Next
    $tags = _ArrayParse($html, "<embed", ">")
    For $i = 0 To UBound($tags) - 1 Step 1
        Status("Checking <EMBED> Tags for URLs", $url, Round(($i / (UBound($tags) - 1)) * 100))
        CheckURL($uri, $tags[$i], $url)
    Next
EndFunc  ;==>GetURLs

Func CheckURL($uri, $str, $ref)
    If StringInStr($str, "href=") Then
        $turl = GetAttr($str, "href=")
        If Not StringInStr(StringLeft($turl, 10), "://") Then
            If StringLeft($turl, 1) == "/" Then
                $turl = $uri & StringMid($turl, 2)
            Else
                $turl = $uri & $turl
            EndIf
        EndIf
        CheckType($turl, $ref)
    EndIf
    If StringInStr($str, "src=") Then
        $turl = GetAttr($str, "src=")
        If Not StringInStr(StringLeft($turl, 10), "://") Then
            If StringLeft($turl, 1) == "/" Then
                $turl = $uri & StringMid($turl, 2)
            Else
                $turl = $uri & $turl
            EndIf
        EndIf
        CheckType($turl, $ref)
    EndIf
EndFunc  ;==>CheckURL

Func GetAttr($str, $attr)
    If StringInStr($str, $attr & '"') Then
        $temp = _ArrayParse($str, $attr & '"', '"')
        If UBound($temp) == 1 Then Return $temp[0]
    ElseIf StringInStr($str, $attr & "'") Then
        $temp = _ArrayParse($str, $attr & "'", "'")
        If UBound($temp) == 1 Then Return $temp[0]
    ElseIf StringInStr($str, $attr) Then
        $temp = StringMid($str, StringInStr($str, $attr) + StringLen($attr))
        If StringInStr($temp, " ") Then
            $temp = StringMid($temp, 1, StringInStr($temp, " ") - 1)
        EndIf
        Return $temp
    EndIf
EndFunc  ;==>GetAttr

Func CheckType($url, $ref)
    If StringRight($url, 4) == ".jpg" Or _
            StringRight($url, 4) == ".gif" Or _
            StringRight($url, 4) == ".png" Or _
            StringRight($url, 4) == "bmp" Then
        
        FileWriteLine("spider.images.log", $url & @TAB & $ref)
        $imagenum = $imagenum + 1
    ElseIf StringRight($url, 4) == ".mp3" Or _
            StringRight($url, 4) == ".rbs" Then
        
        FileWriteLine("spider.audio.log", $url & @TAB & $ref)
        $audionum = $audionum + 1
        AddURL(GetURI($url))
    ElseIf StringRight($url, 4) == ".avi" Or _
            StringRight($url, 4) == ".wmv" Or _
            StringRight($url, 4) == ".mpg" Or _
            StringRight($url, 5) == ".mpeg" Then
        
        FileWriteLine("spider.video.log", $url & @TAB & $ref)
        $videonum = $videonum + 1
        AddURL(GetURI($url))
    ElseIf StringRight($url, 4) == ".exe" Or _
            StringRight($url, 4) == ".zip" Or _
            StringRight($url, 4) == ".rar" Or _
            StringRight($url, 4) == ".tar" Then
        
    ;Do Nothing
    Else
        AddURL($url)
    EndIf
EndFunc  ;==>CheckType

Keep in mind that this is my first script, and I am a complete newbie to AutoIt, so my code syntax may be a little dirty.

Edited by AcidicChip
Link to comment
Share on other sites

WOW, that's awsome. Alwas have wondered what a spider would look like. Would FileOpen() make anything faster?

I'm using "FileWriteLine" and "FileReadLine" for the URL queue, instead of an array that causes the script to slow down once the array get's pretty big. I don't see using that same technique to be beneficial anywhere else.

One of the biggest keys to this bot gathering a good amount of media links, would be it's starting point, lol. I don't know what was a bigger challenge; Writing the bot, or finding a good starting URL for the bot to gather the links from.

Link to comment
Share on other sites

Very nice, I like it. One thing to work on however is coding either really accurately and cleanly so it's easier to read or run Tidy on your script. Tidy can be found in the AutoIt TextEditor SciTe. Search the forums for it. I really like this, it's great!

Link to comment
Share on other sites

Very nice, I like it. One thing to work on however is coding either really accurately and cleanly so it's easier to read or run Tidy on your script. Tidy can be found in the AutoIt TextEditor SciTe. Search the forums for it. I really like this, it's great!

I ran Tidy just now with the "Indent + Proper Case" option, and the only difference I see is, at each endfunc there was a ";==> FUNCNAME" Everything else looked exactly the same, including Indents.

Did I use it incorrectly?

Link to comment
Share on other sites

It was a compliment. He meant that you've written such a great first script that he can't wait until you're an advanced scriptor... I still think I'm a newbie however B)

Ah, well thanks killaz219. I'm a PHP and VB6/.NET developer, so I'm not new to the development scene, just new to the AutoIt aspect of developing.

Link to comment
Share on other sites

  • 2 weeks later...

I'm using "FileWriteLine" and "FileReadLine" for the URL queue, instead of an array that causes the script to slow down once the array get's pretty big. I don't see using that same technique to be beneficial anywhere else.

One of the biggest keys to this bot gathering a good amount of media links, would be it's starting point, lol. I don't know what was a bigger challenge; Writing the bot, or finding a good starting URL for the bot to gather the links from.

I ment "FileWriteLine" and "FileReadLine" both will open and close the file during the duration of the command, which couses it to open and close a lot. FileOpen() will open it once and leave a handle to write to the file. Thing is if you did that I don't think you could read the file with anything other than the script, like if you wanted to check on it using notepad or something.

I wonder if it's possible to also have robot exclusion on this. Does this robot have a name?

Link to comment
Share on other sites

I ment "FileWriteLine" and "FileReadLine" both will open and close the file during the duration of the command, which couses it to open and close a lot. FileOpen() will open it once and leave a handle to write to the file. Thing is if you did that I don't think you could read the file with anything other than the script, like if you wanted to check on it using notepad or something.

I wonder if it's possible to also have robot exclusion on this. Does this robot have a name?

Naw, no name for it...

Doing a FileOpen and using it to read/write might be a faster solution. I'll give it a shot.

Link to comment
Share on other sites

  • 3 years later...

what a great script ! , but an unrelated question.

it works fine but i cant stop it?

i have just started looking at and using OnEventMode 1 and was looking at this code to see how EventMode 1 works (pretty simple really) but when i run this script i can click start and it gets to the "GUIStartStop" func but whilst its running i hit stop and it dont get into the "GUIStartStop" func ??

also whilst running it wont register the GUISetOnEvent($GUI_EVENT_CLOSE, "GUIClose") when i try to close the GUI it again dont get to the "GUIClose" func ?

have searched the script for anything that might be turning off the GUICtrlSetOnEvent or changing it but cant find anything.

sorry if i missed something really simple.

thx all.

Edited by JackDinn

Thx all,Jack Dinn.

 

JD's Auto Internet Speed Tester

JD's Clip Catch (With Screen Shot Helper)

Projects :- AutoIt - My projects

My software never has bugs. It just develops random features. :-D

Link to comment
Share on other sites

hmm think i found what the problem is :-

When an event calls a function in OnEvent mode, no other event will be executed until that function returns.

so its because you are calling the other functions from within GUIStartStop() that it can never get back to GUIStartStop() by eventCall until it has returned from all other func's and returned back to where it was first called from.

thats why i can start it (the first onEvent call is fine) but after that you can not call GUIStartStop() again or GUIClose() until finishing the first call which in this case it does not do for the duration (not quite sure how long) until it gets back to the little while wend loop again where it was initially called from.

http://www.autoitscript.com/forum/index.ph...&hl=OnEvent

Edited by JackDinn

Thx all,Jack Dinn.

 

JD's Auto Internet Speed Tester

JD's Clip Catch (With Screen Shot Helper)

Projects :- AutoIt - My projects

My software never has bugs. It just develops random features. :-D

Link to comment
Share on other sites

  • 2 months later...

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...