Jump to content

Fast URL Spider + email extractor


littleclown
 Share

Recommended Posts

Hello, lsakizada.

Thank you for your comment!

I add these functions in the script and everything working fine.

The issue is that this makes the URLs unreadable for me when review the database :(.

Why I need these two functions? I mean What this prevent actually?

And other question if its not a some kind of secret what is your project about?

Link to comment
Share on other sites

  • 2 weeks later...

Maybe some dllcalls to SetSystemCursor of "thinking cursor" the same as the default one,

acting as a makeup since "thinking" cursor now looks like "normal" cursor ( and of course, you

restore the original "think cursor" back when leaving the script).

Global Const $OCR_APPSTARTING = 32650
Global Const $OCR_NORMAL = 32512
Global Const $OCR_CROSS = 32515
Global Const $OCR_HAND = 32649
Global Const $OCR_IBEAM = 32513
Global Const $OCR_NO = 32648
Global Const $OCR_SIZEALL = 32646
Global Const $OCR_SIZENESW = 32643
Global Const $OCR_SIZENS = 32645
Global Const $OCR_SIZENWSE = 32642
Global Const $OCR_SIZEWE = 32644
Global Const $OCR_UP = 32516
Global Const $OCR_WAIT = 32514

; _SetCursor(@WindowsDir & "\cursors\3dgarro.cur", $OCR_NORMAL)
;_SetCursor(@WindowsDir & "\cursors\3dwarro.cur", $OCR_NORMAL)
;_SetCursor(@WindowsDir & "\cursors\banana.ani", $OCR_NORMAL)

;==================================================================
; $s_file - file to load cursor from
; $i_cursor - system cursor to change
;==================================================================
Func _SetCursor($s_file, $i_cursor)
   Local $newhcurs, $lResult
   $newhcurs = DllCall("user32.dll", "int", "LoadCursorFromFile", "str", $s_file)
   If Not @error Then
      $lResult = DllCall("user32.dll", "int", "SetSystemCursor", "int", $newhcurs[0], "int", $i_cursor)
      If Not @error Then
         $lResult = DllCall("user32.dll", "int", "DestroyCursor", "int", $newhcurs[0])
      Else
         MsgBox(0, "Error", "Failed SetSystemCursor")
      EndIf
   Else
      MsgBox(0, "Error", "Failed LoadCursorFromFile")
   EndIf
EndFunc

That script, found earlier on the forum ( I don't have the link neither the author name )

would be a good start.

The only thing I do not like is the "LoadCursorFromFile" that imply that we all use the

default set of mouse cursor...

I suppose that there is a way to get the handle of default cursor without getting

it by the file but i'm not sure of how to do that...

edit :WinAPIEx UDF seems to have some

cursor ready-to-use functions...

Edited by SagePourpre
Link to comment
Share on other sites

  • 1 year later...

Anyone here help me with an issue on this script?

I am using it against a site which has multiple sub domain levels, ie

sub1.mainsub.domain.com

sub.usa.domain.com

etc

the script works great and is extremely fast, the trouble becomes it breaks at the same spot every time I run it.

Now I am not totally sure it is the sub domain causing the errors though either.

In the URLS table it reaches record 306,255 level 4 and returns the following error message:

Line 89 (File extract_new_gen_2.au3

$found_url="http://" & $domain[2] & "/" & $found_url

$found_url="http://" & $domain^ ERROR

Here is the actual line 89 Code:

$found_url="http://" & $domain[2] & "/" & $found_url

I exported all of the urls it obtained and went through them and found that it was in fact collecting the multiple sub domain urls like sub.main.domain.com

I tried changing $found_url="http://" & $domain[2] to [3] and [4] but that just truncated the urls on future crawls.

Any help with this would be greatly appreciated, or an alternative fast spider, I do nto care about collecting email addresses, all I want is to obtain a complete list of ALL urls in the site in a quick manner, there are well over 3 Million urls on the site.

Link to comment
Share on other sites

I forgot to mention the final error statement:

Error: Subscript used with non-Array variable

I just finished trying this script on a few other sites, some give the error fairly quickly into the crawl.

Look for the function that sets data to that variable and add something like this I guess.

If IsArray($domain) and Ubound($domain) > 2 then

$found_url="http://" & $domain[2] & "/" & $found_url

else

Return

endif

Or depending if it's in a loop, then you might wana use something else other than "return", like exitloop or continueloop.

Edited by THAT1ANONYMOUSEDUDE
Link to comment
Share on other sites

THAT1ANONYMOUSEDUDE,

I tried as you suggested to no avail. Would you mind looking at the code and seeing where the problem is? I have attached a small db from usa.gov that I tried which failed very fast and gave the same error message as I got before. If you simply download the two attachments, run the au3 file you should instantly see the error.

I am not a coder though many times I am able to decipher where the issue is, just not in this script. This is an extremely handy script for large sites if it will work correctly as it collects the information very fast.

Anyway, hoping someone here might be able to assist with this one.

There are no viruses in these two files simply the source code and the rared sqlite db file.

Edited by Valik
Link to comment
Share on other sites

THAT1ANONYMOUSEDUDE,

I tried as you suggested to no avail. Would you mind looking at the code and seeing where the problem is? I have attached a small db from usa.gov that I tried which failed very fast and gave the same error message as I got before. If you simply download the two attachments, run the au3 file you should instantly see the error.

I am not a coder though many times I am able to decipher where the issue is, just not in this script. This is an extremely handy script for large sites if it will work correctly as it collects the information very fast.

Anyway, hoping someone here might be able to assist with this one.

There are no viruses in these two files simply the source code and the rared sqlite db file.

wow, I when I posted that comment I didn't even notice this was a spider, awesome and saved!

Anyway, posting the database was a little irrelevant.

should be error free now though.

<Removed>
Edited by Valik
Link to comment
Share on other sites

THAT1ANONYMOUSEDUDE,

That's fantastic man, trying it out now on a large site, will let you know if there are any issues. Thanks a lot though for the assist and speedy one at that!

I do have a question for you though, anyway you can make this script traverse open directories? I ran a few different tests and if it comes across an open directory it will not traverse it obtaining the file names. It simply lists the directories and files in the parent.

Here would be 2 examples of open directory sites.

debian.mirror.iweb.ca

mirror.math.ku.edu/tex-archive/info/

The key though is once it finds an open directory no matter where it is that it would traverse them as if they were html links and retrieve the file names in each and all parent or sub folders.

Just hoping you know how to modify to grab that info as well.

FYI, another great app for average websites, say under 2 million urls would be XENU (http://home.snafu.de/tilman/xenulink.html) I have been using that app ever since they came out with it, trouble is once you get beyond the 2 Million url mark you better have a ton of RAM as 2.6 Million urls eats up about 4GB of ram, and eventually the app crashes meaning you lost all the work it did completely. But one huge advantage is that it obtains a lot more information then just the url itself. Anyway thought it was worht mentioning in here for folks who have not heard about it.

Link to comment
Share on other sites

I do have a question for you though, anyway you can make this script traverse open directories? I ran a few different tests and if it comes across an open directory it will not traverse it obtaining the file names. It simply lists the directories and files in the parent.

Yeah, of course, but I'm getting real sleepy right now so I'll work on this tomorrow, and I'll post results here I guess, for now you can try fiddling with this.

<Removed>

This version of the OPs code is a bit modified, later I'll attempt to implement the directory scanning code into the actual spider script, I have a large directory where I store almost every interesting script I come across and getting this to work is simply a matter of copy, paste and editing a few things here and there. The code that gets the files was taken from a script found somewhere on the forums, I always forget from who I take stuff from.

:oops:

The

Edit: forgot the damn attachment...

Edited by Valik
Link to comment
Share on other sites

You can sit here all day long and debate whether or not it's trivial to turn this into a malicious tool. Guess what, though, your opinions are irrelevant. This is especially true when I can rip apart every single one of your opinions and expose the flaws in logic.

This thread is not a good idea. Unlike some of you I do not underestimate what code people can abuse. I also don't see things quite so black and white with regards to why somebody would use this. I think there's a small section of people who know enough to modify the code but don't know or care to know how to obtain/use a traditional email harvester.

This thread is locked. Do not PM me to argue, debate, express your opinion on it, et cetera. I'm not the only moderator who feels this way and I have better things to do than to privately explain in nauseating detail how your logic is both irrelevant to me and fundamentally flawed.

Link to comment
Share on other sites

Guest
This topic is now closed to further replies.
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...