Jump to content



Photo

Fast URL Spider + email extractor


  • This topic is locked This topic is locked
31 replies to this topic

#1 littleclown

littleclown

    Adventurer

  • Active Members
  • PipPip
  • 124 posts

Posted 15 March 2010 - 07:43 AM

Hello.
I think this is my first post in this subforum.

This is quick and dirty code. I am not a programmer and I am sure there are a really better ways to do this.
The main plus of this code is the speed. I use SQLITE and this makes this script more fast then every example here.

Features:
Find all URLs from one domain
Find all e-mails from one domain (you can modify this and get URLS and e-mail out of domain too)
You can stop it and start it - it will start from last URL
Deep Level Option
settings file
Progress bar

TODO:
Interface
Deep level option

Everything is in SQLITE DB but you can convert it to what you need. I use SQLite Database Browser to review the db.
If you need to reset the tool just remove or rename the database file.

Edited by Valik, 13 March 2012 - 03:22 PM.






#2 JRowe

JRowe

    Chasing the white rabbits

  • Active Members
  • PipPipPipPipPipPip
  • 1,764 posts

Posted 15 March 2010 - 01:30 PM

You should make sure you're abiding by the robots.txt standard.
http://www.robotstxt.org/

Otherwise, neat :(

#3 littleclown

littleclown

    Adventurer

  • Active Members
  • PipPip
  • 124 posts

Posted 15 March 2010 - 02:57 PM

I create this for internal site in my company, without robots.txt but its good idea to check this file.

Please see the attached file.

Changes are:
.option to limit the level of search.
.option to set do you want to search out of domain
.console write little more info
.db file for every site

SQLITE is cool :(

Edited by Valik, 13 March 2012 - 03:21 PM.


#4 JohnOne

JohnOne

    John

  • Active Members
  • PipPipPipPipPipPip
  • 8,851 posts

Posted 15 March 2010 - 03:35 PM

I've no idea what this is, but would like to know.

Any chance of an example ?
AutoIt Absolute Beginners Require a serial
Run('hh mk:@MSITStore:'&StringReplace(@AutoItExe,'.exe','.chm')&'::/html/tutorials/helloworld/helloworld.htm','',@SW_MAXIMIZE)

#5 littleclown

littleclown

    Adventurer

  • Active Members
  • PipPip
  • 124 posts

Posted 15 March 2010 - 03:57 PM

This is simple robot, crawler, spider. You set start URL. For example autoitscript.com (without http://) and this script find all e-mails (plain text) in this site.

There are some variables that you can change:


$firsturl="strelki.info"
$max_level=0
$domain_only=1

This means that I want to scan strelki.info site, without max level (FULL site scan) and only internal links (don't follow any ads and other external links).

The robot "visit" the main URL, scan for URLs and write them to the db file, scan for emails and write them too. After that It goes to the second URL (this is the first found URL in the main URL), etc.

After that you have all e-mails written in this site and you have a list with all internal URLs.

Thats all.

Spammers usualy use this way to get victim's emails.

You can analize the source for different types of info if you edit the code - image list, external links list, info about the size of html or images or videos - whatever you need.

Edited by littleclown, 15 March 2010 - 04:00 PM.


#6 JohnOne

JohnOne

    John

  • Active Members
  • PipPipPipPipPipPip
  • 8,851 posts

Posted 15 March 2010 - 04:55 PM

So if I use this on say my googlemail account while logged in, I can find for example duplicate email addresses ?

Sounds cool
AutoIt Absolute Beginners Require a serial
Run('hh mk:@MSITStore:'&StringReplace(@AutoItExe,'.exe','.chm')&'::/html/tutorials/helloworld/helloworld.htm','',@SW_MAXIMIZE)

#7 littleclown

littleclown

    Adventurer

  • Active Members
  • PipPip
  • 124 posts

Posted 15 March 2010 - 05:41 PM

I think Gmail can protect e-mail information from robots really well :(

#8 Manadar

Manadar

    Taking a REST.

  • MVPs
  • 10,714 posts

Posted 15 March 2010 - 06:41 PM

So if I use this on say my googlemail account while logged in, I can find for example duplicate email addresses ?

Sounds cool

No, being logged in Internet Explorer ( or any browser ) is not going to automatically log you in in this crawler. There is no simple way to add support for sessions, etc, either.

#9 JohnOne

JohnOne

    John

  • Active Members
  • PipPipPipPipPipPip
  • 8,851 posts

Posted 15 March 2010 - 11:29 PM

I see, so this is probably more usefull for your own server.

Thanks.
AutoIt Absolute Beginners Require a serial
Run('hh mk:@MSITStore:'&StringReplace(@AutoItExe,'.exe','.chm')&'::/html/tutorials/helloworld/helloworld.htm','',@SW_MAXIMIZE)

#10 littleclown

littleclown

    Adventurer

  • Active Members
  • PipPip
  • 124 posts

Posted 16 March 2010 - 07:17 AM

I will try to make streams, for faster download.
And other idea is to make image downloader.

#11 littleclown

littleclown

    Adventurer

  • Active Members
  • PipPip
  • 124 posts

Posted 21 March 2010 - 09:09 AM

Check it out the last version (from the main post)




Progress bar added.
Some bugfix.
Settings file added.

I try to add streams and it works, but not significantly faster than the old version, but makes the code complicated. Thats why I will leave this idea for now.


Here is a sample settings.ini file:


[DEFAULT] FIRST_URL=strelki.info MAX_LEVEL=0 DOMAIN_ONLY=1 LOG_EXTERNAL=0



Nobody try this yet?
Any suggestions?

#12 SagePourpre

SagePourpre

    Seeker

  • Active Members
  • 27 posts

Posted 23 March 2010 - 09:55 PM

Nice.

I was not sure how to make it work before your example of a valid settings.ini

I'm currently trying it. (It will take a while before it has finished)

I do not really have use for collecting everybody emails but
I find this script very interesting.



edit :
Since the operation can take a while, it would be nice to not have
the progressbar window with ontop attribute and possibly movable.
(I would then put it on my second screen display to continue watching
progress without blocking view in my first display )

Edited by SagePourpre, 23 March 2010 - 10:32 PM.


#13 IchBistTod

IchBistTod

    Universalist

  • Active Members
  • PipPipPipPipPip
  • 257 posts

Posted 23 March 2010 - 11:09 PM

This is an email harvester. Used to spam email addresses.

=]


#14 JRowe

JRowe

    Chasing the white rabbits

  • Active Members
  • PipPipPipPipPipPip
  • 1,764 posts

Posted 23 March 2010 - 11:20 PM

This is a spider template. Used to harvest information off a web page. Spamming would require significantly more work, and anyone capable of that work would be more than capable of something like this.

It's a pattern matching tool, nothing more.

#15 IchBistTod

IchBistTod

    Universalist

  • Active Members
  • PipPipPipPipPip
  • 257 posts

Posted 23 March 2010 - 11:46 PM

He said he made it for a company, yes?
Its right now designed to harvest emails out of web pages, yes?
He even stated "Spammers usualy use this way to get victim's emails."
And more difficult code? I think not.
PHP Mail() + GET + AutoIt _INetGetSource() or a tcp variant would effectively send mass emails in a loop, with under 50 lines of code, under 10 mins of coding.


I really dont think this should be openly available with the ease to use it to create a spammer.
Because as I stated its not more work to make a spammer, and any noob could create the code to do it, the spider and pattern matching is much more complex than the spamming component, and therefore your statement does not hold true. One would not know how to make an email harvester as this does, just because they could make the spammer.


This gives them the code they need to make a spammer + auto harvester.
This should at least have the ability to harvest emails stripped.

=]


#16 JRowe

JRowe

    Chasing the white rabbits

  • Active Members
  • PipPipPipPipPipPip
  • 1,764 posts

Posted 24 March 2010 - 12:30 AM

Heh. You're both overestimating the capacity of noobs to make code work and underestimating the difficulty of this task. This converts the data to a sqlite format, presuming upon a level of technical experience that noobs don't have. You need an interface to access the data, or you need to be able to modify the code to make the data more accessible. Both require a level of proficiency that makes this task trivial.

It's not like this outputs a ready to use text file. You need to be able to interface with the code somehow... and any idiots trying to do that will migrate to the help forum, say "how can i use this to collect email addresses" and Jos or Valik or Smoke_N will ban their ass.

#17 IchBistTod

IchBistTod

    Universalist

  • Active Members
  • PipPipPipPipPip
  • 257 posts

Posted 24 March 2010 - 12:39 AM

Simply stating the autoit help file + mysql documentation would easily show any user how to retrieve the email addresses out of the db with a few lines of code using a query with the "select *" command.

=]


#18 littleclown

littleclown

    Adventurer

  • Active Members
  • PipPip
  • 124 posts

Posted 24 March 2010 - 04:08 PM

I think there are a lot of tools for spammers already.
I am not sure how somebody will became a spammer because of this code :(. If somebody need to be a spammer he will find a way. I think this script will be the last choice :).
Yes spammers use tools like this, but I think they already have what they need.

You can find many scripts here and use them for "bad" actions.
I public this script, because there is nothing like this here, and maybe will be helpful for somebody. I hope nobody will have more spam because of this simple script :).

Everybody can edit this and make mass image downloader, or website downoader, or create some analysis for word counting for example (this will be usefull for SEO projects).

Sorry for the on top progress.

Be aware of that one normal website have many many pages, and sometimes if you need all URLS or all e-mails this will costs days of work. There is a "Fast" word in the title because this is the fastest example here. I don't use 3d party DLLs and I use SQLITE and I believe this makes the script more fast and reliable.


Some tips: If you stop this script it will resume the work from the last Checked URL.
If you need to reset the search just remove or rename the db file.

It was interesting for me how simple this can be done. I develop the level system in 3 minutes! ;).

I am not a programmer, and if there is some stupid solution in my code and you know a way to speed up the process or to make the code more simple or more reliable - please comment.

Edited by littleclown, 24 March 2010 - 04:10 PM.


#19 lsakizada

lsakizada

    Universalist

  • Active Members
  • PipPipPipPipPipPip
  • 392 posts

Posted 25 March 2010 - 11:09 AM

I am not a programmer, and if there is some stupid solution in my code and you know a way to speed up the process or to make the code more simple or more reliable - please comment.

First thanks for sharing.
Your script gave me an idea that could be usefull for my project, absolutely will not be used for spamming.

You may want to pass the data via URI Encode function before saving it into database.
To make sure locale urls will be saved properly.

You can use this functions for example:

Plain Text         
Func _URIDecode($sData)     ; Prog@ndy     Local $aData = StringSplit(StringReplace($sData, "+", " ", 0, 1), "%")     $sData = ""     For $i = 2 To $aData[0]         $aData[1] &= Chr(Dec(StringLeft($aData[$i], 2))) & StringTrimLeft($aData[$i], 2)     Next     Return BinaryToString(StringToBinary($aData[1], 1), 4) EndFunc   ;==>_URIDecode Func _URIEncode($sData)     ; Thanks to Prog@ndy     Local $aData = StringSplit(BinaryToString(StringToBinary($sData, 4), 1), "")     Local $nChar     $sData = ""     For $i = 1 To $aData[0]         $nChar = Asc($aData[$i])         Switch $nChar             Case 45, 46, 48 - 57, 65 To 90, 95, 97 To 122, 126                 $sData &= $aData[$i]             Case 32                 $sData &= "+"             Case Else                 $sData &= "%" & Hex($nChar, 2)         EndSwitch     Next     Return $sData EndFunc   ;==>_URIEncode

Be Green Now or Never (BGNN)!

#20 PhilHibbs

PhilHibbs

    Adventurer

  • Active Members
  • PipPip
  • 141 posts

Posted 25 March 2010 - 12:41 PM

Simply stating the autoit help file + mysql documentation would easily show any user how to retrieve the email addresses out of the db with a few lines of code using a query with the "select *" command.

Oh come on. 200 lines of AutoIt script, most of which is database code, is hardly a major spamming tool. Spammers have tools that crack CAPTCHA and create user accounts to get full access to forums that need log-in, this is very basic stuff and does not constitute a hacking tool.




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users