Sign in to follow this  
Followers 0
littleclown

Fast URL Spider + email extractor

32 posts in this topic

#1 ·  Posted (edited)

Hello.

I think this is my first post in this subforum.

This is quick and dirty code. I am not a programmer and I am sure there are a really better ways to do this.

The main plus of this code is the speed. I use SQLITE and this makes this script more fast then every example here.

Features:

Find all URLs from one domain

Find all e-mails from one domain (you can modify this and get URLS and e-mail out of domain too)

You can stop it and start it - it will start from last URL

Deep Level Option

settings file

Progress bar

TODO:

Interface

Deep level option

Everything is in SQLITE DB but you can convert it to what you need. I use SQLite Database Browser to review the db.

If you need to reset the tool just remove or rename the database file.

Edited by Valik

Share this post


Link to post
Share on other sites



You should make sure you're abiding by the robots.txt standard.

http://www.robotstxt.org/

Otherwise, neat :(

Share this post


Link to post
Share on other sites

#3 ·  Posted (edited)

I create this for internal site in my company, without robots.txt but its good idea to check this file.

Please see the attached file.

Changes are:

.option to limit the level of search.

.option to set do you want to search out of domain

.console write little more info

.db file for every site

SQLITE is cool :(

Edited by Valik

Share this post


Link to post
Share on other sites

#5 ·  Posted (edited)

This is simple robot, crawler, spider. You set start URL. For example autoitscript.com (without http://) and this script find all e-mails (plain text) in this site.

There are some variables that you can change:

$firsturl="strelki.info"

$max_level=0

$domain_only=1

This means that I want to scan strelki.info site, without max level (FULL site scan) and only internal links (don't follow any ads and other external links).

The robot "visit" the main URL, scan for URLs and write them to the db file, scan for emails and write them too. After that It goes to the second URL (this is the first found URL in the main URL), etc.

After that you have all e-mails written in this site and you have a list with all internal URLs.

Thats all.

Spammers usualy use this way to get victim's emails.

You can analize the source for different types of info if you edit the code - image list, external links list, info about the size of html or images or videos - whatever you need.

Edited by littleclown

Share this post


Link to post
Share on other sites

I think Gmail can protect e-mail information from robots really well :(

Share this post


Link to post
Share on other sites

So if I use this on say my googlemail account while logged in, I can find for example duplicate email addresses ?

Sounds cool

No, being logged in Internet Explorer ( or any browser ) is not going to automatically log you in in this crawler. There is no simple way to add support for sessions, etc, either.

Share this post


Link to post
Share on other sites

I will try to make streams, for faster download.

And other idea is to make image downloader.

Share this post


Link to post
Share on other sites

Check it out the last version (from the main post)

Progress bar added.

Some bugfix.

Settings file added.

I try to add streams and it works, but not significantly faster than the old version, but makes the code complicated. Thats why I will leave this idea for now.

Here is a sample settings.ini file:

[DEFAULT]

FIRST_URL=strelki.info
MAX_LEVEL=0
DOMAIN_ONLY=1
LOG_EXTERNAL=0

Nobody try this yet?

Any suggestions?

Share this post


Link to post
Share on other sites

#12 ·  Posted (edited)

Nice.

I was not sure how to make it work before your example of a valid settings.ini

I'm currently trying it. (It will take a while before it has finished)

I do not really have use for collecting everybody emails but

I find this script very interesting.

edit :

Since the operation can take a while, it would be nice to not have

the progressbar window with ontop attribute and possibly movable.

(I would then put it on my second screen display to continue watching

progress without blocking view in my first display )

Edited by SagePourpre

Share this post


Link to post
Share on other sites

This is an email harvester. Used to spam email addresses.


[center][/center][center]=][u][/u][/center][center][/center]

Share this post


Link to post
Share on other sites

This is a spider template. Used to harvest information off a web page. Spamming would require significantly more work, and anyone capable of that work would be more than capable of something like this.

It's a pattern matching tool, nothing more.

Share this post


Link to post
Share on other sites

He said he made it for a company, yes?

Its right now designed to harvest emails out of web pages, yes?

He even stated "Spammers usualy use this way to get victim's emails."

And more difficult code? I think not.

PHP Mail() + GET + AutoIt _INetGetSource() or a tcp variant would effectively send mass emails in a loop, with under 50 lines of code, under 10 mins of coding.

I really dont think this should be openly available with the ease to use it to create a spammer.

Because as I stated its not more work to make a spammer, and any noob could create the code to do it, the spider and pattern matching is much more complex than the spamming component, and therefore your statement does not hold true. One would not know how to make an email harvester as this does, just because they could make the spammer.

This gives them the code they need to make a spammer + auto harvester.

This should at least have the ability to harvest emails stripped.


[center][/center][center]=][u][/u][/center][center][/center]

Share this post


Link to post
Share on other sites

Heh. You're both overestimating the capacity of noobs to make code work and underestimating the difficulty of this task. This converts the data to a sqlite format, presuming upon a level of technical experience that noobs don't have. You need an interface to access the data, or you need to be able to modify the code to make the data more accessible. Both require a level of proficiency that makes this task trivial.

It's not like this outputs a ready to use text file. You need to be able to interface with the code somehow... and any idiots trying to do that will migrate to the help forum, say "how can i use this to collect email addresses" and Jos or Valik or Smoke_N will ban their ass.

Share this post


Link to post
Share on other sites

Simply stating the autoit help file + mysql documentation would easily show any user how to retrieve the email addresses out of the db with a few lines of code using a query with the "select *" command.


[center][/center][center]=][u][/u][/center][center][/center]

Share this post


Link to post
Share on other sites

#18 ·  Posted (edited)

I think there are a lot of tools for spammers already.

I am not sure how somebody will became a spammer because of this code :(. If somebody need to be a spammer he will find a way. I think this script will be the last choice :).

Yes spammers use tools like this, but I think they already have what they need.

You can find many scripts here and use them for "bad" actions.

I public this script, because there is nothing like this here, and maybe will be helpful for somebody. I hope nobody will have more spam because of this simple script :).

Everybody can edit this and make mass image downloader, or website downoader, or create some analysis for word counting for example (this will be usefull for SEO projects).

Sorry for the on top progress.

Be aware of that one normal website have many many pages, and sometimes if you need all URLS or all e-mails this will costs days of work. There is a "Fast" word in the title because this is the fastest example here. I don't use 3d party DLLs and I use SQLITE and I believe this makes the script more fast and reliable.

Some tips: If you stop this script it will resume the work from the last Checked URL.

If you need to reset the search just remove or rename the db file.

It was interesting for me how simple this can be done. I develop the level system in 3 minutes! ;).

I am not a programmer, and if there is some stupid solution in my code and you know a way to speed up the process or to make the code more simple or more reliable - please comment.

Edited by littleclown

Share this post


Link to post
Share on other sites

I am not a programmer, and if there is some stupid solution in my code and you know a way to speed up the process or to make the code more simple or more reliable - please comment.

First thanks for sharing.

Your script gave me an idea that could be usefull for my project, absolutely will not be used for spamming.

You may want to pass the data via URI Encode function before saving it into database.

To make sure locale urls will be saved properly.

You can use this functions for example:

Func _URIDecode($sData)
    ; Prog@ndy
    Local $aData = StringSplit(StringReplace($sData, "+", " ", 0, 1), "%")
    $sData = ""
    For $i = 2 To $aData[0]
        $aData[1] &= Chr(Dec(StringLeft($aData[$i], 2))) & StringTrimLeft($aData[$i], 2)
    Next
    Return BinaryToString(StringToBinary($aData[1], 1), 4)
EndFunc   ;==>_URIDecode

Func _URIEncode($sData)
    ; Thanks to Prog@ndy
    Local $aData = StringSplit(BinaryToString(StringToBinary($sData, 4), 1), "")
    Local $nChar
    $sData = ""
    For $i = 1 To $aData[0]
        $nChar = Asc($aData[$i])
        Switch $nChar
            Case 45, 46, 48 - 57, 65 To 90, 95, 97 To 122, 126
                $sData &= $aData[$i]
            Case 32
                $sData &= "+"
            Case Else
                $sData &= "%" & Hex($nChar, 2)
        EndSwitch
    Next
    Return $sData
EndFunc   ;==>_URIEncode

Be Green Now or Never (BGNN)!

Share this post


Link to post
Share on other sites

Simply stating the autoit help file + mysql documentation would easily show any user how to retrieve the email addresses out of the db with a few lines of code using a query with the "select *" command.

Oh come on. 200 lines of AutoIt script, most of which is database code, is hardly a major spamming tool. Spammers have tools that crack CAPTCHA and create user accounts to get full access to forums that need log-in, this is very basic stuff and does not constitute a hacking tool.

Share this post


Link to post
Share on other sites
Guest
This topic is now closed to further replies.
Sign in to follow this  
Followers 0