Jump to content
Sign in to follow this  
Creator

Your own web crawler/spider

Recommended Posts

Creator

Here are a few examples for creating your own web crawler/spider by using the free activeX component from ChilKat.

I found it very fast and easy to use.

Free download of the activeX : http://www.chilkatsoft.com/download/SpiderActiveX.msi

Reference for the activeX (if you dont wanna wait on me posting more examples :) ) : http://www.chilkatsoft.com/refdoc/xSpiderRef.html

Examples include:

  • Getting Started Spidering a Site.au3
  • Extract HTML Title, Description, Keywords.au3
  • Fetch robots.txt for a Site.au3
  • Avoid URLs Matching Any of a Set of Patterns.au3
  • Setting a Maximum Response Size.au3
  • Setting a Maximum URL Length.au3
  • Using the Disk Cache.au3
  • Crawling the Web.au3
  • Get Referenced Domains.au3
  • A Simple Web Crawler.au3

Did i mention its fully robot.txt compliant !!

Have fun!

More Examples to come:

Examples Added as new zip file:

Get Base Domains

GetBaseDomain

CanonicalizeUrl

Avoiding Outbound Links Matching Patterns

Must-Match Patterns

These examples are a port from the vb-scripts examples on the ChilKat site.

Updated zip with A simple webcrawler.au3 (crawl a google directory ...how ironic ^_^ )

Spider_Examples.zip

Spider_Examples_2.zip

Edited by Creator

Share this post


Link to post
Share on other sites
Creator

Added A simple webcrawler.au3 which crawls a googledirectory and is pretty much complete.

If you want to do a full html index, you can find the complete html in the LastHtml property of a crawled url.

Only imagine doing an offline search in the autoit forums with all keywords allowed :)

Edited by Creator

Share this post


Link to post
Share on other sites
Creator

Before I check it out and download it, have you included options to ignore robots.txt?

The ActiveX has a native compliancy to obey to robots.txt. If you want to ignore, you cant use the component.

On a personal note: If a webmaster doesnt want you to crawl certain parts of a website, its kinda polite to comply with that (security, privacy performance etc).

:)

-edit- Here a little more information on how "bad" robots (which ignore robot.txt) get banned. http://www.fleiner.com/bots/

Edited by Creator

Share this post


Link to post
Share on other sites
gseller

Nice!! This will come in handy. Thank You

Share this post


Link to post
Share on other sites
jvanegmond

The ActiveX has a native compliancy to obey to robots.txt. If you want to ignore, you cant use the component.

On a personal note: If a webmaster doesnt want you to crawl certain parts of a website, its kinda polite to comply with that (security, privacy performance etc).

:)

-edit- Here a little more information on how "bad" robots (which ignore robot.txt) get banned. http://www.fleiner.com/bots/

Makes sense. I might wanted to try it for personal use, just to gather some data from websites that I normally would not have found.. but it is good to keep a compliance to robots.txt

Share this post


Link to post
Share on other sites
ptrex
Creator

Updated first post with a new zip file. It contains the following examples:

Get Base Domains

GetBaseDomain

CanonicalizeUrl

Avoiding Outbound Links Matching Patterns

Must-Match Patterns

Thats it!! Now you should be more than on your way to building a nice crawling little thingy :)

Edited by Creator

Share this post


Link to post
Share on other sites
6monkeyrs

This is nice tool. Are planning to add the ability to read content within the page much like a meta tag description? Just wondering.

Share this post


Link to post
Share on other sites
coffeeturtle

Hello! I know this is an old thread, but is there a way to flag links that exist, but are dead links?

Or is the report in output.txt only of good links? If so, is there a way I can filter or search for dead links specifically?

Thanks!

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.