Sign in to follow this  
Followers 0
Creator

Your own web crawler/spider

14 posts in this topic

#1 ·  Posted (edited)

Here are a few examples for creating your own web crawler/spider by using the free activeX component from ChilKat.

I found it very fast and easy to use.

Free download of the activeX : http://www.chilkatsoft.com/download/SpiderActiveX.msi

Reference for the activeX (if you dont wanna wait on me posting more examples :) ) : http://www.chilkatsoft.com/refdoc/xSpiderRef.html

Examples include:

  • Getting Started Spidering a Site.au3
  • Extract HTML Title, Description, Keywords.au3
  • Fetch robots.txt for a Site.au3
  • Avoid URLs Matching Any of a Set of Patterns.au3
  • Setting a Maximum Response Size.au3
  • Setting a Maximum URL Length.au3
  • Using the Disk Cache.au3
  • Crawling the Web.au3
  • Get Referenced Domains.au3
  • A Simple Web Crawler.au3

Did i mention its fully robot.txt compliant !!

Have fun!

More Examples to come:

Examples Added as new zip file:

Get Base Domains

GetBaseDomain

CanonicalizeUrl

Avoiding Outbound Links Matching Patterns

Must-Match Patterns

These examples are a port from the vb-scripts examples on the ChilKat site.

Updated zip with A simple webcrawler.au3 (crawl a google directory ...how ironic ^_^ )

Spider_Examples.zip

Spider_Examples_2.zip

Edited by Creator

Share this post


Link to post
Share on other sites



Cool.

Share this post


Link to post
Share on other sites

#4 ·  Posted (edited)

Added A simple webcrawler.au3 which crawls a googledirectory and is pretty much complete.

If you want to do a full html index, you can find the complete html in the LastHtml property of a crawled url.

Only imagine doing an offline search in the autoit forums with all keywords allowed :)

Edited by Creator

Share this post


Link to post
Share on other sites

#6 ·  Posted (edited)

Before I check it out and download it, have you included options to ignore robots.txt?

The ActiveX has a native compliancy to obey to robots.txt. If you want to ignore, you cant use the component.

On a personal note: If a webmaster doesnt want you to crawl certain parts of a website, its kinda polite to comply with that (security, privacy performance etc).

:)

-edit- Here a little more information on how "bad" robots (which ignore robot.txt) get banned. http://www.fleiner.com/bots/

Edited by Creator

Share this post


Link to post
Share on other sites

Nice!! This will come in handy. Thank You

Share this post


Link to post
Share on other sites

The ActiveX has a native compliancy to obey to robots.txt. If you want to ignore, you cant use the component.

On a personal note: If a webmaster doesnt want you to crawl certain parts of a website, its kinda polite to comply with that (security, privacy performance etc).

:)

-edit- Here a little more information on how "bad" robots (which ignore robot.txt) get banned. http://www.fleiner.com/bots/

Makes sense. I might wanted to try it for personal use, just to gather some data from websites that I normally would not have found.. but it is good to keep a compliance to robots.txt

Share this post


Link to post
Share on other sites

#10 ·  Posted (edited)

Updated first post with a new zip file. It contains the following examples:

Get Base Domains

GetBaseDomain

CanonicalizeUrl

Avoiding Outbound Links Matching Patterns

Must-Match Patterns

Thats it!! Now you should be more than on your way to building a nice crawling little thingy :)

Edited by Creator

Share this post


Link to post
Share on other sites

This is nice tool. Are planning to add the ability to read content within the page much like a meta tag description? Just wondering.

Share this post


Link to post
Share on other sites

Hello! I know this is an old thread, but is there a way to flag links that exist, but are dead links?

Or is the report in output.txt only of good links? If so, is there a way I can filter or search for dead links specifically?

Thanks!

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0