Sign in to follow this  
Followers 0
babyjoe

How to find extract links on a website (keystroke / routine)

6 posts in this topic

#1 ·  Posted (edited)

Hello,

I would be happy with some advice on my project:

Purpose: I have to make a crawler that searches 1 single internet page, and visits every link on that page.

Then, I just need to copy-paste the content from that visited link into a DB file.

Back again to central page, visit next link.

I have some mediocre programming skills in Java, but have just discovered AutoIt.

It seems to me that the simplest thing to do, is passing keystrokes to the browser, especially for the copy/paste.

However, this is my problem: How can I direct the browser to the next link?? 'Tab' doesn't work as an advancing keystroke, so I do not know how to extract the links.

Should I try to find out what pixels are blue (only the hyperlinks are blue) and try to click those pixels?

Is there anyone who has had a similar project and has some advice?

Thanks!

Edited by babyjoe

Share this post


Link to post
Share on other sites



I forgot to say, but significant: it is a HTTPS site, so you can not save it, so I think keystrokes are the only option.

Are there any other browser helper software around that can automatically select the next link?

Share this post


Link to post
Share on other sites

#3 ·  Posted (edited)

So what you what to do is:

- Browse through X pages (or links)

Question: Are the links preset?

Or is it just like: browse through www.example.com\ + any pages related (not preset)

i.e.

www.example.com, www.example.com\page1, www.example.com\page2, etc

Edited by _Kurt

Awaiting Diablo III..

Share this post


Link to post
Share on other sites

#4 ·  Posted (edited)

Browse through all links on a page (they do not link to other physical pages, they are DB generated)

I have come up with the following, but I have one caveat.

I used Firefox, because you can use the cursor to select text, however, when you push the back button, you are back, but the link you clicked is not selected (in IE, when you go back, the last link you clicked is the active item).

If I could make FF select the last visited link, I could keep using the down button to go to the next link.

Now I have to keep a counter that tracks how many times I went down, and add 1

Hotkeyset('{F9}', 'Zoek')

While 1

sleep (100)

wend

Func Zoek()

Opt("WinTitleMatchMode", 2)

send("{DOWN}")

sleep (50)

send("{ENTER}")

sleep (50)

send("^F")

sleep (50)

send("^a")

sleep (50)

send("{SHIFTDOWN}")

sleep (50)

send("{END}")

sleep (50)

send("{SHIFTUP}")

sleep (50)

send("^c")

sleep (50)

WinActivate("WordPad")

send("^v")

sleep (50)

send("{ENTER}")

sleep(50)

WinActivate("Firefox")

EndFunc

Edited by babyjoe

Share this post


Link to post
Share on other sites

See if this is like what you are wanting.

#include <IE.au3>

$sURL = "http://www.google.com"
$oIE = _IECreate($sURL)
$oLinks = _IELinkGetCollection($oIE)
For $oLink In $oLinks
    $sHREF = $oLink.href
    $oIE2 = _IECreate($sHREF, 0, 0)
    $sText = _IEBodyReadText($oIE2)
    ConsoleWrite("<<<<<<<<<<>>>>>>>>>>" & @CR)
    ConsoleWrite($sText & @CR)
    ConsoleWrite(">>>>>>>>>><<<<<<<<<<" & @CR & @CR)
    _IEQuit($oIE2)
Next

Share this post


Link to post
Share on other sites

Would something like WebReaper or WebStripper do the job?

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0