Jump to content

IE - Extract Links from Source Page


Dieuz
 Share

Recommended Posts

Hey guys,

I am having a hard time extracting the links + Anchor Text from a source page.

#include <Array.au3>
#include <IE.au3> 

$Primary_url = "http://www.britannica.com/blogs/2008/04/are-newspapers-doomed-do-we-care-newspapers-the-net-forum/" ; Any URL

$IE = _IECreate($Primary_url,0,1,1)
$pagesource = _IEBodyReadHTML($IE)

$array = StringRegExp($pagesource,'(?:<A href=")(http.*?)(?:">)(.*?)(?:</A>)',3)

 _IEQuit($IE)
 _ArrayDisplay($array, "Test")

I am trying to extract the url (http://...) and the related anchor text. The thing is that sometime there is no anchor text at all or there are other parameters such as <B>,<COLOR> etc.. and all these things mess up my regular expression.

I am not really good at writting regular expression so I would appreciate a little help here.

Thanks!

Link to comment
Share on other sites

Ok, simple : _IELinkGetCollection ()

#include <IE.au3>

_IELinkGetCollection ( ByRef $o_object [, $i_index = -1] )

Parameters

$o_object Object variable of an InternetExplorer.Application, Window or Frame object

$i_index Optional: specifies whether to return a collection or indexed instance

0 or positive integer returns an indexed instance

-1 = (Default) returns a collection

Edited by logmein
Link to comment
Share on other sites

_IELinkGetCollection () is great to extract all links but I cant extract the anchor text with it. It's why I would like to use a regular expression...

Is there anyway to gather the anchor text with _IELinkGetCollection ()?

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...