Jump to content

Regex version of _IELinkGetCollection needed


storme
 Share

Recommended Posts

G'day All

I'm working on a script that extracts links using _IELinkGetCollection then extracts information from the pages pointed at by the links.

I have a nice little piece of code (from here THANKS) :P to get the page source for the second part.

...Well I wrote that an hour ago and thought "This can't be that hard".... AHHHHH!!!! I HATE REGEX!!!!Yeah I know it's usefull and great but everytime I try to read a reasonable size one my brain explodes.... :x

Has anyone got a piece of code that will return a list of links on a web page?

The reason I need it (or think I might) is this is for code that maybe used on an out of date computer (SP1 or NO service Pack) so it may not have an IE that is compatible with IE_UDF or someone may have removed IE....

Thanks in advance for any help you can offer!!!

Link to comment
Share on other sites

It's kind of hard to develop something that works without seeing the page source. Gathering links isn't that hard though; you generally just need to collect all of the anchor tags.

Sorry I should have included that. :P

Yeah anchors is what i'm after and that is the most annoying thing. I know IT IS EASY.... But I can't get my head around regex.... I seem to have a block where it is concerned.

Anyway, Here is the test code I am using

$sURL = "http://driverpacks.net/downloads"
    Local $sSource = _INetGetSource($sURL, True)
    ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') : $sSource = ' & $sSource & @CRLF & '>Error code: ' & @error & @CRLF) ;### Debug Console
    Local $aslinks = StringRegExp($sSource, 'href=(?:"(?<1>[^"]*)"|(?<1>))', 1) ; href\s*=\s*(?:"(?<1>[^"]*)"|(?<1>\S+))
    ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') : ubound($aslinks) = ' & UBound($aslinks) & @CRLF & '>Error code: ' & @error & @CRLF) ;### Debug Console
    _ArrayDisplay($aslinks)

The links I'm after are in this form

<a href="/driverpacks/windows/xp/x86/chipset/10.11">DriverPack Chipset 10.11 for Windows 2000/XP/2003 (x86)</a>

But I have another page I need to collect links from using _IELinkGetCollection where the link looks like

<a href="/driverpacks/windows/xp/x86/chipset/10.11/download/torrent">Download ↓</a>

What I'm really after is a replacement for _IELinkGetCollection (dont' need the collection, an array is fine).

I've got code that is working perfectly using _IELinkGetCollection but I don't want to rely on IE being on the computer.

Thinking about it, IF (this is only an extra not necessary) it could include the text and the link that could come in very useful. :x

Thanks for any help.

Link to comment
Share on other sites

You don't need that inet.au3 file at all. It was fine before we had InetRead(0 but it's useless now

Replace the _INetGetSource() line with

Global $Source = BinaryToString(InetRead($sURL))

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

This code will find 150 links. There are generally in the form of "link">text.

<SNIP>

A BIG thanks I had something like that at one point but I was also messing with the flags and the right regex and the right flag didn't come together at any point. :P

Now that I have the right code I came up with the following to get the links I wanted

StringRegExp($sSource, '(?s)(?i)<a href="/driverpacks/windows/(.*?)">', 3)

and it works perfectly! :x

I love what regex can do but I've just got this block when I see the dots and brackets and run screaming from the room if I try too long to get it working.

THANKS again

Link to comment
Share on other sites

You don't need that inet.au3 file at all. It was fine before we had InetRead(0 but it's useless now

Replace the _INetGetSource() line with

Global $Source = BinaryToString(InetRead($sURL))

Thanks George

I should have looked at the source of the UDF....it's all in there. :x

I'll add a bit of error checking around it but what you've suggested is great. :P

BTW I did try and use PCRE but I had the syntax messed up and couldn't get it right. :nuke: sigh

Now that I have the syntax right :shifty:

Link to comment
Share on other sites

I just wrote a function (using SRE) a few days ago that returns a two dimension array where element 0 holds the url and element 1 the text. I was going to post it for you and then I couldn't find it. I'll come across it one of these days and probably just add it to the Sample library in the toolkit.

Another little trick is to use \x22 in place of the double quotes. In fact I allow for people getting sloppy when they write the HTML and use

[\x22\x27\s]?
Sloppy html coders will use any one (or none) of those although it should just be the \x22. They leave it up to the browser to sort it all out and some browsers are very forgiving in that aspect, much like IE being forgiving when you use a backslash instead of forward slash in a url.

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

I just wrote a function (using SRE) a few days ago that returns a two dimension array where element 0 holds the url and element 1 the text. I was going to post it for you and then I couldn't find it. I'll come across it one of these days and probably just add it to the Sample library in the toolkit.

Another little trick is to use \x22 in place of the double quotes. In fact I allow for people getting sloppy when they write the HTML and use

[\x22\x27\s]?
Sloppy html coders will use any one (or none) of those although it should just be the \x22. They leave it up to the browser to sort it all out and some browsers are very forgiving in that aspect, much like IE being forgiving when you use a backslash instead of forward slash in a url.

I was thinking of adding a few \s* through the code to eliminate blanks but I never considered \x27 THANKS!

BTW the regex removed quite a bit of code that took me ages to write...Grrrr... whish I had of asked sooner. :x

The 2 dimensional array sounds great. For one program I wrote a while ago it would have been perfect as I need to look at the link and the text to work out if the link was one I needed. Anyway I did find a solution but it was more spaghetti than code. :P

Thanks for taking the time to help!

Hope you had a great New Years!

John Morrison

Link to comment
Share on other sites

No problem John and the same to you.

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...