storme Posted December 29, 2010 Share Posted December 29, 2010 G'day AllI'm working on a script that extracts links using _IELinkGetCollection then extracts information from the pages pointed at by the links.I have a nice little piece of code (from here THANKS) to get the page source for the second part. ...Well I wrote that an hour ago and thought "This can't be that hard".... AHHHHH!!!! I HATE REGEX!!!!Yeah I know it's usefull and great but everytime I try to read a reasonable size one my brain explodes.... Has anyone got a piece of code that will return a list of links on a web page?The reason I need it (or think I might) is this is for code that maybe used on an out of date computer (SP1 or NO service Pack) so it may not have an IE that is compatible with IE_UDF or someone may have removed IE....Thanks in advance for any help you can offer!!! Some of my small contributions to AutoIt Browse for Folder Dialog - Automation SysTreeView32 | FileHippo Download and/or retrieve program information | Get installedpath from uninstall key in registry | RoboCopy function John Morrison aka Storm-E Link to comment Share on other sites More sharing options...
storme Posted December 30, 2010 Author Share Posted December 30, 2010 It's kind of hard to develop something that works without seeing the page source. Gathering links isn't that hard though; you generally just need to collect all of the anchor tags. Sorry I should have included that. Yeah anchors is what i'm after and that is the most annoying thing. I know IT IS EASY.... But I can't get my head around regex.... I seem to have a block where it is concerned. Anyway, Here is the test code I am using $sURL = "http://driverpacks.net/downloads" Local $sSource = _INetGetSource($sURL, True) ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') : $sSource = ' & $sSource & @CRLF & '>Error code: ' & @error & @CRLF) ;### Debug Console Local $aslinks = StringRegExp($sSource, 'href=(?:"(?<1>[^"]*)"|(?<1>))', 1) ; href\s*=\s*(?:"(?<1>[^"]*)"|(?<1>\S+)) ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') : ubound($aslinks) = ' & UBound($aslinks) & @CRLF & '>Error code: ' & @error & @CRLF) ;### Debug Console _ArrayDisplay($aslinks) The links I'm after are in this form <a href="/driverpacks/windows/xp/x86/chipset/10.11">DriverPack Chipset 10.11 for Windows 2000/XP/2003 (x86)</a> But I have another page I need to collect links from using _IELinkGetCollection where the link looks like <a href="/driverpacks/windows/xp/x86/chipset/10.11/download/torrent">Download ↓</a> What I'm really after is a replacement for _IELinkGetCollection (dont' need the collection, an array is fine). I've got code that is working perfectly using _IELinkGetCollection but I don't want to rely on IE being on the computer. Thinking about it, IF (this is only an extra not necessary) it could include the text and the link that could come in very useful. Thanks for any help. Some of my small contributions to AutoIt Browse for Folder Dialog - Automation SysTreeView32 | FileHippo Download and/or retrieve program information | Get installedpath from uninstall key in registry | RoboCopy function John Morrison aka Storm-E Link to comment Share on other sites More sharing options...
GEOSoft Posted December 30, 2010 Share Posted December 30, 2010 You don't need that inet.au3 file at all. It was fine before we had InetRead(0 but it's useless now Replace the _INetGetSource() line with Global $Source = BinaryToString(InetRead($sURL)) George Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.*** The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number. Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else. "Old age and treachery will always overcome youth and skill!" Link to comment Share on other sites More sharing options...
storme Posted December 30, 2010 Author Share Posted December 30, 2010 This code will find 150 links. There are generally in the form of "link">text. <SNIP> A BIG thanks I had something like that at one point but I was also messing with the flags and the right regex and the right flag didn't come together at any point. Now that I have the right code I came up with the following to get the links I wanted StringRegExp($sSource, '(?s)(?i)<a href="/driverpacks/windows/(.*?)">', 3) and it works perfectly! I love what regex can do but I've just got this block when I see the dots and brackets and run screaming from the room if I try too long to get it working. THANKS again Some of my small contributions to AutoIt Browse for Folder Dialog - Automation SysTreeView32 | FileHippo Download and/or retrieve program information | Get installedpath from uninstall key in registry | RoboCopy function John Morrison aka Storm-E Link to comment Share on other sites More sharing options...
storme Posted December 30, 2010 Author Share Posted December 30, 2010 You don't need that inet.au3 file at all. It was fine before we had InetRead(0 but it's useless now Replace the _INetGetSource() line with Global $Source = BinaryToString(InetRead($sURL)) Thanks George I should have looked at the source of the UDF....it's all in there. I'll add a bit of error checking around it but what you've suggested is great. BTW I did try and use PCRE but I had the syntax messed up and couldn't get it right. sigh Now that I have the syntax right Some of my small contributions to AutoIt Browse for Folder Dialog - Automation SysTreeView32 | FileHippo Download and/or retrieve program information | Get installedpath from uninstall key in registry | RoboCopy function John Morrison aka Storm-E Link to comment Share on other sites More sharing options...
GEOSoft Posted December 30, 2010 Share Posted December 30, 2010 I just wrote a function (using SRE) a few days ago that returns a two dimension array where element 0 holds the url and element 1 the text. I was going to post it for you and then I couldn't find it. I'll come across it one of these days and probably just add it to the Sample library in the toolkit. Another little trick is to use \x22 in place of the double quotes. In fact I allow for people getting sloppy when they write the HTML and use [\x22\x27\s]? Sloppy html coders will use any one (or none) of those although it should just be the \x22. They leave it up to the browser to sort it all out and some browsers are very forgiving in that aspect, much like IE being forgiving when you use a backslash instead of forward slash in a url. George Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.*** The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number. Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else. "Old age and treachery will always overcome youth and skill!" Link to comment Share on other sites More sharing options...
storme Posted January 1, 2011 Author Share Posted January 1, 2011 I just wrote a function (using SRE) a few days ago that returns a two dimension array where element 0 holds the url and element 1 the text. I was going to post it for you and then I couldn't find it. I'll come across it one of these days and probably just add it to the Sample library in the toolkit. Another little trick is to use \x22 in place of the double quotes. In fact I allow for people getting sloppy when they write the HTML and use [\x22\x27\s]? Sloppy html coders will use any one (or none) of those although it should just be the \x22. They leave it up to the browser to sort it all out and some browsers are very forgiving in that aspect, much like IE being forgiving when you use a backslash instead of forward slash in a url. I was thinking of adding a few \s* through the code to eliminate blanks but I never considered \x27 THANKS! BTW the regex removed quite a bit of code that took me ages to write...Grrrr... whish I had of asked sooner. The 2 dimensional array sounds great. For one program I wrote a while ago it would have been perfect as I need to look at the link and the text to work out if the link was one I needed. Anyway I did find a solution but it was more spaghetti than code. Thanks for taking the time to help! Hope you had a great New Years! John Morrison Some of my small contributions to AutoIt Browse for Folder Dialog - Automation SysTreeView32 | FileHippo Download and/or retrieve program information | Get installedpath from uninstall key in registry | RoboCopy function John Morrison aka Storm-E Link to comment Share on other sites More sharing options...
GEOSoft Posted January 1, 2011 Share Posted January 1, 2011 No problem John and the same to you. George Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.*** The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number. Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else. "Old age and treachery will always overcome youth and skill!" Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now