Medic873 Posted March 27, 2014 Share Posted March 27, 2014 Hello, I am pulling information from yellow pages and seem to be having a issue I want to pull any website's that are not internal links or yellowpages.com here is my current code #include <IE.au3> #include <array.au3> #Include <File.au3> #include <string.au3> #include <INet.au3> #include <Excel.au3> $YellowPagesUrl = "http://www.yellowpages.com/phoenix-az/pet-store?g=Phoenix%2C+AZ&page=2&q=pet+store";This will help us on finding the next URL. $i = 1;This will keep track of how many pofiles we have pulled from linkedin $YellowPages = _INetGetSource($YellowPagesUrl);Pulls the data from the address InetClose ($YellowPages);Closes the connection to linkedin $YellowPagesWebsite = _StringBetween($YellowPages, '<a href="', '"');List out all yellow pages links _ArrayDisplay($YellowPagesWebsite); Link to comment Share on other sites More sharing options...
Medic873 Posted March 27, 2014 Author Share Posted March 27, 2014 hmm second time this has happened it didnt include what I put in my message after the code. I wan this to exclude anything that is a /ofiheif.html type of link or anything that is a yellowpages.com/ type of link how would I do this Thanks Link to comment Share on other sites More sharing options...
jguinch Posted March 27, 2014 Share Posted March 27, 2014 (edited) Is it good with this ? #include <array.au3> $YellowPagesUrl = "http://www.yellowpages.com/phoenix-az/pet-store?g=Phoenix%2C+AZ&page=2&q=pet+store";This will help us on finding the next URL. $YellowPages = BinaryToString( InetRead ($YellowPagesUrl) );Pulls the data from the address $YellowPagesWebsite = StringRegExp($YellowPages, '<a href="(http://(?!www\.yellowpages\.com)[^"]+)', 3) ; _ArrayDisplay($YellowPagesWebsite); Match only links starting by "http://" and exclude yellowpages.com Or this $YellowPagesWebsite = StringRegExp($YellowPages, '<a href="([^/#](?!.*yellowpages)[^"]+)', 3) ; for links not in "http://" format Edited March 27, 2014 by jguinch Spoiler Network configuration UDF, _DirGetSizeByExtension, _UninstallList Firefox ConfigurationArray multi-dimensions, Printer Management UDF Link to comment Share on other sites More sharing options...
mikell Posted March 27, 2014 Share Posted March 27, 2014 (edited) If you want a more manageable solution you can also do it like this $YellowPages = StringReplace($YellowPages, 'href="http://www.yellowpages', "") $YellowPages = StringReplace($YellowPages, 'href="http://ads', "") ; etc $YellowPagesWebsite = StringRegExp($YellowPages, '<a href="([^/#][^"]+)', 3) Edited March 27, 2014 by mikell Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now