Sign in to follow this  
Followers 0
Medic873

Removing Certain String

4 posts in this topic

Hello,

I am pulling information from yellow pages and seem to be having a issue

I want to pull any website's that are not internal links or yellowpages.com

here is my current code

#include <IE.au3>
#include <array.au3>
#Include <File.au3>
#include <string.au3>
#include <INet.au3>
#include <Excel.au3>
 
 
$YellowPagesUrl = "http://www.yellowpages.com/phoenix-az/pet-store?g=Phoenix%2C+AZ&page=2&q=pet+store";This will help us on finding the next URL.
$i = 1;This will keep track of how many pofiles we have pulled from linkedin
 
 
 
$YellowPages = _INetGetSource($YellowPagesUrl);Pulls the data from the address
InetClose ($YellowPages);Closes the connection to linkedin
 
$YellowPagesWebsite = _StringBetween($YellowPages, '<a href="', '"');List out all yellow pages links
 
_ArrayDisplay($YellowPagesWebsite);

Share this post


Link to post
Share on other sites



hmm second time this has happened it didnt include what I put in my message after the code.

I wan this to exclude anything that is a /ofiheif.html type of link or anything that is a yellowpages.com/ type of link

how would I do this

Thanks

Share this post


Link to post
Share on other sites

#3 ·  Posted (edited)

Is it good with this ?

#include <array.au3>

$YellowPagesUrl = "http://www.yellowpages.com/phoenix-az/pet-store?g=Phoenix%2C+AZ&page=2&q=pet+store";This will help us on finding the next URL.
 
$YellowPages = BinaryToString( InetRead ($YellowPagesUrl) );Pulls the data from the address
 $YellowPagesWebsite = StringRegExp($YellowPages, '<a href="(http://(?!www\.yellowpages\.com)[^"]+)', 3) ; 
 _ArrayDisplay($YellowPagesWebsite);

Match only links starting by "http://" and exclude yellowpages.com

 

Or this

$YellowPagesWebsite = StringRegExp($YellowPages, '<a href="([^/#](?!.*yellowpages)[^"]+)', 3) ;

for links not in "http://" format

Edited by jguinch

Share this post


Link to post
Share on other sites

#4 ·  Posted (edited)

If you want a more manageable solution you can also do it like this

$YellowPages = StringReplace($YellowPages, 'href="http://www.yellowpages', "")
$YellowPages = StringReplace($YellowPages, 'href="http://ads', "")
; etc
$YellowPagesWebsite = StringRegExp($YellowPages, '<a href="([^/#][^"]+)', 3)
Edited by mikell

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0