Jump to content
Sign in to follow this  
phew

Google RegExp

Recommended Posts

hellowww,

i'm trying to write my own googlescript, searching for a pattern returning the results of the first google-site. working so far, just got one problem, i guess it's my regexp (i'm not good in regexp, confusing thingy!)

my $srcv = TCPRecv($sock, 10000) is receiving the data from google (the quelltext of the first results website, when i search for "testpattern" ie. the quelltext of: http://www.google.de/search?hl=de&q=te...Suche&meta= is saved as a string in my $srcv)

now i want the script to check for ALL url's found in this quelltext using:

$www = StringRegExp($srcv, '<a href="(.*?)" class=l>', 1)

it works so far, i can ie. write an urls.txt with strings matching the regexp, but there i also get a problem:

in my quelltext there is:

<a href="http://testpattern.msnbc.msn.com/" class=l>
<a href="http://testpattern.msnbc.msn.com/archive/2007/08/06/299549.aspx" class=l>
<a href="http://www.msnbc.msn.com/id/4326967/" class=l>
<a href="http://www.testpattern.de/" class=l>
<a href="http://forum.de.selfhtml.org/archiv/2007/5/t152535/" class=l>     <---------- this one is not matched
<a href="http://www.testpattern.org/" class=l>
<a href="http://ivs.cs.uni-magdeburg.de/~dumke/ST1/GBeleg.html" class=l>
<a href="http://forum.de.selfhtml.org/archiv/2007/5/t152535/" class=l>
<a href="http://www.prunejuice.net/testpattern/" class=l>
[...] and much more unneeded stuff and a few other <a href="......" class=l>

now my script writes down all filtered url's in urls.txt. here my result for searching for "testpattern" at google.com:

http://testpattern.msnbc.msn.com/
http://testpattern.msnbc.msn.com/archive/2007/08/06/299549.aspx
http://www.msnbc.msn.com/id/4326967/
http://www.testpattern.de/
http://www.testpattern.org/
http://ivs.cs.uni-magdeburg.de/~dumke/ST1/GBeleg.html
http://www.prunejuice.net/testpattern/

in the $srcv (quelltext of google result website), "forum.de.selfhtml.org/archiv/2007/5/t152535/" is written in this form:

[...] <a href="http://forum.de.selfhtml.org/archiv/2007/5/t152535/" class=l> [...]

why is this link not matched in my regexp? it's not written down in my urls.txt - i guess there must be smtn wrong with my regexp, but i have no clue what!

help pls, greets

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×
×
  • Create New...