Dieuz Posted December 13, 2009 Share Posted December 13, 2009 (edited) Hey guys, How can I extract the "<a></a>" tags from a Page source string. I have tried: $array = StringRegExp($pagesource,"<a(.+</a>)",1) _ArrayDisplay($array, "Test") but it doesnt work ! Can someone correct the RegExp above? Thanks, Edited December 13, 2009 by Dieuz Link to comment Share on other sites More sharing options...
martin Posted December 13, 2009 Share Posted December 13, 2009 (edited) Hey guys, How can I extract the <a XX ></a> tags from a Page source string. I have tried: $array = StringRegExp($pagesource,"<a(.+</a>)",1) _ArrayDisplay($array, "Test") but it doesnt work ! Can someone correct the RegExp above? Thanks, Maybe like this #include <array.au3> $Pagesource = "<aXX>ufirst set of letters</a>irrelevant material<aRf>secondgroup</a><aRf>bonus characters</a>" $array = StringRegExp($Pagesource,"(?:<a.*?>)(.*?)(?:</a>)",3) _ArrayDisplay($array, "Test") Edited December 13, 2009 by martin Serial port communications UDF Includes functions for binary transmission and reception.printing UDF Useful for graphs, forms, labels, reports etc.Add User Call Tips to SciTE for functions in UDFs not included with AutoIt and for your own scripts.Functions with parameters in OnEvent mode and for Hot Keys One function replaces GuiSetOnEvent, GuiCtrlSetOnEvent and HotKeySet.UDF IsConnected2 for notification of status of connected state of many urls or IPs, without slowing the script. Link to comment Share on other sites More sharing options...
Dieuz Posted December 13, 2009 Author Share Posted December 13, 2009 (edited) Thanks, I found a workaround... Can someone tell me why this work: $array = StringRegExp($pagesource,"<A[^test]+</A>",3) But this doesnt: $array = StringRegExp($pagesource,"<A(^test)+</A>",3) How can I exclude the word "test"? Edited December 13, 2009 by Dieuz Link to comment Share on other sites More sharing options...
martin Posted December 13, 2009 Share Posted December 13, 2009 Thanks, I found a workaround... Can someone tell me why this work: $array = StringRegExp($pagesource,"<A[^test]+</A>",3) But this doesnt: $array = StringRegExp($pagesource,"<A(^test)+</A>",3) Why? Without knowing what $pagesource is it's difficult to say. But (^test) is looking for the group of letters which does not include the word "test", whereas [^test] is excluding any of the letters in that set regardless of order. I think I misunderstood or misread your first post. If you want to remove the tags then I think you need to use StringRegExpReplace. Is that what you meant? Serial port communications UDF Includes functions for binary transmission and reception.printing UDF Useful for graphs, forms, labels, reports etc.Add User Call Tips to SciTE for functions in UDFs not included with AutoIt and for your own scripts.Functions with parameters in OnEvent mode and for Hot Keys One function replaces GuiSetOnEvent, GuiCtrlSetOnEvent and HotKeySet.UDF IsConnected2 for notification of status of connected state of many urls or IPs, without slowing the script. Link to comment Share on other sites More sharing options...
Dieuz Posted December 13, 2009 Author Share Posted December 13, 2009 (edited) Actually, I am trying to extract the url & the anchor text to an array.Example: <a href="www.applepie.com/test.html">apple pie</a> I would like to extract www.applepie.com/test.html & apple pie.Here's what I have got so far:$url = "http://www.britannica.com/blogs/2008/04/are-newspapers-doomed-do-we-care-newspapers-the-net-forum/" ; It can be ANY HTML pages $IE = _IECreate($url,0,1,1) $pagesource = _IEBodyReadHTML($IE) $array = StringRegExp($pagesource,'href="(http://.+)".+>' & '(.+)</A>',3) _ArrayDisplay($array, "Test") _IEQuit($IE)It doesnt work really well... Edited December 13, 2009 by Dieuz Link to comment Share on other sites More sharing options...
martin Posted December 13, 2009 Share Posted December 13, 2009 Actually, I am trying to extract the url & the anchor text to an array. Example: <a href="www.applepie.com/test.html">apple pie</a> I would like to extract www.applepie.com/test.html & apple pie. Here's what I have got so far: $url = "http://www.britannica.com/blogs/2008/04/are-newspapers-doomed-do-we-care-newspapers-the-net-forum/" ; It can be ANY HTML pages $IE = _IECreate($url,0,1,1) $pagesource = _IEBodyReadHTML($IE) $array = StringRegExp($pagesource,'href="(http://.+)".+>' & '(.+)</A>',3) _ArrayDisplay($array, "Test") _IEQuit($IE) It doesnt work really well... Then I think that what I showed before was the approach that was needed $Pagesource = '<a href="www.applepie.com/test.html">apple pie</a>' $array = StringRegExp($Pagesource,'(?:<a href=")(.*?)(?:">)(.*?)(?:</a>)',3) _ArrayDisplay($array, "Test") Serial port communications UDF Includes functions for binary transmission and reception.printing UDF Useful for graphs, forms, labels, reports etc.Add User Call Tips to SciTE for functions in UDFs not included with AutoIt and for your own scripts.Functions with parameters in OnEvent mode and for Hot Keys One function replaces GuiSetOnEvent, GuiCtrlSetOnEvent and HotKeySet.UDF IsConnected2 for notification of status of connected state of many urls or IPs, without slowing the script. Link to comment Share on other sites More sharing options...
Dieuz Posted December 13, 2009 Author Share Posted December 13, 2009 Thanks! The thing is that if you try it with a normal HTML webpage, it pick up ALOT of junks. Link to comment Share on other sites More sharing options...
martin Posted December 13, 2009 Share Posted December 13, 2009 Thanks! The thing is that if you try it with a normal HTML webpage, it pick up ALOT of junks. I haven't tried but maybe it needs to be made more restictive $Pagesource = '<a href="www.applepie.com/test.html">apple pie</a>' $array = StringRegExp($Pagesource,'(?:<a href=")(www\.*?)(?:">)(.*?)(?:</a>)',3) _ArrayDisplay($array, "Test") Serial port communications UDF Includes functions for binary transmission and reception.printing UDF Useful for graphs, forms, labels, reports etc.Add User Call Tips to SciTE for functions in UDFs not included with AutoIt and for your own scripts.Functions with parameters in OnEvent mode and for Hot Keys One function replaces GuiSetOnEvent, GuiCtrlSetOnEvent and HotKeySet.UDF IsConnected2 for notification of status of connected state of many urls or IPs, without slowing the script. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now