Sign in to follow this  
Followers 0
Dieuz

HTML Analys

8 posts in this topic

#1 ·  Posted (edited)

Hey guys,

How can I extract the "<a></a>" tags from a Page source string.

I have tried:

$array = StringRegExp($pagesource,"<a(.+</a>)",1)
 _ArrayDisplay($array, "Test")

but it doesnt work !

Can someone correct the RegExp above?

Thanks,

Edited by Dieuz

Share this post


Link to post
Share on other sites



#2 ·  Posted (edited)

Hey guys,

How can I extract the <a XX ></a> tags from a Page source string.

I have tried:

$array = StringRegExp($pagesource,"<a(.+</a>)",1)
 _ArrayDisplay($array, "Test")

but it doesnt work !

Can someone correct the RegExp above?

Thanks,

Maybe like this

#include <array.au3>
$Pagesource = "<aXX>ufirst set of letters</a>irrelevant material<aRf>secondgroup</a><aRf>bonus characters</a>"
$array = StringRegExp($Pagesource,"(?:<a.*?>)(.*?)(?:</a>)",3)
 _ArrayDisplay($array, "Test")
Edited by martin

Serial port communications UDF Includes functions for binary transmission and reception.printing UDF Useful for graphs, forms, labels, reports etc.Add User Call Tips to SciTE for functions in UDFs not included with AutoIt and for your own scripts.Functions with parameters in OnEvent mode and for Hot Keys One function replaces GuiSetOnEvent, GuiCtrlSetOnEvent and HotKeySet.UDF IsConnected2 for notification of status of connected state of many urls or IPs, without slowing the script.

Share this post


Link to post
Share on other sites

#3 ·  Posted (edited)

Thanks, I found a workaround...

Can someone tell me why this work:

$array = StringRegExp($pagesource,"<A[^test]+</A>",3)

But this doesnt:

$array = StringRegExp($pagesource,"<A(^test)+</A>",3)

How can I exclude the word "test"?

Edited by Dieuz

Share this post


Link to post
Share on other sites

Thanks, I found a workaround...

Can someone tell me why this work:

$array = StringRegExp($pagesource,"<A[^test]+</A>",3)

But this doesnt:

$array = StringRegExp($pagesource,"<A(^test)+</A>",3)

Why?

Without knowing what $pagesource is it's difficult to say. But (^test) is looking for the group of letters which does not include the word "test", whereas [^test] is excluding any of the letters in that set regardless of order.

I think I misunderstood or misread your first post. If you want to remove the tags then I think you need to use StringRegExpReplace. Is that what you meant?


Serial port communications UDF Includes functions for binary transmission and reception.printing UDF Useful for graphs, forms, labels, reports etc.Add User Call Tips to SciTE for functions in UDFs not included with AutoIt and for your own scripts.Functions with parameters in OnEvent mode and for Hot Keys One function replaces GuiSetOnEvent, GuiCtrlSetOnEvent and HotKeySet.UDF IsConnected2 for notification of status of connected state of many urls or IPs, without slowing the script.

Share this post


Link to post
Share on other sites

#5 ·  Posted (edited)

Actually, I am trying to extract the url & the anchor text to an array.

Example:

<a href="www.applepie.com/test.html">apple pie</a>

I would like to extract www.applepie.com/test.html & apple pie.

Here's what I have got so far:

$url = "http://www.britannica.com/blogs/2008/04/are-newspapers-doomed-do-we-care-newspapers-the-net-forum/" ; It can be ANY HTML pages

$IE = _IECreate($url,0,1,1)
$pagesource = _IEBodyReadHTML($IE)

$array = StringRegExp($pagesource,'href="(http://.+)".+>' & '(.+)</A>',3)

 _ArrayDisplay($array, "Test")

 _IEQuit($IE)

It doesnt work really well...

Edited by Dieuz

Share this post


Link to post
Share on other sites

Actually, I am trying to extract the url & the anchor text to an array.

Example:

<a href="www.applepie.com/test.html">apple pie</a>

I would like to extract www.applepie.com/test.html & apple pie.

Here's what I have got so far:

$url = "http://www.britannica.com/blogs/2008/04/are-newspapers-doomed-do-we-care-newspapers-the-net-forum/" ; It can be ANY HTML pages

$IE = _IECreate($url,0,1,1)
$pagesource = _IEBodyReadHTML($IE)

$array = StringRegExp($pagesource,'href="(http://.+)".+>' & '(.+)</A>',3)

 _ArrayDisplay($array, "Test")

 _IEQuit($IE)

It doesnt work really well...

Then I think that what I showed before was the approach that was needed

$Pagesource = '<a href="www.applepie.com/test.html">apple pie</a>'
$array = StringRegExp($Pagesource,'(?:<a href=")(.*?)(?:">)(.*?)(?:</a>)',3)
_ArrayDisplay($array, "Test")

Serial port communications UDF Includes functions for binary transmission and reception.printing UDF Useful for graphs, forms, labels, reports etc.Add User Call Tips to SciTE for functions in UDFs not included with AutoIt and for your own scripts.Functions with parameters in OnEvent mode and for Hot Keys One function replaces GuiSetOnEvent, GuiCtrlSetOnEvent and HotKeySet.UDF IsConnected2 for notification of status of connected state of many urls or IPs, without slowing the script.

Share this post


Link to post
Share on other sites

Thanks!

The thing is that if you try it with a normal HTML webpage, it pick up ALOT of junks.

Share this post


Link to post
Share on other sites

Thanks!

The thing is that if you try it with a normal HTML webpage, it pick up ALOT of junks.

I haven't tried but maybe it needs to be made more restictive

$Pagesource = '<a href="www.applepie.com/test.html">apple pie</a>'
$array = StringRegExp($Pagesource,'(?:<a href=")(www\.*?)(?:">)(.*?)(?:</a>)',3)
_ArrayDisplay($array, "Test")

Serial port communications UDF Includes functions for binary transmission and reception.printing UDF Useful for graphs, forms, labels, reports etc.Add User Call Tips to SciTE for functions in UDFs not included with AutoIt and for your own scripts.Functions with parameters in OnEvent mode and for Hot Keys One function replaces GuiSetOnEvent, GuiCtrlSetOnEvent and HotKeySet.UDF IsConnected2 for notification of status of connected state of many urls or IPs, without slowing the script.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0