Jump to content
Sign in to follow this  
Dieuz

HTML Analys

Recommended Posts

Dieuz

Hey guys,

How can I extract the "<a></a>" tags from a Page source string.

I have tried:

$array = StringRegExp($pagesource,"<a(.+</a>)",1)
 _ArrayDisplay($array, "Test")

but it doesnt work !

Can someone correct the RegExp above?

Thanks,

Edited by Dieuz

Share this post


Link to post
Share on other sites
martin

Hey guys,

How can I extract the <a XX ></a> tags from a Page source string.

I have tried:

$array = StringRegExp($pagesource,"<a(.+</a>)",1)
 _ArrayDisplay($array, "Test")

but it doesnt work !

Can someone correct the RegExp above?

Thanks,

Maybe like this

#include <array.au3>
$Pagesource = "<aXX>ufirst set of letters</a>irrelevant material<aRf>secondgroup</a><aRf>bonus characters</a>"
$array = StringRegExp($Pagesource,"(?:<a.*?>)(.*?)(?:</a>)",3)
 _ArrayDisplay($array, "Test")
Edited by martin

Serial port communications UDF Includes functions for binary transmission and reception.printing UDF Useful for graphs, forms, labels, reports etc.Add User Call Tips to SciTE for functions in UDFs not included with AutoIt and for your own scripts.Functions with parameters in OnEvent mode and for Hot Keys One function replaces GuiSetOnEvent, GuiCtrlSetOnEvent and HotKeySet.UDF IsConnected2 for notification of status of connected state of many urls or IPs, without slowing the script.

Share this post


Link to post
Share on other sites
Dieuz

Thanks, I found a workaround...

Can someone tell me why this work:

$array = StringRegExp($pagesource,"<A[^test]+</A>",3)

But this doesnt:

$array = StringRegExp($pagesource,"<A(^test)+</A>",3)

How can I exclude the word "test"?

Edited by Dieuz

Share this post


Link to post
Share on other sites
martin

Thanks, I found a workaround...

Can someone tell me why this work:

$array = StringRegExp($pagesource,"<A[^test]+</A>",3)

But this doesnt:

$array = StringRegExp($pagesource,"<A(^test)+</A>",3)

Why?

Without knowing what $pagesource is it's difficult to say. But (^test) is looking for the group of letters which does not include the word "test", whereas [^test] is excluding any of the letters in that set regardless of order.

I think I misunderstood or misread your first post. If you want to remove the tags then I think you need to use StringRegExpReplace. Is that what you meant?


Serial port communications UDF Includes functions for binary transmission and reception.printing UDF Useful for graphs, forms, labels, reports etc.Add User Call Tips to SciTE for functions in UDFs not included with AutoIt and for your own scripts.Functions with parameters in OnEvent mode and for Hot Keys One function replaces GuiSetOnEvent, GuiCtrlSetOnEvent and HotKeySet.UDF IsConnected2 for notification of status of connected state of many urls or IPs, without slowing the script.

Share this post


Link to post
Share on other sites
Dieuz

Actually, I am trying to extract the url & the anchor text to an array.

Example:

<a href="www.applepie.com/test.html">apple pie</a>

I would like to extract www.applepie.com/test.html & apple pie.

Here's what I have got so far:

$url = "http://www.britannica.com/blogs/2008/04/are-newspapers-doomed-do-we-care-newspapers-the-net-forum/" ; It can be ANY HTML pages

$IE = _IECreate($url,0,1,1)
$pagesource = _IEBodyReadHTML($IE)

$array = StringRegExp($pagesource,'href="(http://.+)".+>' & '(.+)</A>',3)

 _ArrayDisplay($array, "Test")

 _IEQuit($IE)

It doesnt work really well...

Edited by Dieuz

Share this post


Link to post
Share on other sites
martin

Actually, I am trying to extract the url & the anchor text to an array.

Example:

<a href="www.applepie.com/test.html">apple pie</a>

I would like to extract www.applepie.com/test.html & apple pie.

Here's what I have got so far:

$url = "http://www.britannica.com/blogs/2008/04/are-newspapers-doomed-do-we-care-newspapers-the-net-forum/" ; It can be ANY HTML pages

$IE = _IECreate($url,0,1,1)
$pagesource = _IEBodyReadHTML($IE)

$array = StringRegExp($pagesource,'href="(http://.+)".+>' & '(.+)</A>',3)

 _ArrayDisplay($array, "Test")

 _IEQuit($IE)

It doesnt work really well...

Then I think that what I showed before was the approach that was needed

$Pagesource = '<a href="www.applepie.com/test.html">apple pie</a>'
$array = StringRegExp($Pagesource,'(?:<a href=")(.*?)(?:">)(.*?)(?:</a>)',3)
_ArrayDisplay($array, "Test")

Serial port communications UDF Includes functions for binary transmission and reception.printing UDF Useful for graphs, forms, labels, reports etc.Add User Call Tips to SciTE for functions in UDFs not included with AutoIt and for your own scripts.Functions with parameters in OnEvent mode and for Hot Keys One function replaces GuiSetOnEvent, GuiCtrlSetOnEvent and HotKeySet.UDF IsConnected2 for notification of status of connected state of many urls or IPs, without slowing the script.

Share this post


Link to post
Share on other sites
Dieuz

Thanks!

The thing is that if you try it with a normal HTML webpage, it pick up ALOT of junks.

Share this post


Link to post
Share on other sites
martin

Thanks!

The thing is that if you try it with a normal HTML webpage, it pick up ALOT of junks.

I haven't tried but maybe it needs to be made more restictive

$Pagesource = '<a href="www.applepie.com/test.html">apple pie</a>'
$array = StringRegExp($Pagesource,'(?:<a href=")(www\.*?)(?:">)(.*?)(?:</a>)',3)
_ArrayDisplay($array, "Test")

Serial port communications UDF Includes functions for binary transmission and reception.printing UDF Useful for graphs, forms, labels, reports etc.Add User Call Tips to SciTE for functions in UDFs not included with AutoIt and for your own scripts.Functions with parameters in OnEvent mode and for Hot Keys One function replaces GuiSetOnEvent, GuiCtrlSetOnEvent and HotKeySet.UDF IsConnected2 for notification of status of connected state of many urls or IPs, without slowing the script.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.