Jump to content

HTML Analys


Dieuz
 Share

Recommended Posts

Hey guys,

How can I extract the <a XX ></a> tags from a Page source string.

I have tried:

$array = StringRegExp($pagesource,"<a(.+</a>)",1)
 _ArrayDisplay($array, "Test")

but it doesnt work !

Can someone correct the RegExp above?

Thanks,

Maybe like this

#include <array.au3>
$Pagesource = "<aXX>ufirst set of letters</a>irrelevant material<aRf>secondgroup</a><aRf>bonus characters</a>"
$array = StringRegExp($Pagesource,"(?:<a.*?>)(.*?)(?:</a>)",3)
 _ArrayDisplay($array, "Test")
Edited by martin
Serial port communications UDF Includes functions for binary transmission and reception.printing UDF Useful for graphs, forms, labels, reports etc.Add User Call Tips to SciTE for functions in UDFs not included with AutoIt and for your own scripts.Functions with parameters in OnEvent mode and for Hot Keys One function replaces GuiSetOnEvent, GuiCtrlSetOnEvent and HotKeySet.UDF IsConnected2 for notification of status of connected state of many urls or IPs, without slowing the script.
Link to comment
Share on other sites

Thanks, I found a workaround...

Can someone tell me why this work:

$array = StringRegExp($pagesource,"<A[^test]+</A>",3)

But this doesnt:

$array = StringRegExp($pagesource,"<A(^test)+</A>",3)

Why?

Without knowing what $pagesource is it's difficult to say. But (^test) is looking for the group of letters which does not include the word "test", whereas [^test] is excluding any of the letters in that set regardless of order.

I think I misunderstood or misread your first post. If you want to remove the tags then I think you need to use StringRegExpReplace. Is that what you meant?

Serial port communications UDF Includes functions for binary transmission and reception.printing UDF Useful for graphs, forms, labels, reports etc.Add User Call Tips to SciTE for functions in UDFs not included with AutoIt and for your own scripts.Functions with parameters in OnEvent mode and for Hot Keys One function replaces GuiSetOnEvent, GuiCtrlSetOnEvent and HotKeySet.UDF IsConnected2 for notification of status of connected state of many urls or IPs, without slowing the script.
Link to comment
Share on other sites

Actually, I am trying to extract the url & the anchor text to an array.

Example:

<a href="www.applepie.com/test.html">apple pie</a>

I would like to extract www.applepie.com/test.html & apple pie.

Here's what I have got so far:

$url = "http://www.britannica.com/blogs/2008/04/are-newspapers-doomed-do-we-care-newspapers-the-net-forum/" ; It can be ANY HTML pages

$IE = _IECreate($url,0,1,1)
$pagesource = _IEBodyReadHTML($IE)

$array = StringRegExp($pagesource,'href="(http://.+)".+>' & '(.+)</A>',3)

 _ArrayDisplay($array, "Test")

 _IEQuit($IE)

It doesnt work really well...

Edited by Dieuz
Link to comment
Share on other sites

Actually, I am trying to extract the url & the anchor text to an array.

Example:

<a href="www.applepie.com/test.html">apple pie</a>

I would like to extract www.applepie.com/test.html & apple pie.

Here's what I have got so far:

$url = "http://www.britannica.com/blogs/2008/04/are-newspapers-doomed-do-we-care-newspapers-the-net-forum/" ; It can be ANY HTML pages

$IE = _IECreate($url,0,1,1)
$pagesource = _IEBodyReadHTML($IE)

$array = StringRegExp($pagesource,'href="(http://.+)".+>' & '(.+)</A>',3)

 _ArrayDisplay($array, "Test")

 _IEQuit($IE)

It doesnt work really well...

Then I think that what I showed before was the approach that was needed

$Pagesource = '<a href="www.applepie.com/test.html">apple pie</a>'
$array = StringRegExp($Pagesource,'(?:<a href=")(.*?)(?:">)(.*?)(?:</a>)',3)
_ArrayDisplay($array, "Test")
Serial port communications UDF Includes functions for binary transmission and reception.printing UDF Useful for graphs, forms, labels, reports etc.Add User Call Tips to SciTE for functions in UDFs not included with AutoIt and for your own scripts.Functions with parameters in OnEvent mode and for Hot Keys One function replaces GuiSetOnEvent, GuiCtrlSetOnEvent and HotKeySet.UDF IsConnected2 for notification of status of connected state of many urls or IPs, without slowing the script.
Link to comment
Share on other sites

Thanks!

The thing is that if you try it with a normal HTML webpage, it pick up ALOT of junks.

I haven't tried but maybe it needs to be made more restictive

$Pagesource = '<a href="www.applepie.com/test.html">apple pie</a>'
$array = StringRegExp($Pagesource,'(?:<a href=")(www\.*?)(?:">)(.*?)(?:</a>)',3)
_ArrayDisplay($array, "Test")
Serial port communications UDF Includes functions for binary transmission and reception.printing UDF Useful for graphs, forms, labels, reports etc.Add User Call Tips to SciTE for functions in UDFs not included with AutoIt and for your own scripts.Functions with parameters in OnEvent mode and for Hot Keys One function replaces GuiSetOnEvent, GuiCtrlSetOnEvent and HotKeySet.UDF IsConnected2 for notification of status of connected state of many urls or IPs, without slowing the script.
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...