Jump to content
Sign in to follow this  
empty75

Regular expression

Recommended Posts

empty75

hello I have a web page src and want to extract a list of words from it preferebly to array.

The search criteria is like:

<a href="BLAHBLAHBLAH.html?PAGE=Download">

Where BLAHBLAHBLAH is the bit i need to extract, and will be different for each occurence on the page.

I'm not entirely certain but i think the search BLAHBLAHBLAH string can contain virtually any characters except space, and be of variable length.

I am prety sure that regular expression is needed but i just cant grasp this code syntaxt.

Is there an alternative function that will return a string between two substrings but search backwards and match the .html?PAGE=Download"> first then the <a href="

thanks Matthew.

Share this post


Link to post
Share on other sites
smstroble

You could try reading each line and string splitting it with the delimiters " and . $split = StringSplit($line, '".'). Then search for a chunk that is = to "html?PAGE=Download" for example $split[3] = html?PAGE=Download then check $split[1] to see if it is = "<a href=", if so then $split[2] is BLAHBLAHBLAH.


MUHAHAHAHAHA

Share this post


Link to post
Share on other sites
Xenobiologist

Hi,

#include <INet.au3>
#include<String.au3>
$profile = 'www.myspace.com/tom'
$values = _StringBetween(_INetGetSource($profile), '<a href="', '.html?PAGE=Download">')
For $i = 0 To UBound($values) - 1
    MsgBox(0, "Comment count is:", $values[$i])
Next

So long,

Mega


Scripts & functions Organize Includes Let Scite organize the include files

Yahtzee The game "Yahtzee" (Kniffel, DiceLion)

LoginWrapper Secure scripts by adding a query (authentication)

_RunOnlyOnThis UDF Make sure that a script can only be executed on ... (Windows / HD / ...)

Internet-Café Server/Client Application Open CD, Start Browser, Lock remote client, etc.

MultipleFuncsWithOneHotkey Start different funcs by hitting one hotkey different times

Share this post


Link to post
Share on other sites
Mast3rpyr0

Works 100% for me

#include <GUIConstants.au3>
#include <Constants.au3>
#include <String.au3>
#include <array.au3>

$file = FileOpen(____HTMLfile.html____, 0)

; Check if file opened for reading OK
If $file = -1 Then
    MsgBox(0, "Error", "Unable to open file.")
    Exit
EndIf

Global $URLIN = FileRead($file)
$aArray1 = _StringBetween($URLIN, '<a href="', '.html?PAGE=Download">')

FileClose($file)

_ArrayDisplay($aArray1, 'Search HTML file')

EDIT: Not sure but you might not need the first 2 include files. Theres no gui that comes up, but idk.

Edited by Mast3rpyr0

Share this post


Link to post
Share on other sites
Xenobiologist

Hi,

for checking the include files try my organize includes script (see my sig) :)

So long,

Mega


Scripts & functions Organize Includes Let Scite organize the include files

Yahtzee The game "Yahtzee" (Kniffel, DiceLion)

LoginWrapper Secure scripts by adding a query (authentication)

_RunOnlyOnThis UDF Make sure that a script can only be executed on ... (Windows / HD / ...)

Internet-Café Server/Client Application Open CD, Start Browser, Lock remote client, etc.

MultipleFuncsWithOneHotkey Start different funcs by hitting one hotkey different times

Share this post


Link to post
Share on other sites
empty75

Thanks for all your quick responces, but maybe i should add a little more detail,

I tried _Stringbetween but because there are extra tags - <a href=" without the required following .html?PAGE=Download">.

Stringbetween searches the html then finds its first <a href=" which may be a link to the home page right at the top of the page then returns everything between this link and the first occurence of .html?PAGE=Download"> which maybe several dozen lines latter.

I am just looking at returning the one random length word between these two search parameters.

The problem is there maybe several hundered of the first <a href=", but i am only interested in few that immediately procede the second search parameter with a unknown word in between.

Just a thought but is there a function that returns a stringleft given a substring to search upto instead of a number of chars.

ie.

CODE

$charpos = StringInStr ($src, ".html?PAGE=Download")

$reqstring = StringLeftToChars($src, $charpos, '<a href="')

Thanks.

Matthew.

Share this post


Link to post
Share on other sites
Mast3rpyr0

hmm your right, i just tried it on all the same URL before, lemme see if i can fix it

EDIT: ok i see the problem, working on a fix but looks like youll have to start from the right and delete anythign after that first " from the right.

Edited by Mast3rpyr0

Share this post


Link to post
Share on other sites
empty75

I think i have solved it just needs tidy up, but need to go home now, however the code as is is bellow.

CODE

#include <Array.au3>

#include <GuiConstants.au3> ; Call Gui functions

#include <IE.au3>

#Include <GuiStatusBar.au3>

#include <Date.au3>

#include <String.au3>

#include <Inet.au3>

Local $EndPos, $StartPos, $MarkPos, $occ

$srcHTMLNewest = _INetGetSource ( "http://www.reflexive.com/index.php?PAGE=GameList&SORT=Age" )

;$fnGamesInCat = _StringBetween ( $srcHTMLNewest, "<a href=",".html?PAGE=Download_Landing")

;StringInStr ( "string", "substring" [, casesense [, occurrence]] )

$occ = 1

$EndPos = StringInStr ( $srcHTMLNewest, ".html?PAGE=Download_Landing")

;get pos of second search param

Do

$StartPos = StringInStr ( $srcHTMLNewest, "<a href=", 0, $occ)

IF $StartPos < $EndPos Then ;find closest search param #1 to 1st occurence of 2nd search param

$MarkPos = $StartPos

$occ = $occ + 1

Endif

Until $StartPos > $EndPos

$reqString = StringTrimLeft( $srcHTMLNewest, $MarkPos)

$EndPos = StringInStr ( $reqString, ".html?PAGE=Download_Landing") ; find new postition in new string

$reqString2 = StringLeft( $reqString, $EndPos)

$srcHTMLNewest = ""

msgbox(64,"the string",$reqString2)

thanks for your help,

Bye!

Matthew.

Share this post


Link to post
Share on other sites
Mast3rpyr0

Ok heres what i came up with.

It works but will only get the last one on the page.

#include <GUIConstants.au3>
#include <Constants.au3>
#include <String.au3>
#include <array.au3>

$file = FileOpen("htmlfile.html", 0)

Global $URLIN = FileRead($file)
$aArray1 = _StringBetween($URLIN, '<a href="', '.html?PAGE=Download">')
FileClose($file)
Global $aArray2 = _ArrayToString($aArray1, "")
$Pos = StringInStr($aArray2, '"' , 2, -1)
$aArray2 = StringTrimLeft($aArray2, $Pos)
MsgBox(0, "File Name", $aArray2)

Share this post


Link to post
Share on other sites
MisterBates

Thanks for all your quick responces, but maybe i should add a little more detail,

I tried _Stringbetween but because there are extra tags - <a href=" without the required following .html?PAGE=Download">.

Stringbetween searches the html then finds its first <a href=" which may be a link to the home page right at the top of the page then returns everything between this link and the first occurence of .html?PAGE=Download"> which maybe several dozen lines latter.

I am just looking at returning the one random length word between these two search parameters.

The problem is there maybe several hundered of the first <a href=", but i am only interested in few that immediately procede the second search parameter with a unknown word in between.

Just a thought but is there a function that returns a stringleft given a substring to search upto instead of a number of chars.

ie.

CODE

$charpos = StringInStr ($src, ".html?PAGE=Download")

$reqstring = StringLeftToChars($src, $charpos, '<a href="')

Thanks.

Matthew.

What about:

$sText = '<a href="BLAH.html"></a><a href="BLAHBLAH.html"></a><a href="BLAHBLAHBLAH.html?PAGE=Download"></a>'
$iOffset=1
$array = StringRegExp($sText, '.*(?i)href="(.*?)\.html\?PAGE', 1, $iOffset)
MsgBox(0, "RegExp result", $array[0])

There's no error checking in there, but for me, this found "BLAHBLAHBLAH".

MisterBates

Share this post


Link to post
Share on other sites
empty75

Thanks for all your help i have now managed to hack together this new function. Any comments improvements are welcome !

Can be easily modified to searchtostring right as well.

CODE

Func _StringLeftToStr ( Const $InputString, Const $StrToSearch, Const $EndMarker, Const $CaseSensitive = 0 )

;Takes a string, a string End postition and a search string.

;Will return the string left of the marker up to the first occurrence of the searchstring

;Else @Error: 1 = Out of Bounds, @Error: 2 = Search String Not Found.

Local $OutputString, $StartPos, $MarkPos, $occ, $errorStrLen

$occ = 1

IF ($EndMarker > StringLen($InputString)) OR (StringLen($StrToSearch) >= StringLen($InputString)) OR ( StringLen($StrToSearch) >= $EndMarker) OR (StringLen($StrToSearch) = 0) Then

;IF Endmarker is outside the inputstring,

;IF SearchStr is longer than InputString,

;IF Endmarker is inside the searchstr

;IF Searchstr is empty

$errorStrLen = 1

Else

$errorStrLen = 0

EndIf

IF $errorStrLen = 0 Then ;No out of bound errors

Do

$StartPos = StringInStr ( $InputString, $StrToSearch, $CaseSensitive, $occ)

IF $StartPos < $EndMarker Then ;find closest StrToSearch to EndMarker

$MarkPos = $StartPos

$occ = $occ + 1

Endif

Until ($StartPos > $EndMarker) OR ($MarkPos = 0) ; Returns startpos of searchstr or 0 if not found.

IF $MarkPos > 0 Then ; Searchstr exists and marker is found

$OutputString = StringTrimLeft (StringLeft ($InputString, $EndMarker), $MarkPos + StringLen($StrToSearch))

;Return the string upto the endmarker;

;Find the end position of StrToSearch and trimleft to this new position.

EndIF

EndIF

IF ($errorStrLen = 1) Then

SetError(1) ; Various out of bounds error

$OutputString = ""

EndIF

IF ($StartPos = 0) Then

SetError(2) ; Searchstr not found

$OutputString = ""

EndIF

return $OutputString ; Return the string or emptystring if errors, check @Error

EndFunc

Tested it with:

$srcHTMLNewest = _INetGetSource ( "http://www.reflexive.com/index.php?PAGE=GameList&SORT=Age" )

$EndPos = StringInStr ( $srcHTMLNewest, ".html?PAGE=Download_Landing") - 1

$test = _StringLeftToStr ($srcHTMLNewest, "<a href=", $EndPos)

IF @Error = 0 Then msgbox(64,"test", $test)

Thanks.Matthew

Share this post


Link to post
Share on other sites
empty75

What about:

$sText = '<a href="BLAH.html"></a><a href="BLAHBLAH.html"></a><a href="BLAHBLAHBLAH.html?PAGE=Download"></a>'
$iOffset=1
$array = StringRegExp($sText, '.*(?i)href="(.*?)\.html\?PAGE', 1, $iOffset)
MsgBox(0, "RegExp result", $array[0])

There's no error checking in there, but for me, this found "BLAHBLAHBLAH".

MisterBates

I am still trying to do this, the function i wrote (StringLeftToStr) has serious performance issues the further down the file you search, after about the 15 occurence it crawls to a near halt, persumably trying to find the edge of the left hand search string.

I have taken a look at the above regexp, am i wrong in thinking that this is supposed to return a array full of all the matches ?

CODE

$srcHTMLNewest = _INetGetSource ( "http://www.reflexive.com/index.php?START=1&END=10&PAGE=GameList&CAT=Action&SORT=Age" )

$iOffset=1

While 1

$array = StringRegExp($srcHTMLNewest, '.*(?i)href="(.*?)\.html\?PAGE=Download', 1, $iOffset)

If @error = 0 Then

$iOffset = @extended

Else

ExitLoop

EndIf

$arrayofgames = $array

WEnd

_ArrayDisplay($arrayofgames, "List of Games")

This only seems to return the last item found.

The bellow does seem to work, but this is a work around surely ?

CODE

local $arrayofgames[1] = [0]

$srcHTMLNewest = _INetGetSource ( "http://www.reflexive.com/index.php?START=1&END=10&PAGE=GameList&CAT=Action&SORT=Age" )

$iOffset=1

$cell = 1

While 1

$array = StringRegExp($srcHTMLNewest, '.*(?i)href="(.*?)\.html\?PAGE=Download', 1, $iOffset)

If @error = 0 Then

$iOffset = @extended

Else

ExitLoop

EndIf

_ArrayAdd ( $arrayofgames, $array[0] )

WEnd

_ArrayDisplay($arrayofgames, "List of Games in new")

Is this a bug in StringRegExp i am using Autoit v3.2.4.9

thanks matthew

Share this post


Link to post
Share on other sites
SmOke_N

Try switching $iOffset = 1 to $iOffset = 3 ... 1 will only return 1 occurence, 3 will return all that are found.


Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites
MisterBates

Try switching $iOffset = 1 to $iOffset = 3 ... 1 will only return 1 occurence, 3 will return all that are found.

SmOke_N's on the right track, but it's not the $iOffset that needs switching, it's the "1" in the StringRegExp call just before the $iOffset. Try:

$array = StringRegExp($sText, '.*(?i)href="(.*?)\.html\?PAGE', 3, $iOffset)

In fact, you can simplify the code to be:

#include <array.au3>
#include <inet.au3>

$srcHTMLNewest = _INetGetSource ( "http://www.reflexive.com/index.php?START=1&END=10&PAGE=GameList&CAT=Action&SORT=Age" )
$arrayofgames = StringRegExp($srcHTMLNewest, '.*(?i)href="(.*?)\.html\?PAGE=Download', 3)
_ArrayDisplay($arrayofgames, "List of Games in new")
Edited by MisterBates

Share this post


Link to post
Share on other sites
amel27

SmOke_N's on the right track, but it's not the $iOffset that needs switching, it's the "1" in the StringRegExp call just before the $iOffset. Try:

$array = StringRegExp($sText, '.*(?i)href="(.*?)\.html\?PAGE', 3, $iOffset)
above pattern is wrong, try this for whole text:
$array = StringRegExp ($sText,'(?i)\bhref\s*=\s*"([^"\?]+)(?:\.html|\?)[^"]*"',3)oÝ÷ Úí+ºÚ"µÍÌÍØ^HHÝ[ÔYÑ^
    ÌÍÜÕ^   ÌÎNÊÚJIØIÌLÜÖ×Ý×JÌLØYÌLÜÊIÌLÜÊ][ÝÊ×ÝÉÌLÏÉ][Ý×JÊJÎÌLË[  ÌLÏÊV×ÝÉ][Ý×J][ÝÖ×Ý×JÝÉÌÎNËÊ
Edited by amel27

Share this post


Link to post
Share on other sites
empty75

#include <array.au3>
#include <inet.au3>

$srcHTMLNewest = _INetGetSource ( "http://www.reflexive.com/index.php?START=1&END=10&PAGE=GameList&CAT=Action&SORT=Age" )
$arrayofgames = StringRegExp($srcHTMLNewest, '.*(?i)href="(.*?)\.html\?PAGE=Download', 3)
_ArrayDisplay($arrayofgames, "List of Games in new")

Thanks for all your help , with the above code i now have what i wanted, and the only noticable slow down is while the inetgetsource is running. Now generates a list of about 300 games in about 30 seconds, instead of a list of 20 games in 60 seconds with the method i was using.

Thanks. Matthew

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×