Jump to content

Regular expression


Recommended Posts

hello I have a web page src and want to extract a list of words from it preferebly to array.

The search criteria is like:

<a href="BLAHBLAHBLAH.html?PAGE=Download">

Where BLAHBLAHBLAH is the bit i need to extract, and will be different for each occurence on the page.

I'm not entirely certain but i think the search BLAHBLAHBLAH string can contain virtually any characters except space, and be of variable length.

I am prety sure that regular expression is needed but i just cant grasp this code syntaxt.

Is there an alternative function that will return a string between two substrings but search backwards and match the .html?PAGE=Download"> first then the <a href="

thanks Matthew.

Link to comment
Share on other sites

You could try reading each line and string splitting it with the delimiters " and . $split = StringSplit($line, '".'). Then search for a chunk that is = to "html?PAGE=Download" for example $split[3] = html?PAGE=Download then check $split[1] to see if it is = "<a href=", if so then $split[2] is BLAHBLAHBLAH.

MUHAHAHAHAHA

Link to comment
Share on other sites

Hi,

#include <INet.au3>
#include<String.au3>
$profile = 'www.myspace.com/tom'
$values = _StringBetween(_INetGetSource($profile), '<a href="', '.html?PAGE=Download">')
For $i = 0 To UBound($values) - 1
    MsgBox(0, "Comment count is:", $values[$i])
Next

So long,

Mega

Scripts & functions Organize Includes Let Scite organize the include files

Yahtzee The game "Yahtzee" (Kniffel, DiceLion)

LoginWrapper Secure scripts by adding a query (authentication)

_RunOnlyOnThis UDF Make sure that a script can only be executed on ... (Windows / HD / ...)

Internet-Café Server/Client Application Open CD, Start Browser, Lock remote client, etc.

MultipleFuncsWithOneHotkey Start different funcs by hitting one hotkey different times

Link to comment
Share on other sites

Works 100% for me

#include <GUIConstants.au3>
#include <Constants.au3>
#include <String.au3>
#include <array.au3>

$file = FileOpen(____HTMLfile.html____, 0)

; Check if file opened for reading OK
If $file = -1 Then
    MsgBox(0, "Error", "Unable to open file.")
    Exit
EndIf

Global $URLIN = FileRead($file)
$aArray1 = _StringBetween($URLIN, '<a href="', '.html?PAGE=Download">')

FileClose($file)

_ArrayDisplay($aArray1, 'Search HTML file')

EDIT: Not sure but you might not need the first 2 include files. Theres no gui that comes up, but idk.

Edited by Mast3rpyr0
Link to comment
Share on other sites

Hi,

for checking the include files try my organize includes script (see my sig) :)

So long,

Mega

Scripts & functions Organize Includes Let Scite organize the include files

Yahtzee The game "Yahtzee" (Kniffel, DiceLion)

LoginWrapper Secure scripts by adding a query (authentication)

_RunOnlyOnThis UDF Make sure that a script can only be executed on ... (Windows / HD / ...)

Internet-Café Server/Client Application Open CD, Start Browser, Lock remote client, etc.

MultipleFuncsWithOneHotkey Start different funcs by hitting one hotkey different times

Link to comment
Share on other sites

Thanks for all your quick responces, but maybe i should add a little more detail,

I tried _Stringbetween but because there are extra tags - <a href=" without the required following .html?PAGE=Download">.

Stringbetween searches the html then finds its first <a href=" which may be a link to the home page right at the top of the page then returns everything between this link and the first occurence of .html?PAGE=Download"> which maybe several dozen lines latter.

I am just looking at returning the one random length word between these two search parameters.

The problem is there maybe several hundered of the first <a href=", but i am only interested in few that immediately procede the second search parameter with a unknown word in between.

Just a thought but is there a function that returns a stringleft given a substring to search upto instead of a number of chars.

ie.

CODE

$charpos = StringInStr ($src, ".html?PAGE=Download")

$reqstring = StringLeftToChars($src, $charpos, '<a href="')

Thanks.

Matthew.

Link to comment
Share on other sites

hmm your right, i just tried it on all the same URL before, lemme see if i can fix it

EDIT: ok i see the problem, working on a fix but looks like youll have to start from the right and delete anythign after that first " from the right.

Edited by Mast3rpyr0
Link to comment
Share on other sites

I think i have solved it just needs tidy up, but need to go home now, however the code as is is bellow.

CODE

#include <Array.au3>

#include <GuiConstants.au3> ; Call Gui functions

#include <IE.au3>

#Include <GuiStatusBar.au3>

#include <Date.au3>

#include <String.au3>

#include <Inet.au3>

Local $EndPos, $StartPos, $MarkPos, $occ

$srcHTMLNewest = _INetGetSource ( "http://www.reflexive.com/index.php?PAGE=GameList&SORT=Age" )

;$fnGamesInCat = _StringBetween ( $srcHTMLNewest, "<a href=",".html?PAGE=Download_Landing")

;StringInStr ( "string", "substring" [, casesense [, occurrence]] )

$occ = 1

$EndPos = StringInStr ( $srcHTMLNewest, ".html?PAGE=Download_Landing")

;get pos of second search param

Do

$StartPos = StringInStr ( $srcHTMLNewest, "<a href=", 0, $occ)

IF $StartPos < $EndPos Then ;find closest search param #1 to 1st occurence of 2nd search param

$MarkPos = $StartPos

$occ = $occ + 1

Endif

Until $StartPos > $EndPos

$reqString = StringTrimLeft( $srcHTMLNewest, $MarkPos)

$EndPos = StringInStr ( $reqString, ".html?PAGE=Download_Landing") ; find new postition in new string

$reqString2 = StringLeft( $reqString, $EndPos)

$srcHTMLNewest = ""

msgbox(64,"the string",$reqString2)

thanks for your help,

Bye!

Matthew.

Link to comment
Share on other sites

Ok heres what i came up with.

It works but will only get the last one on the page.

#include <GUIConstants.au3>
#include <Constants.au3>
#include <String.au3>
#include <array.au3>

$file = FileOpen("htmlfile.html", 0)

Global $URLIN = FileRead($file)
$aArray1 = _StringBetween($URLIN, '<a href="', '.html?PAGE=Download">')
FileClose($file)
Global $aArray2 = _ArrayToString($aArray1, "")
$Pos = StringInStr($aArray2, '"' , 2, -1)
$aArray2 = StringTrimLeft($aArray2, $Pos)
MsgBox(0, "File Name", $aArray2)
Link to comment
Share on other sites

Thanks for all your quick responces, but maybe i should add a little more detail,

I tried _Stringbetween but because there are extra tags - <a href=" without the required following .html?PAGE=Download">.

Stringbetween searches the html then finds its first <a href=" which may be a link to the home page right at the top of the page then returns everything between this link and the first occurence of .html?PAGE=Download"> which maybe several dozen lines latter.

I am just looking at returning the one random length word between these two search parameters.

The problem is there maybe several hundered of the first <a href=", but i am only interested in few that immediately procede the second search parameter with a unknown word in between.

Just a thought but is there a function that returns a stringleft given a substring to search upto instead of a number of chars.

ie.

CODE

$charpos = StringInStr ($src, ".html?PAGE=Download")

$reqstring = StringLeftToChars($src, $charpos, '<a href="')

Thanks.

Matthew.

What about:

$sText = '<a href="BLAH.html"></a><a href="BLAHBLAH.html"></a><a href="BLAHBLAHBLAH.html?PAGE=Download"></a>'
$iOffset=1
$array = StringRegExp($sText, '.*(?i)href="(.*?)\.html\?PAGE', 1, $iOffset)
MsgBox(0, "RegExp result", $array[0])

There's no error checking in there, but for me, this found "BLAHBLAHBLAH".

MisterBates

Link to comment
Share on other sites

Thanks for all your help i have now managed to hack together this new function. Any comments improvements are welcome !

Can be easily modified to searchtostring right as well.

CODE

Func _StringLeftToStr ( Const $InputString, Const $StrToSearch, Const $EndMarker, Const $CaseSensitive = 0 )

;Takes a string, a string End postition and a search string.

;Will return the string left of the marker up to the first occurrence of the searchstring

;Else @Error: 1 = Out of Bounds, @Error: 2 = Search String Not Found.

Local $OutputString, $StartPos, $MarkPos, $occ, $errorStrLen

$occ = 1

IF ($EndMarker > StringLen($InputString)) OR (StringLen($StrToSearch) >= StringLen($InputString)) OR ( StringLen($StrToSearch) >= $EndMarker) OR (StringLen($StrToSearch) = 0) Then

;IF Endmarker is outside the inputstring,

;IF SearchStr is longer than InputString,

;IF Endmarker is inside the searchstr

;IF Searchstr is empty

$errorStrLen = 1

Else

$errorStrLen = 0

EndIf

IF $errorStrLen = 0 Then ;No out of bound errors

Do

$StartPos = StringInStr ( $InputString, $StrToSearch, $CaseSensitive, $occ)

IF $StartPos < $EndMarker Then ;find closest StrToSearch to EndMarker

$MarkPos = $StartPos

$occ = $occ + 1

Endif

Until ($StartPos > $EndMarker) OR ($MarkPos = 0) ; Returns startpos of searchstr or 0 if not found.

IF $MarkPos > 0 Then ; Searchstr exists and marker is found

$OutputString = StringTrimLeft (StringLeft ($InputString, $EndMarker), $MarkPos + StringLen($StrToSearch))

;Return the string upto the endmarker;

;Find the end position of StrToSearch and trimleft to this new position.

EndIF

EndIF

IF ($errorStrLen = 1) Then

SetError(1) ; Various out of bounds error

$OutputString = ""

EndIF

IF ($StartPos = 0) Then

SetError(2) ; Searchstr not found

$OutputString = ""

EndIF

return $OutputString ; Return the string or emptystring if errors, check @Error

EndFunc

Tested it with:

$srcHTMLNewest = _INetGetSource ( "http://www.reflexive.com/index.php?PAGE=GameList&SORT=Age" )

$EndPos = StringInStr ( $srcHTMLNewest, ".html?PAGE=Download_Landing") - 1

$test = _StringLeftToStr ($srcHTMLNewest, "<a href=", $EndPos)

IF @Error = 0 Then msgbox(64,"test", $test)

Thanks.Matthew

Link to comment
Share on other sites

What about:

$sText = '<a href="BLAH.html"></a><a href="BLAHBLAH.html"></a><a href="BLAHBLAHBLAH.html?PAGE=Download"></a>'
$iOffset=1
$array = StringRegExp($sText, '.*(?i)href="(.*?)\.html\?PAGE', 1, $iOffset)
MsgBox(0, "RegExp result", $array[0])

There's no error checking in there, but for me, this found "BLAHBLAHBLAH".

MisterBates

I am still trying to do this, the function i wrote (StringLeftToStr) has serious performance issues the further down the file you search, after about the 15 occurence it crawls to a near halt, persumably trying to find the edge of the left hand search string.

I have taken a look at the above regexp, am i wrong in thinking that this is supposed to return a array full of all the matches ?

CODE

$srcHTMLNewest = _INetGetSource ( "http://www.reflexive.com/index.php?START=1&END=10&PAGE=GameList&CAT=Action&SORT=Age" )

$iOffset=1

While 1

$array = StringRegExp($srcHTMLNewest, '.*(?i)href="(.*?)\.html\?PAGE=Download', 1, $iOffset)

If @error = 0 Then

$iOffset = @extended

Else

ExitLoop

EndIf

$arrayofgames = $array

WEnd

_ArrayDisplay($arrayofgames, "List of Games")

This only seems to return the last item found.

The bellow does seem to work, but this is a work around surely ?

CODE

local $arrayofgames[1] = [0]

$srcHTMLNewest = _INetGetSource ( "http://www.reflexive.com/index.php?START=1&END=10&PAGE=GameList&CAT=Action&SORT=Age" )

$iOffset=1

$cell = 1

While 1

$array = StringRegExp($srcHTMLNewest, '.*(?i)href="(.*?)\.html\?PAGE=Download', 1, $iOffset)

If @error = 0 Then

$iOffset = @extended

Else

ExitLoop

EndIf

_ArrayAdd ( $arrayofgames, $array[0] )

WEnd

_ArrayDisplay($arrayofgames, "List of Games in new")

Is this a bug in StringRegExp i am using Autoit v3.2.4.9

thanks matthew

Link to comment
Share on other sites

  • Moderators

Try switching $iOffset = 1 to $iOffset = 3 ... 1 will only return 1 occurence, 3 will return all that are found.

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

Try switching $iOffset = 1 to $iOffset = 3 ... 1 will only return 1 occurence, 3 will return all that are found.

SmOke_N's on the right track, but it's not the $iOffset that needs switching, it's the "1" in the StringRegExp call just before the $iOffset. Try:

$array = StringRegExp($sText, '.*(?i)href="(.*?)\.html\?PAGE', 3, $iOffset)

In fact, you can simplify the code to be:

#include <array.au3>
#include <inet.au3>

$srcHTMLNewest = _INetGetSource ( "http://www.reflexive.com/index.php?START=1&END=10&PAGE=GameList&CAT=Action&SORT=Age" )
$arrayofgames = StringRegExp($srcHTMLNewest, '.*(?i)href="(.*?)\.html\?PAGE=Download', 3)
_ArrayDisplay($arrayofgames, "List of Games in new")
Edited by MisterBates
Link to comment
Share on other sites

SmOke_N's on the right track, but it's not the $iOffset that needs switching, it's the "1" in the StringRegExp call just before the $iOffset. Try:

$array = StringRegExp($sText, '.*(?i)href="(.*?)\.html\?PAGE', 3, $iOffset)
above pattern is wrong, try this for whole text:
$array = StringRegExp ($sText,'(?i)\bhref\s*=\s*"([^"\?]+)(?:\.html|\?)[^"]*"',3)oÝ÷ Úí+ºÚ"µÍÌÍØ^HHÝ[ÔYÑ^
    ÌÍÜÕ^   ÌÎNÊÚJIØIÌLÜÖ×Ý×JÌLØYÌLÜÊIÌLÜÊ][ÝÊ×ÝÉÌLÏÉ][Ý×JÊJÎÌLË[  ÌLÏÊV×ÝÉ][Ý×J][ÝÖ×Ý×JÝÉÌÎNËÊ
Edited by amel27
Link to comment
Share on other sites

#include <array.au3>
#include <inet.au3>

$srcHTMLNewest = _INetGetSource ( "http://www.reflexive.com/index.php?START=1&END=10&PAGE=GameList&CAT=Action&SORT=Age" )
$arrayofgames = StringRegExp($srcHTMLNewest, '.*(?i)href="(.*?)\.html\?PAGE=Download', 3)
_ArrayDisplay($arrayofgames, "List of Games in new")

Thanks for all your help , with the above code i now have what i wanted, and the only noticable slow down is while the inetgetsource is running. Now generates a list of about 300 games in about 30 seconds, instead of a list of 20 games in 60 seconds with the method i was using.

Thanks. Matthew

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...