Jump to content

Recommended Posts

Posted

hello I have a web page src and want to extract a list of words from it preferebly to array.

The search criteria is like:

<a href="BLAHBLAHBLAH.html?PAGE=Download">

Where BLAHBLAHBLAH is the bit i need to extract, and will be different for each occurence on the page.

I'm not entirely certain but i think the search BLAHBLAHBLAH string can contain virtually any characters except space, and be of variable length.

I am prety sure that regular expression is needed but i just cant grasp this code syntaxt.

Is there an alternative function that will return a string between two substrings but search backwards and match the .html?PAGE=Download"> first then the <a href="

thanks Matthew.

Posted

You could try reading each line and string splitting it with the delimiters " and . $split = StringSplit($line, '".'). Then search for a chunk that is = to "html?PAGE=Download" for example $split[3] = html?PAGE=Download then check $split[1] to see if it is = "<a href=", if so then $split[2] is BLAHBLAHBLAH.

MUHAHAHAHAHA

Posted

Hi,

#include <INet.au3>
#include<String.au3>
$profile = 'www.myspace.com/tom'
$values = _StringBetween(_INetGetSource($profile), '<a href="', '.html?PAGE=Download">')
For $i = 0 To UBound($values) - 1
    MsgBox(0, "Comment count is:", $values[$i])
Next

So long,

Mega

Scripts & functions Organize Includes Let Scite organize the include files

Yahtzee The game "Yahtzee" (Kniffel, DiceLion)

LoginWrapper Secure scripts by adding a query (authentication)

_RunOnlyOnThis UDF Make sure that a script can only be executed on ... (Windows / HD / ...)

Internet-Café Server/Client Application Open CD, Start Browser, Lock remote client, etc.

MultipleFuncsWithOneHotkey Start different funcs by hitting one hotkey different times

Posted (edited)

Works 100% for me

#include <GUIConstants.au3>
#include <Constants.au3>
#include <String.au3>
#include <array.au3>

$file = FileOpen(____HTMLfile.html____, 0)

; Check if file opened for reading OK
If $file = -1 Then
    MsgBox(0, "Error", "Unable to open file.")
    Exit
EndIf

Global $URLIN = FileRead($file)
$aArray1 = _StringBetween($URLIN, '<a href="', '.html?PAGE=Download">')

FileClose($file)

_ArrayDisplay($aArray1, 'Search HTML file')

EDIT: Not sure but you might not need the first 2 include files. Theres no gui that comes up, but idk.

Edited by Mast3rpyr0
Posted

Hi,

for checking the include files try my organize includes script (see my sig) :)

So long,

Mega

Scripts & functions Organize Includes Let Scite organize the include files

Yahtzee The game "Yahtzee" (Kniffel, DiceLion)

LoginWrapper Secure scripts by adding a query (authentication)

_RunOnlyOnThis UDF Make sure that a script can only be executed on ... (Windows / HD / ...)

Internet-Café Server/Client Application Open CD, Start Browser, Lock remote client, etc.

MultipleFuncsWithOneHotkey Start different funcs by hitting one hotkey different times

Posted

Thanks for all your quick responces, but maybe i should add a little more detail,

I tried _Stringbetween but because there are extra tags - <a href=" without the required following .html?PAGE=Download">.

Stringbetween searches the html then finds its first <a href=" which may be a link to the home page right at the top of the page then returns everything between this link and the first occurence of .html?PAGE=Download"> which maybe several dozen lines latter.

I am just looking at returning the one random length word between these two search parameters.

The problem is there maybe several hundered of the first <a href=", but i am only interested in few that immediately procede the second search parameter with a unknown word in between.

Just a thought but is there a function that returns a stringleft given a substring to search upto instead of a number of chars.

ie.

CODE

$charpos = StringInStr ($src, ".html?PAGE=Download")

$reqstring = StringLeftToChars($src, $charpos, '<a href="')

Thanks.

Matthew.

Posted (edited)

hmm your right, i just tried it on all the same URL before, lemme see if i can fix it

EDIT: ok i see the problem, working on a fix but looks like youll have to start from the right and delete anythign after that first " from the right.

Edited by Mast3rpyr0
Posted

I think i have solved it just needs tidy up, but need to go home now, however the code as is is bellow.

CODE

#include <Array.au3>

#include <GuiConstants.au3> ; Call Gui functions

#include <IE.au3>

#Include <GuiStatusBar.au3>

#include <Date.au3>

#include <String.au3>

#include <Inet.au3>

Local $EndPos, $StartPos, $MarkPos, $occ

$srcHTMLNewest = _INetGetSource ( "http://www.reflexive.com/index.php?PAGE=GameList&SORT=Age" )

;$fnGamesInCat = _StringBetween ( $srcHTMLNewest, "<a href=",".html?PAGE=Download_Landing")

;StringInStr ( "string", "substring" [, casesense [, occurrence]] )

$occ = 1

$EndPos = StringInStr ( $srcHTMLNewest, ".html?PAGE=Download_Landing")

;get pos of second search param

Do

$StartPos = StringInStr ( $srcHTMLNewest, "<a href=", 0, $occ)

IF $StartPos < $EndPos Then ;find closest search param #1 to 1st occurence of 2nd search param

$MarkPos = $StartPos

$occ = $occ + 1

Endif

Until $StartPos > $EndPos

$reqString = StringTrimLeft( $srcHTMLNewest, $MarkPos)

$EndPos = StringInStr ( $reqString, ".html?PAGE=Download_Landing") ; find new postition in new string

$reqString2 = StringLeft( $reqString, $EndPos)

$srcHTMLNewest = ""

msgbox(64,"the string",$reqString2)

thanks for your help,

Bye!

Matthew.

Posted

Ok heres what i came up with.

It works but will only get the last one on the page.

#include <GUIConstants.au3>
#include <Constants.au3>
#include <String.au3>
#include <array.au3>

$file = FileOpen("htmlfile.html", 0)

Global $URLIN = FileRead($file)
$aArray1 = _StringBetween($URLIN, '<a href="', '.html?PAGE=Download">')
FileClose($file)
Global $aArray2 = _ArrayToString($aArray1, "")
$Pos = StringInStr($aArray2, '"' , 2, -1)
$aArray2 = StringTrimLeft($aArray2, $Pos)
MsgBox(0, "File Name", $aArray2)
Posted

Thanks for all your quick responces, but maybe i should add a little more detail,

I tried _Stringbetween but because there are extra tags - <a href=" without the required following .html?PAGE=Download">.

Stringbetween searches the html then finds its first <a href=" which may be a link to the home page right at the top of the page then returns everything between this link and the first occurence of .html?PAGE=Download"> which maybe several dozen lines latter.

I am just looking at returning the one random length word between these two search parameters.

The problem is there maybe several hundered of the first <a href=", but i am only interested in few that immediately procede the second search parameter with a unknown word in between.

Just a thought but is there a function that returns a stringleft given a substring to search upto instead of a number of chars.

ie.

CODE

$charpos = StringInStr ($src, ".html?PAGE=Download")

$reqstring = StringLeftToChars($src, $charpos, '<a href="')

Thanks.

Matthew.

What about:

$sText = '<a href="BLAH.html"></a><a href="BLAHBLAH.html"></a><a href="BLAHBLAHBLAH.html?PAGE=Download"></a>'
$iOffset=1
$array = StringRegExp($sText, '.*(?i)href="(.*?)\.html\?PAGE', 1, $iOffset)
MsgBox(0, "RegExp result", $array[0])

There's no error checking in there, but for me, this found "BLAHBLAHBLAH".

MisterBates

Posted

Thanks for all your help i have now managed to hack together this new function. Any comments improvements are welcome !

Can be easily modified to searchtostring right as well.

CODE

Func _StringLeftToStr ( Const $InputString, Const $StrToSearch, Const $EndMarker, Const $CaseSensitive = 0 )

;Takes a string, a string End postition and a search string.

;Will return the string left of the marker up to the first occurrence of the searchstring

;Else @Error: 1 = Out of Bounds, @Error: 2 = Search String Not Found.

Local $OutputString, $StartPos, $MarkPos, $occ, $errorStrLen

$occ = 1

IF ($EndMarker > StringLen($InputString)) OR (StringLen($StrToSearch) >= StringLen($InputString)) OR ( StringLen($StrToSearch) >= $EndMarker) OR (StringLen($StrToSearch) = 0) Then

;IF Endmarker is outside the inputstring,

;IF SearchStr is longer than InputString,

;IF Endmarker is inside the searchstr

;IF Searchstr is empty

$errorStrLen = 1

Else

$errorStrLen = 0

EndIf

IF $errorStrLen = 0 Then ;No out of bound errors

Do

$StartPos = StringInStr ( $InputString, $StrToSearch, $CaseSensitive, $occ)

IF $StartPos < $EndMarker Then ;find closest StrToSearch to EndMarker

$MarkPos = $StartPos

$occ = $occ + 1

Endif

Until ($StartPos > $EndMarker) OR ($MarkPos = 0) ; Returns startpos of searchstr or 0 if not found.

IF $MarkPos > 0 Then ; Searchstr exists and marker is found

$OutputString = StringTrimLeft (StringLeft ($InputString, $EndMarker), $MarkPos + StringLen($StrToSearch))

;Return the string upto the endmarker;

;Find the end position of StrToSearch and trimleft to this new position.

EndIF

EndIF

IF ($errorStrLen = 1) Then

SetError(1) ; Various out of bounds error

$OutputString = ""

EndIF

IF ($StartPos = 0) Then

SetError(2) ; Searchstr not found

$OutputString = ""

EndIF

return $OutputString ; Return the string or emptystring if errors, check @Error

EndFunc

Tested it with:

$srcHTMLNewest = _INetGetSource ( "http://www.reflexive.com/index.php?PAGE=GameList&SORT=Age" )

$EndPos = StringInStr ( $srcHTMLNewest, ".html?PAGE=Download_Landing") - 1

$test = _StringLeftToStr ($srcHTMLNewest, "<a href=", $EndPos)

IF @Error = 0 Then msgbox(64,"test", $test)

Thanks.Matthew

Posted

What about:

$sText = '<a href="BLAH.html"></a><a href="BLAHBLAH.html"></a><a href="BLAHBLAHBLAH.html?PAGE=Download"></a>'
$iOffset=1
$array = StringRegExp($sText, '.*(?i)href="(.*?)\.html\?PAGE', 1, $iOffset)
MsgBox(0, "RegExp result", $array[0])

There's no error checking in there, but for me, this found "BLAHBLAHBLAH".

MisterBates

I am still trying to do this, the function i wrote (StringLeftToStr) has serious performance issues the further down the file you search, after about the 15 occurence it crawls to a near halt, persumably trying to find the edge of the left hand search string.

I have taken a look at the above regexp, am i wrong in thinking that this is supposed to return a array full of all the matches ?

CODE

$srcHTMLNewest = _INetGetSource ( "http://www.reflexive.com/index.php?START=1&END=10&PAGE=GameList&CAT=Action&SORT=Age" )

$iOffset=1

While 1

$array = StringRegExp($srcHTMLNewest, '.*(?i)href="(.*?)\.html\?PAGE=Download', 1, $iOffset)

If @error = 0 Then

$iOffset = @extended

Else

ExitLoop

EndIf

$arrayofgames = $array

WEnd

_ArrayDisplay($arrayofgames, "List of Games")

This only seems to return the last item found.

The bellow does seem to work, but this is a work around surely ?

CODE

local $arrayofgames[1] = [0]

$srcHTMLNewest = _INetGetSource ( "http://www.reflexive.com/index.php?START=1&END=10&PAGE=GameList&CAT=Action&SORT=Age" )

$iOffset=1

$cell = 1

While 1

$array = StringRegExp($srcHTMLNewest, '.*(?i)href="(.*?)\.html\?PAGE=Download', 1, $iOffset)

If @error = 0 Then

$iOffset = @extended

Else

ExitLoop

EndIf

_ArrayAdd ( $arrayofgames, $array[0] )

WEnd

_ArrayDisplay($arrayofgames, "List of Games in new")

Is this a bug in StringRegExp i am using Autoit v3.2.4.9

thanks matthew

  • Moderators
Posted

Try switching $iOffset = 1 to $iOffset = 3 ... 1 will only return 1 occurence, 3 will return all that are found.

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Posted (edited)

Try switching $iOffset = 1 to $iOffset = 3 ... 1 will only return 1 occurence, 3 will return all that are found.

SmOke_N's on the right track, but it's not the $iOffset that needs switching, it's the "1" in the StringRegExp call just before the $iOffset. Try:

$array = StringRegExp($sText, '.*(?i)href="(.*?)\.html\?PAGE', 3, $iOffset)

In fact, you can simplify the code to be:

#include <array.au3>
#include <inet.au3>

$srcHTMLNewest = _INetGetSource ( "http://www.reflexive.com/index.php?START=1&END=10&PAGE=GameList&CAT=Action&SORT=Age" )
$arrayofgames = StringRegExp($srcHTMLNewest, '.*(?i)href="(.*?)\.html\?PAGE=Download', 3)
_ArrayDisplay($arrayofgames, "List of Games in new")
Edited by MisterBates
Posted (edited)

SmOke_N's on the right track, but it's not the $iOffset that needs switching, it's the "1" in the StringRegExp call just before the $iOffset. Try:

$array = StringRegExp($sText, '.*(?i)href="(.*?)\.html\?PAGE', 3, $iOffset)
above pattern is wrong, try this for whole text:
$array = StringRegExp ($sText,'(?i)\bhref\s*=\s*"([^"\?]+)(?:\.html|\?)[^"]*"',3)oÝ÷ Úí+ºÚ"µÍÌÍØ^HHÝ[ÔYÑ^
    ÌÍÜÕ^   ÌÎNÊÚJIØIÌLÜÖ×Ý×JÌLØYÌLÜÊIÌLÜÊ][ÝÊ×ÝÉÌLÏÉ][Ý×JÊJÎÌLË[  ÌLÏÊV×ÝÉ][Ý×J][ÝÖ×Ý×JÝÉÌÎNËÊ
Edited by amel27
Posted

#include <array.au3>
#include <inet.au3>

$srcHTMLNewest = _INetGetSource ( "http://www.reflexive.com/index.php?START=1&END=10&PAGE=GameList&CAT=Action&SORT=Age" )
$arrayofgames = StringRegExp($srcHTMLNewest, '.*(?i)href="(.*?)\.html\?PAGE=Download', 3)
_ArrayDisplay($arrayofgames, "List of Games in new")

Thanks for all your help , with the above code i now have what i wanted, and the only noticable slow down is while the inetgetsource is running. Now generates a list of about 300 games in about 30 seconds, instead of a list of 20 games in 60 seconds with the method i was using.

Thanks. Matthew

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...