empty75 Posted June 8, 2007 Posted June 8, 2007 hello I have a web page src and want to extract a list of words from it preferebly to array.The search criteria is like:<a href="BLAHBLAHBLAH.html?PAGE=Download">Where BLAHBLAHBLAH is the bit i need to extract, and will be different for each occurence on the page.I'm not entirely certain but i think the search BLAHBLAHBLAH string can contain virtually any characters except space, and be of variable length.I am prety sure that regular expression is needed but i just cant grasp this code syntaxt.Is there an alternative function that will return a string between two substrings but search backwards and match the .html?PAGE=Download"> first then the <a href="thanks Matthew.
smstroble Posted June 8, 2007 Posted June 8, 2007 You could try reading each line and string splitting it with the delimiters " and . $split = StringSplit($line, '".'). Then search for a chunk that is = to "html?PAGE=Download" for example $split[3] = html?PAGE=Download then check $split[1] to see if it is = "<a href=", if so then $split[2] is BLAHBLAHBLAH. MUHAHAHAHAHA
Xenobiologist Posted June 8, 2007 Posted June 8, 2007 Hi, #include <INet.au3> #include<String.au3> $profile = 'www.myspace.com/tom' $values = _StringBetween(_INetGetSource($profile), '<a href="', '.html?PAGE=Download">') For $i = 0 To UBound($values) - 1 MsgBox(0, "Comment count is:", $values[$i]) Next So long, Mega Scripts & functions Organize Includes Let Scite organize the include files Yahtzee The game "Yahtzee" (Kniffel, DiceLion) LoginWrapper Secure scripts by adding a query (authentication) _RunOnlyOnThis UDF Make sure that a script can only be executed on ... (Windows / HD / ...) Internet-Café Server/Client Application Open CD, Start Browser, Lock remote client, etc. MultipleFuncsWithOneHotkey Start different funcs by hitting one hotkey different times
Mast3rpyr0 Posted June 8, 2007 Posted June 8, 2007 (edited) Works 100% for me #include <GUIConstants.au3> #include <Constants.au3> #include <String.au3> #include <array.au3> $file = FileOpen(____HTMLfile.html____, 0) ; Check if file opened for reading OK If $file = -1 Then MsgBox(0, "Error", "Unable to open file.") Exit EndIf Global $URLIN = FileRead($file) $aArray1 = _StringBetween($URLIN, '<a href="', '.html?PAGE=Download">') FileClose($file) _ArrayDisplay($aArray1, 'Search HTML file') EDIT: Not sure but you might not need the first 2 include files. Theres no gui that comes up, but idk. Edited June 8, 2007 by Mast3rpyr0 My UDF's : _INetUpdateCheck() My Programs : GameLauncher vAlpha, InfoCrypt, WindowDesigner, ScreenCap, DailyRemindersPick3GeneratorBackupUtility! Other : Bored? Click Here!
Xenobiologist Posted June 8, 2007 Posted June 8, 2007 Hi, for checking the include files try my organize includes script (see my sig) So long, Mega Scripts & functions Organize Includes Let Scite organize the include files Yahtzee The game "Yahtzee" (Kniffel, DiceLion) LoginWrapper Secure scripts by adding a query (authentication) _RunOnlyOnThis UDF Make sure that a script can only be executed on ... (Windows / HD / ...) Internet-Café Server/Client Application Open CD, Start Browser, Lock remote client, etc. MultipleFuncsWithOneHotkey Start different funcs by hitting one hotkey different times
Mast3rpyr0 Posted June 8, 2007 Posted June 8, 2007 oo niftty thanks My UDF's : _INetUpdateCheck() My Programs : GameLauncher vAlpha, InfoCrypt, WindowDesigner, ScreenCap, DailyRemindersPick3GeneratorBackupUtility! Other : Bored? Click Here!
empty75 Posted June 8, 2007 Author Posted June 8, 2007 Thanks for all your quick responces, but maybe i should add a little more detail, I tried _Stringbetween but because there are extra tags - <a href=" without the required following .html?PAGE=Download">. Stringbetween searches the html then finds its first <a href=" which may be a link to the home page right at the top of the page then returns everything between this link and the first occurence of .html?PAGE=Download"> which maybe several dozen lines latter. I am just looking at returning the one random length word between these two search parameters. The problem is there maybe several hundered of the first <a href=", but i am only interested in few that immediately procede the second search parameter with a unknown word in between. Just a thought but is there a function that returns a stringleft given a substring to search upto instead of a number of chars. ie. CODE $charpos = StringInStr ($src, ".html?PAGE=Download") $reqstring = StringLeftToChars($src, $charpos, '<a href="') Thanks. Matthew.
Mast3rpyr0 Posted June 8, 2007 Posted June 8, 2007 (edited) hmm your right, i just tried it on all the same URL before, lemme see if i can fix it EDIT: ok i see the problem, working on a fix but looks like youll have to start from the right and delete anythign after that first " from the right. Edited June 8, 2007 by Mast3rpyr0 My UDF's : _INetUpdateCheck() My Programs : GameLauncher vAlpha, InfoCrypt, WindowDesigner, ScreenCap, DailyRemindersPick3GeneratorBackupUtility! Other : Bored? Click Here!
empty75 Posted June 8, 2007 Author Posted June 8, 2007 I think i have solved it just needs tidy up, but need to go home now, however the code as is is bellow. CODE #include <Array.au3> #include <GuiConstants.au3> ; Call Gui functions #include <IE.au3> #Include <GuiStatusBar.au3> #include <Date.au3> #include <String.au3> #include <Inet.au3> Local $EndPos, $StartPos, $MarkPos, $occ $srcHTMLNewest = _INetGetSource ( "http://www.reflexive.com/index.php?PAGE=GameList&SORT=Age" ) ;$fnGamesInCat = _StringBetween ( $srcHTMLNewest, "<a href=",".html?PAGE=Download_Landing") ;StringInStr ( "string", "substring" [, casesense [, occurrence]] ) $occ = 1 $EndPos = StringInStr ( $srcHTMLNewest, ".html?PAGE=Download_Landing") ;get pos of second search param Do $StartPos = StringInStr ( $srcHTMLNewest, "<a href=", 0, $occ) IF $StartPos < $EndPos Then ;find closest search param #1 to 1st occurence of 2nd search param $MarkPos = $StartPos $occ = $occ + 1 Endif Until $StartPos > $EndPos $reqString = StringTrimLeft( $srcHTMLNewest, $MarkPos) $EndPos = StringInStr ( $reqString, ".html?PAGE=Download_Landing") ; find new postition in new string $reqString2 = StringLeft( $reqString, $EndPos) $srcHTMLNewest = "" msgbox(64,"the string",$reqString2) thanks for your help, Bye! Matthew.
Mast3rpyr0 Posted June 8, 2007 Posted June 8, 2007 Ok heres what i came up with. It works but will only get the last one on the page. #include <GUIConstants.au3> #include <Constants.au3> #include <String.au3> #include <array.au3> $file = FileOpen("htmlfile.html", 0) Global $URLIN = FileRead($file) $aArray1 = _StringBetween($URLIN, '<a href="', '.html?PAGE=Download">') FileClose($file) Global $aArray2 = _ArrayToString($aArray1, "") $Pos = StringInStr($aArray2, '"' , 2, -1) $aArray2 = StringTrimLeft($aArray2, $Pos) MsgBox(0, "File Name", $aArray2) My UDF's : _INetUpdateCheck() My Programs : GameLauncher vAlpha, InfoCrypt, WindowDesigner, ScreenCap, DailyRemindersPick3GeneratorBackupUtility! Other : Bored? Click Here!
MisterBates Posted June 9, 2007 Posted June 9, 2007 Thanks for all your quick responces, but maybe i should add a little more detail, I tried _Stringbetween but because there are extra tags - <a href=" without the required following .html?PAGE=Download">. Stringbetween searches the html then finds its first <a href=" which may be a link to the home page right at the top of the page then returns everything between this link and the first occurence of .html?PAGE=Download"> which maybe several dozen lines latter. I am just looking at returning the one random length word between these two search parameters. The problem is there maybe several hundered of the first <a href=", but i am only interested in few that immediately procede the second search parameter with a unknown word in between. Just a thought but is there a function that returns a stringleft given a substring to search upto instead of a number of chars. ie. CODE $charpos = StringInStr ($src, ".html?PAGE=Download") $reqstring = StringLeftToChars($src, $charpos, '<a href="') Thanks. Matthew. What about: $sText = '<a href="BLAH.html"></a><a href="BLAHBLAH.html"></a><a href="BLAHBLAHBLAH.html?PAGE=Download"></a>' $iOffset=1 $array = StringRegExp($sText, '.*(?i)href="(.*?)\.html\?PAGE', 1, $iOffset) MsgBox(0, "RegExp result", $array[0]) There's no error checking in there, but for me, this found "BLAHBLAHBLAH". MisterBates [u]MisterBates[/u]_____________________________________________________Suspend/Resume Windows ScreensaverWatchWindows - Window watcher/loggerUDF: Click systray menu/submenu itemsUDF: Outlook Express Folder/Message handling (+ example code)HowTo: Multiple icons in one compiled script
empty75 Posted June 11, 2007 Author Posted June 11, 2007 Thanks for all your help i have now managed to hack together this new function. Any comments improvements are welcome ! Can be easily modified to searchtostring right as well. CODE Func _StringLeftToStr ( Const $InputString, Const $StrToSearch, Const $EndMarker, Const $CaseSensitive = 0 ) ;Takes a string, a string End postition and a search string. ;Will return the string left of the marker up to the first occurrence of the searchstring ;Else @Error: 1 = Out of Bounds, @Error: 2 = Search String Not Found. Local $OutputString, $StartPos, $MarkPos, $occ, $errorStrLen $occ = 1 IF ($EndMarker > StringLen($InputString)) OR (StringLen($StrToSearch) >= StringLen($InputString)) OR ( StringLen($StrToSearch) >= $EndMarker) OR (StringLen($StrToSearch) = 0) Then ;IF Endmarker is outside the inputstring, ;IF SearchStr is longer than InputString, ;IF Endmarker is inside the searchstr ;IF Searchstr is empty $errorStrLen = 1 Else $errorStrLen = 0 EndIf IF $errorStrLen = 0 Then ;No out of bound errors Do $StartPos = StringInStr ( $InputString, $StrToSearch, $CaseSensitive, $occ) IF $StartPos < $EndMarker Then ;find closest StrToSearch to EndMarker $MarkPos = $StartPos $occ = $occ + 1 Endif Until ($StartPos > $EndMarker) OR ($MarkPos = 0) ; Returns startpos of searchstr or 0 if not found. IF $MarkPos > 0 Then ; Searchstr exists and marker is found $OutputString = StringTrimLeft (StringLeft ($InputString, $EndMarker), $MarkPos + StringLen($StrToSearch)) ;Return the string upto the endmarker; ;Find the end position of StrToSearch and trimleft to this new position. EndIF EndIF IF ($errorStrLen = 1) Then SetError(1) ; Various out of bounds error $OutputString = "" EndIF IF ($StartPos = 0) Then SetError(2) ; Searchstr not found $OutputString = "" EndIF return $OutputString ; Return the string or emptystring if errors, check @Error EndFunc Tested it with: $srcHTMLNewest = _INetGetSource ( "http://www.reflexive.com/index.php?PAGE=GameList&SORT=Age" ) $EndPos = StringInStr ( $srcHTMLNewest, ".html?PAGE=Download_Landing") - 1 $test = _StringLeftToStr ($srcHTMLNewest, "<a href=", $EndPos) IF @Error = 0 Then msgbox(64,"test", $test) Thanks.Matthew
empty75 Posted June 12, 2007 Author Posted June 12, 2007 What about: $sText = '<a href="BLAH.html"></a><a href="BLAHBLAH.html"></a><a href="BLAHBLAHBLAH.html?PAGE=Download"></a>' $iOffset=1 $array = StringRegExp($sText, '.*(?i)href="(.*?)\.html\?PAGE', 1, $iOffset) MsgBox(0, "RegExp result", $array[0]) There's no error checking in there, but for me, this found "BLAHBLAHBLAH". MisterBates I am still trying to do this, the function i wrote (StringLeftToStr) has serious performance issues the further down the file you search, after about the 15 occurence it crawls to a near halt, persumably trying to find the edge of the left hand search string. I have taken a look at the above regexp, am i wrong in thinking that this is supposed to return a array full of all the matches ? CODE $srcHTMLNewest = _INetGetSource ( "http://www.reflexive.com/index.php?START=1&END=10&PAGE=GameList&CAT=Action&SORT=Age" ) $iOffset=1 While 1 $array = StringRegExp($srcHTMLNewest, '.*(?i)href="(.*?)\.html\?PAGE=Download', 1, $iOffset) If @error = 0 Then $iOffset = @extended Else ExitLoop EndIf $arrayofgames = $array WEnd _ArrayDisplay($arrayofgames, "List of Games") This only seems to return the last item found. The bellow does seem to work, but this is a work around surely ? CODE local $arrayofgames[1] = [0] $srcHTMLNewest = _INetGetSource ( "http://www.reflexive.com/index.php?START=1&END=10&PAGE=GameList&CAT=Action&SORT=Age" ) $iOffset=1 $cell = 1 While 1 $array = StringRegExp($srcHTMLNewest, '.*(?i)href="(.*?)\.html\?PAGE=Download', 1, $iOffset) If @error = 0 Then $iOffset = @extended Else ExitLoop EndIf _ArrayAdd ( $arrayofgames, $array[0] ) WEnd _ArrayDisplay($arrayofgames, "List of Games in new") Is this a bug in StringRegExp i am using Autoit v3.2.4.9 thanks matthew
Moderators SmOke_N Posted June 12, 2007 Moderators Posted June 12, 2007 Try switching $iOffset = 1 to $iOffset = 3 ... 1 will only return 1 occurence, 3 will return all that are found. Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.
MisterBates Posted June 12, 2007 Posted June 12, 2007 (edited) Try switching $iOffset = 1 to $iOffset = 3 ... 1 will only return 1 occurence, 3 will return all that are found. SmOke_N's on the right track, but it's not the $iOffset that needs switching, it's the "1" in the StringRegExp call just before the $iOffset. Try: $array = StringRegExp($sText, '.*(?i)href="(.*?)\.html\?PAGE', 3, $iOffset) In fact, you can simplify the code to be: #include <array.au3> #include <inet.au3> $srcHTMLNewest = _INetGetSource ( "http://www.reflexive.com/index.php?START=1&END=10&PAGE=GameList&CAT=Action&SORT=Age" ) $arrayofgames = StringRegExp($srcHTMLNewest, '.*(?i)href="(.*?)\.html\?PAGE=Download', 3) _ArrayDisplay($arrayofgames, "List of Games in new") Edited June 12, 2007 by MisterBates [u]MisterBates[/u]_____________________________________________________Suspend/Resume Windows ScreensaverWatchWindows - Window watcher/loggerUDF: Click systray menu/submenu itemsUDF: Outlook Express Folder/Message handling (+ example code)HowTo: Multiple icons in one compiled script
amel27 Posted June 13, 2007 Posted June 13, 2007 (edited) SmOke_N's on the right track, but it's not the $iOffset that needs switching, it's the "1" in the StringRegExp call just before the $iOffset. Try: $array = StringRegExp($sText, '.*(?i)href="(.*?)\.html\?PAGE', 3, $iOffset)above pattern is wrong, try this for whole text:$array = StringRegExp ($sText,'(?i)\bhref\s*=\s*"([^"\?]+)(?:\.html|\?)[^"]*"',3)oÝ÷ Úí+ºÚ"µÍÌÍØ^HHÝ[ÔYÑ^ ÌÍÜÕ^ ÌÎNÊÚJIØIÌLÜÖ×Ý×JÌLØYÌLÜÊIÌLÜÊ][ÝÊ×ÝÉÌLÏÉ][Ý×JÊJÎÌLË[ ÌLÏÊV×ÝÉ][Ý×J][ÝÖ×Ý×JÝÉÌÎNËÊ Edited June 13, 2007 by amel27
empty75 Posted June 14, 2007 Author Posted June 14, 2007 #include <array.au3> #include <inet.au3> $srcHTMLNewest = _INetGetSource ( "http://www.reflexive.com/index.php?START=1&END=10&PAGE=GameList&CAT=Action&SORT=Age" ) $arrayofgames = StringRegExp($srcHTMLNewest, '.*(?i)href="(.*?)\.html\?PAGE=Download', 3) _ArrayDisplay($arrayofgames, "List of Games in new") Thanks for all your help , with the above code i now have what i wanted, and the only noticable slow down is while the inetgetsource is running. Now generates a list of about 300 games in about 30 seconds, instead of a list of 20 games in 60 seconds with the method i was using. Thanks. Matthew
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now