MuffettsMan Posted October 20, 2009 Share Posted October 20, 2009 I assume i may be attacking this all wrong but I am trying to narrow down fields to screen scrape by grabing the object id of a DIV field so say in a simple example trying to find the results from a search http://www.gamespot.com/search.html?qs=Pac-Pix+%28DS%29 DebugBar reports the search results is in a <DIV id=results> now if i try to loop through all the links found in that ID, sometimes it works and sometimes it gives me random links elsewhere on the page $oDiv = _IEGetObjById ($oIE, "results") $oLinks = _IELinkGetCollection($oDiv) For $oLink in $oLinks $sLinkText = _IEPropertyGet($oLink, "innerText") If StringInStr($sLinkText, $sMyString) Then ConsoleWrite("Found It.... ") _IEAction($oLink, "click") _IELoadWait ($oIE) ExitLoop Else ConsoleWrite ($sLinkText & "NOT FOUND " & $sMyString & @CR) EndIf is it possible to attack this a different way or at least more consistently? sometimes it seems to find it (even though it loops through nearly all the links on the page) yet then again running it again without changing a thing it will error out unable to find any search results... FrUsTeRaTiNg my full chicken scratch is attached if anyone wants to try to replicate: AllGames.txt (the text file of names to search for it loops through) loop3.au3 (the full partialy working when the planets align script) Don't let that status fool you, I am no advanced memeber! Link to comment Share on other sites More sharing options...
ame1011 Posted October 20, 2009 Share Posted October 20, 2009 (edited) the problem is that the site is coded using ajax. The results div looks like this before the page loads: <div id="search_results" class="module search_results contain_all"> <div class="head"> <div class="wrap"> <h2 class="module_title">Searching…</h2> </div> </div> ...*snip*... </div> you need to wait until <h2 class="module_title">Searching…</h2>becomes <h2 class="module_title">Search results for 'Pac-Pix (DS)'</h2> Edited October 20, 2009 by ame1011 [font="Impact"] I always thought dogs laid eggs, and I learned something today. [/font] Link to comment Share on other sites More sharing options...
PsaltyDS Posted October 20, 2009 Share Posted October 20, 2009 (edited) Your page might be in a variable state of completeness being loaded. _IELoadWait() can't always tell if the page is really done. You might introduce some delays to ensure the page is loaded, or test $oDiv and get it in a loop until you get a valid object. Try: While 1 $oDiv = _IEGetObjById ($oIE, "results") If IsObj($oDiv) Then ExitLoop WEnd You can add a timeout to test for failure. Edit: Didn't see reply from ame1011 before this post. Edited October 20, 2009 by PsaltyDS Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law Link to comment Share on other sites More sharing options...
ame1011 Posted October 21, 2009 Share Posted October 21, 2009 I was bored so I whipped something up real quick, don't mind the code if it's a bit messy, I tried to comment a bit so you knew what was going on. The code below will cleanup your original text file and create a file named AllGames_new.txt It will then read the new text file line by line and visit the game page for each game. After that it's up to you to add your code in the _GetGameInfo() to parse that game page and extract all the information. Feel free to ask if you don't understand something. expandcollapse popup#include <Constants.au3> #include <IE.au3> #include <GuiConstants.au3> #include <Array.au3> #Include <date.au3> #include <file.au3> Global $sGamesFile = "AllGames.txt" Global $sCleanedUpGamesFile = "AllGames_new.txt" Global $sGamespotURL = "http://www.gamespot.com/search.html?qs=" Global $oIE ;open first file for reading Global $hFile = FileOpen($sGamesFile, 0) if $hFile = -1 Then Exit ;open second file for writing Global $hFile_new = FileOpen($sCleanedUpGamesFile,2) if $hFile_new = -1 Then Exit While 1 ;iterate through file 1 $line = FileReadLine($hFile) If @error = -1 Then ExitLoop ;clean up line and if valid write to new file $line = _CleanUp($line) if $line Then FileWriteLine($hFile_new, $line) Wend ;close files FileClose($hFile) FileClose($hFile_new) $oIE = _IECreate() ;open new file for reading $hFile_new = FileOpen($sCleanedUpGamesFile, 0) while 1 $line = FileReadLine($hFile) If @error = -1 Then ExitLoop if _VisitGamePage(StringTrimLeft($line, 7)) Then _GetGameInfo() WEnd _IEQuit($oIE) ;Close File FileClose($hFile_new) Func _CleanUp($line, $method = "full") ;full method is to cleanup the file lines, short method used to cleanup search results from gamespot Local $keepString = "" ;release group strings to remove Local $sReleaseGroups = "EUR|NDS|XPA|iNSTEON|SQUiRE|Multi1|Multi2|Multi3|Multi4|Multi5|Multi6|MULTi7|MULTi8|MULTi9|MULTi10|MULTi11|MULTi12|VORTEX|Y|WiNE|" _ & "FireX|SirVG|SERIAL|NFO|sUppLeX|iND|JAP|CNBS|Micronauts|PUPPA|EXiMiUS|FRA|DCS|USA|DUT|Goomba|REPACK|DiPLODOCUS|Penguinz|OneUp|TRM|" _ & "READNFO|JunkRat|NFOFiX|DUTCH|DITIT|DSRP|BAHAMUT|GUARDiAN|VENOM|DS" Local $aReleaseGroups = StringSplit($sReleaseGroups, "|") ;since all text between () will be removed, the following is an override to keep certain bracket text Local $sBracketTextToKeep = "THISISANEXAMPLE|EXAMPLE2" Local $aBracketTextToKeep = StringSplit($sBracketTextToKeep, "|") if $method = "short" Then ;reset keep string $keepString = "" ;find bracket text to keep for $b = 1 to $aBracketTextToKeep[0] if StringInStr($line, "(" & $aBracketTextToKeep[$b] & ")") Then $keepString &= " (" & $aBracketTextToKeep[$b] & ")" Next ;remove trailing .rar, .zip, .7z if StringRight ($line, 4) = ".rar" Then $line = StringTrimRight ($line, 4) if StringRight ($line, 4) = ".zip" Then $line = StringTrimRight ($line, 4) if StringRight ($line, 3) = ".7z" Then $line = StringTrimRight ($line, 3) ;replace all - and _ with spaces $line = StringReplace($line, "_", " ") $line = StringReplace($line, "-", " ") ;remove anything in brackets () $open = StringInStr($line, "(") $close = StringInStr($line, ")") While $open And $close if $open And $close Then $line = StringReplace($line, StringMid($line, $open, $close+1-$open), "") $open = StringInStr($line, "(") $close = StringInStr($line, ")") WEnd ;remove release group codes for $a = 1 to $aReleaseGroups[0] if StringInStr($line, " " & $aReleaseGroups[$a]) Then $line = StringReplace($line, $aReleaseGroups[$a], "") Next ;strip any leading, trailing or multiple spaces $line = StringStripWS($line,7) ;add backet text to keep $line &= $keepString return $line Elseif $method = "full" Then ;if first 4 digits are numbers $sDigits = StringLeft($line,4) if Number($sDigits) <> 0 Then ;reset keep string $keepString = "" ;convert first 4 digits to [####] format $line = StringTrimLeft($line, 4) $line = "["&$sDigits&"]" & $line ;find bracket text to keep for $b = 1 to $aBracketTextToKeep[0] if StringInStr($line, "(" & $aBracketTextToKeep[$b] & ")") Then $keepString &= " (" & $aBracketTextToKeep[$b] & ")" Next ;remove trailing .rar, .zip, .7z if StringRight ($line, 4) = ".rar" Then $line = StringTrimRight ($line, 4) if StringRight ($line, 4) = ".zip" Then $line = StringTrimRight ($line, 4) if StringRight ($line, 3) = ".7z" Then $line = StringTrimRight ($line, 3) ;replace all - and _ with spaces $line = StringReplace($line, "_", " ") $line = StringReplace($line, "-", " ") ;remove anything in brackets () $open = StringInStr($line, "(") $close = StringInStr($line, ")") While $open And $close if $open And $close Then $line = StringReplace($line, StringMid($line, $open, $close+1-$open), "") $open = StringInStr($line, "(") $close = StringInStr($line, ")") WEnd ;remove release group codes for $a = 1 to $aReleaseGroups[0] if StringInStr($line, " " & $aReleaseGroups[$a]) Then $line = StringReplace($line, $aReleaseGroups[$a], "") Next ;strip any leading, trailing or multiple spaces $line = StringStripWS($line,7) ;add backet text to keep $line &= $keepString return $line Else return false EndIf EndIf EndFunc Func _VisitGamePage($gameName) Local $timer Local $attempts = 3 Local $current_attempt = 0 Local $sLinkHREF, $sLinkTEXT ;go to url _IENavigate($oIE, $sGamespotURL & $gameName & " (DS)") _IELoadWait($oIE) ;get search results div $oSearchResultsDiv = _IEGetObjById($oIE, "search_results") $sSearchResultsHTML = $oSearchResultsDiv.innerHTML ;wait until page really loaded $timer = TimerInit() While StringInStr($sSearchResultsHTML, '<H2 class=module_title>Searching...</H2>') Sleep(500) $sSearchResultsHTML = $oSearchResultsDiv.innerHTML ;if page hasn't loaded after 10 seconds, revisit page if TimerDiff($timer) > 10 * 60 * 1000 Then $current_attempt += 1 ;quit if max attempts reached if $current_attempt > 3 Then return false Else _IENavigate($oIE, $sGamespotURL & $gameName & " (DS)") _IELoadWait($oIE) $oSearchResultsDiv = _IEGetObjById($oIE, "search_results") $sSearchResultsHTML = $oSearchResultsDiv.innerHTML EndIf EndIf WEnd ;page is loaded ;get results div $oResultsDiv = _IEGetObjById($oIE, "results") ;the first result is stored in an LI with class containing the words "result game_result first" $oLIs = _IETagNameGetCollection($oResultsDiv, "LI") for $oLI in $oLIs if StringInStr($oLI.className, "result game_result first") Then ;div containing the link $oResultTitleDiv = $oLi.firstChild.firstChild ;get link in this div $oAs = _IETagNameGetCollection($oResultTitleDiv, "A") for $oA in $oAs $sLinkHREF = $oA.href $sLinkTEXT = _CleanUp($oA.innerText, "short") if $gameName = $sLinkTEXT Then ;if exact match ;visit page _IENavigate($oIE, $sLinkHREF) _IELoadWait($oIE) return true elseif StringInStr($sLinkTEXT, $gameName) Then ;if gameName in it's entirety found in the result somewhere ;visit page _IENavigate($oIE, $sLinkHREF) _IELoadWait($oIE) return true Else ;do something with a non match return false EndIf Next EndIf Next EndFunc Func _GetGameInfo() ConsoleWrite("REACHED" & @CRLF) EndFunc [font="Impact"] I always thought dogs laid eggs, and I learned something today. [/font] Link to comment Share on other sites More sharing options...
MuffettsMan Posted October 21, 2009 Author Share Posted October 21, 2009 wow thanks so much for the helpful advice guys i woulda never figured out the ajax / half loading issue and ame thanks for the example code its rock solid - you must have laughed seeing my chicken scratch thanks again, Derrick Don't let that status fool you, I am no advanced memeber! Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now