Sign in to follow this  
Followers 0
MuffettsMan

is _IEGetObjById unreliable with DIV id's?

5 posts in this topic

I assume i may be attacking this all wrong but I am trying to narrow down fields to screen scrape by grabing the object id of a DIV field so say in a simple example trying to find the results from a search http://www.gamespot.com/search.html?qs=Pac-Pix+%28DS%29 DebugBar reports the search results is in a <DIV id=results> now if i try to loop through all the links found in that ID, sometimes it works and sometimes it gives me random links elsewhere on the page

$oDiv = _IEGetObjById ($oIE, "results")
    $oLinks = _IELinkGetCollection($oDiv)
    For $oLink in $oLinks
        $sLinkText = _IEPropertyGet($oLink, "innerText")
        If StringInStr($sLinkText, $sMyString) Then
            ConsoleWrite("Found It.... ")
            _IEAction($oLink, "click")
            _IELoadWait ($oIE)
            ExitLoop
        Else
            ConsoleWrite ($sLinkText & "NOT FOUND " & $sMyString & @CR)
        EndIf

is it possible to attack this a different way or at least more consistently? :) sometimes it seems to find it (even though it loops through nearly all the links on the page) yet then again running it again without changing a thing it will error out unable to find any search results... FrUsTeRaTiNg

my full chicken scratch is attached if anyone wants to try to replicate:

AllGames.txt (the text file of names to search for it loops through)

loop3.au3 (the full partialy working when the planets align script) :)


Don't let that status fool you, I am no advanced memeber!

Share this post


Link to post
Share on other sites



#2 ·  Posted (edited)

the problem is that the site is coded using ajax. The results div looks like this before the page loads:

<div id="search_results" class="module search_results contain_all">
 <div class="head">
  <div class="wrap">
   <h2 class="module_title">Searching&hellip;</h2>
  </div>
 </div>
 ...*snip*...
</div>

you need to wait until

<h2 class="module_title">Searching&hellip;</h2>
becomes

<h2 class="module_title">Search results for 'Pac-Pix (DS)'</h2>
Edited by ame1011

[font="Impact"] I always thought dogs laid eggs, and I learned something today. [/font]

Share this post


Link to post
Share on other sites

#3 ·  Posted (edited)

Your page might be in a variable state of completeness being loaded. _IELoadWait() can't always tell if the page is really done. You might introduce some delays to ensure the page is loaded, or test $oDiv and get it in a loop until you get a valid object. Try:

While 1
     $oDiv = _IEGetObjById ($oIE, "results")
     If IsObj($oDiv) Then ExitLoop
WEnd

You can add a timeout to test for failure.

:)

Edit: Didn't see reply from ame1011 before this post.

Edited by PsaltyDS

Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law

Share this post


Link to post
Share on other sites

I was bored so I whipped something up real quick, don't mind the code if it's a bit messy, I tried to comment a bit so you knew what was going on.

The code below will cleanup your original text file and create a file named AllGames_new.txt

It will then read the new text file line by line and visit the game page for each game.

After that it's up to you to add your code in the _GetGameInfo() to parse that game page and extract all the information.

Feel free to ask if you don't understand something.

#include <Constants.au3>
#include <IE.au3>
#include <GuiConstants.au3>
#include <Array.au3>
#Include <date.au3>
#include <file.au3>

Global $sGamesFile = "AllGames.txt"
Global $sCleanedUpGamesFile = "AllGames_new.txt"
Global $sGamespotURL = "http://www.gamespot.com/search.html?qs="
Global $oIE

;open first file for reading
Global $hFile = FileOpen($sGamesFile, 0)
if $hFile = -1 Then Exit
;open second file for writing
Global $hFile_new = FileOpen($sCleanedUpGamesFile,2)
if $hFile_new = -1 Then Exit

While 1 ;iterate through file 1
    $line = FileReadLine($hFile)
    If @error = -1 Then ExitLoop
    ;clean up line and if valid write to new file
    $line = _CleanUp($line)
    if $line Then FileWriteLine($hFile_new, $line)
Wend

;close files
FileClose($hFile)
FileClose($hFile_new)

$oIE = _IECreate()
;open new file for reading
$hFile_new = FileOpen($sCleanedUpGamesFile, 0)
while 1
    $line = FileReadLine($hFile)
    If @error = -1 Then ExitLoop
    if _VisitGamePage(StringTrimLeft($line, 7)) Then _GetGameInfo()
WEnd
_IEQuit($oIE)

;Close File
FileClose($hFile_new)

Func _CleanUp($line, $method = "full")
;full method is to cleanup the file lines, short method used to cleanup search results from gamespot

    Local $keepString = ""
    ;release group strings to remove
    Local $sReleaseGroups = "EUR|NDS|XPA|iNSTEON|SQUiRE|Multi1|Multi2|Multi3|Multi4|Multi5|Multi6|MULTi7|MULTi8|MULTi9|MULTi10|MULTi11|MULTi12|VORTEX|Y|WiNE|" _
                          & "FireX|SirVG|SERIAL|NFO|sUppLeX|iND|JAP|CNBS|Micronauts|PUPPA|EXiMiUS|FRA|DCS|USA|DUT|Goomba|REPACK|DiPLODOCUS|Penguinz|OneUp|TRM|" _
                          & "READNFO|JunkRat|NFOFiX|DUTCH|DITIT|DSRP|BAHAMUT|GUARDiAN|VENOM|DS"
    Local $aReleaseGroups = StringSplit($sReleaseGroups, "|")
    ;since all text between () will be removed, the following is an override to keep certain bracket text
    Local $sBracketTextToKeep = "THISISANEXAMPLE|EXAMPLE2"
    Local $aBracketTextToKeep = StringSplit($sBracketTextToKeep, "|")

    if $method = "short" Then
        ;reset keep string
        $keepString = ""
        ;find bracket text to keep
        for $b = 1 to $aBracketTextToKeep[0]
            if StringInStr($line, "(" & $aBracketTextToKeep[$b] & ")") Then $keepString &= " (" & $aBracketTextToKeep[$b] & ")"
        Next
        ;remove trailing .rar, .zip, .7z
        if StringRight ($line, 4) = ".rar" Then $line = StringTrimRight ($line, 4)
        if StringRight ($line, 4) = ".zip" Then $line = StringTrimRight ($line, 4)
        if StringRight ($line, 3) = ".7z" Then $line = StringTrimRight ($line, 3)
        ;replace all - and _ with spaces
        $line = StringReplace($line, "_", " ")
        $line = StringReplace($line, "-", " ")
        ;remove anything in brackets ()
        $open = StringInStr($line, "(")
        $close = StringInStr($line, ")")
        While $open And $close
            if $open And $close Then $line = StringReplace($line, StringMid($line, $open, $close+1-$open), "")
            $open = StringInStr($line, "(")
            $close = StringInStr($line, ")")
        WEnd
        ;remove release group codes
        for $a = 1 to $aReleaseGroups[0]
            if StringInStr($line, " " & $aReleaseGroups[$a]) Then $line = StringReplace($line, $aReleaseGroups[$a], "")
        Next
        ;strip any leading, trailing or multiple spaces
        $line = StringStripWS($line,7)
        ;add backet text to keep
        $line &= $keepString
        return $line
    Elseif $method = "full" Then
        ;if first 4 digits are numbers
        $sDigits = StringLeft($line,4)
        if Number($sDigits) <> 0 Then
            ;reset keep string
            $keepString = ""
            ;convert first 4 digits to [####] format
            $line = StringTrimLeft($line, 4)
            $line = "["&$sDigits&"]" & $line
            ;find bracket text to keep
            for $b = 1 to $aBracketTextToKeep[0]
                if StringInStr($line, "(" & $aBracketTextToKeep[$b] & ")") Then $keepString &= " (" & $aBracketTextToKeep[$b] & ")"
            Next
            ;remove trailing .rar, .zip, .7z
            if StringRight ($line, 4) = ".rar" Then $line = StringTrimRight ($line, 4)
            if StringRight ($line, 4) = ".zip" Then $line = StringTrimRight ($line, 4)
            if StringRight ($line, 3) = ".7z" Then $line = StringTrimRight ($line, 3)
            ;replace all - and _ with spaces
            $line = StringReplace($line, "_", " ")
            $line = StringReplace($line, "-", " ")
            ;remove anything in brackets ()
            $open = StringInStr($line, "(")
            $close = StringInStr($line, ")")
            While $open And $close
                if $open And $close Then $line = StringReplace($line, StringMid($line, $open, $close+1-$open), "")
                $open = StringInStr($line, "(")
                $close = StringInStr($line, ")")
            WEnd
            ;remove release group codes
            for $a = 1 to $aReleaseGroups[0]
                if StringInStr($line, " " & $aReleaseGroups[$a]) Then $line = StringReplace($line, $aReleaseGroups[$a], "")
            Next
            ;strip any leading, trailing or multiple spaces
            $line = StringStripWS($line,7)
            ;add backet text to keep
            $line &= $keepString
            return $line
        Else
            return false
        EndIf
    EndIf
EndFunc

Func _VisitGamePage($gameName)

    Local $timer
    Local $attempts = 3
    Local $current_attempt = 0
    Local $sLinkHREF, $sLinkTEXT

    ;go to url
    _IENavigate($oIE, $sGamespotURL & $gameName & " (DS)")
    _IELoadWait($oIE)
    ;get search results div
    $oSearchResultsDiv = _IEGetObjById($oIE, "search_results")
    $sSearchResultsHTML = $oSearchResultsDiv.innerHTML
    ;wait until page really loaded
    $timer = TimerInit()
    While StringInStr($sSearchResultsHTML, '<H2 class=module_title>Searching...</H2>')
        Sleep(500)
        $sSearchResultsHTML = $oSearchResultsDiv.innerHTML
        ;if page hasn't loaded after 10 seconds, revisit page
        if TimerDiff($timer) > 10 * 60 * 1000 Then
            $current_attempt += 1
            ;quit if max attempts reached
            if $current_attempt > 3 Then
                return false
            Else
                _IENavigate($oIE, $sGamespotURL & $gameName & " (DS)")
                _IELoadWait($oIE)
                $oSearchResultsDiv = _IEGetObjById($oIE, "search_results")
                $sSearchResultsHTML = $oSearchResultsDiv.innerHTML
            EndIf
        EndIf
    WEnd

    ;page is loaded

    ;get results div
    $oResultsDiv = _IEGetObjById($oIE, "results")
    ;the first result is stored in an LI with class containing the words "result game_result  first"
    $oLIs = _IETagNameGetCollection($oResultsDiv, "LI")
    for $oLI in $oLIs
        if StringInStr($oLI.className, "result game_result  first") Then
            ;div containing the link
            $oResultTitleDiv = $oLi.firstChild.firstChild
            ;get link in this div
            $oAs = _IETagNameGetCollection($oResultTitleDiv, "A")
            for $oA in $oAs
                $sLinkHREF = $oA.href
                $sLinkTEXT = _CleanUp($oA.innerText, "short")
                if $gameName = $sLinkTEXT Then ;if exact match
                    ;visit page
                    _IENavigate($oIE, $sLinkHREF)
                    _IELoadWait($oIE)
                    return true
                elseif StringInStr($sLinkTEXT, $gameName) Then ;if gameName in it's entirety found in the result somewhere
                    ;visit page
                    _IENavigate($oIE, $sLinkHREF)
                    _IELoadWait($oIE)
                    return true
                Else ;do something with a non match
                    return false
                EndIf
            Next
        EndIf
    Next
EndFunc

Func _GetGameInfo()
    ConsoleWrite("REACHED" & @CRLF)
EndFunc

[font="Impact"] I always thought dogs laid eggs, and I learned something today. [/font]

Share this post


Link to post
Share on other sites

wow thanks so much for the helpful advice guys i woulda never figured out the ajax / half loading issue and ame thanks for the example code its rock solid - you must have laughed seeing my chicken scratch :)

thanks again,

Derrick


Don't let that status fool you, I am no advanced memeber!

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0