Jump to content
Sign in to follow this  
Imbuter2000

can I use IE UDF to capture Google results' titles in an array?

Recommended Posts

Imbuter2000

I want to extract the titles of the Google search results in an array.

I could solve it using a regular expression:

I open for example http://www.google.it/#q=foobar, I see with the DebugBar that the code for example of the title "foobar2000 - Wikipedia" is:

"<a class="l" onmousedown="return rwt(this,'','','','6','AFQjCNG5l1JlHEfLHSE1yqxjOCBlWP5Z4A','','0CFoQFjAF',null,event)" href="http://it.wikipedia.org/wiki/Foobar2000"><em>foobar2000</em> - Wikipedia</a>"

so I use a code like this:

$bodyhtml = _IEBodyReadHTML($oIE)

$matches = StringRegExp($bodyhtml,'(?s:<a class="l"[^>]* href="[^"]+">((?:<em>|)[^<]*(?:</em>|)[^<]*)</a>)',3) ; I could do it even better...

How can I solve it with IE UDF in a better way, possibly without using a regular expression?

Share this post


Link to post
Share on other sites
DaleHohm

Explain more precisely what you are trying to do.

Dale


Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Share this post


Link to post
Share on other sites
Imbuter2000

Explain more precisely what you are trying to do.

Dale

I want to track the position of a given result (title) in the search results page. Google is just an example, I need it for various other sites...

Share this post


Link to post
Share on other sites
DaleHohm

Apparently, you have no clue what "precise" means.

Dale


Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Share this post


Link to post
Share on other sites
GMK

#include <Array.au3>
#include <IE.au3>

Local $aTitles[1], $oIE, $oForm, $oQuery, $oLinks, $iCount, $sLinkHTML, $sLinkText

$oIE = _IECreate("http://www.google.it/", 0, 1) ; Load Google
$oForm = _IEFormGetObjByName($oIE, "gbqf") ; Get Form object
$oQuery = _IEFormElementGetObjByName($oForm, "q") ; Get query text box
_IEFormElementSetValue($oQuery, "foobar") ; Populate query text box
_IEFormSubmit($oForm, 1) ; Submit form

$oLinks = _IELinkGetCollection($oIE) ; Get collection of links
If IsObj($oLinks) Then ; Check to make sure collection of links is an object
    $iCount = 0 ; Set count to 0
    For $oLink In $oLinks ; Loop through all links
        $sLinkHTML = _IEPropertyGet($oLink, "innerhtml") ; Get inner html of link
        $sLinkText = _IEPropertyGet($oLink, "innertext") ; Get inner text of link

        If StringInStr($sLinkHTML, "<em>") Then ; Since all the search results contain <EM>, we'll check our links for that
            $iCount += 1 ; Add one to count
            ReDim $aTitles[$iCount] ; Add to array rows
            $aTitles[$iCount - 1] = $sLinkText ; Set array row
        EndIf
    Next
EndIf
_ArrayDisplay($aTitles, "Titles") ; Display array

Edit: This doesn't work so well if <EM> isn't used, but in many cases it does anyway.

Edited by GMK

Share this post


Link to post
Share on other sites
Imbuter2000

#include  #include  Local $aTitles[1], $oIE, $oForm, $oQuery, $oLinks, $iCount, $sLinkHTML, $sLinkText $oIE = _IECreate("http://www.google.it/", 0, 1) ; Load Google $oForm = _IEFormGetObjByName($oIE, "gbqf") ; Get Form object $oQuery = _IEFormElementGetObjByName($oForm, "q") ; Get query text box _IEFormElementSetValue($oQuery, "foobar") ; Populate query text box _IEFormSubmit($oForm, 1) ; Submit form $oLinks = _IELinkGetCollection($oIE) ; Get collection of links If IsObj($oLinks) Then ; Check to make sure collection of links is an object $iCount = 0 ; Set count to 0 For $oLink In $oLinks ; Loop through all links $sLinkHTML = _IEPropertyGet($oLink, "innerhtml") ; Get inner html of link $sLinkText = _IEPropertyGet($oLink, "innertext") ; Get inner text of link If StringInStr($sLinkHTML, "[i]") Then ; Since all the search results contain [i], we'll check our links for that $iCount += 1 ; Add one to count ReDim $aTitles[$iCount] ; Add to array rows $aTitles[$iCount - 1] = $sLinkText ; Set array row EndIf Next EndIf _ArrayDisplay($aTitles, "Titles") ; Display array
Edit: This doesn't work so well if isn't used, but in many cases it does anyway.

I see that Google uses the tag <em> to emphasize the keyword in the search results so where the keyword is not in the title (in the first 10 pages of "foobar" you'll find some cases) your solutions fails to detect the result.

Is the solution with the regular expression the best solution then for this example?

Share this post


Link to post
Share on other sites
GMK

How about this?

#include <Array.au3>
#include <IE.au3>

Local $aTitles[1][2], $oIE, $oForm, $oQuery, $oLinks, $iCount, $sLinkHTML, $sLinkText

$oIE = _IECreate("http://www.google.it/", 0, 1) ; Load Google
$oForm = _IEFormGetObjByName($oIE, "gbqf") ; Get Form object
$oQuery = _IEFormElementGetObjByName($oForm, "q") ; Get query text box
_IEFormElementSetValue($oQuery, "foobar") ; Populate query text box
_IEFormSubmit($oForm, 1) ; Submit form

$oLinks = _IELinkGetCollection($oIE) ; Get collection of links
If IsObj($oLinks) Then ; Check to make sure collection of links is an object
    $iCount = 0 ; Set count to 0
    For $oLink In $oLinks ; Loop through all links
        $sLinkHREF = $oLink.href
        $sLinkText = _IEPropertyGet($oLink, "innertext") ; Get inner text of link

        If Not StringInStr($sLinkHREF, "google") And Not StringInStr($sLinkHREF, "javascript") Then ; All non-Google links
            $iCount += 1 ; Add one to count
            ReDim $aTitles[$iCount][2] ; Add to array rows
            $aTitles[$iCount - 1][0] = $sLinkText ; Set text
            $aTitles[$iCount - 1][1] = $sLinkHREF ; Set href
        EndIf
    Next
EndIf
For $i = 1 To 2
    _ArrayDelete($aTitles, 0) ; Delete first two non-Google links (YouTube and Blogger)
Next
_ArrayDisplay($aTitles, "Titles") ; Display array

Share this post


Link to post
Share on other sites
Imbuter2000

How about this?

#include <Array.au3>
#include <IE.au3>

Local $aTitles[1][2], $oIE, $oForm, $oQuery, $oLinks, $iCount, $sLinkHTML, $sLinkText

$oIE = _IECreate("http://www.google.it/", 0, 1) ; Load Google
$oForm = _IEFormGetObjByName($oIE, "gbqf") ; Get Form object
$oQuery = _IEFormElementGetObjByName($oForm, "q") ; Get query text box
_IEFormElementSetValue($oQuery, "foobar") ; Populate query text box
_IEFormSubmit($oForm, 1) ; Submit form

$oLinks = _IELinkGetCollection($oIE) ; Get collection of links
If IsObj($oLinks) Then ; Check to make sure collection of links is an object
    $iCount = 0 ; Set count to 0
    For $oLink In $oLinks ; Loop through all links
        $sLinkHREF = $oLink.href
        $sLinkText = _IEPropertyGet($oLink, "innertext") ; Get inner text of link

        If Not StringInStr($sLinkHREF, "google") And Not StringInStr($sLinkHREF, "javascript") Then ; All non-Google links
            $iCount += 1 ; Add one to count
            ReDim $aTitles[$iCount][2] ; Add to array rows
            $aTitles[$iCount - 1][0] = $sLinkText ; Set text
            $aTitles[$iCount - 1][1] = $sLinkHREF ; Set href
        EndIf
    Next
EndIf
For $i = 1 To 2
    _ArrayDelete($aTitles, 0) ; Delete first two non-Google links (YouTube and Blogger)
Next
_ArrayDisplay($aTitles, "Titles") ; Display array

It wrongly includes adv links too...

so... is regular expression still the best option for this example?

Share this post


Link to post
Share on other sites
GMK

Hmmm...if it works, go with RegExp...but if I find another option, I'll post it here.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×