Jump to content

Recommended Posts

Posted

I want to extract the titles of the Google search results in an array.

I could solve it using a regular expression:

I open for example http://www.google.it/#q=foobar, I see with the DebugBar that the code for example of the title "foobar2000 - Wikipedia" is:

"<a class="l" onmousedown="return rwt(this,'','','','6','AFQjCNG5l1JlHEfLHSE1yqxjOCBlWP5Z4A','','0CFoQFjAF',null,event)" href="http://it.wikipedia.org/wiki/Foobar2000"><em>foobar2000</em> - Wikipedia</a>"

so I use a code like this:

$bodyhtml = _IEBodyReadHTML($oIE)

$matches = StringRegExp($bodyhtml,'(?s:<a class="l"[^>]* href="[^"]+">((?:<em>|)[^<]*(?:</em>|)[^<]*)</a>)',3) ; I could do it even better...

How can I solve it with IE UDF in a better way, possibly without using a regular expression?

Posted

Explain more precisely what you are trying to do.

Dale

Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y

Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Posted

Explain more precisely what you are trying to do.

Dale

I want to track the position of a given result (title) in the search results page. Google is just an example, I need it for various other sites...

Posted

Apparently, you have no clue what "precise" means.

Dale

Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y

Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Posted (edited)

#include <Array.au3>
#include <IE.au3>

Local $aTitles[1], $oIE, $oForm, $oQuery, $oLinks, $iCount, $sLinkHTML, $sLinkText

$oIE = _IECreate("http://www.google.it/", 0, 1) ; Load Google
$oForm = _IEFormGetObjByName($oIE, "gbqf") ; Get Form object
$oQuery = _IEFormElementGetObjByName($oForm, "q") ; Get query text box
_IEFormElementSetValue($oQuery, "foobar") ; Populate query text box
_IEFormSubmit($oForm, 1) ; Submit form

$oLinks = _IELinkGetCollection($oIE) ; Get collection of links
If IsObj($oLinks) Then ; Check to make sure collection of links is an object
    $iCount = 0 ; Set count to 0
    For $oLink In $oLinks ; Loop through all links
        $sLinkHTML = _IEPropertyGet($oLink, "innerhtml") ; Get inner html of link
        $sLinkText = _IEPropertyGet($oLink, "innertext") ; Get inner text of link

        If StringInStr($sLinkHTML, "<em>") Then ; Since all the search results contain <EM>, we'll check our links for that
            $iCount += 1 ; Add one to count
            ReDim $aTitles[$iCount] ; Add to array rows
            $aTitles[$iCount - 1] = $sLinkText ; Set array row
        EndIf
    Next
EndIf
_ArrayDisplay($aTitles, "Titles") ; Display array

Edit: This doesn't work so well if <EM> isn't used, but in many cases it does anyway.

Edited by GMK
Posted

#include  #include  Local $aTitles[1], $oIE, $oForm, $oQuery, $oLinks, $iCount, $sLinkHTML, $sLinkText $oIE = _IECreate("http://www.google.it/", 0, 1) ; Load Google $oForm = _IEFormGetObjByName($oIE, "gbqf") ; Get Form object $oQuery = _IEFormElementGetObjByName($oForm, "q") ; Get query text box _IEFormElementSetValue($oQuery, "foobar") ; Populate query text box _IEFormSubmit($oForm, 1) ; Submit form $oLinks = _IELinkGetCollection($oIE) ; Get collection of links If IsObj($oLinks) Then ; Check to make sure collection of links is an object $iCount = 0 ; Set count to 0 For $oLink In $oLinks ; Loop through all links $sLinkHTML = _IEPropertyGet($oLink, "innerhtml") ; Get inner html of link $sLinkText = _IEPropertyGet($oLink, "innertext") ; Get inner text of link If StringInStr($sLinkHTML, "[i]") Then ; Since all the search results contain [i], we'll check our links for that $iCount += 1 ; Add one to count ReDim $aTitles[$iCount] ; Add to array rows $aTitles[$iCount - 1] = $sLinkText ; Set array row EndIf Next EndIf _ArrayDisplay($aTitles, "Titles") ; Display array
Edit: This doesn't work so well if isn't used, but in many cases it does anyway.

I see that Google uses the tag <em> to emphasize the keyword in the search results so where the keyword is not in the title (in the first 10 pages of "foobar" you'll find some cases) your solutions fails to detect the result.

Is the solution with the regular expression the best solution then for this example?

Posted

How about this?

#include <Array.au3>
#include <IE.au3>

Local $aTitles[1][2], $oIE, $oForm, $oQuery, $oLinks, $iCount, $sLinkHTML, $sLinkText

$oIE = _IECreate("http://www.google.it/", 0, 1) ; Load Google
$oForm = _IEFormGetObjByName($oIE, "gbqf") ; Get Form object
$oQuery = _IEFormElementGetObjByName($oForm, "q") ; Get query text box
_IEFormElementSetValue($oQuery, "foobar") ; Populate query text box
_IEFormSubmit($oForm, 1) ; Submit form

$oLinks = _IELinkGetCollection($oIE) ; Get collection of links
If IsObj($oLinks) Then ; Check to make sure collection of links is an object
    $iCount = 0 ; Set count to 0
    For $oLink In $oLinks ; Loop through all links
        $sLinkHREF = $oLink.href
        $sLinkText = _IEPropertyGet($oLink, "innertext") ; Get inner text of link

        If Not StringInStr($sLinkHREF, "google") And Not StringInStr($sLinkHREF, "javascript") Then ; All non-Google links
            $iCount += 1 ; Add one to count
            ReDim $aTitles[$iCount][2] ; Add to array rows
            $aTitles[$iCount - 1][0] = $sLinkText ; Set text
            $aTitles[$iCount - 1][1] = $sLinkHREF ; Set href
        EndIf
    Next
EndIf
For $i = 1 To 2
    _ArrayDelete($aTitles, 0) ; Delete first two non-Google links (YouTube and Blogger)
Next
_ArrayDisplay($aTitles, "Titles") ; Display array
Posted

How about this?

#include <Array.au3>
#include <IE.au3>

Local $aTitles[1][2], $oIE, $oForm, $oQuery, $oLinks, $iCount, $sLinkHTML, $sLinkText

$oIE = _IECreate("http://www.google.it/", 0, 1) ; Load Google
$oForm = _IEFormGetObjByName($oIE, "gbqf") ; Get Form object
$oQuery = _IEFormElementGetObjByName($oForm, "q") ; Get query text box
_IEFormElementSetValue($oQuery, "foobar") ; Populate query text box
_IEFormSubmit($oForm, 1) ; Submit form

$oLinks = _IELinkGetCollection($oIE) ; Get collection of links
If IsObj($oLinks) Then ; Check to make sure collection of links is an object
    $iCount = 0 ; Set count to 0
    For $oLink In $oLinks ; Loop through all links
        $sLinkHREF = $oLink.href
        $sLinkText = _IEPropertyGet($oLink, "innertext") ; Get inner text of link

        If Not StringInStr($sLinkHREF, "google") And Not StringInStr($sLinkHREF, "javascript") Then ; All non-Google links
            $iCount += 1 ; Add one to count
            ReDim $aTitles[$iCount][2] ; Add to array rows
            $aTitles[$iCount - 1][0] = $sLinkText ; Set text
            $aTitles[$iCount - 1][1] = $sLinkHREF ; Set href
        EndIf
    Next
EndIf
For $i = 1 To 2
    _ArrayDelete($aTitles, 0) ; Delete first two non-Google links (YouTube and Blogger)
Next
_ArrayDisplay($aTitles, "Titles") ; Display array

It wrongly includes adv links too...

so... is regular expression still the best option for this example?

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...