Jump to content
Sign in to follow this  
dragan

Scrape Google Search results using Google APIs

Recommended Posts

Has anyone tried to use Google APIs for scraping search results?

I've built this simple script to demonstrate my problem I'm having with Google results scraping:

#include <Array.au3>

Global $oHTTP = ObjCreate("WinHttp.WinHttpRequest.5.1")

_PerformSearch();

Func _PerformSearch()
    dim $ShowResults[0][3];
    $searchPages = 3

    for $j = 1 to $searchPages*8 Step 8
        $SearchString = 'Apple+Juice';              Disable this line...
;~      $SearchString = 'intitle:"crazy+stink"';    ...And enable this one
        ;http://ajax.googleapis.com/ajax/services/search/web?v=1.0&start=1&rsz=large&q=intitle:%22crazy+stink%22
        $sURL = "http://ajax.googleapis.com/ajax/services/search/web?v=1.0&start=" & $j & "&rsz=large&q=" & $SearchString
        $oHTTP.Open("GET", $sURL, False)
        $oHTTP.SetRequestHeader("Referer", @IPAddress1)
        If (@error) Then Return SetError(1, 0, 0)
        $oHTTP.Send()
        If (@error) Then Return SetError(2, 0, 0)
        $retVal = $oHTTP.ResponseText
        If (@error) Then Return SetError(3, 0, 0)

        $aReturn = _JSON_Decode($retVal)
        if NOT @error then
            $responseData = $aReturn[0][1]
            $results = $responseData[0][1]
            for $i = 0 to UBound($results)-1
                $oneResult = $results[$i]
                $title = _OnlyBoldedDecode(_getJSonValue($oneResult, "title"))
                $url = _getJSonValue($oneResult, "url")
                $content = _OnlyBoldedDecode(_getJSonValue($oneResult, "content"))

                ReDim $ShowResults[UBound($ShowResults)+1][3]
                $arIndex = UBound($ShowResults)-1
                $ShowResults[$arIndex][0] = $title
                $ShowResults[$arIndex][1] = $url
                $ShowResults[$arIndex][2] = $content
            Next
        EndIf
    Next
    _ArrayDisplay($ShowResults);
EndFunc

Func _OnlyBoldedDecode($sData);decoding only most common code
    Return StringReplace(StringReplace($sData, "\u003c", "<"), "\u003e", ">");
EndFunc

Func _getJSonValue($_res, $getData)
    for $i = 0 to UBound($_res)-1
        if $_res[$i][0] == $getData then Return $_res[$i][1]
    Next
    Return "";
EndFunc

Func _JSON_Decode($sString)
    Local $iIndex, $aVal, $sOldStr = $sString, $b

    $sString = StringStripCR(StringStripWS($sString, 7))
    If Not StringRegExp($sString, "(?i)^\{.+}$") Then Return SetError(1, 0, 0)
    Local $aArray[1][2], $iIndex = 0
    $sString = StringMid($sString, 2)

    Do
        $b = False

        $aVal = StringRegExp($sString, '^"([^"]+)"\s*:\s*(["{[]|[-+]?\d+(?:(?:\.\d+)?[eE][+-]\d+)?|true|false|null)', 2) ; Get value & next token
        If @error Then
            ConsoleWrite("!> StringRegExp Error getting next Value." & @CRLF)
            ConsoleWrite($sString & @CRLF)
            $sString = StringMid($sString, 2) ; maybe it works when the string is trimmed by 1 char from the left ?
            ContinueLoop
        EndIf

        $aArray[$iIndex][0] = $aVal[1] ; Key
        $sString = StringMid($sString, StringLen($aVal[0]))

        Switch $aVal[2] ; Value Type (Array, Object, String) ?
            Case '"' ; String
                ; Value -> Array subscript. Trim String after that.

                $aArray[$iIndex][1] = StringMid($sString, 2, StringInStr($sString, """", 1, 2) - 2)
                $sString = StringMid($sString, StringLen($aArray[$iIndex][1]) + 3)

                ReDim $aArray[$iIndex + 2][2]
                $iIndex += 1

            Case '{' ; Object
                ; Recursive function call which will decode the object and return it.
                ; Object -> Array subscript. Trim String after that.

                $aArray[$iIndex][1] = _JSON_Decode($sString)
                $sString = StringMid($sString, @extended + 2)
                If StringLeft($sString, 1) = "," Then $sString = StringMid($sString, 2)

                $b = True
                ReDim $aArray[$iIndex + 2][2]
                $iIndex += 1

            Case '[' ; Array
                ; Decode Array
                $sString = StringMid($sString, 2)
                Local $aRet[1], $iArIndex = 0 ; create new array which will contain the Json-Array.

                Do
                    $sString = StringStripWS($sString, 3) ; Trim Leading & trailing spaces
                    $aNextArrayVal = StringRegExp($sString, '^\s*(["{[]|\d+(?:(?:\.\d+)?[eE]\+\d+)?|true|false|null)', 2)
                    if @error Then Return SetError(@error, 0, 0);
                    Switch $aNextArrayVal[1]
                        Case '"' ; String
                            ; Value -> Array subscript. Trim String after that.
                            $aRet[$iArIndex] = StringMid($sString, 2, StringInStr($sString, """", 1, 2) - 2)
                            $sString = StringMid($sString, StringLen($aRet[$iArIndex]) + 3)

                        Case "{" ; Object
                            ; Recursive function call which will decode the object and return it.
                            ; Object -> Array subscript. Trim String after that.
                            $aRet[$iArIndex] = _JSON_Decode($sString)
                            $sString = StringMid($sString, @extended + 2)

                        Case "["
                            MsgBox(0, "", "Array in Array. WTF is up with this JSON shit?")
                            MsgBox(0, "", "This should not happen! Please post this!")
                            Exit 0xDEADBEEF

                        Case Else
                            ConsoleWrite("Array Else (maybe buggy?)" & @CRLF)
                            $aRet[$iArIndex] = $aNextArrayVal[1]
                    EndSwitch

                    ReDim $aRet[$iArIndex + 2]
                    $iArIndex += 1

                    $sString = StringStripWS($sString, 3) ; Leading & trailing
                    If StringLeft($sString, 1) = "]" Then ExitLoop
                    $sString = StringMid($sString, 2)
                Until False

                $sString = StringMid($sString, 2)
                ReDim $aRet[$iArIndex]
                $aArray[$iIndex][1] = $aRet

                ReDim $aArray[$iIndex + 2][2]
                $iIndex += 1

            Case Else ; Number, bool
                ; Value (number (int/flaot), boolean, null) -> Array subscript. Trim String after that.
                $aArray[$iIndex][1] = $aVal[2]
                ReDim $aArray[$iIndex + 2][2]
                $iIndex += 1
                $sString = StringMid($sString, StringLen($aArray[$iIndex][1]) + 2)
        EndSwitch

        If StringLeft($sString, 1) = "}" Then
            StringMid($sString, 2)
            ExitLoop
        EndIf
        If Not $b Then $sString = StringMid($sString, 2)
    Until False

    ReDim $aArray[$iIndex][2]
    Return SetError(0, StringLen($sOldStr) - StringLen($sString), $aArray)
EndFunc   ;==>_JSON_Decode

This works as long as you're not using "intelligent search placeholders" like using "intitle", "inurl", "site", and other placeholders with sentences (single word works, like: intitle:cake, but with sentence like: intitle:"crazy+stink" it doesn't, while searching this on google will give you approx. 35 results:    https://www.google.com/search?q=intitle:"crazy+stink"     )

Has anyone found a better way to legally scrape Google? This JSON API was built to be free, without big limitations (max results you get from a single query is 64), but it's not working properly, it doesn't give me results on "intelligent search placeholders". I'm aware of the Google Custom Search API, which requires API Key (which I have) but this API can search only specific website, and I need to scrape results from Google's search results.

Any thoughts, suggestions, ideas?

Edit: July 4th 2014:

I have found a way how to use Google Custom Search API with API Key, and still search entire web (instead of only single page). I have found this: https://support.google.com/customsearch/answer/2631040?hl=en and I have followed the instructions. I got my CX code, and I formatted the URL:

https://www.googleapis.com/customsearch/v1?key=[MY_API_KEY]&cx=017576662512468239146:omuauf_lfve&q=intitle:%22crazy+stink%22

(the CX in this example is the one that Google provides as an example for the API here: https://developers.google.com/custom-search/json-api/v1/using_rest, however, even with my own CX I get the same results)
Here are the results:

{
 "kind": "customsearch#search",
 "url": {
  "type": "application/json",
  "template": "https://www.googleapis.com/customsearch/v1?q={searchTerms}&num={count?}&start={startIndex?}&lr={language?}&safe={safe?}&cx={cx?}&cref={cref?}&sort={sort?}&filter={filter?}&gl={gl?}&cr={cr?}&googlehost={googleHost?}&c2coff={disableCnTwTranslation?}&hq={hq?}&hl={hl?}&siteSearch={siteSearch?}&siteSearchFilter={siteSearchFilter?}&exactTerms={exactTerms?}&excludeTerms={excludeTerms?}&linkSite={linkSite?}&orTerms={orTerms?}&relatedSite={relatedSite?}&dateRestrict={dateRestrict?}&lowRange={lowRange?}&highRange={highRange?}&searchType={searchType}&fileType={fileType?}&rights={rights?}&imgSize={imgSize?}&imgType={imgType?}&imgColorType={imgColorType?}&imgDominantColor={imgDominantColor?}&alt=json"
 },
 "queries": {
  "request": [
   {
    "title": "Google Custom Search - intitle:\"crazy stink\"",
    "totalResults": "0",
    "searchTerms": "intitle:\"crazy stink\"",
    "count": 10,
    "inputEncoding": "utf8",
    "outputEncoding": "utf8",
    "safe": "off",
    "cx": "017576662512468239146:omuauf_lfve"
   }
  ]
 },
 "searchInformation": {
  "searchTime": 0.35068,
  "formattedSearchTime": "0.35",
  "totalResults": "0",
  "formattedTotalResults": "0"
 }
}

The results are almost the same as I get them from AJAX JSON Api http://ajax.googleapis.com/ajax/services/search/web?v=1.0&start=1&rsz=large&q=intitle:%22crazy+stink%22:

{"responseData": {"results":[],"cursor":{"moreResultsUrl":"http://www.google.com/search?oe\u003dutf8\u0026ie\u003dutf8\u0026source\u003duds\u0026start\u003d1\u0026hl\u003den\u0026q\u003dintitle:%22crazy+stink%22","searchResultTime":"0.10"}}, "responseDetails": null, "responseStatus": 200}

Which is 0.

So... maybe there isn't any error on my part, but there is on Google's?

I'm just curious if anyone encountered an issue like the one I have, or if anyone have any better suggestion, but bare in mind that I want to keep this legal (scraping results from IE object is not something I want to do).

Edited by dragan

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By tarretarretarre
      About AutoIt-API-WS
      AutoIt-API-WS is a light weight web server with expressive syntax, with the sole purpose of wrapping your existing AutoIt app with little to no effort.
      With AutoIt-API-WS you can send and receive data between any application or framework, as long they can handle HTTP requests, which is an industry standard today.
      Like my other communcations UDF AutoIt-Socket-IO AutoIt-API-WS is heavily inspired from the big boys, but this time its Laravel and Ruby on Rails.
      Features Highlights
      No external or internal dependencies required RESTful mindset when designed Expressive syntax Small codebase Heavy use of Michelsofts Dictionary object Limitations
      Not complient with any RFC, so something important could be missing. Time will tell! One persons slow loris attack will kill the process forever. Example of implemetnation (With screenshots)
      This is a basic cRud operation with the RESTful mindset in use.
      #include "API.au3" #include <Array.au3> _API_MGR_SetName("My APP DB adapter") _API_MGR_SetVer("1.0 BETA") _API_MGR_SetDescription("This adapter allows you to get this n that") _API_MGR_Init(3000) _API_MGR_ROUTER_GET('/users', CB_GetUsers, 'string sortBy', 'Get all users, sortBy can be either asc or desc. asc is default') _API_MGR_ROUTER_GET('/users/{id}', CB_GetUsersById, 'int id*', 'Get user by id') While _API_MGR_ROUTER_HANDLE() WEnd Func DB_GetUsers() Local $userA = ObjCreate("Scripting.Dictionary") Local $userB = ObjCreate("Scripting.Dictionary") $userA.add('id', 1) $userA.add('name', 'TarreTarreTarre') $userA.add('age', 27) $userB.add('id', 2) $userB.add('name', @UserName) $userB.add('age', 22) Local $aRet = [$userA, $userB] Return $aRet EndFunc Func CB_GetUsers(Const $oRequest) Local $aUsers = DB_GetUsers() If $oRequest.exists('sortBy') Then Switch $oRequest.item('sortBy') Case Default Case 'asc' Case 'desc' _ArrayReverse($aUsers) EndSwitch EndIf Return $aUsers EndFunc Func CB_GetUsersById(Const $oRequest) Local Const $aUsers = DB_GetUsers() Local $foundUser = Null For $i = 0 To UBound($aUsers) -1 Local $curUser = $aUsers[$i] If $curUser.item('id') == $oRequest.item('#id') Then $foundUser = $curUser ExitLoop EndIf Next If Not IsObj($foundUser) Then Return _API_RES_NotFound(StringFormat("Could not find user with ID %d", $oRequest.item('#id'))) EndIf return $foundUser EndFunc When you visit http://localhost:3000 you are greeted with this pleasent view that will show you all your registred routes and some extra info you have provided.

      When you visit http://localhost:3000/users the UDF will return the array of objects as Json
       
      And here is an example of http://localhost:3000/users/1

       
      More examples can be found here
       
       (NEWEST 2020-09-21)
      Autoit-API-WS-1.0.3-beta.zip
      OLD VERSIONS
      Autoit-API-WS-1.0.0-beta.zip Autoit-API-WS-1.0.1-beta.zip
       
    • By nacerbaaziz
      goodmorning autoit team
      today am comming with some winhttp problems, i hope that you can help me to solve them.
      the first problem
      is when opening a request
      my forums api allow me to delete any post using the api key
      all functions work, i mean post / get
      but when i tried to use the delete verb it's gave me an html 404 error
      here is what am tried
      #include "WinHttp.au3" ; Open needed handles Global $hOpen = _WinHttpOpen() Global $hConnect = _WinHttpConnect($hOpen, "xxxxxxxx.com") ; Specify the reguest: Global $hRequest = _WinHttpOpenRequest($hConnect, "Delete", "/vb/Api/posts/10447/?hard_delete=true", default, default) _WinHttpAddRequestHeaders($hRequest, "XF-Api-Key:xxxxx") _WinHttpAddRequestHeaders($hRequest, "XF-Api-User:xxxxx") ; Send request _WinHttpSendRequest($hRequest) ; Wait for the response _WinHttpReceiveResponse($hRequest) Global $sHeader = 0, $sReturned = 0 ; If there is data available... If _WinHttpQueryDataAvailable($hRequest) Then $sHeader = _WinHttpQueryHeaders($hRequest, $WINHTTP_QUERY_CONTENT_DISPOSITION) ;Or maybe: ; $sHeader = _WinHttpQueryHeaders($hRequest, BitOR($WINHTTP_QUERY_RAW_HEADERS_CRLF, $WINHTTP_QUERY_CUSTOM), "Content-Disposition") Do $sReturned &= _WinHttpReadData($hRequest) Until @error msgBox(64, "", $sReturned) endIf ; Close handles _WinHttpCloseHandle($hRequest) _WinHttpCloseHandle($hConnect) _WinHttpCloseHandle($hOpen)  
      and here is the error message
      <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>403 Forbidden</title> </head><body> <h1>Forbidden</h1> <p>You don't have permission to access /vb/Api/posts/10447/ on this server.<br /> </p> </body></html>  
      i hope you can help me 
      thanks in advance
    • By argumentum
      I can TCP/IP in AutoIt, hence, make a HTTP deamon. Now, how can I HTTPS to use SSL !??
      Well, Apache has this "mod_proxy.so" module that can let me have SSL and what not is in Apache.
      All that is needed is to tell Apache what I wanna do by editing httpd.conf .
      # Implements a proxy/gateway for Apache. # 1. Open /Applications/XAMPP/etc/httpd.conf # 2. Enable the following Modules by removing the # at the front of the line. # - LoadModule rewrite_module modules/mod_rewrite.so # - LoadModule proxy_module modules/mod_proxy.so # - LoadModule proxy_http_module modules/mod_proxy_http.so # # 3. Copy and Paste below to the bottom of httpd.conf # <IfModule mod_proxy.c> ProxyRequests On <Proxy *> Order deny,allow Allow from all </Proxy> ProxyVia Off ProxyPreserveHost Off ProxyPass /home/ http://127.0.0.1:84/home/ ProxyPassReverse /home/ http://127.0.0.1:84/home/ SetEnv proxy-nokeepalive 1 # ..since we are not using "keep-alive", we are using "close" </IfModule> ...et voila  
      I'm using XAMPP ( https://www.apachefriends.org/download.html )
      and this is my solution to avoid coding in PHP, as I feel more comfortable coding in AutoIt.
      A "muli-thread or concurrency" can be done by forking the socket ( https://www.autoitscript.com/forum/topic/199177-fork-udf-ish/ )
      but responses are under 20 ms., so I feel fine with a single thread.
      I modified an example ( attached below ), so can try out the concept.
      PS: I am not an Apache guru. I just discovered this and it opens a world of possibilities. In my case, I'm thinking of an API to query SQLite 
      PS2: I'm not gonna make Poll but do click like if you do  
       
      201673-json-http-post-serverlistener.au3
    • By MichaelSDeVries
      Has anyone successfully developed  integration with the E*TRADE API in AutoIt?


       
      If so, then would you be willing and able to share some of your AutoIt code to do so with me? and/or help me successfully develop an AutoIt interface with the E*TRADE API?

       
      Please also Reply at: http://blog.thevcf.com/forums/topic/etrade-api/#post-4032

       
      Thank You and Have a Great Day!
      - Michael S. DeVries
    • By rcmaehl
      Hi all, 

      Recently my work swapped from Cisco CTIOS to Finesse. This completely threw me off as I had been automating the Win32 application and I had never done IUIAutomation before. As such I've been messing around with the API and will be adding code as I figure it out. While I do have Supervisor access, I will likely not be adding functions for those features yet.
      Currently Available Functions:
      User API - Query and Set User Info
      Dialog API - Query and Set Call and other Dialog Info
      Queue API - Query Assigned Queues
      Team API - Query Users in a Team


      Changelog:
       
      Download:
       
×
×
  • Create New...