Jump to content
andrewz

Extracting Data from the source of a website - wont work :(

Recommended Posts

andrewz

Hello!

I am currently facing a problem which I can't seem to be able to solve.

What do I want to do with the script ?
Extract all the links of the hotels on this website: http://www.yelp.de/search?cflt=hotels&find_loc=Berlin%2C+Germany
For example the first link to the first hotel would be: http://www.yelp.de/biz/novum-hotel-city-b-berlin-zentrum-berlin  - changes sometimes, so the link will be different.

To start off, I tried to export only one hotel at first. I am using this code to read the content from the source

and then get the content between two "functions" or whatever these are called:

#NoTrayIcon
#include <Inet.au3>
#include <Array.au3>
#include <String.au3>
#include <AutoItConstants.au3>
#include <MsgBoxConstants.au3>
Global $url = "http://www.yelp.de/search?find_desc=Hotels&find_loc=Berlin&ns=1"

Global $content = _INetGetSource($url)
Global $string_A = _StringBetween($content, '<div class="media-avatar">', '</div>')

MsgBox(0,"",$string_A[0])

It's part of an older project, which did almost the same thing, with the exeption that this one is not as easy :(

The link is saved differently, and I can't find a way to export it. After it's saved into an array, I am going to

save the links into a variable with a do - until function. But first I need this step working.

Please, if anyone has an idea how to solve this, even the smallest help is appreciated!

Edited by andrewz

Share this post


Link to post
Share on other sites
water

If you run Microsofts Internet Explorer you could use the IE UDF to extract the links.


My UDFs and Tutorials:

Spoiler

UDFs:
Active Directory (2018-06-01 - Version 1.4.9.0) - Download - General Help & Support - Example Scripts - Wiki
OutlookEX (NEW 2018-09-01 - Version 1.3.4.0) - Download - General Help & Support - Example Scripts - Wiki
ExcelChart (2017-07-21 - Version 0.4.0.1) - Download - General Help & Support - Example Scripts
PowerPoint (2017-06-06 - Version 0.0.5.0) - Download - General Help & Support
Excel - Example Scripts - Wiki
Word - Wiki
 
Tutorials:

ADO - Wiki

 

Share this post


Link to post
Share on other sites
andrewz

If you run Microsofts Internet Explorer you could use the IE UDF to extract the links.

 

Doesnt matter which browser, I could use any. But thanks, will look into it ;)

Share this post


Link to post
Share on other sites
MikahS

Simple example with the IE UDF functionality. Get all links:

#include <IE.au3>
#include <MsgBoxConstants.au3>

Global $url = "http://www.yelp.de/search?find_desc=Hotels&find_loc=Berlin&ns=1"

Local $oIE = _IECreate($url)
Local $oLinks = _IELinkGetCollection($oIE)
Local $iNumLinks = @extended

Local $sTxt = $iNumLinks & " links found" & @CRLF & @CRLF
For $oLink In $oLinks
    $sTxt &= $oLink.href & @CRLF
Next
MsgBox($MB_SYSTEMMODAL, "Link Info", $sTxt)

Only difference between this and helpfile example (_IELinkGetCollection()) is *_IECreate() & $url, so it should be pretty easy to start using the IE UDF. :D

Edited by MikahS
  • Like 1

Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Share this post


Link to post
Share on other sites
andrewz

@mikahs and water, you both are brilliant!

Share this post


Link to post
Share on other sites
MikahS

@mikahs and water, you both are brilliant!

 

I'd say that's just Water. ;)

Nonetheless, it is my pleasure. :)


Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Share this post


Link to post
Share on other sites
water

MikahS,

you are brilliant too :)

I just pointed him into the right direction, but you showed him working code.

  • Like 1

My UDFs and Tutorials:

Spoiler

UDFs:
Active Directory (2018-06-01 - Version 1.4.9.0) - Download - General Help & Support - Example Scripts - Wiki
OutlookEX (NEW 2018-09-01 - Version 1.3.4.0) - Download - General Help & Support - Example Scripts - Wiki
ExcelChart (2017-07-21 - Version 0.4.0.1) - Download - General Help & Support - Example Scripts
PowerPoint (2017-06-06 - Version 0.0.5.0) - Download - General Help & Support
Excel - Example Scripts - Wiki
Word - Wiki
 
Tutorials:

ADO - Wiki

 

Share this post


Link to post
Share on other sites
MikahS

MikahS,

you are brilliant too :)

I just pointed him into the right direction, but you showed him working code.

 

Thank you Water, I appreciate it. :)

Edited by MikahS

Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Share this post


Link to post
Share on other sites
water

Give credit where credit is due :)

  • Like 1

My UDFs and Tutorials:

Spoiler

UDFs:
Active Directory (2018-06-01 - Version 1.4.9.0) - Download - General Help & Support - Example Scripts - Wiki
OutlookEX (NEW 2018-09-01 - Version 1.3.4.0) - Download - General Help & Support - Example Scripts - Wiki
ExcelChart (2017-07-21 - Version 0.4.0.1) - Download - General Help & Support - Example Scripts
PowerPoint (2017-06-06 - Version 0.0.5.0) - Download - General Help & Support
Excel - Example Scripts - Wiki
Word - Wiki
 
Tutorials:

ADO - Wiki

 

Share this post


Link to post
Share on other sites
andrewz

So I used it like this:

Global $url = "http://www.yelp.de/search?find_desc=Hotels&find_loc=Berlin&ns=1#start="&$timesran&"&cflt=hotels"
_IECreate($url)
Sleep(5000)
Local $oIE = _IECreate($url,1,1)
Local $oLinks = _IELinkGetCollection($oIE)
Local $iNumLinks = @extended

I first use _IECreate to open the window cuz "yelp.de" doesnt load the items immediately, it first shows

hotel 1-10 and then after a few seconds displays the ones from the next page. So if the variable

$timesran would be "10" (That means page 2) , it would first completly load the page, then

take the current, already opened IE window and store it inside the variable, and finally collect

all the links.

But, as of today this doesnt seem to work :( Is there any workaround to this, so that the programm would

first wait for the page to not only load completly, but also load the items completly which are loaded

usually 2-3 seconds afterwards.

Try :  http://www.yelp.de/search?find_desc=Hotels&find_loc=Berlin&ns=1#start=10&cflt=hotels

Thanks in advance

Edited by andrewz

Share this post


Link to post
Share on other sites
MikahS

worked fine for me:

#include <IE.au3>
Global $timesran = "10"
Global $url = "http://www.yelp.de/search?find_desc=Hotels&find_loc=Berlin&ns=1#start="&$timesran&"&cflt=hotels"
_IECreate($url)
Sleep(5000)
Local $oIE = _IECreate($url,1,1)
Local $oLinks = _IELinkGetCollection($oIE)
Local $iNumLinks = @extended
MsgBox(0, "", $iNumLinks)

Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Share this post


Link to post
Share on other sites
andrewz

 

worked fine for me:

#include <IE.au3>
Global $timesran = "10"
Global $url = "http://www.yelp.de/search?find_desc=Hotels&find_loc=Berlin&ns=1#start="&$timesran&"&cflt=hotels"
_IECreate($url)
Sleep(5000)
Local $oIE = _IECreate($url,1,1)
Local $oLinks = _IELinkGetCollection($oIE)
Local $iNumLinks = @extended
MsgBox(0, "", $iNumLinks)

 

Did it display you the hotels 1-10 or 11-20 ?

Maybe I should have posted more code:

#NoTrayIcon
#include <IE.au3>
#include <MsgBoxConstants.au3>
#include <File.au3>
#include <Inet.au3>

FileDelete("Links.txt")
_FileCreate("Edited.txt")

$timesran = 0

Do
_FileCreate("Links.txt")
Global $url = "http://www.yelp.de/search?find_desc=&find_loc=Frankfurt+am+Main%2C+Hessen&ns=1#find_desc=Hotel&start="&$timesran&"&cflt=hotels"
_IECreate($url)
Sleep(5000)
Local $oIE = _IECreate($url,1,1)
Local $oLinks = _IELinkGetCollection($oIE)
Local $iNumLinks = @extended

Local $sTxt = $iNumLinks & " links found" & @CRLF & @CRLF
For $oLink In $oLinks
    $sTxt &= $oLink.href & @CRLF
Next

FileWrite("Links.txt",$sTxt)


$file = "Links.txt"
FileOpen($file, 0)
For $i = 1 to _FileCountLines($file)
    $line = FileReadLine($file, $i)
If StringInStr($line,"http://www.yelp.de/biz/") = true Then
    $content = FileRead("Edited.txt")
    If StringInStr($content,$line) = false Then
    FileWrite("Edited.txt",$line & @CRLF)
    EndIf
    EndIf
Next
FileClose($file)
FileDelete("Links.txt")
sleep(1000)
$timesran += 10

Until $timesran = "1000"

$msg=_FileCountLines ("Edited.txt")
MsgBox(0,"",$msg)
Edited by andrewz

Share this post


Link to post
Share on other sites
MikahS

Hotels 11-20.


Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Share this post


Link to post
Share on other sites
andrewz

Hotels 11-20.

 

Hmm I edited my last post with the full code, dunno why it worked yesterday. :/

Am I missing something ? Thanks in advance

Share this post


Link to post
Share on other sites
MikahS

What is the expected order of hotels?


Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Share this post


Link to post
Share on other sites
andrewz

What is the expected order of hotels?

 

Oh  I still got "until timeran = 1000" from berlin, I switched to frankfurt tho.

Frankfurth has 704 listed hotels on the site. http://www.yelp.de/search?find_desc=&find_loc=Frankfurt+am+Main%2C+Hessen&ns=1#find_desc=Hotel&start=0&cflt=hotels

The order doesnt matter, aslong as all the links from page 1-71 are saved

into a txt file. Yelp.de however has this different way of loading the results,

why the usual collect link function didnt work untill I changed it to first

run IE and load it, then take it if it already exists and grab all the links.

Somehow this worked yesterday, and today suddently stopped working .

&start=0 means page 1

&start=10 means page 2

and so on...you probably already understood that.

Edited by andrewz

Share this post


Link to post
Share on other sites
MikahS

Maybe because you are using a different URL?

; one used before

"http://www.yelp.de/search?find_desc=Hotels&find_loc=Berlin&ns=1"
;one used now
 

"http://www.yelp.de/search?find_desc=&find_loc=Frankfurt+am+Main%2C+Hessen&ns=1#find_desc=Hotel&start="&$timesran&"&cflt=hotels"

Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Share this post


Link to post
Share on other sites
andrewz

 

Maybe because you are using a different URL?

; one used before

"http://www.yelp.de/search?find_desc=Hotels&find_loc=Berlin&ns=1"
;one used now
 

"http://www.yelp.de/search?find_desc=&find_loc=Frankfurt+am+Main%2C+Hessen&ns=1#find_desc=Hotel&start="&$timesran&"&cflt=hotels"

 

Doesnt make a difference, the displayed results are the same, I also tried it with : http://www.yelp.de/search?find_desc=Hotel&find_loc=Frankfurt+am+Main%2C+Hessen&cflt=hotels#start=10

which showed the same results and still didnt work. Might also be the PC here which uses some older IE version, and is kinda slow.

The programm also opens all the links perfectly fine, but the collected links are always hotels 1 - 10 , dunno why it doesnt grab the

ones which load after a few seconds.

EDIT: Got it to work by removing the filedelete and filecreate function each time it repeats! It's screwing everything up :o

The intention of deleting and recreating it was to lower the search time in the links file. Now I will just make it empty the file.

Thanks for the ideas/help tho :)

Just added this:

FileOpen($file, 2)
FileClose($file)
Edited by andrewz

Share this post


Link to post
Share on other sites
MikahS

Thanks for the ideas/help tho :)

 

My pleasure. ;)


Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Share this post


Link to post
Share on other sites
SmOke_N

I'm so confused...

1.  You're not declaring the "first" _IECreate() with a variable

2.  You're using Sleep() instead of _IELoadWait()

3.  You then go to the same URL "again" right after the sleep, this time with a declared _IECreate() variable object

4.  You're constantly declaring variable and creating (over writing older ones) files within a loop


Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Similar Content

    • rm4453
      By rm4453
      Hello,
       
      I am currently writing a program that parses a massive table from a website, and need a way to add a progress bar while parsing.
      I am currently using the function _IETableWriteToArray($oObj, True) to parse the array. I need the progress bar to update as the table is parsed, not just at the end of the parsing.
      Any help at all would be very much appreciated!
       
      *EDIT --> The array I am left with after parsing is $array[0-50000][16]
    • xiantez
      By xiantez
      This script used to work on an older version of AutoIT. Currently I am running AutoIT v3.3.14.5 and it's failing.
      Func PublicIP() ;Post public facing IP address Local $url = 'https://www.google.com/search?client=opera&q=what+is+my+ip&sourceid=opera&ie=UTF-8&oe=UTF-8' Local $getIPaddress = BinaryToString(InetRead($url)) Local $sStart = 'clamp:2">' Local $sEnd = '</div>' Local $ipaddress = _StringBetween($getIPaddress, $sStart, $sEnd For $i In $ipaddress MsgBox(0, 'External IP', "Your public IP address is " & $i) Next EndFunc ;==>PublicIP The console output shows:
      "C:\Users\user\Documents\AutoIT\Scripts\WSI Tools.au3" (197) : ==> Variable must be of type "Object".: For $i In $ipaddress For $i In $ipaddress^ ERROR ->14:12:16 AutoIt3.exe ended.rc:1 +>14:12:16 AutoIt3Wrapper Finished. >Exit code: 1 Time: 9.811
    • ur
      By ur
      Is there any UDF to remove all anchor tags <a> with a particular class (and also its sub elements completely) in a html document.
      Here the classes are browse and breadcrumbs
      Like in the below image.


       
      I am not able to find that option in IE.au3
       
      Please suggest.
    • milkmoron
      By milkmoron
      I am trying to automate something in a web browser but i need some help with finding the html code to a web applet. How do I access the code.
    • Seminko
      By Seminko
      Is there a way to grab non-hardcoded but rather javascript generated data from a webpage?
      Tried a get request as well as _IEBodyReadHTML but both seem to grab the code without the javascript generated data.
      $oHTTP = ObjCreate("winhttp.winhttprequest.5.1") $oHTTP.Open("GET", "link", False) $oHTTP.Send() $oReceived = $oHTTP.ResponseText $oStatusCode = $oHTTP.Status Global $DataArray[10][5] If $oStatusCode <> 200 Then Exit MsgBox(1, "Error", "Status Code <> 200") EndIf FileWrite(@ScriptDir & "\output.txt", $oReceived) ; //////// #include <IE.au3> Local $FullLink = "link" Local $oIE = _IECreate($FullLink, 0, 0) _IELoadWait($oIE) Local $sText = _IEBodyReadHTML($oIE) FileWrite(@ScriptDir & "\output.txt", $sText)  
×