Jump to content
andrewz

Extracting Data from the source of a website - wont work :(

Recommended Posts

andrewz

Hello!

I am currently facing a problem which I can't seem to be able to solve.

What do I want to do with the script ?
Extract all the links of the hotels on this website: http://www.yelp.de/search?cflt=hotels&find_loc=Berlin%2C+Germany
For example the first link to the first hotel would be: http://www.yelp.de/biz/novum-hotel-city-b-berlin-zentrum-berlin  - changes sometimes, so the link will be different.

To start off, I tried to export only one hotel at first. I am using this code to read the content from the source

and then get the content between two "functions" or whatever these are called:

#NoTrayIcon
#include <Inet.au3>
#include <Array.au3>
#include <String.au3>
#include <AutoItConstants.au3>
#include <MsgBoxConstants.au3>
Global $url = "http://www.yelp.de/search?find_desc=Hotels&find_loc=Berlin&ns=1"

Global $content = _INetGetSource($url)
Global $string_A = _StringBetween($content, '<div class="media-avatar">', '</div>')

MsgBox(0,"",$string_A[0])

It's part of an older project, which did almost the same thing, with the exeption that this one is not as easy :(

The link is saved differently, and I can't find a way to export it. After it's saved into an array, I am going to

save the links into a variable with a do - until function. But first I need this step working.

Please, if anyone has an idea how to solve this, even the smallest help is appreciated!

Edited by andrewz

Share this post


Link to post
Share on other sites
water

If you run Microsofts Internet Explorer you could use the IE UDF to extract the links.


My UDFs and Tutorials:

Spoiler

UDFs:
Active Directory (NEW 2018-06-01 - Version 1.4.9.0) - Download - General Help & Support - Example Scripts - Wiki
OutlookEX (2018-01-27 - Version 1.3.3.1) - Download - General Help & Support - Example Scripts - Wiki
ExcelChart (2015-04-01 - Version 0.4.0.0) - Download - General Help & Support - Example Scripts
Excel - Example Scripts - Wiki
Word - Wiki
PowerPoint (2015-06-06 - Version 0.0.5.0) - Download - General Help & Support

Tutorials:
ADO - Wiki

 

Share this post


Link to post
Share on other sites
andrewz

If you run Microsofts Internet Explorer you could use the IE UDF to extract the links.

 

Doesnt matter which browser, I could use any. But thanks, will look into it ;)

Share this post


Link to post
Share on other sites
MikahS

Simple example with the IE UDF functionality. Get all links:

#include <IE.au3>
#include <MsgBoxConstants.au3>

Global $url = "http://www.yelp.de/search?find_desc=Hotels&find_loc=Berlin&ns=1"

Local $oIE = _IECreate($url)
Local $oLinks = _IELinkGetCollection($oIE)
Local $iNumLinks = @extended

Local $sTxt = $iNumLinks & " links found" & @CRLF & @CRLF
For $oLink In $oLinks
    $sTxt &= $oLink.href & @CRLF
Next
MsgBox($MB_SYSTEMMODAL, "Link Info", $sTxt)

Only difference between this and helpfile example (_IELinkGetCollection()) is *_IECreate() & $url, so it should be pretty easy to start using the IE UDF. :D

Edited by MikahS
  • Like 1

Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Share this post


Link to post
Share on other sites
andrewz

@mikahs and water, you both are brilliant!

Share this post


Link to post
Share on other sites
MikahS

@mikahs and water, you both are brilliant!

 

I'd say that's just Water. ;)

Nonetheless, it is my pleasure. :)


Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Share this post


Link to post
Share on other sites
water

MikahS,

you are brilliant too :)

I just pointed him into the right direction, but you showed him working code.

  • Like 1

My UDFs and Tutorials:

Spoiler

UDFs:
Active Directory (NEW 2018-06-01 - Version 1.4.9.0) - Download - General Help & Support - Example Scripts - Wiki
OutlookEX (2018-01-27 - Version 1.3.3.1) - Download - General Help & Support - Example Scripts - Wiki
ExcelChart (2015-04-01 - Version 0.4.0.0) - Download - General Help & Support - Example Scripts
Excel - Example Scripts - Wiki
Word - Wiki
PowerPoint (2015-06-06 - Version 0.0.5.0) - Download - General Help & Support

Tutorials:
ADO - Wiki

 

Share this post


Link to post
Share on other sites
MikahS

MikahS,

you are brilliant too :)

I just pointed him into the right direction, but you showed him working code.

 

Thank you Water, I appreciate it. :)

Edited by MikahS

Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Share this post


Link to post
Share on other sites
water

Give credit where credit is due :)

  • Like 1

My UDFs and Tutorials:

Spoiler

UDFs:
Active Directory (NEW 2018-06-01 - Version 1.4.9.0) - Download - General Help & Support - Example Scripts - Wiki
OutlookEX (2018-01-27 - Version 1.3.3.1) - Download - General Help & Support - Example Scripts - Wiki
ExcelChart (2015-04-01 - Version 0.4.0.0) - Download - General Help & Support - Example Scripts
Excel - Example Scripts - Wiki
Word - Wiki
PowerPoint (2015-06-06 - Version 0.0.5.0) - Download - General Help & Support

Tutorials:
ADO - Wiki

 

Share this post


Link to post
Share on other sites
andrewz

So I used it like this:

Global $url = "http://www.yelp.de/search?find_desc=Hotels&find_loc=Berlin&ns=1#start="&$timesran&"&cflt=hotels"
_IECreate($url)
Sleep(5000)
Local $oIE = _IECreate($url,1,1)
Local $oLinks = _IELinkGetCollection($oIE)
Local $iNumLinks = @extended

I first use _IECreate to open the window cuz "yelp.de" doesnt load the items immediately, it first shows

hotel 1-10 and then after a few seconds displays the ones from the next page. So if the variable

$timesran would be "10" (That means page 2) , it would first completly load the page, then

take the current, already opened IE window and store it inside the variable, and finally collect

all the links.

But, as of today this doesnt seem to work :( Is there any workaround to this, so that the programm would

first wait for the page to not only load completly, but also load the items completly which are loaded

usually 2-3 seconds afterwards.

Try :  http://www.yelp.de/search?find_desc=Hotels&find_loc=Berlin&ns=1#start=10&cflt=hotels

Thanks in advance

Edited by andrewz

Share this post


Link to post
Share on other sites
MikahS

worked fine for me:

#include <IE.au3>
Global $timesran = "10"
Global $url = "http://www.yelp.de/search?find_desc=Hotels&find_loc=Berlin&ns=1#start="&$timesran&"&cflt=hotels"
_IECreate($url)
Sleep(5000)
Local $oIE = _IECreate($url,1,1)
Local $oLinks = _IELinkGetCollection($oIE)
Local $iNumLinks = @extended
MsgBox(0, "", $iNumLinks)

Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Share this post


Link to post
Share on other sites
andrewz

 

worked fine for me:

#include <IE.au3>
Global $timesran = "10"
Global $url = "http://www.yelp.de/search?find_desc=Hotels&find_loc=Berlin&ns=1#start="&$timesran&"&cflt=hotels"
_IECreate($url)
Sleep(5000)
Local $oIE = _IECreate($url,1,1)
Local $oLinks = _IELinkGetCollection($oIE)
Local $iNumLinks = @extended
MsgBox(0, "", $iNumLinks)

 

Did it display you the hotels 1-10 or 11-20 ?

Maybe I should have posted more code:

#NoTrayIcon
#include <IE.au3>
#include <MsgBoxConstants.au3>
#include <File.au3>
#include <Inet.au3>

FileDelete("Links.txt")
_FileCreate("Edited.txt")

$timesran = 0

Do
_FileCreate("Links.txt")
Global $url = "http://www.yelp.de/search?find_desc=&find_loc=Frankfurt+am+Main%2C+Hessen&ns=1#find_desc=Hotel&start="&$timesran&"&cflt=hotels"
_IECreate($url)
Sleep(5000)
Local $oIE = _IECreate($url,1,1)
Local $oLinks = _IELinkGetCollection($oIE)
Local $iNumLinks = @extended

Local $sTxt = $iNumLinks & " links found" & @CRLF & @CRLF
For $oLink In $oLinks
    $sTxt &= $oLink.href & @CRLF
Next

FileWrite("Links.txt",$sTxt)


$file = "Links.txt"
FileOpen($file, 0)
For $i = 1 to _FileCountLines($file)
    $line = FileReadLine($file, $i)
If StringInStr($line,"http://www.yelp.de/biz/") = true Then
    $content = FileRead("Edited.txt")
    If StringInStr($content,$line) = false Then
    FileWrite("Edited.txt",$line & @CRLF)
    EndIf
    EndIf
Next
FileClose($file)
FileDelete("Links.txt")
sleep(1000)
$timesran += 10

Until $timesran = "1000"

$msg=_FileCountLines ("Edited.txt")
MsgBox(0,"",$msg)
Edited by andrewz

Share this post


Link to post
Share on other sites
MikahS

Hotels 11-20.


Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Share this post


Link to post
Share on other sites
andrewz

Hotels 11-20.

 

Hmm I edited my last post with the full code, dunno why it worked yesterday. :/

Am I missing something ? Thanks in advance

Share this post


Link to post
Share on other sites
MikahS

What is the expected order of hotels?


Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Share this post


Link to post
Share on other sites
andrewz

What is the expected order of hotels?

 

Oh  I still got "until timeran = 1000" from berlin, I switched to frankfurt tho.

Frankfurth has 704 listed hotels on the site. http://www.yelp.de/search?find_desc=&find_loc=Frankfurt+am+Main%2C+Hessen&ns=1#find_desc=Hotel&start=0&cflt=hotels

The order doesnt matter, aslong as all the links from page 1-71 are saved

into a txt file. Yelp.de however has this different way of loading the results,

why the usual collect link function didnt work untill I changed it to first

run IE and load it, then take it if it already exists and grab all the links.

Somehow this worked yesterday, and today suddently stopped working .

&start=0 means page 1

&start=10 means page 2

and so on...you probably already understood that.

Edited by andrewz

Share this post


Link to post
Share on other sites
MikahS

Maybe because you are using a different URL?

; one used before

"http://www.yelp.de/search?find_desc=Hotels&find_loc=Berlin&ns=1"
;one used now
 

"http://www.yelp.de/search?find_desc=&find_loc=Frankfurt+am+Main%2C+Hessen&ns=1#find_desc=Hotel&start="&$timesran&"&cflt=hotels"

Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Share this post


Link to post
Share on other sites
andrewz

 

Maybe because you are using a different URL?

; one used before

"http://www.yelp.de/search?find_desc=Hotels&find_loc=Berlin&ns=1"
;one used now
 

"http://www.yelp.de/search?find_desc=&find_loc=Frankfurt+am+Main%2C+Hessen&ns=1#find_desc=Hotel&start="&$timesran&"&cflt=hotels"

 

Doesnt make a difference, the displayed results are the same, I also tried it with : http://www.yelp.de/search?find_desc=Hotel&find_loc=Frankfurt+am+Main%2C+Hessen&cflt=hotels#start=10

which showed the same results and still didnt work. Might also be the PC here which uses some older IE version, and is kinda slow.

The programm also opens all the links perfectly fine, but the collected links are always hotels 1 - 10 , dunno why it doesnt grab the

ones which load after a few seconds.

EDIT: Got it to work by removing the filedelete and filecreate function each time it repeats! It's screwing everything up :o

The intention of deleting and recreating it was to lower the search time in the links file. Now I will just make it empty the file.

Thanks for the ideas/help tho :)

Just added this:

FileOpen($file, 2)
FileClose($file)
Edited by andrewz

Share this post


Link to post
Share on other sites
MikahS

Thanks for the ideas/help tho :)

 

My pleasure. ;)


Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Share this post


Link to post
Share on other sites
SmOke_N

I'm so confused...

1.  You're not declaring the "first" _IECreate() with a variable

2.  You're using Sleep() instead of _IELoadWait()

3.  You then go to the same URL "again" right after the sleep, this time with a declared _IECreate() variable object

4.  You're constantly declaring variable and creating (over writing older ones) files within a loop


Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Similar Content

    • Seminko
      By Seminko
      Is there a way to grab non-hardcoded but rather javascript generated data from a webpage?
      Tried a get request as well as _IEBodyReadHTML but both seem to grab the code without the javascript generated data.
      $oHTTP = ObjCreate("winhttp.winhttprequest.5.1") $oHTTP.Open("GET", "link", False) $oHTTP.Send() $oReceived = $oHTTP.ResponseText $oStatusCode = $oHTTP.Status Global $DataArray[10][5] If $oStatusCode <> 200 Then Exit MsgBox(1, "Error", "Status Code <> 200") EndIf FileWrite(@ScriptDir & "\output.txt", $oReceived) ; //////// #include <IE.au3> Local $FullLink = "link" Local $oIE = _IECreate($FullLink, 0, 0) _IELoadWait($oIE) Local $sText = _IEBodyReadHTML($oIE) FileWrite(@ScriptDir & "\output.txt", $sText)  
    • SkysLastChance
      By SkysLastChance
       
      WinActivate("MEDITECH - Internet Explorer") Sleep (500) $oIE = _IEAttach("MEDITECH") $oDiv1 = _IEGetObjById($oIE, "sysmenu-searchbarbutton") _IEAction($oDiv1, "click") I am just trying to click the little magnifying glass, next to the gear button with no luck. I was hoping someone might have an idea why this is not working?
       

    • California
      By California
      Hello,
      I wrote a benchmark script to measure variable declarations
      to find out whether you should focus more on static or global variables
      #cs ---------------------------------------------------------------------------- AutoIt Version: 3.3.14.5 #ce ---------------------------------------------------------------------------- #Region Pre-Setting Local $iTally1 = 0 Local $iTally2 = 0 Local $iTally3 = 0 Local $iTally4 = 0 Local $iTally5 = 0 Local $iTally6 = 0 Local $iTally7 = 0 Global $GLOBALCONST1 = 1 Global $GLOBALCONST2 = 1 Global $GLOBALCONST3 = 1 Global $GLOBALCONST4 = 1 Global $GLOBALCONST5 = 1 #EndRegion Pre-Setting #Region Test Functions Func s1() Static $i = $GLOBALCONST1 Return $i EndFunc Func g2() Return $GLOBALCONST2 EndFunc Func g3() Static $i7 = "gsdgdfegbgbrwefw" Return $GLOBALCONST3 EndFunc Func g4() Static $i1 = 1 Static $i2 = "asd" Static $i3 = 234 Static $i4 = True Static $i5 = [0] Static $i6 = "hgsdg" Static $i7 = 1 Static $i8 = 1 Static $i9 = 1 Static $i0 = 1 Return $GLOBALCONST4 EndFunc Func g5() Local $i = $GLOBALCONST5 Return $i EndFunc Func g6() Local $i = 1 Return $i EndFunc Func g7() Return 1 EndFunc #EndRegion Test Functions #Region Benchmark Loop For $i = 0 To 15 Local $tDelta = TimerInit() Do $iTally1 += s1() Until TimerDiff($tDelta) >= 1000 Local $tDelta = TimerInit() Do $iTally2 += g2() Until TimerDiff($tDelta) >= 1000 Local $tDelta = TimerInit() Do $iTally3 += g3() Until TimerDiff($tDelta) >= 1000 Local $tDelta = TimerInit() Do $iTally4 += g4() Until TimerDiff($tDelta) >= 1000 Local $tDelta = TimerInit() Do $iTally5 += g5() Until TimerDiff($tDelta) >= 1000 Local $tDelta = TimerInit() Do $iTally6 += g6() Until TimerDiff($tDelta) >= 1000 Local $tDelta = TimerInit() Do $iTally7 += g7() Until TimerDiff($tDelta) >= 1000 Next #EndRegion Benchmark Loop ConsoleWrite(@CRLF&"Static1: "&$iTally1&" pkt"&@CRLF&"Global2: "&$iTally2&" pkt"&@CRLF&"Global3: "&$iTally3&" pkt"&@CRLF&"Global4: "&$iTally4&" pkt"&@CRLF&"Local5: "&$iTally5&" pkt"&@CRLF&"Local6: "&$iTally6&" pkt"&@CRLF&"Hardcode7:"&$iTally7&" pkt"&@CRLF) #cs Result Static1: 10291881 pkt global to static Global2: 13977324 pkt only global Global3: 9886169 pkt global and static Global4: 2933051 pkt global and many statics Local5: 9937314 pkt global to local Local6: 10306484 pkt only local Hardcode7: 14835319 pkt no variable #ce Result:
      100% no variable, hardcore value
      94% only global variable use
      69% only local variable use with hardcore value set
      69% only static variable use with global variable value set
      67% declaration of local variable with global variable value set
      66% only global variable use with one static variable beside
      20% only global variable use with ten static variables beside
      My thesis of the result:
      Be careful with declarations, whether local, global or static Note: in my test the global variable performance was better than the local one, but in practice the global one would lose performance due to multiple operations
       
      What is your best practice sharing data between multiple functions?
    • Burgs
      By Burgs
      Hello,
        I have a website with a Google Map I setup using the Google Map API.  It works and displays just fine.  However to make it useful to me I need to be able to dynamically change the map to display different areas by sending new Latitude and Longitude coordinates.  I am having difficulty making this happen.  Here is my code thus far:
      #include <IE.au3> $oIE3 = _IECreate("http://my_sample_website.html") ;just an example, not an actual site... _IELoadWait($oIE3) $s_word = "lat:" $oInputs = _IETagNameAllGetCollection($oIE3) if @error <> 0 Then MsgBox($MB_SYSTEMMODAL, "ERROR", "Error is: " & @error) EndIf ;@error For $oInput In $oInputs if Number($iPos) == -1 Then $iPos = StringInStr($oInput.innerHTML, String($s_word)) if (Number($iPos) > 0) AND (@error == 0) Then ConsoleWrite("I FOUND IT...! " & String($s_word) & @CRLF) $sHTML = _IEBodyReadHTML($oIE3) $_lat_look = 0 $_lng_look = 0 $_end_look = 0 ;default $_lat_look = StringInStr(String($sHTML), "lat:") if Number($_lat_look) <> 0 Then $_lng_look = StringInStr(String($sHTML), "lng:") if Number($_lng_look) <> 0 Then $_end_look = StringInStr(String($sHTML), "}") if Number($_end_look) <> 0 Then ConsoleWrite("HTML BODY: " & $sHTML & @CRLF) $_old_lat = String(StringMid(String($sHTML), $_lat_look, ($_lng_look - $_lat_look))) $_old_lng = String(StringMid(String($sHTML), $_lng_look, ($_end_look - $_lng_look))) ConsoleWrite("$_old_lat: " & $_old_lat & @CRLF) ConsoleWrite("$_old_lng: " & $_old_lng & @CRLF) $_new_lat = "lat: " & String("-34.397") & ", " $_new_lng = "lng: " & String("150.644") & "}; " ConsoleWrite("...new lat is: " & String($_new_lat) & " new lng is: " & String($_new_lng) & @CRLF) $_LOOK = StringReplace($_old_lat, 1, String($_new_lat)) $_LOOK2 = StringReplace($_old_lng, 1, String($_new_lng)) ConsoleWrite("$_LOOK: " & $_LOOK & "$_LOOK2: " & $_LOOK2 & @CRLF) EndIf ;'$_end_look' NOT "0"... $iPos = -1 EndIf ;'String($s_word)' was found in the collection '$oInputs' EndIf ;'$iPos' is "-1" Next  
        I am having trouble trying to replace the line in the HTML ($sHTML variable in my example) that contains the "lat:" and "lng:" information.  I figure if I can replace that line everything else remains the same, and in theory, the map should cycle to display a map with the new latitude and longitude coordinates...I hope. 
        I have attempted to write the $sHTML to a text document and then use '_IEBodyWriteHTML' to read it back into the webpage HTML however that is not working.  There must be an easier method to accomplish this...what am I missing here...?  Any thoughts greatly appreciated.  Regards.       
    • AnonymousX
      By AnonymousX
      Hello,
      So this may be more of a challenge of effective programming then specific to AutoIT but I want to solve this problem with AutoIT  so i'm putting it here. (If someone has a better language to solve with I'm all ears)
       
      So the task I'm trying to achieve is that I have multiple .CSV files that have: year, month, day, hour, value. I need to be able to sum up all the values that have the same date/time, then find which date and time has the maximum value.
       
      The problem is that each file may or may not have same amount of days/hours as the rest. So I need to devise a way to handle this. 
       
      Example:
      File A   File B   File C 2018 1 1  1:00 10   2018 1 1 2:00 10   2018 1 1  1:00 10 2018 1 1  2:00 12   2018 1 1 3:00 12   2018 1 2 1:00 12 2018 1 1  3:00 14   2018 1 1 4:00 14   2018 2 1  1:00 16 2018 2 1  1:00 16   2018 2 1  1:00 16              
       Answer I want to be spit out is Feb 1st 2018 at 2:00 with value of 48
       
      So far I've got code to store all .CSV files to an array, then a loop to go through each csv, but not sure how to effectively manipulate the data. Keep in mind each file has over 7000 time entry points.
       
      If anyone can solve this that would be pretty awesome! 
      #include <Array.au3> #include <File.au3> #include <MsgBoxConstants.au3> RefineData() Func RefineData() Local $i, $filenum, $file, $csvArray, $FilePath = @ScriptDir $fileList = _FileListToArrayRec($FilePath, "*.csv", 1) ;Create and array of all .csv files within folder Local $chkArray[UBound($fileList)][2] ;=====Loop through the .csv files within the folder====== For $filenum = 1 To UBound($fileList) - 1 Step 1 $file = $fileList[$filenum] $sFilePath = $FilePath & "\" & $file ;=====Create array based on csv file===== _FileReadToArray($sFilePath, $csvArray, $FRTA_NOCOUNT, ",") ;#### Operations here ###### next msgbox(0,"", "Date: " & $date_of_max & "Value: " & $maxVal );display solution endfunc  
×