Jump to content

Extracting Data from the source of a website - wont work :(


Go to solution Solved by SmOke_N,

Recommended Posts

Hello!

I am currently facing a problem which I can't seem to be able to solve.

What do I want to do with the script ?
Extract all the links of the hotels on this website: http://www.yelp.de/search?cflt=hotels&find_loc=Berlin%2C+Germany
For example the first link to the first hotel would be: http://www.yelp.de/biz/novum-hotel-city-b-berlin-zentrum-berlin  - changes sometimes, so the link will be different.

To start off, I tried to export only one hotel at first. I am using this code to read the content from the source

and then get the content between two "functions" or whatever these are called:

#NoTrayIcon
#include <Inet.au3>
#include <Array.au3>
#include <String.au3>
#include <AutoItConstants.au3>
#include <MsgBoxConstants.au3>
Global $url = "http://www.yelp.de/search?find_desc=Hotels&find_loc=Berlin&ns=1"

Global $content = _INetGetSource($url)
Global $string_A = _StringBetween($content, '<div class="media-avatar">', '</div>')

MsgBox(0,"",$string_A[0])

It's part of an older project, which did almost the same thing, with the exeption that this one is not as easy :(

The link is saved differently, and I can't find a way to export it. After it's saved into an array, I am going to

save the links into a variable with a do - until function. But first I need this step working.

Please, if anyone has an idea how to solve this, even the smallest help is appreciated!

Edited by andrewz
Link to post
Share on other sites

If you run Microsofts Internet Explorer you could use the IE UDF to extract the links.

My UDFs and Tutorials:

Spoiler

UDFs:
Active Directory (NEW 2022-02-19 - Version 1.6.1.0) - Download - General Help & Support - Example Scripts - Wiki
ExcelChart (2017-07-21 - Version 0.4.0.1) - Download - General Help & Support - Example Scripts
OutlookEX (2021-11-16 - Version 1.7.0.0) - Download - General Help & Support - Example Scripts - Wiki
OutlookEX_GUI (2021-04-13 - Version 1.4.0.0) - Download
Outlook Tools (2019-07-22 - Version 0.6.0.0) - Download - General Help & Support - Wiki
PowerPoint (2021-08-31 - Version 1.5.0.0) - Download - General Help & Support - Example Scripts - Wiki
Task Scheduler (NEW 2022-07-28 - Version 1.6.0.1) - Download - General Help & Support - Wiki

Standard UDFs:
Excel - Example Scripts - Wiki
Word - Wiki

Tutorials:
ADO - Wiki
WebDriver - Wiki

 

Link to post
Share on other sites

Simple example with the IE UDF functionality. Get all links:

#include <IE.au3>
#include <MsgBoxConstants.au3>

Global $url = "http://www.yelp.de/search?find_desc=Hotels&find_loc=Berlin&ns=1"

Local $oIE = _IECreate($url)
Local $oLinks = _IELinkGetCollection($oIE)
Local $iNumLinks = @extended

Local $sTxt = $iNumLinks & " links found" & @CRLF & @CRLF
For $oLink In $oLinks
    $sTxt &= $oLink.href & @CRLF
Next
MsgBox($MB_SYSTEMMODAL, "Link Info", $sTxt)

Only difference between this and helpfile example (_IELinkGetCollection()) is *_IECreate() & $url, so it should be pretty easy to start using the IE UDF. :D

Edited by MikahS

Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Link to post
Share on other sites

@mikahs and water, you both are brilliant!

 

I'd say that's just Water. ;)

Nonetheless, it is my pleasure. :)

Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Link to post
Share on other sites

MikahS,

you are brilliant too :)

I just pointed him into the right direction, but you showed him working code.

My UDFs and Tutorials:

Spoiler

UDFs:
Active Directory (NEW 2022-02-19 - Version 1.6.1.0) - Download - General Help & Support - Example Scripts - Wiki
ExcelChart (2017-07-21 - Version 0.4.0.1) - Download - General Help & Support - Example Scripts
OutlookEX (2021-11-16 - Version 1.7.0.0) - Download - General Help & Support - Example Scripts - Wiki
OutlookEX_GUI (2021-04-13 - Version 1.4.0.0) - Download
Outlook Tools (2019-07-22 - Version 0.6.0.0) - Download - General Help & Support - Wiki
PowerPoint (2021-08-31 - Version 1.5.0.0) - Download - General Help & Support - Example Scripts - Wiki
Task Scheduler (NEW 2022-07-28 - Version 1.6.0.1) - Download - General Help & Support - Wiki

Standard UDFs:
Excel - Example Scripts - Wiki
Word - Wiki

Tutorials:
ADO - Wiki
WebDriver - Wiki

 

Link to post
Share on other sites

MikahS,

you are brilliant too :)

I just pointed him into the right direction, but you showed him working code.

 

Thank you Water, I appreciate it. :)

Edited by MikahS

Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Link to post
Share on other sites

Give credit where credit is due :)

My UDFs and Tutorials:

Spoiler

UDFs:
Active Directory (NEW 2022-02-19 - Version 1.6.1.0) - Download - General Help & Support - Example Scripts - Wiki
ExcelChart (2017-07-21 - Version 0.4.0.1) - Download - General Help & Support - Example Scripts
OutlookEX (2021-11-16 - Version 1.7.0.0) - Download - General Help & Support - Example Scripts - Wiki
OutlookEX_GUI (2021-04-13 - Version 1.4.0.0) - Download
Outlook Tools (2019-07-22 - Version 0.6.0.0) - Download - General Help & Support - Wiki
PowerPoint (2021-08-31 - Version 1.5.0.0) - Download - General Help & Support - Example Scripts - Wiki
Task Scheduler (NEW 2022-07-28 - Version 1.6.0.1) - Download - General Help & Support - Wiki

Standard UDFs:
Excel - Example Scripts - Wiki
Word - Wiki

Tutorials:
ADO - Wiki
WebDriver - Wiki

 

Link to post
Share on other sites

So I used it like this:

Global $url = "http://www.yelp.de/search?find_desc=Hotels&find_loc=Berlin&ns=1#start="&$timesran&"&cflt=hotels"
_IECreate($url)
Sleep(5000)
Local $oIE = _IECreate($url,1,1)
Local $oLinks = _IELinkGetCollection($oIE)
Local $iNumLinks = @extended

I first use _IECreate to open the window cuz "yelp.de" doesnt load the items immediately, it first shows

hotel 1-10 and then after a few seconds displays the ones from the next page. So if the variable

$timesran would be "10" (That means page 2) , it would first completly load the page, then

take the current, already opened IE window and store it inside the variable, and finally collect

all the links.

But, as of today this doesnt seem to work :( Is there any workaround to this, so that the programm would

first wait for the page to not only load completly, but also load the items completly which are loaded

usually 2-3 seconds afterwards.

Try :  http://www.yelp.de/search?find_desc=Hotels&find_loc=Berlin&ns=1#start=10&cflt=hotels

Thanks in advance

Edited by andrewz
Link to post
Share on other sites

worked fine for me:

#include <IE.au3>
Global $timesran = "10"
Global $url = "http://www.yelp.de/search?find_desc=Hotels&find_loc=Berlin&ns=1#start="&$timesran&"&cflt=hotels"
_IECreate($url)
Sleep(5000)
Local $oIE = _IECreate($url,1,1)
Local $oLinks = _IELinkGetCollection($oIE)
Local $iNumLinks = @extended
MsgBox(0, "", $iNumLinks)

Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Link to post
Share on other sites

 

worked fine for me:

#include <IE.au3>
Global $timesran = "10"
Global $url = "http://www.yelp.de/search?find_desc=Hotels&find_loc=Berlin&ns=1#start="&$timesran&"&cflt=hotels"
_IECreate($url)
Sleep(5000)
Local $oIE = _IECreate($url,1,1)
Local $oLinks = _IELinkGetCollection($oIE)
Local $iNumLinks = @extended
MsgBox(0, "", $iNumLinks)

 

Did it display you the hotels 1-10 or 11-20 ?

Maybe I should have posted more code:

#NoTrayIcon
#include <IE.au3>
#include <MsgBoxConstants.au3>
#include <File.au3>
#include <Inet.au3>

FileDelete("Links.txt")
_FileCreate("Edited.txt")

$timesran = 0

Do
_FileCreate("Links.txt")
Global $url = "http://www.yelp.de/search?find_desc=&find_loc=Frankfurt+am+Main%2C+Hessen&ns=1#find_desc=Hotel&start="&$timesran&"&cflt=hotels"
_IECreate($url)
Sleep(5000)
Local $oIE = _IECreate($url,1,1)
Local $oLinks = _IELinkGetCollection($oIE)
Local $iNumLinks = @extended

Local $sTxt = $iNumLinks & " links found" & @CRLF & @CRLF
For $oLink In $oLinks
    $sTxt &= $oLink.href & @CRLF
Next

FileWrite("Links.txt",$sTxt)


$file = "Links.txt"
FileOpen($file, 0)
For $i = 1 to _FileCountLines($file)
    $line = FileReadLine($file, $i)
If StringInStr($line,"http://www.yelp.de/biz/") = true Then
    $content = FileRead("Edited.txt")
    If StringInStr($content,$line) = false Then
    FileWrite("Edited.txt",$line & @CRLF)
    EndIf
    EndIf
Next
FileClose($file)
FileDelete("Links.txt")
sleep(1000)
$timesran += 10

Until $timesran = "1000"

$msg=_FileCountLines ("Edited.txt")
MsgBox(0,"",$msg)
Edited by andrewz
Link to post
Share on other sites

Hotels 11-20.

Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Link to post
Share on other sites

What is the expected order of hotels?

Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Link to post
Share on other sites

What is the expected order of hotels?

 

Oh  I still got "until timeran = 1000" from berlin, I switched to frankfurt tho.

Frankfurth has 704 listed hotels on the site. http://www.yelp.de/search?find_desc=&find_loc=Frankfurt+am+Main%2C+Hessen&ns=1#find_desc=Hotel&start=0&cflt=hotels

The order doesnt matter, aslong as all the links from page 1-71 are saved

into a txt file. Yelp.de however has this different way of loading the results,

why the usual collect link function didnt work untill I changed it to first

run IE and load it, then take it if it already exists and grab all the links.

Somehow this worked yesterday, and today suddently stopped working .

&start=0 means page 1

&start=10 means page 2

and so on...you probably already understood that.

Edited by andrewz
Link to post
Share on other sites

Maybe because you are using a different URL?

; one used before

"http://www.yelp.de/search?find_desc=Hotels&find_loc=Berlin&ns=1"
;one used now
 

"http://www.yelp.de/search?find_desc=&find_loc=Frankfurt+am+Main%2C+Hessen&ns=1#find_desc=Hotel&start="&$timesran&"&cflt=hotels"

Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Link to post
Share on other sites

 

Maybe because you are using a different URL?

; one used before

"http://www.yelp.de/search?find_desc=Hotels&find_loc=Berlin&ns=1"
;one used now
 

"http://www.yelp.de/search?find_desc=&find_loc=Frankfurt+am+Main%2C+Hessen&ns=1#find_desc=Hotel&start="&$timesran&"&cflt=hotels"

 

Doesnt make a difference, the displayed results are the same, I also tried it with : http://www.yelp.de/search?find_desc=Hotel&find_loc=Frankfurt+am+Main%2C+Hessen&cflt=hotels#start=10

which showed the same results and still didnt work. Might also be the PC here which uses some older IE version, and is kinda slow.

The programm also opens all the links perfectly fine, but the collected links are always hotels 1 - 10 , dunno why it doesnt grab the

ones which load after a few seconds.

EDIT: Got it to work by removing the filedelete and filecreate function each time it repeats! It's screwing everything up :o

The intention of deleting and recreating it was to lower the search time in the links file. Now I will just make it empty the file.

Thanks for the ideas/help tho :)

Just added this:

FileOpen($file, 2)
FileClose($file)
Edited by andrewz
Link to post
Share on other sites

Thanks for the ideas/help tho :)

 

My pleasure. ;)

Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Link to post
Share on other sites
  • Moderators

I'm so confused...

1.  You're not declaring the "first" _IECreate() with a variable

2.  You're using Sleep() instead of _IELoadWait()

3.  You then go to the same URL "again" right after the sleep, this time with a declared _IECreate() variable object

4.  You're constantly declaring variable and creating (over writing older ones) files within a loop

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By SkysLastChance
      I am having trouble finding a good way to click these "button" below. 

      I only need to be able to click them when they have both yes/no. Otherwise I don't have to worry about them. For instance if they looked like this I would NOT have worry about clicking them and can just ignore them all togheter.(Below Picture)

      The problem is as mentioned in the title, all of the ID's  are dynamic. (Classes too)

      Here is what it looks like if yes is already selected.

      This is what I was using to select the the button. However, I need to know if the button has already been clicked/selected or not.
      _WD_LoadWait($sSession) $sElement = _WD_FindElement($sSession, $_WD_LOCATOR_ByXPath, "//span[text() = 'Offered access to electronic health information?']") Sleep(1000) _WD_ElementAction($sSession, $sElement, 'click') Sleep(500) _WD_Action($sSession, "actions", $sActionTab) Sleep(500) _WD_Action($sSession, "actions", $sActionEnter) Is there a way I can get the count of spans in the span class-"s_636" by tabbing over to the button? I am hoping someone might have some ideas on what I can try.
      Unfortunally, The site is for work so giving the site wont do any good. 
    • By goku200
      I'm having an issue with my html paginated table. The script work as expected. It reads the html table and clicks on the Download button. However when it clicks on the next page its not iterating the items. instead it goes to the next URL from the spreadsheet and then iterates through the html table clicking the Download button and so on. Not sure why its doing that. I want it to click the next page and then continue iterating then after it has reached the end of the pagination go to the next url in the spreadsheet and repeat the process. Below is my script. Any help is appreciated 🙂
       
       
    • By EmilyLove
      I have a string containing the full path of an executable and an array of executables without their paths. I am trying to compare the string to the list in the array and if a match is found, remove it from the array. The entry get removed from the array successfully, and after checking its return result, uses it to update the ubound if it succeeded, but it doesn't want to update to the new value. Any ideas what I am doing wrong? It acts like it is read-only.
      #include <Array.au3> #include <File.au3> Local $sApp_Exe = "F:\App\Nextcloud\nextcloud.exe" Local $aWaitForEXEX = [3, "Nextcloud.exe", "nextcloudcmd.exe", "QtWebEngineProcess.exe"] For $h = 1 To $aWaitForEXEX[0] If StringInStr($sApp_Exe, $aWaitForEXEX[$h]) <> 0 Then $iRet = _ArrayDelete($aWaitForEXEX, $h) If $iRet <> -1 Then $aWaitForEXEX[0] = $iRet ;this line doesn't work. $aWaitForEXEX[0] doesn't update and shortly gives Error: Array variable has incorrect number of subscripts or subscript dimension range exceeded.: _ArrayDisplay($aWaitForEXEX) EndIf Next  
    • By Exit
      I make plugins for WordPress and am dissatisfied with the built-in editor.
      I'm now looking for a way to format the PHP code like TIDY does in AU3-Editor.
      I've already tried some online tools but they are buggy or have too few formatting options.

      Buggy: http://www.phpformatter.com/
      Missing options: https://homepage-kosten.de/php_beautifier.php
      Here is the code to check if the beautifier is buggy:
      <?php /* Version: 2021.08.08#1 */ exit ( ' Version is: ' . chop ( substr ( file ( __FILE__ ) [ 2 ] , 9 , 13 ) ) ) ; /* Output from PHP Formatter: Format Error on line 5: parse error, unexpected '['([), expecting ')'! Output in Browser: Version is: 2021.08.08#1 */ ?> It would be great if you showed me a link to another beautifier that is not buggy and has options for compression.  
      e.g. do not put comments in a new line. 
      Or even better, a hint as to which PHP editor in WordPress meets these requirements.
      TIA Exit
    • By Hermes
      Hi, I have a site that has the following elements below: 
      <div>More element here</div> <div>More element here</div> <div>More element here</div> When I do this in Auto It:
      Local $oSelectDiv = _WD_FindElement($sSession, $_WD_LOCATOR_ByCSSSelector, "div") _WD_HighlightElement($sSession, $oSelectDiv, 1) I also tried to add [3], but it doesnt seems to work:
      Local $oSelectDiv = _WD_FindElement($sSession, $_WD_LOCATOR_ByCSSSelector, "div[3]") _WD_HighlightElement($sSession, $oSelectDiv, 1) It always highlight the first one, but I am trying to highlight the 3rd in the list. Is there anyway to select the 3rd div without having to add any class/id in the divs, and without using XPATH? The structure of the elements in that site were built that way.
×
×
  • Create New...