Jump to content

Extracting Data from the source of a website - wont work :(


andrewz
 Share

Go to solution Solved by SmOke_N,

Recommended Posts

I'm so confused...

1.  You're not declaring the "first" _IECreate() with a variable

2.  You're using Sleep() instead of _IELoadWait()

3.  You then go to the same URL "again" right after the sleep, this time with a declared _IECreate() variable object

4.  You're constantly declaring variable and creating (over writing older ones) files within a loop

 

1.+ 2. Because of the way this website loads, some kind of javascript or whatever - I don't know. Both of these wouldnt work.

I solved it the way I did it in the above code.

3.) The sleep is for it to competly load. The website fully loads. Then there is a small loading popup with the mentioned way

from step 1. and 2. .

4.) Because it doesnt matter if I overwrite them and I go through all pages, and then filter each page for the hotel links.

Here is the full WORKING code: (If it is optimizable, feel free to correct me :), thanks! )

#NoTrayIcon
#include <IE.au3>
#include <MsgBoxConstants.au3>
#include <File.au3>
#include <Inet.au3>

FileDelete("Links.txt")
_FileCreate("Edited.txt")
_FileCreate("Links.txt")

$timesran = 0
Do

Global $url = "http://www.yelp.de/search?find_desc=Hotel&find_loc=Frankfurt+am+Main%2C+Hessen&cflt=hotels#start="&$timesran
_IECreate($url)
Sleep(5000)
Local $oIE = _IECreate($url,1,1)
Local $oLinks = _IELinkGetCollection($oIE)
Local $iNumLinks = @extended

Local $sTxt = $iNumLinks & " links found" & @CRLF & @CRLF
For $oLink In $oLinks
    $sTxt &= $oLink.href & @CRLF
Next

FileWrite("Links.txt",$sTxt)


$file = "Links.txt"
FileOpen($file, 0)
For $i = 1 to _FileCountLines($file)
    $line = FileReadLine($file, $i)
If StringInStr($line,"http://www.yelp.de/biz/") = true Then
    $content = FileRead("Edited.txt")
    If StringInStr($content,$line) = false Then
    FileWrite("Edited.txt",$line & @CRLF)
    EndIf
    EndIf
Next
FileClose($file)
FileOpen($file, 2)
FileClose($file)

WinKill("Suche in Hotel für Hotel – Frankfurt am Main, Hessen | Yelp - Internet Explorer")
sleep(500)

$timesran += 10

Until $timesran = "710"

$msg=_FileCountLines ("Edited.txt")
MsgBox(0,"",$msg)
Edited by andrewz
Link to comment
Share on other sites

there is really no point in doing 2 _IECreate's as the internet function has _IELoadWait anyway running in the function.

Sleep is the same thing, there is an internal sleep in _IELoadWait until the javascript completes (not a sleep, but essentially the same).

Hey, if it works for you that is awesome! but, there is always little things that can be fixed here and there. :D

Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Link to comment
Share on other sites

  • Moderators
  • Solution

I imagine you could benefit immensely from code that doesn't constantly open and close the IE window, as well as code that doesn't constantly open and close files.

The links file seems to be redundant since you delete it every time the script starts.

Your FileClose function calls are incorrect, it requires the handle from the FileOpen function call, not the file name.

Give this a whirl, see where I shortcut your code as well as increased the speed and reliability.

#NoTrayIcon
#include <IE.au3>
#include <File.au3>
#include <MsgBoxConstants.au3>

; keep track of your globals, know what is being used
Global $gsURL, $goLinks
Global $gsHREFs, $gsLink, $gsFRead
Global $gsFile = "Edited.txt"

Global $gsPreURL = "http://www.yelp.de/search?find_desc=Hotel&find_loc=" & _
    "Frankfurt+am+Main%2C+Hessen&cflt=hotels#start="
Global $goIE = _IECreate("http://www.yelp.de")
_IELoadWait($goIE) ; wait for page to load

; really should check how many max links there are
; that way you don't go to non-existing pages and it will speed
;  up everything exponentially
Global $giMaxRunTimes = 710

; open file to read now
If Not FileExists($gsFile) Then _FileCreate($gsFile)
Global $ghFile = FileOpen($gsFile) ; $ghFile has our FileClose handle

For $i = 0 To $giMaxRunTimes Step 10
    $gsURL = $giMaxRunTimes & $i
    _IENavigate($goIE, $gsURL)
    _IELoadWait($goIE) ; wait for page to load
    Sleep(1000) ; sanity pause
    
    $goLinks = _IELinkGetCollection($goIE)
    ; no sense in continuing if there are no links
    If Not @extended Then ContinueLoop
    
    $gsHREFs = ""; clear container variable
    $gsFRead = FileRead($ghFile)
    For $oLink In $goLinks
        $gsLink = $oLink.href
        If StringInStr($gsLink, "http://www.yelp.de/biz/") Then
            ; get only unique strings
            If Not StringInStr($gsHREFs, $gsLink & @CRLF, 0, 1) And _
                Not StringInStr($gsFRead, $gsLink & @CRLF, 0, 1) Then
                $gsHREFs &= $gsLink & @CRLF
            EndIf
        EndIf
    Next
    
    If $gsHREFs Then FileWrite($ghFile, $gsHREFs)
    
Next
; attempt to close IE window properly
_IEQuit($goIE)

MsgBox(64, "Total Lines In Edit file", StringSplit(StringStripCR(StringTrimRight(FileRead($gsFile), 2)), @LF)[0])
FileClose($ghFile)

.

Edited by SmOke_N

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

I imagine you could benefit immensely from code that doesn't constantly open and close the IE window, as well as code that doesn't constantly open and close files.

The links file seems to be redundant since you delete it every time the script starts.

Your FileClose function calls are incorrect, it requires the handle from the FileOpen function call, not the file name.

Give this a whirl, see where I shortcut your code as well as increased the speed and reliability.

#NoTrayIcon
#include <IE.au3>
#include <File.au3>
#include <MsgBoxConstants.au3>

; keep track of your globals, know what is being used
Global $gsURL, $goLinks
Global $gsHREFs, $gsLink, $gsFRead
Global $gsFile = "Edited.txt"

Global $gsPreURL = "http://www.yelp.de/search?find_desc=Hotel&find_loc=" & _
    "Frankfurt+am+Main%2C+Hessen&cflt=hotels#start="
Global $goIE = _IECreate("http://www.yelp.de")
_IELoadWait($goIE) ; wait for page to load

; really should check how many max links there are
; that way you don't go to non-existing pages and it will speed
;  up everything exponentially
Global $giMaxRunTimes = 710

; open file to read now
If Not FileExists($gsFile) Then _FileCreate($gsFile)
Global $ghFile = FileOpen($gsFile) ; $ghFile has our FileClose handle

For $i = 0 To $giMaxRunTimes Step 10
    $gsURL = $giMaxRunTimes & $i
    _IENavigate($goIE, $gsURL)
    _IELoadWait($goIE) ; wait for page to load
    Sleep(1000) ; sanity pause
    
    $goLinks = _IELinkGetCollection($goIE)
    ; no sense in continuing if there are no links
    If Not @extended Then ContinueLoop
    
    $gsHREFs = ""; clear container variable
    $gsFRead = FileRead($ghFile)
    For $oLink In $goLinks
        $gsLink = $oLink.href
        If StringInStr($gsLink, "http://www.yelp.de/biz/") Then
            ; get only unique strings
            If Not StringInStr($gsHREFs, $gsLink & @CRLF, 0, 1) And _
                Not StringInStr($gsFRead, $gsLink & @CRLF, 0, 1) Then
                $gsHREFs &= $gsLink & @CRLF
            EndIf
        EndIf
    Next
    
    If $gsHREFs Then FileWrite($ghFile, $gsHREFs)
    
Next
; attempt to close IE window properly
_IEQuit($goIE)

MsgBox(64, "Total Lines In Edit file", StringSplit(StringStripCR(StringTrimRight(FileRead($gsFile), 2)), @LF)[0])
FileClose($ghFile)

.

 

Wow thank you! Your code looks 100x better than mine. It's so neat and perfect ;)

I didnt think that the _IELoadWait function would wait long enought, but apparently

it does with the sanity pause you included, love it as it is also sooo fast now.

And yeah I understand that creating an IE window every time, as well with the

files is totally uneffective and slow, and really unreliably, as IE often prints

errors in AutoIT on this older computer, dunno why.

Will probably be a long way to go untill my code looks as neat as yours, but

it's a fun doing this stuff in AutoIT raher than sitting here and doing this

hotel exporting manually. (Damn school-forced-internship)

The last thing I appreciate is the way you solved the thing with the Links.txt

file. It seemed a bit complicated so I used the easy, but bad methode ^^.

Best regards,

Andrewz

Edited by andrewz
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...