Jump to content

Exporting Data from a Website


andrewz
 Share

Go to solution Solved by andrewz,

Recommended Posts

Hey ;P

Dunno how to start of so here is an explanation of what I want to be able to automate:

Export data from a website called immobilienscout24.de (A website where people offer

properties), for example the name of the owner, where it's located and how much it is.

The data is ALWAYS saved at the same location. For instance:

<div class="margin-bottom font-line-l">
    <span data-qa="contactName" class="font-bold">Herr Thomas und Uschi Westhoff</span>

Is there any function in autoIT available to export this kind of data? In this case it would

be the name "Herr Thomas und Uschi Westhoff" (Yeah german names haha). I cant

seem to find it :( With exporting I mean just saving this into a variable or clipboard.

Here is the link I used for the example:

http://www.immobilienscout24.de/expose/78279770

 

I would be sooooo thankful if anyone could give me an idea on how to start off, as it

takes ages to copy paste all the included data into excel by hand.

Thanks in advance & best regards

Andrewz

Edited by andrewz
Link to comment
Share on other sites

Have you had a chance to look at the _IE functions reference?

EDIT: By the way, welcome to the AutoIt forum! :D

Edited by MikahS

Snips & Scripts


My Snips: graphCPUTemp ~ getENVvars
My Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4

Feel free to use any of my code for your own use.                                                                                                                                                           Forum FAQ

 

Link to comment
Share on other sites

  • Solution

 

Example of and xpath to use in my sig:

$xpath = "//span[@data-qa='contactName']"

 

Sort of ..

$txt = BinaryToString(InetRead("http://www.immobilienscout24.de/expose/78279770", 1))
$name = StringRegExpReplace($txt, '(?is).*contactname.*?>([^<]+).*', "$1")
Msgbox(0,"", $name)

:)

 

Thank you both so much!  Guess I couldnt have figured that out since I am still a beginner :P

@mikell, that works perfect :P I'm gonna make a full application out of it to grab all the required

data from these properties and then save them into a csv table, that should be easy.

It's because im currently doing an internship at an estate agency (I didnt have a choice, would have

gone for any IT-company straight away lol) for school and they always export hundrets of properties

into excel by copy and paste, which of course takes ages to complete.

best regards

Edited by andrewz
Link to comment
Share on other sites

I almost got it now, but there is an error that I dont know how to bypass or fix.

Currently, the script only works if all the data is given on the website, if one is missing

cuz the owner didnt include it , the script doesnt write down anything and exits.

Sooo here it is :

If FileExists("Immobilien.csv") =false Then
FileWrite("Immobilien.csv","Name;Adresse;Tel;Objekt;Ort;Baujahr;Zi;frei/vermie.;Wfl./ qm;Kaltmiete;Warmmietpreis;Scout- ID"& @CRLF)
EndIf

#include <Inet.au3>
#include <Array.au3>
#include <String.au3>
Global $mobil_A= "0"
Global $telefon_A = "0"
Global $url = InputBox("ScoutID","Enter the Scout-ID")
Global $content = _INetGetSource($url)
Global $name_A = _StringBetween($content, '<span data-qa="contactName" class="font-bold">', '</span>')
Global $preis_A = _StringBetween($content, ' "offerPrice": "', '",')
Global $strase_A = _StringBetween($content, '<strong class="font-standard">' , '</strong><br/>')
Global $telefon_A = _StringBetween($content, '<div class="is24-phone-number hide">' ,'</div>')
Global $objekttyp_A = _StringBetween($content, '<dd class="is24qa-wohnungstyp">' ,'</dd>')
Global $ort_A = _StringBetween($content, '</strong><br/>' , '<br/>')
Global $baujahr_A = _StringBetween($content, '<dd class="is24qa-baujahr">','</dd>')
Global $zimmer_A = _StringBetween($content, '<dd class="is24qa-zimmer">','</dd>')
Global $bezugsfrei_A = _StringBetween($content, '<dd class="is24qa-bezugsfrei-ab">' ,'</dd>')
Global $wohnflache_A = _StringBetween($content, '<dd class="is24qa-wohnflaeche-ca">' ,'</dd>')
Global $preiswarm_A =_StringBetween($content, '<strong class="is24qa-gesamtmiete">','</strong>')




$aio= $name_A[0]&";"&$strase_A[0]&";"&$telefon_A[0]&";"&$objekttyp_A[0]&";"&$ort_A[0]&";"&$baujahr_A[0]&";"&$zimmer_A[0]&";"&$bezugsfrei_A[0]&";"&$wohnflache_A[0]&";"&$preis_A[0]&",00"&";"&$preiswarm_A[0]&";"&$url

$sString1 = StringReplace($aio, " ", "") ;removing spaces -to format it later to csv
$sString2 = StringReplace($sString1, "<p>", "") ;removing <p> -useless
$sString3 = StringReplace($sString2, "<span>Mobil:</span>", "") ;removing <span>Mobil:</span> -useless
$sString4 = StringReplace($sString3, "</p>", "") ;removing </p> - useless
$sString5 = StringReplace($sString4, "Â", "") ;removing  from m²
$sString6 = StringReplace($sString5, '<spanclass="is24-operator">=</span>', "") ;removing <spanclass="is24-operator">=</span> -useless
$sString7 = StringReplace($sString6, "EUR", "") ;removing EUR -useless cuz we will format it later in excel
$sString8 = StringReplace($sString7, "<span>Telefon:</span>","") ;removing <span>Telefon:</span> -useless
$sStringfinal = StringReplace($sString8, @CRLF, "") ;finally removing @CRLF to get a csv format


FileWrite ( "Immobilien.csv", $sStringfinal & @CRLF )

I did it a bit different cuz it was easier for me this way.

Now if you try it with this linK: http://www.immobilienscout24.de/expose/78294144 it work perfect.

BUT with this link: http://www.immobilienscout24.de/expose/78295011 it exits, cuz of course it cant

find the adress for example, which is given in the first link as "Grasserstr. 5" but there is no

given in the second link.

Is there anyway to skip or make that variable 0 if it cant be found ?

Thanks in advance!

Edited by andrewz
Link to comment
Share on other sites

This a bit of code that I use in a script of mine. Now if any or all of the fourth field(s) and beyond are blank, the script continues. It doesn't exit. Note that I am using some of the _IE* functions..

$oForm = _IEFormGetObjByName($oIE, "cpform")

$Spammer[0] = _IEFormElementGetObjByName($oForm, "user[username]")
$Spammer[0] = _IEFormElementGetValue($Spammer[0])
$Spammer[1] = _IEFormElementGetObjByName($oForm, "user[email]")
$Spammer[1] = _IEFormElementGetValue($Spammer[1])
$Spammer[2] = _IEFormElementGetObjByName($oForm, "user[ipaddress]")
$Spammer[2] = _IEFormElementGetValue($Spammer[2])
$Spammer[3] = _IEFormElementGetObjByName($oForm, "user[homepage]")
$Spammer[3] = _IEFormElementGetValue($Spammer[3])
$Spammer[4] = _IEFormElementGetObjByName($oForm, "profile[field1]") ;Biography
$Spammer[4] = _IEFormElementGetValue($Spammer[4])
$Spammer[5] = _IEFormElementGetObjByName($oForm, "profile[field2]") ;Location
$Spammer[5] = _IEFormElementGetValue($Spammer[5])
$Spammer[6] = _IEFormElementGetObjByName($oForm, "profile[field3]") ;Interests
$Spammer[6] = _IEFormElementGetValue($Spammer[6])
$Spammer[7] = _IEFormElementGetObjByName($oForm, "profile[field4]") ;Occupation
$Spammer[7] = _IEFormElementGetValue($Spammer[7])
This is a bit of code from another script that I have written. Note that it uses the native Inet function '_INetGetSource'. If any of the array elements don't exist, the script does not quit. I don't know if either of these 'code bits' will help you, but good luck with your project!

 

Global $Banyan_Calico[5] = ["Registering", "Activating", "Modifying", "Viewing User Profile", "Viewing User Control Panel"], $Quatrain

While 1
Local $Source = _INetGetSource("http://forum.powweb.com/online.php?who=members")
If StringInStr($Source, "The server is too busy at the moment.") <> 0 Then MsgBox(48 + 4096, "Oh No!!", "Busy server.", 3) ;If text does exist
For $a = 0 To UBound($Banyan_Calico) - 1
If StringInStr($Source, $Banyan_Calico[$a], 1) <> 0 Then ;If text does exist
SoundPlay(@ScriptDir & "\foghorn.mp3")
MsgBox(48 + 4096, @ScriptName, $Banyan_Calico[$a], 3)
Whoson()
EndIf
Next
TraySetIcon("hourglass.ico")
Timer()
WEnd
Edited by somdcomputerguy

- Bruce /*somdcomputerguy */  If you change the way you look at things, the things you look at change.

Link to comment
Share on other sites

This a bit of code that I use in a script of mine. Now if any or all of the fourth field(s) and beyond are blank, the script continues. It doesn't exit. Note that I am using some of the _IE* functions..

$oForm = _IEFormGetObjByName($oIE, "cpform")

$Spammer[0] = _IEFormElementGetObjByName($oForm, "user[username]")
$Spammer[0] = _IEFormElementGetValue($Spammer[0])
$Spammer[1] = _IEFormElementGetObjByName($oForm, "user[email]")
$Spammer[1] = _IEFormElementGetValue($Spammer[1])
$Spammer[2] = _IEFormElementGetObjByName($oForm, "user[ipaddress]")
$Spammer[2] = _IEFormElementGetValue($Spammer[2])
$Spammer[3] = _IEFormElementGetObjByName($oForm, "user[homepage]")
$Spammer[3] = _IEFormElementGetValue($Spammer[3])
$Spammer[4] = _IEFormElementGetObjByName($oForm, "profile[field1]") ;Biography
$Spammer[4] = _IEFormElementGetValue($Spammer[4])
$Spammer[5] = _IEFormElementGetObjByName($oForm, "profile[field2]") ;Location
$Spammer[5] = _IEFormElementGetValue($Spammer[5])
$Spammer[6] = _IEFormElementGetObjByName($oForm, "profile[field3]") ;Interests
$Spammer[6] = _IEFormElementGetValue($Spammer[6])
$Spammer[7] = _IEFormElementGetObjByName($oForm, "profile[field4]") ;Occupation
$Spammer[7] = _IEFormElementGetValue($Spammer[7])
This is a bit of code from another script that I have written. Note that it uses the native Inet function '_INetGetSource'. If any of the array elements don't exist, the script does not quit. I don't know if either of these 'code bits' will help you, but good luck with your project!

 

Global $Banyan_Calico[5] = ["Registering", "Activating", "Modifying", "Viewing User Profile", "Viewing User Control Panel"], $Quatrain

While 1
Local $Source = _INetGetSource("http://forum.powweb.com/online.php?who=members")
If StringInStr($Source, "The server is too busy at the moment.") <> 0 Then MsgBox(48 + 4096, "Oh No!!", "Busy server.", 3) ;If text does exist
For $a = 0 To UBound($Banyan_Calico) - 1
If StringInStr($Source, $Banyan_Calico[$a], 1) <> 0 Then ;If text does exist
SoundPlay(@ScriptDir & "\foghorn.mp3")
MsgBox(48 + 4096, @ScriptName, $Banyan_Calico[$a], 3)
Whoson()
EndIf
Next
TraySetIcon("hourglass.ico")
Timer()
WEnd

 

Spammer :P, anway thanks a lot ! Let's see if I can get this working now...

best regards,

Andrewz

Link to comment
Share on other sites

Ya, $Spammer[] :) I chose that variable name since I use the script to get info from another forum that I moderate. That way I don't have to clip/paste all the necessary info individually, which takes quite a long time. BTW, you don't need to quote any or all of my post(s), I know what I have written. Although a partial quote may help someone else know what you are replying about, but again it's not really necessary.

- Bruce /*somdcomputerguy */  If you change the way you look at things, the things you look at change.

Link to comment
Share on other sites

One way to solve the error problem is an error checking  - obviously  :)

$preis_A = _StringBetween($content, ' "offerPrice": "', '",')
$preis = (IsArray($preis_A) = 1) ? $preis_A[0] : "not found"

Using this small example, if the _StringBetween fails then the returned result is "not found" instead of nothing

Link to comment
Share on other sites

One way to solve the error problem is an error checking  - obviously  :)

$preis_A = _StringBetween($content, ' "offerPrice": "', '",')
$preis = (IsArray($preis_A) = 1) ? $preis_A[0] : "not found"

Using this small example, if the _StringBetween fails then the returned result is "not found" instead of nothing

 

Hey, thanks, that will work too :P

I did it this way:

If  IsArray($preis_A) Then
$preis_B = $preis_A[0]
Else
$preis_B = "not found"
EndIf

And later on use $preis_B in order to display the data.

Which way to you think is better? (Maybe resource consuming related)

The one I use or the one you provided? Your's looks shorter so maybe it

is better, but I dunno anything about this ...

Edited by andrewz
Link to comment
Share on other sites

  • 1 month later...

Hi All,

I've been stuggeling with something simulair the last couple of days (Been browsing the fora for a possible sullotion, array's and such are still kinda new to me..)

And the one sullotion above seemed also great one for me... but it doesn't do exactly what it suppose to to.

It does create a .csv, every time I run the script it puts in another line, but it doesn't seem to find the info in the HTML/website (all it gives are 0's) So I suspect that the script doesn't read the site or don't seem to find info that I want. Been breaking my head over it all weekend, but can't seem to find where I gone wrong. :ermm:

Here is the script I use to test it and the HTML where I test it with

HotKeySet("{ESC}", "Terminate")

Opt("WinTextMatchMode", 2)      ;1=complete, 2=quick
Opt("WinTitleMatchMode", 1)     ;1=start, 2=subStr, 3=exact, 4=advanced, -1 to -4=Nocase
AutoItSetOption("MouseCoordMode", 0)
opt("SendKeyDelay",90)
opt("WinWaitDelay",35)
opt("TrayIconDebug",1)
#include <IE.au3>
#include <Inet.au3>
#include <Array.au3>
#include <String.au3>
#include <MsgBoxConstants.au3>
If FileExists("C:\Data\Auto ITs\check\check.csv") =false Then
FileWrite("C:\Data\Auto ITs\check\check.csv","Actief;Lidstaat;nummer;Tijdstip waarop de aanvraag werd ontvangen;Naam;Adres;Cnummer"& @CRLF)
EndIf

$content = _INetGetSource("C:\Data\Auto ITs\check\Test.htm")
$Status = _StringBetween($content, '<span class="validStyle">', "</span></b></td>")
$Lidstaat = _StringBetween($content, '<td class="labelStyle">Lidstaat</td> <td>' , '</td>')
$nr = _StringBetween($content, '<td class="labelStyle">nummer</td> <td>' , '</td>')
$Tijd = _StringBetween($content, '<td class="labelStyle">Tijdstip waarop de aanvraag werd ontvangen</td> <td>' , '</td>')
$Naam = _StringBetween($content, '<td class="labelStyle">Naam</td> <td>' , '</td>')
$Adres= _StringBetween($content, '<td class="labelStyle">Adres</td> <td>' , '</td>')
$Cnummer = _StringBetween($content, '<td class="labelStyle">Cnummer</td> <td>' , '</td>')

$aio= $Status&";"&$Lidstaat&";"&$nr&";"&$Tijd&";"&$Naam&";"&$Adres&";"&$Cnummer

$sString1 = StringReplace($aio, " ", "") ;removing spaces -to format it later to csv
$sString2 = StringReplace($sString1, "<p>", "") ;removing <p> -useless
$sString3 = StringReplace($sString2, "<span>Mobil:</span>", "") ;removing <span>Mobil:</span> -useless
$sString4 = StringReplace($sString3, "</p>", "") ;removing </p> - useless
$sString5 = StringReplace($sString4, "Â", "") ;removing  from m²
$sString6 = StringReplace($sString5, '<spanclass="is24-operator">=</span>', "") ;removing <spanclass="is24-operator">=</span> -useless
$sString7 = StringReplace($sString6, "EUR", "") ;removing EUR -useless cuz we will format it later in excel
$sString8 = StringReplace($sString7, "<span>Telefon:</span>","") ;removing <span>Telefon:</span> -useless
$sStringfinal = StringReplace($sString8, @CRLF, "") ;finally removing @CRLF to get a csv format

FileWrite ( "check.csv", $sStringfinal & @CRLF )

Func Terminate()
    Exit 0
EndFunc

The HTML test page

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    
    
    <title>Test</title>
</head>
<body>
<a id="top-page" name="top-page"></a>
<div id="layout" class="layout">








<div id="header">
    

    <h2>Info</h2>
    <fieldset>
        <table id="vatResponseFormTable">
            <tr>
                <td class="labelLeft" colspan="3"><b><span class="validStyle">Ja, correct</span></b></td> 
           </tr>
           <tr>
                <td><br /></td>
            </tr>
            <tr>
                <td class="labelStyle">Lidstaat</td> 
                <td>NL</td>
                <td class="errorFormStyle"></td>
            </tr>
            <tr>
                <td class="labelStyle">nummer</td> 
                <td>820471616gdwsg01</td>
            </tr>
            <tr>
                <td class="labelStyle">Tijdstip waarop de aanvraag werd ontvangen</td> 
                <td>2015/01/12 12:28:03</td>
            </tr>
            
                <tr>
                    <td class="labelStyle">Naam</td> 
                    <td>T. Est
</td>
                    
                </tr>
             
             
             
            
                <tr>
                    <td class="labelStyle">Adres</td> 
                    <td><br />Straat 00189<br />1234AA Stad<br />
</td>
                </tr>
            
             
            
                <tr>
                    <td class="labelStyle">Cnummer</td> 
                    <td></td>
                </tr>
            
        </table>
        <br />
        <p><a href="backtest.html">Back</a></p>
    </fieldset>

                </div>
            </div>
        </div>
    </div>
</div>



</div>

</body>
</html>

If somebody could point out where I gone wrong or send me in the right direction it would be greatly appreciated :)

Thanks in advanced!

-Kap

Link to comment
Share on other sites

  • 3 years later...

Hi,

I have a question on this topic too, I am very new to XPath and I am trying to import the address of an exposé on ImmobilienScout into Google Spreadsheets.

I am using this URL as an example: https://www.immobilienscout24.de/expose/104781577

With the function =importxml(URL;//div[@class="address-block"]/div) I get the address, but since there is twice the address on the page I get it twice. I would like to only get it once; I tried many ways of specifying precisely where one of the 2 versions of the address is, but they don't work... any idea?

Best,
Claire

 

Link to comment
Share on other sites

Well... writing my question inspired me to find the answer! I focused on the 2nd version of the address and wrote this: //div[@class="grid-item automatic-width padding-right"]/div

and it works! But I am still interested in learning about a way to get data from a precise place in the document, with the example of the 1st version of the address. I think it would be more advanced than the solution I found (?).

Link to comment
Share on other sites

Hi,

Me again! I have a more important question that has been blocking me for quite some time already!

Still on the same page on ImmobilienScout (https://www.immobilienscout24.de/expose/104781577), I would like to extract all the images of the flat that is on this page. All their urls look the same : 

There are 11 of them in this precise case.

I tried all kinds of Xpaths (with the Importxml function on GoogleSheets) to automatically extract all these image urls, but it doesn't work. Sometimes I get 3 urls while there are 11, no idea why! Any idea? 

Link to comment
Share on other sites

Not so difficult using a regular expression on the source code of the page  :)

#Include <Array.au3>

$txt = BinaryToString(InetRead("https://www.immobilienscout24.de/expose/104781577"))

; get all img
; $img = StringRegExp($txt, 'https://pictures.immobilienscout24.de/listings/[^"]+', 3)

; specific size
$img = StringRegExp($txt, '(https://pictures.immobilienscout24.de/listings/[^"]+?.jpg)[^"]+?1106x830[^"]+', 3)

_ArrayDisplay($img)

 

Edited by mikell
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...