Jump to content

Collecting Website data w/ Array


Kap
 Share

Go to solution Solved by SmOke_N,

Recommended Posts

Hello All,

My appologies for the (sort of) double post, but I just realised it probably wasn't the best idea to ask my question in an answered topic ('?do=embed' frameborder='0' data-embedContent>>)

The above mentioned topic helped me a great deal,  ... but it doesn't do exactly what it suppose to do (or at least what I want it to do). And being new to array's I'm not quite sure where I go wrong..

It does create a .csv, every time I run the script it puts in another line, but it doesn't seem to find the info in the HTML/website (all it gives are 0's) So I suspect that the script doesn't read the site or don't seem to find info that I want. Been breaking my head over it all weekend, but can't seem to find where I gone wrong.

My script:

HotKeySet("{ESC}", "Terminate")

Opt("WinTextMatchMode", 2)      ;1=complete, 2=quick
Opt("WinTitleMatchMode", 1)     ;1=start, 2=subStr, 3=exact, 4=advanced, -1 to -4=Nocase
AutoItSetOption("MouseCoordMode", 0)
opt("SendKeyDelay",90)
opt("WinWaitDelay",35)
opt("TrayIconDebug",1)
#include <IE.au3>
#include <Inet.au3>
#include <Array.au3>
#include <String.au3>
#include <MsgBoxConstants.au3>
If FileExists("C:\Data\Auto ITs\check\check.csv") =false Then
FileWrite("C:\Data\Auto ITs\check\check.csv","Actief;Lidstaat;nummer;Tijdstip waarop de aanvraag werd ontvangen;Naam;Adres;Cnummer"& @CRLF)
EndIf

$content = _INetGetSource("C:\Data\Auto ITs\check\Test.htm")
$Status = _StringBetween($content, '<span class="validStyle">', "</span></b></td>")
$Lidstaat = _StringBetween($content, '<td class="labelStyle">Lidstaat</td> <td>' , '</td>')
$nr = _StringBetween($content, '<td class="labelStyle">nummer</td> <td>' , '</td>')
$Tijd = _StringBetween($content, '<td class="labelStyle">Tijdstip waarop de aanvraag werd ontvangen</td> <td>' , '</td>')
$Naam = _StringBetween($content, '<td class="labelStyle">Naam</td> <td>' , '</td>')
$Adres= _StringBetween($content, '<td class="labelStyle">Adres</td> <td>' , '</td>')
$Cnummer = _StringBetween($content, '<td class="labelStyle">Cnummer</td> <td>' , '</td>')

$aio= $Status&";"&$Lidstaat&";"&$nr&";"&$Tijd&";"&$Naam&";"&$Adres&";"&$Cnummer

$sString1 = StringReplace($aio, " ", "") ;removing spaces -to format it later to csv
$sString2 = StringReplace($sString1, "<p>", "") ;removing <p> -useless
$sString3 = StringReplace($sString2, "<span>Mobil:</span>", "") ;removing <span>Mobil:</span> -useless
$sString4 = StringReplace($sString3, "</p>", "") ;removing </p> - useless
$sString5 = StringReplace($sString4, "Â", "") ;removing  from m²
$sString6 = StringReplace($sString5, '<spanclass="is24-operator">=</span>', "") ;removing <spanclass="is24-operator">=</span> -useless
$sString7 = StringReplace($sString6, "EUR", "") ;removing EUR -useless cuz we will format it later in excel
$sString8 = StringReplace($sString7, "<span>Telefon:</span>","") ;removing <span>Telefon:</span> -useless
$sStringfinal = StringReplace($sString8, @CRLF, "") ;finally removing @CRLF to get a csv format

FileWrite ( "check.csv", $sStringfinal & @CRLF )

Func Terminate()
    Exit 0

And the test HTML

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    
    
    <title>Test</title>
</head>
<body>
<a id="top-page" name="top-page"></a>
<div id="layout" class="layout">

<div id="header">
    

    <h2>Info</h2>
    <fieldset>
        <table id="vatResponseFormTable">
            <tr>
                <td class="labelLeft" colspan="3"><b><span class="validStyle">Ja, correct</span></b></td> 
           </tr>
           <tr>
                <td><br /></td>
            </tr>
            <tr>
                <td class="labelStyle">Lidstaat</td> 
                <td>NL</td>
                <td class="errorFormStyle"></td>
            </tr>
            <tr>
                <td class="labelStyle">nummer</td> 
                <td>820471616gdwsg01</td>
            </tr>
            <tr>
                <td class="labelStyle">Tijdstip waarop de aanvraag werd ontvangen</td> 
                <td>2015/01/12 12:28:03</td>
            </tr>
            
                <tr>
                    <td class="labelStyle">Naam</td> 
                    <td>T. Est
</td>
                    
                </tr>
         
                <tr>
                    <td class="labelStyle">Adres</td> 
                    <td><br />Straat 00189<br />1234AA Stad<br />
</td>
                </tr>

                <tr>
                    <td class="labelStyle">Cnummer</td> 
                    <td></td>
                </tr>
            
        </table>
        <br />
        <p><a href="backtest.html">Back</a></p>
    </fieldset>

                </div>
            </div>
        </div>
    </div>
</div>

</div>

</body>
</html>

If somebody could point out where I gone wrong or send me in the right direction it would be greatly appreciated :)

Thanks in advanced!

-Kap

Link to comment
Share on other sites

  • Moderators

Have you checked what the data looks like after the $content _InetGetSource()?

Is it in binary or is it a regular string?

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

I haven't finished looking at your code yet, and cannot at the moment run it, but yoiu might like to know there is a string command to remove spaces and characters like CRLF etc - StringStripWS

Make sure brain is in gear before opening mouth!
Remember, what is not said, can be just as important as what is said.

Spoiler

What is the Secret Key? Life is like a Donut

If I put effort into communication, I expect you to read properly & fully, or just not comment.
Ignoring those who try to divert conversation with irrelevancies.
If I'm intent on insulting you or being rude, I will be obvious, not ambiguous about it.
I'm only big and bad, to those who have an over-active imagination.

I may have the Artistic Liesense ;) to disagree with you. TheSaint's Toolbox (be advised many downloads are not working due to ISP screwup with my storage)

userbar.png

Link to comment
Share on other sites

Hi Kap,

For me _INetGetSource only works for domains (e.g. www.google.com) not for local html documents. For local documents you could use:

FileRead("C:\Data\Auto ITs\check\Test.htm")

_StringBetween for $Lidstaat doesn't really work as you are trying to get the _StringBetween to jump a line (you have to edit some @CRLF and @TABS in there). I would recommend trying a complete different method, e.g. finding "<td>" (its the second one for the NL you want to have). Or get the linenumber of "Lidstaat", assign the line+1 to a variable. Strip the last 5 characters (</td>) and the first 8(? because of @TAB right?) characters, then you should have your NL.

Also: _StringBetween returns an Array (0-based , look in the Help file :) )

So when assigning the contents you should write:

$aio= $Status[0]&";"&$Lidstaat[0]&";"&$nr[0]&";"&$Tijd[0]&";"&$Naam[0]&";"&$Adres[0]&";"&$Cnummer[0]
Edited by draien
Link to comment
Share on other sites

Are you using msgbox's to test variable values at every instance they are created?

I don't see any variable declarations (Global, etc), but they might just be missing from example.

I don't see the required FileOpen preceding FileWrite. You could use _FileWriteLine instead.

If you just want a pre-created blank file, use _FileCreate

I personally prefer to use - If Not FileExists rather than your use of false.

Some of the commands I mention, are in the UDF section of the Help file.

Edited by TheSaint

Make sure brain is in gear before opening mouth!
Remember, what is not said, can be just as important as what is said.

Spoiler

What is the Secret Key? Life is like a Donut

If I put effort into communication, I expect you to read properly & fully, or just not comment.
Ignoring those who try to divert conversation with irrelevancies.
If I'm intent on insulting you or being rude, I will be obvious, not ambiguous about it.
I'm only big and bad, to those who have an over-active imagination.

I may have the Artistic Liesense ;) to disagree with you. TheSaint's Toolbox (be advised many downloads are not working due to ISP screwup with my storage)

userbar.png

Link to comment
Share on other sites

@draien also seems to know what he/she is talking about and makes some good points ... some of which I haven't tried or have no experience with.

Make sure brain is in gear before opening mouth!
Remember, what is not said, can be just as important as what is said.

Spoiler

What is the Secret Key? Life is like a Donut

If I put effort into communication, I expect you to read properly & fully, or just not comment.
Ignoring those who try to divert conversation with irrelevancies.
If I'm intent on insulting you or being rude, I will be obvious, not ambiguous about it.
I'm only big and bad, to those who have an over-active imagination.

I may have the Artistic Liesense ;) to disagree with you. TheSaint's Toolbox (be advised many downloads are not working due to ISP screwup with my storage)

userbar.png

Link to comment
Share on other sites

Even though this is horrible scripted (I tried to fit in different ways to handle this), but it works for me:

Edit the $content = _GetSource to your benefiting. If you want to get it via _INETGetSource then you have to call the function with

$content = _GetSource("http://yourdomainhere.com",2)
HotKeySet("{ESC}", "Terminate")

Opt("WinTextMatchMode", 2)      ;1=complete, 2=quick
Opt("WinTitleMatchMode", 1)     ;1=start, 2=subStr, 3=exact, 4=advanced, -1 to -4=Nocase
AutoItSetOption("MouseCoordMode", 0)
opt("SendKeyDelay",90)
opt("WinWaitDelay",35)
opt("TrayIconDebug",1)
#include <IE.au3>
#include <Inet.au3>
#include <Array.au3>
#include <String.au3>
#include <MsgBoxConstants.au3>
If FileExists(@ScriptDir & "\check.csv") =false Then
FileWrite(@ScriptDir & "\check.csv","Actief;Lidstaat;nummer;Tijdstip waarop de aanvraag werd ontvangen;Naam;Adres;Cnummer"& @CRLF)
EndIf

$content = _GetSource(@ScriptDir & "\Test.htm")
$start_Status = '<td class="labelLeft" colspan="3"><b><span class="validStyle">'
$end_Status = '</span></b></td>'
$Status = _StringBetween($content,$start_Status,$end_Status)



$Lidstaat = _GrabValue($content,4)
$nr = _GrabValue($content,7)
$Tijd = _GrabValue($content,9)
$Naam = _GrabValue($content,11)
$Naam = StringReplace($Naam,@CRLF,"")
$Adres = _GrabValue($content,13)
$Adres = StringReplace($Adres,"<br />","")
$Adres = StringReplace($Adres,@CRLF,"")
$Cnummer = _GrabValue($content,15)
$aio= $Status[0]&";"&$Lidstaat&";"&$nr&";"&$Tijd&";"&$Naam&";"&$Adres&";"&$Cnummer
MsgBox(0,"",$aio)

FileWrite ( "check.csv", $aio & @CRLF )

Func _GrabValue($sContent,$iOccurence)
    Local $start
    Local $end

    $start = StringInStr($sContent,"<td",0,$iOccurence)
    $start += 4
    $end = StringInStr($content,"</td>",0,$iOccurence)

    Return StringMid($content,$start,$end - $start)

EndFunc

Func _GetSource($sHandle,$iMode=1)
    Switch $iMode
        Case 1
            Return FileRead($sHandle)
        Case 2
            Return _INetGetSource($sHandle)
        Case Else
            Return 0
    EndSwitch
EndFunc

Func Terminate()
    Exit 0
EndFunc
Edited by draien
Link to comment
Share on other sites

Thanks draien! (and TheSaint too)

I'll go trough your script to see what it does exactly :)

One thing though, it doesn't work with me :/ I get an error I already seen a lot during my thinkering with this...(might by that that's part of my problem)

"C:\Data\Auto ITs\BTW check\test2.au3" (34) : ==> Subscript used on non-accessible variable.:
$aio= $Status[0]&";"&$Lidstaat&";"&$nr&";"&$Tijd&";"&$Naam&";"&$Adres&";"&$Cnummer
$aio= $Status^ ERROR

For some reason I get this error when I use [0] after my variables..

Since the code works with you, could it be I got a wrong liberary or missing something?

I use v3.3.12.0

Link to comment
Share on other sites

  • Moderators

Did you run the consolewrite after _InetGetSource() like I suggested?

Try this:

$content = _INetGetSource("C:\Data\Auto ITs\check\Test.htm", True)

 

Your data is coming out in binary form, you're searching for strings.

Edited by SmOke_N

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

Would be much easier to load up the html into a DOM object, and traverse that than regexp.  It's very difficult to cover every possible html scenario via a regexp.

IEbyXPATH-Grab IE DOM objects by XPATH IEscriptRecord-Makings of an IE script recorder ExcelFromXML-Create Excel docs without excel installed GetAllWindowControls-Output all control data on a given window.
Link to comment
Share on other sites

Thanks draien! (and TheSaint too)

I'll go trough your script to see what it does exactly :)

One thing though, it doesn't work with me :/ I get an error I already seen a lot during my thinkering with this...(might by that that's part of my problem)

"C:\Data\Auto ITs\BTW check\test2.au3" (34) : ==> Subscript used on non-accessible variable.:
$aio= $Status[0]&";"&$Lidstaat&";"&$nr&";"&$Tijd&";"&$Naam&";"&$Adres&";"&$Cnummer
$aio= $Status^ ERROR

For some reason I get this error when I use [0] after my variables..

Since the code works with you, could it be I got a wrong liberary or missing something?

I use v3.3.12.0

 

This error indicates that $Status is not an array. So lets debug here for a second:

  • _StringBetween returns an array if something is found
  • The variable is not an array
  • The questions you have to ask yourself here is: Did _StringBetween found something? (no)
    • Possible causes: Path/Url to the file is wrong. Did you edit this?
    • $content = _GetSource(@ScriptDir & "\Test.htm")

      You have to edit @ScriptDir &"Test.htm" to your path of the htm (or copy-paste the htm in your Scriptdirectory)... I think it would be this:

    • $content = _INetGetSource("C:\Data\Auto ITs\check\Test.htm")
Link to comment
Share on other sites

SmOke_N Jup :) I've tried the consolewrite, it gives me the full HTML code of the page (even tried it in my full script. So with the real website, not the test HTML I posted above)

Also put consolewrites here

Local $start_Status = '<td class="labelLeft" colspan="3"><b><span class="validStyle">'
consolewrite($start_Status & @crlf)
Local $end_Status = '</span></b></td>'
consolewrite( _StringBetween($content,$start_Status,$end_Status) & @crlf)
Local $Status = _StringBetween($content,$start_Status,$end_Status)
consolewrite($Status & @crlf)

which gave these resultes:

<td class="labelLeft" colspan="3"><b><span class="validStyle">
0
0
"C:\Data\Auto ITs\BTW check\test.au3" (60) : ==> Subscript used on non-accessible variable.:
Local $aio= $Status[0]&";"&$Lidstaat&";"&$nr&";"&$Tijd&";"&$Naam&";"&$Adres&";"&$Cnummer
Local $aio= $Status^ ERROR

So the _StringBetween doesn't find anything (0)

(@draien  also triedit with the test script and html same results and jup I did change the @ScriptDir ;))

But now I think of it I got the error also earlier when I started looking

$aio= $Status[0]&";"&$Lidstaat&";"&$nr&";"&$Tijd&";"&$Naam&";"&$Adres&";"&$Cnummer
$aio= $Status^ ERROR

The first thing I did then was checking my AutoIT version (was V3.3.08 orsomething) and I updated it. Didn't help apperently (at least not with the error)

I think I better start again and see if I make that work, to get the hang of it and to make sure that I'm not overlooking some simple thing (I kinda got the feeling it's something real small that I'm overlooking. Because I tried so much options that it turned a bit chaotic)

(atm I made a quick and real dirty sollution to get the data I need with coordinated and winwaits etc. So I'm getting the data I need.)

But I now know it should also be possible via this way :)

Thanks all for all your help, tips and input!

Link to comment
Share on other sites

  • Moderators
  • Solution

I'm fairly sure your "string" data is not what you think it is.

Run this example, if this works, then you need to change the approach to the string vs binary data that I've suggested already:

#include <Array.au3>
#include <String.au3>
; stringbetween returns an array
; checking it as a string return is not going to help you

Local $sHTMLString = "<html>" & @CRLF
$sHTMLString &= "<body>" & @CRLF
$sHTMLString &= '<table id="mytable">' & @CRLF
$sHTMLString &= "<tr>" & @CRLF
$sHTMLString &= "<td>noting much</td>" & @CRLF
$sHTMLString &= '<td class="labelLeft" colspan="3"><b><span class="validStyle">somedata here</span></b></td>' & @CRLF
$sHTMLString &= "</tr>" & @CRLF
$sHTMLString &= "</table>" & @CRLF
$sHTMLString &= "</body>" & @CRLF
$sHTMLString &= "</html>"

Local $content = $sHTMLString
Local $start_Status = '<td class="labelLeft" colspan="3"><b><span class="validStyle">'
consolewrite($start_Status & @crlf)
Local $end_Status = '</span></b></td>'
Local $Status = _StringBetween($content,$start_Status,$end_Status)
_ArrayDisplay($Status)

... Good luck

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...