Jump to content

xml scraping


Go to solution Solved by mikell,

Recommended Posts

hi all,

im trying to scrape some info off a wbsites api.

when i open link its an xml and info i want is as follows

<availability>

<members date="2014-05-30" count="2" day="2" night="1" OOA="0" na="0" />

</availability>

the info i want is the day= number

would i use the _IEGetObjByName or _IEGetObjById and use day as the id or name?

cheers

p.s first time using xml and api

Link to comment
Share on other sites

Assuming you don't need more from this xml you can probably use this:

Local $sXML = 'Blah ... <availability>' & @CRLF & '<members date="2014-05-30" count="2" day="2" night="1" OOA="0" na="0" />' & @CRLF & '</availability> ... more blah'
Local $day = StringRegExpReplace($sXML, '(?is)(?:.*<availability>.*? day=")(\d+)(?:".*)', '$1')
ConsoleWrite($day & @LF)

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

You could use _StringBetween function to get text between day=" and ". (Use it first to get text between <availability> and </availability>).

Im sure there is also some xml udf on this forum you could use, or even the functions you refering to, dont realy know them.

Link to comment
Share on other sites

thanks guys for reply

@Jchd: that is all the info i need from the xml.

@ Geir1983: I did look at udf but seems that all examples were reading of a file on hdd where this would be direct off the interwebs using IE.au3 libs

again thanks

are there any tips for reading direct off net xml ?

Link to comment
Share on other sites

@Jchd that works well but i for got to mention a one part as i was in a rush.

I use IECreate(" point to website with api key)

so now i need to extract that "day" number from that website. the "day" number always changes its not a stacit number.

any ideas ?

Link to comment
Share on other sites

The regular expression will extract whatever unsigned integer number is after day="

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

The regular expression will extract whatever unsigned integer number is after day="

 Right so what i have tried is

#include <IE.au3>
#include <Constants.au3>
#include <String.au3>
Global $oIE = _IECreate("website ")
_IELoadWait ($oIE)
Local $sXML ='<availability>' & @CRLF & '<members date="2014-05-30" count="2" day="2" night="1" OOA="0" na="0" />' & @CRLF & '</availability>'
Local $day = StringRegExpReplace($oIE,$sXML, '(?is)(?:.*<availability>.*? night=")(\d+)(?:".*)', '$1')
ConsoleWrite($day & @LF)

Then all i get is an error.

Local $sXML = $oIE,'<availability>' & @CRLF & '<members date="2014-05-30" count="2" day="2" night="1" OOA="0" na="0" />' & @CRLF & '</availability>'
Local $sXML = $oIE,^ ERROR
>Exit code: 1    Time: 0.907

My goal is to load up websites API, From there find the "day" number.

in the end i will do a If $day <2 do

so that day number has to come off that website as that "day number changes

Thanks thus far

Link to comment
Share on other sites

The error shown doesn't appear to match the code you posted. However, it looks like you are trying to use the $oIE object directly instead of calling one of its functions. To retrieve the XML from the webpage, you can try this:

Local $sXML = _IEBodyReadHTML($oIE)

You may also want to review the _IEBody* and _IEDoc* functions to see if one of them will work better for your given situation.

Link to comment
Share on other sites

thanks @Danp2: i shall try that code shortly. after looking at code i just posted i can see what i did wrong. i added the $oIE on front thinking to read from that sorce.

I have looked a _IEBodyReadHtml but assumed that was for HTML only. then relized XML is just code(new to xml and web scraping).so will try later

thanks for all the help

Link to comment
Share on other sites

If you know the address of the xml then you can get its text without using _IE* (faster)

Here is an example

$sXML = BinaryToString(InetRead("http://api.openweathermap.org/data/2.5/weather?q=London&mode=xml"))
Msgbox(0,"content", $sXML)

$clouds = StringRegExpReplace($sXML, '(?is).*<clouds.*?name="([^"]+).*', '$1')
Msgbox(0,"clouds", $clouds)
Link to comment
Share on other sites

 

If you know the address of the xml then you can get its text without using _IE* (faster)

Here is an example

$sXML = BinaryToString(InetRead("http://api.openweathermap.org/data/2.5/weather?q=London&mode=xml"))
Msgbox(0,"content", $sXML)

$clouds = StringRegExpReplace($sXML, '(?is).*<clouds.*?name="([^"]+).*', '$1')
Msgbox(0,"clouds", $clouds)

thanks wored a Charm and your right works lot faster than _IE. what UDF was this from so i can read and learn more ?

Link to comment
Share on other sites

hi all,

        script has been working great up until my xml had more data added

in my original post I had one date. But now i have added more dates the script is finding information for all date not just current one.

how can i make it read only today's date?

xml example

<availability>
<members date="2014-06-6" count="2" day="2" night="1" OOA="0" na="0" />

<members date="2014-06-7" count="6" day="5" night="1" OOA="0" na="0" />

<members date="2014-06-8" count="8" day="4" night="1" OOA="0" na="0" />

<members date="2014-06-9" count="9" day="9" night="1" OOA="0" na="0" />

</availability>

cheers shane

Link to comment
Share on other sites

  • Solution

$sXML = '<availability>' & @crlf & _
    '<members date="2014-06-6" count="2" day="2" night="1" OOA="0" na="0" />' & @crlf & _
    '<members date="2014-06-7" count="6" day="5" night="1" OOA="0" na="0" />' & @crlf & _
    '<members date="2014-06-8" count="8" day="4" night="1" OOA="0" na="0" />' & @crlf & _
    '<members date="2014-06-9" count="9" day="9" night="1" OOA="0" na="0" />' & @crlf & _
    '</availability>'

$day = StringRegExpReplace($sXML, '(?is).*<availability.*?day="([^"]+).*</availability.*', '$1')  ; gets the first one (2)
Msgbox(0,"day", $day)
$day = StringRegExpReplace($sXML, '(?is).*<availability.*day="([^"]+).*?</availability.*', '$1')  ; gets the last one (9)
Msgbox(0,"day", $day)

:)

Link to comment
Share on other sites

$sXML = '<availability>' & @crlf & _
    '<members date="2014-06-6" count="2" day="2" night="1" OOA="0" na="0" />' & @crlf & _
    '<members date="2014-06-7" count="6" day="5" night="1" OOA="0" na="0" />' & @crlf & _
    '<members date="2014-06-8" count="8" day="4" night="1" OOA="0" na="0" />' & @crlf & _
    '<members date="2014-06-9" count="9" day="9" night="1" OOA="0" na="0" />' & @crlf & _
    '</availability>'

$day = StringRegExpReplace($sXML, '(?is).*<availability.*?day="([^"]+).*</availability.*', '$1')  ; gets the first one (2)
Msgbox(0,"day", $day)
$day = StringRegExpReplace($sXML, '(?is).*<availability.*day="([^"]+).*?</availability.*', '$1')  ; gets the last one (9)
Msgbox(0,"day", $day)

:)

should explain the xml is a calender that shows dates for 2 weeks at a time. what im trying to achieve is to get the data for current day e.g get today's data today , tomorrows data tomorrow etc

i did try

$date = _Date_Time_SystemTimeToDateStr($tDate, 1)
$day = StringRegExpReplace($sXML, & $date & '(?is).*<availability.*?day="([^"]+).*</availability.*', '$1')

but got error

$day = StringRegExpReplace($sXML, & $date & '(?is).*<members.*?day="([^"]+).*', '$1')

$day = StringRegExpReplace($sXML, ^ ERROR

Edited by shaggy89
Link to comment
Share on other sites

XMLDOM would be an easier route (my opinion):

#include <File.au3>

$file = @DesktopDir & "\some.xml"
_FileCreate($file)
FileWrite($file,'<SomeXML>' & @CRLF & _
'<availability>' & @CRLF & _
'<members date="2014-06-6" count="2" day="2" night="1" OOA="0" na="0" />' & @CRLF & _
'<members date="2014-06-7" count="6" day="5" night="1" OOA="0" na="0" />' & @CRLF & _
'<members date="2014-06-8" count="8" day="4" night="1" OOA="0" na="0" />' & @CRLF & _
'<members date="2014-06-9" count="9" day="9" night="1" OOA="0" na="0" />' & @CRLF & _
'</availability>' & @CRLF & _
'</SomeXML>')

$oXML = ObjCreate("Microsoft.XMLDOM")
$oXML.Load($file)
$oMembers= $oXML.selectNodes('//availability/members')

For $oMember In $oMembers
    ConsoleWrite("date=" & $oMember.getAttribute("date") & _
        "; count=" & $oMember.getAttribute("count") & _
        "; day=" & $oMember.getAttribute("day") & _
        "; night=" & $oMember.getAttribute("night") & _
        "; OOA=" & $oMember.getAttribute("OOA") & _
        "; na=" & $oMember.getAttribute("na") & @CRLF)
Next

Exit

output:

date=2014-06-6; count=2; day=2; night=1; OOA=0; na=0
date=2014-06-7; count=6; day=5; night=1; OOA=0; na=0
date=2014-06-8; count=8; day=4; night=1; OOA=0; na=0
date=2014-06-9; count=9; day=9; night=1; OOA=0; na=0

In the loop, you can do a condition, to validate the date is today...you might also need to format the month, when it's a single digit:

If String($oMember.getAttribute("date")) = @YEAR & "-" & @MON & "-" & StringRegExpReplace(@MDAY,"(0)(\d+)","\2") Then
        ConsoleWrite ( @TAB & "This date ^ is for today" & @CRLF)
    EndIf



Or, you can just add it to the xpath, and only that day will return:

$oMembers= $oXML.selectNodes('//availability/members[@date=' & @YEAR & "-" & @MON & "-" & StringRegExpReplace(@MDAY,"(0)(\d+)","\2") & ']')

The power of XMLDOM

Edited by jdelaney
IEbyXPATH-Grab IE DOM objects by XPATH IEscriptRecord-Makings of an IE script recorder ExcelFromXML-Create Excel docs without excel installed GetAllWindowControls-Output all control data on a given window.
Link to comment
Share on other sites

You can use _ie functions, and load the html into the xmldom (generally; it's not a 1:1 conversion).  I've added more contributions, above.

Or, just use the _ie functions, and you can do similar to what I've done.  I thought this was an actual XML document, not just HTML source.

Edited by jdelaney
IEbyXPATH-Grab IE DOM objects by XPATH IEscriptRecord-Makings of an IE script recorder ExcelFromXML-Create Excel docs without excel installed GetAllWindowControls-Output all control data on a given window.
Link to comment
Share on other sites

You can use _ie functions, and load the html into the xmldom (generally; it's not a 1:1 conversion).  I've added more contributions, above.

Or, just use the _ie functions, and you can do similar to what I've done.  I thought this was an actual XML document, not just HTML source.

I was going to use IE functions but then was told about BinaryToString(InetRead($Site)) and i would like to keep using this as its alot faster then opening IE. im sure there is a way to only make it read the days information.

Link to comment
Share on other sites

Try that out into xmldom...see if it works:

$oXML = ObjCreate("Microsoft.XMLDOM")
$oXML.LoadXML(BinaryToString(InetRead($Site)))
$oMembers= $oXML.selectNodes('//availability/members[@date=' & @YEAR & "-" & @MON & "-" & StringRegExpReplace(@MDAY,"(0)(\d+)","\2") & ']')

For $oMember In $oMembers

    ConsoleWrite("date=" & $oMember.getAttribute("date") & _
        "; count=" & $oMember.getAttribute("count") & _
        "; day=" & $oMember.getAttribute("day") & _
        "; night=" & $oMember.getAttribute("night") & _
        "; OOA=" & $oMember.getAttribute("OOA") & _
        "; na=" & $oMember.getAttribute("na") & @CRLF)

    If String($oMember.getAttribute("date")) = @YEAR & "-" & @MON & "-" & StringRegExpReplace(@MDAY,"(0)(\d+)","\2") Then
        ConsoleWrite ( @TAB & "This date ^ is for today" & @CRLF)
    EndIf
Next
IEbyXPATH-Grab IE DOM objects by XPATH IEscriptRecord-Makings of an IE script recorder ExcelFromXML-Create Excel docs without excel installed GetAllWindowControls-Output all control data on a given window.
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...