shaggy89

Web scrape problem

8 posts in this topic

#1 ·  Posted (edited)

Hi all,

Ive made a script that scrapes an xml off the web code below

-<availability>
-<members date="2015-07-18" daytag="Today" count="11" day="8" night="9" ooa="0" s44="" na="0">
<qualification abbrev="2YR" name="2 Years Experience" category="Ability" count="4" day="3" night="3" ooa="0" s44="0"na="0"/>
<qualification abbrev="BA" name="Breathing Apparatus Operator" category="Operator" count="4" day="3" night="4" ooa="0"s44="0" na="0"/>
</members>
-<members date="2015-07-19" daytag="Tomorrow" count="11" day="8" night="11" ooa="0" s44="0" na="0">
<qualification abbrev="2YR" name="2 Years Experience" category="Ability" count="4" day="4" night="4" ooa="0" s44="0"na="0"/>
<qualification abbrev="BA" name="Breathing Apparatus Operator" category="Operator" count="6" day="6" night="4" ooa="0"s44="0" na="0"/>
</members>                                                                                                                                      <availability>

 

My script is meant to scrape the "today" section. The first part of my script works and picks up the correct "day" count but when its comes to the "breathing Apparatus Operator" it collects the number from "tomorrow" how can I fix this? My code below

 

 

$sXML = BinaryToString(InetRead($Site))

  $day = StringRegExpReplace($sXML, '(?is).*<availability.*?day="([^"]+).*</availability.*', '$1')

  $BA = StringRegExpReplace($sXML, '(?is).*<members.*? name="Breathing Apparatus Operator".*?day="([^"]+).*</members.*', '$1');this gets the info we need

 

Edited by shaggy89
coding from mobile SUCKS

Share this post


Link to post
Share on other sites



That looks alot more complex and involed surely theres a way using code I have?

Share this post


Link to post
Share on other sites

You must rewrite you answer. make cleaner. You should show what ouput expect for.

 

Saludos

Share this post


Link to post
Share on other sites

Errr ok? Queation and code seemed clear to me.Basically the output I want is the "day" number from "Breathing Appetatus Operators" from today.

 

So from my example I want 3 not 6

Share this post


Link to post
Share on other sites

#6 ·  Posted (edited)

Change the regular expression for $BA from (?is).* to (?is).*? (/edit: because currently it picks up the last members instead of the first one.)

That's the quickfix. But I agree with using a real XML parser if you want this to be more robust. Parsing xml with regex is just shaky.

Edited by SadBunny
2 people like this

Roses are FF0000, violets are 0000FF... All my base are belong to you.

Share this post


Link to post
Share on other sites

#7 ·  Posted (edited)

I would do it like this. If anything, it's not "shaky":

; $sXML = BinaryToString(InetRead(...))

$sXML = '-<availability>' & _
        '-<members date="2015-07-18" daytag="Today" count="11" day="8" night="9" ooa="0" s44="" na="0">' & _
        '<qualification abbrev="2YR" name="2 Years Experience" category="Ability" count="4" day="3" night="3" ooa="0" s44="0"na="0"/>' & _
        '<qualification abbrev="BA" name="Breathing Apparatus Operator" category="Operator" count="4" day="3" night="4" ooa="0"s44="0" na="0"/>' & _
        '</members>' & _
        '-<members date="2015-07-19" daytag="Tomorrow" count="11" day="8" night="11" ooa="0" s44="0" na="0">' & _
        '<qualification abbrev="2YR" name="2 Years Experience" category="Ability" count="4" day="4" night="4" ooa="0" s44="0"na="0"/>' & _
        '<qualification abbrev="BA" name="Breathing Apparatus Operator" category="Operator" count="6" day="6" night="4" ooa="0"s44="0" na="0"/>' & _
        '</members>' & _
        '<availability>'


MsgBox(4096, "bzz...", "daytag = Today, abbrev = BA, day = " & ThatThingFromXML($sXML, "Today"))
MsgBox(4096, "bzz...", "daytag = Tomorrow, abbrev = BA, day = " & ThatThingFromXML($sXML, "Tomorrow"))


Func ThatThingFromXML($sXML, $sDayTag, $sAbbrev = "BA", $sAttrib = "day")
    ; Clean the XML
    $sXML = StringRegExpReplace($sXML, "(?s)<!--.*?-->", "") ; removing comments
    $sXML = StringRegExpReplace($sXML, "(?s)<!\[CDATA\[.*?\]\]>", "") ; removing CDATA

    ; Find all members
    Local $aMembers = StringRegExp($sXML, "(?si)<\s*members(?:[^\w])\s*(.*?)(?:(?:<\s*/members\s*>)|\Z)", 3)
    If @error Then Return SetError(1, 0, "") ; There are no members available

    Local $sMember, $sAttributes, $aDesc

    ; Loop through members
    For $iMemberOrdinal = 0 To UBound($aMembers) - 1
        $sMember = $aMembers[$iMemberOrdinal] ; currently examined member

        $sAttributes = StringRegExp($sMember, "(?s)(.*?)>", 3)
        If Not @error Then $sAttributes = $sAttributes[0]

        If AttribVal($sAttributes, "daytag") = $sDayTag Then
            $aDesc = StringRegExp($sMember, "(?si)<\h*(?:qualification|whatever)\h*(.*?)/*\h*>", 3)
            For $i = 0 To UBound($aDesc) - 1
                If AttribVal($aDesc[$i], "abbrev") = $sAbbrev Then
                    Return AttribVal($aDesc[$i], $sAttrib)
                    ExitLoop 2
                EndIf

            Next
            ExitLoop

        EndIf

    Next

    Return SetError(2, 0, "") ; Conditions not met
EndFunc



Func AttribVal($sIn, $sAttrib)
    Local $aArray = StringRegExp($sIn, '(?i).*?' & $sAttrib & '\h*=(\h*"(.*?)"|' & "\h*'(.*?)'|" & '\h*(.*?)(?: |\Z))', 3) ; e.g. id="abc" or id='abc' or id=abc

    If @error Then Return ""
    Return $aArray[UBound($aArray) - 1]
EndFunc

 

Edited by trancexx
1 person likes this

♡♡♡

.

eMyvnE

Share this post


Link to post
Share on other sites

Thanks @SadBunny that worked. Somthing so simple. Thanks everyone else for there ideas

 

SOLVED

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now

  • Similar Content

    • Simpel
      By Simpel
      Hi. I'm trying to write a xml. Here is my code:
      #include <_XMLDomWrapper.au3> #include <Date.au3> Global $g_sXMLFileName Global $g_sDestPath = @DesktopDir & "\" Global $g_sReturnedBID = "A10829" _makeXML() _AddXML(1, "A10829_Thomas/wav/T001.wav") _AddXML(2, "A10829_Thomas/wav/T002.wav") Exit Func _makeXML() Local $sXMLtime = StringReplace(StringReplace(StringReplace(_NowCalc()," ","_"),":","-"),"/","-") ; in yyyy-mm-dd_hh-mm-ss $g_sXMLFileName = $g_sDestPath & $g_sReturnedBID & "_" & "EB-Ton-Upload" & "_" & $sXMLtime & ".xml" _XMLCreateFile($g_sXMLFileName, "gemagvl", 1,1) _XMLFileOpen($g_sXMLFileName) EndFunc Func _AddXML($iCount, $sDateiname) _XMLCreateRootNodeWAttr("row", "count", $iCount, "") _XMLCreateChildNode("//row", "picklistenname", $g_sReturnedBID & "_EB-Ton-Upload") _XMLCreateChildNode("//row", "picklisteninfo") _XMLCreateChildNode("//row", "bid", $g_sReturnedBID) _XMLCreateChildNode("//row", "audiodateiname", $sDateiname) _XMLCreateChildNode("//row", "titel", StringTrimRight(StringTrimLeft($sDateiname, 7), 4)) _XMLCreateChildNode("//row", "interpret", "EB") _XMLCreateChildNode("//row", "quelle", "Ton") EndFunc It returns:
      <?xml version="1.0" encoding="UTF-8"?><gemagvl> <row count="1"> <picklistenname>A10829_EB-Ton-Upload</picklistenname> <picklisteninfo/> <bid>A10829</bid> <audiodateiname>A10829_Thomas/wav/T001.wav</audiodateiname> <titel>Thomas/wav/T002</titel> <interpret>EB</interpret> <quelle>Ton</quelle> <picklistenname>A10829_EB-Ton-Upload</picklistenname> <picklisteninfo/> <bid>A10829</bid> <audiodateiname>A10829_Thomas/wav/T002.wav</audiodateiname> <titel>Thomas/wav/T003</titel> <interpret>EB</interpret> <quelle>Ton</quelle> </row> <row count="2"> <picklistenname>A10829_EB-Ton-Upload</picklistenname> <picklisteninfo/> <bid>A10829</bid> <audiodateiname>A10829_Thomas/wav/T002.wav</audiodateiname> <titel>Thomas/wav/T003</titel> <interpret>EB</interpret> <quelle>Ton</quelle> </row> </gemagvl> But it should return:
      <?xml version="1.0" encoding="UTF-8"?><gemagvl> <row count="1"> <picklistenname>A10829_EB-Ton-Upload</picklistenname> <picklisteninfo/> <bid>A10829</bid> <audiodateiname>A10829_Thomas/wav/T001.wav</audiodateiname> <titel>Thomas/wav/T002</titel> <interpret>EB</interpret> <quelle>Ton</quelle> </row> <row count="2"> <picklistenname>A10829_EB-Ton-Upload</picklistenname> <picklisteninfo/> <bid>A10829</bid> <audiodateiname>A10829_Thomas/wav/T002.wav</audiodateiname> <titel>Thomas/wav/T003</titel> <interpret>EB</interpret> <quelle>Ton</quelle> </row> </gemagvl> The second inserted nodes are double. How will it be going right?
      Regards, Conrad
    • rootx
      By rootx
      I need help to read in a loop the DVD id child and subchild. Thx
      Example...
      DVD001 - PAL - EN,FR,DE,ES,IT and filter the right title & descri language.  I tried with $oXML.SelectSingleNode but without success
      <?xml version="1.0" encoding="UTF-8"?> <datafile xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="mydvd.xsd"> <dvd name="My dvd title"> <id>DVD001</id> <region>PAL</region> <languages>EN,FR,DE,ES,IT</languages> <locale lang="EN"> <title>title en</title> <descri>descri en</descri> </locale> <locale lang="FR"> <title>title fr</title> <descri>descri fr </descri> </locale> <locale lang="DE"> <title>title de</title> <descri>descri de </descri> </locale> <locale lang="ES"> <title>title es</title> <descri>descri es</descri> </locale> <locale lang="IT"> <title>title it</title> <descri>descri it</descri> </locale> </dvd> <dvd name="My dvd title 2"> <id>DVD002</id> <region>USA</region> <languages>EN</languages> <locale lang="EN"> <title>title en</title> <descri>descri en</descri> </locale> </dvd> </datafile> #include <File.au3> $xml = @ScriptDir&"\test.xml" Local $oXML = ObjCreate("Microsoft.XMLDOM") $oXML.load($xml) $id = $oXML.SelectNodes("//dvd") For $ids In $id ConsoleWrite($ids.text &@CRLF) Next  
    • mLipok
      By mLipok
      I was asking @eltorro serveral times for any support  for XML DOM wrapper (COM) - with no success  
          So I took matters into my hands ..... I want to present XMLWrapperEx.au3 - BETA Version
      Want to join to the project ?
       
      Here is some description:
      ; #INDEX# ======================================================================================================================= ; Title .........: XMLWrapperEx.au3 ; AutoIt Version : 3.3.10.2++ ; Language ......: English ; Description ...: Functions to use for reading and writing XML using msxml. ; Remarks .......: BETA Version ; Author ........: mLipok ; Version .......: "1.1.1.01" ; _XML_MiscProperty_UDFVersion() #CS This UDF is created on the basis of: https://www.autoitscript.com/forum/topic/19848-xml-dom-wrapper-com/ For this reason, I attach also the last known (to me) previous version ($_XMLUDFVER = "1.0.3.98" _XMLDomWrapper_1.0.3.98_CN.au3 ) For the same reason I continue to recognize the achievements of the work of my predecessors (they are still noted in each Function header). . . . . !!!!!!!!! This is BETA VERSION (all could be changed) !!!!!!!!! . . . WORK IN PROGRES INFORMATION: For now 2015-09-01 the descripion (Function Header) can not entirely correctly describe the function. TODO: in many places I used "TODO" as a keyword to find what should be done in future . I want to: PREVENT THIS: The unfortunate nature of both the scripts is that the func return results are strings or arrays instead of objects. .     I want to: USE THIS CONCEPT:     .   All function should use Refernce to the object as first Function parameter     .   All function should return in most cases objects. There should be separate functions to Change Object collection to array     .   All function should use COM Error Handler in local scope.     .   All function should return @error which are defined in #Region XMLWrapperEx.au3 - @ERROR Enums     .    All function should have the same naming convention     .    All variables should have the same naming convention     .    There should not to be any Global Variable - exception is $g__oXMLDOM_Events     .   It should be possible easy to use XML DOM Events     .        https://msdn.microsoft.com/en-us/library/ms764697(v=vs.85).aspx     .   It should be possible easy to Debug     .    Ultimately, you should be able to do anything with your XML without having to use your own Error Handler. #CE  
      More info inside zip archive.
      This UDF can be downloaded from here:
      REMARK:
      This UDF was formerly named:   XMLWrapperEx.au3 
       
    • 31290
      By 31290
      Hi Guys, 
      Since I'm able to get a Dell equipment warranty status thanks to my API key, I'm using an UDF to extract data from an XML file and get the end date. > 
      Thing is, when using InetGet, the original file is in JSON format and the UDF is not working anymore, even if I download the file with the xml extension. Therefore, and when I manually download the page with Chrome, I have a proper XML file where the UDF is working fine.
      Here's my code:
      I even tried to convert the json to xml > https://www.autoitscript.com/forum/topic/185717-js-json-to-xml/
      I took a look here https://www.autoitscript.com/forum/topic/104150-json-udf-library-fully-rfc4627-compliant/ but I don't understand anything :/
       
      The XML read UDF is just perfect for my needs but I'm stuck here... 
      Thanks for any help you can provide
      -31290-
      3MTXM12.json
      3MTXM12.xml
    • seppedelanghe
      By seppedelanghe
      Hi everyone,
      First of all sorry for my bad english.
      I'm trying to build a automated program/autoit that controls a web app.
      I created the script using mouseclick() , but i don't want the web browser to be visible.
      I tried ControlClick() , but the web app uses flash and the buttons/items to be clicked do not have an ID.
      I searched and visited a lott of autoit post and pages (even in german  ) , but could not find a way or get it to work.
      Any help is welcome!!!
      Thanks already
      Seppe