Jump to content
shaggy89

Web scrape problem

Recommended Posts

shaggy89

Hi all,

Ive made a script that scrapes an xml off the web code below

-<availability>
-<members date="2015-07-18" daytag="Today" count="11" day="8" night="9" ooa="0" s44="" na="0">
<qualification abbrev="2YR" name="2 Years Experience" category="Ability" count="4" day="3" night="3" ooa="0" s44="0"na="0"/>
<qualification abbrev="BA" name="Breathing Apparatus Operator" category="Operator" count="4" day="3" night="4" ooa="0"s44="0" na="0"/>
</members>
-<members date="2015-07-19" daytag="Tomorrow" count="11" day="8" night="11" ooa="0" s44="0" na="0">
<qualification abbrev="2YR" name="2 Years Experience" category="Ability" count="4" day="4" night="4" ooa="0" s44="0"na="0"/>
<qualification abbrev="BA" name="Breathing Apparatus Operator" category="Operator" count="6" day="6" night="4" ooa="0"s44="0" na="0"/>
</members>                                                                                                                                      <availability>

 

My script is meant to scrape the "today" section. The first part of my script works and picks up the correct "day" count but when its comes to the "breathing Apparatus Operator" it collects the number from "tomorrow" how can I fix this? My code below

 

 

$sXML = BinaryToString(InetRead($Site))

  $day = StringRegExpReplace($sXML, '(?is).*<availability.*?day="([^"]+).*</availability.*', '$1')

  $BA = StringRegExpReplace($sXML, '(?is).*<members.*? name="Breathing Apparatus Operator".*?day="([^"]+).*</members.*', '$1');this gets the info we need

 

Edited by shaggy89
coding from mobile SUCKS

Share this post


Link to post
Share on other sites
shaggy89

That looks alot more complex and involed surely theres a way using code I have?

Share this post


Link to post
Share on other sites
Danyfirex

You must rewrite you answer. make cleaner. You should show what ouput expect for.

 

Saludos

Share this post


Link to post
Share on other sites
shaggy89

Errr ok? Queation and code seemed clear to me.Basically the output I want is the "day" number from "Breathing Appetatus Operators" from today.

 

So from my example I want 3 not 6

Share this post


Link to post
Share on other sites
SadBunny

Change the regular expression for $BA from (?is).* to (?is).*? (/edit: because currently it picks up the last members instead of the first one.)

That's the quickfix. But I agree with using a real XML parser if you want this to be more robust. Parsing xml with regex is just shaky.

Edited by SadBunny
  • Like 2

Roses are FF0000, violets are 0000FF... All my base are belong to you.

Share this post


Link to post
Share on other sites
trancexx

I would do it like this. If anything, it's not "shaky":

; $sXML = BinaryToString(InetRead(...))

$sXML = '-<availability>' & _
        '-<members date="2015-07-18" daytag="Today" count="11" day="8" night="9" ooa="0" s44="" na="0">' & _
        '<qualification abbrev="2YR" name="2 Years Experience" category="Ability" count="4" day="3" night="3" ooa="0" s44="0"na="0"/>' & _
        '<qualification abbrev="BA" name="Breathing Apparatus Operator" category="Operator" count="4" day="3" night="4" ooa="0"s44="0" na="0"/>' & _
        '</members>' & _
        '-<members date="2015-07-19" daytag="Tomorrow" count="11" day="8" night="11" ooa="0" s44="0" na="0">' & _
        '<qualification abbrev="2YR" name="2 Years Experience" category="Ability" count="4" day="4" night="4" ooa="0" s44="0"na="0"/>' & _
        '<qualification abbrev="BA" name="Breathing Apparatus Operator" category="Operator" count="6" day="6" night="4" ooa="0"s44="0" na="0"/>' & _
        '</members>' & _
        '<availability>'


MsgBox(4096, "bzz...", "daytag = Today, abbrev = BA, day = " & ThatThingFromXML($sXML, "Today"))
MsgBox(4096, "bzz...", "daytag = Tomorrow, abbrev = BA, day = " & ThatThingFromXML($sXML, "Tomorrow"))


Func ThatThingFromXML($sXML, $sDayTag, $sAbbrev = "BA", $sAttrib = "day")
    ; Clean the XML
    $sXML = StringRegExpReplace($sXML, "(?s)<!--.*?-->", "") ; removing comments
    $sXML = StringRegExpReplace($sXML, "(?s)<!\[CDATA\[.*?\]\]>", "") ; removing CDATA

    ; Find all members
    Local $aMembers = StringRegExp($sXML, "(?si)<\s*members(?:[^\w])\s*(.*?)(?:(?:<\s*/members\s*>)|\Z)", 3)
    If @error Then Return SetError(1, 0, "") ; There are no members available

    Local $sMember, $sAttributes, $aDesc

    ; Loop through members
    For $iMemberOrdinal = 0 To UBound($aMembers) - 1
        $sMember = $aMembers[$iMemberOrdinal] ; currently examined member

        $sAttributes = StringRegExp($sMember, "(?s)(.*?)>", 3)
        If Not @error Then $sAttributes = $sAttributes[0]

        If AttribVal($sAttributes, "daytag") = $sDayTag Then
            $aDesc = StringRegExp($sMember, "(?si)<\h*(?:qualification|whatever)\h*(.*?)/*\h*>", 3)
            For $i = 0 To UBound($aDesc) - 1
                If AttribVal($aDesc[$i], "abbrev") = $sAbbrev Then
                    Return AttribVal($aDesc[$i], $sAttrib)
                    ExitLoop 2
                EndIf

            Next
            ExitLoop

        EndIf

    Next

    Return SetError(2, 0, "") ; Conditions not met
EndFunc



Func AttribVal($sIn, $sAttrib)
    Local $aArray = StringRegExp($sIn, '(?i).*?' & $sAttrib & '\h*=(\h*"(.*?)"|' & "\h*'(.*?)'|" & '\h*(.*?)(?: |\Z))', 3) ; e.g. id="abc" or id='abc' or id=abc

    If @error Then Return ""
    Return $aArray[UBound($aArray) - 1]
EndFunc

 

Edited by trancexx
  • Like 1

♡♡♡

.

eMyvnE

Share this post


Link to post
Share on other sites
shaggy89

Thanks @SadBunny that worked. Somthing so simple. Thanks everyone else for there ideas

 

SOLVED

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Similar Content

    • 31290
      By 31290
      Hi everyone, hope you are doing fine
      Well, I'm currently writing a small script that goes to a certain web page, finds the first link of a specified section and download the file associated to this link.
      Depending on the computer that the tool is launched, the script gets the computer model and search in the (provided here) ini file which link to follow.
      At first, Dell was kind enough to provide only one link but now, they provide two of them. The first one is now a .txt file (  ) whereas my script has been designed to download only the fist and latest link released for the BIOS Update.

      Here's the current code which is working with only the first and latest link of the BIOS category:
      So the question is: 
      In the case of double links like shown in the picture above, how it is possible to tell the script to download only the link containing an the .exe file?
      Of course, I could have changed the array result to [1] instead of [0] [which is working] but it seems that Dell does that randomly and that I deal with a lot of computer models.
      Thanks for the help you can provide, 
      -31290-
       
      SEE_BIOS_LINKS.ini
    • Subz
      By Subz
      Can anyone assist with creating/modify/delete XML Nodes/Child Nodes?
      Have a document for example:
      <ConfigData> <parameter> <name>Setting One</name> <value>10</value> </parameter> <parameter> <name>IPAddress</name> <value>192.168.1.1</value> </parameter> <parameter> <name>Setting One</name> <value>200</value> </parameter> <parameter> <name>Setting Three</name> <value>300</value> </parameter> </ConfigData> Would like to add another node parameter/name, parameter/value, but unsure how to, when there are multiple instances with the same tagname for example, would like to add the following if the nodes do not already exist:
      <parameter> <name>UserPreferredLanguage</name> <value>English</value> </parameter> Currently using the following to check and delete/modify existing nodes, was thinking of placing the name and value into a 2d array and then using a true/false in a 3rd column to return if the setting was found and modified or not, so I can decide whether I need to create the nodes or not, but I'm hoping there is a simpler way.\
      Any assistance would be much appreciated.
      PS: Have also written this with the XML Dom Wrapper UDF but still unsure how to proceed with checking nodes and creating them if they're missing, also couldn't find a function to delete a full node + parent using selectNodes, hence the custom __XML_DeleteNode function.
      Thanks
      Local $sXMLDocPath = @ScriptDir & "\XMLDoc.xml" Local $oXMLDoc = ObjCreate("MSXML2.DOMDocument") $oXMLDoc.validateOnParse = False $oXMLDoc.load($sXMLDocPath) Local $oXMLNodes = $oXMLDoc.selectNodes("ConfigData/parameter") If IsObj($oXMLNodes) Then For $oXMLNode In $oXMLNodes If $oXMLNode.childNodes.item(0).tagName = "Name" And $oXMLNode.childNodes.item(0).Text = "Setting One" Then If $oXMLNode.childNodes.item(1).tagName = "Value" Then $oXMLNode.childNodes.item(1).Text = 10 EndIf If $oXMLNode.childNodes.item(0).tagName = "Name" And $oXMLNode.childNodes.item(0).Text = "IPAddress" Then __XML_DeleteNode($oXMLNode) EndIf If $oXMLNode.childNodes.item(0).tagName = "Name" And $oXMLNode.childNodes.item(0).Text = "Setting Two" Then If $oXMLNode.childNodes.item(1).tagName = "Value" Then $oXMLNode.childNodes.item(1).Text = 20 EndIf If $oXMLNode.childNodes.item(0).tagName = "Name" And $oXMLNode.childNodes.item(0).Text = "Setting Three" Then If $oXMLNode.childNodes.item(1).tagName = "Value" Then $oXMLNode.childNodes.item(1).Text = 30 EndIf Next EndIf $oXMLDoc.Save ($sXMLDocPath) Func __XML_DeleteNode($_oNode_Enum) If $_oNode_Enum.hasChildNodes Then For $_oNode_Enum_Child In $_oNode_Enum.childNodes If $_oNode_Enum_Child.nodeType = $XML_NODE_TEXT Then If StringStripWS($_oNode_Enum_Child.text, $STR_STRIPLEADING + $STR_STRIPTRAILING + $STR_STRIPSPACES) = "" Then $_oNode_Enum.removeChild($_oNode_Enum_Child) EndIf EndIf Next EndIf $_oNode_Enum.parentNode.removeChild($_oNode_Enum) EndFunc  
    • PedroWarlock
      By PedroWarlock
      (GOOGLE TRANSLATOR) Sorry: /
      Sorry, there is no new difficulty to organize my reading system, I need an XLM file like that, which is the problem with 30 thousand lines, is this, I need to find the value "Difficulty" and "live" and tell the program what's with the "default =" yes "" not <dipvalue>, can anyone have an example of how I wanted to do this? An example is enough to do this.
      example.au3
      #include <Array.au3> #include "XML.au3" _Example() Func _Example() Local $oXML = _XML_CreateDOMDocument(Default) Local $sXML_Content =@ScriptDir & "\file.xml" _XML_Load($oXML, $sXML_Content) ; ~ _XML_LoadXML($oXML, $sXML) Local $iNodeCount = _XML_GetNodesCount($oXML, "/mame/game") ConsoleWrite("Group(s): $iNodeCount = " & $iNodeCount & "; @error = " & @error & "; @extended = " & @extended & @LF) Local $aNames = _XML_GetValue($oXML, "/mame/game/description") Local $aTaxCountries = _XML_GetValue($oXML, "/mame/game/manufacturer") Local $aData[$iNodeCount + 1][4] = [[$iNodeCount, "", ""]] For $n = 1 To $iNodeCount Local $oNode_Selected_SingleOne = _XML_SelectSingleNode($oXML, '/mame/game/dipswitch[' & $n & ']') Local $sAttribute_Value = _XML_GetNodeAttributeValue($oNode_Selected_SingleOne, 'name') Local $oNode_Selected_SingleOne2 = _XML_SelectSingleNode($oXML, '/mame/game/dipswitch/dipvalue[' & $n & ']') Local $sAttribute_Value2 = _XML_GetNodeAttributeValue($oNode_Selected_SingleOne2, 'name') $aData[$n][0] = $aNames[$n] $aData[$n][1] = $aTaxCountries[$n] $aData[$n][2] = $sAttribute_Value $aData[$n][3] = $sAttribute_Value2 Next _ArrayDisplay($aData, "$aData") EndFunc ;==>_Example  
      file.xml
      <mame build="0.124a BRarcade(Mar 31 2008)" debug="no"> <game name="puckman" sourcefile="pacman.c"> <description>PuckMan (Japan set 1, Probably Bootleg)</description> <dipswitch name="Service Mode"> <dipvalue name="Off" default="yes"/> <dipvalue name="On"/> </dipswitch> <dipswitch name="Cabinet"> <dipvalue name="Upright" default="yes"/> <dipvalue name="Cocktail"/> </dipswitch> <dipswitch name="Coinage"> <dipvalue name="2 Coins/1 Credit"/> <dipvalue name="1 Coin/1 Credit" default="yes"/> <dipvalue name="1 Coin/2 Credits"/> <dipvalue name="Free Play"/> </dipswitch> <dipswitch name="Lives"> <dipvalue name="1"/> <dipvalue name="2"/> <dipvalue name="3" default="yes"/> <dipvalue name="5"/> </dipswitch> <dipswitch name="Bonus Life"> <dipvalue name="10000" default="yes"/> <dipvalue name="15000"/> <dipvalue name="20000"/> <dipvalue name="None"/> </dipswitch> <dipswitch name="Difficulty"> <dipvalue name="Normal" default="yes"/> <dipvalue name="Hard"/> </dipswitch> </game> <game name="puckmana" sourcefile="pacman.c" cloneof="puckman" romof="puckman"> <description>PuckMan (Japan set 2)</description> <dipswitch name="Rack Test (Cheat)"> <dipvalue name="Off" default="yes"/> <dipvalue name="On"/> </dipswitch> <dipswitch name="Service Mode"> <dipvalue name="Off" default="yes"/> <dipvalue name="On"/> </dipswitch> <dipswitch name="Cabinet"> <dipvalue name="Upright" default="yes"/> <dipvalue name="Cocktail"/> </dipswitch> <dipswitch name="Coinage"> <dipvalue name="2 Coins/1 Credit"/> <dipvalue name="1 Coin/1 Credit" default="yes"/> <dipvalue name="1 Coin/2 Credits"/> <dipvalue name="Free Play"/> </dipswitch> <dipswitch name="Lives"> <dipvalue name="1"/> <dipvalue name="2"/> <dipvalue name="3" default="yes"/> <dipvalue name="5"/> </dipswitch> <dipswitch name="Difficulty"> <dipvalue name="Normal" default="yes"/> <dipvalue name="Hard"/> </dipswitch> </game> <game name="puckmanf" sourcefile="pacman.c" cloneof="puckman" romof="puckman"> <dipswitch name="Cabinet"> <dipvalue name="Upright" default="yes"/> <dipvalue name="Cocktail"/> </dipswitch> <dipswitch name="Lives"> <dipvalue name="1"/> <dipvalue name="2"/> <dipvalue name="3" default="yes"/> <dipvalue name="5"/> </dipswitch> <dipswitch name="Difficulty"> <dipvalue name="Normal" default="yes"/> <dipvalue name="Hard"/> </dipswitch> <dipswitch name="Ghost Names"> <dipvalue name="Normal" default="yes"/> <dipvalue name="Alternate"/> </dipswitch> </game> <game name="puckmanh" sourcefile="pacman.c" cloneof="puckman" romof="puckman"> <dipswitch name="Coinage"> <dipvalue name="2 Coins/1 Credit"/> <dipvalue name="1 Coin/1 Credit" default="yes"/> <dipvalue name="1 Coin/2 Credits"/> <dipvalue name="Free Play"/> </dipswitch> <dipswitch name="Lives"> <dipvalue name="1"/> <dipvalue name="2"/> <dipvalue name="3" default="yes"/> <dipvalue name="5"/> </dipswitch> <dipswitch name="Bonus Life"> <dipvalue name="10000" default="yes"/> <dipvalue name="15000"/> <dipvalue name="20000"/> <dipvalue name="None"/> </dipswitch> <dipswitch name="Difficulty"> <dipvalue name="Normal" default="yes"/> <dipvalue name="Hard"/> </dipswitch> </game> <game name="pacman" sourcefile="pacman.c" cloneof="puckman" romof="puckman"> <dipswitch name="Cabinet"> <dipvalue name="Upright" default="yes"/> <dipvalue name="Cocktail"/> </dipswitch> <dipswitch name="Coinage"> <dipvalue name="2 Coins/1 Credit"/> <dipvalue name="1 Coin/1 Credit" default="yes"/> <dipvalue name="1 Coin/2 Credits"/> <dipvalue name="Free Play"/> </dipswitch> <dipswitch name="Lives"> <dipvalue name="1"/> <dipvalue name="2"/> <dipvalue name="3" default="yes"/> <dipvalue name="5"/> </dipswitch> <dipswitch name="Bonus Life"> <dipvalue name="10000" default="yes"/> <dipvalue name="15000"/> <dipvalue name="20000"/> <dipvalue name="None"/> </dipswitch> </game> <game name="pacmanf" sourcefile="pacman.c" cloneof="puckman" romof="puckman"> <dipswitch name="Coinage"> <dipvalue name="2 Coins/1 Credit"/> <dipvalue name="1 Coin/1 Credit" default="yes"/> <dipvalue name="1 Coin/2 Credits"/> <dipvalue name="Free Play"/> </dipswitch> <dipswitch name="Lives"> <dipvalue name="1"/> <dipvalue name="2"/> <dipvalue name="3" default="yes"/> <dipvalue name="5"/> </dipswitch> <dipswitch name="Bonus Life"> <dipvalue name="10000" default="yes"/> <dipvalue name="15000"/> <dipvalue name="20000"/> <dipvalue name="None"/> </dipswitch> <dipswitch name="Difficulty"> <dipvalue name="Normal" default="yes"/> <dipvalue name="Hard"/> </dipswitch> <dipswitch name="Ghost Names"> <dipvalue name="Normal" default="yes"/> <dipvalue name="Alternate"/> </dipswitch> <driver status="good" emulation="good" color="good" sound="good" graphic="good" savestate="supported" palettesize="512"/> </game> </mame>
    • Robdog1955
      By Robdog1955
      I'm trying to click a button on a web page. I have added a couple of MsgBox lines to allow me to watch what happens on the page. As you can see the first half of my script enters data into text boxes on the page. I have no problem there. I just cannot click on the region buttons. The "set focus" line causes an outline to appear around the EU button and the "click button" line causes the "Pick a Region" text to disappear. Here is the code I have so far.
      #include <IE.au3> Local $oIE = _IECreate("http://questchecker.com/") Local $iQuestID = "123456" Local $sCharacterName = "CharacterName" Local $colForms = _IEFormGetCollection($oIE) $iCount = 0 For $oForm In $colForms $oFormElements = _IEFormElementGetCollection($oForm) For $oFormElement In $oFormElements $iCount = $iCount + 1 Local $sTagName = StringLower($oFormElement.tagName) Local $sElementType = $oFormElement.type Local $sElementName = $oFormElement.name Switch $iCount Case 6 _IEFormElementSetValue($oFormElement, "MyRealm", 0) ; realm Case 7 _IEFormElementSetValue($oFormElement, $sCharacterName, 0) Case 8 _IEFormElementSetValue($oFormElement, $iQuestID, 0) EndSwitch Next Next Local $oButtons = _IEGetObjByName($oIE, "questForm") For $oButton In $oButtons If _IEFormElementGetValue($oButton) = "US" Then MsgBox(0, "", "Click Okay to set focus") _IEAction($oButton, "focus") MsgBox(0, "", "Click Okay to click button") _IEAction($oButton, "click") ExitLoop EndIf Next MsgBox(0, "", "Click Okay to quit") _IEQuit($oIE) Exit  
    • zenocon
      By zenocon
      Hi, After scouring the forums for many hours, I'm trying to compile the most up to date / recent information on the options available for integrating with JavaScript / DOM -- as it relates to scraping + automation of web pages.
      It's my understanding there is IE.au3 script for automation of IE through a COM interface.  But I believe this only works with IE and won't work with Edge, correct?  Is there a COM interface that works with Edge, or any other options for integrating with Edge (other than IUIAuatomation?)
      I know there was also a FF.au3 UDF, but Mozilla abandoned the support for their mozrepl in favor of Web Extensions, and my understanding is that the FF.au3 UDF no longer works, is that correct?
      There was also a Chrome.au3 UDF, but my read on the forums indicate that this also broke many Chrome releases past.
      Which leaves IUIAutomation which I have been using to automate / scrape Windows apps, but when I am trying it on a website, it is not as useful.  For example, if I know the exact DOM id / class, I can get at it and do whatever I need to in JavaScript very simply.  With IUIAutomation, the DOM properties are not available, and most tags / elements in DOM have no useful defining characteristics to be able to get at them reliably (if they are targetable at all).  Some things might be able to be done with IUIAutomation, but I see it's value in targeting website automation / scraping as fairly limited.
      At this point, it seems like my best option is to use IE.au3, but that forces users on IE, which is probably a showstopper.
      Is there another way to bridge into the DOM?  I have written Web Extensions for Chrome and Firefox before.  They can communicate with external processes via AJAX or messaging.  I'm wondering if I can build what I need in a WebExtension and then trigger it from AutoIT Script, and gather up the results somewhere.
      I know there was the ISimpleDOM.au3 and some Microsoft Accessability scripts, but they seem to only be partially supported in browsers, and I didn't have a lot of luck getting those examples to run correctly.
×

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.