Jump to content

What UDF to work with XML/HTML files?


Go to solution Solved by TheXman,

Recommended Posts

On 3/10/2023 at 3:35 PM, littlebigman said:

FWIW, none of the *ML files I loop through are particularly big, eg. only a few pages worth each.

What you mean ?

 

Ps.

Sorry but my English skills are poor.

Signature beginning:
Please remember: "AutoIt"..... *  Wondering who uses AutoIt and what it can be used for ? * Forum Rules *
ADO.au3 UDF * POP3.au3 UDF * XML.au3 UDF * IE on Windows 11 * How to ask ChatGPT for AutoIt Codefor other useful stuff click the following button:

Spoiler

Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind. 

My contribution (my own projects): * Debenu Quick PDF Library - UDF * Debenu PDF Viewer SDK - UDF * Acrobat Reader - ActiveX Viewer * UDF for PDFCreator v1.x.x * XZip - UDF * AppCompatFlags UDF * CrowdinAPI UDF * _WinMergeCompare2Files() * _JavaExceptionAdd() * _IsBeta() * Writing DPI Awareness App - workaround * _AutoIt_RequiredVersion() * Chilkatsoft.au3 UDF * TeamViewer.au3 UDF * JavaManagement UDF * VIES over SOAP * WinSCP UDF * GHAPI UDF - modest begining - comunication with GitHub REST APIErrorLog.au3 UDF - A logging Library * Include Dependency Tree (Tool for analyzing script relations) * Show_Macro_Values.au3 *

 

My contribution to others projects or UDF based on  others projects: * _sql.au3 UDF  * POP3.au3 UDF *  RTF Printer - UDF * XML.au3 UDF * ADO.au3 UDF SMTP Mailer UDF * Dual Monitor resolution detection * * 2GUI on Dual Monitor System * _SciLexer.au3 UDF * SciTE - Lexer for console pane

Useful links: * Forum Rules * Forum etiquette *  Forum Information and FAQs * How to post code on the forum * AutoIt Online Documentation * AutoIt Online Beta Documentation * SciTE4AutoIt3 getting started * Convert text blocks to AutoIt code * Games made in Autoit * Programming related sites * Polish AutoIt Tutorial * DllCall Code Generator * 

Wiki: Expand your knowledge - AutoIt Wiki * Collection of User Defined Functions * How to use HelpFile * Good coding practices in AutoIt * 

OpenOffice/LibreOffice/XLS Related: WriterDemo.au3 * XLS/MDB from scratch with ADOX

IE Related:  * How to use IE.au3  UDF with  AutoIt v3.3.14.x * Why isn't Autoit able to click a Javascript Dialog? * Clicking javascript button with no ID * IE document >> save as MHT file * IETab Switcher (by LarsJ ) * HTML Entities * _IEquerySelectorAll() (by uncommon) * IE in TaskSchedulerIE Embedded Control Versioning (use IE9+ and HTML5 in a GUI) * PDF Related:How to get reference to PDF object embeded in IE * IE on Windows 11

I encourage you to read: * Global Vars * Best Coding Practices * Please explain code used in Help file for several File functions * OOP-like approach in AutoIt * UDF-Spec Questions *  EXAMPLE: How To Catch ConsoleWrite() output to a file or to CMD *

I also encourage you to check awesome @trancexx code:  * Create COM objects from modules without any demand on user to register anything. * Another COM object registering stuffOnHungApp handlerAvoid "AutoIt Error" message box in unknown errors  * HTML editor

winhttp.au3 related : * https://www.autoitscript.com/forum/topic/206771-winhttpau3-download-problem-youre-speaking-plain-http-to-an-ssl-enabled-server-port/

"Homo sum; humani nil a me alienum puto" - Publius Terentius Afer
"Program are meant to be read by humans and only incidentally for computers and execute" - Donald Knuth, "The Art of Computer Programming"
:naughty:  :ranting:, be  :) and       \\//_.

Anticipating Errors :  "Any program that accepts data from a user must include code to validate that data before sending it to the data store. You cannot rely on the data store, ...., or even your programming language to notify you of problems. You must check every byte entered by your users, making sure that data is the correct type for its field and that required fields are not empty."

Signature last update: 2023-04-24

Link to comment
Share on other sites

2 hours ago, mLipok said:

What you mean ?

 

Ps.

Sorry but my English skills are poor.

Here's my interpretation: 

FWIW = For what it's worth

*ML = XML or HTML

only a few pages worth each = probably less than 100 kb

 

So put together:

Quote

For what it's worth none of the XML/HTML files I loop through are particularly big, probably less than 100 kb

 

And just to provide some more information though it sounds like @littlebigman may be good, here's another post about the topic link @ioa747 posted, with an example from mLipok: 

 

Edited by mistersquirrle

We ought not to misbehave, but we should look as though we could.

Link to comment
Share on other sites

It looks like you are expecting to have a pill for all your problems.

XML.au3 shows how to use XMLDOM object. It helps in most cases.

If you need specific usage you will need combine XMLDOM with other your specific technologies.

But as far as you need some uniuqe solution you need to do it on your own, or buy some kind of advance component which will do the magic you need just out of the box.

Signature beginning:
Please remember: "AutoIt"..... *  Wondering who uses AutoIt and what it can be used for ? * Forum Rules *
ADO.au3 UDF * POP3.au3 UDF * XML.au3 UDF * IE on Windows 11 * How to ask ChatGPT for AutoIt Codefor other useful stuff click the following button:

Spoiler

Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind. 

My contribution (my own projects): * Debenu Quick PDF Library - UDF * Debenu PDF Viewer SDK - UDF * Acrobat Reader - ActiveX Viewer * UDF for PDFCreator v1.x.x * XZip - UDF * AppCompatFlags UDF * CrowdinAPI UDF * _WinMergeCompare2Files() * _JavaExceptionAdd() * _IsBeta() * Writing DPI Awareness App - workaround * _AutoIt_RequiredVersion() * Chilkatsoft.au3 UDF * TeamViewer.au3 UDF * JavaManagement UDF * VIES over SOAP * WinSCP UDF * GHAPI UDF - modest begining - comunication with GitHub REST APIErrorLog.au3 UDF - A logging Library * Include Dependency Tree (Tool for analyzing script relations) * Show_Macro_Values.au3 *

 

My contribution to others projects or UDF based on  others projects: * _sql.au3 UDF  * POP3.au3 UDF *  RTF Printer - UDF * XML.au3 UDF * ADO.au3 UDF SMTP Mailer UDF * Dual Monitor resolution detection * * 2GUI on Dual Monitor System * _SciLexer.au3 UDF * SciTE - Lexer for console pane

Useful links: * Forum Rules * Forum etiquette *  Forum Information and FAQs * How to post code on the forum * AutoIt Online Documentation * AutoIt Online Beta Documentation * SciTE4AutoIt3 getting started * Convert text blocks to AutoIt code * Games made in Autoit * Programming related sites * Polish AutoIt Tutorial * DllCall Code Generator * 

Wiki: Expand your knowledge - AutoIt Wiki * Collection of User Defined Functions * How to use HelpFile * Good coding practices in AutoIt * 

OpenOffice/LibreOffice/XLS Related: WriterDemo.au3 * XLS/MDB from scratch with ADOX

IE Related:  * How to use IE.au3  UDF with  AutoIt v3.3.14.x * Why isn't Autoit able to click a Javascript Dialog? * Clicking javascript button with no ID * IE document >> save as MHT file * IETab Switcher (by LarsJ ) * HTML Entities * _IEquerySelectorAll() (by uncommon) * IE in TaskSchedulerIE Embedded Control Versioning (use IE9+ and HTML5 in a GUI) * PDF Related:How to get reference to PDF object embeded in IE * IE on Windows 11

I encourage you to read: * Global Vars * Best Coding Practices * Please explain code used in Help file for several File functions * OOP-like approach in AutoIt * UDF-Spec Questions *  EXAMPLE: How To Catch ConsoleWrite() output to a file or to CMD *

I also encourage you to check awesome @trancexx code:  * Create COM objects from modules without any demand on user to register anything. * Another COM object registering stuffOnHungApp handlerAvoid "AutoIt Error" message box in unknown errors  * HTML editor

winhttp.au3 related : * https://www.autoitscript.com/forum/topic/206771-winhttpau3-download-problem-youre-speaking-plain-http-to-an-ssl-enabled-server-port/

"Homo sum; humani nil a me alienum puto" - Publius Terentius Afer
"Program are meant to be read by humans and only incidentally for computers and execute" - Donald Knuth, "The Art of Computer Programming"
:naughty:  :ranting:, be  :) and       \\//_.

Anticipating Errors :  "Any program that accepts data from a user must include code to validate that data before sending it to the data store. You cannot rely on the data store, ...., or even your programming language to notify you of problems. You must check every byte entered by your users, making sure that data is the correct type for its field and that required fields are not empty."

Signature last update: 2023-04-24

Link to comment
Share on other sites

  • 2 months later...
On 3/10/2023 at 2:35 PM, littlebigman said:

Hello,

I need to read and modify XML/HTML files.
Before diving in, which UDF would you recommend? In addition to the XML UDFs, there's also TheXman's Xml2Json UDF.
FWIW, none of the *ML files I loop through are particularly big, eg. only a few pages worth each.

Thank you.

Just saying that you need to read/modify XML & HTML (or JSON) doesn't provide enough detail to be able to suggest which tool(s) might be best for the job.  The most important parts that you left out are details like what type of information are you trying to gather (values, calculations, transformations, grouping, etc.), in what format you need it, and any other constraints or restrictions that you may have.  Also, the best tool for processing or reading data, may not be the best tool for modifying or writing that data.

For example, you posted a simple script HERE where you seem to be trying to read (process) an XML dataset and output some values.  Of course there are several ways to do it.  For the sake of this example, let's assume that you wanted to gather the GPS track points from the XML dataset in order to do further processing of them, I would've probably done it something like the example below.  If your XML (or JSON) datasets are large and speed is a concern, then the example below would be one of the quickest ways to process the data and generate the result.  On one of my older Win 7 PC's, the small example below took about 0.003 seconds to convert the XML to JSON and about 0.055 seconds to process the JSON to get the result.  The time it would take to process much larger datasets would be incremental not linear.

Example based on post HERE:

#AutoIt3Wrapper_AU3Check_Parameters=-w 3 -w 4 -w 5 -w 6 -d

#include <Constants.au3>
#include <Array.au3>
#include <xml2json\xml2json.au3> ;<== Modify path as needed
#include <jq\jq.au3>             ;<== Modify path as needed


Const $gkJqExe = "C:\Utils\jq\jq-win64.exe" ;<== Modify as needed

Const $gkXML = _
    '<?xml version="1.0" encoding="UTF-8"?><gpx><metadata><name>S' & _
    'ome name</name></metadata><trk><name>Track 1</name><trkseg><' & _
    'trkpt lat="48.81782" lon="2.24906"><ele>123</ele><time>Dummy' & _
    ' time</time></trkpt><trkpt lat="48.81784" lon="2.24906"><ele' & _
    '>456</ele></trkpt></trkseg></trk><trk><name>Track 2</name><t' & _
    'rkseg><trkpt lat="48.81782" lon="2.24906"><ele>321</ele></tr' & _
    'kpt><trkpt lat="48.81784" lon="2.24906"><ele>654</ele></trkp' & _
    't></trkseg></trk></gpx>'

Const $gkJqTrackPointsFilter = _
    '.gpx.trk[]                                  #For each gpx track'                      & @CRLF & _
    '|  .name as $track_name                     #  Save the track name'                   & @CRLF & _
    '|  .trkseg.trkpt[]                          #  For each segment track point'          & @CRLF & _
    '|    [$track_name, .lat, .lon, .ele, .time] #    Create an array of specified values' & @CRLF & _
    '|    join("|")                              #    Ouput the array as a list of |-separated-values'


xml2json_jq_example()

Func xml2json_jq_example()

    Local $sJson   = "", _
          $sResult = ""

    Local $aResult[0][5]


    ;Initialize jq by declaring path to exe
    _jqInit($gkJqExe)
    If @error Then Return MsgBox($MB_ICONERROR, "_jqInit Error", "@error = " & @error)

    ;Transform XML to JSON
    $sJson = _Xml2Json($gkXML)
    If @error Then Return MsgBox($MB_ICONERROR, "_Xml2Json Error", "@error = " & @error)

    ;Display JSON
    ConsoleWrite("Transformed Xml to JSON"  & @CRLF)
    ConsoleWrite(_jqPrettyPrintJson($sJson) & @CRLF & @CRLF)

    ;Process JSON data set to get values of interest (in desired format)
    $sResult = _jqExec($sJson, $gkJqTrackPointsFilter)
    If @error Then Return MsgBox($MB_ICONERROR, "_jqExec Error", $sResult)

    ;Display result
    ConsoleWrite("jq Result" & @CRLF)
    ConsoleWrite($sResult    & @CRLF & @CRLF)

    ;Add result to an array and display it
    _ArrayAdd($aResult, $sResult)
    _ArrayDisplay($aResult, "GPS Exchange Info", "", 0, Default, "Name|Lat|Long|Elevation|Time")

EndFunc

Console output from example:

Transformed Xml to JSON
{
  "gpx": {
    "metadata": {
      "name": "Some name"
    },
    "trk": [
      {
        "name": "Track 1",
        "trkseg": {
          "trkpt": [
            {
              "lat": "48.81782",
              "lon": "2.24906",
              "ele": "123",
              "time": "Dummy time"
            },
            {
              "lat": "48.81784",
              "lon": "2.24906",
              "ele": "456"
            }
          ]
        }
      },
      {
        "name": "Track 2",
        "trkseg": {
          "trkpt": [
            {
              "lat": "48.81782",
              "lon": "2.24906",
              "ele": "321"
            },
            {
              "lat": "48.81784",
              "lon": "2.24906",
              "ele": "654"
            }
          ]
        }
      }
    ]
  }
}

jq Result
Track 1|48.81782|2.24906|123|Dummy time
Track 1|48.81784|2.24906|456|
Track 2|48.81782|2.24906|321|
Track 2|48.81784|2.24906|654|

Array generated from the example:

image.png.4bc4b381160ff25899c7b7937f938ef9.png

Transforming XML to JSON for processing gives you a lot more flexibility and power in gathering information from the dataset.  Also, you can usually get those results much faster than parsing data using the XML DOM and then processing that parsed data in AutoIt.  For the record, I can think of one or 2 rare cases where processing XML using the XML DOM would be better (not faster) than processing the XML as JSON.  But as I said, those cases are rare.

If you truly need to write or modify XML, then using functions that use the XML DOM (whether it's one of the xml UDF's or the COM objects themselves) is probably one of the best methods.  There are tools that transform JSON to XML, but I haven't had a need to use them and haven't dug too deeply into researching their usefulness.

If nothing else, the Xml2Json and jq UDF's give you more tools in your tool belt.  Using the best tool(s) for the job, whichever they may be, always gets the job done more effectively and efficiently.  :)

Edited by TheXman
Corrected typos
Link to comment
Share on other sites

Thanks much. I need to read and modify GPX/KML files, ie rename element, add/modify value, add new parent/sibbling/child element, remove elements.

The XML UDF seems easier to use than working with the MSXML COM object. It just took a bit of time to figure out to use it by going through the examples + user questions in the forum.

I'll look at the JSON UDF if I can't get the XML UDF to do what I need — which is unlikely considering it's nothing fancy.

Edited by littlebigman
Link to comment
Share on other sites

2 hours ago, littlebigman said:

I need to read and modify GPX/KML files, ie rename element, add/modify value, add new parent/sibbling/child element, remove elements.

I understand.  A very long time ago, I did something similar for a company that designed and manufactured small tracking devices.  The devices, among other things like accelerometer, temperature and sensor data,  sent GPS NMEA data over-the-air.  I created tools that converted the GPS NMEA RMC data to KML files for plotting the movement, location, or geo-fences of the devices in Google Earth or more simple plotting in Google Maps.  Most of those tools were created using AutoIt.  :thumbsup:

Fun stuff! :)

Edited by TheXman
Link to comment
Share on other sites

7 hours ago, littlebigman said:

The XML UDF seems easier to use than working with the MSXML COM object. It just took a bit of time to figure out to use it by going through the examples + user questions in the forum.

Looking at your script and its output, it appears that you are still trying to get a good grasp of how to process XML nodes and how to get individual values.  :)  Personally, I find it much easier and faster to work with the actual XML DOM objects than to use XML UDF's.  Plus, there's a lot less overhead.

Below, you will find an example that produces the exact same array as my xml2json example above.  Of course it could have been done in several different ways.  I just chose to do it that way because it uses the exact same logic as what I used in the previous XML2JSON example.  As you can see, working with the XML DOM objects is not that much different than using some of the XML UDF's.  As you can also see, if you just need to get information from an XML file, the script using the xml2json method is far less code and much easier to create, modify and maintain.  :dance:

I hope the example helps you understand some of the concepts related to XML node processing.

Example using XML DOM objects:

#AutoIt3Wrapper_AU3Check_Parameters=-w 3 -w 4 -w 5 -w 6 -d

#include <Constants.au3>
#include <Array.au3>

Const $gkXML = _
    '<?xml version="1.0" encoding="UTF-8"?>'     & _
    '<gpx>'                                      & _
    '  <metadata>'                               & _
    '    <name>Some name</name>'                 & _
    '  </metadata>'                              & _
    '  <trk>'                                    & _
    '    <name>Track 1</name>'                   & _
    '    <trkseg>'                               & _
    '      <trkpt lat="48.81782" lon="2.24906">' & _
    '        <ele>123</ele>'                     & _
    '        <time>Dummy time</time>'            & _
    '      </trkpt>'                             & _
    '      <trkpt lat="48.81784" lon="2.24906">' & _
    '        <ele>456</ele>'                     & _
    '      </trkpt>'                             & _
    '    </trkseg>'                              & _
    '  </trk>'                                   & _
    '  <trk>'                                    & _
    '    <name>Track 2</name>'                   & _
    '    <trkseg>'                               & _
    '      <trkpt lat="48.81782" lon="2.24906">' & _
    '        <ele>321</ele>'                     & _
    '      </trkpt>'                             & _
    '      <trkpt lat="48.81784" lon="2.24906">' & _
    '        <ele>654</ele>'                     & _
    '      </trkpt>'                             & _
    '    </trkseg>'                              & _
    '  </trk>undefinedundefinedundefined'        & _
    '</gpx>'


xml_example()

Func xml_example()
    Local $oComErr = ObjEvent("AutoIt.Error", "com_error_handler")
    #forceref $oComErr

    Local $oTrackNodes      = Null, _
          $oTrackPointNodes = Null

    Local $sTrackName = ""

    Local $aTrackPoints[0][5]


    With ObjCreate("Msxml2.DOMDocument.6.0")
        ;Load XML document object from the string
        .PreserveWhitespace = True
        .loadXML($gkXML)
        If .parseError.errorCode Then Exit MsgBox($MB_ICONERROR + $MB_TOPMOST, "XML PARSING ERROR", .parseError.reason)

        ;Select all track nodes
        $oTrackNodes = .selectNodes('//trk')
        If Not IsObj($oTrackNodes) Then Return MsgBox($MB_ICONERROR, "Error", "No track nodes found." & @error)

        ;For each track node
        For $oTrackNode in $oTrackNodes
            ;Save track name
            $sTrackName = $oTrackNode.selectSingleNode("./name").text

            ;Select all child segment track points
            $oTrackPointNodes = $oTrackNode.selectNodes('./trkseg/trkpt')
            If Not IsObj($oTrackNodes) Then Return MsgBox($MB_ICONERROR, "Error", "No track points found." & @error)

            ;For each child segment track point
            For $oTrackPointNode in $oTrackPointNodes
                With $oTrackPointNode
                    ;Add specified values to the array
                    _ArrayAdd($aTrackPoints, _
                        $sTrackName                     & "|" & _
                        .getAttribute("lat")            & "|" & _
                        .getAttribute("lon")            & "|" & _
                        .selectSingleNode("./ele").text & "|" & _
                        (IsObj(.selectSingleNode("./time")) ? .selectSingleNode("./time").text : "") _ ;Handle optional node
                    )
                EndWith
            Next
        Next
    EndWith

    ;Display array
    _ArrayDisplay($aTrackPoints, "GPS Exchange Info", "", 0, Default, "Name|Lat|Long|Elevation|Time")
EndFunc

Func com_error_handler($oError)
    With $oError
        ConsoleWrite(@CRLF & "COM ERROR DETECTED!" & @CRLF)
        ConsoleWrite("  Error ScriptLine....... " & .scriptline & @CRLF)
        ConsoleWrite("  Error Number........... " & StringFormat("0x%08x (%i)", .number, .number) & @CRLF)
        ConsoleWrite("  Error WinDescription... " & StringStripWS(.windescription, $STR_STRIPTRAILING) & @CRLF)
        ConsoleWrite("  Error RetCode.......... " & StringFormat("0x%08x (%i)", .retcode, .retcode) & @CRLF)
        ConsoleWrite("  Error Description...... " & StringStripWS(.description   , $STR_STRIPTRAILING) & @CRLF)
    EndWith
    Exit
EndFunc

Resulting array:

image.png

Edited by TheXman
Link to comment
Share on other sites

Indeed, while learning about mLipok's XML UDF, I wondered what its benefit is compared to working with the MSXML COM object directly because the calls looked so similar.

As for converting XML to JSON with jq.exe: Besides adding a second binary to the mix, I have a preliminary question: When reading data, is there a way to know whether the value came from the node's text or one of its attributes? If I need to then write data to a new file, I must know.

For example:

<trkpt lat="48.81782" lon="2.24906">
<ele>123</ele>
<time>Dummy time</time>
</trkpt>

If "48.81782", "2.24906", "123", and "Dummy time" all end up in the array, the only way to tell them apart is by keeping this info in the source (eg. lat = attribute, time = value).

 

 

Link to comment
Share on other sites

  • Solution
10 hours ago, littlebigman said:

When reading data, is there a way to know whether the value came from the node's text or one of its attributes?

I don't understand the question.  Don't you already know the schema of the XML files that you are working with?  The only way I can see that being an issue is when a parent node has an attribute name that's the same name as one of its child nodes.  That would be rather illogical but I guess it could happen.  Also, keep in mind that not all XML files lend themselves to being easily transformed into JSON.  Some may need custom stylesheets or may need the XML file to be modified before transformation.  One example that comes to mind are XML files that make use of CDATA.  I don't think the default xsl stylesheet has rules for handling CDATA.

With that said, If you need XML attribute names be transformed to JSON key names that identify them as attributes, then it just involves a small change to the default .xsl stylesheet that is used to do the transformations.  I modified the attached xsl file below to prepend "attr_" to transformed attribute names.  Since xsl stylesheets basically contains rules for how to transform XML, you could make the name look however you'd like (@id, _id, id_, id_attr, etc.)

The, optional, second parameter of the _Xml2Json() function allows you to override the default xsl stylesheet that is used. Make the change below and put the modified xsl stylesheet in the script directory and it will produce the output below.  You will also need to modify the jqFilter to user the "attr_" attribute names (i.e.  .attr_lat).  The _Xml2Json() function was already set up to allow the use of modified xsl stylesheets for scenarios like yours and other one-offs that users may desire.

Change the existing _Xml2Json() line to:

;Transform XML to JSON
$sJson = _Xml2Json($gkXML, FileRead(@ScriptDir & "\xml2json_identify_attrs.xsl"))

JSON Output:

{
  "gpx": {
    "metadata": {
      "name": "Some name"
    },
    "trk": [
      {
        "name": "Track 1",
        "trkseg": {
          "trkpt": [
            {
              "attr_lat": "48.81782",
              "attr_lon": "2.24906",
              "ele": "123",
              "time": "Dummy time"
            },
            {
              "attr_lat": "48.81784",
              "attr_lon": "2.24906",
              "ele": "456"
            }
          ]
        }
      },
      {
        "name": "Track 2",
        "trkseg": {
          "trkpt": [
            {
              "attr_lat": "48.81782",
              "attr_lon": "2.24906",
              "ele": "321"
            },
            {
              "attr_lat": "48.81784",
              "attr_lon": "2.24906",
              "ele": "654"
            }
          ]
        }
      }
    ]
  }
}

 

xml2json_identify_attrs.xsl

Edited by TheXman
Link to comment
Share on other sites

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...