XML-Cleanup

TiC01 · March 8, 2010

Hi all,

I haven't posted much, but I've created a script I just needed to share.

I've been working on a little project over here which involved the microsoft wsuscab (which is barely readable). In order to make it readable I've made a script which will read the entire file, strip empty spaces, indent the lines and write them back to a file.

Please have a look at it..... Any comments and suggestions are welcome !!

XMLCleanup.au3

Edited March 9, 2010 by TiC01

Fulano · March 9, 2010

Two suggestions came to mind while looking at it:

1: You can force it into this format:

<tag>
string
</tag>

with the following:

StringRegExpReplace ($XMLText, ">[^\n]", ">" & @CRLF)
StringRegExpReplace ($XMLText, "[^\n]<", @CRLF & "<")

Then you can split it into an array and walk the array, computing the indent based on how many open tags you find before you find the appropriate close tag

2:This is massively more efficient:

For $i = 1 to $Indent
    $NewLine&="    "
Next

Than this:

For $i = 1 to ($Indent*4)
    $NewLine&=" "
Next

Suppose we are indenting this bit of xml:

<xml>
    <first>
        <second>
            <third>
                <fourth>
note
                </fourth>
            </third>

        </second>

    </first>
</xml>

In order to indent the 'note' line correctly, the original version would have to loop 20 times. The modified version has to loop only 5 times.

Happy Coding :mellow:

Fulano

TiC01 · March 9, 2010

Thank you for your reply. I've been checking your suggestions:

1: You can force it into this format:
<tag>
string
</tag>
with the following:
StringRegExpReplace ($XMLText, ">[^\n]", ">" & @CRLF)
StringRegExpReplace ($XMLText, "[^\n]<", @CRLF & "<")
Then you can split it into an array and walk the array, computing the indent based on how many open tags you find before you find the appropriate close tag

I understand what your intention is, but wouldn't this piece of code (comes from Microsoft Office Patchdata.xml) fail on that??

<TITLE>
Security Update for Office XP: WordPerfect 5.x Converter (KB873379) (Turkish version)
</TITLE>

I'd figure the 2nd line would either not be in the array or the 2nd and 3rd line would be one.

2:This is massively more efficient:

For $i = 1 to $Indent
    $NewLine&="    "
Next

Than this:

For $i = 1 to ($Indent*4)
    $NewLine&=" "
Next

Noted. I've changed it in the code.

Fulano · March 9, 2010

<TITLE>
Security Update for Office XP: WordPerfect 5.x Converter (KB873379) (Turkish version)
</TITLE>
I'd figure the 2nd line would either not be in the array or the 2nd and 3rd line would be one.

Surprisingly enough it's ok, the magic is in the [^\n] tag, basically it says, for every instance of ">" that is not followed by a line break, add one. The other line is identical in function, only differing in that the newline checked for is before the open tag, rather than after.

Seriously, whoever created regular expressions was a friggin' genius :mellow:

TiC01 · March 10, 2010

Well, I've been trying to get it to work with the piece of code you provided, but either I'm doing it wrong or there's a bug in AutoIT, 'cause the result ain't what it should be..

I use a piece of a xml below:

<?xml version="1.0"?><!--7/9/2009 6:30:10 PM--><BUNDLEDATA MAJORVERSION="11" MINORVERSION="8120" BUILD="12.09.100.014" DATE="2009-07-09" URL="http://go.microsoft.com/fwlink/?LinkId=19074"><BUNDLECATALOG><BUNDLE ID="WRD11FF" VER="2" LANG="0" OFFICEVER="3" UPDATETYPE="5" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Security Update" KB_NUMBER="887979" BULLETIN_ID="MS05-023" SEVERITY="Critical"><PACKLETSET><PACKLET PUID="512165" REQUIRED="False" /></PACKLETSET></BUNDLE>

As I understand it, the code should just put an ENTER after each > character when using this:

Local $sOutput = StringRegExpReplace ($TotalFileRead, ">[^\n]", ">" & @CRLF)
ConsoleWrite($sOutput)

This is the result I get:

<?xml version="1.0"?>
!--7/9/2009 6:30:10 PM-->
BUNDLEDATA MAJORVERSION="11" MINORVERSION="8120" BUILD="12.09.100.014" DATE="2009-07-09" URL="http://go.microsoft.com/fwlink/?LinkId=19074">
BUNDLECATALOG>
BUNDLE ID="WRD11FF" VER="2" LANG="0" OFFICEVER="3" UPDATETYPE="5" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Security Update" KB_NUMBER="887979" BULLETIN_ID="MS05-023" SEVERITY="Critical">
PACKLETSET>
PACKLET PUID="512165" REQUIRED="False" />
/PACKLETSET>
/BUNDLE>
BUNDLE ID="ACC_9" VER="1" LANG="1028" OFFICEVER="1" UPDATETYPE="2" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Critical Update" KB_NUMBER="287484">

It doesn't just seem to add the ENTER, but it also strips the < character from the next line.....

So, what's wrong here ??

Fulano · March 10, 2010

Found the bug, here is it in context:

Local $XMLText = '<?xml version="1.0"?><!--7/9/2009 6:30:10 PM--><BUNDLEDATA MAJORVERSION="11" MINORVERSION="8120" BUILD="12.09.100.014" DATE="2009-07-09" URL="http://go.microsoft.com/fwlink/?LinkId=19074"><BUNDLECATALOG><BUNDLE ID="WRD11FF" VER="2" LANG="0" OFFICEVER="3" UPDATETYPE="5" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Security Update" KB_NUMBER="887979" BULLETIN_ID="MS05-023" SEVERITY="Critical"><PACKLETSET><PACKLET PUID="512165" REQUIRED="False" /></PACKLETSET></BUNDLE>'

Local $CleanXML = StringRegExpReplace ($XMLText, ">([^\n])", ">" & @CRLF & "\1")
$CleanXML = StringRegExpReplace ($CleanXML, "([^\n])<", "\1" & @CRLF & "<")


ConsoleWrite  ($CleanXML)

The problem was that I had neglected to save the character we tested to make sure there wasn't already a linebreak.

Ahh the joys of coding w/o debugging :mellow:

Fulano · March 10, 2010

Here is a slightly more fleshed out version. It has a minor bug with empty tags like <br><br/>, but handles everything else fine.

Global $INDENT_TEXT = "    "

Local $XMLText = '<?xml version="1.0"?><!--7/9/2009 6:30:10 PM--><BUNDLEDATA MAJORVERSION="11" MINORVERSION="8120" BUILD="12.09.100.014" DATE="2009-07-09" URL="http://go.microsoft.com/fwlink/?LinkId=19074"><BUNDLECATALOG>Test<BUNDLE ID="WRD11FF" VER="2" LANG="0" OFFICEVER="3" UPDATETYPE="5" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Security Update" KB_NUMBER="887979" BULLETIN_ID="MS05-023" SEVERITY="Critical"><PACKLETSET><PACKLET PUID="512165" REQUIRED="False" /></PACKLETSET></BUNDLE>'

Local $XMLarray = SplitXML ($XMLText)

Local $indentLevel = -1
For $line = 0 to Ubound ($XMLarray) - 1
    If StringInStr ($XMLarray[$line], "</") Then ; Closing Tag
        $indentLevel -= 1;
        $XMLarray[$line] = MakeIndent ($indentLevel) & $XMLarray[$line]
    ElseIf StringInStr ($XMLarray[$line], "<") Then ; Opening Tag
        $indentLevel += 1;
        $XMLarray[$line] = MakeIndent ($indentLevel) & $XMLarray[$line]
    Else ; Data Line
        $XMLarray[$line] = MakeIndent ($indentLevel + 1) & $XMLarray[$line]
    EndIf
Next

Local $CleanXML = ""
For $line in $XMLarray
    $CleanXML &= $line & @CRLF
Next

ConsoleWrite ($CleanXML)

Func MakeIndent ($Level)
    Local $indent = ""
    For $i = 0 to $Level
        $indent &= $INDENT_TEXT
    Next
    Return $indent
EndFunc

Func SplitXML ($XMLString)
    ;Remove any existing line breaks
    $XMLString = StringReplace ($XMLString, @CR, "")
    $XMLString = StringReplace ($XMLString, @LF, "")
    ;Add one line break per tag
    $XMLString = StringRegExpReplace ($XMLString, "([^\n])<", "\1" & @LF & "<")
    $XMLString = StringRegExpReplace ($XMLString, ">([^\n])", ">" & @LF & "\1")
    ;Split into an array and return
    Return StringSplit ($XMLString, @LF, 2)EndFunc

I could make it smarter, but I really don't feel like implementing a stack right at this moment :mellow:

weaponx · March 10, 2010

Using the XML UDF you can indent an XML file using one line...

_XMLTransform("", "", "")

This calls a default stylesheet and applies it to the loaded file.

Fulano · March 10, 2010

There's always an easier way

TiC01 · March 13, 2010

@WeaponX

I tried the _XMLTransform function, but it only cleaned up the first two lines eg:

<?xml version="1.0"?><!--7/9/2009 6:30:10 PM--><BUNDLEDATA MAJORVERSION="11" MINORVERSION="8120" BUILD="12.09.100.014" DATE="2009-07-09" URL="http://go.microsoft.com/fwlink/?LinkId=19074"><BUNDLECATALOG><BUNDLE ID="WRD11FF" VER="2" LANG="0" OFFICEVER="3" UPDATETYPE="5" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Security Update" KB_NUMBER="887979" BULLETIN_ID="MS05-023" SEVERITY="Critical">

became

<?xml version="1.0" encoding="UTF-16"?>
<!--7/9/2009 6:30:10 PM-->
<BUNDLEDATA MAJORVERSION="11" MINORVERSION="8120" BUILD="12.09.100.014" DATE="2009-07-09" URL="http://go.microsoft.com/fwlink/?LinkId=19074"><BUNDLECATALOG><BUNDLE ID="WRD11FF" VER="2" LANG="0" OFFICEVER="3" UPDATETYPE="5" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Security Update" KB_NUMBER="887979" BULLETIN_ID="MS05-023" SEVERITY="Critical">

It probably did something more than just that, 'cause the filesize got twice as big, but as you can see, this is not exactly what I expected.

@Fulano

This did the trick. I had to add a third stringreplacement in though (namely tabs). The only thing I am a little bothered with is the stringsplit function. It seems to take forever on a 35MB file.

Fulano · March 13, 2010

The only thing I am a little bothered with is the stringsplit function. It seems to take forever on a 35MB file.

Well ... there is that ...

Frankly speaking, given that AutoIt is not really designed for doing string manipulations on files of that size, I don't think there's alot you can do to speed it up. I wrote a routine to do a binary search of zipcodes from a file, and I had to write in python and call it from the main script to get decent speed out of it. :mellow:

I have been known to be wrong before, so if you do run into a clever trick that speeds things up, I'd be interested in hearing about it.

XML-Cleanup

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members