Jump to content

XML-Cleanup


TiC01
 Share

Recommended Posts

Hi all,

I haven't posted much, but I've created a script I just needed to share.

I've been working on a little project over here which involved the microsoft wsuscab (which is barely readable). In order to make it readable I've made a script which will read the entire file, strip empty spaces, indent the lines and write them back to a file.

Please have a look at it..... Any comments and suggestions are welcome !!

XMLCleanup.au3

Edited by TiC01
Link to comment
Share on other sites

Two suggestions came to mind while looking at it:

1: You can force it into this format:

<tag>
string
</tag>

with the following:

StringRegExpReplace ($XMLText, ">[^\n]", ">" & @CRLF)
StringRegExpReplace ($XMLText, "[^\n]<", @CRLF & "<")

Then you can split it into an array and walk the array, computing the indent based on how many open tags you find before you find the appropriate close tag

2:This is massively more efficient:

For $i = 1 to $Indent
    $NewLine&="    "
Next
Than this:

For $i = 1 to ($Indent*4)
    $NewLine&=" "
Next

Suppose we are indenting this bit of xml:

<xml>
    <first>
        <second>
            <third>
                <fourth>
note
                </fourth>
            </third>

        </second>

    </first>
</xml>
In order to indent the 'note' line correctly, the original version would have to loop 20 times. The modified version has to loop only 5 times.

Happy Coding :mellow:

Fulano

#fgpkerw4kcmnq2mns1ax7ilndopen (Q, $0); while ($l = <Q>){if ($l =~ m/^#.*/){$l =~ tr/a-z1-9#/Huh, Junketeer's Alternate Pro Ace /; print $l;}}close (Q);[code] tag ninja!

Link to comment
Share on other sites

Thank you for your reply. I've been checking your suggestions:

1: You can force it into this format:

<tag>
string
</tag>

with the following:

StringRegExpReplace ($XMLText, ">[^\n]", ">" & @CRLF)
StringRegExpReplace ($XMLText, "[^\n]<", @CRLF & "<")

Then you can split it into an array and walk the array, computing the indent based on how many open tags you find before you find the appropriate close tag

I understand what your intention is, but wouldn't this piece of code (comes from Microsoft Office Patchdata.xml) fail on that??

<TITLE>
Security Update for Office XP: WordPerfect 5.x Converter (KB873379) (Turkish version)
</TITLE>

I'd figure the 2nd line would either not be in the array or the 2nd and 3rd line would be one.

2:This is massively more efficient:

For $i = 1 to $Indent
    $NewLine&="    "
Next
Than this:

For $i = 1 to ($Indent*4)
    $NewLine&=" "
Next

Noted. I've changed it in the code.
Link to comment
Share on other sites

<TITLE>
Security Update for Office XP: WordPerfect 5.x Converter (KB873379) (Turkish version)
</TITLE>

I'd figure the 2nd line would either not be in the array or the 2nd and 3rd line would be one.

Surprisingly enough it's ok, the magic is in the [^\n] tag, basically it says, for every instance of ">" that is not followed by a line break, add one. The other line is identical in function, only differing in that the newline checked for is before the open tag, rather than after.

Seriously, whoever created regular expressions was a friggin' genius :mellow:

#fgpkerw4kcmnq2mns1ax7ilndopen (Q, $0); while ($l = <Q>){if ($l =~ m/^#.*/){$l =~ tr/a-z1-9#/Huh, Junketeer's Alternate Pro Ace /; print $l;}}close (Q);[code] tag ninja!

Link to comment
Share on other sites

Well, I've been trying to get it to work with the piece of code you provided, but either I'm doing it wrong or there's a bug in AutoIT, 'cause the result ain't what it should be..

I use a piece of a xml below:

<?xml version="1.0"?><!--7/9/2009 6:30:10 PM--><BUNDLEDATA MAJORVERSION="11" MINORVERSION="8120" BUILD="12.09.100.014" DATE="2009-07-09" URL="http://go.microsoft.com/fwlink/?LinkId=19074"><BUNDLECATALOG><BUNDLE ID="WRD11FF" VER="2" LANG="0" OFFICEVER="3" UPDATETYPE="5" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Security Update" KB_NUMBER="887979" BULLETIN_ID="MS05-023" SEVERITY="Critical"><PACKLETSET><PACKLET PUID="512165" REQUIRED="False" /></PACKLETSET></BUNDLE>

As I understand it, the code should just put an ENTER after each > character when using this:

Local $sOutput = StringRegExpReplace ($TotalFileRead, ">[^\n]", ">" & @CRLF)
ConsoleWrite($sOutput)

This is the result I get:

<?xml version="1.0"?>
!--7/9/2009 6:30:10 PM-->
BUNDLEDATA MAJORVERSION="11" MINORVERSION="8120" BUILD="12.09.100.014" DATE="2009-07-09" URL="http://go.microsoft.com/fwlink/?LinkId=19074">
BUNDLECATALOG>
BUNDLE ID="WRD11FF" VER="2" LANG="0" OFFICEVER="3" UPDATETYPE="5" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Security Update" KB_NUMBER="887979" BULLETIN_ID="MS05-023" SEVERITY="Critical">
PACKLETSET>
PACKLET PUID="512165" REQUIRED="False" />
/PACKLETSET>
/BUNDLE>
BUNDLE ID="ACC_9" VER="1" LANG="1028" OFFICEVER="1" UPDATETYPE="2" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Critical Update" KB_NUMBER="287484">

It doesn't just seem to add the ENTER, but it also strips the < character from the next line.....

So, what's wrong here ??

Link to comment
Share on other sites

Found the bug, here is it in context:

Local $XMLText = '<?xml version="1.0"?><!--7/9/2009 6:30:10 PM--><BUNDLEDATA MAJORVERSION="11" MINORVERSION="8120" BUILD="12.09.100.014" DATE="2009-07-09" URL="http://go.microsoft.com/fwlink/?LinkId=19074"><BUNDLECATALOG><BUNDLE ID="WRD11FF" VER="2" LANG="0" OFFICEVER="3" UPDATETYPE="5" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Security Update" KB_NUMBER="887979" BULLETIN_ID="MS05-023" SEVERITY="Critical"><PACKLETSET><PACKLET PUID="512165" REQUIRED="False" /></PACKLETSET></BUNDLE>'

Local $CleanXML = StringRegExpReplace ($XMLText, ">([^\n])", ">" & @CRLF & "\1")
$CleanXML = StringRegExpReplace ($CleanXML, "([^\n])<", "\1" & @CRLF & "<")


ConsoleWrite  ($CleanXML)

The problem was that I had neglected to save the character we tested to make sure there wasn't already a linebreak.

Ahh the joys of coding w/o debugging :mellow:

#fgpkerw4kcmnq2mns1ax7ilndopen (Q, $0); while ($l = <Q>){if ($l =~ m/^#.*/){$l =~ tr/a-z1-9#/Huh, Junketeer's Alternate Pro Ace /; print $l;}}close (Q);[code] tag ninja!

Link to comment
Share on other sites

Here is a slightly more fleshed out version. It has a minor bug with empty tags like <br><br/>, but handles everything else fine.

Global $INDENT_TEXT = "    "

Local $XMLText = '<?xml version="1.0"?><!--7/9/2009 6:30:10 PM--><BUNDLEDATA MAJORVERSION="11" MINORVERSION="8120" BUILD="12.09.100.014" DATE="2009-07-09" URL="http://go.microsoft.com/fwlink/?LinkId=19074"><BUNDLECATALOG>Test<BUNDLE ID="WRD11FF" VER="2" LANG="0" OFFICEVER="3" UPDATETYPE="5" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Security Update" KB_NUMBER="887979" BULLETIN_ID="MS05-023" SEVERITY="Critical"><PACKLETSET><PACKLET PUID="512165" REQUIRED="False" /></PACKLETSET></BUNDLE>'

Local $XMLarray = SplitXML ($XMLText)

Local $indentLevel = -1
For $line = 0 to Ubound ($XMLarray) - 1
    If StringInStr ($XMLarray[$line], "</") Then ; Closing Tag
        $indentLevel -= 1;
        $XMLarray[$line] = MakeIndent ($indentLevel) & $XMLarray[$line]
    ElseIf StringInStr ($XMLarray[$line], "<") Then ; Opening Tag
        $indentLevel += 1;
        $XMLarray[$line] = MakeIndent ($indentLevel) & $XMLarray[$line]
    Else ; Data Line
        $XMLarray[$line] = MakeIndent ($indentLevel + 1) & $XMLarray[$line]
    EndIf
Next

Local $CleanXML = ""
For $line in $XMLarray
    $CleanXML &= $line & @CRLF
Next

ConsoleWrite ($CleanXML)

Func MakeIndent ($Level)
    Local $indent = ""
    For $i = 0 to $Level
        $indent &= $INDENT_TEXT
    Next
    Return $indent
EndFunc

Func SplitXML ($XMLString)
    ;Remove any existing line breaks
    $XMLString = StringReplace ($XMLString, @CR, "")
    $XMLString = StringReplace ($XMLString, @LF, "")
    ;Add one line break per tag
    $XMLString = StringRegExpReplace ($XMLString, "([^\n])<", "\1" & @LF & "<")
    $XMLString = StringRegExpReplace ($XMLString, ">([^\n])", ">" & @LF & "\1")
    ;Split into an array and return
    Return StringSplit ($XMLString, @LF, 2)EndFunc

I could make it smarter, but I really don't feel like implementing a stack right at this moment :mellow:

#fgpkerw4kcmnq2mns1ax7ilndopen (Q, $0); while ($l = <Q>){if ($l =~ m/^#.*/){$l =~ tr/a-z1-9#/Huh, Junketeer's Alternate Pro Ace /; print $l;}}close (Q);[code] tag ninja!

Link to comment
Share on other sites

@WeaponX

I tried the _XMLTransform function, but it only cleaned up the first two lines eg:

<?xml version="1.0"?><!--7/9/2009 6:30:10 PM--><BUNDLEDATA MAJORVERSION="11" MINORVERSION="8120" BUILD="12.09.100.014" DATE="2009-07-09" URL="http://go.microsoft.com/fwlink/?LinkId=19074"><BUNDLECATALOG><BUNDLE ID="WRD11FF" VER="2" LANG="0" OFFICEVER="3" UPDATETYPE="5" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Security Update" KB_NUMBER="887979" BULLETIN_ID="MS05-023" SEVERITY="Critical">

became

<?xml version="1.0" encoding="UTF-16"?>
<!--7/9/2009 6:30:10 PM-->
<BUNDLEDATA MAJORVERSION="11" MINORVERSION="8120" BUILD="12.09.100.014" DATE="2009-07-09" URL="http://go.microsoft.com/fwlink/?LinkId=19074"><BUNDLECATALOG><BUNDLE ID="WRD11FF" VER="2" LANG="0" OFFICEVER="3" UPDATETYPE="5" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Security Update" KB_NUMBER="887979" BULLETIN_ID="MS05-023" SEVERITY="Critical">

It probably did something more than just that, 'cause the filesize got twice as big, but as you can see, this is not exactly what I expected.

@Fulano

This did the trick. I had to add a third stringreplacement in though (namely tabs). The only thing I am a little bothered with is the stringsplit function. It seems to take forever on a 35MB file.

Link to comment
Share on other sites

The only thing I am a little bothered with is the stringsplit function. It seems to take forever on a 35MB file.

Well ... there is that ...

Frankly speaking, given that AutoIt is not really designed for doing string manipulations on files of that size, I don't think there's alot you can do to speed it up. I wrote a routine to do a binary search of zipcodes from a file, and I had to write in python and call it from the main script to get decent speed out of it. :mellow:

I have been known to be wrong before, so if you do run into a clever trick that speeds things up, I'd be interested in hearing about it.

#fgpkerw4kcmnq2mns1ax7ilndopen (Q, $0); while ($l = <Q>){if ($l =~ m/^#.*/){$l =~ tr/a-z1-9#/Huh, Junketeer's Alternate Pro Ace /; print $l;}}close (Q);[code] tag ninja!

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...