Jump to content
Sign in to follow this  
TiC01

XML-Cleanup

Recommended Posts

TiC01

Hi all,

I haven't posted much, but I've created a script I just needed to share.

I've been working on a little project over here which involved the microsoft wsuscab (which is barely readable). In order to make it readable I've made a script which will read the entire file, strip empty spaces, indent the lines and write them back to a file.

Please have a look at it..... Any comments and suggestions are welcome !!

XMLCleanup.au3

Edited by TiC01

Share this post


Link to post
Share on other sites
Fulano

Two suggestions came to mind while looking at it:

1: You can force it into this format:

<tag>
string
</tag>

with the following:

StringRegExpReplace ($XMLText, ">[^\n]", ">" & @CRLF)
StringRegExpReplace ($XMLText, "[^\n]<", @CRLF & "<")

Then you can split it into an array and walk the array, computing the indent based on how many open tags you find before you find the appropriate close tag

2:This is massively more efficient:

For $i = 1 to $Indent
    $NewLine&="    "
Next
Than this:

For $i = 1 to ($Indent*4)
    $NewLine&=" "
Next

Suppose we are indenting this bit of xml:

<xml>
    <first>
        <second>
            <third>
                <fourth>
note
                </fourth>
            </third>

        </second>

    </first>
</xml>
In order to indent the 'note' line correctly, the original version would have to loop 20 times. The modified version has to loop only 5 times.

Happy Coding :mellow:

Fulano


#fgpkerw4kcmnq2mns1ax7ilndopen (Q, $0); while ($l = <Q>){if ($l =~ m/^#.*/){$l =~ tr/a-z1-9#/Huh, Junketeer's Alternate Pro Ace /; print $l;}}close (Q);[code] tag ninja!

Share this post


Link to post
Share on other sites
TiC01

Thank you for your reply. I've been checking your suggestions:

1: You can force it into this format:

<tag>
string
</tag>

with the following:

StringRegExpReplace ($XMLText, ">[^\n]", ">" & @CRLF)
StringRegExpReplace ($XMLText, "[^\n]<", @CRLF & "<")

Then you can split it into an array and walk the array, computing the indent based on how many open tags you find before you find the appropriate close tag

I understand what your intention is, but wouldn't this piece of code (comes from Microsoft Office Patchdata.xml) fail on that??

<TITLE>
Security Update for Office XP: WordPerfect 5.x Converter (KB873379) (Turkish version)
</TITLE>

I'd figure the 2nd line would either not be in the array or the 2nd and 3rd line would be one.

2:This is massively more efficient:

For $i = 1 to $Indent
    $NewLine&="    "
Next
Than this:

For $i = 1 to ($Indent*4)
    $NewLine&=" "
Next

Noted. I've changed it in the code.

Share this post


Link to post
Share on other sites
Fulano

<TITLE>
Security Update for Office XP: WordPerfect 5.x Converter (KB873379) (Turkish version)
</TITLE>

I'd figure the 2nd line would either not be in the array or the 2nd and 3rd line would be one.

Surprisingly enough it's ok, the magic is in the [^\n] tag, basically it says, for every instance of ">" that is not followed by a line break, add one. The other line is identical in function, only differing in that the newline checked for is before the open tag, rather than after.

Seriously, whoever created regular expressions was a friggin' genius :mellow:


#fgpkerw4kcmnq2mns1ax7ilndopen (Q, $0); while ($l = <Q>){if ($l =~ m/^#.*/){$l =~ tr/a-z1-9#/Huh, Junketeer's Alternate Pro Ace /; print $l;}}close (Q);[code] tag ninja!

Share this post


Link to post
Share on other sites
TiC01

Well, I've been trying to get it to work with the piece of code you provided, but either I'm doing it wrong or there's a bug in AutoIT, 'cause the result ain't what it should be..

I use a piece of a xml below:

<?xml version="1.0"?><!--7/9/2009 6:30:10 PM--><BUNDLEDATA MAJORVERSION="11" MINORVERSION="8120" BUILD="12.09.100.014" DATE="2009-07-09" URL="http://go.microsoft.com/fwlink/?LinkId=19074"><BUNDLECATALOG><BUNDLE ID="WRD11FF" VER="2" LANG="0" OFFICEVER="3" UPDATETYPE="5" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Security Update" KB_NUMBER="887979" BULLETIN_ID="MS05-023" SEVERITY="Critical"><PACKLETSET><PACKLET PUID="512165" REQUIRED="False" /></PACKLETSET></BUNDLE>

As I understand it, the code should just put an ENTER after each > character when using this:

Local $sOutput = StringRegExpReplace ($TotalFileRead, ">[^\n]", ">" & @CRLF)
ConsoleWrite($sOutput)

This is the result I get:

<?xml version="1.0"?>
!--7/9/2009 6:30:10 PM-->
BUNDLEDATA MAJORVERSION="11" MINORVERSION="8120" BUILD="12.09.100.014" DATE="2009-07-09" URL="http://go.microsoft.com/fwlink/?LinkId=19074">
BUNDLECATALOG>
BUNDLE ID="WRD11FF" VER="2" LANG="0" OFFICEVER="3" UPDATETYPE="5" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Security Update" KB_NUMBER="887979" BULLETIN_ID="MS05-023" SEVERITY="Critical">
PACKLETSET>
PACKLET PUID="512165" REQUIRED="False" />
/PACKLETSET>
/BUNDLE>
BUNDLE ID="ACC_9" VER="1" LANG="1028" OFFICEVER="1" UPDATETYPE="2" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Critical Update" KB_NUMBER="287484">

It doesn't just seem to add the ENTER, but it also strips the < character from the next line.....

So, what's wrong here ??

Share this post


Link to post
Share on other sites
Fulano

Found the bug, here is it in context:

Local $XMLText = '<?xml version="1.0"?><!--7/9/2009 6:30:10 PM--><BUNDLEDATA MAJORVERSION="11" MINORVERSION="8120" BUILD="12.09.100.014" DATE="2009-07-09" URL="http://go.microsoft.com/fwlink/?LinkId=19074"><BUNDLECATALOG><BUNDLE ID="WRD11FF" VER="2" LANG="0" OFFICEVER="3" UPDATETYPE="5" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Security Update" KB_NUMBER="887979" BULLETIN_ID="MS05-023" SEVERITY="Critical"><PACKLETSET><PACKLET PUID="512165" REQUIRED="False" /></PACKLETSET></BUNDLE>'

Local $CleanXML = StringRegExpReplace ($XMLText, ">([^\n])", ">" & @CRLF & "\1")
$CleanXML = StringRegExpReplace ($CleanXML, "([^\n])<", "\1" & @CRLF & "<")


ConsoleWrite  ($CleanXML)

The problem was that I had neglected to save the character we tested to make sure there wasn't already a linebreak.

Ahh the joys of coding w/o debugging :mellow:


#fgpkerw4kcmnq2mns1ax7ilndopen (Q, $0); while ($l = <Q>){if ($l =~ m/^#.*/){$l =~ tr/a-z1-9#/Huh, Junketeer's Alternate Pro Ace /; print $l;}}close (Q);[code] tag ninja!

Share this post


Link to post
Share on other sites
Fulano

Here is a slightly more fleshed out version. It has a minor bug with empty tags like <br><br/>, but handles everything else fine.

Global $INDENT_TEXT = "    "

Local $XMLText = '<?xml version="1.0"?><!--7/9/2009 6:30:10 PM--><BUNDLEDATA MAJORVERSION="11" MINORVERSION="8120" BUILD="12.09.100.014" DATE="2009-07-09" URL="http://go.microsoft.com/fwlink/?LinkId=19074"><BUNDLECATALOG>Test<BUNDLE ID="WRD11FF" VER="2" LANG="0" OFFICEVER="3" UPDATETYPE="5" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Security Update" KB_NUMBER="887979" BULLETIN_ID="MS05-023" SEVERITY="Critical"><PACKLETSET><PACKLET PUID="512165" REQUIRED="False" /></PACKLETSET></BUNDLE>'

Local $XMLarray = SplitXML ($XMLText)

Local $indentLevel = -1
For $line = 0 to Ubound ($XMLarray) - 1
    If StringInStr ($XMLarray[$line], "</") Then ; Closing Tag
        $indentLevel -= 1;
        $XMLarray[$line] = MakeIndent ($indentLevel) & $XMLarray[$line]
    ElseIf StringInStr ($XMLarray[$line], "<") Then ; Opening Tag
        $indentLevel += 1;
        $XMLarray[$line] = MakeIndent ($indentLevel) & $XMLarray[$line]
    Else ; Data Line
        $XMLarray[$line] = MakeIndent ($indentLevel + 1) & $XMLarray[$line]
    EndIf
Next

Local $CleanXML = ""
For $line in $XMLarray
    $CleanXML &= $line & @CRLF
Next

ConsoleWrite ($CleanXML)

Func MakeIndent ($Level)
    Local $indent = ""
    For $i = 0 to $Level
        $indent &= $INDENT_TEXT
    Next
    Return $indent
EndFunc

Func SplitXML ($XMLString)
    ;Remove any existing line breaks
    $XMLString = StringReplace ($XMLString, @CR, "")
    $XMLString = StringReplace ($XMLString, @LF, "")
    ;Add one line break per tag
    $XMLString = StringRegExpReplace ($XMLString, "([^\n])<", "\1" & @LF & "<")
    $XMLString = StringRegExpReplace ($XMLString, ">([^\n])", ">" & @LF & "\1")
    ;Split into an array and return
    Return StringSplit ($XMLString, @LF, 2)EndFunc

I could make it smarter, but I really don't feel like implementing a stack right at this moment :mellow:


#fgpkerw4kcmnq2mns1ax7ilndopen (Q, $0); while ($l = <Q>){if ($l =~ m/^#.*/){$l =~ tr/a-z1-9#/Huh, Junketeer's Alternate Pro Ace /; print $l;}}close (Q);[code] tag ninja!

Share this post


Link to post
Share on other sites
weaponx

Using the XML UDF you can indent an XML file using one line...

_XMLTransform("", "", "")

This calls a default stylesheet and applies it to the loaded file.

Share this post


Link to post
Share on other sites
Fulano
:mellow: There's always an easier way :(

#fgpkerw4kcmnq2mns1ax7ilndopen (Q, $0); while ($l = <Q>){if ($l =~ m/^#.*/){$l =~ tr/a-z1-9#/Huh, Junketeer's Alternate Pro Ace /; print $l;}}close (Q);[code] tag ninja!

Share this post


Link to post
Share on other sites
TiC01

@WeaponX

I tried the _XMLTransform function, but it only cleaned up the first two lines eg:

<?xml version="1.0"?><!--7/9/2009 6:30:10 PM--><BUNDLEDATA MAJORVERSION="11" MINORVERSION="8120" BUILD="12.09.100.014" DATE="2009-07-09" URL="http://go.microsoft.com/fwlink/?LinkId=19074"><BUNDLECATALOG><BUNDLE ID="WRD11FF" VER="2" LANG="0" OFFICEVER="3" UPDATETYPE="5" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Security Update" KB_NUMBER="887979" BULLETIN_ID="MS05-023" SEVERITY="Critical">

became

<?xml version="1.0" encoding="UTF-16"?>
<!--7/9/2009 6:30:10 PM-->
<BUNDLEDATA MAJORVERSION="11" MINORVERSION="8120" BUILD="12.09.100.014" DATE="2009-07-09" URL="http://go.microsoft.com/fwlink/?LinkId=19074"><BUNDLECATALOG><BUNDLE ID="WRD11FF" VER="2" LANG="0" OFFICEVER="3" UPDATETYPE="5" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Security Update" KB_NUMBER="887979" BULLETIN_ID="MS05-023" SEVERITY="Critical">

It probably did something more than just that, 'cause the filesize got twice as big, but as you can see, this is not exactly what I expected.

@Fulano

This did the trick. I had to add a third stringreplacement in though (namely tabs). The only thing I am a little bothered with is the stringsplit function. It seems to take forever on a 35MB file.

Share this post


Link to post
Share on other sites
Fulano

The only thing I am a little bothered with is the stringsplit function. It seems to take forever on a 35MB file.

Well ... there is that ...

Frankly speaking, given that AutoIt is not really designed for doing string manipulations on files of that size, I don't think there's alot you can do to speed it up. I wrote a routine to do a binary search of zipcodes from a file, and I had to write in python and call it from the main script to get decent speed out of it. :mellow:

I have been known to be wrong before, so if you do run into a clever trick that speeds things up, I'd be interested in hearing about it.


#fgpkerw4kcmnq2mns1ax7ilndopen (Q, $0); while ($l = <Q>){if ($l =~ m/^#.*/){$l =~ tr/a-z1-9#/Huh, Junketeer's Alternate Pro Ace /; print $l;}}close (Q);[code] tag ninja!

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.