TiC01 Posted March 8, 2010 Share Posted March 8, 2010 (edited) Hi all, I haven't posted much, but I've created a script I just needed to share. I've been working on a little project over here which involved the microsoft wsuscab (which is barely readable). In order to make it readable I've made a script which will read the entire file, strip empty spaces, indent the lines and write them back to a file. Please have a look at it..... Any comments and suggestions are welcome !! XMLCleanup.au3 Edited March 9, 2010 by TiC01 Link to comment Share on other sites More sharing options...
Fulano Posted March 9, 2010 Share Posted March 9, 2010 Two suggestions came to mind while looking at it: 1: You can force it into this format: <tag> string </tag> with the following: StringRegExpReplace ($XMLText, ">[^\n]", ">" & @CRLF) StringRegExpReplace ($XMLText, "[^\n]<", @CRLF & "<") Then you can split it into an array and walk the array, computing the indent based on how many open tags you find before you find the appropriate close tag 2:This is massively more efficient: For $i = 1 to $Indent $NewLine&=" " NextThan this: For $i = 1 to ($Indent*4) $NewLine&=" " Next Suppose we are indenting this bit of xml:<xml> <first> <second> <third> <fourth> note </fourth> </third> </second> </first> </xml>In order to indent the 'note' line correctly, the original version would have to loop 20 times. The modified version has to loop only 5 times. Happy Coding Fulano #fgpkerw4kcmnq2mns1ax7ilndopen (Q, $0); while ($l = <Q>){if ($l =~ m/^#.*/){$l =~ tr/a-z1-9#/Huh, Junketeer's Alternate Pro Ace /; print $l;}}close (Q);[code] tag ninja! Link to comment Share on other sites More sharing options...
TiC01 Posted March 9, 2010 Author Share Posted March 9, 2010 Thank you for your reply. I've been checking your suggestions: 1: You can force it into this format: <tag> string </tag> with the following: StringRegExpReplace ($XMLText, ">[^\n]", ">" & @CRLF) StringRegExpReplace ($XMLText, "[^\n]<", @CRLF & "<") Then you can split it into an array and walk the array, computing the indent based on how many open tags you find before you find the appropriate close tag I understand what your intention is, but wouldn't this piece of code (comes from Microsoft Office Patchdata.xml) fail on that?? <TITLE> Security Update for Office XP: WordPerfect 5.x Converter (KB873379) (Turkish version) </TITLE> I'd figure the 2nd line would either not be in the array or the 2nd and 3rd line would be one. 2:This is massively more efficient: For $i = 1 to $Indent $NewLine&=" " NextThan this: For $i = 1 to ($Indent*4) $NewLine&=" " Next Noted. I've changed it in the code. Link to comment Share on other sites More sharing options...
Fulano Posted March 9, 2010 Share Posted March 9, 2010 <TITLE> Security Update for Office XP: WordPerfect 5.x Converter (KB873379) (Turkish version) </TITLE> I'd figure the 2nd line would either not be in the array or the 2nd and 3rd line would be one. Surprisingly enough it's ok, the magic is in the [^\n] tag, basically it says, for every instance of ">" that is not followed by a line break, add one. The other line is identical in function, only differing in that the newline checked for is before the open tag, rather than after. Seriously, whoever created regular expressions was a friggin' genius #fgpkerw4kcmnq2mns1ax7ilndopen (Q, $0); while ($l = <Q>){if ($l =~ m/^#.*/){$l =~ tr/a-z1-9#/Huh, Junketeer's Alternate Pro Ace /; print $l;}}close (Q);[code] tag ninja! Link to comment Share on other sites More sharing options...
TiC01 Posted March 10, 2010 Author Share Posted March 10, 2010 Well, I've been trying to get it to work with the piece of code you provided, but either I'm doing it wrong or there's a bug in AutoIT, 'cause the result ain't what it should be.. I use a piece of a xml below: <?xml version="1.0"?><!--7/9/2009 6:30:10 PM--><BUNDLEDATA MAJORVERSION="11" MINORVERSION="8120" BUILD="12.09.100.014" DATE="2009-07-09" URL="http://go.microsoft.com/fwlink/?LinkId=19074"><BUNDLECATALOG><BUNDLE ID="WRD11FF" VER="2" LANG="0" OFFICEVER="3" UPDATETYPE="5" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Security Update" KB_NUMBER="887979" BULLETIN_ID="MS05-023" SEVERITY="Critical"><PACKLETSET><PACKLET PUID="512165" REQUIRED="False" /></PACKLETSET></BUNDLE> As I understand it, the code should just put an ENTER after each > character when using this: Local $sOutput = StringRegExpReplace ($TotalFileRead, ">[^\n]", ">" & @CRLF) ConsoleWrite($sOutput) This is the result I get: <?xml version="1.0"?> !--7/9/2009 6:30:10 PM--> BUNDLEDATA MAJORVERSION="11" MINORVERSION="8120" BUILD="12.09.100.014" DATE="2009-07-09" URL="http://go.microsoft.com/fwlink/?LinkId=19074"> BUNDLECATALOG> BUNDLE ID="WRD11FF" VER="2" LANG="0" OFFICEVER="3" UPDATETYPE="5" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Security Update" KB_NUMBER="887979" BULLETIN_ID="MS05-023" SEVERITY="Critical"> PACKLETSET> PACKLET PUID="512165" REQUIRED="False" /> /PACKLETSET> /BUNDLE> BUNDLE ID="ACC_9" VER="1" LANG="1028" OFFICEVER="1" UPDATETYPE="2" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Critical Update" KB_NUMBER="287484"> It doesn't just seem to add the ENTER, but it also strips the < character from the next line..... So, what's wrong here ?? Link to comment Share on other sites More sharing options...
Fulano Posted March 10, 2010 Share Posted March 10, 2010 Found the bug, here is it in context:Local $XMLText = '<?xml version="1.0"?><!--7/9/2009 6:30:10 PM--><BUNDLEDATA MAJORVERSION="11" MINORVERSION="8120" BUILD="12.09.100.014" DATE="2009-07-09" URL="http://go.microsoft.com/fwlink/?LinkId=19074"><BUNDLECATALOG><BUNDLE ID="WRD11FF" VER="2" LANG="0" OFFICEVER="3" UPDATETYPE="5" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Security Update" KB_NUMBER="887979" BULLETIN_ID="MS05-023" SEVERITY="Critical"><PACKLETSET><PACKLET PUID="512165" REQUIRED="False" /></PACKLETSET></BUNDLE>' Local $CleanXML = StringRegExpReplace ($XMLText, ">([^\n])", ">" & @CRLF & "\1") $CleanXML = StringRegExpReplace ($CleanXML, "([^\n])<", "\1" & @CRLF & "<") ConsoleWrite ($CleanXML) The problem was that I had neglected to save the character we tested to make sure there wasn't already a linebreak. Ahh the joys of coding w/o debugging #fgpkerw4kcmnq2mns1ax7ilndopen (Q, $0); while ($l = <Q>){if ($l =~ m/^#.*/){$l =~ tr/a-z1-9#/Huh, Junketeer's Alternate Pro Ace /; print $l;}}close (Q);[code] tag ninja! Link to comment Share on other sites More sharing options...
Fulano Posted March 10, 2010 Share Posted March 10, 2010 Here is a slightly more fleshed out version. It has a minor bug with empty tags like <br><br/>, but handles everything else fine.expandcollapse popupGlobal $INDENT_TEXT = " " Local $XMLText = '<?xml version="1.0"?><!--7/9/2009 6:30:10 PM--><BUNDLEDATA MAJORVERSION="11" MINORVERSION="8120" BUILD="12.09.100.014" DATE="2009-07-09" URL="http://go.microsoft.com/fwlink/?LinkId=19074"><BUNDLECATALOG>Test<BUNDLE ID="WRD11FF" VER="2" LANG="0" OFFICEVER="3" UPDATETYPE="5" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Security Update" KB_NUMBER="887979" BULLETIN_ID="MS05-023" SEVERITY="Critical"><PACKLETSET><PACKLET PUID="512165" REQUIRED="False" /></PACKLETSET></BUNDLE>' Local $XMLarray = SplitXML ($XMLText) Local $indentLevel = -1 For $line = 0 to Ubound ($XMLarray) - 1 If StringInStr ($XMLarray[$line], "</") Then ; Closing Tag $indentLevel -= 1; $XMLarray[$line] = MakeIndent ($indentLevel) & $XMLarray[$line] ElseIf StringInStr ($XMLarray[$line], "<") Then ; Opening Tag $indentLevel += 1; $XMLarray[$line] = MakeIndent ($indentLevel) & $XMLarray[$line] Else ; Data Line $XMLarray[$line] = MakeIndent ($indentLevel + 1) & $XMLarray[$line] EndIf Next Local $CleanXML = "" For $line in $XMLarray $CleanXML &= $line & @CRLF Next ConsoleWrite ($CleanXML) Func MakeIndent ($Level) Local $indent = "" For $i = 0 to $Level $indent &= $INDENT_TEXT Next Return $indent EndFunc Func SplitXML ($XMLString) ;Remove any existing line breaks $XMLString = StringReplace ($XMLString, @CR, "") $XMLString = StringReplace ($XMLString, @LF, "") ;Add one line break per tag $XMLString = StringRegExpReplace ($XMLString, "([^\n])<", "\1" & @LF & "<") $XMLString = StringRegExpReplace ($XMLString, ">([^\n])", ">" & @LF & "\1") ;Split into an array and return Return StringSplit ($XMLString, @LF, 2)EndFunc I could make it smarter, but I really don't feel like implementing a stack right at this moment #fgpkerw4kcmnq2mns1ax7ilndopen (Q, $0); while ($l = <Q>){if ($l =~ m/^#.*/){$l =~ tr/a-z1-9#/Huh, Junketeer's Alternate Pro Ace /; print $l;}}close (Q);[code] tag ninja! Link to comment Share on other sites More sharing options...
weaponx Posted March 10, 2010 Share Posted March 10, 2010 Using the XML UDF you can indent an XML file using one line... _XMLTransform("", "", "") This calls a default stylesheet and applies it to the loaded file. Link to comment Share on other sites More sharing options...
Fulano Posted March 10, 2010 Share Posted March 10, 2010 There's always an easier way #fgpkerw4kcmnq2mns1ax7ilndopen (Q, $0); while ($l = <Q>){if ($l =~ m/^#.*/){$l =~ tr/a-z1-9#/Huh, Junketeer's Alternate Pro Ace /; print $l;}}close (Q);[code] tag ninja! Link to comment Share on other sites More sharing options...
TiC01 Posted March 13, 2010 Author Share Posted March 13, 2010 @WeaponXI tried the _XMLTransform function, but it only cleaned up the first two lines eg:<?xml version="1.0"?><!--7/9/2009 6:30:10 PM--><BUNDLEDATA MAJORVERSION="11" MINORVERSION="8120" BUILD="12.09.100.014" DATE="2009-07-09" URL="http://go.microsoft.com/fwlink/?LinkId=19074"><BUNDLECATALOG><BUNDLE ID="WRD11FF" VER="2" LANG="0" OFFICEVER="3" UPDATETYPE="5" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Security Update" KB_NUMBER="887979" BULLETIN_ID="MS05-023" SEVERITY="Critical">became<?xml version="1.0" encoding="UTF-16"?> <!--7/9/2009 6:30:10 PM--> <BUNDLEDATA MAJORVERSION="11" MINORVERSION="8120" BUILD="12.09.100.014" DATE="2009-07-09" URL="http://go.microsoft.com/fwlink/?LinkId=19074"><BUNDLECATALOG><BUNDLE ID="WRD11FF" VER="2" LANG="0" OFFICEVER="3" UPDATETYPE="5" EXTERNAL="False" FROZEN="False" MSIPATCH="True" EXPIRED="True" TYPE="Security Update" KB_NUMBER="887979" BULLETIN_ID="MS05-023" SEVERITY="Critical">It probably did something more than just that, 'cause the filesize got twice as big, but as you can see, this is not exactly what I expected.@FulanoThis did the trick. I had to add a third stringreplacement in though (namely tabs). The only thing I am a little bothered with is the stringsplit function. It seems to take forever on a 35MB file. Link to comment Share on other sites More sharing options...
Fulano Posted March 13, 2010 Share Posted March 13, 2010 The only thing I am a little bothered with is the stringsplit function. It seems to take forever on a 35MB file.Well ... there is that ... Frankly speaking, given that AutoIt is not really designed for doing string manipulations on files of that size, I don't think there's alot you can do to speed it up. I wrote a routine to do a binary search of zipcodes from a file, and I had to write in python and call it from the main script to get decent speed out of it. I have been known to be wrong before, so if you do run into a clever trick that speeds things up, I'd be interested in hearing about it. #fgpkerw4kcmnq2mns1ax7ilndopen (Q, $0); while ($l = <Q>){if ($l =~ m/^#.*/){$l =~ tr/a-z1-9#/Huh, Junketeer's Alternate Pro Ace /; print $l;}}close (Q);[code] tag ninja! Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now