leuce Posted November 22, 2011 Posted November 22, 2011 G'day everyoneBackground: I'm trying to write a file splitter for a specific type of XML file. The format of the XML is very simple, so I don't need to take into account things like schema or dtd etc.My question: I have a simple and a complex question, but I'll start simple :-)1. How can I split the file into parts so that the delimiter is included in the split parts (without simply adding the delimiter manually)?In other words, if I have this as the file:$i = something</tu><tu>something</tu><tu>something</tu><tu>somethingIf I use StringSplit ($i, "</tu><tu>", 1) then I get this in the array:somethingsomethingsomethingsomethingBut what I want to get in the array, is this:something</tu><tu>something</tu><tu>something</tu><tu>somethingIs there a simple, easy way of doing that?2. Once I know the above, then the idea is to use regex to specify a string split based on multiples of blocks. In other words, if I write a file splitter, it would split the file not by every block but by e.g. every three blocks. To illustrate:$j = a</tu><tu>b</tu><tu>c</tu><tu>d</tu><tu>e</tu><tu>f</tu><tu>g</tu><tu>h</tu><tu>i</tu><tu>j</tu><tu>kIf the number of blocks per split is 3, then I want it to be split like this:a</tu><tu>b</tu><tu>c</tu> (3 blocks)<tu>d</tu><tu>e</tu><tu>f</tu> (3 blocks)<tu>g</tu><tu>h</tu><tu>i</tu> (3 blocks)<tu>j</tu><tu>k (the remainder blocks)Though I guess that that would involve an insanely complex regular expression (to be figured out later).Any thoughts? Or am I doing this all wrong?ThanksSamuel
UEZ Posted November 22, 2011 Posted November 22, 2011 (edited) Try this: #include <array.au3> $i = "something<tu>something</tu><tu>something</tu><tu>something" $aNew = StringSplit(StringReplace($i, "><", ">" & @LF & "<"), @LF, 2) _ArrayDisplay($aNew) Br, UEZ Edited November 22, 2011 by UEZ Please don't send me any personal message and ask for support! I will not reply! Selection of finest graphical examples at Codepen.io The own fart smells best! ✌Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!¯\_(ツ)_/¯ ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ
czardas Posted November 22, 2011 Posted November 22, 2011 (edited) Here's my attempt.#include <Array.au3> $i = "something</tu><tu>something</tu><tu>something</tu><tu>something" $aArray = StringRegExp($i,"(Aw+</tu>|<tu>w+</tu>|<tu>w+)", 3) _ArrayDisplay($aArray)It can probably be improved.As for the second question, I would simply concatenate three elements at a time, and create a new array.EditActually I think you can use a repeat (at most) 3 times in the regexp, but it's confusing me a little, since the start and the end of the string are different in the example you gave.. Edited November 22, 2011 by czardas operator64 ArrayWorkshop
leuce Posted November 22, 2011 Author Posted November 22, 2011 #include <array.au3> $i = "something<tu>something</tu><tu>something</tu><tu>something" $aNew = StringSplit(StringReplace($i, "><", ">" & @LF & "<"), @LF, 2) _ArrayDisplay($aNew) Wow, what a novel approach! I'll remember that one. Samuel
czardas Posted November 22, 2011 Posted November 22, 2011 (edited) If you had opening tags at the start of the string and a closing tag at the end, it wouild make this a simple affair.#include <Array.au3> $i = "<tu>something</tu><tu>something</tu><tu>something</tu><tu>something</tu>" & _ "<tu>something</tu><tu>something</tu><tu>something</tu><tu>something</tu>" $aArray = StringRegExp($i,"((<tu>w+</tu>){1,3})", 2) _ArrayDisplay($aArray)EditMinor error. Changed the repeat in the regexp from {0,3} to {1,3}.This only works with word characters. Edited November 22, 2011 by czardas operator64 ArrayWorkshop
leuce Posted November 22, 2011 Author Posted November 22, 2011 #include <Array.au3> $i = "something</tu><tu>something</tu><tu>something</tu><tu>something" $aArray = StringRegExp($i,"(Aw+</tu>|<tu>w+</tu>|<tu>w+)", 3) _ArrayDisplay($aArray) This is fantastic, thanks. It even works if there is junk between the closing and opening block-level tags (e.g. "</tu><!-- junk--><tu>", as it strips away the junk, which would be ideal in my particular case). ...it's confusing me a little, since the start and the end of the string are different in the example you gave. Sorry about that -- in fact, now that I think about it, it is likely that the initial opening <tu> and final closing </tu> would be present in the original string anyway. Thanks... from this point onwards it should be easy to write the rest of the script.
czardas Posted November 22, 2011 Posted November 22, 2011 (edited) This is fantastic, thanks. It even works if there is junk between the closing and opening block-level tags (e.g. "</tu><!-- junk--><tu>", as it strips away the junk, which would be ideal in my particular case).Sorry about that -- in fact, now that I think about it, it is likely that the initial opening <tu> and final closing </tu> would be present in the original string anyway. Thanks... from this point onwards it should be easy to write the rest of the script.There are still a few things wrong with my second try, but you if it gives you some ideas then I'm happy. Probably someone will come along with a fix for my broken code. You will need to double check that it catches all the characters you want in all possible case scenarios. Good luck. Edited November 22, 2011 by czardas operator64 ArrayWorkshop
leuce Posted November 22, 2011 Author Posted November 22, 2011 #include <Array.au3> $i = "<tu>something</tu><tu>something</tu><tu>something</tu><tu>something</tu>" & _ "<tu>something</tu><tu>something</tu><tu>something</tu><tu>something</tu>" $aArray = StringRegExp($i,"((<tu>w+</tu>){1,3})", 2) _ArrayDisplay($aArray) This looks like its working, until you try it with other values: #include <Array.au3> $i = "<tu>a</tu><tu>b</tu><tu>c</tu><tu>d</tu>" & _ "<tu>e</tu><tu>f</tu><tu>g</tu><tu>h</tu>" $aArray = StringRegExp($i,"((<tu>w+</tu>){1,3})", 2) _ArrayDisplay($aArray) What would be needed, is this: <tu>a</tu><tu>b</tu><tu>c</tu> <tu>d</tu><tu>e</tu><tu>f</tu> <tu>g</tu><tu>h</tu> ...but what the above script gives me, is this: <tu>a</tu><tu>b</tu><tu>c</tu> <tu>a</tu><tu>b</tu><tu>c</tu> <tu>c</tu> Anyway, unfortunately I have realised that I would have to make it work with more than just word characters...
leuce Posted November 22, 2011 Author Posted November 22, 2011 There are still a few things wrong with my second try, but you if it gives you some ideas then I'm happy. You will need to double check that it catches all the characters you want in all possible case scenarios.Well, for your first example, this seems to work just nicely:$aArray = StringRegExp($i,"(<tu>.+?</tu>)", 3)As for your second example, this *almost* works:$aArray = StringRegExp($i,"((<tu>.+?</tu>){0,3})", 3)...all I have to do is ignore the array items 1, 3, 5, 7 etc :-)I'll probably use the first option and simply concatenate them into threes, because that is simpler (for me, as I'm a simple kind of person).
czardas Posted November 22, 2011 Posted November 22, 2011 (edited) My second example was completely broken. An improvement on my first attempt. This should grab any character between the tags. #include <Array.au3> $i = "something</tu><tu>something else</tu><tu>more stuff</tu><tu>something" $aArray = StringRegExp($i,"(A[^<]+</tu>|<tu>[^<]+</tu>|<tu>[^<]+)", 3) _ArrayDisplay($aArray) Or with tags.included at both the start and the end of the string, which is probably more useful to you. #include <Array.au3> $i = "<tu>something</tu><tu>something else</tu><tu>more stuff</tu><tu>something</tu>" $aArray = StringRegExp($i,"(<tu>[^<]+</tu>)", 3) _ArrayDisplay($aArray) For some reason I'm struggling to get the repeat syntax to work. I still find using regexp quite tricky. Perhaps it will come to me later. Edited November 22, 2011 by czardas operator64 ArrayWorkshop
Malkey Posted November 22, 2011 Posted November 22, 2011 These two RegExp patterns appears to work capturing all characters except "<". Also captures preceding and trailing characters as shown in the example of post #1. #include <Array.au3> Local $i = "a</tu><tu>b</tu><tu>c</tu><tu>d</tu><tu>e</tu><tu>f</tu>" & _ "<tu>g</tu><tu>h</tu><tu>i</tu><tu>j</tu><tu>k" Local $aArray = StringRegExp($i, "((?:<?t?u?>?[^<]+<?/?t?u?>?){1,3})", 3) ; or ;Local $aArray = StringRegExp($i, "((?:(?:<tu>)?[^<]+(?:</tu>)?){1,3})", 3) _ArrayDisplay($aArray)
czardas Posted November 22, 2011 Posted November 22, 2011 (edited) #include <Array.au3> Local $i = "a</tu><tu>b</tu><tu>c</tu><tu>d</tu><tu>e</tu><tu>f</tu>" & _ "<tu>g</tu><tu>h</tu><tu>i</tu><tu>j</tu><tu>k" Local $aArray = StringRegExp($i, "((?:<?t?u?>?[^<]+<?/?t?u?>?){1,3})", 3) ; or ;Local $aArray = StringRegExp($i, "((?:(?:<tu>)?[^<]+(?:</tu>)?){1,3})", 3) _ArrayDisplay($aArray) That's neat. I can see where I was going wrong. Edited November 22, 2011 by czardas operator64 ArrayWorkshop
UEZ Posted November 22, 2011 Posted November 22, 2011 (edited) Here a non RegEx version for your 2nd question. #include <Array.au3> Global $xml = "a</tu><tu>b</tu><tu>c</tu><tu>d</tu><tu>e</tu><tu>f</tu><tu>g</tu><tu>h</tu><tu>i</tu><tu>j</tu><tu>k" Global $array = Group_XML($xml) _ArrayDisplay($array) Func Group_XML($xml, $groups = 3, $delimiter= "><") ;coded by UEZ StringReplace($xml, $delimiter, $delimiter) Local $amount = @extended Local $b = 1 Local $e = 0 Local $strings, $j, $aNew For $j = $groups To $amount Step $groups $e = StringInStr($xml, $delimiter, 0, $j) $strings &= StringMid($xml, $b, $e - $b + 1) & @LF $b = $e + 1 Next If $b < StringLen($xml) Then $strings &= StringMid($xml, $b, StringLen($xml)) $aNew = StringSplit($strings, @LF, 2) Return $aNew EndFunc Br, UEZ Edited November 22, 2011 by UEZ Please don't send me any personal message and ask for support! I will not reply! Selection of finest graphical examples at Codepen.io The own fart smells best! ✌Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!¯\_(ツ)_/¯ ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now