Sign in to follow this  
Followers 0
leuce

How to do StringSplit, but include delimiter in split parts

13 posts in this topic

G'day everyone

Background: I'm trying to write a file splitter for a specific type of XML file. The format of the XML is very simple, so I don't need to take into account things like schema or dtd etc.

My question: I have a simple and a complex question, but I'll start simple :-)

1. How can I split the file into parts so that the delimiter is included in the split parts (without simply adding the delimiter manually)?

In other words, if I have this as the file:

$i = something</tu><tu>something</tu><tu>something</tu><tu>something

If I use StringSplit ($i, "</tu><tu>", 1) then I get this in the array:

something

something

something

something

But what I want to get in the array, is this:

something</tu>

<tu>something</tu>

<tu>something</tu>

<tu>something

Is there a simple, easy way of doing that?

2. Once I know the above, then the idea is to use regex to specify a string split based on multiples of blocks. In other words, if I write a file splitter, it would split the file not by every block but by e.g. every three blocks. To illustrate:

$j = a</tu><tu>b</tu><tu>c</tu><tu>d</tu><tu>e</tu><tu>f</tu><tu>g</tu><tu>h</tu><tu>i</tu><tu>j</tu><tu>k

If the number of blocks per split is 3, then I want it to be split like this:

a</tu><tu>b</tu><tu>c</tu> (3 blocks)

<tu>d</tu><tu>e</tu><tu>f</tu> (3 blocks)

<tu>g</tu><tu>h</tu><tu>i</tu> (3 blocks)

<tu>j</tu><tu>k (the remainder blocks)

Though I guess that that would involve an insanely complex regular expression (to be figured out later).

Any thoughts? Or am I doing this all wrong?

Thanks

Samuel

Share this post


Link to post
Share on other sites



#2 ·  Posted (edited)

Try this:

#include <array.au3>
$i = "something<tu>something</tu><tu>something</tu><tu>something"
$aNew = StringSplit(StringReplace($i, "><", ">" & @LF & "<"), @LF, 2)
_ArrayDisplay($aNew)

Br,

UEZ

Edited by UEZ

Please don't send me any personal message and ask for support! I will not reply!

Selection of finest graphical examples at Codepen.io

The own fart smells best!
Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!
¯\_(ツ)_/¯  ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ

Share this post


Link to post
Share on other sites

#3 ·  Posted (edited)

Here's my attempt.

#include <Array.au3>
$i = "something</tu><tu>something</tu><tu>something</tu><tu>something"

$aArray = StringRegExp($i,"(Aw+</tu>|<tu>w+</tu>|<tu>w+)", 3)

_ArrayDisplay($aArray)

It can probably be improved.

As for the second question, I would simply concatenate three elements at a time, and create a new array.

Edit

Actually I think you can use a repeat (at most) 3 times in the regexp, but it's confusing me a little, since the start and the end of the string are different in the example you gave..

Edited by czardas

Share this post


Link to post
Share on other sites

#5 ·  Posted (edited)

If you had opening tags at the start of the string and a closing tag at the end, it wouild make this a simple affair.

#include <Array.au3>
$i = "<tu>something</tu><tu>something</tu><tu>something</tu><tu>something</tu>" & _
"<tu>something</tu><tu>something</tu><tu>something</tu><tu>something</tu>"

$aArray = StringRegExp($i,"((<tu>w+</tu>){1,3})", 2)
_ArrayDisplay($aArray)

Edit

Minor error. Changed the repeat in the regexp from {0,3} to {1,3}.

This only works with word characters. :D

Edited by czardas

Share this post


Link to post
Share on other sites

#include <Array.au3>
$i = "something</tu><tu>something</tu><tu>something</tu><tu>something"
$aArray = StringRegExp($i,"(Aw+</tu>|<tu>w+</tu>|<tu>w+)", 3)
_ArrayDisplay($aArray)
This is fantastic, thanks. It even works if there is junk between the closing and opening block-level tags (e.g. "</tu><!-- junk--><tu>", as it strips away the junk, which would be ideal in my particular case).

...it's confusing me a little, since the start and the end of the string are different in the example you gave.

Sorry about that -- in fact, now that I think about it, it is likely that the initial opening <tu> and final closing </tu> would be present in the original string anyway. Thanks... from this point onwards it should be easy to write the rest of the script.

Share this post


Link to post
Share on other sites

#7 ·  Posted (edited)

This is fantastic, thanks. It even works if there is junk between the closing and opening block-level tags (e.g. "</tu><!-- junk--><tu>", as it strips away the junk, which would be ideal in my particular case).

Sorry about that -- in fact, now that I think about it, it is likely that the initial opening <tu> and final closing </tu> would be present in the original string anyway. Thanks... from this point onwards it should be easy to write the rest of the script.

There are still a few things wrong with my second try, but you if it gives you some ideas then I'm happy. Probably someone will come along with a fix for my broken code. You will need to double check that it catches all the characters you want in all possible case scenarios. Good luck.

Edited by czardas

Share this post


Link to post
Share on other sites

#include <Array.au3>
$i = "<tu>something</tu><tu>something</tu><tu>something</tu><tu>something</tu>" & _
"<tu>something</tu><tu>something</tu><tu>something</tu><tu>something</tu>"
$aArray = StringRegExp($i,"((<tu>w+</tu>){1,3})", 2)
_ArrayDisplay($aArray)
This looks like its working, until you try it with other values:

#include <Array.au3>
$i = "<tu>a</tu><tu>b</tu><tu>c</tu><tu>d</tu>" & _
"<tu>e</tu><tu>f</tu><tu>g</tu><tu>h</tu>"
$aArray = StringRegExp($i,"((<tu>w+</tu>){1,3})", 2)
_ArrayDisplay($aArray)

What would be needed, is this:

<tu>a</tu><tu>b</tu><tu>c</tu>

<tu>d</tu><tu>e</tu><tu>f</tu>

<tu>g</tu><tu>h</tu>

...but what the above script gives me, is this:

<tu>a</tu><tu>b</tu><tu>c</tu>

<tu>a</tu><tu>b</tu><tu>c</tu>

<tu>c</tu>

Anyway, unfortunately I have realised that I would have to make it work with more than just word characters...

Share this post


Link to post
Share on other sites

There are still a few things wrong with my second try, but you if it gives you some ideas then I'm happy. You will need to double check that it catches all the characters you want in all possible case scenarios.

Well, for your first example, this seems to work just nicely:

$aArray = StringRegExp($i,"(<tu>.+?</tu>)", 3)

As for your second example, this *almost* works:

$aArray = StringRegExp($i,"((<tu>.+?</tu>){0,3})", 3)

...all I have to do is ignore the array items 1, 3, 5, 7 etc :-)

I'll probably use the first option and simply concatenate them into threes, because that is simpler (for me, as I'm a simple kind of person).

Share this post


Link to post
Share on other sites

#10 ·  Posted (edited)

My second example was completely broken. :D

An improvement on my first attempt. This should grab any character between the tags.

#include <Array.au3>

$i = "something</tu><tu>something else</tu><tu>more stuff</tu><tu>something"

$aArray = StringRegExp($i,"(A[^<]+</tu>|<tu>[^<]+</tu>|<tu>[^<]+)", 3)
_ArrayDisplay($aArray)

Or with tags.included at both the start and the end of the string, which is probably more useful to you.

#include <Array.au3>

$i = "<tu>something</tu><tu>something else</tu><tu>more stuff</tu><tu>something</tu>"

$aArray = StringRegExp($i,"(<tu>[^<]+</tu>)", 3)
_ArrayDisplay($aArray)

For some reason I'm struggling to get the repeat syntax to work. I still find using regexp quite tricky. Perhaps it will come to me later.

Edited by czardas

Share this post


Link to post
Share on other sites

These two RegExp patterns appears to work capturing all characters except "<". Also captures preceding and trailing characters as shown in the example of post #1.

#include <Array.au3>

Local $i = "a</tu><tu>b</tu><tu>c</tu><tu>d</tu><tu>e</tu><tu>f</tu>" & _
        "<tu>g</tu><tu>h</tu><tu>i</tu><tu>j</tu><tu>k"

Local $aArray = StringRegExp($i, "((?:<?t?u?>?[^<]+<?/?t?u?>?){1,3})", 3)
; or
;Local $aArray = StringRegExp($i, "((?:(?:<tu>)?[^<]+(?:</tu>)?){1,3})", 3)

_ArrayDisplay($aArray)

Share this post


Link to post
Share on other sites

#12 ·  Posted (edited)

#include <Array.au3>

Local $i = "a</tu><tu>b</tu><tu>c</tu><tu>d</tu><tu>e</tu><tu>f</tu>" & _
        "<tu>g</tu><tu>h</tu><tu>i</tu><tu>j</tu><tu>k"

Local $aArray = StringRegExp($i, "((?:<?t?u?>?[^<]+<?/?t?u?>?){1,3})", 3)
; or
;Local $aArray = StringRegExp($i, "((?:(?:<tu>)?[^<]+(?:</tu>)?){1,3})", 3)

_ArrayDisplay($aArray)

That's neat. I can see where I was going wrong. :D Edited by czardas

Share this post


Link to post
Share on other sites

#13 ·  Posted (edited)

Here a non RegEx version for your 2nd question.

#include <Array.au3>

Global $xml = "a</tu><tu>b</tu><tu>c</tu><tu>d</tu><tu>e</tu><tu>f</tu><tu>g</tu><tu>h</tu><tu>i</tu><tu>j</tu><tu>k"

Global $array = Group_XML($xml)

_ArrayDisplay($array)

Func Group_XML($xml, $groups = 3, $delimiter= "><") ;coded by UEZ
    StringReplace($xml, $delimiter, $delimiter)
    Local $amount = @extended
    Local $b = 1
    Local $e = 0
    Local $strings, $j, $aNew
    For $j = $groups To $amount Step $groups
        $e = StringInStr($xml, $delimiter, 0, $j)
        $strings &= StringMid($xml, $b, $e - $b + 1) & @LF
        $b = $e + 1
    Next
    If $b < StringLen($xml) Then $strings &= StringMid($xml, $b, StringLen($xml))
    $aNew = StringSplit($strings, @LF, 2)
    Return $aNew
EndFunc

Br,

UEZ

Edited by UEZ

Please don't send me any personal message and ask for support! I will not reply!

Selection of finest graphical examples at Codepen.io

The own fart smells best!
Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!
¯\_(ツ)_/¯  ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0