How to do StringSplit when delimiter is included in split parts

Pumbaa · September 8, 2012

Hi All!

I need to split into parts string like below with comma delimiter:

text1,text2,text3{Join('Errors','Name','id',Error,'Group',Group)}text4,text5

As you see 3rd field includes text selflimited with { and } signs. Such constructions can appear in any field & not just once per string or even field, it also can be all the content of a field.

Using simple StringSplit gives me wrong result. I've tried to make StringSplit first based on { & } signs and then after analysing result StringSplit to parts, which are not between {}, but that's rather messy method.

I was thinking of replacing "right" commas with, for example, @ to use regular StringSplit after that. May be RexExp functions could be useful here, but I'm not too familiar with them to solve my problem.

Any suggestions?

czardas · September 8, 2012

Recommendations.

1. I recommend you read the specs for csv format.

2. use one of the csv scripts in example scripts. Search the forum for

3. don't use commas within fields or use Chr(130) instead

4.Use a different delimeter such as semicolon or TAB

5. you may be able to create a regular expression if you can identify a pattern.

Malkey · September 8, 2012

Hope this helps.

#include <Array.au3>
Local $sString = "text1,text2,text3{Join('Errors','Name','id',Error,'Group',Group)}text4, text5"
; Create an array of each instance of the text between "{" and "}".
Local $aArray = StringRegExp($sString, "\{(.*)\}", 3)
; Replace all occurrances of "{..text in between..}" with a coma.
Local $sNewString = StringRegExpReplace($sString, "(\{.*\})", ",")
If StringRight($sNewString, 1) = "," Then $sNewString = StringTrimRight($sNewString, 1) ; Delete trailing coma if one exists.
Local $aArray2 = StringSplit($sNewString, ",", 2)
Local $iUbndA2 = UBound($aArray2)
;Join arrays
ReDim $aArray2[$iUbndA2 + UBound($aArray)]
For $i = $iUbndA2 To UBound($aArray2) - 1
$aArray2[$i] = $aArray[$i - $iUbndA2]
Next
_ArrayDisplay($aArray2)

Pumbaa · September 8, 2012

To czrdas: Unfortunatly I work with externally predicted files & string structures in them, so leading them to csv format & similar recommendations are out of my reach. I've wrote about possible patterns & willingness of RegEx usage, but lack of their understanding & expirience. Thanks anyway. CSV UDF could be useful in future.

To Malkey: Thanks, I'll try to develop your offer.

Examples & suggestions are still acceptable.

czardas · September 8, 2012

Perhaps this will help or give you some ideas.

#include <Array.au3>
Local $sString = "text0,{m,o,r,e},text1,text2,text3{Join('Errors','Name','id',Error,'Group',Group)}text4, text5"
; Create an array of each instance of the text between "{" and "}".
Local $sReplacement = ",", $sTemp = $sString ; In case you need the original string later.

Local $aArray = StringRegExp($sString, "{[^}]*", 3)
;_ArrayDisplay($aArray)

If IsArray($aArray) Then ; Added error check!
    For $i = 255 To 1 Step -1 ; Search for a replacement character.
        If Not StringInStr($sString,Chr($i)) Then ExitLoop
    Next
    If $i = 0 Then Exit ; In the most unlikely event that no suitable delimeter found.

    Local $sReplacement = Chr($i)
    For $i = 0 To UBound($aArray) -1 ; Replace the commas we wish to ignore
        $sTemp = StringReplace($sTemp, $aArray[$i], StringReplace($aArray[$i], ",", $sReplacement))
    Next
EndIf

$aArray = StringSplit($sTemp, ",", 2) ; Might as well use the same array name again.
For $i = 0 To UBound($aArray) -1 ; Put the removed commas back.
    $aArray[$i] = StringReplace($aArray[$i], $sReplacement, ",")
Next
_ArrayDisplay($aArray)

Edit

Added an error check to the code.

Edited September 8, 2012 by czardas

Pumbaa · September 8, 2012

Good. Using RegEx with "{[^}]*" finally gives me smth like this:

#include <Array.au3>

Local $sString = "text0,{m,o,r,e},text1,text2,text3{Join('Errors','Name','id',Error,'Group',Group)}text4, text5"
Local $ExprArray = StringRegExp ($sString, "{[^}]*", 3) ; array of all {formulas} in string
Local $j = @error
$sString = StringRegExpReplace ($sString, "{[^}]*", "{") ; replacment of all {formula} in string with "{"
Local $FinalArray = StringSplit ($sString, ",")
If $j = 0 Then ; if {formula} existed then recover them
    For $i = 0 To $FinalArray [0]
        If StringInStr ($FinalArray [$i], "}") > 0 Then
            $FinalArray [$i] = StringReplace ($FinalArray [$i], "{}", $ExprArray [$j] & "}", 1)
            $j = $j + 1
        EndIf
    Next
EndIf

_ArrayDisplay ($FinalArray)

It seems to work with all possible variants.

The only thing which remains to understand myself is what "{[^}]*" really means.

RegEx rules

Thanks.

Edited September 8, 2012 by Pumbaa

czardas · September 8, 2012

Ha I introduced a bug when I made changes to the above code. It should be okay now.

The only thing which remains to understand myself is what "{[^}]*" really means.

This is easy to pick apart.

{ = Find pattern starting with an opening curly bracket (followed by)

[^ = a character which is not

} = a closing curly bracket

]* = which may or may not appear and may also repeat

Edited September 8, 2012 by czardas

Pumbaa · September 8, 2012

Yes, I've got it. Tried some more combinations, but yours is the most useful. Thanks again.

Upgraded my script to take in consideration few appearances of {formula} in one field.

#include <Array.au3>

Local $sString = "text0,{m,o,r,e}{m,o,r,e},text1,text2,text3{Join('Errors','Name','id',Error,'Group',Group)}text4, text5"
Local $ExprArray = StringRegExp ($sString, "{[^}]*", 3) ; array of all {formulas} in string
Local $j = @error
$sString = StringRegExpReplace ($sString, "{[^}]*", "{") ; replacement of all {formula} in string "{"
Local $FinalArray = StringSplit ($sString, ",")
Local $NumOfExpr [1], $k
If $j = 0 Then ; if {formula} existed, then recover them
    For $i = 1 To $FinalArray [0]
        If StringInStr ($FinalArray [$i], "}") > 0 Then
            $NumOfExpr = StringSplit ($FinalArray [$i], "{}", 1) ; if more then 1 {formula} in field
            If $NumOfExpr [0] > 1 Then
                For $k = 1 To $NumOfExpr [0] - 1
                    $FinalArray [$i] = StringReplace ($FinalArray [$i], "{}", $ExprArray [$j] & "}", 1)
                    $j = $j + 1
                Next
            EndIf
        EndIf
    Next
EndIf

_ArrayDisplay ($FinalArray)

Edited September 8, 2012 by Pumbaa

czardas · September 8, 2012

I'm glad you found it useful.

Pumbaa · September 8, 2012

Well... after taking care of all damn {formula} possible appearances my old code seems to be shorter & easier:

#include <Array.au3>

Local $sString = "text0,{m,o,r,e}{m,o,r,e},text1,text2,text3{Join('Errors','Name','id',Error,'Group',Group)}text4, text5"
Local $sStringWithNewDelimiter = ""
Local $WithFormulasArray = StringSplit ($sString, "{}")
For $i = 1 To $WithFormulasArray [0]
    If mod ($i, 2) <> 0 Then
        $WithFormulasArray [$i] = StringReplace ($WithFormulasArray [$i], ",", @TAB)
    Else
        $WithFormulasArray [$i] = "{" & $WithFormulasArray [$i] & "}"
    EndIf
    $sStringWithNewDelimiter = $sStringWithNewDelimiter & $WithFormulasArray [$i]
Next

Local $FinalArray = StringSplit ($sStringWithNewDelimiter, @TAB)
_ArrayDisplay ($FinalArray)

But that was still a good experience of using RegEx.

Edited September 8, 2012 by Pumbaa

Pumbaa · September 8, 2012

Ufff... I feel not satisfied.

I've imagined smth like

#include  <Array.au3>

Local $sString = "text0,{m,o,r,e}{m,o,r,e},text1,text2,text3{Join('Errors','Name','id',Error,'Group',Group)}text4, text5"

$sString = StringRegExpReplace ($sString, "some tricky RegEx to define all commas that are not situated somewhere between {}", @TAB)

Local $FinalArray = StringSplit ($sString, @TAB)

_ArrayDisplay ($FinalArray)

czardas · September 8, 2012

Don't overcomplicate things. Providing the input follows a clear set of rules that can be used to extract the required information, then you should be able to parse it. Sometimes a single regular expression will do all (or most of) the job, but a few extra lines of code may be easier to write and understand. You also need to be clear exactly how you want the returned data to be formatted.

Edited September 8, 2012 by czardas

xeroTechnologiesLLC · September 8, 2012

I'm pretty much new to programming and autoIT in general but to answer the topic of the thread, without any super coding as already provided - i usually change the symbol of whatever you're going to use as the delimiter in the cell to something entirely unused - like one of the ASCII latin characters, then run your stringsplit, then re-replace that symbol back to the delimiter symbol.

if cell contains "," swap it to Œ then run your string split. run another replace to turn "Œ" back to ",".

This obviously doesn't work in 100% of all situations you have to do this, but...it's fast and easy to do for us noob programmers.

Good luck and have fun.

dany · September 8, 2012

Indeed, don't overcomplicate things. You can do this without regular expressions.

#include <String.au3> ;  _StringBetween
#include <Array.au3> ;  _ArrayDisplay

Local $sFields = "text0,{m,o,r,e}{m,o,r,e},text1,text2,text3{Join('Errors','Name','id',Error,'Group',Group)}text4, text5"
Local $sBracket, $sTabs, $aFields
While 1
    $sBracket = _StringBetween($sFields, '{', '}')
    If 0 = $sBracket Then ExitLoop
    $sTabs = StringReplace($sBracket[0], ',', @TAB)
    ; Remove the brackets, or the loop will never exit.
    $sFields = StringReplace($sFields, '{' & $sBracket[0] & '}', '_A_' & $sTabs & '_Z_')
WEnd
; Put brackets back in place.
$sFields = StringReplace($sFields, '_A_', '{')
$sFields = StringReplace($sFields, '_Z_', '}')
$aFields = StringSplit($sFields, ',')
_ArrayDisplay($aFields)

UEZ · September 8, 2012

Here my version for this particular string:

#include <Array.au3>
#include <String.au3>

$s = "text1,text2,text3{Join('Errors','Name','id',Error,'Group',Group)}text4,text5"
$aRes = _StringBetween($s, "{", "}")
$aNew = StringSplit(StringReplace(StringReplace(StringReplace($s, $aRes[0], StringReplace($aRes[0], ",", "°^°")), ",", "|"), "°^°", ","), "|", 2)
_ArrayDisplay($aNew)

Or

#include <Array.au3>
#include <String.au3>

$s = "text1,text2,text3{Join('Errors','Name','id',Error,'Group',Group)}text4,text5"
$aRes = _StringBetween($s, "{", "}")
$aNew = StringSplit(StringReplace(StringReplace(StringReplace(StringReplace(StringReplace($s, $aRes[0], StringReplace($aRes[0], ",", "°^°")), ",", "|"), "°^°", ","), "{", "|"), "}", "|"), "|", 2)
_ArrayDisplay($aNew)

Br,

UEZ

Edited September 8, 2012 by UEZ

Pumbaa · September 9, 2012

Thanks guys. These are also possible ways to my goal. I'll put them in my "scripts"-bank.

But still imaginary RegEx decision seems to be more effective due to common count of circles & usages of complex functions like Split, Between & Replace. On large amount of long strings it should be noticeably faster.

May be somewhen RegEx-genius will visit this topic & show us master-class... or explain that it's impossible or will take much more CPU time then any other string operations

dany · September 9, 2012

On large amount of long strings it should be noticeably faster.

Actually not always. It heavily depends on the complexity of the regular expression pattern and your ability to write efficient patterns.

The complexer the pattern the slower the RegExp function will be. RegExp functions can be slower than ordinary String* functions as it scans the string one character at a time, concats that character to the previously scanned characters and tests the entire result against the pattern. Repeat for the next character. With very long strings this will become a slow process.

An unoptimized pattern can have a severe speed impact as well. For instance, using groups ( ... ) extensively will severely slow down any RegExp if they arn't optimized. the pattern

b(integer|insert|in)b

is slower than

b(?:integer|insert|in)b

for the subject 'integers'. They both won't match, but the first RegExp will take more time to figure that out. The reason why is explained here http://www.regular-expressions.info/atomic.html

If you want to treat that test string you gave with only one regular exp<b></b>ression, well, that's going to be a real beasty if it has to take into account all edge cases you've given. Therefore it actually will be slower than my method with String* functions.

For more info and insight on the inner workings of regular exp<b></b>ressions I recommend http://www.regular-expressions.info/

edit: forum software screwed up the links...

edit 2: Well, it's sunday and I got nothing to do, so I had a stab at your test string. To my surprise my RegExp was actually faster than the String* functions. However, the general rule of thumb that String* functions are faster than RegExp* functions still stands. It just depends heavily what you want to do. I once wrote a syntax highlighter in PHP and found str* faster than preg*. Anyway, here's what I got:

#include <String.au3> ;  _StringBetween
#include <Array.au3> ;  _ArrayDisplay

Local $sFields = "text0,{m,o,r,e}{m,o,r,e},text1,text2,text3{Join('Errors','Name','id',Error,'Group',Group)}text4, text5"
Local $sBracket, $sTabs, $aFields
Local $iStart = TimerInit()
While 1
    $sBracket = _StringBetween($sFields, '{', '}')
    If 0 = $sBracket Then ExitLoop
    $sTabs = StringReplace($sBracket[0], ',', @TAB)
    $sFields = StringReplace($sFields, '{' & $sBracket[0] & '}', '_A_' & $sTabs & '_Z_')
WEnd
$sFields = StringReplace($sFields, '_A_', '{')
$sFields = StringReplace($sFields, '_Z_', '}')
$aFields = StringSplit($sFields, ',')
_ArrayDisplay($aFields, TimerDiff($iStart) / 1000)

$sFields = "text0,{m,o,r,e}{m,o,r,e},text1,text2,text3{Join('Errors','Name','id',Error,'Group',Group)}text4, text5"
Local $rPattern = '([a-z0-9]+{[^}]+}|{[^}]+}|[^{},]+)'
$iStart = TimerInit()
Local $aMatches = StringRegExp($sFields, $rPattern, 3)
_ArrayDisplay($aMatches, TimerDiff($iStart) / 1000)

Also note they produce different results.

Edited September 9, 2012 by dany

Pumbaa · September 10, 2012

Splendid work. I've tried to match time results and with my own code, but got different results each time. Seems it depends on some inner processes in Windows.

"text3{Join('Errors','Name','id',Error,'Group',Group)}text4" - text4 is a part of the pattern, but nevertheless that's what I was looking for. Thanks for your collaboration.

Edited September 10, 2012 by Pumbaa

Sign In

How to do StringSplit when delimiter is included in split parts

Recommended Posts

Pumbaa

czardas

Malkey

Pumbaa

czardas

Pumbaa

czardas

Pumbaa

czardas

Pumbaa

Pumbaa

czardas

xeroTechnologiesLLC

dany

UEZ

Pumbaa

dany

Pumbaa

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

AutoIt Resources

Release

Beta