StringRegExp(Replace) count unicode character incorrectly

unipark · July 24, 2018

I have a given string like "This is a test string".
I have a given offset like 15.
I need to insert something after the offset with regexp.

$string = "This is a test string"
$offset = 15
$stringtoinsert = "|inserted|"

$regexpattern = "^(.{" & $offset & "})(.*)$"
$regexreplace = "$1" & $stringtoinsert & "$2"
$outputstring = StringRegExpReplace($string, $regexpattern, $regexreplace)
ConsoleWrite($outputstring & @CRLF)

;output:This is a test |inserted|string

The problem occurs when the string contains special Unicode characters.

$string = "This is a 😅 👨‍💻 🇧🇷 🏳️‍🌈 string"
$offset = 31
$stringtoinsert = "|inserted|"

$regexpattern = "^(.{" & $offset & "})(.*)$"
$regexreplace = "$1" & $stringtoinsert & "$2"
$outputstring = StringRegExpReplace($string, $regexpattern, $regexreplace)
ConsoleWrite($outputstring & @CRLF)

According to this regex tester it should insert right before the last word but it doesn't.
If I change the offset to 24 it does what it should do with offset 31.

I guess it may be some encoding problem but I have no clue how to fix this.

Note that I know that I could solve this problem without regex but this is just a simplified example that shows the problem.
The real patter is way more complex and there is no easy way to achieve it with simple string manipulation.

jchd · July 24, 2018

AutoIt uses UCS2 (a subset of UTF16 limited to the Unicode BMP) encoding and your string is exactly 29 Unicode characters long. The offset of 24 is correct as the regex engine used (PCRE) correctly parses full UTF16.

The string contains codepoints > 0xFFFF which need 2 16-bit encoding units and it itsn't as "simple" as it looks:

https://r12a.github.io/uniview/?charlist="This is 😅 👨‍💻 🇧🇷 🏳️‍🌈 string"

Moral: beware of string functions when dealing with non-basic Unicode constructs. In this particular example, the embedded LF doesn't clearly show up, the combined sequence of 3 Unicode codepoints produces the glyph of a man behind a notebook (albeit the codepoints don't express that and even comprise a space in-between), same for the localised flag (white flag changed by regional indicator, producing a Brazilian flag) and finally flag also applied to rainbow codepoint.

Edited July 24, 2018 by jchd

unipark · July 25, 2018

16 hours ago, jchd said:

AutoIt uses UCS2 (a subset of UTF16 limited to the Unicode BMP) encoding and your string is exactly 29 Unicode characters long. The offset of 24 is correct as the regex engine used (PCRE) correctly parses full UTF16.

The string contains codepoints > 0xFFFF which need 2 16-bit encoding units and it itsn't as "simple" as it looks:

https://r12a.github.io/uniview/?charlist="This is 😅 👨‍💻 🇧🇷 🏳️‍🌈 string"

Moral: beware of string functions when dealing with non-basic Unicode constructs. In this particular example, the embedded LF doesn't clearly show up, the combined sequence of 3 Unicode codepoints produces the glyph of a man behind a notebook (albeit the codepoints don't express that and even comprise a space in-between), same for the localised flag (white flag changed by regional indicator, producing a Brazilian flag) and finally flag also applied to rainbow codepoint.

Well assuming AutoIts PCRE engine does work correctly how come that this PCRE tester does not the same https://regex101.com/r/UPvXfl/1 (not just this one)
One of them must be doing something wrong/different and I need to know what exactly to fix it.
Also as stated above in my case the offset is given 31 for the example string so the code that generates this offset (external program) does the same counting as regex101.com.
Hard to believe that 2 independent programs would do the same in error.
But now back to Autoit.
StringLen($string) Says the string is 37 character long just like regex101.com does.
StringTrimLeft($string, 31) Removes everything before the word "string" just as expected.
In other words AutoIt itself counts the same way too but PCRE in AutoIt doesn't.

So regardless of which is correct I'm obviously still searching for a way to use PCRE with the given offset.
Can I use PCRE with UCS2 like AutoIt does? Or can I encode the string somehow different?
Maybe regex the BIN/HEX representation of the string to avoid encoding problems?

Any help would be much appreciated.

jchd · July 25, 2018

Once again this particular string (as copy/pasted from your post) is only 29 codepoints long, nothing more nothing less. The provided link precisely shows what's inside.

Every Unicode codepoints > 0xFFFF requires two 16-bit encoding units to represent. For programs using UCS2 that means two 16-bit characters.
So some programs or functions are fooled by codepoints > 0xFFFF and count two character (2x 16-bit encoding units as in UCS2) instead of one codepoint represented by 2x 16-bit units.

Here's the hex representation of the string, along with coepoint offsets, encoding units offsets and its UTF16 encoding:

Codepoint Name                                     Codepoint offset    Encoding unit offset      Hex UTF16
 ‎0054 LATIN CAPITAL LETTER T                              1                     1                0x0054
 ‎0068 LATIN SMALL LETTER H                                2                     2                0x0068
 ‎0069 LATIN SMALL LETTER I                                3                     3                0x0069
 ‎0073 LATIN SMALL LETTER S                                4                     4                0x0073
 ‎0020 SPACE                                               5                     5                0x0020
 ‎0069 LATIN SMALL LETTER I                                6                     6                0x0069
 ‎0073 LATIN SMALL LETTER S                                7                     7                0x0073
 ‎000A [control]                                           8                     8                0x000A
 ‎0020 SPACE                                               9                     9                0x0020
 ‎1F605 SMILING FACE WITH OPEN MOUTH AND COLD SWEAT       10                    10                0xD83D  0xDE05
 ‎0020 SPACE                                              11                    12                0x0020
 ‎1F468 MAN                                               12                    13                0xD83D  0xDC68
 ‎200D ZERO WIDTH JOINER                                  13                    15                0x200D
 ‎1F4BB PERSONAL COMPUTER                                 14                    16                0xD83D  0xDCBB
 ‎0020 SPACE                                              15                    18                0x0020
 ‎1F1E7 REGIONAL INDICATOR SYMBOL LETTER B                16                    19                0xD83C  0xDDE7
 ‎1F1F7 REGIONAL INDICATOR SYMBOL LETTER R                17                    21                0xD83C  0xDDF7
 ‎0020 SPACE                                              18                    23                0x0020
 ‎1F3F3 WAVING WHITE FLAG                                 19                    24                0xD83C  0xDFF3
 ‎FE0F VARIATION SELECTOR-16                              20                    26                0xFE0F
 ‎200D ZERO WIDTH JOINER                                  21                    27                0x200D
 ‎1F308 RAINBOW                                           22                    28                0xD83C  0xDF08
 ‎0020 SPACE                                              23                    30                0x0020
 ‎0073 LATIN SMALL LETTER S                               24                    31                0x0073
 ‎0074 LATIN SMALL LETTER T                               25                    32                0x0074
 ‎0072 LATIN SMALL LETTER R                               26                    33                0x0072
 ‎0069 LATIN SMALL LETTER I                               27                    34                0x0069
 ‎006E LATIN SMALL LETTER N                               28                    35                0x006E
 ‎0067 LATIN SMALL LETTER G                               29                    36                0x0067

The offset discrepancy (24 vs. 31) you observe comes from the fact that some programs/functions count 16-bit encoding units while others correctly recognize and support Unicode surrogates, hence count actual UTF16 codepoints.

So beware of codepoints > 0xFFFF and, as far as possible, keep away from using literal offsets blindly.

About the string length being returned as 37 I suspect that some invisible codepoints has escaped copy/paste somewhere, or some obscure bug.

unipark · July 27, 2018

I just saw there is in fact a difference between your test string and the one I used.

"This is a 😅 👨‍💻 🇧🇷 🏳️‍🌈 string"
"This is 😅 👨‍💻 🇧🇷 🏳️‍🌈 string"

In my string there is
0020 SPACE
0061 LATIN SMALL LETTER A
0020 SPACE

And in yours string there is
000A [control]
0020 SPACE

There is a missing "a" and a missing space but an additional linefeed which in total gives a string that is 1 character shorter hence your string has a total length of 36 not 37 respectively 29 and not 30 Codepoints.

The difference of 7 comes from different way to count (counting Codepoints or Encoding unit).
So we can conclude both values are correct as your table states.

The problem I try to solve with AutoIt and RegExp requires me to count by Encoding unit and not by Codepoints.
AutoIt does count by Encoding unit but its RegExp implementation doesn't.

So far I got detailed information about the encoding but still no hint on how to solve the problem.

Here is my attempt to circumvent the problem:

$string = "This is a 😅 👨‍💻 🇧🇷 🏳️‍🌈 string"
$offset = 31
$stringtoinsert = "|inserted|"

$regexpattern = "^(.{" & $offset & "})(.*)$"
$regexreplace = "$1" & $stringtoinsert & "$2"
$outputstring = StringRegExpReplace($string, $regexpattern, $regexreplace)
ConsoleWrite($outputstring & @CRLF)
;ConsoleWrite(StringLen($string) & @CRLF) ; = 37

ConsoleWrite("-------------------------------------------------" & @CRLF)

$stringHex = Hex(StringToBinary($string, 3))
$regexpattern = "^(.{" & $offset*4 & "})(.*)$"
$regexreplace = "${1}" & Hex(StringToBinary($stringtoinsert, 3)) & "${2}"
$outputstring = StringRegExpReplace($stringHex, $regexpattern, $regexreplace)
ConsoleWrite($outputstring & @CRLF)
ConsoleWrite(BinaryToString("0x" & $outputstring, 3) & @CRLF)

But I hope there is something better than this.

jchd · July 27, 2018

Sorry for the emasculated copy/paste, I must have made an error.

Let me ask two questions:

1) Where does your literal offset come from and why do you need to have it precomputed or forced thusly?
2) Are you really routinely in need of using Unicode codepoints beyond BMP (> 0xFFFF)?

I ask 1) because it's very unusual to have an offset forced there, coming from some external source.

Your count in hex approach won't keep you away from issues: if the offset supplied falls in the middle of a 2x 16-bit codepoint, things are certain to go astray. Another situation where unexpected result will occur is when the "alien-dictated" offset falls in between a grapheme cluster. A grapheme cluster has a potentially unbounded length.

Jury · July 28, 2018

Yes this would work but for the limitations of AutoIt Unicode:

😅 \x{D83D}\x{DE05}

💻 \x{D83D}\x{DCBB}

👨 \x{D83D}\x{DC68}

🇧🇷 \x{D83C}\x{DDE7}\x{D83C}\x{DDF7}

🏳️ \x{D83C}\x{DFF3}

🌈 \x{D83C}\x{DF08}

"(*UCP)^(?i)(?-s)This is a \x{D83D}\x{DE05}\s\x{D83D}\x{DC68}.\x{D83D}\x{DCBB}\s\x{D83C}\x{DDE7}\x{D83C}\x{DDF7}\s\x{D83C}\x{DFF3}..\x{D83C}\x{DF08} string"
              11       12      13    14  15     16       17      18     19   20      21      22     25    26   27    28         29      30   31

indeed you'd have to analyse everything to find out if there is going to be a \x2E ending a grapheme cluster which won't be counted.

jchd · July 28, 2018

Just as a (related) a-parté, SringRegExp allows to deal with Unicode grapheme clusters. See help about \X which requires the (*UCP) modifier.

Sign In

StringRegExp(Replace) count unicode character incorrectly

Recommended Posts

unipark

jchd

unipark

jchd

unipark

jchd

Jury

jchd

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

AutoIt Resources

Release

Beta