Jump to content

Recommended Posts

Posted

I have a given string like "This is a test string".
I have a given offset like 15.
I need to insert something after the offset with regexp.
Β 

$string = "This is a test string"
$offset = 15
$stringtoinsert = "|inserted|"

$regexpattern = "^(.{" & $offset & "})(.*)$"
$regexreplace = "$1" & $stringtoinsert & "$2"
$outputstring = StringRegExpReplace($string, $regexpattern, $regexreplace)
ConsoleWrite($outputstring & @CRLF)

;output:This is a test |inserted|string

The problem occurs when the string contains special Unicode characters.
Β 

$string = "This is a πŸ˜… πŸ‘¨β€πŸ’» πŸ‡§πŸ‡· πŸ³οΈβ€πŸŒˆ string"
$offset = 31
$stringtoinsert = "|inserted|"

$regexpattern = "^(.{" & $offset & "})(.*)$"
$regexreplace = "$1" & $stringtoinsert & "$2"
$outputstring = StringRegExpReplace($string, $regexpattern, $regexreplace)
ConsoleWrite($outputstring & @CRLF)

According to this regex tester it should insert right before the last word but it doesn't.
If I change the offset to 24 it does what it should do with offset 31.

I guess it may be some encoding problem but I have no clue how to fix this.

Note that I know that I could solve this problem without regex but this is just a simplified example that shows the problem.
The real patter is way more complex and there is no easy way to achieve it with simple string manipulation.

Β 

Posted (edited)

AutoIt uses UCS2 (a subset of UTF16 limited to the Unicode BMP) encoding and your string is exactly 29 Unicode characters long. The offset of 24 is correct as the regex engine used (PCRE) correctly parses full UTF16.

The string contains codepoints > 0xFFFF which need 2 16-bit encoding units and it itsn't as "simple" as it looks:

https://r12a.github.io/uniview/?charlist="This is πŸ˜… πŸ‘¨β€πŸ’» πŸ‡§πŸ‡· πŸ³οΈβ€πŸŒˆ string"

Moral: beware of string functions when dealing with non-basic Unicode constructs. In this particular example, the embedded LF doesn't clearly show up, the combined sequence of 3 Unicode codepoints produces the glyph of a man behind a notebook (albeit the codepoints don't express that and even comprise a space in-between), same for the localised flag (white flag changed by regional indicator, producing a Brazilian flag) and finally flag also applied to rainbow codepoint.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Posted
16 hours ago, jchd said:

AutoIt uses UCS2 (a subset of UTF16 limited to the Unicode BMP) encoding and your string is exactly 29 Unicode characters long. The offset of 24 is correct as the regex engine used (PCRE) correctly parses full UTF16.

The string contains codepoints > 0xFFFF which need 2 16-bit encoding units and it itsn't as "simple" as it looks:

https://r12a.github.io/uniview/?charlist="This is πŸ˜… πŸ‘¨β€πŸ’» πŸ‡§πŸ‡· πŸ³οΈβ€πŸŒˆ string"

Moral: beware of string functions when dealing with non-basic Unicode constructs. In this particular example, the embedded LF doesn't clearly show up, the combined sequence of 3 Unicode codepoints produces the glyph of a man behind a notebook (albeit the codepoints don't express that and even comprise a space in-between), same for the localised flag (white flag changed by regional indicator, producing a Brazilian flag) and finally flag also applied to rainbow codepoint.

Well assuming AutoIts PCRE engine does work correctly how come that this PCRE tester does not the same https://regex101.com/r/UPvXfl/1Β  (not just this one)
One of them must be doing something wrong/different and I need to know what exactly to fix it.
Also as stated above in my case the offset is given 31 for the example string so the code that generates this offset (external program) does the same counting as regex101.com.
Hard to believe that 2
independent programs would do the same in error.
But now back to Autoit.

StringLen($string)Β  Says the string is 37 character long just like regex101.com does.
StringTrimLeft($string, 31) Removes everything before the word "string" just as expected.
In other words AutoIt itself counts the same way too but PCRE in AutoIt doesn't.

So regardless of which is correct I'm obviously still searching for a way to use PCRE with the given offset.
Can I use PCRE with UCS2 like AutoIt does? Or can I encode the string somehow different?
Maybe regex the BIN/HEX representation of the string to avoid encoding problems?

Any help would be much appreciated.

Β 

Posted

Once again this particular string (as copy/pasted from your post) is only 29 codepoints long, nothing more nothing less. The provided link precisely shows what's inside.

Every Unicode codepoints > 0xFFFF requires two 16-bit encoding units to represent. For programs using UCS2 that means two 16-bit characters.
So some programs or functions are fooled by codepoints > 0xFFFF and count two character (2x 16-bit encoding units as in UCS2) instead of one codepoint represented by 2x 16-bit units.

Here's the hex representation of the string, along with coepoint offsets, encoding units offsets and its UTF16 encoding:

Codepoint Name                                     Codepoint offset    Encoding unit offset      Hex UTF16
 β€Ž0054 LATIN CAPITAL LETTER T                              1                     1                0x0054
 β€Ž0068 LATIN SMALL LETTER H                                2                     2                0x0068
 β€Ž0069 LATIN SMALL LETTER I                                3                     3                0x0069
 β€Ž0073 LATIN SMALL LETTER S                                4                     4                0x0073
 β€Ž0020 SPACE                                               5                     5                0x0020
 β€Ž0069 LATIN SMALL LETTER I                                6                     6                0x0069
 β€Ž0073 LATIN SMALL LETTER S                                7                     7                0x0073
 β€Ž000A [control]                                           8                     8                0x000A
 β€Ž0020 SPACE                                               9                     9                0x0020
 β€Ž1F605 SMILING FACE WITH OPEN MOUTH AND COLD SWEAT       10                    10                0xD83D  0xDE05
 β€Ž0020 SPACE                                              11                    12                0x0020
 β€Ž1F468 MAN                                               12                    13                0xD83D  0xDC68
 β€Ž200D ZERO WIDTH JOINER                                  13                    15                0x200D
 β€Ž1F4BB PERSONAL COMPUTER                                 14                    16                0xD83D  0xDCBB
 β€Ž0020 SPACE                                              15                    18                0x0020
 β€Ž1F1E7 REGIONAL INDICATOR SYMBOL LETTER B                16                    19                0xD83C  0xDDE7
 β€Ž1F1F7 REGIONAL INDICATOR SYMBOL LETTER R                17                    21                0xD83C  0xDDF7
 β€Ž0020 SPACE                                              18                    23                0x0020
 β€Ž1F3F3 WAVING WHITE FLAG                                 19                    24                0xD83C  0xDFF3
 β€ŽFE0F VARIATION SELECTOR-16                              20                    26                0xFE0F
 β€Ž200D ZERO WIDTH JOINER                                  21                    27                0x200D
 β€Ž1F308 RAINBOW                                           22                    28                0xD83C  0xDF08
 β€Ž0020 SPACE                                              23                    30                0x0020
 β€Ž0073 LATIN SMALL LETTER S                               24                    31                0x0073
 β€Ž0074 LATIN SMALL LETTER T                               25                    32                0x0074
 β€Ž0072 LATIN SMALL LETTER R                               26                    33                0x0072
 β€Ž0069 LATIN SMALL LETTER I                               27                    34                0x0069
 β€Ž006E LATIN SMALL LETTER N                               28                    35                0x006E
 β€Ž0067 LATIN SMALL LETTER G                               29                    36                0x0067

The offset discrepancy (24 vs. 31) you observe comes from the fact that some programs/functions count 16-bit encoding units while others correctly recognize and support Unicode surrogates, hence count actual UTF16 codepoints.

So beware of codepoints > 0xFFFF and, as far as possible, keep away from using literal offsets blindly.

About the string length being returned as 37 I suspect that some invisible codepoints has escaped copy/paste somewhere, or some obscure bug.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Posted

I just saw there is in fact a difference between your test string and the one I used.

"This is a πŸ˜… πŸ‘¨β€πŸ’» πŸ‡§πŸ‡· πŸ³οΈβ€πŸŒˆ string"
"This is πŸ˜… πŸ‘¨β€πŸ’» πŸ‡§πŸ‡· πŸ³οΈβ€πŸŒˆ string"

In my string there is
0020 SPACE
0061 LATIN SMALL LETTER A
0020 SPACE

And in yours string there is
000A [control]
0020 SPACE

There is a missing "a" and a missing space but an additional linefeed which in total gives a string that is 1 character shorter hence your string has a total length of 36 not 37 respectively 29 and not 30 Codepoints.

The difference of 7 comes from different way to count (counting Codepoints or Encoding unit).
So we can conclude both values are correct as your table states.

The problem I try to solve with AutoIt and RegExp requires me to count by Encoding unit and not by Codepoints.
AutoIt does count by Encoding unit but its RegExp implementation doesn't.

So far I got detailed information about the encoding but still no hint on how to solve the problem.

Here is my attempt to circumvent the problem:
Β 

$string = "This is a πŸ˜… πŸ‘¨β€πŸ’» πŸ‡§πŸ‡· πŸ³οΈβ€πŸŒˆ string"
$offset = 31
$stringtoinsert = "|inserted|"

$regexpattern = "^(.{" & $offset & "})(.*)$"
$regexreplace = "$1" & $stringtoinsert & "$2"
$outputstring = StringRegExpReplace($string, $regexpattern, $regexreplace)
ConsoleWrite($outputstring & @CRLF)
;ConsoleWrite(StringLen($string) & @CRLF) ; = 37

ConsoleWrite("-------------------------------------------------" & @CRLF)

$stringHex = Hex(StringToBinary($string, 3))
$regexpattern = "^(.{" & $offset*4 & "})(.*)$"
$regexreplace = "${1}" & Hex(StringToBinary($stringtoinsert, 3)) & "${2}"
$outputstring = StringRegExpReplace($stringHex, $regexpattern, $regexreplace)
ConsoleWrite($outputstring & @CRLF)
ConsoleWrite(BinaryToString("0x" & $outputstring, 3) & @CRLF)

But I hope there is something better than this.

Posted

Sorry for the emasculated copy/paste, I must have made an error.

Let me ask two questions:

1) Where does your literal offset come from and why do you need to have it precomputed or forced thusly?
2) Are you really routinely in need of using Unicode codepoints beyond BMP (> 0xFFFF)?

I ask 1) because it's very unusual to have an offset forced there, coming from some external source.

Your count in hex approach won't keep you away from issues: if the offset supplied falls in the middle of a 2x 16-bit codepoint, things are certain to go astray. Another situation where unexpected result will occur is when the "alien-dictated" offset falls in between a grapheme cluster. A grapheme cluster has a potentially unbounded length.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Posted

Yes this would work but for the limitations of AutoIt Unicode:

πŸ˜… \x{D83D}\x{DE05}

πŸ’» \x{D83D}\x{DCBB}

πŸ‘¨ \x{D83D}\x{DC68}

πŸ‡§πŸ‡· \x{D83C}\x{DDE7}\x{D83C}\x{DDF7}

🏳️ \x{D83C}\x{DFF3}

🌈 \x{D83C}\x{DF08}

"(*UCP)^(?i)(?-s)This is a \x{D83D}\x{DE05}\s\x{D83D}\x{DC68}.\x{D83D}\x{DCBB}\s\x{D83C}\x{DDE7}\x{D83C}\x{DDF7}\s\x{D83C}\x{DFF3}..\x{D83C}\x{DF08} string"
              11       12      13    14  15     16       17      18     19   20      21      22     25    26   27    28         29      30   31

indeed you'd have to analyse everything to find out if there is going to be a \x2EΒ ending a grapheme cluster which won't be counted.Β 

Posted

Just as a (related) a-partΓ©, SringRegExp allows to deal with Unicode grapheme clusters. See help about \X which requires the (*UCP) modifier.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...