unipark Posted July 24, 2018 Posted July 24, 2018 I have a given string like "This is a test string". I have a given offset like 15. I need to insert something after the offset with regexp. Β $string = "This is a test string" $offset = 15 $stringtoinsert = "|inserted|" $regexpattern = "^(.{" & $offset & "})(.*)$" $regexreplace = "$1" & $stringtoinsert & "$2" $outputstring = StringRegExpReplace($string, $regexpattern, $regexreplace) ConsoleWrite($outputstring & @CRLF) ;output:This is a test |inserted|string The problem occurs when the string contains special Unicode characters. Β $string = "This is a π π¨βπ» π§π· π³οΈβπ string" $offset = 31 $stringtoinsert = "|inserted|" $regexpattern = "^(.{" & $offset & "})(.*)$" $regexreplace = "$1" & $stringtoinsert & "$2" $outputstring = StringRegExpReplace($string, $regexpattern, $regexreplace) ConsoleWrite($outputstring & @CRLF) According to this regex tester it should insert right before the last word but it doesn't. If I change the offset to 24 it does what it should do with offset 31. I guess it may be some encoding problem but I have no clue how to fix this. Note that I know that I could solve this problem without regex but this is just a simplified example that shows the problem. The real patter is way more complex and there is no easy way to achieve it with simple string manipulation. Β
jchd Posted July 24, 2018 Posted July 24, 2018 (edited) AutoIt uses UCS2 (a subset of UTF16 limited to the Unicode BMP) encoding and your string is exactly 29 Unicode characters long. The offset of 24 is correct as the regex engine used (PCRE) correctly parses full UTF16. The string contains codepoints > 0xFFFF which need 2 16-bit encoding units and it itsn't as "simple" as it looks: https://r12a.github.io/uniview/?charlist="This is π π¨βπ» π§π· π³οΈβπ string" Moral: beware of string functions when dealing with non-basic Unicode constructs. In this particular example, the embedded LF doesn't clearly show up, the combined sequence of 3 Unicode codepoints produces the glyph of a man behind a notebook (albeit the codepoints don't express that and even comprise a space in-between), same for the localised flag (white flag changed by regional indicator, producing a Brazilian flag) and finally flag also applied to rainbow codepoint. Edited July 24, 2018 by jchd This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
unipark Posted July 25, 2018 Author Posted July 25, 2018 16 hours ago, jchd said: AutoIt uses UCS2 (a subset of UTF16 limited to the Unicode BMP) encoding and your string is exactly 29 Unicode characters long. The offset of 24 is correct as the regex engine used (PCRE) correctly parses full UTF16. The string contains codepoints > 0xFFFF which need 2 16-bit encoding units and it itsn't as "simple" as it looks: https://r12a.github.io/uniview/?charlist="This is π π¨βπ» π§π· π³οΈβπ string" Moral: beware of string functions when dealing with non-basic Unicode constructs. In this particular example, the embedded LF doesn't clearly show up, the combined sequence of 3 Unicode codepoints produces the glyph of a man behind a notebook (albeit the codepoints don't express that and even comprise a space in-between), same for the localised flag (white flag changed by regional indicator, producing a Brazilian flag) and finally flag also applied to rainbow codepoint. Well assuming AutoIts PCRE engine does work correctly how come that this PCRE tester does not the same https://regex101.com/r/UPvXfl/1Β (not just this one) One of them must be doing something wrong/different and I need to know what exactly to fix it. Also as stated above in my case the offset is given 31 for the example string so the code that generates this offset (external program) does the same counting as regex101.com. Hard to believe that 2 independent programs would do the same in error. But now back to Autoit. StringLen($string)Β Says the string is 37 character long just like regex101.com does. StringTrimLeft($string, 31) Removes everything before the word "string" just as expected. In other words AutoIt itself counts the same way too but PCRE in AutoIt doesn't. So regardless of which is correct I'm obviously still searching for a way to use PCRE with the given offset. Can I use PCRE with UCS2 like AutoIt does? Or can I encode the string somehow different? Maybe regex the BIN/HEX representation of the string to avoid encoding problems? Any help would be much appreciated. Β
jchd Posted July 25, 2018 Posted July 25, 2018 Once again this particular string (as copy/pasted from your post) is only 29 codepoints long, nothing more nothing less. The provided link precisely shows what's inside. Every Unicode codepoints > 0xFFFF requires two 16-bit encoding units to represent. For programs using UCS2 that means two 16-bit characters. So some programs or functions are fooled by codepoints > 0xFFFF and count two character (2x 16-bit encoding units as in UCS2) instead of one codepoint represented by 2x 16-bit units. Here's the hex representation of the string, along with coepoint offsets, encoding units offsets and its UTF16 encoding: Codepoint Name Codepoint offset Encoding unit offset Hex UTF16 β0054 LATIN CAPITAL LETTER T 1 1 0x0054 β0068 LATIN SMALL LETTER H 2 2 0x0068 β0069 LATIN SMALL LETTER I 3 3 0x0069 β0073 LATIN SMALL LETTER S 4 4 0x0073 β0020 SPACE 5 5 0x0020 β0069 LATIN SMALL LETTER I 6 6 0x0069 β0073 LATIN SMALL LETTER S 7 7 0x0073 β000A [control] 8 8 0x000A β0020 SPACE 9 9 0x0020 β1F605 SMILING FACE WITH OPEN MOUTH AND COLD SWEAT 10 10 0xD83D 0xDE05 β0020 SPACE 11 12 0x0020 β1F468 MAN 12 13 0xD83D 0xDC68 β200D ZERO WIDTH JOINER 13 15 0x200D β1F4BB PERSONAL COMPUTER 14 16 0xD83D 0xDCBB β0020 SPACE 15 18 0x0020 β1F1E7 REGIONAL INDICATOR SYMBOL LETTER B 16 19 0xD83C 0xDDE7 β1F1F7 REGIONAL INDICATOR SYMBOL LETTER R 17 21 0xD83C 0xDDF7 β0020 SPACE 18 23 0x0020 β1F3F3 WAVING WHITE FLAG 19 24 0xD83C 0xDFF3 βFE0F VARIATION SELECTOR-16 20 26 0xFE0F β200D ZERO WIDTH JOINER 21 27 0x200D β1F308 RAINBOW 22 28 0xD83C 0xDF08 β0020 SPACE 23 30 0x0020 β0073 LATIN SMALL LETTER S 24 31 0x0073 β0074 LATIN SMALL LETTER T 25 32 0x0074 β0072 LATIN SMALL LETTER R 26 33 0x0072 β0069 LATIN SMALL LETTER I 27 34 0x0069 β006E LATIN SMALL LETTER N 28 35 0x006E β0067 LATIN SMALL LETTER G 29 36 0x0067 The offset discrepancy (24 vs. 31) you observe comes from the fact that some programs/functions count 16-bit encoding units while others correctly recognize and support Unicode surrogates, hence count actual UTF16 codepoints. So beware of codepoints > 0xFFFF and, as far as possible, keep away from using literal offsets blindly. About the string length being returned as 37 I suspect that some invisible codepoints has escaped copy/paste somewhere, or some obscure bug. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
unipark Posted July 27, 2018 Author Posted July 27, 2018 I just saw there is in fact a difference between your test string and the one I used. "This is a π π¨βπ» π§π· π³οΈβπ string" "This is π π¨βπ» π§π· π³οΈβπ string" In my string there is 0020 SPACE 0061 LATIN SMALL LETTER A 0020 SPACE And in yours string there is 000A [control] 0020 SPACE There is a missing "a" and a missing space but an additional linefeed which in total gives a string that is 1 character shorter hence your string has a total length of 36 not 37 respectively 29 and not 30 Codepoints. The difference of 7 comes from different way to count (counting Codepoints or Encoding unit). So we can conclude both values are correct as your table states. The problem I try to solve with AutoIt and RegExp requires me to count by Encoding unit and not by Codepoints. AutoIt does count by Encoding unit but its RegExp implementation doesn't. So far I got detailed information about the encoding but still no hint on how to solve the problem. Here is my attempt to circumvent the problem: Β $string = "This is a π π¨βπ» π§π· π³οΈβπ string" $offset = 31 $stringtoinsert = "|inserted|" $regexpattern = "^(.{" & $offset & "})(.*)$" $regexreplace = "$1" & $stringtoinsert & "$2" $outputstring = StringRegExpReplace($string, $regexpattern, $regexreplace) ConsoleWrite($outputstring & @CRLF) ;ConsoleWrite(StringLen($string) & @CRLF) ; = 37 ConsoleWrite("-------------------------------------------------" & @CRLF) $stringHex = Hex(StringToBinary($string, 3)) $regexpattern = "^(.{" & $offset*4 & "})(.*)$" $regexreplace = "${1}" & Hex(StringToBinary($stringtoinsert, 3)) & "${2}" $outputstring = StringRegExpReplace($stringHex, $regexpattern, $regexreplace) ConsoleWrite($outputstring & @CRLF) ConsoleWrite(BinaryToString("0x" & $outputstring, 3) & @CRLF) But I hope there is something better than this.
jchd Posted July 27, 2018 Posted July 27, 2018 Sorry for the emasculated copy/paste, I must have made an error. Let me ask two questions: 1) Where does your literal offset come from and why do you need to have it precomputed or forced thusly? 2) Are you really routinely in need of using Unicode codepoints beyond BMP (> 0xFFFF)? I ask 1) because it's very unusual to have an offset forced there, coming from some external source. Your count in hex approach won't keep you away from issues: if the offset supplied falls in the middle of a 2x 16-bit codepoint, things are certain to go astray. Another situation where unexpected result will occur is when the "alien-dictated" offset falls in between a grapheme cluster. A grapheme cluster has a potentially unbounded length. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
Jury Posted July 28, 2018 Posted July 28, 2018 Yes this would work but for the limitations of AutoIt Unicode: π \x{D83D}\x{DE05} π» \x{D83D}\x{DCBB} π¨ \x{D83D}\x{DC68} π§π· \x{D83C}\x{DDE7}\x{D83C}\x{DDF7} π³οΈ \x{D83C}\x{DFF3} π \x{D83C}\x{DF08} "(*UCP)^(?i)(?-s)This is a \x{D83D}\x{DE05}\s\x{D83D}\x{DC68}.\x{D83D}\x{DCBB}\s\x{D83C}\x{DDE7}\x{D83C}\x{DDF7}\s\x{D83C}\x{DFF3}..\x{D83C}\x{DF08} string" 11 12 13 14 15 16 17 18 19 20 21 22 25 26 27 28 29 30 31 indeed you'd have to analyse everything to find out if there is going to be a \x2EΒ ending a grapheme cluster which won't be counted.Β
jchd Posted July 28, 2018 Posted July 28, 2018 Just as a (related) a-partΓ©, SringRegExp allows to deal with Unicode grapheme clusters. See help about \X which requires the (*UCP) modifier. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now