Strange Behavior with Unicode Characters

I'm seeing some strange behavior while doing StringReplace on some unicode characters.

The example is very simple. I have an string and I want to either remove or StringReplace a certain character with nothing ("").

Here are some unicode characters:

"ZERO WIDTH NON-JOINER" or ChrW(8204) or U+200C

"ZERO WIDTH JOINER" or ChrW(8205) or U+200D

"ARABIC TATWEEL" or ChrW(1600) or U+0640

First two characters are invisible they just change the behavior of the previous and next characters (if you didn't know).

Now let's say I have a generated string that might have all those characters in it and I only want to get rid of ChrW(8204) and ChrW(8205). After lots of testing I narrowed it down to:



Both commands above will remove ChrW(1600) too.

This might be a bug.

I'm not so sure it's a bug in AutoIt.

What I suspect is that the function you use (StringReplace) merely wraps a native Windows Unicode function. Now the behavior of several Unicode codepoints like the ZWNJ and ZWJ you want to get rid of isn't as simple as you might think, at least in the hands of an actual Unicode-compliant function. See for example this article.

In short I believe (but can't say for sure) that the underlying function applies Unicode-defined treatment to the string you supply and since both codepoints apply to codepoints leading to ligature in order to change the display rendered, the end effect is that the "meaningful" codepoint vanishes (not being part of any ligature then) in both of your examples.

You may have more chance using StringRegExpReplace as PCRE doesn't give codepoints in subject and pattern strings their complex Unicode semantic, but simply treats them individually.

No it's not AutoIt bug. Nor Windows bug.

When you want exact character replacement with no mumbo jumbo you specify casesense parameter for StringReplace().

...Being 1 of course.




