Sign in to follow this  
Followers 0
leomoon

Strange Behavior with Unicode Characters

5 posts in this topic

Hello,

I'm seeing some strange behavior while doing StringReplace on some unicode characters.

The example is very simple. I have an string and I want to either remove or StringReplace a certain character with nothing ("").

Here are some unicode characters:

"ZERO WIDTH NON-JOINER" or ChrW(8204) or U+200C

"ZERO WIDTH JOINER" or ChrW(8205) or U+200D

"ARABIC TATWEEL" or ChrW(1600) or U+0640

First two characters are invisible they just change the behavior of the previous and next characters (if you didn't know).

Now let's say I have a generated string that might have all those characters in it and I only want to get rid of ChrW(8204) and ChrW(8205). After lots of testing I narrowed it down to:

MsgBox(0,'',StringReplace(ChrW(8205)&ChrW(1600),ChrW(8205),''))

MsgBox(0,'',StringReplace(ChrW(8204)&ChrW(1600),ChrW(8204),''))

Both commands above will remove ChrW(1600) too.

This might be a bug.

Share this post


Link to post
Share on other sites



I'm not so sure it's a bug in AutoIt.

What I suspect is that the function you use (StringReplace) merely wraps a native Windows Unicode function. Now the behavior of several Unicode codepoints like the ZWNJ and ZWJ you want to get rid of isn't as simple as you might think, at least in the hands of an actual Unicode-compliant function. See for example this article.

In short I believe (but can't say for sure) that the underlying function applies Unicode-defined treatment to the string you supply and since both codepoints apply to codepoints leading to ligature in order to change the display rendered, the end effect is that the "meaningful" codepoint vanishes (not being part of any ligature then) in both of your examples.

You may have more chance using StringRegExpReplace as PCRE doesn't give codepoints in subject and pattern strings their complex Unicode semantic, but simply treats them individually.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

No it's not AutoIt bug. Nor Windows bug.

When you want exact character replacement with no mumbo jumbo you specify casesense parameter for StringReplace().

...Being 1 of course.


♡♡♡

.

eMyvnE

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0