Jump to content
Sign in to follow this  
Stomp

Problem with conversion to and from UTF-8

Recommended Posts

Stomp

Hello I'm writing my first script with AutoIt and I like very much so far.

I've encountered a problem, that I can't solve. The following script converts a string from Unicode to UTF-8 and back. It should produce the same string, but it doesn't. The first ConsoleWrite produces "ウィキペディアはオープンコンテントの百科事典です。" But the second produces "ウィキペディアはオープンコンチEトE百科事Eです". As you can see the produced string is broken. Why does it happen?

Here is my script

#include <WinAPI.au3>

_Main()

Func _Main()
    Local $str = "ウィキペディアはオープンコンテントの百科事典です。" ; a Japanese string
    ConsoleWrite($str & @CRLF)
    $str = _WinAPI_WideCharToMultiByte($str, 65001)
    $str = _WinAPI_MultiByteToWideChar($str, 65001, 0, 1)
    ConsoleWrite($str & @CRLF)
EndFunc

I'm on Windows XP SP3. I have AutoIt 3.3.6.1 . Code page for non-unicode applications is set to Japanese.

Share this post


Link to post
Share on other sites
jchd

AFAIK the console outputs UTF8 correctly when you set SciTe parameter to Unicode.

From SciTe, Options > Open Global option file and from there locate and edit the code.page parameter:

# Internationalisation

# Japanese input code page 932 and ShiftJIS character set 128

#code.page=932

#character.set=128

# Unicode

code.page=65001 <<-- here

#code.page=0

#character.set=204

I believe that if you send Japanese ANSI (which is a non-Unicode double-byte encoding) to the console you'll see, as your example demonstrate, things interpreted wrongly.

So choose a codepage in SciTe and then either send Japanese double-byte ANSI _or_ UTF-8 but you can't have both displayed correctly with the same settings.

Also, be sure to work with scripts saved in UTF-8 + BOM encoding to have consistent display.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Stomp

ConsoleWrite is not the problem here, I think. The following script, which uses MsgBox instead, has exactly the same problem. I tried doing it directly in C and it works. I also tried using UTF-8 and ANSI for my script and it didn't make a difference.

#include <WinAPI.au3>

_Main()

Func _Main()
    Local $str = "ウィキペディアはオープンコンテントの百科事典です。"
    MsgBox(0, "", $str & @CRLF)
    $str = _WinAPI_WideCharToMultiByte($str, 65001)
    $str = _WinAPI_MultiByteToWideChar($str, 65001, 0, 1)
    MsgBox(0, "", $str & @CRLF)
EndFunc

Share this post


Link to post
Share on other sites
jchd

There's is something wrong here.

Can you run the same as this and tell us any difference:

post-44800-1275995044016_thumb.jpg

I don't have the oriental language pack installed so I can't use ConsoleWrite to write it in Japanese ANSI, but the UTF-8 version works as expected for me as you can see.

post-44800-12759953125136_thumb.jpg


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Stomp

I changed the code page for non-unicode programs back to German and the script worked. When I changed it back to Japanese I had the same problem. It seems AutoIt has problems with the Japanese code page.

I tried running this script with Japanese code page

#include <WinAPI.au3>

_Main()

Func _Main()
    Local $str = "ウィキペディアはオープンコンテントの百科事典です。" ; a Japanese string
    ConsoleWrite(StringToBinary($str, 4) & @CRLF)
    $str = _WinAPI_WideCharToMultiByte($str, 65001)
    ConsoleWrite(Binary($str) & @CRLF)
    $str = _WinAPI_MultiByteToWideChar($str, 65001, 0, 1)
    ConsoleWrite(StringToBinary($str, 4) & @CRLF)
EndFunc

And this is what I got

0xE382A6E382A3E382ADE3839AE38387E382A3E382A2E381AFE382AAE383BCE38397E383B3E382B3E383B3E38386E383B3E38388E381AEE799BEE7A791E4BA8BE585B8E381A7E38199E38082
0xE382A6E382A3E382ADE3839AE38387E382A3E382A2E381AFE382AAE383BCE38397E383B3E382B3E383B3E383
0xE382A6E382A3E382ADE3839AE38387E382A3E382A2E381AFE382AAE383BCE38397E383B3E382B3E383B3E3838145E3838845E799BEE7A791E4BA8B45E381A7E38199

Share this post


Link to post
Share on other sites
jchd

While it's fairly likely than Germany would win over Japan at the soccer worldcup, I see no good reason for the behavior you show.

Sorry if I ask again: are you positive that the script was saved in UTF-8 encoding? Note that it's not enough to change it from SciTe, you have to make some change to the file and save it for the new encoding to be effective (a dummy change will do).


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Stomp

Yes, I'm sure that the script is in the UTF-8 encoding. When I open it in SciTe it says that it is in the UTF-8 encoding.

Share this post


Link to post
Share on other sites
jchd

Sorry for unwanted delay in answering.

Can you please tell which precise Japanese codepage you're using?

932 shift_jis

10001 x-mac_japanese (Mac!)

20290 IBM290 (EBCDIC!)

20932 EUC-JP (JIS X 0208-1990 and JIS X 0121-1990)

50220 iso-2022-jp

50221 csISO2022JP

50222 iso-2022-jp (JIS X 0201-1989)

50930 (EBCDIC!)

50931 (EBCDIC!)

50939

51932 euc-jp


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Stomp

I meant that the language for non-Unicode programs is set to Japanese in Regional and Language Options in Control Panel. Here is link to a Microsoft page explaining how to set the language. Sorry, if it was not clear what I meant.

Share this post


Link to post
Share on other sites
Stomp

Can someone reproduce this problem? I can't use AutoIt if this doesn't work.

Share this post


Link to post
Share on other sites
jchd

I'm sorry but I just can't reproduce that issue here. I nonetheless believe that the problem you get has a simple solution as there are a significant number of asian users of AutoIt seen in this forum or the chinese forum (most also use a double-byte page code, Big5).

Perhaps would you have a better chance of attracting seasonned asian users with a thread subject more explicit about the character set you use. Try openning a new one with "Help with Japanese charset" or something close.

I must confess I've no clue at the moment about what happens in your case, albeit I've tackled down some issues with Unicode, UTF-*, asian charsets, ... and AutoIt before.

Don't give up too quickly, there _must_ be a way out. As soon as I can in the week-end, I'll try to setup a vanilla XP SP3 x86 machine with asian support enabled and try to find out what the issue is.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Stomp

Thanks, a lot for trying to help me. I tried it once again and I found a workaround.

The following script works.

_Main()

Func _Main()
    Local $str = "ウィキペディアはオープンコンテントの百科事典です。" ; a Japanese string
    MsgBox(0, "", $str & @CRLF)
    $str = _WinAPI_WideCharToMultiByte($str, 65001, 0)
    $str = _WinAPI_MultiByteToWideChar($str, 65001, 0, 1)
    MsgBox(0, "", $str & @CRLF)
EndFunc

I changed the call to _WinAPI_WideCharToMultiByte to return a struct instead of the string. When I pass a multi byte string to C function I have to declare the argument as "ptr" and not as "str" and pass a pointer to the struct with DllStructGetPtr.

Edited by Stomp

Share this post


Link to post
Share on other sites
jchd

You're right that an "str" argument to DllCall will be silently converted to ANSI by AutoIt. The actual issue here is that you UTF-8 output from the first conversion was converted to ANSI codepage double-byte (which I'm not big fan). I've had this "str" type problem as well in the SQLite UDF, which uses its own wide-to-multibyte routines (using structs). But what I don't see is how your last script differs from the previous one(s).

$str = _WinAPI_WideCharToMultiByte($str, 65001, 0)

is equivalent to

$str = _WinAPI_WideCharToMultiByte($str, 65001)

as the third parameter (conversion flags) defaults to 0. The _fourth_ parameter is a switch to return string or struct. I believe you used that instead:

$str = _WinAPI_WideCharToMultiByte($str, 65001, 0, 0)

Anyway, glad to know you have things working now and don't hesitate to post again if you encounter other Unicode issue.

BTW you may want to know that current AutoIt only handles the UCS-2 character set, rather than genuine UTF-16LE. In practice, that means that only the Unicode plane 0 characters can be dealt with in AutoIt. In other words, native AutoIt strings consist of 16-bit encoding units, with one unit = one character. Unicode characters in upper planes needs two 16-bit units to represent the full Unicode range. While most characters in those upper planes are not widely used, there are recent additions that map chinese extensions there. As a consequence, they can't be used with today's AutoIt and if you have data source using charaters in that range, you'll need to find workarounds to avoid data loss. I'm unable to tell you how widely used are those problematic ranges today.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Stomp

You're right that an "str" argument to DllCall will be silently converted to ANSI by AutoIt. The actual issue here is that you UTF-8 output from the first conversion was converted to ANSI codepage double-byte (which I'm not big fan). I've had this "str" type problem as well in the SQLite UDF, which uses its own wide-to-multibyte routines (using structs). But what I don't see is how your last script differs from the previous one(s).

$str = _WinAPI_WideCharToMultiByte($str, 65001, 0)

is equivalent to

$str = _WinAPI_WideCharToMultiByte($str, 65001)

as the third parameter (conversion flags) defaults to 0. The _fourth_ parameter is a switch to return string or struct. I believe you used that instead:

$str = _WinAPI_WideCharToMultiByte($str, 65001, 0, 0)

Yes, that's what the documention says, but the function is declared like with in WinAPI.au3.

Func _WinAPI_WideCharToMultiByte($pUnicode, $iCodePage = 0, $bRetString = True)

Maybe AutoIt shouldn't convert multibyte string to ANSI, that would make things much easier.

Share this post


Link to post
Share on other sites
jchd

Geez, good catch. I didn't use this function myself and never got into this trap. I'll take care of reporting/fixing this.

Edit: that's now ticket #1671.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×