Jump to content

How do I convert strings between ANSI and UTF-8 / 16


jchd
 Share

Recommended Posts

Hello group,

I've been trying almost everything possible to convert strings between ANSI (I personaly use Latin-1 codepage) and UTF-8 or UTF-16, but I've had no real success up to now.

I need this because I have to deal with a pure ANSI database using an ODBC layer, a complex GUI interface and two separate SQLite3 (v3.6.13) databases (one utf-8 and one utf-16). Given that this will be routinely used with a significant dataflow volume, I'd like to know which would be the most practical (and if possible efficient) way.

I may be misunderstanding obvious things, but it seems to me that many standard UDF or functions are still working in ANSI mode. Are simple GUI controls (say InputBoxes) delivering ANSI or UTF-8 strings? What is the SQLite3 interface expecting when it comes to data format?

In the same direction, why are UTF-16 (improperly called Unicode) strings passed to/from in dll calls thru obscure structures instead of the wstr type? Aren't wstr's first class citizens?

Another question haunting me: how can I hex-dump a string without going into any conversion? StringToBinary is not an option since it forces you to declare which format the string is using, which is precisely what I need to know!

I really can't understand how all this is supposed to be used in simple or more complex developments!

I apologize for asking that much, but it's a consequence of me wasting ___way___ too much time trying to solve these issues.

Warm thanks in advance for any help.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

I try to push up this post in the hope someone could help.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Well, I've been playing around with _WinAPI_MultiByteToWideChar() and _WinAPI_WideCharToMultiByte() but try as I might I couldn't get them to work. I've extracted both functions from the WinAPI.au3 file and tweaked them a little to get them working. Below is an example of converting from ANSI to UTF-8. For UTF-16, you'd be as well taking the result from _WBD_WinAPI_MultiByteToWideChar() and re-using it in your function calls with DllStructGetPtr(). I hope this makes sense.

MsgBox(64, "UTF-8", _ConvertAnsiToUtf8("Café á ©®"), 5)

Exit

Func _ConvertAnsiToUtf8($sText)
    Local $tUnicode = _WBD_WinAPI_MultiByteToWideChar($sText)
    If @error Then Return SetError(@error, 0, "")
    Local $sUtf8 = _WBD_WinAPI_WideCharToMultiByte(DllStructGetPtr($tUnicode), 65001)
    If @error Then Return SetError(@error, 0, "")
    Return SetError(0, 0, $sUtf8)
EndFunc   ;==>_ConvertAnsiToUtf8

Func _WBD_WinAPI_MultiByteToWideChar($sText, $iCodePage = 0, $iFlags = 0)
    Local $iText, $pText, $tText

    $iText = StringLen($sText) + 1
    $tText = DllStructCreate("wchar[" & $iText & "]")
    $pText = DllStructGetPtr($tText)
    DllCall("Kernel32.dll", "int", "MultiByteToWideChar", "int", $iCodePage, "int", $iFlags, "str", $sText, "int", $iText, "ptr", $pText, "int", $iText)
    If @error Then Return SetError(@error, 0, $tText)
    Return $tText
EndFunc   ;==>_WBD_WinAPI_MultiByteToWideChar

Func _WBD_WinAPI_WideCharToMultiByte($pUnicode, $iCodePage = 0)
    Local $aResult, $tText, $pText

    $aResult = DllCall("Kernel32.dll", "int", "WideCharToMultiByte", "int", $iCodePage, "int", 0, "ptr", $pUnicode, "int", -1, "ptr", 0, "int", 0, "int", 0, "int", 0)
    If @error Then Return SetError(@error, 0, "")
    $tText = DllStructCreate("char[" & $aResult[0] + 1 & "]")
    $pText = DllStructGetPtr($tText)
    $aResult = DllCall("Kernel32.dll", "int", "WideCharToMultiByte", "int", $iCodePage, "int", 0, "ptr", $pUnicode, "int", -1, "ptr", $pText, "int", $aResult[0], "int", 0, "int", 0)
    If @error Then Return SetError(@error, 0, "")
    Return DllStructGetData($tText, 1)
EndFunc   ;==>_WBD_WinAPI_WideCharToMultiByte

WBD

Link to comment
Share on other sites

Hello group,

I've been trying almost everything possible to convert strings between ANSI (I personaly use Latin-1 codepage) and UTF-8 or UTF-16, but I've had no real success up to now.

I need this because I have to deal with a pure ANSI database using an ODBC layer, a complex GUI interface and two separate SQLite3 (v3.6.13) databases (one utf-8 and one utf-16). Given that this will be routinely used with a significant dataflow volume, I'd like to know which would be the most practical (and if possible efficient) way.

I may be misunderstanding obvious things, but it seems to me that many standard UDF or functions are still working in ANSI mode. Are simple GUI controls (say InputBoxes) delivering ANSI or UTF-8 strings? What is the SQLite3 interface expecting when it comes to data format?

In the same direction, why are UTF-16 (improperly called Unicode) strings passed to/from in dll calls thru obscure structures instead of the wstr type? Aren't wstr's first class citizens?

Another question haunting me: how can I hex-dump a string without going into any conversion? StringToBinary is not an option since it forces you to declare which format the string is using, which is precisely what I need to know!

I really can't understand how all this is supposed to be used in simple or more complex developments!

I apologize for asking that much, but it's a consequence of me wasting ___way___ too much time trying to solve these issues.

Warm thanks in advance for any help.

There is no one-to-one relationship between the two, so "converting" is not involved unless you limit to a small subset of Unicode values that happen to have ANSI analogues. For all the rest, it will be a matter of "translating" (as in Arabic to English), not "converting" (as in Decimal to Hex).

How would you convert ü to ANSI? What if there is a mix of French accents, Hebrew characters, and Greek scientific constants?

^_^

Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law
Link to comment
Share on other sites

There is no one-to-one relationship between the two, so "converting" is not involved unless you limit to a small subset of Unicode values that happen to have ANSI analogues. For all the rest, it will be a matter of "translating" (as in Arabic to English), not "converting" (as in Decimal to Hex).

How would you convert ü to ANSI? What if there is a mix of French accents, Hebrew characters, and Greek scientific constants?

Hi PsaltyDS, nice to see you on board!

I'm in no way confusing between conversion and translation. Indeed, ü belongs to both Unicode (of course it does!) and Latin-1 ANSI codepage.

We have the right to expect a conversion between some codepage (say, Latin1) and Unicode. In this precise case, Latin1 ü has hex representation FC while Unicode codepoint is 0x00FC, having hex UTF-8 representation C3 BC. Such 1to1 bi-directional conversion is obviously limited to the subset: Unicode ∩ ANSI codepage ≍ ANSI codepage (since Unicode is the full code universe).

When converting the other way round, it's generally admitted to convert to placeholder any Unicode input character that doesn't have an ANSI code in the working codepage.

BTW, I can see no use of a "mixed ANSI string" having distinct elements of two or more codepages. Codepaged sets are a bit like ix86 segmented addressing, where the contents of SP alone doesn't define an unambiguous address: only SS:SP does but once SS (the codepage) is fixed you can't access more than 64kb (O memories!).

What made me crazy is the fact that --as the previous post did point out-- there are untold problems making kernel32.dll' WideCharToMultiByte and sister MultiByteToWideChar work as they should using the WinAPI functions. As I understand it, UTF-16 "strings" returned back from DllCalls are not strings but rather structures containing a hex representation of UTF-16 strings. That I suppose is a side effect of AutoIt typelessness. But then it's impossible to pass such parameters to a third party library function expecting genuine wstrs but unaware of the exotic hex format.

Also I've struggled to understand the following definition of the wstr parameter type in DllCall function: "a UNICODE wide character string (converted to/from an ANSI string during the call if needed)." This hardly makes any sense to me as it is and I believe it should be ( clarified | rephrased | corrected | removed ).

As a sidenote, yes I do have a French accent ... I'm a français de France!

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Hi PsaltyDS, nice to see you on board!

I'm in no way confusing between conversion and translation. Indeed, ü belongs to both Unicode (of course it does!) and Latin-1 ANSI codepage.

We have the right to expect a conversion between some codepage (say, Latin1) and Unicode. In this precise case, Latin1 ü has hex representation FC while Unicode codepoint is 0x00FC, having hex UTF-8 representation C3 BC. Such 1to1 bi-directional conversion is obviously limited to the subset: Unicode ∩ ANSI codepage ≍ ANSI codepage (since Unicode is the full code universe).

When converting the other way round, it's generally admitted to convert to placeholder any Unicode input character that doesn't have an ANSI code in the working codepage.

BTW, I can see no use of a "mixed ANSI string" having distinct elements of two or more codepages. Codepaged sets are a bit like ix86 segmented addressing, where the contents of SP alone doesn't define an unambiguous address: only SS:SP does but once SS (the codepage) is fixed you can't access more than 64kb (O memories!).

What made me crazy is the fact that --as the previous post did point out-- there are untold problems making kernel32.dll' WideCharToMultiByte and sister MultiByteToWideChar work as they should using the WinAPI functions. As I understand it, UTF-16 "strings" returned back from DllCalls are not strings but rather structures containing a hex representation of UTF-16 strings. That I suppose is a side effect of AutoIt typelessness. But then it's impossible to pass such parameters to a third party library function expecting genuine wstrs but unaware of the exotic hex format.

Also I've struggled to understand the following definition of the wstr parameter type in DllCall function: "a UNICODE wide character string (converted to/from an ANSI string during the call if needed)." This hardly makes any sense to me as it is and I believe it should be ( clarified | rephrased | corrected | removed ).

As a sidenote, yes I do have a French accent ... I'm a français de France!

Can you post a short runnable example of one the conversion issues you have? This stuff gets over my head too fast to try and figure out that many cases. Like maybe reading from a short GUI example and a short SQLite example and comparing values in the way you intend?

^_^

Valuater's AutoIt 1-2-3, Class... Is now in Session!For those who want somebody to write the script for them: RentACoder"Any technology distinguishable from magic is insufficiently advanced." -- Geek's corollary to Clarke's law
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...