Jump to content
Sign in to follow this  
DeltaRocked

[Solved] GB2312 to UTF8 - Charcter encoding

Recommended Posts

DeltaRocked

hello,

I wanted to convert GB2312 (chinese) character encoding to UTF8. I will describe the problem as mentioned below and also let me know as to where I am going wrong in understanding the character encoding (source code also included).

The Subject headers of an email contain the following
Eg1:
=?GB2312?B?MzE2Njg3OTU4o6zA67K7v6q1xM/6Y8rb?=
Eg2:
=?utf-8?B?56ym54+C6bmm?=

where the format is :
=?CharSet?B/Q?Base64_encoded_string?=

When it comes to displaying UTF8 or GB2312 *individually* in different emails is not a problem, however when I want to display both these character-encodings , only one of them will get displayed.

This can be achieved by defining the charset in the email-msg body.
 

Content-Type: text/html;
charset="UTF-8"
OR
Content-Type: text/html;
charset="GB2312"


If all goes well you can view chinese characters.
 

Base64_Decode(""MzE2Njg3OTU4o6zA67K7v6q1xM/6Y8rb)

output:
MzE2Njg3OTU4o6zA67K7v6q1xM/6Y8rb = 316687958£¬Àë²»¿ªµÄÏúcÊÛ

Save the mentioned .7z and extract the .eml file and open this file in your fav. email-client

The code I am using is as follows and the output when replaced in the eml file doesnt give me any chinese characters .
 

#Include
$hFile=FileOpen('utf_t.txt',256+2)
$sText = '316687958£¬Àë²»¿ªµÄÏúcÊÛ'
FileWrite($hFile,_ConvertAnsiToUtf8($sText))
FileClose($hFile)
Func _ConvertAnsiToUtf8($sText)
Local $tUnicode = _WinAPI_MultiByteToWideChar($sText)
If @error Then Return SetError(@error, 0, "")
Local $sUtf8 =_WinAPI_WideCharToMultiByte(DllStructGetPtr($tUnicode), 65001)
If @error Then Return SetError(@error, 0, "")
Return SetError(0, 0, $sUtf8)
EndFunc ;==>_ConvertAnsiToUtf8

Thanks in advance.

Regards
Del.

[uPDATE]
After searching found this:
http://hi.baidu.com/qianyiyidu/item/579ee4a1f6ca1b3e030a4df0

 

//GB2312到UTF-8的转换
static int GB2312ToUtf8(const char* gb2312, char* utf8)
{
int len = MultiByteToWideChar(CP_ACP, 0, gb2312, -1, NULL, 0);
wchar_t* wstr = new wchar_t[len+1];
memset(wstr, 0, len+1);
MultiByteToWideChar(CP_ACP, 0, gb2312, -1, wstr, len);
len = WideCharToMultiByte(CP_UTF8, 0, wstr, -1, NULL, 0, NULL, NULL);
utf8 = new char[len+1];
memset(utf8, 0, len+1);
WideCharToMultiByte(CP_UTF8, 0, wstr, -1, utf8, len, NULL, NULL);
if(wstr) delete[] wstr;
return len;
}
Edited by DeltaRocked

Share this post


Link to post
Share on other sites
DeltaRocked

Hello,

Partially solved this issue based on the various resouces available in Autoit Forums itself. A big thanks to AZJIO for making available the encoding.au3 in one of the posts.

_EncodingToUnicode_API() was picked up from encoding.au3 which is available in the post over here,

Another version by Arilvv can also be found over here

Note to Self: to get the conversion right, one needs to know the *correct* codepage identifier which is available here

http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756%28v=vs.85%29.aspx

Image of the conversion : http://imm.io/15phW

#include
$hFile=FileOpen("GB2312ToUnicode.txt",256+2)
$sCodePage_Identifier=936 ;GB2312
; refer to http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756%28v=vs.85%29.aspx
; for more information on the codepage identifier.
; Base64 of the below mentioned string : MzE2Njg3OTU4o6zA67K7v6q1xM/6Y8rb
$sString='316687958£¬Àë²»¿ªµÄÏúcÊÛ'
FileWrite($hFile,_EncodingToUnicode_API($sString,$sCode_Page))
FileClose("out2.txt")

Func _EncodingToUnicode_API($sString,$sCodePage_Identifier)
Local $BufferSize = StringLen($sString) * 2
Local $Buffer = DllStructCreate("byte[" & $BufferSize & "]")

Local $Return = DllCall("Kernel32.dll", "int", "MultiByteToWideChar", _
"int", $sCodePage_Identifier, _
"int", 0, _
"str", $sString, _
"int", StringLen($sString), _
"ptr", DllStructGetPtr($Buffer), _
"int", $BufferSize)

Local $UnicodeBinary = DllStructGetData($Buffer, 1)
Local $UnicodeHex1 = StringReplace($UnicodeBinary, "0x", "")
Local $StrLen = StringLen($UnicodeHex1)
Local $UnicodeString, $UnicodeHex2, $UnicodeHex3

For $i = 1 To $StrLen Step 4
$UnicodeHex2 = StringMid($UnicodeHex1, $i, 4)
$UnicodeHex3 = StringMid($UnicodeHex2, 3, 2) & StringMid($UnicodeHex2, 1, 2)
$UnicodeString &= ChrW(Dec($UnicodeHex3))
Next
$Buffer = 0
Return $UnicodeString
EndFunc

TODO:

MIME-Decode for Subject and correctly identify the character encoding and complete the conversion.

Edited by DeltaRocked

Share this post


Link to post
Share on other sites
leuce

Hello

I also want to convert GB2312 to UTF8 and I would like to try the script that is mentioned in the second post. However, when I try to run it, AutoIt says: "Cannot parse #include". Any idea what might be wrong?

Thanks

Samuel

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×