Help: Unicode String

svkhtn · October 5, 2006

Hi,

I want to post a unicode string to an object of a javascript page.

Manually, when I copy/paste directly to that object textbox, everything is ok.

However, when I tried to use AutoIt, it seems AutoIt screwed my unicode text.

I copied some unicode text of my language into clipboard before running the following code:

_IEFormElementSetValue($oMsgField, ClipGet())

or

$text = ClipGet()
_IEFormElementSetValue($oMsgField, $text)

Both didn't return the text correctly.

Does anyone have experienced this before? Any help please!!!

Many thanks!

svkhtn · October 6, 2006

Is the STRING data type in autoit not able to handle Unicode text (2 bytes) ?

Any help, please?

Many thanks.

svkhtn · October 8, 2006

Is it because IE.au3 library couldn't handle unicode???

Any idea? Any help please?

Thanks.

Edited October 8, 2006 by svkhtn

DaleHohm · October 9, 2006

Is it because IE.au3 library couldn't handle unicode???
Any idea? Any help please?
Thanks.

A similar question has come up before and the issue needs to be investigated further to isolate where the trouble exists. Please check out this post for some ideas and suggestions on how to narrow it down.

IE.au3 does no magic in this area -- it simply sets the form element .value property to the string you pass in. I have little experience with unicode so I need you to do the characterization testing to see where the data characteristics are getting list. The post above has some suggestions on how to approach this.

We need to be able to point to the specific interaction in the chain that is the source of the trouble.

Dale

sulfurious · October 9, 2006

Unicode is 16bit, which you can access by opening the file in raw format, and if Little Endian, stripping away the first 4 characters of FFFE. Then you can use some hex to string or string to hex functions to do your operations. I found most help in Scripts and Scraps, searching for Unicode.

Here is an example of opening the file, and checking for Unicode and stripping the FFFE.

$file = "path & name of file"
        Local $TempFile = FileOpen($file,4)
        $Rread = StringReplace(String(FileRead($TempFile,FileGetSize($file))),"0x","")
     If StringLeft($Rread,4) = "FFFE" Then ; this is Unicode
        $Rread = StringTrimLeft($Rread,4) ; trim out this to use the text
     EndIf

late,

Sul

Edit: meaning, if you use a hex2string converter, then pass the string in your script, it will not be unicode, but ascii. Which is great unless you actually have Unicode characters in the output, in which case the function needs to accept unicode.

Edited October 9, 2006 by sulfurious

svkhtn · October 9, 2006

Thanks DaleHohm and sulfurious for your suggestions.

However, first of all, I wrote a small code below to check to see if it screws up my text when dealing with the clipboard:

1. I used Ctrl-C to copy this text "Kiểm tra Tiếng Việt" to the clipboard.

2. Then, run this code

$text = ClipGet()
ClipPut($text)

3. Used Ctrl-V to paste it to a word document, it appeared "Ki?m tra ti?ng Vi?t"

It means now the unicode text appeared as question marks "?"

Do you have any idea why it screwed up my unicode text?

Any help please?

Many thanks!!!

PS. Of course, if I just used Ctrl-C and Ctrl-V manually without running the AutoIt code above, everything appeared OK.

sulfurious · October 9, 2006

The text you are attempting to use would be encoded as UTF-16, meaning a character such as Lower case 'r' is, in hex, 7200. In UTF-8, it would be just 72. Unicode characters for standard ASCII would be seen as appending 00 to the ASCII Chr.

You are using extended characters, which might look something like this 726e. What is happening is that your unicode is being converted to ASCII. Basically, you need a function to handle all 4 hex values. Right now, I believe, your function is seeing 726e as 72,6e, and outputting r?. Remember in ASCII, a ? or a little square box will be a representative for any unknown characters. FF is the max hex value for a character, so anything after that usually gets a ? put in it's place.

late,

Sul

Edit: maybe you could try bringing in the clipboard value or whatever it is as binary. Remember too that whatever you ouput to must be encoded UTF-16, or you will lose the extended characters and have ?'s.

Edited October 9, 2006 by sulfurious

svkhtn · October 10, 2006

Thanks sulfurious.

I understood your idea, but not fully understood your suggestion.

When I used Ctrl-C to copy that unicode text (i.e. "Kiểm tra Tiếng Việt") to the clipboard, the clipboard contains that unicode text. However, when I tried to use ClipGet() which returns a string, I think this string is not in unicode anymore.

I am wondering how I could handle this (i.e. get the string out of the clipboard as binary as you read a unicode file in RAW format) ???

Any help please?

Thanks,

SV

sulfurious · October 10, 2006

Erm, binary, meaning raw or hex, not actual binary. Sorry.

If it is UTF-16, you need to read file raw, strip FFFE, and manipulate with hex to string or string to hex functions.

If it is UTF-8, you are in no good luck. Heh. You need that encoding for extended characters. Try to bring the file in raw.

Sul

Edit: er, on second thought, just try to read the file as raw. unless you plan on working in all hex for any string manipulation, any string functions will be useless for what you are doing. you also need the output location to be capable of UTF-16 encoding. Some of the controls I believe will allow that. I think the status bar is one of them that you can pass unicode to. Msgbox's are not capable of unicode however.

Edited October 10, 2006 by sulfurious

svkhtn · October 10, 2006

Thanks sulfurious.

What happens if I want to read UTF-8 text from an object of a website?

How can I read it in RAW mode?

For example, as I mentioned in my first post, I have an object $oMsgField that I want to get the unicode text from and assign another unicode text to it. How can I read the text content of that object in RAW mode in order to manipulate the text later on?

In other words, if the unicode text is in a file, I can read it in raw mode with the option FileOpen(...,4). How can we set the read mode as RAW if we read it from an object of a webpage?

Thanks!

sulfurious · October 10, 2006

Ok. Unicode is 16bit, so you have extra 00 on ascii characters, or up to FF for extended. Reading in the entire unicode value, and stripping the 0x, will result with the first 4 characters being:

FFFE for little endian, meaning 22,00,34,00 etc

FFEF for big endian, meaninig 00,22,00,34 etc

You normally want to work with little and not with big.

You read a value into a variable, not raw, and something like a message box, that is only UTF8, will see the first part of the unicode and give some non-ascii characters as a return and stop. So leaving the unicode data will result in any non unicode file or control to not display extended characters correctly, if it will display anything at all.

Here is an example of how to check a file for UTF type.

#include <file.au3>
#include <array.au3>

$file = " path to .reg file"
$codeVAL = _chkRegFile($file)
If $codeVAL = 105 Then
    Exit
ElseIf $codeVAL = 110 Then
    ;you have work to do, export as 9x format .reg file
    Exit
ElseIf $codeVAL = 115 Then
    ;sorry, no Big Endians
    Exit
EndIf
$fo = FileRead($file,0)
$ar = StringSplit($fo,"[")
For $x = 1 To UBound($ar) - 1
    If StringInStr($ar[$x],"seach string") Then $ar[$x] = StringReplace($ar[$x],"search string","replacement")
    ; make your filewriteline code here ie. FileWriteLine($handle,"[" & $ar[$x])
Next

Func _chkRegFile($checkF) ; check registry file for UTF type
    Local $tmpCHECK = FileOpen($checkF,4)
    If $tmpCHECK = -1 Then ; unable to open reg file
        Return 105
    EndIf
    Local $chkFILE = StringReplace(String(FileRead($tmpCHECK,FileGetSize($checkF))),"0x","")
    If StringMid($chkFILE,1,4) = "FFFE" Then
            MsgBox(4096,"Error","This registry file appears to be UTF-16 (Unicode)")
            Return 110
    ElseIf StringMid($chkFILE,1,4) = "FEFF" Then
            MsgBox(4096,"Error","This registry file is Big Endian Unicode.")
            Return 115
    EndIf
    FileClose($tmpCHECK)
    Return 0
EndFunc  ;--->>> _chkRegFileoÝ÷ Ù.q©ìyëajÜ¨º^+kk(¬{*.².ÙæzjE ±-£
âµì¨º»Úç¢Ø^^Ø^µé©ÄzÚÂjxÚ0Êjw!yÉ¢µ'Êzç¢Ö(¹Æ§j[(¢wÊ¬¶¸©¶Q,µªí¶*n«^!ü¨º·u«Úç¢ËkaxQEz0Ê¢Ê&zËn}ü¨»§¶Ú0®+^ÖÚrF(¹Æ§Â¸zØ^¬/j[¶§{âæò~)^Â¸xhm¶§*.q©Ý¡ø¥{
âµébà,jw±jjexZ½æ«zâì!Èb²éâr^"^+kk
ÞiØ{hjö«¦åy+kaxQEúèØZ¶ö«¦åzØ^ËZ®Û-zba® Ûazö«¦åzË^§kj·!x¢»ax,N§Â§#
âµâ!j÷W¨¥éÚ×âì+×è^Æö¥¹é^j÷W¨¥éÝ¢Ê&zj+zf§¥jØ¨Ø^ø¥{
âµëaz±½©ny©Ý²'N«zË¥¶+,¶)ejéâr^~)^!ò.±æºw-åG¡{h²Úâjyè·
âµè¶âä¨"ëjËkx-¢±rïz»^®Ø^ø¥{
âµè^Æö¥¹ä¨Êjvëx"¶)ëk^Åø¥{
âµè^Æá{¬¶¸§Ø^f§¥j×¬¶¸§Ø^Ëkx6ìméßW°®+^ìAºÜ¨ºwu«-®)à¶¡{z¶§²êÞQ1uèw«z+0«HºÇ¢·(÷«²*'±«¢+ÙÕ¹}!àÄÙÑ½MÑÉ¥¹ ÀÌØí¡    åÑ¤(%1½°ÀÌØí¤°ÀÌØí¡Y°°ÀÌØíÍIÍÕ±Ð(%1½°ÀÌØí¥18õMÑÉ¥¹1¸ ÀÌØí¡ åÑ¤(%%5½¡MÑÉ¥¹1¸ ÀÌØí¡   åÑ¤°Ð¤±ÐìÐìÀQ¡¸($%MÑÉÉ½È Ä¤($%IÑÕÉ¸´Ä(%¹%(%½ÈÀÌØí¤ôÄÑ¼ÀÌØí¥18MÑÀà($$ÀÌØí¡Y°õ
¡È¡¡MÑÉ¥¹5¥ ÀÌØí¡ åÑ°ÀÌØí¤°È¤¤¤µÀì
¡È¡¡MÑÉ¥¹5¥ ÀÌØí¡ åÑ°ÀÌØí¤¬Ð°È¤¤¤($$ÀÌØíÍIÍÕ±ÐµÀìôÀÌØí¡Y°(%9áÐ(%IÑÕÉ¸ÀÌØíÍIÍÕ±Ð)¹Õ¹$ì´´´ØÈìÐìÐì}!àÄÙÑ½MÑÉ¥¹)Õ¹}MÑÉ¥¹Q½!àÄØ ÀÌØíÍÑÉ
¡È°ÀÌØí
½µµõ±Í¤(%1½°ÀÌØí¤°ÀÌØí¡MÐ°ÀÌØí¡MÑÈ($ÀÌØí¥18ôMÑÉ¥¹1¸ ÀÌØíÍÑÉ
¡È¤(%½ÈÀÌØí¤ôÄQ¼ÀÌØí¥18($%%ÀÌØí
½µµõQÉÕQ¡¸($$$ÀÌØí¡MÐõ!à¡  ¥¹ÉåMÑÉ¥¹¡MÑÉ¥¹5¥ ÀÌØíÍÑÉ
¡È°ÀÌØí¤°Ä¤¤°È¤µÀìÅÕ½Ðì°ÀÀ°ÅÕ½Ðì($%±Í($$$ÀÌØí¡MÐõ!à¡ ¥¹ÉåMÑÉ¥¹¡MÑÉ¥¹5¥ ÀÌØíÍÑÉ
¡È°ÀÌØí¤°Ä¤¤°È¤µÀìÅÕ½ÐìÀÀÅÕ½Ðì($%¹%($$ÀÌØí¡MÑÈµÀìôÀÌØí¡MÐ(%9áÐ(%IÑÕÉ¸ÀÌØí¡MÑÈ)¹Õ¹ì´´´ØÈìÐìÐì}MÑÉ¥¹Q½!àÄØ

I think what you are doing with the value determines everthing. If you are just reading it, and writing it, then read it as raw, and filewrite. If you are manipulating it, you can try to convert to string, but you will not be able to manipulate any extended characters. If you want to manipulate extended characters, do not covert to string, but find the hex values of the extended characters and manipulate in hex. It is more work, but you ensure you keep the UTF-16 encoding.

Hope that helps,

Sul

Edit: whenever you do hex16 conversions, you may lose the extended characters. In my routines, I don't manipulate the extended characters, but I do watch for them so that I don't strip them out. Pretty easy to do. If you step 2, every other value is normally 00 for ascii. If not 00, then it is extended, and you should not convert it. The conversion normally messes up the extended. I use a lot of hex values in this situation.

Edited October 10, 2006 by sulfurious

svkhtn · October 10, 2006

Thanks a lot for your detailed instructions.

Basically, I got your ideas on reading the unicode from a FILE in RAW mode, dealing with it, and writing it back to another RAW file with hex values.

However, as I asked in the previous post, for example, I want to read a unicode text from an object of a webpage (or simpler it is the address bar of IE or status bar where we can have unicode text), then do some comparisons and assign it back to the place it comes from.

I meant I am NOT doing with a file, but I am doing with an object of a website. That object contains innerText as unicode.

For example, $object.innerText contains this text "Kiểm tra Tiếng Việt.". This text is an unicode text.

If you want to manipulate extended characters, do not covert to string, but find the hex values of the extended characters and manipulate in hex. It is more work, but you ensure you keep the UTF-16 encoding.

As you mentioned (in the quote above), if I assign it to a string, I will lose extended characters (i.e. it will become question marks).

So, how can I get the hex value or the binary value of $object.innerText ???

Is there any way to do that?

Thanks...!

svkhtn · October 10, 2006

OK. This is what I just tested.

1. I put this unicode text "Kiểm tra Tiếng Việt." into a file, named Unicode.txt.

Used the code below to run and get result:

#include <string.au3>

$strFile1 = "Unicode.txt"
$file1 = FileOpen($strFile1,4)

$text = FileRead($file1,FileGetSize($strFile1))
$text = StringReplace(String($text),"0x","")

If (StringLeft($text,4) = "FFFE") OR (StringLeft($text,4) = "FFEF") Then ; this is Unicode
    $text = StringTrimLeft($text,4) ; trim out this to use the text
EndIf

ClipPut($text)
FileClose($file1)
oÝ÷ Ù8b²+0«H_®ç%èj·b{Múµìmà4ëÝ4}Dè=4ÛM4ï4ïm4ë]4ÛM4ï4ëÝ4]DèM4ë½4ÛM4ç4ëÝ4½Dï4ØM4ØÈrbyÛazéâr^µìmªê-*.üï¤âïÎõbïÎý¶«¨¶)í¡Éb¥ºÛazzîØ^r^méhÁ«¢+Ø(ÀÌØíÑáÐô}MÑÉ¥¹Q½!àÄØ¡
±¥ÁÐ ¤¤)
±¥ÁAÕÐ ÀÌØíÑáÐ¤oÝ÷ Ù8b²+0«H_®ç%èj·xM:÷M7M:M6ÓM;ãM;ÛM:×M6ÓM;ãM:÷M7M:M:ïM6ÓM9ëM:÷M7M;ãM6M¢jZ®)à¶¬zÜ(²'^o+"uábîÓN½ÓFÂßQ?o ôÓm4Ó¾4Ó½´ÓtÓm4Ó¾4Ó¯tÑ°EÔOÛèM4ë½4ÛM4ç4ëÝ4l.õöûãM6M*xM:÷MÜ]4ý¾ÓM´ÓNøÓNöÓNµÓM´ÓNøÓN½ÓF÷M?o¡4Ó®ôÓm4Ó´Ó¯tÑ½ÅÓOÛï4ØM4"z-Ç¶mçèâ-®'¶¶Ø^ìoj[¡ûax)b¤g{azÇ¢w^Å©©æ¢÷ë-®)à~º&¶*FzÜ²Ç+{jZÞiÜ¢yìmzw^uÈZ§-z»yÆ¦z«²Ø¨f«Â
'uêâhÀ¶¡{Úç¨~Ø^rX©n«uÊ'µ¨§x.'(uë^ÆÔájy,ß

Here is the result: KiÃm tra ti¿ng ViÇt.

So, from the original unicode text in the file: "Kiểm tra tiếng Việt."

Now it returned: "KiÃm tra ti¿ng ViÇt."

All the extended characters appeared wrongly ????

Anybody help ??? Thanks!

Edited October 10, 2006 by svkhtn

SmOke_N · October 10, 2006

The system implicitly converts data between certain clipboard formats: if a window requests data in a format that is not on the clipboard, the system converts an available format to the requested format. The system can convert data as indicated in the following table.

http://msdn.microsoft.com/library/default....oardformats.asp Edited October 10, 2006 by SmOke_N

sulfurious · October 10, 2006

From my tests, you have to bring in the data as hexadecimal values. That is the only way to keep the extended characters. There is no function in AutoIt, that I know of, that will convert to UTF-16. _StringToHex() will give you hex, but only up to 254 chr.

You can filewrite hex using FileWrite($fileHandle,BinaryString("0x" & "$variable")), but this is only good for Ascii. You need the hex values to write the hex values, thus the RAW mode.

Sorry.

Sul

svkhtn · October 10, 2006

http://msdn.microsoft.com/library/default....oardformats.asp

I don't think it is because of the implicit conversion of the clipboard. If I used Ctrl-C and Ctrl-V to copy that unicode text from one place to paste to another place, everything appears correctly.

The main problem is after I used ClipGet(), the return string was screwed up (i.e. string in AutoIt does not keep the Unicode encoding anymore).

When reading from file, it is easy because we can set the file reading to RAW mode and read everything in binary. However, if I tried to get the unicode text from an object of a webpage, it is like I read unicode text from a file with NORMAL mode. In this case, the text was screwed.

Anyway, thanks SmOke_N and sulfurious for your help!!!

SmOke_N · October 10, 2006

I don't think it is because of the implicit conversion of the clipboard. If I used Ctrl-C and Ctrl-V to copy that unicode text from one place to paste to another place, everything appears correctly.
The main problem is after I used ClipGet(), the return string was screwed up (i.e. string in AutoIt does not keep the Unicode encoding anymore).
When reading from file, it is easy because we can set the file reading to RAW mode and read everything in binary. However, if I tried to get the unicode text from an object of a webpage, it is like I read unicode text from a file with NORMAL mode. In this case, the text was screwed.
Anyway, thanks SmOke_N and sulfurious for your help!!!

It is because of the conversion, I don't think that ClipPut() does Unicode.

sulfurious · October 10, 2006

You are right SmOk_N. No extended characters here.

late,

Sul

Edited October 10, 2006 by sulfurious

svkhtn · October 10, 2006

It is because of the conversion, I don't think that ClipPut() does Unicode.

I am wondering if it is because the string argument of ClipPut($string) doesn't hold the unicode text (i.e. the argument was messed up before the ClipPut() does anything) OR the function ClipPut() of AutoIt itself couldn't handle unicode ???

I am asking this because if we do it manually (i.e. use Ctrl-C to copy to the clipboard), there is no conversion.

Thanks again SmOke_N and sulfurious !!!

sulfurious · October 10, 2006

I would bet it is because AutoIt does not handle unicode. If it did, then other string controls, such as a msgbox would be able to display it. You can put unicode in clip, and paste it in a unicode text file just fine. But there is no way to retrieve it from the clip as hex. If the clip acted like a file, you could do it.

Sul

Sign In

Help: Unicode String

Recommended Posts

svkhtn

svkhtn

svkhtn

DaleHohm

sulfurious

svkhtn

sulfurious

svkhtn

sulfurious

svkhtn

sulfurious

svkhtn

svkhtn

SmOke_N

sulfurious

svkhtn

SmOke_N

sulfurious

svkhtn

sulfurious

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta