Jump to content

Help: Unicode String


 Share

Recommended Posts

Hi,

I want to post a unicode string to an object of a javascript page.

Manually, when I copy/paste directly to that object textbox, everything is ok.

However, when I tried to use AutoIt, it seems AutoIt screwed my unicode text.

I copied some unicode text of my language into clipboard before running the following code:

_IEFormElementSetValue($oMsgField, ClipGet())

or

$text = ClipGet()
_IEFormElementSetValue($oMsgField, $text)

Both didn't return the text correctly.

Does anyone have experienced this before? Any help please!!!

Many thanks!

Link to comment
Share on other sites

Is it because IE.au3 library couldn't handle unicode???

Any idea? Any help please?

Thanks.

A similar question has come up before and the issue needs to be investigated further to isolate where the trouble exists. Please check out this post for some ideas and suggestions on how to narrow it down.

IE.au3 does no magic in this area -- it simply sets the form element .value property to the string you pass in. I have little experience with unicode so I need you to do the characterization testing to see where the data characteristics are getting list. The post above has some suggestions on how to approach this.

We need to be able to point to the specific interaction in the chain that is the source of the trouble.

Dale

Free Internet Tools: DebugBar, AutoIt IE Builder, HTTP UDF, MODIV2, IE Developer Toolbar, IEDocMon, Fiddler, HTML Validator, WGet, curl

MSDN docs: InternetExplorer Object, Document Object, Overviews and Tutorials, DHTML Objects, DHTML Events, WinHttpRequest, XmlHttpRequest, Cross-Frame Scripting, Office object model

Automate input type=file (Related)

Alternative to _IECreateEmbedded? better: _IECreatePseudoEmbedded  Better Better?

IE.au3 issues with Vista - Workarounds

SciTe Debug mode - it's magic: #AutoIt3Wrapper_run_debug_mode=Y

Doesn't work needs to be ripped out of the troubleshooting lexicon. It means that what you tried did not produce the results you expected. It begs the questions 1) what did you try?, 2) what did you expect? and 3) what happened instead?

Reproducer: a small (the smallest?) piece of stand-alone code that demonstrates your trouble

Link to comment
Share on other sites

Unicode is 16bit, which you can access by opening the file in raw format, and if Little Endian, stripping away the first 4 characters of FFFE. Then you can use some hex to string or string to hex functions to do your operations. I found most help in Scripts and Scraps, searching for Unicode.

Here is an example of opening the file, and checking for Unicode and stripping the FFFE.

$file = "path & name of file"
        Local $TempFile = FileOpen($file,4)
        $Rread = StringReplace(String(FileRead($TempFile,FileGetSize($file))),"0x","")
     If StringLeft($Rread,4) = "FFFE" Then ; this is Unicode
        $Rread = StringTrimLeft($Rread,4) ; trim out this to use the text
     EndIf

late,

Sul

Edit: meaning, if you use a hex2string converter, then pass the string in your script, it will not be unicode, but ascii. Which is great unless you actually have Unicode characters in the output, in which case the function needs to accept unicode.

Edited by sulfurious
Link to comment
Share on other sites

Thanks DaleHohm and sulfurious for your suggestions.

However, first of all, I wrote a small code below to check to see if it screws up my text when dealing with the clipboard:

1. I used Ctrl-C to copy this text "Kiểm tra Tiếng Việt" to the clipboard.

2. Then, run this code

$text = ClipGet()
ClipPut($text)

3. Used Ctrl-V to paste it to a word document, it appeared "Ki?m tra ti?ng Vi?t"

It means now the unicode text appeared as question marks "?"

Do you have any idea why it screwed up my unicode text?

Any help please?

Many thanks!!!

PS. Of course, if I just used Ctrl-C and Ctrl-V manually without running the AutoIt code above, everything appeared OK.

Link to comment
Share on other sites

The text you are attempting to use would be encoded as UTF-16, meaning a character such as Lower case 'r' is, in hex, 7200. In UTF-8, it would be just 72. Unicode characters for standard ASCII would be seen as appending 00 to the ASCII Chr.

You are using extended characters, which might look something like this 726e. What is happening is that your unicode is being converted to ASCII. Basically, you need a function to handle all 4 hex values. Right now, I believe, your function is seeing 726e as 72,6e, and outputting r?. Remember in ASCII, a ? or a little square box will be a representative for any unknown characters. FF is the max hex value for a character, so anything after that usually gets a ? put in it's place.

late,

Sul

Edit: maybe you could try bringing in the clipboard value or whatever it is as binary. Remember too that whatever you ouput to must be encoded UTF-16, or you will lose the extended characters and have ?'s.

Edited by sulfurious
Link to comment
Share on other sites

Thanks sulfurious.

I understood your idea, but not fully understood your suggestion.

When I used Ctrl-C to copy that unicode text (i.e. "Kiểm tra Tiếng Việt") to the clipboard, the clipboard contains that unicode text. However, when I tried to use ClipGet() which returns a string, I think this string is not in unicode anymore.

I am wondering how I could handle this (i.e. get the string out of the clipboard as binary as you read a unicode file in RAW format) ???

Any help please?

Thanks,

SV

Link to comment
Share on other sites

Erm, binary, meaning raw or hex, not actual binary. Sorry.

If it is UTF-16, you need to read file raw, strip FFFE, and manipulate with hex to string or string to hex functions.

If it is UTF-8, you are in no good luck. Heh. You need that encoding for extended characters. Try to bring the file in raw.

Sul

Edit: er, on second thought, just try to read the file as raw. unless you plan on working in all hex for any string manipulation, any string functions will be useless for what you are doing. you also need the output location to be capable of UTF-16 encoding. Some of the controls I believe will allow that. I think the status bar is one of them that you can pass unicode to. Msgbox's are not capable of unicode however.

Edited by sulfurious
Link to comment
Share on other sites

Thanks sulfurious.

What happens if I want to read UTF-8 text from an object of a website?

How can I read it in RAW mode?

For example, as I mentioned in my first post, I have an object $oMsgField that I want to get the unicode text from and assign another unicode text to it. How can I read the text content of that object in RAW mode in order to manipulate the text later on?

In other words, if the unicode text is in a file, I can read it in raw mode with the option FileOpen(...,4). How can we set the read mode as RAW if we read it from an object of a webpage?

Thanks!

Link to comment
Share on other sites

Ok. Unicode is 16bit, so you have extra 00 on ascii characters, or up to FF for extended. Reading in the entire unicode value, and stripping the 0x, will result with the first 4 characters being:

FFFE for little endian, meaning 22,00,34,00 etc

FFEF for big endian, meaninig 00,22,00,34 etc

You normally want to work with little and not with big.

You read a value into a variable, not raw, and something like a message box, that is only UTF8, will see the first part of the unicode and give some non-ascii characters as a return and stop. So leaving the unicode data will result in any non unicode file or control to not display extended characters correctly, if it will display anything at all.

Here is an example of how to check a file for UTF type.

#include <file.au3>
#include <array.au3>

$file = " path to .reg file"
$codeVAL = _chkRegFile($file)
If $codeVAL = 105 Then
    Exit
ElseIf $codeVAL = 110 Then
    ;you have work to do, export as 9x format .reg file
    Exit
ElseIf $codeVAL = 115 Then
    ;sorry, no Big Endians
    Exit
EndIf
$fo = FileRead($file,0)
$ar = StringSplit($fo,"[")
For $x = 1 To UBound($ar) - 1
    If StringInStr($ar[$x],"seach string") Then $ar[$x] = StringReplace($ar[$x],"search string","replacement")
    ; make your filewriteline code here ie. FileWriteLine($handle,"[" & $ar[$x])
Next

Func _chkRegFile($checkF) ; check registry file for UTF type
    Local $tmpCHECK = FileOpen($checkF,4)
    If $tmpCHECK = -1 Then ; unable to open reg file
        Return 105
    EndIf
    Local $chkFILE = StringReplace(String(FileRead($tmpCHECK,FileGetSize($checkF))),"0x","")
    If StringMid($chkFILE,1,4) = "FFFE" Then
            MsgBox(4096,"Error","This registry file appears to be UTF-16 (Unicode)")
            Return 110
    ElseIf StringMid($chkFILE,1,4) = "FEFF" Then
            MsgBox(4096,"Error","This registry file is Big Endian Unicode.")
            Return 115
    EndIf
    FileClose($tmpCHECK)
    Return 0
EndFunc  ;--->>> _chkRegFileoÝ÷ Ù.q©ìyëajܨº^+kk(¬{*.².Ùæ­zjE ±-£
âµì¨º»Úç­¢Ø^^Ø^µé©ÄzÚÂjxÚ0Êjw!yÉ¢µ'Êzç¢Ö(¹Æ§j[(¢wʬ¶¸©¶Q,µªí¶*n«^!ü¨º·u«Úç¢ËkaxQEz0Ê¢Ê&zËn}ü¨»§¶Ú0®+^ÖÚrF(¹Æ§Â¸­zØ^­¬/j[¶§{âæò~)^¸­xhm¶§*.q©Ý¡ø¥{
âµébà,jw±jjexZ½æ«zâì!Èb²éâr^"^+kk
ÞiØ­{hjö«¦åy+kaxQEúèØZ¶ö«¦åzØ^ËZ®Û-zba® Ûazö«¦åzË^§kj·!x¢»ax,N§Â§#
âµâ!j÷W¨¥éÚ×âì+×­è^Æö¥¹é^j÷W¨¥éÝ¢Ê&zj+zf§¥jبØ^ø¥{
âµëaz±½©ny©Ý²'N«zË¥¶+,)ejéâr^~)^.±æºw-åG¡{h²Úâjyè·
âµè­¶âä¨"ëjËkx-¢±rïz»^®Ø^ø¥{
âµè^Æö¥¹ä¨Êjvëx")ëk^Åø¥{
âµè^Æá{¬¶¸§Ø^f§¥j׬¶¸§Ø^Ëkx6ìméßW°®+^ìAºÜ¨ºwu«-®)ඡ{z¶§²êÞQ1uèw«z+0«HºÇ¢·(÷«²*'±«­¢+ÙÕ¹}!àÄÙѽMÑÉ¥¹ ÀÌØí¡    åѤ(%1½°ÀÌØí¤°ÀÌØí¡Y°°ÀÌØíÍIÍÕ±Ð(%1½°ÀÌØí¥18õMÑÉ¥¹1¸ ÀÌØí¡ åѤ(%%5½¡MÑÉ¥¹1¸ ÀÌØí¡   åѤ°Ð¤±ÐìÐìÀQ¡¸($%MÑÉÉ½È Ä¤($%IÑÕɸ´Ä(%¹%(%½ÈÀÌØí¤ôÄѼÀÌØí¥18MÑÀà($$ÀÌØí¡Y°õ
¡È¡¡MÑÉ¥¹5¥ ÀÌØí¡ åÑ°ÀÌØí¤°È¤¤¤µÀì
¡È¡¡MÑÉ¥¹5¥ ÀÌØí¡ åÑ°ÀÌØí¤¬Ð°È¤¤¤($$ÀÌØíÍIÍձеÀìôÀÌØí¡Y°(%9áÐ(%IÑÕɸÀÌØíÍIÍÕ±Ð)¹Õ¹$ì´´´ØÈìÐìÐì}!àÄÙѽMÑÉ¥¹)Õ¹}MÑÉ¥¹Q½!àÄØ ÀÌØíÍÑÉ
¡È°ÀÌØí
½µµõ±Í¤(%1½°ÀÌØí¤°ÀÌØí¡MаÀÌØí¡MÑÈ($ÀÌØí¥18ôMÑÉ¥¹1¸ ÀÌØíÍÑÉ
¡È¤(%½ÈÀÌØí¤ôÄQ¼ÀÌØí¥18($%%ÀÌØí
½µµõQÉÕQ¡¸($$$ÀÌØí¡MÐõ!à¡  ¥¹ÉåMÑÉ¥¹¡MÑÉ¥¹5¥ ÀÌØíÍÑÉ
¡È°ÀÌØí¤°Ä¤¤°È¤µÀìÅÕ½Ðì°ÀÀ°ÅÕ½Ðì($%±Í($$$ÀÌØí¡MÐõ!à¡ ¥¹ÉåMÑÉ¥¹¡MÑÉ¥¹5¥ ÀÌØíÍÑÉ
¡È°ÀÌØí¤°Ä¤¤°È¤µÀìÅÕ½ÐìÀÀÅÕ½Ðì($%¹%($$ÀÌØí¡MÑȵÀìôÀÌØí¡MÐ(%9áÐ(%IÑÕɸÀÌØí¡MÑÈ)¹Õ¹ì´´´ØÈìÐìÐì}MÑÉ¥¹Q½!àÄØ

I think what you are doing with the value determines everthing. If you are just reading it, and writing it, then read it as raw, and filewrite. If you are manipulating it, you can try to convert to string, but you will not be able to manipulate any extended characters. If you want to manipulate extended characters, do not covert to string, but find the hex values of the extended characters and manipulate in hex. It is more work, but you ensure you keep the UTF-16 encoding.

Hope that helps,

Sul

Edit: whenever you do hex16 conversions, you may lose the extended characters. In my routines, I don't manipulate the extended characters, but I do watch for them so that I don't strip them out. Pretty easy to do. If you step 2, every other value is normally 00 for ascii. If not 00, then it is extended, and you should not convert it. The conversion normally messes up the extended. I use a lot of hex values in this situation.

Edited by sulfurious
Link to comment
Share on other sites

Thanks a lot for your detailed instructions.

Basically, I got your ideas on reading the unicode from a FILE in RAW mode, dealing with it, and writing it back to another RAW file with hex values.

However, as I asked in the previous post, for example, I want to read a unicode text from an object of a webpage (or simpler it is the address bar of IE or status bar where we can have unicode text), then do some comparisons and assign it back to the place it comes from.

I meant I am NOT doing with a file, but I am doing with an object of a website. That object contains innerText as unicode.

For example, $object.innerText contains this text "Kiểm tra Tiếng Việt.". This text is an unicode text.

If you want to manipulate extended characters, do not covert to string, but find the hex values of the extended characters and manipulate in hex. It is more work, but you ensure you keep the UTF-16 encoding.

As you mentioned (in the quote above), if I assign it to a string, I will lose extended characters (i.e. it will become question marks).

So, how can I get the hex value or the binary value of $object.innerText ???

Is there any way to do that?

Thanks...!

Link to comment
Share on other sites

OK. This is what I just tested.

1. I put this unicode text "Kiểm tra Tiếng Việt." into a file, named Unicode.txt.

Used the code below to run and get result:

#include <string.au3>

$strFile1 = "Unicode.txt"
$file1 = FileOpen($strFile1,4)

$text = FileRead($file1,FileGetSize($strFile1))
$text = StringReplace(String($text),"0x","")

If (StringLeft($text,4) = "FFFE") OR (StringLeft($text,4) = "FFEF") Then ; this is Unicode
    $text = StringTrimLeft($text,4) ; trim out this to use the text
EndIf

ClipPut($text)
FileClose($file1)
oÝ÷ Ù8b²+0«H_®­ç%èj·b{Múµìmà4ëÝ4}=4ÛM4ï4ïm4ë]4ÛM4ï4ëÝ4]DèM4ë½4ÛM4ç­4ëÝ4½Dï4ØM4ØÈrbyÛazéâr^µìmªê-*.üï­­¤âïÎõbïÎý¶«¨¶)í¡Éb¥º­ÛazzîØ^r^méhÁ«­¢+Ø(ÀÌØíÑáÐô}MÑÉ¥¹Q½!àÄØ¡
±¥ÁÐ ¤¤)
±¥ÁAÕÐ ÀÌØíÑáФoÝ÷ Ù8b²+0«H_®­ç%èj·xM:÷M7M:M6ÓM;ãM;ÛM:×M6ÓM;ãM:÷M7M:M:ïM6ÓM9ëM:÷M7M;ãM6M¢jZ®)ඬzÜ(²'^o+"uábîÓN½ÓFÂßQ?o ôÓm4Ó¾4Ó½´Ó­tÓm4Ó¾4Ó¯tÑ°EÔOÛèM4ë½4ÛM4ç­4ëÝ4l.õöûãM6M*xM:÷MÜ]4ý¾ÓM´ÓNøÓNöÓNµÓM´ÓNøÓN½ÓF÷M?o¡4Ó®ôÓm4Ó´Ó¯tѽÅÓOÛï4ØM4"z-Ƕ­mçè­â-®'¶¶Ø^ìoj[¡ûax)b¤g­{azÇ¢w^Å©©æ¢÷­ë-®)à~º&¶*FzܲÇ+{jZÞiÜ¢yìmzw^uÈZ­§-z»yƦz«²Ø¨f«Â
'uêâhÀ­¶¡{Úç¨~Ø^rX©n«uÊ'µ¨§x.'(uë^ÆÔájy,ß

Here is the result: KiÃm tra ti¿ng ViÇt.

So, from the original unicode text in the file: "Kiểm tra tiếng Việt."

Now it returned: "KiÃm tra ti¿ng ViÇt."

All the extended characters appeared wrongly ????

Anybody help ??? Thanks!

Edited by svkhtn
Link to comment
Share on other sites

  • Moderators

The system implicitly converts data between certain clipboard formats: if a window requests data in a format that is not on the clipboard, the system converts an available format to the requested format. The system can convert data as indicated in the following table.

http://msdn.microsoft.com/library/default....oardformats.asp Edited by SmOke_N

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

From my tests, you have to bring in the data as hexadecimal values. That is the only way to keep the extended characters. There is no function in AutoIt, that I know of, that will convert to UTF-16. _StringToHex() will give you hex, but only up to 254 chr.

You can filewrite hex using FileWrite($fileHandle,BinaryString("0x" & "$variable")), but this is only good for Ascii. You need the hex values to write the hex values, thus the RAW mode.

Sorry.

Sul

Link to comment
Share on other sites

I don't think it is because of the implicit conversion of the clipboard. If I used Ctrl-C and Ctrl-V to copy that unicode text from one place to paste to another place, everything appears correctly.

The main problem is after I used ClipGet(), the return string was screwed up (i.e. string in AutoIt does not keep the Unicode encoding anymore).

When reading from file, it is easy because we can set the file reading to RAW mode and read everything in binary. However, if I tried to get the unicode text from an object of a webpage, it is like I read unicode text from a file with NORMAL mode. In this case, the text was screwed.

Anyway, thanks SmOke_N and sulfurious for your help!!!

Link to comment
Share on other sites

  • Moderators

I don't think it is because of the implicit conversion of the clipboard. If I used Ctrl-C and Ctrl-V to copy that unicode text from one place to paste to another place, everything appears correctly.

The main problem is after I used ClipGet(), the return string was screwed up (i.e. string in AutoIt does not keep the Unicode encoding anymore).

When reading from file, it is easy because we can set the file reading to RAW mode and read everything in binary. However, if I tried to get the unicode text from an object of a webpage, it is like I read unicode text from a file with NORMAL mode. In this case, the text was screwed.

Anyway, thanks SmOke_N and sulfurious for your help!!!

It is because of the conversion, I don't think that ClipPut() does Unicode.

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

It is because of the conversion, I don't think that ClipPut() does Unicode.

I am wondering if it is because the string argument of ClipPut($string) doesn't hold the unicode text (i.e. the argument was messed up before the ClipPut() does anything) OR the function ClipPut() of AutoIt itself couldn't handle unicode ???

I am asking this because if we do it manually (i.e. use Ctrl-C to copy to the clipboard), there is no conversion.

Thanks again SmOke_N and sulfurious !!!

Link to comment
Share on other sites

I would bet it is because AutoIt does not handle unicode. If it did, then other string controls, such as a msgbox would be able to display it. You can put unicode in clip, and paste it in a unicode text file just fine. But there is no way to retrieve it from the clip as hex. If the clip acted like a file, you could do it.

Sul

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...