Opened 20 months ago

Last modified 5 months ago

#3731 assigned Bug

Binary() performs hidden and wrong conversion on strings

Reported by: jchd18 Owned by: Jpm
Milestone: Component: AutoIt
Version: Severity: None
Keywords: Cc:


One would expect Binary(<string>) to return the binary image of <string> but it's not (at all) so.
The string below contains the first 5 ASCII letters, a space and the corresponding 5 Greek letters.

ConsoleWrite(Binary("ABCDE ΑΒΓΔΕ") & @LF)

In memory the string looks like this:

0041 0042 0043 0044 0045 0020 0391 0392 0393 0394 0395

and this is what one would expect from invoking Binary(), since AutoIt uses UCS2 (UTF16-LE limited to the Unicode BMP.)

Instead we get something completely unuseable. First the Greek letters Alpha, Beta, Delta and Epsilon appear as question marks (no equivalent in ASCII) but the letter Gamma surprisingly gets converted to ASCII G.


Attachments (0)

Change History (5)

comment:1 Changed 19 months ago by Jpm

Certainly the doc is incomplete about the conversion to byte not to UCS2

comment:2 Changed 19 months ago by jchd18

The doc is indeed incomplete, but there are a number of very unexpected "conversions" elsewhere in the range > 0xFF (maybe even in the range [0x7F,0xFF] depending on local codepage), making Binary(<string>) deceptive.

ConsoleWrite(Binary("ABCDE ΑΒΓΔΕ") & @LF)

Local $c, $b, $u
For $i = 0x100 To 0xFFFF
	$c = ChrW($i)
	$b = Binary($c)
	If $b <> "0x3F" Then _U8ConsoleWrite(Hex($i, 4) & @TAB & $c & "    -->     " & @TAB & $b & @TAB & ChrW($b))

; Unicode-aware ConsoleWrite (set console to UTF8 for decent result)
Func _U8ConsoleWrite($s)
	ConsoleWrite(BinaryToString(StringToBinary($s & @LF, 4), 1))
EndFunc   ;==>_U8ConsoleWrite

For instance some codepoints are converted, but not all possible and not always right:

β (lowercase Greek beta) turned into ß (German eszet) ?!?!?
Γ -> G but γ (lowercase Greek gamma) isn't converted
Many codepoints are unexpectedly converted to control characters!

I'm not completely against attempts to convert, say, Ā to A in a distinct function but at least this has to be clearly documented AND it's better to have it right and consistant (that is much, much harder than it looks.) In any case, a function named Binary shouldn't emasculate anything and OTOH an attempt to map UCS2 > 0x7F to local Windows codepage is doomed to failures.

All in all I doubt a simple approach can be really satisfactory. From this point of view, _StringToHex() [which produces hex of the string in UTF8] and StringToASCIIArray() [which returns an array of codepoints] are more robust.

comment:3 Changed 19 months ago by Jpm

  • Owner set to Jon
  • Status changed from new to assigned

I leave to Jon the final answer to change only doc or the code to follow your recommandation ...

comment:4 Changed 19 months ago by jchd18

Fine. The issue finally boils down to: "what should be the correct semantic of Binary when applied to a native (UCS2) AutoIt string?"

comment:5 Changed 5 months ago by Jpm

  • Owner changed from Jon to Jpm

Guidelines for posting comments:

  • You cannot re-open a ticket but you may still leave a comment if you have additional information to add.
  • In-depth discussions should take place on the forum.

For more information see the full version of the ticket guidelines here.

Add Comment

Modify Ticket

as assigned The owner will remain Jpm.

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.