Modify

Opened 3 weeks ago

Last modified 5 days ago

#3731 assigned Bug

Binary() performs hidden and wrong conversion on strings

Reported by: jchd18 Owned by: Jon
Milestone: Component: AutoIt
Version: 3.3.14.5 Severity: None
Keywords: Cc:

Description

One would expect Binary(<string>) to return the binary image of <string> but it's not (at all) so.
The string below contains the first 5 ASCII letters, a space and the corresponding 5 Greek letters.

ConsoleWrite(Binary("ABCDE ΑΒΓΔΕ") & @LF)

In memory the string looks like this:

0041 0042 0043 0044 0045 0020 0391 0392 0393 0394 0395

and this is what one would expect from invoking Binary(), since AutoIt uses UCS2 (UTF16-LE limited to the Unicode BMP.)

Instead we get something completely unuseable. First the Greek letters Alpha, Beta, Delta and Epsilon appear as question marks (no equivalent in ASCII) but the letter Gamma surprisingly gets converted to ASCII G.

0x4142434445203F3F473F3F

Attachments (0)

Change History (4)

comment:1 Changed 6 days ago by Jpm

Certainly the doc is incomplete about the conversion to byte not to UCS2

comment:2 Changed 6 days ago by jchd18

The doc is indeed incomplete, but there are a number of very unexpected "conversions" elsewhere in the range > 0xFF (maybe even in the range [0x7F,0xFF] depending on local codepage), making Binary(<string>) deceptive.

ConsoleWrite(Binary("ABCDE ΑΒΓΔΕ") & @LF)

Local $c, $b, $u
For $i = 0x100 To 0xFFFF
	$c = ChrW($i)
	$b = Binary($c)
	If $b <> "0x3F" Then _U8ConsoleWrite(Hex($i, 4) & @TAB & $c & "    -->     " & @TAB & $b & @TAB & ChrW($b))
Next

; Unicode-aware ConsoleWrite (set console to UTF8 for decent result)
Func _U8ConsoleWrite($s)
	ConsoleWrite(BinaryToString(StringToBinary($s & @LF, 4), 1))
EndFunc   ;==>_U8ConsoleWrite

For instance some codepoints are converted, but not all possible and not always right:

β (lowercase Greek beta) turned into ß (German eszet) ?!?!?
Γ -> G but γ (lowercase Greek gamma) isn't converted
Many codepoints are unexpectedly converted to control characters!

I'm not completely against attempts to convert, say, Ā to A in a distinct function but at least this has to be clearly documented AND it's better to have it right and consistant (that is much, much harder than it looks.) In any case, a function named Binary shouldn't emasculate anything and OTOH an attempt to map UCS2 > 0x7F to local Windows codepage is doomed to failures.

All in all I doubt a simple approach can be really satisfactory. From this point of view, _StringToHex() [which produces hex of the string in UTF8] and StringToASCIIArray() [which returns an array of codepoints] are more robust.

comment:3 Changed 5 days ago by Jpm

  • Owner set to Jon
  • Status changed from new to assigned

I leave to Jon the final answer to change only doc or the code to follow your recommandation ...

comment:4 Changed 5 days ago by jchd18

Fine. The issue finally boils down to: "what should be the correct semantic of Binary when applied to a native (UCS2) AutoIt string?"

Guidelines for posting comments:

  • You cannot re-open a ticket but you may still leave a comment if you have additional information to add.
  • In-depth discussions should take place on the forum.

For more information see the full version of the ticket guidelines here.

Add Comment

Modify Ticket

Action
as assigned The owner will remain Jon.
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.