Help with UTF-8 Hex bytes

DeviaAnimus · October 28, 2012

HI,

I have a number of files that I've converted from binary to YAML text files. Everything in the text files look good except that the UTF-8 Latin characters have been replaced with byte sequences. For example, à is replaced with \xC3\xA0 and ç with \xC3\xA7.

I want to find these two-byte sequences inside my files and replace them with the appropriate character. So if I have the string: "La nature de mon travail m'am\xC3\xA8ne \xC3\xA0 c\xC3\xB4toyer quotidiennement", I want the script to output "La nature de mon travail m'amène à côtoyer quotidiennement" having successfully replaced the bytes with the correct characters.

As I understand I can use BinaryToString to convert the bytes to characters but I can't figure out how to find the sequences within the string. I've looked at StringRegExp but I've never used it before and i don't understand how to use it for this purpose.

How would I go about to achieve this?

Thanks in advance!

Edited October 28, 2012 by DeviaAnimus

jchd · October 28, 2012

You're confused about what UTF-8 is all about. The byte sequences you see for à, ç, etc. are UTF-8 encoded representations themselves.

If the UTF-8 text is a file, then read it as UTF-8 using FileOpen with UTF-8 option then FileRead.

BinaryToString and StringRegExp are different beasts best left alone in your case.

DeviaAnimus · October 28, 2012

You're confused about what UTF-8 is all about. The byte sequences you see for à, ç, etc. are UTF-8 encoded representations themselves.
If the UTF-8 text is a file, then read it as UTF-8 using FileOpen with UTF-8 option then FileRead.
BinaryToString and StringRegExp are different beasts best left alone in your case.

None of the modes for FileOpen returns a string without the byte sequences.

Please explain what I've gotten wrong about UTF-8, because I can't understand why it's so hard to get the characters instead of the byte sequences.

jchd · October 28, 2012

That's just because what you call "character" (e.g. à, ç, whatever) needs what you call a byte sequence in UTF-8 encoding.

Depending on whether your input files carry a BOM or not, you need to FileOpen them with option 128 or 256. That will allow you to FileRead them correctly to a native AutoIt string variable (AutoIt uses UTF-16 internally). Now if you want to convert them to, say, your default Windows Latin charset (presumably), all you have to do is FileWrite the contents to new files or overwrite the inputs (be cautious then!).

DeviaAnimus · October 28, 2012

Well, that doesn't work. Both option 128 or 256 returns a string with the byte sequence and not the characters.

trancexx · October 28, 2012

Somebody got something wrong. May I try?

; Read
Local $sReadString = "La nature de mon travail m'amxC3xA8ne xC3xA0 cxC3xB4toyer quotidiennement"

; Replace unwanted characters
$sReadString = Execute('"' & StringRegExpReplace($sReadString, 'xC3x([[:xdigit:]]{2})', '" & ChrW(Dec(''$1'') + 64) & "') & '"')

;... whatever

DeviaAnimus · October 28, 2012

Somebody got something wrong. May I try?

; Read
Local $sReadString = "La nature de mon travail m'am\xC3\xA8ne \xC3\xA0 c\xC3\xB4toyer quotidiennement"

; Replace unwanted characters
$sReadString = Execute('"' & StringRegExpReplace($sReadString, '\\xC3\\x([[:xdigit:]]{2})', '" & ChrW(Dec(''$1'') + 64) & "') & '"')

;... whatever

@trancexx.

Thank you so much, it worked perfectly. This is exactly what i was looking for.

A thousand thanks to you.

jchd · October 29, 2012

Ha! I don't know why I kept looking at x as a representation of a hex character.

trancexx got it right that I got it wrong by misinterpreting x by not being face value chars.

However the above code won't work for some characters even if they are included in Windows Latin charset (which is what I suppose you use). Take for instance the character €: it takes three bytes when encoded in UTF-8.

Should you ever have to encounter such characters, the best way would be to convert every x.. sequence into individual byte of said value, then convert the resulting string from UTF-8 to ANSI.

Myicq · October 29, 2012

Take for instance the character €: it takes three bytes when encoded in UTF-8.

Well, here's where it gets funny.. because the conversion between a unique identification of characters (like €) which is possible in Unicode/UTF8 to a simpler character set such as ANSI causes the some characters to depend on interpretation.

Take again the €. It's not present in en-US ANSI (normally), but it IS present in f.ex da-DK ANSI (ISO-8859-15).

So you can not make the conversion from UTF8 to ANSI without at the same time specifying which ANSI variant you mean. There are several. The same bytecode may mean at least 12 different characters, especially 0x80..0xFF

jchd · October 29, 2012

This part is even much worst than you think it is. There are thousands of 8-bit charsets in use, even leaving alone multibyte charsets. Also I used the term ANSI as a shorthand to Windows Latin but this term itself is essentially meaningless. What most people use is the default charset defined by the locale they use. Most users in western countries (US included) use Windows Western Latin.

Sign In

Help with UTF-8 Hex bytes

Recommended Posts

DeviaAnimus

jchd

DeviaAnimus

jchd

DeviaAnimus

trancexx

DeviaAnimus

jchd

Myicq

jchd

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta