Jump to content
Sign in to follow this  
DeviaAnimus

Help with UTF-8 Hex bytes

Recommended Posts

DeviaAnimus

HI,

I have a number of files that I've converted from binary to YAML text files. Everything in the text files look good except that the UTF-8 Latin characters have been replaced with byte sequences. For example, à is replaced with \xC3\xA0 and ç with \xC3\xA7.

I want to find these two-byte sequences inside my files and replace them with the appropriate character. So if I have the string: "La nature de mon travail m'am\xC3\xA8ne \xC3\xA0 c\xC3\xB4toyer quotidiennement", I want the script to output "La nature de mon travail m'amène à côtoyer quotidiennement" having successfully replaced the bytes with the correct characters.

As I understand I can use BinaryToString to convert the bytes to characters but I can't figure out how to find the sequences within the string. I've looked at StringRegExp but I've never used it before and i don't understand how to use it for this purpose.

How would I go about to achieve this?

Thanks in advance!

Edited by DeviaAnimus

Share this post


Link to post
Share on other sites
jchd

You're confused about what UTF-8 is all about. The byte sequences you see for à, ç, etc. are UTF-8 encoded representations themselves.

If the UTF-8 text is a file, then read it as UTF-8 using FileOpen with UTF-8 option then FileRead.

BinaryToString and StringRegExp are different beasts best left alone in your case.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
DeviaAnimus

You're confused about what UTF-8 is all about. The byte sequences you see for à, ç, etc. are UTF-8 encoded representations themselves.

If the UTF-8 text is a file, then read it as UTF-8 using FileOpen with UTF-8 option then FileRead.

BinaryToString and StringRegExp are different beasts best left alone in your case.

None of the modes for FileOpen returns a string without the byte sequences.

Please explain what I've gotten wrong about UTF-8, because I can't understand why it's so hard to get the characters instead of the byte sequences.

Share this post


Link to post
Share on other sites
jchd

That's just because what you call "character" (e.g. à, ç, whatever) needs what you call a byte sequence in UTF-8 encoding.

Depending on whether your input files carry a BOM or not, you need to FileOpen them with option 128 or 256. That will allow you to FileRead them correctly to a native AutoIt string variable (AutoIt uses UTF-16 internally). Now if you want to convert them to, say, your default Windows Latin charset (presumably), all you have to do is FileWrite the contents to new files or overwrite the inputs (be cautious then!).


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
DeviaAnimus

Well, that doesn't work. Both option 128 or 256 returns a string with the byte sequence and not the characters.

Share this post


Link to post
Share on other sites
trancexx

Somebody got something wrong. May I try?

; Read
Local $sReadString = "La nature de mon travail m'amxC3xA8ne xC3xA0 cxC3xB4toyer quotidiennement"

; Replace unwanted characters
$sReadString = Execute('"' & StringRegExpReplace($sReadString, 'xC3x([[:xdigit:]]{2})', '" & ChrW(Dec(''$1'') + 64) & "') & '"')

;... whatever

♡♡♡

.

eMyvnE

Share this post


Link to post
Share on other sites
DeviaAnimus

Somebody got something wrong. May I try?

; Read
Local $sReadString = "La nature de mon travail m'am\xC3\xA8ne \xC3\xA0 c\xC3\xB4toyer quotidiennement"

; Replace unwanted characters
$sReadString = Execute('"' & StringRegExpReplace($sReadString, '\\xC3\\x([[:xdigit:]]{2})', '" & ChrW(Dec(''$1'') + 64) & "') & '"')

;... whatever

@trancexx.

Thank you so much, it worked perfectly. This is exactly what i was looking for.

A thousand thanks to you.

Share this post


Link to post
Share on other sites
jchd

Ha! I don't know why I kept looking at x as a representation of a hex character.

trancexx got it right that I got it wrong by misinterpreting x by not being face value chars.

However the above code won't work for some characters even if they are included in Windows Latin charset (which is what I suppose you use). Take for instance the character €: it takes three bytes when encoded in UTF-8.

Should you ever have to encounter such characters, the best way would be to convert every x.. sequence into individual byte of said value, then convert the resulting string from UTF-8 to ANSI.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Myicq

Take for instance the character €: it takes three bytes when encoded in UTF-8.

Well, here's where it gets funny.. because the conversion between a unique identification of characters (like €) which is possible in Unicode/UTF8 to a simpler character set such as ANSI causes the some characters to depend on interpretation.

Take again the €. It's not present in en-US ANSI (normally), but it IS present in f.ex da-DK ANSI (ISO-8859-15).

So you can not make the conversion from UTF8 to ANSI without at the same time specifying which ANSI variant you mean. There are several. The same bytecode may mean at least 12 different characters, especially 0x80..0xFF


I am just a hobby programmer, and nothing great to publish right now.

Share this post


Link to post
Share on other sites
jchd

This part is even much worst than you think it is. There are thousands of 8-bit charsets in use, even leaving alone multibyte charsets. Also I used the term ANSI as a shorthand to Windows Latin but this term itself is essentially meaningless. What most people use is the default charset defined by the locale they use. Most users in western countries (US included) use Windows Western Latin.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×