au3check.exe not working

Torment · November 9, 2011

Thanks for the updated version, Jos! I tested it with files encoded as ANSI/ASCII, UTF-8, UTF-8 no BOM, UTF-16, UTF-16 no BOM, UTF-16 - Big Endian, UTF-16 - Big Endian - no BOM, and Unicode - ASCII Escaped. It worked with all encodings except "UTF-16 no BOM" and "UTF-16 - Big Endian - no BOM". I'm not sure if anybody else uses those encodings, but if you want to keep tinkering with autoit3wrapper I can post some files with those encodings for you. Just let me know.

I'm surprised to hear that you two both recommend using a BOM for UTF-8. Especially since the Unicode Standard itself states that "Use of a BOM is neither required nor recommended for UTF-8...". [Page 30, Section 2.6 (Encoding Schemes)]

My understanding is that the BOM (Byte Order Mark) was originally developed to indicate the byte order (big-endian vs little-endian) of an encoding. It was not initially intended as an encoding signature, but has since been adopted by many as such. Because UTF-8 does not support a byte order though, a BOM can ONLY be used as a signature for that particular encoding. Unfortunately, because the BOM isn't always present in UTF-8, it can't be reliably used to determine the correct encoding. Sure, if it IS there then you can determine that the encoding is UTF-8. But if it's missing, then other means are needed. For that reason, it's been recommended to always use another means to determine encoding. According to Wikipedia, it's fairly easy to reliably determine UTF-8 encoding with a simple heuristic algorithm.

I'm interested in hearing more of your justification for using the BOM with UTF-8. Everything that I've read so far seems to indicate that a BOM should not be used, so if you have information that I haven't come across yet, I'd be interested in hearing it. Is your reasoning based solely on ease of detection? Or are there other factors?

And thank you again for your help with this issue and your work on the autoit3wrapper program, Jos! I really appreciate it!

jchd · November 9, 2011

Please read me again: while (as your link correctly points out) few random ANSI files are valid UTF-8 no BOM files, the contrary doesn't hold water.

Every UTF-8 file is a perfectly (100%) valid ANSI file.

So in the absence of BOM and not knowing which ANSI locale (~ Windows codepage) the file uses (which makes things even worse) can you see we have a really big problem deciding if a given file is ANSI or UTF-8 no BOM?

The easy lesson is: always use UTF-8 + BOM when you know the file consumer will process both UTF-8 and the BOM correctly.

If you don't agree, I'll be pleased to exhibit counter-examples at will where the mystical "heuristic algorithm" will fail flat on its nose.

Torment · November 10, 2011

Please don't misunderstand me. I'm not trying to say that you're wrong. I'm just trying to understand your position on this matter.

If I understand you correctly, the issue with not using a BOM is that a UTF-8 file might be misinterpreted as an ANSI file. Maybe I don't know enough about the differences between the two, but where would that kind of misinterpretation cause a problem? Can you give me an example that might help me wrap my mind around this a little better?

Thank you for the response. I appreciate it!

jchd · November 11, 2011

No of course I don't take it bad. We're not playing "mine bigger than yours" game.

1) Case input file is UTF-8 no BOM

Look at this very simple example and suppose you receive a price list containing the following initial input, encoded in UTF-8 no BOM:

"141-5489-2",1214.56₨,91.0514₪,18.13€

You have the item reference and a list of prices in various currencies. Here the prices are listed in Indian Roupies, in Israel New Sheqel and Euros.

Since your file has no BOM, it is likely to be interpreted as perfectly valid ANSI (all files are!) and this is where issues raise.

To see why, let's look at a hex dump of the file (the text part at the right is bogus with chars > 0x7F, so don't even look there):

0000H  22 31 34 31 2D 35 34 38 39 2D 32 22 2C 31 32 31   '"141-5489-2",121'
0010H  34 2E 35 36 E2 82 A8 2C 39 31 2E 30 35 31 34 E2   '4.56b.(,91.0514b'
0020H  82 AA 2C 31 38 2E 31 33 E2 82 AC               '.*,18.13b.,'

Say you use the Latin Windows codepage, then your file means this:

"141-5489-2",1214.56â‚¨,91.0514â‚ª,18.13â‚¬

Not only currencies became meaningless but disturbing commas magically appeared which completely ruin the structure of you CSV file.

Now if you receive the very same file as a Russian user, using the Cyrillic Windows 1251 codepage, then the same file will display:

"141-5489-2",1214.56в‚Ё,91.0514в‚Є,18.13в‚¬

Geez, even a € symbol got thrown in, due to byte per byte misinterpretation of course!

Say you are in Japan and use Japanese JIS Windows codepage, then your file will display another garbage:

"141-5489-2",1214.56竄ｨ,91.0514竄ｪ,18.13竄ｬ

And so on.

Depending on your choice of ANSI codepage at your end, the same file will display apparently random garbage, but _all_ are perfectly valid ANSI files, there is no rule violated because ANSI has no rule regarding the encoding of characters.

In short, you just can't decide what to do with this file, unless some oracle instructs you that the file is UTF-8 no BOM.

On the contrary, it _is_ extremely easy and fast to decide if a given file _may_ be interpreted as UTF-8 no BOM: just use a simple regular expression (as Jos and myself tested some time ago). But this is a half-baked answer: you know either that is isn't valid UTF-8 (with justified 100% assurance), or that it may be UTF-8 ... or maybe it's not! That is why no algorithm can replace human inspection and basic understanding of the semantics of the text. Not even Siri will get anything close to that.

2) Case input file is ANSI

Well, but which ANSI, or is it UTF-8?

Worst, it isn't easy to be an oracle in this case either: without external hint, you don't know if the producer of the file didn't use his own ANSI codepage to produce the file. If you look at where the Euro symbol was in the Unicode (first) display (last character) and look at the hex dump, you see that it takes 3 bytes E2 82 AC in UTF-8. Even if it's possible to interpret this sequence as UTF-8, it it equally valid to interpret it as 3 ANSI characters of some unknown codepage. Not knowing which, it could well be an acronym meaning 'Ex special Tax' or 'worldwide shipping cost included' or 'obsolete' in some Indian script (plenty of choices there) or East-Asian codepage (more interesting choices there also).

Even if you restrict files to be AutoIt sources, you still have a big problem with string litterals which may contain sequences of bytes that have a meaning in the user's codepage but ressemble UTF-8. The article you quoted is right to point out that chances for such misinterpretation of ANSI as UTF-8 are low and that such files are uncommon, it remains that chances aren't zero, by far.

Without bold assumptions which can be proven wrong anytime, there is no way out the dilemna, lest of switching to a BOMmed UTF of some sort, which raises any ambiguity.

FYI, here's how a Unicode codepoint (numeric value of the character position) gets encoded in UTFs:

** Notes on UTF-8:
**
**   Byte-0 Byte-1  Byte-2  Byte-3  Value
**  0xxxxxxx                                 00000000 00000000 0xxxxxxx
**  110yyyyy  10xxxxxx                       00000000 00000yyy yyxxxxxx
**  1110zzzz  10yyyyyy  10xxxxxx             00000000 zzzzyyyy yyxxxxxx
**  11110uuu  10uuzzzz  10yyyyyy  10xxxxxx   000uuuuu zzzzyyyy yyxxxxxx
**
**
** Notes on UTF-16:  (with wwww+1==uuuuu)
**
**    Word-0               Word-1         Value
**  110110ww wwzzzzyy   110111yy yyxxxxxx   000uuuuu zzzzyyyy yyxxxxxx
**  zzzzyyyy yyxxxxxx                       00000000 zzzzyyyy yyxxxxxx

I hope this clarifies my statements. Don't hesitate to tell me if something is still unclear.

Edited November 11, 2011 by jchd

Torment · November 18, 2011

Sorry for the delay in my reply. It's been a hectic week. Thank you jchd for the clarification and the examples! That helps me quite a bit (and hopefully anybody else searching the forum for info on this subject). With your help, I now have a much better understanding of UTF-8 and BOMs. Thank you, I appreciate your help!

jchd · November 18, 2011

You're welcome to the band. Character sets and encodings are a genuine maze and this (short) exposition is just the visible part of this huge iceberg. Finding one's way in this utterly complex matter isn't as obvious as it looks like at first. We all said someday "That's too easy, just do ...", while being obviously plain wrong when looking closely at the details (where we know the devil resides).

To make a long story finally short, the only assurance we can get about a no BOM file by testing if it might be a valid UTF8 no BOM. Those which are rejected by violating the UTF-8 rules might be any codepage (or even pure binary executable).

To make a parallel with another well-know partial knowledge situation, it's like using a primality test to decide whether a given integer is or not a perfect number. If it's prime you know it can't be perfect, but that doesn't give you enough knowledge for building your decision. At least in that case there is a way to use another criterion to decide.

Sign In

au3check.exe not working

Recommended Posts

Torment

Link to comment

Share on other sites

jchd

Link to comment

Share on other sites

Torment

Link to comment

Share on other sites

jchd

Link to comment

Share on other sites

Torment

Link to comment

Share on other sites

jchd

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

AutoIt Resources

Release

Beta