Jump to content

au3check.exe not working


Torment
 Share

Recommended Posts

Thanks for the updated version, Jos! I tested it with files encoded as ANSI/ASCII, UTF-8, UTF-8 no BOM, UTF-16, UTF-16 no BOM, UTF-16 - Big Endian, UTF-16 - Big Endian - no BOM, and Unicode - ASCII Escaped. It worked with all encodings except "UTF-16 no BOM" and "UTF-16 - Big Endian - no BOM". I'm not sure if anybody else uses those encodings, but if you want to keep tinkering with autoit3wrapper I can post some files with those encodings for you. Just let me know.

I'm surprised to hear that you two both recommend using a BOM for UTF-8. Especially since the Unicode Standard itself states that "Use of a BOM is neither required nor recommended for UTF-8...". [Page 30, Section 2.6 (Encoding Schemes)]

My understanding is that the BOM (Byte Order Mark) was originally developed to indicate the byte order (big-endian vs little-endian) of an encoding. It was not initially intended as an encoding signature, but has since been adopted by many as such. Because UTF-8 does not support a byte order though, a BOM can ONLY be used as a signature for that particular encoding. Unfortunately, because the BOM isn't always present in UTF-8, it can't be reliably used to determine the correct encoding. Sure, if it IS there then you can determine that the encoding is UTF-8. But if it's missing, then other means are needed. For that reason, it's been recommended to always use another means to determine encoding. According to Wikipedia, it's fairly easy to reliably determine UTF-8 encoding with a simple heuristic algorithm.

I'm interested in hearing more of your justification for using the BOM with UTF-8. Everything that I've read so far seems to indicate that a BOM should not be used, so if you have information that I haven't come across yet, I'd be interested in hearing it. Is your reasoning based solely on ease of detection? Or are there other factors?

And thank you again for your help with this issue and your work on the autoit3wrapper program, Jos! I really appreciate it! :D

Link to comment
Share on other sites

Please read me again: while (as your link correctly points out) few random ANSI files are valid UTF-8 no BOM files, the contrary doesn't hold water.

Every UTF-8 file is a perfectly (100%) valid ANSI file.

So in the absence of BOM and not knowing which ANSI locale (~ Windows codepage) the file uses (which makes things even worse) can you see we have a really big problem deciding if a given file is ANSI or UTF-8 no BOM?

The easy lesson is: always use UTF-8 + BOM when you know the file consumer will process both UTF-8 and the BOM correctly.

If you don't agree, I'll be pleased to exhibit counter-examples at will where the mystical "heuristic algorithm" will fail flat on its nose.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Please don't misunderstand me. I'm not trying to say that you're wrong. I'm just trying to understand your position on this matter. :D

If I understand you correctly, the issue with not using a BOM is that a UTF-8 file might be misinterpreted as an ANSI file. Maybe I don't know enough about the differences between the two, but where would that kind of misinterpretation cause a problem? Can you give me an example that might help me wrap my mind around this a little better?

Thank you for the response. I appreciate it!

Link to comment
Share on other sites

No of course I don't take it bad. We're not playing "mine bigger than yours" game.

1) Case input file is UTF-8 no BOM

Look at this very simple example and suppose you receive a price list containing the following initial input, encoded in UTF-8 no BOM:

"141-5489-2",1214.56₨,91.0514₪,18.13€

You have the item reference and a list of prices in various currencies. Here the prices are listed in Indian Roupies, in Israel New Sheqel and Euros.

Since your file has no BOM, it is likely to be interpreted as perfectly valid ANSI (all files are!) and this is where issues raise.

To see why, let's look at a hex dump of the file (the text part at the right is bogus with chars > 0x7F, so don't even look there):

0000H  22 31 34 31 2D 35 34 38 39 2D 32 22 2C 31 32 31   '"141-5489-2",121'
0010H  34 2E 35 36 E2 82 A8 2C 39 31 2E 30 35 31 34 E2   '4.56b.(,91.0514b'
0020H  82 AA 2C 31 38 2E 31 33 E2 82 AC               '.*,18.13b.,'

Say you use the Latin Windows codepage, then your file means this:

"141-5489-2",1214.56₨,91.0514₪,18.13€

Not only currencies became meaningless but disturbing commas magically appeared which completely ruin the structure of you CSV file.

Now if you receive the very same file as a Russian user, using the Cyrillic Windows 1251 codepage, then the same file will display:

"141-5489-2",1214.56₨,91.0514₪,18.13€

Geez, even a € symbol got thrown in, due to byte per byte misinterpretation of course!

Say you are in Japan and use Japanese JIS Windows codepage, then your file will display another garbage:

"141-5489-2",1214.56竄ィ,91.0514竄ェ,18.13竄ャ

And so on.

Depending on your choice of ANSI codepage at your end, the same file will display apparently random garbage, but _all_ are perfectly valid ANSI files, there is no rule violated because ANSI has no rule regarding the encoding of characters.

In short, you just can't decide what to do with this file, unless some oracle instructs you that the file is UTF-8 no BOM.

On the contrary, it _is_ extremely easy and fast to decide if a given file _may_ be interpreted as UTF-8 no BOM: just use a simple regular expression (as Jos and myself tested some time ago). But this is a half-baked answer: you know either that is isn't valid UTF-8 (with justified 100% assurance), or that it may be UTF-8 ... or maybe it's not! That is why no algorithm can replace human inspection and basic understanding of the semantics of the text. Not even Siri will get anything close to that.

2) Case input file is ANSI

Well, but which ANSI, or is it UTF-8?

Worst, it isn't easy to be an oracle in this case either: without external hint, you don't know if the producer of the file didn't use his own ANSI codepage to produce the file. If you look at where the Euro symbol was in the Unicode (first) display (last character) and look at the hex dump, you see that it takes 3 bytes E2 82 AC in UTF-8. Even if it's possible to interpret this sequence as UTF-8, it it equally valid to interpret it as 3 ANSI characters of some unknown codepage. Not knowing which, it could well be an acronym meaning 'Ex special Tax' or 'worldwide shipping cost included' or 'obsolete' in some Indian script (plenty of choices there) or East-Asian codepage (more interesting choices there also).

Even if you restrict files to be AutoIt sources, you still have a big problem with string litterals which may contain sequences of bytes that have a meaning in the user's codepage but ressemble UTF-8. The article you quoted is right to point out that chances for such misinterpretation of ANSI as UTF-8 are low and that such files are uncommon, it remains that chances aren't zero, by far.

Without bold assumptions which can be proven wrong anytime, there is no way out the dilemna, lest of switching to a BOMmed UTF of some sort, which raises any ambiguity.

FYI, here's how a Unicode codepoint (numeric value of the character position) gets encoded in UTFs:

** Notes on UTF-8:
**
**   Byte-0 Byte-1  Byte-2  Byte-3  Value
**  0xxxxxxx                                 00000000 00000000 0xxxxxxx
**  110yyyyy  10xxxxxx                       00000000 00000yyy yyxxxxxx
**  1110zzzz  10yyyyyy  10xxxxxx             00000000 zzzzyyyy yyxxxxxx
**  11110uuu  10uuzzzz  10yyyyyy  10xxxxxx   000uuuuu zzzzyyyy yyxxxxxx
**
**
** Notes on UTF-16:  (with wwww+1==uuuuu)
**
**    Word-0               Word-1         Value
**  110110ww wwzzzzyy   110111yy yyxxxxxx   000uuuuu zzzzyyyy yyxxxxxx
**  zzzzyyyy yyxxxxxx                       00000000 zzzzyyyy yyxxxxxx

I hope this clarifies my statements. Don't hesitate to tell me if something is still unclear.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Sorry for the delay in my reply. It's been a hectic week. Thank you jchd for the clarification and the examples! That helps me quite a bit (and hopefully anybody else searching the forum for info on this subject). With your help, I now have a much better understanding of UTF-8 and BOMs. Thank you, I appreciate your help! :D

Link to comment
Share on other sites

You're welcome to the band. Character sets and encodings are a genuine maze and this (short) exposition is just the visible part of this huge iceberg. Finding one's way in this utterly complex matter isn't as obvious as it looks like at first. We all said someday "That's too easy, just do ...", while being obviously plain wrong when looking closely at the details (where we know the devil resides).

To make a long story finally short, the only assurance we can get about a no BOM file by testing if it might be a valid UTF8 no BOM. Those which are rejected by violating the UTF-8 rules might be any codepage (or even pure binary executable).

To make a parallel with another well-know partial knowledge situation, it's like using a primality test to decide whether a given integer is or not a perfect number. If it's prime you know it can't be perfect, but that doesn't give you enough knowledge for building your decision. At least in that case there is a way to use another criterion to decide.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...