Jump to content

Wrong result with FileGetEncoding(


Recommended Posts

Hi,

if i use FileGetEncoding( to get the encoding of a file  get always 256 instead of 512 in case of an ANSI file.

i can check the format with notepad++, and it is for sure a ANSI file.

thanks for assistance

 

cheers mike

Link to post
Share on other sites

Attach an example to stop guesswork.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to post
Share on other sites
Posted (edited)

From the Help File under the "Unicode Support" topic:

  • File operations on text files not opened with FileOpen() and explicit unicode flags auto-detect encoding similar to most modern editors. This includes all file functions that are used with a filename, for example FileRead("filename.txt"). Specifically:
    • Files containing a BOM will be opened in the relevant mode as per that BOM. UTF-8 and UTF-16 BOMs are checked.
    • UTF-8 and UTF-16 files without a BOM will be automatically detected and opened in the relevant mode.
    • Files containing nulls are opened in Binary ($FO_BINARY) mode by default (unless they are detected as valid UTF-16). Previously they would be opened in ANSI mode. Use the $FO_ANSI flag to override.
    • Files containing only characters 1-127 are opened in UTF-8 with no BOM ($FO_UTF8_NOBOM) mode by default. Previously they would be opened in ANSI mode. Use the $FO_ANSI flag to override.
    • Files containing only characters 1-255 are opened in ANSI ($FO_ANSI) mode by default.
    • Due to the above FileGetEncoding() now returns 512 ($FO_ANSI) or 256 ($FO_UTF8_NOBOM) instead of 0 which was undocumented but indicated ANSI.
Edited by TheXman
Link to post
Share on other sites
Posted (edited)

Not quite. FileGetEncoding() with a file name, as opposed to a handle that opened the file with an explicit encoding flag, will open an ANSI file as UTF8 no BOM.  So the return value is $FO_UTF8_NOBOM.

 

That was a "code breaking" change that was documented in a previous version of AutoIt.  You can look up which version if you feel so inclined.

Edited by TheXman
Link to post
Share on other sites

Hi again,

this is quite confusing.

why is an ansi file opened as an utf8 nobom file.
but this would be only for checking the encoding.

in my case i would check encoding with filegetencoding(
if i get ansi or utf8 nobom i would open the file then with ansi flag for overwriting or appending and save like this.

is this ok then ???

cheers mike

Link to post
Share on other sites

Not exactly  If you look at the example provided with the FileGetEncoding, it says :

; The value returned for this example should be 0 or $FO_ANSI.

But it is not.  It returns $FO_UTF8_NOBOM (256).   However, if you add a character over 128 (as a comment or whatever), it will now return 512.  Like this :

; ¢

:)

Link to post
Share on other sites
Posted (edited)

I don't understand your issue.  FileGetEncoding() tells you what encoding was used when the file was opened.  If FileGetEncoding() used a file name or a handle gotten from an explicit FileOpen without an encoding flag, then the encoding was determined using a set of predefined rules.  Keep in mind that FileGetEncoding, when supplied with a file name, still opens the file.

1 hour ago, mike1950r said:

why is an ansi file opened as an utf8 nobom file.

Because UTF8 no BOM can read/write an ANSI encoded file.

 

Why don't you discuss the problem you are trying to solve instead of the solution that you've come up with?  Maybe there's a better way to do whatever it is you are trying to do.

 

Edited by TheXman
Link to post
Share on other sites

thanks xman,

i understood, that i should overwrite the utf8 nobom with ansi, right?
if so for my case i treate the file encoding detection  utf8 nobom = ansi and overwrite as ansi.

this is alright for me.

(strange though, that other editors like notepad, notepad++ etc. are able to detect this kind of file as ansi.)

may be they have another methode for detecting the encoding.

thanks lot for your help,

and excuse for my difficult long lasting understanding.

fortunately in other themes i'm much faster.

🙂

cheers mike

Link to post
Share on other sites
Posted (edited)

Here's the underlying issue with text files.

Extended ANSI uses one byte per character and has 128 "upper" characters codes [0x80, 0xFF] which are assigned to a set of characters defined by the codepage in use. The codepage is not explicit and this is a problem for information interchange.

Unicode has a very large character set encompassing all glyphs ever used by humans. The range of Unicode characters is [0x000000, 0x10FFFF] which is 1 114 112 possible characters!

Obviously an Unicode character (a codepoint) must use something larger than one byte to represent, contrary to previous codepages. This is where encoding enters the scene.

A useful encoding is UTF8 which uses sequences of 1 to 4 bytes to represent a character. See UTF8 to understand how this encoding works.
The lower part of ANSI is mapped verbatim to the first 128 Unicode codepoints. In UTF8, a byte > 0x7F introduces a sequence one more than one byte and this sequence has to conform to UTF8 encoding. This is what FileGetEncoding tries to determine.

The word "España" has different representations in Windows Occidental codepage and UTF8:

ANSI Occidental codepage 1252
E  s  p  a  ñ  a
45 73 70 61 F1 61

UTF8 (NoBOM)
E  s  p  a    ñ   a
            ┌─┴─┐
45 73 70 61 C3 B1 61

UTF8 (BOM)
         E  s  p  a    ñ   a
                     ┌─┴─┐
EF BB BF 45 73 70 61 C3 B1 61
└──┬───┘
  BOM

The optional BOM (Byte Order Mark) serves as a special marker to help distinguish UTF8 from byte codepages.

If you FileOpen a file with the first content without specifying a mode, AutoIt will try to find in the first 64k bytes if there are invalid UTF8 sequences. If found the file will be open as ANSI, else UTF8. The sequence 0xF1 0x61 is an invalid UTF8 sequence, hence file is treated as ANSI.

If a file with the second example is mistakenly open as ANSI it would display as "Espaïa" which is probably not what users want.

If an UTF8 BOM is found, it is ignored but the file is treated as UTF8 without further examination.

 

EDIT:

The file you provided is empty, hence will by default be considered as UTF8 w/o BOM.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to post
Share on other sites
Posted (edited)

Most of the time it's because the font used to display file content doesn't have a representation for the Unicode codepoints found. But there may be other reasons. If you have an example I'll be happy to help.

For instance I use the latest DejaVu Sans Mono font for all fixed-size uses, including my SciTE UTF8 console. This allows the following code

; Mixed language strings
$s = "Μεγάλο πρόβλημα  Большая проблема  大问题  बड़ी समस्या  مشكلة كبيرة"
CW($s)

; A familly with different Fitzpatrick settings = only one glyph
$s = ChrW(0xD83D) & ChrW(0xDC68) & ChrW(0xD83C) & ChrW(0xDFFB) & ChrW(0x200D) & ChrW(0xD83D) & ChrW(0xDC69) & ChrW(0xD83C) & ChrW(0xDFFF) & ChrW(0x200D) & ChrW(0xD83D) & ChrW(0xDC66) & ChrW(0xD83C) & ChrW(0xDFFD)
CW($s)

to display this (CW() is a Unicode-aware ConsoleWrite):

Μεγάλο πρόβλημα  Большая проблема  大问题  बड़ी समस्या  مشكلة كبيرة
👨🏻‍👩🏿‍👦🏽

You can also open cmd.exe then use chcp 65001 and try to paste the content of the result above. Several codepoints show as blank rectangular placeholders, others as unknown (a question mark in black hexagonal background).

If you use a poorly complete Unicode font (no font cover all of Unicode) you're most likely going to see some garbage or rather many question marks, depending on how the font is coded to represent codepoints it has no representation for.

EDIT:

Forgot to mention that there are codepoints reserved for surrogates [0xD800, 0xDFFF] which if found as standalone cause an invalid codepoint detection by fonts rendering engines. There are also private use ranges where Unicode doesn't define a representation and currently unassigned codepoints which may get assigned in the future version of the character set.

Edited by jchd
Typo

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to post
Share on other sites
6 hours ago, jchd said:

ANSI Occidental codepage 1252 E  s  p  a  ñ  a 45 73 70 61 F1 61 UTF8 (NoBOM) E  s  p  a    ñ   a             ┌─┴─┐ 45 73 70 61 C3 B1 61 UTF8 (BOM)          E  s  p  a    ñ   a                      ┌─┴─┐ EF BB BF 45 73 70 61 C3 B1 61 └──┬───┘   BOM

 

jchd,

this was very helpful, thanks lot

cheers mike

 

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...