Wrong result with FileGetEncoding(

mike1950r · July 6, 2021

Hi,

if i use FileGetEncoding( to get the encoding of a file get always 256 instead of 512 in case of an ANSI file.

i can check the format with notepad++, and it is for sure a ANSI file.

thanks for assistance

cheers mike

jchd · July 6, 2021

Attach an example to stop guesswork.

mike1950r · July 6, 2021

ok, thanks for reply.

attach just a normal txt file.

Local $iEncoding = FileGetEncoding("test.txt")
MsgBox($MB_TOPMOST, "", $iEncoding, 0)

cheers mike

test.txt

Edited July 6, 2021 by mike1950r

TheXman · July 6, 2021

From the Help File under the "Unicode Support" topic:

File operations on text files not opened with FileOpen() and explicit unicode flags auto-detect encoding similar to most modern editors. This includes all file functions that are used with a filename, for example FileRead("filename.txt"). Specifically:
- Files containing a BOM will be opened in the relevant mode as per that BOM. UTF-8 and UTF-16 BOMs are checked.
- UTF-8 and UTF-16 files without a BOM will be automatically detected and opened in the relevant mode.
- Files containing nulls are opened in Binary ($FO_BINARY) mode by default (unless they are detected as valid UTF-16). Previously they would be opened in ANSI mode. Use the $FO_ANSI flag to override.
- Files containing only characters 1-127 are opened in UTF-8 with no BOM ($FO_UTF8_NOBOM) mode by default. Previously they would be opened in ANSI mode. Use the $FO_ANSI flag to override.
- Files containing only characters 1-255 are opened in ANSI ($FO_ANSI) mode by default.
- Due to the above FileGetEncoding() now returns 512 ($FO_ANSI) or 256 ($FO_UTF8_NOBOM) instead of 0 which was undocumented but indicated ANSI.

Edited July 6, 2021 by TheXman

mike1950r · July 6, 2021

thanks xman,

if i understand you right:

$iEncoding = $FO_UTF8_NOBOM would be ANSI as well as $iEncoding = $FO_ANSI ???

cheers mike

TheXman · July 6, 2021

Not quite. FileGetEncoding() with a file name, as opposed to a handle that opened the file with an explicit encoding flag, will open an ANSI file as UTF8 no BOM. So the return value is $FO_UTF8_NOBOM.

That was a "code breaking" change that was documented in a previous version of AutoIt. You can look up which version if you feel so inclined.

Edited July 6, 2021 by TheXman

mike1950r · July 6, 2021

Hi again,

this is quite confusing.

why is an ansi file opened as an utf8 nobom file.
but this would be only for checking the encoding.

in my case i would check encoding with filegetencoding(
if i get ansi or utf8 nobom i would open the file then with ansi flag for overwriting or appending and save like this.

is this ok then ???

cheers mike

Nine · July 6, 2021

Not exactly If you look at the example provided with the FileGetEncoding, it says :

; The value returned for this example should be 0 or $FO_ANSI.

But it is not. It returns $FO_UTF8_NOBOM (256). However, if you add a character over 128 (as a comment or whatever), it will now return 512. Like this :

; ¢

mike1950r · July 6, 2021

nine,

thanks lot for your assistance.

i fear i'm just to stupid to understand.

this really confuses me.

cheers mike

TheXman · July 6, 2021

I don't understand your issue. FileGetEncoding() tells you what encoding was used when the file was opened. If FileGetEncoding() used a file name or a handle gotten from an explicit FileOpen without an encoding flag, then the encoding was determined using a set of predefined rules. Keep in mind that FileGetEncoding, when supplied with a file name, still opens the file.

1 hour ago, mike1950r said:

why is an ansi file opened as an utf8 nobom file.

Because UTF8 no BOM can read/write an ANSI encoded file.

Why don't you discuss the problem you are trying to solve instead of the solution that you've come up with? Maybe there's a better way to do whatever it is you are trying to do.

Edited July 6, 2021 by TheXman

mike1950r · July 6, 2021

thanks xman,

i understood, that i should overwrite the utf8 nobom with ansi, right?
if so for my case i treate the file encoding detection utf8 nobom = ansi and overwrite as ansi.

this is alright for me.

(strange though, that other editors like notepad, notepad++ etc. are able to detect this kind of file as ansi.)

may be they have another methode for detecting the encoding.

thanks lot for your help,

and excuse for my difficult long lasting understanding.

fortunately in other themes i'm much faster.

🙂

cheers mike

jchd · July 7, 2021

Here's the underlying issue with text files.

Extended ANSI uses one byte per character and has 128 "upper" characters codes [0x80, 0xFF] which are assigned to a set of characters defined by the codepage in use. The codepage is not explicit and this is a problem for information interchange.

Unicode has a very large character set encompassing all glyphs ever used by humans. The range of Unicode characters is [0x000000, 0x10FFFF] which is 1 114 112 possible characters!

Obviously an Unicode character (a codepoint) must use something larger than one byte to represent, contrary to previous codepages. This is where encoding enters the scene.

A useful encoding is UTF8 which uses sequences of 1 to 4 bytes to represent a character. See UTF8 to understand how this encoding works.
The lower part of ANSI is mapped verbatim to the first 128 Unicode codepoints. In UTF8, a byte > 0x7F introduces a sequence one more than one byte and this sequence has to conform to UTF8 encoding. This is what FileGetEncoding tries to determine.

The word "España" has different representations in Windows Occidental codepage and UTF8:

ANSI Occidental codepage 1252
E  s  p  a  ñ  a
45 73 70 61 F1 61

UTF8 (NoBOM)
E  s  p  a    ñ   a
            ┌─┴─┐
45 73 70 61 C3 B1 61

UTF8 (BOM)
         E  s  p  a    ñ   a
                     ┌─┴─┐
EF BB BF 45 73 70 61 C3 B1 61
└──┬───┘
  BOM

The optional BOM (Byte Order Mark) serves as a special marker to help distinguish UTF8 from byte codepages.

If you FileOpen a file with the first content without specifying a mode, AutoIt will try to find in the first 64k bytes if there are invalid UTF8 sequences. If found the file will be open as ANSI, else UTF8. The sequence 0xF1 0x61 is an invalid UTF8 sequence, hence file is treated as ANSI.

If a file with the second example is mistakenly open as ANSI it would display as "EspaÃ¯a" which is probably not what users want.

If an UTF8 BOM is found, it is ignored but the file is treated as UTF8 without further examination.

EDIT:

The file you provided is empty, hence will by default be considered as UTF8 w/o BOM.

Edited July 7, 2021 by jchd

JockoDundee · July 7, 2021

Well said @jchd !

Since you’re so smart, why don’t you explain why sometimes you see all those ???? when opening a file

jchd · July 7, 2021

Most of the time it's because the font used to display file content doesn't have a representation for the Unicode codepoints found. But there may be other reasons. If you have an example I'll be happy to help.

For instance I use the latest DejaVu Sans Mono font for all fixed-size uses, including my SciTE UTF8 console. This allows the following code

; Mixed language strings
$s = "Μεγάλο πρόβλημα  Большая проблема  大问题  बड़ी समस्या  مشكلة كبيرة"
CW($s)

; A familly with different Fitzpatrick settings = only one glyph
$s = ChrW(0xD83D) & ChrW(0xDC68) & ChrW(0xD83C) & ChrW(0xDFFB) & ChrW(0x200D) & ChrW(0xD83D) & ChrW(0xDC69) & ChrW(0xD83C) & ChrW(0xDFFF) & ChrW(0x200D) & ChrW(0xD83D) & ChrW(0xDC66) & ChrW(0xD83C) & ChrW(0xDFFD)
CW($s)

to display this (CW() is a Unicode-aware ConsoleWrite):

Μεγάλο πρόβλημα  Большая проблема  大问题  बड़ी समस्या  مشكلة كبيرة
👨🏻‍👩🏿‍👦🏽

You can also open cmd.exe then use chcp 65001 and try to paste the content of the result above. Several codepoints show as blank rectangular placeholders, others as unknown (a question mark in black hexagonal background).

If you use a poorly complete Unicode font (no font cover all of Unicode) you're most likely going to see some garbage or rather many question marks, depending on how the font is coded to represent codepoints it has no representation for.

EDIT:

Forgot to mention that there are codepoints reserved for surrogates [0xD800, 0xDFFF] which if found as standalone cause an invalid codepoint detection by fonts rendering engines. There are also private use ranges where Unicode doesn't define a representation and currently unassigned codepoints which may get assigned in the future version of the character set.

Edited July 7, 2021 by jchd
Typo

mike1950r · July 7, 2021

6 hours ago, jchd said:

ANSI Occidental codepage 1252 E s p a ñ a 45 73 70 61 F1 61 UTF8 (NoBOM) E s p a ñ a ┌─┴─┐ 45 73 70 61 C3 B1 61 UTF8 (BOM) E s p a ñ a ┌─┴─┐ EF BB BF 45 73 70 61 C3 B1 61 └──┬───┘ BOM

jchd,

this was very helpful, thanks lot

cheers mike

JockoDundee · August 11, 2021

On 7/7/2021 at 2:12 AM, jchd said:

But there may be other reasons. If you have an example I'll be happy to help.

As it turns out, your expertise is needed at bogus cybersymposium, as no one know what to make of:

96945A01-4316-4133-8543-954636C5364F.jpeg.8b08368e2a074fe7cb121e1a35bde07d.jpeg

more information here:

Block all input without UAC	Save/Retrieve Images to/from Text	Monitor Management (VCP commands)
Tool to search in text (au3) files	Date Range Picker	Virtual Desktop Manager
Sudoku Game 2020	Overlapped Named Pipe IPC	HotString 2.0 - Hot keys with string
x64 Bitwise Operations	Multi-keyboards HotKeySet	Recursive Array Display
Fast and simple WCD IPC	Multiple Folders Selector	Printer Manager
GIF Animation (cached) Debug Messages Monitor UDF	Screen Scraping Round Corner GUI UDF	Multi-Threading Made Easy Interface Object based on Tag

Sign In

Wrong result with FileGetEncoding(

Recommended Posts

mike1950r

jchd

mike1950r

TheXman

mike1950r

TheXman

mike1950r

Nine

mike1950r

TheXman

mike1950r

jchd

JockoDundee

jchd

mike1950r

JockoDundee

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

AutoIt Resources

Release

Beta