Jump to content

The FileOpen() guess


Recommended Posts

1) Short story :

Help file, FileOpen topic :

...When reading without an explicit unicode mode flag, the content of the file is examined and a guess is made whether the file is UTF8, UTF16 or ANSI.

My question is : how is this "guess" made ?
Because in the script below, opening the file "product.dbf" in read mode doesn't detect it's an ANSI file, so results are incorrect (file "product.dbf" attached at the end of the script)

#include <FileConstants.au3>
#include <MsgBoxConstants.au3>

Opt("MustDeclareVars", 1)

Local $hFileOpen = FileOpen("Product.dbf", $FO_READ)
If $hFileOpen = -1 Then Exit MsgBox($MB_TOPMOST, "", "Open error")

Local $sFileRead = FileRead($hFileOpen)
Local $iKeepError = @error, $iKeepExtended = @extended
If $iKeepError <> 0 Then Exit MsgBox($MB_TOPMOST, "", "Read error")

ConsoleWrite("@extended = " & $iKeepExtended & @crlf) ; 146 (strangely correct)

ConsoleWrite(Asc(StringMid($sFileRead, 1, 1)) & "   " & _
             Asc(StringMid($sFileRead, 2, 1)) & @crlf) ; 48, 120 (should be 3, 121)

FileClose($hFileOpen)

2) Longer story :

This test file was created today with a shareware program, after I encountered the same issue yesterday in this post.

So here is how I created "product.dbf" today :

product_1.png.1f7484fdaaa3d33d811c327e8a77f723.png

product_2.png.930512e09f34199cc7fcb0668383d66e.png

I could explain the values found in the memory dump above, but it would be off-topic. Anyway, accurate explanations of the values can be found in this link and/or that link.

Forget the 16 green marked bytes above (they correspond to the 1st record) and let's focus on byte 0 (0x03) and byte 1 (0x79, i.e 121 in decimal)

3) Back to the initial script :
The values returned by ConsoleWrite are wrong : 48   120
The correct values are 3   121 and you will get the correct values only if you add by yourself $FO_ANSI (512) when opening the file.

That's why I asked : how is the FileOpen() guess made ?
Thanks :)

Product.dbf

Link to comment
Share on other sites

From what I recall, the leading part of the file is scanned for conformance to one of the UTF8 or UTF16-LE (w/ or w/o BOM) encodings. If ever a byte > 0x7F not introducing a valid UTF8 sequence is found and the file is not valid UTF16-LE, the file is considered codepage encoded, aka ANSI (improper term here). Not mentionned in the help, it seems that if a null byte (0x00) is encountered, then the file is read as binary.

And your example shows exactly this. In the script below, the function vd() is a variable dump (not provided here to keep things short).

#include <FileConstants.au3>
#include <MsgBoxConstants.au3>

Opt("MustDeclareVars", 1)

Local $hFileOpen = FileOpen("Product.dbf", $FO_READ)
If $hFileOpen = -1 Then Exit MsgBox($MB_TOPMOST, "", "Open error")

Local $sFileRead = FileRead($hFileOpen)
Local $iKeepError = @error, $iKeepExtended = @extended
If $iKeepError <> 0 Then Exit MsgBox($MB_TOPMOST, "", "Read error")
FileClose($hFileOpen)
ConsoleWrite("@extended = " & $iKeepExtended & @crlf) ; 146 (strangely correct)

vd($sFileRead, 0, 0, 0)
vd(String($sFileRead), 0, 0)
vd(BinaryMid($sFileRead, 1, 1))
vd(BinaryMid($sFileRead, 2, 1))

ConsoleWrite(Asc(StringMid($sFileRead, 1, 1)) & "   " & _
             Asc(StringMid($sFileRead, 2, 1)) & @crlf) ; 48, 120 (correct!)

The console output I get is:

@extended = 146
Binary (146)             0x03790113030000006100100000000000 ... 6D6E6F7020202020203131312E31311A

String (294)             '0x037901130300000061001000000000 ... 6E6F7020202020203131312E31311A'

Binary (1)               0x03

Binary (1)               0x79

String (3)               '121'

48   120

Here you see that the output of FileRead is a binary variant. StringMid forces this binary to be converted to a string. The first character of this string is '0' whose ASCII code is decimal 48. The next character in the string is 'x' whose ASCII code is decimal 120.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Bravo jchd :)

I tested what follows, after reading your post, by replacing many bytes with 0x20 (starting from the 1st 0x00 found at position 5, to the end), then :

1) If you leave only 1 byte = 0x00 (pic below) then ConsoleWrite shows :
48  120

product_3.png.35e44c60005dc6a9d0f00cda4e9222a8.png

2) If you overwrite that 0x00 byte with 0x20 (so not a single 0x00 byte exists anymore) then ConsoleWrite shows :
3  121

Shouldn't the help file be amended with your sentence, stipulating that "if a null byte (0x00) is encountered, then the file is read as binary." instead of "a guess is made" ?

Link to comment
Share on other sites

This is just guesswork from my part, nothing close to a specification. Only @jpm & @Jon can tell: maybe other control characters trigger the switch to binary, that or even 0x7F.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

@jchd The code is actually open-source and published as a library :)

https://github.com/AutoItConsulting/text-encoding-detect

A detailed write-up of how it works on the AutoIt Consulting website: https://www.autoitconsulting.com/site/development/utf-8-utf-16-text-encoding-detection-library/

EasyCodeIt - A cross-platform AutoIt implementation - Fund the development! (GitHub will double your donations for a limited time)

DcodingTheWeb Forum - Follow for updates and Join for discussion

Link to comment
Share on other sites

I wasn't aware.

First, just by quick look at the top of this library C++ code, the presence of NULL(s) denote binary if UTF16 is ruled out. But we have no clue that this is what current AutoIt implements in full gory detail, even if both pieces of code must be quite similar.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

6 hours ago, jchd said:

Not mentionned in the help, it seems that if a null byte (0x00) is encountered, then the file is read as binary.

Luckily, I just found it written in the help file, not in the FileOpen topic, but... in the Unicode Support topic :

Files containing nulls are opened in Binary ($FO_BINARY) mode by default (unless they are detected as valid UTF-16). Previously they would be opened in ANSI mode. Use the $FO_ANSI flag to override.

This Unicode topic is found in our .chm help file, when we click this line in FileOpen topic (I discovered this 1 hour ago !) 

See "Unicode Support" for a detailed description.

Now I just tried, without FileOpen, the FileGetEncoding("product.dbf") function :
It returned... 16 (which means binary) for the "product.dbf" file. This value is not indicated in the "Success" return values of the function in the help file (the success list goes from 32 to 512)

Also, the help file example of FileGetEncoding() is a bit strange : it checks for @error but @error will always be = 0 . A test with FileGetEncoding("sdfgsdfgggfghhsfg.txt") will bypass the @error test and Return - 1 in the help file example.

6 hours ago, jchd said:

If ever a byte > 0x7F not introducing a valid UTF8 sequence is found and the file is not valid UTF16-LE, the file is considered codepage encoded, aka ANSI

Very true ( after test :D ) I just made this test on a 10 bytes file :

C2 70 C2 80 C2 80 C2 80 C2 80

It opens as $FO_ANSI (512) because no BOM and C2 70 is not an UTF-8 valid sequence [UTF-8 would code caract. 127 as 0x7F then jumps to C2 80 for caract. 128, says Wiki]

ConsoleWrite would return 194   112 if this was our "product.dbf" in the script above :
0xC2 = 194
0x70 = 112
 

Now, the complementary test : the following 9 bytes file would open as $FO_UTF8_NOBOM (256) 
C3 A9 E2 82 AC C3 A9 C3 A9

Because there are 4 valid UTF-8 sequences in it :

2002912900_4validUTF-8codes.png.5293825fe8e91b33c7930654529ea39e.png

Edit: thx @TheDcoder for the 2 links, it looks very interesting.

Edited by pixelsearch
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...