Jump to content
Sign in to follow this  
Pardalito

UTF-8 encoding/charset detection

Recommended Posts

Pardalito

Hello,

There is a solution to detect file encoding/charset?

I need to detect if my file(s) have UTF-8 encoding (without BOM).

I try to read the first 3 Hex words but the number changes:

0x3C3F70

0x3C3439

0x3C6D65

0x3C6874

0x3C7461

0x3C2144

0x093C64 (UTF-8 without BOOM)

0xEFBBBF

...

Anyone knows a good solution to see if the file have UTF-8 encoding/charset without BOM?

Best regards, Pardalito.

Edit: Typo

Edited by Pardalito

Share this post


Link to post
Share on other sites
jchd

Anyone knows a good solution to see if the file have UTF-8 encoding/charset without BOM?

Obviously, finding a BOM as in your last example line is the easiest case.

The ambiguity between codepage (whatever it is) and UTF-8 w/o BOM is more difficult. There was a thread by Jon lately here which made its way into the latest release. So current AutoIt does this automagically but I haven't seen that the encoding detected is exfiltered by the read level.

If you really need to get this information, what you should probably do is read the whole file and check each successive Unicode character for valid UTF-8 encoding and exit at the first invalid one. This will be very slow if coded in AutoIt, but it can be done with help of a small .DLL if you need to do that routinely.

May I ask what is the purpose of your request?


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Pardalito

Hello jchd,

I need an app that search my web files (.php) and do a log of files that don’t have UTF-8 charset/encoding.

For example:

I have 1000 files and I have 2 files in ANSI.

When a run this future app, the log must include this 2 files in ANSI.

In the thread that you mention I don’t see any function to do detection of UTF-8 files.

In the latest release of AutoIt, I don’t see any procedure to detect the file charset.

Any help?

Thanks for your reply.

Best regards, Pardalito.

Share this post


Link to post
Share on other sites
Pardalito

Hello again,

In the release notes of v3.3.4.0:

Added: Ability to read and write UTF-8 files with no BOM including automatic detection during reading.

Sorry... :D But I don't see any function/procedure in the includes folder... :huggles:

Best regards, Pardalito.

Share this post


Link to post
Share on other sites
jchd

Am I transfiguring your need if I say it's a one-time onversion?

If it's the case, then I believe you can brute force the conversion very easily.

Read up every file (FileRead will switch the read to ANSI or UTF-8 w/o BOM transparently)

Rewite it by forcing UTF-8 with BOM.

It will be as fast as possible and as fail-proof as the auto-detection routine that Jon has introduced.

Keep a list of files already converted to apply the procedure to new files only to save time (if that's important).

Would it work?


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
jchd

Sorry... Posted Image But I don't see any function/procedure in the includes folder... Posted Image

You won't find an UDF for that. The feature is built in the FileRead* functions which are part of AutoIt core.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Pardalito

Hello jchd,

This is a nice idea... Slower, but a nice idea.

If you get only de ANSI files and then convert these files to UTF-8, then you have a fast process.

For future versions a function to determinate a charset/enconding from file, are welcome.

Something like that:

FileEncoding('ansi.php') = 1

FileEncoding('utf-8.php') = 2

FileEncoding('utf-16.php') = 3

...

1 = ANSI Charset

2 = UTF-8 Charset

3 = UTF-16 Charset

...

Thanks again jchd.

P.S.: If somebody have more ideas to do this, please write :D

Share this post


Link to post
Share on other sites
jchd

This is a nice idea... Slower, but a nice idea.

If you get only de ANSI files and then convert these files to UTF-8, then you have a fast process.

For future versions a function to determinate a charset/enconding from file, are welcome.

Something like that:

FileEncoding('ansi.php') = 1

FileEncoding('utf-8.php') = 2

FileEncoding('utf-16.php') = 3

I guess there can't be such magic without reading the file until an invalid UTF-8 sequence is found.

For files already in UTF-8, it won't be better than reading the file, writing it (possibly to /dev/null) and comparing the lengths read and writen (must be the _byte_ length).

For files in ANSI, it depends if they include 8-bit chars and how far is the first invalid sequence.

So if only few (e.g. 2 among 1000) are ANSI, then it will be "slow" anyway.

Anyway, note that this procedure can give wrong results. An ANSI file is absolutely entitled to contain a sequence of 8-bit chars identical to a single UTF-8 char. It will be valid UTF-8 hence be classified as UTF-8. But hopefully, since typical UTF-8 sequences make little or no sense when interpreted as series of ANSI chars, there are only little odds that this happens in human-readable" files or .php or other common types.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Pardalito

Hello jchd,

Yes... You have right. I understand now. :D

Thanks.

Share this post


Link to post
Share on other sites
Jon

In the latest release of AutoIt, I don’t see any procedure to detect the file charset.

Hopefully we'll have a function (or a @extended code from FileOpen()) that does this in the next beta. Soon.

FWIW the AutoIt internal procedure is:

Read first 64KB of file

While chars
  If char = valid UTF8 sequence Then 
    Skip sequence (1,2,3 or 4 bytes)
  Else
    Return NOT_UTF8
WEnd

Also, at the end if NO chars read were >127 then also return NOT_UTF8 because we can't tell if it's really UTF8 or standard ANSI.

  • Like 1

Share this post


Link to post
Share on other sites
jchd

Hopefully we'll have a function (or a @extended code from FileOpen()) that does this in the next beta. Soon.

That's nice, thanks.

FWIW the AutoIt internal procedure is:

Read first 64KB of file

Only 64K and not whole beef? Isn't this a bit of gambling?

Being picky, don't you also check the bytes n+1 and more, when needed?

Also, at the end if NO chars read were >127 then also return NOT_UTF8 because we can't tell if it's really UTF8 or standard ANSI.

:ahem: in his case, you can rightfully return UTF8 as well!

EDIT: the above remark is because I interpret(ed) the semantics of your return value differently. I thought it was to mean "is compatible with" while you seem to mean "is requiring". You get the idea.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Jon

:ahem: in his case, you can rightfully return UTF8 as well!

EDIT: the above remark is because I interpret(ed) the semantics of your return value differently. I thought it was to mean "is compatible with" while you seem to mean "is requiring". You get the idea.

Not really. Open a text file in something like Notepad++ enter normal letters like "abcdefghijklmnopqrstuvwxyz" and save it as "UTF-8 with no BOM". Close the file and then open it again. It will say it's encoded as ANSI. If all chars are <127 then it can't assume anything else.

Share this post


Link to post
Share on other sites
Jon

Only 64K and not whole beef? Isn't this a bit of gambling?

Maybe. If it doesn't work out well then I can increase it to read the whole file it at the cost of perf - but it's statistically unlikely that valid UTF8 sequences would happen by random and if there is no character > 127 in the first 64KB then how likely is one to be in the rest of the file? You can also force the issue with a flag in FileOpen() if required.

Being picky, don't you also check the bytes n+1 and more, when needed?

Of course. Edited by Jon

Share this post


Link to post
Share on other sites
jchd

Not really. Open a text file in something like Notepad++ enter normal letters like "abcdefghijklmnopqrstuvwxyz" and save it as "UTF-8 with no BOM". Close the file and then open it again. It will say it's encoded as ANSI. If all chars are <127 then it can't assume anything else.

But 7-bit ASCII is UTF-8 compatible.

That's probably because I regard ANSI as retarted and UTF as "untold default" (should be with BOM, anyway). It's a shame that such a dumb thing like codepages are still the default for that many editors/programs. I wonder how many decades after the avent of Unicode we (or our children) will still have to wait before the ANSI crap is gone.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
KaFu

Maybe writing a wrapper function for this will do the job?

IsTextUnicode Function

http://msdn.microsoft.com/en-us/library/dd318672(VS.85).aspx

Share this post


Link to post
Share on other sites
jchd

From what I understand, here (and at other places in MSDN as well), Microsoft uses Unicode to mean UTF-16.

IS_TEXT_UNICODE_ODD_LENGTHThe number of characters in the string is odd. A string of odd length cannot (by definition) be Unicode text.

It doesn't seem that they consider UTF-8 as a possibility (or I missed it). It's still possible that another call would do it.

There are also several Unicode transformations that could be of interest, like normalizations. It's possible to have all valid Unicode sequences on a character basis, but invalid sequences of characters. This should be of marginal use and there is a risk of utter confusion for many if not most users. I bet that those with such demanding needs will either ask or know to do that by themselves using Windows calls.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Jon

It doesn't seem that they consider UTF-8 as a possibility (or I missed it).

It doesn't. And it's pretty crappy for a lot of text (search for "bush hid the facts" and IsTextUnicode Posted Image ). It only uses the first 256 bytes as well.

Share this post


Link to post
Share on other sites
Pardalito

Hello Lazycat,

Good UDF. Thanks.

Best Regards, Pardalito.

Share this post


Link to post
Share on other sites
jchd

While not handling plane 1 of Unicode is probably harmless to most every day use, I am told that plane 2 (supplementary CJK extensions) is getting more fashionable. That would mean that for a large number of people, UTF-8 encoding using 3 or 4 bytes will become routine.

I do have to deal with Asia a lot and I need to take care of this. Just to let you know.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×