Jump to content

Detect the codepage of a string or file


zeffy
 Share

Recommended Posts

Hi, in my script I have to read a dynamic file, which could be one of any code page, from ANSI, to UTF-8, to shift_JIS, so I need a way to detect what code page it is, and convert it to UTF-8 accordingly with _WinAPI_MultiByteToWideChar.

I have found IMultiLanguage2 (MLang) on MSDN, but I'm horrible with DllCall and structs etc, and I need some help understanding how exactly to use it.

Thanks

Edited by zeffy
Link to comment
Share on other sites

I don't believe you can decide which codepage is used in the general case. The closest you can do safely is decide which encoding the file is _not_ using, for instance if you happen to find an invalid UTF-8 sequence, then the file is not UTF-8.

The problem here is that an UTF-8 file is always a valid ANSI (whatever codepage) file. Not all UTF-* use a BOM and even a BOM mark _is_ valid ANSI. Any JIS (independant of the JIS flavor) _is_ a valid ANSI file (JIS are ANSI codepages).

So, for instance, any shift-JIS is a valid Latin1-ANSI file. That is generally doesn't make sense is another thing: that issue is about semantics, not encoding.

Your best bet is to ask yourself how _you_ would sort out which encoding the file actually uses. Try using an hex editor only, explain to a (virtual or real) friend having zero prior knowledge how you proceed to investigate; take notes of the steps. That will require you derive explicit rules about which character can be found, which combinations make sense and possibly use character statistics techniques. While I believe that the general case is essentially impossible to solve, you can end up with workable rules as you probably know enough about the expected contents of the files you process. It is unlikely that your files are random sequences of characters: they are probably using words, sentences and many other conditions that apply in your case (no word longer than 40 characters, no more than 2 1-char words in a row, whatever applies to your specific language and specific typical file contents).

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

In IT, when we say something is essentially impossible in the general case using approach XYZ, we mean that it's easy to come up with a counter-example on which XYZ will fail and that the number of potential failures is large enough to tag XYZ as "may work for you most of the time" instead of "will always work". The same wording appear in mathematics, with a comparable semantics.

I hardly suspect that the Windows API you mention employs some sort of lexicographic histogram (statistic techniques) which works "reasonably well" on "suitable input". This is more or less what I suggested. Read the comments in the article you mention, and see that the author stresses that it won't always work correctly. That's why I wrote that in the general case, that is impossible. Just because the outcome of such approach is very dependant of the contents of the file and human language is just too complex to stick to such simple "model".

Now I agree that these routines have some interest and may suit your needs but still you must be prepared to see them return plain wrong results at any time, without your application taking ground-shaking idiotic decision with harmful consequences.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...