Jump to content

Major bug with strings encodings on files with BOM


jchd
 Share

Recommended Posts

AutoIt has a really big problem with string litterals encodings when the script is encoded in UTF-* with BOM.

AutoIt string litterals only behave correctly when the file is encoded as ANSI or UTF-8 without BOM.

Please someone confirm before report.

Be very careful not to destroy fixed data when you try the various encodings.

Beware: the first 3 bytes must be ;;€ since this is used to check file encoding

;;€                       IMPORTANT: semicolomn, semicolomn, Euro

Global $s = "éà€"   ; must be &esharp; à €
Global $b = Binary($s)
Global $hdl, $bom, $encoding, $shouldbe


$hdl = FileOpen(@ScriptFullPath, 16); read this script file in binary mode
$bom = FileRead($hdl, 3)            ; read the first 3 bytes
FileClose($hdl)

Switch Binary($bom)
    Case "0x3B3B80"
        $encoding = "ANSI"
        $shouldbe = "0xE9E080"
    Case "0x3B3BE2"
        $encoding = "UTF-8"
        $shouldbe = "0xC3A9C3A0E282AC"
    Case "0xEFBBBF"
        $encoding = "UTF-8 with BOM"
        $shouldbe = "0xC3A9C3A0E282AC"
    Case "0xFEFF00"
        $encoding = "UTF-16be (with BOM)"
        $shouldbe = "0x00E900E020AC"; see note
    Case "0xFFFE3B"
        $encoding = "UTF-16le (with BOM)"
        $shouldbe = "0xE900E000AC20"; see note
    Case Else
        $encoding = "Unknown"
        $shouldbe = "I have no idea!"
EndSwitch

MsgBox(0, "Encoding dependancy", _
            "Script is encoded as " & $encoding & @LF & _
            "The reference string contains " & $b & @LF & _
            "and this should be equal to     " & $shouldbe & @LF & _
            "which is " & ($b = $shouldbe))


; NOTE: it is quite probable that a single encoding is used inside AutoIt
; In which case it could be that when reading UTF-16 strings from script file
; AutoIt converts them to, say, UTF-8. Hence the compares to UTF-16 data above
; is obviously due to failure.
;
; BUT, anyway, there is absolutely NO reason to allow AutoIt to store them as
; ANSI.
Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Are you sure that UTF8 always makes characters double-byte? I thought that wasn't the case.

Hi Jos it's really a bless that you get interested with this. I've been wasting an indecent time trying to understand what was the problem(s) with various strings representations.

Here's a convenient (and acurate) digest of UTF-* encodings. It comes from utf.c in SQLite source, so it's public domain.

[font="Courier New"]*************************************************************************
** This file contains routines used to translate between UTF-8, 
** UTF-16, UTF-16BE, and UTF-16LE.
**
** $Id: utf.c,v 1.73 2009/04/01 18:40:32 drh Exp $
**
** Notes on UTF-8:
**
**   Byte-0 Byte-1  Byte-2  Byte-3  Value
**  0xxxxxxx                                 00000000 00000000 0xxxxxxx
**  110yyyyy  10xxxxxx                     00000000 00000yyy yyxxxxxx
**  1110zzzz  10yyyyyy  10xxxxxx             00000000 zzzzyyyy yyxxxxxx
**  11110uuu  10uuzzzz  10yyyyyy  10xxxxxx   000uuuuu zzzzyyyy yyxxxxxx
**
**
** Notes on UTF-16:  (with wwww+1==uuuuu)
**
**     Word-0              Word-1        Value
**  110110ww wwzzzzyy   110111yy yyxxxxxx   000uuuuu zzzzyyyy yyxxxxxx
**  zzzzyyyy yyxxxxxx                       00000000 zzzzyyyy yyxxxxxx
**
**
** BOM or Byte Order Mark:
**   0xff 0xfe   little-endian utf-16 follows
**   0xfe 0xff   big-endian utf-16 follows
**
*/
[/font]

So the answer to your question is NO.

Valid UTF-8 characters can take from 1 to 4 bytes in a sequence. UTF-16?e can take 1 or 2 ushort, 2 to 4 bytes. Only UTF-32 make us safe to assume 1 character = 1 storage element. You can also make arbitrarily long invalid sequences, but this can/should be flagged as such with special codepoint (invalid character substitution). You can fetch very informative documents regarding Unicode at unicode.org or icu-project.org.

Please note that in the example the character Euro is ANSI "0x80", UTF-8 "0xE2 82 AC" (three bytes) and UTF-16le "0x20AC". I've put this common character on purpose here as an example of a 3-byte sequence that has to be expected without even having to select obscure (to us) asian codepoints.

BTW, there's also a bug in WinAPI _WinAPI_MultiByteToWideChar since StrinLen is documented to return a character length (not a byte length), but the function is only allocating twice as much _bytes_. It should be four times as much is the most general case. There is also a weird (buggy) behavior of these functions, but I keep that for another post after more investigation.

Cheers.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

  • Developers

So if the answer is indeed NO then why do you expect the result of:

Global $s = "éà"   ; must be &esharp; à €
Global $b = Binary($s)

To be the same as:

Global $s = "éà"   ; must be &esharp; à €
Global $b = StringToBinary ($s,4)

?

SciTE4AutoIt3 Full installer Download page   - Beta files       Read before posting     How to post scriptsource   Forum etiquette  Forum Rules 
 
Live for the present,
Dream of the future,
Learn from the past.
  :)

Link to comment
Share on other sites

So if the answer is indeed NO then why do you expect the result of:

Global $s = "éà"  ; must be &esharp; à €
Global $b = Binary($s)

To be the same as:

Global $s = "éà"  ; must be &esharp; à €
Global $b = StringToBinary ($s,4)
I never coded than. I simply said that AutoIt should read litterals the same way when they come from a (UTF-8 without BOM)-encoded file and from a (UTF-8 with BOM)-encoded file!

The problem here is that it reads and stores the former (correctly) as UTF-8 strings, but reads and stores the latter (incorrectly) as ANSI strings. When you read/update/stores those values in a production database, it makes a huge difference!

The doc says AutoIt is a "Unicode" program. But there is no such beast. The best it can be is "Unicode aware", but there must be a way to differentiate ANSI, UTF-8, UTF-16le and UTF-16be data, along with means to convert back and forth between those distinct representations. The problem here is with typelessness. An ANSI string is not the same as an UTF-8 string, and not the same as any other UTF encoding. Even if they are contain the same characters, they are distinct values. I can't keep on destroying databases with incorrectly formatted data.

It is the same distinction between 5 and 0x05 and "5" and "0x05". These four representations can be made to operate the same (more or less) way (using internal object-oriented polymorphism and inheritance) but we have means (functions) to sort out which is a string, which is a number, and functions to convert between any two representation.

Try this:

$s = "éà"

MsgBox(0, "", $s)

It displays as expected when the script file is ANSI, UTF-8 with BOM, UTF-16le and UTF-16be, but shows garbage when the script is encoded as UTF-8 without BOM.

This is consistent with the fact that AutoIt reads and interprets the file as ANSI when it doesn't find a BOM.

But if you look at how the string is encoded (using Binary()), then it behaves otherwise. This is completely akward.

It's possible that there's a silly conversion acting somewhere but since the source code is a secret, noone can tell what to do.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

  • Developers

Ok, got you this time. ^_^

Strange is that ConsoleWrite works fine but MsgBox doesn't.

Global $s = "éà"   ; must be &esharp; à €
MsgBox(262144,'Debug line ~' & @ScriptLineNumber,'Selection:' & @lf & '$s' & @lf & @lf & 'Return:' & @lf & $s);### Debug MSGBOX
ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') : $s = ' & $s & @crlf & '>Error code: ' & @error & @crlf);### Debug Console

SciTE4AutoIt3 Full installer Download page   - Beta files       Read before posting     How to post scriptsource   Forum etiquette  Forum Rules 
 
Live for the present,
Dream of the future,
Learn from the past.
  :)

Link to comment
Share on other sites

Ok, got you this time. ^_^

Strange is that ConsoleWrite works fine but MsgBox doesn't.

Global $s = "éà"  ; must be &esharp; à €
MsgBox(262144,'Debug line ~' & @ScriptLineNumber,'Selection:' & @lf & '$s' & @lf & @lf & 'Return:' & @lf & $s);### Debug MSGBOX
ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') : $s = ' & $s & @crlf & '>Error code: ' & @error & @crlf);### Debug Console

Sorry I don't get you (my turn!).

What I see is that the result is ok for both MsgBox and Console when UTF-8 with BOM but fails consistently without BOM (litteral is then read and displayed as a sequence of ANSI bytes).

I've been going crazy with these issues for weeks and in the meatime almost irremediably destroyed asymetrically two large databases.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

  • Developers

When I run the script save as UTF without BOM the msgbox shows:

---------------------------

Debug line ~2

---------------------------

Selection:

$s

Return:

éà€

---------------------------

OK

---------------------------

The ConsoleWrite shows:

@@ Debug(3) : $s = éà€

Edited by Jos

SciTE4AutoIt3 Full installer Download page   - Beta files       Read before posting     How to post scriptsource   Forum etiquette  Forum Rules 
 
Live for the present,
Dream of the future,
Learn from the past.
  :)

Link to comment
Share on other sites

I beg to differ, sorry for that.

Console:

@@ Debug(3) : $s = éàâ¬

MsgBox:

same as yours, incorrect display (éà â¬)

No question, there are strange things hapenning.

I'm sorry but I just can't avoid going to utterly urgent stuff tonight (VAT form due ... today, deadline).

I'll do my best to test on other machines here as soon as I can. Here I have XP Pro corp SP3 (clean) and latest release AutoIt everywhere. As for locale I've French input, French keyboard if locale matters (I doubt it).

We need to find out what the problem really is and how to workaround. This is vital for me.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

  • Developers

I beg to differ, sorry for that.

Don't doubt you are seeing something else but what I posted is what I see. ^_^

Strait forward Copy/Paste exercise.

I beg to differ, sorry for that.

Console:

@@ Debug(3) : $s = éà â¬

MsgBox:

same as yours, incorrect display (éà â¬)

No question, there are strange things hapenning.

I'm sorry but I just can't avoid going to utterly urgent stuff tonight (VAT form due ... today, deadline).

I'll do my best to test on other machines here as soon as I can. Here I have XP Pro corp SP3 (clean) and latest release AutoIt everywhere. As for locale I've French input, French keyboard if locale matters (I doubt it).

We need to find out what the problem really is and how to workaround. This is vital for me.

I think I know the difference. I am using the full SciTE4AutoIt3 installer which runs AutoIt3Wrapper to start AutoIt3.exe (and other stuff).

I installed the standard SciTE4AutoIt3 editor that comes with the AutoIt3 installer and hit F5 which gave for both the same result:

>"C:\Program Files\AutoIt3\SciTE\..\autoit3.exe" /ErrorStdOut "C:\Development\test.au3"

@@ Debug(3) : $s = éàâ¬

>Error code: 0

>Exit code: 0 Time: 2.183

SciTE4AutoIt3 Full installer Download page   - Beta files       Read before posting     How to post scriptsource   Forum etiquette  Forum Rules 
 
Live for the present,
Dream of the future,
Learn from the past.
  :)

Link to comment
Share on other sites

I think I know the difference. I am using the full SciTE4AutoIt3 installer which runs AutoIt3Wrapper to start AutoIt3.exe (and other stuff).

I installed the standard SciTE4AutoIt3 editor that comes with the AutoIt3 installer and hit F5 which gave for both the same result:

I apologize for not telling you at once which precise setup I was using. Indeed, you found out which it was: the "vanilla" setup from SciTE4AutoIt3.

Now I suspect there are also what could be called "cosmetic" (display only) issues on top of deeper and more difficult to solve problems. But, I'm sorry to insist on this, I have a big responsability in selecting AutoIt, SQLite and other tools for solving a practical problem. I'm eventually ready to cheerfully pay for ad hoc service helping us get out of the trap I managed to get in. I can't give millions away (and it's not _my_ money) but our business is getting havoc (my fault) because I just can't sort out this kind of problems. Not having access to the source code makes this worst.

AutoIt is, I say it sincerely to anyone listening, a wonderful piece of software, but it turns out it has --like almost all software-- its own share of problems. I invested certainly too much and underestimated what was really necessary. Now it turns out I've spend months developping a big thing (15 to 20000 lines) that has served no purpose except destroying useful valid data. I'm right now too exhausted to be anything else than 100% hopeless.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

  • Developers

I apologize for not telling you at once which precise setup I was using. Indeed, you found out which it was: the "vanilla" setup from SciTE4AutoIt3.

Now I suspect there are also what could be called "cosmetic" (display only) issues on top of deeper and more difficult to solve problems. But, I'm sorry to insist on this, I have a big responsability in selecting AutoIt, SQLite and other tools for solving a practical problem. I'm eventually ready to cheerfully pay for ad hoc service helping us get out of the trap I managed to get in. I can't give millions away (and it's not _my_ money) but our business is getting havoc (my fault) because I just can't sort out this kind of problems. Not having access to the source code makes this worst.

AutoIt is, I say it sincerely to anyone listening, a wonderful piece of software, but it turns out it has --like almost all software-- its own share of problems. I invested certainly too much and underestimated what was really necessary. Now it turns out I've spend months developping a big thing (15 to 20000 lines) that has served no purpose except destroying useful valid data. I'm right now too exhausted to be anything else than 100% hopeless.

Maybe you could tell us what the real issue is you are having?

Is data stored in the database the wrong way due to the source being UTF-8 without BOM?

Do you need to convert the source file to something readable again?

like:

Global $s = "éà€"  ; must be &esharp; à €
$source = FileRead(@ScriptFullPath)
Consolewrite($source)
Consolewrite(@LF)
Consolewrite(@LF)
$source = BinaryToString(FileRead(@ScriptFullPath),4)
Consolewrite($source)
Edited by Jos

SciTE4AutoIt3 Full installer Download page   - Beta files       Read before posting     How to post scriptsource   Forum etiquette  Forum Rules 
 
Live for the present,
Dream of the future,
Learn from the past.
  :)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...