Reading Unicode symbols with FileRead ()

Knivy · August 10, 2010

Hello.

I was writing a simple program that looks for certain Unicode characters in a text and replaces them with others (transliteration). But I got empty space instead of chars.

After all I found out that in some cases FileRead() behaves in a strange way and doesn't recognize Unicode symbols correctly. I haven't found any info about that in the documentation or in the forum, so I write more details.

Illustration - a program that writes $chars into a text file twice:

$txt=FileOpen("text.txt",2+256) ;creates a file in UTF-8 without BOM writing mode

$chars="ж" ;a cyrillic (Russian) letter

FileWrite($txt,$chars)

FileFlush($txt)

FileSetPos($txt,0,0)

$chars=FileRead($txt,1) ;the char won't be read or written, I'll get one "ж" instead of two

FileWrite($txt,$chars)

The behavior of this procedure differs depending on some factors:

1) the ratio between the parameter for FileRead() and the number of chars:

Reading blocks bigger than twice the string seems to work correctly as well as reading full lines or files.

FileRead($txt,2), FileReadLine($txt) or FileRead($txt) will work correctly in the example above.

"жжж" and FileRead($txt,3) ===> Р¶РР¶¶ instead of "жжжжжж" (unreadable)

"жжж" and FileRead($txt,4) ===> жжжж instead of "жжжжжж" (some chars lost)

"жжж" and FileRead($txt,5) ===> Р¶Р¶РР¶Р¶ (again unreadable)

"жжж" and FileRead($txt,6) ===> жжжжжж (correct but why reading 6 chars to write 3?)

2) latin chars are read correctly:

"j" and FileRead($txt,1) ===> jj (correct)

"jjj" and FileRead($txt,3) ===> jjjjjj (correct)

"jжж" and FileRead($txt,3) ===> jжjж instead of "jжжjжж" (some cyrillic chars lost)

"jжж" and FileRead($txt,4) ===> jР¶РjР¶ (cyrillic chars are unreadable)

"jjж" and FileRead($txt,4) ===> jjжjjж (correct)

3) the file reading mode:

The above example will work correctly with FileOpen("text.txt",2).

But in files with Unicode chars opened in default read-only mode there are the same problems.

In my program I needed to read the Russian symbols one by one from a file written in the Notepad and opened in read-only mode. But the chars could not be read with FileRead($txt, 1), the program silently skipped them. After all I could do it by creating a temporary file in the default writing mode (2) and copying all the contents of my input files there. Then they seemed to be read correctly.

I suppose all this is related to Unicode chars being 2-byte characters. Documentation tells that the parameter in FileRead() function is the number of characters to read. While in fact it behaves more like the number of bytes in the cases illustrated above. The problem is:

1) eigher it's a bug or it's something not documented in FileRead() and FileOpen() functions (the English version of documentation);

2) it's real uncomfortable especially when reading text files containing a mix of latin and Unicode chars.

bogQ · August 10, 2010

When i set in scite File->Encoding->UTF-8 without BOM

I dont have any problems to read ж in txt file

$txt=FileOpen("text.txt",2+256) ;Write mode, creates a file in UTF-8 without BOM writing mode
$chars="ж" ;a cyrillic (Russian) letter
FileWrite($txt,$chars)
FileWrite($txt,$chars)
FileClose($txt)
$txt=FileOpen("text.txt",256) ;Read mode, file in UTF-8 without BOM
$chars=FileRead($txt) ;the char won't be read or written, I'll get one "?" instead of two
MsgBox(0,"",$chars)

Youl need to close file from writing before atemping to read it

Knivy · August 11, 2010

I've tried to read Unicode files in read-only modes - that doesn't influence the result.

The example above works well for latin characters, they are read correctly, though the file is opened in write-mode.

In your example you use reading the whole file at once:

$chars=FileRead($txt)

which works correctly.

The problem appears when reading the Unicode characters from a file one by one:

$chars=FileRead($txt,1)

This works badly for Unicode in many cases mentioned above and I think there is a bug there.

Thanks for the idea with StringReplace() and File->Encoding, I'll try that.

Edited August 11, 2010 by Knivy

bogQ · August 13, 2010

Yes i see the problem, your trying to read every char

Dono if its bug or its meant to work that way or we overlooked something

workaround

$txt=FileOpen("text.txt",2+256) ;creates a file in UTF-8 without BOM writing mode
$chars="ж" ;a cyrillic (Russian) letter
FileWrite($txt,$chars)
FileWrite($txt,$chars)
FileClose($txt)
$txt=FileOpen("text.txt",256) ;creates a file in UTF-8 without BOM
$chars=StringSplit(FileRead($txt),"") ;the char won't be read or written, I'll get one "?" instead of two
For $x = 1 To $chars[0]
    MsgBox(0,"",$chars[$x])
Next
FileClose($txt)

jchd · August 14, 2010

@Knivy,

You are in fact creating your own problem.

Writing > 0x7F characters in UTF-8 encoding means you are writing 2 to 4 bytes which represent those characters.

The system functions FileetPos don't have the faintest clue as to what the file is and how data is represented inside it. They refer to a byte position, not a character position.

The presence of a BOM makes the issue even worse.

In short: either you use (byte) file position and read/write from there without any respect to character encoding (then you need to be sure you read a stream of bytes with valid alignment or, more precisely read/write at an UTF-8 character boundary), or you need to read the file as a sequence of characters (or full lines) and then you let the various function work at their level and the result is UTF-8 (or -16) text being recognized properly.

Any attempt to mix high-level (character-wise) functions with low-level (byte-wise) functions will fail at some point.

The same reasonning applies equally to BIG-5 and other asian multi-byte ANSI codepages.

Sign In

Reading Unicode symbols with FileRead ()

Recommended Posts

Knivy

bogQ

Knivy

bogQ

jchd

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta