Jump to content
Sign in to follow this  
Knivy

Reading Unicode symbols with FileRead ()

Recommended Posts

Knivy

Hello.

I was writing a simple program that looks for certain Unicode characters in a text and replaces them with others (transliteration). But I got empty space instead of chars.

After all I found out that in some cases FileRead() behaves in a strange way and doesn't recognize Unicode symbols correctly. I haven't found any info about that in the documentation or in the forum, so I write more details.

Illustration - a program that writes $chars into a text file twice:

$txt=FileOpen("text.txt",2+256) ;creates a file in UTF-8 without BOM writing mode

$chars="ж" ;a cyrillic (Russian) letter

FileWrite($txt,$chars)

FileFlush($txt)

FileSetPos($txt,0,0)

$chars=FileRead($txt,1) ;the char won't be read or written, I'll get one "ж" instead of two

FileWrite($txt,$chars)

The behavior of this procedure differs depending on some factors:

1) the ratio between the parameter for FileRead() and the number of chars:

Reading blocks bigger than twice the string seems to work correctly as well as reading full lines or files.

FileRead($txt,2), FileReadLine($txt) or FileRead($txt) will work correctly in the example above.

"жжж" and FileRead($txt,3) ===> Р¶РР¶¶ instead of "жжжжжж" (unreadable)

"жжж" and FileRead($txt,4) ===> жжжж instead of "жжжжжж" (some chars lost)

"жжж" and FileRead($txt,5) ===> жжРжж (again unreadable)

"жжж" and FileRead($txt,6) ===> жжжжжж (correct but why reading 6 chars to write 3?)

2) latin chars are read correctly:

"j" and FileRead($txt,1) ===> jj (correct)

"jjj" and FileRead($txt,3) ===> jjjjjj (correct)

"jжж" and FileRead($txt,3) ===> jжjж instead of "jжжjжж" (some cyrillic chars lost)

"jжж" and FileRead($txt,4) ===> jР¶РjР¶ (cyrillic chars are unreadable)

"jjж" and FileRead($txt,4) ===> jjжjjж (correct)

3) the file reading mode:

The above example will work correctly with FileOpen("text.txt",2).

But in files with Unicode chars opened in default read-only mode there are the same problems.

In my program I needed to read the Russian symbols one by one from a file written in the Notepad and opened in read-only mode. But the chars could not be read with FileRead($txt, 1), the program silently skipped them. After all I could do it by creating a temporary file in the default writing mode (2) and copying all the contents of my input files there. Then they seemed to be read correctly.

I suppose all this is related to Unicode chars being 2-byte characters. Documentation tells that the parameter in FileRead() function is the number of characters to read. While in fact it behaves more like the number of bytes in the cases illustrated above. The problem is:

1) eigher it's a bug or it's something not documented in FileRead() and FileOpen() functions (the English version of documentation);

2) it's real uncomfortable especially when reading text files containing a mix of latin and Unicode chars.

Share this post


Link to post
Share on other sites
bogQ

When i set in scite File->Encoding->UTF-8 without BOM

I dont have any problems to read ж in txt file

$txt=FileOpen("text.txt",2+256) ;Write mode, creates a file in UTF-8 without BOM writing mode
$chars="ж" ;a cyrillic (Russian) letter
FileWrite($txt,$chars)
FileWrite($txt,$chars)
FileClose($txt)
$txt=FileOpen("text.txt",256) ;Read mode, file in UTF-8 without BOM
$chars=FileRead($txt) ;the char won't be read or written, I'll get one "?" instead of two
MsgBox(0,"",$chars)

Youl need to close file from writing before atemping to read it


TCP server and client - Learning about TCP servers and clients connection
Au3 oIrrlicht - Irrlicht project
Au3impact - Another 3D DLL game engine for autoit. (3impact 3Drad related)



460px-Thief-4-temp-banner.jpg
There are those that believe that the perfect heist lies in the preparation.
Some say that it’s all in the timing, seizing the right opportunity. Others even say it’s the ability to leave no trace behind, be a ghost.

 

Share this post


Link to post
Share on other sites
Knivy

I've tried to read Unicode files in read-only modes - that doesn't influence the result.

The example above works well for latin characters, they are read correctly, though the file is opened in write-mode.

In your example you use reading the whole file at once:

$chars=FileRead($txt)

which works correctly.

The problem appears when reading the Unicode characters from a file one by one:

$chars=FileRead($txt,1)

This works badly for Unicode in many cases mentioned above and I think there is a bug there.

Thanks for the idea with StringReplace() and File->Encoding, I'll try that.

Edited by Knivy

Share this post


Link to post
Share on other sites
bogQ

Yes i see the problem, your trying to read every char

Dono if its bug or its meant to work that way or we overlooked something

workaround

$txt=FileOpen("text.txt",2+256) ;creates a file in UTF-8 without BOM writing mode
$chars="ж" ;a cyrillic (Russian) letter
FileWrite($txt,$chars)
FileWrite($txt,$chars)
FileClose($txt)
$txt=FileOpen("text.txt",256) ;creates a file in UTF-8 without BOM
$chars=StringSplit(FileRead($txt),"") ;the char won't be read or written, I'll get one "?" instead of two
For $x = 1 To $chars[0]
    MsgBox(0,"",$chars[$x])
Next
FileClose($txt)

TCP server and client - Learning about TCP servers and clients connection
Au3 oIrrlicht - Irrlicht project
Au3impact - Another 3D DLL game engine for autoit. (3impact 3Drad related)



460px-Thief-4-temp-banner.jpg
There are those that believe that the perfect heist lies in the preparation.
Some say that it’s all in the timing, seizing the right opportunity. Others even say it’s the ability to leave no trace behind, be a ghost.

 

Share this post


Link to post
Share on other sites
jchd

@Knivy,

You are in fact creating your own problem.

Writing > 0x7F characters in UTF-8 encoding means you are writing 2 to 4 bytes which represent those characters.

The system functions FileetPos don't have the faintest clue as to what the file is and how data is represented inside it. They refer to a byte position, not a character position.

The presence of a BOM makes the issue even worse.

In short: either you use (byte) file position and read/write from there without any respect to character encoding (then you need to be sure you read a stream of bytes with valid alignment or, more precisely read/write at an UTF-8 character boundary), or you need to read the file as a sequence of characters (or full lines) and then you let the various function work at their level and the result is UTF-8 (or -16) text being recognized properly.

Any attempt to mix high-level (character-wise) functions with low-level (byte-wise) functions will fail at some point.

The same reasonning applies equally to BIG-5 and other asian multi-byte ANSI codepages.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×