Sign in to follow this  
Followers 0
H5N1

Search & Replace Unicode Text

5 posts in this topic

Hi all! I'm new here and i'm a total amateur.

Here's my problem:

I've downloaded a huge amount of czech text files. But on my system some letters aren't displayed correct (in this text file).

For example: Text in file is "Dobøe", but correct it should be "Dobře".

So i want to make a script who corrects the false letters automaticly.

My beginner script:

$szFile = "D:\Downloads\Czechtext.txt"

$szText = FileRead($szFile,FileGetSize($szFile))

$szText = StringReplace($szText, "ø", "ř")
$szText = StringReplace($szText, "Ø", "Ř")
$szText = StringReplace($szText, "Ù", "ů")
$szText = StringReplace($szText, "ù", "ů")
$szText = StringReplace($szText, "ì", "ě")
$szText = StringReplace($szText, "ò", "ň")
$szText = StringReplace($szText, "È", "Č")
$szText = StringReplace($szText, "è", "č")
$szText = StringReplace($szText, "ï", "ď")

FileDelete($szFile)
FileWrite($szFile,$szText)

I got no errors but the letters are still inorrect. After running the script, the letter is just a normal "r" insteat of "ř" (or "u" or "e" or "c", etc.).

Where is my fault?

Share this post


Link to post
Share on other sites



You definitely need to save your source files in UTF-8 + BOM format, so as to prevent any partial or erroneous Unicode to some ANSI codepage "translation" (= destruction, like what you've experienced). You can do that in Scite: File >> Encoding >> UTF8 + BOM and then modify (even a dummy action, like insert space then backspace) and save the file or else the change won't be made!

When your source file is in Unicode (UTF-8 + BOM) you can check that replacing arbitrary Unicode characters, e.g.

$szText = StringReplace($szText, "Skrýchov u Opařan", "БОЛЬШОЕ ГРИДИНО")

$szText = StringReplace($szText, " فرنسيّ عربيّ", "เขาจะได้ไปเที่ยวเมืองลาว")

will work flawlessly, even if the font you use in your editor doesn't display all these characters correctly.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

First thanks a lot for your answer. But the problem has not been solved.

First i just changed the encoding option to UTF-8 + BOM, done a dummy action and saved -> incorrect letters.

Then i created a new au3 file, changed the encoding option and paste my code from the original file -> incorrect letters.

After that i opend the (original ansi) text file, saved it as unicode, run the new (UTF-8 + BOM) script but the letters are still incorrect.

By the way: I'm using AutoIt 3.3.6.1 and Win7.

Share this post


Link to post
Share on other sites

This begs a question:

Which ANSI codepage does your system use? Easy guess: latin-1 ANSI

Explanation

Open the charmap.exe applet to see that the following characters have the same encoding:

ANSI Latin1 codepage = ANSI central Europe codepage

øØùìòÈèï = řŘůěňČčď

Here I mean 'ø' has encoding 0xF8 in Windows Latin1 ANSI, and 'ř' has the same 0xF8 encoding in Windows Central Europe ANSI. Same for 'Ø' and 'Ř', ...

Your input czech text uses Central Europe ANSI codepage (not surprisingly), but you're using another codepage to display it.

Temporarily change your system codepage to see that.

Possible solution: read input in ANSI mode (with central Europe codepage active) and rewrite it in Unicode (UTF-8 + BOM is recommended) so that the characters get encoded correctly independantly of the system codepage. Use any Unicode-aware tool to display/process the Unicode output.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

#5 ·  Posted (edited)

I feel so silly now. Changing the codepage was the right thing to do. Everthing is fine now. So it wasn't a fault by Autoit.

Thank you very much! (or "Děkuji mnohokrát" ;) )

Edited by H5N1

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0