Search & Replace Unicode Text

H5N1 · October 20, 2010

Hi all! I'm new here and i'm a total amateur.

Here's my problem:

I've downloaded a huge amount of czech text files. But on my system some letters aren't displayed correct (in this text file).

For example: Text in file is "Dobøe", but correct it should be "Dobře".

So i want to make a script who corrects the false letters automaticly.

My beginner script:

$szFile = "D:\Downloads\Czechtext.txt"

$szText = FileRead($szFile,FileGetSize($szFile))

$szText = StringReplace($szText, "ø", "ř")
$szText = StringReplace($szText, "Ø", "Ř")
$szText = StringReplace($szText, "Ù", "ů")
$szText = StringReplace($szText, "ù", "ů")
$szText = StringReplace($szText, "ì", "ě")
$szText = StringReplace($szText, "ò", "ň")
$szText = StringReplace($szText, "È", "Č")
$szText = StringReplace($szText, "è", "č")
$szText = StringReplace($szText, "ï", "ď")

FileDelete($szFile)
FileWrite($szFile,$szText)

I got no errors but the letters are still inorrect. After running the script, the letter is just a normal "r" insteat of "ř" (or "u" or "e" or "c", etc.).

Where is my fault?

jchd · October 21, 2010

You definitely need to save your source files in UTF-8 + BOM format, so as to prevent any partial or erroneous Unicode to some ANSI codepage "translation" (= destruction, like what you've experienced). You can do that in Scite: File >> Encoding >> UTF8 + BOM and then modify (even a dummy action, like insert space then backspace) and save the file or else the change won't be made!

When your source file is in Unicode (UTF-8 + BOM) you can check that replacing arbitrary Unicode characters, e.g.

$szText = StringReplace($szText, "Skrýchov u Opařan", "БОЛЬШОЕ ГРИДИНО")

$szText = StringReplace($szText, " فرنسيّ عربيّ", "เขาจะได้ไปเที่ยวเมืองลาว")

will work flawlessly, even if the font you use in your editor doesn't display all these characters correctly.

H5N1 · October 21, 2010

First thanks a lot for your answer. But the problem has not been solved.

First i just changed the encoding option to UTF-8 + BOM, done a dummy action and saved -> incorrect letters.

Then i created a new au3 file, changed the encoding option and paste my code from the original file -> incorrect letters.

After that i opend the (original ansi) text file, saved it as unicode, run the new (UTF-8 + BOM) script but the letters are still incorrect.

By the way: I'm using AutoIt 3.3.6.1 and Win7.

jchd · October 21, 2010

This begs a question:

Which ANSI codepage does your system use? Easy guess: latin-1 ANSI

Explanation

Open the charmap.exe applet to see that the following characters have the same encoding:

ANSI Latin1 codepage = ANSI central Europe codepage

øØùìòÈèï = řŘůěňČčď

Here I mean 'ø' has encoding 0xF8 in Windows Latin1 ANSI, and 'ř' has the same 0xF8 encoding in Windows Central Europe ANSI. Same for 'Ø' and 'Ř', ...

Your input czech text uses Central Europe ANSI codepage (not surprisingly), but you're using another codepage to display it.

Temporarily change your system codepage to see that.

Possible solution: read input in ANSI mode (with central Europe codepage active) and rewrite it in Unicode (UTF-8 + BOM is recommended) so that the characters get encoded correctly independantly of the system codepage. Use any Unicode-aware tool to display/process the Unicode output.

H5N1 · October 21, 2010

I feel so silly now. Changing the codepage was the right thing to do. Everthing is fine now. So it wasn't a fault by Autoit.

Thank you very much! (or "Děkuji mnohokrát" )

Edited October 21, 2010 by H5N1

Sign In

Search & Replace Unicode Text

Recommended Posts

H5N1

jchd

H5N1

jchd

H5N1

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta