Script to Remove UniCode/Illegal Characters from Files' Name

asgarcymed · December 6, 2007

Many files downloaded by eMule (ed2k/Kad) contain, in its name, UniCode characters (such as Chinese, Japanese, Korean, Arabic, Hebraic, Russian) which are seen as "Illegal Characters" by English version of Windows XP's explorer.exe... This causes serious troubles when managing such files...

Thus, I would like to get a script to automatically delete such characters from files' name, in order to avoid problems when trying to access them...

PS - even when we download an eBook totally written in English, stupidly the files' names contain such unicode/illegal characters...

Thanks.

Regards.

weaponx · December 6, 2007

Can you paste an example of a string you need modified?

asgarcymed · December 6, 2007

In Windows XP English, all unicode/illegal character strings appear as a "square" or a "?"

Only if someone has the MUI (MultiLingual User Interface) will see the correct Chinese/Japanese/Korean/Arabic/Hebraic/Russian characters

?????? ????? ??????????

=> this was my attempt to make a copy-paste... This forum also does not support Chinese/Japanese/Korean/Arabic/Hebraic/Russian characters...

asgarcymed · December 6, 2007

PS - if you want to see such characters, you can see Wikipedia in all of these (esoteric) languages...

Regards.

weaponx · December 6, 2007

I'm gonna go out on a limb here...maybe:

StringRegExpReplace ( "titlestringwithforeigncharacters", "[\x10-\x1F\x21-\x2F\x3A-\x40\x5B-\x60\x80-\xFF]", "")

therks · December 6, 2007

I'm gonna go out on a limb here...maybe:

StringRegExpReplace ( "titlestringwithforeigncharacters", "[\x10-\x1F\x21-\x2F\x3A-\x40\x5B-\x60\x80-\xFF]", "")

I think you should probably have them replaced with something.. even if just dashes, or underscores, etc. Otherwise you could get an error trying to rename a file to nothing (in the case that all the characters are unacceptable).

weaponx · December 6, 2007

I'm not even sure if thats what the OP is after. If it works, I will leave the replacement at his discretion.

asgarcymed · December 6, 2007

Thank to all of you for replying!

Saunders - you are correct - Chinese/Japanese/Korean/Arabic/Hebraic/Russian characters MUST be replaced by underscores (because of the reason you very well posted)...

I still need help about such StringRegExpReplace...

Maybe someone who is from one of such countries, and then, must deal with both languages...

Thanks.

Regards.

asgarcymed · December 6, 2007

A "MUST"!... Look at:

http://www.isthisthingon.org/unicode/allchars1.php

All characters are there!

My problem is that I do not know how to make the RegExp...

Please help!

Thank you!

Regards.

asgarcymed · December 6, 2007

I need to allow: All English/German and Latin (Portuguese/Spanish/French/Italian) letters, lower and upper case [A..Z; À; Ã; É; Ê; Í; Ì; Ó; Ò; Õ; Ñ; Ç]

AND

!; ""; #; $; %; & @; £; §; {; }; '; «; »; [American and European Keyboard]

I very urgently need to kill ALL Chinese, Japanese, Korean, Arabic, Hebraic, Russian characters (all letters are "crazy")...

Could this be?:

StringRegExpReplace ("", "[^\u0000-\u024F]+", "_")

trying:

\p{InBasic_Latin}: U+0000..U+007F

\p{InLatin-1_Supplement}: U+0080..U+00FF

\p{InLatin_Extended-A}: U+0100..U+017F

\p{InLatin_Extended-B}: U+0180..U+024F

Is there any UniCode and RegEx expert?

Thanks.

Regards.

weaponx · December 6, 2007

It looks correct to me. What is the problem?

asgarcymed · December 6, 2007

I am now using "RegExBuddy", a superb Win32 app to work and learn about Regular Expressions...

Using Google, I could get a txt file (inside zip attached) which has many, many Chinese characters; and few English characters... I opened it with RegExBuddy, and I tested both RegEx's:

[\x10-\x1F\x21-\x2F\x3A-\x40\x5B-\x60\x80-\xFF]

and

[^\u0000-\u024F]+

But the results of test/debug were very confusing...

Even more - I got the Windows XP MUI (MultiLingual User Interface) and I installed all languages I already announced (Chinese/Japanese/Korean/Arabic/Hebraic/Russian)...

My confusion is now even bigger - some apps can correctly load the Chinese characters (for example), but the majority of apps continue not to deal with such characters (they show "squares" or "???????????" or distorted characters like when we try to read a binary file with a text editor...

A big confusion is installed in my brain... Must I have MUI installed ?... What is the best RegEx to kill such characters from files' names? If I have MUI installed, do I need such regex/script?? What should I do to solve this question once and for all?

Is there any Chinese/Japanese/Korean/Arabic/Hebraic/Russia person here? If yes, how do you manage the characters' conflicts between your Native Language and English?

Help is very appreciated!

Thanks in advance.

Regards.

CHINES.zip

asgarcymed · December 6, 2007

PS - If you have problems with the file attached; please see:

http://www.xys.org/xys/netters/others/net/wiki2.txt

Thanks.

Regards.

Bowmore · December 6, 2007

I need to allow: All English/German and Latin (Portuguese/Spanish/French/Italian) letters, lower and upper case [A..Z; À; Ã; É; Ê; Í; Ì; Ó; Ò; Õ; Ñ; Ç]
AND
!; ""; #; $; %; & @; £; §; {; }; '; «; »; [American and European Keyboard]

I very urgently need to kill ALL Chinese, Japanese, Korean, Arabic, Hebraic, Russian characters (all letters are "crazy")...

Could this be?:
StringRegExpReplace ("", "[^\u0000-\u024F]+", "_")oÝ÷ ÚÚòx4÷jH¬ÂÚ¶)ÔûM4ÑO´Ó±t÷jH¶}Rºezg§µO´ÓÍûMOv¤ËjØ§^×O´×MûM{Ov¤ËjØ§^×O´×ÍûM¸-êÞj| ¨uæ§u ±¥êíN§Ä^ªÝ³ú®¢×çèZ0x0¢¹¢¹Â+aÊ«±©©çâæ(ºf²ç¶py©%Ëh}ÈZ§-z»¶Ø^mè"x§íç%jË-¢fr¨º·±iËkz«¢éÛºÚ"µÍÝ[ÔYÑ^XÙH
    ][ÝÉ][ÝË    ][ÝÖ×    ÌÌÎÉ][ÝÈÉÌÍÉI[Ð(éÞßIÌÎNêîÈIÍN×I][ÝË  ][Ý×É][ÝÊ

asgarcymed · December 6, 2007

Bowmore - thank you for replying!... To solve this once and for all, could you please post the complete script (with the correct sequence of different RegEx's)?

Please note that today is the first day in my life in that I deal with RegEx's... If you help me, you can be sure I will study this so important subject that I was missing out; from your precious help...

Thank you!

Regards.

Confuzzled · December 15, 2007

Many files downloaded by eMule (ed2k/Kad) contain, in its name, UniCode characters (such as Chinese, Japanese, Korean, Arabic, Hebraic, Russian) which are seen as "Illegal Characters" by English version of Windows XP's explorer.exe... This causes serious troubles when managing such files...
Thus, I would like to get a script to automatically delete such characters from files' name, in order to avoid problems when trying to access them...
PS - even when we download an eBook totally written in English, stupidly the files' names contain such unicode/illegal characters...
Thanks.
Regards.

Try the 'cleanup' button in the Mass Rename function in eMule.

Sign In

Script to Remove UniCode/Illegal Characters from Files' Name

Recommended Posts

asgarcymed

weaponx

asgarcymed

asgarcymed

weaponx

therks

weaponx

asgarcymed

asgarcymed

asgarcymed

weaponx

asgarcymed

asgarcymed

Bowmore

asgarcymed

Confuzzled

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta