Sign in to follow this  
Followers 0
asgarcymed

Script to Remove UniCode/Illegal Characters from Files' Name

16 posts in this topic

Many files downloaded by eMule (ed2k/Kad) contain, in its name, UniCode characters (such as Chinese, Japanese, Korean, Arabic, Hebraic, Russian) which are seen as "Illegal Characters" by English version of Windows XP's explorer.exe... This causes serious troubles when managing such files...

Thus, I would like to get a script to automatically delete such characters from files' name, in order to avoid problems when trying to access them...

PS - even when we download an eBook totally written in English, stupidly the files' names contain such unicode/illegal characters...

Thanks.

Regards.


MLMK - my blogging craziness...

Share this post


Link to post
Share on other sites



Can you paste an example of a string you need modified?

Share this post


Link to post
Share on other sites

In Windows XP English, all unicode/illegal character strings appear as a "square" or a "?"

Only if someone has the MUI (MultiLingual User Interface) will see the correct Chinese/Japanese/Korean/Arabic/Hebraic/Russian characters

?????? ????? ??????????

=> this was my attempt to make a copy-paste... This forum also does not support Chinese/Japanese/Korean/Arabic/Hebraic/Russian characters...


MLMK - my blogging craziness...

Share this post


Link to post
Share on other sites

PS - if you want to see such characters, you can see Wikipedia in all of these (esoteric) languages...

Regards.


MLMK - my blogging craziness...

Share this post


Link to post
Share on other sites

I'm gonna go out on a limb here...maybe:

StringRegExpReplace ( "titlestringwithforeigncharacters", "[\x10-\x1F\x21-\x2F\x3A-\x40\x5B-\x60\x80-\xFF]", "")

Share this post


Link to post
Share on other sites

I'm gonna go out on a limb here...maybe:

StringRegExpReplace ( "titlestringwithforeigncharacters", "[\x10-\x1F\x21-\x2F\x3A-\x40\x5B-\x60\x80-\xFF]", "")
I think you should probably have them replaced with something.. even if just dashes, or underscores, etc. Otherwise you could get an error trying to rename a file to nothing (in the case that all the characters are unacceptable).

Share this post


Link to post
Share on other sites

I'm not even sure if thats what the OP is after. If it works, I will leave the replacement at his discretion.

Share this post


Link to post
Share on other sites

Thank to all of you for replying!

Saunders - you are correct - Chinese/Japanese/Korean/Arabic/Hebraic/Russian characters MUST be replaced by underscores (because of the reason you very well posted)...

I still need help about such StringRegExpReplace...

Maybe someone who is from one of such countries, and then, must deal with both languages...

Thanks.

Regards.


MLMK - my blogging craziness...

Share this post


Link to post
Share on other sites

I need to allow: All English/German and Latin (Portuguese/Spanish/French/Italian) letters, lower and upper case [A..Z; À; Ã; É; Ê; Í; Ì; Ó; Ò; Õ; Ñ; Ç]

AND

!; ""; #; $; %; & @; £; §; {; }; '; «; »; [American and European Keyboard]

I very urgently need to kill ALL Chinese, Japanese, Korean, Arabic, Hebraic, Russian characters (all letters are "crazy")...

Could this be?:

StringRegExpReplace ("", "[^\u0000-\u024F]+", "_")

trying:

\p{InBasic_Latin}: U+0000..U+007F

\p{InLatin-1_Supplement}: U+0080..U+00FF

\p{InLatin_Extended-A}: U+0100..U+017F

\p{InLatin_Extended-B}: U+0180..U+024F

Is there any UniCode and RegEx expert?

Thanks.

Regards.


MLMK - my blogging craziness...

Share this post


Link to post
Share on other sites

It looks correct to me. What is the problem?

Share this post


Link to post
Share on other sites

I am now using "RegExBuddy", a superb Win32 app to work and learn about Regular Expressions...

Using Google, I could get a txt file (inside zip attached) which has many, many Chinese characters; and few English characters... I opened it with RegExBuddy, and I tested both RegEx's:

[\x10-\x1F\x21-\x2F\x3A-\x40\x5B-\x60\x80-\xFF]

and

[^\u0000-\u024F]+

But the results of test/debug were very confusing...

Even more - I got the Windows XP MUI (MultiLingual User Interface) and I installed all languages I already announced (Chinese/Japanese/Korean/Arabic/Hebraic/Russian)...

My confusion is now even bigger - some apps can correctly load the Chinese characters (for example), but the majority of apps continue not to deal with such characters (they show "squares" or "???????????" or distorted characters like when we try to read a binary file with a text editor...

A big confusion is installed in my brain... Must I have MUI installed ?... What is the best RegEx to kill such characters from files' names? If I have MUI installed, do I need such regex/script?? What should I do to solve this question once and for all?

Is there any Chinese/Japanese/Korean/Arabic/Hebraic/Russia person here? If yes, how do you manage the characters' conflicts between your Native Language and English?

Help is very appreciated!

Thanks in advance.

Regards.

CHINES.zip


MLMK - my blogging craziness...

Share this post


Link to post
Share on other sites

I need to allow: All English/German and Latin (Portuguese/Spanish/French/Italian) letters, lower and upper case [A..Z; À; Ã; É; Ê; Í; Ì; Ó; Ò; Õ; Ñ; Ç]

AND

!; ""; #; $; %; & @; £; §; {; }; '; «; »; [American and European Keyboard]

I very urgently need to kill ALL Chinese, Japanese, Korean, Arabic, Hebraic, Russian characters (all letters are "crazy")...

Could this be?:

StringRegExpReplace ("", "[^\u0000-\u024F]+", "_")oÝ÷ ÚÚòx4÷jH¬ÂÚ¶)ÔûM4ÑO´Ó±t÷jH¶­}Rºezg§µO´ÓÍûMOv¤Ëjا^×O´×MûM{Ov¤Ëjا^×O´×ÍûM¸-êÞj| ¨uæ§u ±¥êíN§Ä^ªÝ³ú®¢×çèZ0x0¢¹¢¹Â+aÊ«±©©çâæ(ºf²ç¶py©%Ëh}ÈZ­§-z»¶Ø^"x§íç%jË-¢f­r¨º·±iËkz«¢­éÛºÚ"µÍÝ[ÔYÑ^XÙH
    ][ÝÉ][ÝË    ][ÝÖ×    ÌÌÎÉ][ÝÈÉÌÍÉI[Ð(éÞßIÌÎNêîÈIÍN×I][ÝË  ][Ý×É][ÝÊ

"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the universe trying to build bigger and better idiots. So far, the universe is winning."- Rick Cook

Share this post


Link to post
Share on other sites

Bowmore - thank you for replying!... To solve this once and for all, could you please post the complete script (with the correct sequence of different RegEx's)?

Please note that today is the first day in my life in that I deal with RegEx's... If you help me, you can be sure I will study this so important subject that I was missing out; from your precious help...

Thank you!

Regards.


MLMK - my blogging craziness...

Share this post


Link to post
Share on other sites

Many files downloaded by eMule (ed2k/Kad) contain, in its name, UniCode characters (such as Chinese, Japanese, Korean, Arabic, Hebraic, Russian) which are seen as "Illegal Characters" by English version of Windows XP's explorer.exe... This causes serious troubles when managing such files...

Thus, I would like to get a script to automatically delete such characters from files' name, in order to avoid problems when trying to access them...

PS - even when we download an eBook totally written in English, stupidly the files' names contain such unicode/illegal characters...

Thanks.

Regards.

Try the 'cleanup' button in the Mass Rename function in eMule.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0