Jump to content
Sign in to follow this  
lgwapnitsky

Unicode and StringRegExpReplace issues

Recommended Posts

No problem but I'm beginning to think you might want to do another replacement after the loop and that is to delete invalid characters. That way you probably wouldn't need to validate. I can whip up that SRER quickly since I already had it earlier but didn't think to save it.

That's what this is for:

$Testarray = StringRegExp($sStr, "([[:alnum:]\-\s\x27\x9e\x9f\xc9-\xff]+)", 3)
MsgBox(0, "Result2", _ArrayToString($Testarray, ""))

Actually, this code just concatenates out the bad characters. My SRER is working fine right now...for the characters I need.

Edited by lgwapnitsky

Share this post


Link to post
Share on other sites

Instead of botherng to create the array with the RegEx you might want to think about just deleting enything except the allowable characters.

$sUserName = StringRegExpReplace($sUserName, "[^[:alnum:]\-\s\x27\x9e\x9f\xc9-\xff]", "")

In this one, anything inside the first "[" and the last "]" is excluded from deletion.


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

Instead of botherng to create the array with the RegEx you might want to think about just deleting enything except the allowable characters.

$sUserName = StringRegExpReplace($sUserName, "[^[:alnum:]\-\s\x27\x9e\x9f\xc9-\xff]", "")

In this one, anything inside the first "[" and the last "]" is excluded from deletion.

Very clean...I like it. Thanks.

Share this post


Link to post
Share on other sites

btw - shouldn't it be \xc0 instead of \xc9? That's where A with grave is...

Should it? I just copied the characters you were using, however \xc0 is correct. Easily tested if you follow that link I gave you.

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

Though I don't want to threadjack this erm thread, I'm curious.. if we want to check for specific unicode characters, we can do it as long as we know the range of hex codes, like for Cyrillic, the range is something like 0x400-0x4FFF and some in the 0x5xxx range.

I just did a test on Cyrillic characters, mixed with Japanese and english, and the following PCRE picked out the correct Cyrillic characters, but I'm kinda lost because I can't get hex codes for all character sets.

Grab (a subset of) Cyrillic Unicode characters: "[\x{0400}-\x{04FF}]"

GEOSoft, do you know of any good sites which list the hex-code ranges for different languages? I just sorta stumbled on what I did, and it's really a general overview. And I can't get my brain around this 'U+' notation. Plus now I'm reading that 16-bits isn't going to capture the entire Unicode character set (there's 32-bit versions though..) :idea:


Share this post


Link to post
Share on other sites

Though I don't want to threadjack this erm thread, I'm curious.. if we want to check for specific unicode characters, we can do it as long as we know the range of hex codes, like for Cyrillic, the range is something like 0x400-0x4FFF and some in the 0x5xxx range.

I just did a test on Cyrillic characters, mixed with Japanese and english, and the following PCRE picked out the correct Cyrillic characters, but I'm kinda lost because I can't get hex codes for all character sets.

Grab (a subset of) Cyrillic Unicode characters: "[\x{0400}-\x{04FF}]"

GEOSoft, do you know of any good sites which list the hex-code ranges for different languages? I just sorta stumbled on what I did, and it's really a general overview. And I can't get my brain around this 'U+' notation. Plus now I'm reading that 16-bits isn't going to capture the entire Unicode character set (there's 32-bit versions though..) :idea:

I don't have a link to tha page I found when I first went looking for this information. I searched the forums earlier to find the script I did for getting the Unicode characters (Full as I recall) using a RegEx. I still haven't found it and I think that may be becuase I either didn't have the correct serch term or the limit on the number of search returns cut off the list.

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

Wait a second here. I do remember that I found the correct codes for the sets some place on http://unicode.org/


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

Ascend4nt,

The right spot for much Unicode stuff is there. Be sure to visit UniView, Converters and other shortcuts Ishida offers.

You can also benefit looking there to discover in detail which script maps where and in which Unicode plane.

Finally, a Unicode primer, not very tech but correct. While you are there, take a few hours to read the 1075 articles Joel posted on his blog. Most are worth it.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Awesome, thanks for the links guys!


Share this post


Link to post
Share on other sites

Awesome, thanks for the links guys!

In case you are interested I Pmed you another link that you might be intereset in keeping track of.Also there was a minor update to the toolkit today, same link.

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

Should it? I just copied the characters you were using, however \xc0 is correct. Easily tested if you follow that link I gave you.

Strange- I thought I posted my final testing code yesterday, but I don't see it now:

#include <String.au3>

Global $UnicodeReplacements[19][2] = [["\xc0-\xc5","A"],["\xe0-\xe5","a"], ["\xc6","AE"], ["\xc7","C"],["\xe7","c"], _
                                        ["\xc8-\xcb","E"], ["\xe8-\xeb","e"], ["\xcc-\xcf","I"], ["\xec-\xef","i"], _
                                        ["\xd1","N"],["\xf1","n"], ["\xd2-\xd6\xd8","O"], ["\xf2-\xf6\xf8","o"], _
                                        ["\xd9-\xdc","U"], ["\xf9-\xfc","u"], ["\xdd","Y"], ["\xfd","y"]]


            
$ib = InputBox("Test","enter test string")

$ib = StringRegExpReplace($ib, "[^[:alnum:]\-\s\x27\x9e\x9f\xc9-\xff]", "")

for $ur = 0 to UBound($UnicodeReplacements) -1
    $ib = StringRegExpReplace($ib, "[" & $UnicodeReplacements[$ur][0] & "]", $UnicodeReplacements[$ur][1])
next

ConsoleWrite(_StringProper($ib) & @CRLF)

Also, I'm curious about the toolkit link you PM'ed to Ascend4nt...

Edited by lgwapnitsky

Share this post


Link to post
Share on other sites

I think you did and if I'm correct it's way back on page 1 of this thread. Your array is different and I'm guessing that you already allowed for the changes.

ConsoleWrite(), contrary to popular belief is not a great method of testing. Use MsgBoxes instead. For me it doesn't matter becuase it's only a single click to convert them anyway and in your case there is only one to be done. This portion of the code seems to be fine when I test it but I'm curious to know how it is with th rest of your script.

It was just a link to the revisions for the latest version of PCRE which AutoIt isn't using yet.

Edited by GEOSoft

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

I think you did and if I'm correct it's way back on page 1 of this thread. Your array is different and I'm guessing that you already allowed for the changes.

ConsoleWrite(), contrary to popular belief is not a great method of testing. Use MsgBoxes instead. For me it doesn't matter becuase it's only a single click to convert them anyway and in your case there is only one to be done. This portion of the code seems to be fine when I test it but I'm curious to know how it is with th rest of your script.

It was just a link to the revisions for the latest version of PCRE which AutoIt isn't using yet.

Ahhh...i'm not awake yet.

One more question...for now...

\x2122 doesn't seem to be matching ™ (which I'm also still trying to figure out why I can't type it on my keyboard). This is for a different portion of the utility.

Thanks

Share this post


Link to post
Share on other sites

Ahhh...i'm not awake yet.

One more question...for now...

\x2122 doesn't seem to be matching ™ (which I'm also still trying to figure out why I can't type it on my keyboard). This is for a different portion of the utility.

Thanks

Give me a while to get some caffeine into me and I'll look at that.

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

You can type it with your keyboard using the compose sequence Alt0153 and it indeed is \x2122 or 8482 decimal.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Actually that should be \x99 which also doesn't work. I suspect it's a bug in the PCRE engine and not in AutoIt.


George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Share this post


Link to post
Share on other sites

Hi George,

Beware that \x99 is the Latin1 ANSI assignment for ™ (the Unicode code point \U2122).

Charmap.exe shows it nicely, once you check the "advanced view" checkbox (I don't know how it gets called in non frenchy Windows). You can switch the displayed charset between Unicode, Ansi (select your codepage), OEM, ...


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Actually that should be \x99 which also doesn't work. I suspect it's a bug in the PCRE engine and not in AutoIt.

I also tried the literal (\xe2\x84\xa2) which did not work either.

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...