Jump to content

Unicode and StringRegExpReplace issues


Recommended Posts

Forgive me if this has been covered, but I'm experiencing an odd issue while testing code prior to implementation.

This is the code:

$foreignchars = "([\\u00e9])?"

$text = InputBox("Test","Enter text with foreign characters")

$regtext = StringRegExpReplace($text, $foreignchars, "e")

ConsoleWrite(@ERROR & @CRLF & @EXTENDED & @CRLF & $regtext)

My output is as follows if I enter the character é:

CD: E:\testing
Current directory: E:\testing
"D:\Program Files\AutoIt3\AutoIt3.exe" "E:\testing\stringregexpforeign.au3" /ErrorStdOut
Process started >>>
0
2
eΘe<<< Process finished.
================ READY ================

Any ideas why I'm getting double output? Here's a longer example if I use the word éclair:

CD: E:\testing
Current directory: E:\testing
"D:\Program Files\AutoIt3\AutoIt3.exe" "E:\testing\stringregexpforeign.au3" /ErrorStdOut
Process started >>>
0
7
eΘeceleaeiere<<< Process finished.
================ READY ================

Thanks in advance,

Larry

Link to comment
Share on other sites

  • Replies 63
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Two problems:

  • The PCRE engine AutoIT uses doesn't recognize \u
  • ConsoleWrite() only displays ASCII text or special characters (which is what you have). Just thought I'd point that out since you are asking about Unicode.
Anyway, the way to check for certain unicode characters is to use "\x{####}", and for special characters like you have, it can simply be "\xE9".

So change $foreignchars to this:

$foreignchars = "\xE9"

More info from PCRE.org:

5. The following Perl escape sequences are not supported: \l, \u, \L,

\U, and \N. In fact these are implemented by Perl's general string-han-

dling and are not part of its pattern matching engine. If any of these

are encountered by PCRE, an error is generated.

6. The Perl escape sequences \p, \P, and \X are supported only if PCRE

is built with Unicode character property support. The properties that

can be tested with \p and \P are limited to the general category prop-

erties such as Lu and Nd, script names such as Greek or Han, and the

derived properties Any and L&. PCRE does support the Cs (surrogate)

property, which Perl does not; the Perl documentation says "Because

Perl hides the need for the user to understand the internal representa-

tion of Unicode characters, there is no need to implement the somewhat

messy concept of surrogates."

Also see http://perldoc.perl.org/perlre.html for information on proper use of \x. (Note that the \x is *lowercase*, not uppercase as #6 above would suggest)

Edited by Ascend4nt
Link to comment
Share on other sites

Nevermind...Got it.

Thanks,

Larry

Two problems:

  • The PCRE engine AutoIT uses doesn't recognize \u
  • ConsoleWrite() only displays ASCII text or special characters (which is what you have). Just thought I'd point that out since you are asking about Unicode.
Anyway, the way to check for certain unicode characters is to use "\x{####}", and for special characters like you have, it can simply be "\xE9".

So change $foreignchars to this:

$foreignchars = "\xE9"

More info from PCRE.org:

Also see http://perldoc.perl.org/perlre.html for information on proper use of \x. (Note that the \x is *lowercase*, not uppercase as #6 above would suggest)

Edited by lgwapnitsky
Link to comment
Share on other sites

Update: I'm trying to use the REGEX below now for other purposes, but it does not seem to work in AutoIT while in my tester it is accurate:

[ \-_\w\p{IsLatin-1Supplement}]+

Please advise.

Thanks,

Larry

That's because your tester (like most) is not AutoIt specific and probably doesn't handle any PCRE expressions properly.

If you look in the help file > StringRegExp() and check the meaning of [...] you should spot the trouble immediatly. It means match any character within the []

Also the { and } have special meanings so if you need to use them literaly you have to escape them with \

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

lgwapnitsky,

Check out the String Regular Exp​ression Tester (by Szhlopp). It will let you play around with AutoIT's implementation of PCRE's.

GEOSoft is correct about [], but I'm curious why something like '\p{Latin}' isn't working - it cites an error with '\p'. Perhaps AutoIT wasn't compiled with the Unicode version of the Regular Expression engine? It doesn't make sense why the \x{####} worked though. Hmmm..

*edit: ahh, nevermind, according to the following, it would need UTF-8 Unicode mode:

Unicode character properties

When PCRE is built with Unicode character property support, three additional escape sequences to match character properties are available when UTF-8 mode is selected. They are:

\p{xx} a character with the xx property

\P{xx} a character without the xx property

\X an extended Unicode sequence

from pcrepattern specification (linked to from the Help) Edited by Ascend4nt
Link to comment
Share on other sites

lgwapnitsky,

Check out the String Regular Exp​ression Tester (by Szhlopp). It will let you play around with AutoIT's implementation of PCRE's.

GEOSoft is correct about [], but I'm curious why something like '\p{Latin}' isn't working - it cites an error with '\p'. Perhaps AutoIT wasn't compiled with the Unicode version of the Regular Expression engine? It doesn't make sense why the \x{####} worked though. Hmmm..

*edit: ahh, nevermind, according to the following, it would need UTF-8 Unicode mode:

from pcrepattern specification (linked to from the Help)

That's been an ongoing problem with UTF-8. Although Jon says that it is properly compiled to support it, none of the UTF functions work.

I also have a beta version of a AutoIt specific tester available. I was hoping to update it today but that isn't going to happen. Probably tomorrow when I finish testing the latest bug fix.

http://dundats.mvps.org/beta/pcretest.zip

Edited by GEOSoft

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

Thanks for the explanation GEOSoft. Oh, and I just tried your PCRE tester - VERY nice job! So many cool little additions that make it so much more than just a PCRE tester. I'd love to see a portable version (since I take AutoIT with me on thumbdrives), but I'm assuming that wouldn't be easy with the reliance on other installed components. In any case, very professional work.. I can see it boosting productivity :idea: You should put a link to it in your sig

Link to comment
Share on other sites

Thanks for the explanation GEOSoft. Oh, and I just tried your PCRE tester - VERY nice job! So many cool little additions that make it so much more than just a PCRE tester. I'd love to see a portable version (since I take AutoIT with me on thumbdrives), but I'm assuming that wouldn't be easy with the reliance on other installed components. In any case, very professional work.. I can see it boosting productivity :idea: You should put a link to it in your sig

Thank you for the compliments. Actually it would take very little to create a portable version of it and in reality that script already exists. All anyone would have to do is use the released version of the source code. The only external dependency is the fact that the extra files are stored in the AppData folder. The source that I release for others to check moves those to the script dir so nothing more would be required as long as it's compiled. If it's not compiled then it needs several of the standard AutoIt UDFs.. The icons that are in the current dll file will eventually be part of the compiled file itself so you are left with nothing left to worry about.

As far as a link in the signature, I'm sure that will happen but I'm waiting until I get a final release ready. This is still in beta stage and I have a few ideas to implement yet as well as a few bugs to iron out. The bugs are all minor and I'm probably the only one that even notices but they annoy me.

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

That's been an ongoing problem with UTF-8. Although Jon says that it is properly compiled to support it, none of the UTF functions work.

I also have a beta version of a AutoIt specific tester available. I was hoping to update it today but that isn't going to happen. Probably tomorrow when I finish testing the latest bug fix.

http://dundats.mvps.org/beta/pcretest.zip

I'll have to test all this out later today. Thanks for the tips. I'll get back to you.

-Larry

Link to comment
Share on other sites

Thank you for the compliments. Actually it would take very little to create a portable version of it and in reality that script already exists. All anyone would have to do is use the released version of the source code. The only external dependency is the fact that the extra files are stored in the AppData folder. The source that I release for others to check moves those to the script dir so nothing more would be required as long as it's compiled. If it's not compiled then it needs several of the standard AutoIt UDFs.. The icons that are in the current dll file will eventually be part of the compiled file itself so you are left with nothing left to worry about.

As far as a link in the signature, I'm sure that will happen but I'm waiting until I get a final release ready. This is still in beta stage and I have a few ideas to implement yet as well as a few bugs to iron out. The bugs are all minor and I'm probably the only one that even notices but they annoy me.

UPDATE: I'm going to have to write some sloppy code to accomplish what I need. It's kind of disappointing that AutoIT doesn't support what appears to be a standard regex feature.

Link to comment
Share on other sites

UPDATE: I'm going to have to write some sloppy code to accomplish what I need. It's kind of disappointing that AutoIT doesn't support what appears to be a standard regex feature.

If you post some examples of wht your strings are and the results you expect there may still be ways to do it. I did manage to work out a method for unicode which I posted someplace in Example scripts. I think it's probably buried in Valuaters Wrappers thread which is a sticky and easy to find.

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

If you post some examples of wht your strings are and the results you expect there may still be ways to do it. I did manage to work out a method for unicode which I posted someplace in Example scripts. I think it's probably buried in Valuaters Wrappers thread which is a sticky and easy to find.

What I'm trying to write is a user provisioning application for our new Win2k8 servers going into place in a few months. Part of what needs to be populated is the users' name, address, etc. as well as generating a username. I'm trying to validate the characters in the name fields to see that there are only alphanumeric, extended Latin, space and hyphen characters. Everything else needs to be tossed out. After that, the username would be generated based on the first/last name, and extended Latin characters need to be replaced with their English equivalents.

I'm 95% of the way done with the user interface, and once that's completed, the provisioning portion of the application can be written.

BTW - here's my testing code...

#include <array.au3>

Global $UnicodeReplacements[24][2] = [["c0","A"],["c1","A"],["c2","A"],["c3","A"],["c4","A"],["c5","A"], _
                                     ["e0","a"],["e1","a"],["e2","a"],["e3","a"],["e4","a"],["e5","a"], _
                                     ["c7","C"],["e7","c"],["d1","N"],["f1","n"], _
                                     ["c8","E"],["c9","E"],["ca","E"],["cb","E"], _
                                     ["e8","e"],["e9","e"],["ea","e"],["eb","e"]]

Global $UnicodeRegExArray[1]
                                     
$ib = InputBox("Test","enter test string")

for $ur = 0 to UBound($UnicodeReplacements) -1
    _ArrayAdd($UnicodeRegExArray, $UnicodeReplacements[$ur][0])
next
$UnicodeRegEx = _ArrayToString($UnicodeRegExArray, "\x")

$array = StringRegExp($ib,"[ \-_\w" & $UnicodeRegEx & "]+", 3)
ConsoleWrite(_ArrayToString($array,""))

for $ur = 0 to UBound($UnicodeReplacements) -1
    $ib = StringRegExpReplace($ib, "[\x" & $UnicodeReplacements[$ur][0] & "]", $UnicodeReplacements[$ur][1])
next

$fixedUname = _ArrayToString(StringRegExp($ib, "[\-\w]+", 3), "")
ConsoleWrite($fixedUname & @CRLF)
Edited by lgwapnitsky
Link to comment
Share on other sites

This is getting easier by the moment

So what you want is only upper/lower alpha, digits, space, hyphen and decimal characters 192 through 255? Basicly just a valid name verification check.

EDIT: You may also want to consider an apostrophe in there since it's also a valid name character.

And a further question. Dou you want this to simply verify or do you want it to actually replace unwanted characters?

Edited by GEOSoft

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

This is getting easier by the moment

So what you want is only upper/lower alpha, digits, space, hyphen and decimal characters 192 through 255? Basicly just a valid name verification check.

EDIT: You may also want to consider an apostrophe in there since it's also a valid name character.

And a further question. Dou you want this to simply verify or do you want it to actually replace unwanted characters?

Verify when entering the name(s), then replace when the username needs to be created. That's what this sample code is doing. Ignore the ConsoleWrite functions I've put in. I know they don't output unicode, but it was part of my testing.

Link to comment
Share on other sites

This should be the correct verify SRE (includes apostrophe), although I'm not certain that you need it.

Use MsgBoxes instead of ConsoleWrite()

$sStr = "Abdénago o'neil-123 _É"
If StringRegExp($sStr, "^([[:alnum:]\-\s\x27\x9e\x9f\xc9-\xff]+)$", 0) Then
    $sOut = "Match"
Else
    $sOut = "No Match"
EndIf
MsgBox(0, "Result", $sStr & @CRLF & $sOut)

$sStr = "Abdénago o'neil-123 É"
If StringRegExp($sStr, "^([[:alnum:]\-\s\x27\x9e\x9f\xc9-\xff]+)$", 0) Then
    $sOut = "Match"
Else
    $sOut = "No Match"
EndIf
MsgBox(0, "Result", $sStr & @CRLF & $sOut)

EDIT: Duh, call it a "Georgie is brain dead" moment, of course you need to verify.

Edited by GEOSoft

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

Your replacements are even easier.

Global $UnicodeReplacements[24][2] = [["c0","A"],["c1","A"],["c2","A"],["c3","A"],["c4","A"],["c5","A"], _
                                     ["e0","a"],["e1","a"],["e2","a"],["e3","a"],["e4","a"],["e5","a"], _
                                     ["c7","C"],["e7","c"],["d1","N"],["f1","n"], _
                                     ["c8","E"],["c9","E"],["ca","E"],["cb","E"], _
                                     ["e8","e"],["e9","e"],["ea","e"],["eb","e"]]

For $i = 0 To Ubound($UnicodeReplacements) -1
    $sUserName = StringRegExpReplace($sUserName, "\x" & $UnicodeReplacements[$i][0], $UnicodeReplacements[$i][1])
Next

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

Your replacements are even easier.

Global $UnicodeReplacements[24][2] = [["c0","A"],["c1","A"],["c2","A"],["c3","A"],["c4","A"],["c5","A"], _
                                     ["e0","a"],["e1","a"],["e2","a"],["e3","a"],["e4","a"],["e5","a"], _
                                     ["c7","C"],["e7","c"],["d1","N"],["f1","n"], _
                                     ["c8","E"],["c9","E"],["ca","E"],["cb","E"], _
                                     ["e8","e"],["e9","e"],["ea","e"],["eb","e"]]

For $i = 0 To Ubound($UnicodeReplacements) -1
    $sUserName = StringRegExpReplace($sUserName, "\x" & $UnicodeReplacements[$i][0], $UnicodeReplacements[$i][1])
Next

Thanks, Georgie! The replacements I already had down. I'll have to test the other code in a bit and report back.

Regards,

Larry

Link to comment
Share on other sites

This should be the correct verify SRE (includes apostrophe), although I'm not certain that you need it.

Use MsgBoxes instead of ConsoleWrite()

$sStr = "Abdénago o'neil-123 _É"
If StringRegExp($sStr, "^([[:alnum:]\-\s\x27\x9e\x9f\xc9-\xff]+)$", 0) Then
    $sOut = "Match"
Else
    $sOut = "No Match"
EndIf
MsgBox(0, "Result", $sStr & @CRLF & $sOut)

$sStr = "Abdénago o'neil-123 É"
If StringRegExp($sStr, "^([[:alnum:]\-\s\x27\x9e\x9f\xc9-\xff]+)$", 0) Then
    $sOut = "Match"
Else
    $sOut = "No Match"
EndIf
MsgBox(0, "Result", $sStr & @CRLF & $sOut)

EDIT: Duh, call it a "Georgie is brain dead" moment, of course you need to verify.

I like it. I'll modify it back with my _ArrayToString, but i think this is the ticket.

Link to comment
Share on other sites

Thanks, Georgie! The replacements I already had down. I'll have to test the other code in a bit and report back.

Regards,

Larry

No problem but I'm beginning to think you might want to do another replacement after the loop and that is to delete invalid characters. That way you probably wouldn't need to validate. I can whip up that SRER quickly since I already had it earlier but didn't think to save it.

George

Question about decompiling code? Read the decompiling FAQ and don't bother posting the question in the forums.

Be sure to read and follow the forum rules. -AKA the AutoIt Reading and Comprehension Skills test.***

The PCRE (Regular Expression) ToolKit for AutoIT - (Updated Oct 20, 2011 ver:3.0.1.13) - Please update your current version before filing any bug reports. The installer now includes both 32 and 64 bit versions. No change in version number.

Visit my Blog .. currently not active but it will soon be resplendent with news and views. Also please remove any links you may have to my website. it is soon to be closed and replaced with something else.

"Old age and treachery will always overcome youth and skill!"

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...