Sign in to follow this  
Followers 0
MFerris

StringRegExp and (non)Ascii characters

16 posts in this topic

Hi there -

I'm trying to detect non-"Western" characters in a string of characters - specifically, ascii codes 0-127. Based on the documentation for StringRegExp, I should be able to use the class [:ascii:] - however, this is not working.

Consider the following script:

#include <Array.au3>

$string = "This is a regular string."
$string2 = "This string has code 128 in it and should trigger non-ascii: Ç"

$testString1 = StringRegExp($string,"[^:ascii:]+")
$testString2 = StringRegExp($string2,"[^:ascii:]+")

if $testString1 Then MsgBox(0,"","String 1 has a non-ascii character.")
if $teststring2 Then MsgBox(0,"","String 2 has a non-ascii character.")
    
$testString1Array = StringRegExp($string,"([^:ascii:]+)",3)
$testString2Array = StringRegExp($string2,"([^:ascii:]+)",3)

_ArrayDisplay($testString1Array,"all-ascii string")
_ArrayDisplay($testString2Array,"string w non-ascii character")

Based on that, I should not see the first msgbox, and I should see the second. (I'm searching both strings for 1 or more characters that are NOT ascii codes 0-127. However, I get the msgbox for the first string.

In the second section, I'm writing all parts of the string to an array - this definitely tells me that this is not working. It is actually looking for characters 'a', 's', 'c', and 'i'. The array returned is anything in between those 4 characters.

I'm somewhat new to using RegExp, but I think I have this right. Does [:ascii:] not work as a class, or am I implementing this wrong?

Thanks for any help!

Share this post


Link to post
Share on other sites



Also, I'm using beta ver 3.2.1.11.

Share this post


Link to post
Share on other sites

I'm not sure those groups are allowed... does it say they are in the help file?


[center]Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.[/center]

Share this post


Link to post
Share on other sites

Don't know about [:ascii:] (you will have to take alook at the pcre documentation I think) There is also the non existing unicode thing with AutoIt (to be usable on win9x)

But maybe you can rewrite yor patter to somthing like this?

Func testNonASCII()

   $string = "This is a regular string."
   $string2 = "This string has code 128 in it and should trigger non-ascii: Ç"
   $rp1 = "[^a-zA-Z0-9_\.\:\- ]+";"[^:ascii:]+"
   $rp2 = "[^a-zA-Z0-9_\.\:\- ]+";"[^:ascii:]+"
   $testString1 = StringRegExp($string, $rp1)
   $testString2 = StringRegExp($string2,$rp2)

   if $testString1 Then MsgBox(0,"","String 1 has a non-ascii character.")
   if $teststring2 Then MsgBox(0,"","String 2 has a non-ascii character.")
       
   $testString1Array = StringRegExp($string,"(" & $rp1 & ")",3)
   $testString2Array = StringRegExp($string2,"(" & $rp2 & ")",3)

   _ArrayDisplay($testString1Array,"all-ascii string")
   _ArrayDisplay($testString2Array,"string w non-ascii character")

EndFunc

Share this post


Link to post
Share on other sites

I'm not sure those groups are allowed... does it say they are in the help file?

Yes, in the help file for StringRegExp it lists a number of new classes, which weren't in the older version of the help file. other classes are 'alnum', 'alpha', 'digit', 'upper', 'word', etc.. Since it does work, I guess it needs to be revised in the help file.

Share this post


Link to post
Share on other sites

Don't know about [:ascii:] (you will have to take alook at the pcre documentation I think) There is also the non existing unicode thing with AutoIt (to be usable on win9x)

But maybe you can rewrite yor patter to somthing like this?

Yes, I think I'll have to go that way if :ascii: doesn't really work. I'm actually parsing html, so I'll have to also take into account everything that goes with that. Just seems like :ascii: would be the optimal choice (assuming it worked).

Share this post


Link to post
Share on other sites

Yes, I think I'll have to go that way if :ascii: doesn't really work. I'm actually parsing html, so I'll have to also take into account everything that goes with that. Just seems like :ascii: would be the optimal choice (assuming it worked).

I assume you did a test with pcretest.exe. If not you must do it.

In fact I did. unless Jon did a bad job this module react the same as autoIt code. So I assume that something must be wrong with the pattern you are using. I cannot imagine pcretest.exe being wrong...

I am not an expert in RegularExpression, pcretest.exe IS ;)

Share this post


Link to post
Share on other sites

I assume you did a test with pcretest.exe. If not you must do it.

In fact I did. unless Jon did a bad job this module react the same as autoIt code. So I assume that something must be wrong with the pattern you are using. I cannot imagine pcretest.exe being wrong...

I am not an expert in RegularExpression, pcretest.exe IS ;)

No, I have not even heard of pcretest. I just did a google and most results seem to relate to a perl-based regexp testing program, however the syntax for that program seems different than that of how it is used in AutoIt. Or maybe I just need more coffee and time to digest the instructions.

... after a little more googling, I did find this page which shows the character classes in PERL which includes :ascii:. The only difference between AutoIt and the PERL syntax is that to do a not is "^:ascii:" (autoit) and ":^ascii:" (perl). I've tried both in Autoit, neither seem to work.

However, after looking through the help file, I realized that what I need can be accomplished much more easily using StringIsASCII(), which also check for ascii codes 0-127. This works perfectly, so I'll use that.

I don't know that my regexp statement could be any more simple ([^:ascii:]), so I'm still not sure it's working properly. If :ascii: should work, I'd be interested in seeing how it should be properly implemented. Now that I have alternate means of accomplishing this task, this isn't a problem for me, but it may be for others in the future.

Share this post


Link to post
Share on other sites

No, I have not even heard of pcretest.

pcretest can be found here.

Share this post


Link to post
Share on other sites

No, I have not even heard of pcretest. I just did a google and most results seem to relate to a perl-based regexp testing program, however the syntax for that program seems different than that of how it is used in AutoIt. Or maybe I just need more coffee and time to digest the instructions.

... after a little more googling, I did find this page which shows the character classes in PERL which includes :ascii:. The only difference between AutoIt and the PERL syntax is that to do a not is "^:ascii:" (autoit) and ":^ascii:" (perl). I've tried both in Autoit, neither seem to work.

However, after looking through the help file, I realized that what I need can be accomplished much more easily using StringIsASCII(), which also check for ascii codes 0-127. This works perfectly, so I'll use that.

I don't know that my regexp statement could be any more simple ([^:ascii:]), so I'm still not sure it's working properly. If :ascii: should work, I'd be interested in seeing how it should be properly implemented. Now that I have alternate means of accomplishing this task, this isn't a problem for me, but it may be for others in the future.

I hope an PCRE expert can help you because as the result of pcretest and AutoIt. it looks like your expression is not so good.

Jon did implement this PCRE porting in AutoIt, if both return the same, he will say NOBUG

Share this post


Link to post
Share on other sites

In the PCRE documentation I found the line about :ascii:

I saw the same. That's the reason of all my PM ;)

Share this post


Link to post
Share on other sites

I saw the same. That's the reason of all my PM ;)

I saw the same as well. I'm not sure what the point of that statement is, though.

As I can use the StringIsASCII() function, I no longer have a problem, I just wanted to point out that the :ascii: class for RegExp in Autoit may not be working properly, based on my test. I don't know how to use pcretest so I can't confirm if it works in that environment or not. Someone familiar with pcretest and/or a better understanding of RegExp syntax may want to take a closer look. My concern was for future users who may need to use the :ascii: class and can't rely on StringIsASCII().

Thanks for all your help.

Share this post


Link to post
Share on other sites

I saw the same as well. I'm not sure what the point of that statement is, though.

As I can use the StringIsASCII() function, I no longer have a problem, I just wanted to point out that the :ascii: class for RegExp in Autoit may not be working properly, based on my test. I don't know how to use pcretest so I can't confirm if it works in that environment or not. Someone familiar with pcretest and/or a better understanding of RegExp syntax may want to take a closer look. My concern was for future users who may need to use the :ascii: class and can't rely on StringIsASCII().

Thanks for all your help.

I know your concern , I was answering Jon not You ;)

Share this post


Link to post
Share on other sites

In fact I don't know why the pcre implementation is as such

^ negate the class, but only if the first character

but see the detail doc

Share this post


Link to post
Share on other sites

To replace non-alphanumeric characters: (notice the double square brackets)

1. $temp = StringRegExpReplace($filename, '[^[:alnum:]+]', "")

2. $temp = StringRegExpReplace($filename, '[[:alnum:]+]', "")

3. $temp = StringRegExpReplace($filename, '[[:alnum:]]', "")

(decided to post it here so that others can search it)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0