Jump to content

StringRegExp and (non)Ascii characters


MFerris
 Share

Recommended Posts

Hi there -

I'm trying to detect non-"Western" characters in a string of characters - specifically, ascii codes 0-127. Based on the documentation for StringRegExp, I should be able to use the class [:ascii:] - however, this is not working.

Consider the following script:

#include <Array.au3>

$string = "This is a regular string."
$string2 = "This string has code 128 in it and should trigger non-ascii: Ç"

$testString1 = StringRegExp($string,"[^:ascii:]+")
$testString2 = StringRegExp($string2,"[^:ascii:]+")

if $testString1 Then MsgBox(0,"","String 1 has a non-ascii character.")
if $teststring2 Then MsgBox(0,"","String 2 has a non-ascii character.")
    
$testString1Array = StringRegExp($string,"([^:ascii:]+)",3)
$testString2Array = StringRegExp($string2,"([^:ascii:]+)",3)

_ArrayDisplay($testString1Array,"all-ascii string")
_ArrayDisplay($testString2Array,"string w non-ascii character")

Based on that, I should not see the first msgbox, and I should see the second. (I'm searching both strings for 1 or more characters that are NOT ascii codes 0-127. However, I get the msgbox for the first string.

In the second section, I'm writing all parts of the string to an array - this definitely tells me that this is not working. It is actually looking for characters 'a', 's', 'c', and 'i'. The array returned is anything in between those 4 characters.

I'm somewhat new to using RegExp, but I think I have this right. Does [:ascii:] not work as a class, or am I implementing this wrong?

Thanks for any help!

Link to comment
Share on other sites

  • Moderators

I'm not sure those groups are allowed... does it say they are in the help file?

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

Don't know about [:ascii:] (you will have to take alook at the pcre documentation I think) There is also the non existing unicode thing with AutoIt (to be usable on win9x)

But maybe you can rewrite yor patter to somthing like this?

Func testNonASCII()

   $string = "This is a regular string."
   $string2 = "This string has code 128 in it and should trigger non-ascii: Ç"
   $rp1 = "[^a-zA-Z0-9_\.\:\- ]+";"[^:ascii:]+"
   $rp2 = "[^a-zA-Z0-9_\.\:\- ]+";"[^:ascii:]+"
   $testString1 = StringRegExp($string, $rp1)
   $testString2 = StringRegExp($string2,$rp2)

   if $testString1 Then MsgBox(0,"","String 1 has a non-ascii character.")
   if $teststring2 Then MsgBox(0,"","String 2 has a non-ascii character.")
       
   $testString1Array = StringRegExp($string,"(" & $rp1 & ")",3)
   $testString2Array = StringRegExp($string2,"(" & $rp2 & ")",3)

   _ArrayDisplay($testString1Array,"all-ascii string")
   _ArrayDisplay($testString2Array,"string w non-ascii character")

EndFunc
Link to comment
Share on other sites

I'm not sure those groups are allowed... does it say they are in the help file?

Yes, in the help file for StringRegExp it lists a number of new classes, which weren't in the older version of the help file. other classes are 'alnum', 'alpha', 'digit', 'upper', 'word', etc.. Since it does work, I guess it needs to be revised in the help file.

Link to comment
Share on other sites

Don't know about [:ascii:] (you will have to take alook at the pcre documentation I think) There is also the non existing unicode thing with AutoIt (to be usable on win9x)

But maybe you can rewrite yor patter to somthing like this?

Yes, I think I'll have to go that way if :ascii: doesn't really work. I'm actually parsing html, so I'll have to also take into account everything that goes with that. Just seems like :ascii: would be the optimal choice (assuming it worked).
Link to comment
Share on other sites

Yes, I think I'll have to go that way if :ascii: doesn't really work. I'm actually parsing html, so I'll have to also take into account everything that goes with that. Just seems like :ascii: would be the optimal choice (assuming it worked).

I assume you did a test with pcretest.exe. If not you must do it.

In fact I did. unless Jon did a bad job this module react the same as autoIt code. So I assume that something must be wrong with the pattern you are using. I cannot imagine pcretest.exe being wrong...

I am not an expert in RegularExpression, pcretest.exe IS ;)

Link to comment
Share on other sites

I assume you did a test with pcretest.exe. If not you must do it.

In fact I did. unless Jon did a bad job this module react the same as autoIt code. So I assume that something must be wrong with the pattern you are using. I cannot imagine pcretest.exe being wrong...

I am not an expert in RegularExpression, pcretest.exe IS ;)

No, I have not even heard of pcretest. I just did a google and most results seem to relate to a perl-based regexp testing program, however the syntax for that program seems different than that of how it is used in AutoIt. Or maybe I just need more coffee and time to digest the instructions.

... after a little more googling, I did find this page which shows the character classes in PERL which includes :ascii:. The only difference between AutoIt and the PERL syntax is that to do a not is "^:ascii:" (autoit) and ":^ascii:" (perl). I've tried both in Autoit, neither seem to work.

However, after looking through the help file, I realized that what I need can be accomplished much more easily using StringIsASCII(), which also check for ascii codes 0-127. This works perfectly, so I'll use that.

I don't know that my regexp statement could be any more simple ([^:ascii:]), so I'm still not sure it's working properly. If :ascii: should work, I'd be interested in seeing how it should be properly implemented. Now that I have alternate means of accomplishing this task, this isn't a problem for me, but it may be for others in the future.

Link to comment
Share on other sites

No, I have not even heard of pcretest. I just did a google and most results seem to relate to a perl-based regexp testing program, however the syntax for that program seems different than that of how it is used in AutoIt. Or maybe I just need more coffee and time to digest the instructions.

... after a little more googling, I did find this page which shows the character classes in PERL which includes :ascii:. The only difference between AutoIt and the PERL syntax is that to do a not is "^:ascii:" (autoit) and ":^ascii:" (perl). I've tried both in Autoit, neither seem to work.

However, after looking through the help file, I realized that what I need can be accomplished much more easily using StringIsASCII(), which also check for ascii codes 0-127. This works perfectly, so I'll use that.

I don't know that my regexp statement could be any more simple ([^:ascii:]), so I'm still not sure it's working properly. If :ascii: should work, I'd be interested in seeing how it should be properly implemented. Now that I have alternate means of accomplishing this task, this isn't a problem for me, but it may be for others in the future.

I hope an PCRE expert can help you because as the result of pcretest and AutoIt. it looks like your expression is not so good.

Jon did implement this PCRE porting in AutoIt, if both return the same, he will say NOBUG

Link to comment
Share on other sites

I saw the same. That's the reason of all my PM ;)

I saw the same as well. I'm not sure what the point of that statement is, though.

As I can use the StringIsASCII() function, I no longer have a problem, I just wanted to point out that the :ascii: class for RegExp in Autoit may not be working properly, based on my test. I don't know how to use pcretest so I can't confirm if it works in that environment or not. Someone familiar with pcretest and/or a better understanding of RegExp syntax may want to take a closer look. My concern was for future users who may need to use the :ascii: class and can't rely on StringIsASCII().

Thanks for all your help.

Link to comment
Share on other sites

I saw the same as well. I'm not sure what the point of that statement is, though.

As I can use the StringIsASCII() function, I no longer have a problem, I just wanted to point out that the :ascii: class for RegExp in Autoit may not be working properly, based on my test. I don't know how to use pcretest so I can't confirm if it works in that environment or not. Someone familiar with pcretest and/or a better understanding of RegExp syntax may want to take a closer look. My concern was for future users who may need to use the :ascii: class and can't rely on StringIsASCII().

Thanks for all your help.

I know your concern , I was answering Jon not You ;)
Link to comment
Share on other sites

  • 6 months later...
  • 7 months later...

To replace non-alphanumeric characters: (notice the double square brackets)

1. $temp = StringRegExpReplace($filename, '[^[:alnum:]+]', "")

2. $temp = StringRegExpReplace($filename, '[[:alnum:]+]', "")

3. $temp = StringRegExpReplace($filename, '[[:alnum:]]', "")

(decided to post it here so that others can search it)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...