StringRegExp with European letters

Uten · October 31, 2006

I'm need to pick out words containing characters from the Norwegian alphabet (Most European languages has some special characters).

So I was wondering, does anyone of you do StringRegExp searches with some of those characters?

I know AutoIt does not use unicode strings but StringInStr does find those (æ,ø,å) letters so I thought StringRegExp would (should) to?

The reason I'm doing this is that I'm creating something like a T9 (the keyboard on cell phones) touch screen keyboard.

Obviously I might have goofed :whistle: in my regexp pattern so if you have suggestions pleas let me know.

Some test code:

testEuropeanLetters()
Exit 
Func testEuropeanLetters()
   ;PURPOSE: See if StringRegExp handels European special chars
   ; Norwegian [æÆøØåÅ]
   Local $data, $regexp, $expect, $msg, $res
   Local $flag=3
   ConsoleWrite('Asc("æ"):=' & Asc('æ') & @LF)
   ConsoleWrite('Asc("æ"):=' & Asc('ø') & @LF)
   ConsoleWrite('Asc("æ"):=' & Asc('å') & @LF)
   If StringInStr("æ ø å", "æ") Then ConsoleWrite("StringInStr works with æ" & @LF)
   If StringInStr("æ ø å", "ø") Then ConsoleWrite("StringInStr works with ø" & @LF)
   If StringInStr("æ ø å", "å") Then ConsoleWrite("StringInStr works with å" & @LF)
   #region - data
   $data = 'ægiden' & @CRLF & 'ær' & @CRLF & 'æra' & @CRLF & 'æraen' & @CRLF & _
      'ærvær' & @CRLF & 'æser' & @CRLF & 'æsj' & @CRLF & 'æte' & @CRLF & 'ætling' & @CRLF & _
      'ætlingen' & @CRLF & 'ætt' & @CRLF & 'ætta' & @CRLF & 'ættbåren' & @CRLF & 'æva' & @CRLF & _
      'æve' & @CRLF & 'ævelig' & @CRLF & 'æven' & @CRLF & 'æver' & @CRLF & 'øde' & @CRLF & _
      'ødegård' & @CRLF & 'ødela' & @CRLF & 'ødelagt' & @CRLF & 'ødelagte' & @CRLF & 'ødeland' & @CRLF & _
      'øglene' & @CRLF & 'øgler' & @CRLF & 'øk' & @CRLF & 'øke' & @CRLF & 'økede' & @CRLF & _
      'økenavn' & @CRLF & 'økenavnet' & @CRLF & 'øl' & @CRLF & 'ølebrød' & @CRLF & 'ølebrødet' & @CRLF & _
      'ølen' & @CRLF & 'øm' & @CRLF & 'ømfintlig' & @CRLF & 'ømt' & @CRLF & 'ømtålelige' & @CRLF & _
      'ør' & @CRLF & 'øra' & @CRLF & 'øre' & @CRLF & 'ørebetennelse' & @CRLF & 'ørebro' & @CRLF & _
      'østsiden' & @CRLF & 'øv' & @CRLF & 'øvd' & @CRLF & 'øye' & @CRLF & 'å' & @CRLF & _
      'åa' & @CRLF & 'åbit' & @CRLF & 'åbor' & @CRLF & 'åk' & @CRLF & 'åker' & @CRLF & 'åkrer' & @CRLF & _
      'ål' & @CRLF & 'åla' & @CRLF & 'ålborg'
   #endregion - data
   $regexp = "\b[æøå]\b"     ;Returns æ,å,å,ø,ø,å -> Expected å
   $regexp = "\b[æøå]{1}\b" ;Returns æ,å,å,ø,ø,å -> Expected å
   $regexp = "\b[\O230\O248\O229]\b" ;Nothing -> Expected å
   $regexp = "\b\O229\b" ;Nothing -> Expected å
   $res = StringRegExp($data, $regexp, 3)
   Local $i 
   if IsArray($res) Then 
      For $i = 0 to UBound($res) - 1
         ConsoleWrite("+>$res[" & $i & "]:=" & $res[$i] & @LF)
      Next 
   Else 
      ConsoleWrite("! Nothing returned to array" & @LF)
   EndIf 
EndFunc

EDIT: Typo in the code.

Edited October 31, 2006 by Uten

MHz · October 31, 2006

Obviously I might have goofed in my regexp pattern so if you have suggestions pleas let me know.

Don't think you goofed, but rather unicode support seems non existant in AutoIt's PCRE.

Some known references that seem to fail

\X
Matches a single Unicode grapheme, whether encoded as a single code point or multiple code points using combining marks. A grapheme most closely resembles the everyday concept of a "character".
\X matches à encoded as U+0061 U+0300, à encoded as U+00E0, ©, etc.

\uFFFF where FFFF are 4 hexadecimal digits
Matches a specific Unicode code point. Can be used inside character classes.
\u00E0 matches à encoded as U+00E0 only. \u00A9 matches ©

\p{L} or \p{Letter}
Matches a single Unicode code point that has the property "letter". See Unicode Character Properties in the tutorial for a complete list of properties. Each Unicode code point has exactly one property. Can be used inside character classes.
\p{L} matches à encoded as U+00E0; \p{S} matches ©

\P{L} or \P{Letter}
Matches a single Unicode code point that does not have the property "letter". Can be used inside character classes.
\P{L} matches ©

Unfortunately, Unicode brings its own requirements and pitfalls when it comes to regular expressions. Of the regex flavors discussed in this tutorial, Java and the .NET framework use Unicode-based regex engines. Perl supports Unicode starting with version 5.6.

PCRE?

Uten · October 31, 2006

I think the unicode part of PCRE was excluded at compile time to shave off some kbytes and reasoned with the missing unicode support in AutoIt (due to compatibility with win95).

I'll take another look at the switches you have provided @MHz to see if I can get any of them to work.

Thanks :whistle:

Uten · October 31, 2006

Rewrote my test to work with thomasl's perl regexp udfs and the regexp patterns I have used did not work well there either.

Looks like using \b is a no!no! Or it does not work as I expect (think it shall).

But this returns the expected letter å as the only match. Have to play a bit more with this to see if I can get words containing the letters.

$regexp = "[^\w]([æøå]{1})[^\w]"

Still open for suggestions thought :whistle:

Sign In

StringRegExp with European letters

Recommended Posts

Uten

MHz

Uten

Uten

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta