Jump to content

StringRegExp with European letters


Uten
 Share

Recommended Posts

I'm need to pick out words containing characters from the Norwegian alphabet (Most European languages has some special characters).

So I was wondering, does anyone of you do StringRegExp searches with some of those characters?

I know AutoIt does not use unicode strings but StringInStr does find those (æ,ø,å) letters so I thought StringRegExp would (should) to?

The reason I'm doing this is that I'm creating something like a T9 (the keyboard on cell phones) touch screen keyboard.

Obviously I might have goofed :whistle: in my regexp pattern so if you have suggestions pleas let me know.

Some test code:

testEuropeanLetters()
Exit 
Func testEuropeanLetters()
   ;PURPOSE: See if StringRegExp handels European special chars
   ; Norwegian [æÆøØåÅ]
   Local $data, $regexp, $expect, $msg, $res
   Local $flag=3
   ConsoleWrite('Asc("æ"):=' & Asc('æ') & @LF)
   ConsoleWrite('Asc("æ"):=' & Asc('ø') & @LF)
   ConsoleWrite('Asc("æ"):=' & Asc('å') & @LF)
   If StringInStr("æ ø å", "æ") Then ConsoleWrite("StringInStr works with æ" & @LF)
   If StringInStr("æ ø å", "ø") Then ConsoleWrite("StringInStr works with ø" & @LF)
   If StringInStr("æ ø å", "å") Then ConsoleWrite("StringInStr works with å" & @LF)
   #region - data
   $data = 'ægiden' & @CRLF & 'ær' & @CRLF & 'æra' & @CRLF & 'æraen' & @CRLF & _
      'ærvær' & @CRLF & 'æser' & @CRLF & 'æsj' & @CRLF & 'æte' & @CRLF & 'ætling' & @CRLF & _
      'ætlingen' & @CRLF & 'ætt' & @CRLF & 'ætta' & @CRLF & 'ættbåren' & @CRLF & 'æva' & @CRLF & _
      'æve' & @CRLF & 'ævelig' & @CRLF & 'æven' & @CRLF & 'æver' & @CRLF & 'øde' & @CRLF & _
      'ødegård' & @CRLF & 'ødela' & @CRLF & 'ødelagt' & @CRLF & 'ødelagte' & @CRLF & 'ødeland' & @CRLF & _
      'øglene' & @CRLF & 'øgler' & @CRLF & 'øk' & @CRLF & 'øke' & @CRLF & 'økede' & @CRLF & _
      'økenavn' & @CRLF & 'økenavnet' & @CRLF & 'øl' & @CRLF & 'ølebrød' & @CRLF & 'ølebrødet' & @CRLF & _
      'ølen' & @CRLF & 'øm' & @CRLF & 'ømfintlig' & @CRLF & 'ømt' & @CRLF & 'ømtålelige' & @CRLF & _
      'ør' & @CRLF & 'øra' & @CRLF & 'øre' & @CRLF & 'ørebetennelse' & @CRLF & 'ørebro' & @CRLF & _
      'østsiden' & @CRLF & 'øv' & @CRLF & 'øvd' & @CRLF & 'øye' & @CRLF & 'å' & @CRLF & _
      'åa' & @CRLF & 'åbit' & @CRLF & 'åbor' & @CRLF & 'åk' & @CRLF & 'åker' & @CRLF & 'åkrer' & @CRLF & _
      'ål' & @CRLF & 'åla' & @CRLF & 'ålborg'
   #endregion - data
   $regexp = "\b[æøå]\b"     ;Returns æ,å,å,ø,ø,å -> Expected å
   $regexp = "\b[æøå]{1}\b" ;Returns æ,å,å,ø,ø,å -> Expected å
   $regexp = "\b[\O230\O248\O229]\b" ;Nothing -> Expected å
   $regexp = "\b\O229\b" ;Nothing -> Expected å
   $res = StringRegExp($data, $regexp, 3)
   Local $i 
   if IsArray($res) Then 
      For $i = 0 to UBound($res) - 1
         ConsoleWrite("+>$res[" & $i & "]:=" & $res[$i] & @LF)
      Next 
   Else 
      ConsoleWrite("! Nothing returned to array" & @LF)
   EndIf 
EndFunc

EDIT: Typo in the code.

Edited by Uten
Link to comment
Share on other sites

Obviously I might have goofed :whistle: in my regexp pattern so if you have suggestions pleas let me know.

Don't think you goofed, but rather unicode support seems non existant in AutoIt's PCRE.

Some known references that seem to fail

\X

Matches a single Unicode grapheme, whether encoded as a single code point or multiple code points using combining marks. A grapheme most closely resembles the everyday concept of a "character".

\X matches à encoded as U+0061 U+0300, à encoded as U+00E0, ©, etc.

\uFFFF where FFFF are 4 hexadecimal digits

Matches a specific Unicode code point. Can be used inside character classes.

\u00E0 matches à encoded as U+00E0 only. \u00A9 matches ©

\p{L} or \p{Letter}

Matches a single Unicode code point that has the property "letter". See Unicode Character Properties in the tutorial for a complete list of properties. Each Unicode code point has exactly one property. Can be used inside character classes.

\p{L} matches à encoded as U+00E0; \p{S} matches ©

\P{L} or \P{Letter}

Matches a single Unicode code point that does not have the property "letter". Can be used inside character classes.

\P{L} matches ©

Unfortunately, Unicode brings its own requirements and pitfalls when it comes to regular expressions. Of the regex flavors discussed in this tutorial, Java and the .NET framework use Unicode-based regex engines. Perl supports Unicode starting with version 5.6.

PCRE?
Link to comment
Share on other sites

I think the unicode part of PCRE was excluded at compile time to shave off some kbytes and reasoned with the missing unicode support in AutoIt (due to compatibility with win95).

I'll take another look at the switches you have provided @MHz to see if I can get any of them to work.

Thanks :whistle:

Link to comment
Share on other sites

Rewrote my test to work with thomasl's perl regexp udfs and the regexp patterns I have used did not work well there either.

Looks like using \b is a no!no! Or it does not work as I expect (think it shall).

But this returns the expected letter å as the only match. Have to play a bit more with this to see if I can get words containing the letters.

$regexp = "[^\w]([æøå]{1})[^\w]"

Still open for suggestions thought :whistle:

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...