Jump to content

Change characters in a string with StringRegExpReplace


Recommended Posts

Hi,

I would like to change the hungarian characters in a string, but I can't figure out how to do it.

Help, pls.

 

#include <MsgBoxConstants.au3>

Local $sInput = "Árvíztűrő tükörfúrógép"
Local $sOutput = StringRegExpReplace($sInput, "(?-i)(á)|(Á)|(é)|(É)|(í)|(Í)|(ó)|(Ó)|(ö)|(Ö)|(ő)|(Ő)|(ú)|(Ú)|(ü)|(Ü)|(ű)|(Ű)", "(?1a)(?2A)(?3e)(?4E)(?5i)(?6I)(?7o)(?8O)(?9o)(?10O)(?11o)(?12O)(?13u)(?14U)(?15u)(?16U)(?17u)(?18U)")
Display($sInput, $sOutput)


Func Display($sInput, $sOutput)
    ; Format the output.
    Local $sMsg = StringFormat("Input:\t%s\n\nOutput:\t%s", $sInput, $sOutput)
    MsgBox($MB_SYSTEMMODAL, "Results", $sMsg)
EndFunc   ;==>Display

 

Link to comment
Share on other sites

Your regex can't do that and it tries to use invalid syntax.

You need to use this:

Removing Unicode accentuation boils down to convert the string to norm form D (or KD) then remove all diacritic and/or modifier codepoints. In your case, removing combining diacritics works fine.

; Unicode Normalization Forms
Global Enum $UNF_NormC = 1, $UNF_NormD, $UNF_NormKC = 5, $UNF_NormKD

Func _UNF_Change($sIn, $iForm)
    If $iForm = $UNF_NormC Or $iForm = $UNF_NormD Or $iForm = $UNF_NormKC Or $iForm = $UNF_NormKD Then
        Local $aRet = DllCall("Normaliz.dll", "int", "NormalizeString", "int", $iForm, "wstr", $sIn, "int", -1, "ptr", 0, "int", 0)
        Local $tOut = DllStructCreate("wchar[" & $aRet[0] & "]")
        $aRet = DllCall("Normaliz.dll", "int", "NormalizeString", "int", $iForm, "wstr", $sIn, "int", -1, "ptr", DllStructGetPtr($tOut, 1), "int", $aRet[0])
        Return DllStructGetData($tOut, 1)
    Else
        SetError(1, 0, $sIn)
    EndIf
EndFunc   ;==>_UNF_Change


Local $sInput = "Árvíztűrő tükörfúrógép"
Local $sFormD = _UNF_Change($sInput, $UNF_NormD)
Local $sOutput = StringRegExpReplace($sFormD, "(*UCP)\p{Mn}", "")
Display($sInput, $sOutput)


Func Display($sInput, $sOutput)
    ; Format the output.
    Local $sMsg = StringFormat("Input:\t%s\n\nOutput:\t%s", $sInput, $sOutput)
    MsgBox($MB_SYSTEMMODAL, "Results", $sMsg)
EndFunc   ;==>Display

Just in case there may be other codepoints having category Mn that you don't want removed, you can restrict removal to just the latin diacritics codepoints range.

Replace the regex line by this one:

Local $sOutput = StringRegExpReplace($sFormD, "(*UCP)[\x{300}-\x{36F}]", "")

 

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

For current and future interested readers, here's an bit of explanation.

In Unicode, many (but not all) characters with diacritic signs (so-called "accents") can be represented in a string by essentially two forms, called "normalization forms". For instance the characters  and Ç can be represented by either sequences of form C (composed characters) or form D (decomposed characters):
 in form C = 'Â' (U+00C2)
 in form D = 'A' (U+0041) followed by circumflex combining mark '^' (U+0302)
Ç in form C = 'Ç' (U+00C7)
Ç in form D = 'C' (U+0043) followed by cedilla combining mark '̧̧' (U+0327)

The idea under this method of unaccenting latin characters is to convert the string into form D then remove combining marks specifically.

This works for most latin scripts but not in general. Human scripts are very complex and subtle, thus Unicode itself has to be complex as well.

From the above you can infer that the notion of character in Unicode isn't as simple as it was with codepages (ANSI, Windows, you-name-it). For instance, decomposed "characters" (form D) may use several codepoints, which will count individually. So StringLen(Ç in form D) will return 2. Unicode string are almost always in form C as this makes it simpler to count "characters" (in a sense) and shorter to represent.

Another useful concept in Unicode is the "extended grapheme cluster". That is a series of codepoints, some characters, some modifiers that are to be represented alltogether by the rendering system. Such languages using complex extended grapheme clusters are Arabic, Hebrew, Thai, Indic, and others. For examples and differences between decomposed characters and extended grapheme clusters, see this page Unicode to get a feeling about how complex representing human scripts can really be.

Btw, our AutoIt implementation of regex (PCRE1) has a character type for matching an extended grapheme cluster: \X which matches as many codepoints as needed to comprehend the whole of what will be represented as a single (complex) visual glyph by the Unicode renderer engine (what you as user feel is a "character").

Finally please realize that all the notions discussed above are independant of the Unicode encoding in use. You can then mimic OSI layers applied to Unicode:
bit                                                (not very useful level)
encoding unit                            (byte for UTF8, 16-bit word for UTF16, 32-bit dword for UTF32; byte-order conventional or explicit)
codepoint                                   (how many encoding units a codepoint needs for its representation in a given encoding)
extended grapheme cluster   (as many codepoints needed to designate a glyph)
font                                             (set of drawing rules used to draw a glyph)
rendered glyph                         (graphical output of what we human visually perceive as a "character", done by the rendering engine)

I told you Unicode wasn't trivial, didn't I?

Edited by jchd
Fix typo, clarify and expand explanations

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...