Change characters in a string with StringRegExpReplace

fs1234 · May 3, 2019

Hi,

I would like to change the hungarian characters in a string, but I can't figure out how to do it.

Help, pls.

#include <MsgBoxConstants.au3>

Local $sInput = "Árvíztűrő tükörfúrógép"
Local $sOutput = StringRegExpReplace($sInput, "(?-i)(á)|(Á)|(é)|(É)|(í)|(Í)|(ó)|(Ó)|(ö)|(Ö)|(ő)|(Ő)|(ú)|(Ú)|(ü)|(Ü)|(ű)|(Ű)", "(?1a)(?2A)(?3e)(?4E)(?5i)(?6I)(?7o)(?8O)(?9o)(?10O)(?11o)(?12O)(?13u)(?14U)(?15u)(?16U)(?17u)(?18U)")
Display($sInput, $sOutput)


Func Display($sInput, $sOutput)
    ; Format the output.
    Local $sMsg = StringFormat("Input:\t%s\n\nOutput:\t%s", $sInput, $sOutput)
    MsgBox($MB_SYSTEMMODAL, "Results", $sMsg)
EndFunc   ;==>Display

jchd · May 4, 2019

Your regex can't do that and it tries to use invalid syntax.

You need to use this:

Removing Unicode accentuation boils down to convert the string to norm form D (or KD) then remove all diacritic and/or modifier codepoints. In your case, removing combining diacritics works fine.

; Unicode Normalization Forms
Global Enum $UNF_NormC = 1, $UNF_NormD, $UNF_NormKC = 5, $UNF_NormKD

Func _UNF_Change($sIn, $iForm)
    If $iForm = $UNF_NormC Or $iForm = $UNF_NormD Or $iForm = $UNF_NormKC Or $iForm = $UNF_NormKD Then
        Local $aRet = DllCall("Normaliz.dll", "int", "NormalizeString", "int", $iForm, "wstr", $sIn, "int", -1, "ptr", 0, "int", 0)
        Local $tOut = DllStructCreate("wchar[" & $aRet[0] & "]")
        $aRet = DllCall("Normaliz.dll", "int", "NormalizeString", "int", $iForm, "wstr", $sIn, "int", -1, "ptr", DllStructGetPtr($tOut, 1), "int", $aRet[0])
        Return DllStructGetData($tOut, 1)
    Else
        SetError(1, 0, $sIn)
    EndIf
EndFunc   ;==>_UNF_Change


Local $sInput = "Árvíztűrő tükörfúrógép"
Local $sFormD = _UNF_Change($sInput, $UNF_NormD)
Local $sOutput = StringRegExpReplace($sFormD, "(*UCP)\p{Mn}", "")
Display($sInput, $sOutput)


Func Display($sInput, $sOutput)
    ; Format the output.
    Local $sMsg = StringFormat("Input:\t%s\n\nOutput:\t%s", $sInput, $sOutput)
    MsgBox($MB_SYSTEMMODAL, "Results", $sMsg)
EndFunc   ;==>Display

Just in case there may be other codepoints having category Mn that you don't want removed, you can restrict removal to just the latin diacritics codepoints range.

Replace the regex line by this one:

Local $sOutput = StringRegExpReplace($sFormD, "(*UCP)[\x{300}-\x{36F}]", "")

Edited May 4, 2019 by jchd

fs1234 · May 4, 2019

Thank you, jchd!

Works fine.

jchd · May 4, 2019

For current and future interested readers, here's an bit of explanation.

In Unicode, many (but not all) characters with diacritic signs (so-called "accents") can be represented in a string by essentially two forms, called "normalization forms". For instance the characters Â and Ç can be represented by either sequences of form C (composed characters) or form D (decomposed characters):
Â in form C = 'Â' (U+00C2)
Â in form D = 'A' (U+0041) followed by circumflex combining mark '^' (U+0302)
Ç in form C = 'Ç' (U+00C7)
Ç in form D = 'C' (U+0043) followed by cedilla combining mark '̧̧' (U+0327)

The idea under this method of unaccenting latin characters is to convert the string into form D then remove combining marks specifically.

This works for most latin scripts but not in general. Human scripts are very complex and subtle, thus Unicode itself has to be complex as well.

From the above you can infer that the notion of character in Unicode isn't as simple as it was with codepages (ANSI, Windows, you-name-it). For instance, decomposed "characters" (form D) may use several codepoints, which will count individually. So StringLen(Ç in form D) will return 2. Unicode string are almost always in form C as this makes it simpler to count "characters" (in a sense) and shorter to represent.

Another useful concept in Unicode is the "extended grapheme cluster". That is a series of codepoints, some characters, some modifiers that are to be represented alltogether by the rendering system. Such languages using complex extended grapheme clusters are Arabic, Hebrew, Thai, Indic, and others. For examples and differences between decomposed characters and extended grapheme clusters, see this page Unicode to get a feeling about how complex representing human scripts can really be.

Btw, our AutoIt implementation of regex (PCRE1) has a character type for matching an extended grapheme cluster: \X which matches as many codepoints as needed to comprehend the whole of what will be represented as a single (complex) visual glyph by the Unicode renderer engine (what you as user feel is a "character").

Finally please realize that all the notions discussed above are independant of the Unicode encoding in use. You can then mimic OSI layers applied to Unicode:
bit                                                (not very useful level)
encoding unit                            (byte for UTF8, 16-bit word for UTF16, 32-bit dword for UTF32; byte-order conventional or explicit)
codepoint                                   (how many encoding units a codepoint needs for its representation in a given encoding)
extended grapheme cluster   (as many codepoints needed to designate a glyph)
font (set of drawing rules used to draw a glyph)
rendered glyph                         (graphical output of what we human visually perceive as a "character", done by the rendering engine)

I told you Unicode wasn't trivial, didn't I?

Edited May 4, 2019 by jchd
Fix typo, clarify and expand explanations

Sign In

Change characters in a string with StringRegExpReplace

Recommended Posts

fs1234

jchd

fs1234

jchd

Create an account or sign in to comment

Create an account

Sign in

Similar Content

Is dynamic SERC StringRegExpReplace possible please?

Get only number from webpage

Replace text from table using stringreplace

questions about StringRegExpReplace

Replace a part only of a regex

Browse

AutoIt Resources

Release

Beta