Jump to content
fs1234

Change characters in a string with StringRegExpReplace

Recommended Posts

Hi,

I would like to change the hungarian characters in a string, but I can't figure out how to do it.

Help, pls.

 

#include <MsgBoxConstants.au3>

Local $sInput = "Árvíztűrő tükörfúrógép"
Local $sOutput = StringRegExpReplace($sInput, "(?-i)(á)|(Á)|(é)|(É)|(í)|(Í)|(ó)|(Ó)|(ö)|(Ö)|(ő)|(Ő)|(ú)|(Ú)|(ü)|(Ü)|(ű)|(Ű)", "(?1a)(?2A)(?3e)(?4E)(?5i)(?6I)(?7o)(?8O)(?9o)(?10O)(?11o)(?12O)(?13u)(?14U)(?15u)(?16U)(?17u)(?18U)")
Display($sInput, $sOutput)


Func Display($sInput, $sOutput)
    ; Format the output.
    Local $sMsg = StringFormat("Input:\t%s\n\nOutput:\t%s", $sInput, $sOutput)
    MsgBox($MB_SYSTEMMODAL, "Results", $sMsg)
EndFunc   ;==>Display

 

Share this post


Link to post
Share on other sites
Posted (edited)

Your regex can't do that and it tries to use invalid syntax.

You need to use this:

Removing Unicode accentuation boils down to convert the string to norm form D (or KD) then remove all diacritic and/or modifier codepoints. In your case, removing combining diacritics works fine.

; Unicode Normalization Forms
Global Enum $UNF_NormC = 1, $UNF_NormD, $UNF_NormKC = 5, $UNF_NormKD

Func _UNF_Change($sIn, $iForm)
    If $iForm = $UNF_NormC Or $iForm = $UNF_NormD Or $iForm = $UNF_NormKC Or $iForm = $UNF_NormKD Then
        Local $aRet = DllCall("Normaliz.dll", "int", "NormalizeString", "int", $iForm, "wstr", $sIn, "int", -1, "ptr", 0, "int", 0)
        Local $tOut = DllStructCreate("wchar[" & $aRet[0] & "]")
        $aRet = DllCall("Normaliz.dll", "int", "NormalizeString", "int", $iForm, "wstr", $sIn, "int", -1, "ptr", DllStructGetPtr($tOut, 1), "int", $aRet[0])
        Return DllStructGetData($tOut, 1)
    Else
        SetError(1, 0, $sIn)
    EndIf
EndFunc   ;==>_UNF_Change


Local $sInput = "Árvíztűrő tükörfúrógép"
Local $sFormD = _UNF_Change($sInput, $UNF_NormD)
Local $sOutput = StringRegExpReplace($sFormD, "(*UCP)\p{Mn}", "")
Display($sInput, $sOutput)


Func Display($sInput, $sOutput)
    ; Format the output.
    Local $sMsg = StringFormat("Input:\t%s\n\nOutput:\t%s", $sInput, $sOutput)
    MsgBox($MB_SYSTEMMODAL, "Results", $sMsg)
EndFunc   ;==>Display

Just in case there may be other codepoints having category Mn that you don't want removed, you can restrict removal to just the latin diacritics codepoints range.

Replace the regex line by this one:

Local $sOutput = StringRegExpReplace($sFormD, "(*UCP)[\x{300}-\x{36F}]", "")

 

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
Posted (edited)

For current and future interested readers, here's an bit of explanation.

In Unicode, many (but not all) characters with diacritic signs (so-called "accents") can be represented in a string by essentially two forms, called "normalization forms". For instance the characters  and Ç can be represented by either sequences of form C (composed characters) or form D (decomposed characters):
 in form C = 'Â' (U+00C2)
 in form D = 'A' (U+0041) followed by circumflex combining mark '^' (U+0302)
Ç in form C = 'Ç' (U+00C7)
Ç in form D = 'C' (U+0043) followed by cedilla combining mark '̧̧' (U+0327)

The idea under this method of unaccenting latin characters is to convert the string into form D then remove combining marks specifically.

This works for most latin scripts but not in general. Human scripts are very complex and subtle, thus Unicode itself has to be complex as well.

From the above you can infer that the notion of character in Unicode isn't as simple as it was with codepages (ANSI, Windows, you-name-it). For instance, decomposed "characters" (form D) may use several codepoints, which will count individually. So StringLen(Ç in form D) will return 2. Unicode string are almost always in form C as this makes it simpler to count "characters" (in a sense) and shorter to represent.

Another useful concept in Unicode is the "extended grapheme cluster". That is a series of codepoints, some characters, some modifiers that are to be represented alltogether by the rendering system. Such languages using complex extended grapheme clusters are Arabic, Hebrew, Thai, Indic, and others. For examples and differences between decomposed characters and extended grapheme clusters, see this page Unicode to get a feeling about how complex representing human scripts can really be.

Btw, our AutoIt implementation of regex (PCRE1) has a character type for matching an extended grapheme cluster: \X which matches as many codepoints as needed to comprehend the whole of what will be represented as a single (complex) visual glyph by the Unicode renderer engine (what you as user feel is a "character").

Finally please realize that all the notions discussed above are independant of the Unicode encoding in use. You can then mimic OSI layers applied to Unicode:
bit                                                (not very useful level)
encoding unit                            (byte for UTF8, 16-bit word for UTF16, 32-bit dword for UTF32; byte-order conventional or explicit)
codepoint                                   (how many encoding units a codepoint needs for its representation in a given encoding)
extended grapheme cluster   (as many codepoints needed to designate a glyph)
font                                             (set of drawing rules used to draw a glyph)
rendered glyph                         (graphical output of what we human visually perceive as a "character", done by the rendering engine)

I told you Unicode wasn't trivial, didn't I?

Edited by jchd
Fix typo, clarify and expand explanations

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Similar Content

    • By Skysnake
      I need some regex help
      I inherited some data 
      The data is massive and I need a clean, fast solution 
      source is text and complex.  
      I need to find dates such as "31-01-2018" and replace with "31-JAN-2018"
      Problem is that my regex "31-01-2018" takes for ever and replaces all.
      The ideal would be to search like this, but I am not managing
      \d{2}-(01)-\d{4} replace (01) with JAN But if I do it that way, the entire search string gets replaced by JAN.  This is not an error, but typically regex behaviour.  Any ideas?
      Skysnake
    • By luckyluke
      $t = '... 1-347-318-9643 1-347-318-9647 1-347-318-9648 1-347-318-9650 1-347-318-9651 1-347-318-9652 1-347-318-9653 1-347-318-9655 1-347-318-&nbsp;...' $pattern = '347.*?318.*?9655' $tmp = StringRegExpReplace($t, $pattern, "|||", 1) ConsoleWrite($tmp & @CRLF) However i got this output:
      ... 1-|||  1-347-318-&nbsp;...
      Why i got only that, where is the other string, i thought the output should be this:
      ... 1-347-318-9643  1-347-318-9647  1-347-318-9648  1-347-318-9650  1-347-318-9651  1-347-318-9652  1-347-318-9653  1-|||  1-347-318-&nbsp;...
    • By ViciousXUSMC
      So I ran into this crazy "program" that cant be uninstalled via WMI, MSIExec, etc.
      The only way to uninstall it was from Add/Remove programs manually... Or I found if you find it in the registry under HKCU and run the  uninstall string, it will also uninstall.
      However the string in the registry cant be run directly in a cmd window because of the format errors.
      It has spaces without quotations, it has invalid characters, etc, etc 
      I know things run different when executed in the registry, so maybe there is a way I can run the regsitry key just like how the system does?  If so chime in.
      Otherwise I did this a crude way using several stringregexpreplace() functions and have it working.
      The solution feels so barbaric and crude that I wanted to post it so some of you guys better than me can clean up the code, maybe offer alternative ways to do it, or reduce the number of times I process the string.
      Here is the string right out of the registry:
      c:\Program Files\Common Files\Microsoft Shared\VSTO\10.0\VSTOInstaller.exe /Uninstall file:///C:/Users/it022565/AppData/Local/Temp/OOBAXTOWordAddIn/ApplicationXtender.AXTO.Word.vsto Here is my cave man scripting to turn this into a run able string.
       
      Func _UninstallOld() For $i = 1 to 100 ;Enumerate Registry $sEnumBase = "HKEY_CURRENT_USER\SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\" ;Look in HKCU for the uninstall string for the old version $sEnum = RegEnumKey($sEnumBase, $i) If @Error Then Return If $iDebug = 1 Then MsgBox(0, "", $sEnum) If StringInStr(RegRead($sEnumBase & $sEnum, "DisplayName"), "Word Addin") Then ExitLoop Next If $iDebug = 1 Then MsgBox(0, "", $sEnum) $sKey = "HKEY_CURRENT_USER\SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\" & $sEnum $sKey2 = RegRead($sKey, "UninstallString") If $iDebug = 1 Then MsgBox(0, "Original Install Location", $sKey2) $sKey3 = StringRegExpReplace($sKey2, "(?i)(c:.*exe)", '"$1"') If $iDebug = 1 Then MsgBox(0, "", $sKey3) $sKey4 = StringRegExpReplace($sKey3, "(?i)file:///", "") If $iDebug = 1 Then MsgBox(0, "", $sKey4) $sKey5 = StringRegExpReplace($sKey4, "%20", " ") If $iDebug = 1 Then MsgBox(0, "", $sKey5) $sKey6 = StringRegExpReplace($sKey5, '(?i)((?<!")c:.*vsto)', '"$1"') If $iDebug = 1 Then MsgBox(0, "", $sKey6) RunWait(@ComSpec & ' /c ' & '"' & $sKey6 & ' /s"', "", @SW_HIDE) EndFunc Basically step by step I add quotations, strip bad characters, etc.  Kind of proud for using look behind for once
      Looking forward to what you guys come up with.
    • By VIP
      Need help to make function better  with full infomation
      #include <Array.au3> #include <File.au3> _TEST(@ScriptFullPath) _TEST("A:") _TEST("A:\B.c") _TEST("D:\E\F\") _TEST("G:\H/../J.k/") _TEST("M:\N\k..J.k") _TEST("D:\E\F\..\G\G\I..J.K.M") Func _TEST($sFilePath) Local $sDrive = "", $sFullPathDir = "", $sDirPath = "", $sDirName = "", $sFileName = "", $sFileNameExt = "", $sExtension = "", $sExt = "" Local $aPathSplit = _PathSplitByRef($sFilePath, $sDrive, $sFullPathDir, $sDirPath, $sDirName, $sFileName, $sFileNameExt, $sExtension, $sExt) ConsoleWrite("!Path IN : " & $sFilePath & @CRLF) ; C:\Windows\System32\etc\hosts.exe ConsoleWrite("- Driver : " & $sDrive & @CRLF) ; C: ConsoleWrite("- DirPath : " & $sFullPathDir & @CRLF) ; C:\Windows\System32\etc\etc ConsoleWrite("- DirPath : " & $sDirPath & @CRLF) ; \Windows\System32\etc\ ConsoleWrite("- DirName : " & $sDirName & @CRLF) ; etc ConsoleWrite("- FileName : " & $sFileName & @CRLF) ; hosts ConsoleWrite("- FileNameExt: " & $sFileNameExt & @CRLF) ; hosts.exe ConsoleWrite("- Extension : " & $sExtension & @CRLF) ; .exe ConsoleWrite("- Ext : " & $sExt & @CRLF & @CRLF) ; exe ;~ ConsoleWrite("!Path IN : " & $aPathSplit[0] & @CRLF) ; C:\Windows\System32\etc\hosts.exe ;~ ConsoleWrite("- Driver : " & $aPathSplit[1] & @CRLF) ; C: ;~ ConsoleWrite("- DirPath : " & $aPathSplit[2] & @CRLF) ; C:\Windows\System32\etc\etc ;~ ConsoleWrite("- DirPath : " & $aPathSplit[3] & @CRLF) ; \Windows\System32\etc\ ;~ ConsoleWrite("- DirName : " & $aPathSplit[4] & @CRLF) ; etc ;~ ConsoleWrite("- FileName : " & $aPathSplit[5] & @CRLF) ; hosts ;~ ConsoleWrite("- FileNameExt: " & $aPathSplit[6] & @CRLF) ; hosts.exe ;~ ConsoleWrite("- Extension : " & $aPathSplit[7] & @CRLF) ; .exe ;~ ConsoleWrite("- Ext : " & $aPathSplit[8] & @CRLF) ; exe ;~ _ArrayDisplay($aPathSplit, "_PathSplit of " & $sFilePath) EndFunc ;==>_TEST Func _PathSplitByRef($sFilePath, ByRef $sDrive, ByRef $sFullPathDir, ByRef $sDirPath, ByRef $sDirName, ByRef $sFileName, ByRef $sFileNameExt, ByRef $sExtension, ByRef $sExt) If StringInStr($sFilePath,"..") Then $sFilePath=_PathFull($sFilePath) Local $aPartOfPath=StringRegExp($sFilePath, "^\h*((?:\\\\\?\\)*(\\\\[^\?\/\\]+|[A-Za-z]:)?(.*[\/\\]\h*)?((?:[^\.\/\\]|(?(?=\.[^\/\\]*\.)\.))*)?([^\/\\]*))$", $STR_REGEXPARRAYMATCH) ;~ If @error Then ReDim $aPartOfPath[9] ;~ $aPartOfPath[0] = $sFilePath ;~ EndIf $aPartOfPath[0] = $sFilePath ; C:\Windows\System32\etc\hosts.exe $sDrive = $aPartOfPath[1] ; C: $sFullPathDir = $aPartOfPath[1] & $aPartOfPath[2] ; C:\Windows\System32\etc If StringLeft($aPartOfPath[2], 1) == "/" Then $sDirPath = StringRegExpReplace($aPartOfPath[2], "\h*[\/\\]+\h*", "\/") Else $sDirPath = StringRegExpReplace($aPartOfPath[2], "\h*[\/\\]+\h*", "\\") EndIf $aPartOfPath[2] = $sFullPathDir ; C:\Windows\System32\etc $sDirName=StringReplace($sDirPath,"\","") $sDirName=StringReplace($sDirPath,"/","") $sFileName = $aPartOfPath[3] ; hosts $aPartOfPath[5] = $sFileName ; hosts $sExtension = $aPartOfPath[4] ; .exe $aPartOfPath[7] = $sExtension ; .exe $aPartOfPath[3] = $sDirPath ; \Windows\System32\etc\ $aPartOfPath[4] = $sDirName ; etc $aPartOfPath[6] = $sFileName & $sExtension ; hosts.exe $sFileNameExt = $aPartOfPath[6] ; hosts.exe $sExt = StringReplace($sExtension,".","") ; exe $aPartOfPath[8] = $sExt ; exe Return $aPartOfPath EndFunc ;==>_PathSplitByRef  
    • By hawkair
      Hi
      I am trying to insert line numbers in to a string
      with this script
      Func _MyInc () Static Local $i = 0 $i += 1 Return $i EndFunc Exit _InsertLines() Func _InsertLines()     $String = "A" & @CRLF & "B" & @CRLF & "C" & @CRLF & "D" $NewString =  Execute("'" & StringRegExpReplace($String,"[\r\n]*",  "' & _MyInc () & '\1" ) & "'") MsgBox (0, "", $NewString) EndFunc but I get this:
      1A23B45C67D8
      I never really could master how Execute works here and I always get some working example and make substitutions.
      But this is the closest i could get...
       
×
×
  • Create New...