Jump to content
Sign in to follow this  
fs1234

Change characters in a string with StringRegExpReplace

Recommended Posts

Hi,

I would like to change the hungarian characters in a string, but I can't figure out how to do it.

Help, pls.

 

#include <MsgBoxConstants.au3>

Local $sInput = "Árvíztűrő tükörfúrógép"
Local $sOutput = StringRegExpReplace($sInput, "(?-i)(á)|(Á)|(é)|(É)|(í)|(Í)|(ó)|(Ó)|(ö)|(Ö)|(ő)|(Ő)|(ú)|(Ú)|(ü)|(Ü)|(ű)|(Ű)", "(?1a)(?2A)(?3e)(?4E)(?5i)(?6I)(?7o)(?8O)(?9o)(?10O)(?11o)(?12O)(?13u)(?14U)(?15u)(?16U)(?17u)(?18U)")
Display($sInput, $sOutput)


Func Display($sInput, $sOutput)
    ; Format the output.
    Local $sMsg = StringFormat("Input:\t%s\n\nOutput:\t%s", $sInput, $sOutput)
    MsgBox($MB_SYSTEMMODAL, "Results", $sMsg)
EndFunc   ;==>Display

 

Share this post


Link to post
Share on other sites

Your regex can't do that and it tries to use invalid syntax.

You need to use this:

Removing Unicode accentuation boils down to convert the string to norm form D (or KD) then remove all diacritic and/or modifier codepoints. In your case, removing combining diacritics works fine.

; Unicode Normalization Forms
Global Enum $UNF_NormC = 1, $UNF_NormD, $UNF_NormKC = 5, $UNF_NormKD

Func _UNF_Change($sIn, $iForm)
    If $iForm = $UNF_NormC Or $iForm = $UNF_NormD Or $iForm = $UNF_NormKC Or $iForm = $UNF_NormKD Then
        Local $aRet = DllCall("Normaliz.dll", "int", "NormalizeString", "int", $iForm, "wstr", $sIn, "int", -1, "ptr", 0, "int", 0)
        Local $tOut = DllStructCreate("wchar[" & $aRet[0] & "]")
        $aRet = DllCall("Normaliz.dll", "int", "NormalizeString", "int", $iForm, "wstr", $sIn, "int", -1, "ptr", DllStructGetPtr($tOut, 1), "int", $aRet[0])
        Return DllStructGetData($tOut, 1)
    Else
        SetError(1, 0, $sIn)
    EndIf
EndFunc   ;==>_UNF_Change


Local $sInput = "Árvíztűrő tükörfúrógép"
Local $sFormD = _UNF_Change($sInput, $UNF_NormD)
Local $sOutput = StringRegExpReplace($sFormD, "(*UCP)\p{Mn}", "")
Display($sInput, $sOutput)


Func Display($sInput, $sOutput)
    ; Format the output.
    Local $sMsg = StringFormat("Input:\t%s\n\nOutput:\t%s", $sInput, $sOutput)
    MsgBox($MB_SYSTEMMODAL, "Results", $sMsg)
EndFunc   ;==>Display

Just in case there may be other codepoints having category Mn that you don't want removed, you can restrict removal to just the latin diacritics codepoints range.

Replace the regex line by this one:

Local $sOutput = StringRegExpReplace($sFormD, "(*UCP)[\x{300}-\x{36F}]", "")

 

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

For current and future interested readers, here's an bit of explanation.

In Unicode, many (but not all) characters with diacritic signs (so-called "accents") can be represented in a string by essentially two forms, called "normalization forms". For instance the characters  and Ç can be represented by either sequences of form C (composed characters) or form D (decomposed characters):
 in form C = 'Â' (U+00C2)
 in form D = 'A' (U+0041) followed by circumflex combining mark '^' (U+0302)
Ç in form C = 'Ç' (U+00C7)
Ç in form D = 'C' (U+0043) followed by cedilla combining mark '̧̧' (U+0327)

The idea under this method of unaccenting latin characters is to convert the string into form D then remove combining marks specifically.

This works for most latin scripts but not in general. Human scripts are very complex and subtle, thus Unicode itself has to be complex as well.

From the above you can infer that the notion of character in Unicode isn't as simple as it was with codepages (ANSI, Windows, you-name-it). For instance, decomposed "characters" (form D) may use several codepoints, which will count individually. So StringLen(Ç in form D) will return 2. Unicode string are almost always in form C as this makes it simpler to count "characters" (in a sense) and shorter to represent.

Another useful concept in Unicode is the "extended grapheme cluster". That is a series of codepoints, some characters, some modifiers that are to be represented alltogether by the rendering system. Such languages using complex extended grapheme clusters are Arabic, Hebrew, Thai, Indic, and others. For examples and differences between decomposed characters and extended grapheme clusters, see this page Unicode to get a feeling about how complex representing human scripts can really be.

Btw, our AutoIt implementation of regex (PCRE1) has a character type for matching an extended grapheme cluster: \X which matches as many codepoints as needed to comprehend the whole of what will be represented as a single (complex) visual glyph by the Unicode renderer engine (what you as user feel is a "character").

Finally please realize that all the notions discussed above are independant of the Unicode encoding in use. You can then mimic OSI layers applied to Unicode:
bit                                                (not very useful level)
encoding unit                            (byte for UTF8, 16-bit word for UTF16, 32-bit dword for UTF32; byte-order conventional or explicit)
codepoint                                   (how many encoding units a codepoint needs for its representation in a given encoding)
extended grapheme cluster   (as many codepoints needed to designate a glyph)
font                                             (set of drawing rules used to draw a glyph)
rendered glyph                         (graphical output of what we human visually perceive as a "character", done by the rendering engine)

I told you Unicode wasn't trivial, didn't I?

Edited by jchd
Fix typo, clarify and expand explanations

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By jmp
      i am trying to get number from string using this code :
      #include <IE.au3> $oIE = _IEAttach ("Edu.corner") Local $aName = "Student name & Code:", $iaName = "0" Local $oTds = _IETagNameGetCollection($oIE, "td") For $oTd In $oTds If $oTd.InnerText = $aName Then $iaName = $oTd.NextElementSibling.InnerText $iGet = StringRegExpReplace($iaName, "\D", "") EndIf Next MsgBox(0, "", $iGet) it was get number like 52503058
      But, I want to get only student code 5250. (Different student have different code, sometime its 3 digits, Sometime 4)

       
    • By jmp
      I am adding labour charge to total paid amount using : 
      #include <IE.au3> #include <Array.au3> $oIE = _IEAttach ("Shop") $oTable = _IETableGetCollection ($oIE, 1) $aTableData3 = _IETableWriteToArray ($oTable) Local $sitem1 = $aTableData3[5][1] Local $sitem2 = $aTableData3[5][2] Local $lcharge = "10" ;add manualy using inputbox, becuase not generating online Local $atotPric = "Payable Total Price " Local $oTds = _IETagNameGetCollection($oIE, "td") For $oTd In $oTds If $oTd.Innertext = $atotPric Then $iatotPric = $oTd.NextElementSibling.innertext MsgBox (0, "2", $iatotPric) EndIf Next $irCtotal = StringFormat("%.2f", $sitem1 + $sitem2 + $lcharge) $crTotp = StringReplace(_IEBodyReadHTML($oIE), $iatotPric, $irCtotal) _IEBodyWriteHTML ($oIE, $crTotp) But, It was also changing Total price, I want to change only Payable Total Price.

    • By nacerbaaziz
      hello sirs
      i've some questions about StringRegExpReplace i hope you can help me
       
      i tried to make a function that give me the host of the url and other give me the url with out host
      for example i've this link
      https://www.example.com/vb/result.php
      i need the first give me the
      example.com
      and the other give me 
      /vb/result.php
      i find that
      $s_source = "https://www.google.com/vb/index.php" Local $s_Host = StringRegExpReplace($s_Source, '.*://(.*?)/.*', '\1') Local $s_Page = StringRegExpReplace($s_source, '.*://.*?(/.*)', '\1') msgBox(64, $s_Host, $s_Page)  
      but i found some problems i need your help to correct it
      first: when i get the host if the url has www i want to remove it
      second: if the url with out host did not have other things 
      i need the result to be ""
      e.g
      https://www.example.com
      the first i want it
      example.com
      and the second i want it to be ""
      i hope that you can help me
      thanks in advance
    • By Skysnake
      I need some regex help
      I inherited some data 
      The data is massive and I need a clean, fast solution 
      source is text and complex.  
      I need to find dates such as "31-01-2018" and replace with "31-JAN-2018"
      Problem is that my regex "31-01-2018" takes for ever and replaces all.
      The ideal would be to search like this, but I am not managing
      \d{2}-(01)-\d{4} replace (01) with JAN But if I do it that way, the entire search string gets replaced by JAN.  This is not an error, but typically regex behaviour.  Any ideas?
      Skysnake
    • By luckyluke
      $t = '... 1-347-318-9643 1-347-318-9647 1-347-318-9648 1-347-318-9650 1-347-318-9651 1-347-318-9652 1-347-318-9653 1-347-318-9655 1-347-318-&nbsp;...' $pattern = '347.*?318.*?9655' $tmp = StringRegExpReplace($t, $pattern, "|||", 1) ConsoleWrite($tmp & @CRLF) However i got this output:
      ... 1-|||  1-347-318-&nbsp;...
      Why i got only that, where is the other string, i thought the output should be this:
      ... 1-347-318-9643  1-347-318-9647  1-347-318-9648  1-347-318-9650  1-347-318-9651  1-347-318-9652  1-347-318-9653  1-|||  1-347-318-&nbsp;...
×
×
  • Create New...