Loc Posted July 27, 2021 Posted July 27, 2021 I'm Vietnamese so the words in my string would be like this: 'Nếu tôi là cậu thì sẽ biết mệt mỏi' I want to replace this string to: 'Neu toi la cau thi se biet met moi' Of course it is possible to use StringReplace() to assign those two strings, but when the string is self-set, that is not possible. StringRegExpReplace() is replaceable. I tried turning it into a function but it doesn't return anything Func _SStringReplace($text) StringRegExpReplace($text, '[ă â ắ ặ ẵ ẳ]', 'a') Endfunc Please help me
jchd Posted July 27, 2021 Posted July 27, 2021 Use this to unaccent your strings: ; Unicode Normalization Forms Global Enum $UNF_NormC = 1, $UNF_NormD, $UNF_NormKC = 5, $UNF_NormKD Func _UNF_Change($sIn, $iForm) If $iForm = $UNF_NormC Or $iForm = $UNF_NormD Or $iForm = $UNF_NormKC Or $iForm = $UNF_NormKD Then Local $aRet = DllCall("Normaliz.dll", "int", "NormalizeString", "int", $iForm, "wstr", $sIn, "int", -1, "ptr", 0, "int", 0) Local $tOut = DllStructCreate("wchar[" & 2 * ($aRet[0] + 20) & "]") $aRet = DllCall("Normaliz.dll", "int", "NormalizeString", "int", $iForm, "wstr", $sIn, "int", -1, "ptr", DllStructGetPtr($tOut, 1), "int", 2 * ($aRet[0] + 20)) Return DllStructGetData($tOut, 1) Else SetError(1, 0, $sIn) EndIf EndFunc ;==>_UNF_Change Func _Unaccent($s, $iMode = 0) Local Static $aPat = [ _ "(*UCP)[\x{300}-\x{36F}`'¨^¸¯]", _ ; $iMode = 0 : remove combining accents only "(*UCP)\p{Mn}|\p{Lm}|\p{Sk}" _ ; $iMode = 1 : " " " and modifying letters ] Return StringRegExpReplace(_UNF_Change($s, $UNF_NormD), $aPat[Mod($iMode, 2)], "") EndFunc ;==>_Unaccent Local $s = 'Nếu tôi là cậu thì sẽ biết mệt mỏi' Local $t = _Unaccent($s) MsgBox(0, "", $s & @LF & $t) This has nothing to do with UTF8, let alone unsigned UTF8 (which doesn't make sense). Loc, robertocm and Skysnake 2 1 This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
jchd Posted July 28, 2021 Posted July 28, 2021 Explanation: The Unicode character set includes a (large) number of already accented letters, like à, î, Õ, ç, etc, which have each been assigned one Unicode codepoint. Some languages use accented letters that aren't in the character set: to form these letters, the base letter is followed by one or more accent or modifiying letter. When such a string is displayed, the font renderer parses the codepoints including all diacritics and draws the whole thing as one letter. Unicode strings can adopt one of the several Normalization Form (see https://unicode.org/reports/tr15/ for gory details). In short, characters like letters can de decorated with a number of diacritics or modifying letters, like accents or strokes. An "all-in-one" character like Ñ can be represented in a string by either the single codepoint 0x00D1 in normalization form C or by the sequence 0x004E 0x0303 in form D, respectively E and the tilde accent. The idea of the above script is to transform the input string in normalization form D (decomposed characters) then use a regular expression to remove diacritics. You're then left with base letters only. Run this to see the decomposed form of the vietnamese string in question: Local $s = 'Nếu tôi là cậu thì sẽ biết mệt mỏi' Local $a = StringToASCIIArray(_UNF_Change($s, $UNF_NormD)) For $i = 0 To UBound($a) - 1 $a[$i] = StringFormat("0x%04X", $a[$i]) Next _ArrayDisplay($a) There diacritics are codepoints in the range 0x0300..0x036F. You can display fancy letters by appending diacritics to a letter, even if that combination is not used on Earth: this is a small latin letter "o" with "combining candrabindu", a "combining long solidus overlay" and a "combined ring below". I bet this glyph isn't in use. I formed it with "o" & ChrW(0x310) & ChrW(0x325) & ChrW(0x338). How it looks depend on deep details of the font renderer. The thing below can be selected only as one unit despite being an uninterruptible sequence of 4 Unicode codepoints. o̸̥̐ Hope this sheds some light on what's under the hood. Musashi, Loc and Skysnake 1 2 This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
JockoDundee Posted July 28, 2021 Posted July 28, 2021 44 minutes ago, jchd said: The idea of the above script is to transform the input string in normalization form D (decomposed characters) then use a regular expression to remove diacritics. You're then left with base letters only. Does that mean that the resulting unaccented string might seem arbitrary, depending upon whether the source contained an “all-in-one” character or an aggregate character? Code hard, but don’t hard code...
jchd Posted July 28, 2021 Posted July 28, 2021 To the contrary. Decomposing composed characters is a complex but formal process obeying strict rules.The reverse process: transforming form D to form C is symetric so you can round trip from form C to form D to form C verbatim. Whatever combination of diacritics you add to "A", the base letter still remains "A". "À" (a precomposed Unicode character) decomposes to "A" followed by "`" (grave accent). Hence there is nothing arbitrary there. Other normalization forms (those with a K) operate on compatibility [de]composition while those without K are based on canonical (strict) equivalence. Compatibility means that things are not always reversible in all cases. For instance the fraction character ¼ is compatibity-decomposed into the ASCII 1/4. Loc and JockoDundee 1 1 This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
Loc Posted July 29, 2021 Author Posted July 29, 2021 Thanks for helping me. I have just returned to my hometown and am currently quarantined because of the covid epidemic, seeing your friends' codes, I can't wait to do my job right away.
Loc Posted July 29, 2021 Author Posted July 29, 2021 (edited) $text = 'Nếu tôi là cậu thì sẽ biết mệt mỏi' Msgbox(0,0, _Removemark($text)) Func _Removemark($s) Local $a = StringToASCIIArray(_UNF_Change($s, $UNF_NormD)) For $i = 0 To UBound($a) - 1 $a[$i] = StringFormat("0x%04X", $a[$i]) Next EndFunc Is it okay if I assign it to a function to pass data like that? I see many String functions when creating a function to pass data it returns hollow. I don't have a PC right now 😰 Edited July 29, 2021 by Loc
jchd Posted July 29, 2021 Posted July 29, 2021 The code you've pasted in your _Removemark function doesn't do anything practical: it's preparing the display of codepoints in hex but doesn't display anything. To remove diacritics, use You must use either Return $s or make the $s parameter ByRef to make changes effective at the end of your function. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
Skysnake Posted July 29, 2021 Posted July 29, 2021 @jchd Brilliant Do you perhaps know if this can be replicated in SQL? (just point me in the right direction) Skysnake Why is the snake in the sky?
jchd Posted July 29, 2021 Posted July 29, 2021 (edited) You can have better in SQL, especially with SQLite! You can build the SQLite DLL with ICU (full Unicode) support or you can [auto]load the extension separately. Or you can use my unifuzz.dll extension which offer many string functions like a fuzzy search and more. Both ways include collation functions, among many other goodies. Edit: x86 binary code + C source can be downloaded via this post: Edited July 29, 2021 by jchd Add link to old post This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now