Jump to content

How to replace signed string to unsigned UTF8 format


Recommended Posts

I'm Vietnamese so the words in my string would be like this:

'Nếu tôi là cậu thì sẽ biết mệt mỏi'

I want to replace this string to:

'Neu toi la cau thi se biet met moi'

Of course it is possible to use StringReplace() to assign those two strings, but when the string is self-set, that is not possible. StringRegExpReplace() is replaceable.

I tried turning it into a function but it doesn't return anything

Func _SStringReplace($text)

StringRegExpReplace($text, '[ă â ắ ặ ẵ ẳ]', 'a')

Endfunc

Please help me

Link to post
Share on other sites

Use this to unaccent your strings:

; Unicode Normalization Forms
Global Enum $UNF_NormC = 1, $UNF_NormD, $UNF_NormKC = 5, $UNF_NormKD



Func _UNF_Change($sIn, $iForm)
    If $iForm = $UNF_NormC Or $iForm = $UNF_NormD Or $iForm = $UNF_NormKC Or $iForm = $UNF_NormKD Then
        Local $aRet = DllCall("Normaliz.dll", "int", "NormalizeString", "int", $iForm, "wstr", $sIn, "int", -1, "ptr", 0, "int", 0)
        Local $tOut = DllStructCreate("wchar[" & 2 * ($aRet[0] + 20) & "]")
        $aRet = DllCall("Normaliz.dll", "int", "NormalizeString", "int", $iForm, "wstr", $sIn, "int", -1, "ptr", DllStructGetPtr($tOut, 1), "int", 2 * ($aRet[0] + 20))
        Return DllStructGetData($tOut, 1)
    Else
        SetError(1, 0, $sIn)
    EndIf
EndFunc   ;==>_UNF_Change


Func _Unaccent($s, $iMode = 0)
    Local Static $aPat = [ _
            "(*UCP)[\x{300}-\x{36F}`'¨^¸¯]", _    ; $iMode = 0 : remove combining accents only
            "(*UCP)\p{Mn}|\p{Lm}|\p{Sk}" _        ; $iMode = 1 :     "       "       "    and modifying letters
            ]
    Return StringRegExpReplace(_UNF_Change($s, $UNF_NormD), $aPat[Mod($iMode, 2)], "")
EndFunc   ;==>_Unaccent

Local $s = 'Nếu tôi là cậu thì sẽ biết mệt mỏi'
Local $t = _Unaccent($s)
MsgBox(0, "", $s & @LF & $t)

This has nothing to do with UTF8, let alone unsigned UTF8 (which doesn't make sense).

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to post
Share on other sites

Explanation:

The Unicode character set includes a (large) number of already accented letters, like à, î, Õ, ç, etc, which have each been assigned one Unicode codepoint. Some languages use accented letters that aren't in the character set: to form these letters, the base letter is followed by one or more accent or modifiying letter. When such a string is displayed, the font renderer parses the codepoints including all diacritics and draws the whole thing as one letter.

Unicode strings can adopt one of the several Normalization Form (see https://unicode.org/reports/tr15/ for gory details). In short, characters like letters can de decorated with a number of diacritics or modifying letters, like accents or strokes. An "all-in-one" character like Ñ can be represented in a string by either the single codepoint 0x00D1 in normalization form C or by the sequence 0x004E 0x0303 in form D, respectively E and the tilde accent.

The idea of the above script is to transform the input string in normalization form D (decomposed characters) then use a regular expression to remove diacritics. You're then left with base letters only.

Run this to see the decomposed form of the vietnamese string in question:

Local $s = 'Nếu tôi là cậu thì sẽ biết mệt mỏi'
Local $a = StringToASCIIArray(_UNF_Change($s, $UNF_NormD))
For $i = 0 To UBound($a) - 1
    $a[$i] = StringFormat("0x%04X", $a[$i])
Next
_ArrayDisplay($a)

There diacritics are codepoints in the range 0x0300..0x036F.

 

You can display fancy letters by appending diacritics to a letter, even if that combination is not used on Earth: this is a small latin letter "o" with "combining candrabindu", a "combining long solidus overlay" and a "combined ring below". I bet this glyph isn't in use. I formed it with "o" & ChrW(0x310) & ChrW(0x325) & ChrW(0x338). How it looks depend on deep details of the font renderer. The thing below can be selected only as one unit despite being an uninterruptible sequence of 4 Unicode codepoints.

o̸̥̐

Hope this sheds some light on what's under the hood.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to post
Share on other sites
44 minutes ago, jchd said:

The idea of the above script is to transform the input string in normalization form D (decomposed characters) then use a regular expression to remove diacritics. You're then left with base letters only.

Does that mean that the resulting unaccented string might seem arbitrary, depending upon whether the source contained an “all-in-one” character or an aggregate character?

Code hard, but don’t hard code...

Link to post
Share on other sites

To the contrary. Decomposing composed characters is a complex but formal process obeying strict rules.The reverse process: transforming form D to form C is symetric so you can round trip from form C to form D to form C verbatim. Whatever combination of diacritics you add to "A", the base letter still remains "A". "À" (a precomposed Unicode character) decomposes to "A" followed by "`" (grave accent). Hence there is nothing arbitrary there.

Other normalization forms (those with a K) operate on compatibility [de]composition while those without K are based on canonical (strict) equivalence. Compatibility means that things are not always reversible in all cases. For instance the fraction character ¼ is compatibity-decomposed into the ASCII 1/4.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to post
Share on other sites

Thanks for helping me. I have just returned to my hometown and am currently quarantined because of the covid epidemic, seeing your friends' codes, I can't wait to do my job right away.

Link to post
Share on other sites
Posted (edited)
$text = 'Nếu tôi là cậu thì sẽ biết mệt mỏi'
Msgbox(0,0, _Removemark($text))
Func _Removemark($s)
Local $a = StringToASCIIArray(_UNF_Change($s, $UNF_NormD))
For $i = 0 To UBound($a) - 1
    $a[$i] = StringFormat("0x%04X", $a[$i])
Next
EndFunc

Is it okay if I assign it to a function to pass data like that? I see many String functions when creating a function to pass data it returns hollow. I don't have a PC right now 😰

Edited by Loc
Link to post
Share on other sites

The code you've pasted in your _Removemark function doesn't do anything practical: it's preparing the display of codepoints in hex but doesn't display anything.

To remove diacritics, use

You must use either Return $s or make the $s parameter ByRef to make changes effective at the end of your function.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to post
Share on other sites

You can have better in SQL, especially with SQLite!

You can build the SQLite DLL with ICU (full Unicode) support or you can [auto]load the extension separately.
Or you can use my unifuzz.dll extension which offer many string functions like a fuzzy search and more.

Both ways include collation functions, among many other goodies.

Edit: x86 binary code + C source can be downloaded via this post:

 

Edited by jchd
Add link to old post

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...