Jump to content

Removing emojis from text string


Go to solution Solved by photonbuddy,

Recommended Posts

Hi All,

I am writing a script that I use to save an image from a Reddit post. As most of these save as a random string of letters, my script takes the post title (from the window title of the browser), and uses that as a filename.

Problem is, while Windows will save and display the emojis in file explorer, my image viewer (ACDSee - very old pre-bloat version) can't display the file.

How do I process the string to remove all emojis?

Thanks for any help.

Link to post
Share on other sites

Looks like AutoIt replaces the emoji's with ??.

Local $a = 'Test Title 😭'

ConsoleWrite($a)

Output:

Quote

Test Title ??

Maybe try and StringReplace the question marks with nothing? I guess one of the downsides is it would remove legitimate question marks, should it work at all.

Edited by Luke94
Link to post
Share on other sites

ConsoleWrite() silently "converts" Unicode text to ANSI, replacing almost all non-ANSI characters by question marks. This doesn't work fairly with non-latin languages. Below CW() is a homebrew Unicode-aware ConsoleWrite():

; Mixed language strings
$s = "Большая проблема  大问题  बड़ी समस्या  مشكلة كبيرة"
CW($s)
ConsoleWrite($s & @LF)


; A familly with different Fitzpatrick settings = only one glyph
$s = "Our familly " & ChrW(0xD83D) & ChrW(0xDC68) & ChrW(0xD83C) & ChrW(0xDFFB) & ChrW(0x200D) & ChrW(0xD83D) & ChrW(0xDC69) & ChrW(0xD83C) & ChrW(0xDFFF) & ChrW(0x200D) & ChrW(0xD83D) & ChrW(0xDC66) & ChrW(0xD83C) & ChrW(0xDFFD)
CW($s)
ConsoleWrite($s & @LF)

Result:

Большая проблема  大问题  बड़ी समस्या  مشكلة كبيرة
??????? ????????  ???  ???? ??????  ????? ?????
Our familly 👨🏻‍👩🏿‍👦🏽
Our familly ??????????????

I don't know which charset this legacy version of ACDSee handles for filenames. You can remove emojis or a wider range of Unicode charset explicitely using a regexp.

BUT there is a pitfall however: AutoIt charset is UCS2, a limitation of Unicode UTF16 to the BMP (Unicode plane 0) using 16-bit encoding units. But there is more: Unicode codepoints in planes 1..16 use surrogate values to represent. For instance 😭 is represented in UCS2 (AutoIt string) as ChrW(0xD83D) & ChrW(0xDE2D).

You might think: pretty easy, just use a regexp pattern to match and replace these values, using StringRegExpReplace($s, "[\x{D800}-\x{DFFF}]", "-")

NO!  Just because PCRE (the regexp engine used by AutoIt) invokation internally merges the two surrogates into the actual 😭 codepoint 0x1F62D (LOUDLY CRYING FACE).

This will replace all series of non-BMP codepoints by an underscore:

$s = "Большая проблема  大问题  बड़ी समस्या  مشكلة كبيرة Test Title 😭" & @LF
$s &= "Our familly " & ChrW(0xD83D) & ChrW(0xDC68) & ChrW(0xD83C) & ChrW(0xDFFB) & ChrW(0x200D) & ChrW(0xD83D) & ChrW(0xDC69) & ChrW(0xD83C) & ChrW(0xDFFF) & ChrW(0x200D) & ChrW(0xD83D) & ChrW(0xDC66) & ChrW(0xD83C) & ChrW(0xDFFD)

CW($s)
$t = StringRegExpReplace($s, "[\x{10000}-\x{1FFFF}]+", "_")
CW($t)

Result:

Большая проблема  大问题  बड़ी समस्या  مشكلة كبيرة Test Title 😭
Our familly 👨🏻‍👩🏿‍👦🏽
Большая проблема  大问题  बड़ी समस्या  مشكلة كبيرة Test Title _
Our familly _‍_‍_

Note that in the last line, there are 3 "people" joined with ChrW(0x200D) [Zero Width  Joiner] hence three underscores.

Yet I suspect that your image viewer will bark at codepoints outside the default 8-bit codepage of your system. If you still get question marks in the last example above, then your only bet is to correctly convert characters into their 8-bit codepage counterpoint, or by a useable substitution character when impossible.

Func _StringToCodepage($sStr, $iCodepage = Default)
    If $iCodepage = Default Then $iCodepage = 65001        ; or Int(RegRead("HKLM\SYSTEM\CurrentControlSet\Control\Nls\Codepage", "OEMCP"))
    Local $aResult = DllCall("kernel32.dll", "int", "WideCharToMultiByte", "uint", $iCodepage, "dword", 0, "wstr", $sStr, "int", StringLen($sStr), _
            "ptr", 0, "int", 0, "ptr", 0, "ptr", 0)
    Local $tCP = DllStructCreate("char[" & $aResult[0] & "]")
    $aResult = DllCall("Kernel32.dll", "int", "WideCharToMultiByte", "uint", $iCodepage, "dword", 0, "wstr", $sStr, "int", StringLen($sStr), _
            "struct*", $tCP, "int", $aResult[0], "ptr", 0, "ptr", 0)
    Return DllStructGetData($tCP, 1)
EndFunc   ;==>_StringToCodepage

Invoke this conversion function with the codepage ID which suits your needs. See https://docs.microsoft.com/en-us/windows/win32/intl/code-page-identifiers

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to post
Share on other sites
  • Solution
2 hours ago, Luke94 said:

Maybe try and StringReplace the question marks with nothing? I guess one of the downsides is it would remove legitimate question marks, should it work at all.

I tried this after seeing the 2 question marks, but AutoIT sees the emoji, not the question marks. Ironically, I can actually use StringReplace and pass in the copied emoji character, and it will work fine, but then I have to do a StringReplace for every emoji.

The really annoying thing is AutoIT if I use StringIsASCII, it happily tells me it is, probably because internally it's converting the emojis to "??", which are ASCII.

58 minutes ago, jchd said:

ConsoleWrite() silently "converts" Unicode text to ANSI, replacing almost all non-ANSI characters by question marks.

While most of what you wrote went a little over my head, this little bit took me down a path which looks to have solved my issue.

Using StringToBinary converts emojis (the couple of test ones anyway) to the aforementioned "??", and then BinaryToString gives me a string I can use.

Thanks to all who replied. Much appreciated.

Link to post
Share on other sites

This gets the result necessary to do it equally well.

$s = "Большая проблема  大问题  बड़ी समस्या  مشكلة كبيرة* (Test Title) ?😭 (Our familly) "

MsgBox(0, "", StringStripWS(StringRegExpReplace($s, "[\x00-\x7F]\K|\W", ""), 7))

; Or

MsgBox(0, "", StringStripWS(StringRegExpReplace($s, "[^ -~]", ""), 7))

 

Edited by Deye
Link to post
Share on other sites

@Deye not true. There are significant differences between Unicode and the upper 8-bit ANSI. The badly mapped characters depend on which locale is in effect.

While your code removes all characters beyond 0x7F. I find it's better to have mappable Unicode converted to the corresponding 0x80-0xFF locale counterpart.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to post
Share on other sites
54 minutes ago, jchd said:

characters depend on which locale is in effect.

Yes, it really depends on usability and manipulation In this case, the code should be changed to make it usable.

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...