Jump to content
AXLE

UTF-8 Strings in AutoIt

Recommended Posts

Posted (edited)

I am trying to find information on using UTF-8 Strings in AutoIt. After searching extensively I cannot find anything conclusive on this topic. What I need to do is FileRead() into a String variable(or Array) and keep the UTF-8 Encoding. Some articles, and even Help documents on FileOpen() suggest that AutoIT (Current Versions) can read and store UTF-8 internally but my tests on reading a test web page containing UTF-8 encoded characters into a variable fails.

Does/Can AutoIt use Strings Encoded as UTF-8, and if so how ?

If Not does anyone know of a UDF, or a C/Win-API routine to allow to use a UTF-8 Array in AutoIt ?

What does AutoIt use internally for Strings ? Is it converting the UTF-8 file to UCS-2 String in the Variable ?

The following is an example which fails for me.

;UTF-8 Tests
#include <FileConstants.au3>
#include <MsgBoxConstants.au3>
#include <WinAPIFiles.au3>

;https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html
;Also all checked in Notepad++ UTF-8 Encoding (Many Characters are scrambled)
Local $sFile1 = "UTF-8 test file.htm"; 414 Lines | 76,412 characters. "UTF-8 test file.htm" = "/UTF-8-demo.html"
Local $sFile2 = "test2.html"

Local $hfile1 = FileOpen($sFile1, BitOr($FO_READ, $FO_UTF8_NOBOM))
If @error Then
    MsgBox($MB_SYSTEMMODAL, "FileOpen1", "Value of @error is: " & @error & @CRLF & "Value of @extended is: " & @extended)
EndIf

Local $sAm_I_UFT_8 = FileRead($hfile1, -1);Does not appear to read UTF-8 characters correctly from the "UTF-8 test file.htm"
If @error Then
    MsgBox($MB_SYSTEMMODAL, "FileRead", "Value of @error is: " & @error & @CRLF & "Value of @extended is: " & @extended)
EndIf

FileClose($hfile1)

Local $sAm_I_Still_UTF_8 = $sAm_I_UFT_8 ;Are these two strings stored internaly as UTF-8 ?
If @error Then
    MsgBox($MB_SYSTEMMODAL, "String=String", "Value of @error is: " & @error & @CRLF & "Value of @extended is: " & @extended)
EndIf

Local $iStrLen1 = StringLen($sAm_I_UFT_8)
Local $iStrLen2 = StringLen($sAm_I_Still_UTF_8)
MsgBox($MB_SYSTEMMODAL, "String Lenght of $sAm_I_UFT_8", $iStrLen1); 414 Lines | 70,174 characters
MsgBox($MB_SYSTEMMODAL, "String Lenght of $sAm_I_Still_UTF_8", $iStrLen2); 414 Lines | 70,174 characters

Local $hfile2 = FileOpen($sFile2, BitOR($FO_OVERWRITE, $FO_BINARY))
If @error Then
    MsgBox($MB_SYSTEMMODAL, "FileOpen2", "Value of @error is: " & @error & @CRLF & "Value of @extended is: " & @extended)
EndIf

FileWrite($hfile2, $sAm_I_Still_UTF_8) ;If $sAm_I_Still_UTF_8 is actual UTF-8 it should be an exact copy of the original "UTF-8 test file.htm"
If @error Then
    MsgBox($MB_SYSTEMMODAL, "FileOpen2", "Value of @error is: " & @error & @CRLF & "Value of @extended is: " & @extended)
EndIf
FileClose($hfile2)

 

Edited by AXLE
Additional information

"Writing code to carry out an intended set of tasks is relatively easy.
Writing code to carry out ONLY an intended set of tasks, well that can be a little more challenging."

Alex Maddern

Share this post


Link to post
Share on other sites

Thanks argumentum, The above link seams to be referring more towards actual script Encoding  rather than Internal "types" although there is much suggestion that UTF types will be automatically detected at FileOpen() and FileRead() etc, I can't confirm any of this at the moment. As with the above example it shows that the file is being loaded into the variable with some other type of encoding that is not a character count equivalent of the original UTF-8 test file. Also I cant make sense of how I can use what appears to be Pre compiler directives ("Look inside AutoIt3Wrapper.au3, and look for $UTFtype") within my script. Is there info or documents on forcing the use of UTF-8 Variable types in my scripts?

Any further assistance will be appreciated.

Axle


"Writing code to carry out an intended set of tasks is relatively easy.
Writing code to carry out ONLY an intended set of tasks, well that can be a little more challenging."

Alex Maddern

Share this post


Link to post
Share on other sites
Posted (edited)

Native AutoIt strings use UCS2, i.e. a subset of UTF16-LE restricted to the BMP.

AutoIt File* functions can detect (read) or be forced to write UTF8 files, depending on options. The resulting data read will be UCS2 encoded (except if reading binary of course).

8 hours ago, AXLE said:

;Does not appear to read UTF-8 characters correctly from the "UTF-8 test file.htm"

I suspect it does read UTF8 correctly.

1 hour ago, AXLE said:

As with the above example it shows that the file is being loaded into the variable with some other type of encoding that is not a character count equivalent of the original UTF-8 test file.

A single codepoint can use from 1 to 4 bytes in UTF8 whereas it consists of only one 16-bit word in UCS-2 and 1 to 2 16-bit words in UTF16. Hence there is no surprise that in general [UTF8 file byte count] ≠ [UCS2 codepoints]

During AutoIt internal UTF-8 to UCS2 conversion, codepoints above the BMP are emasculated since they would need an extra 16-bit word to represent in UTF-16. Said otherwise, AutoIt doesn't recognize and handle UTF16 surrogates. This may be a serious problem for people who use the growing number of planes/blocks for the script (= writing script) they use (SMP, SIP, SSP, private planes). Yet the BMP allows to encode a large number of scripts: https://en.wikipedia.org/wiki/Unicode_block

That said, AutoIt offers ways to convert to/from any two of {UCS2, UTF8, ANSII (any codepage), Windows(any codepage), OEM, double-byte and any codepage supported by your Windows version}. It is also possible to build strings in "beyond BMP" UTF16-LE (manually or programmatically), which Windows or other Unicode-aware applications will handle gracefully, provided the appropriate font(s) is used. But keep in mind that most AutoIt string functions won't handle UTF16 surrogates correctly.

This site offers a load of information, examples, data, applets about Unicode: https://r12a.github.io/scripts/tutorial/part3 (also check docs and apps links.)

Don't hesitate to post if you encounter any encoding issue.

NOTE about Unicode planes:
BMP = Basic Multilingual Plane
SMP = Supplementary Multilingual Plane
SIP = Supplementary Ideographic Plane
SSP = Supplementary Special-purpose Plane

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Thank you very much for your excellent reply jchd :)  You confirmed what I had originally believe of AutoIt internal types. That is all AutoIt types are of the UCS-2/ MS variant UTF-16 and that UFT-8 documents are being read and converted to the internal types of AutoIt. The documentation https://www.autoitscript.com/autoit3/docs/intro/unicode.htm isn't overly clear on this and had me a little confused as thought maybe AutoIt had introduced native support for UTF-8 types.

I can quite likely achieve what I need with the UCS-2 keeping in mind possible conversion "gotchas", or alternatively I'll have a go at creating a native C UTF-8 String dll library/UDF based off something like ICU or anther light UTF-8 header library.

Thanks for the assistance, is very much appreciated.

Axle


"Writing code to carry out an intended set of tasks is relatively easy.
Writing code to carry out ONLY an intended set of tasks, well that can be a little more challenging."

Alex Maddern

Share this post


Link to post
Share on other sites
Posted (edited)

ICU is a really huge hog and doesn't easily solve all the issues that arise in practice with Unicode.

By mere curiosity, why do you think you need a set of functions for UTF8?

As far as I can think, this is more or less all you need:

Func _CodepageToString($sCP, $iCodepage = Default)
    If $iCodepage = Default Then $iCodepage = Int(RegRead("HKLM\SYSTEM\CurrentControlSet\Control\Nls\Codepage", "OEMCP"))
    Local $tText = DllStructCreate("byte[" & StringLen($sCP) & "]")
    DllStructSetData($tText, 1, $sCP)
    Local $aResult = DllCall("kernel32.dll", "int", "MultiByteToWideChar", "uint", $iCodepage, "dword", 0, "struct*", $tText, "int", StringLen($sCP), _
            "ptr", 0, "int", 0)
    Local $tWstr = DllStructCreate("wchar[" & $aResult[0] & "]")
    $aResult = DllCall("kernel32.dll", "int", "MultiByteToWideChar", "uint", $iCodepage, "dword", 0, "struct*", $tText, "int", StringLen($sCP), _
            "struct*", $tWstr, "int", $aResult[0])
    Return DllStructGetData($tWstr, 1)
EndFunc   ;==>_CodepageToString

Func _StringToCodepage($sStr, $iCodepage = Default)
    If $iCodepage = Default Then $iCodepage = Int(RegRead("HKLM\SYSTEM\CurrentControlSet\Control\Nls\Codepage", "OEMCP"))
    Local $aResult = DllCall("kernel32.dll", "int", "WideCharToMultiByte", "uint", $iCodepage, "dword", 0, "wstr", $sStr, "int", StringLen($sStr), _
            "ptr", 0, "int", 0, "ptr", 0, "ptr", 0)
    Local $tCP = DllStructCreate("char[" & $aResult[0] & "]")
    $aResult = DllCall("Kernel32.dll", "int", "WideCharToMultiByte", "uint", $iCodepage, "dword", 0, "wstr", $sStr, "int", StringLen($sStr), _
            "struct*", $tCP, "int", $aResult[0], "ptr", 0, "ptr", 0)
    Return DllStructGetData($tCP, 1)
EndFunc   ;==>_StringToCodepage

Supply 65001 as codepage to convert to/from UTF8 to native strings. It's trivial to change the default codepage to UTF8 instead of OEM.

Convert your UTF8 or codepage input data to native strings, process and massage them ad nauseam, convert them if necessary to ouput codepage and you're done.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Hi jchd, ICU is huge, I was looking at a few single header libraries like https://github.com/sheredom/utf8.h Just for C type width and string manipulation, maybe wrap it up in a dll for convenience. Just thinking outside the box for a moment (future projects etc). My Unicode knowledge is still in a learning phase.

For now I just need to do some inline Base64 data URIs for html pages and some direct image to b64 conversions. Most of this will be as Binary and ANSI 7bit anyways, so the code page shouldn't really matter. Main thing was just confirming that AutoIt still uses UCS-2 as its internal type so I can test and check for type conversion "Gotchas" along the way. Do enough of these codepage conversions and almost certain mojibake will happen sooner or later lol. Would rather it sooner so I can correct it :)

Also thank you for the informative information above. From my research I was of the belief that 65001 codepage is only available in windows 10, and to some console and internal functions prior to W10. It would be nice if everthing was just UTF-8 or byte code.

 


"Writing code to carry out an intended set of tasks is relatively easy.
Writing code to carry out ONLY an intended set of tasks, well that can be a little more challenging."

Alex Maddern

Share this post


Link to post
Share on other sites
11 hours ago, AXLE said:

From my research I was of the belief that 65001 codepage is only available in windows 10, and to some console and internal functions prior to W10.

Windows inaugurated Unicode support with an upgrade to Win 9x and was one of the very first large software company to do so.
With Win NT system calls used UCS-2 and with Win 2000 and up the encoding settled on UTF16-LE.

What's new with Win 10 is indeed that you can select the local codepage to be 65001 (UTF-8) system-wide, not only for the DOS console (CHCP) and for use in conversion functions (in code I posted above). That only changes the behavior of system calls explicitely ANSI, ending in *A, which then consider the byte string as UTF8 data. The encoding used in all other primitives is UTF16-LE and will remain such, until UTF32 will be a good incentive to sell more memory and storage (just guessing here).

In short: apps designed to run on 99.5% of today's PCs should use the conversion functions above for converting codepage input and output when required, but everything else (main code) remains UCS2 (BMP of Unicode).

You'll find a number of Base-64 related post when searching here.

If you find a use for that, note that the regexp support functions (PCRE1) accept the (*UCP) switch (see StringRegExp help).

Local $String = "Sample simple english text 한국어    텍스트의 예 טקסט עברית ירושלים русский образец អត្ថបទថៃ"
Local $aLang = StringRegExp($String, "(*UCP)(\p{Hangul}+(?:\s+\p{Hangul}+)*)", 1)
If not @error Then MsgBox(64, "Korean text found", $aLang[0])
$aLang = StringRegExp($String, "(*UCP)(\p{Khmer}+(?:\s+\p{Khmer}+)*)", 1)
If not @error Then MsgBox(64, "Thaï text found", $aLang[0])
$aLang = StringRegExp($String, "(*UCP)(\p{Latin}+(?:\s+\p{Latin}+)*)", 1)
If not @error Then MsgBox(64, "Latin text found", $aLang[0])
$aLang = StringRegExp($String, "(*UCP)(\p{Hebrew}+(?:\s+\p{Hebrew}+)*)", 1)
If not @error Then MsgBox(64, "Hebrew text found", $aLang[0])
$aLang = StringRegExp($String, "(*UCP)(\p{Cyrillic}+(?:\s+\p{Cyrillic}+)*)", 1)
If not @error Then MsgBox(64, "Cyrillic text found", $aLang[0])

 


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Awesome 😃 thanks for the information jchd. I think I have it all nailed atm. I am using a modded version of CryptBinaryToString by trancexx (Added an extra parameter flag for String|Binary mode). The Binary B64 Enc/Dec on images is byte perfect, and the W3C UTF-8 test page is converting byte perfect, so... so far all good :)

As far as UTF-8 text manipulation Dll/UDF goes, I've added it to my whiteboard along with the many other projects that I will get to as time permits. I have a few large tertiary assessment modules coming up on web programming, so maybe I will slip it in amidst that.


"Writing code to carry out an intended set of tasks is relatively easy.
Writing code to carry out ONLY an intended set of tasks, well that can be a little more challenging."

Alex Maddern

Share this post


Link to post
Share on other sites

@AXEL: I haven't followed this thread in detail, but I want to mention a couple of things that I'm reminded of whenever I see discussions about UTF/encoding in AU3.

First, make sure you confirm the character encoding of your script, itself. I recall fighting all kinds of problems when there was a mismatch in strings declared in the script and the encoding of the file it was processing. Once I got everything on the same page, things got a lot easier.

Second, get yourself a copy of XVI32 so you can quickly check the encoding of individual files. Again, I recall chasing ghosts when the file wasn't actually what I was declaring it as in the Read/Write statements.

Hope these help.

1313478691_UTFSetting.PNG.1fff1295a1c607f389b65b5c18ca3435.PNG

Share this post


Link to post
Share on other sites

Yeah, I try and keep all coding in ANSI or UTF-8. I use both XVI32 and HxD (Prefer HxD Most times) Even the Hex Editor Plugin for Notepad++ is ok for a quick check on the fly :)

At the moment all my conversions (B64 Enc/Dec) in both Bin, and Text are coming out byte perfect with no mojibake. Text conversion tests are based on the W3C UTF-8 test page (https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html) and all good so far. Thanks for the pointers just the same :D


"Writing code to carry out an intended set of tasks is relatively easy.
Writing code to carry out ONLY an intended set of tasks, well that can be a little more challenging."

Alex Maddern

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Similar Content

    • By Jibberish
      I need to read log files into an array to search for errors. However when I display the array I get garbage or "chinese characters". Our developers say they are using UTF-8, but FileGetEncoding says the logs are "2048" or $FO_UTF16_BE_NOBOM (2048) = Use Unicode UTF16 Big Endian (without BOM) from the Encoding codes in FileOpen().
      There is an app called Detenc that detects the encoding used by files. You have to guess, but it returns correctly when I set the Encoder for UTF-8. I understand Encoding is not etched in stone, but the first character of the file is a capital B, using HxD Hex Editor.
      I even have another  topic here about running PowerShell to reencode the file so AutoIt will store the file properly in the array - See:
      So I am trying to figure out why AutoIt thinks my logs are not UTF-8.
      Here is sample code:
      #include <array.au3> #include <File.au3> Local $aRetArrayFile _FileReadToArray("C:\Logs\Myplayer1.log", $aRetArrayFile) _ArrayDisplay($aRetArrayFile) I won't post the results as it is illegible, but I did attach a screenshot of the _ArrayDisplay results, and this is the first line of the Log file:
      BANNER 10/10/2017 15:56:00 ====================================================================== And the Hex from the beginning of the file:
      42 41 4E 4E 45 52 20 31 30 2F 31 30 2F 32 30 31 37 20 31 34 3A 33 31 3A 33 35 20 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 3D 0D 0A 42 41 4E 4E 45 52 20 So I don't understand why AutoIt thinks the file is UTF16 BE.
      If I can get the Powershell script running, I have a workaround.
      BTW none of my other arrays display as garbage, just the log files.
      Weird.
      Rereading my post, what seems to be missing is the question. I guess my question is, does anyone know why these logs are being displayed incorrectly?
      Cheers
      Jibs

    • By rootx
      I need help with unicode char ü I get some text from online json but if try to read 4 example Zürich I heave  Zürich.
      How can I convert with autoit unicode to a clear character readable? thx
    • By 4bst1n3nz
      Hello,
      i need to save files with ANSI-Encoding. Since 3.3.14.2 Auto-It it doesn't work in any direction.
      I tried the following:
      #include <FileConstants.au3> FileDelete(@ScriptDir&"\Test.txt") $o = FileOpen(@ScriptDir&"\Test.txt", BitOR($FO_BINARY,$FO_ANSI,$FO_OVERWRITE)) FileWrite($o, "Test") FileClose($o) Or
      #include <FileConstants.au3> FileDelete(@ScriptDir&"\Test.txt") $o = FileOpen(@ScriptDir&"\Test.txt", 514) FileWrite($o, "Test") FileClose($o) Both create UTF-8 encoded files.
      What am i doing wrong?
      Thank you!
    • By noorm
      Hello!
      I've been lurking around for a loooong time... and I decided to finally share a little. I do a lot of internet stuff, mostly machine to machine for work (instrumentation) so I have quite a few "RFC" scripts.
      Disclaimer these work for me... but I sometime use... "shortcuts" based on my particular requirement. An example, the Base64 encoding snippet might not be too good for binary data. I pad the original data with spaces to avoid the "==" padding of base64.
      So... first is the base64 encoding snippet. It is not in a function, it was in a sequential program, used only once! It encode $Graph to $SMTPMessage:
      ; Create the base64 encoding table Dim $Base64EncodingTable[0] For $Cpt = Asc("A") to Asc("Z") _ArrayAdd($Base64EncodingTable, Chr($Cpt)) Next For $Cpt = Asc("a") to Asc("z") _ArrayAdd($Base64EncodingTable, Chr($Cpt)) Next For $Cpt = Asc("0") to Asc("9") _ArrayAdd($Base64EncodingTable, Chr($Cpt)) Next _ArrayAdd($Base64EncodingTable, "+") _ArrayAdd($Base64EncodingTable, "/") ; Pad the SVG Graph to attach with space(s). Lazy way to avoid base64 == pading While Mod(StringLen($Graph), 3) <> 0 $Graph &= " " WEnd ; Start from the first character $Cpt = 1 Do ; Extract the 3 characters to encode $Char1 = Asc(StringMid($Graph, $Cpt, 1)) $Char2 = Asc(StringMid($Graph, $Cpt+1, 1)) $Char3 = Asc(StringMid($Graph, $Cpt+2, 1)) ; Encode them to 4 characters $SMTPMessage &= $Base64EncodingTable[BitShift(BitAND($Char1, 252), 2)] $SMTPMessage &= $Base64EncodingTable[BitShift(BitAND($Char1, 3), -4) + BitShift(BitAND($Char2, 240), 4)] $SMTPMessage &= $Base64EncodingTable[BitShift(BitAND($Char2, 15), -2) + BitShift(BitAND($Char3, 192), 6)] $SMTPMessage &= $Base64EncodingTable[BitAND($Char3, 63)] ; Increment the counter, and if required, add a @CRLF to split in multiples lines $Cpt += 3 If Mod($Cpt, 57) = 1 Then $SMTPMessage &= @CRLF ; Do this until all the graph has been encoded Until $Cpt >= StringLen($Graph)  
      Second... I just finished this one and was allready thinking about sharing it... so it's been encapsulated into function a bit more. I use it to decode email subjects in a system where you can update something by email. I separated the Base64Decode function so it can be grabbed more easily. Please note that it return an hex string so you would still need to convert it if it's a string with BinaryToString or whatever suit your needs.
      If can be copied as is and runned directly... it include my test strings! (Yes... I'm french!)
      ; For the $SB_UTF8 and $SB_ANSI Variable #include <StringConstants.au3> ; For _ArrayAdd and _ArraySearch used in the Base64 decoder #include <Array.au3> ; Various test sentences... ;$text = "=?UTF-8?Q?Ce=c3=a7i_est_un_autre_test!_h=c3=a9h=c3=a9!?=" ; Normal UTF-8 ;$text = "=?UTF-8?Q?Encore_=3d_un_autre_test_=c3=a9_?=" ; "=" added ;$text = "=?UTF-8?Q?un_autre_test_=5f_=c3=a9?=" ; "_" added ;$text = "=?UTF-8?B?Q2XDp2kgZXN0IHVuIGF1dHJlIHRlc3QhID0gXyBow6low6kh?=" ; UTF-8 Base64 $text = "=?UTF-8?B?ZcOnaSBlc3QgdW4gYXV0cmUgdGVzdCEgPSBfIGjDqWjDqSE=?=" ; UTF-8 Base64 with padding ;$text = "=?iso-8859-1?Q?Ce=E7i_est_un_test!?=" ; iso-8859-1 MsgBox(0, "", DecodeHeader($text)) Func DecodeHeader($lString) ; Check and store encoding type If StringInStr($lString, "?Q?") Then ; Quoted printable content $lType = "?Q?" ElseIf StringInStr($lString, "?B?") Then ; Base64 encoding $lType = "?B?" Else ; No encoding (or unknown encoding) return($lString) EndIf ; Start of the charset string $lStart = StringInStr($lString, "=?") + 2 ; End of the charset string $lStop = StringInStr($lString, $lType) ; Charset variable, storing "UTF-8" or "iso-8859-1" $lEncoding = StringMid($lString, $lStart, $lStop-$lStart) ; Change encoding type for the BinaryToString flag If $lEncoding = "UTF-8" Then $lEncoding = $SB_UTF8 ElseIf $lEncoding = "iso-8859-1" Then $lEncoding = $SB_ANSI Else MsgBox(0, "", "Unknown character set") Exit EndIf ; Start of the actual encoded content $lStart = $lStop + 3 ; End of the actual encoded content $lStop = StringInStr($lString, "?=") ; Actual content to decode $lString = StringMid($lString, $lStart, $lStop-$lStart) ; For Quoted printable content If $lType == "?Q?" Then ; Restore underscore encoded spaces $lString = StringReplace($lString, "_", " ") ; Starting with the first character of the string $lCpt = 1 ; "=XX" search and convert loop While 1 ; There will be 0 characters to convert in that block unless... $lConvertableLenght = 0 ; That character, and another one 3 bytes over... and the next, and the next... For $lCpt2 = 0 to 100 ; Is equal to "=" If StringMid($lString, $lCpt+($lCpt2*3), 1) == "=" Then ; In that case, yes, we will have to convert 3 more characters $lConvertableLenght += 3 Else ; But if we fail to find or reach the end of a block of encoded characters, exit the search ExitLoop EndIf Next ; If we did in fact find some encoded characters If $lConvertableLenght > 0 Then ; Extract that block of encoded characters $lConvertableString = StringMid($lString, $lCpt, $lConvertableLenght) ; Convert it $lConvertedString = BinaryToString("0x" & StringReplace($lConvertableString, "=", ""), $lEncoding) ; Replace it in the original $lString = StringReplace($lString, $lConvertableString, $lConvertedString) EndIf ; Increment the "=XX" search and convert loop counter $lCpt += 1 ; If we reached the end of the string, exit the "=XX" search and convert loop If $lCpt >= StringLen($lString) Then ExitLoop ; Continue searching in the "=XX" search and convert loop WEnd ; For Base64 encoded strings Else ; Use the separate Base64Decode function $lString = Base64Decode($lString) $lString = BinaryToString($lString, $lEncoding) EndIf return($lString) EndFunc Func Base64Decode($lEncoded) ; Create the base64 encoding table Dim $Base64EncodingTable[0] For $Cpt = Asc("A") to Asc("Z") _ArrayAdd($Base64EncodingTable, Chr($Cpt)) Next For $Cpt = Asc("a") to Asc("z") _ArrayAdd($Base64EncodingTable, Chr($Cpt)) Next For $Cpt = Asc("0") to Asc("9") _ArrayAdd($Base64EncodingTable, Chr($Cpt)) Next _ArrayAdd($Base64EncodingTable, "+") _ArrayAdd($Base64EncodingTable, "/") ; Start from the first character $Cpt = 1 $Decoded = "0x" Do ; Extract the 4 characters to encode $Char1 = StringMid($lEncoded, $Cpt, 1) $Char2 = StringMid($lEncoded, $Cpt+1, 1) $Char3 = StringMid($lEncoded, $Cpt+2, 1) $Char4 = StringMid($lEncoded, $Cpt+3, 1) ; Decode them $Decoded &= Hex(BitShift(_ArraySearch($Base64EncodingTable, $Char1, 0, 0, 1), -2) + BitShift(BitAnd(_ArraySearch($Base64EncodingTable, $Char2, 0, 0, 1), 48), 4), 2) If $Char3 <> "=" Then $Decoded &= Hex(BitShift(BitAnd(_ArraySearch($Base64EncodingTable, $Char2, 0, 0, 1), 15), -4) + BitShift(BitAnd(_ArraySearch($Base64EncodingTable, $Char3, 0, 0, 1), 60), 2), 2) If $Char4 <> "=" Then $Decoded &= Hex(BitShift(BitAnd(_ArraySearch($Base64EncodingTable, $Char3, 0, 0, 1), 3), -6) + _ArraySearch($Base64EncodingTable, $Char4, 0, 0, 1), 2) ; Increment the counter $Cpt += 4 ; Do this until all the encoded string has been decoded Until $Cpt >= StringLen($lEncoded) return($Decoded) EndFunc  
      Last thing... I may update it into a better format for you, like a standalone telnet program with GUI. It is my telnet options negociations loops. The basic concept is systematically deny all request for special options and keep it "raw".
      If server says "Will", I reply "Don't". If it says "Do", I reply "Wont"... unless it's the terminal type subnegociation, in which case I reply xterm!
      $Data for now needs to be Global. You still need to know what you're doing, opening sockets and making a basic communication loop or something.
      Global $T_Is = Chr(0) Global $T_Send = Chr(1) Global $T_TerminalType = Chr(24) Global $T_SE = Chr(240) Global $T_SB = Chr(250) Global $T_Will = Chr(251) Global $T_Wont = Chr(252) Global $T_Do = Chr(253) Global $T_Dont = Chr(254) Global $T_IAC = Chr(255) Func NegotiateTelnetOptions() $NegotiationCommandsToSendBack = "" While StringInStr($Data, $T_IAC) $IACPosition = StringInStr($Data, $T_IAC) Switch StringMid($Data, $IACPosition+1, 1) Case $T_Will $NegotiationCommandsToSendBack &= CraftReply_CleanUpData($IACPosition, $T_Dont) Case $T_Do If StringMid($Data, $IACPosition+2, 1) = $T_TerminalType Then $NegotiationCommandsToSendBack &= CraftReply_CleanUpData($IACPosition, $T_Will) Else $NegotiationCommandsToSendBack &= CraftReply_CleanUpData($IACPosition, $T_Wont) EndIf Case $T_SB If StringMid($Data, $IACPosition, 6) = ($T_IAC & $T_SB & $T_TerminalType & $T_Send & $T_IAC & $T_SE) Then $NegotiationCommandsToSendBack &= $T_IAC & $T_SB & $T_TerminalType & $T_Is & "xterm" & $T_IAC & $T_SE $Data = StringReplace($Data, StringMid($Data, $IACPosition, 6), "") Else MsgBox(0, "", "Unknown Subnegotiation...") ; Should never happen. Exit EndIf EndSwitch WEnd Return $NegotiationCommandsToSendBack EndFunc Func CraftReply_CleanUpData($IACPosition, $Reply) $PartialCommandToSendBack = $T_IAC & $Reply & StringMid($Data, $IACPosition+2, 1) $Data = StringReplace($Data, StringMid($Data, $IACPosition, 3), "") Return $PartialCommandToSendBack EndFunc  
    • By Alexxander
      Hello
       
      i have lots of text like this
       
      i have no idea what does this \u stuff means, after googling it i found that it is some sort of utf encoding
       
      how could i write this text to a file without loosing any characters and getting rid of \u  ?
       
       
       
      EDIT: after more digging i found that i need "Converting Unicode Entities to Unicode Text" any ideas hoe to do that with autoit ?
×
×
  • Create New...