Sign in to follow this  
Followers 0
Xenobiologist

Converting with e.g. StringToASCIIArray

11 posts in this topic

Hi,

first of all I'm no encoding expert Is this coorect? Should the function show negative numbers?

Local $a = StringToASCIIArray(' „ ”', Default, Default, 0)
ConsoleWrite('UTF-16 ' & @TAB & _ArrayToString($a, @TAB) & @CRLF)
Local $b = StringToASCIIArray(' „ ”', Default, Default, 1)
ConsoleWrite('ANSI ' & @TAB & _ArrayToString($b, @TAB) & @CRLF)
Local $c = StringToASCIIArray(' „ ”', Default, Default, 2)
ConsoleWrite('UTF-8 ' & @TAB & _ArrayToString($c, @TAB) & @CRLF)
ConsoleWrite('! - - - - - - - - - - - - - - - - - - - - ' & @CRLF)
ConsoleWrite( @TAB & AscW("") & @TAB)
ConsoleWrite(AscW(" ") & @TAB)
ConsoleWrite(AscW("„") & @TAB)
ConsoleWrite(AscW(" ") & @TAB)
ConsoleWrite(AscW("”") & @CRLF)

UTF-16  129 32  8222    32  8221
ANSI    -127    32  -124    32  -108
UTF-8   -62 -127    32  -30 -128
! - - - - - - - - - - - - - - - - - - - - 
    129 32  8222    32  8221

If all is correct then I'm fine. :D

Mega


Scripts & functions Organize Includes Let Scite organize the include files

Yahtzee The game "Yahtzee" (Kniffel, DiceLion)

LoginWrapper Secure scripts by adding a query (authentication)

_RunOnlyOnThis UDF Make sure that a script can only be executed on ... (Windows / HD / ...)

Internet-Café Server/Client Application Open CD, Start Browser, Lock remote client, etc.

MultipleFuncsWithOneHotkey Start different funcs by hitting one hotkey different times

Share this post


Link to post
Share on other sites



#2 ·  Posted (edited)

I view the ANSI and UTF-8 as incorrect results.

It looks like StringToASCIIArray is treating the individual codes for both ANSI and UTF-8 as signed, wrongly.

Lets try it for UTF-16 with some codepoint having MSB set in its UTF-16 representation.

$str contains five full width characters

Local $str = "ABCDE"

Local $a = StringToASCIIArray($str, Default, Default, 0)
ConsoleWrite('UTF-16 ' & @TAB & _ArrayToString($a, @TAB) & @CRLF)

The result is unsigned and OK.

That was v3.3.5.1 under XP SP3 x86. Looks like you're ready for a ticket!

EDIT: it turns out to be a little more buggy than that:

$str2 contains the four "wind" mahjong tiles (codepoints > 0x10000)

gives wrong results.

This also shows up in the result of your first post.

In short, we have several bugs here:
StringLen doesn't count Unicode characters but counts every 16-bit position as a character. I admit that codepoints >= 0x10000 are not of routine usage for everyone (except for those using the new asian blocks!), but it's nonetheless a wrong behavior w.r.t. Unicode.
StringToASCIIArray($string, 0) returns StringLen($string) 16-bit codes from the UTF-16LE encoding instead of returning actual Unicode codepoints. Two errors here.
StringToASCIIArray($string, 1) returns the first StringLen($string) 8-bit codes (as signed values) of ANSI characters corresponding to the first StringLen($string) 16-bit codes from the UTF-16LE encoding. Three errors here.
StringToASCIIArray($string, 2) returns the first StringLen($string) 8-bit values (as signed values) of StringLen($string) 8-bit codes from the UTF-8 encoding. Two errors here.

 I hope I didn't misrepresent the issues.


            
                


    Edited  by jchd
    
    

            
        

        
            
                

    
        
            
        
    


            
        

        
            
                
                
                
            
            
                
            
        
        
            




  
  
    This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

  


        
    

    
        
            Share this post
            
Link to post
            
            
            

            
                Share on other sites
                

    
        
            

    

        
            

    

        
            

    

        
            

    

        
    

            
        
    


    

                    
                
                    
                    
                    






    
    
        
Xenobiologist   

    
        
    
             34
    
        
    

        
            
                Xx Code~Mega xX
            
            

            
                

    
        
    

            
            MVPs
            
                
            
            
                

    
        
    
             34
    
        
    

                4,840 posts
                
                

            
        
    
    
        


    

    
        
            
            
                
            
            
        

        
    #3 · 
            Posted 
            
            
            
        
    

    


            
        
            Thanks! 
Hopefully, some of the mods or devs find the topic and reply. 
Otherwise, I try my luck maybe with a ticket tomorrow.

Mega


            
        

        
            
                

    
        
            
        
    


            
        

        
            
                
                
                
            
            
                
            
        
        
            




  
  
    Scripts & functions Organize Includes Let Scite organize the include files
Yahtzee The game "Yahtzee" (Kniffel, DiceLion)
LoginWrapper  Secure scripts by adding a query (authentication)
_RunOnlyOnThis UDF Make sure that a script can only be executed on ... (Windows / HD / ...)
Internet-Café Server/Client Application Open CD, Start Browser, Lock remote client, etc.
MultipleFuncsWithOneHotkey Start different funcs by hitting one hotkey different times
  


        
    

    
        
            Share this post
            
Link to post
            
            
            

            
                Share on other sites
                

    
        
            

    

        
            

    

        
            

    

        
            

    

        
    

            
        
    


    

                    
                
                    
                    
                    






    
    
        
Xenobiologist   

    
        
    
             34
    
        
    

        
            
                Xx Code~Mega xX
            
            

            
                

    
        
    

            
            MVPs
            
                
            
            
                

    
        
    
             34
    
        
    

                4,840 posts
                
                

            
        
    
    
        


    

    
        
            
            
                
            
            
        

        
    #4 · 
            Posted 
            
            
            
        
    

    


            
        
            I created a ticket to get a comment.Ticket


            
        

        
            
                

    
        
            
        
    


            
        

        
            
                
                
                
            
            
                
            
        
        
            




  
  
    Scripts & functions Organize Includes Let Scite organize the include files
Yahtzee The game "Yahtzee" (Kniffel, DiceLion)
LoginWrapper  Secure scripts by adding a query (authentication)
_RunOnlyOnThis UDF Make sure that a script can only be executed on ... (Windows / HD / ...)
Internet-Café Server/Client Application Open CD, Start Browser, Lock remote client, etc.
MultipleFuncsWithOneHotkey Start different funcs by hitting one hotkey different times
  


        
    

    
        
            Share this post
            
Link to post
            
            
            

            
                Share on other sites
                

    
        
            

    

        
            

    

        
            

    

        
            

    

        
    

            
        
    


    

                    
                
                    
                    
                    






    
    
        
Valik   

    
        
    
             470
    
        
    

        
            
                Former developer.
            
            

            
                

    
        
    

            
            Active Members
            
            
                

    
        
    
             470
    
        
    

                18,334 posts
                
                

            
        
    
    
        


    

    
        
            
            
                
            
            
        

        
    #5 · 
            Posted 
            
            
            
        
    

    


            
        
            My first comment is this: Read the fucking guidelines when you create a new ticket. I didn't write those to have them ignored.


            
        

        
            
                

    
        
            
        
    


            
        

        
            
                
                
                
            
            
                
            
        
        
    

    
        
            Share this post
            
Link to post
            
            
            

            
                Share on other sites
                

    
        
            

    

        
            

    

        
            

    

        
            

    

        
    

            
        
    


    

                    
                
                    
                    
                    






    
    
        
Valik   

    
        
    
             470
    
        
    

        
            
                Former developer.
            
            

            
                

    
        
    

            
            Active Members
            
            
                

    
        
    
             470
    
        
    

                18,334 posts
                
                

            
        
    
    
        


    

    
        
            
            
                
            
            
        

        
    #6 · 
            Posted 
            
                (edited)
            
            
            
        
    

    


            
        
            Second comment: What are the correct UTF-8 values? Are they:
UTF-8   194 129 32  226 128
Edited by Valik

Share this post


Link to post
Share on other sites

The sign stuff was trivial and obvious to fix. Think no more about that.

With that out of the way, that's not all the UTF-8 data, it's only half of it. It's complicated, though, or at least it will be if it is done correctly. All input is UTF-16 LE because that's what AutoIt stores things as internally. I didn't take into consideration that a single UTF-16 character might expand to 2+ UTF-8 characters. That's why you see the length capped at the length of the input string (and why only half the UTF-8 data is present). The problem is, it has to be done this way or the function can't honor the length parameter. The proper fix is to parse the UTF-8 characters in order to return the correct number of characters even if it's a lot more bytes. Rather obvious, but a pain since this was supposed to be a simple function. Ugh.

Share this post


Link to post
Share on other sites

Where did I ignore the(your) guideline?

FMPOV, there is a problem with that function.

At least, there is a need to update the desciption.


Scripts & functions Organize Includes Let Scite organize the include files

Yahtzee The game "Yahtzee" (Kniffel, DiceLion)

LoginWrapper Secure scripts by adding a query (authentication)

_RunOnlyOnThis UDF Make sure that a script can only be executed on ... (Windows / HD / ...)

Internet-Café Server/Client Application Open CD, Start Browser, Lock remote client, etc.

MultipleFuncsWithOneHotkey Start different funcs by hitting one hotkey different times

Share this post


Link to post
Share on other sites

I didn't take into consideration that a single UTF-16 character might expand to 2+ UTF-8 characters.

Valik, I understand what you mean but this is not correct the way you put it. A codepoint (= a single Unicode character) may need from 1 to 4 ubyte in UTF-8, from 1 to 2 ushort in UTF-16*E and 1 ulong in UTF-32.

Here's a visual representation that I found useful when dealing with UTF (comes from in SQLite.c):

** Notes on UTF-8:
**
** Byte-0   Byte-1  Byte-2  Byte-3 _______ Value
** 0xxxxxxx ______________________________ 00000000 00000000 0xxxxxxx
** 110yyyyy 10xxxxxx _____________________ 00000000 00000yyy yyxxxxxx
** 1110zzzz 10yyyyyy 10xxxxxx ____________ 00000000 zzzzyyyy yyxxxxxx
** 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx ___ 000uuuuu zzzzyyyy yyxxxxxx
**
**
** Notes on UTF-16: (with wwww+1==uuuuu)
**
**  Word-0  Word-1  Value
** 110110ww wwzzzzyy 110111yy yyxxxxxx ___ 000uuuuu zzzzyyyy yyxxxxxx
** zzzzyyyy yyxxxxxx _____________________ 00000000 zzzzyyyy yyxxxxxx

(I had to put underlines so the alignment doesn't vanish.)

The proper fix is to parse the UTF-8 characters in order to return the correct number of characters even if it's a lot more bytes. Rather obvious, but a pain since this was supposed to be a simple function. Ugh.

:D I kown! See what Jon just said about it. And, fortunately, we don't even imagine to deal with anything else than C-form normalized strings. Never, ever, let an ounce of normalization stuff get inside AutoIt, let alone grapheme parsing... If you do then I suggest fixing a large amount of padding on the table in front of you right in the spot where you will be repeatedly banging your head*.

* Roger Binns on SQLite list, about XML.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

It's fine, I've talked to Jon about it and he showed me how to use stuff he's wrote in the past few months to fix the function. However, if there are still problems with multi-word UTF-16 characters that's an AutoIt-wide issue.

Share this post


Link to post
Share on other sites

Thank you for caring.

OTOH multi code unit (codepoints > 0x10000) may be seen today as a pedantic "extension" (e.g.: Aztec, Deseret or byzantine music sympols) and indeed many editors (Scite, PsPad, NotePad++) display or rather hash them as per the UCS-2 standard, not less than 12 years old! But investing in support for the full UTF representations is certainly a safe bet for future as we see more blocks in planes 1, 2 and 14 being increasingly used (for instance, large enhancements to unified CJK and use of language tags). Worldwide exchange of documents and need for support of multi-language data[base] can only push the trend forward.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0