Jump to content
Sign in to follow this  
chenxu

Arithmetic needed, about searching a string which contains none Latin char

Recommended Posts

chenxu

How to get the count of none Latin char in a string which may contain Latin or none Latin char? O(1) or lg(n) needed, O(n) is solved.

Share this post


Link to post
Share on other sites
ProgAndy

You could make a String with all latin Chars and then check each char with this string :)

$latin = "ABCDE....."
$String = "A String %&32"
$Split = StringSplit($String,"") ; Eyery single letter
$NonLatin = 0
For $i = 1 To $Split[0]
If Not StringinStr($latin,$Split[$i]) Then $NonLation += 1
Next
MsgBox(0,"",$NonLatin)

*GERMAN* [note: you are not allowed to remove author / modified info from my UDFs]My UDFs:[_SetImageBinaryToCtrl] [_TaskDialog] [AutoItObject] [Animated GIF (GDI+)] [ClipPut for Image] [FreeImage] [GDI32 UDFs] [GDIPlus Progressbar] [Hotkey-Selector] [Multiline Inputbox] [MySQL without ODBC] [RichEdit UDFs] [SpeechAPI Example] [WinHTTP]UDFs included in AutoIt: FTP_Ex (as FTPEx), _WinAPI_SetLayeredWindowAttributes

Share this post


Link to post
Share on other sites
weaponx

This will probably count non-alphanumeric characters along with the non-Latin characters but it should provide some ideas.

$MixedString = "ABCDEYíäýñíèé"
$NonLatinCharArray = StringRegExp($MixedString, "[^[:alnum:]]", 3)

If IsArray($NonLatinCharArray) Then
    $count = Ubound($NonLatinCharArray)
    
    For $X = 0 to $count -1
        ConsoleWrite($NonLatinCharArray[$X] & @CRLF)
    Next
    
    MsgBox(0,"","Found " & $count & " Non-Latin characters ")
EndIf
Edited by weaponx

Share this post


Link to post
Share on other sites
chenxu

This will probably count non-alphanumeric characters along with the non-Latin characters but it should provide some ideas.

$MixedString = "ABCDEYíäýñíèé"
$NonLatinCharArray = StringRegExp($MixedString, "[^[:alnum:]]", 3)

If IsArray($NonLatinCharArray) Then
    $count = Ubound($NonLatinCharArray)
    
    For $X = 0 to $count -1
        ConsoleWrite($NonLatinCharArray[$X] & @CRLF)
    Next
    
    MsgBox(0,"","Found " & $count & " Non-Latin characters ")
EndIfoÝ÷ Ûú®¢×ç(uçÚWß{kÊØb²h ­«^uúè*.­Êy«­¢+ØÀÌØí5¥áMÑÉ¥¹ôÅÕ½Ðì¨èÀäÈíIÀäÈíÕÑ¥±ÌÀäÈíѵÀÅÕ½Ðì)5Í  ½à À°ÅÕ½ÐìÅÕ½Ðì°½Õ¹Ñ
¡¥¹Í
¡È ÀÌØí5¥áMÑÉ¥¹¤¤)Õ¹½Õ¹Ñ
¡¥¹Í
¡È ÀÌØíÍÑȤ(%1½°ÀÌØí9½¹1Ñ¥¹
¡ÉÉÉäôMÑÉ¥¹IáÀ ÀÌØíÍÑÈ°ÅÕ½Ðímylé±¹Õ´éutÅÕ½Ðì°Ì¤(%%9½Ð%ÍÉÉä ÀÌØí9½¹1Ñ¥¹
¡ÉÉÉä¤Q¡¸IÑÕɸÀ(%IÑÕɸU½Õ¹ ÀÌØí9½¹1Ñ¥¹
¡ÉÉÉä¤)¹Õ¹
Edited by chenxu

Share this post


Link to post
Share on other sites
Siao

The code failed!

That's because you failed twice:

1) to specify exactly what you want and stick with it (first you just said "none latin", and now you expect "chinese only")

2) to comprehend what weaponx said right above his code example.

Anyway, my version:

$s = ClipGet()
$a = StringSplit($s, "")
$iNonLatin = 0
For $i = 1 To $a[0]
    If AscW($a[$i]) >= 0x250 Then  $iNonLatin += 1
Next
ConsoleWrite($iNonLatin & @CRLF)

Short explanation of the above:

Range 0-0x24F includes Basic Latin, Latin-1, Latin Extended-A and Latin Extended-B; any char that doesn't fall within it will be counted. This doesn't really guarantee that only alpha chars will be counted, for example, Spacing Modifier Letters subset (0x2B0-0x2FF) would be counted too, so

go to unicode.org or wherever to get range charts, and tweak the code as you need.

Edited by Siao

"be smart, drink your wine"

Share this post


Link to post
Share on other sites
chenxu

$s = ClipGet()
$a = StringSplit($s, "")
$iNonLatin = 0
For $i = 1 To $a[0]
    If AscW($a[$i]) >= 0x250 Then  $iNonLatin += 1
Next
ConsoleWrite($iNonLatin & @CRLF)
This code is sure do what I want, but, it takes O(n) time. I need to invoke the utility a lot in my script, so I need an O(1) or lg(n) time utility.

Any way, thank you very much.

Share this post


Link to post
Share on other sites
weaponx

This code is sure do what I want, but, it takes O(n) time. I need to invoke the utility a lot in my script, so I need an O(1) or lg(n) time utility.

Any way, thank you very much.

This isn't math. The longer your string is, the longer it will take. There should be a linear increase in time with the length, just like the relationship between my stress level and the length of this thread.

Share this post


Link to post
Share on other sites
Siao

Exactly. I would really like to know how O(1) can be expected trying to count characters in a string.

@chenxu:

These would be faster than StringSplit approach (and it has nothing to do with your dubious understanding of big O, just the fact that compiled code is much faster than script code):

StringRegExpReplace($s, "[^\x00-\x{24F}]", "")
    $iNonLatin = @extended

if you expect most characters in a string to be in the specified range

or

$stmp = StringRegExpReplace($s, "[\x00-\x{24F}]", "")
    $iNonLatin = StringLen($stmp)

if you expect most characters in a string to be outside the range

Again, I'm suggesting that the 0-0x24F used here likely is not exactly what you need, so you should adjust as necessary.

If you want to count Chinese only, the tricky part is the multitude of possible Chinese charsets.

Edited by Siao

"be smart, drink your wine"

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×