Jump to content
Sign in to follow this  
sambalec

How to compare 2 strings to get a similarity percent in result

Recommended Posts

sambalec

Hello,

How can I compare 2 strings to get a percent result about similarity ?

Example :

String 1 : "Hello Worlds !"

String 2 : "Hello my World !!!"

I need a % result, for example : 70 % similar...

Many thanks ! :-)

Share this post


Link to post
Share on other sites
FireFox

Hi,

By using the StringCompare function with some math tricks.

Here you go :

#include <Misc.au3>

Local Const $s1 = "toto"
Local Const $s2 = "tata"

Local Const $a1 = StringSplit($s1, ""), $a2 = StringSplit($s2, "")

Local Const $iMax = _Iif($a1[0] > $a2[0], $a2[0], $a1[0])

Local $iDiffCount = 0

For $i = 1 To $iMax
    If StringCompare($a1[$i], $a2[$i], 2) <> 0 Then $iDiffCount += 1
Next

ConsoleWrite("Diff: " & $iDiffCount / $iMax * 100 & "%" & @CrLf)

Br, FireFox.

Edited by FireFox

 

OS : Win XP SP2 (32 bits) / Win 7 SP1 (64 bits) / Win 8 (64 bits) | Autoit version: latest stable / beta.
Hardware : Intel(R) Core(TM) i5-2400 CPU @ 3.10Ghz / 8 GiB RAM DDR3.

My UDFs : Skype UDF | TrayIconEx UDF | GUI Panel UDF | Excel XML UDF | Is_Pressed_UDF

My Projects : YouTube Multi-downloader | FTP Easy-UP | Lock'n | WinKill | AVICapture | Skype TM | Tap Maker | ShellNew | Scriptner | Const Replacer | FT_Pocket | Chrome theme maker

My Examples : Capture toolIP Camera | Crosshair | Draw Captured Region | Picture Screensaver | Jscreenfix | Drivetemp | Picture viewer

My Snippets : Basic TCP | Systray_GetIconIndex | Intercept End task | Winpcap various | Advanced HotKeySet | Transparent Edit control

 

Share this post


Link to post
Share on other sites
jdelaney

Wish I could cite the source:

Func _Typos(Const $st1, Const $st2, $anychar = '_', $anytail = '%') ; Get amount of typos between two strings
Local $s1, $s2, $pen, $del, $ins, $subst
If Not IsString($st1) Then Return SetError(-1, -1, -1)
If Not IsString($st2) Then Return SetError(-2, -2, -1)
If $st2 = '' Then Return StringLen($st1)
If $st2 == $anytail Then Return 0
If $st1 = '' Then
Return(StringInStr($st2 & $anytail, $anytail, 1) - 1)
EndIf
;~ $s1 = StringSplit(_LowerUnaccent($st1)), "", 2) ;; _LowerUnaccent() addon function not available here
;~ $s2 = StringSplit(_LowerUnaccent($st2)), "", 2) ;; _LowerUnaccent() addon function not available here
$s1 = StringSplit(StringLower($st1), "", 2)
$s2 = StringSplit(StringLower($st2), "", 2)
Local $l1 = UBound($s1), $l2 = UBound($s2)
Local $r[$l1 + 1][$l2 + 1]
For $x = 0 To $l2 - 1
Switch $s2[$x]
Case $anychar
    If $x < $l1 Then
     $s2[$x] = $s1[$x]
    EndIf
Case $anytail
    $l2 = $x
    If $l1 > $l2 Then
     $l1 = $l2
    EndIf
    ExitLoop
EndSwitch
$r[0][$x] = $x
Next
$r[0][$l2] = $l2
For $x = 0 To $l1
$r[$x][0] = $x
Next
For $x = 1 To $l1
     For $y = 1 To $l2
$pen = Not ($s1[$x - 1] == $s2[$y - 1])
$del = $r[$x-1][$y] + 1
$ins = $r[$x][$y-1] + 1
$subst = $r[$x-1][$y-1] + $pen
If $del > $ins Then $del = $ins
If $del > $subst Then $del = $subst
$r[$x][$y] = $del
If ($pen And $x > 1 And $y > 1 And $s1[$x-1] == $s2[$y-2] And $s1[$x-2] == $s2[$y-1]) Then
    If $r[$x][$y] >= $r[$x-2][$y-2] Then $r[$x][$y] = $r[$x-2][$y-2] + 1
    $r[$x-1][$y-1] = $r[$x][$y]
EndIf
Next
Next
Return ($r[$l1][$l2])
;~ ; usage
;~ Local $reference = "lexicographically"
;~ Local $Words[11][2] = [ _
;~ [$reference], _
;~ ["Lexicôgraphicaly"], _
;~ ["lexkographicaly"], _
;~ ["Lexico9raphically"], _
;~ ["lexioo9asdasraphically"], _
;~ ["Lexicographical"], _
;~ ["lexicographlcally"], _
;~ ["[email="Lex1cogr@phically"]Lex1cogr@phically[/email]"], _
;~ ["lexic0graphïca1yl"], _
;~ ["lexIcOgraphically"], _
;~ ["Lexlcographically"] _
;~ ]
;~ For $i = 0 To UBound($Words) - 1
;~ $Words[$i][1] = _Typos($Words[$i][0], $reference)
;~ Next
;~ _ArrayDisplay($Words, "Number of typos")
;~ ConsoleWrite("Usage of '_' and '%' wildcards in pattern:" & @LF & @TAB & "_Typos([email="'lex1c0gr@fhlâofznho'"]'lex1c0gr@fhlâofznho'[/email], 'LEx_c_gr%') = " & _Typos([email="'lex1c0gr@fhlofznho'"]'lex1c0gr@fhlofznho'[/email], 'lex_c_gr%') & @LF)
;~ ConsoleWrite("Does not always return the absolute minimum edit distance:" & @LF & @TAB & "_Typos('bdac', 'abcd') = " & _Typos('bdac', 'abcd') & @LF)
;~
EndFunc

got it, jchd:

Edited by jdelaney

IEbyXPATH-Grab IE DOM objects by XPATH IEscriptRecord-Makings of an IE script recorder ExcelFromXML-Create Excel docs without excel installed GetAllWindowControls-Output all control data on a given window.

Share this post


Link to post
Share on other sites
sambalec

Nice ! I'm waiting for ! Many Thank's !

Share this post


Link to post
Share on other sites
sambalec

Thank's FireFox and Jdelaney for your help.

when I try in your script (Mr FireFox) :

Local Const $s1 = "pizza service"

Local Const $s2 = "Pizza Service"

result is 0 % (perfect for me)

But :

Local Const $s1 = "pizza service"

Local Const $s2 = "the pizza Service"

result is 100 % (is not good, il would like about 20 % of difference)

Share this post


Link to post
Share on other sites
water

If you need it case sensitive then just change this line in the example Firefox provided

If StringCompare($a1[$i], $a2[$i], 2) <> 0 Then $iDiffCount += 1
to this
If StringCompare($a1[$i], $a2[$i], 1) <> 0 Then $iDiffCount += 1

My UDFs and Tutorials:

Spoiler

UDFs:
Active Directory (NEW 2018-10-19 - Version 1.4.10.0) - Download - General Help & Support - Example Scripts - Wiki
OutlookEX (2018-09-01 - Version 1.3.4.0) - Download - General Help & Support - Example Scripts - Wiki
ExcelChart (2017-07-21 - Version 0.4.0.1) - Download - General Help & Support - Example Scripts
PowerPoint (2017-06-06 - Version 0.0.5.0) - Download - General Help & Support
Excel - Example Scripts - Wiki
Word - Wiki
 
Tutorials:

ADO - Wiki

 

Share this post


Link to post
Share on other sites
sambalec

Thanks Water,

My problem was not with sensitive case,

Problem is :

Local Const $s1 = "pizza service"

Local Const $s2 = "the pizza Service"

result is 100 % of difference (is not good for me, il would like about 20 % of difference)

Share this post


Link to post
Share on other sites
FireFox

result is 100 % of difference (is not good for me, il would like about 20 % of difference)

Yes because it starts from the left to right, I don't know what is best algorithm that would fit your need.

Maybe a second check from the opposite direction and take the less difference ?

Br, FireFox.


 

OS : Win XP SP2 (32 bits) / Win 7 SP1 (64 bits) / Win 8 (64 bits) | Autoit version: latest stable / beta.
Hardware : Intel(R) Core(TM) i5-2400 CPU @ 3.10Ghz / 8 GiB RAM DDR3.

My UDFs : Skype UDF | TrayIconEx UDF | GUI Panel UDF | Excel XML UDF | Is_Pressed_UDF

My Projects : YouTube Multi-downloader | FTP Easy-UP | Lock'n | WinKill | AVICapture | Skype TM | Tap Maker | ShellNew | Scriptner | Const Replacer | FT_Pocket | Chrome theme maker

My Examples : Capture toolIP Camera | Crosshair | Draw Captured Region | Picture Screensaver | Jscreenfix | Drivetemp | Picture viewer

My Snippets : Basic TCP | Systray_GetIconIndex | Intercept End task | Winpcap various | Advanced HotKeySet | Transparent Edit control

 

Share this post


Link to post
Share on other sites
water

My best bet is: Search for an algorithm written in Visual Basic and then translate it to AutoIt.


My UDFs and Tutorials:

Spoiler

UDFs:
Active Directory (NEW 2018-10-19 - Version 1.4.10.0) - Download - General Help & Support - Example Scripts - Wiki
OutlookEX (2018-09-01 - Version 1.3.4.0) - Download - General Help & Support - Example Scripts - Wiki
ExcelChart (2017-07-21 - Version 0.4.0.1) - Download - General Help & Support - Example Scripts
PowerPoint (2017-06-06 - Version 0.0.5.0) - Download - General Help & Support
Excel - Example Scripts - Wiki
Word - Wiki
 
Tutorials:

ADO - Wiki

 

Share this post


Link to post
Share on other sites
FireFox

Thanks for your link water, like I said there is different algorithms to check the similarity of strings and from this search I'm coming up with the link below.

@sambalec

Can you chose an algorithm from this page? Me or someone else will be glad to translate it for you ;)

Br, FireFox.

Edited by FireFox

 

OS : Win XP SP2 (32 bits) / Win 7 SP1 (64 bits) / Win 8 (64 bits) | Autoit version: latest stable / beta.
Hardware : Intel(R) Core(TM) i5-2400 CPU @ 3.10Ghz / 8 GiB RAM DDR3.

My UDFs : Skype UDF | TrayIconEx UDF | GUI Panel UDF | Excel XML UDF | Is_Pressed_UDF

My Projects : YouTube Multi-downloader | FTP Easy-UP | Lock'n | WinKill | AVICapture | Skype TM | Tap Maker | ShellNew | Scriptner | Const Replacer | FT_Pocket | Chrome theme maker

My Examples : Capture toolIP Camera | Crosshair | Draw Captured Region | Picture Screensaver | Jscreenfix | Drivetemp | Picture viewer

My Snippets : Basic TCP | Systray_GetIconIndex | Intercept End task | Winpcap various | Advanced HotKeySet | Transparent Edit control

 

Share this post


Link to post
Share on other sites
jdelaney

Local $reference = "pizza service"
Local $Words[4][4] = [ _
  [$reference], _
  ["the pizza service"], _
  ["tha piza service"], _
  ["pitza sarvace"]]
For $i = 0 To UBound($Words) - 1
 $Words[$i][1] = _Typos($Words[$i][0], $reference)
 $Words[$i][2] = (StringLen($reference) - $Words[$i][1]) / StringLen($reference)
 $Words[$i][3] = Abs(1-(StringLen($reference) - $Words[$i][1]) / StringLen($reference))
Next
_ArrayDisplay($Words, "Number of typos")
Exit
Func _Typos(Const $st1, Const $st2, $anychar = '_', $anytail = '%') ; Get amount of typos between two strings
 Local $s1, $s2, $pen, $del, $ins, $subst
 If Not IsString($st1) Then Return SetError(-1, -1, -1)
 If Not IsString($st2) Then Return SetError(-2, -2, -1)
 If $st2 = '' Then Return StringLen($st1)
 If $st2 == $anytail Then Return 0
 If $st1 = '' Then
  Return (StringInStr($st2 & $anytail, $anytail, 1) - 1)
 EndIf
;~ $s1 = StringSplit(_LowerUnaccent($st1)), "", 2) ;; _LowerUnaccent() addon function not available here
;~ $s2 = StringSplit(_LowerUnaccent($st2)), "", 2) ;; _LowerUnaccent() addon function not available here
 $s1 = StringSplit(StringLower($st1), "", 2)
 $s2 = StringSplit(StringLower($st2), "", 2)
 Local $l1 = UBound($s1), $l2 = UBound($s2)
 Local $r[$l1 + 1][$l2 + 1]
 For $x = 0 To $l2 - 1
  Switch $s2[$x]
   Case $anychar
    If $x < $l1 Then
     $s2[$x] = $s1[$x]
    EndIf
   Case $anytail
    $l2 = $x
    If $l1 > $l2 Then
     $l1 = $l2
    EndIf
    ExitLoop
  EndSwitch
  $r[0][$x] = $x
 Next
 $r[0][$l2] = $l2
 For $x = 0 To $l1
  $r[$x][0] = $x
 Next
 For $x = 1 To $l1
  For $y = 1 To $l2
   $pen = Not ($s1[$x - 1] == $s2[$y - 1])
   $del = $r[$x - 1][$y] + 1
   $ins = $r[$x][$y - 1] + 1
   $subst = $r[$x - 1][$y - 1] + $pen
   If $del > $ins Then $del = $ins
   If $del > $subst Then $del = $subst
   $r[$x][$y] = $del
   If ($pen And $x > 1 And $y > 1 And $s1[$x - 1] == $s2[$y - 2] And $s1[$x - 2] == $s2[$y - 1]) Then
    If $r[$x][$y] >= $r[$x - 2][$y - 2] Then $r[$x][$y] = $r[$x - 2][$y - 2] + 1
    $r[$x - 1][$y - 1] = $r[$x][$y]
   EndIf
  Next
 Next
 Return ($r[$l1][$l2])
EndFunc   ;==>_Typos

output: (against the expected)

|String|Count wrong|Percent correct|Percent Wrong

[0]|pizza service|0|1|0

[1]|the pizza service|4|0.692307692307692|0.307692307692308

[2]|tha piza service|5|0.615384615384615|0.384615384615385

[3]|pitza sarvace|3|0.769230769230769|0.230769230769231

or, switch the comparison to be against the actual:

using:

Local $reference = "pizza service"
Local $Words[4][4] = [ _
  [$reference], _
  ["the pizza service"], _
  ["tha piza service"], _
  ["pitza sarvace"]]
For $i = 0 To UBound($Words) - 1
 $Words[$i][1] = _Typos($Words[$i][0], $reference)
 $Words[$i][2] = (StringLen($Words[$i][0]) - $Words[$i][1]) / StringLen($Words[$i][0])
 $Words[$i][3] = Abs(1-(StringLen($Words[$i][0]) - $Words[$i][1]) / StringLen($Words[$i][0]))
Next
_ArrayDisplay($Words, "Number of typos")

output:

[0]|pizza service|0|1|0

[1]|the pizza service|4|0.764705882352941|0.235294117647059

[2]|tha piza service|5|0.6875|0.3125

[3]|pitza sarvace|3|0.769230769230769|0.230769230769231


IEbyXPATH-Grab IE DOM objects by XPATH IEscriptRecord-Makings of an IE script recorder ExcelFromXML-Create Excel docs without excel installed GetAllWindowControls-Output all control data on a given window.

Share this post


Link to post
Share on other sites
kylomas

symbalec,

What percent similar are the following two sets of strings (by your definition of similar)?

abcd

acbd

and

z

zzz

kylomas

also: these strings

the boy

the boy

Edited by kylomas

Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites
jchd

A percentage is not obviously the most informative measure since it depends on the length of the string. My function returns the number of edits required to change string1 into string2.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
sambalec

Many thanks for your help ! :-)

@Jdelaney : your script is very good for me... just inversing function is missing :

Example :

the Pizza Service

Service Pizza the

I need to get same result :-)

Share this post


Link to post
Share on other sites
jdelaney

the Pizza Service

Service Pizza the

I need to get same result :-)

these wouldn't return the same result...these would:

the Pizza Service

Pizza Service the

Local $reference = "Pizza Service"
Local $Words[4][4] = [ _
  [$reference], _
  ["the Pizza Service"], _
  ["Pizza Service the"], _
  ["Service Pizza the"]]
For $i = 0 To UBound($Words) - 1
 $Words[$i][1] = _Typos($Words[$i][0], $reference)
 $Words[$i][2] = (StringLen($Words[$i][0]) - $Words[$i][1]) / StringLen($Words[$i][0])
 $Words[$i][3] = Abs(1-(StringLen($Words[$i][0]) - $Words[$i][1]) / StringLen($Words[$i][0]))
Next
_ArrayDisplay($Words, "Number of typos")

output:

[0]|Pizza Service|0|1|0

[1]|the Pizza Service|4|0.764705882352941|0.235294117647059

[2]|Pizza Service the|4|0.764705882352941|0.235294117647059

[3]|Service Pizza the|14|0.176470588235294|0.823529411764706


IEbyXPATH-Grab IE DOM objects by XPATH IEscriptRecord-Makings of an IE script recorder ExcelFromXML-Create Excel docs without excel installed GetAllWindowControls-Output all control data on a given window.

Share this post


Link to post
Share on other sites
kylomas

A percentage is not obviously the most informative measure since it depends on the length of the string. My function returns the number of edits required to change string1 into string2.

I know, trying to understand the OP's rules...

sambelec,

Try this

local $str1 = 'the pizaa service', $init_len1 = stringlen($str1)
local $str2 = 'pizaa service the', $init_len2 = stringlen($str2)

for $1 = 1 to stringlen($str1)
    for $2 = 1 to stringlen($str2)
        if stringmid($str1,$1,1) = stringmid($str2,$2,1) then
            $str2 = stringreplace($str2,stringmid($str2,$2,1),'_')
            $str1 = stringreplace($str1,stringmid($str1,$1,1),'_')
        endif
    next
next

$str1 = stringreplace($str1,'_','')
$str2 = stringreplace($str2,'_','')

ConsoleWrite('String1 is ' & round( (stringlen($str1)/$init_len2)*100,2 ) & '% different from string2' & @LF)
ConsoleWrite('String2 is ' & round( (stringlen($str2)/$init_len1)*100,2 ) & '% different from string1' & @LF)

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites
jchd

My remark was not towards you kylomas.

Fuzzy question, fuzzy answer.

  • Like 1

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
kylomas

My remark was not towards you kylomas.

Fuzzy question, fuzzy answer.

Yes, I know, been trying to get specifications.

@sambalec,

Please define exactly what you want.

kylomas


Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites
kylomas

sambalec, (follow up from 04/02/2013)

The code that I posted simply eliminates "like" letters from each subject string. Therefore, this will produce differences of "0" percent:

;local $str1 = 'the pizaa service', $init_len1 = stringlen($str1)
;local $str2 = 'pizaa service the', $init_len2 = stringlen($str2)
local $str1 = 'zzzzzz', $init_len1 = stringlen($str1)
local $str2 = 'z', $init_len2 = stringlen($str2)

for $1 = 1 to stringlen($str1)
for $2 = 1 to stringlen($str2)
if stringmid($str1,$1,1) = stringmid($str2,$2,1) then
$str2 = stringreplace($str2,stringmid($str2,$2,1),'_')
$str1 = stringreplace($str1,stringmid($str1,$1,1),'_')
endif
next
next

$str1 = stringreplace($str1,'_','')
$str2 = stringreplace($str2,'_','')

ConsoleWrite('String1 is ' & round( (stringlen($str1)/$init_len2)*100,2 ) & '% different from string2' & @LF)
ConsoleWrite('String2 is ' & round( (stringlen($str2)/$init_len1)*100,2 ) & '% different from string1' & @LF)

Do you see why we are asking for further specifications?

kylomas

edit: addfitional info

This version leaves duplicate characters, so "z" compared to "zzzzzz" is 500% different (because there are 5 "z'" left over)

;local $str1 = 'the pizaa service', $init_len1 = stringlen($str1)
;local $str2 = 'pizaa service the', $init_len2 = stringlen($str2)
local $str1 = 'zzzzzz', $init_len1 = stringlen($str1)
local $str2 = 'z', $init_len2 = stringlen($str2)

for $1 = 1 to stringlen($str1)
    for $2 = 1 to stringlen($str2)
        if stringmid($str1,$1,1) = stringmid($str2,$2,1) then
            $str2 = stringreplace($str2,stringmid($str2,$2,1),'_',1)
            $str1 = stringreplace($str1,stringmid($str1,$1,1),'_',1)
        endif
    next
next

$str1 = stringreplace($str1,'_','')
$str2 = stringreplace($str2,'_','')

ConsoleWrite('String1 is ' & round( (stringlen($str1)/$init_len2)*100,3 ) & '% different from string2' & @LF)
ConsoleWrite('String2 is ' & round( (stringlen($str2)/$init_len1)*100,3 ) & '% different from string1' & @LF)
Edited by kylomas

Forum Rules         Procedure for posting code

"I like pigs.  Dogs look up to us.  Cats look down on us.  Pigs treat us as equals."

- Sir Winston Churchill

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×