Jump to content

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Find out more here. X
X


Photo

Calculate string similarity


  • Please log in to reply
4 replies to this topic

#1 Stumpii

Stumpii

    Mr. Stumpii

  • Active Members
  • PipPipPipPipPipPip
  • 473 posts

Posted 08 February 2007 - 01:50 AM

I found some great code called SimMetrics that mathematically calculates the similarity of two strings, returning a value of 0.0 (not a match) to 1.0 (exact match).

It is written in .Net 2.0, so I wrote a COM wrapper and installation package so I could use the feature in Excel. It could also be used in AutoIt.

The installation package (which installs the SimMetrics dll, wrapper dll and example au3 code) can be downloaded from here: SimMetrics COM Wrapper

Some examples of use could be:
  • Custom spell checker
  • Search a music database for similar named songs
  • Find the best match to a name in an address book
  • Cross reference part numbers in two list that may have typos
The example given in the download shows the user a list of names and then asks the user to enter a name. The program picks the name that most closely matches the name the user enters.

At the moment only the Levenshtein Distance Metric is included in the wrapper. If anyone finds another metric that they think would be useful, I can add it to the wrapper. For those interested, the SourceForge page for the SimMetrics program is here.

Thanks go to sohfeyr for helping register the .Net dlls without the MS installer.

Here is the example code for those interested:

CODE
#include <Array.au3>

$MySimMetrics = ObjCreate("SimMetricsCOMWrapper.SimMetrics")
If IsObj($MySimMetrics) = 0 Then
MsgBox(16, "SimMetrics COM Wrapper Example", "Could not create SimMetrics object.")
Exit
EndIf

Dim $avArray
$avArray = _ArrayCreate("JPM", "Holger", "Jon", "Larry", "Jeremy", "Valik", "Cyberslug", "Nutster", "Tylo", "JdeB")
_ArrayDisplay($avArray, "List of users")

$Value = InputBox("SimMetrics COM Wrapper Example", "Enter a user to search for (guess/mis-spell a name)")

$MostSimilar = ""
$MostSimilarSimilarity = 0
For $i = 0 To UBound($avArray) - 1
$Similarity = $MySimMetrics.LevensteinSimilarity ($Value, $avArray[$i])
If Number($Similarity) > Number($MostSimilarSimilarity) Then
$MostSimilar = $avArray[$i]
$MostSimilarSimilarity = $Similarity
EndIf
Next

MsgBox(64, "SimMetrics COM Wrapper Example", "Closest match to '" & $Value & "' is '" & $MostSimilar & "' with " & _
$MostSimilarSimilarity * 100 & "% similarity.")

Edited by Stumpii, 08 February 2007 - 02:03 AM.

“Give a man a script; you have helped him for today. Teach a man to script; and you will not have to hear him whine for help.”AutoIt4UE - Custom AutoIt toolbar and wordfile for UltraEdit/UEStudio users.AutoIt Graphical Debugger - A graphical debugger for AutoIt.SimMetrics COM Wrapper - Calculate string similarity.







#2 Eusebio

Eusebio

    Seeker

  • Active Members
  • 18 posts

Posted 08 February 2007 - 08:01 PM

Here my old code in au3:

http://www.autoitscript.com/forum/index.ph...st&p=224962

#3 SmOke_N

SmOke_N

    It's not what you know ... It's what you can prove!

  • Moderators
  • 15,730 posts

Posted 08 February 2007 - 08:58 PM

Couldn't resist... I have no idea how close in concept this is though.
MsgBox(0, '', 'Case Sensitive: ' & _StringLikePerc('I am a strinG.', 'I am a string.', 1) & '%' & @CR & _         'Not Sensitive: ' & _StringLikePerc('I am a strinG.', 'I am a string.') & '%') Func _StringLikePerc($sString1, $sString2, $nCase = 0)     Local $nLen1 = StringLen($sString1), $nLen2 = StringLen($sString2)     Local $nMinLen = $nLen1, $nMaxLen = $nLen2, $iAdd     If $nLen2 < $nLen1 Then         $nMinLen = $nLen2         $nMaxLen = $nLen1     EndIf     If $nCase Then         For $iCC = 1 To $nMinLen             If StringMid($sString1, $iCC, 1) == StringMid($sString2, $iCC, 1) Then $iAdd += 1         Next     Else         For $iCC = 1 To $nMinLen             If StringMid($sString1, $iCC, 1) = StringMid($sString2, $iCC, 1) Then $iAdd += 1         Next     EndIf     Return Round(($iAdd / $nMaxLen) * 100, 2) EndFunc


Edit:
Now that I think about it, the above isn't a good example, it would only do a % based on if the first char was the same or not... if that varied, then it would return a 0% match.... ahh well.

Edited by SmOke_N, 08 February 2007 - 09:02 PM.

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.


#4 WideBoyDixon

WideBoyDixon

    Code Monkey

  • Active Members
  • PipPipPipPipPipPip
  • 381 posts

Posted 10 February 2009 - 02:58 PM

The following implements two "similarity" functions. The first is the traditional Levenshtein method. The second is an algorithm adapted from http://siderite.blogspot.com/2007/01/super...-algorithm.html and re-coded for AutoIt. You can choose which implementation to use by passing a non-zero third parameter. This is my first code post so be kind!

AutoIt         
#include <Math.au3> Dim Const $maxOffset = 5 ; Change according to how far you want to search; a bigger value means a slower comparison ; Example usage MsgBox(0, "Levenshtein", _Similarity("The cat sat on the mat", "The dog sat on the cat")) MsgBox(0, "Similarity", _Similarity("The cat sat on the mat", "The dog sat on the cat", 1)) Exit ; Compute similarity between two strings Func _Similarity($s1, $s2, $alg = 0)     Local $dis     If ($alg = 0) Then         $dis = _Levenshtein($s1, $s2)     Else         $dis = _Distance($s1, $s2)     EndIf     Local $maxLen = _Max(StringLen($s1), StringLen($s2))     If ($maxLen = 0) Then Return 100     Return (1 - $dis / $maxLen) * 100 EndFunc ; Compute Levenshtein Distance Func _Levenshtein($s, $t)     Local $m, $n, $i, $j, $s_i, $t_j, $cost     $n = StringLen($s)     $m = StringLen($t)     If $n * $m = 0 Then Return $m + $n     Local $d[$n + 1][$m + 1]         For $i = 0 To $n         $d[$i][0] = $i     Next         For $j = 0 To $m         $d[0][$j] = $j     Next     For $i = 1 To $n         $s_i = StringMid($s, $i, 1)                 For $j = 1 To $m             $t_j = StringMid($t, $j, 1)             If $s_i = $t_j Then                 $cost = 0             Else                 $cost = 1             EndIf                         $d[$i][$j] = _Minimum3($d[$i - 1][$j] + 1, $d[$i][$j - 1] + 1, $d[$i - 1][$j - 1] + $cost)         Next     Next     Return $d[$n][$m]   EndFunc ; Get minimum of three values Func _Minimum3($a, $b, $c)     Local $mi     $mi = $a     If $b < $mi Then $mi = $b     If $c < $mi Then $mi = $c         Return $mi EndFunc ; Efficient calculation of similarity Func _Distance($s1, $s2)     Local $ls1 = StringLen($s1), $ls2 = StringLen($s2)     If $ls1 * $ls2 = 0 Then Return $ls1 + $ls2         Local $c = 0, $offset1 = 0, $offset2 = 0, $dist = 0, $i         While (($c + $offset1 < $ls1) And ($c + $offset2 < $ls2))         If (StringMid($s1, $c + $offset1 + 1, 1) <> StringMid($s2, $c + $offset2 + 1, 1)) Then             $offset1 = 0             $offset2 = 0             For $i = 0 to $maxOffset - 1                 If (($c + $i < $ls1) And (StringMid($s1, $c + $i + 1, 1) = StringMid($s2, $c + 1, 1))) Then                     If ($i > 0) Then                         $dist += 1                         $offset1 = $i                     EndIf                     $dist -= 1                     ExitLoop                 EndIf                 If (($c + $i < $ls2) And (StringMid($s1, $c + 1, 1) = StringMid($s2, $c + $i + 1, 1))) Then                     If ($i > 0) Then                         $dist += 1                         $offset2 = $i                     EndIf                     $dist -= 1                     ExitLoop                 EndIf             Next             $dist += 1         EndIf         $c += 1     Wend         Return $dist + ($ls1 - $offset1 + $ls2 - $offset2) / 2 - $c EndFunc


WBD

#5 Malkey

Malkey

  • MVPs
  • 1,525 posts

Posted 17 September 2009 - 11:51 PM

Here is another example using the Levenshtein Distance or Edit distance algorithm.

Also, a string alignment, longest common subsequence. or _TraceBack function which uses the same table/matrix created by the _EditDistance() function.

The results of the TraceBack function, eg.:-
vintner-    | || writ-ers No. of Differences: 5 appropriate m-eaning |||||   |||||    ||| approximate matching No. of Differences: 7
are written to console. MsgBox does not format the display very well.

In the above two examples - referring to the middle line between the two strings - the vertical lines indicate matches, the spaces are differences.

AutoIt         
; #include <Math.au3> #include <array.au3> ;Edit distance, Levenshtein distance, or Dynamic time warping ; http: www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Edit/ ; String alignment, sequence alignment, longest common subsequence. or _TraceBack approach. ; http: books.google.com.au/books?id=STGlsyqtjYMC&pg=PA215&lpg=PA215&dq=string+difference+edit+distane&source=bl&ots=kiQ8vAL_v3&sig=TovNa041KRZDRGIa5bVI9HBMp1E&hl=en&ei=nd-iSt-SGIGYkQXd3OWBBA&sa=X&oi=book_result&ct=result&resnum=9#v=onepage&q=&f=false ; http: en.wikipedia.org/wiki/Sequence_alignment ; http: en.wikipedia.org/wiki/Longest_common_subsequence_problem ; [url="http://www.autoitscript.com/forum/index.php?showtopic=40843&view=findpost&p=303848"]http://www.autoitscript.com/forum/index....p?showtopic=40843&view=findpost&p=303848[/url] Local $sD Local $s1 = "appropriate meaning", $s2 = "approximate matching" ;Local $s1 = "vintner", $s2 = "writers" ;Local  $s1='test', $s2='jesT' ;Local  $s1 = "456043",  $s2 =  "1234567" Local $aArr = _EditDistance($s1, $s2) _TraceBack('', '', '', $aArr, StringLen($s1), StringLen($s2), $s1, $s2, $sD) ConsoleWrite(@CRLF & $sD & @CRLF & "No. of Differences: " & $aArr[StringLen($s1)][StringLen($s2)] & @CRLF) ;MsgBox(0, "", $sD & @CRLF & @CRLF & "No. of Differences: " & $aArr[StringLen($s1)][StringLen($s2)]) ;_ArrayDisplay($aArr) Func _EditDistance($s1, $s2)     Local $m[StringLen($s1) + 1][StringLen($s2) + 1], $i, $j     $m[0][0] = 0;   boundary conditions     For $j = 1 To StringLen($s2)         $m[0][$j] = $m[0][$j - 1] + 1;   boundary conditions     Next     For $i = 1 To StringLen($s1)         $m[$i][0] = $m[$i - 1][0] + 1;   boundary conditions     Next     For $j = 1 To StringLen($s2);                      outer loop         For $i = 1 To StringLen($s1) ;                      inner loop             If (StringMid($s1, $i, 1) = StringMid($s2, $j, 1)) Then                 $diag = 0;             Else                 $diag = 1             EndIf             $m[$i][$j] = _Min($m[$i - 1][$j] + 1, _ ;  insertion                     (_Min($m[$i][$j - 1] + 1, _  ;      deletion                     $m[$i - 1][$j - 1] + $diag))) ;   substitution         Next     Next     ;_TraceBack('', '', '', $m, StringLen($s1), StringLen($s2), $s1, $s2, $Dis);     ;_ArrayDisplay($m)     Return $m ; $m[StringLen($s1)][StringLen($s2)] EndFunc   ;==>_EditDistance Func _TraceBack($row1, $row2, $row3, $m, $i, $j, $s1, $s2, ByRef $sD)     Local $diag, $diagCh     If ($i > 0) And ($j > 0) Then         $diag = $m[$i - 1][$j - 1]         $diagCh = '|';         If (StringMid($s1, $i, 1) <> StringMid($s2, $j, 1)) Then             $diag += 1             $diagCh = ' '; }         EndIf         If ($m[$i][$j] >= $diag) Then             _TraceBack(StringMid($s1, $i, 1) & $row1, $diagCh & $row2, StringMid($s2, $j, 1) & $row3, $m, $i - 1, $j - 1, $s1, $s2, $sD);  change or match         ElseIf $m[$i][$j] = ($m[$i - 1][$j] + 1) Then ;  delete             _TraceBack(StringMid($s1, $i, 1) & $row1, $diagCh & $row2, '-' & $row3, $m, $i - 1, $j, $s1, $s2, $sD);         Else             _TraceBack("-" & $row1, $diagCh & $row2, StringMid($s2, $j, 1) & $row3, $m, $i, $j - 1, $s1, $s2, $sD);  insertion         EndIf     ElseIf ($i > 0) Then         _TraceBack(StringMid($s1, $i, 1) & $row1, " " & $row2, '-' & $row3, $m, $i - 1, $j, $s1, $s2, $sD);     ElseIf ($j > 0) Then         _TraceBack("-" & $row1, " " & $row2, StringMid($s2, $j, 1) & $row3, $m, $i, $j - 1, $s1, $s2, $sD);     Else ;   $i==0 and $j==0         ;ConsoleWrite( @CRLF & $row1 & @CRLF & $row2 & @CRLF & $row3 & @CRLF)         $sD = $row1 & @CRLF & $row2 & @CRLF & $row3 & @CRLF     EndIf EndFunc   ;==>_TraceBack ;





0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users