Jump to content

Calculate string similarity


Stumpii
 Share

Recommended Posts

I found some great code called SimMetrics that mathematically calculates the similarity of two strings, returning a value of 0.0 (not a match) to 1.0 (exact match).

It is written in .Net 2.0, so I wrote a COM wrapper and installation package so I could use the feature in Excel. It could also be used in AutoIt.

The installation package (which installs the SimMetrics dll, wrapper dll and example au3 code) can be downloaded from here: SimMetrics COM Wrapper

Some examples of use could be:

  • Custom spell checker
  • Search a music database for similar named songs
  • Find the best match to a name in an address book
  • Cross reference part numbers in two list that may have typos
The example given in the download shows the user a list of names and then asks the user to enter a name. The program picks the name that most closely matches the name the user enters.

At the moment only the Levenshtein Distance Metric is included in the wrapper. If anyone finds another metric that they think would be useful, I can add it to the wrapper. For those interested, the SourceForge page for the SimMetrics program is here.

Thanks go to sohfeyr for helping register the .Net dlls without the MS installer.

Here is the example code for those interested:

CODE
#include <Array.au3>

$MySimMetrics = ObjCreate("SimMetricsCOMWrapper.SimMetrics")

If IsObj($MySimMetrics) = 0 Then

MsgBox(16, "SimMetrics COM Wrapper Example", "Could not create SimMetrics object.")

Exit

EndIf

Dim $avArray

$avArray = _ArrayCreate("JPM", "Holger", "Jon", "Larry", "Jeremy", "Valik", "Cyberslug", "Nutster", "Tylo", "JdeB")

_ArrayDisplay($avArray, "List of users")

$Value = InputBox("SimMetrics COM Wrapper Example", "Enter a user to search for (guess/mis-spell a name)")

$MostSimilar = ""

$MostSimilarSimilarity = 0

For $i = 0 To UBound($avArray) - 1

$Similarity = $MySimMetrics.LevensteinSimilarity ($Value, $avArray[$i])

If Number($Similarity) > Number($MostSimilarSimilarity) Then

$MostSimilar = $avArray[$i]

$MostSimilarSimilarity = $Similarity

EndIf

Next

MsgBox(64, "SimMetrics COM Wrapper Example", "Closest match to '" & $Value & "' is '" & $MostSimilar & "' with " & _

$MostSimilarSimilarity * 100 & "% similarity.")

Edited by Stumpii

“Give a man a script; you have helped him for today. Teach a man to script; and you will not have to hear him whine for help.”AutoIt4UE - Custom AutoIt toolbar and wordfile for UltraEdit/UEStudio users.AutoIt Graphical Debugger - A graphical debugger for AutoIt.SimMetrics COM Wrapper - Calculate string similarity.

Link to comment
Share on other sites

  • Moderators

Couldn't resist... I have no idea how close in concept this is though.

MsgBox(0, '', 'Case Sensitive: ' & _StringLikePerc('I am a strinG.', 'I am a string.', 1) & '%' & @CR & _
        'Not Sensitive: ' & _StringLikePerc('I am a strinG.', 'I am a string.') & '%')
Func _StringLikePerc($sString1, $sString2, $nCase = 0)
    Local $nLen1 = StringLen($sString1), $nLen2 = StringLen($sString2)
    Local $nMinLen = $nLen1, $nMaxLen = $nLen2, $iAdd
    If $nLen2 < $nLen1 Then 
        $nMinLen = $nLen2
        $nMaxLen = $nLen1
    EndIf
    If $nCase Then
        For $iCC = 1 To $nMinLen
            If StringMid($sString1, $iCC, 1) == StringMid($sString2, $iCC, 1) Then $iAdd += 1
        Next
    Else
        For $iCC = 1 To $nMinLen
            If StringMid($sString1, $iCC, 1) = StringMid($sString2, $iCC, 1) Then $iAdd += 1
        Next
    EndIf
    Return Round(($iAdd / $nMaxLen) * 100, 2)
EndFunc

Edit:

Now that I think about it, the above isn't a good example, it would only do a % based on if the first char was the same or not... if that varied, then it would return a 0% match.... ahh well.

Edited by SmOke_N

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Link to comment
Share on other sites

  • 2 years later...

The following implements two "similarity" functions. The first is the traditional Levenshtein method. The second is an algorithm adapted from http://siderite.blogspot.com/2007/01/super...-algorithm.html and re-coded for AutoIt. You can choose which implementation to use by passing a non-zero third parameter. This is my first code post so be kind!

#include <Math.au3>

Dim Const $maxOffset = 5 ; Change according to how far you want to search; a bigger value means a slower comparison

; Example usage
MsgBox(0, "Levenshtein", _Similarity("The cat sat on the mat", "The dog sat on the cat"))
MsgBox(0, "Similarity", _Similarity("The cat sat on the mat", "The dog sat on the cat", 1))

Exit

; Compute similarity between two strings
Func _Similarity($s1, $s2, $alg = 0)
    Local $dis
    If ($alg = 0) Then
        $dis = _Levenshtein($s1, $s2)
    Else 
        $dis = _Distance($s1, $s2)
    EndIf
    Local $maxLen = _Max(StringLen($s1), StringLen($s2))
    If ($maxLen = 0) Then Return 100
    Return (1 - $dis / $maxLen) * 100
EndFunc

; Compute Levenshtein Distance
Func _Levenshtein($s, $t)
    Local $m, $n, $i, $j, $s_i, $t_j, $cost

    $n = StringLen($s)
    $m = StringLen($t)

    If $n * $m = 0 Then Return $m + $n

    Local $d[$n + 1][$m + 1]
    
    For $i = 0 To $n
        $d[$i][0] = $i
    Next
    
    For $j = 0 To $m
        $d[0][$j] = $j
    Next

    For $i = 1 To $n
        $s_i = StringMid($s, $i, 1)
        
        For $j = 1 To $m
            $t_j = StringMid($t, $j, 1)

            If $s_i = $t_j Then
                $cost = 0
            Else
                $cost = 1
            EndIf
            
            $d[$i][$j] = _Minimum3($d[$i - 1][$j] + 1, $d[$i][$j - 1] + 1, $d[$i - 1][$j - 1] + $cost)
        Next
    Next

    Return $d[$n][$m]
  
EndFunc

; Get minimum of three values
Func _Minimum3($a, $b, $c)
    Local $mi

    $mi = $a
    If $b < $mi Then $mi = $b
    If $c < $mi Then $mi = $c
    
    Return $mi
EndFunc

; Efficient calculation of similarity
Func _Distance($s1, $s2)
    Local $ls1 = StringLen($s1), $ls2 = StringLen($s2)
    If $ls1 * $ls2 = 0 Then Return $ls1 + $ls2
    
    Local $c = 0, $offset1 = 0, $offset2 = 0, $dist = 0, $i
    
    While (($c + $offset1 < $ls1) And ($c + $offset2 < $ls2))
        If (StringMid($s1, $c + $offset1 + 1, 1) <> StringMid($s2, $c + $offset2 + 1, 1)) Then
            $offset1 = 0
            $offset2 = 0
            For $i = 0 to $maxOffset - 1
                If (($c + $i < $ls1) And (StringMid($s1, $c + $i + 1, 1) = StringMid($s2, $c + 1, 1))) Then
                    If ($i > 0) Then
                        $dist += 1
                        $offset1 = $i
                    EndIf
                    $dist -= 1
                    ExitLoop
                EndIf
                If (($c + $i < $ls2) And (StringMid($s1, $c + 1, 1) = StringMid($s2, $c + $i + 1, 1))) Then
                    If ($i > 0) Then
                        $dist += 1
                        $offset2 = $i
                    EndIf
                    $dist -= 1
                    ExitLoop
                EndIf
            Next
            $dist += 1
        EndIf
        $c += 1
    Wend
    
    Return $dist + ($ls1 - $offset1 + $ls2 - $offset2) / 2 - $c
EndFunc

WBD

Link to comment
Share on other sites

  • 7 months later...

Here is another example using the Levenshtein Distance or Edit distance algorithm.

Also, a string alignment, longest common subsequence. or _TraceBack function which uses the same table/matrix created by the _EditDistance() function.

The results of the TraceBack function, eg.:-

vintner-
   | ||
writ-ers
No. of Differences: 5

appropriate m-eaning
|||||   |||||    |||
approximate matching
No. of Differences: 7
are written to console. MsgBox does not format the display very well.

In the above two examples - referring to the middle line between the two strings - the vertical lines indicate matches, the spaces are differences.

;
#include <Math.au3>
#include <array.au3>
;Edit distance, Levenshtein distance, or Dynamic time warping
; http: www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Edit/

; String alignment, sequence alignment, longest common subsequence. or _TraceBack approach.
; http: books.google.com.au/books?id=STGlsyqtjYMC&pg=PA215&lpg=PA215&dq=string+difference+edit+distane&source=bl&ots=kiQ8vAL_v3&sig=TovNa041KRZDRGIa5bVI9HBMp1E&hl=en&ei=nd-iSt-SGIGYkQXd3OWBBA&sa=X&oi=book_result&ct=result&resnum=9#v=onepage&q=&f=false
; http: en.wikipedia.org/wiki/Sequence_alignment
; http: en.wikipedia.org/wiki/Longest_common_subsequence_problem

; [url="http://www.autoitscript.com/forum/index.php?showtopic=40843&view=findpost&p=303848"]http://www.autoitscript.com/forum/index....p?showtopic=40843&view=findpost&p=303848[/url]
Local $sD
Local $s1 = "appropriate meaning", $s2 = "approximate matching"
;Local $s1 = "vintner", $s2 = "writers"
;Local  $s1='test', $s2='jesT'
;Local  $s1 = "456043",  $s2 =  "1234567"

Local $aArr = _EditDistance($s1, $s2)
_TraceBack('', '', '', $aArr, StringLen($s1), StringLen($s2), $s1, $s2, $sD)

ConsoleWrite(@CRLF & $sD & @CRLF & "No. of Differences: " & $aArr[StringLen($s1)][StringLen($s2)] & @CRLF)
;MsgBox(0, "", $sD & @CRLF & @CRLF & "No. of Differences: " & $aArr[StringLen($s1)][StringLen($s2)])
;_ArrayDisplay($aArr)

Func _EditDistance($s1, $s2)
    Local $m[StringLen($s1) + 1][StringLen($s2) + 1], $i, $j
    $m[0][0] = 0;   boundary conditions
    For $j = 1 To StringLen($s2)
        $m[0][$j] = $m[0][$j - 1] + 1;   boundary conditions
    Next
    For $i = 1 To StringLen($s1)
        $m[$i][0] = $m[$i - 1][0] + 1;   boundary conditions
    Next
    For $j = 1 To StringLen($s2);                      outer loop
        For $i = 1 To StringLen($s1) ;                      inner loop
            If (StringMid($s1, $i, 1) = StringMid($s2, $j, 1)) Then
                $diag = 0;
            Else
                $diag = 1
            EndIf
            $m[$i][$j] = _Min($m[$i - 1][$j] + 1, _ ;  insertion
                    (_Min($m[$i][$j - 1] + 1, _  ;      deletion
                    $m[$i - 1][$j - 1] + $diag))) ;   substitution
        Next
    Next
    ;_TraceBack('', '', '', $m, StringLen($s1), StringLen($s2), $s1, $s2, $Dis);
    ;_ArrayDisplay($m)
    Return $m ; $m[StringLen($s1)][StringLen($s2)]
EndFunc   ;==>_EditDistance


Func _TraceBack($row1, $row2, $row3, $m, $i, $j, $s1, $s2, ByRef $sD)
    Local $diag, $diagCh
    If ($i > 0) And ($j > 0) Then
        $diag = $m[$i - 1][$j - 1]
        $diagCh = '|';
        If (StringMid($s1, $i, 1) <> StringMid($s2, $j, 1)) Then
            $diag += 1
            $diagCh = ' '; }
        EndIf
        If ($m[$i][$j] >= $diag) Then
            _TraceBack(StringMid($s1, $i, 1) & $row1, $diagCh & $row2, StringMid($s2, $j, 1) & $row3, $m, $i - 1, $j - 1, $s1, $s2, $sD);  change or match
        ElseIf $m[$i][$j] = ($m[$i - 1][$j] + 1) Then ;  delete
            _TraceBack(StringMid($s1, $i, 1) & $row1, $diagCh & $row2, '-' & $row3, $m, $i - 1, $j, $s1, $s2, $sD);
        Else
            _TraceBack("-" & $row1, $diagCh & $row2, StringMid($s2, $j, 1) & $row3, $m, $i, $j - 1, $s1, $s2, $sD);  insertion
        EndIf
    ElseIf ($i > 0) Then
        _TraceBack(StringMid($s1, $i, 1) & $row1, " " & $row2, '-' & $row3, $m, $i - 1, $j, $s1, $s2, $sD);
    ElseIf ($j > 0) Then
        _TraceBack("-" & $row1, " " & $row2, StringMid($s2, $j, 1) & $row3, $m, $i, $j - 1, $s1, $s2, $sD);
    Else ;   $i==0 and $j==0
        ;ConsoleWrite( @CRLF & $row1 & @CRLF & $row2 & @CRLF & $row3 & @CRLF)
        $sD = $row1 & @CRLF & $row2 & @CRLF & $row3 & @CRLF
    EndIf
EndFunc   ;==>_TraceBack
;
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...