Comparing strings with error correction

nAutoIT · September 16, 2011

I am trying to compare strings with an array of names and i want to include error correction, like correcting interchanged letters. For example: Smith = smiht

Since the array is pretty big and i have to compare several strings this should be as fast as possible. I have working code, that does it with brute force.

$input = "smiht"
 
Global $namearray[4]
$namearray[0] = "Willis"
$namearray[1] = "Schwarzenegger"
$namearray[2] = "Smith"
$namearray[3] = "Norris"
For $i = 0 To UBound($namearray)-1
If StringCompare($namearray[$i], $input) = 0 Then
  MsgBox(0, "Match", "Number: " & $i+1)
  Exit
EndIf
Next
$length = StringLen($input)
For $j = 0 To UBound($namearray)-1
For $k = 1 To $length-1
  $letter1 = StringTrimLeft($input, $k-1)
  $letter1 = StringTrimRight($letter1, $length-$k)
  $letter2 = StringTrimLeft($input, $k)
  $letter2 = StringTrimRight($letter2, $length-$k-1)
  $replaced = StringLeft($input, $k-1) & $letter2 & $letter1 & StringRight($input, $length-$k-1)
  If StringCompare($namearray[$j], $replaced) = 0 Then
   MsgBox(0, "Match", "Number: " & $j+1)
   Exit
  EndIf
Next
Next

That is obviously not very elegant and much to slow. Would appreciate any suggestions on how to approach that problem.

JohnOne · September 16, 2011

I could be off target here, but maybe a simple array is not best suited for this task.

I don't know much about SQL but maybe a database is better suited, I think then you could use the "LIKE" keyword.

I know this is not a solution to what you asked, rather a suggestion.

nAutoIT · September 16, 2011

Thanks for the answer, JohnOne.

I never worked with SQL, but i probably will sooner or later. I just took a look at the LIKE operator and i do not see, how that can help with interchanged letters. I only found "_" and "%" wildcards that could help in identifying partial matches.

Do you think SQL would make sense speedwise, for only a single list of names (~14000 entries)?

jchd · September 16, 2011

I happen to had the need for a more general set of text functions for Unicode strings manipulations in the context of SQLite. Therefore I developped a reaonnably small and efficient SQLite extension in C offering, among others, a Typo() function using Damerau-Levenshtein distance. This function returns the number of typos (mssing letters, extra letters, swapped letters and exchanged letters) between two strings. Since it's coded in C, it may prove faster when applied to a significant number of entries than the AutoIt translation (which you should find by searching the Examples forum). Keep in mind that whatever language the function is implemented in, it's still O(m n) for strings of resp. m and n characters, so it should be reserved to short strings (like names, street names, city name, etc.)

This Typos() function can help much in fuzzy searches involving accented or otherwise non-english letters (it first unaccent strings by using built-in simplified Unicode tries and then folds [lowercases] them, so you don't have to worry about missing or extra diacritics, should they appear).

I invite you to download the package (source code including full explanations and ready-to-use x86 build) at this address.

To make it easy for your to play with it and see by yourself if and how it can help solve your issue anyhow, first download SQLiteExpert freeware version (which has support for autoloaded SQLite extensions) and is really the easiest tool you can find to avoid messing with SQL[ite] code yourself to a great extent.

Then design a basic table (the schema doesn't matter much at this stage), populate it with some CSV file with sample data or do it manually or programmatically (not hard given the SQLite UDF included in AutoIt) and determine by yourself if this path is worth the burden for your particular application.

FYI, here's a recent result I posted to the SQLite mailing list to give you a rough idea of what you can expect in the case of a fuzzy search:

If you need to search names with uncertain spelling, you can

use my typos() function to perform a fuzzy search. Here's a sample of

its use on a decently populated ZipCodes table (848207 rows):

select pays, zip, ville, region from allcountries where typos(ville, 'saopaul%') < 3 group by pays, ville, region
RecNo Pays Zip     Ville                    Region                  
----- ---- --------- ------------------------ --------------------------
    1 AR   6221   LA PAULINA               LA PAMPA               
    2 AU   2031   St Pauls               New South Wales            
    8 ES   22281     La Paul                  Aragon                    
    9 ES   22471     Laspaules              Aragon                  
   10 ES   07691     Sa Taulera            Baleares               
   11 FR   29400     Lampaul Guimiliau      Bretagne                  
   12 FR   29810     Lampaul Plouarzel      Bretagne                  
   13 FR   29830     Lampaul Ploudalmezeau  Bretagne                  
   14 FR   33390     St Paul                  Aquitaine               
   15 FR   61100     St Paul                  Basse-Normandie           
   16 FR   87260     St Paul                  Limousin                
   17 FR   88170     St Paul                  Lorraine                
   18 FR   65150     St Paul                  Midi-Pyrenees           
   19 FR   60650     St Paul                  Picardie                
   20 FR   06570     St Paul                  Provence-Alpes-Cote D'Azur
   21 FR   73170     St Paul                  Rhone-Alpes               
   22 FR   02300     St Paul Aux Bois        Picardie                 
   23 FR   81220     St Paul Cap De Joux      Midi-Pyrenees           
   24 FR   82400     St Paul D Espis          Midi-Pyrenees           
>>> big snip >>>
   68 FR   11320     St Paulet              Languedoc-Roussillon      
   69 FR   30130     St Paulet De Caisson    Languedoc-Roussillon     
   70 FR   43350     St Paulien            Auvergne               
   71 GB   EC4     St Paul's                (null)                  
   72 GB   BR5     St Paul's Cray          (null)                   
   73 GB   SG4     St Paul's Walden      (null)                 
   75 IN   281307   Sahpau                 Uttar Pradesh              
   76 IN   328027   Saipau                 Rajasthan                  
   77 IN   171006   Sanjauli                 Himachal Pradesh         
   78 IT   39050     St.Paul                  Trentino-Alto Adige       
   79 PK   47701     Sanpal                Norhern Punajb Rawalpindi  
   80 PT   8900-121  Sapal                  Faro                      
   81 PT   4560-042  Sopal                  Porto                     
   93 RE   97460     St Paul                  (null)

Sorry for fixed spacing and diacritics which both seem to get destroyed by the new board code.

EDIT: I needed to remove completely any accented letter above (i.e. rows containing those) as the board doesn't like Unicode anymore...

Result obtained by full table scan in 1.7s on a 3-year old PC. Remember that SQL allows you to order the result set by increasing distance without involving more work, for instance:

select pays, zip, ville, region, typos(ville, 'saopaul%') as typos from allcountries where typos < 3 group by typos, pays, ville, region order by typos desc

You get "closest" matches head first in the resulting array by using above construct. Up to you to estimate how much typos you allow to avoid dragging to much unrelated data.

Note that you can couple this with a soundex comparison at relatively decent cost, but soundex is by far too sensitive to errors in key letters and is inappropriate for non-english languages (I do have code for 2 distinct French soundex-like adaptations, but they give less than average results for the same reason as US soundex).

A small table with only 14 000 entries would obviously you good responsiveness, somewhere in the realm of 50-80 ms for a similar search on similar data.

At any rate and once the bulk task of inserting into a names table would be done (presumably once for its largest part), updating it, inserting few new entries, deleting some and fuzzy searching the new table would probably be _way_ faster than anything you could imagine written in AutoIt.

Edited September 17, 2011 by jchd

Sign In

Comparing strings with error correction

Recommended Posts

nAutoIT

JohnOne

nAutoIT

jchd

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta