nAutoIT Posted September 16, 2011 Share Posted September 16, 2011 I am trying to compare strings with an array of names and i want to include error correction, like correcting interchanged letters. For example: Smith = smiht Since the array is pretty big and i have to compare several strings this should be as fast as possible. I have working code, that does it with brute force. $input = "smiht" Global $namearray[4] $namearray[0] = "Willis" $namearray[1] = "Schwarzenegger" $namearray[2] = "Smith" $namearray[3] = "Norris" For $i = 0 To UBound($namearray)-1 If StringCompare($namearray[$i], $input) = 0 Then MsgBox(0, "Match", "Number: " & $i+1) Exit EndIf Next $length = StringLen($input) For $j = 0 To UBound($namearray)-1 For $k = 1 To $length-1 $letter1 = StringTrimLeft($input, $k-1) $letter1 = StringTrimRight($letter1, $length-$k) $letter2 = StringTrimLeft($input, $k) $letter2 = StringTrimRight($letter2, $length-$k-1) $replaced = StringLeft($input, $k-1) & $letter2 & $letter1 & StringRight($input, $length-$k-1) If StringCompare($namearray[$j], $replaced) = 0 Then MsgBox(0, "Match", "Number: " & $j+1) Exit EndIf Next Next That is obviously not very elegant and much to slow. Would appreciate any suggestions on how to approach that problem. Link to comment Share on other sites More sharing options...
JohnOne Posted September 16, 2011 Share Posted September 16, 2011 I could be off target here, but maybe a simple array is not best suited for this task. I don't know much about SQL but maybe a database is better suited, I think then you could use the "LIKE" keyword. I know this is not a solution to what you asked, rather a suggestion. AutoIt Absolute Beginners Require a serial Pause Script Video Tutorials by Morthawt ipify Monkey's are, like, natures humans. Link to comment Share on other sites More sharing options...
nAutoIT Posted September 16, 2011 Author Share Posted September 16, 2011 Thanks for the answer, JohnOne. I never worked with SQL, but i probably will sooner or later. I just took a look at the LIKE operator and i do not see, how that can help with interchanged letters. I only found "_" and "%" wildcards that could help in identifying partial matches. Do you think SQL would make sense speedwise, for only a single list of names (~14000 entries)? Link to comment Share on other sites More sharing options...
jchd Posted September 16, 2011 Share Posted September 16, 2011 (edited) I happen to had the need for a more general set of text functions for Unicode strings manipulations in the context of SQLite. Therefore I developped a reaonnably small and efficient SQLite extension in C offering, among others, a Typo() function using Damerau-Levenshtein distance. This function returns the number of typos (mssing letters, extra letters, swapped letters and exchanged letters) between two strings. Since it's coded in C, it may prove faster when applied to a significant number of entries than the AutoIt translation (which you should find by searching the Examples forum). Keep in mind that whatever language the function is implemented in, it's still O(m n) for strings of resp. m and n characters, so it should be reserved to short strings (like names, street names, city name, etc.) This Typos() function can help much in fuzzy searches involving accented or otherwise non-english letters (it first unaccent strings by using built-in simplified Unicode tries and then folds [lowercases] them, so you don't have to worry about missing or extra diacritics, should they appear). I invite you to download the package (source code including full explanations and ready-to-use x86 build) at this address. To make it easy for your to play with it and see by yourself if and how it can help solve your issue anyhow, first download SQLiteExpert freeware version (which has support for autoloaded SQLite extensions) and is really the easiest tool you can find to avoid messing with SQL[ite] code yourself to a great extent. Then design a basic table (the schema doesn't matter much at this stage), populate it with some CSV file with sample data or do it manually or programmatically (not hard given the SQLite UDF included in AutoIt) and determine by yourself if this path is worth the burden for your particular application. FYI, here's a recent result I posted to the SQLite mailing list to give you a rough idea of what you can expect in the case of a fuzzy search: If you need to search names with uncertain spelling, you can use my typos() function to perform a fuzzy search. Here's a sample of its use on a decently populated ZipCodes table (848207 rows): expandcollapse popupselect pays, zip, ville, region from allcountries where typos(ville, 'saopaul%') < 3 group by pays, ville, region RecNo Pays Zip Ville Region ----- ---- --------- ------------------------ -------------------------- 1 AR 6221 LA PAULINA LA PAMPA 2 AU 2031 St Pauls New South Wales 8 ES 22281 La Paul Aragon 9 ES 22471 Laspaules Aragon 10 ES 07691 Sa Taulera Baleares 11 FR 29400 Lampaul Guimiliau Bretagne 12 FR 29810 Lampaul Plouarzel Bretagne 13 FR 29830 Lampaul Ploudalmezeau Bretagne 14 FR 33390 St Paul Aquitaine 15 FR 61100 St Paul Basse-Normandie 16 FR 87260 St Paul Limousin 17 FR 88170 St Paul Lorraine 18 FR 65150 St Paul Midi-Pyrenees 19 FR 60650 St Paul Picardie 20 FR 06570 St Paul Provence-Alpes-Cote D'Azur 21 FR 73170 St Paul Rhone-Alpes 22 FR 02300 St Paul Aux Bois Picardie 23 FR 81220 St Paul Cap De Joux Midi-Pyrenees 24 FR 82400 St Paul D Espis Midi-Pyrenees >>> big snip >>> 68 FR 11320 St Paulet Languedoc-Roussillon 69 FR 30130 St Paulet De Caisson Languedoc-Roussillon 70 FR 43350 St Paulien Auvergne 71 GB EC4 St Paul's (null) 72 GB BR5 St Paul's Cray (null) 73 GB SG4 St Paul's Walden (null) 75 IN 281307 Sahpau Uttar Pradesh 76 IN 328027 Saipau Rajasthan 77 IN 171006 Sanjauli Himachal Pradesh 78 IT 39050 St.Paul Trentino-Alto Adige 79 PK 47701 Sanpal Norhern Punajb Rawalpindi 80 PT 8900-121 Sapal Faro 81 PT 4560-042 Sopal Porto 93 RE 97460 St Paul (null) Sorry for fixed spacing and diacritics which both seem to get destroyed by the new board code. EDIT: I needed to remove completely any accented letter above (i.e. rows containing those) as the board doesn't like Unicode anymore... Result obtained by full table scan in 1.7s on a 3-year old PC. Remember that SQL allows you to order the result set by increasing distance without involving more work, for instance: select pays, zip, ville, region, typos(ville, 'saopaul%') as typos from allcountries where typos < 3 group by typos, pays, ville, region order by typos desc You get "closest" matches head first in the resulting array by using above construct. Up to you to estimate how much typos you allow to avoid dragging to much unrelated data. Note that you can couple this with a soundex comparison at relatively decent cost, but soundex is by far too sensitive to errors in key letters and is inappropriate for non-english languages (I do have code for 2 distinct French soundex-like adaptations, but they give less than average results for the same reason as US soundex). A small table with only 14 000 entries would obviously you good responsiveness, somewhere in the realm of 50-80 ms for a similar search on similar data. At any rate and once the bulk task of inserting into a names table would be done (presumably once for its largest part), updating it, inserting few new entries, deleting some and fuzzy searching the new table would probably be _way_ faster than anything you could imagine written in AutoIt. Edited September 17, 2011 by jchd This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now