Jump to content
fgthhhh

compare strings

Recommended Posts

fgthhhh

i have a main-string and other sub-strings,find the sub-string which is like the main-string most. i don't know how to do it

example: main-string: aqwert

sub-strings:

+qwerb

+gfdgf

+qwbt

result:

+qwerb : 80% ( have same qwer)

+gfdgf: 0% ( nothing like)

+qwbt: 20% ( have same qw)

pls help me, thx

Share this post


Link to post
Share on other sites
water

I think "StringRegExp" will do what you need. I'm no expert so the following example checks for any character in the pattern and therefore doesn't give the result you need:

#include <array.au3>
$string = "aqwert"
$pattern = "qwbt"
$R = StringRegExp($string,"[" & $pattern & "]",3)
If IsArray($R) Then
    MsgBox(0,"",UBound($R)*100/StringLen($pattern) & "% match")
Else
    MsgBox(0,"","0% match")
EndIf

Maybe some RegExpr Guru can jump in and give you the correct expression.


My UDFs and Tutorials:

Spoiler

UDFs:
Active Directory (NEW 2018-06-01 - Version 1.4.9.0) - Download - General Help & Support - Example Scripts - Wiki
OutlookEX (2018-01-27 - Version 1.3.3.1) - Download - General Help & Support - Example Scripts - Wiki
ExcelChart (2015-04-01 - Version 0.4.0.0) - Download - General Help & Support - Example Scripts
Excel - Example Scripts - Wiki
Word - Wiki
PowerPoint (2015-06-06 - Version 0.0.5.0) - Download - General Help & Support

Tutorials:
ADO - Wiki

 

Share this post


Link to post
Share on other sites
whim

This might help as well

wim

Share this post


Link to post
Share on other sites
fgthhhh

StringRegExp worked like magic but i still don't understand how it work :)

Approximate string matching showed me more than really complicated :(:idea:

anyway, thanks you two so much, i will need research more

Share this post


Link to post
Share on other sites
jchd

You can use my Typos() fuzzy comparison function: Typos.au3

It computes the edit distance between two strings, that is the number of omissions, insertions, changes or swap of letters necessary to transform one string into the other. If you compare several strings in succession and keep one having the smallest errors (typos) you'll be home.

Optionally, you can use two distinct wildcards in the second string: _ and % (the same characters than in SQL LIKE.)

_ is a single character joker, much like ? in Windows filename patterns

% may represent one or more characters, like Windows * (but % may only appear at the end of the second parameter)

Try it and post again if you have problems using it.

  • Like 2

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
fgthhhh

hi jchd, u wrote a awesome script

but i don't understand what the function return ?

0 is the same?

higher number mean more mistake?

i try

$asd = _Typos("aqwert", "qwertb")

MsgBox(0,"",$asd)

it return 2

what does it mean?

Edited by fgthhhh

Share this post


Link to post
Share on other sites
czardas

hi jchd, u wrote a awesome script

but i don't understand what the function return ?

0 is the same?

higher number mean more mistake?

i try

$asd = _Typos("aqwert", "qwertb")

MsgBox(0,"",$asd)

it return 2

what does it mean?

I took a quick look at jchd's code. It seems that the return value 2 means that there are two changes needed to convert one string to the other. The changes are as follows:

1. Delete the first character => a

2. Add a character on the end => b.

This converts one string to the other in 2 steps. jchd will be able to tell you if I'm wrong about this.

Edited by czardas

Share this post


Link to post
Share on other sites
jchd

That's correct.

If typos($str1, $str2) = 0 Then MsgBox(0, $str1 & ' and ' & $str2 & ' are identical (case-sensitive wise).')

; Computes the number of typos (Damerau-Levenshtein distance) between two strings.

; Four types of differences are counted:

; insertion of a character, abcd ab#cd

; deletion of a character, abcd acd

; exchange of a character abcd ab$d

; inversion of adjacent chars abcd acbd

;

; This function does NOT satisfy the so-called "triangle inequality", which means

; more simply that it makes NO attempt to compute the MINIMUM edit distance in all

; cases. If you need that, you should use more complex algorithms.

;

; This simple function allows a fuzzy compare for e.g. recovering from typical

; human typos in short strings like names, address, cities... while getting rid of

; minor scripting differences (accents, ligatures).

;

; Strings are unaccented then lowercased.

; String $st2 can be used as a pattern similar to the SQL 'LIKE' operator:

; '_' and trailing '%' act as in LIKE. These wildcards can be passed as parameters

; but % should appear at most once for the function to work properly.

Another comment, comes from the C version I use for SQLite extension:

** TYPOS($str1, $str2)

** returns the "Damerau-Levenshtein distance" between StringLower(str1) and

** StringLower(str2). This is the number of insertions, omissions, changes

** and transpositions (of adjacent letters only).

**

** If the reference string is 'abcdef', it will return 1 (one typo) for

** 'abdef' missing c

** 'abcudef' u inserted

** 'abzef' c changed into z

** 'abdcef' c & d exchanged

**

** Only one level of "typo" is considered, e.g. the function will

** consider the following transformations to be 3 typos:

** 'abcdef' reference

** 'abdcef' c & d exchanged

** 'abdzcef' z inserted inside (c & d exchanged)

** In this case, it will return 3. Technically, it does not

** always return the minimum edit distance and doesn't satisfy

** the "triangle inequality" in all cases. It is nonetheless

** very useful to anyone having to lookup simple entry subject to

** user typo (e.g. name or city name).

**

** It will also accept '_' and a trailing '%' in str2, both acting

** as in SQL LIKE operator.

**

** You can use it this way:

** $str = "Leiwenschtein"

** If typos($str, 'leivencht%') <= 2;

** or this way:

** $nbErrors = typos($str1, $str2)

**

** NOTE: the implementation may seem naive but is open to several

** evolutions. Due to the complexity in O(n*m) you

** should reserve its use to _short_ fields only. There

** are much better algorithms for large fields (most of

** which are terrible for small strings.) The choice made

** reflects the typical need to match names, surnames,

** street addresses, cities or such data prone to typos

** in user input. Flexibility has been choosen over mere

** performance, because fuzzy search is _slow_ anyway.

** So you better have a 380% slower algo that retrieves

** the data you're after, than a 100% slow algo that misses

** them most of the times.

**

** | DO NOT use TYPOS in case StringInStr would do! for instance, if

** | your data contains a fixed substring (without typo),

** | then use:

** | If StringInStr($cityname, 'angel') Then

** | It will match 'Los Angeles' without question. If you try:

** | If typos($cityname, 'angel%') <= 4 Then

** | you will be overhelmed with data from everywhere, since up

** | to 4 typos allows for typically _many_ values (cities, here).

Hope this clears some mud. If you still have practical problems using it in real-world, post here.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
fgthhhh

thanks mate for answer.

i want ask a question

can i use your script for auto-correct word?

if yes, can u show me an example?

ex: "thraa" how can it auto-correct to "three"?

can i compare the "thraa" with some possible words and choose the best?

Edited by fgthhhh

Share this post


Link to post
Share on other sites
jchd

You may have some (relative) success in doing so, but mostly for limited cases. For instance, this function works well in selecting words from a list which have a spelling close to a given word. It was designed in this goal as an extension to a database engine.

In your example, only a human brain or really "smart" program can chose which of threw, three, tharm (for instance) should be the replacement for thraa. For making the (right) correction by program, you have to identify he context, the grammar, the partial semantics and devise a target global semantics to infer the right correction.

For spelling or grammar correction, you'll have much better time using one of the available libraries specialized in those task.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
fgthhhh

all my words is just limit from one to twenty( 1->20) so it will not have threw or tharm

can u show me a way to correct the word?

i really need an example to understand the code :idea:

Edited by fgthhhh

Share this post


Link to post
Share on other sites
jchd

Do you mean the numbers 1 to 20 in plain text?

If so, place the text in an array $A and find the minimum of typos($A[$i], $word), if any.

Try to come up with somehing of your own.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
fgthhhh

help me checking if it's ok

$numeros[0]="one"
$numeros[1]="two"
$numeros[2]="three"
$numeros[3]="four"
$numeros[4]="five"
$numeros[5]="six"
$numeros[6]="seven"
$numeros[7]="eight"
$numeros[8]="nine"
$numeros[9]="ten"
$numeros[10]="eleven"
$numeros[11]="twelve"
$numeros[12]="thirteen"
$numeros[13]="fourteen"
$numeros[14]="fifteen"
$numeros[15]="sixteen"
$numeros[16]="seventeen"
$numeros[17]="eighteen"
$numeros[18]="nineteen"
$numeros[19]="twenty"

$test_word = "thraa"
dim $result[20]
for $k = 0 to 19
    $result[$k] = typos($numeros[$k], $test_word)
next
_ArraySort($result) ; or _arraymin($result)

; then i can get the lowest result but i can't get the correct word

i stucked here, i don't know how to get the correct answer

Edited by fgthhhh

Share this post


Link to post
Share on other sites
jchd

Hey, calm down. There is no need to brag like you do!

Use something along this line:

#include <String.au3>

Local Const $numeros[20] = [ _
    "one", _
    "two", _
    "three", _
    "four", _
    "five", _
    "six", _
    "seven", _
    "eight", _
    "nine", _
    "ten", _
    "eleven", _
    "twelve", _
    "thirteen", _
    "fourteen", _
    "fifteen", _
    "sixteen", _
    "seventeen", _
    "eighteen", _
    "nineteen", _
    "twenty" _
]

Local $test_word = "thraa"
Local $bestMatch = StringLen($test_word), $bestMatchIdx, $typos
For $k = 0 To UBound($numeros) - 1
    $typos = Typos($numeros[$k], $test_word)
    If $typos < $bestMatch Then
        $bestMatch = $typos
        $bestMatchIdx = $k
    EndIf
next
ConsoleWrite(StringFormat("Best match for '%s' is %s (%u) with %u spelling errors.\n", $test_word, $numeros[$bestMatchIdx], $bestMatchIdx + 1, $bestMatch))

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
fgthhhh

great

u are my hero, that extractly what i need :idea:

Share this post


Link to post
Share on other sites
Malkey

_EditDistance() function from here, appears to be another version of the Typos() function from post #5 , this thread.

#include <String.au3>
#include <Array.au3>
#include <Math.au3>

Local Const $numeros[21] = ["zero", "one", "two", "three", "four", "five", "six", _
        "seven", "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", _
        "fifteen", "sixteen", "seventeen", "eighteen", "nineteen", "twenty"]

Local $test_word = "thraa"
Local $bestMatch = StringLen($test_word), $bestMatchIdx, $typos
For $k = 0 To UBound($numeros) - 1
    $typos1 = Typos($numeros[$k], $test_word)
    ConsoleWrite("Typos => " & $typos1 & " ")
    $typos = _EditDistance($numeros[$k], $test_word)
    ConsoleWrite($typos & " <= _EditDistance" & @CRLF)
    If $typos < $bestMatch Then
        $bestMatch = $typos
        $bestMatchIdx = $k
    EndIf
Next

ConsoleWrite(StringFormat("Best match for '%s' is '%s' with %u different, non-matching characters.\n", $test_word, $numeros[$bestMatchIdx], $bestMatch))


Func _EditDistance($s1, $s2)
    Local $m[StringLen($s1) + 1][StringLen($s2) + 1], $i, $j
    $m[0][0] = 0; boundary conditions
    For $j = 1 To StringLen($s2)
        $m[0][$j] = $m[0][$j - 1] + 1; boundary conditions
    Next
    For $i = 1 To StringLen($s1)
        $m[$i][0] = $m[$i - 1][0] + 1; boundary conditions
    Next
    For $j = 1 To StringLen($s2);   outer loop
        For $i = 1 To StringLen($s1) ;  inner loop
            If (StringMid($s1, $i, 1) = StringMid($s2, $j, 1)) Then
                $diag = 0;
            Else
                $diag = 1
            EndIf
            $m[$i][$j] = _Min($m[$i - 1][$j] + 1, _ ; insertion
                    (_Min($m[$i][$j - 1] + 1, _ ;   deletion
                    $m[$i - 1][$j - 1] + $diag))) ; substitution
        Next
    Next
    Return $m[StringLen($s1)][StringLen($s2)] ; $m ;
EndFunc ;==>_EditDistance

Func Typos(Const $st1, Const $st2, $anychar = '_', $anytail = '%')
    Local $s1, $s2, $pen, $del, $ins, $subst
    If Not IsString($st1) Then Return SetError(-1, -1, -1)
    If Not IsString($st2) Then Return SetError(-2, -2, -1)
    If $st2 = '' Then Return StringLen($st1)
    If $st2 == $anytail Then Return 0
    If $st1 = '' Then
        Return (StringInStr($st2 & $anytail, $anytail, 1) - 1)
    EndIf
;~  $s1 = StringSplit(_LowerUnaccent($st1)), "", 2)     ;; _LowerUnaccent() addon function not available here
;~  $s2 = StringSplit(_LowerUnaccent($st2)), "", 2)     ;; _LowerUnaccent() addon function not available here
    $s1 = StringSplit(StringLower($st1), "", 2)
    $s2 = StringSplit(StringLower($st2), "", 2)
    Local $l1 = UBound($s1), $l2 = UBound($s2)
    Local $r[$l1 + 1][$l2 + 1]
    For $x = 0 To $l2 - 1
        Switch $s2[$x]
            Case $anychar
                If $x < $l1 Then
                    $s2[$x] = $s1[$x]
                EndIf
            Case $anytail
                $l2 = $x
                If $l1 > $l2 Then
                    $l1 = $l2
                EndIf
                ExitLoop
        EndSwitch
        $r[0][$x] = $x
    Next
    $r[0][$l2] = $l2
    For $x = 0 To $l1
        $r[$x][0] = $x
    Next
    For $x = 1 To $l1
        For $y = 1 To $l2
            $pen = Not ($s1[$x - 1] == $s2[$y - 1])
            $del = $r[$x - 1][$y] + 1
            $ins = $r[$x][$y - 1] + 1
            $subst = $r[$x - 1][$y - 1] + $pen
            If $del > $ins Then $del = $ins
            If $del > $subst Then $del = $subst
            $r[$x][$y] = $del
            If ($pen And $x > 1 And $y > 1 And $s1[$x - 1] == $s2[$y - 2] And $s1[$x - 2] == $s2[$y - 1]) Then
                If $r[$x][$y] >= $r[$x - 2][$y - 2] Then $r[$x][$y] = $r[$x - 2][$y - 2] + 1
                $r[$x - 1][$y - 1] = $r[$x][$y]
            EndIf
        Next
    Next
    Return ($r[$l1][$l2])
EndFunc ;==>Typos

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×