Jump to content
Sign in to follow this  
duckling78

What's a good way to compare two dictionary files?

Recommended Posts

duckling78

I am trying to compare two dictionary files (about 2 megs in length each) and need to know which words are not in the other.

The dictionary files have one word on each line.

I made a quick script to parse the files and compare each word, but it's taking forever to finish.

Is it possible to improve the speed of this much? -- or is there a better solution? Windiff seems to work so quickly, but it doesn't give the results I need.

$fileA = FileOpen("A.txt", 0)
$fileB = FileRead("B.txt")

$fileNotInB = FileOpen("NotInB.txt", 9)

While True
    $stringA = FileReadLine($fileA)
    If @error == -1 Then ExitLoop
    
    $stringA = StringStripWS($stringA, 2)
    
    If StringInStr($fileB, $stringA, 2) == 0 Then
        ConsoleWrite('"' & $stringA & '" is not in $fileB.' & @CRLF)
        FileWriteLine($fileNotInB, $stringA)
    EndIf
WEnd

ConsoleWrite("Finished at " + @MON & "/" & @MDAY & "/" & @YEAR & " " & _
    @HOUR & ":" & @MIN & ":" & @SEC & "." & @CRLF)
Edited by duckling78

Share this post


Link to post
Share on other sites
ChrisFromBoston

Might be faster to load both files into an array using _FileReadToArray and then looping over one array and compare the arrays. I've seen a few UDFs that do array comparisons, here is the first one in the search list. Now, maybe someone will have an even faster way, but that's a good start.

Share this post


Link to post
Share on other sites
duckling78

Just as a side note it took about 12 hours to parse the words between two dictionary files (about 3 megs in length each).

I'm reading through Steven S. Skiena's "The Algorithm Design Manual" now and hopefully I won't have such horribly inefficient coding in the future.

^_^

One algorithm that I've read about so far would have significantly reduced the duration.

Assuming the dictionary files were alphabetical (which they weren't).

1. The words would all go into array values (as ChrisFromBoston suggested)

2. The word in the middle of the array would be compared to the current search word

3. If the value is after or before the current word, then half the values in the dictionary would be ruled out

4. Keep repeating until the word is not after or before the current word (is the current word) or else does not exist

This would take less than 100 iterations of checks vs a straight dictionary check to narrow down the words in a dictionary file and each word would probably have taken milliseconds vs seconds to process.

Hmm, I'll keep reading this book, it's kind'a nice ;)

Edited by duckling78

Share this post


Link to post
Share on other sites
duckling78

Ha. I optimized the code to parse two dictionary files FROM ~12 hours DOWN TO 21 seconds!

Here's a log of the results:

090430_001159: Reading contents of file to $arrayA...
090430_001159: Reading contents of file to $arrayB...
090430_001200: Starting comparisons...  UBound($arrayA) - 1: 161017
090430_001205: Status: 10000/161017 (6%)
090430_001210: Status: 20000/161017 (12%)
090430_001214: Status: 30000/161017 (18%)
090430_001219: Status: 40000/161017 (24%)
090430_001224: Status: 50000/161017 (31%)
090430_001228: Status: 60000/161017 (37%)
090430_001233: Status: 70000/161017 (43%)
090430_001238: Status: 80000/161017 (49%)
090430_001244: Status: 90000/161017 (55%)
090430_001249: Status: 100000/161017 (62%)
090430_001254: Status: 110000/161017 (68%)
090430_001259: Status: 120000/161017 (74%)
090430_001304: Status: 130000/161017 (80%)
090430_001309: Status: 140000/161017 (86%)
090430_001314: Status: 150000/161017 (93%)
090430_001319: Status: 160000/161017 (99%)
090430_001320: Finished!

Here's the source!

#include <File.au3>
#include <Array.au3>

HotKeySet("!+^x", "exitScript")

Dim $arrayA, $arrayB

FileDelete("NotInB.txt")

Blah("Reading contents of file to $arrayA...")
_FileReadToArray("sortedA.txt", $arrayA)
Blah("Reading contents of file to $arrayB...")
_FileReadToArray("sortedB.txt", $arrayB)

$fileNotInB = FileOpen("NotInB.txt", 9)

#comments-start --- uncomment the below section to create sorted files ---
Blah("Starting _ArraySort($arrayA)...")
_ArraySort($arrayA)
Blah("Starting _ArraySort($arrayB)...")
_ArraySort($arrayB)
Blah("Finished sorting arrays.")
Blah("Writing sortedA.txt...")
_FileWriteFromArray("sortedA.txt", $arrayA)
Blah("Writing sortedB.txt...")
_FileWriteFromArray("sortedB.txt", $arrayB)
Blah("Exiting.")
Exit
#comments-end --- uncomment the above section to create sorted files ---

$maxA = UBound($arrayA) - 1
Blah("Starting comparisons...  UBound($arrayA) - 1: " & $maxA)

For $checkA = 1 To $maxA
    If Mod($checkA, 10000) == 0 Then
        Blah("Status: " & $checkA & "/" & $maxA & " (" & Int($checkA / $maxA * 100) & "%)");
    EndIf
    $arrayA[$checkA] = StringStripWS($arrayA[$checkA],2)
    If AinB($arrayA[$checkA]) = False Then
        ;Blah('"' & $arrayA[$checkA] & '" is not in $arrayB.')
        FileWriteLine($fileNotInB, $arrayA[$checkA])
    EndIf
Next

Blah("Finished!")

Func AinB($wordCheckA)
    $start  = 1
    $end    = UBound($arrayB)
    $last   = 1
    
    While True
        $check = Int(($start + $end) / 2)
        
        If $check == $last Then
            ;Blah(">>>>>>>>>>>>>> FALSE: " & $wordCheckA & " is NOT in $fileB. -- $check: " & $check & " $last: " & $last)
            Return False
        EndIf
        
        $wordCheckB = StringStripWS($arrayB[$check], 2)
        
        $compare = StringCompare($wordCheckA, $wordCheckB, 2)
        
        ;Blah("$start: " & $start & "  $end: " & $end & "  $check: " & $check & "  $wordCheck: " & $wordCheckA & "  $arrayB[$check]: " & $wordCheckB)
        
        If $compare < 0 Then
            ;Blah("$compare < 0 ... 1: " & $wordCheckA & "  2: " & $wordCheckB)
            $end = $check
        ElseIf $compare > 0 Then
            ;Blah("$compare > 0 ... 1: " & $wordCheckA & "  2: " & $wordCheckB)
            $start = $check
        Else
            ;Blah(">>>>>>>>>>>>>> TRUE: " & $wordCheckA & " is in $fileB. -- $check: " & $check & " $last: " & $last)
            Return True
        EndIf
        $last = $check
    WEnd
EndFunc

Func timeStamp()
    Return StringRight(@YEAR, 2) & @MON & @MDAY & "_" & @HOUR & @MIN & @SEC
EndFunc

Func exitScript()
    ConsoleWrite("Exiting script" & @CRLF)
    Exit
EndFunc

Func Blah($text)
    ConsoleWrite(timeStamp() & ": " & $text & @CRLF)
EndFunc
Edited by duckling78

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×