duckling78 Posted April 29, 2009 Share Posted April 29, 2009 (edited) I am trying to compare two dictionary files (about 2 megs in length each) and need to know which words are not in the other. The dictionary files have one word on each line. I made a quick script to parse the files and compare each word, but it's taking forever to finish. Is it possible to improve the speed of this much? -- or is there a better solution? Windiff seems to work so quickly, but it doesn't give the results I need. $fileA = FileOpen("A.txt", 0) $fileB = FileRead("B.txt") $fileNotInB = FileOpen("NotInB.txt", 9) While True $stringA = FileReadLine($fileA) If @error == -1 Then ExitLoop $stringA = StringStripWS($stringA, 2) If StringInStr($fileB, $stringA, 2) == 0 Then ConsoleWrite('"' & $stringA & '" is not in $fileB.' & @CRLF) FileWriteLine($fileNotInB, $stringA) EndIf WEnd ConsoleWrite("Finished at " + @MON & "/" & @MDAY & "/" & @YEAR & " " & _ @HOUR & ":" & @MIN & ":" & @SEC & "." & @CRLF) Edited April 29, 2009 by duckling78 Link to comment Share on other sites More sharing options...
ChrisFromBoston Posted April 29, 2009 Share Posted April 29, 2009 Might be faster to load both files into an array using _FileReadToArray and then looping over one array and compare the arrays. I've seen a few UDFs that do array comparisons, here is the first one in the search list. Now, maybe someone will have an even faster way, but that's a good start. Link to comment Share on other sites More sharing options...
duckling78 Posted April 30, 2009 Author Share Posted April 30, 2009 (edited) Just as a side note it took about 12 hours to parse the words between two dictionary files (about 3 megs in length each). I'm reading through Steven S. Skiena's "The Algorithm Design Manual" now and hopefully I won't have such horribly inefficient coding in the future. One algorithm that I've read about so far would have significantly reduced the duration. Assuming the dictionary files were alphabetical (which they weren't). 1. The words would all go into array values (as ChrisFromBoston suggested) 2. The word in the middle of the array would be compared to the current search word 3. If the value is after or before the current word, then half the values in the dictionary would be ruled out 4. Keep repeating until the word is not after or before the current word (is the current word) or else does not exist This would take less than 100 iterations of checks vs a straight dictionary check to narrow down the words in a dictionary file and each word would probably have taken milliseconds vs seconds to process. Hmm, I'll keep reading this book, it's kind'a nice Edited April 30, 2009 by duckling78 Link to comment Share on other sites More sharing options...
duckling78 Posted April 30, 2009 Author Share Posted April 30, 2009 (edited) Ha. I optimized the code to parse two dictionary files FROM ~12 hours DOWN TO 21 seconds! Here's a log of the results: 090430_001159: Reading contents of file to $arrayA... 090430_001159: Reading contents of file to $arrayB... 090430_001200: Starting comparisons... UBound($arrayA) - 1: 161017 090430_001205: Status: 10000/161017 (6%) 090430_001210: Status: 20000/161017 (12%) 090430_001214: Status: 30000/161017 (18%) 090430_001219: Status: 40000/161017 (24%) 090430_001224: Status: 50000/161017 (31%) 090430_001228: Status: 60000/161017 (37%) 090430_001233: Status: 70000/161017 (43%) 090430_001238: Status: 80000/161017 (49%) 090430_001244: Status: 90000/161017 (55%) 090430_001249: Status: 100000/161017 (62%) 090430_001254: Status: 110000/161017 (68%) 090430_001259: Status: 120000/161017 (74%) 090430_001304: Status: 130000/161017 (80%) 090430_001309: Status: 140000/161017 (86%) 090430_001314: Status: 150000/161017 (93%) 090430_001319: Status: 160000/161017 (99%) 090430_001320: Finished! Here's the source! expandcollapse popup#include <File.au3> #include <Array.au3> HotKeySet("!+^x", "exitScript") Dim $arrayA, $arrayB FileDelete("NotInB.txt") Blah("Reading contents of file to $arrayA...") _FileReadToArray("sortedA.txt", $arrayA) Blah("Reading contents of file to $arrayB...") _FileReadToArray("sortedB.txt", $arrayB) $fileNotInB = FileOpen("NotInB.txt", 9) #comments-start --- uncomment the below section to create sorted files --- Blah("Starting _ArraySort($arrayA)...") _ArraySort($arrayA) Blah("Starting _ArraySort($arrayB)...") _ArraySort($arrayB) Blah("Finished sorting arrays.") Blah("Writing sortedA.txt...") _FileWriteFromArray("sortedA.txt", $arrayA) Blah("Writing sortedB.txt...") _FileWriteFromArray("sortedB.txt", $arrayB) Blah("Exiting.") Exit #comments-end --- uncomment the above section to create sorted files --- $maxA = UBound($arrayA) - 1 Blah("Starting comparisons... UBound($arrayA) - 1: " & $maxA) For $checkA = 1 To $maxA If Mod($checkA, 10000) == 0 Then Blah("Status: " & $checkA & "/" & $maxA & " (" & Int($checkA / $maxA * 100) & "%)"); EndIf $arrayA[$checkA] = StringStripWS($arrayA[$checkA],2) If AinB($arrayA[$checkA]) = False Then ;Blah('"' & $arrayA[$checkA] & '" is not in $arrayB.') FileWriteLine($fileNotInB, $arrayA[$checkA]) EndIf Next Blah("Finished!") Func AinB($wordCheckA) $start = 1 $end = UBound($arrayB) $last = 1 While True $check = Int(($start + $end) / 2) If $check == $last Then ;Blah(">>>>>>>>>>>>>> FALSE: " & $wordCheckA & " is NOT in $fileB. -- $check: " & $check & " $last: " & $last) Return False EndIf $wordCheckB = StringStripWS($arrayB[$check], 2) $compare = StringCompare($wordCheckA, $wordCheckB, 2) ;Blah("$start: " & $start & " $end: " & $end & " $check: " & $check & " $wordCheck: " & $wordCheckA & " $arrayB[$check]: " & $wordCheckB) If $compare < 0 Then ;Blah("$compare < 0 ... 1: " & $wordCheckA & " 2: " & $wordCheckB) $end = $check ElseIf $compare > 0 Then ;Blah("$compare > 0 ... 1: " & $wordCheckA & " 2: " & $wordCheckB) $start = $check Else ;Blah(">>>>>>>>>>>>>> TRUE: " & $wordCheckA & " is in $fileB. -- $check: " & $check & " $last: " & $last) Return True EndIf $last = $check WEnd EndFunc Func timeStamp() Return StringRight(@YEAR, 2) & @MON & @MDAY & "_" & @HOUR & @MIN & @SEC EndFunc Func exitScript() ConsoleWrite("Exiting script" & @CRLF) Exit EndFunc Func Blah($text) ConsoleWrite(timeStamp() & ": " & $text & @CRLF) EndFunc Edited April 30, 2009 by duckling78 Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now