Jump to content

Fast file Compare


SnArF
 Share

Recommended Posts

I had to compare two files with more than one million lines per file.
I've tested several examples but all of them are too slow.
Most of them are running for several hours to compare 1 million lines.
 
I have written a script that compare's 2 txt files with 1 million lines in less than 5 minutes. (After the files are loaded in an array)
It writes the missing files to 2 textfiles.
 
It compares 10.000 lines in 1.8 sec, 100.000 lines in 21 sec, 1000.000 lines in 250 sec on my laptop.
 
The example script creates 2 array's with 1.000.000 lines and then remove's some entry's.
At the end it writes 2 txt files with the missing lines per array.
 
Please test it and give commend's
#include <array.au3>
#include <Timers.au3>
#include <file.au3>

Local $NrOfRows = 1000000 ; Set number of rows to test

Local $delString1 = 0
Local $delString2 = 0
Local $Array1[$NrOfRows]
Local $Array2[$NrOfRows]
$StartTime = _Timer_Init()
$Timer = _Timer_Init()

; Creating 2 array's
For $i = 0 to $NrOfRows - 1
    $Array1[$i] = "Just some tekst to emulate data to compare " & $i
Next
$Array2 = $Array1
ConsoleWrite("Array's created in " & Round(_Timer_Diff($Timer)) & " milliseconds" & @CRLF)
$Timer = _Timer_Init()

; removing some entry's from both array's to show functionality
_ArrayDelete($Array1, "333;5555;7777")
_ArrayDelete($Array2, "222;4444;6666")
ConsoleWrite("Removed some value's in " & Round(_Timer_Diff($Timer)) & " milliseconds" & @CRLF)
$Timer = _Timer_Init()


; You neede to sort the array is you use Binary Search
_ArraySort($Array1, 0, 1, 0, 0, 1)
ConsoleWrite("Sorted Array 1 in " & Round(_Timer_Diff($Timer)) & " milliseconds" & @CRLF)
$Timer = _Timer_Init()

; comparing the 2 array's
For $i = 0 to UBound($Array2) - 1
    $Index = _ArrayBinarySearch($Array1, $Array2[$i], 1)

    ; add equal rows to a string
    If $Index <> -1 Then
        $delString1 &= ";" & $Index
        $delString2 &= ";" & $i
    EndIf
Next
ConsoleWrite("Array's compared in " & Round(_Timer_Diff($Timer)) & " milliseconds" & @CRLF)
$Timer = _Timer_Init()

; removing the equal rows from the array's
_ArrayDelete($Array1, $delString1)
_ArrayDelete($Array2, $delString2)
ConsoleWrite("removed equal rows in " & Round(_Timer_Diff($Timer)) & " milliseconds" & @CRLF)
$Timer = _Timer_Init()

; writing the rsult to files
_FileWriteFromArray("missing in array 1.txt", $Array2)
_FileWriteFromArray("missing in array 2.txt", $Array1)
ConsoleWrite("Write missing value's to File in " & Round(_Timer_Diff($Timer)) & " milliseconds" & @CRLF)
$Timer = _Timer_Init()

ConsoleWrite("Compare complete in " &Round(_Timer_Diff($StartTime)) & " milliseconds")
Edited by SnArF

My scripts: _ConsoleWriteLog | _FileArray2D

 

 

 

Link to comment
Share on other sites

Do you need only the information whether the 2 files are different or also what is different (content)?

Br,

UEZ

Edited by UEZ

Please don't send me any personal message and ask for support! I will not reply!

Selection of finest graphical examples at Codepen.io

The own fart smells best!
Her 'sikim hıyar' diyene bir avuç tuz alıp koşma!
¯\_(ツ)_/¯  ٩(●̮̮̃•̃)۶ ٩(-̮̮̃-̃)۶ૐ

Link to comment
Share on other sites

@UEZ,

The script shows what's different (Content).

I have a script that makes an index of 2 servers, about 1.500.000 files per server.

The result are saved to 2 text files.

Then the the text files are compared, only the different files are then saved to text files.

The complete process, indexing 2 file server with 1.5 million files each and comparing them takes about 11 minutes, I think that's very fast.

My scripts: _ConsoleWriteLog | _FileArray2D

 

 

 

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...