Sign in to follow this  
Followers 0
JCarson

Best way to ensure capture of duplicate files?

7 posts in this topic

I have written a script to store the file length and md5 hash of a file in a database for the purposes of eliminating duplicate files.

I have googled around and am looking for any input from the community as to the probability that two files with the same length would generate the same MD5 file hash.

I am not opposed to using SHA1 which has a much lower incidence of possible collision.

Any input would be appreciated.

Thank you,

Joe

Share this post


Link to post
Share on other sites



Understandable in theory as indicated by google .. but in practice, does anyone have an opion on reliability of this method or a different one that they have found workable ?

Thanks again,

Joe

Probability of coincidence for MD5 - 1/2^252.

:mellow:

Share this post


Link to post
Share on other sites

Thank you, I appreciate both your comments, this gives me some sense of tranquility with going forward with my script.

Thanks again !

Joe

I use md5 in SMF and it works fine, not a single false positive up to now.

Share this post


Link to post
Share on other sites

Forget about storing the file length: it's uselessly redundant with a hash (except if you actually have another use for this information).

Now about "some sense of tranquility": people who haven't been exposed to various security/cryptography/number theory have difficulty getting a feeling of what large numbers mean.

While it's true that MD5 is broken for any cryptographic signature scheme, you may still use it without any question in any file de-duplicating application in "friendly" context. The reason is that the contexts are very different. In the former case, you have to stop attackers from compromising your scheme, while in the latter case, you only rely on probabilities.

Forging a pair of different files having the same MD5 and finding in the wild existing files colliding are not the same beasts. Advances in cryptanalysis and hardware made possible the forgery, but collision for "innocent" files (I mean actual user files, not specially built to collide) still have the same probability to collide they had by the time MD5 was published and regarded as the most effective hash available.

BTW, this collision probability is 2^-128 (MD5 is a 128-bit hash) for 2 random files and it can only grow as you handle more files in the same pool (birthday paradox). It's plain impossible to decrease anyhow, so 2^-252 is out of question.

Moral: except if you have to count with malvolence or attackers determined to ruin the usefulness of your program, you can still use MD5 for file characterization in a not too huge space (if you would have to handle the hashes of a titanic number of files, the collision would be more probable due to the birthday paradox)

If you're not comfortable with the idea that some Joe Dow could provoke artificial collision, then select SHA-1, SHA-2, or something else.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

#7 ·  Posted (edited)

Forget about storing the file length: it's uselessly redundant with a hash (except if you actually have another use for this information).

In SMF I use the filesize to pre-evaluate the files. I store the size of all files to a SQLite DB, extract all files which have a grouped filesize count >1 and calculate the hash only for those files. This massively decreases the amount of hashes to calculate. Only if a filesize is non-unique it can possibly belong to a duplicate file :mellow:. Edited by KaFu

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0