Best way to ensure capture of duplicate files?

JCarson · June 16, 2010

I have written a script to store the file length and md5 hash of a file in a database for the purposes of eliminating duplicate files.

I have googled around and am looking for any input from the community as to the probability that two files with the same length would generate the same MD5 file hash.

I am not opposed to using SHA1 which has a much lower incidence of possible collision.

Any input would be appreciated.

Thank you,

Joe

Yashied · June 16, 2010

Probability of coincidence for MD5 - 1/2^252.

:mellow:

JCarson · June 16, 2010

Understandable in theory as indicated by google .. but in practice, does anyone have an opion on reliability of this method or a different one that they have found workable ?

Thanks again,

Joe

Probability of coincidence for MD5 - 1/2^252.

KaFu · June 16, 2010

I use md5 in SMF and it works fine, not a single false positive up to now.

JCarson · June 16, 2010

Thank you, I appreciate both your comments, this gives me some sense of tranquility with going forward with my script.

Thanks again !

Joe

I use md5 in SMF and it works fine, not a single false positive up to now.

jchd · June 16, 2010

Forget about storing the file length: it's uselessly redundant with a hash (except if you actually have another use for this information).

Now about "some sense of tranquility": people who haven't been exposed to various security/cryptography/number theory have difficulty getting a feeling of what large numbers mean.

While it's true that MD5 is broken for any cryptographic signature scheme, you may still use it without any question in any file de-duplicating application in "friendly" context. The reason is that the contexts are very different. In the former case, you have to stop attackers from compromising your scheme, while in the latter case, you only rely on probabilities.

Forging a pair of different files having the same MD5 and finding in the wild existing files colliding are not the same beasts. Advances in cryptanalysis and hardware made possible the forgery, but collision for "innocent" files (I mean actual user files, not specially built to collide) still have the same probability to collide they had by the time MD5 was published and regarded as the most effective hash available.

BTW, this collision probability is 2^-128 (MD5 is a 128-bit hash) for 2 random files and it can only grow as you handle more files in the same pool (birthday paradox). It's plain impossible to decrease anyhow, so 2^-252 is out of question.

Moral: except if you have to count with malvolence or attackers determined to ruin the usefulness of your program, you can still use MD5 for file characterization in a not too huge space (if you would have to handle the hashes of a titanic number of files, the collision would be more probable due to the birthday paradox)

If you're not comfortable with the idea that some Joe Dow could provoke artificial collision, then select SHA-1, SHA-2, or something else.

KaFu · June 17, 2010

Forget about storing the file length: it's uselessly redundant with a hash (except if you actually have another use for this information).

In SMF I use the filesize to pre-evaluate the files. I store the size of all files to a SQLite DB, extract all files which have a grouped filesize count >1 and calculate the hash only for those files. This massively decreases the amount of hashes to calculate. Only if a filesize is non-unique it can possibly belong to a duplicate file :mellow:

. Edited June 17, 2010 by KaFu

Sign In

Best way to ensure capture of duplicate files?

Recommended Posts

JCarson

Yashied

JCarson

KaFu

JCarson

jchd

KaFu

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta