JCarson Posted June 16, 2010 Share Posted June 16, 2010 I have written a script to store the file length and md5 hash of a file in a database for the purposes of eliminating duplicate files. I have googled around and am looking for any input from the community as to the probability that two files with the same length would generate the same MD5 file hash. I am not opposed to using SHA1 which has a much lower incidence of possible collision. Any input would be appreciated. Thank you, Joe Link to comment Share on other sites More sharing options...
Yashied Posted June 16, 2010 Share Posted June 16, 2010 Probability of coincidence for MD5 - 1/2^252. My UDFs: iKey | FTP Uploader | Battery Checker | Boot Manager | Font Viewer | UDF Keyword Manager | Run Dialog Replacement | USBProtect | 3D Axis | Calculator | Sleep | iSwitcher | TM | NetHelper | File Types Manager | Control Viewer | SynFolders | DLL Helper Animated Tray Icons UDF Library | Hotkeys UDF Library | Hotkeys Input Control UDF Library | Caret Shape UDF Library | Context Help UDF Library | Most Recently Used List UDF Library | Icons UDF Library | FTP UDF Library | Script Communications UDF Library | Color Chooser UDF Library | Color Picker Control UDF Library | IPHelper (Vista/7) UDF Library | WinAPI Extended UDF Library | WinAPIVhd UDF Library | Icon Chooser UDF Library | Copy UDF Library | Restart UDF Library | Event Log UDF Library | NotifyBox UDF Library | Pop-up Windows UDF Library | TVExplorer UDF Library | GuiHotKey UDF Library | GuiSysLink UDF Library | Package UDF Library | Skin UDF Library | AITray UDF Library | RDC UDF Library Appropriate path | Button text color | Gaussian random numbers | Header's styles (Vista/7) | ICON resource enumeration | Menu & INI | Tabbed string size | Tab's skin | Pop-up circular menu | Progress Bar without animation (Vista/7) | Registry export | Registry path jumping | Unique hardware ID | Windows alignment More... Link to comment Share on other sites More sharing options...
JCarson Posted June 16, 2010 Author Share Posted June 16, 2010 Understandable in theory as indicated by google .. but in practice, does anyone have an opion on reliability of this method or a different one that they have found workable ?Thanks again,JoeProbability of coincidence for MD5 - 1/2^252. Link to comment Share on other sites More sharing options...
KaFu Posted June 16, 2010 Share Posted June 16, 2010 I use md5 in SMF and it works fine, not a single false positive up to now. OS: Win10-22H2 - 64bit - German, AutoIt Version: 3.3.16.1, AutoIt Editor: SciTE, Website: https://funk.eu AMT - Auto-Movie-Thumbnailer (2022-Nov-26) BIC - Batch-Image-Cropper (2023-Apr-01) COP - Color Picker (2009-May-21) DCS - Dynamic Cursor Selector (2024-Feb-16) HMW - Hide my Windows (2018-Sep-16) HRC - HotKey Resolution Changer (2012-May-16) ICU - Icon Configuration Utility (2018-Sep-16) SMF - Search my Files (2023-Jun-03) - THE file info and duplicates search tool SSD - Set Sound Device (2017-Sep-16) Link to comment Share on other sites More sharing options...
JCarson Posted June 16, 2010 Author Share Posted June 16, 2010 Thank you, I appreciate both your comments, this gives me some sense of tranquility with going forward with my script.Thanks again !JoeI use md5 in SMF and it works fine, not a single false positive up to now. Link to comment Share on other sites More sharing options...
jchd Posted June 16, 2010 Share Posted June 16, 2010 Forget about storing the file length: it's uselessly redundant with a hash (except if you actually have another use for this information). Now about "some sense of tranquility": people who haven't been exposed to various security/cryptography/number theory have difficulty getting a feeling of what large numbers mean. While it's true that MD5 is broken for any cryptographic signature scheme, you may still use it without any question in any file de-duplicating application in "friendly" context. The reason is that the contexts are very different. In the former case, you have to stop attackers from compromising your scheme, while in the latter case, you only rely on probabilities. Forging a pair of different files having the same MD5 and finding in the wild existing files colliding are not the same beasts. Advances in cryptanalysis and hardware made possible the forgery, but collision for "innocent" files (I mean actual user files, not specially built to collide) still have the same probability to collide they had by the time MD5 was published and regarded as the most effective hash available. BTW, this collision probability is 2^-128 (MD5 is a 128-bit hash) for 2 random files and it can only grow as you handle more files in the same pool (birthday paradox). It's plain impossible to decrease anyhow, so 2^-252 is out of question. Moral: except if you have to count with malvolence or attackers determined to ruin the usefulness of your program, you can still use MD5 for file characterization in a not too huge space (if you would have to handle the hashes of a titanic number of files, the collision would be more probable due to the birthday paradox) If you're not comfortable with the idea that some Joe Dow could provoke artificial collision, then select SHA-1, SHA-2, or something else. This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe hereRegExp tutorial: enough to get startedPCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta. SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt) Link to comment Share on other sites More sharing options...
KaFu Posted June 17, 2010 Share Posted June 17, 2010 (edited) Forget about storing the file length: it's uselessly redundant with a hash (except if you actually have another use for this information).In SMF I use the filesize to pre-evaluate the files. I store the size of all files to a SQLite DB, extract all files which have a grouped filesize count >1 and calculate the hash only for those files. This massively decreases the amount of hashes to calculate. Only if a filesize is non-unique it can possibly belong to a duplicate file . Edited June 17, 2010 by KaFu OS: Win10-22H2 - 64bit - German, AutoIt Version: 3.3.16.1, AutoIt Editor: SciTE, Website: https://funk.eu AMT - Auto-Movie-Thumbnailer (2022-Nov-26) BIC - Batch-Image-Cropper (2023-Apr-01) COP - Color Picker (2009-May-21) DCS - Dynamic Cursor Selector (2024-Feb-16) HMW - Hide my Windows (2018-Sep-16) HRC - HotKey Resolution Changer (2012-May-16) ICU - Icon Configuration Utility (2018-Sep-16) SMF - Search my Files (2023-Jun-03) - THE file info and duplicates search tool SSD - Set Sound Device (2017-Sep-16) Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now