KaFu

SMF - The fastest duplicate files finder... [Updated 2017-Jun-18]

192 posts in this topic

#181 ·  Posted (edited)

KaFu,

Thanks for everything, your MVP guys...i just have to learn ;)

I have a little "request", sorry if bother you. On the first page i have see:

added trimmed md5 short calculation (false md5, but sufficient for dup-search and a great speed improvement!)

Well, i need a function like that. I need to take a file ( can be 1MB or 7GB, this is the main problem ) and get a CRC, MD5 or whatever ( faster way is the better way ) and keep it for future reference/comparing. Your code isn't the best well commented i have ever seen, without any offense i have a great respect :D

Can you just extract that "trimmed md5 short calculation" and make a little UDF/Func? Something easy like:

Func MD5ShortCalc($sFilePath)
; code
Return $MD5ShortCalc
EndFunc

Thanks again

Edited by Terenz

Nothing is so strong as gentleness. Nothing is so gentle as real strength

 

Share this post


Link to post
Share on other sites



#182 ·  Posted (edited)

Here you go :)... I'll add some comments on the technique and limitations later on, have to go now.

Edit #1:

The functions is reading the first 8kb, 8kb from the middle and 8kb from the end of the file. Based on these 24kb a hash is calculated. In SMF I use this hash plus the exact filesize in byte to look for duplicates, when hash and filesize are the same the files are most likely the same. In theory this method is of course prone for false positives, in practice I've never encountered a problem. What might be a problem is the structure of the file, in SMF e.g. I exclude "doc;docx;ppt;pptx;xls;xlsx" files from this method and perform a full md5 for those, as these files contain huge chunks of meta-data. But combined with the filesize assessment this method should be quite save and really quick. And on the other hand it should not produce false negatives at least :)...

Edit #2:

Replaced FileGetSize() with internal function to make it save for large files (you mentioned 7GB).

Edit #3:

Cleaned up some unnecessary variables used in other parts of SMF :)...

$h_DLL_Kernel32 = DllOpen("kernel32.dll")
$h_DLL_Advapi32 = DllOpen("advapi32.dll")
; Start - Init Globals for _Hash_Calculation_Fast_MD5_ReadFile_DLL
Global $__MD5_Hash_Calculation_Short_Factor = 8192
Global $__MD5_Overlapped_tBuffer_Parent, $__MD5_Overlapped_tBuffer
Global $__MD5_Overlapped_tBuffer_Parts1, $__MD5_Overlapped_tBuffer_Parts2, $__MD5_Overlapped_tBuffer_Parts3
Global $__MD5_MD5CTX
; End - Init Globals for _Hash_Calculation_Fast_MD5_ReadFile_DLL
_MD5_Buffers_Initialize()
OnAutoItExitRegister("_MD5_Buffers_UnInitialize")

$Checksum_Result = _Hash_Calculation_Fast_MD5_ReadFile_DLL(@ScriptFullPath)
MsgBox(0, "", @ScriptFullPath & @CRLF & $Checksum_Result)


Func _Hash_Calculation_Fast_MD5_ReadFile_DLL($Checksum_Filename)
    If StringLeft($Checksum_Filename, 4) <> "\\?\" Then $Checksum_Filename = "\\?\" & $Checksum_Filename

    Local $hFile = DllCall($h_DLL_Kernel32, "ptr", "CreateFileW", "wstr", $Checksum_Filename, "dword", 0, "dword", 7, "ptr", 0, "dword", 3, "dword", 0, "ptr", 0) ; $iAccess = 0, $iShare = 7
    If $hFile[0] = Ptr(-1) Then Return SetError(1, 0, 0) ; file not found
    Local $aFileSize = DllCall($h_DLL_Kernel32, "bool", "GetFileSizeEx", "handle", $hFile[0], "int64*", 0)
    DllCall($h_DLL_Kernel32, "int", "CloseHandle", "hwnd", $hFile[0])
    If $aFileSize[2] = 0 Then Return SetError(2, 0, 0) ; 0 byte file

    $hFile = DllCall($h_DLL_Kernel32, "ptr", "CreateFileW", "wstr", $Checksum_Filename, "dword", 0x80000000, "dword", 0, "ptr", 0, "dword", 3, "dword", 0x10000000, "ptr", 0)
    If $hFile[0] = Ptr(-1) Then Return SetError(3, 0, 0)

    DllCall($h_DLL_Kernel32, "bool", "ReadFile", "handle", $hFile[0], "struct*", $__MD5_Overlapped_tBuffer_Parts1, "dword", $__MD5_Hash_Calculation_Short_Factor, "dword*", 0, "ptr", 0)

    DllCall($h_DLL_Kernel32, "int", "SetFilePointerEx", "handle", $hFile[0], "int64", Int(($aFileSize[2] / 2) - ($__MD5_Hash_Calculation_Short_Factor / 2 + 1) - $__MD5_Hash_Calculation_Short_Factor), "int64*", 0, "dword", 1)
    DllCall($h_DLL_Kernel32, "bool", "ReadFile", "handle", $hFile[0], "struct*", $__MD5_Overlapped_tBuffer_Parts2, "dword", $__MD5_Hash_Calculation_Short_Factor, "dword*", 0, "ptr", 0)

    DllCall($h_DLL_Kernel32, "int", "SetFilePointerEx", "handle", $hFile[0], "int64", -$__MD5_Hash_Calculation_Short_Factor, "int64*", 0, "dword", 2)
    DllCall($h_DLL_Kernel32, "bool", "ReadFile", "handle", $hFile[0], "struct*", $__MD5_Overlapped_tBuffer_Parts3, "dword", $__MD5_Hash_Calculation_Short_Factor, "dword*", 0, "ptr", 0)

    DllCall($h_DLL_Kernel32, "bool", "CloseHandle", "handle", $hFile[0])

    DllCall($h_DLL_Advapi32, "none", "MD5Init", "struct*", $__MD5_MD5CTX)
    DllCall($h_DLL_Advapi32, "none", "MD5Update", "struct*", $__MD5_MD5CTX, "struct*", $__MD5_Overlapped_tBuffer_Parent, "dword", $__MD5_Hash_Calculation_Short_Factor * 3)
    DllCall($h_DLL_Advapi32, "none", "MD5Final", "struct*", $__MD5_MD5CTX)

    Return DllStructGetData($__MD5_MD5CTX, 4)
EndFunc   ;==>_Hash_Calculation_Fast_MD5_ReadFile_DLL


Func _MD5_Buffers_Initialize()
    $__MD5_Overlapped_tBuffer_Parent = DllStructCreate("byte[" & $__MD5_Hash_Calculation_Short_Factor * 3 & "]")
    $__MD5_Overlapped_tBuffer = DllStructCreate("byte[" & $__MD5_Hash_Calculation_Short_Factor & "];byte[" & $__MD5_Hash_Calculation_Short_Factor & "];byte[" & $__MD5_Hash_Calculation_Short_Factor & "]", DllStructGetPtr($__MD5_Overlapped_tBuffer_Parent))
    $__MD5_Overlapped_tBuffer_Parts1 = DllStructCreate("byte[" & $__MD5_Hash_Calculation_Short_Factor & "]", DllStructGetPtr($__MD5_Overlapped_tBuffer, 1))
    $__MD5_Overlapped_tBuffer_Parts2 = DllStructCreate("byte[" & $__MD5_Hash_Calculation_Short_Factor & "]", DllStructGetPtr($__MD5_Overlapped_tBuffer, 2))
    $__MD5_Overlapped_tBuffer_Parts3 = DllStructCreate("byte[" & $__MD5_Hash_Calculation_Short_Factor & "]", DllStructGetPtr($__MD5_Overlapped_tBuffer, 3))
    $__MD5_MD5CTX = DllStructCreate("dword i[2];dword buf[4];ubyte in[64];ubyte digest[16]")
EndFunc   ;==>_MD5_Buffers_Initialize

Func _MD5_Buffers_UnInitialize()
    $__MD5_Overlapped_tBuffer_Parts1 = 0
    $__MD5_Overlapped_tBuffer_Parts2 = 0
    $__MD5_Overlapped_tBuffer_Parts3 = 0
    $__MD5_Overlapped_tBuffer = 0
    $__MD5_Overlapped_tBuffer_Parent = 0
    $__MD5_MD5CTX = 0
    DllClose($h_DLL_Kernel32)
    DllClose($h_DLL_Advapi32)
EndFunc   ;==>_MD5_Buffers_UnInitialize
Edited by KaFu

Share this post


Link to post
Share on other sites

My respect thanks. I'll try and i'll post the result. So from your experience is hard to get false-positive ( yes before comparing i'll check the size ) but why 8Kb * 3? And not 10-12 etc.

Increasing that value in theory decrease the chance of false-positive but increase the time of md5 calculation


Nothing is so strong as gentleness. Nothing is so gentle as real strength

 

Share this post


Link to post
Share on other sites

#184 ·  Posted (edited)

I don't understand this line:

Replaced FileGetSize() with internal function to make it save for large files (you mentioned 7GB)

Why? For me FileGetSize work with big file:

$sSize = FileGetSize("C:\Path\BigFile.iso")
ConsoleWrite("RESULT: " & ($sSize/1073741824) & " GB"& @CRLF)
RESULT: 7.10991006496472 GB

There is something i don't know about FileGetSize? The result is different for you?

EDIT: Try both version:

$h_DLL_Kernel32 = DllOpen("kernel32.dll")
$h_DLL_Advapi32 = DllOpen("advapi32.dll")
; Start - Init Globals for _Hash_Calculation_Fast_MD5_ReadFile_DLL
Global $__MD5_Hash_Calculation_Short_Factor = 8192
Global $__MD5_Overlapped_tBuffer_Parent, $__MD5_Overlapped_tBuffer
Global $__MD5_Overlapped_tBuffer_Parts1, $__MD5_Overlapped_tBuffer_Parts2, $__MD5_Overlapped_tBuffer_Parts3
Global $__MD5_MD5CTX
; End - Init Globals for _Hash_Calculation_Fast_MD5_ReadFile_DLL
_MD5_Buffers_Initialize()
OnAutoItExitRegister("_MD5_Buffers_UnInitialize")

$Checksum_Result = _Hash_Calculation_Fast_MD5_ReadFile_DLL("C:\Path\BigFile.iso")
ConsoleWrite($Checksum_Result & @CR)
$Checksum_Resultv2 = _Hash_Calculation_Fast_MD5_ReadFile_DLLv2("C:\Path\BigFile.iso")
ConsoleWrite($Checksum_Result & @CR)

Func _Hash_Calculation_Fast_MD5_ReadFile_DLLv2($Checksum_Filename)
    If StringLeft($Checksum_Filename, 4) <> "\\?\" Then $Checksum_Filename = "\\?\" & $Checksum_Filename
    $hFile = DllCall($h_DLL_Kernel32, "ptr", "CreateFileW", "wstr", $Checksum_Filename, "dword", 0x80000000, "dword", 0, "ptr", 0, "dword", 3, "dword", 0x10000000, "ptr", 0)
    If $hFile[0] = Ptr(-1) Then Return SetError(3, 0, 0)
    DllCall($h_DLL_Kernel32, "bool", "ReadFile", "handle", $hFile[0], "struct*", $__MD5_Overlapped_tBuffer_Parts1, "dword", $__MD5_Hash_Calculation_Short_Factor, "dword*", 0, "ptr", 0)
    DllCall($h_DLL_Kernel32, "int", "SetFilePointerEx", "handle", $hFile[0], "int64", Int((FileGetSize($Checksum_Filename) / 2) - ($__MD5_Hash_Calculation_Short_Factor / 2 + 1) - $__MD5_Hash_Calculation_Short_Factor), "int64*", 0, "dword", 1)
    DllCall($h_DLL_Kernel32, "bool", "ReadFile", "handle", $hFile[0], "struct*", $__MD5_Overlapped_tBuffer_Parts2, "dword", $__MD5_Hash_Calculation_Short_Factor, "dword*", 0, "ptr", 0)
    DllCall($h_DLL_Kernel32, "int", "SetFilePointerEx", "handle", $hFile[0], "int64", -$__MD5_Hash_Calculation_Short_Factor, "int64*", 0, "dword", 2)
    DllCall($h_DLL_Kernel32, "bool", "ReadFile", "handle", $hFile[0], "struct*", $__MD5_Overlapped_tBuffer_Parts3, "dword", $__MD5_Hash_Calculation_Short_Factor, "dword*", 0, "ptr", 0)
    DllCall($h_DLL_Kernel32, "bool", "CloseHandle", "handle", $hFile[0])
    DllCall($h_DLL_Advapi32, "none", "MD5Init", "struct*", $__MD5_MD5CTX)
    DllCall($h_DLL_Advapi32, "none", "MD5Update", "struct*", $__MD5_MD5CTX, "struct*", $__MD5_Overlapped_tBuffer_Parent, "dword", $__MD5_Hash_Calculation_Short_Factor * 3)
    DllCall($h_DLL_Advapi32, "none", "MD5Final", "struct*", $__MD5_MD5CTX)
    Return DllStructGetData($__MD5_MD5CTX, 4)
EndFunc   ;==>_Hash_Calculation_Fast_MD5_ReadFile_DLL

Func _Hash_Calculation_Fast_MD5_ReadFile_DLL($Checksum_Filename)
    If StringLeft($Checksum_Filename, 4) <> "\\?\" Then $Checksum_Filename = "\\?\" & $Checksum_Filename

    Local $hFile = DllCall($h_DLL_Kernel32, "ptr", "CreateFileW", "wstr", $Checksum_Filename, "dword", 0, "dword", 7, "ptr", 0, "dword", 3, "dword", 0, "ptr", 0) ; $iAccess = 0, $iShare = 7
    If $hFile[0] = Ptr(-1) Then Return SetError(1, 0, 0) ; file not found
    Local $aFileSize = DllCall($h_DLL_Kernel32, "bool", "GetFileSizeEx", "handle", $hFile[0], "int64*", 0)
    DllCall($h_DLL_Kernel32, "int", "CloseHandle", "hwnd", $hFile[0])
    If $aFileSize[2] = 0 Then Return SetError(2, 0, 0) ; 0 byte file

    $hFile = DllCall($h_DLL_Kernel32, "ptr", "CreateFileW", "wstr", $Checksum_Filename, "dword", 0x80000000, "dword", 0, "ptr", 0, "dword", 3, "dword", 0x10000000, "ptr", 0)
    If $hFile[0] = Ptr(-1) Then Return SetError(3, 0, 0)

    DllCall($h_DLL_Kernel32, "bool", "ReadFile", "handle", $hFile[0], "struct*", $__MD5_Overlapped_tBuffer_Parts1, "dword", $__MD5_Hash_Calculation_Short_Factor, "dword*", 0, "ptr", 0)

    DllCall($h_DLL_Kernel32, "int", "SetFilePointerEx", "handle", $hFile[0], "int64", Int(($aFileSize[2] / 2) - ($__MD5_Hash_Calculation_Short_Factor / 2 + 1) - $__MD5_Hash_Calculation_Short_Factor), "int64*", 0, "dword", 1)
    DllCall($h_DLL_Kernel32, "bool", "ReadFile", "handle", $hFile[0], "struct*", $__MD5_Overlapped_tBuffer_Parts2, "dword", $__MD5_Hash_Calculation_Short_Factor, "dword*", 0, "ptr", 0)

    DllCall($h_DLL_Kernel32, "int", "SetFilePointerEx", "handle", $hFile[0], "int64", -$__MD5_Hash_Calculation_Short_Factor, "int64*", 0, "dword", 2)
    DllCall($h_DLL_Kernel32, "bool", "ReadFile", "handle", $hFile[0], "struct*", $__MD5_Overlapped_tBuffer_Parts3, "dword", $__MD5_Hash_Calculation_Short_Factor, "dword*", 0, "ptr", 0)

    DllCall($h_DLL_Kernel32, "bool", "CloseHandle", "handle", $hFile[0])

    DllCall($h_DLL_Advapi32, "none", "MD5Init", "struct*", $__MD5_MD5CTX)
    DllCall($h_DLL_Advapi32, "none", "MD5Update", "struct*", $__MD5_MD5CTX, "struct*", $__MD5_Overlapped_tBuffer_Parent, "dword", $__MD5_Hash_Calculation_Short_Factor * 3)
    DllCall($h_DLL_Advapi32, "none", "MD5Final", "struct*", $__MD5_MD5CTX)

    Return DllStructGetData($__MD5_MD5CTX, 4)
EndFunc   ;==>_Hash_Calculation_Fast_MD5_ReadFile_DLL


Func _MD5_Buffers_Initialize()
    $__MD5_Overlapped_tBuffer_Parent = DllStructCreate("byte[" & $__MD5_Hash_Calculation_Short_Factor * 3 & "]")
    $__MD5_Overlapped_tBuffer = DllStructCreate("byte[" & $__MD5_Hash_Calculation_Short_Factor & "];byte[" & $__MD5_Hash_Calculation_Short_Factor & "];byte[" & $__MD5_Hash_Calculation_Short_Factor & "]", DllStructGetPtr($__MD5_Overlapped_tBuffer_Parent))
    $__MD5_Overlapped_tBuffer_Parts1 = DllStructCreate("byte[" & $__MD5_Hash_Calculation_Short_Factor & "]", DllStructGetPtr($__MD5_Overlapped_tBuffer, 1))
    $__MD5_Overlapped_tBuffer_Parts2 = DllStructCreate("byte[" & $__MD5_Hash_Calculation_Short_Factor & "]", DllStructGetPtr($__MD5_Overlapped_tBuffer, 2))
    $__MD5_Overlapped_tBuffer_Parts3 = DllStructCreate("byte[" & $__MD5_Hash_Calculation_Short_Factor & "]", DllStructGetPtr($__MD5_Overlapped_tBuffer, 3))
    $__MD5_MD5CTX = DllStructCreate("dword i[2];dword buf[4];ubyte in[64];ubyte digest[16]")
EndFunc   ;==>_MD5_Buffers_Initialize

Func _MD5_Buffers_UnInitialize()
    $__MD5_Overlapped_tBuffer_Parts1 = 0
    $__MD5_Overlapped_tBuffer_Parts2 = 0
    $__MD5_Overlapped_tBuffer_Parts3 = 0
    $__MD5_Overlapped_tBuffer = 0
    $__MD5_Overlapped_tBuffer_Parent = 0
    $__MD5_MD5CTX = 0
    DllClose($h_DLL_Kernel32)
    DllClose($h_DLL_Advapi32)
EndFunc   ;==>_MD5_Buffers_UnInitialize
0xD3AA4E42362721362316AB37C68A4558
0xD3AA4E42362721362316AB37C68A4558

Anyway is very very fast, i'd like only to know about FileGetSize :blink:

Last, if i want to increase the 8kb i need just to edit this line:

Global $__MD5_Hash_Calculation_Short_Factor = 8192

Like 16384 or the code need more changes? Thanks

Edited by Terenz

Nothing is so strong as gentleness. Nothing is so gentle as real strength

 

Share this post


Link to post
Share on other sites

#185 ·  Posted (edited)

Is there something i don't know about FileGetSize? The result is different for you?

 

Nope, my fault, I just thought FileGetSize has problems with files > 2GB, in fact it was fixed some 10 years ago (3.0.102)  o:) .

The 8192 bytes is just a try & error value which I determined in course of developing SMF, worked always fine for me. Changing $__MD5_Hash_Calculation_Short_Factor to a different value should dynamically adjust the needed buffers and should work fine.

One more thing, in SMF I use the default full hash (use standard _Crypt_HashFile() function) for files smaller 3 * $__MD5_Hash_Calculation_Short_Factor (in the example above 24kb), I guess the results will not be consistent otherwise (e.g. if you reuse the function, some parts of the buffer might still contain data from the last file).

Edited by KaFu

Share this post


Link to post
Share on other sites

Nope, my fault, I just thought FileGetSize has problems with files > 2GB, in fact it was fixed some 10 years ago (3.0.102)  o:) .

 

Pratically yesterday :D

I'll leave the buffer as is, i'll put FileGetSize and i'll clean the buffer if i re-use the function. Thanks


Nothing is so strong as gentleness. Nothing is so gentle as real strength

 

Share this post


Link to post
Share on other sites

#187 ·  Posted (edited)

<snip>

Edited by Melba23
Post removed

Share this post


Link to post
Share on other sites

Kathygib,

We do not permit advertising for paid products - please do not do it again. :naughty:

M23


Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind._______My UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites

Hi everyone,

 

Does anyone have updated the sources to run with Autoit 3.3.12.0 ?

I tried, but it is not natural.

 

 

 

Share this post


Link to post
Share on other sites

#190 ·  Posted (edited)

I doubt anyone has ^_^. I thought about it, but currently I'm still happy with 3.3.8.1, as I managed to code around any limitation or bug I've encountered. I'm preparing to release v10 sometime soon, maybe I'll (try to) update the code to the most recent AU version.

Meanwhile you might want to give the latest v10 Beta a try and let me know if it works for you ;)...

Edited by KaFu

Share this post


Link to post
Share on other sites

2016-Jan-03, Changelog v9 > v10

  • Updated -   Improved file and duplicates search speed
  • Fixed   -   Duplicates Search "Hash-Cache" functionality worked sub-optimal, now brings real improvement on repeated searches
  • Report  -   Improved speed of TNP Thumbnail Provider
  • Report  -   Added thumbnail and icon cache functionality
  • Report  -   Added custom cell highlighting feature
  • Report  -   Added optional checkboxes in Filename Column
  • Report  -   Improved Copy/MoveTo dialog
  • Report  -   Column order/size, OFFSET and LIMIT are now saved
  • Report  -   Save Styles fixed
  • Updated -   Treeview functions
  • Added   -   Optional Explorer Contextmenu entry to "Search with SMF for duplicate files"
  • Updated -   Lots of other bug fixes and style changes
  • Updated -   SQLite Dll to 3.9.2
  • Updated -   MediaInfo Dll to 0.7.81
  • Updated -   TrID Definitions to version 2015 Dec 29

Source and Executable are available at http://www.funk.eu
Best Regards
Updated first Post... Enjoy :)...

1 person likes this

Share this post


Link to post
Share on other sites

#192 ·  Posted

2017-Jun-18, Changelog v10 > v11

  • Fixed   -   Error in Report thumbnails cached in DB
  • Fixed   -   Single quote ' in filenames led to errors in report
  • Fixed   -   Win XP compatibility
  • Updated -   Improved file and duplicates search speed
  • Updated -   Lots of other small bug fixes and style changes
  • Updated -   SQLite Dll to 3.19.3
  • Updated -   MediaInfo Dll to 0.7.96
  • Updated -   TrID Definitions to version 2017 Jun 15

Source and Executable are available at http://www.funk.eu
Best Regards
Updated first Post... Enjoy :)...

1 person likes this

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now