Jump to content
gcue

finding duplicates - trying to find a way to improve speed

Recommended Posts

gcue

here is my script, it is part of a larger script but this component is taking forever to process

#include <File.au3>
#include <Crypt.au3>
#include <Array.au3>

$msg_normal = 0

$source_parent_dir = "C:\"

Local $full_array[1]

;get full file paths array for all files and in all dirs and sub dirs - this is taking very long
$files_array = _FileListToArrayRec($source_parent_dir, "*", $FLTAR_FILES, 1, $FLTAR_SORT, $FLTAR_FULLPATH)
Debug("pause 1")

;i need the array to be in the following format for later script processing - this is taking very long
For $x = 1 To UBound($files_array) - 1
    $file_name_extension = StringRegExpReplace($files_array[$x], "^.*\\", "")
    $file_path_dir = StringReplace($files_array[$x], "\" & $file_name_extension, "")

    ReDim $full_array[UBound($full_array) + 1][4]

    $full_array[UBound($full_array) - 1][0] = $file_path_dir
    $full_array[UBound($full_array) - 1][1] = $file_name_extension
Next
Debug("pause 2")

;here's how i find duplicates - this is taking very long
$final_array = _GetFileDupes($full_array)

If UBound($final_array) - 1 = 0 Then
    MsgBox($msg_normal, @ScriptName, "There are NO duplicate files found.")
    Exit
EndIf

Debug($final_array)

Func _GetFileDupes($full_array)

    _Crypt_Startup()

    For $x = 1 To UBound($full_array) - 1
        $path = $full_array[$x][0]
        $file_name = $full_array[$x][1]

        $sha1 = _Crypt_HashFile($path & "\" & $file_name, $CALG_SHA1)

        $full_array[$x][2] = $sha1
    Next
;~  Debug($full_array)
    _Crypt_Shutdown()

    Local $final_array[1]

    For $x = 1 To UBound($full_array) - 1
        $search = _ArrayFindAll($full_array, $full_array[$x][2], 1, 0, 0, 0, 2)

        If UBound($search) = 1 Then ContinueLoop

        For $y = 0 To UBound($search) - 1
            $index = $search[$y]

            If $full_array[$index][3] <> "DUPLICATE" Then
                $full_array[$index][3] = "DUPLICATE"

                ReDim $final_array[UBound($final_array) + 1][3]

                $final_array[UBound($final_array) - 1][0] = $full_array[$index][0]
                $final_array[UBound($final_array) - 1][1] = $full_array[$index][1]
                $final_array[UBound($final_array) - 1][2] = $full_array[$index][2]
            EndIf
        Next
    Next
;~  Debug($final_array)
    Return $final_array

EndFunc   ;==>_GetFileDupes

Func Debug($variable1 = "", $variable2 = "", $variable3 = "", $variable4 = "", $variable5 = "")

;~  #include <array.au3>
;~  $msg_normal = 0

    If IsArray($variable1) Or IsArray($variable2) Then
        If IsArray($variable1) Then _ArrayDisplay($variable1, $variable2)
        If IsArray($variable2) Then _ArrayDisplay($variable2, $variable1)
    Else
        $variable = ""

        If $variable1 <> "" Then $variable &= $variable1 & @CRLF
        If $variable2 <> "" Then $variable &= $variable2 & @CRLF
        If $variable3 <> "" Then $variable &= $variable3 & @CRLF
        If $variable4 <> "" Then $variable &= $variable4 & @CRLF
        If $variable5 <> "" Then $variable &= $variable5 & @CRLF

        $variable = StringStripWS($variable, 2)

        ClipPut($variable)

        MsgBox($msg_normal, "Debug", $variable)
    EndIf

EndFunc   ;==>Debug

any help is greatly appreciated!

Share this post


Link to post
Share on other sites
spudw2k

Perhaps you could use the FindFile functions to build your array yourself and populate it with the desired format from the get go, versus populating the array and updating each array entity afterwards.  That could help cut some time down.  

Share this post


Link to post
Share on other sites
iamtheky

MD5 might be faster.

And why are you splitting them into name and extension?  Wouldnt simply hashing the first array tell you if there were dupes?


,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Share this post


Link to post
Share on other sites
gcue

i need name and extension later. so that's why i am processing them

ill try md5 -- but still cant get away from the first part being slow

Share this post


Link to post
Share on other sites
gcue
24 minutes ago, spudw2k said:

Perhaps you could use the FindFile functions to build your array yourself and populate it with the desired format from the get go, versus populating the array and updating each array entity afterwards.  That could help cut some time down.  

let me try that =)

Share this post


Link to post
Share on other sites
Danyfirex

Hello. my suggestions are:

  • Build your own recursive files list rutine.
  • Build you formatted array inside recursion.
  • To speed up you compare rutine you can first check if files size are equal if it is. you go to hash checking.

Saludos

Edited by Danyfirex

Share this post


Link to post
Share on other sites
iamtheky

and you can scrap the sort since you are testing the hashes, that will no doubt speed it up (just over 3x faster in my testing just now)

Edited by iamtheky

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Share this post


Link to post
Share on other sites
Beege

Doing a Redim on the array for each item within that first loop has a tax too and should not be needed

;get full file paths array for all files and in all dirs and sub dirs - this is taking very long
$files_array = _FileListToArrayRec($source_parent_dir, "*", $FLTAR_FILES, 1, $FLTAR_SORT, $FLTAR_FULLPATH)
Debug("pause 1")

Local $full_array[UBound($files_array)][4]

;i need the array to be in the following format for later script processing - this is taking very long
For $x = 1 To UBound($files_array) - 1
    $file_name_extension = StringRegExpReplace($files_array[$x], "^.*\\", "")
    $file_path_dir = StringReplace($files_array[$x], "\" & $file_name_extension, "")

    ;ReDim $full_array[UBound($full_array) + 1][4]

    $full_array[$x][0] = $file_path_dir
    $full_array[$x][1] = $file_name_extension
Next

 

Share this post


Link to post
Share on other sites
gcue
5 minutes ago, Beege said:

Doing a Redim on the array for each item within that first loop has a tax too and should not be needed

;get full file paths array for all files and in all dirs and sub dirs - this is taking very long
$files_array = _FileListToArrayRec($source_parent_dir, "*", $FLTAR_FILES, 1, $FLTAR_SORT, $FLTAR_FULLPATH)
Debug("pause 1")

Local $full_array[UBound($files_array)][4]

;i need the array to be in the following format for later script processing - this is taking very long
For $x = 1 To UBound($files_array) - 1
    $file_name_extension = StringRegExpReplace($files_array[$x], "^.*\\", "")
    $file_path_dir = StringReplace($files_array[$x], "\" & $file_name_extension, "")

    ;ReDim $full_array[UBound($full_array) + 1][4]

    $full_array[$x][0] = $file_path_dir
    $full_array[$x][1] = $file_name_extension
Next

 

then how do i add another record to the array without doing that?

Share this post


Link to post
Share on other sites
Beege

Just like I posted should work. For that portion of code you already know how large the array is going to need to be, so its better to create the whole array just once, then walk though it and fill in the elements.

 

Share this post


Link to post
Share on other sites
gcue

sorry i overlooked that.. great suggestion! =)

thank you!

Share this post


Link to post
Share on other sites
gcue

wow just that alone was incredibly faster beege

thank you sooo much for catching/suggesting it.

playing with some of the other suggestions

thank you everyone! =)

Share this post


Link to post
Share on other sites
mrider

You'll need to change how this displays, but this is noticeably faster than using arrays.  Note also that I had to download the SQLite DLLs from https://www.autoitscript.com/autoit3/pkgmgr/sqlite/ and put the files in the script directory in order to get this to work, since my computer is behind a proxy.

 

My code:

#include <File.au3>
#include <Crypt.au3>
#include <Array.au3>
#include <SQLite.au3>
#include <SQLite.dll.au3>

_SQLite_Startup()
_SQLite_Open()
_SQLite_Exec(-1, "CREATE TABLE HashSums (Count, Sum, Path);")
_Crypt_Startup()


Local $source_parent_dir = "C:\"

Local $row, $count, $path
Local $files = _FileListToArrayRec($source_parent_dir, "*", $FLTAR_FILES, 1, $FLTAR_SORT, $FLTAR_FULLPATH)
For $i = 1 To $files[0]
    Local $sum = _Crypt_HashFile($files[$i], $CALG_SHA1)
    _SQLite_QuerySingleRow(-1, "SELECT Count, Path FROM HashSums WHERE Sum = '" & $sum & "';", $row)
    If @error Then
        $path = _SQLite_FastEscape(@LF & $files[$i])
        _SQLite_Exec(-1, "INSERT INTO HashSums (Count, Sum, Path) VALUES (1, '" & $sum & "', " & $path & ");")
    Else
        $count = Int($row[0])+1
        $path = _SQLite_FastEscape($row[1] & @LF & $files[$i])
        _SQLite_Exec(-1, "UPDATE HashSums SET Count = '" & $count & "', Path = " & $path & " WHERE SUM = '" & $sum & "';")
        If @error Then
            ConsoleWrite("@error = " & @error & @CRLF)
        EndIf
    EndIf
Next

Local $query
_SQLite_Query(-1, "SELECT Sum, Path FROM HashSums WHERE Count > 1;", $query)
While _SQLite_FetchData($query, $row) = $SQLITE_OK

    ; Change this to display however you wish...
    ConsoleWrite($row[0])
    ConsoleWrite($row[1] & @LF & @LF & @LF)

WEnd

_Crypt_Shutdown()
_SQLite_Shutdown()

 

Edited by mrider
Oops, left experimental path in place instead of the default "C:"

How's my riding? Dial 1-800-Wait-There

Trying to use a computer with McAfee installed is like trying to read a book at a rock concert.

Share this post


Link to post
Share on other sites
AspirinJunkie

It should take very long if you really want to hash every file on "C:\"!
Better try to reduce the list of the files to hash by using filtering in the _FileListToArrayRec-Function.

Then you need a fast comparison method to find existing matches.
Two nested for-loops are very inefficient.
A Dictionary is a possible solution for this.
Also the sqlite-solution from mrider goes into that direction.

So here's my solution for this:

#include <File.au3>
#include <Crypt.au3>
#include <Array.au3>

Global $s_Path_Parent = "C:\programming\AutoIt"

Global $o_Hashes = ObjCreate("Scripting.Dictionary")
Global $o_DoubleHashes = ObjCreate("Scripting.Dictionary")
Global $s_Hash, $a_Temp, $s_File

Global $a_Files = _FileListToArrayRec($s_Path_Parent, "*.au3", 1, 1, 0, 2)

; Hash all files and create List of double files:
_Crypt_Startup()
For $i = 1 To $a_Files[0]
    $s_Hash = String(_Crypt_HashFile($a_Files[$i], $CALG_MD5))
    If $o_Hashes.Exists($s_Hash) Then
        $a_Temp = $o_Hashes($s_Hash)
        If UBound($a_Temp) = 1 Then $o_DoubleHashes($s_Hash) = 0
        _ArrayAdd($a_Temp, $a_Files[$i])
    Else
        Local $a_Temp[] = [$a_Files[$i]]
    EndIf
    $o_Hashes($s_Hash) = $a_Temp
Next
_Crypt_Shutdown()


; output the doubled files:
For $s_Hash in $o_DoubleHashes.Keys
   For $s_File in $o_Hashes($s_Hash)
      ConsoleWrite($s_File & @CRLF)
   Next
   ConsoleWrite(@CRLF)
Next

 

Edited by AspirinJunkie

Share this post


Link to post
Share on other sites
gcue

i tried the sql way - wasnt much faster - still playing with some of the suggestions

Share this post


Link to post
Share on other sites
gcue

after finding out how much time rediming took.. i went through and removed any extra arrays i was using and worked through the same array - much faster... i also took your advice aspirinjunkie and only going through certain file types instead of *

thank you all!

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.