Jump to content

finding duplicates - trying to find a way to improve speed


Recommended Posts

here is my script, it is part of a larger script but this component is taking forever to process

#include <File.au3>
#include <Crypt.au3>
#include <Array.au3>

$msg_normal = 0

$source_parent_dir = "C:\"

Local $full_array[1]

;get full file paths array for all files and in all dirs and sub dirs - this is taking very long
$files_array = _FileListToArrayRec($source_parent_dir, "*", $FLTAR_FILES, 1, $FLTAR_SORT, $FLTAR_FULLPATH)
Debug("pause 1")

;i need the array to be in the following format for later script processing - this is taking very long
For $x = 1 To UBound($files_array) - 1
    $file_name_extension = StringRegExpReplace($files_array[$x], "^.*\\", "")
    $file_path_dir = StringReplace($files_array[$x], "\" & $file_name_extension, "")

    ReDim $full_array[UBound($full_array) + 1][4]

    $full_array[UBound($full_array) - 1][0] = $file_path_dir
    $full_array[UBound($full_array) - 1][1] = $file_name_extension
Next
Debug("pause 2")

;here's how i find duplicates - this is taking very long
$final_array = _GetFileDupes($full_array)

If UBound($final_array) - 1 = 0 Then
    MsgBox($msg_normal, @ScriptName, "There are NO duplicate files found.")
    Exit
EndIf

Debug($final_array)

Func _GetFileDupes($full_array)

    _Crypt_Startup()

    For $x = 1 To UBound($full_array) - 1
        $path = $full_array[$x][0]
        $file_name = $full_array[$x][1]

        $sha1 = _Crypt_HashFile($path & "\" & $file_name, $CALG_SHA1)

        $full_array[$x][2] = $sha1
    Next
;~  Debug($full_array)
    _Crypt_Shutdown()

    Local $final_array[1]

    For $x = 1 To UBound($full_array) - 1
        $search = _ArrayFindAll($full_array, $full_array[$x][2], 1, 0, 0, 0, 2)

        If UBound($search) = 1 Then ContinueLoop

        For $y = 0 To UBound($search) - 1
            $index = $search[$y]

            If $full_array[$index][3] <> "DUPLICATE" Then
                $full_array[$index][3] = "DUPLICATE"

                ReDim $final_array[UBound($final_array) + 1][3]

                $final_array[UBound($final_array) - 1][0] = $full_array[$index][0]
                $final_array[UBound($final_array) - 1][1] = $full_array[$index][1]
                $final_array[UBound($final_array) - 1][2] = $full_array[$index][2]
            EndIf
        Next
    Next
;~  Debug($final_array)
    Return $final_array

EndFunc   ;==>_GetFileDupes

Func Debug($variable1 = "", $variable2 = "", $variable3 = "", $variable4 = "", $variable5 = "")

;~  #include <array.au3>
;~  $msg_normal = 0

    If IsArray($variable1) Or IsArray($variable2) Then
        If IsArray($variable1) Then _ArrayDisplay($variable1, $variable2)
        If IsArray($variable2) Then _ArrayDisplay($variable2, $variable1)
    Else
        $variable = ""

        If $variable1 <> "" Then $variable &= $variable1 & @CRLF
        If $variable2 <> "" Then $variable &= $variable2 & @CRLF
        If $variable3 <> "" Then $variable &= $variable3 & @CRLF
        If $variable4 <> "" Then $variable &= $variable4 & @CRLF
        If $variable5 <> "" Then $variable &= $variable5 & @CRLF

        $variable = StringStripWS($variable, 2)

        ClipPut($variable)

        MsgBox($msg_normal, "Debug", $variable)
    EndIf

EndFunc   ;==>Debug

any help is greatly appreciated!

Link to comment
Share on other sites

Perhaps you could use the FindFile functions to build your array yourself and populate it with the desired format from the get go, versus populating the array and updating each array entity afterwards.  That could help cut some time down.  

Spoiler

Things I've Made: Always On Top Tool ◊ AU History ◊ Deck of Cards ◊ HideIt ◊ ICU ◊ Icon Freezer ◊ Ipod Ejector ◊ Junos Configuration Explorer ◊ Link Downloader ◊ MD5 Folder Enumerator ◊ PassGen ◊ Ping Tool ◊ Quick NIC ◊ Read OCR ◊ RemoteIT ◊ SchTasksGui ◊ SpyCam ◊ System Scan Report Tool ◊ System UpTime ◊ Transparency Machine ◊ VMWare ESX Builder
Misc Code Snippets: ADODB Example ◊ CheckHover ◊ Detect SafeMode ◊ DynEnumArray ◊ GetNetStatData ◊ HashArray ◊ IsBetweenDates ◊ Local Admins ◊ Make Choice ◊ Recursive File List ◊ Remove Sizebox Style ◊ Retrieve PNPDeviceID ◊ Retreive SysListView32 Contents ◊ Set IE Homepage ◊ Tickle Expired Password ◊ Transpose Array
Projects: Drive Space Usage GUI ◊ LEDkIT ◊ Plasma_kIt ◊ Scan Engine Builder ◊ SpeeDBurner ◊ SubnetCalc
Cool Stuff: AutoItObject UDF â—Š Extract Icon From Proc â—Š GuiCtrlFontRotate â—Š Hex Edit Funcs â—Š Run binary â—Š Service_UDF

 

Link to comment
Share on other sites

MD5 might be faster.

And why are you splitting them into name and extension?  Wouldnt simply hashing the first array tell you if there were dupes?

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

24 minutes ago, spudw2k said:

Perhaps you could use the FindFile functions to build your array yourself and populate it with the desired format from the get go, versus populating the array and updating each array entity afterwards.  That could help cut some time down.  

let me try that =)

Link to comment
Share on other sites

Hello. my suggestions are:

  • Build your own recursive files list rutine.
  • Build you formatted array inside recursion.
  • To speed up you compare rutine you can first check if files size are equal if it is. you go to hash checking.

Saludos

Edited by Danyfirex
Link to comment
Share on other sites

and you can scrap the sort since you are testing the hashes, that will no doubt speed it up (just over 3x faster in my testing just now)

Edited by iamtheky

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

Doing a Redim on the array for each item within that first loop has a tax too and should not be needed

;get full file paths array for all files and in all dirs and sub dirs - this is taking very long
$files_array = _FileListToArrayRec($source_parent_dir, "*", $FLTAR_FILES, 1, $FLTAR_SORT, $FLTAR_FULLPATH)
Debug("pause 1")

Local $full_array[UBound($files_array)][4]

;i need the array to be in the following format for later script processing - this is taking very long
For $x = 1 To UBound($files_array) - 1
    $file_name_extension = StringRegExpReplace($files_array[$x], "^.*\\", "")
    $file_path_dir = StringReplace($files_array[$x], "\" & $file_name_extension, "")

    ;ReDim $full_array[UBound($full_array) + 1][4]

    $full_array[$x][0] = $file_path_dir
    $full_array[$x][1] = $file_name_extension
Next

 

Link to comment
Share on other sites

5 minutes ago, Beege said:

Doing a Redim on the array for each item within that first loop has a tax too and should not be needed

;get full file paths array for all files and in all dirs and sub dirs - this is taking very long
$files_array = _FileListToArrayRec($source_parent_dir, "*", $FLTAR_FILES, 1, $FLTAR_SORT, $FLTAR_FULLPATH)
Debug("pause 1")

Local $full_array[UBound($files_array)][4]

;i need the array to be in the following format for later script processing - this is taking very long
For $x = 1 To UBound($files_array) - 1
    $file_name_extension = StringRegExpReplace($files_array[$x], "^.*\\", "")
    $file_path_dir = StringReplace($files_array[$x], "\" & $file_name_extension, "")

    ;ReDim $full_array[UBound($full_array) + 1][4]

    $full_array[$x][0] = $file_path_dir
    $full_array[$x][1] = $file_name_extension
Next

 

then how do i add another record to the array without doing that?

Link to comment
Share on other sites

Just like I posted should work. For that portion of code you already know how large the array is going to need to be, so its better to create the whole array just once, then walk though it and fill in the elements.

 

Link to comment
Share on other sites

wow just that alone was incredibly faster beege

thank you sooo much for catching/suggesting it.

playing with some of the other suggestions

thank you everyone! =)

Link to comment
Share on other sites

You'll need to change how this displays, but this is noticeably faster than using arrays.  Note also that I had to download the SQLite DLLs from https://www.autoitscript.com/autoit3/pkgmgr/sqlite/ and put the files in the script directory in order to get this to work, since my computer is behind a proxy.

 

My code:

#include <File.au3>
#include <Crypt.au3>
#include <Array.au3>
#include <SQLite.au3>
#include <SQLite.dll.au3>

_SQLite_Startup()
_SQLite_Open()
_SQLite_Exec(-1, "CREATE TABLE HashSums (Count, Sum, Path);")
_Crypt_Startup()


Local $source_parent_dir = "C:\"

Local $row, $count, $path
Local $files = _FileListToArrayRec($source_parent_dir, "*", $FLTAR_FILES, 1, $FLTAR_SORT, $FLTAR_FULLPATH)
For $i = 1 To $files[0]
    Local $sum = _Crypt_HashFile($files[$i], $CALG_SHA1)
    _SQLite_QuerySingleRow(-1, "SELECT Count, Path FROM HashSums WHERE Sum = '" & $sum & "';", $row)
    If @error Then
        $path = _SQLite_FastEscape(@LF & $files[$i])
        _SQLite_Exec(-1, "INSERT INTO HashSums (Count, Sum, Path) VALUES (1, '" & $sum & "', " & $path & ");")
    Else
        $count = Int($row[0])+1
        $path = _SQLite_FastEscape($row[1] & @LF & $files[$i])
        _SQLite_Exec(-1, "UPDATE HashSums SET Count = '" & $count & "', Path = " & $path & " WHERE SUM = '" & $sum & "';")
        If @error Then
            ConsoleWrite("@error = " & @error & @CRLF)
        EndIf
    EndIf
Next

Local $query
_SQLite_Query(-1, "SELECT Sum, Path FROM HashSums WHERE Count > 1;", $query)
While _SQLite_FetchData($query, $row) = $SQLITE_OK

    ; Change this to display however you wish...
    ConsoleWrite($row[0])
    ConsoleWrite($row[1] & @LF & @LF & @LF)

WEnd

_Crypt_Shutdown()
_SQLite_Shutdown()

 

Edited by mrider
Oops, left experimental path in place instead of the default "C:"

How's my riding? Dial 1-800-Wait-There

Trying to use a computer with McAfee installed is like trying to read a book at a rock concert.

Link to comment
Share on other sites

It should take very long if you really want to hash every file on "C:\"!
Better try to reduce the list of the files to hash by using filtering in the _FileListToArrayRec-Function.

Then you need a fast comparison method to find existing matches.
Two nested for-loops are very inefficient.
A Dictionary is a possible solution for this.
Also the sqlite-solution from mrider goes into that direction.

So here's my solution for this:

#include <File.au3>
#include <Crypt.au3>
#include <Array.au3>

Global $s_Path_Parent = "C:\programming\AutoIt"

Global $o_Hashes = ObjCreate("Scripting.Dictionary")
Global $o_DoubleHashes = ObjCreate("Scripting.Dictionary")
Global $s_Hash, $a_Temp, $s_File

Global $a_Files = _FileListToArrayRec($s_Path_Parent, "*.au3", 1, 1, 0, 2)

; Hash all files and create List of double files:
_Crypt_Startup()
For $i = 1 To $a_Files[0]
    $s_Hash = String(_Crypt_HashFile($a_Files[$i], $CALG_MD5))
    If $o_Hashes.Exists($s_Hash) Then
        $a_Temp = $o_Hashes($s_Hash)
        If UBound($a_Temp) = 1 Then $o_DoubleHashes($s_Hash) = 0
        _ArrayAdd($a_Temp, $a_Files[$i])
    Else
        Local $a_Temp[] = [$a_Files[$i]]
    EndIf
    $o_Hashes($s_Hash) = $a_Temp
Next
_Crypt_Shutdown()


; output the doubled files:
For $s_Hash in $o_DoubleHashes.Keys
   For $s_File in $o_Hashes($s_Hash)
      ConsoleWrite($s_File & @CRLF)
   Next
   ConsoleWrite(@CRLF)
Next

 

Edited by AspirinJunkie
Link to comment
Share on other sites

Check out trancexx's file mapping examples and KaFu's solution of using hashes on parts of the files.

My Contributions and Wrappers

Link to comment
Share on other sites

after finding out how much time rediming took.. i went through and removed any extra arrays i was using and worked through the same array - much faster... i also took your advice aspirinjunkie and only going through certain file types instead of *

thank you all!

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...