Jump to content

Speeding up mass value comparison?


Quinch
 Share

Recommended Posts

I've got a small problem I could use some help with. I've put together a script that downloads a lot of small files, then compares them against a list to see which ones to keep. The list is simple enough, a twodimensional array consisting of a MD5 hash and filenames that match that hash. The script basically goes through an for-to loop comparing each value, but since the array will grow into tens or hundreds of thousands of entries, the one-by-one kind of approach is likely to turn into something that runs at a snail's pace. With that in mind, are there ways, preferably simple, to speed it up? Would loading the entire index as a single string variable and using stringinstr and similar commands to check for and edit values be faster?

Edited by Quinch
Link to comment
Share on other sites

Are the lists sorted by name?

My UDFs and Tutorials:

Spoiler

UDFs:
Active Directory (NEW 2022-02-19 - Version 1.6.1.0) - Download - General Help & Support - Example Scripts - Wiki
ExcelChart (2017-07-21 - Version 0.4.0.1) - Download - General Help & Support - Example Scripts
OutlookEX (2021-11-16 - Version 1.7.0.0) - Download - General Help & Support - Example Scripts - Wiki
OutlookEX_GUI (2021-04-13 - Version 1.4.0.0) - Download
Outlook Tools (2019-07-22 - Version 0.6.0.0) - Download - General Help & Support - Wiki
PowerPoint (2021-08-31 - Version 1.5.0.0) - Download - General Help & Support - Example Scripts - Wiki
Task Scheduler (NEW 2022-07-28 - Version 1.6.0.1) - Download - General Help & Support - Wiki

Standard UDFs:
Excel - Example Scripts - Wiki
Word - Wiki

Tutorials:
ADO - Wiki
WebDriver - Wiki

 

Link to comment
Share on other sites

Nope... although... if I sort the list by hash value, maybe I could go by a "by half" route - start looking at the middle, if the hash to be compared against is greater/smaller than the midpoint, split the remains, etc until the un-eliminated section to be searched is something reasonable.

That still leaves me looking for filenames, though - each hash has one or more filenames attached to it, and if a match is found, return the first filename associated with that hash, so the sort-and-split approach wouldn't work there.

I've attached both the script and an example of the list - the index is the result of a single page backup, and since the script is intended as a mid-to-long term solution, it's going to grow - a lot.

archiver.au3

MD5_index.txt

Link to comment
Share on other sites

  • Moderators

DicatoroftheUSA,

Have you looked inside the _FilelistToArray function? :)

Probably not or you would have seen: ;)

Func _FileListToArray($sPath, $sFilter = "*", $iFlag = 0)
    Local $hSearch, $sFile, $sFileList, $sDelim = "|"
    $sPath = StringRegExpReplace($sPath, "[/]+z", "") & "" ; ensure single trailing backslash
    If Not FileExists($sPath) Then Return SetError(1, 1, "")
    If StringRegExp($sFilter, "[/:><|]|(?s)As*z") Then Return SetError(2, 2, "")
    If Not ($iFlag = 0 Or $iFlag = 1 Or $iFlag = 2) Then Return SetError(3, 3, "")
    $hSearch = FileFindFirstFile($sPath & $sFilter) ; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
    If @error Then Return SetError(4, 4, "")
    While 1
        $sFile = FileFindNextFile($hSearch) ; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
        If @error Then ExitLoop
        If ($iFlag + @extended = 2) Then ContinueLoop
        $sFileList &= $sDelim & $sFile
    WEnd
    FileClose($hSearch)
    If Not $sFileList Then Return SetError(4, 4, "")
    Return StringSplit(StringTrimLeft($sFileList, 1), "|")
EndFunc   ;==>_FileListToArray

How else did you think it worked? ;)

M23

Public_Domain.png.2d871819fcb9957cf44f4514551a2935.png Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind

Open spoiler to see my UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Link to comment
Share on other sites

That works similarly to something I have been doing recently - I was using the pipe as a delimiter and splitting at that to make my arrays.

I would suggest something like that. It doesn't really matter if it doesn't look nice in the archive, does it? You could just have "FileName|MD5|AnotherFileName|MD5"

Then you could create the array based on stringsplitting by the pipe ( | - that is a pipe, if you dont know.)

Link to comment
Share on other sites

Even though the MD5's are not unique, your filenames are, so sorting by name, as Water mentioned, would allow you to do binary searches. Beyond that, you could split your huge list into parts, for instance, 27 arrays, A-Z + non-alpha, and search the smaller arrays depending on the first character of the filename. Or, you could always go with using a database.

Come to think of it, you could probably also "cheat" and use Yashied's Assign/IsDeclared route which is lightning fast. Assign your filenames as the variable name (with some sort of appended prefix or suffix to avoid conflicts with "real" variables) and set the MD5 as the value of the dummy variable. Maybe something like:

Global $array[5][2] = [[4,0],["Acrobat.exe", 3333333333],["Explorer.exe", 7777777777],["Windows.exe", 1234567890],["Winzip.exe", 99999999999]]
Global $filename, $filehash

For $x = 1 to $array[0][0] ; load array to variables
    Assign("__" & $array[$x][0], $array[$x][1])
Next

$filename = "Explorer.exe"
$filehash = 6666666666
MsgBox(0,"","Found = " & Search())

$filename = "Windows.exe"
$filehash = 1234567890
MsgBox(0,"","Found = " & Search())

Func Search()
    If IsDeclared("__" & $filename) And Eval("__" & $filename) = $filehash Then Return 1
EndFunc
Link to comment
Share on other sites

Quinch,

Look around for ScriptingDictionary. It should do the job efficiently for your purpose. Of course, use the hash as key and filename as value.

Another option would be to use SQLite for that but it may be a bit slower and less easy to use for such a simple application if it's used once. The advantage of an SQLite database is that you can grow your database as much as you need without (essentially) any performance penalty until you reach an undecent number of hash/filename couples.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

  • Moderators

DicatoroftheUSA,

Especially if re-dim is required

Now there you are quite correct! :)

But arrays are very useful and there are ways to reduce the number of ReDims you need - even if you do not know how big your array will become. Take a look at the final example in the Recursion tutorial in the Wiki. ;)

M23

Public_Domain.png.2d871819fcb9957cf44f4514551a2935.png Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind

Open spoiler to see my UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Link to comment
Share on other sites

Well... crap. That's one step forward and two steps back. I've been looking at some of these filenames and it's pretty obvious that they can't be the one and the same file. You're right - MD5 hashes are't unique - or unique enough, at least - to reliably use as a check if two files are identical in content.

Which brings me to a new question - short of comparing file contents outright, what would be? Would a layered approach work? For example, check for filesize matches, then check those for MD5 if a match is found? When dealing with thousands of files mostly under 10Kb, what would the reliability rate be between those two, offhand?

Link to comment
Share on other sites

If you're using a 2D array, you could sort the array on the "column" that you want to search in and use the _Array.au3 file in my signature, which includes a 2D version of _ArrayBinarySearch which would greatly speed up the searching process.

If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Link to comment
Share on other sites

I beg your pardon. Do any of you guys seriously expect MD5 collisions on random 10k files? I mean spurious MD5 collisions on non-cooked data/executable, real-world files?

Can you exhibit one non_cooked example (just curious)?

Hint: that would be a world record!

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Yes, but what I mean is I personally tend to avoid using arrays entirely.

Learn when, and when not, to use them and they will serve you well, young ... erm.

"Straight_and_Crooked_Thinking" : A "classic guide to ferreting out untruths, half-truths, and other distortions of facts in political and social discussions."
"The Secrets of Quantum Physics" : New and excellent 2 part documentary on Quantum Physics by Jim Al-Khalili. (Dec 2014)

"Believing what you know ain't so" ...

Knock Knock ...
 

Link to comment
Share on other sites

I beg your pardon. Do any of you guys seriously expect MD5 collisions on random 10k files? I mean spurious MD5 collisions on non-cooked data/executable, real-world files?

Not sure if sarcastic {of if I'm not awake enough to read this right} but check the MD5 index file attached a few posts up - there's a couple of MD5 entries that contain files in both gif, jog and png format, so... I guess yes?

Link to comment
Share on other sites

No sarcastic at all. To be honest, I didn't look at your index file before.

Now it's just plain impossible to collect as many _spurious_ MD5 collision on distinct _non-cooked_ binary files as your index file shows.

What that suggests is that there must be something wrong in the MD5s you obtain.

Can you post a couple of distinct binary files colliding (please no cheating here)?

Please get me right: I'm not saying that MD5 collisions can't be computed, but that requires very (and I mean VERY) special conditions on both input files (and then you're not free to choose the resulting MD5).

The odds of random (or so, like jpg, png) 10kb binary files colliding spuriously while being internally completely distinct are unchanged and still much, much less than the odds of having a neutrino interact with your PC and changing a single bit somewhere. And the latter odds are so small that you can bet you life safely on it.

What is commonly known is that given an input file A with MD5 hash M[A], it only takes seconds to compute another distinct file B with M[A] = M[A].

That's what makes MD5 unsecure for any cryptographic purposes: an opponent may _compute_ a distinct file and exhibit it, claiming it's the original. Turning the counterfeit file with same MD5 into a meaningful file is quite another task which hasn't yet be fully addressed unless we talk 5 to (say) 20 letters passwords or allow for unlimited special characters in text files, for instance. Short inputs like passwords are being more and more tabulated by rainbow tables.

What most people don't get is that the odds of a "natural" collision between two _meaningful_ distinct, real-world files (say images) A and B is still extroardinary low. What I say implies we don't put ourselves in a context where we can expect determined opponents, which I understand isn't your case.

Files in 10kb range produce MD5 hashes that noone on Earth can distinguish from 10Mb files or greater. In practice, your result is a random MD5 byte sequence. Cryptanalysis of MD5 estimate the practical number of possible MD5 to be somewhere in the 2120 to 2127 range. Let's be overly conservative and only estimate the possible number of hashes to be 2100.

Comparing 100 billion of random files isn't likely to cause a single collision in any reasonable time framework. 2100 is nothing less than 1,267,650,600,228,229,401,496,703,205,376 which means we still have a very safe margin before the odds of a natural collision occuring tomorrow exceed the odds of the Sun going nova in the same time.

Your situation (comparing contents of image files) boils down to the above, which is called a "birthday attack": accumulate enough real-world data hashes to find a match.

Note that no hash standard is 100% immune to such "spurious collide" event, but the longer the hash is bitwise, the smaller the odds are. Anyway, the 128-bit MD5 hash is still perfectly useable for routine file compare with excellent safety, of course unless a malicious hand can get in the process.

Edit: I fought a long time against unwanted bold type, sorry!

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Well, that was definitely an educational post. Anyway, I'll check and send again when I get the chance - I'm at work right now, so it's not an option. If it's of consequence, I'm using the built-in _Crypt_HashData UFD for the hashing.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...