Jump to content

Which is the fastest method to compare two array


rootx
 Share

Recommended Posts

Sorry LarsJ, I didn't look closely enough at your code. I thought that as soon as there's a mismatch you can  declare the arrays as not being equal. If that is not the case then I apologize. Otherwise I don't understand how binary search could ever be faster because it requires multiple comparisons for each element instead of just one . . . $aArr1[$i] = $aArr2[$i]. You are clearly not making single comparisons in either algorithm. I guess I misunderstood the question.

Edited by czardas
Link to comment
Share on other sites

The purpose is to find all differences between the two arrays.

Explanation for everybody. $aAr2 is not a random array. 100 elements are copied from $aAr1. And there are 100 rows in $aAr1 between each element.

Because there are 10,000 elements in $aAr1 I need no more than 14 comparisons (2^14 = 16,384) in a binary search to find a specific element. To find all 100 elements I need no more than 1,400 comparisons in a binary search. And the binary search is further optimized because both arrays are sorted. This means an even smaller number of comparisons.

Because there are 100 rows in $aAr1 between each element which is copied to $aAr2, I need exactly 100 new comparisons in a linear scan to find the next element from $aAr2 in $aAr1. To find all 100 elements I need 100 * 100 = 10,000 comparisons which exactly matches the number of rows in $aAr1. Again it's utilized that both arrays are sorted.

This can be verified by removing the comment character in front of the ConsoleWrites.

Because the compare command, StringCompare( $s, $aAr1[$mi] ), is the most time consuming command in the code the fastest method is approximately decided by the number of comparisons. Since the binary search has the fewest number of comparisons it wins.

Edited by LarsJ
Added first line
Link to comment
Share on other sites

ftr:  powershell's compare-object is retarded fast, I have no idea what it is doing.

,-. .--. ________ .-. .-. ,---. ,-. .-. .-. .-.
|(| / /\ \ |\ /| |__ __||| | | || .-' | |/ / \ \_/ )/
(_) / /__\ \ |(\ / | )| | | `-' | | `-. | | / __ \ (_)
| | | __ | (_)\/ | (_) | | .-. | | .-' | | \ |__| ) (
| | | | |)| | \ / | | | | | |)| | `--. | |) \ | |
`-' |_| (_) | |\/| | `-' /( (_)/( __.' |((_)-' /(_|
'-' '-' (__) (__) (_) (__)

Link to comment
Share on other sites

I think it's necessary to use StringCompare to handle special national characters and unicode strings. _ArraySort is implemented with StringCompare. But _ArrayBinarySearch is coded with the simple comparison operators. I think it should have been StringCompare to be able to handle all strings.

Link to comment
Share on other sites

@LarsJ,

You still read me as if I recommended a 100 * 100 nested loop, which is absolutely not the idea of a linear scan of both inpurt arrays in parallel.

I place myself in the OP case of comparing presumably comparable directories to find out differences (files in X but not in Y or vice-versa).

Obviously when inputs are not expectedly  and essentially comparable, things depend on other considerations. I'll try to put an example together soon.

 

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Fun to analyse my harddrive (its a mess of duplicates ;-)) directly finding all duplicates 

Based on

  • Dir /b /s                       (seems to be faster then findfirst/filefindnext logic. and certainly faster then powershell)
  • 1 array
  • On windows folder > 100k of files
  • many subfolders
  • Many duplicates I found on my system
  • < 20 seconds for windows folder
Dir data received at: 5862.30989681335
Dir data read out at: 5898.1037270181
Dir going to regex  : 5898.92918516491
Dir data regex done : 6310.39039228707
split done : 6826.75232951424
sort done on 1 d array: 12907.4355269818
Dir data written to new file: 14385.3360269644
Read back from file to 2d array: 15883.6121557265
...
Compare finished  : 16682.6079798966
Compared filecount: 132580
Duplicates        : 59086
  • To have it working for:
    • C-drive  Local $sFilePath = "C:\" ; Search the C directory. Normally quite some files there
  • Issues
    • Breaks on semicolons in name (due to reading back in array no 2d array possible),
    • So far was unable to get it working with tab. "$2;$1$3"as regex not working "$2\t$1$3"
    • Going from 1d array to 2 d array with file to drive in between. Unsure if there is a better way
    • Version with dir /s is more complicated but would be more elegant by comparing date, time, size also
#include <array.au3>
#include <file.au3>
#include <FileConstants.au3>
#include <MsgBoxConstants.au3>
#include <WinAPIFiles.au3>
example1()

Func Example1()
;~     Local $sFilePath = @ScriptDir ; Search the current script directory.
    Local $sFilePath = @WindowsDir; Search the windows directory. Normally quite some files there

    Local $sFilter = "*.*" ; Search for all files in the current directory. For a list of valid wildcards, search for 'Wildcards' in the Help file.

    local $newString, $outline, $arrFileInfo
    
    Local $hTimer = TimerInit()

    ; If the file path isn't a directory then return from the 'Example' function.
    If Not StringInStr(FileGetAttrib($sFilePath), "D") Then
        Return SetError(1, 0, 0)
    EndIf

    ; Remove trailing backslashes and append a single trailing backslash.
    $sFilePath = StringRegExpReplace($sFilePath, "[\\/]+\z", "") & "\"

    #cs
        Commandline parameters for DIR:
        /B - Simple output.
        /A-D - Search for all files, minus folders.
        /S - Search within subfolders.
    #ce
    Local $iPID = Run(@ComSpec & ' /C DIR "' & $sFilePath & $sFilter & '" /A-D /B /S', $sFilePath, @SW_HIDE, $STDOUT_CHILD)

    ; Wait until the process has closed using the PID returned by Run.
    ProcessWaitClose($iPID)

    consolewrite("Dir data received at: " & TimerDiff($hTimer) & @CRLF)
    
    ; Read the Stdout stream of the PID returned by Run. This can also be done in a while loop. Look at the example for StderrRead.
    Local $sOutput = StdoutRead($iPID)

    consolewrite("Dir data read out at: " & TimerDiff($hTimer) & @CRLF)

    ; Create a constant variable in Local scope of the filepath that will be read/written to.
    Local Const $sFilePath2 = _WinAPI_GetTempFileName(@TempDir)
    
    ; Open the file for writing (append to the end of a file) and store the handle to a variable.
    Local $hFileOpen = FileOpen($sFilePath2, $FO_APPEND)
    If $hFileOpen = -1 Then
        MsgBox($MB_SYSTEMMODAL, "", "An error occurred whilst writing the temporary file.")
        Return False
    EndIf

    consolewrite("Dir going to regex  : " & TimerDiff($hTimer) & @CRLF)

    ;Make a colon separated string file     . filename first
    $newString=$sOutput
    $newString= StringRegExpReplace($newString,"(.*\\)(.*)(\r\n)","$2;$1$3")
    
    consolewrite("Dir data regex done : " & TimerDiff($hTimer) & @CRLF)
    local $aRetArray
    $aRetArray=stringsplit($newString, @CRLF, $STR_ENTIRESPLIT)
;~  dELETE RECORD 0
    _ArrayDelete($aRetArray, 0)

    consolewrite("split done : " & TimerDiff($hTimer) & @CRLF)
    _arraysort($aRetArray)
    consolewrite("sort done : " & TimerDiff($hTimer) & @CRLF)

;~  Change from 1 d to 2 d array
    _FileWriteFromArray($hFileOpen,$aRetArray,1)
    FileClose($hFileOpen)
;~  filewrite($hFileOpen,$newString)
    consolewrite("Dir data written to new file: " & TimerDiff($hTimer) & @CRLF)
;~ consolewrite($sfilepath2 & @CRLF)

    _filereadtoarray($sFilePath2,$aRetArray,$FRTA_NOCOUNT ,";")
;~  consolewrite(@Error & @CRLF)
;~  _arraydisplay($aRetArray,"info","100")
    consolewrite("Read back from file : " & TimerDiff($hTimer) & @CRLF)

;~ Now do the compare
    local $dupcount
    for $i=0 to ubound($aRetArray)-2
        if ($aRetArray[$i][0]=$aRetArray[$i+1][0]) Then
            $dupcount=$dupcount+1
            consolewrite("Duplicate found: " & $aRetArray[$i][0] & " ;folder a: ;" & $aRetArray[$i][1] & " ;folder b: ;" & $aRetArray[$i+1][1]& @CRLF)
        EndIf
    Next
    consolewrite("Compare finished  : " & TimerDiff($hTimer) & @CRLF)
    consolewrite("Compared filecount: " & ubound($aRetArray) & @CRLF)
    consolewrite("Duplicates        : " & $dupCount & @CRLF)
    
endfunc

 

Link to comment
Share on other sites

9 hours ago, junkew said:

So far was unable to get it working with tab. "$2;$1$3"as regex not working "$2\t$1$3"

Replace can be done with: "$2" & @TAB é "$1$3"

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

The "replace" part of StringRegexReplace isn't in any way related nor handled by the PCRE library. That's why only $1, $2 ... (or \1, \2, ...) are recognized there.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

jchd, Please show some code.

The linear scan in post 18 is just my own simple implementation to get an idea of how fast the code is compared to a binary search. The results shows that the binary search is 3 times as fast as the linear scan.

If you ConsoleWrite $i and $j indices in the linear scan you'll see that they both runs from zero to number of rows in the arrays. The number of loops matches exactly the number of rows. Which is characteristic for a linear scan.

If it's possible to implement some kind of a linear scan that is faster than the binary search I would like to see the code.


In the code in post 18 all numbers are more or less multiples of 100. They are easy to check in SciTE console.

Please replace line 28:

$aAr2[$i] = $aAr1[$i*100+99]

with this line:

$aAr2[$i] = $aAr1[$i*100+($i<99?Random(55,145,1):99)]

This change will not affect the results at all. Now only array sizes are multiples of 100.

Link to comment
Share on other sites

Please tell jchd. There were no such exceptions in either post 14 or 16.

Edited by LarsJ
Link to comment
Share on other sites

Then maybe another little test script can help.
Just play with the both values $N1 and $N2 (the array sizes):

#include <Math.au3>
#include <Array.au3>

; ------------------- global settings
Global $N1 = 100, $N2 = 1000, $iT

; ------------------- create two random arrays
Global $aTest1[$N1], $aTest2[$N2]
For $i = 0 To $N1-1
    $aTest1[$i] = Random(0, _Max($N1,$N2), 1)
Next
For $i = 0 To $N2-1
    $aTest2[$i] = Random(0, _Max($N1,$N2), 1)
Next
_ArraySort($aTest1)
_ArraySort($aTest2)

; ------------------- paralel linear scan
$iT = TimerInit()
Global $i = 0, $j = 0
While ($i < $N1) And ($j < $N2)
    If $aTest1[$i] < $aTest2[$j] Then
        $i += 1
    ElseIf $aTest1[$i] > $aTest2[$j] Then
        $j += 1
    Else
;~      ConsoleWrite($i & @TAB & $j & @TAB & $aTest1[$i] & @TAB & $aTest2[$j] & @CRLF)
        $i += 1
    EndIf
WEnd
$iT = TimerDiff($iT)
ConsoleWrite(StringFormat("% 20s: %8.3f ms\n", "parallel linear scan", $iT))

; ------------------- binary search
$iT = TimerInit()
For $i = 0 To $N1 -1
    $j = _ArrayBinarySearch($aTest2, $aTest1[$i])
    If $j <> -1 Then
;~      ConsoleWrite($i & @TAB & $j & @TAB & $aTest1[$i] & @TAB & $aTest2[$j] & @CRLF)
    EndIf
Next
$iT = TimerDiff($iT)
ConsoleWrite(StringFormat("% 20s: %8.3f ms\n", "Binary search", $iT))

 

Link to comment
Share on other sites

Your code shows that a linear scan is always faster. That was not what you wrote in post 32. What do you mean?

Link to comment
Share on other sites

Read again:

1 hour ago, AspirinJunkie said:

play with the both values $N1 and $N2 (the array sizes):

That's what i wrote.

For example try this first: $N1 = 1000, $N2 = 1000 and then $N1 = 100, $N2 = 10000.
And you will see that the code shows that it depends on the relative array sizes.

Edit: If the arrays aren't already sorted then the dictionary-variant is probably the best option:

#include <Math.au3>
#include <Array.au3>

; ------------------- global settings
Global $N1 = 1000, $N2 = 1000, $iT

; ------------------- create two random arrays
Global $aTest1[$N1], $aTest2[$N2]
For $i = 0 To $N1-1
    $aTest1[$i] = Random(0, _Max($N1,$N2), 1)
Next
For $i = 0 To $N2-1
    $aTest2[$i] = Random(0, _Max($N1,$N2), 1)
Next
Global $iT_Sort = TimerInit()
_ArraySort($aTest1)
_ArraySort($aTest2)
$iT_Sort = TimerDiff($iT_Sort)

; ------------------- paralel linear scan
$iT = TimerInit()
Global $i = 0, $j = 0
While ($i < $N1) And ($j < $N2)
    If $aTest1[$i] < $aTest2[$j] Then
        $i += 1
    ElseIf $aTest1[$i] > $aTest2[$j] Then
        $j += 1
    Else
;~      ConsoleWrite($aTest1[$i] & @CRLF)
        $i += 1
    EndIf
WEnd
$iT = TimerDiff($iT)
ConsoleWrite(StringFormat("% 20s: %8.3f ms\n", "parallel linear scan", $iT+$iT_Sort))

; ------------------- binary search
$iT = TimerInit()
For $i = 0 To $N1 -1
    $j = _ArrayBinarySearch($aTest2, $aTest1[$i])
    If $j <> -1 Then
;~      ConsoleWrite($aTest1[$i] & @CRLF)
    EndIf
Next
$iT = TimerDiff($iT)
ConsoleWrite(StringFormat("% 20s: %8.3f ms\n", "Binary search", $iT+$iT_Sort))

; ------------------- dictionary search
$iT = TimerInit()
Global $o_Dic = ObjCreate("Scripting.Dictionary")
For $j In $aTest2
    $o_Dic($j) = ""
Next
For $i In $aTest1
    If $o_Dic.Exists($i) Then
;~      ConsoleWrite($i & @CRLF)
    EndIf
Next
$iT = TimerDiff($iT)
ConsoleWrite(StringFormat("% 20s: %8.3f ms\n", "Dictionary", $iT))

 

Edited by AspirinJunkie
Link to comment
Share on other sites

Well guys, I am happy that the tread interests you a lot ...

thx AspirinJunkie

I have implemented your code, parameterizing the cycles and the length of the string to compare, you can stress your CPU and see who will win.

#include <Math.au3>
#include <Array.au3>
#include <MsgBoxConstants.au3>
#include <ButtonConstants.au3>
#include <GuiButton.au3>
#include <GUIConstantsEx.au3>
#include <WindowsConstants.au3>
#include <INet.au3>
#include <File.au3>
#include <EditConstants.au3>
#include <Date.au3>
#include <GuiStatusBar.au3>
#include <IE.au3>
#include <WinAPI.au3>
#include <StringConstants.au3>
; ------------------- global settings


Local $Arr1[0]
Local $Arr2[0]

Global $iT,$iT2,$aTest2,$aTest1
Global $N1=1000 ;cilcle
Global $Length = 5 ;sting length


For $c = 1 to $N1
    _ArrayAdd($Arr1, _RString($Length))
    _ArrayAdd($Arr2, _RString($Length))
Next
_ArrayDisplay($Arr1)
_ArrayDisplay($Arr2)
$aTest1 = $Arr1
$aTest2 = $Arr2

_ArraySort($aTest1)
_ArraySort($aTest2)


; ------------------- paralel linear scan
$iT = TimerInit()
Global $i = 0, $j = 0
While ($i < $N1) And ($j < $N1)
    If $aTest1[$i] < $aTest2[$j] Then
        $i += 1
    ElseIf $aTest1[$i] > $aTest2[$j] Then
        $j += 1
    Else
;~      ConsoleWrite($i & @TAB & $j & @TAB & $aTest1[$i] & @TAB & $aTest2[$j] & @CRLF)
        $i += 1
    EndIf
WEnd
$iT = TimerDiff($iT)
ConsoleWrite(StringFormat("% 20s: %8.3f ms\n", "parallel linear scan", $iT))

; ------------------- binary search
$iT2 = TimerInit()
For $i = 0 To $N1 -1
    $j = _ArrayBinarySearch($aTest2, $aTest1[$i])
    If $j <> -1 Then
;~      ConsoleWrite($i & @TAB & $j & @TAB & $aTest1[$i] & @TAB & $aTest2[$j] & @CRLF)
    EndIf
Next
$iT2 = TimerDiff($iT2)
ConsoleWrite(StringFormat("% 20s: %8.3f ms\n", "Binary search", $iT2))

If Number($iT) < Number($iT2) Then
    MsgBox("","","The winner is "&@CRLF&"parallel linear scan")
Else
    MsgBox("","","The winner is "&@CRLF&"Binary search")
EndIf


Func _RString($iLength)
    $aChars = StringSplit("ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", "")
    $sString = ""
    While $iLength > StringLen($sString)
        If $sString < "A" Then $sString = ""
        $sString &= $aChars[Random(1, $aChars[0],1)]
    WEnd
    Return $sString
EndFunc

 

Edited by rootx
Link to comment
Share on other sites

Once you get to the end of the smallest either array, the job is finished because both arrays are alphabetically sorted and already unique. You only have to search for the next value which is greater or equal to the current test element. Once you overstep the value in one array, start searching for the greater value in the other array. Repeat the process (switching arrays) until you reach the end of one array. This parallel linear search method ought to be efficient in particular cases.

Edited by czardas
Link to comment
Share on other sites

33 minutes ago, czardas said:

Once you get to the end of either array, the job is finished because both arrays are alphabetically sorted and already unique.

I'm not sure if this was adressed to me but this condition is my while condition.
So mabye i didn't understand you or i was not the right recipient of your message.

Link to comment
Share on other sites

In the linear scan in my code I need 10,000 comparisons. If I implement the linear scan with the code by AspirinJunkie I also need 10,000 comparisons. Why always 10,000 comparisons? Because in the code in post 18 I've carefully selected the last element in $aAr2 (the small array) as the last element in $aAr1. That way, I assured that the linear scan always required 10,000 comparisons no matter how it's implemented. Since the binary search need no more than 1,400 comparisons it's faster.

I've already spent too much time on this thread. So I think this is my last post.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...