Jump to content

_ArrayUniqueConcatenate


czardas
 Share

Recommended Posts

After some insightful input by kylomas in >this topic, I decided to make a general resusable function using the same ideas. It handles up to 24 arrays. Zero based arrays go in, and the target array is returned ByRef. Before passing arrays to this function, you need to delete element 0 if it contains the item count; and you need to use Ubound to get the new size of the target array after using the function. If processing large arrays it is a good idea to delete the arrays you no longer need after the concatenation.

The function returns the number of removed duplicates. Set case sensitivity using the second parameter 0 = case insensitive, 1 = case sensitive.

;

Func _ArrayUniqueConcatenate(ByRef $aTarget, $iCasesense = 0, _ ; up to 23 more arrays can be included
    $a0 = 0, $a1 = 0, $a2 = 0, $a3 = 0, $a4 = 0, $a5 = 0, $a6 = 0, $a7 = 0, $a8 = 0, $a9 = 0, $a10 = 0, $a11 = 0, _
    $a12 = 0, $a13 = 0, $a14 = 0, $a15 = 0, $a16 = 0, $a17 = 0, $a18 = 0, $a19 = 0, $a20 = 0, $a21 = 0, $a22 = 0)

    #forceref $a0, $a1, $a2, $a3, $a4, $a5, $a6, $a7, $a8, $a9, $a10, $a11, $a12, $a13, $a14, $a15, $a16, $a17, $a18, $a19, $a20, $a21, $a22
    If Not IsArray($aTarget) Or UBound($aTarget, 0) <> 1 Then Return SetError(1)

    Local $iTotalSize = UBound($aTarget), $iItems = 0, $tVarName
    If $iCasesense Then
        For $i = 0 To $iTotalSize -1
            $tVarName = "_" & StringToBinary($aTarget[$i], 2)
            If IsDeclared($tVarName) = -1 Then ContinueLoop

            Assign($tVarName, "", 1)
            $aTarget[$iItems] = $aTarget[$i]
            $iItems += 1
        Next

    Else
        For $i = 0 To $iTotalSize -1
            $tVarName = "_" & StringToBinary(StringLower($aTarget[$i]), 2)
            If IsDeclared($tVarName) = -1 Then ContinueLoop

            Assign($tVarName, "", 1)
            $aTarget[$iItems] = $aTarget[$i]
            $iItems += 1
        Next
    EndIf

    Local $iParams = @NumParams
    If $iParams > 2 Then
        Local $aNextArray, $iBound
        For $i = 0 To $iParams -3
            $aNextArray = Eval('a' & $i)
            If Not IsArray($aNextArray) Or UBound($aNextArray, 0) <> 1 Then Return SetError(2, $i +3) ; Sets @Extended to the parameter which failed
            $iBound = UBound($aNextArray)

            $iTotalSize += $iBound
            ReDim $aTarget[$iItems + $iBound]

            If $iCasesense Then
                For $j = 0 To $iBound -1
                    $tVarName = "_" & StringToBinary($aNextArray[$j], 2)
                    If IsDeclared($tVarName) = -1 Then ContinueLoop

                    Assign($tVarName, "", 1)
                    $aTarget[$iItems] = $aNextArray[$j]
                    $iItems += 1
                Next

            Else
                For $j = 0 To $iBound -1
                    $tVarName = "_" & StringToBinary(StringLower($aNextArray[$j]), 2)
                    If IsDeclared($tVarName) = -1 Then ContinueLoop

                    Assign($tVarName, "", 1)
                    $aTarget[$iItems] = $aNextArray[$j]
                    $iItems += 1
                Next
            EndIf
            Execute('_FreeMemory($a' & $i & ')')
        Next
    EndIf
    ReDim $aTarget[$iItems]

    Return $iTotalSize - $iItems ; Return the number of duplicates removed
EndFunc ; _ArrayUniqueConcatenate

Func _FreeMemory(ByRef $vParam)
    $vParam = 0
EndFunc

;

In the following test, after randomly filling 24 arrays of 50000 elements (each with 2 ascii characters),  the function searches (case insensitive) through 1200000 elements removing all duplicates in just a few seconds. Filling the arrays takes a few seconds to begin with (watch the SciTE console). It should hit the expected limit of 38416 possible 2 case insensitive character combinations and remove 1161584 duplicates. It takes about 13 12 seconds on my machine. Also works with unicode.

;

#include <Array.au3>
#include <String.au3>

Global $a1[50000], $a2[50000], $a3[50000], $a4[50000], $a5[50000], $a6[50000], $a7[50000], $a8[50000], _
$a9[50000], $a10[50000], $a11[50000], $a12[50000], $a13[50000], $a14[50000], $a15[50000], $a16[50000], _
$a17[50000], $a18[50000], $a19[50000], $a20[50000], $a21[50000], $a22[50000], $a23[50000], $a24[50000]

ConsoleWrite("Populating Arrays" & @LF)
For $i = 1 To 24
    Execute('_Fill($a' & $i & ')')
Next

ConsoleWrite("Starting Timer" & @LF)
Local $iTimer = TimerInit()
Local $ret = _ArrayUniqueConcatenate($a1, 0, $a2, $a3, $a4, $a5, $a6, $a7, $a8, $a9, $a10, $a11, $a12, $a13, $a14, $a15, $a16, $a17, $a18, $a19, $a20, $a21, $a22, $a23, $a24)
ConsoleWrite("Error = " & @error & @lf & "Seconds = " & TimerDiff($iTimer)/1000 & @LF & "Unique Items = " & UBound($a1) & @LF & "Duplicates removed = " & $ret & @LF)

For $i = 2 To 24
    Execute('_FreeMemory($a' & $i & ')')
Next

_ArrayDisplay($a1)

Func _FreeMemory(ByRef $vParam)
    $vParam = 0
EndFunc

Func _Fill(ByRef $aArray)
    For $i = 0 To UBound($aArray) -1
        $aArray[$i] = _HexToString(_RandomHexStr(4))
    Next
EndFunc

Func _RandomHexStr($sLen)
    Local $sHexString = ""
    For $i = 1 To $sLen
        $sHexString &= StringRight(Hex(Random(0, 15, 1)), 1)
    Next
    Return $sHexString
EndFunc ;==> _RandomHexStr

Func _ArrayUniqueConcatenate(ByRef $aTarget, $iCasesense = 0, _ ; up to 23 more arrays can be included
    $a0 = 0, $a1 = 0, $a2 = 0, $a3 = 0, $a4 = 0, $a5 = 0, $a6 = 0, $a7 = 0, $a8 = 0, $a9 = 0, $a10 = 0, $a11 = 0, _
    $a12 = 0, $a13 = 0, $a14 = 0, $a15 = 0, $a16 = 0, $a17 = 0, $a18 = 0, $a19 = 0, $a20 = 0, $a21 = 0, $a22 = 0)

    #forceref $a0, $a1, $a2, $a3, $a4, $a5, $a6, $a7, $a8, $a9, $a10, $a11, $a12, $a13, $a14, $a15, $a16, $a17, $a18, $a19, $a20, $a21, $a22
    If Not IsArray($aTarget) Or UBound($aTarget, 0) <> 1 Then Return SetError(1)

    Local $iTotalSize = UBound($aTarget), $iItems = 0, $tVarName
    If $iCasesense Then
        For $i = 0 To $iTotalSize -1
            $tVarName = "_" & StringToBinary($aTarget[$i], 2)
            If IsDeclared($tVarName) = -1 Then ContinueLoop

            Assign($tVarName, "", 1)
            $aTarget[$iItems] = $aTarget[$i]
            $iItems += 1
        Next

    Else
        For $i = 0 To $iTotalSize -1
            $tVarName = "_" & StringToBinary(StringLower($aTarget[$i]), 2)
            If IsDeclared($tVarName) = -1 Then ContinueLoop

            Assign($tVarName, "", 1)
            $aTarget[$iItems] = $aTarget[$i]
            $iItems += 1
        Next
    EndIf

    Local $iParams = @NumParams
    If $iParams > 2 Then
        Local $aNextArray, $iBound
        For $i = 0 To $iParams -3
            $aNextArray = Eval('a' & $i)
            If Not IsArray($aNextArray) Or UBound($aNextArray, 0) <> 1 Then Return SetError(2, $i +3) ; Sets @Extended to the parameter which failed
            $iBound = UBound($aNextArray)

            $iTotalSize += $iBound
            ReDim $aTarget[$iItems + $iBound]

            If $iCasesense Then
                For $j = 0 To $iBound -1
                    $tVarName = "_" & StringToBinary($aNextArray[$j], 2)
                    If IsDeclared($tVarName) = -1 Then ContinueLoop

                    Assign($tVarName, "", 1)
                    $aTarget[$iItems] = $aNextArray[$j]
                    $iItems += 1
                Next

            Else
                For $j = 0 To $iBound -1
                    $tVarName = "_" & StringToBinary(StringLower($aNextArray[$j]), 2)
                    If IsDeclared($tVarName) = -1 Then ContinueLoop

                    Assign($tVarName, "", 1)
                    $aTarget[$iItems] = $aNextArray[$j]
                    $iItems += 1
                Next
            EndIf
            Execute('_FreeMemory($a' & $i & ')')
        Next
    EndIf
    ReDim $aTarget[$iItems]

    Return $iTotalSize - $iItems ; Return the number of duplicates removed
EndFunc ; _ArrayUniqueConcatenate
Edited by czardas
Link to comment
Share on other sites

9.5 seconds here :)

What if, instead of taking a load of arrays as params, and checking how many were passed, you took an array of arrays by reference?

It would remove limit of amount of arrays that can be passed, but put onus on the caller to create the array of arrays.

I've done something like that before, and I'm certain it speeded it up too.

AutoIt Absolute Beginners    Require a serial    Pause Script    Video Tutorials by Morthawt   ipify 

Monkey's are, like, natures humans.

Link to comment
Share on other sites

Well using an array of arrays is generally not recomended, or at least it didn't used to be. It also looks as if the code can be simplified, but it might  well introduce a time penalty. Using a helper function is not feasible since it needs to test the existance of local variables created within the function. For these reasons part of the code repeats.

Edited by czardas
Link to comment
Share on other sites

The only prohibition against using putting an array inside an array is that you have to know how to address it correctly, there's generally no problems actually using them if you do know how. It's an advanced feature not recommended for the faint of heart or a newbie.

If you're not using an array of arrays, you should probably use ByRef for all your arrays being passed if you want to limit the amount of memory used by the function. As long as you're not altering the incoming arrays, there shouldn't be any downside to doing it that way.

If I posted any code, assume that code was written using the latest release version unless stated otherwise. Also, if it doesn't work on XP I can't help with that because I don't have access to XP, and I'm not going to.
Give a programmer the correct code and he can do his work for a day. Teach a programmer to debug and he can do his work for a lifetime - by Chirag Gude
How to ask questions the smart way!

I hereby grant any person the right to use any code I post, that I am the original author of, on the autoitscript.com forums, unless I've specifically stated otherwise in the code or the thread post. If you do use my code all I ask, as a courtesy, is to make note of where you got it from.

Back up and restore Windows user files _Array.au3 - Modified array functions that include support for 2D arrays.  -  ColorChooser - An add-on for SciTE that pops up a color dialog so you can select and paste a color code into a script.  -  Customizable Splashscreen GUI w/Progress Bar - Create a custom "splash screen" GUI with a progress bar and custom label.  -  _FileGetProperty - Retrieve the properties of a file  -  SciTE Toolbar - A toolbar demo for use with the SciTE editor  -  GUIRegisterMsg demo - Demo script to show how to use the Windows messages to interact with controls and your GUI.  -   Latin Square password generator

Link to comment
Share on other sites

Thanks for testing this.


There is no way to pass optional ByRef parameters in AutoIt, otherwise I would have done so. I thought of passing an array of arrays but decided that nobody ever does this for a reason. Sure it can be done. If you want to pass arrays of arrays then it's easy enough to modify. I've never needed to use an array of arrays, and I've seldom seen one used as a function parameter. I was under the impression that there are performance related issues when doing this. I've added a line to free up memory as you go. This should allow larger input.

The function could also be modified to return the item count in the first element. I may do this later, however I wanted to keep it simple and practical, so all input and output ended up 0-based. I think 24 arrays are enough for most practical purposes. Add as many extra parameters as you want to the function. It won't break anything. :)


Fixed => I got the case sensitivity working backwards. Updated first post. >_<

After increasing the length of the strings in the arrays, it just processed 146 MB of string data in 60 seconds. :D

Edited by czardas
Link to comment
Share on other sites

After further tests, I must warn anyone using this function that performance will degrade with very large arrays. The limitations are not clear: because it depends on the number of duplicates and available RAM. The results of one test showed that 2,400,000 elements containing random strings of between 1 and 64 characters (that's approx - 76,800,000 characters in total) returned 2,347,389 unique items after removing 52,611 duplicates in 5 minutes and 15 seconds using 2GB of RAM. Performance degrades because for each new item a local variable is created (more unique items = lower performance). Therefore the number of expected duplicates (more duplicates = better performance) affects the amount of data this function can handle.

Edited by czardas
Link to comment
Share on other sites

Initial attempts to simplify the code in the first post produced a 20% reduction in speed. I think this is interesting: because what appears to be an unecessary duplication of arguments is noticeably more efficient than the less bulky code in the spoiler below. The original function (in the 1st post) is 20% faster.

Func _ArrayUniqueConcatenate(ByRef $aTarget, $iCasesense = 0, _ ; up to 23 more arrays can be included
    $a0 = 0, $a1 = 0, $a2 = 0, $a3 = 0, $a4 = 0, $a5 = 0, $a6 = 0, $a7 = 0, $a8 = 0, $a9 = 0, $a10 = 0, $a11 = 0, _
    $a12 = 0, $a13 = 0, $a14 = 0, $a15 = 0, $a16 = 0, $a17 = 0, $a18 = 0, $a19 = 0, $a20 = 0, $a21 = 0, $a22 = 0)

    #forceref $a0, $a1, $a2, $a3, $a4, $a5, $a6, $a7, $a8, $a9, $a10, $a11, $a12, $a13, $a14, $a15, $a16, $a17, $a18, $a19, $a20, $a21, $a22
    If Not IsArray($aTarget) Or UBound($aTarget, 0) <> 1 Then Return SetError(1)

    Local $iTotalSize = UBound($aTarget), $iItems = 0, $tVarName, $aExpression[2]
    $aExpression[0] = "StringToBinary(StringLower($aTarget[$i]), 2)"
    $aExpression[1] = "StringToBinary($aTarget[$i], 2)"
    
    If $iCasesense <> 0 Then $iCasesense = 1

    For $i = 0 To $iTotalSize -1
        $tVarName = "_" & Execute($aExpression[$iCasesense])
        If IsDeclared($tVarName) = -1 Then ContinueLoop

        Assign($tVarName, "", 1)
        $aTarget[$iItems] = $aTarget[$i]
        $iItems += 1
    Next

    Local $iParams = @NumParams
    If $iParams > 2 Then
        $aExpression[0] = "StringToBinary(StringLower($aNextArray[$j]), 2)"
        $aExpression[1] = "StringToBinary($aNextArray[$j], 2)"
        
        Local $aNextArray, $iBound
        For $i = 0 To $iParams -3
            $aNextArray = Eval('a' & $i)
            If Not IsArray($aNextArray) Or UBound($aNextArray, 0) <> 1 Then Return SetError(2, $i +3) ; Sets @Extended to the parameter which failed
            $iBound = UBound($aNextArray)

            $iTotalSize += $iBound
            ReDim $aTarget[$iItems + $iBound]

            For $j = 0 To $iBound -1
                $tVarName = "_" & Execute($aExpression[$iCasesense])
                If IsDeclared($tVarName) = -1 Then ContinueLoop
                    
                Assign($tVarName, "", 1)
                $aTarget[$iItems] = $aNextArray[$j]
                $iItems += 1
            Next
            Execute('_FreeMemory($a' & $i & ')')
        Next
    EndIf
    ReDim $aTarget[$iItems]

    Return $iTotalSize - $iItems ; Return the number of duplicates removed
EndFunc ; _ArrayUniqueConcatenate

Func _FreeMemory(ByRef $vParam)
    $vParam = 0
EndFunc

This would appear to question the validity of some good coding practices (in certain situations):. ie the practice of using encapsulation instead of simply repeating the same arguments. I don't know of any more ways to encapsulate this function with the methods it uses. I don't believe using recursion will improve it. :unsure:

Edited by czardas
Link to comment
Share on other sites

After renaming one or two parameters, I could easily rewrite this function using more compact and better organized code. While the code in the previous post above is surprisingly sluggish, this version appears to have a slight edge on the original function.

;

Func _ArrayUniqueConcatenate(ByRef $a1, $iCasesense = 0, _ ; up to 23 more arrays can be included
    $a2 = 0, $a3 = 0, $a4 = 0, $a5 = 0, $a6 = 0, $a7 = 0, $a8 = 0, $a9 = 0, $a10 = 0, $a11 = 0, $a12 = 0, $a13 = 0, _
    $a14 = 0, $a15 = 0, $a16 = 0, $a17 = 0, $a18 = 0, $a19 = 0, $a20 = 0, $a21 = 0, $a22 = 0, $a23 = 0, $a24 = 0)
    #forceref $a1, $a2, $a3, $a4, $a5, $a6, $a7, $a8, $a9, $a10, $a11, $a12, $a13, $a14, $a15, $a16, $a17, $a18, $a19, $a20, $a21, $a22, $a23, $a24

    Local $aNextArray, $iBound, $tVarName, $iTotalSize = 0, $iItems = 0, $iParams = @NumParams

    If $iParams = 1 Then $iParams = 2
    For $i = 1 To $iParams -1
        $aNextArray = Eval('a' & $i)
        If Not IsArray($aNextArray) Or UBound($aNextArray, 0) <> 1 Then Return SetError(1, $i + ($i > 1)) ; Sets @Extended to the parameter which failed

        $iBound = UBound($aNextArray)
        If $i > 1 Then ReDim $a1[$iItems + $iBound]

        If $iCasesense Then
            For $j = 0 To $iBound -1
                $tVarName = "_" & StringToBinary($aNextArray[$j], 2)
                If IsDeclared($tVarName) = -1 Then ContinueLoop

                Assign($tVarName, "", 1)
                $a1[$iItems] = $aNextArray[$j]
                $iItems += 1
            Next

        Else
            For $j = 0 To $iBound -1
                $tVarName = "_" & StringToBinary(StringLower($aNextArray[$j]), 2)
                If IsDeclared($tVarName) = -1 Then ContinueLoop

                Assign($tVarName, "", 1)
                $a1[$iItems] = $aNextArray[$j]
                $iItems += 1
            Next
        EndIf
        If $i > 1 Then Execute('_FreeMemory($a' & $i & ')')
        $iTotalSize += $iBound
    Next
    ReDim $a1[$iItems]

    Return $iTotalSize - $iItems ; Return the number of duplicates removed
EndFunc ; _ArrayUniqueConcatenate

Func _FreeMemory(ByRef $vParam)
    $vParam = 0
EndFunc

;

Don't mean to bump threads, just letting you know the code has been improved.

Edited by czardas
Link to comment
Share on other sites

Nice, I'll have to try this updated version in the data backup script I have written.  Thanks!

 

Your question and kylomas' idea inspired me. Having to wait so long for _ArrayUnique() has been an issue for me in the past too, so creating this brings rewards for me also. The new version is just more compact and I think the code is neater. Time for some proper documentation after all this testing. :)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...