Duplicating Data

czardas · June 4, 2010

I figured this was the best forum to post this topic.

I have a 2D array of approx 10,000 elements. It contains about 8,800 unassigned elements. My concern is that testing all these unassigned elements during a repeated array search seems long winded and tedious. So why not create a smaller array with only 1,000 assigned elements and search that. :mellow:

Scaled down reproducer

#include <Array.au3>

Local $data[14] = [4,5,2,3,0,1,7,6,9,8,12,11,10,13]
Local $2D_array[8][7] ; This array is needed for another process.

; The 1st column is a key which determines the number of entries in each row.
; Numbers in this 1st column are excluded from the search results.
$2D_array[2][0] = 1
$2D_array[3][0] = 1
$2D_array[4][0] = 6
$2D_array[5][0] = 2
$2D_array[6][0] = 3
$2D_array[7][0] = 1

; Populate the remaining columns of the 2D array.
$a = 0
For $i = 2 To 7
    For $j = 1 To $2D_array[$i][0]
        $2D_array[$i][$j] = $data[$a]
        $a += 1
    Next
Next

; NONE OF THE CODE ABOVE CAN NOT BE ALTERED.

; Breakng convention.
Local $index_matching_array[14] ; Array created to avoid testing many empty cells during the search (for sparce population in a large array).
$a = 0
For $i = 2 To 7
    For $j = 1 To $2D_array[$i][0]
        $index_matching_array[$a] = $i
        $a += 1
    Next
Next

find()

Func find()
    Local $x = Random(0,13,1) ; Whatever element we want to find in the 2D array.
    Local $data_index = _ArraySearch($data, $x, 0, 0, 0, 1) ; This search will be repeated thousands of times.
    Local $return_value = $index_matching_array[$data_index]
    ; Display result =>
    MsgBox(0, "Index in 2D_array", $x & " was found at " & @CRLF & "array index " & $return_value)
    _ArrayDisplay($2D_array)
EndFunc

So now I am breaking convention by duplicating information. I know this topic has been discussed before, and that data duplication is considered bad practice. But is it always so bad? Maybe another dumb ass question.

Edited June 4, 2010 by czardas

evilertoaster · June 4, 2010

My concern is that testing all these unassigned elements during a repeated array search seems long winded and tedious

It usually is. If you know a bounds or range limit to the set you're searching, use it.

So why not create a smaller array with only 1,000 assigned elements and search that.

You could... it should be unnecessary. Like I mentioned, it's better to simply limit the range of the search.

So now I am breaking convention by duplicating information. I know this topic has been discussed before, and that data duplication is considered bad practice. But is it always so bad?

I'd hesitate to say it's always bad to duplicate information... In this case I would think it is bad though, since there should be no reason too. I thought the _ArraySearch allowed you to specify a starting a stopping point for the search (aka the 'range' to search) I'm not sure how it works with 2d arrays, but you could most certainly write your own 2D array searching function if speed is your concern.

jvanegmond · June 4, 2010

See for yourself.

Also, $a is always the same as $i - 2.

#include <Array.au3>

Local $data[14] = [4,5,2,3,0,1,7,6,9,8,12,11,10,13]
Local $2D_array[8][7] ; This array is needed for another process.

; The 1st column is a key which determines the number of entries in each row.
; Numbers in this 1st column are excluded from the search results.
$2D_array[2][0] = 1
$2D_array[3][0] = 1
$2D_array[4][0] = 6
$2D_array[5][0] = 2
$2D_array[6][0] = 3
$2D_array[7][0] = 1

; Populate the remaining columns of the 2D array.
$a = 0
For $i = 2 To 7
    For $j = 1 To $2D_array[$i][0]
        $2D_array[$i][$j] = $data[$a]
        $a += 1
    Next
Next

; NONE OF THE CODE ABOVE CAN NOT BE ALTERED.

$a2 = 0
$b2 = 0

For $i = 0 to 10000
    $n = Random(0, 13, 1) ; This is the value we want to find

    $a1 = TimerInit()
    $a = createAndFind($n)
    $a2 += TimerDiff($a1)

    $b1 = TimerInit()
    $b = justFind($n)
    $b2 += TimerDiff($b1)
Next

ConsoleWrite("Create array and find total time: " & $a2 & @CRLF)
ConsoleWrite("Just find total time: " & $b2 & @CRLF)



Func justFind($x)
    $a = 0
    For $i = 2 To 7
        For $j = 1 To $2D_array[$i][0]
            if $i = $x Then
                Return $a
            EndIf
            $a += 1
        Next
    Next
    Return $a
EndFunc

Func createAndFind($x) ; Breaks convention
    Local $index_matching_array[14]
    $a = 0
    For $i = 2 To 7
        For $j = 1 To $2D_array[$i][0]
            $index_matching_array[$a] = $i
            $a += 1 ; $a is always the same as $i - 2
        Next
    Next

    $index = _ArraySearch($data, $x, 0, 0, 0, 1)
    Return $index_matching_array[$index]
EndFunc

Edit: And when a programmer tells you that it's bad to duplicate data, then he's talking about databases.

Edited June 4, 2010 by Manadar

czardas · June 4, 2010

I am beginning to think perhaps I ought to have posted this in Help and Support. Manadar, you make everything seem so easy. Thank you for this illustration. I'm sure it will be quicker regardless of the array size. For some reason I was fixated on using the built in _arraySearch function. A classic case of I couldn't see the wood for the trees. Still I thought the topic may be of general interest, even if just to discover how others would approach such a problem. And your solution is so very much to the point. Great! :party:

Thanks also to evilertoaster for your comments.

Edit: I just spotted one thing about it I need to test. The second array $index_matching_array only needs to be created once.

#include <Array.au3>

Local $data[14] = [4,5,2,3,0,1,7,6,9,8,12,11,10,13]
Local $2D_array[8][7] ; This array is needed for another process.

; The 1st column is a key which determines the number of entries in each row.
; Numbers in this 1st column are excluded from the search results.
$2D_array[2][0] = 1
$2D_array[3][0] = 1
$2D_array[4][0] = 6
$2D_array[5][0] = 2
$2D_array[6][0] = 3
$2D_array[7][0] = 1

; Populate the remaining columns of the 2D array.
$a = 0
For $i = 2 To 7
    For $j = 1 To $2D_array[$i][0]
        $2D_array[$i][$j] = $data[$a]
        $a += 1
    Next
Next

; NONE OF THE CODE ABOVE CAN NOT BE ALTERED.

Local $index_matching_array[14]
$a = 0
For $i = 2 To 7
   For $j = 1 To $2D_array[$i][0]
        $index_matching_array[$a] = $i
        $a += 1
    Next
Next

$a2 = 0
$b2 = 0

For $i = 0 to 10000
    $n = Random(0, 13, 1) ; This is the value we want to find

    $a1 = TimerInit()
    $a = Find($n)
    $a2 += TimerDiff($a1)

    $b1 = TimerInit()
    $b = justFind($n)
    $b2 += TimerDiff($b1)
Next

ConsoleWrite("Create array and find total time: " & $a2 & @CRLF)
ConsoleWrite("Just find total time: " & $b2 & @CRLF)

Func justFind($x)
    $a = 0
    For $i = 2 To 7
        For $j = 1 To $2D_array[$i][0]
            if $i = $x Then
                Return $a
            EndIf
            $a += 1
        Next
    Next
    Return $a
EndFunc

Func Find($x)
    $index = _ArraySearch($data, $x, 0, 0, 0, 1)
    Return $index_matching_array[$index]
EndFunc

Update: I took the array creation out of the loop and ran the test again. It appears that your search function => justFind() is still faster than using _arraysearch() on the smaller 1D array. :mellow:

Edited June 5, 2010 by czardas

jvanegmond · June 4, 2010

Hey! I was working on that. :mellow: It's only _ArraySearch that makes it slow.

If your data never changes, you can create the special array once, and then search through it manually. This is faster (not including the time creating the array)

#include <Array.au3>

Local $data[14] = [4,5,2,3,0,1,7,6,9,8,12,11,10,13]
Local $2D_array[8][7] ; This array is needed for another process.

; The 1st column is a key which determines the number of entries in each row.
; Numbers in this 1st column are excluded from the search results.
$2D_array[2][0] = 1
$2D_array[3][0] = 1
$2D_array[4][0] = 6
$2D_array[5][0] = 2
$2D_array[6][0] = 3
$2D_array[7][0] = 1

; Populate the remaining columns of the 2D array.
$a = 0
For $i = 2 To 7
    For $j = 1 To $2D_array[$i][0]
        $2D_array[$i][$j] = $data[$a]
        $a += 1
    Next
Next

; NONE OF THE CODE ABOVE CAN NOT BE ALTERED.

Local $index_matching_array[14]
$a = 0
For $i = 2 To 7
   For $j = 1 To $2D_array[$i][0]
        $index_matching_array[$a] = $i
        $a += 1 ; $a is always the same as $i - 2
    Next
Next

$a2 = 0
$b2 = 0

For $i = 0 to 10000
    $n = Random(0, 13, 1) ; This is the value we want to find

    $a1 = TimerInit()
    $a = Find($n)
    $a2 += TimerDiff($a1)

    $b1 = TimerInit()
    $b = justFind($n)
    $b2 += TimerDiff($b1)
Next

ConsoleWrite("Find from array total time: " & $a2 & @CRLF)
ConsoleWrite("Just find total time: " & $b2 & @CRLF)

Func justFind($x)
    $a = 0
    For $i = 2 To 7
        For $j = 1 To $2D_array[$i][0]
            if $i = $x Then
                Return $a
            EndIf
            $a += 1
        Next
    Next
    Return $a
EndFunc

Func Find($x)
    For $i = 0 to 13
        if $data[$i] = $x then
            return $index_matching_array[$i]
        EndIf
    Next
EndFunc

Edited June 4, 2010 by Manadar

czardas · June 4, 2010

If your data never changes, you can create the special array once, and then search through it manually. This is faster (not including the time creating the array)

Let's see if I got this right? Duplication of data is possibly a good idea sometimes. Wow!

Thanks again. Much appreciated. :mellow:

Edited June 5, 2010 by czardas

evilertoaster · June 4, 2010

Let's see if I got this right? Duplication of data is possibly a good idea sometimes. Wow!

Humm, I don't see how we're getting to this conclusion...

The goal is finding the index of the value isn't it?

Comparing a function searches a 1d array for an index to a function that searches a 2d array? In the 2n array you'd need to return 2 values anyways, even if the 1d search goes faster you still haven't resolved it to the original (2d) index... (or if the original is not important why not just sort it and to a 1d binary search to begin with?)

*ramble ramble*

... o well, if you found your answer congrats :mellow:

czardas · June 4, 2010

In the 2n array you'd need to return 2 values anyways

If that was the case then I would have to agree with you. However I am only interested in the returned row index value, and not the column in which the match was found. The 2D array stores information to be used elsewhere for a different process, otherwise it would be superfluous. I had originally planned to delete the original data and only search the 2D array. On reflection, with some redesigning, I should be able to avoid this duplication. But I can imagine circumstances where I might need the 2D array, as well as calling on the original data to more quickly reproduce these values. If any of that makes sense. :mellow:

It is a consequence of my attempts to avoid duplicating data, that I ran into this scenario in the first place. I certainly agree that duplication should be avoided. But it occasionally seems more straight forward to simply allow it. Perhaps that's just me being lazy, or too short sighted to see another solution.

I also would like to point out that: I would not post a topic like this, if I did not value people's opinions. What it boils down to is whether duplication avoidance should be treated as a general guideline or as a definitive rule written in stone.

Edited June 5, 2010 by czardas

czardas · June 5, 2010

The more I think about this, the more I keep thinking there must be a better way. After a while of pondering I have come up with a solution. I only need to populate the first column of the 2D array. I can still find the index values as if I had searched a fully populated array. Only if and when required should I populate the array with more data. Bingo! :mellow:

Edit: Bah! This still doesn't actually avoid duplicating data. Once the array is populated, the data will be duplicated.

I guess I'll just have to follow my intuition on this one. Manadar's speed test is swaying my decision.

Edited June 5, 2010 by czardas

SmOke_N · June 6, 2010

Why not use SQLite to manage such big arrays and dupe issues. You'd avoid all your speed issue concerns.

evilertoaster · June 6, 2010

There are some cases where 'duplicating' (or preprocessing) data is fine as far as I'm concerned. Sorting lists before processing is common for speed purposes. If the 2-D array you're searching doesn't have any quantifiable bounds and you're going to be doing a lot of intense processing over a small subset (which you can't logicality describe for looping) it would make sense to pre-arrange the data...

Then again, depending on how important speed really is for you, AutoIt isn't always the best option in general.

AndyG · June 6, 2010

If you have trouble with the speed in searching arrays, and the "Double Data", why use arrays?

There are much faster ways sometimes if you use a lists, a queue, a stack or a hash-table instead of an array.

In this forum I have not searched intensively for examples, because in the german forum is a little "tutorial" with some examples. One of them might help you...

http://www.autoit.de/index.php?page=Thread&postID=24276#post24276

czardas · June 6, 2010

Thanks everyone for your input. Lots of interesting suggestions and ideas to chew over. I ought to explain what I'm up to. The concept is very simple but the implementation I find quite tricky. :mellow:

The code is part of a fully functional encryption program which has undergone multiple tests and rebuilds over the last few months. The architecture has been clearly defined for a while now, but from the perspective of another programmer looking at it, the code needs further attention. Fortunately I am not confined by a deadline. I actually need to rip the heart out of it and make it presentable.

It is vitaly important that the data be held in RAM, so using a database is out of the question. Holding the information in an array apparently simplifies several tasks, though I am very interested to look into your suggested alternatives. The size of the 2D array is not really so large (256 rows x 54 columns), but the number of repeat searches will depend entirely on the length of the encrypted data.

Edited June 6, 2010 by czardas

jchd · June 7, 2010

It is vitaly important that the data be held in RAM, so using a database is out of the question.

You miss the point about SQLite flexibility: it will happily create a memory database which you can then search with all the power of SQL yet using fast SQLite C code.

Depending on your precise requirements, you can create indexes on whatever combination of column / type you need and still enjoy very fast answer.

I routinely use it for temporary storage and searches and when you have to repeatedly lookup the data with non-trivial conditions, it quickly becomes much faster than AutoIt loops.

As a bonus, you can load an SQLite DB from disk and save it back if you need it a some later point in time, all of this without even closing it. See here.

That is exactly one of the reasons for which SQLite is so widely used: high-efficiency embedded application data storage / retrieval.

czardas · June 7, 2010

Yeah, not sure what you meant by bad.

When I say bad, I mean that it is considered bad to duplicate data because it generally hinders performance. In this case the duplication appears to be enhancing performance, which perhaps seems counter-intuative. I am therefore reluctant to change it. That doesn't necessarily mean that it will be the final solution: though probably.

You miss the point about SQLite flexibility: it will happily create a memory database which you can then search with all the power of SQL yet using fast SQLite C code.

Ah yes, I missunderstood that point. Thanks. However I am not so much concerned with speed, as I am with presenting my method in code that is well written and easy to understand. I am finding that to be a bit of a challenge. Also I'm not keen on bundling 3rd party software (I'm pressuming that would be required) with a portable application: which is how I intend to release it. However it sounds like I ought to investigate the functionality of SQLite at some point. It sounds very interesting. Any further enhancements to performance can be made later, if it turns out to be a worthwhile project.

Changing the subject slightly :mellow:

Nobody spotted the mistake yet?

Edited June 7, 2010 by czardas

Valik · June 7, 2010

Performance entails more than just "speed". Resource usage (memory for example) is also a metric in performance. Duplicating data may allow speed to improve under certain circumstances but it does affect memory usage adversely. It's a trade-off and you must determine whether the increase in usage justifies the speed.

GEOSoft · June 7, 2010

czardas, you are aware that there is a SQLite UDF included with AutoIt are you not? Nothing more to be installed.

wraithdu · June 7, 2010

If you just need to find an index, you could use a Scripting.Dictionary and store the array of data as the data element.

czardas · June 7, 2010

czardas, you are aware that there is a SQLite UDF included with AutoIt are you not? Nothing more to be installed.

Now that's a totally differnt story. Very useful indeed. :mellow:

Thanks again to everyone for sharing your knowledge with me, and to Valik for clarifying my understanding of the subject.

Edited June 7, 2010 by czardas

Duplicating Data

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members