Map/Dictionary/HashTable - which hash algorithm is used?

AspirinJunkie · December 21, 2020

I made some performance tests to get a rough indication of which data structure is most appropriate and when. For this, I measured the time needed for adding and reading out at different numbers of elements.

These are the results (beware of the logarithmic scale for the axes):

Especially the behavior of map is interesting: At small numbers of elements the map structure performs very well and scales quite linear with the number of elements. But then (here at 10,000 elements) there is some sort of cut and the results are getting bad. Scripting.dictionary also shows this behavior but only later a larger number of elements.

Now for my question: I suspect, that the AutoIt-Map is internally implemented as hash-table and not with sorting trees. As far as I know the used hash-algorithm in a hash-table will reach the point where it start to produce a lot of collisions which leads to extra treatments. This would explain the showed behavior. But because there is a difference between AutoIt-Map, Scripting.Dictionary and System.Collections.HashTable i have to assume that these each use different hash algorithms. Does anybody know which exact hash algorithms are used internally for these?

hashtable.au3

DataStructureComparison.zip

Edited December 29, 2020 by AspirinJunkie
added pure AutoIt hash-table

JockoDundee · December 21, 2020

Nice.

It would be interesting to see where SQLite would place in the mix.

AspirinJunkie · December 21, 2020

Sure - i can add scenarios for sqlite. But not until wednesday (feel free to add them yourself and go through the tests).

Anyway - if sqlite with AutoIt is used as a simple key-value datastructure i don't expect much impressive results.
In my scenario i add and read every element separately. This means each time a _SQLite_Exec() resp. _SQLite_Query().
I expect, that the performance benefits from the optimized internal implementation of sqlite would eaten up again by the overhead for the wrapper functions.

Since sqlite is based on b-trees for its indexes, the runtime behavior during readout should be quite similar to that of the "2D array binary search" (with an absolute offset).
For building the data structure, I lack information on how this was implemented internally in sqlite. But since the index has to be built in parallel and sorted, I don't expect any miracles here either.

Of course, this is mainly due to the unfavorable scenario for sqlite that elements are added and read individually, instead of bundled.
Especially if you can use the ".import" function with _SQLite_SQLiteExe() then quiet nothing is as fast as this. But this is not my scenario here.

But if you want to include external databases in the comparison, the really interesting candidates would be the special mem-cached key-value databases like redis. These can be expected to perform at a high level even with large amounts of data.

JockoDundee · December 22, 2020

4 hours ago, AspirinJunkie said:

But if you want to include external databases in the comparison, the really interesting candidates would be the special mem-cached key-value databases like redis.

I mention SQLite only because it is a good reference. Though, loading aside, at some number of rows, it might beat the colliding hashes, perhaps?

JockoDundee · December 22, 2020

7 hours ago, AspirinJunkie said:

Since sqlite is based on b-trees for its indexes, the runtime behavior during readout should be quite similar to that of the "2D array binary search" (with an absolute offset).

Yes, I would agree. FWIW, the 2D array seems to be holding well at the end, whereas the Map looks like it’s gonna go asymptotic.

AspirinJunkie · December 22, 2020

8 hours ago, JockoDundee said:

Though, loading aside, at some number of rows, it might beat the colliding hashes, perhaps?

Yes of course. There are no collisions because it's index is a sorted (as a b-tree) structure with one element per index.
As the number of elements increases, the number of iterations required to find a particular element also increases due to the levels in the tree.
However, this is quite harmless and should still work quickly even with a really large number of elements.

5 hours ago, JockoDundee said:

the 2D array seems to be holding well at the end, whereas the Map looks like it’s gonna go asymptotic.

Really? Map looks also really stable for me. Hm - ok this is just my subjective view based on the optics. I did'nt made a statistical test for linearity (btw i made a statistic-udf to realize such tests 😅 )

To get deeper into the Functionality of hash-tables i wrote a hash-table udf in pure AutoIt as an exercise for myself. Here i learned, that you need a good balance between maximum number of collisions per index and as less as possible resizing when the number of elements increase. If the base-array is big enough, then a big number of collisions is unlikely. Because there are such a lot of optimization parameters - not only the used hash-function - i slowly begin to understand why the 3 hash-tables here perform so different.

Melba23 · December 22, 2020

AspirinJunkie,

I went back looking at some old private threads and found that AutoIt Maps currently use the "djb2" hash function and a 256 entry hash table.

M23

AspirinJunkie · December 22, 2020

13 hours ago, Melba23 said:

AutoIt Maps currently use the "djb2" hash function

Thank you - i think i will play a little bit with this.
Funnily enough, I once stumbled across a collision generator for djb2 where the following statement was made about DJB2:
"So never ever use this function in hash table." 😅

13 hours ago, Melba23 said:

AutoIt Maps currently use [...] a 256 entry hash table.

Fixed size? - no resizing?
That would explain the determined behavior.
I think linked lists are used for the sub-structures - right?
Even with a perfect hash/index-function the result is, that when you get an element we have an average linear scan over 2 elements at size 1,000; 20 elements at size 10,000; 200 elements at size 100,000 and so on. So if you read every single element of such a hash-table with 1,000,000 elements you have to iterate over at least 2 billion elements...

There are many reasons for this concrete implementation. But to know this special detail helps me a lot to understand why it behaves as we see.
~~I suspect, that scripting.dictionary takes a quite similar approach - but with a bigger hash-table size.~~ (as stated >>here<< scripting.dictionary shall do resizes)
And system.collections.hashtable - puh - either the size is also fixed but significantly larger or they dynamically resize there.

Edited December 23, 2020 by AspirinJunkie
found one source, that stated that scripting.dictionary do resizes

Melba23 · December 22, 2020

AspirinJunkie,

I am not a Dev and so have no further knowledge of why or how this particular function or table size were chosen. They only came up in a much wider-ranging private conversation while discussing Maps.

M23

jchd · December 23, 2020

It would be interesting to find out what F14 gives: https://engineering.fb.com/2019/04/25/developer-tools/f14/

AspirinJunkie · December 23, 2020

@JockoDundee i added sqlite. The results are as expected because the scenario here is horrible for sqlite.
In real life nobody would (at least I hope so) use single inserts and selects for such a amount of elements if it's possible to avoid.
That's why i also added the results for one single insert for all elements just to show the huge difference, even if it doesn't match the actual scenario here.

12 hours ago, Melba23 said:

I am not a Dev and so have no further knowledge of why or how this particular function or table size were chosen.

I already realized that - but this forum is named "technical discussion" - so why not discuss even if we are not the devs.
And your hint already gave me a much further understanding. I learned - that was my intention here - so many thanks to you!

@jchd
Oh that looks interesting. Especially the discussion of which parameters of a hash-table have what effect.
The library is C++ - i don't think it is useful to write a wrapper for it and use it in AutoIt.
But maybe i make some tests with it in C++ sometime.

Edit: I said i wrote my own hash-table in pure AutoIt for a better understanding. I add this structure to the comparison and i am quite surprised how well this one does. It's even better as AutoIt-maps and scripting.dictionary (only reading) at 1,000,000 elements.

Edited December 23, 2020 by AspirinJunkie

jchd · December 23, 2020

Of course SQLite is slower than any other {key, value} library since its overhead for enforcing ACID properties is paramount. Bulk inserts can be made faster by using the following guidelines:

Use a memory DB
If not possible to use a Memory DB, enclose all inserts in one exclusive transaction
Do not add indices and triggers to your target table, do that after all inserts
Have your input data cleaned up so you can spare constraints
Group new rows in chained insert statements: insert into T ... values (...), (...), (...), ..., (...)
Raise memory and cache limits as fit
Some fine-tune pragmas can speed things up

Even with these steps SQLite can't compete with other libraries on the bulk insert test; the overhead of invoking primitives thru DllCall makes things worse. SQLite isn't bad, it's only poorly suited for the job.

JockoDundee · December 23, 2020

4 hours ago, jchd said:

Of course SQLite is slower than any other {key, value} library since its overhead for enforcing ACID properties is paramount.

It’s my fault SQLite was invited to the party. The question wasn’t if it would be slower, just how much it would be slower. And to see at what size that the performance might converge (10,000,000 rows?) Useful also as a sanity check on the other values, too.

That said I agree there may be things to improve its performance.

4 hours ago, jchd said:

Use a memory DB

Have your input data cleaned up so you can spare constraints

Group new rows in chained insert statements: insert into T ... values (...), (...), (...), ..., (...)

Raise memory and cache limits as fit

Some fine-tune pragmas can speed things up

Agree.

4 hours ago, jchd said:

If not possible to use a Memory DB, enclose all inserts in one exclusive transaction

Disagree. Not fair to the one at a time loading scenario set forth by AJ. Also, not clear that it would benefit performance to be part of a million row transaction, since that would imply substantial journaling writes, in order to rollback.

4 hours ago, jchd said:

Do not add indices and triggers to your target table, do that after all inserts

Agree it’s faster, disagree it’s allowed by the loading rules.

Another thing that *may* speed up SQLite is the use of the WITHOUT ROWID clause, see here.

jchd · December 23, 2020

8 minutes ago, JockoDundee said:

Also, not clear that it would benefit performance to be part of a million row transaction, since that would imply substantial journaling writes, in order to rollback.

Try it on a sample and you'll see 😉
Divide your bulk insert in, say, chunks of 10k rows and compare with 10k individual inserts...

9 minutes ago, JockoDundee said:

Another thing that *may* speed up SQLite is the use of the WITHOUT ROWID clause

Most probably not. In a w/o rowid table you must declare a primary key by your own and either it's an integer (making no difference with a rowid) or something else which is going to be slower to handle than an integer.

AspirinJunkie · December 23, 2020

4 hours ago, jchd said:

Bulk inserts can be made faster by using the following guidelines:

The results in the first post already based on these (code is attached). I used a in-memory db with index creation after all elements are inserted (beware not to define a primary key - this would lead to an implicit index). Also the cache size is increased.

4 hours ago, jchd said:

SQLite isn't bad, it's only poorly suited for the job.

my words

JockoDundee · December 23, 2020

2 hours ago, jchd said:

Try it on a sample and you'll see 😉
Divide your bulk insert in, say, chunks of 10k rows and compare with 10k individual inserts...

No need to test, I wouldn’t “include all inserts in one exclusive transaction” or a bunch of 10,000 row transactions.

Both are way slower than just turning off transactions altogether.

jchd · December 23, 2020

Are you sure? ;-))

#include <SQLite.au3>

Const $SQLITE_DLL = "C:\SQLite\bin\sqlite3.dll" ;<-- Change to the location of your sqlite dll

; Init sqlite
_SQLite_Startup($SQLITE_DLL, False, 1)
If @error Then Exit MsgBox($MB_ICONERROR, "SQLite Error", "Unable to start SQLite. Check existence of DLL")
OnAutoItExitRegister(_SQLite_Shutdown)

ConsoleWrite("SQlite version " & _SQLite_LibVersion() & @LF & @LF)    ; optional!

Local $hDB = _SQLite_Open()
If @error Then Exit
OnAutoItExitRegister(_SQ3Close)


_SQLite_Exec($hDB, "create table T (k integer primary key, v char)")
Local $t = TimerInit()
For $i = 1 To 10000
    _SQLite_Exec($hDB, "insert into T values (" & $i & ", 'abcdef')")
Next
Local $dt = TimerDiff($t)
ConsoleWrite("10k inserts, no transaction, individual inserts took " & $dt / 1000 & "s" & @LF)
_SQLite_Exec($hDB, "drop table T")


_SQLite_Exec($hDB, "create table T (k integer primary key, v char)")
$t = TimerInit()
_SQLite_Exec($hDB, "begin exclusive")
For $i = 1 To 10000
    _SQLite_Exec($hDB, "insert into T values (" & $i & ", 'abcdef')")
Next
_SQLite_Exec($hDB, "end")
Local $dt = TimerDiff($t)
ConsoleWrite("10k inserts, one transaction, individual inserts took " & $dt / 1000 & "s" & @LF)
_SQLite_Exec($hDB, "drop table T")


_SQLite_Exec($hDB, "create table T (k integer primary key, v char)")
$t = TimerInit()
Local $s = "insert into T values (1, 'abcdef')"
For $i = 2 To 10000
    $s &= ",(" & $i & ", 'abcdef')"
Next
_SQLite_Exec($hDB, $s)
Local $dt = TimerDiff($t)
ConsoleWrite("10k inserts, one (implicit) transaction, chained inserts took " & $dt / 1000 & "s" & @LF)
_SQLite_Exec($hDB, "drop table T")

Func _SQ3Close()
    _SQLite_Close($hDB)
EndFunc   ;==>_SQ3Close

SQlite version 3.32.3

10k inserts, no transaction, individual inserts took 5.9761773s
10k inserts, one transaction, individual inserts took 5.6808805s
10k inserts, one (implicit) transaction, chained inserts took 0.0446869s

The beta is even much faster (I added test with a Map):

SQlite version 3.32.3

10k inserts, no transaction, individual inserts took 1.2966753s
10k inserts, one transaction, individual inserts took 1.2246717s
10k inserts, one (implicit) transaction, chained inserts took 0.0381699s
10k inserts in a Map took 0.0103734s

Edited December 23, 2020 by jchd

JockoDundee · December 23, 2020

43 minutes ago, jchd said:

Are you sure? ;-))

I’ll check it out. Thanks!

JockoDundee · December 23, 2020

1 hour ago, jchd said:

Are you sure? ;-))

what i meant by "transactions off" was this:

SQlite version 3.27.2

PRAGMA journal_mode=OFF
100k inserts, no transaction, individual inserts took 8.3460208s
100k inserts, one transaction, individual inserts took 8.2206863s
100k inserts, one (implicit) transaction, chained inserts took 0.2979496s

SQlite version 3.27.2

WITHOUT PRAGMA journal_mode=OFF
100k inserts, no transaction, individual inserts took 8.5634186s
100k inserts, one transaction, individual inserts took 8.8829882s
100k inserts, one (implicit) transaction, chained inserts took 0.3040778s

only a minor difference addmitedly.

I bumped the 10K to 100K to event out the averaging a bit.

Not familiar with "chained inserts", is it standard SQL? Regardless, it looks to be quit a winner!

AspirinJunkie · December 24, 2020

12 hours ago, JockoDundee said:

what i meant by "transactions off" was this:

This parameter (also PRAGMA SYNCHRONOUS=OFF) has an effect mainly on file-databases. For in-memory databases like this one, its effect disappears.
Change the sqlite-open statement to this one: _SQLite_Open(@DesktopDir & "\Test.db") and you will see the effect.

12 hours ago, JockoDundee said:

Not familiar with "chained inserts", is it standard SQL?

I do not have the standard here but I firmly assume yes. This construct can be found in almost all big databases (except oracle - they have an extra "insert all" statement). And since PostgreSQL has this as well, there is a lot to be said for it being in the standard as well.

Edited December 24, 2020 by AspirinJunkie

Sign In

Advert

Map/Dictionary/HashTable - which hash algorithm is used?

Recommended Posts

AspirinJunkie

Advert

JockoDundee

AspirinJunkie

JockoDundee

JockoDundee

AspirinJunkie

Melba23

AspirinJunkie

Melba23

jchd

AspirinJunkie

jchd

JockoDundee

jchd

AspirinJunkie

JockoDundee

jchd

JockoDundee

JockoDundee

AspirinJunkie

Create an account or sign in to comment

Create an account

Sign in

Browse

AutoIt Resources

Release

Beta