Jump to content
AspirinJunkie

Map/Dictionary/HashTable - which hash algorithm is used?

Recommended Posts

I made some performance tests to get a rough indication of which data structure is most appropriate and when. For this, I measured the time needed for adding and reading out at different numbers of elements.

These are the results (beware of the logarithmic scale for the axes):

build_associative.thumb.png.18add773a07c7f3d3d53c824e566bfff.pngbuild_sequential.thumb.png.67f8a3e01212534edb5e79c967c0ec35.pngread_associative.thumb.png.7920f63ff9345d94c71bb317f958d88b.pngread_sequential.thumb.png.98113523ece89cf6c4c56413a5d31be8.png

Especially the behavior of map is interesting: At small numbers of elements the map structure performs very well and scales quite linear with the number of elements. But then (here at 10,000 elements) there is some sort of cut and the results are getting bad. Scripting.dictionary also shows this behavior but only later a larger number of elements.

Now for my question: I suspect, that the AutoIt-Map is internally implemented as hash-table and not with sorting trees. As far as I know the used hash-algorithm in a hash-table will reach the point where it start to produce a lot of collisions which leads to extra treatments. This would explain the showed behavior. But because there is a difference between AutoIt-Map, Scripting.Dictionary and System.Collections.HashTable i have to assume that these each use different hash algorithms. Does anybody know which exact hash algorithms are used internally for these?

hashtable.au3

DataStructureComparison.zip

Edited by AspirinJunkie
added pure AutoIt hash-table

Share this post


Link to post
Share on other sites

Sure - i can add scenarios for sqlite. But not until wednesday (feel free to add them yourself and go through the tests).

Anyway - if sqlite with AutoIt is used as a simple key-value datastructure i don't expect much impressive results.
In my scenario i add and read every element separately. This means each time a _SQLite_Exec()  resp. _SQLite_Query().
I expect, that the performance benefits from the optimized internal implementation of sqlite would eaten up again by the overhead for the wrapper functions.

Since sqlite is based on b-trees for its indexes, the runtime behavior during readout should be quite similar to that of the "2D array binary search" (with an absolute offset).
For building the data structure, I lack information on how this was implemented internally in sqlite. But since the index has to be built in parallel and sorted, I don't expect any miracles here either.

Of course, this is mainly due to the unfavorable scenario for sqlite that elements are added and read individually, instead of bundled.
Especially if you can use the ".import" function with _SQLite_SQLiteExe() then quiet nothing is as fast as this. But this is not my scenario here.

But if you want to include external databases in the comparison, the really interesting candidates would be the special mem-cached key-value databases like redis. These can be expected to perform at a high level even with large amounts of data.

Share this post


Link to post
Share on other sites
4 hours ago, AspirinJunkie said:

But if you want to include external databases in the comparison, the really interesting candidates would be the special mem-cached key-value databases like redis.

I mention SQLite only because it is a good reference. Though, loading aside, at some number of rows, it might beat the colliding hashes, perhaps?


Code hard, but don’t hard code...

Share this post


Link to post
Share on other sites
7 hours ago, AspirinJunkie said:

Since sqlite is based on b-trees for its indexes, the runtime behavior during readout should be quite similar to that of the "2D array binary search" (with an absolute offset).

Yes, I would agree.  FWIW,  the 2D array seems to be holding well at the end, whereas the Map looks like it’s gonna go asymptotic.


Code hard, but don’t hard code...

Share this post


Link to post
Share on other sites
8 hours ago, JockoDundee said:

Though, loading aside, at some number of rows, it might beat the colliding hashes, perhaps?

Yes of course. There are no collisions because it's index is a sorted (as a b-tree) structure with one element per index.
As the number of elements increases, the number of iterations required to find a particular element also increases due to the levels in the tree.
However, this is quite harmless and should still work quickly even with a really large number of elements.

5 hours ago, JockoDundee said:

the 2D array seems to be holding well at the end, whereas the Map looks like it’s gonna go asymptotic.

Really? Map looks also really stable for me. Hm - ok this is just my subjective view based on the optics. I did'nt made a statistical test for linearity (btw i made a statistic-udf to realize such tests 😅 )

To get deeper into the Functionality of hash-tables i wrote a hash-table udf in pure AutoIt as an exercise for myself. Here i learned, that you need a good balance between maximum number of collisions per index and as less as possible resizing when the number of elements increase. If the base-array is big enough, then a big number of collisions is unlikely. Because there are such a lot of optimization parameters - not only the used hash-function - i slowly begin to understand why the 3 hash-tables here perform so different.

Share this post


Link to post
Share on other sites

AspirinJunkie,

I went back looking at some old private threads and found that AutoIt Maps currently use the "djb2" hash function and a 256 entry hash table.

M23


Public_Domain.png.2d871819fcb9957cf44f4514551a2935.png Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind

Open spoiler to see my UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites
13 hours ago, Melba23 said:

AutoIt Maps currently use the "djb2" hash function

Thank you - i think i will play a little bit with this.
Funnily enough, I once stumbled across a collision generator for djb2 where the following statement was made about DJB2:
"So never ever use this function in hash table." 😅

13 hours ago, Melba23 said:

AutoIt Maps currently use [...] a 256 entry hash table.

Fixed size? - no resizing?
That would explain the determined behavior.
I think linked lists are used for the sub-structures - right?
Even with a perfect hash/index-function the result is, that when you get an element we have an average linear scan over 2 elements at size 1,000; 20 elements at size 10,000; 200 elements at size 100,000 and so on. So if you read every single element of such a hash-table with 1,000,000 elements you have to iterate over at least 2 billion elements... 

There are many reasons for this concrete implementation. But to know this special detail helps me a lot to understand why it behaves as we see.
I suspect, that scripting.dictionary takes a quite similar approach - but with a bigger hash-table size. (as stated >>here<< scripting.dictionary shall do resizes)
And system.collections.hashtable - puh - either the size is also fixed but significantly larger or they dynamically resize there.

Edited by AspirinJunkie
found one source, that stated that scripting.dictionary do resizes

Share this post


Link to post
Share on other sites

AspirinJunkie,

I am not a Dev and so have no further knowledge of why or how this particular function or table size were chosen. They only came up in a much wider-ranging private conversation while discussing Maps.

M23


Public_Domain.png.2d871819fcb9957cf44f4514551a2935.png Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind

Open spoiler to see my UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

Share this post


Link to post
Share on other sites

It would be interesting to find out what F14 gives: https://engineering.fb.com/2019/04/25/developer-tools/f14/


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites

@JockoDundee i added sqlite. The results are as expected because the scenario here is horrible for sqlite.
In real life nobody would (at least I hope so) use single inserts and selects for such a amount of elements if it's possible to avoid.
That's why i also added the results for one single insert for all elements just to show the huge difference, even if it doesn't match the actual scenario here.

12 hours ago, Melba23 said:

I am not a Dev and so have no further knowledge of why or how this particular function or table size were chosen.

I already realized that - but this forum is named "technical discussion" - so why not discuss even if we are not the devs.
And your hint already gave me a much further understanding. I learned - that was my intention here - so many thanks to you!

@jchd
Oh that looks interesting. Especially the discussion of which parameters of a hash-table have what effect.
The library is C++ - i don't think it is useful to write a wrapper for it and use it in AutoIt.
But maybe i make some tests with it in C++ sometime.

Edit: I said i wrote my own hash-table in pure AutoIt for a better understanding. I add this structure to the comparison and i am quite surprised how well this one does. It's even better as AutoIt-maps and scripting.dictionary (only reading) at 1,000,000 elements.

Edited by AspirinJunkie

Share this post


Link to post
Share on other sites

Of course SQLite is slower than any other {key, value} library since its overhead for enforcing ACID properties is paramount. Bulk inserts can be made faster by using the following guidelines:

  • Use a memory DB
  • If not possible to use a Memory DB, enclose all inserts in one exclusive transaction
  • Do not add indices and triggers to your target table, do that after all inserts
  • Have your input data cleaned up so you can spare constraints
  • Group new rows in chained insert statements: insert into T ... values (...), (...), (...), ..., (...)
  • Raise memory and cache limits as fit
  • Some fine-tune pragmas can speed things up

Even with these steps SQLite can't compete with other libraries on the bulk insert test; the overhead of invoking primitives thru DllCall makes things worse. SQLite isn't bad, it's only poorly suited for the job.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
4 hours ago, jchd said:

Of course SQLite is slower than any other {key, value} library since its overhead for enforcing ACID properties is paramount.

It’s my fault SQLite was invited to the party.  The question wasn’t if it would be slower, just how much it would be slower.  And to see at what size that the performance might converge (10,000,000 rows?) Useful also as a sanity check on the other values, too. 

That said I agree there may be things to improve its performance.

4 hours ago, jchd said:

  • Use a memory DB
  • Have your input data cleaned up so you can spare constraints
  • Group new rows in chained insert statements: insert into T ... values (...), (...), (...), ..., (...)
  • Raise memory and cache limits as fit
  • Some fine-tune pragmas can speed things up

Agree.  

4 hours ago, jchd said:

If not possible to use a Memory DB, enclose all inserts in one exclusive transaction

Disagree.  Not fair to the one at a time loading scenario set forth by AJ. Also, not clear that it would benefit performance to be part of a million row transaction, since that would imply substantial journaling writes, in order to rollback.

4 hours ago, jchd said:

Do not add indices and triggers to your target table, do that after all inserts

Agree it’s faster, disagree it’s allowed by the loading rules.

Another thing that *may* speed up SQLite is the use of the WITHOUT ROWID clause, see here.

 

 


Code hard, but don’t hard code...

Share this post


Link to post
Share on other sites
8 minutes ago, JockoDundee said:

Also, not clear that it would benefit performance to be part of a million row transaction, since that would imply substantial journaling writes, in order to rollback.

Try it on a sample and you'll see 😉
Divide your bulk insert in, say, chunks of 10k rows and compare with 10k individual inserts...
 

9 minutes ago, JockoDundee said:

Another thing that *may* speed up SQLite is the use of the WITHOUT ROWID clause

Most probably not. In a w/o rowid table you must declare a primary key by your own and either it's an integer (making no difference with a rowid) or something else which is going to be slower to handle than an integer.


This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
4 hours ago, jchd said:

Bulk inserts can be made faster by using the following guidelines:

The results in the first post already based on these (code is attached). I used a in-memory db with index creation after all elements are inserted (beware not to define a primary key - this would lead to an implicit index). Also the cache size is increased.

4 hours ago, jchd said:

SQLite isn't bad, it's only poorly suited for the job.

my words

Share this post


Link to post
Share on other sites
2 hours ago, jchd said:

Try it on a sample and you'll see 😉
Divide your bulk insert in, say, chunks of 10k rows and compare with 10k individual inserts...

No need to test, I wouldn’t “include all inserts in one exclusive transaction” or a bunch of 10,000 row transactions.

Both are way slower than just turning off transactions altogether.


Code hard, but don’t hard code...

Share this post


Link to post
Share on other sites

Are you sure? ;-))

#include <SQLite.au3>

Const $SQLITE_DLL = "C:\SQLite\bin\sqlite3.dll" ;<-- Change to the location of your sqlite dll

; Init sqlite
_SQLite_Startup($SQLITE_DLL, False, 1)
If @error Then Exit MsgBox($MB_ICONERROR, "SQLite Error", "Unable to start SQLite. Check existence of DLL")
OnAutoItExitRegister(_SQLite_Shutdown)

ConsoleWrite("SQlite version " & _SQLite_LibVersion() & @LF & @LF)    ; optional!

Local $hDB = _SQLite_Open()
If @error Then Exit
OnAutoItExitRegister(_SQ3Close)


_SQLite_Exec($hDB, "create table T (k integer primary key, v char)")
Local $t = TimerInit()
For $i = 1 To 10000
    _SQLite_Exec($hDB, "insert into T values (" & $i & ", 'abcdef')")
Next
Local $dt = TimerDiff($t)
ConsoleWrite("10k inserts, no transaction, individual inserts took " & $dt / 1000 & "s" & @LF)
_SQLite_Exec($hDB, "drop table T")


_SQLite_Exec($hDB, "create table T (k integer primary key, v char)")
$t = TimerInit()
_SQLite_Exec($hDB, "begin exclusive")
For $i = 1 To 10000
    _SQLite_Exec($hDB, "insert into T values (" & $i & ", 'abcdef')")
Next
_SQLite_Exec($hDB, "end")
Local $dt = TimerDiff($t)
ConsoleWrite("10k inserts, one transaction, individual inserts took " & $dt / 1000 & "s" & @LF)
_SQLite_Exec($hDB, "drop table T")


_SQLite_Exec($hDB, "create table T (k integer primary key, v char)")
$t = TimerInit()
Local $s = "insert into T values (1, 'abcdef')"
For $i = 2 To 10000
    $s &= ",(" & $i & ", 'abcdef')"
Next
_SQLite_Exec($hDB, $s)
Local $dt = TimerDiff($t)
ConsoleWrite("10k inserts, one (implicit) transaction, chained inserts took " & $dt / 1000 & "s" & @LF)
_SQLite_Exec($hDB, "drop table T")

Func _SQ3Close()
    _SQLite_Close($hDB)
EndFunc   ;==>_SQ3Close
SQlite version 3.32.3

10k inserts, no transaction, individual inserts took 5.9761773s
10k inserts, one transaction, individual inserts took 5.6808805s
10k inserts, one (implicit) transaction, chained inserts took 0.0446869s

The beta is even much faster (I added test with a Map):

SQlite version 3.32.3

10k inserts, no transaction, individual inserts took 1.2966753s
10k inserts, one transaction, individual inserts took 1.2246717s
10k inserts, one (implicit) transaction, chained inserts took 0.0381699s
10k inserts in a Map took 0.0103734s

 

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Share this post


Link to post
Share on other sites
1 hour ago, jchd said:

Are you sure? ;-))

what i meant by "transactions off" was this:

SQlite version 3.27.2

PRAGMA journal_mode=OFF
100k inserts, no transaction, individual inserts took 8.3460208s
100k inserts, one transaction, individual inserts took 8.2206863s
100k inserts, one (implicit) transaction, chained inserts took 0.2979496s

SQlite version 3.27.2

WITHOUT PRAGMA journal_mode=OFF
100k inserts, no transaction, individual inserts took 8.5634186s
100k inserts, one transaction, individual inserts took 8.8829882s
100k inserts, one (implicit) transaction, chained inserts took 0.3040778s

only a minor difference addmitedly.

I bumped the 10K to 100K to event out the averaging a bit.

Not familiar with "chained inserts", is it standard SQL?  Regardless, it looks to be quit a winner!


Code hard, but don’t hard code...

Share this post


Link to post
Share on other sites
12 hours ago, JockoDundee said:

what i meant by "transactions off" was this:

This parameter (also PRAGMA SYNCHRONOUS=OFF) has an effect mainly on file-databases. For in-memory databases like this one, its effect disappears.
Change the sqlite-open statement to this one: _SQLite_Open(@DesktopDir & "\Test.db") and you will see the effect.

12 hours ago, JockoDundee said:

Not familiar with "chained inserts", is it standard SQL? 

I do not have the standard here but I firmly assume yes. This construct can be found in almost all big databases (except oracle - they have an extra "insert all" statement). And since PostgreSQL has this as well, there is a lot to be said for it being in the standard as well.

Edited by AspirinJunkie

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...