Concurrent Hash Tables: Fast and General(?)! · Concurrent Hash Tables: Fast and General(?)! Tobias Maier 1, Peter Sanders , and Roman Dementiev2 1 Karlsruhe Institute of Technology,

Concurrent Hash Tables: Fast and General(?)!

Tobias Maier1, Peter Sanders1, and Roman Dementiev2

1 Karlsruhe Institute of Technology, Karlsruhe, Germany t.maier,[email protected] Intel Deutschland GmbH [email protected]

Abstract

Concurrent hash tables are one of the most impor-tant concurrent data structures which is used innumerous applications. Since hash table accessescan dominate the execution time of whole applica-tions, we need implementations that achieve goodspeedup even in these cases. Unfortunately, cur-rently available concurrent hashing libraries turnout to be far away from this requirement in partic-ular when adaptively sized tables are necessary orcontention on some elements occurs.

Our starting point for better performing datastructures is a fast and simple lock-free concurrenthash table based on linear probing that is howeverlimited to word sized key-value types and does notsupport dynamic size adaptation. We explain howto lift these limitations in a provably scalable wayand demonstrate that dynamic growing has a per-formance overhead comparable to the same gener-alization in sequential hash tables.

We perform extensive experiments comparingthe performance of our implementations with six ofthe most widely used concurrent hash tables. Oursare considerably faster than the best algorithmswith similar restrictions and an order of magnitudefaster than the best more general tables. In someextreme cases, the difference even approaches fourorders of magnitude.

Category: [D.1.3] Programming TechniquesConcurrent Programming [E.1] Data StructuresTables [E.2] Data Storage Representation Hash-table representations

Terms: Performance, Experimentation, Mea-surement, Design, Algorithms

Keywords: Concurrency, dynamic data struc-tures, experimental analysis, hash table, lock-freedom, transactional memory

1 Introduction

A hash table is a dynamic data structure whichstores a set of elements that are accessible by theirkey. It supports insertion, deletion, find and updatein constant expected time. In a concurrent hash ta-ble, multiple threads have access to the same table.This allows threads to share information in a flex-ible and efficient way. Therefore, concurrent hashtables are one of the most important concurrentdata structures. See Section 4 for a more detaileddiscussion of concurrent hash table functionality.

To show the ubiquity of hash tables we give ashort list of example applications: A very sim-ple use case is storing sparse sets of precom-puted solutions (e.g. [27], [3]). A more compli-cated one is aggregation as it is frequently usedin analytical data base queries of the form SELECT

FROM. . . COUNT. . . GROUP BY x [25]. Such a query se-lects rows from one or several relations and countsfor every key x how many rows have been found(similar queries work with SUM, MIN, or MAX). Hash-ing can also be used for a data-base join [5]. An-other group of examples is the exploration of a largecombinatorial search space where a hash table isused to remember the already explored elements(e.g., in dynamic programming [36], itemset mining[28], a chess program, or when exploring an implic-itly defined graph in model checking [37]). Simi-larly, a hash table can maintain a set of cached ob-jects to save I/Os [26]. Further examples are dupli-cate removal, storing the edge set of a sparse graphin order to support edge queries [23], maintainingthe set of nonempty cells in a grid-data structureused in geometry processing (e.g. [7]), or maintain-ing the children in tree data structures such as vanEmde-Boas search trees [6] or suffix trees [21].

Many of these applications have in common that– even in the sequential version of the program –hash table accesses constitute a significant fraction

1

arX

iv:1

601.

0401

7v2

[cs

.DS]

6 S

ep 2

016

of the running time. Thus, it is essential to havehighly scalable concurrent hash tables that actuallydeliver significant speedups in order to parallelizethese applications. Unfortunately, currently avail-able general purpose concurrent hash tables do notoffer the needed scalability (see Section 8 for con-crete numbers). On the other hand, it seems to befolklore that a lock-free linear probing hash tablewhere keys and values are machine words, which ispreallocated to a bounded size, and which supportsno true deletion operation can be implemented us-ing atomic compare-and-swap (CAS) instructions[36]. Find-operations can even proceed naively andwithout any write operations. In Section 4 we ex-plain our own implementation (folklore) in detail,after elaborating on some related work, and intro-ducing the necessary notation (in Section 2 and3 respectively). To see the potential big perfor-mance differences, consider an exemplary situationwith mostly read only access to the table and heavycontention for a small number of elements that areaccessed again and again by all threads. folkloreactually profits from this situation because the con-tended elements are likely to be replicated into lo-cal caches. On the other hand, any implementa-tion that needs locks or CAS instructions for find-operations, will become much slower than the se-quential code on current machines. The purpose ofour paper is to document and explain performancedifferences, and, more importantly, to explore towhat extent we can make folklore more general withan acceptable deterioration in performance.

These generalizations are discussed in Section 5.We explain how to grow (and shrink) such a table,and how to support deletions and more general datatypes. In Section 6 we explain how hardware trans-actional memory can be used to speed up insertionsand updates and how it may help to handle moregeneral data types.

After describing implementation details in Sec-tion 7, Section 8 experimentally compares our hashtables with six of the most widely used concurrenthash tables for microbenchmarks including inser-tion, finding, and aggregating data. We look atboth uniformly distributed and skewed input dis-tributions. Section 9 summarizes the results anddiscusses possible lines of future research.

2 Related Work

This publication follows up on our previous findingsabout generalizing fast concurrent hash tables [18].In addition to describing how to generalize a fastlinear probing hash table, we offer an extensiveexperimental analysis comparing many concurrenthash tables from several libraries.

There has been extensive previous work on con-current hashing. The widely used textbook “TheArt of Multiprocessor Programming” [12] by Her-lihy and Shavit devotes an entire chapter to concur-rent hashing and gives an overview over previouswork. However, it seems to us that a lot of previ-ous work focuses more on concepts and correctnessbut surprisingly little on scalability. For example,most of the discussed growing mechanisms assumethat the size of the hash table is known exactlywithout a discussion that this introduces a perfor-mance bottleneck limiting the speedup to a con-stant. Similarly, the actual migration is often donesequentially.

Stivala et al. [36] describe a bounded concurrentlinear probing hash table specialized for dynamicprogramming that only support insert and find.Their insert operation starts from scratch when theCAS fails which seems suboptimal in the presenceof contention. An interesting point is that theyneed only word size CAS instructions at the priceof reserving a special empty value. This techniquecould also be adapted to port our code to machineswithout 128-bit CAS.

Kim and Kim [14] compare this table with acache-optimized lockless implementation of hashingwith chaining and with hopscotch hashing [13]. Theexperiments use only uniformly distributed keys,i.e., there is little contention. Both linear prob-ing and hashing with chaining perform well in thatcase. The evaluation of find-performance is a bitinconclusive: chaining wins but using more spacethan linear probing. Moreover it is not specifiedwhether this is for successful (use key of insertedelements) or mostly unsuccessful (generate freshkeys) accesses. We suspect that varying these pa-rameters could reverse the result.

Gao et al. [10] present a theoretical dynamic lin-ear probing hash table, that is lock-free. The maincontribution is a formal correctness proof. Not alldetails of the algorithm or even an implementationis given. There is also no analysis of the complexity

2

of the growing procedure.

Shun and Blelloch [34] propose phase concurrenthash tables which are allowed to use only a sin-gle operation within a globally synchronized phase.They show how phase concurrency helps to im-plement some operations more efficiently and evendeterministically in a linear probing context. Forexample, deletions can adapt the approach from[15] and rearrange elements. This is not possiblein a general hash table since this might cause find-operations to report false negatives. They also out-line an elegant growing mechanism albeit withoutimplementing it and without filling in all the detaillike how to initialize newly allocated tables. Theypropose to trigger a growing operation when anyoperation has to scan more than k log n elementswhere k is a tuning parameter. This approach istempting since it is somewhat faster than the ap-proximate size estimator we use. We actually triedthat but found that this trigger has a very highvariance – sometimes it triggers late making opera-tions rather slow, sometimes it triggers early wast-ing a lot of space. We also have theoretical concernssince the bound k log n on the length of the longestprobe sequence implies strong assumptions on cer-tain properties of the hash function. Shun and Blel-loch make extensive experiments including applica-tions from the problem based benchmark suite [35].

Li et al. [17] use the bucket cuckoo-hashingmethod by Dietzfelbinger and Weidling [8] and de-velop a concurrent implementation. They exploitthat using a BFS-based insertion algorithm, thenumber of element moves for an insertion is verysmall. They use fine grained locks which can some-times be avoided using transactional memory (IntelTSX). As a result of their work, they implementedthe small open source library libcuckoo, which wemeasure against (which does not use TSX). Thisapproach has the potential to achieve very goodspace efficiency. However, our measurements indi-cate that the performance penalty is high.

The practical importance of concurrent hash ta-bles also leads to new and innovative implementa-tions outside of the scientific community. A goodexample of this is the Junction library, that waspublished by Preshing [31] in the beginning of 2016,shortly after our initial publication [19].

3 Preliminaries

We assume that each application thread has its owndesignated hardware thread or processing core anddenote the number of these threads with p. A datastructure is non-blocking if no blocked thread cur-rently accessing this data structure can block anoperation on the data structure by another thread.A data structure is lock-free if it is non-blockingand guarantees global progress, i.e., there must al-ways be at least one thread finishing its operationin a finite number of steps.

Hash Tables store a set of 〈Key,Value〉 pairs (ele-ments).1 A hash function h maps each key to a cellof a table (an array). The number of elements inthe hash table is denoted n and the number of oper-ations is m. For the purpose of algorithm analysis,we assume that n and m are p2 – this allows usto simplify algorithm complexities by hiding O(p)terms that are independent of n and m in the over-all cost. Sequential hash tables support the inser-tion of elements, and finding, updating, or delet-ing an element with given key – all of this in con-stant expected time. Further operations compute n(size), build a table with a given number of initialelements, and iterate over all elements (forall).

Linear Probing is one of the most popular se-quential hash table schemes used in practice. Anelement 〈x, a〉 is stored at the first free table entryfollowing position h(x) (wrapping around when theend of the table is reached). Linear probing is atthe same time simple and efficient – if the table isnot too full, a single cache line access will be enoughmost of the time. Deletion can be implemented byrearranging the elements locally [15] to avoid holesviolating the invariant mentioned above. When thetable becomes too full or too empty, the elementscan be migrated to a larger or smaller table re-spectively. The migration cost can be charged toinsertions and deletions causing amortized constantoverhead.

1Much of what is said here can be generalized to thecase when Elements are black boxes from which keys areextracted by an accessor function.

3

4 Concurrent Hash Table In-terface and Folklore Imple-mentation

Although it seems quite clear what a hash table isand how this generalizes to concurrent hash tables,there is a surprising number of details to consider.Therefore, we will quickly go over some of our inter-face decisions, and detail how this interface can beimplemented in a simple, fast, lock-free concurrentlinear probing hash table.

This hash table will have a bounded capacityc that has to be specified when the table is con-structed. It is the basis for all other hash tablevariants presented in this publication. We call thistable the folklore solution, because variations of itare used in many publications and it is not clear tous by whom it was first published.

The most important requirement for concurrentdata structures is, that they should be linearizable,i.e., it must be possible to order the hash table op-erations in some sequence – without reordering twoopperations of the same thread – so that executingthem sequentially in that order yields the same re-sults as the concurrent processing. For a hash ta-ble data structure, this basically means that all op-erations should be executed atomically some timebetween their invokation and their return. For ex-ample, it has to be avoided, that a find returnsan inconsistent state, e.g. a half-updated data fieldthat was never actually stored at the correspondingkey.

Our variant of the folklore solution ensures theatomicity of operations using 2-word atomic CASoperations for all changes of the table. As long asthe key and the value each only use one machineword, we can use 2-word CAS opearations to atom-ically manipulate a stored key together with thecorresponding value. There are other variants thatavoid need 2-word compare and swap operations,but they often need a designated empty value (see[31]) . Since, the corresponding machine instruc-tions are widely available on modern hardware, us-ing them should not be a problem. If the targetarchitecture does not support the needed instruc-tions, the implementation can easily be switchedto use a variant of the folklore solution which doesnot use 2-word CAS. As it can easily be deducedby the context, we will usually omit the “2-word”

prefix and use the abbreviation CAS for both singleand double word CAS operations.

Initialization The constructor allocates an arrayof size c consisting of 128-bit aligned cells whose keyis initialized to the empty values.

Modifications We propose, to categorize allchanges to the hash table content into one of thefollowing three functions, that can be implementedvery similarly (does not cover deletions).

insert(k, d): Returns false if an element withthe specified key is already present. Only one op-eration should succeed if multiple threads are in-serting the same key at the same time.

update(k, d, up): Returns false, if there is novalue stored at the specified key, otherwise thisfunction atomically updates the stored value tonew = up(current, d). Notice, that the resultingvalue can be dependent on both the current valueand the input parameter d.

insertOrUpdate(k, d, up): This operation updatesthe current value, if one is present, otherwise thegiven data element is inserted as the new value.The function returns true, if insertOrUpdate per-formed an insert (key was not present), and false

if an update was executed.

We choose this interface for two main reasons.It allows applications to quickly differentiate be-tween inserting and changing an element – this isespecially usefull since the thread who first inserteda key can be identified uniquely. Additionally itallows transparent, lockless updates that can bemore complex, than just replacing the current value(think of CAS or Fetch-and-Add).

The update interface using an update functiondeserves some special attention, as it is a novel ap-proach compared to most interfaces we encounteredduring our research. Most implementations fall intoone of two categories: They return mutable refer-ences to table elements – forcing the user to imple-ment atomic operations on the data type; or theyoffer an update function which usually replaces thecurrent value with a new one – making it very hardto implement atomic changes like a simple counter(find + increment + overwrite not necessarilyatomic).

In Algorithm 1 we show the pseudocode of theinsertOrUpdate function. The operation com-

4

ALGORITHM 1: Pseudocode for the insertOrUpdate operation

Input: Key k, Data Element d, Update Function up : Key×Val×Val→ ValOutput: Boolean true when a new key was inserted, false if an update occurred

1 i = h(k);2 while true do3 i = i % c;4 current = table[i];

5 if current.key == empty key then // Key is not present yet ...

6 if table[i].CAS(current, 〈k, d 〉) then7 return true

8 else9 i--;

10 else if current.key == k then // Same key already present ...

11 if table[i].atomicUpdate(current, d, up) then// default: atomicUpdate(·) = CAS( current, up( k, current.data, d))

12 return false

13 else14 i--;

15 i++;

putes the hash value of the key and proceeds tolook for an element with the appropriate key (be-ginning at the corresponding position). If no ele-ment matching the key is found (when an emptyspace is encountered), the new element has to beinserted. This is done using a CAS operation. Afailed swap can only be caused by another inser-tion into the same cell. In this case, we have torevisit the same cell, to check if the inserted el-ement matches the current key. If a cell storingthe same key is found, it will be updated using theatomicUpdate function. This function is usuallyimplemented by evaluating the passed update func-tion (up) and using a CAS operation, to change thecell. In the case of multiple concurrent updates, atleast one will be successful.

In our (C++) implementation, partial templatespecialization can be used to implement more ef-ficient atomicUpdate variants using atomic opera-tions – changing the default line 11, e.g. overwrite(using single word store), increment (using fetchand add).

The code presented in Algorithm 1 can easily bemodified to implement the insert (return false

when the key is already present – line 10) andupdate (return true after a successful update –line 12 and false when the key is not found –line 5) functions. All modification functions have a

constant expected running time.

Lookup Since this folklore implementation doesnot move elements within the table, it would bepossible for find(k) to return a reference to thecorresponding element. In our experience, return-ing references directly tempts inexperienced pro-grammers to opperate on these references in a waythat is not necessarily threadsafe. Therefore, ourimplementation returns a copy of the correspondingcell (〈k, d 〉), if one is found (〈empty key, ·〉 other-wise). The find operation has a constant expectedrunning time.

Our implementation of find somewhat non-trivial, because it is not possible to read two ma-chine words at once using an atomic instruction2.Therefore it is possible for a cell to be changed in-between reading its key and its value – this is calleda torn read. We have to make sure, that torn readscannot lead to any wrong behavior. There are twokinds of interesting torn reads: First an empty keyis read while the searched key is inserted into thesame cell, in this case the element is not found (con-sistent since it has not been fully inserted); Second

2The element is not read atomically, because x86 doesnot support that. One could use a 2-word CAS to achievethe same effect but this would have disastrous effects on per-formance when many threads try to find the same element.

5

the element is updated between the key being readand the data being read, since the data is read sec-ond, only the newer data is read (consistent with afinished update).

Deletions The folklore solution can only han-dle deletions using dummy elements – called tomb-stones. Usually the key stored in a cell is replacedwith del key. Afterwards the cell cannot be usedanymore. This method of handling deleted ele-ments is usually not feasible, as it does not increasethe capacity for new elements. In Section 5.4 Wewill show, how our generalizations can be used tohandle tombstones more efficiently.

Bulk Operations While not often used in prac-tice, the folklore table can be modified to sup-port operations like buildFrom(·) (see Section 5.5)– using a bulk insertion which can be more effi-cient than element-wise insertion – or forall(f) –which can be implemented embarrassingly parallelby splitting the table between threads.

Size Keeping track of the number of contained el-ements deserves special notice here because it turnsout to be significantly harder in concurrent hash ta-bles. In sequential hash tables, it is trivial to countthe number of contained elements – using a singlecounter. This same method is possible in paralleltables using atomic fetch and add operations, but itintroduces a massive amount of contention on onesingle counter creating a performance bottleneck.

Because of this we did not include a countingmethod in folklore implementation. In Section 5.2we show how this can be alleviated using an ap-proximate count.

5 Generalizations and Exten-sions

In this section, we detail how to adapt the concur-rent hash table implementation – described in theprevious section – to be universally applicable toall hash table workloads. Most of our efforts havegone into a scalable migration method that is usedto move all elements stored in one table into an-other table. It turns out that a fast migration can

solve most shortcomings of the folklore implemen-tation (especially deletions and adaptable size).

5.1 Storing Thread-Local Data

By itself, storing thread specific data connected toa hash table does not offer additional functionality,but it is necessary to efficiently implement some ofour other extensions. Per-thread data can be usedin many different ways, from counting the numberof insertions to caching shared resources.

From a theoretical point of view, it is easy tostore thread specific data. The additional space isusually only dependent on the number of threads(O(p) additional space), since the stored data isoften constant sized. Compared to the hash tablethis is usually negligible (p n < c).

Storing thread specific data is challenging froma software design and performance perspective.Some of our competitors use a register(·) func-tion that each thread has to call before accessingthe table. This allocates some memory, that canbe accessed using the global hash table object.

Our solution uses explicit handles. Each threadhas to create a handle, before accessing the hash ta-ble. These handles can store thread specific data,since they are not shared between threads. This isnot only in line with the RAII idiom (resource ac-quisition is initialization [24]), it also protects ourimplementation from some performance pitfalls likeunnecessary indirections and false sharing3. More-over, the data can easily be deleted once the threaddoes not use the hash table anymore (delete thehandle).

5.2 Approximating the Size

Keeping an exact count of the elements stored inthe hash table can often lead to contention on onecount variable. Therefore, we propose to supportonly an approximative size operation.

To keep an approximate count of all elements,each thread maintains a local counter of its success-ful insertions (using the method desribed in Sec-tion 5.1). Every Θ(p) such insertions this counteris atomically added to a global insertion counterI and then reset. Contention at I can be provably

3Significant slow down created by the cache coherencyprotocol due to multiple threads repeatedly changing dis-tinct values within the same cache line.

6

made small by randomizing the exact number of lo-cal insertions accepted before adding to the globalcounter, e.g., between 1 and p. I underestimatesthe size by at most O

(p2). Since we assume the

size to be p2 this still means a small relative er-ror. By adding the maximal error, we also get anupper bound for the table size.

If deletions are also allowed, we maintain a globalcounter D in a similar way. S = I − D is then agood estimate of the total size as long as S p2.

When a table is migrated for growing or shrink-ing (see Section 5.3.1), each migration thread lo-cally counts the elements it moves. At the end ofthe migration, local counters are added to createthe initial count for I (D is set to 0).

This method can also be extended to give anexact count – in absence of concurrent inser-tions/deletions. To do this, a list of all handleshas to be stored at the global hash table object. Athread can now iterate over all handles computingthe actual element size.

5.3 Table Migration

While Gao et al. [10] have shown that lock-free dy-namic linear probing hash tables are possible, thereis no result on their practical feasibility. Our focusis geared more towards engineering the fastest mi-gration possible, therefore, we are fine with smallamounts of locking, as long as it improves the over-all performance.

5.3.1 Eliminating Unnecessary Contentionfrom the Migration

If the table size is not fixed, it makes sense to as-sume that the hash function h yields a large pseu-dorandom integer which is then mapped to a cellposition in 0..c− 1 where c is the current capacityc.4 We will discuss a way to do this by scaling. Ifh yields values in the global range 0..U − 1 we mapkey x to cell hc(x) := bh(x) c

U c. Note that whenboth c and U are powers of two, the mapping canbe implemented by a simple shift operation.

Growing Now suppose that we want to migratethe table into a table that has at least the same size(growing factor γ ≥ 1). Exploiting the properties

4We use x..y as a shorthand for x, . . . , y in this paper.

of linear probing and our scaling function, there isa surprisingly simple way to migrate the elementsfrom the old table to the new table in parallel whichresults in exactly the same order a sequential algo-rithm would take and that completely avoids syn-chronization between threads.

Lemma 1. Consider a range a..b of nonempty cellsin the old table with the property that the cellsa− 1 mod c and b+ 1 mod c are both empty – callsuch a range a cluster (see Figure 1a). Whenmigrating a table, sequential migration will mapthe elements stored in that cluster into the rangebγac .. bγ(b+ 1)c in the target table, regardless ofthe rest of the source array.

Proof. Let x be an element stored in the cluster a..bat position p(x) = hc(x) + d(x). Then hc(x) has tobe in the cluster a..b, because linear probing doesnot displace elements over empty cells (hc(x) =

bh(x) cU c ≥ a), and therefore, h(x) c′

U ≥ ac′

c ≥ γa.Similarly, from bh(x) c

U c ≤ b follows h(x) cU <

b+ 1, and therefore, h(x) c′

U < γ(b+ 1).

Therefore, two distinct clusters in the source ta-ble cannot overlap in the target table. We can ex-ploit this lemma by assigning entire clusters to mi-grating threads which can then process each clustercompletely independently. Distributing clusters be-tween threads can easily be achieved by first split-ting the table into blocks (regardless of the tablescontents) which we assign to threads for parallel mi-gration. A thread assigned block d..e will migratethose clusters that start within this range – implic-itly moving the block borders to free cells as seen inFigure 1b). Since the average cluster length is shortand c = Ω

(p2), it is sufficient to deal out blocks

of size Ω(p) using a single shared global variableand atomic fetch-and-add operations. Additionallyeach thread is responsible for initializing all cells inits region of the target table. This is important,because sequentially initializing the hash table canquickly become infeasible.

Note that waiting for the last thread at the endof the migration introduces some waiting (locking),but this does not create significant work imbalance,since the block/cluster migration is really fast andclusters are expected to be short.

Shrinking Unfortunately, the nice structuralLemma 1 no longer applies. We can still parallelize

7

a

b

γa

γ(b+ 1)a′

b′γa′

γ(b′ + 1)

(a) Two neighboring clusters and their non-overlapping target areas (γ = 2).

(b) Left: table split into even blocks. Right: resultingcluster distribution (moved implicit block borders).

Figure 1: Cluster migration and work distribution

the migration with little synchronization. Oncemore, we cut the source table into blocks that weassign to threads for migration. The scaling func-tion maps each block a..b in the source table to ablock a′..b′ in the target table. We have to be care-ful with rounding issues so that the blocks in thetarget table are non-overlapping. We can then pro-ceed in two phases. First, a migrating thread mi-grates those elements that move from a..b to a′..b′.These migrations can be done in a sequential man-ner, since target blocks are disjoint. The majorityof elements will fit into the target block. Then, af-ter a barrier synchronization, all elements that didnot fit into their respective target blocks are mi-grated using concurrent insertion i.e., using atomicoperations. This has negligible overhead since el-ements like this only exist at the boundaries ofblocks. The resulting allocation of elements in thetarget table will no longer be the same as for asequential migration but as long as the data struc-ture invariants of a linear probing hash table arefulfilled, this is not a problem.

5.3.2 Hiding the Migration from the Un-derlying Application

To make the concurrent hash table more generaland easy to use, we would like to avoid all explicitsynchronization. The growing (and shrinking) op-erations should be performed asynchronously whenneeded, without involvement of the underlying ap-plication. The migration is triggered once the ta-ble is filled to a factor ≥ α (e.g. 50 %), this isestimated using the approximate count from Sec-tion 5.2, and checked whenever the global count isupdated. When a growing operation is triggered,

the capacity will be increased by a factor of γ ≥ 1(Usually γ = 2). The difficulty is ensuring that thisoperation is done in a transparent way without in-troducing any inconsistent behavior and withoutincurring undue overheads.

To hide the migration process from the user, twoproblems have to be solved. First, we have to findthreads to grow the table, and second, we have toensure, that changing elements in the source tablewill not lead to any inconsistent states in the targettable (possibly reverting changes made during themigration). Each of these problems can be solvedin multiple ways. We implemented two strategiesfor each of them resulting in four different variantsof the hash table (mix and match).

Recruiting User-Threads A simple approachto dynamically allocate threads to growing the ta-ble, is to “enslave” threads that try to performtable accesses that would otherwise have to waitfor the completion of the growing process anyway.This works really well when the table is regularlyaccessed by all user-threads, but is inefficient in theworst case when most threads stop accessing the ta-ble at some point, e.g., waiting for the completion ofa global computation phase at a barrier. The fewthreads still accessing the table at this point willneed a lot of time for growing (up to Ω(n)) whilemost threads are waiting for them. One could tryto also enslave waiting threads but it looks difficultto do this in a sufficiently general and portable way.

Using a Dedicated Thread Pool A provablyefficient approach is to maintain a pool of p threadsdedicated to growing the table. They are blocked

8

until a growing operation is triggered. This is whenthey are awoken to collectively perform the migra-tion in time O(n/p) and then get back to sleep.During a migration, application threads might haveto sleep until the migration threads are finished.This will increase the CPU time of our migrationthreads making this method nearly as efficient asthe enslavement variant. Using a reasonable com-putation model, one can show that using threadpools for migration increases the cost of each tableaccess by at most a constant in a globally amortizedsense (over the non-growing folklore solution). Weomit the relatively simple proof.

To remain fair to all competitors, we used ex-actly as many threads for the thread pool as therewere application threads accessing the table. Ad-ditionally each migration thread was bound to acore, that was also used by one corresponding ap-plication thread.

Marking Moved Elements for Consistency(asynchronous) During the migration it is im-portant that no element can be changed in the oldtable after it has been copied to the new table. Oth-erwise, it would be hard to guarantee that changesare correctly applied to the new table. The easiestsolution to this problem is, to mark each cell beforeit is copied. Marking each cell can be done usinga CAS operation to set a special marked bit whichis stored in the key. In practice this reduces thepossible key space. If this reduction is a problem,see Section 5.6 on how to circumvent it. To ensurethat no copied cell can be changed, it suffices toensure that no marked cell can be changed. Thiscan easily be done by checking the bit before eachwriting operation, and by using CAS operations foreach update. This prohibits the use of fast atomicoperations to change element values.

After the migration, the old hash table has tobe deallocated. Before deallocating an old table,we have to make sure that no thread is currentlyusing it anymore. This problem can generally besolved by using reference counting. Instead of stor-ing the table with a usual pointer, we use a ref-erence counted pointer (e.g. std::shared ptr) toensure that the table is eventually freed.

The main disadvantage of counting pointersis that acquiring a counting pointer requires anatomic increment on a shared counter. Therefore, it

is not feasible to acquire a counting pointer for eachoperation. Instead a copy of the shared pointer canbe stored locally, together with the increasing ver-sion number of the corresponding hash table (usingthe method from Section 5.1). At the beginning ofeach operation, we can use the local version numberto make sure that the local counting pointer stillpoints to the newest table version. If this is not thecase, a new pointer will be acquired. This happensonly once per version of the hash table. The oldtable will automatically be freed once every threadhas updated its local pointer. Note that countingpointers cannot be exchanged in a lock-free man-ner increasing the cost of changing the current table(using a lock). This lock could be avoided by usinga hazard pointer. We did not do this

Prevent Concurrent Updates to ensure Con-sistency (synchronized) We propose a simpleprotocol inspired by read-copy-update protocols[22]. The thread t triggering the growing operationsets some global growing flag using a CAS instruc-tion. A thread t performing a table access sets alocal busy flag when starting an operation. Thenit inspects the growing flag, if the flag is set, thelocal flag is unset. Then the local thread waits forthe completion of the growing operation, or helpswith migrating the table depending on the currentgrowing strategy. Thread t waits until all busyflags have been unset at least once before startingthe migration. When the migration is completed,the growing flag is reset, signaling to the waitingthreads that they can safely continue their table-operations. Because this protocol ensures that nothread is accessing the previous table after the be-ginning of the migration, it can be freed withoutusing reference counting.

We call this method (semi-)synchronized, be-cause grow and update operations are disjoint.Threads participating in one growing step still ar-rive asynchronously, e.g. when the parent applica-tion called a hash table operation. Compared tothe marking based protocol, we save cost duringmigration by avoiding CAS operations. However,this is at the expense of setting the busy flags forevery operation. Our experiments indicates thatoverall this is only advantageous for updates usingatomic operations like fetch-and-add that cannotcoexist with the marker flags.

9

5.4 Deletions

For concurrent linear probing, we combine tomb-stoning (see Section 4) with our migration algo-rithm to clean the table once it is filled with toomany tombstones.

A tombstone is an element, that has a del key inplace of its key. The key x of a deleted entry 〈x, a〉is atomically changed to 〈del key, a〉. Other ta-ble operations scan over these deleted elements likeover any other nonempty entry. No inconsistenciescan arise from deletions. In particular, a concurrentfind-operations with a torn read will return the ele-ment before the deletion since the delete-operationwill leave the value-slot a untouched. A concurrentinsert 〈x, b〉 might read the key x before it is over-written by the deletion and return false becauseit concludes that an element with key x is alreadypresent. This is consistent with the outcome whenthe insertion is performed before the deletion in alinearization.

This method of deletion can easily be imple-mented in the folklore solution from Section 4. Butthe starting capacity has to be set dependent onthe number of overall insertions, since this form ofdeletion does not free up any of the deleted cells.Even worse, tombstones will fill up the table andslow down find queries.

Both of these problems can be solved by migrat-ing all non-tombstone elements into a new table.The decision when to migrate the table should bemade solely based on the number of insertions I(= number of nonempty cells). The count of allnon-deleted elements I −D is then used to decidewhether the table should grow, keep the same size(notice γ = 1 is a special case for our optimized mi-gration), or shrink. Either way, all tombstones canbe removed in the course of the element migration.

5.5 Bulk Operations

Building a hash table for n elements passed to theconstructor can be parallelized using integer sortingby the hash function value. This works in timeO(n/p) regardless how many times an element isinserted, i.e., sorting circumvents contention. Seethe work of Mller et al.[25] for a discussion of thisphenomenon in the context of aggregation.

This can be generalized for processing batches ofsizem = Ω(n) that may even contain a mix of inser-

tions, deletions, and updates. We outline a simplealgorithm for bulk-insertion that works without ex-plicit sorting albeit does not avoid contention. Leta denote the old size of the hash table and b thenumber of insertions. Then a+b is an upper boundfor the new table size. If necessary, grow the tableto that size or larger (see below). Finally, in paral-lel, insert the new elements.

More generally, processing batches of size m =Ω(n) in a globally synchronized way can use thesame strategy. We outline it for the case of bulkinsertions. Generalization to deletions, updates,or mixed batches is possible: Integer sort the ele-ments to be inserted by their hash key in expectedtime O(m/p). Among elements with the same hashvalue, remove all but the last. Then “merge” thebatch and the hash table into a new hash table(that may have to be larger to provide space for thenew elements). We can adapt ideas from parallelmerging [11]. We co-partition the sorted insertionarray and the hash table into corresponding piecesof size O(m/p). Most of the work can now be doneon these pieces in an embarrassingly parallel way –each piece of the insertion array is scanned sequen-tially by one thread. Consider an element 〈x, a〉 andprevious insertion position i in the table. Then westart looking for a free cell at position max(h(x), i)

5.6 Restoring the Full Key Space

Our table uses special keys, like the empty key(empty key) and the deleted key (del key). El-ements that actually have these keys cannot bestored in the hash table. This can easily be fixedby using two special slots in the global hash tabledata structure. This makes some case distinctionnecessary but should have rather low impact on theoverall performance.

One of our growing variants (asynchronous) usesa marker bit in its key field. This halves the possi-ble key space from 264 to 263. To regain the lost keyspace, we can store the lost bit implicitly. Insteadof using one hash table that holds all elements, weuse the two subtables t0 and t1. The subtable t0holds all elements whose key does not have its top-most bit set. While t1 stores all elements whosekey does have the topmost bit set, but instead ofstoring the topmost bit explicitly it is removed.

Each element can still be found in constant time,because when looking for a certain key, it is imme-

10

diately obvious in which table the correspondingelement will be stored. After choosing the righttable, comparing the 63 explicitly stored bits canuniquely identify the correct element. Notice thatboth empty keys have to be stored distinctly (asdescribed above).

5.7 Complex Key and Value Types

Using CAS instructions to change the content ofhash table cells makes our data structure fast butlimits its use to cases where keys and values fit intomemory words. Lifting this restriction is boundto have some impact on performance but we wantto outline ways to keep this penalty small. Thegeneral idea is to replace the keys and or values byreferences to the actual data.

Complex Keys To make things more concretewe outline a way where the keys are strings andthe hash table data structure itself manages spacefor the keys. When an element 〈s, a〉 is inserted,space for string s is allocated. The hash tablestores 〈r, a〉 where r is a pointer to s. Unfortu-nately, we get a considerable performance penaltyduring table operations because looking for an el-ement with a given key now has to follow this in-direction for every key comparison – effectively de-stroying the advantage of linear probing over otherhashing schemes with respect to cache efficiency.This overhead can be reduced by two measures:First, we can make the table bigger thus reducingthe necessary search distance – considering that thekeys are large anyway, this has a relatively smallimpact on overall storage consumption. A moresophisticated idea is to store a signature of the keyin some unused bits of the reference to the key (onmodern machines keys actually only use 48 bits).This signature can be obtained from the masterhash function h extracting bits that were not usedfor finding the position in the table (i.e. the leastsignificant digits). While searching for a key y onecan then first compare the signatures before actu-ally making a full key comparison that involves acostly pointer dereference.

Deletions do not immediately deallocate thespace for the key because concurrent operationsmight still be scanning through them. The spacefor deleted keys can be reclaimed when the arraygrows. At that time, our migration protocols make

sure that no concurrent table operations are goingon.

The memory management is challenging sincewe need high throughput allocation for very finegrained variable sized objects and a kind of garbagecollection. On the positive side, we can find all thepointers to the strings using the hash function. Allin all, these properties might be sufficiently uniquethat a carefully designed special purpose imple-mentation is faster than currently available generalpurpose allocators. We outline one such approach:New strings are allocated into memory pages of sizeΩ(p). Each thread has one current page that isonly used locally for allocating short strings. Longstrings are allocated using a general purpose allo-cator. When the local page of a thread is full, thethread allocates a fresh page and remembers theold one on a stack. During a shrinking phase, agarbage collection is done on the string memory.This can be parallelized on a page by page basis.Each thread works on two pages A and B at a timewhere A is a partially filled page. B is scanned andthe strings stored there are moved to A (updatingtheir pointer in the hash table). When A runs full,B replaces A. When B runs empty, it is freed. Ineither case, an unprocessed page is obtained to be-come B.

Complex Values We can take a similar ap-proach as for complex keys – the hash table datastructure itself allocates space for complex val-ues. This space is only deallocated during migra-tion/cleanup phases that make sure that no con-current table operations are affected. The find-operation only hands out copies of the values sothat there is no danger of stale data. There arenow two types of update operations. One that mod-ifies part of a complex value using an atomic CASoperation and one that allocates an entirely newvalue object and performs the update by atomi-cally setting the value-reference to the new object.Unfortunately it is not possible to use both typesconcurrently.

Complex Keys and Values Of course we cancombine the two approaches described above. How-ever in that case, it will be more efficient to storea single reference to a combined key-value objecttogether with a signature.

11

6 Using Hardware MemoryTransactions

The biggest difference between a concurrent table,and a sequential hash table is the use of atomicprocessor instructions. We use them for access-ing and modifying data which is shared betweenthreads. An additional way to achieve atomicity isthe use of hardware transactional memory synchro-nization introduced recently by Intel and IBM. Thenew instruction extensions can group many mem-ory accesses into a single transaction. All changesfrom one transaction are committed at the sametime. For other threads they appear to be atomic.General purpose memory transactions do not haveprogress guarantees (i.e. can always be aborted),therefore they require a fall-back path implement-ing atomicity (a lock or an implementation usingtraditional atomic instructions).

We believe that transactional memory synchro-nization is an important opportunity for concurrentdata structures. Therefore, we analyze how to ef-ficiently use memory transactions for our concur-rent linear probing hash tables. In the following,we discuss which aspects of our hash table can beimproved by using restricted transactional mem-ory implemented in Intel Transactional Synchro-nization Extensions (Intel TSX).

We use Intel TSX by wrapping sequential codeinto a memory transaction. Since the sequentialcode is simpler (e.g. less branches, more freedomfor compiler optimizations) it can outperform in-herently more complex code based on (expensive128-bit CAS) atomic instructions. As a transac-tion fall-back mechanism, we employ our atomicvariants of hash table operations. Replacing theinsert and update functions of our specialized grow-ing hash table with Intel TSX variants increases thethroughput of our hash table by up to 28 % (seeSection 8.4). Speedups like this are easy to ob-tain on workloads without contentious accesses (si-multaneous write accesses on the same cell). Con-tentious write accesses lead to transaction abortswhich have a higher latency than the failure of aCAS. Our atomic fall-back minimizes the penaltyfor such scenarios compared to the classic lock-based fall-back that causes more overhead and se-rialization.

Another aspect that can be improved through

the use of memory transactions is the key andvalue size. On current x86 hardware, there is noatomic instruction that can change words biggerthan 128 bits at once. The amount of memory thatcan be manipulated during one memory transac-tion can be far greater than 128 bits. Therefore,one could easily implement hash tables with com-plex keys and values using transactional memorysynchronization. However, using atomic functionsas fall-back will not be possible. Solutions withfine-grained locks that are only needed when thetransactions actually fail, are still possible.

With general purpose memory transactions it iseven possible to atomically change multiple valuesthat are not stored consecutively. Therefore, it ispossible to implement a hash table that separatesthe keys from the values storing each in a separatetable. In theory this could improve the cache local-ity of linear probing.

Overall, transactional memory synchronizationcan be used to improve performance and to makethe data structure more flexible.

7 Implementation Details

Bounded Hash Tables. All of our implemen-tations are constructed around a highly optimizedvariant of the circular bounded folklore hash tablethat was describe in Section 4. The main perfor-mance optimizations were to restrict the table sizeto powers of two – replacing expensive modulo op-erations with fast bit operations When initializingthe capacity c, we compute the lowest power of two,that is still at least twice as large as the expectednumber of insertions (2n ≤ size ≤ 4n).

We also built a second non growing hash tablevariant called tsxfolklore, this variant forgoes theusual CAS-operations that are used to change cells.Instead tsxfolklore uses TSX transactions to changeelements in the table atomically. As described inSection 6, we use our usual atomic operations asfallback in case a TSX transaction is aborted.

Growing Hash Tables. All of our growing hashtables use folklore or tsxfolklore to represent thecurrent status of the hash table. When the tableis approximately 60% filled, a migration is started.With each migration, we double the capacity. Themigration works in cell-blocks of the size 4096.

12

Blocks are migrated with a minimum amount ofatomics by using the cluster migration described inSection 5.3.1.

We use a user-space memory pool from Intel’sTBB library to prevent a slow down due to the re-mapping of virtual to physical memory (protectedby a coarse lock in the Linux kernel). This improvesthe performance of our growing variants, especiallywhen using more than 24 threads. By allocatingmemory from this memory pool, we ensure that thevirtual memory that we receive is already mappedto physical memory, bypassing the kernel lock.

In Section 5.3.1 we identified two orthogonalproblems that have to be solved to migrate hashtables: which threads execute the migration? andhow can we make sure that copied elements cannotbe changed? For each of these problems we formu-lated two strategies. The table can either be mi-grated by user-threads that execute operations onthe table (u), or by using a pool of threads whichis only responsible for the migration (p). To en-sure that copied elements cannot be changed, weproposed to wait for each currently running opera-tion synchronizing update and growing phases (s),or to mark elements before they are copied, thusproceeding fully asynchronously (a).

All strategies can be combined – creating the fol-lowing four growing hash table variants: uaGrowuses enslavement of user threads and asynchronousmarking for consistency; usGrow also uses userthreads threads, but ensures consistency by syn-chronizing updates and growing routines; paGrowuses a pool of dedicated migration threads for themigration and asynchronous marking of migratedentries for consistency; and psGrow combines theuse of a dedicated thread pool for migration withthe synchronized exclusion mechanism.

All of these versions can also be instantiated us-ing the TSX based non-growing table tsxfolklore asa basis.

8 Experimental Evaluation

We performed a large number of experiments toinvestigate the performance of different concur-rent hash tables in a variety of circumstances (anoverview over all tested hash tables can be found inTable 1). We begin by describing the tested com-petitors (Section 8.1, our variants are introduced in

Section 7), the test instances (Section 8.3), and thetest environment (Section 8.2). Then Section 8.4discusses the actual measurements. In Section 8.5,we conclude the section by summarizing our exper-iments and reflecting how different generalizationsaffect the performance of hash tables.

8.1 Competitors

To compare our implementation to the currentstate of the art we use a broad selection of otherconcurrent hash tables. These tables were chosenon the basis of their popularity in applications andacademic publications. We split these hash tableimplementations into the following three groups de-pending on their growing functionality.

8.1.1 Efficiently Growing Hash Tables

This group contains all hash tables, that are ableto grow efficiently from a very small initial size.They are used in our growing benchmarks, wherewe initialize tables with an initial size of 4096 thusmaking growing necessary.

Junction Linear , Junction Grampa , andJunction Leapfrog The junction library con-sists of three different variants of a dynamic con-current hash table. It was published by Jeff Presh-ing over github [31], after our first publication onthe subject ([19]). There are no scientific publi-cations, but on his blog [32] Preshing writes someinsightful posts on his implementation. In theory,junction’s hash tables use an approach to growingwhich is similar to ours. A filled bounded hashtable is migrated into a newly allocated bigger ta-ble. Although they are constructed from a simi-lar idea, the execution seems to differ quite signif-icantly. The junction hash tables use a quiescent-state based reclamation (QSBR) protocol, for mem-ory reclamation. Using this protocol, in order toreclaim freed hash table memory, the user has toregularly call a designated function.

Contrary to other hash tables, we used the pro-vided standard hash function (avalanche), becausejunction assumes its hash function, to be invertible.Therefore, the hash function which is used for allother tables (see Section 8.3) is not usable.

The different hash tables within junction all per-form different variants of open addressing. These

13

variants are described in more detail, in one ofPreshing’s blogposts (see [32]).

tbbHM F and tbbUM F (correspond tothe TBB maps tbb::concurrent hash map andtbb::concurrent unordered map respectively)The Threading Building Blocks [30] (TBB) library(Version 4.3 Update 6) developed by Intel is one ofthe most widely used libraries for shared memoryconcurrent programming. The two different con-current hash tables it contains behave relativelysimilar in our tests. Therefore, we sometimesonly plot the results of tbbHM F. But they havesome differences concerning the locking of accessedelements. Therefore, they behave very differentlyunder contention.

8.1.2 Hash Tables with Limited GrowingCapabilities

This group contains all hash tables that can onlygrow by a limited amount (constant factor of theinitial size) or become very slow when growing isrequired. When testing their growing capabilities,we usually initialize these tables with half their tar-get size. This is comparable to a workload wherethe approximate number of elements is known butcannot be bound strictly.

folly + (folly::AtomicHashMap) This hash ta-ble was developed at facebook as a part of theiropen source library folly [9] (Version 57:0). It usesrestrictions on key and data types similar to ourfolklore implementation. In contrast to our grow-ing procedure, the folly table grows by allocatingadditional hash tables. This increases the cost offuture queries and it bounds the total growing fac-tor to ≈ 18 (×initial size).

cuckoo (cuckoohash map) This hash table us-ing (bucket) cuckoo hashing as its collision resolu-tion method, is part of the small libcuckoo library(Version 1.0). It uses a fine grained locking ap-proach presented by Li et al. [17] to ensure consis-tency. Cuckoo is mentionable for their interestinginterface, which combines easy container style ac-cess with an update routine similar to our updateinterface.

RCU ×/RCU QSBR × This hash table ispart of the Userspace RCU library (Version 0.8.7)[29], that brings the read copy update principle touserspace applications. Read copy update is a setof protocols for concurrent programming, that arepopular in the Linux kernel community [22]. Thehash table uses split-ordered lists to grow in a lock-free manner. This approach has been proposed byShalev and Shavit [33].

RCU uses the recommended read-copy-updatevariant called urcu. RCU QSBR uses a QSBRbased protocol that is comparable to the one usedby junction hash tables. It forces the user to repeat-edly call a function with each participating thread.We tested both variants, but in many plots we showonly RCU × because both variants behaved verysimilarly in our tests.

8.1.3 Non-Growing Hash Tables

One of the most important subjects of this publica-tion is offering a scalable asynchronous migrationfor the simple folklore hash table. While this makesit usable in circumstances where bounded tablescannot be used, we want to show that even whenno growing is necessary we can compete againstbounded hash tables. Therefore, it is reasonableto use our growing hash table even in applicationswhere the number of elements can be bounded in areasonable manner, offering a graceful degradationin edge cases and allowing improved memory usageif the bound is not reached.

Folklore Our implementation of the folkloresolution described in Section 4. Notice that thishash table is the core of our growing variants.Therefore, we can immediately determine the over-head that the ability for growing places on this im-plementation (Overhead for approximate countingand shared pointers).

Phase Concurrent This hash table imple-mentation proposed by Shun and Blelloch [34] isdesigned to support only phase concurrent accesses,i.e. no reads can occur concurrently with writes.We tested this table anyway, because several of ourtest instances satisfy this constraint and it showedpromising running times.

14

Hopscotch Hash N Hopscotch hashing (ver 2.0)is one of the more popular variants of open ad-dressing. The version we tested, was publishedby Herlihy et al. [13] connected to their origi-nal publication proposing the technique. Inter-estingly, the provided implementation only imple-ments the functionality of a hash set (unable toretrieve/update stored data). Therefore, we had toadapt some tests to account for that (insert∼=put

and find∼=contains).

LeaHash H This hash table is designed by Lea[16] as part of Java’s Concurrency Package. Wehave obtained a C++ implementation which waspublished together with the hopscotch table. It waspreviously used during for experiments by Herlihyet al. [13] and Shun and Blelloch [34]. LeaHash useshashing with chaining and the implementation thatwe use has the same hash set interface as hopscotch.

As previously described, we used hash set imple-mentations for Hopscotch hashing, as well as Lea-Hash (they were published like this). They shouldeasily be convertible into common hash map imple-mentations, without loosing too much performance,but probably using quite a bit more memory.

8.1.4 Sequential Variants

To report absolute speedup numbers, we imple-mented sequential variants of growing and fixed sizetables. They do not use any atomic instructionsor similar slowdowns. They outperform popularchoices like google’s dense hash map significantly(80% increased insert throughput), making them areasonable approximation for the optimal sequen-tial performance.

8.1.5 Color/Marker Choice

For practicality reasons, we chose not to print alegend with all of our figures. Instead, we use thissection to explain the color and marker choices forour plots (see Section 8.1 and Table 1), hopefullymaking them more readable.

Some of the tested hash tables are part of thesame library. In these cases, we use the samemarker, for all hash tables within that library. Thedifferent variants of the hash table are then dif-ferentiated using the line color (and filling of themarker).

For our own tables, we mostly use and foruaGrow and usGrow respectively.

8.2 Hardware Overview

Most of our Experiments were run on a two socketmachine, with Intel Xeon E5-2670 v3 processors(previously codenamed Haswell-EP). Each proces-sor has 12 cores running at 2.3 Ghz base frequency.The two sockets are connected by two Intel QPI-links. Distributed to the two sockets there are128 GB of main memory (64 GB each). The pro-cessors support Intel Hyper-Threading, AVX2, andTSX technologies.

This system runs a Ubuntu distribution with thekernel number 3.13.0-91-generic. We compiled allour tests with gcc 5.2.0 – using optimization level-O3 and the necessary compiler flags (e.g. -mcx16,-msse4.2 among others).

Additionally we executed some experiments on a32-core 4-socket Intel Xeon E5-4640 (SandyBridge-EP) machine, with 512 GB main memory (usingthe same operating system and compiler), to verifyour findings, and show improved scalability even on4-socket machines.

8.3 Test Methodology

Each test measures the time it takes, to execute 108

hash table operations (strong scaling). Each datapoint was computed by taking the average, of fiveseparate execution times. Different tests use dif-ferent hash table operations and key distributions.The used keys are pre-computed before the bench-mark is started. Each speedup given in this sectionis computed as the absolute speedup over our hand-optimized sequential hash table.

The work is distributed between threads dynam-ically. While there is work to do, threads re-serve blocks of 4096 operations to execute (using anatomic counter). This ensures a minimal amountof work imbalance, making the measurements lessprone to variance.

Two executions of the same test will always usethe same input keys. Most experiments are per-formed with uniformly random generated keys (us-ing the Mersenne twister random number generator[20]). Since real world inputs may have recurringelements, there can be contention which can po-tentially lead to performance issues. To test hash

15

Table 1: Overview over Table Functionalities.

name plot std. interface growing atomic updates deletion arbitrary types

xyGrowuaGrow using handles X X XusGrow ” X X XpaGrow ” X X XpsGrow ” X X X

Junctionlinear qsbr function X only overwrite Xgrampa ” X ” Xleapfrog ” X ” X

TBBhash map F X X X X Xunordered F X X X unsafe X

Folly + X const factor XCuckoo X slow X X XRCU

urcu × register thread very slow X X Xqsbr × qsbr function ” X X X

Folklore X XPhase sync phases only overwrite XHopscotch N X set interface XLea Hash H X set interface X

table performance under contention, we use Zipf’sdistribution to create skewed key sequences. Usingthe Zipf distribution, the probability for any givenkey k is P (k) = 1

ks·HN,s, where HN,s is the N -th

generalized harmonic number∑N

k=11ks (normaliza-

tion factor) and N is the universe size (N = 108).The exponent s can be altered to regulate the con-tention. We use the Zipf distribution, because itclosely models some real world inputs like naturallanguage, natural size distributions (e.g. of firmsor internet pages), and even user behavior ([2], [4],[1]). Notice that key generation is done prior tothe benchmark execution as to not influence themeasurements unnecessarily (this is especially nec-essary, for skewed inputs).

As a hash function, we use two CRC32C x86 in-structions with different seeds, to generate the up-per and lower 32 bits of each hash value. Theirhardware implementation minimizes the computa-tional overhead.

8.4 Experiments

The most basic functionality of each hash table isinserting and finding elements. The performance of

many parallel algorithms depends on the scalabilityof parallel insertions and finds. Therefore, we beginour experiments with a thorough investigation intothe scalability of these basic hash table operations.

Insert Performance We begin with the very ba-sic test of inserting 108 different uniformly randomkeys, into a previously empty hash table. For thisfirst test, all hash tables have been initialized tothe final size making growing unnecessary. The re-sults presented in Figure 2a show clearly, that thefolklore solution is optimal in this case. Sincethere is no migration necessary, and the table canbe initialized large enough, such that long searchdistances become very improbable. The large dis-crepancy between the folklore solution, and allprevious growable hash tables is what motivatedus, to work with growable hash tables in the firstplace. As shown in the plot, our growing hash tableuaGrow looses about 10% of performance overfolklore (9.6× Speedup vs. 8.7×). This perfor-mance loss can be explained with some overheadsthat are necessary for eventually growing the table(e.g. estimating the number of elements). All hashtables that have a reasonable performance (> 50%

16

1 4 8 12 16 24 36 48number of threads p

0

50

100

150

200

250

300

thro

ughp

ut in

[MO

ps/s

]

0

1

2

3

4

5

6

7

8

9

absolute speedup(a) Insert into pre-initialized table


0

20

40

60

80

100

120

thro

ughp

ut in

[MO

ps/s

]

012345678910

absolute speedup

(b) Insert into growing table

Figure 2: Throughput while inserting 108 elements into a previously empty table (Legend see Table 1).

of folklore performance), are variants of open ad-dressing (junction leapfrog 4.4 at p = 12, folly +5.1, phase 8.3) that have similar restrictions onkey and value types. All hash tables that can han-dle generic data types are severely outclassed (F,F, , ×, and ×).

After this introductory experiment, we take alook at the growing capability of each table. Weagain insert 108 elements into a previously emptytable. This time, the table has only been initial-ized, to hold 4092 elements (5 · 107 for all semigrowing tables). We can clearly see from the plotsin Figure 2b, that our hash table variants are signif-icantly faster than any comparable tables. The dif-ference becomes especially obvious once two sock-ets are used (> 12 cores). With more than onesocket, none of our competitors could achieve anysignificant speedups. On the contrary, many tablesbecome slower when executed on more cores. Thiseffect, does not happen for our table.

Junction grampa is the only growing hash table– apart from our growing variants – which achievesabsolute speedups higher than 2. Overall, it is stillseverely outperformed by our hash table uaGrow

(factor 2.5×). Compared to all other tables, weachieve at least seven times the performance (de-scending order; using 48 threads) folly + (7.4×),junction leapfrog (7.7×), tbb hm F (9.6×), tbbum F (10.7×), junction linear (22.6×), cuckoo

(61.3×), rcu × (63.2×), and rcu with qsbr ×(64.5×).

The speedup in this growing instance is evenbetter than the speedup in our non-growing tests.Overall we reach absolute speedups of > 9× com-pared to the sequential version (also with growing).This is slightly better then the absolute speedup inthe non-growing test (≈ 8.5), suggesting that ourmigration is at least as scalable as hash table ac-cesses. Overall the insert performance of our im-plementation behaves as one would have hoped. Itperforms similar to folklore in the non-growingcase, while performing similarly well in tests wheregrowing is necessary.

Find Performance When looking for a key in ahash table there are two possible outcomes, eitherit is in the table or it is not. For most hash tablesnot finding an element takes longer than findingsaid element. Therefore, we present two distinctmeasurements for both cases Figure 3a and Fig-ure 3b. The measurement for successful finds hasbeen made by looking for 108 elements, that havepreviously been inserted into a hash table. For theunsuccessful measurement, 108 uniformly randomkeys are searched in this same hash table.

All the measurements made for these plots weredone on a preinitialized table (preinitialized beforeinsertion). This does not make a difference for ourimplementation, but it has an influence on some ofour competitors. All tables that grow by allocat-ing additional tables (namely cuckoo and folly+) have significantly worse find performance on a

17


0

100

200

300

400

500

600

thro

ughp

ut in

[MO

ps/s

]

0

2

4

6

8

10

12

absolute speedup(a) Successful finds


0

100

200

300

400

500

600

700

800

thro

ughp

ut in

[MO

ps/s

]

0

3

6

9

12

15

18

21

24

absolute speedup

(b) Unsuccessful finds

Figure 3: Performance and scalability calling 108 unique find operations, on a table, containing 108

unique keys (Legend see Table 1).

grown table, as they can have multiple active tablesat the same time (all of them have to be checked).

Obviously, find workloads achieve biggerthroughputs than insert heavy workloads – nomemory is changed and no coordination is neces-sary between processors (i.e. atomic operations).It is interesting that find operations seem toscale better with multiple processors. Here, ourgrowable implementations achieve speedups of 12.8compared to 9 in the insertion case.

When comparing the find performance betweendifferent tables, we can see that other implemen-tations with open addressing narrow the gap to-wards our implementation. Especially, the hop-scotch hashing N and the phase concurrent ap-proach seem to perform well when finding ele-ments. Hopscotch hashing N performs especiallywell in the unsuccessful case, here it outperformsall other hash tables, by a significant margin. How-ever, this has to be taken with a grain of salt,because the tested implementation only offers thefunctionality of a hash set (contains instead of find).Therefore, less memory is needed per element andmore elements can be hashed into one cache line,making lookups significantly more cache efficient.

For our hash tables, the performance reductionbetween successful and unsuccessful finds is around20 to 23 % The difference of absolute speedups be-tween both cases is relatively small – suggestingthat sequential hash tables suffer from the same

performance penalties. The biggest difference hasbeen measured for folly + (51 to 55 % reduced per-formance). Later we see that the reason for thisis likely that folly + is configured to use only rela-tively little memory (see Figure 10). When initial-ized with more memory, its performance gets closerto the performance of other hash tables using openaddressing.

Performance under Contention Up to thispoint, all data sets we looked at contained uni-formly random keys sampled from the whole keyspace. This is not necessarily the case in real worlddata sets. For some data sets one keys might ap-pear many times. In some sets one key might evendominate the input. Access to this key’s elementcan slow down the global progress significantly, es-pecially if hash table operations use (fine grained)locking, to protect hash table accesses.

To benchmark the robustness of the comparedhash tables onthese degenerated inputs, we con-struct the following test setup. Before the execu-tion, we compute a sequence of skewed keys usingthe Zipf distribution described in Section 8.3 (108

keys from the range 1..108). Then the table is filledwith all keys from the same range 1..108.

For the first benchmark we execute an updateoperation for each key of the skewed key se-quence, overwriting its previously stored element(Figure 4a). These update operations will create

18

contending write accesses to the hash table. Notethat updates perform simple overwrites, i.e., the re-sulting value of the element is not dependent on theprevious value. The hash table will remain at a con-stant size for the whole execution, making it easyto compare different implementations independentof effects introduced through growing. In the sec-ond benchmark, we execute find operations insteadof updates, thus creating contending read accesses.

For sequential hash tables, contention on someelements can have very positive effects. When onecell is visited repeatedly, its contents will be cachedand future accesses will be faster. The sequentialperformance is shown in our figures using a dashedblack line. For concurrent hash tables, contentionhas very different effects.

Unsurprisingly, the effects experienced from con-tention are different between writing and readingoperations. The reason is that multiple threadscan read the same value simultaneously, but onlyone thread at a time can change a value (on cur-rent CPU architecture). Therefore, read accessescan profit from cache effects – much like a sequen-tial hash table, while write accesses are hindered bythe contention. This goes so far, that for workloadswith high contention no concurrent hash table canachieve the performance of a sequential table.

Appart from slowdown because of exclusive writeaccesses, there is also the additional problem ofcache invalidation. When a value is repeatedlychanged by different cores of a multi-socket archi-tecture, then cached copies have to be invalidatedwhenever this value is changed. This leads to badcache efficiency and also to high traffic on QPILinks (connections between sockets).

From the update measurement shown in Fig-ure 4a it is clearly visible, that the serious impactthrough contention begins between s = 0.85 and0.95. Up until that point contention has a positiveeffect even on update operations. For a skew be-tween s = 0.85 and 0.95, about 1 % to 3 % of allaccesses go to the most common element (key k1).This is exactly the point where 1/p ≈ P (k1), there-fore, on average there will be one thread changingthe value of k1.

It is noteworthy that the usGrow version ofour hash table is more efficient when updating thanthe uaGrow version. The reason for this is thatusGrow uses 128 bit CAS operations to update el-ements while simultaneously making sure, that the

marked bit of the element has not been set beforethe change. This can be avoided using the usGrow

variant by specializing the update method to useatomic operations on the data part of the element.This is possible because updates and grow routinescannot overlap in this variant.

The plot in Figure 4b shows that concurrent hashtables achieve performance improvements similarto sequential ones when repeatedly accessing thesame elements. Our hash table can even increaseits speedups over uniform access patterns, the high-est speedup of uaGrow is 17.9 at s = 1.25. Sincethe speedup is this high, we also included scaledplots showing 5× and 10× the throughput of the se-quential variant. Unfortunately, our growable vari-ants cannot improve as much, with contention asthe non-growing folklore and phase concurrent tables (both 23.2 at s = 1.25). This is probablydue to minor overheads compared to the folklore

implementation which get pronounced since theoverall function execution time is reduced.

Overall, we see that our folklore implementa-tion which our growable variants are based upon,outperforms all other competitors. Our growablevariant usGrow is consistently close to folklore’sperformance – outperforming all hash tables thathave the ability to grow.

Aggregation – a common Use Case Hash ta-bles are often used for key aggregation. The ideais that all data elements connected to the samekey are aggregated using a commutative and as-sociative function. For our test, we implementeda simple key count program. To implement thekey count routine with a concurrent hash table,an insert-or-increment function is necessary. Forsome tables, we were not able to implement an up-date function, where the resulting value dependson the previous value, within the given interface(junction tables, rcu tables, phase concurrent, hop-scotch, and leahash). This was mainly a problem ofthe used interfaces, therefore, it could probably besolved by reimplementing a more functional inter-face. For our table this can easily be achieved withthe insertOrUpdate interface using an incrementas update function (see Section 4).

The aggregation benchmark uses the same Zipfkey distribution as the other contention tests. For108 skewed keys, the insert-or-increment function

19

0.25 0.50 0.75 1.00 1.25 1.50 2.00contention parameter s

0

100

200

300

400

500

600

thro

ughp

ut in

[MO

ps/s

]

(a) Update (Overwrite)


0k

1k

2k

3k

4k

5k

6k

thro

ughp

ut in

[MO

ps/s

]

(b) Successful Finds (1×, 5× and 10× seq.)

Figure 4: Throughput executing 108 operations using a varying amount of skew in the key sequence(all keys were previously inserted; using 48 threads; Legend see Table 1). The sequential performance isindicated using dashed lines.

is called. Contrary to the previous contention test,there is no pre-initialization. Therefore, the num-ber of distinct elements in the hash table is depen-dent on the contention of the key sequence (givenby s). This makes growable hash tables even moredesirable, because the final size can only be guessedbefore the execution.

Like in previous tests, we make two distinct mea-surements. One with growing (Figure 5a) and onewithout (Figure 5b). In the test without growing,we initialize the table with a size of 108 to ensurethat there is enough room for all keys, even if theyare distinct. We excluded the semi-growing tablesfrom Figure 5b as approximating the number ofunique keys can be difficult. To set the growing per-formance into relation, we show some non-growingtests. Growing actually costs less in the presenceof contentious updates, because the resulting tablewill be smaller than without contention, therefore,fewer growing steps can be amortized over the samenumber of operations.

The result of this measurement is clearly relatedto the result of the contentious overwrite test shownin Figure 4a. However, changing a value by incre-ment has some slight differences to overwriting it,since the updated value of an insert-or-increment isdependent on its previous value. In the best case,this increment can be implemented using an atomicfetch-and-add operation (i.e. usGrow , folklore ,

and folly +). However this is not possible for inall hash tables, sometimes dependent updates areimplemented using a read-modify-CAS cycle (i.e.uaGrow ) or fine grained locking (i.e. tbb hashmap F or cuckoo ).

Until s = 0.85, uaGrow seems to be the moreefficient option, since it has an increased writingperformance and the update cycle will be success-ful most of the time. From that point on, usGrowis clearly more efficient because fetch-and-add be-haves better under contention. For highly skewedworkloads, it comes really close to the performanceof our folklore implementation which again per-forms the best out of all implementations.

Deletion Tests As described in Section 5.4, weuse migration, not only to implement an efficientlygrowing hash table, but also to clean up the ta-ble after deletions. This way all tombstones areremoved, and thus freed cells are reclaimed. Buthow does this fare against different ways of remov-ing elements. This is what we investigate with thefollowing benchmark.

The test starts on a prefilled table (107 elements)and consists of 108 insertions – each immediatelyfollowed by a deletion. Therefore, the table remainsat approximately the same size throughout the test(±p elements). All keys used in the test are gener-ated before the benchmark execution (uniform dis-

20


050

100150200250300350400450

thro

ughp

ut in

[MO

ps/s

]

(a) Aggregation using a pre-initialized size of 108 (⇒size = |operations|).


0

50

100

150

200

250

300

350

thro

ughp

ut in

[MO

ps/s

]

(b) Aggregation with growing. Dashed plots ( and )indicate non-growing performance.

Figure 5: Throughput of an aggregation executing 108 insert-or-increment operations using skewedkey distributions (using 48 threads; Legend see Table 1). The dashed black line indicates sequentialperformance. Some tables are not shown because their interface does not support insert-or-increment ina convenient way.

tribution). As described in Section 8.3, all keysare stored in one array. Each insert uses an entryfrom this array distributed in blocks of 4096 fromthe beginning. The corresponding deletion uses thekey that is 107 elements prior to the correspondinginsert. The keys stored within the hash table arecontained in a sliding window of the key array.

We constructed the test to keep a constant ta-ble size, because this allows us to test non-growingtables without significantly overestimating the nec-essary capacity. All hash tables are initializedwith 1.5 × 107 capacity, therefore, it is necessaryto reclaim deleted cells to successfully execute thebenchmark.

The measurements shown in Figure 6 indicate,that only the phase concurrent hash table byShun and Blelloch [34] can outperform our table.The reason for this is pretty simple. Their tableperforms linear probing comparable to our tech-nique, but it does not use any tombstones for dele-tion. Instead, deleted cells are reclaimed immedi-ately (possibly moving elements). This is only pos-sible, because the table does not allow concurrentlookup operations, thus, removing the possibilityfor the so called ABA problem (a lookup of an ele-ment while it is deleted returns wrong data, if thereis also a concurrent insert into the newly freed cell).

From all remaining hash tables that support fullyconcurrent access, ours is clearly the fastest, eventhough there are other hash tables like cuckoo

and hopscotch N that also get around full tablemigrations.

Mixed Insertions and Finds It can be arguedthat some of our tests are just micro-benchmarkswhich are not representative of real world work-loads that often mix insertions with lookups. Toaddress these concerns, we want to show that mixedfunction workloads (i. e. combined find and insertworkloads) behave similarly.

As in previous tests, we generate a key sequencefor our test. Each key of this sequence is used foran insert or a find operation. Overall, we generate108 keys for our benchmark. For each key, insertor find is chosen at random according to the writepercentage wp. In addition to the keys used in thebenchmark, we generate a small number of keys(pre = 8192 ·p = 2 blocks ·p) that are inserted priorto the benchmark. This ensures that the table isnot empty and there are keys that can be foundwith lookups.

The keys used for insertions are drawn uniformlyfrom the key space. Our goal for find keys is to pre-construct the find keys in a way that makes find op-erations successful and is also fair to all data struc-tures. If all find operations were executed to thepre-inserted keys then linear probing hash tableswould have an unfair advantage, because elementsthat are inserted early have very short probing dis-

21

12 24 48number of threads p

0

20

40

60

80

100

120

140

thro

ughp

ut in

[MO

ps/s

]

Figure 6: Throughput in a test using deletion.Some tables have been left out of this test, becausethey do not support deletion with memory reclama-tion (Legend see Table 1). By alternating betweeninsertions and deletions, we keep the number of el-ements in the table at approximately 107 elements.For the purpose of computing the throughput, 108

such alternations are executed, each counting asone operation (1 Op = insert + delete).

tances, while later elements can take much longerto find. Therefore any find will look for a randomkey, that is inserted at least 8192 ·p elements earlierin the key sequence. This key is usually already inthe table when the find operation is called. Look-ing for a random inserted element is representativeof the overall distribution of probing distances inthe table.

Notice that this method does not strictly en-sure that all search keys are already inserted. Inour practical tests we found, that the number ofkeys which were not found was negligible for per-formance purposes (usually below 1000).

Comparable to previous tests, we test all hashtables with and without the necessity to grow thetable. In the non-growing test the size of each tableis pre-initialized to be c = pre+ (wp · 108). In thegrowing tests semi-growing hash tables are initial-ized with half that capacity.

Similar to previous tests it, is obvious that ournon-growing linear probing hash table folkloreoutperforms most other tables especially on find-heavy workloads. Overall, our hash tables behavesimilar to the sequential solution with a constantspeedup around a factor of 10×. Interestingly, therunning time does not seem to be a linear function(over wp). Instead, performance decreases super-

linearly. One reason for this could be that for find-heavy workloads, the table remains relatively smallfor most of the execution. Therefore, cache ef-fects and similar influences could play a role, sincelookups only look for a small sample of elementsthat is already in the table.

Using Dedicated Growing Threads In Sec-tion 5.3.2 and 7 we describe the possibility, to usea pool of dedicated migration threads which growthe table cooperatively. Usually the performance ofthis method does not differ greatly from the perfor-mance of the enslavement variant used throughoutour testing. This can be seen in Figure 8. There-fore, we omitted these variants from most plots.

In Figure 8a one can clearly see the similaritiesbetween the the variants using a thread pool andtheir counterparts (uaGrow ∼= paGrow and us-Grow ∼= psGrow ). The biggest consistent dif-ference we found between the two options has beenmeasured during the deletion benchmark in Fig-ure 8b. During this benchmark, insert and deleteare called alternately. This keeps the actual tablesize constant. For our implementation, this meansthat there are frequent migrations on a relativelysmall table size. This is difficult when using ad-ditional migration threads, since the threads haveto be awoken regularly, introducing some operatingsystem overhead (scheduling and notification).

Using Intel TSX Technology As described inSection 6, concurrent linear probing hash tables canbe implemented using Intel TSX technology to re-duce the number of atomic operations. Figure 9shows some of the results using this approach.

The implementation used in these tests changesonly the operations within our bounded hash table(folklore) to use TSX-transactions. Atomic fallbackimplementations are used, when a transaction fails.We also instantiated our growing hash table vari-ants, to use the TSX-optimized table as underlyinghash table implementation.

We tested this variant with a uniform insertworkload (see “Insert Performance”), because thelookup implementation does not actually need atransaction. We also show the non-TSX variant,using dashed lines, to indicate the relative perfor-mance benefits.

In Figure 9a one can clearly see that TSX-

22

0 10 20 30 40 50 60 70 80percentage of insertions wp in [%]

0

100

200

300

400

500

600

700

800

thro

ughp

ut in

[MO

ps/s

]

(a) Mixed insertions and finds on a pre-initialized table(wp · 108 + pre).

0 10 20 30 40 50 60 70 80percentage of Insertions wp in [%]

0

100

200

300

400

500

600

thro

ughp

ut in

[MO

ps/s

](b) Mixed insertions and finds on a growing table.

Figure 7: Executing 108 operations mixed between insertions and finds (using 48 threads; Legend seeTable 1). Dashed lines indicate sequential performance (1× and 10×). Find keys are generated in a fairway, that ensures that most find operations are successful.

optimized hash tables offer improved performanceas long, as growing is not necessary. Unfortunately,Figure 9b paints a different picture for instanceswhere growing is necessary. While TSX can be usedto improve the usGrow variant of our hash tableespecially when using hyperthreading, it offers noperformance benefits in the uaGrow variant. Thereason for this is that the running time in thesemeasurements is dominated by the table migrationwhich is not optimized for TSX-transactions.

In theory, the migration algorithm can make useof transactions similarly to single operations. Itwould be interesting whether an optimized migra-tion could further improve the growing instances ofthis test. We have not implemented such a migra-tion, as it introduces the need for some complexparameter optimizations – partitioning the migra-tion into smaller blocks or executing each block-migration into multiple transactions. We estimatethat a well optimized TSX-migration can gain per-formance increases on the order of those witnessedin the non-growing case.

Memory Consumption One aspect of paral-lel hash tables, that we did not talk about untilnow is memory consumption. Overall, a low mem-ory consumption is preferable, but having less cellsmeans that there will be more hash collisions. This

leads to longer running times especially for non-successful find operations.

Most hash tables do not allow the user to set aspecific table size directly. Instead they are initial-ized using the expected number of elements. Weuse this mechanism to create tables of differentsizes. Using these different hash tables with differ-ent sizes, we find out how well any one hash tablescales when it is given more memory. This is inter-esting for applications where the hash table speed ismore important than its memory footprint (lookupsto a small or medium sized hash table within an ap-plication’s inner loop).

The values presented in the follow-ing plot are aquired by initializing thehash tables with different table capacities(4096, 0.5×, 1.0×, 1.25×, 1.5×, 2.0×, 2.5×, 3.0×108

expected elements; semi- and non-growing hashtables start at 0.5× and 1× respectively). Duringthe test, the memory consumption is measuredby logging the size of each allocation, and deal-location during the execution (done by replacingallocation methods, e.g. malloc and memalign).Measurements with growing (initial capacity< 108) are marked with dashed lines. Afterwardsthe table is filled with 108 elements. The plottedmeasurements show the throughput that can beachieved when doing 108 unsuccessful lookups on

23


0

20

40

60

80

100

120

thro

ughp

ut in

[MO

ps/s

]

012345678910

absolute speedup

(a) Insertions into Growing Table.

12 24 48number of threads p

0

10

20

30

40

50

60

70

80

thro

ughp

ut in

[MO

ps/s

]

(b) Alternating Insertions and Deletions.

Figure 8: Comparison between our regular implementation and the variant using a dedicated migrationthread pool (dashed lines mark variants enslaving user-threads; Legend see Table 1).


0

50

100

150

200

250

300

350

thro

ughp

ut in

[MO

ps/s

]

01234567891011

absolute speedup

(a) Without growing (pre-initialized).


0

20

40

60

80

100

120

thro

ughp

ut in

[MO

ps/s

]

012345678910

absolute speedup

(b) Insertions with growing.

Figure 9: Comparison between our regular implementation and the variant using TSX-transactions inplace of atomics (dashed lines are regular variants without TSX).

the preinitialized table. This throughput is plottedover the amount of allocated memory each hashtable used.

The minimum size for any hash table should bearound 1.53 GiB ≈ 108 · (8 B + 8 B) (Key and Valueeach have 8 B). Our hash table uses a number ofcells equal to the smallest power of 2 that is atleast two times as large as the expected numberof elements. In this case this means we use 228 ≈2.7 ·108, therefore, the table will be filled to ≈ 37 %and use exactly 4GiB. We believe that this memoryusage is reasonable, especially for heavily accessedtables where the performance is important. This issupported by our measurements as all hash tablesthat use less memory have bad performance.

Most hash tables round the number of cells in

some convenient way. Therefore, there are of-ten multiple measurement points using the sameamount of memory. As expected, using the sameamount of memory will usually achieve a compa-rable performance. Out of the tested hash tablesonly the folly + hash table grows linearly with theexpected final size. It is also the hash table, thatgains the most performance by increasing its mem-ory. This makes a lot of sense considering that ituses linear probing and is by default configured touse more than 50 % of its cells.

The plot also shows that some hash tables do notgain any performance benefits from the increasedsize. Most notable for this are cuckoo , all varia-tions of junction and the urcu hash tables ×.The TBB hash tables F and F seem to use a con-

24

stant amount of memory, independently from thepreinitialized number of elements. This might bea measurement error, caused by the fact that theyuse different memory allocation methods (not notlogged in our test).

There are also some things that can be learnedabout growing hash tables from this plot. Our mi-gration technique ensures, that our hash table hasthe exact same size when growing is required aswhen it is preinitialized using the same numberof elements. Therefore, lookup operations on thegrown table take the same time as they would ona preinitialized table. This is not true, for many ofour competitors. All Junction tables and RCU pro-duce smaller tables when growing was used, theyalso suffer from a minor slowdown, when usinglookups on these smaller tables. Using Folly is evenworse, it produces a bigger table – when growing isneeded – and still suffers from significantly worseperformance.

Scalability on a 4-Socket Machine Bad per-formance on multi-socket workloads is recurringtheme throughout our testing. This is especiallytrue for some of our competitors where 2-Socketrunning times are often worse than 1-Socket run-ning times. To further expand the understandingof this problem we made some tests on the 4-Socketmachine described in Section 8.2.

The used test instances are generated similarto the insert/find tests described in the beginningof this section (108 executed operations with uni-formly random keys). The results can be seen inFigure 11a (Insertions) and Figure 11b (unsuccess-ful finds).

Our competitor’s hash tables seem to be a lotmore effective when using only one of the four sock-ets (compared to one of two sockets on the two-socket machine). This is especially true for theLookup workload where the junction hash tablesstart out more efficient than our implementation.However this effect seems to invert once multiplesockets are used.

In the test using lookups, there seems to be a per-formance problem using our hash table. It seems toscale sub-optimally on one socket. On two socketshowever, the hash table seems to scale significantlybetter.

Overall the four-socket machine reconfirms our

observations. None of our competitors scale wellwhen a growing hash table is used over multiplesockets. On the contrary, using multiple socketswill generally reduce the throughput. This is notthe case for our hash table. The efficiency is re-duced when using more then two sockets but theabsolute throughput at least remains stable.

8.5 The Price of Generality

Having looked at many detailed measurements, letus now try to get a bigger picture by asking whichhash table performs well for specific requirementsand how much performance has to be sacrificed foradditional flexibility. This will give us an intuition,where performance is sacrificed on our way to afully general hash table. Seeing that all tested hashtables fail to scale linearly on multi-socket machineswe try to answer the question if concurrent hashtables are worth their overhead at all.

At the most restricted level – no grow-ing/deletions and word sized key and value types– we have shown that common linear probing hashtables offer the best performance (over a number ofoperations). Our implementation of this “folklore”solution outperforms different approaches consis-tently, and performs at least as good as other simi-lar implementations (i.e. the phase concurrent ap-proach). We also showed, that this performancecan be improved by using Intel TSX technology.Furthermore, we have shown that our approach togrowing hash tables does not affect the performanceon known input sizes significantly (preinitialized ta-ble to the correct size).

Sticking to fixed data types but allowing dynamicgrowing, the best data structures are our growingvariants (ua,us,pa,psGrow). The difference inour measurements between pool growing (pxGrow)and the corresponding variants with enslavement(uxGrow) are not very big. Growing with markingperforms better than globally synchronized grow-ing except for update heavy workloads. The priceof growing compared to a fixed size is less thana factor of two for insertions and updates (aggre-gation) and negligible for find-operations. More-over, this slowdown is comparable to the slowdownexperienced in sequential hash tables when grow-ing is necessary. None of the other data structuresthat support growing comes even close to our datastructures. For insertions and updates we are an or-

25

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28size in GiB

0

100

200

300

400

500

600

700

thro

ughp

ut in

[MO

ps/s

]

(a) Performance of unsuccessfulfind operations over the size ofthe data structure.

Figure 10: For these tests 108 keys are searched (unsuccessfully) on a hash table containing 108 elements.Prior to the setup of the benchmark, the tables were initialized with different sizes (there can be manypoints on one (x-)coordinate) (Legend see Table 1).

1 4 8 16 24 32 64number of threads p

0

10

20

30

40

50

60

thro

ughp

ut in

[MO

ps/s

]

01234567

absolute speedup

(a) Insertions into a growing hash table.

1 4 8 16 24 32 64number of threads p

0

50

100

150

200

250

thro

ughp

ut in

[MO

ps/s

]

0

3

6

9

12

15

18

21

absolute speedup

(b) Unsuccessful Lookups into the filled table.

Figure 11: Basic tests made on our 4-Socket machine, consisting of four eight core E5-4640 processorswith 2.4 GHz each (codenamed Sandybridge) and 512 GB main memory (Legend see Table 1).

der of magnitude faster then many of our competi-tors. Furthermore, only one competitor achievesspeedups above one when inserting into a growingtable (junction grampa).

Among the tested hash tables, only TBB,Cuckoo, and RCU have the ability to store arbi-trary key-/value-type combinations. Therefore, us-ing arbitrary data objects with one of these hashtables can be considered to cost at least an orderof magnitude in performance (TBB[arbitrary] ≤TBB[word sized] ≈ 1/10 · xyGrow). In our opin-ion, this restricts the use of these data structures tosituations where hash table accesses are not a com-putational bottleneck. For more demanding appli-

cations the only way to go is to get rid of the generaldata types or the need for concurrent hash tablesaltogether. We believe that the generalizations wehave outlined in Section 5.7 will be able to closethis gap. Actual implementations and experimentsare therefore interesting future work.

Finally let us consider the situation where weneed general data types but no growing. Again, allthe competitors are an order of magnitude slowerfor insertion than our bounded hash tables. Thesingle exception is cuckoo, which is only five timesslower for insertion and six times slower for suc-cessful reads. However, it severely suffers from con-tention being an almost record breaking factor of

26

5 600 slower under find-operations with contention.Again, it seems that better data structures shouldbe possible.

9 Conclusion

We demonstrate that a bounded linear probinghash table specialized to pairs of machine wordshas much higher performance than currently avail-able general purpose hash tables like Intel TBB,Leahash, or RCU based implementations. Thisis not surprising from a qualitative point of viewgiven previous publications [36, 14, 34]. However,we found it surprising how big the differences canbe in particular in the presence of contention. Forexample, the simple decision to require a lock forreading can decrease performance by almost fourorders of magnitude.

Perhaps our main contribution is to show thatintegrating an adaptive growing mechanism intothat data structure has only a moderate perfor-mance penalty. Furthermore, the used migrationalgorithm can also be used to implement deletionsin a way that reclaims freed memory. We also ex-plain how to further generalize the data structureto allowing more general data types.

The next logical steps are to implement thesefurther generalizations efficiently and to integratethem into an easy to use library that hides most ofthe variants from the user, e.g., using programmingtechniques like partial template specialization.

Further directions of research could be to lookfor a practical growable lock-free hash table.

Acknowledgments We would like to thankMarkus Armbruster, Ingo Muller, and Julian Shunfor fruitful discussions.

References

[1] Lada A. Adamic and Bernardo A. Huberman.Zipf’s law and the internet. Glottometrics,3(1):143–150, 2002.

[2] Robert L. Axtell. Zipf distribution of us firmsizes. Science, 293(5536):1818–1820, 2001.

[3] Holger Bast, Stefan Funke, Domagoj Matije-vic, Peter Sanders, and Dominink Schultes. In

transit to constant time shortest-path queriesin road networks. In Proceedings of the Meet-ing on Algorithm Engineering & Expermi-ments (ALENEX), pages 46–59, 2007.

[4] Lee Breslau, Pei Cao, Li Fan, Graham Phillips,and Scott Shenker. Web caching and zipf-like distributions: evidence and implications.In INFOCOM ’99. Eighteenth Annual JointConference of the IEEE Computer and Com-munications Societies. Proceedings. IEEE, vol-ume 1, pages 126–134 vol.1, Mar 1999.

[5] Shimin Chen, Anastassia Ailamaki, Phillip BGibbons, and Todd C Mowry. Improving hashjoin performance through prefetching. ACMTransactions on Database Systems (TODS),32(3):17, 2007.

[6] Roman Dementiev, Lutz Kettner, Jens Mehn-ert, and Peter Sanders. Engineering a sortedlist data structure for 32 bit keys. In 6thWorkshop on Algorithm Engineering & Exper-iments, pages 142–151, New Orleans, 2004.

[7] Martin Dietzfelbinger, Torben Hagerup, JyrkiKatajainen, and Martti Penttonen. A reli-able randomized algorithm for the closest-pairproblem. Journal of Algorithms, 25(1):19–51,1997.

[8] Martin Dietzfelbinger and Christoph Wei-dling. Balanced allocation and dictionar-ies with tightly packed constant size bins.Theoretical Computer Science, 380(1–2):47–68, 2007.

[9] Facebook. folly version 57:0. https://

github.com/facebook/folly, 2016.

[10] Hui Gao, Jan Friso Groote, and Wim H.Hesselink. Lock-free dynamic hash tableswith open addressing. Distributed Computing,18(1), 2005.

[11] Torben Hagerup and Christine Rub. Optimalmerging and sorting on the EREW-PRAM.Information Processing Letters, 33:181–185,1989.

[12] Maurice Herlihy and Nir Shavit. The Art ofMultiprocessor Programming, Revised Reprint.Elsevier, 2012.

27

https://github.com/facebook/folly

https://github.com/facebook/folly

[13] Maurice Herlihy, Nir Shavit, and MoranTzafrir. Hopscotch hashing. In DistributedComputing, pages 350–364. Springer, 2008.

[14] Euihyeok Kim and Min-Soo Kim. Performanceanalysis of cache-conscious hashing techniquesfor multi-core CPUs. International Journal ofControl & Automation (IJCA), 6(2), 2013.

[15] Donald E. Knuth. The Art of ComputerProgramming—Sorting and Searching, vol-ume 3. Addison Wesley, 2nd edition, 1998.

[16] Doug Lea. Hash table util. concurrent.concurrenthashmap, revision 1.3. JSR-166,the proposed Java Concurrency Package.http://gee. cs. oswego. edu/cgi-bin/viewcvs.cgi/jsr166/src/main/java/util/concurrent,2003.

[17] Xiaozhou Li, David G. Andersen, MichaelKaminsky, and Michael J. Freedman. Al-gorithmic improvements for fast concurrentcuckoo hashing. In Proceedings of the NinthEuropean Conference on Computer Systems,EuroSys ’14. ACM, 2014.

[18] Tobias Maier, Peter Sanders, and RomanDementiev. Concurrent hash tables: Fastand general?(!). In Proceedings of the 21stACM SIGPLAN Symposium on Principlesand Practice of Parallel Programming, PPoPP’16, pages 34:1–34:2, New York, NY, USA,2016. ACM.

[19] Tobias Maier, Peter Sanders, and Roman De-mentiev. Concurrent hash tables: Fast andgeneral?(!). CoRR, abs/1601.04017, 2016.

[20] Makoto Matsumoto and Takuji Nishimura.Mersenne twister: A 623-dimensionallyequidistributed uniform pseudo-randomnumber generator. ACMTMCS: ACMTransactions on Modeling and ComputerSimulation, 8:3–30, 1998. http://www.math.

keio.ac.jp/~matumoto/emt.html.

[21] Edward M. McCreight. A space-economicalsuffix tree construction algorithm. Journal ofthe ACM, 23(2):262–272, April 1976.

[22] Paul E. McKenney and John D. Slingwine.Read-copy update: Using execution history to

solve concurrency problems. Parallel and Dis-tributed Computing and Systems, pages 509–518, 1998.

[23] Kurt Mehlhorn and Peter Sanders. Algorithmsand Data Structures — The Basic Toolbox.Springer, 2008.

[24] Scott Meyers. Effective C++: 55 specific waysto improve your programs and designs. Pear-son Education, 2005.

[25] Ingo Muller, Peter Sanders, Arnaud Lacurie,Wolfgang Lehner, and Franz Farber. Cache-efficient aggregation: Hashing is sorting. InProceedings of the 2015 ACM SIGMOD Inter-national Conference on Management of Data,pages 1123–1136. ACM, 2015.

[26] Rajesh Nishtala, Hans Fugal, Steven Grimm,Marc Kwiatkowski, Herman Lee, Harry C Li,Ryan McElroy, Mike Paleczny, Daniel Peek,Paul Saab, et al. Scaling memcache at face-book. In 10th USENIX Symposium on Net-worked Systems Design and Implementation(NSDI), volume 13, pages 385–398, 2013.

[27] Philippe Oechslin. Making a faster cryptana-lytic time-memory trade-off. In Dan Boneh,editor, Advances in Cryptology - CRYPTO2003: 23rd Annual International CryptologyConference, pages 617–630, Berlin, Heidel-berg, 2003. Springer Berlin Heidelberg.

[28] Jong Soo Park, Ming-Syan Chen, and Philip S.Yu. An effective hash-based algorithm for min-ing association rules. In ACM SIGMOD Con-ference on Management of Data, pages 175–186, 1995.

[29] Mathieu Desnoyers Paul E. McKenney and LaiJiangshan. LWN: URCU-protected hash ta-bles. http://lwn.net/Articles/573431/, 2013.

[30] Chuck Pheatt. Intel R©; Threading BuildingBlocks. J. Comput. Sci. Coll., 23(4):298–298,April 2008.

[31] Jeff Preshing. Junction. https://github.

com/preshing/junction, 2016.

[32] Jeff Preshing. New concurrent hash mapsfor c++. http://preshing.com/20160201/

28

http://gee

http://www.math.keio.ac.jp/~matumoto/emt.html

http://www.math.keio.ac.jp/~matumoto/emt.html

http://lwn.net/Articles/573431/

https://github.com/preshing/junction

https://github.com/preshing/junction

http://preshing.com/20160201/new-concurrent-hash-maps-for-cpp/

new-concurrent-hash-maps-for-cpp/,2016.

[33] Ori Shalev and Nir Shavit. Split-ordered lists:Lock-free extensible hash tables. J. ACM,53(3):379–405, May 2006.

[34] Julian Shun and Guy E. Blelloch. Phase-concurrent hash tables for determinism. In26th ACM Symposium on Parallelism in Al-gorithms and Architectures (SPAA), pages 96–107. ACM, 2014.

[35] Julian Shun, Guy E. Blelloch, Jeremy T.Fineman, Phillip B. Gibbons, Aapo Kyrola,Harsha Vardhan Simhadri, and Kanat Tang-wongsan. Brief announcement: the problembased benchmark suite. In 24th ACM Sympo-sium on Parallelism in Algorithms and Archi-tectures (SPAA), pages 68–70. ACM, 2012.

[36] Alex Stivala, Peter J. Stuckey, Maria Garciade la Banda, Manuel Hermenegildo, and An-thony Wirth. Lock-free parallel dynamic pro-gramming. Journal of Parallel and DistributedComputing, 70(8), 2010.

[37] Tony Stornetta and Forrest Brewer. Imple-mentation of an efficient parallel bdd package.In 33rd Design Automation Conference, pages641–644. ACM, 1996.

29

http://preshing.com/20160201/new-concurrent-hash-maps-for-cpp/

Documents

Concurrent Hash Tables: Fast and General(?)! · Concurrent Hash Tables: Fast and General(?)! Tobias Maier 1, Peter Sanders , and Roman Dementiev2 1 Karlsruhe Institute of Technology,