1
Problem Our approach Results Project page: http://idav.ucdavis.edu/~dfalcant/ Fredman, M. L., Komlós, J. and Szemerédi, E. 1984. Storing a sparse table with O(1) worst case access time. Journal of the ACM 31, 3 (July), 583-544. Pagh, R., and Rodler F. F. 2001. Cuckoo hashing. In 9 th Annual European Symposium of Algorithms, Springer, vol. 2161 of Lecture Notes in Computer Science, 121-133. Goal: Parallel-friendly data structure allowing efficient random access to millions of items and buildable at interactive rates. On the CPU, hash tables are generally used. Collisions are handled using techniques like chaining, where items hashing to the same location are put in a linked list. Hash techniques don’t parallelize well for three reasons: Synchronization: Insertions usually involve sequential operations Variable work per access: Number of probes varies per query, forcing threads to wait for worst-case number of probles Sparse storage: Little locality exhibited in either construction or access, so caching and computational hierarchies have little ability to improve performance. Can be addressed with perfect hash tables, where each element can be accessed in O(1), but previously required CPU construction. Get efficient GPU construction by combining two strategies: classical FKS perfect hashing and relatively new cuckoo hashing. FKS perfect hashing For n keys and a table with n 2 slots, randomly chosen hash functions will be likely be collision-free (left) Can shrink to O(n) slots by first hashing the items into small buckets, each needing much less space (right) Requires many small and empty buckets, still using over 3n space Voxelized Lucy model X, Y, and Z axes are mapped to red, green, and blue. Dan A. Alcantara Andrei Sharf Fatemeh Abbasinejad Shubhabrata Sengupta Michael Mitzenmacher * John D. Owens Nina Amenta 0 10 20 30 40 50 60 0 0.5 1 1.5 2 2.5 3 3.5 Milliseconds Key-value pairs (millions) GPU Hash: Construction GPU Hash: Retrieval Sorted array: Radix sort Sorted array: Binary search CPU PSH: Retrieval 0 5 10 15 20 25 30 0 1 2 3 4 5 6 7 8 9 10 Milliseconds Key-value pairs (Millions) Cuckoo hashing Assigning keys to buckets and counting Shuffling the points into the buckets Initialization Determining bucket data locations * Real-Time Parallel Hashing on the GPU References Single bucket’s contents Each bucket contains at most 512 items each; on average, a bucket receives 409. Tier 2 Parallel cuckoo hashing is used to build a hash table for each bucket in shared memory independently. This structure has 5 million slots, but retrieval for any item requires looking at only 3 places. Input 3.5 million voxels from Lucy stored as 32-bit keys. Cuckoo sub-tables Each table is composed of three smaller sub-tables. Colored blocks are voxels while white blocks represent empty spaces. Sub-tables are grouped together when written to global memory. Tier 1 Like in FKS hashing, the input is partitioned into smaller buckets using a first-level hash function. Timings for increasingly finer voxelizations of Lucy. We compare against a sorted list, using binary searches for retrieval. Construction takes roughly the same amount of time as the radix sort, while our retrievals are consistently faster than binary searches for random access. Timing for each stage of our algorithm for increasingly large input of random keys. We see roughly linear scaling for each stage. Background Cuckoo hashing Items can hash into one of multiple locations (left) Items inserted into any empty slots if possible (middle) If no slots available, item is evicted to make room, forcing recursive insertion & evictions until convergence (right) Get better space usage with more sub-tables (up to 90% with three) d c a b d a b c Sub-table 1 Sub-table 2 Keys to insert d b a c Threads simultaneously insert. Black arrows show successful insertions; x and z failed x and z failed to write to T 1 and try T 2 where they collide: x is written and z fails w x y z T 1 T 2 z w y x T 1 T 2 x w y T 1 T 2 z x w z T 1 T 2 y y writes itself into T 2 z goes back to T 1 and evicts y We parallelized cuckoo hashing insertion for the GPU. Parallel cuckoo hashing: Inserts all items simultaneously, iterating through sub-tables in round-robin fashion Assumes that exactly one write will succeed for colliding items Typically completes in O(lg n) iterations for two sub-tables FKS and cuckoo hashing are problematic for GPU: FKS hashing is fast, but requires too much space Cuckoo hashing can be space-efficient, but requires slow global memory and global synchronization for parallel insertion Get efficient algorithm by combining both: Like with FKS hashing, first partition into small buckets of at most 512 items Build parallel cuckoo hash on smaller buckets in fast on-chip memory Retrieval takes at most 3 probes: one for each cuckoo sub-table of the bucket item hashes into Algorithm can be generalized to handle multiple values per key, or to generate two-way index between keys and unique IDs. w x y z H 1 H 2 0 0 1 1 0 1 0 1 Hash functions H 1 and H 2 map each item (w, x, y, z) to two places in the sub-tables T 1 and T 2 n keys Collision free, but very sparse table of size n 2 hash function In our paper, we demonstrate hash table use with two different GPU applications. Geometric hashing finds the image in the upper left in the image in the upper right. Spatial hashing finds the boolean intersection between two moving point clouds while allowing the user to interactively change the transformation. To appear in SIGGRAPH Asia 2009

Real-Time Parallel Hashing on the GPU - Nvidia€¦ · Real-Time Parallel Hashing on the GPU * References Single bucket’s contents Each bucket contains at most 512 items each;

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Real-Time Parallel Hashing on the GPU - Nvidia€¦ · Real-Time Parallel Hashing on the GPU * References Single bucket’s contents Each bucket contains at most 512 items each;

Problem Our approach Results

• Project page: http://idav.ucdavis.edu/~dfalcant/

• Fredman, M. L., Komlós, J. and Szemerédi, E. 1984. Storing a sparse table

with O(1) worst case access time. Journal of the ACM 31, 3 (July), 583-544.

• Pagh, R., and Rodler F. F. 2001. Cuckoo hashing. In 9th Annual European

Symposium of Algorithms, Springer, vol. 2161 of Lecture Notes in Computer

Science, 121-133.

Goal: Parallel-friendly data structure allowing efficient random

access to millions of items and buildable at interactive rates.

On the CPU, hash tables are generally used. Collisions are

handled using techniques like chaining, where items hashing to

the same location are put in a linked list.

Hash techniques don’t parallelize well for three reasons:

• Synchronization: Insertions usually involve sequential

operations

• Variable work per access: Number of probes varies per

query, forcing threads to wait for worst-case number of

probles

• Sparse storage: Little locality exhibited in either

construction or access, so caching and computational

hierarchies have little ability to improve performance.

Can be addressed with perfect hash tables, where each

element can be accessed in O(1), but previously required CPU

construction.

Get efficient GPU construction by combining two strategies:

classical FKS perfect hashing and relatively new cuckoo

hashing.

FKS perfect hashing

• For n keys and a table with n2 slots, randomly chosen

hash functions will be likely be collision-free (left)

• Can shrink to O(n) slots by first hashing the items into

small buckets, each needing much less space (right)

• Requires many small and empty buckets, still using over

3n space

Voxelized Lucy model

X, Y, and Z axes are

mapped to red, green, and

blue.

Dan A. Alcantara Andrei Sharf Fatemeh Abbasinejad Shubhabrata Sengupta Michael Mitzenmacher* John D. Owens Nina Amenta

0

10

20

30

40

50

60

0 0.5 1 1.5 2 2.5 3 3.5

Millise

co

nd

s

Key-value pairs (millions)

GPU Hash: Construction

GPU Hash: Retrieval

Sorted array: Radix sort

Sorted array: Binary search

CPU PSH: Retrieval

0

5

10

15

20

25

30

0 1 2 3 4 5 6 7 8 9 10

Millise

co

nd

s

Key-value pairs (Millions)

Cuckoo hashing

Assigning keys to buckets and counting

Shuffling the points into the buckets

Initialization

Determining bucket data locations

*Real-Time Parallel Hashing on the GPU

References

Single bucket’s contents

Each bucket contains at

most 512 items each; on

average, a bucket receives

409.

Tier 2

Parallel cuckoo hashing is

used to build a hash table

for each bucket in shared

memory independently.

This structure has 5 million

slots, but retrieval for any

item requires looking at only

3 places.

Input

3.5 million voxels from

Lucy stored as 32-bit keys.

Cuckoo sub-tables

Each table is composed of

three smaller sub-tables.

Colored blocks are voxels

while white blocks

represent empty spaces.

Sub-tables are grouped

together when written to

global memory.

Tier 1

Like in FKS hashing, the

input is partitioned into

smaller buckets using a

first-level hash function.

Timings for increasingly finer voxelizations of Lucy. We compare

against a sorted list, using binary searches for retrieval.

Construction takes roughly the same amount of time as the radix

sort, while our retrievals are consistently faster than binary searches

for random access.

Timing for each stage of our algorithm for increasingly large input of

random keys. We see roughly linear scaling for each stage.

Background

Cuckoo hashing

• Items can hash into one of multiple locations (left)

• Items inserted into any empty slots if possible (middle)

• If no slots available, item is evicted to make room, forcing

recursive insertion & evictions until convergence (right)

• Get better space usage with more sub-tables (up to 90%

with three)

d

c

a

b

da b c

Sub-table 1 Sub-table 2

Keys to insert

d

b

a

c

Threads simultaneously insert.

Black arrows show successful

insertions; x and z failed

x and z failed to write to T1 and try

T2 where they collide: x is written

and z fails

w

x

y

z

T1 T2

z

w

y x

T1 T2

x

w

y

T1 T2

z

x

w

z

T1 T2

y y writes itself into T2

z goes back to T1 and evicts y

We parallelized cuckoo hashing insertion for

the GPU. Parallel cuckoo hashing:

• Inserts all items simultaneously, iterating

through sub-tables in round-robin fashion

• Assumes that exactly one write will

succeed for colliding items

• Typically completes in O(lg n) iterations

for two sub-tables

FKS and cuckoo hashing are problematic for GPU:

• FKS hashing is fast, but requires too much

space

• Cuckoo hashing can be space-efficient, but

requires slow global memory and global

synchronization for parallel insertion

Get efficient algorithm by combining both:

• Like with FKS hashing, first partition into small

buckets of at most 512 items

• Build parallel cuckoo hash on smaller buckets

in fast on-chip memory

• Retrieval takes at most 3 probes: one for

each cuckoo sub-table of the bucket item

hashes into

Algorithm can be generalized to handle multiple

values per key, or to generate two-way index

between keys and unique IDs.

w

x

y

z

H1 H2

0

0

1

1

0

1

0

1

Hash functions H1 and H2 map

each item (w, x, y, z) to two places

in the sub-tables T1 and T2

n keys

Collision free, but

very sparse table

of size n2

hash function

In our paper, we demonstrate hash table use with two different GPU

applications. Geometric hashing finds the image in the upper left in

the image in the upper right. Spatial hashing finds the boolean

intersection between two moving point clouds while allowing the

user to interactively change the transformation.

To appear in SIGGRAPH Asia 2009