28
1 Data Structures Dana Shapira Hash Tables

1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

Embed Size (px)

Citation preview

Page 1: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

1

Data Structures

Dana Shapira

Hash Tables

Page 2: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

2

Element Uniqueness Problem

Let Determine whether there exist ij such that xi=xj

Sort Algorithm Bucket Sort

for (i=0;i<m;i++)T[i]=NULL;

for (i=0;i<n;i++){if (T[xi]= = NULL)

T[xi]= ielse{

output (i, T[xi])return;

}}

What happens when m is large or when we are dealing with real numbers??

0 10 ,..., nx x m

Page 3: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

3

Notations: U universe of keys of size |U|, K an actual set of keys of size n, T a hash table of size m

Use a hash-function h:U{0,…,m-1}, h(x)=i that computes the slot i in array T where element x is to be stored, for all x in U.

h(k) is computed in O(|k|) = O(1).

x1

x2

x3

x4

h(x1) h(x

2) h(x

4)h(x

3)

U

h

Set of array indices

Hash Tables

Page 4: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

4

Example

h:U{0,…, m-1}

h(x)=x mod 10 (what is m?)

input: 17,62,19,81,53

Collision: x ≠ y but h(x) = h(y).

m « |U|. Solutions:

1. Chaining

2. Open addressing

81

62

53

17

19

1

2

3

7

9

0

4

56

8

Page 5: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

5

Collision-Resolution by Chaining

1 81

62 12

53

17 37 57

19

Insert(T,x): Insert new element x at the head of list T[h(x.key)]. Delete(T,x): Delete element x from list T[h(x.key)]. Search(T,x): Search list T[h(x.key)].

Page 6: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

6

Simple Uniform Hashing Any given element is equally likely to hash to any slot in the hash table. The slot an element hashes to is independent of where other elements

hash.

Load factor:

α = n/m (elements stored in the hash table / number of slots in the hash table)

Analysis of Chaining

Page 7: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

7

Analysis of Chaining

Theorem: In a hash table with chaining, under the assumption of simple uniform hashing, both successful and unsuccessful searches take expected time Θ(1+α) on the average, where α is the hash table load factor.

Proof: Unsuccessful Search: Under the assumption of simple uniform hashing, any key k is equally

likely to hash to any slot in the hash table. The expected time to search unsuccessfully for a key k is the expected time to search to the end of list T[h(k)] which has expected length α.

expected time - (1 + α) including time for computing h(k). Successful search: The number of elements examined during a successful search is 1 more than the number of

elements that appear before k in T[h(k)].

expected time - (1 + α)

Corollary: If m = (n), then Insert, Delete, and Search take expected constant time.

1 1 1 1

1 1 1 1 1 11 1 1 1

1 11 11 1 1 1 1

2 2 2 2 2 2

n n n n

i i i i

i ii

n m n m n nm

n n n

nm m m n

Page 8: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

8

Example:

Input = reals drawn uniformly at random from [0,1)

Hash function: h(x) = mx

Often, the input distribution is unknown.

Then we can use heuristics or universal hashing.

Designing Good Hash Functions

Page 9: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

9

Hash function: h(x) = x mod m

m = 2k

h(x) = the lowest k bits of x

Heuristic:

m = prime number not too close to a power of 2

The Division Method

Page 10: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

10

Hash function: h(x) = m (cx mod 1), for some 0 < c < 1

Optimal choice of c depends on input distribution.

Heuristic:

Knuth suggests the inverse of the golden ratio as a value that works well:

Example: x=123,456, m=10,000h(x) = 10,000 ·(123,456·0.61803… mod 1) =

= 10,000 ·(76,300.0041151… mod 1) = = 10,000 ·0.0041151… = 41.151... = 41

The Multiplication Method

Page 11: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

11

Efficient Implementation of the Multiplication Method

w bits w bits

w bits w bits

w bits

p bits

*

Fractional part

Integer part aftermultiplying by m = 2p

Let w be the size of a machine word

Assume that key x fits into a machine word

Assume that m = 2p

Restrict ourselves to values of c of the form

c = s / 2w

Then cx = sx / 2w

sx is a number that fits into two machine words

h(x) = p most significant bits of the lower word

h(x) = m (cx mod 1)

0 2ws

Page 12: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

12

x = 123456, p = 14, m = 214 = 16384, w = 32,

Then sx = (76300 ⋅ 232) + 17612864 The 14 most significant bits of

17612864 are 67; that is, h(x) = 67

x = 0000 0000 0000 0001 1110 0010 0100 0000

s = 1001 1110 0011 0111 0111 1001 1011 1001

sx = 0000 0000 0000 0001 0010 1010 0000 1100 0000 0001 0000 1100 1100 0000 0100 0000

h(x) = 00 0000 0100 0011 = 67

Example

h(x) = m (cx mod 1)

Page 13: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

13

Open Addressing

h(x)

All elements are stored directly in the hash table.➥ Load factor α cannot exceed 1. If slot T[h(x)] is already occupied for a key x, we probe alternative locations until we find an empty slot. Searching probes slots starting at T[h(x)] until x is found or we are sure that x is not in T. Instead of computing h(x), we compute h(x, i) i -the probe number.

Page 14: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

14

Hash function:

h(k, i) = (h'(k) + i) mod m, where h' is an original hash function.

Benefits:

Easy to implement Problem:

Primary Clustering - Long runs of occupied slots build up as table becomes fuller.

Linear Probing

Page 15: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

15

Hash function:

h(k, i) = (h'(k) + c1i + c

2i2) mod m, where h' is an original hash function.

Benefits:

No more primary clustering Problem:

Secondary Clustering - Two elements x and y with h'(x) = h'(y) have same probe sequence.

Quadratic Probing

Page 16: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

16

Hash function:

h(k, i) = (h1(k) + ih

2(k)) mod m, where h

1 and h

2 are two original hash functions.

h2(k) has to be prime w.r.t. m; that is, gcd(h

2(k), m) = 1.

Two methods:

Choose m to be a power of 2 and guarantee that h2(k) is always odd.

Choose m to be a prime number and guarantee that h2(k) < m.

Benefits:No more clustering

Drawback:More complicated than linear and quadratic probing

Double Hashing

Page 17: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

17

Analysis of Open Addressing

Uniform hashing: The probe sequence h(k, 0), …, h(k, m – 1) is equally likely to be any permutation of 0, …, m – 1.

Theorem: In an open-address hash table with load factor α < 1, the expected number of probes in an unsuccessful search is at most 1 / (1 – α), assuming uniform hashing.

Proof: Let X be the number of probes in an unsuccessful search.

Ai = “there is an i-th probe, and it accesses a non-empty slot”

0 0 1

[ ] 1i i i

E X iP X i i P X i P X i P X i

Page 18: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

18

Analysis of Open Addressing – Cont.

Page 19: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

19

Corollary: The expected number of probes performed during an insertion into an open-address hash table with uniform hashing is 1 / (1 – α).

Analysis of Open Addressing – Cont.

Page 20: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

20

A successful search for an element x follows the same probe sequence as the insertion of element x.

Consider the (i + 1)-st element x that was inserted. The expected number of probes performed when inserting x is at most

Averaging over all n elements, the expected number of probes in a successful search is

Theorem: Given an open-address hash table with load factor α < 1, the expected number of probes in a successful search is (1/α) ln (1 / (1 – α)), assuming uniform hashing and assuming that each key in the table is equally likely to be searched for.

Analysis of Open Addressing – Cont.

Page 21: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

21

Analysis of Open Addressing – Cont.

Page 22: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

22

A family of hash functions is universal if for each pair k, l of keys, there are at most || / m functions in such that h(k) = h(l).

This means: For any two keys k and l and any function h chosen uniformly at

random, the probability that h(k) = h(l) is at most 1/m.

P= (|| / m )/ || =1/m

This is the same as if we chose h(k) and h(l) uniformly at random from [0, m – 1].

Universal Hashing

Page 23: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

23

Analysis of Universal Hashing

Theorem: For a hash function h chosen uniformly at random from a universal family , the expected length of the list T[h(x)] is α if x is not in the hash table and 1 + α if x is in the hash table.

x xyy Tx y

Y

Proof: Indicator variables:

Yx = the number of keys ≠ x that hash to the same slot as x

x xyy Tx y

E Y E

1

0ij

h i h j

h i h j

Page 24: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

24

If x is not in T, then |{y T : x ≠ y}| = n. Hence, E[Yx] = n / m = α.

If x is in T, then |{y T : x ≠ y} = n – 1. Hence, E[Yx] = (n – 1) / m < α.

The length of list T[h(x)] is one more, that is, 1 + α.

1x xy xy

y T y T y Tx y x y x y

E Y E Em

Analysis of Universal Hashing Cont.

Page 25: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

25

Universal Family of Hash Functions

Choose a prime p so that m = p.

Let such

For each define the hash function as follows:

= || =

Example: m=p=253

a=[248,223,101]

x=1025=[0,2,1] .

0[ ,..., ]rx x x ix m

10[ ,..., ] {0,..., 1}r

ra a a m

0

modr

a i ii

h x a x m

aa

h 1rm

240,..., 2U

248 0 223 2 101 1 mod 253 41ah x

Page 26: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

26

Universal Family of Hash Functions

Theorem: The class is universal.

Proof: Let

such that x ≠ y, w.l.o.g

For all a1,…,ar there exists a single a0 such that

For all there exists a single w such that z·w=1(mod m)

for mr values

The number of hash functions h in , for which h(k1) = h(k2) is at most mr/ mr+1 =1/m

0 0[ ,..., ], [ ,..., ]r rx x x y y y

a ah x h y

0 0x y

0

0 0 00

0(mod )

(mod )

r

a a i i ii

r

i i ii

h x h y a x y m

a x y a x y m

1

0 0 0 0 01

0 (mod )r

i i ii

z x y a a x y x y m

a ah x h y

pz

Page 27: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

27

Choose a prime p so that m < p.

For any 1 ≤ a < p and 0 ≤ b < p, we define a function

ha,b

(x) = ((ax + b) mod p) mod m.

Let Hp,m

be the family

Hp,m

= {ha,b

: 1 ≤ a < p and 0 ≤ b < p}.

Theorem: The class Hp,m

is universal.

Universal Family of Hash Functions

Page 28: 1 Data Structures Dana Shapira Hash Tables. 2 Element Uniqueness Problem Let Determine whether there exist i j such that x i =x j Sort Algorithm Bucket

28

Hash tables are the most efficient dictionaries if only operations Insert, Delete, and Search have to be supported.

If uniform hashing is used, the expected time of each of these operations is constant.

Universal hashing is somewhat complicated, but performs well even for adversarial input distributions.

If the input distribution is known, heuristics perform well and are much simpler than universal hashing.

For collision-resolution, chaining is the simplest method, but it requires more space than open addressing.

Open addressing is either more complicated or suffers from clustering effects.

Summary