Upload
james-atkins
View
220
Download
0
Embed Size (px)
Citation preview
Data Structures
Hash Tables
Hashing Tables
Motivation: symbol tables A compiler uses a symbol table to relate symbols
to associated data Symbols: variable names, procedure names, etc. Associated data: memory location, call graph, etc.
For a symbol table (also called a dictionary), we care about search, insertion, and deletion
We typically don’t care about sorted order
Hash Tables
More formally: Given a table T and a record x, with key (= symbol) and
satellite data, we need to support: Insert (T, x) Delete (T, x) Search(T, x)
We want these to be fast, but don’t care about sorting the records
The structure we will use is a hash table Supports all the above in O(1) expected time!
Hashing: Keys
In the following discussions we will consider all keys to be (possibly large) natural numbers
How can we convert floats to natural numbers for hashing purposes?
How can we convert ASCII strings to natural numbers for hashing purposes?
Hashing a String
Let Ai represent the ascii code for a string character. Since ascii codes go from 0 to 127 we can perform the following calculation.
A3X3 + A2X2 + A1X1 +A0X0 = (((A3)X + A2)X + A1)X + A0
What is this called? Note that if X = 128 we can easily overflow. Solution:public static int hash(String key, int tableSize)
{
int hashVal = 0;
for(int i=0; i < key.length(); i++)
hashVal = ( hashVal * 128 + key.charAt(i))%tableSize;
return hashVal;
}
A better version
Hash functions should be fast and distribute the keys equitably! The 37 slows down the shifting of lower keys and we allow overflow to occur(sort of auto mod-ing) which can result in a neg.public static int hash(String key, int tableSize)
{
int hashVal = 0;
for(int i=0; i < key.length(); i++)
hashVal = hashVal * 37 + key.charAt(i);
hashVal %= tableSize;
if( hashVal < 0)hashVal += tableSize;// fix the sign
return hashVal;
}
Direct Addressing
Suppose: The range of keys is 0..m-1 Keys are distinct
The idea: Set up an array T[0..m-1] in which
T[i] = x if x T and key[x] = i T[i] = NULL otherwise
This is called a direct-address table Operations take O(1) time! So what’s the problem?
The Problem With Direct Addressing
Direct addressing works well when the range m of keys is relatively small
But what if the keys are 32-bit integers? Problem 1: direct-address table will have
232 entries, more than 4 billion Problem 2: even if memory is not an issue, the time to
initialize the elements to NULL may be Solution: map keys to smaller range 0..m-1 This mapping is called a hash function
Hash Functions
Next problem: collisionT
0
m - 1
h(k1)
h(k4)
h(k2) = h(k5)
h(k3)
k4
k2 k3
k1
k5
U(universe of keys)
K(actualkeys)
Resolving Collisions
How can we solve the problem of collisions? Solution 1: chaining Solution 2: open addressing
Open Addressing
Basic idea: To insert: if slot is full, try another slot, …, until an open
slot is found (probing) To search, follow same sequence of probes as would be
used when inserting the element If reach element with correct key, return it If reach a NULL pointer, element is not in table
Good for fixed sets (adding but no deletion) Example: spell checking
Table needn’t be much bigger than n
Open Addressing(Linear Probing)
Let H(k,p) denote the pth position tried for key K where p=0,1,...
Thus H(k,0)= h(k) where h is the basic hash function. H(k,p) represents the probe sequence we use to find the key.
If key K hashes to index i, and it is full, then try i+1, i+2, until an empty slot is found.
The Linear Probing sequence is therefore shown below.
H(K,0) = h(K)
H(K, p+1) = (H(K,p) + 1) mod m
Open Addressing(Double Hashing)
Here we use a second hash function h2 to determine the probe sequence. Note that linear probing is just h2(K)=1
H(K,0) = h(K)
H(K, p+1) = (H(K,p) + h2(K)) mod m
To ensure that the probe sequence visits all positions in the table h2(K) must be greater than zero and relatively prime to m for any K. Hence we usually let m be prime.
Mathematical Defn’s
Let us count as one probe each access to the data structure and n the size of the dictionary to be stored and m the size of the table. Let =n/m be called the load factor.
From a statistical stand point let S() be the expected number of probes to perform a LookUp on a key that is actually in the hash table, and U() be the expected number of probes in an unsuccessful LookUp.
Performance of Linear Probing
12 S. Adams 134 J. Adams 15 W. Floyd 26 T. Heyward 178 J. Hancock 1910 C.Braxton 111 J. Hart 112 J. Hewes 31314 C. Carroll 115 A. Clark 116 R. Ellery 217 B. Franklin 118 W. Hooper 5
key # probesThe names were inserted in alphabetical order. The third column shows the number of probes required to find a key in the table; this is the same as the number of slots that were inspected when it was inserted.
primary
clustering
S = 21/13=1.615U = 47/18=2.6load is 13/18=.72
Linear Probing
Knuth analyzed sequential probing and obtained the following.
1
11
2
1nS
2)1(
11
2
1
nU
=.9
SnUn
Linear
5.550.5
Double
2.5610.0
The small extra effort required to implement double hashing certainly pays off it seems.
Performance of Double Hashing
Assumption: each probe inot the hash table is independent and has a probability of hitting an occupied position exactly equal to the load factor. Let i = i/m for every in. This is then the probability of a collision on any probe after i keys have been inserted.
The expected number of probes in an unsuccessful search when n-1 items have been inserted is
Un-1 1(1- n-1)+2 n-1(1- n-1)+3 n-12 (1- n-1)+ . . .
= 1 + n-1 + n-12 + . . .
= 1/(1- n-1 )Prob. that third probe is empty and first 2 filled
Successful Searching(Double)
The # of probes in a successful search is the average of the number of probes it took to insert each of the n items.
The expected number of probes to insert the ith item is the expected # of probes in an unsuccessful search when i-1 items have been inserted.
n
i i
n
iin n
Un
S1 11
1 1
111
Successful Search Continued
nmm
n
i
n
i in HH
n
m
imn
m
nS
11 1 1
1
1
11
Where ii
H i ln1
3
1
2
11
So
nm
m
n
mnmm
n
mSn ln)ln(ln
nn
1
1ln
1
Observation(Double Hashing)
nnnS
1
1ln
1
When n = 20/31 we have Sn 1.606
Even when the table is 90% full Sn is 2.56
although Un 1/1-.9 = 10.0
Chaining
Chaining puts elements that hash to the same slot in a linked list:
——
——
——
——
——
——
T
k4
k2k3
k1
k5
U(universe of keys)
K(actualkeys)
k6
k8
k7
k1 k4 ——
k5 k2
k3
k8 k6 ——
——
k7 ——
Chaining
How do we insert an element?
——
——
——
——
——
——
T
k4
k2k3
k1
k5
U(universe of keys)
K(actualkeys)
k6
k8
k7
k1 k4 ——
k5 k2
k3
k8 k6 ——
——
k7 ——
Chaining
How do we delete an element?
——
——
——
——
——
——
T
k4
k2k3
k1
k5
U(universe of keys)
K(actualkeys)
k6
k8
k7
k1 k4 ——
k5 k2
k3
k8 k6 ——
——
k7 ——
Chaining
How do we search for a element with a given key?
——
——
——
——
——
——
T
k4
k2k3
k1
k5
U(universe of keys)
K(actualkeys)
k6
k8
k7
k1 k4 ——
k5 k2
k3
k8 k6 ——
——
k7 ——
Analysis of Chaining
Assume simple uniform hashing: each key in table is equally likely to be hashed to any slot
Given n keys and m slots in the table: the load factor = n/m = average # keys per slot
What will be the average cost of an unsuccessful search for a key?
Analysis of Chaining
Assume simple uniform hashing: each key in table is equally likely to be hashed to any slot
Given n keys and m slots in the table, the load factor = n/m = average # keys per slot
What will be the average cost of an unsuccessful search for a key? A: O(1+)
Analysis of Chaining
Assume simple uniform hashing: each key in table is equally likely to be hashed to any slot
Given n keys and m slots in the table, the load factor = n/m = average # keys per slot
What will be the average cost of an unsuccessful search for a key? A: O(1+)
What will be the average cost of a successful search?
Analysis of Chaining
Assume simple uniform hashing: each key in table is equally likely to be hashed to any slot
Given n keys and m slots in the table, the load factor = n/m = average # keys per slot
What will be the average cost of an unsuccessful search for a key? A: O(1+)
What will be the average cost of a successful search? A: O(1 + /2) = O(1 + )
Analysis of Chaining Continued
So the cost of searching = O(1 + ) If the number of keys n is proportional to the
number of slots in the table, what is ? A: = O(1)
In other words, we can make the expected cost of searching constant if we make constant
Choosing A Hash Function
Clearly choosing the hash function well is crucial What will a worst-case hash function do? What will be the time to search in this case?
What are desirable features of the hash function? Should distribute keys uniformly into slots Should not depend on patterns in the data
Hash Functions:The Division Method
h(k) = k mod m In words: hash k into a table with m slots using the slot
given by the remainder of k divided by m What happens to elements with adjacent
values of k? What happens if m is a power of 2 (say 2P)? What if m is a power of 10? Upshot: pick table size m = prime number not too
close to a power of 2 (or 10)
Hash Functions:The Multiplication Method
For a constant A, 0 < A < 1: h(k) = m (kA - kA)
What does this term represent?
Hash Functions:The Multiplication Method
For a constant A, 0 < A < 1: h(k) = m (kA - kA)
Choose m = 2P
Choose A not too close to 0 or 1 Knuth: Good choice for A = (5 - 1)/2
Fractional part of kA
Hash Functions: Worst Case Scenario
Scenario: You are given an assignment to implement hashing You will self-grade in pairs, testing and grading your
partner’s implementation In a blatant violation of the honor code, your partner:
Analyzes your hash function Picks a sequence of “worst-case” keys, causing your
implementation to take O(n) time to search
What’s an honest CS student to do?
Hash Functions: Universal Hashing
As before, when attempting to foil an malicious adversary: randomize the algorithm
Universal hashing: pick a hash function randomly in a way that is independent of the keys that are actually going to be stored Guarantees good performance on average, no
matter what keys adversary chooses
Universal Hashing
Let be a (finite) collection of hash functions …that map a given universe U of keys… …into the range {0, 1, …, m - 1}.
is said to be universal if: for each pair of distinct keys x, y U,
the number of hash functions h for which h(x) = h(y) is ||/m
In other words: With a random hash function from , the chance of a collision
between x and y (x y) is exactly 1/m
Universal Hashing
Theorem 12.3: Choose h from a universal family of hash functions Hash n keys into a table of m slots, n m Then the expected number of collisions involving a particular key
x is less than 1 Proof:
For each pair of keys y, z, let cyx = 1 if y and z collide, 0 otherwise
E[cyz] = 1/m (by definition)
Let Cx be total number of collisions involving key x
Since n m, we have E[Cx] < 1m
ncC
xyTy
xyx
1][E][E
A Universal Hash Function
Choose table size m to be prime Decompose key x into r+1 bytes, so that
x = {x0, x1, …, xr} Only requirement is that max value of byte < m Let a = {a0, a1, …, ar} denote a sequence of r+1 elements
chosen randomly from {0, 1, …, m - 1} Define corresponding hash function ha :
With this definition, has mr+1 members
r
iiia mxaxh
0
mod
A Universal Hash Function
is a universal collection of hash functions (Theorem 12.4)
How to use: Pick r based on m and the range of keys in U Pick a hash function by (randomly) picking the a’s Use that hash function on all keys
The End