View
4
Download
0
Category
Preview:
Citation preview
1
Motivation
• Many applications require a dynamic set that supports only the dictionary operations:
– insert
– search
– delete
• Example:
– A symbol table in a compiler.
2
Keys
• We will consider all keys to be (possibly large) natural numbers.
3
• Suppose: – The keys are from a universe U={0, 1, … u-1}
– Keys are distinct
• The idea: – Set up an array T[0..u-1] in which
• T[i] = x if x T and x.key = i
• T[i] = null otherwise
– Operations take O(1) time! • search (T, k) return T[k]
• insert (T, x) T[x.key] x
• delete (T, x) T[x.key] null
• So – what’s the pro le ?
Direct Addressing
4
Direct Addressing
• Direct addressing works well when the size of
the universe U is relatively small
• But what if the keys are 32-bit integers?
– It will have 232 entries
5
Hashing
• Solution:
– map keys to smaller range 0 .. m-1
• This mapping is called a hashing
6
Collisions
• Two hashed key may collide with one another
• Solution:
– chaining
– open addressing
7
Chaining
• Chaining puts elements that hash to the same
slot in a linked list.
8
Search, Insert and Delete
• search(T, k)
– search for an element with key k in list T[h(k)]
• insert(T, x)
– insert x at the head of list T[h(x.key)]
• delete(T, x)
– delete x from the list T[h(x.key)]
9
Analysis of Chaining
• Assume simple uniform hashing: – each key is equally likely to be hashed to any slot.
• Given n keys and m slots in the table: the load factor = n/m is the average # keys per slot
• We will show that the average cost of an unsuccessful search for a key is Θ(1+).
• We will show that the average cost of a successful search is Θ(1+/2) = Θ(1+).
• Hence, the average cost is Θ(1+).
• Thus, if n = O(m), α = / = O / = O 1), and the average cost is Θ(1)
10
Analysis of Chaining
• Theorem: – An unsuccessful search takes expected time Θ(1+α .
• Proof: – Simple uniform hashing any key not already in the table
is equally likely to hash to any of the m slots.
– To search unsuccessfully for any key k, need to search to the end of the list T[h(k)].
– This list has expe ted le gth E[le gth of T[h k ]] = α. – Therefore, the expected number of elements examined in
a u su essful sear h is α. – Adding in the time to compute the hash function, the total
time required is Θ 1 + α .
11
Analysis of Chaining
• Theorem: – An successful search takes expected time Θ(1+α .
• Proof: – Assume that the element x being searched for is equally likely to be
any of the n elements stored in the table.
– The number of elements examined during a successful search for x is 1 more than the number of elements that appear before x in x’s list.
– These are the elements inserted after x was inserted (because we insert at the head of the list).
– So we need to find the average, over the n elements x in the table, of how many elements were inserted into x’s list after x was inserted.
– For i = 1, 2, . . . , n, let xi be the i th element inserted into the table, and let ki = key[xi].
– For all i and j, define indicator random variable Xi j = I{h(ki) = h(kj)}.
12
Analysis of Chaining
13
Keys
• How can we convert floats or ASCII strings to natural numbers?
• Example: – Co sider CLR“
• ASCII values: C = 67, L = 76, R = 82, S = 83.
– There are 128 basic ASCII values.
– “o i terpret CLR“ • (67 ∙ 1283)+ (76 ∙ 1282)+ (82 ∙ 1281)+ (83 ∙ 1280) =
141,764,947.
14
Hor er’s rule
• Hor er’s rule:
• Code:
y = ad
for i = d-1 to 0
y = ai+x·y
• If d is large the value of y is too big.
• Solution: evaluate the polynomial modulo m:
y = ad
for i = d-1 to 0
y = (ai+x·y) mod m
0210
1
1
1
1 ...)))(...(... axxaxaxaaxaxaxa ddd
d
d
d
d
Choosing A Hash Function
• Clearly choosing the hash function well is
crucial
– What will a worst-case hash function do?
– What will be the time to search in this case?
• What are desirable features of the hash
function?
– Should distribute keys uniformly into slots
– Should not depend on patterns in the data
16
Choosing A Hash Function
• Unfortunately, it is typically not possible to check this conditions – One rarely knows the probability distribution
according to which the keys are drawn
– The keys may not be drawn independently.
• Occasionally we do know the distribution – If the keys are known to be random real numbers k
independently and uniformly distributed in the range 0≤k<1
– The hash function h(k) = km satisfies the condition of simple uniform hashing.
17
The Division Method
h(k) = k mod m
• For example, if the hash table has size m = 12 and the key is k = 100, then h(k) = 4.
• Hashing by division is quite fast.
• What happens if m is a power of 2 (say 2P)?
18
The Multiplication Method
• For a constant 0 < A < 1: h(k) = m·frac(kA), where frac(x) is the fractional part of x.
• Advantage: Value of m is not critical.
• Usually m = 2P for some integer p.
• Suppose that the word size is w bits.
• Choose A = s/2w, where is an integer in the range 0 < s < 2w
• The computation of h(k) can be done using integer operations:
– ks = kA·2w, so the rightmost w bits of ks are equal to frac(kA)·2w.
– m·frac(kA) = first p bits of frac(kA) after floating point = leftmost p bits of the rightmost w bits of ks.
19
Universal Hashing
• Peak a hash function randomly when the
algorithm begins
– A way to randomize the algorithm to control the
input distribution.
– Need a good family of hash functions to choose
from
20
Universal Set
• A finite collection H of hash functions is
universal
– if for each k, l U, where k ≠ l, the u er of hash functions h H for which h(k) = h(l) is
≤|H|/ , where is the size of the hash ta le.
• Alternatively, H is universal
– if, with a hash function h chosen randomly from H,
the probability of a collision between two
different keys is no more than 1/m.
21
Universal Hashing
• Theorem: – Choose h from a universal family of hash functions
– Hash n keys into a table T of m slots, n m
– Then the expected number of collisions involving a particular key x is less than 1
• Proof: – For each pair of keys x, y, let Ixy the indicator that y and x collide.
– E[Ixy] = 1/m
– Let Cx be total number of collisions involving key x
–
– Since n m, we have E[Cx] < 1
• Corollary
– Using chaining and universal hashing, the expected time for each search operation is O(1).
m
1n]E[I]IE[]E[C
xyTy
xy
xyTy
xyx
22
A Universal Hash Set of Functions
• Let U = {1, …, u} • Fix a prime p > u.
• For every a є {1, …, p-1}, b є {0, …, p-1}, define
hab(k) = ((ak + b) mod p) mod m, where m is the
size of the table
• H = {hab(k) } is universal
23
Open Addressing
• Basic idea:
– If slot is full, try another slot, etc., until an open slot is found (probing)
– To search, follow same sequence of probes as would be used when inserting the element
• If reach element with correct key, return it
• If reach a null pointer, element is not in table
• Good for fixed sets (adding but no deletion)
– Example: spell checking
• Ta le eed ’t e u h igger tha n
– But, comparing to chaining, there is more space available.
24
Hash Function
such that
for each k є U
{h(k,0),h(k,1), ..., h(k, m-1)}
is a permutation of
{0, 1, ..., m -1}
25
Insert & Search
26
Deletion
• Cannot just put null into the slot containing
the key we want to delete.
• Solution:
– Use a special value DELETED when marking a slot
as empty during deletion.
• The disadvantage:
– Search time is no longer dependent on the load
factor α.
27
Linear Probing
Let
h' : U → {0, 1, ..., m - 1} be an hash function
Define
h(k, i) = (h'(k) + i) mod m
• Easy to implement
• Suffers from a primary clustering
– Long runs of occupied sequences build up.
28
Double hashing
h(k, i) = (h1(k) + ih2(k)) mod m
where
h1 and h2 are hash functions
• Advantage:
– Two keys with same hash may have different steps.
– Only m2 different probe sequences.
29
Example
• h1(k) = k mod 13
• h2(k) = 1 + (k mod 11)
30
Analysis of Open-Address Hashing
• Theorem: – Given an open-address hash ta le with load fa tor α <
1,
– The expected number of probes in a successful search is at most
• assuming uniform hashing
– Each key is equally likely to have any of the m! permutations of 0, 1, . . . , − 1 as its probe sequence
• assuming that each key in the table is equally likely to be searched for.
31
Analysis of Open-Address Hashing
• Theorem: – Given an open-address hash table with load factor
α < 1,
– The expected number of probes in an unsuccessful search is at most 1/(1-α ,
• assuming uniform hashing.
• If α is a o sta t, a sear h ru s i O 1) time.
• Theorem: – The expected number of probes to insert is at most
1/(1 − α .
32
Bloom Filter
33
• . • URL 10,000,000
. , ( -URL .)
ו • ר ב . פ , .
• , , (false positive) ,
.
34
• -URL . • URL .
, -URL . • , -
URL . ," . -false positive .
35
36
פ –פ
• URL 10,000,000 . • T 80,000,000 , h .
-T -0. • URL x 1 - T[h(x)]. • URL x ,
T[h(x)] . • T[h(x)]=0 . • T[h(x)]=1 . • , -T[h(x)]=1 ?
• y h(x)=h(y). •
10,000,000/80,000,000=0.125
37
-פ
• m k . • : 0. • x ,
x k , k " " .
• y , k
"."
• n .
38
-פ
• , .
• " " ( k,m - n )
.
39
(m=12,k=3)
0 0 0 0 0 0 0 0 0 0 0 0
:
:x1
0 0 0 0 0 1 0 1 0 0 1 0
x1
:x2
0 0 1 1 0 1 0 1 0 0 1 0
x1 x2
40
(m=12,k=3)
:y1
0 0 1 1 0 1 0 1 0 0 1 0
x1 x2 y1
: false positive y2
0 0 1 1 0 1 0 1 0 0 1 0
x1 x2 y2
41
false positive
• , false positive k
1. • n ,
0 :
" " 1 , 0 kn
false positive
• k 1 :
פ false positive • m ( )-n
, k ( ) false
positive ? :false positive
• k k 0 .
.
)1ln(/ /
)1(mkn
ekkmknee
)1ln(/ mkn
ekg
פ false positive
• false positive :
• Byte 8 bit m/n=8 false positive 0.0214.
2lnn
mk
Recommended