49
Lecture 17 April 11, 11 Chapter 5, Hashing • dictionary operations • general idea of hashing • hash functions • chaining • closed hashing

Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Embed Size (px)

Citation preview

Page 1: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Lecture 17 April 11, 11

Chapter 5, Hashing

• dictionary operations

• general idea of hashing

• hash functions

• chaining

• closed hashing

Page 2: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Dictionary operations

o search o inserto delete

Applications:

• data base search• books in a library• patient records, GIS data etc.

• web page caching (web search)

• combinatorial search (game tree)

Page 3: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Dictionary operationsosearcho inserto delete

ARRAY LINKED LIST

sorted unsorted sorted unsorted

Search

Insert

delete

O(log n) O(n) O(n) O(n)

O(n) O(1) O(n) O(n)

O(n) O(n) O(n) O(n)

comparisons and data movements combined (Assuming keys can be compared with <, > and = outcomes)

Exercise: Create a similar table separately for data movements and for comparisons.

Page 4: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Performance goal for dictionary operations:

O(n) is too inefficient.

Goal• O(log n) on average • O(log n) in the worst-case• O(1) on average

Data structure that achieve these goals:

(a) binary search tree

(b) balanced binary search tree (AVL tree)

(c) hashing. (but worst-case is O(n))

Page 5: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Hashing

o An important and widely useful technique for implementing dictionaries.

o Constant time per operation (on the average).

o Worst case time proportional to the size of the set for each operation (just like array and linked list implementation)

Page 6: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

General idea

U = Set of all possible keys: (e.g. 9 digit SS #)

If n = |U| is not very large, a simple way to support dictionary operations is:

map each key e in U to a unique integer h(e) in the range 0 .. n – 1.

Boolean array H[0 .. n – 1] to store keys.

Page 7: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

General idea

Page 8: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Ideal case not realistic

• U the set of all possible keys is usually very large so we can’t create an array of size n = |U|.

• Create an array H of size m much smaller than n.

• Actual keys present at any time will usually be smaller than n.

• mapping from U -> {0, 1, …, m – 1} is called hash function.

Example: D = students currently enrolled in courses, U = set of all SS #’s, hash table of size = 1000

Hash function h(x) = last three digits.

Page 9: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Example (continued)

Insert Student “Dan” SS# = 1238769871h(1238769871) = 871

...

0 1 2 3 999

hash table

buckets

871

DanNULL

Page 10: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Example (continued)

Insert Student “Tim” SS# = 1872769871h(1238769871) = 871, same as that of Dan.

Collision

...

0 1 2 3 999

hash table

buckets

871

DanNULL

Page 11: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Hash Functions

If h(k1) = = h(k2): k1 and k2 have collision at slot

There are two approaches to resolve collisions.

Page 12: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Collision Resolution Policies

Two ways to resolve: (1) Open hashing, also known as separate

chaining (2) Closed hashing, a.k.a. open addressing

Chaining: keys that collide are stored in a linked list.

Page 13: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Previous Example:

Insert Student “Tim” SS# = 1872769871h(1238769871) = 871, same as that of Dan.

Collision

...

0 1 2 3 999

hash table

buckets

871

DanNULL

Tim

Page 14: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Open Hashing

The hash table is a pointer to the head of a linked list

All elements that hash to a particular bucket are placed on that bucket’s linked list

Records within a bucket can be ordered in several waysby order of insertion, by key value order, or by

frequency of access order

Page 15: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Open Hashing Data Organization

0

1

2

3

4

D-1

...

...

...

Page 16: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Implementation of open hashing - search

bool contains( const HashedObj & x )

{

list<HashedObj> whichList = theLists[ myhash( x ) ];

return find( whichList.begin( ), whichList.end( ), x ) !=

whichList.end( );

}

Find is a function in the STL class algorithm. Code for find is described below:

template<class InputIterator, class T>

InputIterator find ( InputIterator first, InputIterator last,

const T& value ) {

for ( ;first!=last; first++)

if ( *first==value ) break;

return first; }

Page 17: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Implementation of open hashing - insert

bool insert( const HashedObj & x )

{

list<HashedObj> whichList = theLists[ myhash( x ) ];

if( find( whichList.begin( ), whichList.end( ), x ) !=

whichList.end( ) )

return false;

whichList.push_back( x );

return true;

}

The new key is inserted at the end of the list.

Page 18: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Implementation of open hashing - delete

Page 19: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Choice of hash function

A good hash function should:

• be easy to compute

• distribute the keys uniformly to the buckets

• use all the fields of the key object.

Page 20: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Example: key is a string over {a, …, z, 0, … 9, _ }Suppose hash table size is n = 10007.

(Choose table size to be a prime number.)

Good hash function: interpret the string as a number to base 37 and compute mod 10007.

h(“word”) = ? “w” = 23, “o” = 15, “r” = 18 and “d” = 4.

h(“word”) = (23 * 37^3 + 15 * 37^2 + 18 * 37^1 + 4) % 10007

Page 21: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Computing hash function for a string

Horner’s rule: (( … (a0 x + a1) x + a2) x + … + an-2 )x + an-1)

int hash( const string & key ){ int hashVal = 0;

for( int i = 0; i < key.length( ); i++ ) hashVal = 37 * hashVal + key[ i ];

return hashVal;}

Page 22: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Computing hash function for a string

int myhash( const HashedObj & x ) const { int hashVal = hash( x ); hashVal %= theLists.size( ); return hashVal; }

Alternatively, we can apply % theLists.size() after each iteration of the loop in hash function.

int myHash( const string & key ){ int hashVal = 0; int s = theLists.size();

for( int i = 0; i < key.length( ); i++ ) hashVal = (37 * hashVal + key[ i ]) % s;

return hashVal % s;}

Page 23: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Analysis of open hashing/chainingOpen hashing uses more memory than open addressing (because

of pointers), but is generally more efficient in terms of time.

If the keys arriving are random and the hash function is good, keys will be nicely distributed to different buckets and so each list will be roughly the same size.

Let n = the number of keys present in the hash table.

m = the number of buckets (lists) in the hash table.

If there are n elements in set, then each bucket will have roughly n/m

If we can estimate n and choose m to be ~ n, then the average bucket will be 1. (Most buckets will have a small number of items).

Page 24: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Analysis continuedAverage time per dictionary operation:

m buckets, n elements in dictionary average n/m elements per bucket

n/m = is called the load factor.

insert, search, remove operation take O(1+n/m) = O(1time each (1 for the hash function computation)

If we can choose m ~ n, constant time per operation on average. (Assuming each element is likely to be hashed to any bucket, running time constant, independent of n.)

Page 25: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Closed HashingAssociated with closed hashing is a rehash strategy: “If we try to place x in bucket h(x) and find it

occupied, find alternative location h1(x), h2(x), etc. Try successively until all the cells have been probed. If this happens, then the hash table is full.”

h(x) is called home bucket

Simplest rehash strategy is called linear hashinghi(x) = (h(x) + i) % D

In general, the collision resolution strategy is to generate a sequence of hash table addresses (probe sequence); test each slot until you find an empty one (probing)

Page 26: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Closed HashingExample: m =8, keys a,b,c,d have hash values h(a)=3,

h(b)=0, h(c)=4, h(d)=3

0

2

3

4

5

6

7

1

b

a

c

Where do we insert d? 3 already filled

Probe sequence using linear hashing:h1(d) = (h(d)+1)%8 = 4%8 = 4

h2(d) = (h(d)+2)%8 = 5%8 = 5*

h3(d) = (h(d)+3)%8 = 6%8 = 6

Etc.

Wraps around the beginning of the table

d

Page 27: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Operations Using Linear Hashing

• Test for membership: search

• Examine h(k), h1(k), h2(k), …, until we find k or an empty bucket or home bucket

case 1: successful search -> return true case 2: unsuccessful search -> false case 3: unsuccessful search and table is full

• If deletions are not allowed, strategy works!• What if deletions?

Page 28: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Dictionary Operations with Linear Hashing

• What if deletions? If we reach empty bucket, cannot be sure that k is

not somewhere else and empty bucket was occupied when k was inserted

• Need special placeholder deleted, to distinguish bucket that was never used from one that once held a value

Page 29: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Implementation of closed hashing

Code slightly modified from the text.

// CONSTRUCTION: an approximate initial size or default of 101//// ******************PUBLIC OPERATIONS*********************// bool insert( x ) --> Insert x// bool remove( x ) --> Remove x// bool contains( x ) --> Return true if x is present// void makeEmpty( ) --> Remove all items// int hash( string str ) --> Global method to hash strings

There is no distinction between hash function used in closed hashing and open hashing. (I.e., they can be used in either context interchangeably.)

Page 30: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

template <typename HashedObj>

class HashTable

{

public:

explicit HashTable( int size = 101 ) : array( nextPrime( size ) )

{ makeEmpty( ); }

bool contains( const HashedObj & x )

{

return isActive( findPos( x ) );

}

void makeEmpty( )

{

currentSize = 0;

for( int i = 0; i < array.size( ); i++ )

array[ i ].info = EMPTY;

}

Page 31: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

bool insert( const HashedObj & x )

{ int currentPos = findPos( x );

if( isActive( currentPos ) )

return false;

array[ currentPos ] = HashEntry( x, ACTIVE );

if( ++currentSize > array.size( ) / 2 )

rehash( ); // rehash when load factor exceeds 0.5

return true;

}

bool remove( const HashedObj & x )

{

int currentPos = findPos( x );

if( !isActive( currentPos ) )

return false;

array[ currentPos ].info = DELETED;

return true;

}

enum EntryType { ACTIVE, EMPTY, DELETED };

Page 32: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

private: struct HashEntry

{

HashedObj element;

EntryType info;

};

vector<HashEntry> array;

int currentSize;

bool isActive( int currentPos ) const

{ return array[ currentPos ].info == ACTIVE; }

Page 33: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

int findPos( const HashedObj & x )

{

int offset = 1; // int offset = s_hash(x); /* double hashing */

int currentPos = myhash( x );

while( array[ currentPos ].info != EMPTY &&

array[ currentPos ].element != x )

{

currentPos += offset; // Compute ith probe

// offset += 2 /* quadratic probing */

if( currentPos >= array.size( ) )

currentPos -= array.size( );

}

return currentPos;

}

Page 34: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Performance Analysis - Worst Case

• Initialization: O(m), m = # of buckets

• Insert and search: O(n), n number of elements currently in the table– Suppose there are close to n elements in the

table that form a chain. Now want to search x, and say x is not in the table. It may happen that h(x) = start address of a very long chain. Then, it will take O(c) time to conclude failure. c ~ n.

• No better than an unsorted array.

Page 35: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Example

0

1

2

3

4

5

6

7

8

9

10

1001

9537

3016

9874

2009

9875

h(k) = k%11 = 0

0

1

2

3

4

5

6

7

8

9

10

1001

9537

3016

9874

2009

9875

1. What if next element has home bucket 0? go to bucket 3Same for elements with home bucket 1 or 2!Only a record with home position 3 will stay. p = 4/11 that next record will go to bucket 3

2. Similarly, records hashing to 7,8,9will end up in 103. Only records hashing to 4 will end upin 4 (p=1/11); same for 5 and 6

I

IIinsert 1052 (h.b. 7)

1052

next element in bucket3 with p = 8/11

Page 36: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Performance Analysis - Average Case

• Distinguish between successful and unsuccessful searches• Delete = successful search for record to be

deleted• Insert = unsuccessful search along its probe

sequence

• Expected cost of hashing is a function of how full the table is: load factor = n/m

Page 37: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Random probing model vs. linear probing model

•It can be shown that average costs under linear hashing (probing) are:

•Insertion: 1/2(1 + 1/(1 - )2)

•Deletion: 1/2(1 + 1/(1 - ))

•Random probing: Suppose we use the following approach: we create a sequence of hash functions h, h,… all of which are independent of each other.

• insertion: 1/(1 – )

• deletion: 1/ log(1/ (1 – ))

Page 38: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Random probing – analysis of insertion (unsuccessful search)

What is the expected number of times one should roll a die before getting 4?

Answer: 6 (probability of success = 1/6.)

More generally, if the probability of success = p, expected number of times you repeat until you succeed is 1/p.

If the current load factor = , then the probability of success = 1 – since the proportion of empty slots is 1 – .

Page 39: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Improved Collision Resolution

• Linear probing: hi(x) = (h(x) + i) % D• all buckets in table will be candidates for inserting a new

record before the probe sequence returns to home position• clustering of records, leads to long probing sequence

• Linear probing with increment c > 1: hi(x) = (h(x) + ic) % D• c constant other than 1• records with adjacent home buckets will not follow same

probe sequence

• Double hashing: hi(x) = (h(x) + i g(x)) % D

• G is another hash function that is used as the increment amount.

• Avoids clustering problems associated with linear probing.

Page 40: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Comparison with Closed Hashing

• Worst case performance is O(n) for both. Average case is a small constant in both cases when is small.

• Closed hashing – uses less space.

• Open hashing – behavior is not sensitive to load factor. Also no need to resize the table since memory is dynamically allocated.

Page 41: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Random probing model vs. linear probing model

•It can be shown that average costs under linear hashing (probing) are:

•Insertion: 1/2(1 + 1/(1 - )2)

•Deletion: 1/2(1 + 1/(1 - ))

•Random probing: Suppose we use the following approach: we create a sequence of hash functions h, h,… all of which are independent of each other.

• insertion: 1/(1 – )

• deletion: 1/ log(1/ (1 – ))

Page 42: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Random probing – analysis of insertion (unsuccessful search)

What is the expected number of times one should roll a die before getting 4?

Answer: 6 (probability of success = 1/6.)

More generally, if the probability of success = p, expected number of times you repeat until you succeed is 1/p.

Probes are assumed to be independent. Success in the case of insertion involves finding an empty slot to insert.

Page 43: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Proof for the case insertion: 1/(1 – )

Recall: geometric distribution involves a sequence of independent random experiments, each with outcome success (with prob = p) or failure (with prob = 1 – p).

We repeat the experiment until we get success.

The question is: what is the expected number of trials performed?

Answer: 1/p

In case of insertion, success involves finding an empty slot. Probability of success is thus 1 – .

Thus, the expected number of probes = 1/(1 – )

Page 44: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Improved Collision Resolution

Linear probing: hi(x) = (h(x) + i) % Dall buckets in table will be candidates for inserting a new record before the probe sequence returns to home positionclustering of records, leads to long probing sequence

Linear probing with increment c > 1: hi(x) = (h(x) + ic) % D

c constant other than 1records with adjacent home buckets will not follow same probe sequence

Double hashing: hi(x) = (h(x) + i g(x)) % D

G is another hash function that is used as the increment amount. Avoids clustering problems associated with linear probing.

Page 45: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Comparison with Closed Hashing

Worst case performance is O(n) for both. Average case is a small constant in both cases when is small.

Closed hashing – uses less space.

Open hashing – behavior is not sensitive to load factor. Also no need to resize the table since memory is dynamically allocated.

Page 46: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

2

4

6

8

10

12

14

16

18

20

0 0.2 0.4 0.6 0.8 1

Aver

age

# of

pro

bes

Load factor

Successful search

Linear probingDouble hashing

Separate chaining

Page 47: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

2

4

6

8

10

12

14

16

18

20

0 0.2 0.4 0.6 0.8 1

Aver

age

# of

pro

bes

Load factor

Unsuccessful search

Linear probingDouble hashing

Separate chaining

Page 48: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Another hash function - Multiplication Method

We choose m to be power of 2 (m=2p) and

For example, k=123456, m=512 then:

...6180339887.02

15

A

322371.32262963.0512

)1mod62963.7629(512

)1mod618.012345(512key

h

10 where)1modkey(key AAmh

Page 49: Lecture 17 April 11, 11 Chapter 5, Hashing dictionary operations general idea of hashing hash functions chaining closed hashing

Multiplication Method: Implementation

x

w bits

A 2W

key

h(key)extract p bits

producthigh order word low order word