30
CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

Embed Size (px)

Citation preview

Page 1: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

CSE 1302

Lecture 23

Hashing and Hash Tables

Richard Gesick

Page 2: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

look up tables• a look up table is not really a search method

but a method to avoid having to conduct searches.

• The key either is or can be converted to an integer value which is used as the index. The mapping of all valid keys to the computed indices must be unambiguous.

• lookup tables ~ the key is used as the index and we can access it directly. May waste much storage space. Running time is O(1).

Page 3: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

look up tables (2)• consider a 5 digit zip code. 100,000

possible combinations. Not all are used by the postal service but a lookup table to ensure that the zip is valid (has a locale) would require the table to be sized to hold all 100,000 possible combinations.

Page 4: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

hashing• hashing is a method of devising a

lookup table that tries not to waste as much space.

• the purpose of the hash is to map all possible key combinations into a smaller range of values and to cover that range with a rather uniform distribution.

Page 5: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

hashing (2)

• a hash table uses a hash function to convert the indices.

• hashing has a side effect. It loses the one to one correspondence of the key to table. It becomes possible to have two or more keys mapped to the same location in the table.

• these multiple mappings are called collisions.

Page 6: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

hashing (3)• a hash table requires that the original key

also be stored with the record. Then when we retrieve a record from the hash table we can verify that it is indeed the proper record.

• Hash table design relies on successfully handling 2 elements:

• choosing a good hash function and• good collision handing

Page 7: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

hash functions• properties of a good hash function

– easy to calculate– map all possible key values into a reasonable range– cover the range uniformly and minimize collisions

• if the key is a string, maybe try some ascii conversions and combine parts of it using arithmetic to come up with integers from 0 to table size-1.

• too simplistic a method, like truncating the key or using modulo division on it, may create some clusters of collisions if the incoming data is not uniformly distributed.

Page 8: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

hash functions (2)• 2 primary ways to resolve collisions chaining and

probing.• chaining~ each entry in the hash table is designed

as a structure that can hold more than one element.• Each element of the table is referred to as a bucket. • A bucket can be implemented as a linked list, a

sorted array, or even as a binary search tree. • Chaining works well on densely populated hash

tables.

Page 9: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

hash functions (3)• probing~ storing the colliding element in a

different slot in the same hash table. If the indexed position is already in use, we use a probing function to find a vacant slot.

• The simplest probing function is to increment the index by 1. This is known as linear probing.

• Probing should be used only on relatively sparsely populated tables so that the probing sequences can be kept fairly short.

Page 10: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

Hash Tables

• Hashing can be used to find elements in a data structure quickly without making a linear search

• A hash table can be used to implement sets and maps

• A hash function computes an integer value (called the hash code) from an object

Page 11: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

Hash Tables• A hash table can be implemented as an

array of buckets • Buckets are sequences of nodes that hold

elements with the same hash code • If there are few collisions, then adding,

locating, and removing hash table elements takes constant time – Big-Oh notation:    O(1)

Page 12: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

Hash Tables• A good hash function minimizes

collisions–identical hash codes for different objects

• To compute the hash code of object x:

• int h = x.hashCode();

Page 13: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

Hash Tables

• For this algorithm to be effective, the bucket sizes must be small

• The table size should be a prime number larger than the expected number of elements – An excess capacity of 30% is typically

recommended

Page 14: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

Computing Hash Codes• A hash function computes an integer

hash code from an object • Choose a hash function so that different

objects are likely to have different hash codes.

Page 15: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

Computing Hash Codes

• Bad choice for hash function for a string Adding the unicode values of the characters in the string

int h = 0; for (int i = 0; i < s.length(); i++) h = h + s.charAt(i);

– Because permutations ("eat" and "tea") would have the same hash code

Page 16: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

Computing Hash Codes• For example, the hash code of "eat" is • -1741487359

• The hash code of "tea" is quite different, • -626004239

Page 17: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

Sample Strings and Their Hash Codes

String Hash Code

"Adam" -377491708

"Eve" 1700577644

"Harry" 400611269

"Jim" 1344150687

"Joe" 1699987831

"Juliet" -1536662019

"Katherine" -1522213679

"Sue" 1700643198

Page 18: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

Simplistic Implementation of a Hash Table

• To implement – Generate hash codes for objects – Make an array – Insert each object at the location of its hash code

• To test if an object is contained in the set – Compute its hash code – Check if the array position with that hash code is

already occupied

Page 19: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

Simplistic Implementation of a Hash Table

Page 20: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

Problems with Simplistic Implementation

• It is not possible to allocate an array that is large enough to hold all possible integer index positions

• It is possible for two different objects to have the same hash code

Page 21: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

Solutions• Pick a reasonable array size and reduce

the hash codes to fall inside the arrayint h = x.hashCode(); if (h < 0) h = -h; h = h % size;

• When elements have the same hash code: – Use a node sequence to store multiple

objects in the same array position – These node sequences are called buckets

Page 22: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

Hash Table with Buckets to Store Elements with Same Hash Code

Page 23: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

Algorithm for Finding an Object x in a Hash Table

• Get the index h into the hash table – Compute the hash code – Reduce it modulo the table size

• Iterate through the elements of the bucket at position h – For each element of the bucket, check whether it is equal

to x

• If a match is found among the elements of that bucket, then x is in the set – Otherwise, x is not in the set

Page 24: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

Hash Tables• Adding an element: simple extension of

the algorithm for finding an object – Compute the hash code to locate the

bucket in which the element should be inserted

– Try finding the object in that bucket – If it is already present, do nothing;

otherwise, insert it

Page 25: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

Hash Tables• Removing an element is equally simple

– Compute the hash code to locate the bucket in which the element should be inserted

– Try finding the object in that bucket – If it is present, remove it; otherwise, do nothing

• If there are few collisions, adding or removing takes O(1) time

Page 26: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

Creating Hash Codes for your Classes• Use a prime number as the HASH_MULTIPLIER • Compute the hash codes of each instance field • For an integer instance field just use the field value • Combine the hash codesint h = HASH_MULTIPLIER * h1 +h2; h = HASH_MULTIPLIER * h + h3; h = HASH_MULTIPLIER *h + h4; . . . return h;

Page 27: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

Creating Hash Codes for your Classes

• Your hashCode method must be compatible with the equals method – if x.equals(y) then – x.hashCode() == y.hashCode()

Page 28: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

Creating Hash Codes for your Classes

• You get into trouble if your class defines an equals method but not a hashCode method – If we forget to define hashCode method for class, it inherits the method from Object superclass

– That method computes a hash code from the memory location of the object

Page 29: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

Hash Maps

• In a hash map, only the keys are hashed • The keys need compatible hashCode

and equals method

Page 30: CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick

Priority Queues• A priority queue collects elements, each

of which has a priority • Example: collection of work requests,

some of which may be more urgent than others

• When removing an element, element with highest priority is retrieved – Customary to give low values to high

priorities, with priority 1 denoting the highest priority