50
ADSA: Hashing/10 241-423 Advanced Data Structures and Algorithms Objectives introduce hashing, hash functions, hash tables, collisions, linear probing, double hashing, bucket hashing, JDK hash classes Semester 2, 2013-2014 10. Hashing

ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

Embed Size (px)

Citation preview

Page 1: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 1

241-423 Advanced Data Structures and Algorithms

Objectives– introduce hashing, hash functions, hash tables, collisions, linear

probing, double hashing, bucket hashing, JDK hash classes

Semester 2, 2013-2014

10. Hashing

Page 2: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 2

Contents1. Searching an Array

2. Hashing

3. Creating a Hash Function

4. Solution #1: Linear Probing

5. Solution #2: Double Hashing

6. Solution #3: Bucket Hashing

7. Table Resizing

8. Java's hashCode()

9. Hash Tables in Java

10. Efficiency

11. Ordered/Unordered Sets and Maps

Page 3: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 3

1. Searching an Array

• If the array is not sorted, a search requires O(n) time.

• If the array is sorted, a binary search requires O(log n) time

• If the array is organized using hashing then it is possible to have constant time search: O(1).

Page 4: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 4

2. Hashing• A hash function takes a search item and

returns its array location (its index position).

• The array + hash function is called a hash table.

• Hash tables support efficient insertion and search in O(1) time– but the hash function must be carefully chosen

Page 5: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 5

A Simple Hash Function• The simple hash function:

hashCode("apple") = 5hashCode("watermelon") = 3hashCode("grapes") = 8hashCode("cantaloupe") = 7hashCode("kiwi") = 0hashCode("strawberry") = 9hashCode("mango") = 6hashCode("banana") = 2

kiwi

bananawatermelon

applemango

cantaloupegrapes

strawberry

0

1

2

3

4

5

6

7

8

9

Page 6: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 6

A Table with Key-Value Pairs

• A hash table for a map storing (ID, Name) items,– ID is a nine-digit integer

• The hash table is an array of size N10,000

• The hash function ishashCode(ID)last four digits of the ID key

01234

999799989999

…(451220004, tim)

(981100002, ad)

(200759998, tim)

(025610001, jim)

Page 7: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 7

Applications of Hash Tables

• Small databases

• Compilers

• Web Browser caches

Page 8: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 8

3. Creating a Hash Function

• The hash function should return the table index where an item is to be placed– but it's not possible to write a perfect hash function

– the best we can usually do is to write a hash function that tells us where to start looking for a location to place the item

Page 9: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 9

A Real Hash Function• A more realistic hash

function produces:– hash("apple") = 5

hash("watermelon") = 3hash("grapes") = 8hash("cantaloupe") = 7hash("kiwi") = 0hash("strawberry") = 9hash("mango") = 6hash("banana") = 2hash("honeydew") = 6

kiwi

bananawatermelon

applemango

cantaloupegrapes

strawberry

0

1

2

3

4

5

6

7

8

9• Now what?

Page 10: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 10

Collisions

• A collision occurs when two item hash to the same array location– e.g "mango" and "honeydew" both hash to 6

• Where should we place the second and other items that hash to this same location?

• Three popular solutions:– linear probing, double hashing, bucket hashing

(chaining)

Page 11: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 11

4. Solution #1: Linear Probingo Linear probing handles collisions by placing the

colliding item in the next empty table cell (perhaps after cycling around the table)

o Each inspected table cell is called a probe– in the above example, three probes were needed to insert 32

41 18 44 59 22 0 1 2 3 4 5 6 7 8 9 10 11 12

Key = 32 to be added [hash(32) = 6]

32

Page 12: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 12

Example 1

• A hash table with N = 13 and h(k) = k mod 13

• Insert keys: 18, 41, 22, 44, 59, 32, 31, 73

• Total number of probes: 19

k h (k ) Probes18 5 541 2 222 9 944 5 5 659 7 732 6 6 7 831 5 5 6 7 8 9 1073 8 8 9 10 11

0 1 2 3 4 5 6 7 8 9 10 11 12

41 18 44 59 32 22 31 73 0 1 2 3 4 5 6 7 8 9 10 11 12

Page 13: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 13

Example 2: Insertion

• Suppose we want to add "seagull":– hash(seagull) = 143

– 3 probes:table[143] is not empty;table[144] is not empty;table[145] is empty

– put seagull at location 145

robin

sparrow

hawk

bluejay

owl

. . .

141

142

143

144

145

146

147

148

. . .

seagull

Page 14: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 14

Searching

• Look up "seagull":– hash(seagull) = 143

– 3 probes:table[143] != seagull;table[144] != seagull;table[145] == seagull

– found seagull at location 145

robin

sparrow

hawk

bluejay

owl

. . .

141

142

143

144

145

146

147

148

. . .

seagull

Page 15: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 15

Searching Again• Look up "cow":

– hash(cow) = 144

– 3 probes:table[144] != cow;table[145] != cow;table[146] is empty

– "cow" is not in the table since the probes reached an empty cell

robin

sparrow

hawk

bluejay

owl

. . .

141

142

143

144

145

146

147

148

. . .

seagull

Page 16: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 16

Insertion Again• Add "hawk":

– hash(hawk) = 143

– 2 probes:table[143] != hawk;table[144] == hawk

– hawk is already in the table, so do nothing

robin

sparrow

hawk

seagull

bluejay

owl

. . .

141

142

143

144

145

146

147

148

. . .

Page 17: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 17

Insertion Again• Add "cardinal":

– hash(cardinal) = 147

– 3 or more probes:147 and 148 are occupied;"cardinal" goes in location 0 (or 1, or 2, or ...)

robin

sparrow

hawk

seagull

bluejay

owl

. . .

141

142

143

144

145

146

147

148

Page 18: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 18

Search with Linear Probing

o Search hash table Ao get(k)

– start at cell h(k)

– probe consecutive locations until:• An item with key k is

found, or• An empty cell is found, or• N cells have been

unsuccessfully probed(N is the table size)

Algorithm get(k)i h(k)p 0 // count num of probesrepeat

c A[i]if c // empty cell

return null else if c.key () k

return c.element()else // linear probing

i (i 1) mod Np p 1

until p Nreturn null

Page 19: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 19

Lazy Deletion

• Deletions are done by marking a table cell as deleted, rather than emptying the cell.

• Deleted locations are treated as empty when inserting and as occupied during a search.

Page 20: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 20

Updates with Lazy Deletiono delete(k)

– Start at cell h(k) – Probe consecutive cells

until:• A cell with key k is

found:• put DELETED in

cell;• return true

• or an empty cell is found• return false

• or N cells have been probed• return false

o insert(k, v)– Start at cell h(k) – Probe consecutive cells

until:• A cell i is found that is

either empty or contains DELETED• put v in cell;• return true

• or N cells have been probed• return false

Page 21: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 21

Clustering• Linear Probing tends to form “clusters”.

– a cluster is a sequence of non-empty array locations with no empty cells among them• e.g. the cluster in Example 1 on slide 12

• The bigger a cluster gets, the more likely it is that new items will hash into that cluster, and make it ever bigger.

• Clusters reduce hash table efficiency– searching becomes sequential (O(n))

continued

Page 22: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 22

• If the size of the table is large relative to the number of items, linear probing is fast ( O(1)) – a good hash function generates indices that are evenly

distributed over the table range, and collisions will be minimal

• As the ratio of the number of items (n) to the table size (N) approaches 1, hashing slows down to the speed of a sequential search ( O(n)).

Page 23: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 23

5. Solution #2: Double Hashing

• In the event of a collision, compute a second 'offset' hash function– this is used as an offset from the collision

location

• Linear probing always uses an offset of 1, which contributes to clustering. Hashing makes the offset more random.

Page 24: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 24

Example of Double Hashing

• A hash table with N = 13, h(k) = k mod 13, and d(k) = 7 - k mod 7

• Insert keys 18, 41, 22, 44, 59, 32, 31, 73

• Total number of probes: 11

0 1 2 3 4 5 6 7 8 9 10 11 12

31 41 18 32 59 73 22 44 0 1 2 3 4 5 6 7 8 9 10 11 12

k h (k ) d (k ) Probes18 5 3 541 2 1 222 9 6 944 5 5 5 1059 7 4 732 6 3 631 5 4 5 9 073 8 4 8

Page 25: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 25

6. Solution #3: Bucket Hashing

• The previous solutions use open hashing: – all items go into the

array

• In bucket hashing, an array cell points to a linked list of items that all hash to that cell.

robin

sparrow

hawk

bluejay

owl

. . .

141

142

143

144

145

146

147

148

. . .

seagull

also called chaining continued

Page 26: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 26

• Chaining is generally faster than linear probing:– searching only examines items that hash to the same table

location

• With linear probing and double hashing, the number of table items (n) is limited to the table size (N), whereas the linked lists in chaining can keep growing.

• To delete an element, just erase it from its list.

Page 27: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 27

7. Table Resizing• As the number of items in the hash

table increases, search speed goes down.

• Increase the hash table size when the number of items in the table is a specified percentage of its size.

Works withopen chainingalso.

continued

Page 28: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 28

• Create a new table with the specified size and cycle through the items in the original table.

• For each item, use the hash() value modulo the new table size to hash to a new index.

• Insert the item at the front of the linked list.

Page 29: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 29

8. Java's hashCode()• public int hashCode() is defined in Object

– it returns the memory address of the object

• hashCode() does not know the size of the hash table– the returned value must be adjusted

• e.g hashCode() % N

• hashCode() can be overridden in your classes

Page 30: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 30

Coding your own hashCode()

• Your hashCode() must:– always return the same value for the same item

• it can’t use random numbers, or the time of day

– always return the same value for equal items• if o1.equals(o2) is true then hashCode() for o1 and

o2 must be the same number

continued

Page 31: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 31

• A good hashCode() should:– be fast to evaluate

– produce uniformly distributed hash values• this spreads the hash table indices around the table, which

helps minimize collisions

– not assign similar hash values to similar items

Page 32: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 32

• In the majority of hash table applications, the key is a string.– combine the string's characters to form an integer

public int hashCode(){ int hash = 0;

for (int i = 0; i < s.length; i++)hash = 31*hash + s[i];

return hash;}

String Hash Function

continued

Page 33: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 33

String strA = "and";String strB = "uncharacteristically";String strC = "algorithm";hashVal = strA.hashCode(); // hashVal = 96727hashVal = strB.hashCode(); // hashVal = -2112884372hashVal = strC.hashCode(); // hashVal = 225490031

A hash function might overflow and return a negative number. The following code insures that the table index is nonnegative.

tableIndex = (hashVal & Integer.MAX_VALUE) % tableSize

Page 34: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 34

Time24 Hash Function

• For the Time24 class, the hash value for an object is its time converted to minutes.

• Since each hour is 60 mins more than the last, and a minute is between 0--59, then each hash is unique.

public int hashCode(){ return hour*60 + minute; }

Page 35: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 35

9. Hash Tables in Java

• Java provides HashSet, Hashtable and HashMap in java.util– HashSet is a set– Hashtable and HashMap are maps

continued

Page 36: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 36

• Hashtable is synchronized; it can be accessed safely from multiple threads– Hashtable uses an open hash, and has rehash()

for resizing the table

• HashMap is newer, faster, and usually better, but it is not synchronized– HashMap uses a bucket hash, and has a

remove() method

Page 37: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 37

Hash Table Operations

• HashSet, Hashtable and HashMap have no-argument constructors, and constructors that take an integer table size.

• HashSet has add(), contains(), remove(), iterator(), etc.

continued

Page 38: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 38

• Hashtable and HashMap include:– public Object put(Object key, Object value)

• returns the previous value for this key, or null

– public Object get(Object key)

– public void clear()

– public Set keySet()• dynamically reflects changes in the hash table

– many others

Page 39: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 39

Using HashMap

• A HashMap with Strings as keys and values

"Charles Nguyen"

HashMap

"(531) 9392 4587"

"Lisa Jones" "(402) 4536 4674"

"William H. Smith" "(998) 5488 0123"

A telephone book

Page 40: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 40

Coding a Map

HashMap <String, String> phoneBook = new HashMap<String, String>();

phoneBook.put("Charles Nguyen", "(531) 9392 4587");phoneBook.put("Lisa Jones", "(402) 4536 4674");phoneBook.put("William H. Smith", "(998) 5488 0123");

String phoneNumber = phoneBook.get("Lisa Jones");

System.out.println( phoneNumber );

prints: (402) 4536 4674

Page 41: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 41

HashMap<String, String> h =

new HashMap<String, String>(100, /*capacity*/

0.75f /*load factor*/ );

h.put( "WA", "Washington" );

h.put( "NY", "New York" );

h.put( "RI", "Rhode Island" );

h.put( "BC", "British Columbia" );

Page 42: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 42

Capacities and Load Factors

• HashMaps round capacities up to powers of twoHashMaps round capacities up to powers of two– e.g. 100 --> 128e.g. 100 --> 128

– default capacity is 16; load factor is 0.75 default capacity is 16; load factor is 0.75

• The load factor is used to decide when it is time to The load factor is used to decide when it is time to double the size of the tabledouble the size of the table– just after you have added 96 elements in this examplejust after you have added 96 elements in this example

– 128 * 0.75 == 96128 * 0.75 == 96

• Hashtables work best with capacities that are Hashtables work best with capacities that are prime numbers. prime numbers.

Page 43: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 43

10. Efficiency

• Hash tables are efficient– until the table is about 70% full, the number of

probes (places looked at in the table) is typically only 2 or 3

• Cost of insertion / accessing, is O(1)

continued

Page 44: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 44

• Even if the table is nearly full (leading to long searches), efficiency remains quite high.

• Hash tables work best when the table size (N) is a prime number. HashMaps use powers of 2 for N.

continued

Page 45: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 45

o In the worst case, searches, insertions and removals on a hash table take O(n) time (n = no. of items)– the worst case occurs when all the keys inserted

into the map collide

o The load factor nN affects the performance of a hash table (N = table size).– for linear probe, 0 ≤ ≤ 1

– for chaining with lists, it is possible that > 1

continued

Page 46: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 46

• Assume that the hash function uniformly distributes indices around the hash table.– we can expect = n/N elements in each cell.

• on average, an unsuccessful search makes comparisons before arriving at the end of a list and returning failure

• mathematical analysis shows that the average number of probes for a successful search is approximately 1 + /2

– so keep small!

Page 47: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 47

11. Ordered/Unordered Sets and Maps

• Use an ordered set or map if an iteration should return items in order– average search time: O(log2n)

• Use an unordered set or map with hashing when fast access and updates are needed without caring about the ordering of items– average search time: O(1)

Page 48: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 48

Timing Tests• SearchComp.java:

– read a file of 25025 randomly ordered words and insert each word into a TreeSet and a HashSet.

– report the amount of time required to build both data structures

– shuffle the file input and time a search of the TreeSet and HashSet for each shuffled word

– report the total time required for both search techniques

continued

Page 49: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 49

Ford & Topp's HashSet build and search times are much better than TreeSet.

continued

Page 50: ADSA: Hashing/10 1 241-423 Advanced Data Structures and Algorithms Objectives – –introduce hashing, hash functions, hash tables, collisions, linear probing,

ADSA : Hashing/10 50

• SearchJComp.java– replace Ford & Topp's TreeSet and HashSet by

the ones in the JDK.

JDK HashSet and TreeSet are much the same speed for searching