37
COMP2402 Hash Tables Pat Morin

COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

COMP2402Hash Tables

Pat Morin

Page 2: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Outline• Hashing with chaining

• Multiplicative hashing

• Hash table implementations of– Set– Map

• Designing a good hashCode() method

Page 3: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Hash: (noun) Food, especially meat and potatoes, chopped and mixed together; A confused mess;

Hash: (noun) Food, especially meat and potatoes, chopped and mixed together; A confused mess;

Page 4: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Hash tables• Hashing is one of the most widely-used techniques in

computer science– Data structures– Error detection– Security

• Hash tables are one of the most useful data structures– store integers• or data that can be converted to integers (via hashCode())

– allow for exact search only• data is either there or its not (e.g., Set, Map)

Page 5: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Hashing with chaining• Data is stored in an array of lists (table)

• Data value x is stored in the list– table[hash(x)]

public class HashTable<T> extends AbstractCollection<T> { List<T>[] table; // data goes in these int n; // total number of elements ...}

234567

01 a b

f1 g x

a be b

aa

d

c

hash(x) = 1hash(x) = 1

tabletable

Page 6: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Hash tables and hashCode()• Hash tables are really designed to store distinct

integers

• Java has lots of types that are not integers

• Every class has a method– public int hashCode()

that converts an object into an integer

• All classes must guarantee that the methods equals() and hashCode() guarantee:– If x.equals(y) then x.hashCode() = y.hashCode()

Page 7: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

The hashing process

Java objectJava object

{-231,...,231-1} (32 bits){-231,...,231-1} (32 bits)

{0,...,table.length-1}{0,...,table.length-1}

hash()

hashCode()

Page 8: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

List size distribution• For good performance we need a good hash function

• Universal hashing assumption: – if x.hashCode() ≠ y.hashCode() then

Pr{hash(x) = hash(y)} < c/table.size

• The expected length of table[hash(x)] is– ≤ k + c(n-k)/table.size

where k is the number of elements y such that x.hashCode() = y.hashCode()

Pr{•} means “ the probability that •”Pr{•} means “ the probability that • ”

Page 9: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

List size distribution (cont'd)• If all hashCode()s are unique then k=1

– The expected length of the list table[hash(x)] isat most 1 + c(n-1)/table.length

• If we keep table.length > n, then– The expected length of table[hash(x)] is

at most 1 + c

n/table.size() is called the occupancyn/table.size() is called the occupancy

Page 10: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Hash table insertion• To add x to a hash table we store x in table[hash(x)]

– Takes constant time

public boolean add(T x) { if (n+1 > table.length) grow(); table[hash(x)].add(x); n++; return true;}

Page 11: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Hash table search• To find elements equal to x, we search the list

table[hash(x)]– Time is O(1 + occupancy)

Remember: all elements equalto x have the same hashCode()Remember: all elements equalto x have the same hashCode()

public T find(Object x) { for (T y : table[hash(x)]) if (y.equals(x)) return y; return null;}

Page 12: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Hash table search• We can also find all items equal to x

public List<T> findAll(Object x) { List<T> l = new LinkedList<T>(); for (T y : table[hash(x)]) if (y.equals(x)) l.add(y);return l;

Page 13: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Hash table removal• Remove all elements equal to x

– takes O(1 + k) time

public int removeAll(Object x) { int r = 0; Iterator<T> it = table[hash(x)].iterator(); while (it.hasNext()) { T y = it.next(); if (y.equals(x)) { it.remove(); n--; r++; } } return r;}

k = # elements equal to xk = # elements equal to x

Page 14: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Hash table removal• Or remove just one element

public T removeOne(Object x) { Iterator<T> it = table[hash(x)].iterator(); while (it.hasNext()) { T y = it.next(); if (y.equals(x)) { it.remove(); n--; return y; } } return null;}

Page 15: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Growing and shrinking• Hash tables grow and shrink in the same manner as

ArrayStacks and ArrayDeques– allocate new table– add elements into new table

• Cost is amortized over add/remove operations– constant amortized cost per operation

Page 16: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

The hash(x) function• There are many many possible hash functions

• In multiplicative hashing we use– table.size = 2d is a power of 2• hash(x) = ((x.hashCode() * z) mod 2w) div 2w-d

where• w is the number of bits in an integer and• z is a randomly chosen odd integer in {0,...,2w-1}

– Equivalently (in Java w = 32):• hash(x) = (x.hashCode()*z) >>> (w-l)protected final int hash(Object x) {

return (x.hashCode() * z) >>> (w-d);}

Page 17: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Example• w=16, d=10

x = 0000000000010110z = 1010110001101011x*z = 00000000000011101101000100110010x*z mod 2**16 = 1101000100110010(x*z mod 2**16) div 2**6 = 1101000100

x = 0000000000010110z = 1010110001101011x*z = 00000000000011101101000100110010x*z >>> 6 = 1101000100110010

Page 18: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Multiplicative Hashing Theorem• Theorem 1: With the multiplicative hash function

– if x.hashCode() ≠ y.hashCode() thenPr{hash(x) = hash(y)} ≤ 2/table.length}

• Proof sketch:– If x != y, then there are at most 2w-d odd values of

z∈{1,...,2w-1} such that hash(x) = hash(y)– we have 2w/2 choices for z– Pr{hash(x) = hash(y)}

= 2w-d / (2w/2)= 2/2d

Page 19: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Multiplicative hash table summary• Theorem 2: With a multiplicative hash table

– find(x) takes O(1) expected time– add(x) takes O(1) expected amortized time– remove(x) takes O(1) expected amortized time

provided that the table stores elements with distinct hashCode()s

Page 20: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Hash tables and the Set interface• The Set interface is easily implemented as a hash table

public class MultiplicativeHashSet<T> extends AbstractSet<T> { MultiplicativeHashTable<T> tab; ...} public boolean add(T x) {

if (tab.contains(x)) { return false; } else { return tab.add(x); }}

public boolean remove(Object x) { return tab.remove(x);}

public boolean contains(Object x) { return tab.find(x) != null;}

Page 21: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Hash tables and the Map interface• The Map methods are easily implemented using a hash

table

• Use a hash table that stores key/value Pairs– two Pairs are equal if their keys are equal– the hashCode() of a Pair is the hashCode() of its key

Page 22: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Map Pairs

class Pair<V> { public Object key; public V value; ... public boolean equals(Object o) { return ((o instanceof Pair) && key.equals(((Pair)o).key)); } public int hashCode() { return key.hashCode(); }}

Page 23: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Hash mapspublic V put(K key, V value) { Pair<V> p = new Pair<V>(key, value); Pair<V> r = tab.removeOne(p); tab.add(p); return (r == null) ? null : r.value;}

public V get(Object key) { Pair<V> p = new Pair<V>(key, null); Pair<V> r = tab.find(p); return (r == null) ? null : r.value;}

public V remove(Object key) { Pair<V> p = new Pair<V>(key,null); Pair<V> r = tab.removeOne(p); return (r == null) ? null : r.value;}

Page 24: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Hash Maps and Sets• Using multiplicative hashing:

– Theorem: A MultiplicativeHashSet supports• contains(x) in O(1) expected time• add(x) and remove(x) in O(1) expected amortized time

– Theorem: A MultiplicativeHashMap supports• get(k) in O(1) expected time• put(k,v) and remove(k) in O(1) expected amortized time

• Both theorems hold under the assumption that all stored objects have distinct hashCode()s

Page 25: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

The hashCode() method• Default hashCode() and equals() use memory locations

– a.equals(b) if and only if a and b refer to the same memory location

– a.hashCode() is the (integer) memory location of a

• Therefore each object has a unique hashCode()

• We run into problems when we override the equals() method

Page 26: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Designing a good hashCode()• Recall:

– x.equals(y) → x.hashCode() = y.hashCode()

• We would like:– x.hashCode() = y.hashCode() → x.equals(y)

• But we can't always have this– hashCode() returns a 32 (or 64) bit integer• only 232 (or 264) possible return values

– Many objects can take on more values than this• e.g. there are 280 > 232 ASCII strings of length 10

Page 27: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Example of a bad hashCode()• This code will be very slow to execute

– The last for loop takes a loooong time - Why?

int n = 100000; Map<Integer,Integer> m = new HashMap<Integer,Integer>();for (int i = 1; i <= n; i++) { m.put(i,i);}Set<Map.Entry<Integer,Integer>> s = new HashSet<Map.Entry<Integer,Integer>>();for (Map.Entry<Integer,Integer> e : m.entrySet()) { s.add(e);}

Page 28: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Answer• From the Map.Entry documentation:

– e.hashCode() = e.getKey().hashCode()^e.getValue().hashCode()

public int hashCode()Returns the hash code value for this map entry. The hash code of a map entry e is defined to be: (e.getKey()==null ? 0 : e.getKey().hashCode()) ^ (e.getValue()==null ? 0 : e.getValue().hashCode())

...

public int hashCode()Returns the hash code value for this map entry. The hash code of a map entry e is defined to be: (e.getKey()==null ? 0 : e.getKey().hashCode()) ^ (e.getValue()==null ? 0 : e.getValue().hashCode())

...

^ is the bitwise exclusive-or (XOR) operation^ is the bitwise exclusive-or (XOR) operation

Page 29: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Answer• If the key and value are the same, then

– e.getKey() = e.getValue() so– e.getKey().hashCode() = e.getValue().hashCode()

• The XOR of two equal values is always 0– a XOR a = 0

• So all 100,000 elements have the same hashCode()– The hash table degenerates into 1 linked list– contains(x) takes O(n) time!– Creating the Set takes O(n2) time!

1

an,n

tabletable

a b4,4 3,3 a b2,2 1,1…

Page 30: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Some bad ideas• There are lots of bad ways to combine hashCode()s

– XOR: x.hashCode() ^ y.hashCode()• always gives 0 if x = y

– Commutative operators• addition, multiplication, bitwise operators• x.hashCode() + y.hashCode()– gives same value even if we swap x and y– e.g. (“ Craig” , “ James” ) versus (“ James” , “ Craig” )

– Lots of others bad examples

Page 31: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

A good hashCode() recipe• Look at all the fields that are compared in the equals()

method– these, and only these, should be used– Recursively compute the hashCode() for each field to get

32-bit values• a1,a2,...,ak

– Output• (a1z1+a2z2+a3z3+...+ak-1zk-1+akzk)mod 232

• z1,...,zk are randomly chosen 32-bit integers

Page 32: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

public int hashCode() { long[] z = {0x2058cc50L, 0xcb19137eL, 0x2cb6b6fdL}; // random long zz = 0xbea0107e5067d19dL; // random

long h0 = x0.hashCode() & ((1L<<32)-1); // unsigned int to long long h1 = x1.hashCode() & ((1L<<32)-1); long h2 = x2.hashCode() & ((1L<<32)-1);

return (int)(((z[0]*h0 + z[1]*h1 + z[2]*h2)*zz) >>> 32);}

Page 33: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Good hashCode() theorem• If (a1,...,ak) != (b1,...,bk) then

– Pr{a.hashCode() = b.hashCode()} ≤ 3/2w

• In Java, this means that, using the previous recipe,

• Pr{x.hashCode() = y.hashCode()} ≤ 3/232

= 3/4,294,967,296

Page 34: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Another recipe• Random numbers z1,...,zk can be hard to get

– If our object includes arrays then we might need a variable amount

– We could use the Java Random class

• In practice, powers of 37 are often used• a137k-1

1+a237k-2+a337k-3+...+ak-137+ak

public static int hashIt(Object[] a) { int h = 0; for (int i = 0; i < a.length; i++) h = (37 * h) + a[i].hashCode(); return h;}

Page 35: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

The Prime Field Method• Pick a large prime number p and a random z in

{0,...,p-1}.

• h(a1,...,ak) = (a1z0 + a2z1 + ... + akzk-1) mod p

• Theorem: If (a1,...,ak) != (b1,...,bk) then– Pr{a.hashCode() = b.hashCode()} ≤ k/p

Page 36: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Hash tables summary• Hash tables allow for implementations of Set and Map

where basic operations take constant expected time– Requires a good hash function• Multiplicative hashing is efficient and provably good

– Requires a good hashCode() method• For x ≠ y, Pr{x.hashCode() = y.hashCode()} < c/2w

• We can find a lot of bad implementations of hash(x) and hashCode() online– even in things like the Java Collections Framework!

Page 37: COMP2402 Hash Tables - cglab.cacglab.ca/~morin/teaching/2402/notes/hashtables.pdf · • Hash tables are really designed to store distinct integers • Java has lots of types that

Hash tables: some perspectives• Hash tables and hashCode() use random numbers

– Should we pick these in advance, or at run-time?

• In advance:– Can get real random numbers • from random.org for example

– Doesn't protect us from an adversarial user

• At run-time:– Harder to get real random numbers• usually settle for pseudorandom numbers (java.util.Random)

– Can help protect against adversarial users