Hashing O(1) data access (almost) -access, insertion, deletion, updating in constant time (on average) but at a price… references: Weiss, Goodrich & Tamassia,

Hashing

O(1) data access (almost)-access, insertion, deletion, updating in constant time (on average) but at a price…references:Weiss,Goodrich & Tamassia,Main

“associative memory”

Access to data O(n) - linked list, array O(log n) – sorted array, search tree O(1) – array by indexindex access is O(1) because data

location is found by computation, not search

Computed access to array datae.g. array of objects

arr8423400 location in memory

address of arr[12]:8423400 + 4 * 12 = 8423448

Access by Hashing Hashing applies same concept at

software level:access operations do not search for data keys; they compute data indexes

index = f(data.key) performance “almost” O(1)

Access example student number is key: s01324092 i = f(key) = 01324092 % 10000 = 4092 data for student s01324092 is at location 4092

in data arrayarr[4092].key = “s01324092”

problems wasted storage – array must have 10 000 elements competition for space: s01324092 and s02894092 iterated operations are more difficult

Access examplekey: “s01324092”i = f(key) = 01324092 % 10000 = 4092

f(“s01324092”) = 4092

“s01324092”key

Hashing terminology student number is key: s01324092 i = f(key) = 01324092 % 10000 = 4092 data for student s01324092 is at location 4092

in data arrayarr[4092].key = “s01324092”

problems wasted storage – array must have 10 000 elements competition for space : s01324092 and s02894092

hash function

hash table

collision

Hashing Fact-of-Life

Collisions are unavoidableSolution strategy: minimize number of collisions resolve the collisions that do occur

Hash functions for hash table of size nmap key -> {0,n-1}typical function:key -> integer % n

eg. // student number keyint hash(String stuNo, int n){ return Integer.parseInt(stuNo.substring(1))

%n;}

Hash function goals simple as possible (speed) distribute keys uniformly over indices

(minimize collisions)

two steps:1. transform key to integer if

necessary (hashCode())2. restrict integer to range of data

array (hash())

Java’s hashCode() methodpublic int hashCode() Returns a hash code value for the object. This method is supported for the

benefit of hashtables such as those provided by java.util.Hashtable. The general contract of hashCode is: Whenever it is invoked on the same object more than once during an

execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.

If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.

It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hashtables.

As much as is reasonably practical, the hashCode method defined by class Object does return distinct integers for distinct objects. (This is typically implemented by converting the internal address of the object into an integer, but this implementation technique is not required by the JavaTM programming language.)

Returns: a hash code value for this object.

http://java.sun.com/j2se/1.4.1/docs/api/java/lang/Object.html#equals(java.lang.Object)

http://java.sun.com/j2se/1.4.1/docs/api/java/lang/Object.html#equals(java.lang.Object)

Other hashing methods hashCode can be overwritten for any

class hashCode usually should be

overwritten fit actual data improve performance remove dependence on location in

memory

Design model equals() is based on key field

match implies same record hashCode function also based on key field

key is used for accessBUT

hash function is also based on table size

Resizing table if table is resized, all data must be re-

entered into new array, not just copiede.g.:int hash(String stuNo, int n){ return

Integer.parseInt(stuNo.substring(1))%n;}hash(“s01324092”,10000) => 4092hash(“s01324092”, 6667) => 4026

Resolving collisionsWhen a collision occurs on insertion: internal

store new element at another location in the table

external store new element outside the table

Linear probing sequential search for next available

location to store data when collision occurs

eg.hash(key) -> index=4if table[4] is occupied, try table[5] then

table[6],…, until empty location found

Linear probing hash table after each insertion (Weiss)

Data Structures & Problem Solving using JAVA/2E Mark Allen Weiss © 2002 Addison Wesley

fail

find(58)delete(89)find(58)

The Deletion Problem

(Weiss, 2002)

Lazy deletion

49589-1-1-1-1-11889

aaaaaaaaaa

find(58)delete(89)find(58)insert(99)find(58)

49589-1-1-1-1-11889

aaaaaaaaad

insert criterionvalue==-1 OR state==d

value state value state

continue search criterionvalue!=-1

49589-1-1-1-1-11899

aaaaaaaaaa

value state

Linear probing performanceIdeal performance depends on fraction of

table that is full k items in table of size n probability of insertion collision: k/n=p average probes to free space: n/(n-k)

or 1/(1-p)e.g. table half full: 2 probesBUT…

Linear probing performanceLinear probing for insertion produces

primary clustering: probability of insertion collision:

(1+(1-p)-2)/2e.g. table half full: 2.5 probes

Illustration of primary clustering in linear probing (b) versus no clustering (a) and the less significant secondary clustering in quadratic probing (c). Long lines represent occupied cells, and the load factor is 0.7.

Data Structures & Problem Solving using JAVA/2E Mark Allen Weiss © 2002 Addison Wesley

Linear probing performance

0

5

10

15

20

25

0 0.2 0.4 0.6 0.8 1

Load factor

Aver

age

prob

es fo

r ins

ertio

n

Unbiased Linear

Clustering primary clustering

from linear probing solution:

alternate probing actions e.g. quadratic probing

constraint: minimize computation of probe

Clustering primary clustering

linear probing secondary clustering

different probes from different indices

quadratic probing even better:

different probes for different keys at same index

secondary hashing

4958-1-1-1-166261889

aaaaaaaaaa

value state

16

linear

Probing comparison

4958-1-1-1-166261889

aaaaaaaaaa

value state

16

-1-1-1-1-1-166-11889

aaaaaaaaaa

value state

16

48

linear non-linear

-1-1-1-1-1-166-11889

aaaaaaaaaa

value state

16

secondary hash

96

Secondary hashing Hash function determines initial index Secondary hash function determines

step size for probe after collision

Table class – Main p.571public class Table{ private int manyItems; private Object[ ] keys; private Object[ ] data; private boolean[ ] hasBeenUsed; …

constructor public Table(int capacity) { if (capacity <= 0) throw new

IllegalArgumentException("Capacity is negative");

keys = new Object[capacity]; data = new Object[capacity]; hasBeenUsed = new boolean[capacity]; }

search for an object by key public boolean containsKey(Object key) { return findIndex(key) != -1; } private int findIndex(Object key) { int count = 0; int i = hash(key); while (count < data.length && hasBeenUsed[i]) { if (key.equals(keys[i])) return i; count++; i = nextIndex(i); } return -1; }

wrap around indexing private int nextIndex(int i) { if (i+1 == data.length) return 0; else return i+1; }

get an object public Object get(Object key) { int index = findIndex(key); if (index == -1) return null; else return data[index]; }

insert a key and object public Object put(Object key, Object element) { int index = findIndex(key); Object answer; if (index != -1) // replace object for key { answer = data[index]; data[index] = element; return answer; } else if (manyItems < data.length) // new key and object { index = hash(key); while (keys[index] != null) index = nextIndex(index); keys[index] = key; data[index] = element; hasBeenUsed[index] = true; manyItems++; return null; } else // table is full { throw new IllegalStateException("Table is full."); } }

remove a key and object public Object remove(Object key) { int index = findIndex(key); Object answer = null; if (index != -1) { answer = data[index]; keys[index] = null; data[index] = null;

manyItems--; } return answer; }

Changing probe strategydouble hash

private int findIndex(Object key) { int count = 0; int i = hash1(key);

int p = hash2(key); while (count < data.length

&& hasBeenUsed[i]) { if (key.equals(keys[i])) return i; count++; i = nextIndex(i,p); } return -1; }

private int nextIndex(int i, int p) { return (i+p)%data.length; }

Picking good hash strategies division hash functions

prime table size (n) is required index is hashCode % n stepSize is 1+ hashCode % (n-2) (Knuth: best if (n-2) also prime)

mid-square hashCode2 – take ‘middle’ digits

multiplicativehashCode * r (0<r<1) – take fraction digits

External Hashing (Chaining) array of linked lists of objects for map, objects contain map entry pairs

keyhash

functionindex

0123…

data pair data pair data pair

data pair

External Hashing (Chaining) less sensitive to load factor more memory access (list) easier to manage

Comparison of Hashing Performance

number of comparisons (y) vs load (x)

0123456

0 1 2 3 4 5

Linear probe Double hash Chaining

Analysis of performanceLinear probing: ½(1 + 1/(1-α)) comparisons for successful

search where α is load factor (Knuth) assumptions:

uniform hashing no deletions

e.g., 1365 entries in table of 1709α = .80, expect 3 comparisons

Analysis of performanceDouble hashing: -ln(1- α)/α comparisons for successful

search where α is load factor (Knuth) assumptions:

uniform hashing no deletions

e.g., 1365 entries in table of 1709α = .80, expect 2 comparisons

Analysis of performanceChained hashing: 1+α/2 comparisons for successful

search where α is load factor assumptions:

uniform hashing e.g., 1365 entries in table of 1709

α = .80, expect 1.4 comparisons

Hash table summary hash table – array computed access into array based on

key n to 1 relation of keys to indexes

collisions collision resolution

open hashing double hashing chained hashing

JAVA CollectionsInterfaces Collection

List Queue

Set SortedSet

Map SortedMap

Implementations array (resizable) linked list balanced search tree hash table hash table plus linked

list

hashed implementations HashSet implements Set HashMap implements Mapconstructors:

capacityload factor

performance

Documents

Hashing O(1) data access (almost) -access, insertion, deletion, updating in constant time (on average) but at a price… references: Weiss, Goodrich & Tamassia,