29
Chapter 2.5: Dictionaries and Hash Tables 0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Chapter 2.5: Dictionaries and Hash Tables 0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Embed Size (px)

Citation preview

Page 1: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Chapter 2.5:Dictionaries and Hash Tables

01234 451-229-0004

981-101-0002

025-612-0001

Page 2: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Dictionary ADT (§2.5.1)

The dictionary ADT models a searchable collection of key-element items

The main operations of a dictionary are searching, inserting, and deleting items

Multiple items with the same key are allowed

Applications: address book credit card authorization mapping host names (e.g.,

cs16.net) to internet addresses (e.g., 128.148.34.101)

Dictionary ADT methods: findElement(k): if the

dictionary has an item with key k, returns its element, else, returns the special element NO_SUCH_KEY

insertItem(k, o): inserts item (k, o) into the dictionary

removeElement(k): if the dictionary has an item with key k, removes it from the dictionary and returns its element, else returns the special element NO_SUCH_KEY

size(), isEmpty() keys(), Elements()

Page 3: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Log File (§2.5.1)

A log file is a dictionary implemented by means of an unsorted sequence

We store the items of the dictionary in a sequence (based on a doubly-linked lists or a circular array), in arbitrary order

Performance: insertItem takes O(1) time since we can insert the new item at the

beginning or at the end of the sequence findElement and removeElement take O(n) time since in the worst

case (the item is not found) we traverse the entire sequence to look for an item with the given key

The log file is effective only for dictionaries of small size or for dictionaries on which insertions are the most common operations, while searches and removals are rarely performed (e.g., historical record of logins to a workstation)

Page 4: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Lookup Table A lookup table is a dictionary implemented with a sorted

sequence We store the items of the dictionary in an array-based

sequence, sorted by key We use an external comparator for the keys

Performance: findElement takes O(log n) time, using binary search insertItem takes O(n) time since in the worst case we have to

shift O(n) items to make room for the new item removeElement take O(n) time since in the worst case we

have to shift O(n) items to compact the items after the removal

Effective for small dictionaries or for dictionaries where searches are common but inserts and deletes are rare (e.g. credit card authorizations)

Page 5: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Hashing (2.5.2) Application: word occurrence statistics Operations: insert, find Dictionary: insert, delete, find Are O(log n) comparisons necessary? (no)

Hashing basic plan: create a big array for the items to be stored use a function to figure out storage location from

key (hash function) a collision resolution scheme is necessary

Page 6: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Hash Table Example Simple Hash function:

Treat the key as a large integer K h(K) = K mod M, where M is the table size let M be a prime number.

Example: Suppose we have 101 buckets in the hash table. ‘abcd’ in hex is 0x61626364 Converted to decimal it’s 1633831724 1633831724 % 101 = 11 Thus h(‘abcd’) = 11. Store the key at location 11. “dcba” hashes to 57. “abbc” also hashes to 57 – collision. What to do? If you have billions of possible keys and hundreds of

buckets, lots of collisions are possible!

Page 7: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Hash Functions (§ 2.5.3)

A hash function is usually specified as the composition of two functions:Hash code map: h1: keys integers

Compression map: h2: integers [0, N1]

The hash code map is applied first, and the compression map is applied next on the result, i.e.,

h(x) = h2(h1(x)) The goal of the hash

function is to “disperse” the keys in an apparently random way

Page 8: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Hash Code Maps (§2.5.3)

Memory address: interpret the memory address of the key as an integer

Integer cast: interpret the bits of the key as an integer (for short keys)

Component sum: partition the bits of the key into chunks (e.g., 16 or 32 bits) and sum, ignoring overflows (for long keys)

Polynomial accumulation: like component sum, but multiply each term by 1, z, z2, z3, ...

p(z) a0 a1 z a2 z2 … … an1zn1

at a fixed value z, ignoring overflows

Can be evaluated in O(n) time using Horner’s rule:

Each term is computed from the previous in O(1) timep0(z) an1

pi (z) ani1 zpi1(z) (i 1, 2, …, n 1)

Page 9: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Compression Maps (§2.5.4)

Division: h2 (y) y mod N The size N of the hash

table is usually chosen to be a prime

The reason has to do with number theory and is beyond the scope of this course

Multiply, Add and Divide (MAD):

h2 (y) (ay b) mod N a and b are

nonnegative integers such that

a mod N 0 Otherwise, every

integer would map to the same value b

Page 10: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Hashing Strings h(‘aVeryLongVariableName’)? Horner’s method example:

256 * 97 + 86 = 24918 % 101 = 72 256 * 72 + 101 = 18533 % 101 = 50 256 * 50 + 114 = 12914 % 101 = 87

Scramble by replacing 256 with 117int hash(char *v, int M){ int h, a=117; for (h=0; *v; v++) h = (a*h + *v) % M; return h;}

Page 11: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Collisions (§2.5.5) How likely are collisions? Birthday paradox

M sqrt(M/2) (about 1.25 sqrt(M)) 100 12

1000 40

10000 125

[1.25 sqrt(365) is about 24]

Experiment: generate random numbers 0..100

84 35 45 32 89 1 58 16 38 69 5 90 16 16 53 61 … Collision at 13th number, as predicted

What to do about collisions?

Page 12: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Collision Resolution: Chaining Build a linked list

for each bucket Linear search

within each list Simple, practical,

widely used Cuts search time

by a factor of M over sequential search

But, requires extra memory outside of table

01234 451-229-0004 981-101-0004

025-612-0001

Page 13: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Chaining 2 Insertion time?

O(1) Average search cost, successful search?

O(N/2M) Average search cost, unsuccessful?

O(N/M) M large: CONSTANT average search time Worst case: N (“probabilistically unlikely”) Keep lists sorted?

insert time O(N/2M) unsuccessful search time O(N/2M)

Page 14: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Linear Probing (§2.5.5) Or, we could keep everything in the same table Insert: upon

collision, search for a free spot

Search: same (ifyou find one, fail)

Runtime? Still O(1) if table

is sparse But: as table fills,

clustering occurs Skipping c spots

doesn’t help…

Page 15: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Clustering Long clusters tend to get longer Precise analysis difficult Theorem (Knuth):

Insert cost: approx. (1 + 1/(1-N/M)2)/2 (50% full 2.5 probes; 80% full 13 probes)

Search (hit) cost: approx. (1 + 1/(1-N/M))/2 (50% full 1.5 probes; 80% full 3 probes)

Search (miss): same as insert Too slow when table gets 70-80% full

How to reduce/avoid clustering?

Page 16: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Double Hashing Use a second hash function to

compute increment seq. Analysis extremely

difficult About like ideal

(random probe) Thm (Guibas-Szemeredi):Insert: approx 1+1/(1-N/M)Search hit: ln(1+N/M)/(N/M)Search miss: same as insert

Not too slow until the table isabout 90% full

Page 17: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Consider a hash table storing integer keys that handles collision with double hashing

N13 h(k) k mod 13 d(k) 7 k mod 7

Insert keys 18, 41, 22, 44, 59, 32, 31, 73, in this order

Example of Double Hashing

0 1 2 3 4 5 6 7 8 9 10 11 12

31 41 183259732244 0 1 2 3 4 5 6 7 8 9 10 11 12

k h (k ) d (k ) Probes18 5 3 541 2 1 222 9 6 944 5 5 5 1059 7 4 732 6 3 631 5 4 5 9 073 8 4 8

Page 18: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Dynamic Hash Tables Suppose you are making a symbol table for a

compiler. How big should you make the hash table?

If you don’t know in advance how big a table to make, what to do?

Could grow the table when it “fills” (e.g. 50% full)

Make a new table of twice the size. Make a new hash function Re-hash all of the items in the new table Dispose of the old table

Page 19: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Table Growing Analysis Worst case insertion: (n), to re-hash all items Can we make any better statements? Average case?

O(1), since insertions n through 2n cost O(n) (on average) for insertions and O(2n) (on average) for rehashing O(n) total (with 3x the constant)

Amortized analysis? The result above is actually an amortized result for the

rehashing. Any sequence of j insertions into an empty table has

O(j) average cost for insertions and O(2j) for rehashing. Or, think of it as billing 3 time units for each insertion,

storing 2 in the bank. Withdraw them later for rehashing.

Page 20: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Separate Chaining vs.Double Hashing Assume the same amount of space for keys, links

(use pointers for long or variable-length keys) Separate chaining:

1M buckets, 4M keys 4M links in nodes 9M words total; avg search time 2

Double hashing in same space: 4M items, 9M buckets in table average search time: 1/(1-4/9) = 1.8: 10% faster

Double hashing in same time 4M items, average search time 2 space needed: 8M words (1/(1-4/8) = 2) (11% less

space)

Page 21: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Deletion How to implement delete() with separate

chaining? Simply unlink unwanted item Runtime? Same as search()

How to implement delete() with linear probing? Can’t just erase it. (Why not?) Re-hash entire cluster Or mark as deleted?

How to delete() with double hashing? Re-hashing cluster doesn’t work – which “cluster”? Mark as deleted Every so often re-hash entire table to prune “dead-

wood”

Page 22: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Comparisons Separate chaining advantages:

Idiot-proof (degrades gracefully) No large chunks of memory needed (but is this better?)

Why use hashing? Fastest dictionary implementation Constant time search and insert, on average Easy to implement; built into many environments

Why not use hashing? No performance guarantees Uses extra space Doesn’t support pred, succ, sort, etc. – no notion of order

Where did perl “hashes” get their name?

Page 23: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Hashing Summary Separate chaining: easiest to deploy Linear probing: fastest (but takes more memory) Double hashing: least memory (but takes more time to

compute the second hash function) Dynamic (grow): handles any number of inserts at < 3x

time Curious use of hashing: early unix spell checker (back

in the days of 3M machines…) Construction Search Miss Chain Probe Dbl Grow Chain Probe Dbl Grow5k 1 4 4 3 1 0 1 050k 18 11 12 22 15 8 8 8100k 35 21 23 47 45 23 21 15190k 79 106 59 155 144 2194 261 30200k 84 159 156 33

Page 24: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

File tamper test Problem: you want to guarantee that a

file hasn’t been tampered with. How?

Page 25: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Password verification Problem: you run a website. You need

to verify people’s logins. How? Possible techniques? Ethics?

Page 26: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Cache filenames Problem: caching FlexScores

$outfileName = $outDirectory . "/" . $cfg->hymn . "-" . $cfg->instrument; if ($custom) $outfileName .= "-" . substr(md5($cfg->asXML()),0,6); $outfileNameLy = $outfileName . ".ly";

Page 27: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Turing Test You want to generate random text that

sounds like it makes sense. How?

09-08-04

Page 28: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

Hash Tables in C# System.Collections: basic data

structures Hashtable ArrayList

(See handout)

Page 29: Chapter 2.5: Dictionaries and Hash Tables   0 1 2 3 4 451-229-0004 981-101-0002 025-612-0001

GUI Programming in C# Microsoft has provided a number of

different GUI programming environments over the years: MFC: Microsoft Foundation Classes

C++, legacy Windows Forms

.net wrapper for access to native Windows interface

WPF: Windows Presentation Foundation .net, Managed code, primarily C# and VB XML file defines user interface Better support for media, animation, etc.