View
214
Download
0
Category
Preview:
Citation preview
Running Time of Kruskal’s Algorithm
Huffman Codes
Monday, July 14th
Outline For Today
1. Runtime of Kruskal’s Algorithm (Union-Find Data
Structure)
2. Data Encodings & Finding An Optimal Prefix-free
Encoding
3. Prefix-free Encodings Binary Trees
4. Huffman Codes
Outline For Today
1. Runtime of Kruskal’s Algorithm (Union-Find Data
Structure)
2. Data Encodings & Finding An Optimal Prefix-free
Encoding
3. Prefix-free Encodings Binary Trees
4. Huffman Codes
Recap: Kruskal’s Algorithm Simulation
B C
1
46
2
5
A D E
F
3
G
2.5
7.5
H
8
7
9
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Recap: Kruskal’s Algorithm Simulation
B C
1
46
2
5
A D E
F
3
G
2.5
7.5
H
8
7
9
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Recap: Kruskal’s Algorithm Simulation
B C
1
46
2
5
A D E
F
3
G
2.5
7.5
H
8
7
9
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Recap: Kruskal’s Algorithm Simulation
B C
1
46
2
5
A D E
F
3
G
2.5
7.5
H
8
7
9
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Recap: Kruskal’s Algorithm Simulation
B C
1
46
2
5
A D E
F
3
G
2.5
7.5
H
8
7
9
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Recap: Kruskal’s Algorithm Simulation
B C
1
46
2
5
A D E
F
3
G
2.5
7.5
H
8
7
9
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Creates a cycle
Recap: Kruskal’s Algorithm Simulation
B C
1
46
2
5
A D E
F
3
G
2.5
7.5
H
8
7
9
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Creates a cycle
Recap: Kruskal’s Algorithm Simulation
B C
1
46
2
5
A D E
F
3
G
2.5
7.5
H
8
7
9
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Recap: Kruskal’s Algorithm Simulation
B C
1
46
2
5
A D E
F
3
G
2.5
7.5
H
8
7
9
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Recap: Kruskal’s Algorithm Simulation
B C
1
46
2
5
A D E
F
3
G
2.5
7.5
H
8
7
9
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Creates a cycle
Recap: Kruskal’s Algorithm Simulation
B C
1
46
2
5
A D E
F
3
G
2.5
7.5
H
8
7
9
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Creates a cycle
Recap: Kruskal’s Algorithm Simulation
B C
1
4
2A D E
F
3
G
2.5
7.5
H
7
Final Tree!
Same as Tprim
Recap: Kruskal’s Algorithm Pseudocode procedure kruskal(G(V, E)):
sort E in order of increasing weights rename E so w(e1) < w(e2) < … < w(em) T = {} // final tree edges for i = 1 to m: if T ∪ ei=(u,v) doesn’t create cycle add ei to T return T
Recap: For Correctness We Proved 2 Things1. Outputs a Spanning Tree Tkrsk
2. Tkrsk is a minimum spanning tree
1: Kruskal Outputs a Spanning Tree (1)
Need to prove Tkrsk is spanning AND is acyclic
Acyclic is by definition of the algorithm.
Why is Tkrsk spanning (i.e., connected)?
Recall Empty Cut Lemma:
A graph is not connected iff ∃ cut (X, Y) with no
crossing edges
If all cuts have a crossing edge -> graph is
connected!
2: Kruskal is Optimal (by Cut Property)Let (u, v) be any edge added by Kruskal’s Algorithm.
u and v are in different comp. (b/c Kruskal checks for
cycles)
ux
y
v
t
zw
Claim: (u, v) is min-edge crossing this cut!
Kruskal’s Runtime
procedure kruskal(G(V, E)): sort E in order of increasing weights
rename E so w(e1) < w(e2) < … < w(em) T = {} // final tree edges for i = 1 to m: if T ∪ ei=(u,v) doesn’t create cycle add ei to T return T
O(mlog(n))
m iterations
?Option 1: check if u v path exists! ⤳
Run a BFS/DFS from u or v => O(|T| + n) = O(n)
Can we speed up cycle checking?
***BFS/DFS Total Runtime: O(mn)***
Speeding Kruskal’s Algorithm
Goal: Check for cycles in log(n) time.
Observation: (u, v) creates a cycle iff u and v
are in the same connected component
Option 2: check if u’s component = v’s
component
More Specific Goal: check the component of
each vertex in log(n) time
Union-Find Data Structure
Operation 1: Maintain the component
structure of T as we add new edges to it.
Operation 2: Query component of each
vertex v
Union
Find
Kruskal’s With Union-Find (Conceptually)
B C
1
46
2
5
A D E
F
3
G
2.5
7.5
H
8
7
9
Kruskal’s With Union-Find (Conceptually)
A
CB
E
B C
1
46
2
5
D
A D E
FF
3
G G
2.5
7.5
HH
8
7
9
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Find(A) = A
Find(D) = DUnion(A, D)
Kruskal’s With Union-Find (Conceptually)
A
CB
E
B C
1
46
2
5
A
A D E
FF
3
G G
2.5
7.5
HH
8
7
9
Find(D) = A
Find(E) = EUnion(A, E)
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually)
A
CB
A
B C
1
46
2
5
A
A D E
FF
3
G G
2.5
7.5
HH
8
7
9
Find(C) = C
Find(F) = FUnion(C, F)
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually)
A
CB
A
B C
1
46
2
5
A
A D E
FC
3
G G
2.5
7.5
HH
8
7
9
Find(E) = A
Find(F) = CUnion(A, C)
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually)
A
AB
A
B C
1
46
2
5
A
A D E
FA
3
G G
2.5
7.5
HH
8
7
9
Find(A) = A
Find(B) = BUnion(A, B)
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually)
A
AA
A
B C
1
46
2
5
A
A D E
FA
3
G G
2.5
7.5
HH
8
7
9
Find(D) = A
Find(C) = ASkip (D, C)
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually)
A
AA
A
B C
1
46
2
5
A
A D E
FA
3
G G
2.5
7.5
HH
8
7
9
Find(A) = A
Find(C) = ASkip (A, C)
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually)
A
AA
A
B C
1
46
2
5
A
A D E
FA
3
G G
2.5
7.5
HH
8
7
9
Find(C) = A
Find(H) = HUnion(A, H)
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually)
A
AA
A
B C
1
46
2
5
A
A D E
FA
3
G G
2.5
7.5
AH
8
7
9
Find(F) = A
Find(G) = GUnion(A, G)
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually)
A
AB
A
B C
1
46
2
5
A
A D E
FA
3
A G
2.5
7.5
AH
8
7
9
Find(B) = A
Find(C) = ASkip (B, C)
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Kruskal’s With Union-Find (Conceptually)
A
AB
A
B C
1
46
2
5
A
A D E
FA
3
A G
2.5
7.5
AH
8
7
9
Find(H) = A
Find(G) = ASkip (H, G)
1, 2, 2.5, 3, 4, 5, 6, 7, 7.5, 8, 9
Union-Find Implementation Simulation
A1
B1
C1
D1
E1
F1
G1
H1
Union-Find Implementation Simulation
A1
B1
C1
D1
E1
F1
G1
H1
Union-Find Implementation Simulation
A2
B1
C1
D
E1
F1
G1
H1
Union-Find Implementation Simulation
A2
B1
C1
D
E1
F1
G1
H1
Union-Find Implementation Simulation
A3
B1
C1
D
F1
G1
H1
E
Union-Find Implementation Simulation
A3
B1
C1
D
F1
G1
H1
E
Union-Find Implementation Simulation
A3
B1
C2
D
G1
H1
E F
Union-Find Implementation Simulation
A3
B1
C2
D
G1
H1
E F
Union-Find Implementation Simulation
A5
B1
CD
G1
H1
E
F
Union-Find Implementation Simulation
A5
B1
CD
G1
H1
E
F
Union-Find Implementation Simulation
A6
CD
G1
H1
E
F
B
Union-Find Implementation Simulation
A6
CD
G1
H1
E
F
B
Union-Find Implementation Simulation
A7
CD
G1
E
F
B H
Union-Find Implementation Simulation
A7
CD
G1
E
F
B H
Union-Find Implementation Simulation
A8
CD E
F
B H G
C
A
X7
W Z
Y
T
Linked Structure Per Connected Component
Leader
C
A W Z
Y
T
Union Operation
F G
X7
E3
Union: **Make Leader of Small Component Point to the leader of Large Component**
C
A W Z
Y
T
Union Operation
F G
X10
E
Cost: O(1)(1 pointer update, 1 increment)
Union: **Make Leader of Small Component Point to the leader of Large Component**
C
A W Z
Y
T
Union Operation
F G
X10
E
C
A W Z
Y
T
Find Operation
F G
X10
E
Find: “pointer chase” until the leader
Cost: # pointers to leader
?
Cost of Find Operation
Claim: For any v,
#-pointers to leader(v) ≤ log2(|
component(v)|)
≤ log2(n)
Proof: Each time v’s path to leader increases by
1, the size of its component at least doubles!
|component(v)| starts at 1, increases to n,
therefore it can double at most log2(n) time!
Summary of Union-Find
Initialization: Each v is a comp. of size 1 and points to
itself.
When we union two components, we make the leader
of the smaller one point to the larger one (break ties
arbitrarily).
Find(v):
Pointer chasing to the leader
Cost: O(log2(|component|)) = O(log2(n))
Union(u, v): 1 pointer update, 1 increment => O(1)
Kruskal’s Runtime With Union-Find
procedure kruskal(G(V, E)):sort E in order of increasing weights
rename E so w(e1) < w(e2) < … < w(em) init Union-Find T = {} // final tree edges for i = 1 to m: ei=(u,v) if find(u) != find(v) add ei to T Union(find(u), find(v)) return T
O(mlog(n))
m iterations
log(n)
***Total Runtime: O(mlog(n))***Same as Prim’s with heaps
O(1)
O(n)
Outline For Today
1. Runtime of Kruskal’s Algorithm (Union-Find Data
Structure)
2. Data Encodings & Finding An Optimal Prefix-free
Encoding
3. Prefix-free Encodings Binary Trees
4. Huffman Codes
Data Encodings and Compression
All data in the digital world gets represented as 0s and
1s. 010010100010010100011110110010010101010110100001110100010011000010010101011010100010
100001110100010011000010010101011010100010010100011010100010010100010010110110010101
111100111010001001100001100101101011010100010011000110101000100101001010110110010101
Goal of Data Compression: Make the binary
blob as small as possible, satisfying the
protocol.
Encoding-Decoding Protocol
010010100010010100011110110010010101010110100001110100010011000010010101011010100010
encoder
decoder
Alphabet A = {a, b, c, …., z}, assume |A|
= 32 ab…z
0000000001…11111
Option 1: Fixed Length Codes
Each letter mapped to exactly 5
bits
Example: ASCII encoding
cat
ab…z
0000000001…11111
encoder
decoder
000110000010100
Example: Fixed Length Codes
000110000010100
cat
A = {a, b, c, …., z}
Output Size of Fixed Length Codes
Input: Alphabet A, text document of length n
Each letter is mapped to log2(|A|) bits
Output Size: nlog2(|A|)
Optimal if letters appear with same frequencies in
text!In practice, letters appear with different
frequencies
Ex: In English, letters a, t, e are much more
frequent than q, z, x
Question: Can we do better?
Option 2: Variable Length Binary Codes
Goal is to assign:
Frequently appearing letters short bit strings
Infrequently appearing ones long bit strings
Hope: On average have ≤ nlog2(|A|) encoded bits for
documents of size n (or ≤ log2(|A|) bits per letter)
Example 1: The Morse’s Code (not binary)Two Symbols: Dots (●) and Dash (−) or light and dark
But end of a letter is indicated with a pause
(effectively a third symbol)
frequents: e => ●, t => −, a => ●−
Infrequents: c => −●−●, j => ●−−−
cat encoder −●−●P
−●−●P●−P−P
cat decode
●−P
−P
Can We Have a Morse Code with 2 Symbols?Goal: Same idea as the Morse code but with only 2
symbols.
frequents: e => 0, t => 1, a => 01
Infrequents: c => 1010, j => 0111
cat encoder 1010
1010011decode
011
taeett?
teteat?cat?
**Decoding is Ambigous**
Why Was There Ambiguity?
The encoding of one letter was a
prefix of another letter.
Ex: e => 0 is a prefix of a => 01
Goal: Use a “prefix-free” encoding, i.e.
no letter’s encoding is a prefix of
another!Note: Fixed-length encoding was naturally
“prefix-free”.
Ex: Variable Length Prefix-free Encoding
Ex: A = {a, b, c, d}
abcd
010110111
110010
decode
Ex: Variable Length Prefix-free Encoding
Ex: A = {a, b, c, d}
abcd
010110111
110010
decodec
Ex: Variable Length Prefix-free Encoding
Ex: A = {a, b, c, d}
abcd
010110111
110010
decodeca
Ex: Variable Length Prefix-free Encoding
Ex: A = {a, b, c, d}
abcd
010110111
110010
decodecab
Ex: Variable Length Prefix-free Encoding
Ex: A = {a, b, c, d}
abcd
010110111
11101101100
decode
Ex: Variable Length Prefix-free Encoding
Ex: A = {a, b, c, d}
abcd
010110111
11101101100
decoded
Ex: Variable Length Prefix-free Encoding
Ex: A = {a, b, c, d}
abcd
010110111
11101101100
decodeda
Ex: Variable Length Prefix-free Encoding
Ex: A = {a, b, c, d}
abcd
010110111
11101101100
decodedac
Ex: Variable Length Prefix-free Encoding
Ex: A = {a, b, c, d}
abcd
010110111
11101101100
decodedacc
Ex: Variable Length Prefix-free Encoding
Ex: A = {a, b, c, d}
abcd
010110111
11101101100
decodedacca
Benefits of Variable Length Codes
Ex: A = {a, b, c, d}, Frequencies: a: 45% b: 40% c:
10% d: 5%
abcd
010110111
Variable Length
Codeabcd
00011011
Fixed Length Code
A document of length
100K
Fixed Length
Code
Variable Length
Code200K bits
(2
bits/letter)
a: 45Kb: 80Kc: 30Kd: 15K
Total:170K bits(1.7 b/l)
Formal Problem Statement
Input: An alphabet A, and frequencies 𝓕 of letters in A
Output: a prefix-free encoding Ɣ, i.e. a mapping A ->
{0,1}* that minimizes the average bits per letter
Outline For Today
1. Runtime of Kruskal’s Algorithm (Union-Find Data
Structure)
2. Data Encodings & Finding An Optimal Prefix-free
Encoding
3. Prefix-free Encodings Binary Trees
4. Huffman Codes
Prefix-free Encodings Binary Trees
We can represent each prefix-free code Ɣ as a binary
tree T as follows:
abcd
010110111
Code 1
b
c d
0 1
a0 1
0 1
Encoding of letter x = path from the root to the leaf
with x
Prefix-free Encodings Binary Trees
We can represent each prefix-free code Ɣ as a binary
tree T as follows:
abcd
00011011
Code 2
c d
0 1
0 1
a b
0 1
Reverse is Also True
Each labeled binary tree T corresponds to a prefix-free
code for an alphabet A, where |A| = # leaves in T
b e
0 1
0 1
a0
1
c d0
1
abcde
011000000111
Why is this code prefix-free?
Reverse is Also True
Claim: Each labeled binary tree T corresponds to a
prefix-free code for an alphabet A, where |A| = #
leaves in T
Proof: Take path P = {0,1}* to leaf x as x’
encoding
Since each letter x is at a leaf,
the path from the root to x is a dead-end
and cannot be part of a path to another letter y.
Number of Bits for Letter x?
b
c d
0 1
a0 1
0 1
Let A be an alphabet, and T be a binary tree where
letters of A are the leaves of T
Answer: depthT(x)
Question: What’s the number
of bits for each letter x in the
encoding corresponding to T?
Formal Problem Statement Restated
Input: An alphabet A, and frequencies 𝓕 of letters in A
Output: A binary tree T, where letters of A are the
leaves of T, that has the minimum average bit length
(ABL):
Outline For Today
1. Runtime of Kruskal’s Algorithm (Union-Find Data
Structure)
2. Data Encodings & Finding An Optimal Prefix-free
Encoding
3. Prefix-free Encodings Binary Trees
4. Huffman Codes
Observation 1 About Optimal T
Claim: The optimal binary tree T is full, i.e., each non-
leaf vertex u has exactly 2 children
a
0 1
c
0 1
b
0 1
e0
a
0 1
c
0 1
b
0 1
e
Why?T T`
Claim: The optimal binary tree T is full, i.e., each non-
leaf vertex u has exactly 2 children
a
0 1
c
0 1
b
0 1
e0
a
0 1
c
0 1
b
0 1
e
Exchange Argument: Can replace u with its only child and decrease the
depths of some leaves, giving a better tree T`.
Observation 1 About Optimal T
Claim: The optimal binary tree T is full, i.e., each non-
leaf vertex has exactly 2 children
T T`
c
0 1
1
0
a b
1c
0 1
0
a b
1
Observation 1 About Optimal T
First Algorithm: Shannon-Fano Codes
From 1948
Top-down Divide-Conquer type approach
1. Divide the alphabet into A0 and A1 s.t the frequencies
of letters in A0 and A1 are roughly 50%
2. Find an encoding Ɣ0 for A0, and Ɣ1 for A1
3. Append 0 to the encodings of Ɣ0 and 1 to Ɣ1
First Algorithm: Shannon-Fano Codes
Ex: A = {a, b, c, d}, Frequencies: a: 45% b: 40% c:
10% d: 5%
A0 = {a, d}, A1 = {b, c}
d
0 1
a c
0 1
b
0 1
Fixed-length encoding, which we saw was
suboptimal!
Observation 2 About Optimal T
Claim: In any optimal tree T if leaf x has depth i, and leaf
y has depth j, s.t i < j => f(x) ≥ f(y)
Why?
Exchange Argument: Replace x and y and get a better
tree T`.
Observation 2 About Optimal T
Ex: A = {a, b, c, d}, Frequencies: a: 45% b: 40% c: 10%
d: 5%
b
a d
0 1
c0 1
0 1b
c d
0 1
a0 1
0 1
T => 2.4 bits/letter
T` => 1.7 bits/letter
Corollary
In any optimal tree T the two lowest
frequency letters are both in the lowest
level of the tree!
Huffman’s Key Insight
Observation 1 => optimal Ts are full => each leaf has
a sibling
Corollary => 2 lowest freq. letters x, y are at the same
level
Changing letters across the same level does not
change the cost of T
b
c d
0 1
a0 1
0 1
There is an optimal tree T,
in which the two lowest
frequency letters are
siblings (in the lowest level
of the tree).
Possible Greedy Algorithm
Possible greedy algorithm:
1. If x, y are siblings, treat them as a single meta-letter
xy
2. Find an optimal tree T* with A-{x, y} + {xy}
3. Expand xy back into x and y in T*
Possible Greedy Algorithm (Example)
xy t
0 1
z0 1
Ex: A = {x, y, z, t}, and let x, y be the two lowest freq.
letters
Let A` = {xy, z, t}
t
0 1
z0 1
x y
0 1
T* T
The weight of meta-letter?
Q: What weight should be attached to the meta-letter
xy?
A: f(x) + f(y) procedure Huffman(A, 𝓕): if (|A|=2): return T where branch 0, 1 point to A[0] and A[1], respectively
let x, y be lowest two frequency letters let A` = A-{x,y}+{xy} let ` = - {x, y} + {xy: f(x) + f(y)}𝓕 𝓕 T* = Huffman(A`, `)𝓕 expand x, y in T* to get Treturn T
Huffman’s Algorithm (1951)
procedure Huffman(A, 𝓕): if (|A|=2): return T where branch 0, 1 point to A[0] and A[1], respectively
let x, y be lowest two frequency letters let A` = A-{x,y}+{xy} let ` = - {x, y} + {xy: f(x) + f(y)}𝓕 𝓕 T* = Huffman(A`, `)𝓕 expand x, y in T* to get Treturn T
Huffman’s Algorithm Correctness (1)
By induction on the |A|
Base case: |A| = 2 => return simple full tree with 2
leaves
IH: Assume true for all alphabets of size k-1
Huffman will get a Tk-1opt with meta-letter xy and
expand xy
Huffman’s Algorithm Correctness (2)
xy t
0 1z
0 1t
0 1z
0 1
x y0 1
Tk-1opt T
f(xy)*depth(xy)=(f(x) +
f(y))*depth(xy)
(f(x) + f(y))*(depth(xy) + 1)
Total diff = f(x) + f(y)
Huffman’s Algorithm Correctness (3)
Take any optimal Z, we’ll argue ABL(T) ≤ ABL(Z)
By corollary we can assume in Z x,y are also siblings at
the lowest level.
Consider Z` by merging them => Z` is valid prefix-
code for A` of size k-1
ABL(Z) = ABL(Z`) + f(x) + f(y)
ABL(T) = ABL(T`) + f(x) + f(y)
By IH: ABL(T`) ≤ ABL(T`) => ABL(T) ≤ ABL(z)
Q.E.D
Huffman’s Algorithm Runtime
Exercise: Make Huffman run in O(|A|log(|A|))?
Recommended