View
221
Download
4
Category
Tags:
Preview:
Citation preview
Huffman Codes
Bahareh Sarrafzadeh6111
Fall 2009
Overview
• What is Huffman Codes?• How can they be helpful?• Fixed-Length Codes v.s. Variable-Length Codes• Encoding v.s. Decoding• Prefix Codes• How to construct the Huffman’s Code?• Greedy works!• Problem Definition• Proof of Correctness• Huffman’s Code and Entropy
Huffman Codes - Intro
• A very efficient technique for data compression
• Savings of 20% to 90%• Proposed by David Huffman, 1952• A greedy algorithm which yields an optimal
encoding for characters based on their frequency
Fixed-length v.s. Variable-length
a b c d e f
Frequency (in thousands) 45 13 12 16 9 5
Fixed-length codeword 000 001 010 011 100 101
Variable-length codeword 0 101 100 111 1101 110
An Example
100
86 14
a : 45 b : 13 c : 12 d : 16 e : 9 f : 5
58 28 14
100
55
c : 12 b : 13 d : 16
25 30
a : 45
e : 9 f : 5
14
0
0
0
0
00
0
0
0 0
0
1
1
1 1
1
1
1 1
1
1
Encoding V.S. Decoding
E 0
T 11
N 100
I 1010
S 1011
E 0
T 10
N 100
I 0111
S 1010
101010110100 10010101011
TE N N I ST
100100
N
Prefix Codes
• Identify end of a codeword as soon as it arrives• No codeword can be a prefix of another
codeword• A symbol code is called a prefix code if no
code word is a prefix of any other codeword• Prefix Codes simplify decoding.
A Convenient Data Structure
• The decoding process needs a convenient representation for the prefix code.
• A binary tree– Leaves: characters– Paths: codewords
• It is not a BST ! A
0
B
0
0
0
01
1
1
1
1
C D
E F
Optimal Code
• full binary tree
• if C is the alphabet and all character frequencies are positive,
• then the tree for an optimal prefix code has exactly |C| leaves, and |C| - 1 internal nodes
Cost of a tree
• Given a tree T, compute the number of bits required to encode a file:
Cc
T cdcfTCost )()()(
character
alphabetfrequency
of character c
depth of c`s leaf
Greedy Algorithm - Overview
1. Take the two least probable symbols in the alphabet.
2. Combine these two symbols into a single symbol, and repeat.
An example
a : 45 b : 13 c : 12d : 16 e : 9 f : 5
14
0 1
a : 45 b : 13d : 16 c : 12
e : 9 f : 5
14
0 1
25
0 1
a : 45
b : 13
d : 16
c : 12 e : 9 f : 5
14
0 1
25
30
0
0
1
1
a : 45
b : 13d : 16
c : 12
e : 9 f : 5
14
0 1
2530
55
0 0
0
1 1
1
a : 45
b : 13d : 16
c : 12
e : 9 f : 5
14
0 1
2530
55
0 0
0
1 1
1
100
0 1
Specifications
• Preconditions: We have a set of characters C (i.e. an alphabet) and we can derive the frequency table for them.
• Postconditions: We have a full binary tree which corresponds to the list of prefix codes assigned to each character in C, such that the total cost is minimum.
• Greedy Choice: Take two nodes in the tree with the least frequencies and merge them.– An Adaptive Decision
Specifications – Cont.
• Loop Invariant:– We have built a binary tree which is consistent with the
optimal solution.
• Establishing LI: Pre LI– Initially we haven’t made any choice, so our current
solution is consistent with the optimal solution.
• Maintaining LI: LI + Code LI’
Maintaining the LI
Algorithm
OptSolLI
Instructions for Fairy Godmother!
x
y
a b
T
a
y
x b
T’
a
b
x y
T”
Proof of Correctness:
1. Validity
– We have a Tree!
2. Consistency
x
y
a b
T
a
y
x b
T’
a
b
x y
T”
][][
][][
yfxf
bfaf
][][
][][
bfyf
afxf
• We need to prove Cost (T’) is not more than Cost (T)
Cc
T
Cc
T cdcfcdcfTCostTCost )()()()()'()( '
0
))()(])([][(
)(][)(][)(][)(][
)(][)(][)(][)(][ ''
xdadxfaf
xdafadxfadafxdxf
adafxdxfadafxdxf
TT
TTTT
TTTT
x
y
a b
T
3. Optimality
Running Time
1 Huffman (C)2 n |C|3 Q C 4 for i = 1 to n – 15 do allocate a new node z6 left [z] x Extract-Min (Q)7 right [z] y Extract-Min (Q)8 f [z] f [x] + f [y]9 Insert (Q, z)10 Return Extract-Min (Q)
Q : a binary min heap O (n)
O (lg n)
Running Time: O (n lg n)
Conclusion
• Huffman Coding– Introduction and Application– Greedy Algorithm– Proof of Correctness
• Entropy
Huffman’s Code and Entropy
• As defined by Shannon, the information content h (in bits) of each symbol ci with non-null probability is:
• The entropy H (in bits) is the weighted sum, across all symbols ci with non-zero probability pi of the information content of each symbol:
ii
pch
1log)( 2
i
ii chpCH )(.)(
Input (C, f) Symbol (ci) a b c d e Sum
Probabilities (pi) 0.10 0.15 0.30 0.16 0.29 = 1
Huffman SCode
Codewords(cwi) 000 001 10 01 11
Codeword length (in bits)
(li)3 3 2 2 2
Cost = li pi 0.30 0.45 0.60 0.32 0.58 L(C) = 2.25
Optimality
Probability budget (2-li
)1/8 1/8 1/4 1/4 1/4 = 1.00
Information content (in bits)
(−log2 pi)3.32 2.74 1.74 2.64 1.79
Entropy(−pi log2 pi)
0.332 0.411 0.521 0.423 0.518 H(A) =
2.205
i
i
ii ccodelengthcpL )()( i
ii cpcpH )(log)( 2
• Huffman reaches entropy limit when all probabilities are negative powers of 2• i.e., 1/2; 1/4; 1/8; 1/16; etc.
• H <= Code Length < H + 1
Recommended