14
Data Structures Huffman Codes 1

Data Structures - BGUds162/wiki.files/... · Data Structures Huffman Codes 1 . ... • Variable-length code is not necessary uniquely decodable –With the code: 0 01 10 101 –There

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Structures - BGUds162/wiki.files/... · Data Structures Huffman Codes 1 . ... • Variable-length code is not necessary uniquely decodable –With the code: 0 01 10 101 –There

Data Structures

Huffman Codes

1

Page 2: Data Structures - BGUds162/wiki.files/... · Data Structures Huffman Codes 1 . ... • Variable-length code is not necessary uniquely decodable –With the code: 0 01 10 101 –There

Huffman Codes

• Optimal static technique for data compression using binary codes

– Savings of 20% to 90% are typical

– Depending on the characteristics of the data

• Binary codes

– Fixed-length

– Variable-length

2

Page 3: Data Structures - BGUds162/wiki.files/... · Data Structures Huffman Codes 1 . ... • Variable-length code is not necessary uniquely decodable –With the code: 0 01 10 101 –There

Example

• Suppose we have a 100,000-character data file

• Fixed-length

– 300,000 bits to code the entire file

• Variable-length

– (45·1 + 13·3 + 12·3 + 16·3 + 9·4 + 5·4)·1,000 = 224,000

3

Page 4: Data Structures - BGUds162/wiki.files/... · Data Structures Huffman Codes 1 . ... • Variable-length code is not necessary uniquely decodable –With the code: 0 01 10 101 –There

Prefix Codes

• Encoding is always simple for any binary character code – Just concatenate the codewords representing each character

• Variable-length code is not necessary uniquely decodable – With the code: 0 01 10 101 – There are many ways to parse 001011101

• Prefix codes

– No codeword is a prefix of some other codeword – Decoding is unambiguous

• Example

– With the code: 0 101 100 111 1101 1100 – 001011101 parses uniquely as 0 · 0 · 101 · 1101

4

Page 5: Data Structures - BGUds162/wiki.files/... · Data Structures Huffman Codes 1 . ... • Variable-length code is not necessary uniquely decodable –With the code: 0 01 10 101 –There

• A binary tree whose leaves are the given characters

• Node x also contains the total frequency of the characters in the sub-tree rooted at x.

• Decoding: 001011101 parses uniquely as • 001 · 011 · 101

• 0 · 0 · 101 · 1101

Representation of Prefix Codes

5

Page 6: Data Structures - BGUds162/wiki.files/... · Data Structures Huffman Codes 1 . ... • Variable-length code is not necessary uniquely decodable –With the code: 0 01 10 101 –There

Cost of Binary Code Tree

f(c) is the frequency of c in the file

dT(c) is the depth of c’s leaf in the tree

6

Page 7: Data Structures - BGUds162/wiki.files/... · Data Structures Huffman Codes 1 . ... • Variable-length code is not necessary uniquely decodable –With the code: 0 01 10 101 –There

Constructing an Huffman Code

• Input: – Set of n characters C. – For each character c C, the frequency f(c) of c.

• Output:

– A tree T, called Huffman tree, corresponding to the optimal prefix code.

• General idea:

– Assign those characters that occur more frequently a shorter code (near the top of the tree).

• Method:

– Begin with a set of the |C| leaves. – Repeatedly, using a min-priority queue Q, keyed on frequencies, identify the

two least-frequent objects to merge together. – Until all objects are merged.

7

Page 8: Data Structures - BGUds162/wiki.files/... · Data Structures Huffman Codes 1 . ... • Variable-length code is not necessary uniquely decodable –With the code: 0 01 10 101 –There

Constructing an Huffman code

8

Page 9: Data Structures - BGUds162/wiki.files/... · Data Structures Huffman Codes 1 . ... • Variable-length code is not necessary uniquely decodable –With the code: 0 01 10 101 –There

Constructing an Huffman code

Huffman(C)

n = |C|

Q = priority queue of C

for (i = 1 to n − 1) do

allocate a new node z

z.left = x = ExtractMin(Q)

z.right = y = ExtractMin(Q)

z.freq = x.freq + y.freq

insert(Q, z)

return ExtractMin(Q)

• Running time O(n log n), where n = |C|.

• Huffman code is not unique

9

Page 10: Data Structures - BGUds162/wiki.files/... · Data Structures Huffman Codes 1 . ... • Variable-length code is not necessary uniquely decodable –With the code: 0 01 10 101 –There

Correctness

• Huffman Codes are prefix codes

– Since all codewords are leaf nodes

10

Page 11: Data Structures - BGUds162/wiki.files/... · Data Structures Huffman Codes 1 . ... • Variable-length code is not necessary uniquely decodable –With the code: 0 01 10 101 –There

Correctness

• Claim:

– An optimal prefix code tree is a full tree.

– (Huffman tree is a full tree.)

11

Page 12: Data Structures - BGUds162/wiki.files/... · Data Structures Huffman Codes 1 . ... • Variable-length code is not necessary uniquely decodable –With the code: 0 01 10 101 –There

Correctness

• Lemma: – There is an optimal prefix code tree in which the two symbols with

smallest frequencies are siblings (in the last level).

• Proof: – An optimal tree is a full binary tree.

– If necessary, it is possible to interchange the two symbols with smallest frequencies with two siblings symbols in the last level of the tree without affecting the cost.

12

Page 13: Data Structures - BGUds162/wiki.files/... · Data Structures Huffman Codes 1 . ... • Variable-length code is not necessary uniquely decodable –With the code: 0 01 10 101 –There

Correctness

• Theorem:

– Huffman code is an optimal prefix code

If this Huffman Tree is optimal

The extension should be optimal Huffman

Tree too

By induction

13

Page 14: Data Structures - BGUds162/wiki.files/... · Data Structures Huffman Codes 1 . ... • Variable-length code is not necessary uniquely decodable –With the code: 0 01 10 101 –There

Proof of the Theorem

• By induction on the size of the alphabet. • For |C| = 2 the result is trivially true. • Assume by induction that the result holds when |C|< k, and let |C| = k, and T be

an Huffman tree of C. • For the sake of contradiction, assume that T is not optimal. That is, there is a tree S

such that B(S) < B(T). • By the lemma, we can assume that the two symbols x and y with smallest

frequencies are siblings in S. • By the algorithm, since x and y are minimal, they are siblings in T. • Let T’ and S’ be the trees obtained from T and S, respectively, by removing these

two siblings and replacing their parents by a leaf with frequency x.freq + y.freq. • T’ is an Huffman tree of a smaller set than C, and by induction it optimal prefix free

codes tree. • B(S’) + x.freq + y.freq = B(S) < B(T) = B(T’) + x.freq + y.freq. • It follows that B(S’) < B(T’), which contradicts the optimality of T’. • Hence, T is optimal.

14