14
Huffman Codes Message consisting of five characters: a, b, c, d,e Probabilities: .12, .4, .15, .08, .25 Encode each character into sequence of 0’s and 1’s so that no code for a character is the prefix of the code for any other character Prefix property – Can decode a string of 0’s and 1’s by repeatedly deleting prefixes of the string that are codes for the character

Huffman Codes Message consisting of five characters: a, b, c, d,e Probabilities:.12,.4,.15,.08,.25 Encode each character into sequence of 0 ’ s and 1 ’

Embed Size (px)

Citation preview

Page 1: Huffman Codes Message consisting of five characters: a, b, c, d,e Probabilities:.12,.4,.15,.08,.25 Encode each character into sequence of 0 ’ s and 1 ’

Huffman Codes• Message consisting of five characters: a, b, c, d,e• Probabilities: .12, .4, .15, .08, .25• Encode each character into sequence of 0’s and 1’s so that no

code for a character is the prefix of the code for any other character– Prefix property– Can decode a string of 0’s and 1’s by repeatedly deleting

prefixes of the string that are codes for the character

Page 2: Huffman Codes Message consisting of five characters: a, b, c, d,e Probabilities:.12,.4,.15,.08,.25 Encode each character into sequence of 0 ’ s and 1 ’

Example

• Both codes have prefix property• Decode code 1: “grab” 3 bits at a time and translate eac

h group into a character• Ex.: 001010011 bcd

Symbol Probability Code 1 Code 2

abcde

.12

.40

.15

.08

.25

000001010011100

000110100110

Page 3: Huffman Codes Message consisting of five characters: a, b, c, d,e Probabilities:.12,.4,.15,.08,.25 Encode each character into sequence of 0 ’ s and 1 ’

Example Cont’d

• Decode code 2: Repeatedly “grab” prefixes that are codes for characters and remove them from input

• Only difference, cannot “slice” up input at once– How many bits depends on encoded character

• Ex.: 1101001 bcd

Symbol Probability Code 1 Code 2

abcde

.12

.40

.15

.08

.25

000001010011100

000110100110

Page 4: Huffman Codes Message consisting of five characters: a, b, c, d,e Probabilities:.12,.4,.15,.08,.25 Encode each character into sequence of 0 ’ s and 1 ’

Big Deal?• Huffman coding results in shorter average length of

compressed (encoded) message• Code 1 has average length of 3

– multiply length of code for each symbol by probability of occurrence of that symbol

• Code 2 has average length of 2.2– (3*.12) + (2*.40) + ( 2*.15) + (3*.08) + (2*.25)

• Can we do better?• Problem: Given a set of characters and their probabilities, find

a code with the prefix property such that the average length of a code for a character is minimum

Page 5: Huffman Codes Message consisting of five characters: a, b, c, d,e Probabilities:.12,.4,.15,.08,.25 Encode each character into sequence of 0 ’ s and 1 ’

Representation• Label leaves in tree by characters represented• Think of prefix codes as paths in binary trees

– Following a path from a node to its left child as appending a 0 to a code, and proceeding form node to right child as appending 1

• Can represent any prefix code as a binary tree• Prefix property guarantees no character can have a

code that is an interior node• Conversely, labeling the leaves of a binary tree with

characters gives us a code with prefix property

Page 6: Huffman Codes Message consisting of five characters: a, b, c, d,e Probabilities:.12,.4,.15,.08,.25 Encode each character into sequence of 0 ’ s and 1 ’

Sample Binary Trees

ba c d e

0

0 1

1

1 10

0 0

0

da

ce b

1

0 1

1

10

0 0

Code 1 Code 2

Page 7: Huffman Codes Message consisting of five characters: a, b, c, d,e Probabilities:.12,.4,.15,.08,.25 Encode each character into sequence of 0 ’ s and 1 ’

Huffman’s Algorithm• Select two characters a and b having the lowest probabilities

and replacing them with a single (imaginary) character, say x– x’s probability of occurrence is the sum of the probabilities

for a and b• Now find an optimal prefix code for this smaller set of

characters, using the above procedure recursively– Code for original character set is obtained by using the code

for x with a 0 appended for a and with a 1 appended for b

Page 8: Huffman Codes Message consisting of five characters: a, b, c, d,e Probabilities:.12,.4,.15,.08,.25 Encode each character into sequence of 0 ’ s and 1 ’

Steps in the Construction of a Huffman Tree

• Sort input characters by frequency

.08 .12 .15 .25 .40

. . . . .d a c e b

Page 9: Huffman Codes Message consisting of five characters: a, b, c, d,e Probabilities:.12,.4,.15,.08,.25 Encode each character into sequence of 0 ’ s and 1 ’

Merge a and d

.20 .15 .25 .40

. . . .

d a c e b

Page 10: Huffman Codes Message consisting of five characters: a, b, c, d,e Probabilities:.12,.4,.15,.08,.25 Encode each character into sequence of 0 ’ s and 1 ’

Merge a, d with c

.25 .40

. .

e b

a

c

d

.35

Page 11: Huffman Codes Message consisting of five characters: a, b, c, d,e Probabilities:.12,.4,.15,.08,.25 Encode each character into sequence of 0 ’ s and 1 ’

Merge a, c, d with e

.40

. b

a

c

d

.60

e

Page 12: Huffman Codes Message consisting of five characters: a, b, c, d,e Probabilities:.12,.4,.15,.08,.25 Encode each character into sequence of 0 ’ s and 1 ’

Final Tree

a

c

d

1.00

e

b

Codes:a - 1111b - 0c - 110d - 1110e - 10

average code length: 2.15

1

1

1

1

0

0

0

0

Page 13: Huffman Codes Message consisting of five characters: a, b, c, d,e Probabilities:.12,.4,.15,.08,.25 Encode each character into sequence of 0 ’ s and 1 ’

Huffman Algorithm• Example of greedy algorithm

– Combine nodes whenever possible without considering potential drawbacks inherent in making such a move

– I.e., at any individual stage select that option which is “locally optimal”

– Recall: vertex coloring problem• Does not always yield optimal solution; however, Huffman

coding is optimal• See textbook for proof

Page 14: Huffman Codes Message consisting of five characters: a, b, c, d,e Probabilities:.12,.4,.15,.08,.25 Encode each character into sequence of 0 ’ s and 1 ’

Finishing Remarks• Works well in theory, several restrictive assumptions

(1) Frequency of letters is independent of the context of that letter in message– Not true in English language

(2) Huffman coding works better when large variation in frequency of letters– Actual frequencies must match expected ones– Examples:

DEED 8 bits (12 bits ASCII)

FUZZ 20 bits (12 bits ASCII)