View
225
Download
1
Embed Size (px)
Citation preview
February 3, 2010 Harvard QR48 1
Coding and Entropy
February 3, 2010 2
Squeezing out the “Air” Suppose you want to ship pillows in boxes and
are charged by the size of the box
Lossless data compression Entropy = lower limit of compressibility
Harvard QR48
February 3, 2010 3
Claude Shannon (1916-2001)A Mathematical Theory of Communication (1948)
Harvard QR48
February 3, 2010 4
Communication over a ChannelSource Coded Bits Received Bits Decoded Message
S X Y T Channel symbols bits bits symbolsEncode bits before putting them in the channelDecode bits when they come out of the channel
E.g. the transformation from S into X changes“yea” --> 1 “nay” --> 0
Changing Y into T does the reverseFor now, assume no noise in the channel, i.e. X=Y
Harvard QR48
February 3, 2010 5
Example: TelegraphySource English letters -> Morse Code
D -..
-.. D
Baltimore
Washington-..
Harvard QR48
February 3, 2010 6
Low and High Information Content Messages The more frequent a message is, the less information it
conveys when it occurs Two weather forecast messages:
Bos:
LA: In LA “Sunny” is a low information message and “cloudy” is
a high information message
Harvard QR48
February 3, 2010 7
Harvard Grades
Less information in Harvard grades now than in recent past
% A A- B+ B B- C+
2005 24 25 21 13 6 2
1995 21 23 20 14 8 3
1986 14 19 21 17 10 5
Harvard QR48
February 3, 2010 8
Fixed Length Codes (Block Codes) Example: 4 symbols, A, B, C, D A=00, B=01, C=10, D=11 In general, with n symbols, codes need to be of
length lg n, rounded up For English text, 26 letters + space = 27 symbols,
length = 5 since 24 < 27 < 25
(replace all punctuation marks by space) AKA “block codes”
Harvard QR48
February 3, 2010 9
Modeling the Message Source
Characteristics of the stream of messages coming from the source affect the choice of the coding method
We need a model for a source of English text that can be described and analyzed mathematically
Source Destination
Harvard QR48
February 3, 2010 10
How can we improve on block codes? Simple 4-symbol example: A, B, C, D If that is all we know, need 2 bits/symbol What if we know symbol frequencies? Use shorter codes for more frequent symbols
Morse Code does something like this Example:
A B C D
.7 .1 .1 .1
0 100 101 110
Harvard QR48
February 3, 2010 11
Prefix CodesOnly one way to decode left to right
A B C D
.7 .1 .1 .1
0 100 101 110
Harvard QR48
February 3, 2010 12
Minimum Average Code Length?
Average bits per symbol:
A B C D
.7 .1 .1 .1
0 100 101 110
A B C D
.7 .1 .1 .1
0 10 110 111.7·1+.1·2+.1·3+.1·3 = 1.5
.7·1+.1·3+.1·3+.1·3 = 1.6
bits/symbol (down from 2)
Harvard QR48
February 3, 2010 13
Entropy of this code <= 1.5 bits/symbol
A B C D
.7 .1 .1 .1
0 10 110 111.7·1+.1·2+.1·3+.1·3 = 1.5
Possibly lower? How low?
Harvard QR48
February 3, 2010 Harvard QR48 14
Self-Information If a symbol S has frequency p, its self-
information is H(S) = lg(1/p) = -lg p.
S A B C D
p .25 .25 .25 .25
H(S) 2 2 2 2
p .7 .1 .1 .1
H(S) .51 3.32 3.32 3.32
February 3, 2010 Harvard QR48 15
First-Order Entropy of Source = Average Self-Information
S A B C D
p .25 .25 .25 .25
-lgp 2 2 2 2
-plgp .5 .5 .5 .5
p .7 .1 .1 .1
-lgp .51 3.32 3.32 3.32
-plgp .357 .332 .332 .332
-∑ plgp
2
1.353
February 3, 2010 Harvard QR48 16
Entropy, Compressibility, Redundancy Lower entropy More redundant More
compressible Less information Higher entropy Less redundant Less
compressible More information A source of “yea”s and “nay”s takes 24 bits per
symbol but contains at most one bit per symbol of information
010110010100010101000001 = yea 010011100100000110101001 = nay
February 8, 2010 Harvard QR48 17
A B C D
.7 .1 .1 .1
0 10 110 111
Entropy and Compression
Average length for this code =.7·1+.1·2+.1·3+.1·3 = 1.5
No code taking only symbol frequencies into account can be better than first-order entropy
First-order Entropy of this source = .7·lg(1/.7)+.1·lg(1/.1)+ .1·lg(1/.1)+.1·lg(1/.1) = 1.353
First-order Entropy of English is about 4 bits/character based on “typical” English texts
“Efficiency” of code = (entropy of source)/(average code length) = 1.353/1.5 = 90%
February 8, 2010 Harvard QR48 18
A Simple Prefix Code:Huffman Codes Suppose we know the symbol frequencies. We
can calculate the (first-order) entropy. Can we design a code to match?
There is an algorithm that transforms a set of symbol frequencies into a variable-length, prefix code that achieves average code length approximately equal to the entropy.
David Huffman, 1951
February 8, 2010 Harvard QR48 19
Huffman Code ExampleA
.35
B
.05
C
.2
D
.15
E
.25
BD
.2
BCD
.4AE
.6
ABCDE
1.0
February 8, 2010 Harvard QR48 20
Huffman Code ExampleA
.35
B
.05
C
.2
D
.15
E
.25
BD
.2
BCD
.4AE
.6
ABCDE
1.0
0 1
01
0
1
01
A 00
B 100
C 11
D 101
E 01
Entropy 2.12
Ave length
2.20
February 8, 2010 Harvard QR48 21
Efficiency of Huffman Codes Huffman codes are as efficient as possible if only
first-order information (symbol frequencies) is taken into account.
Huffman code is always within 1 bit/symbol of the entropy.
February 8, 2010 Harvard QR48 22
Second-Order Entropy Second-Order Entropy of a source is
calculated by treating digrams as single symbols according to their frequencies
Occurrences of q and u are not independent so it is helpful to treat qu as one
Second-order entropy of English is about 3.3 bits/character
How English Would Look Based on frequencies alone
• 0: xfoml rxkhrjffjuj zlpwcfwkcyj ffjeyvkcqsghyd qpaamkbzaacibzlhjqd
• 1: ocroh hli rgwr nmielwis eu ll nbnesebya th eei alhenhttpa oobttva
• 2: On ie antsoutinys are t inctore st be s deamy achin d ilonasive tucoowe at
• 3: IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA
February 8, 2010 Harvard QR48 23
How English Would Look Based on word frequencies
• 1) REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE
• 2) THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED
February 8, 2010 Harvard QR48 24
February 8, 2010 Harvard QR48 25
What is entropy of English? Entropy is the “limit” of the information per
symbol using single symbols, digrams, trigrams, …
Not really calculable because English is a finite language!
Nonetheless it can be determined experimentally using Shannon’s game
Answer: a little more than 1 bit/character
February 8, 2010 Harvard QR48 26
Shannon’s Remarkable 1948 paper
February 8, 2010 Harvard QR48 27
Shannon’s Source Coding Theorem No code can achieve efficiency greater
than 1, but For any source, there are codes with
efficiency as close to 1 as desired. The proof does not give a method to find
the best codes. It just sets a limit on how good they can be.
February 8, 2010 Harvard QR48 28
Huffman coding used widely Eg JPEGs use Huffman codes to for the
pixel-to-pixel changes in color values Colors usually change gradually so there are many
small numbers, 0, 1, 2, in this sequence
JPEGs sometimes use a fancier compression method called “arithmetic coding”
Arithmetic coding produces 5% better compression
February 8, 2010 Harvard QR48 29
Why don’t JPEGs use arithmetic coding?
Because it is patented by IBMUnited States Patent 4,905,297
Langdon, Jr. , et al. February 27, 1990
Arithmetic coding encoder and decoder system Abstract Apparatus and method for compressing and de-compressing binary decision data by arithmetic coding and decoding wherein the estimated probability Qe of the less probable of the two decision events, or outcomes, adapts as decisions are successively encoded. To facilitate coding computations, an augend value A for the current number line interval is held to approximate …
What if Huffman had patented his code?