Upload
others
View
5
Download
1
Embed Size (px)
Citation preview
Multimedia Communications
Mathematical Preliminaries for Lossless Compression
Copyright S. Shirani
What we will see in this chapter • Definition of information and entropy • Modeling a data source • Definition of coding and when a coding is decodable
Copyright S. Shirani
Why information theory? • Compression schemes can be divided into two classes: lossy
and lossless. • Lossy compression: involves loss of some information and
data that have been compressed generally cannot be recovered exactly
• Lossless schemes compress the data without loss of information and the original data can be recovered exactly from the compressed data
• There is a relation between lossless compression and information and entropy.
Copyright S. Shirani
Information Theory • Discrete information source with N symbols (set of symbols is
often called the alphabet) AN={a1,…,aN}. • The probability function p : AN → [0,1] gives the probability
of occurrence of the symbols (p(a1)=p1, …., p(aN)=pN). • When we receive one of the symbols how much information
do we get? • If p1=1, there is no surprise (no information) since we know
what the message must be. • If the probabilities are very different, when a symbol with a
low probability arrives, you feel more surprised and get more information.
• Information is somewhat inversely related to the probability
Copyright S. Shirani
Information • The self-information of a symbol x∈ AN
i : AN → ℜ, i(x) = -log(p(x)) is a measure of the information one receives upon being told that symbol x is received.
• i increases to infinity as the probability of the symbol decreases to zero.
• Logarithm base = 2 → unit of information = BIT (our choice) = e → unit of information = NAT = 10 → unit of information = HARTLEY
Example: flipping a coin. P(H)=P(T)=1/2: i(H)=i(T)=1 bit
Copyright S. Shirani
Entropy
• The entropy of an information source is the expected (average) value of its self-information:
• H(X) is the average amount of information we get from a symbol of the source
Copyright S. Shirani
Entropy • Entropy defined in the previous slide is in fact the first order
entropy • If X={X1,…,Xm} is a sequence of outputs of an information
source S, the entropy of S is
• For i.i.d. (independent, identically distributed) sources, H∞=H1. • For most sources, H∞ is not equal to H1.
Copyright S. Shirani
Entropy • In general, it is not possible to know the actual entropy of a
physical source • We have to estimate the entropy • The estimate of the entropy depends on our assumption about
the structure of the source. • Exp: 1 2 3 2 3 4 5 4 5 6 7 8 9 8 9 10
– Assumption1: iid source • P(1)=P(6)=P(7)=P(10)=1/16 • P(2)=P(3)=P(4)=P(5)=P(8)=P(9)=2/16, H=3.25 bits
– Assumption2: sample-to-sample correlation • 1 1 1 –1 1 1 1 –1 1 1 1 1 1 –1 1 1 • P(1)=13/16, P(-1)=3/16, H=0.7 bits
Copyright S. Shirani
Entropy • Our assumptions about the structure of the source are called
models • In previous example the model is: • This is static model: the parameters do not change with n • Adaptive models: the parameters change or adapt with n to
the changing characteristics of data
Copyright S. Shirani
Properties of Entropy • 0 ≤ H(X) ≤ log2N • The entropy is zero when one of the symbols occurs with
probability 1. • The entropy is maximum when all symbols occur with equal
probability. • H is a continuous function of the probabilities (a small change
in probability, causes a small change in average information) • If all symbols are equally likely, increasing the number of
symbols, increases H. – The more possible outcomes there are, the more information should be
contained in the occurrence of any particular outcome
Copyright S. Shirani
Models for Information Sources • Good models for sources lead to more efficient compression
algorithms • Physical models: if we know something about the physics of
the data generation, we can use that information to construct a model – Exp: physics of speech production
• Probability models – Simplest statistical model: each symbol that is generated by the source
is independent of every other symbol and each occurs with the same probability (ignorance model)
– Next step: independent, a probability for each symbol – Next step: discard the independence assumption and come up with a
description of the dependency
Copyright S. Shirani
Models for Information Sources • One of the most popular ways of representing dependence in
data is through the use of Markov models • kth-order Markov P(xn|xn-1,…, xn-k)= P(xn|xn-1,…, xn-k, …)
– Knowledge of the past k symbols is equivalent to knowledge of the past.
– If xn belongs to a discrete set: also called finite state process – Values taken by {xn-1,…, xn-k} is called the state of Markov process – If the size of source alphabet is N, the number of states is Nk
• First-order Markov model: P(xn|xn-1).
Copyright S. Shirani
Models for Information Sources • How to describe the dependency between samples?
– Linear models – Markov chains
• Entropy of a finite state process:
where Si is the ith state of the Markov model.
Copyright S. Shirani
Example: Binary image
Sw Sb P(w|w) P(b|w)
P(w|b) P(b|b)
P(Sw) = 30/31, P(Sb) =1/31, P(b|w) = 0.01, P(w|b) = 0.3
• Entropy based on iid assumption: H = -(30/31)log(30/31)-(1/31)log(1/31) = 0.206 bits
• Entropy: H(X|Sb) = -.3log.3-.7log.7= 0.881 bits H(X|Sw) = -.01log.01-.99log.99= 0.081 bits H(X) = (30/31)*0.081 + (1/31)*.881 = 0.107 bits
• Image has two types of pixels: white and black
• The type of next pixel depends on current pixel being white or black
• We can model pixels as a first order discrete Markov chain
Copyright S. Shirani
Formal definition of encoding An encoding scheme for a source alphabet S={s1,s2,…,sN} in terms of a code alphabet A={a1, …,aM} is a list of mappings, s1 -> w1
s2 -> w2 . sN -> wN
in which w1, .., wN ∈ A+ where A+ is defined as
Ak is the Cartesian product of A with itself k times
Copyright S. Shirani
Formal definition of encoding • Example: S={a,b,c}, code alphabet A={0,1} the scheme a -> 01
b -> 10
c -> 111
is an encoding scheme. Suppose that si -> wi ∈ A+ is an encoding scheme for a source alphabet S={s1, …,sN}. Suppose that the source letter s1, …,sN occur with relative frequencies (probabilities) f1, .. fN respectively. The average code word length of the code is defined as:
where li is the length of wi
Copyright S. Shirani
Fixed Length Codes • If the source has an alphabet with N symbols, these can be
encoded using a fixed length coder using B bits per symbol, where:
Copyright S. Shirani
Optimal codes • The average number of code letters required to encode a source text
consisting of P source letters is • It may be expensive and time consuming to transmit long sequences of
code letters, therefore it may be desirable for to be as small as possible.
• Common sense or intuition suggests that, in order to minimize , we ought to have the frequently occurring source letters represented by short code words and to reserve the longer code words for rarely occurring source letters (use variable length codes).
• Using variable length codes, we should make sure that the code is decodable.
Copyright S. Shirani
Variable Length Codes: Examples
• Code I: not uniquely decodable. • Code II: not uniquely decodable. • Code III: uniquely decodable. (Note: rate exactly equal to H.) • Code IV: uniquely decodable.
Letters P( a k ) Code I Code II Code III Code IV
a 1 0.5 0 0 0 0
a 2 0.25 0 1 10 01
a 3 0.125 0 00 110 011
a 4 0.125 10 11 111 0111
Average length 1.125 1.25 1.75 1.875
Copyright S. Shirani
A test for unique decodability • Two binary codewords a (k bit long) and b (n bit long) and n>k • If the first k bits of b are identical to a, then a is called a prefix
of b • The last n-k bits in b are called the dangling suffix Construct a list of all the codewords Examine all pairs of codewords to see if any codeword is a
prefix of another codeword Whenever there is such a pair, add the dangling suffix to the
list in the previous iteration Continue until:
There is a dangling suffix that is a codeword: code not uniquely decodeable
There is no more dangling suffixes: code uniquely decodable
Copyright S. Shirani
A test for unique decodability • Exp: {0, 01, 11} • Dangling suffix: 1 • {0,01,11,1} • No more dangling suffixes: code is uniquely decodable
Copyright S. Shirani
Prefix codes • One type of code in which we will never face the possibility
of a dangling suffix being a codeword is a code in which no codeword is a prefix of the other
• These type of codes are called prefix code • A simple way to check if a code is prefix is to draw the binary
tree of the code
Copyright S. Shirani
Tree Representation of Codes
• In a prefix code, all code words are external nodes (leaves).
1 0
1
1
1
0
1
1
Code III Code IV
a3
a1
a2
a4 a3
0
0
a1
a2
a4
a3
Copyright S. Shirani
Instantaneously Decodable Codes
Uniquely decodable
codes
Prefix codes
• Instantaneous codes decode a symbol as soon as its code is received.
• This simplifies the decoding logic.
• It is both necessary and sufficient that an instantaneous code have no code word which is a prefix of another code word (prefix condition)
Copyright S. Shirani
McMillan and Kraft theorems • Theorem (McMillan’s inequality): If |S|=N and |A|=M and si-> wi ∈ Ali
i=1,2,..,N is an encoding scheme resulting in a uniquely decodable code then
• For binary codes the condition is: • Theorem (Kraft’s inequality): Suppose that S={s1, …,sN} is a source
alphabet and A={a1, …,aM} is a code alphabet and l1, l2,..,lN are positive integers. Then, there is an encoding scheme si -> wi i=1,2,..,N for S in terms of A satisfying prefix condition with length(wi)=li if and only if
Copyright S. Shirani
McMillan and Kraft theorems
Exist a prefix encoding scheme with lengths li
Uniquely decodable code
Kraft
McMillan
Copyright S. Shirani
Kraft-McMillan inequalities • Note that the theorem refers to existence of such a code and does not refer
to a particular code. A particular code may obey the Kraft inequality and still not be instantaneous, but there will exist codes that have the li and are instantaneous.
• Example 1: Is there an instantaneous code with code lengths 1,2,2,3? – Kraft inequality: 2-1+2-2+2-2+2-3 > 1 : No
• It is nice to work with prefix codes, are we losing something (in terms of codeword length) if we restrict ourselves to prefix codes?
• No. If there is a code which is uniquely decodable and nonprefix, the values of l1, l2,..,lN for this code satisfy the Kraft-McMillan inequality. Thus, according to Kraft theorem, a prefix code with the same codeword length also exist.
Copyright S. Shirani
Kraft-McMillan inequalities
0 1
a1
a4
a2 0 1
0 1
0 1
2
2
2
2 a3
a5 a6 a7
a8 a9
• If a set of {li} is available that obey the Kraft inequality, an instantaneous code can be systematically built.
• Example: M=3, l={1,2,2,2,2,2,3,3,3} find an instantaneous code.
Copyright S. Shirani
Kraft-McMillan inequalities • One approach to build a uniquely decodable code is:
1. For the particular value of m and n, find all sets of {li} that satisfy the Kraft inequality
2. Systematically build the codewords 3. Assign the shorter codeword to source letter with higher probability
(relative frequency) and longer codewords to letter less likely 4. Find the average codeword length ( ) of the above codes 5. Pick the code that has the minimum
• This “brute force” approach is useful in “mixed” optimization problems, in which we want to keep small and serve some other purpose
• Where minimization of is our only objective a faster and more elegant approach is available (Huffman algorithm)
Copyright S. Shirani
Lossless Source Coding Theorem
• Consider a source with entropy H. Then for every encoding scheme for S, in terms of A, resulting in a uniquely decodable code, the average code word length satisfies: – It is possible to code the source, without distortion, using H +
ε bits, where ε is an arbitrarily small positive number. However, it is not possible to code the source using B bits, where B < H.
• The theorem does not tell how the coder can be constructed.
Copyright S. Shirani
Kolmogorov complexity • Kolmogorov complexity K(x) of a sequence x is the size of
the program needed to generate x • In this size we include all inputs that might be needed by the
program • We do not specify the programming language since it is
always possible to translate a program in one language to a program in another language.
• If x is a random sequence with no structure the only program that could generate it would contain the sequence itself
• There is a correspondence between size of smallest program and amount of compression that can be obtained
• Problem: there is no systematic was of computing (or approximating) the Kolmogrorov complexity
Copyright S. Shirani
Minimum Description Length • Let Mj be a model from a set of models that attempts to
characterize the structure in a sequence x • Let DMj be the number of bits required to describe the model
Mj. – Example: if Mj has coefficients then DMj will depend how many
coefficients the model has and how many bits is used to represent each
• Let RMj(x) be the number of bits required to represent x with respect to the model Mj
• Minimum description length would be given by: