Mathematical Preliminaries for Lossless Compressionshirani/multi08/lossless_math.pdfKraft-McMillan inequalities • Note that the theorem refers to existence of such a code and does

Multimedia Communications

Mathematical Preliminaries for Lossless Compression

Copyright S. Shirani

What we will see in this chapter •  Definition of information and entropy •  Modeling a data source •  Definition of coding and when a coding is decodable


Why information theory? •  Compression schemes can be divided into two classes: lossy

and lossless. •  Lossy compression: involves loss of some information and

data that have been compressed generally cannot be recovered exactly

•  Lossless schemes compress the data without loss of information and the original data can be recovered exactly from the compressed data

•  There is a relation between lossless compression and information and entropy.


Information Theory •  Discrete information source with N symbols (set of symbols is

often called the alphabet) AN={a1,…,aN}. •  The probability function p : AN → [0,1] gives the probability

of occurrence of the symbols (p(a1)=p1, …., p(aN)=pN). •  When we receive one of the symbols how much information

do we get? •  If p1=1, there is no surprise (no information) since we know

what the message must be. •  If the probabilities are very different, when a symbol with a

low probability arrives, you feel more surprised and get more information.

•  Information is somewhat inversely related to the probability


Information •  The self-information of a symbol x∈ AN

i : AN → ℜ, i(x) = -log(p(x)) is a measure of the information one receives upon being told that symbol x is received.

•  i increases to infinity as the probability of the symbol decreases to zero.

•  Logarithm base = 2 → unit of information = BIT (our choice) = e → unit of information = NAT = 10 → unit of information = HARTLEY

Example: flipping a coin. P(H)=P(T)=1/2: i(H)=i(T)=1 bit


Entropy

•  The entropy of an information source is the expected (average) value of its self-information:

•  H(X) is the average amount of information we get from a symbol of the source


Entropy •  Entropy defined in the previous slide is in fact the first order

entropy •  If X={X1,…,Xm} is a sequence of outputs of an information

source S, the entropy of S is

•  For i.i.d. (independent, identically distributed) sources, H∞=H1. •  For most sources, H∞ is not equal to H1.


Entropy •  In general, it is not possible to know the actual entropy of a

physical source •  We have to estimate the entropy •  The estimate of the entropy depends on our assumption about

the structure of the source. •  Exp: 1 2 3 2 3 4 5 4 5 6 7 8 9 8 9 10

–  Assumption1: iid source •  P(1)=P(6)=P(7)=P(10)=1/16 •  P(2)=P(3)=P(4)=P(5)=P(8)=P(9)=2/16, H=3.25 bits

–  Assumption2: sample-to-sample correlation •  1 1 1 –1 1 1 1 –1 1 1 1 1 1 –1 1 1 •  P(1)=13/16, P(-1)=3/16, H=0.7 bits


Entropy •  Our assumptions about the structure of the source are called

models •  In previous example the model is: •  This is static model: the parameters do not change with n •  Adaptive models: the parameters change or adapt with n to

the changing characteristics of data


Properties of Entropy •  0 ≤ H(X) ≤ log2N •  The entropy is zero when one of the symbols occurs with

probability 1. •  The entropy is maximum when all symbols occur with equal

probability. •  H is a continuous function of the probabilities (a small change

in probability, causes a small change in average information) •  If all symbols are equally likely, increasing the number of

symbols, increases H. –  The more possible outcomes there are, the more information should be

contained in the occurrence of any particular outcome


Models for Information Sources •  Good models for sources lead to more efficient compression

algorithms •  Physical models: if we know something about the physics of

the data generation, we can use that information to construct a model –  Exp: physics of speech production

•  Probability models –  Simplest statistical model: each symbol that is generated by the source

is independent of every other symbol and each occurs with the same probability (ignorance model)

–  Next step: independent, a probability for each symbol –  Next step: discard the independence assumption and come up with a

description of the dependency


Models for Information Sources •  One of the most popular ways of representing dependence in

data is through the use of Markov models •  kth-order Markov P(xn|xn-1,…, xn-k)= P(xn|xn-1,…, xn-k, …)

–  Knowledge of the past k symbols is equivalent to knowledge of the past.

–  If xn belongs to a discrete set: also called finite state process –  Values taken by {xn-1,…, xn-k} is called the state of Markov process –  If the size of source alphabet is N, the number of states is Nk

•  First-order Markov model: P(xn|xn-1).


Models for Information Sources •  How to describe the dependency between samples?

–  Linear models –  Markov chains

•  Entropy of a finite state process:

where Si is the ith state of the Markov model.


Example: Binary image

Sw Sb P(w|w) P(b|w)

P(w|b) P(b|b)

P(Sw) = 30/31, P(Sb) =1/31, P(b|w) = 0.01, P(w|b) = 0.3

•  Entropy based on iid assumption: H = -(30/31)log(30/31)-(1/31)log(1/31) = 0.206 bits

•  Entropy: H(X|Sb) = -.3log.3-.7log.7= 0.881 bits H(X|Sw) = -.01log.01-.99log.99= 0.081 bits H(X) = (30/31)*0.081 + (1/31)*.881 = 0.107 bits

•  Image has two types of pixels: white and black

•  The type of next pixel depends on current pixel being white or black

•  We can model pixels as a first order discrete Markov chain


Formal definition of encoding An encoding scheme for a source alphabet S={s1,s2,…,sN} in terms of a code alphabet A={a1, …,aM} is a list of mappings, s1 -> w1

s2 -> w2 . sN -> wN

in which w1, .., wN ∈ A+ where A+ is defined as

Ak is the Cartesian product of A with itself k times


Formal definition of encoding • Example: S={a,b,c}, code alphabet A={0,1} the scheme a -> 01

b -> 10

c -> 111

is an encoding scheme. Suppose that si -> wi ∈ A+ is an encoding scheme for a source alphabet S={s1, …,sN}. Suppose that the source letter s1, …,sN occur with relative frequencies (probabilities) f1, .. fN respectively. The average code word length of the code is defined as:

where li is the length of wi


Fixed Length Codes •  If the source has an alphabet with N symbols, these can be

encoded using a fixed length coder using B bits per symbol, where:


Optimal codes •  The average number of code letters required to encode a source text

consisting of P source letters is •  It may be expensive and time consuming to transmit long sequences of

code letters, therefore it may be desirable for to be as small as possible.

•  Common sense or intuition suggests that, in order to minimize , we ought to have the frequently occurring source letters represented by short code words and to reserve the longer code words for rarely occurring source letters (use variable length codes).

•  Using variable length codes, we should make sure that the code is decodable.


Variable Length Codes: Examples

•  Code I: not uniquely decodable. •  Code II: not uniquely decodable. •  Code III: uniquely decodable. (Note: rate exactly equal to H.) •  Code IV: uniquely decodable.

Letters P( a k ) Code I Code II Code III Code IV

a 1 0.5 0 0 0 0

a 2 0.25 0 1 10 01

a 3 0.125 0 00 110 011

a 4 0.125 10 11 111 0111

Average length 1.125 1.25 1.75 1.875


A test for unique decodability •  Two binary codewords a (k bit long) and b (n bit long) and n>k •  If the first k bits of b are identical to a, then a is called a prefix

of b •  The last n-k bits in b are called the dangling suffix   Construct a list of all the codewords   Examine all pairs of codewords to see if any codeword is a

prefix of another codeword   Whenever there is such a pair, add the dangling suffix to the

list in the previous iteration   Continue until:

  There is a dangling suffix that is a codeword: code not uniquely decodeable

  There is no more dangling suffixes: code uniquely decodable


A test for unique decodability •  Exp: {0, 01, 11} •  Dangling suffix: 1 •  {0,01,11,1} •  No more dangling suffixes: code is uniquely decodable


Prefix codes •  One type of code in which we will never face the possibility

of a dangling suffix being a codeword is a code in which no codeword is a prefix of the other

•  These type of codes are called prefix code •  A simple way to check if a code is prefix is to draw the binary

tree of the code


Tree Representation of Codes

•  In a prefix code, all code words are external nodes (leaves).

1 0

1

1

1

0

1

1

Code III Code IV

a3

a1

a2

a4 a3

0

0

a1

a2

a4

a3


Instantaneously Decodable Codes

Uniquely decodable

codes

Prefix codes

•  Instantaneous codes decode a symbol as soon as its code is received.

•  This simplifies the decoding logic.

•  It is both necessary and sufficient that an instantaneous code have no code word which is a prefix of another code word (prefix condition)


McMillan and Kraft theorems •  Theorem (McMillan’s inequality): If |S|=N and |A|=M and si-> wi ∈ Ali

i=1,2,..,N is an encoding scheme resulting in a uniquely decodable code then

•  For binary codes the condition is: •  Theorem (Kraft’s inequality): Suppose that S={s1, …,sN} is a source

alphabet and A={a1, …,aM} is a code alphabet and l1, l2,..,lN are positive integers. Then, there is an encoding scheme si -> wi i=1,2,..,N for S in terms of A satisfying prefix condition with length(wi)=li if and only if


McMillan and Kraft theorems

Exist a prefix encoding scheme with lengths li

Uniquely decodable code

Kraft

McMillan


Kraft-McMillan inequalities •  Note that the theorem refers to existence of such a code and does not refer

to a particular code. A particular code may obey the Kraft inequality and still not be instantaneous, but there will exist codes that have the li and are instantaneous.

•  Example 1: Is there an instantaneous code with code lengths 1,2,2,3? –  Kraft inequality: 2-1+2-2+2-2+2-3 > 1 : No

•  It is nice to work with prefix codes, are we losing something (in terms of codeword length) if we restrict ourselves to prefix codes?

•  No. If there is a code which is uniquely decodable and nonprefix, the values of l1, l2,..,lN for this code satisfy the Kraft-McMillan inequality. Thus, according to Kraft theorem, a prefix code with the same codeword length also exist.


Kraft-McMillan inequalities

0 1

a1

a4

a2 0 1

0 1

0 1

2

2

2

2 a3

a5 a6 a7

a8 a9

•  If a set of {li} is available that obey the Kraft inequality, an instantaneous code can be systematically built.

•  Example: M=3, l={1,2,2,2,2,2,3,3,3} find an instantaneous code.


Kraft-McMillan inequalities •  One approach to build a uniquely decodable code is:

1.  For the particular value of m and n, find all sets of {li} that satisfy the Kraft inequality

2.  Systematically build the codewords 3.  Assign the shorter codeword to source letter with higher probability

(relative frequency) and longer codewords to letter less likely 4.  Find the average codeword length ( ) of the above codes 5.  Pick the code that has the minimum

•  This “brute force” approach is useful in “mixed” optimization problems, in which we want to keep small and serve some other purpose

•  Where minimization of is our only objective a faster and more elegant approach is available (Huffman algorithm)


Lossless Source Coding Theorem

•  Consider a source with entropy H. Then for every encoding scheme for S, in terms of A, resulting in a uniquely decodable code, the average code word length satisfies: –  It is possible to code the source, without distortion, using H +

ε bits, where ε is an arbitrarily small positive number. However, it is not possible to code the source using B bits, where B < H.

•  The theorem does not tell how the coder can be constructed.


Kolmogorov complexity •  Kolmogorov complexity K(x) of a sequence x is the size of

the program needed to generate x •  In this size we include all inputs that might be needed by the

program •  We do not specify the programming language since it is

always possible to translate a program in one language to a program in another language.

•  If x is a random sequence with no structure the only program that could generate it would contain the sequence itself

•  There is a correspondence between size of smallest program and amount of compression that can be obtained

•  Problem: there is no systematic was of computing (or approximating) the Kolmogrorov complexity


Minimum Description Length •  Let Mj be a model from a set of models that attempts to

characterize the structure in a sequence x •  Let DMj be the number of bits required to describe the model

Mj. –  Example: if Mj has coefficients then DMj will depend how many

coefficients the model has and how many bits is used to represent each

•  Let RMj(x) be the number of bits required to represent x with respect to the model Mj

•  Minimum description length would be given by:

Documents

Mathematical Preliminaries for Lossless Compressionshirani/multi08/lossless_math.pdfKraft-McMillan inequalities • Note that the theorem refers to existence of such a code and does