Ch5-OptSourceCode

Embed Size (px)

Citation preview

  • 8/12/2019 Ch5-OptSourceCode

    1/28

    Chapter 5

    Optimal Source Coding

    In practice there are redundancy in the data we are using. A text, for example, is mostoften still readable if every third letter is replaced with an erasure symbol, e.g. . Thatmeans these letters can be removed and the text only takes about two third of the spacein the memory. The same reasoning holds, e.g. for images where there is a lot of depen-dencies between the pixels. In this chapter we will take a closer look at the limits for howmuch it is possible to compress a vector of source symbols. One of the interpretationsused for the entropy in the previous chapter is that it is the average amount of informa-tion needed to describe the random variable. This should then mean that the entropyshould be the lower bound for how much the outcome (in average) can be compressed.In this chapter optimal source coding will be considered. One of the main results in Shan-nons paper is the source coding theorem. Here, it is shown that for a vector of lengthnthe compression ratio per symbol can approach the entropy asn grows to infinity. It isalso shown that, as the interpretation gives, the entropy is a lower bound on the compres-sion ratio, if there is also a requirement that the decoding should be without errors. So,summarising, the entropy will be shown to be an achievable lower limit for the compres-sion rate of a sequence of i.i.d. variables. We will aslo give a simple algorithm, Huffancode, that is optimal in terms of compression rate.

    5.1 Asymptotic Equipartition Property

    To show that it is possible to construct a compression algorithm that removes all redun-dancy and only has the information left, it is needed to show that the entropy can bereached. By using the law of large numbers it is possible to get a very efficient tool forthis purpose, called the Asymptotic Equipartition Property (AEP). The basic idea is thata series of i.i.d. events is viewed as a vector of events. The probability for each vector

    becomes very small as the length grows, giving that it is pointless to consider the prob-ability for each vector. Instead it is the probability for the type vector that is interesting.Then it turns out that it is only a fraction of all possible outcomes that does not have anegligible small probability. To understand what this means, consider 100 consecutiveflips with a fair coin. Since, assuming a fair coin, all resulting outcome vectors have the

    55

  • 8/12/2019 Ch5-OptSourceCode

    2/28

    56 CHAPTER 5. OPTIMAL SOURCE CODING

    same probability,

    P(x1, . . . , x100) = 2100 8 1031

    Hence, the probability for each of the vectors in the outcome is very small and it is forexample not very likely that the vector with 100 Heads will occur. However, since there

    are 50100 1029, vectors with 50 head and 50 tails, the probability for getting one suchvector is

    P(50Head, 50Tail) = 2100

    50

    100

    0.08

    which is relatively high. To conclude, it is most likely that the outcome of 100 flips witha fair coin will result in approximately the same numbers of Heads and Tails. This is adirect consequence of the weak law of large numbers, which is recapitalised here.

    THEOREM5.1 (THE WEAK LAW OF LARGE NUMBERS) LetX1, X2, . . . , , X nbe i.i.d. ran-dom variables with meanE

    X

    . Then,

    1ni

    Xi pEXwhere

    p denotes convergence in probability.

    The notation of convergence in probability can also be expressed as

    limn

    P 1

    n

    i

    Xi E

    X< = 1

    for an arbitrary real number > 0. Hence, the arithmetic mean ofn i.i.d. random vari-ables approaches the expected value of the distribution as ngrows.

    Consider instead the logarithmic probability for a sequence of length n, consisting of i.i.d.random variables. By the weak law of large numbers

    1n

    logp(x) = 1n

    logi

    p(xi) = 1

    n

    i

    logp(xi) pE logp(x)= H(X)

    or, equivalently,

    limn

    P 1

    np(x) EX< = 1

    for an arbitrary >0. That is, for all sequences, the mean of the logarithmic probability

    for the random variable approaches the entropy as the length of the sequence grows. Forfinite sequences not all will fullfil the criteria set up by the probabilistic convergence and. But those that does are the ones that are the most likely to happen, and are calledtypical sequences. This behaviour is named theasymptotic equipartition property(AEP)[24].It is defined as follows.

    DEFINITION5.1 (AEP) The set of-typical sequences A(X) is the set of all n-dimensionalvectors x= (x1, x2, . . . , xn)with i.i.d. variables such that 1

    nlogp(x) H(X)

    (5.1)

  • 8/12/2019 Ch5-OptSourceCode

    3/28

    5.1. ASYMPTOTIC EQUIPARTITION PROPERTY 57

    It is possible to rewrite (5.1) as

    1n

    logp(x) H(X)

    which is equivalent to

    2n(H(X)+) p(x)2n(H(X))

    In this way there can be an alternative definition of the AEP.

    DEFINITION5.2 (AEP, ALTERNATIVE DEFINITION) The set of-typical sequencesA(X)is the set of alln-dimensional vectors x= (x1, x2, . . . , xn)with i.i.d. variables such that

    2n(H(X)+) p(x)2n(H(X)) (5.2)

    Both definitions of the AEP are frequently used in the literature. It differs which one isused as the main definition, but most often both are presented. In the next example aninterpretation of the-typical sequences is given.

    EXAMPLE5 .1 Consider a binary sequence of length n = 5 with i.i.d. elements wherepX(0) =

    13 andpX(1) =

    23 . The entropy for each symbol isH(X) =h(1/3) = 0.918. In the

    following table all possible sequences are listed together with their respectively probabil-ity.

    x p(x)

    00000 0.004100001 0.008200010 0.008200011 0.016500100 0.008200101 0.016500110 0.016500111 0.0329 01000 0.008201001 0.016501010 0.0165

    x p(x)

    01011 0.0329 01100 0.016501101 0.0329 01110 0.0329 01111 0.0658 10000 0.008210001 0.016510010 0.016510011 0.0329 10100 0.016510101 0.0329

    x p(x)

    10110 0.0329 10111 0.0658 11000 0.016511001 0.0329 11010 0.0329 11011 0.0658 11100 0.0329 11101 0.0658 11110 0.0658 11111 0.1317

    As expected the all zero sequence is the least possible sequence, while the all one se-quence is the most likely. However, even this most likely sequence is not very likely tohappen, it only has probability 0.1317. Picking one vector as a guess for what the out-come would be, this should be the one. In the case when the order of the symbols in thesequence is not important it appears to be better to guess on thetypeof sequence, mean-ing the number of ones and zeros. The probability for a sequence containingk ones and5 kzeros is

    P(kones) =

    n

    k

    2

    3k

    1

    3nk

    =

    n

    k

    2k

    3n

  • 8/12/2019 Ch5-OptSourceCode

    4/28

    58 CHAPTER 5. OPTIMAL SOURCE CODING

    which was seen already in Chapter 3. Viewing these numbers in a table gives

    P(#1in x)

    k =5k

    2k

    35

    0 0.0041

    1 0.04122 0.16463 0.32924 0.32925 0.1317

    Here it is clear that the most likely sequence, the all one sequence, does not belong tothe most likely type of sequence. When guessing of the number of ones, it is more likelyto get 3 or 4 ones. This is of course due to that there are more vectors that fulfill thiscriteria then the single all one vector. So, this concludes that vectors with 3 or 4 ones aresort of the typical that will happen. The question is then how this relates to the previous

    definitions of typical sequences. To see this, chose an. Here 15% of the entropy is used,which gives= 0.138,

    2n(H(X)+) = 25(h(13 )+0.138) 0.0257

    2n(H(X)+) = 25(h(13 )0.138) 0.0669

    According to Definition 5.2 the -typical sequences are the ones with probabilities be-tween these numbers. In the table above these sequences are marked with a . Luckily,these are the same sequences as we intuitively concluded should be typical, i.e. sequenceswith 3 or 4 ones.

    In the previous example it was seen that the typical sequences constitute the types ofsequences that are most likely to appear. In this example very short sequences were usedto be able to list all of them, but for longer sequences it can be seen that the -typicalsequences are just a fraction of all the sequences. On the other hand, it can also be seenthat the probability for a random sequence to belong to the typical sequences is close toone. More formally, the following theorem can be stated.

    THEOREM5 .2 Consider sequences of lengthnof i.i.d. random variables. For each >0there exists an integern0such that, for eachn > n0, the set of-typical sequences,A(X),fulfills

    PxA(X)1 (5.3)(1 )2n(H(X)) A(X)2n(H(X)+) (5.4)

    The first part of the theorem, (5.3), is a direct consequence of the law of large numbersstating that 1nlogp(x) approaches H(X) as n grows. That means there exists an n0, suchthat for allnn0

    P

    1

    n

    logp(x)

    H(X)

    <

    1

  • 8/12/2019 Ch5-OptSourceCode

    5/28

    5.1. ASYMPTOTIC EQUIPARTITION PROPERTY 59

    for anybetween zero and one. Replacing withgives

    P 1

    nlogp(x) H(X)

    < 1 which is equivalent to (5.3). It shows that the probability for an arbitrary sequence to

    belong to the typical set approaches one asngrows.

    To show the second part, that the number of-typical sequences is bounded by (5.4), startwith the left hand inequality. According to (5.3), for large enough n0

    1 PxA(X)= xA(X)

    p(x)

    xA(X)

    2n(H(X)) =A(X)2n(H(X))

    where we in the second inequality follows directly from the left hand side inequality in(5.2). The right hand side of (5.4) can be shown by

    1 =xXn

    p(x)

    xA(X)

    p(x)

    xA(X)

    2n(H(X)+) =A(X)2n(H(X)+)

    whereXn denotes all possible sequences. The next example, which is heavily inspiredby[13], shows the consequences of the theorem for longer sequences.

    EXAMPLE5 .2LetXn be the set of all binary random sequences of lengthn with i.i.d. variables where

    p(0) = 13 andp(1) = 23 . Let = 0.046, i.e. 5% of the entropyh(

    13). Then the number of

    -typical sequences and their bounding functions are given in the next table for n = 100,n= 500and n = 1000. As a comparison, the fraction of-typical sequences compared tothe total number of sequences is also shown.

    n (1 )2n(H(X)) |A(X)| 2n(H(X)+) |A(X)|/|Xn|100 1.17 1026 7.51 1027 1.05 1029 5.9 103500 1.90

    10131 9.10

    10142 1.34

    10145 2.78

    108

    1000 4.16 10262 1.00 10287 1.79 10290 9.38 1015

    This table shows that the -typical sequences is only a fraction of the total number ofsequences.

    Next, the probability for the -typical sequences are given together with the probabilityfor the most probable sequence, the all one sequence. Here it can be clearly seen thatthe most likely sequence has a very low probability, and is in fact very unlikely to hap-pen. Instead, the most likely event is that a random sequence is taken from the typicalsequences, for which the probability approaches one.

  • 8/12/2019 Ch5-OptSourceCode

    6/28

    60 CHAPTER 5. OPTIMAL SOURCE CODING

    n P(A(X)) P(x= 11 . . . 1)

    100 0.660 2.4597 1018500 0.971 9.0027 1089

    1000 0.998 8.1048 10177

    5.2 Source Coding Theorem

    In Figure 5.1 a block model for a system using source coding is shown. The sequencefrom the source is fed to the source encoder where the redundancy is removed, that is,the sequence is compressed. The source decoder is the inverse function and reconstructsthe source data. In the figureXis ann-dimensional vector with i.i.d. random variablesXiwithk outcomes. In general, the lengthn of the vector is considered to be a random

    variable and the average length isE[n]. The corresponding compressed sequence,Y, isan-dimensional vector of random variablesYj withD outcomes. Alsocan be viewedas a random variable so the average length of the codeword isE[k]. Normally, the codesymbols will be considered to be drawn from the alphabet ZD ={0, 1, . . . , D1}. Thedecoder estimates the source vector asX. IN case of lossless source coding the aim isto haveX = X. This is typically the case when the source is text or program code, thatneeds to be reconstructed without exactly as the source. If it is acceptable with a certainlimit of distortion we can talk about lossy source coding, which is the case for example inimage, video or audio coding. Then the reconstructed message should be similar enough

    to the original,

    X X. What similar enough means varies from case to case. For

    example, we do not expect the same sound quality in a phone as we do for the stereoequipment. Then a the phone sound can be much harder compressed than the sound ona CD. In this section, only lossless source coding will be considered. Lossy compressionwill be dealt with in Section 11 where the concept of distortion is introduced.

    Source Encoder DecoderX Y X

    Figure 5.1: Block model for source coding system..

    In the previous section the entropy was considered to be a measure of the uncertainty ofthe outcome of a random variable. This can be interpreted as the amount of informationneeded to determine the outcome. It is then natural to consider the entropy as a limit forhow much it is possible to compress the data, in view of this statistics. In the followingit will not only be shown that indeed the entropy is a lower bound on the compressioncapability, but also that it is a reachable bound. The latter is shown by Shannons sourcecoding theorem, using AEP.

    The compression rate is defined as the ratio between the size of the source vector and the

  • 8/12/2019 Ch5-OptSourceCode

    7/28

    5.2. SOURCE CODING THEOREM 61

    code vector.

    R=E[n]

    E[]

    If the alphabet sizes are equal there is a compression ifR is less than one. The reason to

    view bothn and as random variables is that, to have compression either the length ofthe source vector, or the code vector, or both, must vary. If both lengths would be fixedfor all messages there cannot be any compression. In this section the source symbols will

    be considered to be random variables and the codewords varying length vectors ofD-arysymbols. The following definition will be used in this section.

    DEFINITION5.3 (SOURCE CODING) A source code is a mapping from the outcomes,{x1, . . . , xk}, of a random variable Xto a vector of variable length, y= (y1, . . . , y), whereyi {0, 1, . . . , D 1}. The length of the codeword ycorresponding to the source symbolxis denotedx.

    Since the source symbols are assumed to have fixed length the efficiency of the code canbe measured by the average codeword length,

    L= E

    lx

    =x

    p(x)lx

    To derive the source coding theorem, stating the existence of a source code where theaverage codeword length approaches the entropy of the source, a code based on AEPwill be used. Letting the source symbols ben-dimensional vectors of i.i.d. variables, it isknown from the previous section that the typical vectors constitute a small fraction of allvectors but are the most likely to happen. Starting with a list of all sequences, it can be

    partitioned in two parts, one with the typical sequences and one with the non-typical. Toconstruct the codewords use a binary prefix stating which set the codeword belongs to,e.g. use 0 for the typical sequences and 1 for the non-typical. Each of the sets are listedand indexed by binary vectors. The following two tables show the idea of the look-uptables.

    TypicalPrefix= 0x Index vec

    x0 0 . . . 00x1 0 . . . 01...

    ..

    .

    Non-typicalPrefix= 1x Index vec

    xa 0 . . . . . . 00xb 0 . . . . . . 01...

    ..

    ....

    ...

    Since the number of the typical sequences is bounded by |A(X)| 2n(H(X)+) the lengthof the corresponding codeword is

    x=

    logA(X)+ 1logA(X)+ 2 =nH(X) ++ 2

    Similarly, the length for codewords corresponding to the non-typical sequences can bebounded by

    x= log kn

    + 1log kn + 2

    n log k+ 2

  • 8/12/2019 Ch5-OptSourceCode

    8/28

  • 8/12/2019 Ch5-OptSourceCode

    9/28

    5.3. KRAFT INEQUALITY AND OPTIMAL CODEWORD LENGTH 63

    In the above reasoning it has been assumed that the symbols in the vectors are i.i.d. If,however, there is a dependency among the symbols we should consider random pro-cesses. In [24] it was shown that every ergodic source has the asymptotic equipartitionproperty (AEP). Hence, ifx= x1, x2, . . . , xnis a sequence from an ergodic source, then

    1nlogp(x) p

    H(X), n

    The set of typical sequences should then be defined as sequences such as

    1n

    logp(x) H(X)

    This leads to the source coding theorem for ergodic sources[24], which is stated herewithout proof.

    THEOREM5 .4 LetXn be an stationary ergodic process. Then there exists a code which

    maps sequencesx of lengthn into binary sequences such that the mapping is invertibleand the average codeword length per symbol

    1

    nE

    (x)H(X) +

    wherecan be made arbitrarily small for sufficiently large n.

    Generally, a random process is said to be ergodic if the law of large numbers is satisfied.

    5.3 Kraft inequality and optimal codeword length

    The source coding theorem shows the existence of a code where the average codewordlength approachesH(X). The code is also such that it can be uniquely decodable sincethe same table can be used by the receiver. In the following it will be shown that theentropy is also a lower bound for the codeword length, under the condition that theencoded sequence should have a unique mapping to the source symbols, i.e. it is possibleto decode.

    In Table 5.1 the probability density function for a random variable X is given. There arealso five different examples of codes for this variable. The first code, C1is an example of adirect binary mapping with equal lengths, not taking the distribution into account. Withfour source symbols we need at leastlog 4 = 2bits in a vector to have unique codewords.This can be used as a reference code representing the uncoded case. Since all codewordsare equal in length, the average codeword length will also be L(C1) = 2. In the othercodes symbols with high probability is mapped to short codewords, since this gives alow average codeword length.

    The second code,C2, has two symbols, x1 and x2, that both are mapped to the samecodeword. This means it is impossible to find a unique mapping, and such code should

    be avoided.

  • 8/12/2019 Ch5-OptSourceCode

    10/28

  • 8/12/2019 Ch5-OptSourceCode

    11/28

    5.3. KRAFT INEQUALITY AND OPTIMAL CODEWORD LENGTH 65

    Uniquely decodable codes. Each sequence of source symbols is mapped to a se-quence of code symbols, different from any other valid code sequence that mightappear.

    Prefix codes1. No codeword is a prefix to any other codeword.

    In the case of a prefix code each codeword can be decoded as soon as it is received,whereas for a uniquely decodable code it might be that the complete sequence of code-words must be received before decoding is started.

    From the above reasoning we can conclude that prefix codes are desirable since they areeasily decoded. Clearly, the class of prefix codes is a subclass of the uniquely decodablecodes. One basic criteria for a code to be uniquely decodable is that the set of codewordsis non-overlapping. That is the class of uniquely decodable codes is a subclass of thenon-singular code. Furthermore, the class of non-singular codes is a subclass of all codes.In Figure 5.3 a graphical representation of the relation between the classes is shown.

    All codes

    Non-singular codes

    Uniquely decodable codes

    Prefix codes

    Figure 5.3: All prefix codes are uniquely decodable, and all uniquely decodable codes arenon-singular.

    In the continuation of this section we will mainly consider prefix codes. For this analysisit is needed to consider a tree structure. A general tree2 has a root node which may haveone or more child nodes. Each child node may also have one or more child nodes, and soon. A node that does not have any child nodes is called a leaf. In aD-ary tree each nodehas either zero or Dchild nodes. In Figure 5.4 two examples ofD-ary trees are shown, theleft withD = 2and the right withD = 3. Notice that the trees grow to the right from the

    root. Normally, in computer science trees grow downwards, but in many topics relatedto information theory and communication theory they are drawn from left to right.

    The depth of a node is the same as the number of branches from the root note in the tree.Starting with the root node it has depth 0. In the left tree of Figure 5.4 the node labeled

    A has depth 2 and the node labeledB has depth 4. A tree is said to befullif all leaves

    1The notation of prefix code is a bit misleading since the code should contain no prefixes. However, it istoday the most common notation of this class of codes, see e.g. [3, 32], and will therefore be adopted in thistext. Sometimes in the literature it is mentioned as a prefix condition code, e.g. [5], or prefix-free code, e.g.[22]. It is also mentioned as an instantaneous code. e.g. [2].

    2Even more general, in graph theory i tree is a directed graph where two nodes are connected by exactlyone path.

  • 8/12/2019 Ch5-OptSourceCode

    12/28

    66 CHAPTER 5. OPTIMAL SOURCE CODING

    A

    B

    Figure 5.4: Examples of a binary (D= 2) and a3-ary tree.

    are located at the same depth. In Figure 5.5 a full D-ary tree withD = 2and depth 3 isshown. In a fullD-ary tree of depthdthere areDd leaves.

    Figure 5.5: Example of a full binary tree of depth 3.

    A prefix codes of alphabet size D can be represented in a D-ary tree. The first letterof the codeword is represented by a branch stemming from the root. The second letteris represented by a branch stemming from a node at depth 1, and so on. The end of acodeword is reached when there are no children, i.e. a leaf is reached. In this way astructure is built where each sequence in the tree cannot be a prefix of another codeword.

    In Figure 5.6 the prefix code C5from Table 5.1 is shown in a binary tree representation. Inthis representation the probabilities for each source symbol is also added. The labeling inthe tree nodes is the sum of the probabilities for the source symbols stemming from thatnode, i.e. the probability that a codeword goes through that node. Among the codes inTable 5.1, the reference code C1is also a prefix code. In Figure 5.7 a representation of thiscode is shown in a tree. Since all codewords are of equal length we get a full binary tree.

    x p(x) C5x1

    1/2 0x2 1/4 10

    x3 1/8 110x4 1/8 111

    1

    1/2

    1/4

    x41/8

    1

    x31/8

    01

    x21/4

    01

    x11/2

    0

    Figure 5.6: Representation of the prefix code C5in a binary tree.

    There are many advantages with the tree representation. One is that it gives a graphicalinterpretation of the code, which in many occasions is a great help for the intuitive under-standing. Another advantage is that there is an alternative way to calculate the averagecodeword length, formulated in the next lemma [22].

  • 8/12/2019 Ch5-OptSourceCode

    13/28

  • 8/12/2019 Ch5-OptSourceCode

    14/28

    68 CHAPTER 5. OPTIMAL SOURCE CODING

    To show this consider aD-ary prefix code where the longest codeword length ismax =maxx x. This code can be represented in aD-ary tree. A fullD-ary tree of depthmaxhasDmax leaves. In the tree representation the codeword corresponding toxi is a lengthipath from the root. Since it should be a prefix code the sub-tree spanned with xias a rootis not allowed to be used by any other codeword, hence it should be removed from the

    tree, see Figure 5.8. So, when considering the codeword for symbolxiDmaxi

    leaves areremoved from the tree. The number of removed leaves cannot exceed the total numberof leaves in the fullD-ary tree of depthmax,

    i

    Dmaxi Dmax

    By cancellingDmax on both sides, this proves that for a prefix codei

    Di 1

    xi

    i max i

    D

    maxi leaves

    Remove sub-tree

    Figure 5.8: Tree construction for a general prefix code.

    To show that if the inequality is fulfilled it is possible to construct a prefix code start byassuming that

    i D

    i 1and that the codeword lengths are ordered, 1 2 k, where k = max. Then use the same construction as above; start with the shortestcodeword and remove the corresponding sub-tree. After i < k steps the number ofleaves left is

    Dmax i

    n=1

    Dmaxi =Dmax

    1 i

    n=1

    Di

    > 0

    where the last inequality comes from the assumption, i.e.i

    n=1 Di

  • 8/12/2019 Ch5-OptSourceCode

    15/28

    5.3. KRAFT INEQUALITY AND OPTIMAL CODEWORD LENGTH 69

    EXAMPLE5 .4 To construct a binary prefix code with codeword lengths {2, 2, 2, 3, 4},first check that it is possible according to Kraft inequality. The derivation

    ii

    = 22 + 22 + 22 + 23 + 24 = 314 + 18 +

    116 =

    1516 1

    shows that it is not possible to construct such code. The leaves in the binary tree will notbe enough and it is impossible to fit these lengths in it.

    As a complement to Kraft inequality we also give a generalization by McMillan foruniquely decodable codes.

    THEOREM5.7 (MCMILLAN) There exists auniquely decodableD-ary codewith codewordlengths1, 2, . . . , kif and only if

    ki=1

    Di 1

    Since a prefix code is also uniquely decodable, the existence of such code follows directlyfrom Theorem 5.6. Then, to show that a uniquely decodable code satisfy Kraft inequality

  • 8/12/2019 Ch5-OptSourceCode

    16/28

    70 CHAPTER 5. OPTIMAL SOURCE CODING

    assume that the codeword lengths in the code are given by 1, . . . k and that the maxi-mum length ismax= maxi i. Then the sum can be rewritten as

    ki=1

    Di =maxj=1

    wjDj

    wherewjdenotes the total number of codewords of length j . Consider thenth power ofthis sum,

    ki=1

    Din

    =maxj=1

    wjDjn

    =maxj1=1

    maxjn=1

    (wj1 wjn)D(i1++in)

    =nmax

    =n WD

    where

    W =

    j1++jn=

    (wj1 wjn)

    That is,W is the total number of code sequences of lengthobtained by concatenatingn codewords. Since the code is uniquely decodable, all vectors must be distinct, andthe number of length code vectors cannot exceed the total number length vectors,WD. In other wordsWD 1, which gives

    ki=1

    Din nmax

    =n

    1nmax

    Taking thenth root of this expression givesk

    i=1 Di nnmax, which should hold for

    any number of concatenated codewordsn. By considering infinitely long code sequencescompletes the proof of McMillans inequality,

    ki=1

    Di limn

    n

    nmax= 1

    In the continuation of this text we will consider prefix codes, even though many of theresults are based on Krafts inequality and therefore also holds for uniquely decodablecodes. The reason for this is that it is much easier to construct and make use of prefixcodes, and in light of McMillans inequality there is no gain in the average codewordlength to consider the greater class of uniquely decodable codes. What is says is that,given a uniquely decodable code, it is always possible to construct a prefix code with thesame set of codeword lengths.

    With Kraft inequality we have a mathematical foundation needed to consider to set up anoptimisation function for the codeword lengths. One standard method for minimisationof a function with some side criterion is the Lagrange multiplication method. Set up an

  • 8/12/2019 Ch5-OptSourceCode

    17/28

  • 8/12/2019 Ch5-OptSourceCode

    18/28

    72 CHAPTER 5. OPTIMAL SOURCE CODING

    which imply thatLHD(X)for all prefix codes. In the first inequality the IT-inequalitywas used and in the second Kraft inequality. Both inequalities are satisfied with equal-ity if and only ifi = logDp(xi). When deriving the optimal codeword lengths thelogarithmic function typically will not give integer values, so in practice it might not

    be possible to reach this lower limit. The above result is summarised in the following

    theorem.

    THEOREM5 .8 The average codeword length L= E[x] for a prefix code is lower boundedby the entropy of the source,

    LHD(X) = H(X)log D

    (5.8)

    with equality if and only ifx = logDp(x).

    From the previous theorem on the optimal codeword length and the construction methodin the proof of Kraft inequality, it is possible to find an algorithm for a code design.

    The codeword length for source symbol xi is (opt)i = logDp(xi). To assure that the

    codeword length is an integer use instead

    i = logDp(xi)

    From the following derivation it can be seen that Kraft inequality is still satisfied, whichshows that it is possible to construct a corresponding code,

    i

    Di =i

    D logDp(xi) i

    D( logDp(xi)) =i

    p(xi) = 1

    This code construction is named Shannon-Fano Code. Since the codeword length is theupper integer part of the optimal length, it can be bounded as

    logDp(xi)i

  • 8/12/2019 Ch5-OptSourceCode

    19/28

    5.3. KRAFT INEQUALITY AND OPTIMAL CODEWORD LENGTH 73

    However, the following example show that the Shannon-Fano code is not necessarily anoptimal code.

    EXAMPLE5 .6 Consider a random variable with four outcomes according to the tablebelow. In the table it is also listed the optimal codeword lengths and the lengths for the

    codewords in a Shannon-Fano code.

    x p(x) logp(x) = logp(cx)x1 0.45 1.152 2x2 0.25 2 2x3 0.20 2.32 3x4 0.10 3.32 4

    From above we know that Kraft inequality is fulfilled, but as a further clarification wederive it

    2 = 1

    4+ 1

    4+ 1

    8+ 1

    16= 11

    16

  • 8/12/2019 Ch5-OptSourceCode

    20/28

    74 CHAPTER 5. OPTIMAL SOURCE CODING

    Equivalently, the average codeword length per source symbol is

    1

    nH(X1X2 . . . X n) 1

    nE[x] 1

    nH(X1X2 . . . X n) +

    1

    n

    As the length of the vector, n, grows the term 1n will tend to zero and both sides of the

    limit will approach the entropy rate since 1nH(X)H(X). Expressed more formally

    the following corollary can be formulated.

    COROLLARY5.10 IfX1X2 . . . X n forms a stationary random process, the average code-word length per source symbol for an optimal binary prefix code is

    LH(X), n whereH(X)is the entropy rate for the process

    5.4 Huffman Codes

    The Shannon-Fano code construction introduced in the previus section gives a code withaverage codeword length at most one more than the entropy, which is within the limitfor an optimal code. However, in Example 5.6 it was seen that the construction doesnot necessarily give an optimal code. By optimal code it is meant a code with minimumaverage codeword length over all prefix codes for the source. In 1951 Robert Fano at MITlectured the first ever course in Information theory. One of the students, David Huffman,developed as part of a class assignment an algorithm for constructing an optimal code.The code was later published [11] and the produced code is normally mentioned as a

    Huffman code. In the following the algorithm will be described for the binary case and

    then shown that the procedure generates an optimal code. The algorithm will also begeneralized to constructD-ary codes.

    The idea of the Huffman code construction is to list all source symbols as nodes. Then,iteratively find the two least probable nodes and merge in a binary tree, letting the rootrepresent a new node instead of the two merged. Written as an algorithm we get thefollowing.

    ALGORITHM5.1 (BINARYH UFFMAN COD E)To construct the binary code tree for a random variable withkoutcomes:

    1. Sort the symbols according to their probabilities.

    2. Let xi and xj , with probabilitiespi and pj , respectively, be the two least probablesymbols in the list

    Removexiand xj from the list and connect them in a binary tree. Add the root node{xi, xj}as one symbol with probabilitypij =pi+pj to the

    list.

    3. If one symbol in the listSTOP

    ElseGOTO 2

  • 8/12/2019 Ch5-OptSourceCode

    21/28

    5.4. HUFFMAN CODES 75

    Before showing the optimality of the constructed code we will study two examples. Thefirst is a small example that we will go through in detail, and then follows a slightly

    bigger example.

    EXAMPLE5 .7 The random variableXwith four outcomes and probabilities

    x x1 x2 x3 x4p(x) 1/2 1/4 1/8 1/8

    In Figure 5.9(a) each of the symbols inX represent one node. To start the algorithmfind the two nodes with least probabilities. In this case it is the nodesx3 and x4 withprobability1/8each. Merge these nodes in a binary tree and add the root as a node to thelist. Now, the list contains three nodes,x1,x2and x3x4, where the last one represents thenewly added tree. This list is shown Figure 5.9(b). Since there are more than one nodeleft, continue from the beginning and find the two least probable nodes. It is the nodesx2andx3x4that should be merged, see Figure 5.9(c). Then there are two nodes left, x1andx2x3x4and they represent the two least probable nodes. Merge and let the root represent

    a new node to get Figure 5.9(d). Afetr this there is only one node left and the algorithmstops. The average codeword length in the constructed code is, according to the path

    x11/2

    x21/4

    x31/8

    x41/8

    x11/2

    x21/4

    x31/8

    x41/8

    1/4 0

    1

    x11/2

    x21/4

    x31/8

    x41/8

    1/4

    1/20

    1

    0

    1

    x11/2

    x21/4

    x31/8

    x41/8

    1/4

    1/2

    1

    0

    1

    0

    1

    0

    1

    (a) (b) (c) (d)

    Figure 5.9: Construction of the Huffman tree.

    length lemma,

    L= 1 +1

    2+

    1

    4= 1.75bit

    In Figure 5.10 a redrawn tree is shown together with the code table.

    1

    1/2

    1/4

    x31/81

    x41/8

    01x2

    1/4

    01

    x11/2

    0x p(x) y

    x1 1/2 0x2 1/4 10x3 1/8 110x4 1/8 111

    Figure 5.10: The tree and the code table for the constructed code.

    From Theorem 5.8 we know that the codeword length is lower bounded by the entropyof the source. So, to compare we derive the entropy as

    H12 ,

    14 ,

    18 ,

    18= 1.75bit

  • 8/12/2019 Ch5-OptSourceCode

    22/28

  • 8/12/2019 Ch5-OptSourceCode

    23/28

    5.4. HUFFMAN CODES 77

    To show that Huffmans algorithm will give an optimal code we first make a couple ofobservations that apply to an optimal code. Then it can be shown that by applying theseobservations the constructed code has to be optimal, and that it all boils down to theconstruction given by Huffman.

    The first observation is that in an optimal code the lengths for the codewords must followthe probabilities for the source symbols. That is, codewords corresponding to sourcesymbols with low probability should not be shorter than codewords corresponding tomore probable source symbols. Intuitively it is clear that if we have a codeword whichlength is shorter than a codeword with higher probability, we will get a shorter averagelength if we swap the codewords. To show it mathematically assume that the sourcesymbols xi and xj have probabilities pi and pj , where pi < pj and the correspondingcodeword lengthsiand j satisfiesi j . Notice that the codeword length are chosensuch that the lowest probability is mapped to the shortest length codeword. Then byswappin the codewords the average codeword length will decrease,

    L=

    p =pii+pjj+ =i,j

    p

    =pii+pjj+pij+pjipijpji+=i,j

    p

    = (pjpi) >0

    (j i) 0

    +pij+pji+=i,j

    p

    L

    + L

    where Lis the length when the codewords are swapped. This shows that by swappingthe codewords we get L L with equality when i = j . Hence, in an optimal codecodewords corresponding to less probable symbols are not shorter than codewords cor-responding to more probable symbols.

    The second observation is that in a tree corresponding to an optimal code there are nounused leaves. This can be seen from an assumption that there is an optimal code withan unused leaf in the tree. Then the parent node of this leaf has only one branch. Byremoving this last branch and placing the leaf at the parent node a shorter codeword isobtained, and therfore the average length decreases. Hence, the assumption that the codewas optimal cannot be true.

    The third observation is that the two least probable codewords are of the same length,and that the tree can be constructed such that the differ only in tha last bit. Assumethat the least probable symbol has a codeword according to y1 . . . y10. Then since thereare no unused leaves, there exists a codewordy1 . . . y11. If this is not the second leastprobable codeword, it follows from the first observation that the second least probablecodeword has the same length. So, this codeword can be swapped withy1 . . . y11andto get the desired result, without any change in the average length.

    Then a binary code for a random variable Xwithk outcomes corresponds to a binarytree withk leaves, since there are no unused leaves. The two least probable symbols, xkandxk1, corresponding to the probabilitiespkandpk1, can be assumed to be located assiblings in the tree, see Figure 5.12. The parent node for these leaves has the probabilitypk1 = pk+pk1.

  • 8/12/2019 Ch5-OptSourceCode

    24/28

    78 CHAPTER 5. OPTIMAL SOURCE CODING

    pk1

    xk1

    0 xkpk

    1

    xk1pk1

    Figure 5.12: A code tree for an optimal code.

    Then a fourth observation is that a new code tree with k1 leaves can be constructedby replacing the codewords forxkandxk1by its parent node, called xk1in the figure.Denote byLthe average length in the original code withkcodewords and Lthe averagelength in the new code. From the path length lemma

    L= L+ pk1 = L+ (pk+pk1)

    Since pk and pk1 are the two least probabilities the code with n codewords can onlybe optimal if the code withk1 elements is optimal. Continuing this reasoning untilthere are only two codewords left the optimal code has the codewords 0 and 1. The stepstaken here to construct an optimal code, is exactly the same steps used in the Huffmanalgorithm. Hence, concluding that a Huffman code is an optimal code.

    THEOREM5.11 A binary Huffman code is an optimal (prefix) code.

    In many occations there can be more than one way to merge the nodes in the algorithm.For example if there are more than two nodes with the same least probability. That meansthe algorithm can produce different codes depending on which merges are chosen. In-

    dependent of which code is considered the codeword length will be minimal, as in thefollowing example.

    EXAMPLE5 .9 The random variableXwith five outcomes has the probabilities

    x x1 x2 x3 x4 x5p(x) 0.4 0.2 0.2 0.1 0.1

    The Huffman algorithm can produce two different trees, and thus two different codes, forthis statistics. In Figure 5.13 the two trees are shown. The difference in the constructioncomes after the first step in the algorithm. Then there are three nodes with least probabil-

    ity,x2,x3and x1x2. In the first alternative the nodesx3and x1x2are merged intox3x4x5with probability 0.4, and in the second alternative the two nodes x2andx3are merged tox2x3.

    The average codeword length for the two alternatives will both give the same calculation

    L1= L2= 1 + 0.6 + 0.4 + 0.2 = 2.2bit

    In the example both codes give the same (optimal) codeword lengths. However, the dif-ference can be of important from another perspective. The source coding gives variations

  • 8/12/2019 Ch5-OptSourceCode

    25/28

    5.4. HUFFMAN CODES 79

    1

    0.6

    0.4

    0.2

    x50.1

    1

    x40.1

    01

    x30.2

    01

    x20.2

    01

    x10.4

    0

    1

    0.4

    x30.2

    1

    x20.2

    01

    0.6

    0.2

    x50.1

    1

    x40.1

    01

    x10.4

    0

    0

    (Alternative 1) (Alternative 2)

    Figure 5.13: Two alternative Huffman trees for one source.

    in the length of the coded symbols, i.e. the rate of the symbol varies from the encoder. Infor example video coding this might be an important design factor when chosing codes.Here the rate is approximatly 2-3 Mb/s in average but the peak levels can go up to ashigh as 6-8 Mb/s for standard definition. For high definition (HD) the problem is even

    more pronounced. However, in most communication schemes the transmission is donewith a fixed maximum rate. To handle this mismatch the transmitter and receiver is of-ten equipped with buffers. At the same time the delays in the system should be kept assmall as possible, and therefore the buffer sizes should also be small. This implies thatthe variations in the rates from the source encoder should be as small as possible. In theexample, for the first alternative of the code tree the variation in length is much largerthan in the second alternative. This will be reflected in the variations in the rates of thecode symbol. One way to constructminimum variance Huffman codesis to always mergethe shortest sub-trees when there is a choice. In the example Alternative 2 is a minimumvariance Huffman code, and might be preferable to the first alternative.

    5.4.1 D-ary Huffman code

    So far only binary Huffman codes have been considered. In most applications this isenough but there are also cases when a larger alphabet is required. In these cases D-aryHuffman codes should be considered. The algorithm and the theory is in many aspectsvery similar. Instead of a binary tree a D-ary tree is constructed. Such a tree with depth 1hasD leaves. Then it can be expanded by addingD children to one of the leaves. In theoriginal tree one leaf has become an internal node and there are D new leaves, so thereare nowD1 additional leaves. Every following such expansion will also give D1

    additional leaves, which gives the following lemma.

    LEMMA5.12 The number of leaves in a D-ary tree is

    D+q(D 1)for some non-zero integerq.

    This means that aD-ary tree corresponding to an optimal code must have at most D 2unused leaves. Furthermore, these unused leaves must be located at the same depthand it is possible to rearrange such that they stem from the same parent node. Thisobservation corresponds in the binary case that there are no unused leaves.

  • 8/12/2019 Ch5-OptSourceCode

    26/28

    80 CHAPTER 5. OPTIMAL SOURCE CODING

    The optimal code construction for the D-ary case can then be shown in a similar way asfor the binary case. It will end up with the algorithm described next.

    ALGORITHM5.2 (D-ARY H UFFMAN CODE)To construct the code tree for a random variable withkoutcomes:

    1. Sort the source symbols according to their probabilities.Fill up with zero probable nodes so thatK=D+q(D 1).

    2. Connect theD least probable symbols in a D-ary tree and remove them from thelist. Add the root of the tree as a symbol in the list.

    3. If one symbol in the listSTOP

    ElseGOTO 2

    The procedure is shown with an example.

    EXAMPLE5.10 Construct an optimal code withD = 3for the random variable Xwithk= 6outcomes and probabilities

    x x1 x2 x3 x4 x5 x6p(x) 0.27 0.23 0.20 0.15 0.10 0.05

    First, find the number of unused leaves in the tree. There will beK=D + q(D

    1)leaves

    in the tree whereKis the least integer such that Kk and qinteger. That is,

    q=k D

    D 1

    =6 3

    3 1

    =3

    2

    = 2

    and the number of leaves in a 3-ary tree is

    N= 3 + 2(3 2) = 7Since there are only 6 codewords used there will be one unused leaf in the tree. To in-corporate this in the algorithm add one symbol, x7, with probability 0. Then the codetree can be constructed as in Figure 5.14. The brances are labeled with the code alphabet

    {0, 1, 2}. In the figure the tree representation is also translated into a code table.The average length can as before calculated with the path length lemma,

    L= 1 + 0.5 + 0.15 = 1.65

    As a comparison, the lower bound of the length is the 3-ary entropy,

    H3(X) =H(X)

    log3 2.42

    1.59= 1.52

    We see that that the lower bound is not reached. Still, this is an optimal code and it is notpossible to find a prefix code with less average length.

  • 8/12/2019 Ch5-OptSourceCode

    27/28

    5.4. HUFFMAN CODES 81

    1

    0.5

    0.15

    0x7

    2

    0.05x6

    1

    0.10x5

    02

    0.15x4

    1

    0.20x3

    02

    0.23x2

    1

    0.27x1

    0

    x y

    x1 0x2 1

    x3 20x4 21x5 220x6 221

    Figure 5.14: A3-ary tree for a Huffman code.

    To start the algorithm the number of unused leaves in the tree must be found. The relationbetween the number of source symbol alternativeskand the number of leaves in the treeis

    D+ (q 1)(D 1)< kD +q(D 1)

    By rearrangement, this is equivalent to

    q 1< k DD 1 q

    Sinceqis an integer it can be derived as

    q=k D

    D 1

    which was also used in the previous example. Then the total number of leaves in thetree is K = D + q(D1) and the number of unused leaves becomes m = Kk =D q(D 1) k. Assuming that all unused leaves are located in the same subtree, i.e.have the same parent node, the corresponding number of used leaves in that subtree isN = Dm. Then, the total number of used leaves in the tree is the same as the totalnumber of symbols, and

    k= q(D 1) +N

    In an optimal code there must be at least two used leaves in the subtree with the unusedleaves, i.e. 2 N D, or equivalently, 0 N2 < D1. Subtracting two from theabove equation yields

    k 2 =q(D 1) +N 2

    where it is assumed thatk2. From Euclids division theorem4 Expressed as a theoremwe have the following.

    4Euclids division theorem:Let aandbbe two positive integers. Then there exiist two unique integersqandrsuch that

    a= qb+r, 0 r < b

    where r = Rb(a)is the reminder and q= abthe quatient. An alternative notation of the reminder is the

    modulo operator,a r modb.

  • 8/12/2019 Ch5-OptSourceCode

    28/28

    82 CHAPTER 5. OPTIMAL SOURCE CODING

    THEOREM5.13 The number of unused leaves in the tree for an optimal D-ary prefixcode withkcodewords is

    m= D N

    where

    N =RD1(k 2) + 2

    is the number of used variables in the first itaration of Huffmans algorithm.

    With this result the algorithm for constructing a D-ary Huffman code can be slightlychanged. Instead of filling up withmzero probable dummy symbols in the first step, thenumber of merged nodes in the first itaration isN, i.e. the number of used leaves in thesubtree with the unused leaves. For Example 5.10 it means N = R2(4) + 2 = 2 nodesshould be merged in the first itaration. The result is then a tree where one of the subtreesmight not be a full tree. For the case of Example 5.10 the tree is shown in Figure 5.15. Theresult is of course the same as in Figure 5.14 with the dummy node absent.

    1

    0.5

    0.15

    0.05x6

    1

    0.10x50

    2

    0.15x4

    1

    0.20x3

    02

    0.23x2

    1

    0.27x1

    0

    Figure 5.15: A3-ary tree for a Huffman code.