It Lecture Notes2014

Information Theory for Single-User Systems

Part I

by

Fady Alajaji† and Po-Ning Chen‡

†Department of Mathematics & Statistics,Queen’s University, Kingston, ON K7L 3N6, Canada

Email: [email protected]

‡Department of Electrical EngineeringInstitute of Communication Engineering

National Chiao Tung University1001, Ta Hsueh Road

Hsin Chu, Taiwan 30056Republic of China

Email: [email protected]

August 27, 2014

c© Copyright byFady Alajaji† and Po-Ning Chen‡

August 27, 2014

Preface

The reliable transmission of information bearing signals over a noisy commu-nication channel is at the heart of what we call communication. Informationtheory—founded by Claude E. Shannon in 1948—provides a mathematical frame-work for the theory of communication; it describes the fundamental limits to howefficiently one can encode information and still be able to recover it with negli-gible loss. These lecture notes will examine basic and advanced concepts of thistheory with a focus on single-user (point-to-point) systems. What follows is atentative list of topics to be covered.

1. Part I:

(a) Information measures for discrete systems: self-information, entropy,mutual information and divergence, data processing theorem, Fano’sinequality, Pinsker’s inequality, simple hypothesis testing and theNeyman-Pearson lemma.

(b) Fundamentals of lossless source coding (data compression): discretememoryless sources, fixed-length (block) codes for asymptotically loss-less compression, Asymptotic Equipartition Property (AEP) for dis-crete memoryless sources, coding for stationary ergodic sources, en-tropy rate and redundancy, variable-length codes for lossless compres-sion, prefix codes, Kraft inequality, Huffman codes, Shannon-Fano-Elias codes and Lempel-Ziv codes.

(c) Fundamentals of channel coding: discrete memoryless channels, blockcodes for data transmission, channel capacity, coding theorem for dis-crete memoryless channels, calculation of channel capacity, channelswith symmetry structures.

(d) Information measures for continuous alphabet systems and Gaussianchannels: differential entropy, mutual information and divergence,AEP for continuous memoryless sources, capacity and channel codingtheorem of discrete-time memoryless Gaussian channels, capacity ofuncorrelated parallel Gaussian channels and the water-filling princi-ple, capacity of correlated Gaussian channels, non-Gaussian discrete-

ii

time memoryless channels, capacity of band-limited (continuous-time)white Gaussian channels.

(e) Fundamentals of lossy source coding and joint source-channel coding:distortion measures, rate-distortion theorem for memoryless sources,rate-distortion function and its properties, rate-distortion functionfor memoryless Gaussian sources, lossy joint source-channel codingtheorem.

(f) Overview on suprema and limits (Appendix A).

(g) Overview in probability and random processes (Appendix B): randomvariable and random process, statistical properties of random pro-cesses, Markov chains, convergence of sequences of random variables,ergodicity and laws of large numbers, central limit theorem, concav-ity and convexity, Jensen’s inequality, Lagrange multipliers and theKarush-Kuhn-Tucker condition for the optimization of convex func-tions.

2. Part II:

(a) General information measures: information spectra and quantiles andtheir properties, Renyi’s information measures.

(b) Lossless data compression for arbitrary sources with memory: fixed-length lossless data compression theorem for arbitrary sources, variable-length lossless data compression theorem for arbitrary sources, en-tropy of English, Lempel-Ziv codes.

(c) Randomness and resolvability: resolvability and source coding, ap-proximation of output statistics for arbitrary systems with memory.

(d) Coding for arbitrary channels with memory: channel capacity forarbitrary single-user channels, optimistic Shannon coding theorem,strong converse, ε-capacity.

(e) Lossy data compression for arbitrary sources with memory.

(f) Hypothesis testing: error exponent and divergence, large deviationstheory, Berry-Essen theorem.

(g) Channel reliability: random coding exponent, expurgated exponent,partitioning exponent, sphere-packing exponent, the asymptotic lar-gest minimum distance of block codes, Elias bound, Varshamov-Gil-bert bound, Bhattacharyya distance.

(h) Information theory of networks: distributed detection, data compres-sion for distributed sources, capacity of multiple access channels, de-graded broadcast channel, Gaussian multiple terminal channels.

iii

As shown from the list, the lecture notes are divided into two parts. Thefirst part is suitable for a 12-week introductory course such as the one taught atthe Department of Mathematics and Statistics, Queen’s University at Kingston,Ontario, Canada. It also meets the need of a fundamental course for seniorundergraduates as the one given at the Department of Computer Science andInformation Engineering, National Chi Nan University, Taiwan. For an 18-weekgraduate course such as the one given in the Department of Electrical and Com-puter Engineering, National Chiao-Tung University, Taiwan, the lecturer canselectively add advanced topics covered in the second part to enrich the lecturecontent, and provide a more complete and advanced view on information theoryto students.

The authors are very much indebted to all people who provided insightfulcomments on these lecture notes. Special thanks are devoted to Prof. YunghsiangS. Han with the Department of Computer Science and Information Engineering,National Taiwan University of Science and Technology, Taipei, Taiwan, for hisenthusiasm in testing these lecture notes at his previous school (National Chi-Nan University), and providing the authors with valuable feedback.

Remarks to the reader. In these notes, all assumptions, claims, conjectures,corollaries, definitions, examples, exercises, lemmas, observations, properties,and theorems are numbered under the same counter. For example, the lemmathat immediately follows Theorem 2.1 is numbered as Lemma 2.2, instead ofLemma 2.1.

In addition, one may obtain the latest version of the lecture notes fromhttp://shannon.cm.nctu.edu.tw. The interested reader is welcome to submitcomments to [email protected].

iv

Acknowledgements

Thanks are given to our families for their full support during the period ofwriting these lecture notes.

v

Table of Contents

Chapter Page

List of Tables ix

List of Figures x

1 Introduction 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Communication system model . . . . . . . . . . . . . . . . . . . . 2

2 Information Measures for Discrete Systems 52.1 Entropy, joint entropy and conditional entropy . . . . . . . . . . . 5

2.1.1 Self-information . . . . . . . . . . . . . . . . . . . . . . . . 52.1.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.3 Properties of entropy . . . . . . . . . . . . . . . . . . . . . 102.1.4 Joint entropy and conditional entropy . . . . . . . . . . . . 122.1.5 Properties of joint entropy and conditional entropy . . . . 14

2.2 Mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.1 Properties of mutual information . . . . . . . . . . . . . . 162.2.2 Conditional mutual information . . . . . . . . . . . . . . . 17

2.3 Properties of entropy and mutual information for multiple randomvariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4 Data processing inequality . . . . . . . . . . . . . . . . . . . . . . 202.5 Fano’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.6 Divergence and variational distance . . . . . . . . . . . . . . . . . 262.7 Convexity/concavity of information measures . . . . . . . . . . . . 362.8 Fundamentals of hypothesis testing . . . . . . . . . . . . . . . . . 39

3 Lossless Data Compression 433.1 Principles of data compression . . . . . . . . . . . . . . . . . . . . 433.2 Block codes for asymptotically lossless compression . . . . . . . . 45

3.2.1 Block codes for discrete memoryless sources . . . . . . . . 45

vi

3.2.2 Block codes for stationary ergodic sources . . . . . . . . . 543.2.3 Redundancy for lossless block data compression . . . . . . 59

3.3 Variable-length codes for lossless data compression . . . . . . . . . 603.3.1 Non-singular codes and uniquely decodable codes . . . . . 603.3.2 Prefix or instantaneous codes . . . . . . . . . . . . . . . . 643.3.3 Examples of binary prefix codes . . . . . . . . . . . . . . . 70

A) Huffman codes: optimal variable-length codes . . . . . 70B) Shannon-Fano-Elias code . . . . . . . . . . . . . . . . . 74

3.3.4 Examples of universal lossless variable-length codes . . . . 76A) Adaptive Huffman code . . . . . . . . . . . . . . . . . . 76B) Lempel-Ziv codes . . . . . . . . . . . . . . . . . . . . . 78

4 Data Transmission and Channel Capacity 824.1 Principles of data transmission . . . . . . . . . . . . . . . . . . . . 824.2 Discrete memoryless channels . . . . . . . . . . . . . . . . . . . . 844.3 Block codes for data transmission over DMCs . . . . . . . . . . . 914.4 Calculating channel capacity . . . . . . . . . . . . . . . . . . . . . 103

4.4.1 Symmetric, weakly-symmetric and quasi-symmetric channels1044.4.2 Karuch-Kuhn-Tucker condition for channel capacity . . . . 110

5 Differential Entropy and Gaussian Channels 1145.1 Differential entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.2 Joint and conditional differential entropies, divergence and mutual

information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1215.3 AEP for continuous memoryless sources . . . . . . . . . . . . . . . 1325.4 Capacity and channel coding theorem for the discrete-time mem-

oryless Gaussian channel . . . . . . . . . . . . . . . . . . . . . . . 1335.5 Capacity of uncorrelated parallel Gaussian channels: The water-

filling principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1445.6 Capacity of correlated parallel Gaussian channels . . . . . . . . . 1475.7 Non-Gaussian discrete-time memoryless channels . . . . . . . . . 1495.8 Capacity of the band-limited white Gaussian channel . . . . . . . 151

6 Lossy Data Compression and Transmission 1566.1 Fundamental concept on lossy data compression . . . . . . . . . . 156

6.1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . 1566.1.2 Distortion measures . . . . . . . . . . . . . . . . . . . . . . 1576.1.3 Frequently used distortion measures . . . . . . . . . . . . . 159

6.2 Fixed-length lossy data compression codes . . . . . . . . . . . . . 1616.3 Rate-distortion function for discrete memoryless

sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1626.4 Property and calculation of rate-distortion functions . . . . . . . . 171

vii

6.4.1 Rate-distortion function for binary sources and Hammingadditive distortion measure . . . . . . . . . . . . . . . . . 171

6.4.2 Rate-distortion function for Gaussian source and squarederror distortion measure . . . . . . . . . . . . . . . . . . . 173

6.5 Joint source-channel information transmission . . . . . . . . . . . 175

A Overview on Suprema and Limits 180A.1 Supremum and maximum . . . . . . . . . . . . . . . . . . . . . . 180A.2 Infimum and minimum . . . . . . . . . . . . . . . . . . . . . . . . 182A.3 Boundedness and suprema operations . . . . . . . . . . . . . . . . 183A.4 Sequences and their limits . . . . . . . . . . . . . . . . . . . . . . 185A.5 Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

B Overview in Probability and Random Processes 191B.1 Probability space . . . . . . . . . . . . . . . . . . . . . . . . . . . 191B.2 Random variable and random process . . . . . . . . . . . . . . . . 192B.3 Distribution functions versus probability space . . . . . . . . . . . 193B.4 Relation between a source and a random process . . . . . . . . . . 193B.5 Statistical properties of random sources . . . . . . . . . . . . . . . 194B.6 Convergence of sequences of random variables . . . . . . . . . . . 199B.7 Ergodicity and laws of large numbers . . . . . . . . . . . . . . . . 202

B.7.1 Laws of large numbers . . . . . . . . . . . . . . . . . . . . 202B.7.2 Ergodicity and law of large numbers . . . . . . . . . . . . 205

B.8 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . 208B.9 Convexity, concavity and Jensen’s inequality . . . . . . . . . . . . 208B.10 Lagrange multipliers technique and Karush-Kuhn-Tucker (KKT)

condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

viii

List of Tables

Number Page

3.1 An example of the δ-typical set with n = 2 and δ = 0.4, whereF2(0.4) = AB, AC, BA, BB, BC, CA, CB . The codeword setis 001(AB), 010(AC), 011(BA), 100(BB), 101(BC), 110(CA),111(CB), 000(AA, AD, BD, CC, CD, DA, DB, DC, DD) , wherethe parenthesis following each binary codeword indicates thosesourcewords that are encoded to this codeword. The source distri-bution is PX(A) = 0.4, PX(B) = 0.3, PX(C) = 0.2 and PX(D) =0.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.1 Quantized random variable qn(X) under an n-bit accuracy: H(qn(X))and H(qn(X))− n versus n. . . . . . . . . . . . . . . . . . . . . . 119

ix

List of Figures

Number Page

1.1 Block diagram of a general communication system. . . . . . . . . 2

2.1 Binary entropy function hb(p). . . . . . . . . . . . . . . . . . . . . 102.2 Relation between entropy and mutual information. . . . . . . . . 172.3 Communication context of the data processing lemma. . . . . . . 212.4 Permissible (Pe, H(X|Y )) region due to Fano’s inequality. . . . . . 24

3.1 Block diagram of a data compression system. . . . . . . . . . . . . 453.2 Possible codebook C∼n and its corresponding Sn. The solid box

indicates the decoding mapping from C∼n back to Sn. . . . . . . . 533.3 (Ultimate) Compression rate R versus source entropy HD(X) and

behavior of the probability of block decoding error as block lengthn goes to infinity for a discrete memoryless source. . . . . . . . . . 54

3.4 Classification of variable-length codes. . . . . . . . . . . . . . . . 653.5 Tree structure of a binary prefix code. The codewords are those

residing on the leaves, which in this case are 00, 01, 10, 110, 1110and 1111. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.6 Example of the Huffman encoding. . . . . . . . . . . . . . . . . . 733.7 Example of the sibling property based on the code tree from P

(16)

X.

The arguments inside the parenthesis following aj respectivelyindicate the codeword and the probability associated with aj. “b”is used to denote the internal nodes of the tree with the assigned(partial) code as its subscript. The number in the parenthesisfollowing b is the probability sum of all its children. . . . . . . . 78

3.8 (Continuation of Figure 3.7) Example of violation of the siblingproperty after observing a new symbol a3 at n = 17. Note thatnode a1 is not adjacent to its sibling a2. . . . . . . . . . . . . . . 79

3.9 (Continuation of Figure 3.8) Updated Huffman code. The siblingproperty holds now for the new code. . . . . . . . . . . . . . . . . 80

x

4.1 A data transmission system, where W represents the message fortransmission, Xn denotes the codeword corresponding to messageW , Y n represents the received word due to channel input Xn, andW denotes the reconstructed message from Y n. . . . . . . . . . . 82

4.2 Binary symmetric channel. . . . . . . . . . . . . . . . . . . . . . . 874.3 Binary erasure channel. . . . . . . . . . . . . . . . . . . . . . . . . 894.4 Binary symmetric erasure channel. . . . . . . . . . . . . . . . . . 894.5 Ultimate channel coding rate R versus channel capacity C and be-

havior of the probability of error as blocklength n goes to infinityfor a DMC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.1 The water-pouring scheme for uncorrelated parallel Gaussian chan-nels. The horizontal dashed line, which indicates the level wherethe water rises to, indicates the value of θ for which

∑ki=1 Pi = P . 146

5.2 Band-limited waveform channel with additive white Gaussian noise.1525.3 Water-pouring for the band-limited colored Gaussian channel. . . 155

6.1 Example for applications of lossy data compression codes. . . . . . 1576.2 “Grouping” as one kind of lossy data compression. . . . . . . . . . 1596.3 The Shannon limits for (2, 1) and (3, 1) codes under antipodal

binary-input AWGN channel. . . . . . . . . . . . . . . . . . . . . 1776.4 The Shannon limits for (2, 1) and (3, 1) codes under continuous-

input AWGN channels. . . . . . . . . . . . . . . . . . . . . . . . . 179

A.1 Illustration of Lemma A.17. . . . . . . . . . . . . . . . . . . . . . 186

B.1 General relations of random processes. . . . . . . . . . . . . . . . 199B.2 Relation of ergodic random processes respectively defined through

time-shift invariance and ergodic theorem. . . . . . . . . . . . . . 206B.3 The support line y = ax+ b of the convex function f(x). . . . . . 210

xi

Chapter 1

Introduction

1.1 Overview

Since its inception, the main role of Information Theory has been to provide theengineering and scientific communities with a mathematical framework for thetheory of communication by establishing the fundamental limits on the perfor-mance of various communication systems. The birth of Information Theory wasinitiated with the publication of the groundbreaking works [43, 45] of Claude El-wood Shannon (1916-2001) who asserted that it is possible to send information-bearing signals at a fixed positive rate through a noisy communication channelwith an arbitrarily small probability of error as long as the transmission rateis below a certain fixed quantity that depends on the channel statistical char-acteristics; he “baptized” this quantity with the name of channel capacity. Hefurther proclaimed that random (stochastic) sources, representing data, speechor image signals, can be compressed distortion-free at a minimal rate given bythe source’s intrinsic amount of information, which he called source entropy anddefined in terms of the source statistics. He went on proving that if a source hasan entropy that is less than the capacity of a communication channel, then thesource can be reliably transmitted (with asymptotically vanishing probability oferror) over the channel. He further generalized these “coding theorems” fromthe lossless (distortionless) to the lossy context where the source can be com-pressed and reproduced (possibly after channel transmission) within a tolerabledistortion threshold [44].

Inspired and guided by the pioneering ideas of Shannon,1 information theo-rists gradually expanded their interests beyond communication theory, and in-vestigated fundamental questions in several other related fields. Among themwe cite:

1See [47] for accessing most of Shannon’s works, including his yet untapped doctoral dis-sertation on an algebraic framework for population genetics.

1

• statistical physics (thermodynamics, quantum information theory);

• computer science (algorithmic complexity, resolvability);

• probability theory (large deviations, limit theorems);

• statistics (hypothesis testing, multi-user detection, Fisher information, es-timation);

• economics (gambling theory, investment theory);

• biology (biological information theory);

• cryptography (data security, watermarking);

• data networks (self-similarity, traffic regulation theory).

In this textbook, we focus our attention on the study of the basic theory ofcommunication for single-user (point-to-point) systems for which InformationTheory was originally conceived.

1.2 Communication system model

A simple block diagram of a general communication system is depicted in Fig. 1.1.

Source Modulator

PhysicalChannel

DemodulatorDestination

DiscreteChannel

Transmitter Part

Receiver Part

Focus of

this text

SourceEncoder

ChannelEncoder

ChannelDecoder

SourceDecoder

Figure 1.1: Block diagram of a general communication system.

2

Let us briefly describe the role of each block in the figure.

• Source: The source, which usually represents data or multimedia signals,is modelled as a random process (the necessary background regarding ran-dom processes is introduced in Appendix B). It can be discrete (finite orcountable alphabet) or continuous (uncountable alphabet) in value and intime.

• Source Encoder: Its role is to represent the source in a compact fashionby removing its unnecessary or redundant content (i.e., by compressing it).

• Channel Encoder: Its role is to enable the reliable reproduction of thesource encoder output after its transmission through a noisy communica-tion channel. This is achieved by adding redundancy (using usually analgebraic structure) to the source encoder output.

• Modulator: It transforms the channel encoder output into a waveformsuitable for transmission over the physical channel. This is typically ac-complished by varying the parameters of a sinusoidal signal in proportionwith the data provided by the channel encoder output.

• Physical Channel: It consists of the noisy (or unreliable) medium thatthe transmitted waveform traverses. It is usually modelled via a sequence ofconditional (or transition) probability distributions of receiving an outputgiven that a specific input was sent.

• Receiver Part: It consists of the demodulator, the channel decoder andthe source decoder where the reverse operations are performed. The desti-nation represents the sink where the source estimate provided by the sourcedecoder is reproduced.

In this text, we will model the concatenation of the modulator, physicalchannel and demodulator via a discrete-time2 channel with a given sequence ofconditional probability distributions. Given a source and a discrete channel, ourobjectives will include determining the fundamental limits of how well we canconstruct a (source/channel) coding scheme so that:

• the smallest number of source encoder symbols can represent each sourcesymbol distortion-free or within a prescribed distortion level D, whereD > 0 and the channel is noiseless;

2Except for a brief interlude with the continuous-time (waveform) Gaussian channel inChapter 5, we will consider discrete-time communication systems throughout the text.

3

• the largest rate of information can be transmitted over a noisy channelbetween the channel encoder input and the channel decoder output withan arbitrarily small probability of decoding error;

• we can guarantee that the source is transmitted over a noisy channel andreproduced at the destination within distortion D, where D > 0.

4

Chapter 2

Information Measures for DiscreteSystems

In this chapter, we define information measures for discrete-time discrete-alphabet1

systems from a probabilistic standpoint and develop their properties. Elucidat-ing the operational significance of probabilistically defined information measuresvis-a-vis the fundamental limits of coding constitutes a main objective of thisbook; this will be seen in the subsequent chapters.

2.1 Entropy, joint entropy and conditional entropy

2.1.1 Self-information

Let E be an event belonging to a given event space and having probabilityPr(E) , pE, where 0 ≤ pE ≤ 1. Let I(E) – called the self-information of E –represent the amount of information one gains when learning that E has occurred(or equivalently, the amount of uncertainty one had about E prior to learningthat it has happened). A natural question to ask is “what properties should I(E)have?” Although the answer to this question may vary from person to person,here are some common properties that I(E) is reasonably expected to have.

1. I(E) should be a decreasing function of pE.

In other words, this property first states that I(E) = I(pE), where I(·) isa real-valued function defined over [0, 1]. Furthermore, one would expectthat the less likely event E is, the more information is gained when one

1By discrete alphabets, one usually means finite or countably infinite alphabets. We how-ever mostly focus on finite alphabet systems, although the presented information measuresallow for countable alphabets (when they exist).

5

learns it has occurred. In other words, I(pE) is a decreasing function ofpE.

2. I(pE) should be continuous in pE.

Intuitively, one should expect that a small change in pE corresponds to asmall change in the amount of information carried by E.

3. If E1 and E2 are independent events, then I(E1 ∩ E2) = I(E1) + I(E2),or equivalently, I(pE1 × pE2) = I(pE1) + I(pE2).

This property declares that when events E1 and E2 are independent fromeach other (i.e., when they do not affect each other probabilistically), theamount of information one gains by learning that both events have jointlyoccurred should be equal to the sum of the amounts of information of eachindividual event.

Next, we show that the only function that satisfies properties 1-3 above isthe logarithmic function.

Theorem 2.1 The only function defined over p ∈ [0, 1] and satisfying

1. I(p) is monotonically decreasing in p;

2. I(p) is a continuous function of p for 0 ≤ p ≤ 1;

3. I(p1 × p2) = I(p1) + I(p2);

is I(p) = −c · logb(p), where c is a positive constant and the base b of thelogarithm is any number larger than one.

Proof:

Step 1: Claim. For n = 1, 2, 3, · · · ,

I

(1

n

)

= −c · logb(1

n

)

,

where c > 0 is a constant.

Proof: First note that for n = 1, condition 3 directly shows the claim, sinceit yields that I(1) = I(1) + I(1). Thus I(1) = 0 = −c logb(1).

Now let n be a fixed positive integer greater than 1. Conditions 1 and 3respectively imply

n < m ⇒ I

(1

n

)

< I

(1

m

)

(2.1.1)

6

and

I

(1

mn

)

= I

(1

m

)

+ I

(1

n

)

(2.1.2)

where n,m = 1, 2, 3, · · · . Now using (2.1.2), we can show by induction (onk) that

I

(1

nk

)

= k · I(1

n

)

(2.1.3)

for all non-negative integers k.

Now for any positive integer r, there exists a non-negative integer k suchthat

nk ≤ 2r < nk+1.

By (2.1.1), we obtain

I

(1

nk

)

≤ I

(1

2r

)

< I

(1

nk+1

)

,

which together with (2.1.3), yields

k · I(1

n

)

≤ r · I(1

2

)

< (k + 1) · I(1

n

)

.

Hence, since I(1/n) > I(1) = 0,

k

r≤ I(1/2)

I(1/n)≤ k + 1

r.

On the other hand, by the monotonicity of the logarithm, we obtain

logb nk ≤ logb 2

r ≤ logb nk+1 ⇔ k

r≤ logb(2)

logb(n)≤ k + 1

r.

Therefore, ∣∣∣∣

logb(2)

logb(n)− I(1/2)

I(1/n)

∣∣∣∣<

1

r.

Since n is fixed, and r can be made arbitrarily large, we can let r → ∞ toget:

I

(1

n

)

= c · logb(n).

where c = I(1/2)/ logb(2) > 0. This completes the proof of the claim.

7

Step 2: Claim. I(p) = −c · logb(p) for positive rational number p, where c > 0is a constant.

Proof: A positive rational number p can be represented by a ratio of twointegers, i.e., p = r/s, where r and s are both positive integers. Thencondition 3 yields that

I

(1

s

)

= I

(r

s

1

r

)

= I(r

s

)

+ I

(1

r

)

,

which, from Step 1, implies that

I(p) = I(r

s

)

= I

(1

s

)

− I

(1

r

)

= c · logb s− c · logb r = −c · logb p.

Step 3: For any p ∈ [0, 1], it follows by continuity and the density of the ratio-nals in the reals that

I(p) = lima↑p, a rational

I(a) = limb↓p, b rational

I(b) = −c · logb(p).

The constant c above is by convention normalized to c = 1. Furthermore,the base b of the logarithm determines the type of units used in measuringinformation. When b = 2, the amount of information is expressed in bits (i.e.,binary digits). When b = e – i.e., the natural logarithm (ln) is used – informationis measured in nats (i.e., natural units or digits). For example, if the event Econcerns a Heads outcome from the toss of a fair coin, then its self-informationis I(E) = − log2(1/2) = 1 bit or − ln(1/2) = 0.693 nats.

More generally, under base b > 1, information is in b-ary units or digits.For the sake of simplicity, we will throughout use the base-2 logarithm unlessotherwise specified. Note that one can easily convert information units from bitsto b-ary units by dividing the former by log2(b).

2.1.2 Entropy

Let X be a discrete random variable taking values in a finite alphabet X undera probability distribution or probability mass function (pmf) PX(x) , P [X = x]for all x ∈ X . Note that X generically represents a memoryless source, which isa random process Xn∞n=1 with independent and identically distributed (i.i.d.)random variables (cf. Appendix B).

8

Definition 2.2 (Entropy) The entropy of a discrete random variable X withpmf PX(·) is denoted by H(X) or H(PX) and defined by

H(X) , −∑

x∈XPX(x) · log2 PX(x) (bits).

Thus H(X) represents the statistical average (mean) amount of informationone gains when learning that one of its |X | outcomes has occurred, where |X |denotes the size of alphabet X . Indeed, we directly note from the definition that

H(X) = E[− log2 PX(X)] = E[I(X)]

where I(x) , − log2 PX(x) is the self-information of the elementary event [X =x].

When computing the entropy, we adopt the convention

0 · log2 0 = 0,

which can be justified by a continuity argument since x log2 x → 0 as x → 0.Also note that H(X) only depends on the probability distribution of X and isnot affected by the symbols that represent the outcomes. For example whentossing a fair coin, we can denote Heads by 2 (instead of 1) and Tail by 100(instead of 0), and the entropy of the random variable representing the outcomewould remain equal to log2(2) = 1 bit.

Example 2.3 Let X be a binary (valued) random variable with alphabet X =0, 1 and pmf given by PX(1) = p and PX(0) = 1− p, where 0 ≤ p ≤ 1 is fixed.Then H(X) = −p·log2 p−(1−p)·log2(1−p). This entropy is conveniently calledthe binary entropy function and is usually denoted by hb(p): it is illustrated inFigure 2.1. As shown in the figure, hb(p) is maximized for a uniform distribution(i.e., p = 1/2).

The units for H(X) above are in bits as base-2 logarithm is used. Setting

HD(X) , −∑

x∈XPX(x) · logD PX(x)

yields the entropy in D-ary units, where D > 1. Note that we abbreviate H2(X)as H(X) throughout the book since bits are common measure units for a codingsystem, and hence

HD(X) =H(X)

log2 D.

Thus

He(X) =H(X)

log2(e)= (ln 2) ·H(X)

gives the entropy in nats, where e is the base of the natural logarithm.

9

0

1

0 0.5 1p

Figure 2.1: Binary entropy function hb(p).

2.1.3 Properties of entropy

When developing or proving the basic properties of entropy (and other informa-tion measures), we will often use the following fundamental inequality on thelogarithm (its proof is left as an exercise).

Lemma 2.4 (Fundamental inequality (FI)) For any x > 0 and D > 1, wehave that

logD(x) ≤ logD(e) · (x− 1)

with equality if and only if (iff) x = 1.

Setting y = 1/x and using FI above directly yield that for any y > 0, we alsohave that

logD(y) ≥ logD(e)

(

1− 1

y

)

,

also with equality iff y = 1. In the above the base-D logarithm was used.Specifically, for a logarithm with base-2, the above inequalities become

log2(e)

(

1− 1

x

)

≤ log2(x) ≤ log2(e) · (x− 1)

with equality iff x = 1.

Lemma 2.5 (Non-negativity) H(X) ≥ 0. Equality holds iff X is determin-istic (when X is deterministic, the uncertainty of X is obviously zero).

10

Proof: 0 ≤ PX(x) ≤ 1 implies that log2[1/PX(x)] ≥ 0 for every x ∈ X . Hence,

H(X) =∑

x∈XPX(x) log2

1

PX(x)≥ 0,

with equality holding iff PX(x) = 1 for some x ∈ X .

Lemma 2.6 (Upper bound on entropy) If a random variable X takes val-ues from a finite set X , then

H(X) ≤ log2 |X |,where |X | denotes the size of the set X . Equality holds iff X is equiprobable oruniformly distributed over X (i.e., PX(x) =

1|X | for all x ∈ X ).

Proof:

log2 |X | −H(X) = log2 |X | ×[∑

x∈XPX(x)

]

−[

−∑

x∈XPX(x) log2 PX(x)

]

=∑

x∈XPX(x)× log2 |X |+

∑

x∈XPX(x) log2 PX(x)

=∑

x∈XPX(x) log2[|X | × PX(x)]

≥∑

x∈XPX(x) · log2(e)

(

1− 1

|X | × PX(x)

)

= log2(e)∑

x∈X

(

PX(x)−1

|X |

)

= log2(e) · (1− 1) = 0

where the inequality follows from the FI Lemma, with equality iff (∀ x ∈ X ),|X | × PX(x) = 1, which means PX(·) is a uniform distribution on X .

Intuitively, H(X) tells us how random X is. Indeed, X is deterministic (notrandom at all) iffH(X) = 0. IfX is uniform (equiprobable), H(X) is maximized,and is equal to log2 |X |.

Lemma 2.7 (Log-sum inequality) For non-negative numbers, a1, a2, . . ., anand b1, b2, . . ., bn,

n∑

i=1

(

ai logDaibi

)

≥(

n∑

i=1

ai

)

logD

∑ni=1 ai∑ni=1 bi

(2.1.4)

with equality holding iff, (∀ 1 ≤ i ≤ n) (ai/bi) = (a1/b1), a constant independentof i. (By convention, 0 · logD(0) = 0, 0 · logD(0/0) = 0 and a · logD(a/0) = ∞ ifa > 0. Again, this can be justified by “continuity.”)

11

Proof: Let a ,∑n

i=1 ai and b ,∑n

i=1 bi. Then

n∑

i=1

ai logDaibi

− a logDa

b= a

n∑

i=1

aialogD

aibi

−(

n∑

i=1

aia

)

︸︷︷︸

=1

logDa

b

= a

n∑

i=1

aialogD

[aibi

b

a

]

≥ a logD(e)n∑

i=1

aia

[

1− biai

a

b

]

= a logD(e)

(n∑

i=1

aia−

n∑

i=1

bib

)

= a logD(e) (1− 1) = 0

where the inequality follows from the FI Lemma, with equality holing iff aibi

ba= 1

for all i; i.e., aibi= a

b∀i.

We also provide another proof using Jensen’s inequality (cf. Theorem B.17in Appendix B). Without loss of generality, assume that ai > 0 and bi > 0 forevery i. Jensen’s inequality states that

n∑

i=1

αif(ti) ≥ f

(n∑

i=1

αiti

)

for any strictly convex function f(·), αi ≥ 0, and∑n

i=1 αi = 1; equality holdsiff ti is a constant for all i. Hence by setting αi = bi/

∑nj=1 bj, ti = ai/bi, and

f(t) = t · logD(t), we obtain the desired result.

2.1.4 Joint entropy and conditional entropy

Given a pair of random variables (X, Y ) with a joint pmf PX,Y (·, ·) defined onX × Y , the self-information of the (two-dimensional) elementary event [X =x, Y = y] is defined by

I(x, y) , − log2 PX,Y (x, y).

This leads us to the definition of joint entropy.

Definition 2.8 (Joint entropy) The joint entropy H(X, Y ) of random vari-ables (X, Y ) is defined by

H(X, Y ) , −∑

(x,y)∈X×YPX,Y (x, y) · log2 PX,Y (x, y)

12

= E[− log2 PX,Y (X, Y )].

The conditional entropy can also be similarly defined as follows.

Definition 2.9 (Conditional entropy) Given two jointly distributed randomvariables X and Y , the conditional entropy H(Y |X) of Y given X is defined by

H(Y |X) ,∑

x∈XPX(x)

(

−∑

y∈YPY |X(y|x) · log2 PY |X(y|x)

)

(2.1.5)

where PY |X(·|·) is the conditional pmf of Y given X.

Equation (2.1.5) can be written into three different but equivalent forms:

H(Y |X) = −∑

(x,y)∈X×YPX,Y (x, y) · log2 PY |X(y|x)

= E[− log2 PY |X(Y |X)]

=∑

x∈XPX(x) ·H(Y |X = x)

where H(Y |X = x) , −∑y∈Y PY |X(y|x) log2 PY |X(y|x).The relationship between joint entropy and conditional entropy is exhibited

by the fact that the entropy of a pair of random variables is the entropy of oneplus the conditional entropy of the other.

Theorem 2.10 (Chain rule for entropy)

H(X, Y ) = H(X) +H(Y |X). (2.1.6)

Proof: SincePX,Y (x, y) = PX(x)PY |X(y|x),

we directly obtain that

H(X, Y ) = E[− logPX,Y (X, Y )]

= E[− log2 PX(X)] + E[− log2 PY |X(Y |X)]

= H(X) +H(Y |X).

By its definition, joint entropy is commutative; i.e., H(X, Y ) = H(Y,X).Hence,

H(X, Y ) = H(X) +H(Y |X) = H(Y ) +H(X|Y ) = H(Y,X),

13

which implies that

H(X)−H(X|Y ) = H(Y )−H(Y |X). (2.1.7)

The above quantity is exactly equal to the mutual information which will beintroduced in the next section.

The conditional entropy can be thought of in terms of a channel whose inputis the random variableX and whose output is the random variable Y . H(X|Y ) isthen called the equivocation2 and corresponds to the uncertainty in the channelinput from the receiver’s point-of-view. For example, suppose that the set ofpossible outcomes of random vector (X, Y ) is (0, 0), (0, 1), (1, 0), (1, 1), wherenone of the elements has zero probability mass. When the receiver Y receives 1,he still cannot determine exactly what the sender X observes (it could be either1 or 0); therefore, the uncertainty, from the receiver’s view point, depends onthe probabilities PX|Y (0|1) and PX|Y (1|1).

Similarly, H(Y |X), which is called prevarication,3 is the uncertainty in thechannel output from the transmitter’s point-of-view. In other words, the senderknows exactly what he sends, but is uncertain on what the receiver will finallyobtain.

A case that is of specific interest is when H(X|Y ) = 0. By its definition,H(X|Y ) = 0 if X becomes deterministic after observing Y . In such case, theuncertainty of X after giving Y is completely zero.

The next corollary can be proved similarly to Theorem 2.10.

Corollary 2.11 (Chain rule for conditional entropy)

H(X, Y |Z) = H(X|Z) +H(Y |X,Z).

2.1.5 Properties of joint entropy and conditional entropy

Lemma 2.12 (Conditioning never increases entropy) Side information Ydecreases the uncertainty about X:

H(X|Y ) ≤ H(X)

with equality holding iff X and Y are independent. In other words, “condition-ing” reduces entropy.

2Equivocation is an ambiguous statement one uses deliberately in order to deceive or avoidspeaking the truth.

3Prevarication is the deliberate act of deviating from the truth (it is a synonym of “equiv-ocation”).

14

Proof:

H(X)−H(X|Y ) =∑

(x,y)∈X×YPX,Y (x, y) · log2

PX|Y (x|y)PX(x)

=∑


PX|Y (x|y)PY (y)

PX(x)PY (y)

=∑


PX,Y (x, y)

PX(x)PY (y)

≥

∑

(x,y)∈X×YPX,Y (x, y)

log2

∑

(x,y)∈X×Y PX,Y (x, y)∑

(x,y)∈X×Y PX(x)PY (y)

= 0

where the inequality follows from the log-sum inequality, with equality holdingiff

PX,Y (x, y)

PX(x)PY (y)= constant ∀ (x, y) ∈ X × Y .

Since probability must sum to 1, the above constant equals 1, which is exactlythe case of X being independent of Y .

Lemma 2.13 Entropy is additive for independent random variables; i.e.,

H(X, Y ) = H(X) +H(Y ) for independent X and Y.

Proof: By the previous lemma, independence of X and Y implies H(Y |X) =H(Y ). Hence

H(X, Y ) = H(X) +H(Y |X) = H(X) +H(Y ).

Since conditioning never increases entropy, it follows that

H(X, Y ) = H(X) +H(Y |X) ≤ H(X) +H(Y ). (2.1.8)

The above lemma tells us that equality holds for (2.1.8) only when X is inde-pendent of Y .

A result similar to (2.1.8) also applies to conditional entropy.

Lemma 2.14 Conditional entropy is lower additive; i.e.,

H(X1, X2|Y1, Y2) ≤ H(X1|Y1) +H(X2|Y2).

15

Equality holds iff

PX1,X2|Y1,Y2(x1, x2|y1, y2) = PX1|Y1(x1|y1)PX2|Y2(x2|y2)

for all x1, x2, y1 and y2.

Proof: Using the chain rule for conditional entropy and the fact that condition-ing reduces entropy, we can write

H(X1, X2|Y1, Y2) = H(X1|Y1, Y2) +H(X2|X1, Y1, Y2)

≤ H(X1|Y1, Y2) +H(X2|Y1, Y2), (2.1.9)

≤ H(X1|Y1) +H(X2|Y2). (2.1.10)

For (2.1.9), equality holds iff X1 and X2 are conditionally independent given(Y1, Y2): PX1,X2|Y1,Y2(x1, x2|y1, y2) = PX1|Y1,Y2(x1|y1, y2)PX2|Y1,Y2(x2|y1, y2). For(2.1.10), equality holds iff X1 is conditionally independent of Y2 given Y1 (i.e.,PX1|Y1,Y2(x1|y1, y2) = PX1|Y1(x1|y1)), and X2 is conditionally independent of Y1

given Y2 (i.e., PX2|Y1,Y2(x2|y1, y2) = PX2|Y2(x2|y2)). Hence, the desired equalitycondition of the lemma is obtained.

2.2 Mutual information

For two random variables X and Y , the mutual information between X and Y isthe reduction in the uncertainty of Y due to the knowledge of X (or vice versa).A dual definition of mutual information states that it is the average amount ofinformation that Y has (or contains) about X or X has (or contains) about Y .

We can think of the mutual information between X and Y in terms of achannel whose input is X and whose output is Y . Thereby the reduction of theuncertainty is by definition the total uncertainty of X (i.e., H(X)) minus theuncertainty of X after observing Y (i.e., H(X|Y )) Mathematically, it is

mutual information = I(X;Y ) , H(X)−H(X|Y ). (2.2.1)

It can be easily verified from (2.1.7) that mutual information is symmetric; i.e.,I(X;Y ) = I(Y ;X).

2.2.1 Properties of mutual information

Lemma 2.15

1. I(X;Y ) =∑

x∈X

∑

y∈YPX,Y (x, y) log2

PX,Y (x, y)

PX(x)PY (y).

16

H(X) H(X|Y ) I(X;Y ) H(Y |X) H(Y )

H(X, Y )

Figure 2.2: Relation between entropy and mutual information.

2. I(X;Y ) = I(Y ;X).

3. I(X;Y ) = H(X) +H(Y )−H(X, Y ).

4. I(X;Y ) ≤ H(X) with equality holding iff X is a function of Y (i.e., X =f(Y ) for some function f(·)).

5. I(X;Y ) ≥ 0 with equality holding iff X and Y are independent.

6. I(X;Y ) ≤ minlog2 |X |, log2 |Y|.

Proof: Properties 1, 2, 3, and 4 follow immediately from the definition. Property5 is a direct consequence of Lemma 2.12. Property 6 holds iff I(X;Y ) ≤ log2 |X |and I(X;Y ) ≤ log2 |Y|. To show the first inequality, we write I(X;Y ) = H(X)−H(X|Y ), use the fact that H(X|Y ) is non-negative and apply Lemma 2.6. Asimilar proof can be used to show that I(X;Y ) ≤ log2 |Y|.

The relationships between H(X), H(Y ), H(X, Y ), H(X|Y ), H(Y |X) andI(X;Y ) can be illustrated by the Venn diagram in Figure 2.2.

2.2.2 Conditional mutual information

The conditional mutual information, denoted by I(X;Y |Z), is defined as thecommon uncertainty between X and Y under the knowledge of Z. It is mathe-matically defined by

I(X;Y |Z) , H(X|Z)−H(X|Y, Z). (2.2.2)

17

Lemma 2.16 (Chain rule for mutual information)

I(X;Y, Z) = I(X;Y ) + I(X;Z|Y ) = I(X;Z) + I(X;Y |Z).

Proof: Without loss of generality, we only prove the first equality:

I(X;Y, Z) = H(X)−H(X|Y, Z)= H(X)−H(X|Y ) +H(X|Y )−H(X|Y, Z)= I(X;Y ) + I(X;Z|Y ).

The above lemma can be read as: the information that (Y, Z) has about Xis equal to the information that Y has about X plus the information that Z hasabout X when Y is already known.

2.3 Properties of entropy and mutual information formultiple random variables

Theorem 2.17 (Chain rule for entropy) Let X1, X2, . . ., Xn be drawn ac-cording to PXn(xn) , PX1,··· ,Xn

(x1, . . . , xn), where we use the common super-script notation to denote an n-tuple: Xn , (X1, · · · , Xn) and xn , (x1, . . . , xn).Then

H(X1, X2, . . . , Xn) =n∑

i=1

H(Xi|Xi−1, . . . , X1),

where H(Xi|Xi−1, . . . , X1) , H(X1) for i = 1. (The above chain rule can alsobe written as:

H(Xn) =n∑

i=1

H(Xi|X i−1),

where X i , (X1, . . . , Xi).)

Proof: From (2.1.6),

H(X1, X2, . . . , Xn) = H(X1, X2, . . . , Xn−1) +H(Xn|Xn−1, . . . , X1). (2.3.1)

Once again, applying (2.1.6) to the first term of the right-hand-side of (2.3.1),we have

H(X1, X2, . . . , Xn−1) = H(X1, X2, . . . , Xn−2) +H(Xn−1|Xn−2, . . . , X1).

The desired result can then be obtained by repeatedly applying (2.1.6).

18

Theorem 2.18 (Chain rule for conditional entropy)

H(X1, X2, . . . , Xn|Y ) =n∑

i=1

H(Xi|Xi−1, . . . , X1, Y ).

Proof: The theorem can be proved similarly to Theorem 2.17.

Theorem 2.19 (Chain rule for mutual information)

I(X1, X2, . . . , Xn;Y ) =n∑

i=1

I(Xi;Y |Xi−1, . . . , X1),

where I(Xi;Y |Xi−1, . . . , X1) , I(X1;Y ) for i = 1.

Proof: This can be proved by first expressing mutual information in terms ofentropy and conditional entropy, and then applying the chain rules for entropyand conditional entropy.

Theorem 2.20 (Independence bound on entropy)

H(X1, X2, . . . , Xn) ≤n∑

i=1

H(Xi).

Equality holds iff all the Xi’s are independent from each other.4

Proof: By applying the chain rule for entropy,

H(X1, X2, . . . , Xn) =n∑

i=1

H(Xi|Xi−1, . . . , X1)

≤n∑

i=1

H(Xi).

Equality holds iff each conditional entropy is equal to its associated entropy, thatiff Xi is independent of (Xi−1, . . . , X1) for all i.

4This condition is equivalent to requiring that Xi be independent of (Xi−1, . . . , X1) forall i. The equivalence can be directly proved using the chain rule for joint probabilities, i.e.,PXn(xn) =

∏ni=1 PXi|X

i−1

1

(xi|xi−11 ); it is left as an exercise.

19

Theorem 2.21 (Bound on mutual information) If (Xi, Yi)ni=1 is a pro-cess satisfying the conditional independence assumption PY n|Xn =

∏ni=1 PYi|Xi

,then

I(X1, . . . , Xn;Y1, . . . , Yn) ≤n∑

i=1

I(Xi;Yi)

with equality holding iff Xini=1 are independent.

Proof: From the independence bound on entropy, we have

H(Y1, . . . , Yn) ≤n∑

i=1

H(Yi).

By the conditional independence assumption, we have

H(Y1, . . . , Yn|X1, . . . , Xn) = E[− log2 PY n|Xn(Y n|Xn)

]

= E

[

−n∑

i=1

log2 PYi|Xi(Yi|Xi)

]

=n∑

i=1

H(Yi|Xi).

Hence

I(Xn;Y n) = H(Y n)−H(Y n|Xn)

≤n∑

i=1

H(Yi)−n∑

i=1

H(Yi|Xi)

=n∑

i=1

I(Xi;Yi)

with equality holding iff Yini=1 are independent, which holds iff Xini=1 areindependent.

2.4 Data processing inequality

Lemma 2.22 (Data processing inequality) (This is also called the data pro-cessing lemma.) If X → Y → Z, then I(X;Y ) ≥ I(X;Z).

Proof: The Markov chain relationship X → Y → Z means that X and Zare conditional independent given Y (cf. Appendix B); we directly have thatI(X;Z|Y ) = 0. By the chain rule for mutual information,

I(X;Z) + I(X;Y |Z) = I(X;Y, Z) (2.4.1)

20

Source UEncoder X

Channel YDecoder V

I(U ;V ) ≤ I(X;Y )

“By processing, we can only reduce (mutual) information,but the processed information may be in a more useful form!”

Figure 2.3: Communication context of the data processing lemma.

= I(X;Y ) + I(X;Z|Y )

= I(X;Y ). (2.4.2)

Since I(X;Y |Z) ≥ 0, we obtain that I(X;Y ) ≥ I(X;Z) with equality holdingiff I(X;Y |Z) = 0.

The data processing inequality means that the mutual information will notincrease after processing. This result is somewhat counter-intuitive since giventwo random variables X and Y , we might believe that applying a well-designedprocessing scheme to Y , which can be generally represented by a mapping g(Y ),could possibly increase the mutual information. However, for any g(·), X →Y → g(Y ) forms a Markov chain which implies that data processing cannotincrease mutual information. A communication context for the data processinglemma is depicted in Figure 2.3, and summarized in the next corollary.

Corollary 2.23 For jointly distributed random variables X and Y and anyfunction g(·), we have X → Y → g(Y ) and

I(X;Y ) ≥ I(X; g(Y )).

We also note that if Z obtains all the information about X through Y , thenknowing Z will not help increase the mutual information between X and Y ; thisis formalized in the following.

Corollary 2.24 If X → Y → Z, then

I(X;Y |Z) ≤ I(X;Y ).

Proof: The proof directly follows from (2.4.1) and (2.4.2).

It is worth pointing out that it is possible that I(X;Y |Z) > I(X;Y ) when X,Y and Z do not form a Markov chain. For example, let X and Y be independentequiprobable binary zero-one random variables, and let Z = X + Y . Then,

I(X;Y |Z) = H(X|Z)−H(X|Y, Z)

21

= H(X|Z)= PZ(0)H(X|z = 0) + PZ(1)H(X|z = 1) + PZ(2)H(X|z = 2)

= 0 + 0.5 + 0

= 0.5 bits,

which is clearly larger than I(X;Y ) = 0.

Finally, we observe that we can extend the data processing inequality for asequence of random variables forming a Markov chain:

Corollary 2.25 If X1 → X2 → · · · → Xn, then for any i, j, k.l such that1 ≤ i ≤ j ≤ k ≤ l ≤ n, we have that

I(Xi;Xl) ≤ I(Xj ;Xk).

2.5 Fano’s inequality

Fano’s inequality is a quite useful tool widely employed in Information Theoryto prove converse results for coding theorems (as we will see in the followingchapters).

Lemma 2.26 (Fano’s inequality) LetX and Y be two random variables, cor-related in general, with alphabets X and Y , respectively, where X is finite butY can be countably infinite. Let X , g(Y ) be an estimate of X from observingY , where g : Y → X is a given estimation function. Define the probability oferror as

Pe , Pr[X 6= X].

Then the following inequality holds

H(X|Y ) ≤ hb(Pe) + Pe · log2(|X | − 1), (2.5.1)

where hb(x) , −x log2 x− (1−x) log2(1−x) for 0 ≤ x ≤ 1 is the binary entropyfunction.

Observation 2.27

• Note that when Pe = 0, we obtain that H(X|Y ) = 0 (see (2.5.1)) asintuition suggests, since if Pe = 0, then X = g(Y ) = X (with probability1) and thus H(X|Y ) = H(g(Y )|Y ) = 0.

22

• Fano’s inequality yields upper and lower bounds on Pe in terms of H(X|Y ).This is illustrated in Figure 2.4, where we plot the region for the pairs(Pe, H(X|Y )) that are permissible under Fano’s inequality. In the figure,the boundary of the permissible (dashed) region is given by the function

f(Pe) , hb(Pe) + Pe · log2(|X | − 1),

the right-hand side of (2.5.1). We obtain that when

log2(|X | − 1) < H(X|Y ) ≤ log2(|X |),

Pe can be upper and lower bounded as follows:

0 < infa : f(a) ≥ H(X|Y ) ≤ Pe ≤ supa : f(a) ≥ H(X|Y ) < 1.

Furthermore, when

0 < H(X|Y ) ≤ log2(|X | − 1),

only the lower bound holds:

Pe ≥ infa : f(a) ≥ H(X|Y ) > 0.

Thus for all non-zero values of H(X|Y ), we obtain a lower bound (of thesame form above) on Pe; the bound implies that if H(X|Y ) is boundedaway from zero, Pe is also bounded away from zero.

• A weaker but simpler version of Fano’s inequality can be directly obtainedfrom (2.5.1) by noting that hb(Pe) ≤ 1:

H(X|Y ) ≤ 1 + Pe log2(|X | − 1), (2.5.2)

which in turn yields that

Pe ≥H(X|Y )− 1

log2(|X | − 1)( for |X | > 2)

which is weaker than the above lower bound on Pe.

Proof of Lemma 2.26:

Define a new random variable,

E ,

1, if g(Y ) 6= X0, if g(Y ) = X

.

23

log2(|X | − 1)

log2(|X |)

H(X|Y )

(|X | − 1)/|X |0 1Pe

Figure 2.4: Permissible (Pe, H(X|Y )) region due to Fano’s inequality.

Then using the chain rule for conditional entropy, we obtain

H(E,X|Y ) = H(X|Y ) +H(E|X, Y )

= H(E|Y ) +H(X|E, Y ).

Observe that E is a function of X and Y ; hence, H(E|X, Y ) = 0. Since con-ditioning never increases entropy, H(E|Y ) ≤ H(E) = hb(Pe). The remainingterm, H(X|E, Y ), can be bounded as follows:

H(X|E, Y ) = Pr[E = 0]H(X|Y,E = 0) + Pr[E = 1]H(X|Y,E = 1)

≤ (1− Pe) · 0 + Pe · log2(|X | − 1),

since X = g(Y ) for E = 0, and given E = 1, we can upper bound the conditionalentropy by the logarithm of the number of remaining outcomes, i.e., (|X | − 1).Combining these results completes the proof.

Fano’s inequality cannot be improved in the sense that the lower bound,H(X|Y ), can be achieved for some specific cases. Any bound that can beachieved in some cases is often referred to as sharp.5 From the proof of the abovelemma, we can observe that equality holds in Fano’s inequality, if H(E|Y ) =H(E) and H(X|Y,E = 1) = log2(|X | − 1). The former is equivalent to E beingindependent of Y , and the latter holds iff PX|Y (·|y) is uniformly distributed over

5Definition. A bound is said to be sharp if the bound is achievable for some specific cases.

A bound is said to be tight if the bound is achievable for all cases.

24

the set X −g(y). We can therefore create an example in which equality holdsin Fano’s inequality.

Example 2.28 Suppose that X and Y are two independent random variableswhich are both uniformly distributed on the alphabet 0, 1, 2. Let the estimat-ing function be given by g(y) = y. Then

Pe = Pr[g(Y ) 6= X] = Pr[Y 6= X] = 1−2∑

x=0

PX(x)PY (x) =2

3.

In this case, equality is achieved in Fano’s inequality, i.e.,

hb

(2

3

)

+2

3· log2(3− 1) = H(X|Y ) = H(X) = log2 3.

To conclude this section, we present an alternative proof for Fano’s inequalityto illustrate the use of the data processing inequality and the FI Lemma.

Alternative Proof of Fano’s inequality: Noting that X → Y → X form aMarkov chain, we directly obtain via the data processing inequality that

I(X;Y ) ≥ I(X; X),

which implies thatH(X|Y ) ≤ H(X|X).

Thus, if we show that H(X|X) is no larger than the right-hand side of (2.5.1),the proof of (2.5.1) is complete.

Noting that

Pe =∑

x∈X

∑

x∈X :x6=x

PX,X(x, x)

and1− Pe =

∑

x∈X

∑

x∈X :x=x

PX,X(x, x) =∑

x∈XPX,X(x, x),

we obtain that

H(X|X)− hb(Pe)− Pe log2(|X | − 1)

=∑

x∈X

∑

x∈X :x 6=x

PX,X(x, x) log21

PX|X(x|x)+∑

x∈XPX,X(x, x) log2

1

PX|X(x|x)

−[∑

x∈X

∑

x∈X :x 6=x

PX,X(x, x)

]

log2(|X | − 1)

Pe

+

[∑

x∈XPX,X(x, x)

]

log2(1− Pe)

25

=∑

x∈X

∑

x∈X :x 6=x

PX,X(x, x) log2Pe

PX|X(x|x)(|X | − 1)

+∑

x∈XPX,X(x, x) log2

1− Pe

PX|X(x|x)(2.5.3)

≤ log2(e)∑

x∈X

∑

x∈X :x 6=x

PX,X(x, x)

[

Pe

PX|X(x|x)(|X | − 1)− 1

]

+ log2(e)∑

x∈XPX,X(x, x)

[

1− Pe

PX|X(x|x)− 1

]

= log2(e)

[

Pe

(|X | − 1)

∑

x∈X

∑

x∈X :x 6=x

PX(x)−∑

x∈X

∑

x∈X :x6=x

PX,X(x, x)

]

+ log2(e)

[

(1− Pe)∑

x∈XPX(x)−

∑

x∈XPX,X(x, x)

]

= log2(e)

[Pe

(|X | − 1)(|X | − 1)− Pe

]

+ log2(e) [(1− Pe)− (1− Pe)]

= 0

where the inequality follows by applying the FI Lemma to each logarithm termin (2.5.3).

2.6 Divergence and variational distance

In addition to the probabilistically defined entropy and mutual information, an-other measure that is frequently considered in information theory is divergence orrelative entropy. In this section, we define this measure and study its statisticalproperties.

Definition 2.29 (Divergence) Given two discrete random variables X and Xdefined over a common alphabet X , the divergence (other names are Kullback-Leibler divergence or distance, relative entropy and discrimination) is denotedby D(X‖X) or D(PX‖PX) and defined by6

D(X‖X) = D(PX‖PX) , EX

[

log2PX(X)

PX(X)

]

=∑

x∈XPX(x) log2

PX(x)

PX(x).

6In order to be consistent with the units (in bits) adopted for entropy and mutual informa-tion, we will also use the base-2 logarithm for divergence unless otherwise specified.

26

In other words, the divergence D(PX‖PX) is the expectation (with respect toPX) of the log-likelihood ratio log2[PX/PX ] of distribution PX against distribu-

tion PX . D(X‖X) can be viewed as a measure of “distance” or “dissimilarity”

between distributions PX and PX . D(X‖X) is also called relative entropy sinceit can be regarded as a measure of the inefficiency of mistakenly assuming thatthe distribution of a source is PX when the true distribution is PX . For example,if we know the true distribution PX of a source, then we can construct a losslessdata compression code with average codeword length achieving entropy H(X)(this will be studied in the next chapter). If, however, we mistakenly thoughtthat the “true” distribution is PX and employ the “best” code corresponding toPX , then the resultant average codeword length becomes

∑

x∈X[−PX(x) · log2 PX(x)].

As a result, the relative difference between the resultant average codeword lengthand H(X) is the relative entropy D(X‖X). Hence, divergence is a measure ofthe system cost (e.g., storage consumed) paid due to mis-classifying the systemstatistics.

Note that when computing divergence, we follow the convention that

0 · log20

p= 0 and p · log2

p

0= ∞ for p > 0.

We next present some properties of the divergence and discuss its relation withentropy and mutual information.

Lemma 2.30 (Non-negativity of divergence)

D(X‖X) ≥ 0

with equality iff PX(x) = PX(x) for all x ∈ X (i.e., the two distributions areequal).

Proof:

D(X‖X) =∑

x∈XPX(x) log2

PX(x)

PX(x)

≥(∑

x∈XPX(x)

)

log2

∑

x∈X PX(x)∑

x∈X PX(x)

= 0

27

where the second step follows from the log-sum inequality with equality holdingiff for every x ∈ X ,

PX(x)

PX(x)=

∑

a∈X PX(a)∑

b∈X PX(b),

or equivalently PX(x) = PX(x) for all x ∈ X .

Lemma 2.31 (Mutual information and divergence)

I(X;Y ) = D(PX,Y ‖PX × PY ),

where PX,Y (·, ·) is the joint distribution of the random variables X and Y andPX(·) and PY (·) are the respective marginals.

Proof: The observation follows directly from the definitions of divergence andmutual information.

Definition 2.32 (Refinement of distribution) Given distribution PX on X ,divide X into k mutually disjoint sets, U1,U2, . . . ,Uk, satisfying

X =k⋃

i=1

Ui.

Define a new distribution PU on U = 1, 2, · · · , k as

PU(i) =∑

x∈Ui

PX(x).

Then PX is called a refinement (or more specifically, a k-refinement) of PU .

Let us briefly discuss the relation between the processing of information andits refinement. Processing of information can be modeled as a (many-to-one)mapping, and refinement is actually the reverse operation. Recall that thedata processing lemma shows that mutual information can never increase dueto processing. Hence, if one wishes to increase mutual information, he shouldsimultaneously “anti-process” (or refine) the involved statistics.

From Lemma 2.31, the mutual information can be viewed as the divergenceof a joint distribution against the product distribution of the marginals. It istherefore reasonable to expect that a similar effect due to processing (or a reverseeffect due to refinement) should also apply to divergence. This is shown in thenext lemma.

28

Lemma 2.33 (Refinement cannot decrease divergence) Let PX and PX

be the refinements (k-refinements) of PU and PU respectively. Then

D(PX‖PX) ≥ D(PU‖PU).

Proof: By the log-sum inequality, we obtain that for any i ∈ 1, 2, · · · , k∑

x∈Ui

PX(x) log2PX(x)

PX(x)≥

(∑

x∈Ui

PX(x)

)

log2

∑

x∈UiPX(x)

∑

x∈UiPX(x)

= PU(i) log2PU(i)

PU(i), (2.6.1)

with equality iffPX(x)

PX(x)=

PU(i)

PU(i)

for all x ∈ U . Hence,

D(PX‖PX) =k∑

i=1

∑

x∈Ui

PX(x) log2PX(x)

PX(x)

≥k∑

i=1

PU(i) log2PU(i)

PU(i)

= D(PU‖PU),

with equality iff

(∀ i)(∀ x ∈ Ui)PX(x)

PX(x)=

PU(i)

PU(i).

Observation 2.34 One drawback of adopting the divergence as a measure be-tween two distributions is that it does not meet the symmetry requirement of atrue distance,7 since interchanging its two arguments may yield different quan-tities. In other words, D(PX‖PX) 6= D(PX‖PX) in general. (It also does notsatisfy the triangular inequality.) Thus divergence is not a true distance or met-ric. Another measure which is a true distance, called variational distance, issometimes used instead.

7Given a non-empty set A, the function d : A × A → [0,∞) is called a distance or metric

if it satisfies the following properties.

1. Non-negativity: d(a, b) ≥ 0 for every a, b ∈ A with equality holding iff a = b.

2. Symmetry: d(a, b) = d(b, a) for every a, b ∈ A.

3. Triangular inequality: d(a, b) + d(b, c) ≥ d(a, c) for every a, b, c ∈ A.

29

Definition 2.35 (Variational distance) The variational distance (or L1-distance)between two distributions PX and PX with common alphabet X is defined by

‖PX − PX‖ ,∑

x∈X|PX(x)− PX(x)|.

Lemma 2.36 The variational distance satisfies

‖PX − PX‖ = 2 · supE⊂X

|PX(E)− PX(E)| = 2 ·∑

x∈X :PX(x)>PX(x)

[PX(x)− PX(x)] .

Proof: We first show that ‖PX − PX‖ = 2 ·∑x∈X :PX(x)>PX(x) [PX(x)− PX(x)] .

Setting A , x ∈ X : PX(x) > PX(x), we have

‖PX − PX‖ =∑

x∈X|PX(x)− PX(x)|

=∑

x∈A|PX(x)− PX(x)|+

∑

x∈Ac

|PX(x)− PX(x)|

=∑

x∈A[PX(x)− PX(x)] +

∑

x∈Ac

[PX(x)− PX(x)]

=∑

x∈A[PX(x)− PX(x)] + PX (Ac)− PX (Ac)

=∑

x∈A[PX(x)− PX(x)] + PX(A)− PX(A)

=∑

x∈A[PX(x)− PX(x)] +

∑

x∈A[PX(x)− PX(x)]

= 2 ·∑


where Ac denotes the complement set of A.

We next prove that ‖PX − PX‖ = 2 · supE⊂X |PX(E)− PX(E)| by showingthat each quantity is greater than or equal to the other. For any set E ⊂ X , wecan write

‖PX − PX‖ =∑


=∑

x∈E|PX(x)− PX(x)|+

∑

x∈Ec

|PX(x)− PX(x)|

≥∣∣∣∣∣

∑

x∈E[PX(x)− PX(x)]

∣∣∣∣∣+

∣∣∣∣∣

∑

x∈Ec

[PX(x)− PX(x)]

∣∣∣∣∣

30

= |PX(E)− PX(E)|+ |PX(Ec)− PX(E

c)|= |PX(E)− PX(E)|+ |PX(E)− PX(E)|= 2 · |PX(E)− PX(E)|.

Thus ‖PX − PX‖ ≥ 2 · supE⊂X |PX(E)− PX(E)|. Conversely, we have that

2 · supE⊂X

|PX(E)− PX(E)| ≥ 2 · |PX(A)− PX(A)|

= |PX(A)− PX(A)|+ |PX(Ac)− PX(Ac)|

=

∣∣∣∣∣

∑


∣∣∣∣∣+

∣∣∣∣∣

∑

x∈Ac

[PX(x)− PX(x)]

∣∣∣∣∣

=∑

x∈A|PX(x)− PX(x)|+

∑

x∈Ac

|PX(x)− PX(x)|

= ‖PX − PX‖.

Therefore, ‖PX − PX‖ = 2 · supE⊂X |PX(E)− PX(E)| .

Lemma 2.37 (Variational distance vs divergence: Pinsker’s inequality)

D(X‖X) ≥ log2(e)

2· ‖PX − PX‖2.

This result is referred to as Pinsker’s inequality.

Proof:

1. With A , x ∈ X : PX(x) > PX(x), we have from the previous lemmathat

‖PX − PX‖ = 2[PX(A)− PX(A)].

2. Define two random variables U and U as:

U =

1, if X ∈ A;0, if X ∈ Ac,

and

U =

1, if X ∈ A;

0, if X ∈ Ac.

Then PX and PX are refinements (2-refinements) of PU and PU , respec-tively. From Lemma 2.33, we obtain that

D(PX‖PX) ≥ D(PU‖PU).

31

3. The proof is complete if we show that

D(PU‖PU) ≥ 2 log2(e)[PX(A)− PX(A)]2

= 2 log2(e)[PU(1)− PU(1)]2.

For ease of notations, let p = PU(1) and q = PU(1). Then to prove theabove inequality is equivalent to show that

p · ln p

q+ (1− p) · ln 1− p

1− q≥ 2(p− q)2.

Define

f(p, q) , p · ln p

q+ (1− p) · ln 1− p

1− q− 2(p− q)2,

and observe that

df(p, q)

dq= (p− q)

(

4− 1

q(1− q)

)

≤ 0 for q ≤ p.

Thus, f(p, q) is non-increasing in q for q ≤ p. Also note that f(p, q) = 0for q = p. Therefore,

f(p, q) ≥ 0 for q ≤ p.

The proof is completed by noting that

f(p, q) ≥ 0 for q ≥ p,

since f(1− p, 1− q) = f(p, q).

Observation 2.38 The above lemma tells us that for a sequence of distributions(PXn

, PXn)n≥1, when D(PXn

‖PXn) goes to zero as n goes to infinity, ‖PXn

−PXn

‖ goes to zero as well. But the converse does not necessarily hold. For aquick counterexample, let

PXn(0) = 1− PXn

(1) = 1/n > 0

andPXn

(0) = 1− PXn(1) = 0.

In this case,D(PXn

‖PXn) → ∞

since by convention, (1/n) · log2((1/n)/0) → ∞. However,

‖PX − PX‖ = 2

[

PX

(x : PX(x) > PX(x)

)− PX

(x : PX(x) > PX(x)

)]

32

=2

n→ 0.

We however can upper bound D(PX‖PX) by the variational distance betweenPX and PX when D(PX‖PX) < ∞.

Lemma 2.39 If D(PX‖PX) < ∞, then

D(PX‖PX) ≤log2(e)

minx : PX(x)>0

minPX(x), PX(x)· ‖PX − PX‖.

Proof: Without loss of generality, we assume that PX(x) > 0 for all x ∈ X .Since D(PX‖PX) < ∞, we have that for any x ∈ X , PX(x) > 0 implies thatPX(x) > 0. Let

t , minx∈X : PX(x)>0

minPX(x), PX(x).

Then for all x ∈ X ,

lnPX(x)

PX(x)≤

∣∣∣∣ln

PX(x)

PX(x)

∣∣∣∣

≤∣∣∣∣

maxminPX(x),P

X(x)≤s≤maxPX(x),P

X(x)

d ln(s)

ds

∣∣∣∣· |PX(x)− PX(x)|

=1

minPX(x), PX(x)· |PX(x)− PX(x)|

≤ 1

t· |PX(x)− PX(x)|.

Hence,

D(PX‖PX) = log2(e)∑

x∈XPX(x) · ln

PX(x)

PX(x)

≤ log2(e)

t

∑

x∈XPX(x) · |PX(x)− PX(x)|

≤ log2(e)

t

∑


=log2(e)

t· ‖PX − PX‖.

The next lemma discusses the effect of side information on divergence. Asstated in Lemma 2.12, side information usually reduces entropy; it, however,

33

increases divergence. One interpretation of these results is that side informationis useful. Regarding entropy, side information provides us more information, souncertainty decreases. As for divergence, it is the measure or index of how easyone can differentiate the source from two candidate distributions. The largerthe divergence, the easier one can tell apart between these two distributionsand make the right guess. At an extreme case, when divergence is zero, onecan never tell which distribution is the right one, since both produce the samesource. So, when we obtain more information (side information), we should beable to make a better decision on the source statistics, which implies that thedivergence should be larger.

Definition 2.40 (Conditional divergence) Given three discrete random vari-ables, X, X and Z, where X and X have a common alphabet X , we define theconditional divergence between X and X given Z by

D(X‖X|Z) = D(PX|Z‖PX|Z) ,∑

z∈Z

∑

x∈XPX,Z(x, z) log

PX|Z(x|z)PX|Z(x|z)

.

In other words, it is the expected value with respect to PX,Z of the log-likelihood

ratio logPX|Z

PX|Z

.

Lemma 2.41 (Conditional mutual information and conditional diver-gence) Given three discrete random variables X, Y and Z with alphabets X ,Y and Z, respectively, and joint distribution PX,Y,Z , then

I(X;Y |Z) = D(PX,Y |Z‖PX|ZPY |Z)

=∑

x∈X

∑

y∈Y

∑

z∈ZPX,Y,Z(x, y, z) log2

PX,Y |Z(x, y|z)PX|Z(x|z)PY |Z(y|z)

,

where PX,Y |Z is conditional joint distribution of X and Y given Z, and PX|Z andPY |Z are the conditional distributions of X and Y , respectively, given Z.

Proof: The proof follows directly from the definition of conditional mutual in-formation (2.2.2) and the above definition of conditional divergence.

Lemma 2.42 (Chain rule for divergence) For three discrete random vari-ables, X, X and Z, where X and X have a common alphabet X , we have that

D(PX,Z‖PX,Z) = D(PX‖PX) +D(PZ|X‖PZ|X).

Proof: The proof readily follows from the divergence definitions.

34

Lemma 2.43 (Conditioning never decreases divergence) For three discreterandom variables, X, X and Z, where X and X have a common alphabet X ,we have that

D(PX|Z‖PX|Z) ≥ D(PX‖PX).

Proof:

D(PX|Z‖PX|Z)−D(PX‖PX)

=∑

z∈Z

∑

x∈XPX,Z(x, z) · log2

PX|Z(x|z)PX|Z(x|z)

−∑

x∈XPX(x) · log2

PX(x)

PX(x)

=∑

z∈Z

∑


PX|Z(x|z)PX|Z(x|z)

−∑

x∈X

(∑

z∈ZPX,Z(x, z)

)

· log2PX(x)

PX(x)

=∑

z∈Z

∑


PX|Z(x|z)PX(x)

PX|Z(x|z)PX(x)

≥∑

z∈Z

∑

x∈XPX,Z(x, z) · log2(e)

(

1−PX|Z(x|z)PX(x)

PX|Z(x|z)PX(x)

)

(by the FI Lemma)

= log2(e)

(

1−∑

x∈X

PX(x)

PX(x)

∑

z∈ZPZ(z)PX|Z(x|z)

)

= log2(e)

(

1−∑

x∈X

PX(x)

PX(x)PX(x)

)

= log2(e)

(

1−∑

x∈XPX(x)

)

= 0

with equality holding iff for all x and z,

PX(x)

PX(x)=

PX|Z(x|z)PX|Z(x|z)

.

Note that it is not necessary that

D(PX|Z‖PX|Z) ≥ D(PX‖PX).

In other words, the side information is helpful for divergence only when it pro-vides information on the similarity or difference of the two distributions. For theabove case, Z only provides information about X, and Z provides informationabout X; so the divergence certainly cannot be expected to increase. The nextlemma shows that if (Z, Z) are independent of (X, X), then the side informationof (Z, Z) does not help in improving the divergence of X against X.

35

Lemma 2.44 (Independent side information does not change diver-gence) If (X, X) is independent of (Z, Z), then

D(PX|Z‖PX|Z) = D(PX‖PX),

where

D(PX|Z‖PX|Z) ,∑

x∈X

∑

z∈ZPX,Z(x, z) log2

PX|Z(x|z)PX|Z(x|z)

.

Proof: This can be easily justified by the definition of divergence.

Lemma 2.45 (Additivity of divergence under independence)

D(PX,Z‖PX,Z) = D(PX‖PX) +D(PZ‖PZ),

provided that (X, X) is independent of (Z, Z).

Proof: This can be easily proved from the definition.

2.7 Convexity/concavity of information measures

We next address the convexity/concavity properties of information measureswith respect to the distributions on which they are defined. Such properties willbe useful when optimizing the information measures over distribution spaces.

Lemma 2.46

1. H(PX) is a concave function of PX , namely

H(λPX + (1− λ)PX) ≥ λH(PX) + (1− λ)H(PX).

2. Noting that I(X;Y ) can be re-written as I(PX , PY |X), where

I(PX , PY |X) ,∑

x∈X

∑

y∈YPY |X(y|x)PX(x) log2

PY |X(y|x)∑

a∈X PY |X(y|a)PX(a),

then I(X;Y ) is a concave function of PX (for fixed PY |X), and a convexfunction of PY |X (for fixed PX).

36

3. D(PX‖PX) is convex with respect to both the first argument PX and thesecond argument PX . It is also convex in the pair (PX , PX); i.e., if (PX , PX)and (QX , QX) are two pairs of probability mass functions, then

D(λPX + (1− λ)QX‖λPX + (1− λ)QX)

≤ λ ·D(PX‖PX) + (1− λ) ·D(QX‖QX), (2.7.1)

for all λ ∈ [0, 1].

Proof:

1. The proof uses the log-sum inequality:

H(λPX + (1− λ)PX)−[λH(PX) + (1− λ)H(PX)

]

= λ∑

x∈XPX(x) log2

PX(x)

λPX(x) + (1− λ)PX(x)

+(1− λ)∑

x∈XPX(x) log2

PX(x)


≥ λ

(∑

x∈XPX(x)

)

log2

∑

x∈X PX(x)∑

x∈X [λPX(x) + (1− λ)PX(x)]

+(1− λ)

(∑

x∈XPX(x)

)

log2

∑

x∈X PX(x)∑

x∈X [λPX(x) + (1− λ)PX(x)]

= 0,

with equality holding iff PX(x) = PX(x) for all x.

2. We first show the concavity of I(PX , PY |X) with respect to PX . Let λ =1− λ.

I(λPX + λPX , PY |X)− λI(PX , PY |X)− λI(PX , PY |X)

= λ∑

y∈Y

∑

x∈XPX(x)PY |X(y|x) log2

∑

x∈XPX(x)PY |X(y|x)

∑

x∈X[λPX(x) + λPX(x)]PY |X(y|x)]

+ λ∑

y∈Y

∑

x∈XPX(x)PY |X(y|x) log2

∑


∑

x∈X[λPX(x) + λPX(x)]PY |X(y|x)]

≥ 0 (by the log-sum inequality)

37

with equality holding iff

∑

x∈XPX(x)PY |X(y|x) =

∑


for all y ∈ Y . We now turn to the convexity of I(PX , PY |X) with respect to

PY |X . For ease of notation, let PYλ(y) , λPY (y)+λPY (y), and PYλ|X(y|x) ,

λPY |X(y|x) + λPY |X(y|x). Then

λI(PX , PY |X) + λI(PX , PY |X)− I(PX , λPY |X + λPY |X)

= λ∑

x∈X

∑

y∈YPX(x)PY |X(y|x) log2

PY |X(y|x)PY (y)

+λ∑

x∈X

∑


PY |X(y|x)PY (y)

−∑

x∈X

∑

y∈YPX(x)PYλ|X(y|x) log2

PYλ|X(y|x)PYλ

(y)

= λ∑

x∈X

∑


PY |X(y|x)PYλ(y)

PY (y)PYλ|X(y|x)

+λ∑

x∈X

∑


PY |X(y|x)PYλ(y)

PY (y)PYλ|X(y|x)

≥ λ log2(e)∑

x∈X

∑

y∈YPX(x)PY |X(y|x)

(

1− PY (y)PYλ|X(y|x)PY |X(y|x)PYλ

(y)

)

+λ log2(e)∑

x∈X

∑

y∈YPX(x)PY |X(y|x)

(

1− PY (y)PYλ|X(y|x)PY |X(y|x)PYλ

(y)

)

= 0,

where the inequality follows from the FI Lemma, with equality holding iff

(∀ x ∈ X , y ∈ Y)PY (y)

PY |X(y|x)=

PY (y)

PY |X(y|x).

3. For ease of notation, let PXλ(x) , λPX(x) + (1− λ)PX(x).

λD(PX‖PX) + (1− λ)D(PX‖PX)−D(PXλ‖PX)

= λ∑

x∈XPX(x) log2

PX(x)

PXλ(x)

+ (1− λ)∑

x∈XPX(x) log2

PX(x)

PXλ(x)

= λD(PX‖PXλ) + (1− λ)D(PX‖PXλ

)

38

≥ 0

by the non-negativity of the divergence, with equality holding iff PX(x) =PX(x) for all x.

Similarly, by letting PXλ(x) , λPX(x) + (1− λ)PX(x), we obtain:

λD(PX‖PX) + (1− λ)D(PX‖PX)−D(PX‖PXλ)

= λ∑

x∈XPX(x) log2

PXλ(x)

PX(x)+ (1− λ)

∑

x∈XPX(x) log2

PXλ(x)

PX(x)

≥ λ

ln 2

∑

x∈XPX(x)

(

1− PX(x)

PXλ(x)

)

+(1− λ)

ln 2

∑

x∈XPX(x)

(

1− PX(x)

PXλ(x)

)

= log2(e)

(

1−∑

x∈XPX(x)


PXλ(x)

)

= 0,

where the inequality follows from the FI Lemma, with equality holding iffPX(x) = PX(x) for all x.

Finally, by the log-sum inequality, for each x ∈ X , we have

(λPX(x) + (1− λ)QX(x)) log2λPX(x) + (1− λ)QX(x)

λPX(x) + (1− λ)QX(x)

≤ λPX(x) log2λPX(x)

λPX(x)+ (1− λ)QX(x) log2

(1− λ)QX(x)

(1− λ)QX(x).

Summing over x, we yield (2.7.1).

Note that the last result (convexity of D(PX‖PX) in the pair (PX , PX))actually implies the first two results: just set PX = QX to show convexityin the first argument PX , and set PX = QX to show convexity in the secondargument PX .

2.8 Fundamentals of hypothesis testing

One of the fundamental problems in statistics is to decide between two alternativeexplanations for the observed data. For example, when gambling, one may wishto test whether it is a fair game or not. Similarly, a sequence of observations onthe market may reveal the information that whether a new product is successful

39

or not. This is the simplest form of the hypothesis testing problem, which isusually named simple hypothesis testing.

It has quite a few applications in information theory. One of the frequentlycited examples is the alternative interpretation of the law of large numbers.Another example is the computation of the true coding error (for universal codes)by testing the empirical distribution against the true distribution. All of thesecases will be discussed subsequently.

The simple hypothesis testing problem can be formulated as follows:

Problem: Let X1, . . . , Xn be a sequence of observations which is possibly dr-awn according to either a “null hypothesis” distribution PXn or an “alternativehypothesis” distribution PXn . The hypotheses are usually denoted by:

• H0 : PXn

• H1 : PXn

Based on one sequence of observations xn, one has to decide which of the hy-potheses is true. This is denoted by a decision mapping φ(·), where

φ(xn) =

0, if distribution of Xn is classified to be PXn ;1, if distribution of Xn is classified to be PXn .

Accordingly, the possible observed sequences are divided into two groups:

Acceptance region for H0 : xn ∈ X n : φ(xn) = 0Acceptance region for H1 : xn ∈ X n : φ(xn) = 1.

Hence, depending on the true distribution, there are possibly two types of pro-bability of errors:

Type I error : αn = αn(φ) , PXn (xn ∈ X n : φ(xn) = 1)Type II error : βn = βn(φ) , PXn (xn ∈ X n : φ(xn) = 0) .

The choice of the decision mapping is dependent on the optimization criterion.Two of the most frequently used ones in information theory are:

1. Bayesian hypothesis testing.

Here, φ(·) is chosen so that the Bayesian cost

π0αn + π1βn

is minimized, where π0 and π1 are the prior probabilities for the nulland alternative hypotheses, respectively. The mathematical expression forBayesian testing is:

minφ

[π0αn(φ) + π1βn(φ)] .

40

2. Neyman Pearson hypothesis testing subject to a fixed test level.

Here, φ(·) is chosen so that the type II error βn is minimized subject to aconstant bound on the type I error; i.e.,

αn ≤ ε

where ε > 0 is fixed. The mathematical expression for Neyman-Pearsontesting is:

minφ : αn(φ)≤ε

βn(φ).

The set φ considered in the minimization operation could have two differentranges: range over deterministic rules, and range over randomization rules. Themain difference between a randomization rule and a deterministic rule is that theformer allows the mapping φ(xn) to be random on 0, 1 for some xn, while thelatter only accept deterministic assignments to 0, 1 for all xn. For example, arandomization rule for specific observations xn can be

φ(xn) = 0, with probability 0.2;

φ(xn) = 1, with probability 0.8.

The Neyman-Pearson lemma shows the well-known fact that the likelihoodratio test is always the optimal test.

Lemma 2.47 (Neyman-Pearson Lemma) For a simple hypothesis testingproblem, define an acceptance region for the null hypothesis through the likeli-hood ratio as

An(τ) ,

xn ∈ X n :PXn(xn)

PXn(xn)> τ

,

and letα∗n , PXn Ac

n(τ) and β∗n , PXn An(τ) .

Then for type I error αn and type II error βn associated with another choice ofacceptance region for the null hypothesis, we have

αn ≤ α∗n ⇒ βn ≥ β∗

n.

Proof: Let B be a choice of acceptance region for the null hypothesis. Then

αn + τβn =∑

xn∈Bc

PXn(xn) + τ∑

xn∈BPXn(x

n)

=∑

xn∈Bc

PXn(xn) + τ

[

1−∑

xn∈Bc

PXn(xn)

]

41

= τ +∑

xn∈Bc

[PXn(xn)− τPXn(xn)] . (2.8.1)

Observe that (2.8.1) is minimized by choosing B = An(τ). Hence,

αn + τβn ≥ α∗n + τβ∗

n,

which immediately implies the desired result.

The Neyman-Pearson lemma indicates that no other choices of acceptance re-gions can simultaneously improve both type I and type II errors of the likelihoodratio test. Indeed, from (2.8.1), it is clear that for any αn and βn, one can alwaysfind a likelihood ratio test that performs as good. Therefore, the likelihood ratiotest is an optimal test. The statistical properties of the likelihood ratio thusbecome essential in hypothesis testing. Note that, when the observations arei.i.d. under both hypotheses, the divergence, which is the statistical expectationof the log-likelihood ratio, plays an important role in hypothesis testing (fornon-memoryless observations, one is then concerned with the divergence rate, anextended notion of divergence for systems with memory which will be defined ina following chapter).

42

Chapter 3

Lossless Data Compression

3.1 Principles of data compression

As mentioned in Chapter 1, data compression describes methods of representinga source by a code whose average codeword length (or code rate) is acceptablysmall. The representation can be: lossless (or asymptotically lossless) wherethe reconstructed source is identical (or asymptotically identical) to the originalsource; or lossy where the reconstructed source is allowed to deviate from theoriginal source, usually within an acceptable threshold. We herein focus onlossless data compression.

Since a memoryless source is modelled as a random variable, the averagedcodeword length of a codebook is calculated based on the probability distributionof that random variable. For example, a ternary memoryless source X exhibitsthree possible outcomes with

PX(x = outcomeA) = 0.5;

PX(x = outcomeB) = 0.25;

PX(x = outcomeC) = 0.25.

Suppose that a binary code book is designed for this source, in which outcomeA,outcomeB and outcomeC are respectively encoded as 0, 10, and 11. Then theaverage codeword length (in bits/source outcome) is

length(0)× PX(outcomeA) + length(10)× PX(outcomeB)

+length(11)× PX(outcomeC)

= 1× 0.5 + 2× 0.25 + 2× 0.25

= 1.5 bits.

There are usually no constraints on the basic structure of a code. In thecase where the codeword length for each source outcome can be different, the

43

code is called a variable-length code. When the codeword lengths of all sourceoutcomes are equal, the code is referred to as a fixed-length code . It is obviousthat the minimum average codeword length among all variable-length codes isno greater than that among all fixed-length codes, since the latter is a subclassof the former. We will see in this chapter that the smallest achievable averagecode rate for variable-length and fixed-length codes coincide for sources withgood probabilistic characteristics, such as stationarity and ergodicity. But formore general sources with memory, the two quantities are different (cf. Part IIof the book).

For fixed-length codes, the sequence of adjacent codewords are concate-nated together for storage or transmission purposes, and some punctuationmechanism—such as marking the beginning of each codeword or delineatinginternal sub-blocks for synchronization between encoder and decoder—is nor-mally considered an implicit part of the codewords. Due to constraints on spaceor processing capability, the sequence of source symbols may be too long for theencoder to deal with all at once; therefore, segmentation before encoding is oftennecessary. For example, suppose that we need to encode using a binary code thegrades of a class with 100 students. There are three grade levels: A, B and C.By observing that there are 3100 possible grade combinations for 100 students,a straightforward code design requires ⌈log2(3100)⌉ = 159 bits to encode thesecombinations. Now suppose that the encoder facility can only process 16 bitsat a time. Then the above code design becomes infeasible and segmentation isunavoidable. Under such constraint, we may encode grades of 10 students at atime, which requires ⌈log2(310)⌉ = 16 bits. As a consequence, for a class of 100students, the code requires 160 bits in total.

In the above example, the letters in the grade set A,B,C and the lettersfrom the code alphabet 0, 1 are often called source symbols and code symbols,respectively. When the code alphabet is binary (as in the previous two examples),the code symbols are referred to as code bits or simply bits (as already used).A tuple (or grouped sequence) of source symbols is called a sourceword and theresulting encoded tuple consisting of code symbols is called a codeword. (In theabove example, each sourceword consists of 10 source symbols (students) andeach codeword consists of 16 bits.)

Note that, during the encoding process, the sourceword lengths do not have tobe equal. In this text, we however only consider the case where the sourcewordshave a fixed length throughout the encoding process (except for the Lempel-Zivcode briefly discussed at the end of this chapter), but we will allow the codewordsto have fixed or variable lengths as defined earlier.1 The block diagram of a source

1In other words, our fixed-length codes are actually “fixed-to-fixed length codes” and ourvariable-length codes are “fixed-to-variable length codes” since, in both cases, a fixed number

44

Source sourcewords

Sourceencoder

codewords

Sourcedecoder

sourcewords

Figure 3.1: Block diagram of a data compression system.

coding system is depicted in Figure 3.1.

When adding segmentation mechanisms to fixed-length codes, the codes canbe loosely divided into two groups. The first consists of block codes in which theencoding (or decoding) of the next segment of source symbols is independent ofthe previous segments. If the encoding/decoding of the next segment, somehow,retains and uses some knowledge of earlier segments, the code is called a fixed-length tree code. As will not investigate such codes in this text, we can use“block codes” and “fixed-length codes” as synonyms.

In this chapter, we first consider data compression for block codes in Sec-tion 3.2. Data compression for variable-length codes is then addressed in Sec-tion 3.3.

3.2 Block codes for asymptotically lossless compression

3.2.1 Block codes for discrete memoryless sources

We first focus on the study of asymptotically lossless data compression of discretememoryless sources via block (fixed-length) codes. Such sources were alreadydefined in Appendix B and the previous chapter; but we nevertheless recall theirdefinition.

Definition 3.1 (Discrete memoryless source) A discrete memoryless source(DMS) Xn∞n=1 consists of a sequence of independent and identically distributed(i.i.d.) random variables, X1, X2, X3, . . ., all taking values in a common finitealphabet X . In particular, if PX(·) is the common distribution or probabilitymass function (pmf) of the Xi’s, then

PXn(x1, x2, . . . , xn) =n∏

i=1

PX(xi).

of source symbols is mapped onto codewords with fixed and variable lengths, respectively.

45

Definition 3.2 An (n,M) block code of blocklength n and size M (which canbe a function of n in general,2 i.e., M = Mn) for a discrete source Xn∞n=1 is a setc1, c2, . . . , cM ⊆ X n consisting of M reproduction (or reconstruction) words,where each reproduction word is a sourceword (an n-tuple of source symbols).3

The block code’s operation can be symbolically represented as4

(x1, x2, . . . , xn) → cm ∈ c1, c2, . . . , cM.

This procedure will be repeated for each consecutive block of length n, i.e.,

· · · (x3n, . . . , x31)(x2n, . . . , x21)(x1n, . . . , x11) → · · · |cm3 |cm2 |cm1 ,

where “|” reflects the necessity of “punctuation mechanism” or ”synchronizationmechanism” for consecutive source block coders.

The next theorem provides a key tool for proving Shannon’s source codingtheorem.

2In the literature, both (n,M) and (M,n) have been used to denote a block code withblocklength n and size M . For example, [53, p. 149] adopts the former one, while [13, p. 193]uses the latter. We use the (n,M) notation since M = Mn is a function of n in general.

3One can binary-index the reproduction words in c1, c2, . . . , cM using k , ⌈log2 M⌉ bits.As such k-bit words in 0, 1k are usually stored for retrieval at a later date, the (n,M) blockcode can be represented by an encoder-decoder pair of functions (f, g), where the encodingfunction f : Xn → 0, 1k maps each sourceword xn to a k-bit word f(xn) which we call acodeword. Then the decoding function g : 0, 1k → c1, c2, . . . , cM is a retrieving operationthat produces the reproduction words. Since the codewords are binary-valued, such a blockcode is called a binary code. More generally, a D-ary block code (where D > 1 is an integer)would use an encoding function f : Xn → 0, 1, · · · , D − 1k where each codeword f(xn)contains k D-ary code symbols.Furthermore, since the behavior of block codes is investigated for sufficiently large n and

M (tending to infinity), it is legitimate to replace ⌈log2 M⌉ by log2 M for the case of binarycodes. With this convention, the data compression rate or code rate is

bits required per source symbol =k

n=

1

nlog2 M.

Similarly, for D-ary codes, the rate is

D-ary code symbols required per source symbol =k

n=

1

nlogD M.

For computational convenience, nats (under the natural logarithm) can be used instead ofbits or D-ary code symbols; in this case, the code rate becomes:

nats required per source symbol =1

nlogM.

4When one uses an encoder-decoder pair (f, g) to describe the block code, the code’s oper-ation can be expressed as: cm = g(f(xn)).

46

Theorem 3.3 (Shannon-McMillan) (Asymptotic equipartition propertyor AEP5) If Xn∞n=1 is a DMS with entropy H(X), then

− 1

nlog2 PXn(X1, . . . , Xn) → H(X) in probability.

In other words, for any δ > 0,

limn→∞

Pr

∣∣∣∣− 1

nlog2 PXn(X1, . . . , Xn)−H(X)

∣∣∣∣> δ

= 0.

Proof: This theorem follows by first observing that for an i.i.d. sequence Xn∞n=1,

− 1

nlog2 PXn(X1, . . . , Xn) =

1

n

n∑

i=1

[− log2 PX(Xi)]

and that the sequence − log2 PX(Xi)∞i=1 is i.i.d., and then applying the weaklaw of large numbers (WLLN) on the later sequence.

The AEP indeed constitutes an “information theoretic” analog of WLLN asit states that if − log2 PX(Xi)∞i=1 is an i.i.d. sequence, then for any δ > 0,

Pr

∣∣∣∣∣

1

n

n∑

i=1

[− log2 PX(Xi)]−H(X)

∣∣∣∣∣≤ δ

→ 1 as n → ∞.

As a consequence of the AEP, all the probability mass will be ultimately placedon the weakly δ-typical set, which is defined as

Fn(δ) ,

xn ∈ X n :

∣∣∣∣− 1

nlog2 PXn(xn)−H(X)

∣∣∣∣≤ δ

=

xn ∈ X n :

∣∣∣∣∣− 1

n

n∑

i=1

log2 PX(xi)−H(X)

∣∣∣∣∣≤ δ

.

Note that since the source is memoryless, for any xn ∈ Fn(δ), −(1/n) log2 PXn(xn),the normalized self-information of xn, is equal to (1/n)

∑ni=1 [− log2 PX(xi)],

which is the empirical (arithmetic) average self-information or “apparent” en-tropy of the source. Thus, a sourceword xn is δ-typical if it yields an apparentsource entropy within δ of the “true” source entropyH(X). Note that the source-words in Fn(δ) are nearly equiprobable or equally surprising (cf. Property 1 ofTheorem 3.4); this justifies naming Theorem 3.3 by AEP.

Theorem 3.4 (Consequence of the AEP) Given a DMS Xn∞n=1 with en-tropy H(X) and any δ greater than zero, then the weakly δ-typical set Fn(δ)satisfies the following.

5This is also called the entropy stability property.

47

1. If xn ∈ Fn(δ), then

2−n(H(X)+δ) ≤ PXn(xn) ≤ 2−n(H(X)−δ).

2. PXn (F cn(δ)) < δ for sufficiently large n, where the superscript “c” denotes

the complementary set operation.

3. |Fn(δ)| > (1−δ)2n(H(X)−δ) for sufficiently large n, and |Fn(δ)| ≤ 2n(H(X)+δ)

for every n, where |Fn(δ)| denotes the number of elements in Fn(δ).

Note: The above theorem also holds if we define the typical set using the base-D logarithm logD for any D > 1 instead of the base-2 logarithm; in this case,one just needs to appropriately change the base of the exponential terms in theabove theorem (by replacing 2x terms with Dx terms) and also substitute H(X)with HD(X)).

Proof: Property 1 is an immediate consequence of the definition of Fn(δ).Property 2 is a direct consequence of the AEP, since the AEP states that for afixed δ > 0, limn→∞ PXn(Fn(δ)) = 1; i.e., ∀ε > 0, there exists n0 = n0(ε) suchthat for all n ≥ n0,

PXn(Fn(δ)) > 1− ε.

In particular, setting ε = δ yields the result. We nevertheless provide a directproof of Property 2 as we give an explicit expression for n0: observe that byChebyshev’s inequality,6

PXn(F cn(δ)) = PXn

xn ∈ X n :

∣∣∣∣− 1


∣∣∣∣> δ

≤ σ2X

nδ2< δ,

for n > σ2X/δ

3, where the variance

σ2X , Var[− log2 PX(X)] =

∑

x∈XPX(x) [log2 PX(x)]

2 − (H(X))2

is a constant7 independent of n.

6The detail for Chebyshev’s inequality as well as its proof can be found on p. 203 inAppendix B.

7In the proof, we assume that the variance σ2X = Var[− log2 PX(X)] < ∞. This holds since

the source alphabet is finite:

Var[− log2 PX(X)] ≤ E[(log2 PX(X))2] =∑

x∈X

PX(x)(log2 PX(x))2

≤∑

x∈X

4

e2[log2(e)]

2 =4

e2[log2(e)]

2 × |X | < ∞.

48

To prove Property 3, we have from Property 1 that

1 ≥∑

xn∈Fn(δ)

PXn(xn) ≥∑

xn∈Fn(δ)

2−n(H(X)+δ) = |Fn(δ)|2−n(H(X)+δ),

and, using Properties 2 and 1, we have that

1− δ < 1− σ2X

nδ2≤

∑

xn∈Fn(δ)

PXn(xn) ≤∑

xn∈Fn(δ)

2−n(H(X)−δ) = |Fn(δ)|2−n(H(X)−δ),

for n ≥ σ2X/δ

3.

Note that for any n > 0, a block code C∼n = (n,M) is said to be uniquelydecodable or completely lossless if its set of reproduction words is trivially equalto the set of all source n-tuples: c1, c2, . . . , cM = X n. In this case, if we arebinary-indexing the reproduction words using an encoding-decoding pair (f, g),every sourceword xn will be assigned to a distinct binary codeword f(xn) oflength k = log2 M and all the binary k-tuples are the image under f of somesourceword. In other words, f is a bijective (injective and surjective) map andhence invertible with the decoding map g = f−1 and M = |X |n = 2k. Thus thecode rate is (1/n) log2 M = log2 |X | bits/source symbol.

Now the question becomes: can we achieve a better (i.e., smaller) compres-sion rate? The answer is affirmative: we can achieve a compression rate equal tothe source entropyH(X) (in bits), which can be significantly smaller than log2 Mwhen this source is strongly non-uniformly distributed, if we give up unique de-codability (for every n) and allow n to be sufficiently large to asymptoticallyachieve lossless reconstruction by having an arbitrarily small (but positive) pro-bability of decoding error Pe( C∼n) , PXnxn ∈ X n : g(f(xn)) 6= xn.

Thus, block codes herein can perform data compression that is asymptoticallylossless with respect to blocklength; this contrasts with variable-length codeswhich can be completely lossless (uniquely decodable) for every finite block-length.

We now can formally state and prove Shannon’s asymptotically lossless sourcecoding theorem for block codes. The theorem will be stated for general D-ary block codes, representing the source entropy HD(X) in D-ary code sym-bol/source symbol as the smallest (infimum) possible compression rate for asymp-totically lossless D-ary block codes. Without loss of generality, the theoremwill be proved for the case of D = 2. The idea behind the proof of the for-ward (achievability) part is basically to binary-index the source sequence in theweakly δ-typical set Fn(δ) to a binary codeword (starting from index one withcorresponding k-tuple codeword 0 · · · 01); and to encode all sourcewords outsideFn(δ) to a default all-zero binary codeword, which certainly cannot be repro-duced distortionless due to its many-to-one-mapping property. The resultant

49

Source

∣∣∣∣∣−1

2

2∑

i=1

log2 PX(xi)−H(X)

∣∣∣∣∣

codewordreconstructedsource sequence

AA 0.525 bits 6∈ F2(0.4) 000 ambiguousAB 0.317 bits ∈ F2(0.4) 001 ABAC 0.025 bits ∈ F2(0.4) 010 ACAD 0.475 bits 6∈ F2(0.4) 000 ambiguousBA 0.317 bits ∈ F2(0.4) 011 BABB 0.109 bits ∈ F2(0.4) 100 BBBC 0.183 bits ∈ F2(0.4) 101 BCBD 0.683 bits 6∈ F2(0.4) 000 ambiguousCA 0.025 bits ∈ F2(0.4) 110 CACB 0.183 bits ∈ F2(0.4) 111 CBCC 0.475 bits 6∈ F2(0.4) 000 ambiguousCD 0.975 bits 6∈ F2(0.4) 000 ambiguousDA 0.475 bits 6∈ F2(0.4) 000 ambiguousDB 0.683 bits 6∈ F2(0.4) 000 ambiguousDC 0.975 bits 6∈ F2(0.4) 000 ambiguousDD 1.475 bits 6∈ F2(0.4) 000 ambiguous

Table 3.1: An example of the δ-typical set with n = 2 and δ = 0.4,where F2(0.4) = AB, AC, BA, BB, BC, CA, CB . The codewordset is 001(AB), 010(AC), 011(BA), 100(BB), 101(BC), 110(CA),111(CB), 000(AA, AD, BD, CC, CD, DA, DB, DC, DD) , wherethe parenthesis following each binary codeword indicates those source-words that are encoded to this codeword. The source distribution isPX(A) = 0.4, PX(B) = 0.3, PX(C) = 0.2 and PX(D) = 0.1.

code rate is (1/n)⌈log2(|Fn(δ)|+ 1)⌉ bits per source symbol. As revealed in theShannon-McMillan AEP theorem and its Consequence, almost all the probabi-lity mass will be on Fn(δ) as n sufficiently large, and hence, the probability ofnon-reconstructable source sequences can be made arbitrarily small. A simpleexample for the above coding scheme is illustrated in Table 3.1. The conversepart of the proof will establish (by expressing the probability of correct decodingin terms of the δ-typical set and also using the Consequence of the AEP) thatfor any sequence of D-ary codes with rate strictly below the source entropy, theirprobability of error cannot asymptotically vanish (is bounded away from zero).Actually a stronger result is proven: it is shown that their probability of errornot only does not asymptotically vanish, it actually ultimately grows to 1 (thisis why we call this part a “strong” converse).

50

Theorem 3.5 (Shannon’s source coding theorem) Given integer D > 1,consider a discrete memoryless source Xn∞n=1 with entropy HD(X). Then thefollowing hold.

• Forward part (achievability): For any 0 < ε < 1, there exists 0 < δ < εand a sequence of D-ary block codes C∼n = (n,Mn)∞n=1 with

lim supn→∞

1

nlogD Mn ≤ HD(X) + δ (3.2.1)

satisfyingPe( C∼n) < ε (3.2.2)

for all sufficiently large n, where Pe( C∼n) denotes the probability of decodingerror for block code C∼n.

8

• Strong converse part: For any 0 < ε < 1, any sequence of D-ary blockcodes C∼n = (n,Mn)∞n=1 with

lim supn→∞

1

nlogD Mn < HD(X) (3.2.3)

satisfiesPe( C∼n) > 1− ε

for all n sufficiently large.

Proof:

Forward Part: Without loss of generality, we will prove the result for the caseof binary codes (i.e., D = 2). Also recall that subscript D in HD(X) will bedropped (i.e., omitted) specifically when D = 2.

Given 0 < ε < 1, fix δ such that 0 < δ < ε and choose n > 2/δ. Nowconstruct a binary C∼n block code by simply mapping the δ/2-typical sourcewordsxn onto distinct not all-zero binary codewords of length k , ⌈log2 Mn⌉ bits. Inother words, binary-index (cf. the footnote in Definition 3.2) the sourcewords inFn(δ/2) with the following encoding map:

xn → binary index of xn, if xn ∈ Fn(δ/2);xn → all-zero codeword, if xn 6∈ Fn(δ/2).

8(3.2.2) is equivalent to lim supn→∞ Pe( C∼n) ≤ ε. Since ε can be made arbitrarily small,the forward part actually indicates the existence of a sequence of D-ary block codes C∼n∞n=1

satisfying (3.2.1) such that lim supn→∞ Pe( C∼n) = 0.Based on this, the converse should be that any sequence of D-ary block codes satisfying

(3.2.3) satisfies lim supn→∞ Pe( C∼n) > 0. However, the so-called strong converse actually givesa stronger consequence: lim supn→∞ Pe( C∼n) = 1 (as ǫ can be made arbitrarily small).

51

Then by the Shannon-McMillan AEP theorem, we obtain that

Mn = |Fn(δ/2)|+ 1 ≤ 2n(H(X)+δ/2) + 1 < 2 · 2n(H(X)+δ/2) < 2n(H(X)+δ),

for n > 2/δ. Hence, a sequence of C∼n = (n,Mn) block code satisfying (3.2.1) isestablished. It remains to show that the error probability for this sequence of(n,Mn) block code can be made smaller than ε for all sufficiently large n.

By the Shannon-McMillan AEP theorem,

PXn(F cn(δ/2)) <

δ

2for all sufficiently large n.

Consequently, for those n satisfying the above inequality, and being bigger than2/δ,

Pe( C∼n) ≤ PXn(F cn(δ/2)) < δ ≤ ε.

(For the last step, the readers can refer to Table 3.1 to confirm that only the“ambiguous” sequences outside the typical set contribute to the probability oferror.)

Strong Converse Part: Fix any sequence of block codes C∼n∞n=1 with

lim supn→∞

1

nlog2 | C∼n| < H(X).

Let Sn be the set of source symbols that can be correctly decoded through C∼n-coding system. (A quick example is depicted in Figure 3.2.) Then |Sn| = | C∼n|.By choosing δ small enough with ε/2 > δ > 0, and also by definition of limsupoperation, we have

(∃ N0)(∀ n > N0)1

nlog2 |Sn| =

1

nlog2 | C∼n| < H(X)− 2δ,

which implies|Sn| < 2n(H(X)−2δ).

Furthermore, from Property 2 of the Consequence of the AEP, we obtain that

(∃ N1)(∀ n > N1) PXn(F cn(δ)) < δ.

Consequently, for n > N , maxN0, N1, log2(2/ε)/δ, the probability of cor-rectly block decoding satisfies

1− Pe( C∼n) =∑

xn∈Sn

PXn(xn)

=∑

xn∈Sn∩Fcn(δ)

PXn(xn) +∑

xn∈Sn∩Fn(δ)

PXn(xn)

52

Source SymbolsSn

❲ ❲

Codewords

Figure 3.2: Possible codebook C∼n and its corresponding Sn. The solidbox indicates the decoding mapping from C∼n back to Sn.

≤ PXn(F cn(δ)) + |Sn ∩ Fn(δ)| · max

xn∈Fn(δ)PXn(xn)

< δ + |Sn| · maxxn∈Fn(δ)

PXn(xn)

<ε

2+ 2n(H(X)−2δ) · 2−n(H(X)−δ)

<ε

2+ 2−nδ

< ε,

which is equivalent to Pe( C∼n) > 1− ε for n > N .

Observation 3.6 The results of the above theorem is illustrated in Figure 3.3,where R = lim supn→∞(1/n) logD Mn is usually called the ultimate (or asymp-totic) code rate of block codes for compressing the source. It is clear from thefigure that the (ultimate) rate of any block code with arbitrarily small decodingerror probability must be greater than or equal to the source entropy.9 Con-versely, the probability of decoding error for any block code of rate smaller thanentropy ultimately approaches 1 (and hence is bounded away from zero). Thusfor a DMS, the source entropy HD(X) is the infimum of all “achievable” source(block) coding rates; i.e., it is the infimum of all rates for which there exists asequence of D-ary block codes with asymptotically vanishing (as the blocklengthgoes to infinity) probability of decoding error.

For a source with (statistical) memory, Shannon-McMillan’s theorem cannotbe directly applied in its original form, and thereby Shannon’s source coding

9Note that it is clear from the statement and proof of the forward part of Theorem 3.5 thatthe source entropy can be achieved as an ultimate compression rate as long as (1/n) logD Mn

approaches it from above with increasing n.

53

HD(X)

Pen→∞−→ 1

for all block codesPe

n→∞−→ 0for the best data compression block code

R

Figure 3.3: (Ultimate) Compression rate R versus source entropyHD(X) and behavior of the probability of block decoding error as blocklength n goes to infinity for a discrete memoryless source.

theorem appears restricted to only memoryless sources. However, by exploringthe concept behind these theorems, one can find that the key for the validityof Shannon’s source coding theorem is actually the existence of a set An =xn

1 , xn2 , . . . , x

nM with M ≈ DnHD(X) and PXn(Ac

n) → 0, namely, the existenceof a “typical-like” set An whose size is prohibitively small and whose probabilitymass is asymptotically large. Thus, if we can find such typical-like set for asource with memory, the source coding theorem for block codes can be extendedfor this source. Indeed, with appropriate modifications, the Shannon-McMillantheorem can be generalized for the class of stationary ergodic sources and hencea block source coding theorem for this class can be established; this is consideredin the next subsection. The block source coding theorem for general (e.g., non-stationary non-ergodic) sources in terms of a “generalized entropy” measure (seethe end of the next subsection for a brief description) will be studied in detailin Part II of the book.

3.2.2 Block codes for stationary ergodic sources

In practice, a stochastic source used to model data often exhibits memory orstatistical dependence among its random variables; its joint distribution is hencenot a product of its marginal distributions. In this subsection, we consider theasymptotic lossless data compression theorem for the class of stationary ergodicsources.10

Before proceeding to generalize the block source coding theorem, we needto first generalize the “entropy” measure for a sequence of dependent randomvariables Xn (which certainly should be backward compatible to the discretememoryless cases). A straightforward generalization is to examine the limit ofthe normalized block entropy of a source sequence, resulting in the concept ofentropy rate.

Definition 3.7 (Entropy rate) The entropy rate for a source Xn∞n=1 is de-

10The definitions of stationarity and ergodicity can be found on p. 196 in Appendix B.

54

noted by H(X ) and defined by

H(X ) , limn→∞

1

nH(Xn)

provided the limit exists, where Xn = (X1, · · · , Xn).

Next we will show that the entropy rate exists for stationary sources (here,we do not need ergodicity for the existence of entropy rate).

Lemma 3.8 For a stationary source Xn∞n=1, the conditional entropy

H(Xn|Xn−1, . . . , X1)

is non-increasing in n and also bounded from below by zero. Hence by LemmaA.20, the limit

limn→∞

H(Xn|Xn−1, . . . , X1)

exists.

Proof: We have

H(Xn|Xn−1, . . . , X1) ≤ H(Xn|Xn−1, . . . , X2) (3.2.4)

= H(Xn, · · · , X2)−H(Xn−1, · · · , X2)

= H(Xn−1, · · · , X1)−H(Xn−2, · · · , X1) (3.2.5)

= H(Xn−1|Xn−2, . . . , X1)

where (3.2.4) follows since conditioning never increases entropy, and (3.2.5) holdsbecause of the stationarity assumption. Finally, recall that each conditionalentropy H(Xn|Xn−1, . . . , X1) is non-negative.

Lemma 3.9 (Cesaro-mean theorem) If an → a as n → ∞ and bn = (1/n)∑n

i=1 ai,then bn → a as n → ∞.

Proof: an → a implies that for any ε > 0, there exists N such that for all n > N ,|an − a| < ε. Then

|bn − a| =

∣∣∣∣∣

1

n

n∑

i=1

(ai − a)

∣∣∣∣∣

≤ 1

n

n∑

i=1

|ai − a|

55

=1

n

N∑

i=1

|ai − a|+ 1

n

n∑

i=N+1

|ai − a|

≤ 1

n

N∑

i=1

|ai − a|+ n−N

nε.

Hence, limn→∞ |bn − a| ≤ ε. Since ε can be made arbitrarily small, the lemmaholds.

Theorem 3.10 For a stationary source Xn∞n=1, its entropy rate always existsand is equal to

H(X ) = limn→∞

H(Xn|Xn−1, . . . , X1).

Proof: The result directly follows by writing

1

nH(Xn) =

1

n

n∑

i=1

H(Xi|Xi−1, . . . , X1) (chain-rule for entropy)

and applying the Cesaro-mean theorem.

Observation 3.11 It can also be shown that for a stationary source, (1/n)H(Xn)is non-increasing in n and (1/n)H(Xn) ≥ H(Xn|Xn−1, . . . , X1) for all n ≥ 1.(The proof is left as an exercise. See Problem ??.)

It is obvious that when Xn∞n=1 is a discrete memoryless source, H(Xn) =n×H(X) for every n. Hence,

H(X ) = limn→∞

1

nH(Xn) = H(X).

For a first-order stationary Markov source,

H(X ) = limn→∞

1

nH(Xn) = lim

n→∞H(Xn|Xn−1, . . . , X1) = H(X2|X1),

where

H(X2|X1) , −∑

x1∈X

∑

x2∈Xπ(x1)PX2|X1(x2|x1) · logPX2|X1(x2|x1),

and π(·) is the stationary distribution for the Markov source. Furthermore, ifthe Markov source is binary with PX2|X1(0|1) = α and PX2|X1(1|0) = β, then

H(X ) =β

α + βhb(α) +

α

α + βhb(β),

where hb(α) , −α log2 α− (1− α) log2(1− α) is the binary entropy function.

56

Theorem 3.12 (Generalized AEP or Shannon-McMillan-Breiman The-orem [13]) If Xn∞n=1 is a stationary ergodic source, then

− 1

nlog2 PXn(X1, . . . , Xn)

a.s.−→ H(X ).

Since the AEP theorem (law of large numbers) is valid for stationary ergodicsources, all consequences of AEP will follow, including Shannon’s lossless sourcecoding theorem.

Theorem 3.13 (Shannon’s source coding theorem for stationary ergo-dic sources) Given integer D > 1, let Xn∞n=1 be a stationary ergodic sourcewith entropy rate (in base D)

HD(X ) , limn→∞

1

nHD(X

n).

Then the following hold.

• Forward part (achievability): For any 0 < ε < 1, there exists δ with0 < δ < ε and a sequence of D-ary block codes C∼n = (n,Mn)∞n=1 with

lim supn→∞

1

nlogD Mn < HD(X ) + δ,

and probability of decoding error satisfied

Pe( C∼n) < ε

for all sufficiently large n.

• Strong converse part: For any 0 < ε < 1, any sequence of D-ary blockcodes C∼n = (n,Mn)∞n=1 with

lim supn→∞

1

nlogD Mn < HD(X )

satisfiesPe( C∼n) > 1− ε

for all n sufficiently large.

A discrete memoryless (i.i.d.) source is stationary and ergodic (so Theorem3.5 is clearly a special case of Theorem 3.13). In general, it is hard to checkwhether a stationary process is ergodic or not. It is known though that if astationary process is a mixture of two or more stationary ergodic processes,

57

i.e., its n-fold distribution can be written as the mean (with respect to somedistribution) of the n-fold distributions of stationary ergodic processes, then itis not ergodic.11

For example, let P and Q be two distributions on a finite alphabet X suchthat the process Xn∞n=1 is i.i.d. with distribution P and the process Yn∞n=1

is i.i.d. with distribution Q. Flip a biased coin (with Heads probability equal toθ, 0 < θ < 1) once and let

Zi =

Xi if Heads

Yi if Tails

for i = 1, 2, · · · . Then the resulting process Zi∞i=1 has its n-fold distribution asa mixture of the n-fold distributions of Xn∞n=1 and Yn∞n=1:

PZn(an) = θPXn(an) + (1− θ)PY n(an)

for all an ∈ X n, n = 1, 2, · · · . Then the process Zi∞i=1 is stationary but notergodic.

A specific case for which ergodicity can be easily verified (other than thecase of i.i.d. sources) is the case of stationary Markov sources. Specifically, ifa (finite-alphabet) stationary Markov source is irreducible,12 then it is ergodicand hence the Generalized AEP holds for this source. Note that irreducibilitycan be verified in terms of the source’s transition probability matrix.

In more complicated situations such as when the source is non-stationary(with time-varying statistics) and/or non-ergodic, the source entropy rate H(X )(if the limit exists; otherwise one can look at the lim inf/lim sup of (1/n)H(Xn))has no longer an operational meaning as the smallest possible compression rate.This causes the need to establish new entropy measures which appropriatelycharacterize the operational limits of an arbitrary stochastic system with mem-ory. This is achieved in [24] where Han and Verdu introduce the notions ofinf/sup-entropy rates and illustrate the key role these entropy measures playin proving a general lossless block source coding theorem. More specifically,they demonstrate that for an arbitrary finite-alphabet source X = Xn =(X1, X2, . . . , Xn)∞n=1 (not necessarily stationary and ergodic), the expression forthe minimum achievable (block) source coding rate is given by the sup-entropyrate H(X), defined by

H(X) , infβ∈R

β : lim supn→∞

Pr

[

− 1

nlogPXn(Xn) > β

]

= 0

.

More details will be provided in Part II of the book.

11The converse is also true; i.e., if a stationary process cannot be represented as a mixtureof stationary ergodic processes, then it is ergodic.

12See p. 198 in Appendix B for the definition of irreducibility of Markov sources.

58

3.2.3 Redundancy for lossless block data compression

Shannon’s block source coding theorem establishes that the smallest data com-pression rate for achieving arbitrarily small error probability for stationary er-godic sources is given by the entropy rate. Thus one can define the sourceredundancy as the reduction in coding rate one can achieve via asymptoticallylossless block source coding versus just using uniquely decodable (completelylossless for any value of the sourceword blocklength n) block source coding. Inlight of the fact that the former approach yields a source coding rate equal to theentropy rate while the later approach provides a rate of log2 |X |, we thereforedefine the total block source-coding redundancy ρt (in bits/source symbol) for astationary ergodic source Xn∞n=1 as

ρt , log2 |X | −H(X ).

Hence ρt represents the amount of “useless” (or superfluous) statistical sourceinformation one can eliminate via binary13 block source coding.

If the source is i.i.d. and uniformly distributed, then its entropy rate is equalto log2 |X | and as a result its redundancy is ρt = 0. This means that the sourceis incompressible, as expected, since in this case every sourceword xn will belongto the δ-typical set Fn(δ) for every n > 0 and δ > 0 (i.e., Fn(δ) = X n) andhence there are no superfluous sourcewords that can be dispensed of via sourcecoding. If the source has memory or has a non-uniform marginal distribution,then its redundancy is strictly positive and can be classified into two parts:

• Source redundancy due to the non-uniformity of the source marginal dis-tribution ρd:

ρd , log2 |X | −H(X1).

• Source redundancy due to the source memory ρm:

ρm , H(X1)−H(X ).

As a result, the source total redundancy ρt can be decomposed in two parts:

ρt = ρd + ρm.

We can summarize the redundancy of some typical stationary ergodic sourcesin the following table.

13Since we are measuring ρt in code bits/source symbol, all logarithms in its expression arein base 2 and hence this redundancy can be eliminated via asymptotically lossless binary blockcodes (one can also change the units to D-ary code symbol/source symbol by using base-Dlogarithms for the case of D-ary block codes).

59

Source ρd ρm ρt

i.i.d. uniform 0 0 0i.i.d. non-uniform log2 |X | −H(X1) 0 ρd1st-order symmetricMarkov14

0 H(X1)−H(X2|X1) ρm

1st-order non-symmetric Markov

log2 |X | −H(X1) H(X1)−H(X2|X1) ρd + ρm

3.3 Variable-length codes for lossless data compression

3.3.1 Non-singular codes and uniquely decodable codes

We next study variable-length (completely) lossless data compression codes.

Definition 3.14 Consider a discrete source Xn∞n=1 with finite alphabet Xalong with a D-ary code alphabet B = 0, 1, · · · , D − 1, where D > 1 is aninteger. Fix integer n ≥ 1, then a D-ary n-th order variable-length code (VLC)is a map

f : X n → B∗

mapping (fixed-length) sourcewords of length n to D-ary codewords in B∗ ofvariable lengths, where B∗ denotes the set of all finite-length strings from B (i.e.,c ∈ B∗ ⇔ ∃ integer l ≥ 1 such that c ∈ Bl).

The codebook C of a VLC is the set of all codewords:

C = f(X n) = f(xn) ∈ B∗ : xn ∈ X n.

A variable-length lossless data compression code is a code in which thesource symbols can be completely reconstructed without distortion. In orderto achieve this goal, the source symbols have to be encoded unambiguously inthe sense that any two different source symbols (with positive probabilities) arerepresented by different codewords. Codes satisfying this property are callednon-singular codes. In practice however, the encoder often needs to encode asequence of source symbols, which results in a concatenated sequence of code-words. If any concatenation of codewords can also be unambiguously recon-structed without punctuation, then the code is said to be uniquely decodable. In

14A first-order Markov process is symmetric if for any x1 and x1,

a : a = PX2|X1(y|x1) for some y = a : a = PX2|X1

(y|x1) for some y.

60

other words, a VLC is uniquely decodable if all finite sequences of sourcewords(xn ∈ X n) are mapped onto distinct strings of codewords; i.e., for any m andm′, (xn

1 , xn2 , · · · , xn

m) 6= (yn1 , yn2 , · · · , ynm′) implies that

(f(xn1 ), f(x

n2 ), · · · , f(xn

m)) 6= (f(yn1 ), f(yn2 ), · · · , f(ynm′)).

Note that a non-singular VLC is not necessarily uniquely decodable. For exam-ple, consider a binary (first-order) code for the source with alphabet

X = A,B,C,D,E, F

given by

code of A = 0,

code of B = 1,

code of C = 00,

code of D = 01,

code of E = 10,

code of F = 11.

The above code is clearly non-singular; it is however not uniquely decodablebecause the codeword sequence, 010, can be reconstructed as ABA, DA or AE(i.e., (f(A), f(B), f(A)) = (f(D), f(A)) = (f(A), f(E)) even if (A,B,A), (D,A)and (A,E) are all non-equal).

One important objective is to find out how “efficiently” we can representa given discrete source via a uniquely decodable n-th order VLC and providea construction technique that (at least asymptotically, as n → ∞) attains theoptimal “efficiency.” In other words, we want to determine what is the smallestpossible average code rate (or equivalently, average codeword length) can ann-th order uniquely decodable VLC have when (losslessly) representing a givensource, and we want to give an explicit code construction that can attain thissmallest possible rate (at least asymptotically in the sourceword length n).

Definition 3.15 Let C be a D-ary n-th order VLC code

f : X n → 0, 1, · · · , D − 1∗

for a discrete source Xn∞n=1 with alphabet X and distribution PXn(xn), xn ∈X n. Setting ℓ(cxn) as the length of the codeword cxn = f(xn) associated withsourceword xn, then the average codeword length for C is given by

ℓ ,∑

xn∈Xn

PXn(xn)ℓ(cxn)

61

and its average code rate (in D-ary code symbols/source symbol) is given by

Rn ,ℓ

n=

1

n

∑

xn∈Xn

PXn(xn)ℓ(cxn).

The following theorem provides a strong condition which a uniquely decod-able code must satisfy.

Theorem 3.16 (Kraft inequality for uniquely decodable codes) Let C bea uniquely decodable D-ary n-th order VLC for a discrete source Xn∞n=1 withalphabet X . Let the M = |X |n codewords of C have lengths ℓ1, ℓ2, . . . , ℓM ,respectively. Then the following inequality must hold

M∑

m=1

D−ℓm ≤ 1.

Proof: Suppose that we use the codebook C to encode N sourcewords (xni ∈

X n, i = 1, · · · , N) arriving in a sequence; this yields a concatenated codewordsequence

c1c2c3 . . . cN .

Let the lengths of the codewords be respectively denoted by

ℓ(c1), ℓ(c2), . . . , ℓ(cN).

Consider (∑

c1∈C

∑

c2∈C· · ·∑

cN∈CD−[ℓ(c1)+ℓ(c2)+···+ℓ(cN )]

)

.

It is obvious that the above expression is equal to

(∑

c∈CD−ℓ(c)

)N

=

(M∑

m=1

D−ℓm

)N

.

(Note that |C| = M .) On the other hand, all the code sequences with length

i = ℓ(c1) + ℓ(c2) + · · ·+ ℓ(cN)

contribute equally to the sum of the identity, which is D−i. Let Ai denote thenumber of N -codeword sequences that have length i. Then the above identitycan be re-written as

(M∑

m=1

D−ℓm

)N

=LN∑

i=1

AiD−i,

62

whereL , max

c∈Cℓ(c).

Since C is by assumption a uniquely decodable code, the codeword sequencemust be unambiguously decodable. Observe that a code sequence with length ihas at most Di unambiguous combinations. Therefore, Ai ≤ Di, and

(M∑

m=1

D−ℓm

)N

=LN∑

i=1

AiD−i ≤

LN∑

i=1

DiD−i = LN,

which implies thatM∑

m=1

D−ℓm ≤ (LN)1/N .

The proof is completed by noting that the above inequality holds for every N ,and the upper bound (LN)1/N goes to 1 as N goes to infinity.

The Kraft inequality is a very useful tool, especially for showing that thefundamental lower bound of the average rate of uniquely decodable VLCs fordiscrete memoryless sources is given by the source entropy.

Theorem 3.17 The average rate of every uniquely decodable D-ary n-th orderVLC for a discrete memoryless source Xn∞n=1 is lower-bounded by the sourceentropy HD(X) (measured in D-ary code symbols/source symbol).

Proof: Consider a uniquely decodable D-ary n-th order VLC code for the sourceXn∞n=1

f : X n → 0, 1, · · · , D − 1∗

and let ℓ(cxn) denote the length of the codeword cxn = f(xn) for sourceword xn.Hence,

Rn −HD(X) =1

n

∑

xn∈Xn

PXn(xn)ℓ(cxn)− 1

nHD(X

n)

=1

n

[∑

xn∈Xn

PXn(xn)ℓ(cxn)−∑

xn∈Xn

(−PXn(xn) logD PXn(xn))

]

=1

n

∑

xn∈Xn

PXn(xn) logDPXn(xn)

D−ℓ(cxn )

≥ 1

n

[∑

xn∈Xn

PXn(xn)

]

logD

[∑

xn∈Xn PXn(xn)]

[∑

xn∈Xn D−ℓ(cxn )]

(log-sum inequality)

63

= − 1

nlog

[∑

xn∈Xn

D−ℓ(cxn )

]

≥ 0

where the last inequality follows from the Kraft inequality for uniquely decodablecodes and the fact that the logarithm is a strictly increasing function.

From the above theorem, we know that the average code rate is no smallerthan the source entropy. Indeed a lossless data compression code, whose averagecode rate achieves entropy, should be optimal (since if its average code rate isbelow entropy, the Kraft inequality is violated and the code is no longer uniquelydecodable). We summarize

1. Uniquely decodability ⇒ the Kraft inequality holds.

2. Uniquely decodability⇒ average code rate of VLCs for memoryless sourcesis lower bounded by the source entropy.

Exercise 3.18

1. Find a non-singular and also non-uniquely decodable code that violates theKraft inequality. (Hint: The answer is already provided in this subsection.)

2. Find a non-singular and also non-uniquely decodable code that beats theentropy lower bound.

3.3.2 Prefix or instantaneous codes

A prefix code is a VLC which is self-punctuated in the sense that there is noneed to append extra symbols for differentiating adjacent codewords. A moreprecise definition follows:

Definition 3.19 (Prefix code) A VLC is called a prefix code or an instanta-neous code if no codeword is a prefix of any other codeword.

A prefix code is also named an instantaneous code because the codeword se-quence can be decoded instantaneously (it is immediately recognizable) withoutthe reference to future codewords in the same sequence. Note that a uniquelydecodable code is not necessarily prefix-free and may not be decoded instanta-neously. The relationship between different codes encountered thus far is de-picted in Figure 3.4.

A D-ary prefix code can be represented graphically as an initial segment ofa D-ary tree. An example of a tree representation for a binary (D = 2) prefixcode is shown in Figure 3.5.

64

Prefixcodes

Uniquelydecodable codes

Non-singularcodes

Figure 3.4: Classification of variable-length codes.

Theorem 3.20 (Kraft inequality for prefix codes) There exists a D-arynth-order prefix code for a discrete source Xn∞n=1 with alphabet X iff thecodewords of length ℓm, m = 1, . . . ,M , satisfy the Kraft inequality, where M =|X |n.

Proof: Without loss of generality, we provide the proof for the case of D = 2(binary codes).

1. [The forward part] Prefix codes satisfy the Kraft inequality.

The codewords of a prefix code can always be put on a tree. Pick up a length

ℓmax , max1≤m≤M

ℓm.

A tree has originally 2ℓmax nodes on level ℓmax. Each codeword of length ℓmobstructs 2ℓmax−ℓm nodes on level ℓmax. In other words, when any node is chosenas a codeword, all its children will be excluded from being codewords (as for aprefix code, no codeword can be a prefix of any other code). There are exactly2ℓmax−ℓm excluded nodes on level ℓmax of the tree. We therefore say that eachcodeword of length ℓm obstructs 2ℓmax−ℓm nodes on level ℓmax. Note that no twocodewords obstruct the same nodes on level ℓmax. Hence the number of totallyobstructed codewords on level ℓmax should be less than 2ℓmax , i.e.,

M∑

m=1

2ℓmax−ℓm ≤ 2ℓmax ,

which immediately implies the Kraft inequality:

M∑

m=1

2−ℓm ≤ 1.

65

❯

(0)

(1)

00

01

10

(11)

110

(111) 1110

1111

Figure 3.5: Tree structure of a binary prefix code. The codewords arethose residing on the leaves, which in this case are 00, 01, 10, 110, 1110and 1111.

(This part can also be proven by stating the fact that a prefix code is a uniquelydecodable code. The objective of adding this proof is to illustrate the character-istics of a tree-like prefix code.)

2. [The converse part] Kraft inequality implies the existence of a prefix code.

Suppose that ℓ1, ℓ2, . . . , ℓM satisfy the Kraft inequality. We will show thatthere exists a binary tree with M selected nodes where the ith node resides onlevel ℓi.

Let ni be the number of nodes (among the M nodes) residing on level i(namely, ni is the number of codewords with length i or ni = |m : ℓm = i|),and let

ℓmax , max1≤m≤M

ℓm.

Then from the Kraft inequality, we have

n12−1 + n22

−2 + · · ·+ nℓmax2−ℓmax ≤ 1.

The above inequality can be re-written in a form that is more suitable for thisproof as:

n12−1 ≤ 1

66

n12−1 + n22

−2 ≤ 1

· · ·n12

−1 + n22−2 + · · ·+ nℓmax2

−ℓmax ≤ 1.

Hence,

n1 ≤ 2

n2 ≤ 22 − n121

· · ·nℓmax ≤ 2ℓmax − n12

ℓmax−1 − · · · − nℓmax−121,

which can be interpreted in terms of a tree model as: the 1st inequality saysthat the number of codewords of length 1 is less than the available number ofnodes on the 1st level, which is 2. The 2nd inequality says that the number ofcodewords of length 2 is less than the total number of nodes on the 2nd level,which is 22, minus the number of nodes obstructed by the 1st level nodes alreadyoccupied by codewords. The succeeding inequalities demonstrate the availabilityof a sufficient number of nodes at each level after the nodes blocked by shorterlength codewords have been removed. Because this is true at every codewordlength up to the maximum codeword length, the assertion of the theorem isproved.

Theorems 3.16 and 3.20 unveil the following relation between a variable-length uniquely decodable code and a prefix code.

Corollary 3.21 A uniquely decodable D-ary n-th order code can always bereplaced by a D-ary n-th order prefix code with the same average codewordlength (and hence the same average code rate).

The following theorem interprets the relationship between the average coderate of a prefix code and the source entropy.

Theorem 3.22 Consider a discrete memoryless source Xn∞n=1.

1. For any D-ary n-th order prefix code for the source, the average code rateis no less than the source entropy HD(X).

2. There must exist a D-ary n-th order prefix code for the source whoseaverage code rate is no greater than HD(X) + 1

n, namely,

Rn ,1

n

∑

xn∈Xn

PXn(xn)ℓ(cxn) ≤ HD(X) +1

n, (3.3.1)

where cxn is the codeword for sourceword xn, and ℓ(cxn) is the length ofcodeword cxn .

67

Proof: A prefix code is uniquely decodable, and hence it directly follows fromTheorem 3.17 that its average code rate is no less than the source entropy.

To prove the second part, we can design a prefix code satisfying both (3.3.1)and the Kraft inequality, which immediately implies the existence of the desiredcode by Theorem 3.20. Choose the codeword length for sourceword xn as

ℓ(cxn) = ⌊− logD PXn(xn)⌋+ 1. (3.3.2)

ThenD−ℓ(cxn ) ≤ PXn(xn).

Summing both sides over all source symbols, we obtain∑

xn∈Xn

D−ℓ(cxn ) ≤ 1,

which is exactly the Kraft inequality. On the other hand, (3.3.2) implies

ℓ(cxn) ≤ − logD PXn(xn) + 1,

which in turn implies∑

xn∈Xn

PXn(xn)ℓ(cxn) ≤∑

xn∈Xn

[− PXn(xn) logD PXn(xn)

]+∑

xn∈Xn

PXn(xn)

= HD(Xn) + 1 = nHD(X) + 1,

where the last equality holds since the source is memoryless.

We note that n-th order prefix codes (which encode sourcewords of lengthn) for memoryless sources can yield an average code rate arbitrarily close tothe source entropy when allowing n to grow without bound. For example, amemoryless source with alphabet

A,B,C

and probability distribution

PX(A) = 0.8, PX(B) = PX(C) = 0.1

has entropy being equal to

−0.8 · log2 0.8− 0.1 · log2 0.1− 0.1 · log2 0.1 = 0.92 bits.

One of the best binary first-order or single-letter encoding (with n = 1) prefixcodes for this source is given by c(A) = 0, c(B) = 10 and c(C) = 11, where c(·)is the encoding function. Then the resultant average code rate for this code is

0.8× 1 + 0.2× 2 = 1.2 bits ≥ 0.92 bits.

68

Now if we consider a second-order (with n = 2) prefix code by encoding twoconsecutive source symbols at a time, the new source alphabet becomes

AA,AB,AC,BA,BB,BC,CA,CB,CC,and the resultant probability distribution is calculated by

(∀ x1, x2 ∈ A,B,C) PX2(x1, x2) = PX(x1)PX(x2)

as the source is memoryless. Then one of the best binary prefix codes for thesource is given by

c(AA) = 0

c(AB) = 100

c(AC) = 101

c(BA) = 110

c(BB) = 111100

c(BC) = 111101

c(CA) = 1110

c(CB) = 111110

c(CC) = 111111.

The average code rate of this code now becomes

0.64(1× 1) + 0.08(3× 3 + 4× 1) + 0.01(6× 4)

2= 0.96 bits,

which is closer to the source entropy of 0.92 bits. As n increases, the averagecode rate will be brought closer to the source entropy.

From Theorems 3.17 and 3.22, we obtain the lossless variable-length sourcecoding theorem for discrete memoryless sources.

Theorem 3.23 (Lossless variable-length source coding theorem) Fix in-tegerD > 1 and consider a discrete memoryless source Xn∞n=1 with distributionPX and entropy HD(X) (measured in D-ary units). Then the following hold.

• Forward part (achievability): For any ε > 0, there exists a D-ary n-thorder prefix (hence uniquely decodable) code

f : X n → 0, 1, · · · , D − 1∗

for the source with an average code rate Rn satisfying

Rn ≤ HD(X) + ε

for n sufficiently large.

69

• Converse part: Every uniquely decodable code

f : X n → 0, 1, · · · , D − 1∗

for the source has an average code rate Rn ≥ HD(X).

Thus, for a discrete memoryless source, its entropy HD(X) (measured in D-ary units) represents the smallest variable-length lossless compression rate for nsufficiently large.

Proof: The forward part follows directly from Theorem 3.22 by choosingn large enough such that 1/n < ε, and the converse part is already given byTheorem 3.17.

Observation 3.24 Theorem 3.23 actually also holds for the class of stationarysources by replacing the source entropy HD(X) with the source entropy rate

HD(X ) , limn→∞

1

nHD(X

n),

measured inD-ary units. The proof is very similar to the proofs of Theorems 3.17and 3.22 with slight modifications (such as using the fact that 1

nHD(X

n) is non-increasing with n for stationary sources).

3.3.3 Examples of binary prefix codes

A) Huffman codes: optimal variable-length codes

Given a discrete source with alphabet X , we next construct an optimal binaryfirst-order (single-letter) uniquely decodable variable-length code

f : X → 0, 1∗,

where optimality is in the sense that the code’s average codeword length (orequivalently, its average code rate) is minimized over the class of all binaryuniquely decodable codes for the source. Note that finding optimal n-th odercodes with n > 1 follows directly by considering X n as a new source with ex-panded alphabet (i.e., by mapping n source symbols at a time).

By Corollary 3.21, we remark that in our search for optimal uniquely de-codable codes, we can restrict our attention to the (smaller) class of optimalprefix codes. We thus proceed by observing the following necessary conditionsof optimality for binary prefix codes.

70

Lemma 3.25 Let C be an optimal binary prefix code with codeword lengthsℓi, i = 1, · · · ,M , for a source with alphabet X = a1, . . . , aM and symbolprobabilities p1, . . . , pM . We assume, without loss of generality, that

p1 ≥ p2 ≥ p3 ≥ · · · ≥ pM ,

and that any group of source symbols with identical probability is listed in orderof increasing codeword length (i.e., if pi = pi+1 = · · · = pi+s, then ℓi ≤ ℓi+1 ≤· · · ≤ ℓi+s). Then the following properties hold.

1. Higher probability source symbols have shorter codewords: pi > pj impliesℓi ≤ ℓj, for i, j = 1, · · · ,M .

2. The two least probable source symbols have codewords of equal length:ℓM−1 = ℓM .

3. Among the codewords of length ℓM , two of the codewords are identicalexcept in the last digit.

Proof:

1) If pi > pj and ℓi > ℓj, then it is possible to construct a better code C ′ byinterchanging (“swapping”) codewords i and j of C, since

ℓ(C ′)− ℓ(C) = piℓj + pjℓi − (piℓi + pjℓj)

= (pi − pj)(ℓj − ℓi)

< 0.

Hence code C ′ is better than code C, contradicting the fact that C is optimal.

2) We first know that ℓM−1 ≤ ℓM , since:

• If pM−1 > pM , then ℓM−1 ≤ ℓM by result 1) above.

• If pM−1 = pM , then ℓM−1 ≤ ℓM by our assumption about the orderingof codewords for source symbols with identical probability.

Now, if ℓM−1 < ℓM , we may delete the last digit of codeword M , and thedeletion cannot result in another codeword since C is a prefix code. Thusthe deletion forms a new prefix code with a better average codeword lengththan C, contradicting the fact that C is optimal. Hence, we must have thatℓM−1 = ℓM .

3) Among the codewords of length ℓM , if no two codewords agree in all digitsexcept the last, then we may delete the last digit in all such codewords toobtain a better codeword.

71

The above observation suggests that if we can construct an optimal code forthe entire source except for its two least likely symbols, then we can constructan optimal overall code. Indeed, the following lemma due to Huffman followsfrom Lemma 3.25.

Lemma 3.26 (Huffman) Consider a source with alphabet X = a1, . . . , aMand symbol probabilities p1, . . . , pM such that p1 ≥ p2 ≥ · · · ≥ pM . Consider thereduced source alphabet Y obtained from X by combining the two least likelysource symbols aM−1 and aM into an equivalent symbol aM−1,M with probabilitypM−1 + pM . Suppose that C ′, given by f ′ : Y → 0, 1∗, is an optimal code forthe reduced source Y . We now construct a code C, f : X → 0, 1∗, for theoriginal source X as follows:

• The codewords for symbols a1, a2, · · · , aM−2 are exactly the same as thecorresponding codewords in C ′:

f(a1) = f ′(a1), f(a2) = f ′(a2), · · · , f(aM−2) = f ′(aM−2).

• The codewords associated with symbols aM−1 and aM are formed by ap-pending a “0” and a “1”, respectively, to the codeword f ′(aM−1,M) associ-ated with the letter aM−1,M in C ′:

f(aM−1) = [f ′(aM−1,M)0] and f(aM) = [f ′(aM−1,M)1].

Then code C is optimal for the original source X .

Hence the problem of finding the optimal code for a source of alphabet sizeM is reduced to the problem of finding an optimal code for the reduced sourceof alphabet size M − 1. In turn we can reduce the problem to that of size M − 2and so on. Indeed the above lemma yields a recursive algorithm for constructingoptimal binary prefix codes.

Huffman encoding algorithm: Repeatedly apply the above lemma until one is leftwith a reduced source with two symbols. An optimal binary prefix code for thissource consists of the codewords 0 and 1. Then proceed backwards, constructing(as outlined in the above lemma) optimal codes for each reduced source untilone arrives at the original source.

Example 3.27 Consider a source with alphabet 1, 2, 3, 4, 5, 6 and symbolprobabilities 0.25, 0.25, 0.25, 0.1, 0.1 and 0.05, respectively. By following theHuffman encoding procedure as shown in Figure 3.6, we obtain the Huffmancode as

00, 01, 10, 110, 1110, 1111.

72

0.05

0.1

0.1

0.25

0.25

0.25

(1111)

(1110)

(110)

(10)

(01)

(00)

0.15

0.1

0.25

0.25

0.25

111

110

10

01

00

0.25

0.25

0.25

0.25

11

10

01

00

0.5

0.25

0.25

1

01

00

0.5

0.5

1

01.0

Figure 3.6: Example of the Huffman encoding.

Observation 3.28

• Huffman codes are not unique for a given source distribution; e.g., byinverting all the code bits of a Huffman code, one gets another Huffmancode, or by resolving ties in different ways in the Huffman algorithm, onealso obtains different Huffman codes (but all of these codes have the sameminimal Rn).

• One can obtain optimal codes that are not Huffman codes; e.g., by inter-changing two codewords of the same length of a Huffman code, one can getanother non-Huffman (but optimal) code. Furthermore, one can constructan optimal suffix code (i.e., a code in which no codeword can be a suffixof another codeword) from a Huffman code (which is a prefix code) byreversing the Huffman codewords.

• Binary Huffman codes always satisfy the Kraft inequality with equality(their code tree is “saturated”); e.g., see [14, p. 72].

• Any n-th order binary Huffman code f : X n → 0, 1∗ for a stationarysource Xn∞n=1 with finite alphabet X satisfies:

H(X ) ≤ 1

nH(Xn) ≤ Rn <

1

nH(Xn) +

1

n.

73

Thus, as n increases to infinity, Rn → H(X ) but the complexity as well asencoding-decoding delay grows exponentially with n.

• Finally, note that non-binary (i.e., for D > 2) Huffman codes can also beconstructed in a mostly similar way as for the case of binary Huffman codesby designing a D-ary tree and iteratively applying Lemma 3.26, where nowthe D least likely source symbols are combined at each stage. The onlydifference from the case of binary Huffman codes is that we have to ensurethat we are ultimately left withD symbols at the last stage of the algorithmto guarantee the code’s optimality. This is remedied by expanding theoriginal source alphabet X by adding “dummy” symbols (each with zeroprobability) so that the alphabet size of the expanded source |X ′| is thesmallest positive integer greater than or equal to |X | with

|X ′| = 1 (modulo D − 1).

For example, if |X | = 6 andD = 3 (ternary codes), we obtain that |X ′| = 7,meaning that we need to enlarge the original source X by adding onedummy (zero-probability) source symbol.

We thus obtain that the necessary conditions for optimality of Lemma 3.25also hold for D-ary prefix codes when replacing X with the expandedsource X ′ and replacing “two” with “D” in the statement of the lemma.The resulting D-ary Huffman code will be an optimal code for the originalsource X (e.g., see [19, Chap. 3] and [37, Chap. 11]).

B) Shannon-Fano-Elias code

Assume X = 1, . . . ,M and PX(x) > 0 for all x ∈ X . Define

F (x) ,∑

a≤x

PX(a),

and

F (x) ,∑

a<x

PX(a) +1

2PX(x).

Encoder: For any x ∈ X , express F (x) in decimal binary form, say

F (x) = .c1c2 . . . ck . . . ,

and take the first k (fractional) bits as the codeword of source symbol x, i.e.,

(c1, c2, . . . , ck),

74

where k , ⌈log2(1/PX(x))⌉+ 1.

Decoder: Given codeword (c1, . . . , ck), compute the cumulative sum of F (·) start-ing from the smallest element in 1, 2, . . . ,M until the first x satisfying

F (x) ≥ .c1 . . . ck.

Then x should be the original source symbol.

Proof of decodability: For any number a ∈ [0, 1], let [a]k denote the operationthat chops the binary representation of a after k bits (i.e., removing the (k+1)th

bit, the (k + 2)th bit, etc). Then

F (x)−[F (x)

]

k<

1

2k.

Since k = ⌈log2(1/PX(x))⌉+ 1,

1

2k≤ 1

2PX(x)

=

[∑

a<x

PX(a) +PX(x)

2

]

−∑

a≤x−1

PX(a)

= F (x)− F (x− 1).

Hence,

F (x− 1) =

[

F (x− 1) +1

2k

]

− 1

2k≤ F (x)− 1

2k<[F (x)

]

k.

In addition,F (x) > F (x) ≥

[F (x)

]

k.

Consequently, x is the first element satisfying

F (x) ≥ .c1c2 . . . ck.

Average codeword length:

ℓ =∑

x∈XPX(x)

⌈

log21

PX(x)

⌉

+ 1

<∑

x∈XPX(x) log2

1

PX(x)+ 2

= (H(X) + 2) bits.

Observation 3.29 The Shannon-Fano-Elias code is a prefix code.

75

3.3.4 Examples of universal lossless variable-length codes

In Section 3.3.3, we assume that the source distribution is known. Thus we canuse either Huffman codes or Shannon-Fano-Elias codes to compress the source.What if the source distribution is not a known priori? Is it still possible toestablish a completely lossless data compression code which is universally good(or asymptotically optimal) for all interested sources? The answer is affirma-tive. Two of the examples are the adaptive Huffman codes and the Lempel-Zivcodes (which unlike Huffman and Shannon-Fano-Elias codes map variable-lengthsourcewords onto codewords).

A) Adaptive Huffman code

A straightforward universal coding scheme is to use the empirical distribution(or relative frequencies) as the true distribution, and then apply the optimalHuffman code according to the empirical distribution. If the source is i.i.d.,the relative frequencies will converge to its true marginal probability. Therefore,such universal codes should be good for all i.i.d. sources. However, in order to getan accurate estimation of the true distribution, one must observe a sufficientlylong sourceword sequence under which the coder will suffer a long delay. Thiscan be improved by using the adaptive universal Huffman code [20].

The working procedure of the adaptive Huffman code is as follows. Startwith an initial guess of the source distribution (based on the assumption thatthe source is DMS). As a new source symbol arrives, encode the data in terms ofthe Huffman coding scheme according to the current estimated distribution, andthen update the estimated distribution and the Huffman codebook according tothe newly arrived source symbol.

To be specific, let the source alphabet be X , a1, . . . , aM. Define

N(ai|xn) , number of ai occurrence in x1, x2, . . . , xn.

Then the (current) relative frequency of ai is N(ai|xn)/n. Let cn(ai) denote theHuffman codeword of source symbol ai with respect to the distribution

[N(a1|xn)

n,N(a2|xn)

n, · · · , N(aM |xn)

n

]

.

Now suppose that xn+1 = aj. The codeword cn(aj) is set as output, and therelative frequency for each source outcome becomes:

N(aj|xn+1)

n+ 1=

n× (N(aj|xn)/n) + 1

n+ 1

76

andN(ai|xn+1)

n+ 1=

n× (N(ai|xn)/n)

n+ 1for i 6= j.

This observation results in the following distribution updated policy.

P(n+1)

X(aj) =

nP(n)

X(aj) + 1

n+ 1

andP

(n+1)

X(ai) =

n

n+ 1P

(n)

X(ai) for i 6= j,

where P(n+1)

Xrepresents the estimate of the true distribution PX at time (n+1).

Note that in the Adaptive Huffman coding scheme, the encoder and decoderneed not be re-designed at every time, but only when a sufficient change in theestimated distribution occurs such that the so-called sibling property is violated.

Definition 3.30 (Sibling property) A prefix code is said to have the siblingproperty if its codetree satisfies:

1. every node in the code-tree (except for the root node) has a sibling (i.e.,the code-tree is saturated), and

2. the node can be listed in non-decreasing order of probabilities with eachnode being adjacent to its sibling.

The next observation indicates the fact that the Huffman code is the onlyprefix code satisfying the sibling property.

Observation 3.31 A prefix code is a Huffman code iff it satisfies the siblingproperty.

An example for a code tree satisfying the sibling property is shown in Fig-ure 3.7. The first requirement is satisfied since the tree is saturated. The secondrequirement can be checked by the node list in Figure 3.7.

If the next observation (say at time n = 17) is a3, then its codeword 100 is

set as output (using the Huffman code corresponding to P(16)

X). The estimated

distribution is updated as

P(17)

X(a1) =

16× (3/8)

17=

6

17, P

(17)

X(a2) =

16× (1/4)

17=

4

17

P(17)

X(a3) =

16× (1/8) + 1

17=

3

17, P

(17)

X(a4) =

16× (1/8)

17=

2

17

77

a1(00, 3/8)

a2(01, 1/4)

a3(100, 1/8)

a4(101, 1/8)

a5(110, 1/16)

a6(111, 1/16)

b11(1/8)

b10(1/4)

b0(5/8)

b1(3/8)

8/8

b0

(5

8

)

≥ b1

(3

8

)

︸︷︷︸

sibling pair

≥ a1

(3

8

)

≥ a2

(1

4

)

︸︷︷︸

sibling pair

≥ b10

(1

4

)

≥ b11

(1

8

)

︸︷︷︸

sibling pair

≥ a3

(1

8

)

≥ a4

(1

8

)

︸︷︷︸

sibling pair

≥ a5

(1

16

)

≥ a6

(1

16

)

︸︷︷︸

sibling pair

Figure 3.7: Example of the sibling property based on the code tree fromP

(16)

X. The arguments inside the parenthesis following aj respectively

indicate the codeword and the probability associated with aj. “b” isused to denote the internal nodes of the tree with the assigned (partial)code as its subscript. The number in the parenthesis following b is theprobability sum of all its children.

P(17)

X(a5) =

16× [1/(16)]

17=

1

17, P

(17)

X(a6) =

16× [1/(16)]

17=

1

17.

The sibling property is then violated (cf. Figure 3.8). Hence, codebook needs tobe updated according to the new estimated distribution, and the observation atn = 18 shall be encoded using the new codebook in Figure 3.9. Details aboutAdaptive Huffman codes can be found in [20].

B) Lempel-Ziv codes

We now introduce a well-known and feasible universal coding scheme, which isnamed after its inventors, Lempel and Ziv (e.g., cf. [13]). These codes, unlikeHuffman and Shannon-Fano-Elias codes, map variable-length sourcewords (as

78

a1(00, 6/17)

a2(01, 4/17)

a3(100, 3/17)

a4(101, 2/17)

a5(110, 1/17)

a6(111, 1/17)

b11(2/17)

b10(5/17)

b0(10/17)

b1(7/17)

17/17

b0

(10

17

)

≥ b1

(7

17

)

︸︷︷︸

sibling pair

≥ a1

(6

17

)

≥ b10

(5

17

)

≥ a2

(4

17

)

≥ a3

(3

17

)

≥ a4

(2

17

)

︸︷︷︸

sibling pair

≥ b11

(2

17

)

≥ a5

(1

17

)

≥ a6

(1

17

)

︸︷︷︸

sibling pair

Figure 3.8: (Continuation of Figure 3.7) Example of violation of thesibling property after observing a new symbol a3 at n = 17. Note thatnode a1 is not adjacent to its sibling a2.

opposed to fixed-length codewords) onto codewords.

Suppose the source alphabet is binary. Then the Lempel-Ziv encoder can bedescribed as follows.

Encoder:

1. Parse the input sequence into strings that have never appeared before. Forexample, if the input sequence is 1011010100010 . . ., the algorithm firstgrabs the first letter 1 and finds that it has never appeared before. So 1is the first string. Then the algorithm scoops the second letter 0 and alsodetermines that it has not appeared before, and hence, put it to be thenext string. The algorithm moves on to the next letter 1, and finds thatthis string has appeared. Hence, it hits another letter 1 and yields a newstring 11, and so on. Under this procedure, the source sequence is parsedinto the strings

1, 0, 11, 01, 010, 00, 10.

79

a1(10, 6/17)

a2(00, 4/17)

a3(01, 3/17)

a4(110, 2/17)

a5(1110, 1/17)

a6(1111, 1/17)

b111(2/17)

b11(4/17)

b0(7/17)

b1(10/17)17/17

b1

(10

17

)

≥ b0

(7

17

)

︸︷︷︸

sibling pair

≥ a1

(6

17

)

≥ b11

(4

17

)

︸︷︷︸

sibling pair

≥ a2

(4

17

)

≥ a3

(3

17

)

︸︷︷︸

sibling pair

≥ a4

(2

17

)

≥ b111

(2

17

)

︸︷︷︸

sibling pair

≥ a5

(1

17

)

≥ a6

(1

17

)

︸︷︷︸

sibling pair

Figure 3.9: (Continuation of Figure 3.8) Updated Huffman code. Thesibling property holds now for the new code.

2. Let L be the number of distinct strings of the parsed source. Then weneed log2 L bits to index these strings (starting from one). In the aboveexample, the indices are:

parsed source : 1 0 11 01 010 00 10index : 001 010 011 100 101 110 111

The codeword of each string is then the index of its prefix concatenatedwith the last bit in its source string. For example, the codeword of sourcestring 010 will be the index of 01, i.e., 100, concatenated with the last bitof the source string, i.e., 0. Through this procedure, encoding the aboveparsed strings with L = 3 yields the codeword sequence

(000, 1)(000, 0)(001, 1)(010, 1)(100, 0)(010, 0)(001, 0)

or equivalently,0001000000110101100001000010.

80

Note that the conventional Lempel-Ziv encoder requires two passes: the firstpass to decide L, and the second pass to generate the codewords. The algorithm,however, can be modified so that it requires only one pass over the entire sourcestring. Also note that the above algorithm uses an equal number of bits—log2 L—to all the location index, which can also be relaxed by proper modification.

Decoder: The decoding is straightforward from the encoding procedure.

Theorem 3.32 The above algorithm asymptotically achieves the entropy rateof any stationary ergodic source (with unknown statistics).

Proof: Refer to [13, Sec. 13.5].

81

Chapter 4

Data Transmission and ChannelCapacity

4.1 Principles of data transmission

A noisy communication channel is an input-output medium in which the outputis not completely or deterministically specified by the input. The channel isindeed stochastically modeled, where given channel input x, the channel outputy is governed by a transition (conditional) probability distribution denoted byPY |X(y|x). Since two different inputs may give rise to the same output, thereceiver, upon receipt of an output, needs to guess the most probable sent in-put. In general, words of length n are sent and received over the channel; inthis case, the channel is characterized by a sequence of n-dimensional transitiondistributions PY n|Xn(yn|xn), for n = 1, 2, · · · . A block diagram depicting a datatransmission or channel coding system (with no feedback1) is given in Figure 4.1.

W ChannelEncoder

Xn ChannelPY n|Xn(·|·) Y n

ChannelDecoder

W

Figure 4.1: A data transmission system, where W represents the mes-sage for transmission, Xn denotes the codeword corresponding to mes-sage W , Y n represents the received word due to channel input Xn, andW denotes the reconstructed message from Y n.

1The capacity of channels with (output) feedback will be studied in Part II of the book.

82

The designer of a data transmission (or channel) code needs to carefullyselect codewords from the set of channel input words (of a given length) sothat a minimal ambiguity is obtained at the channel receiver. For example,suppose that a channel has binary input and output alphabets and that itstransition probability distribution induces the following conditional probabilityon its output symbols given that input words of length 2 are sent:

PY |X2(y = 0|x2 = 00) = 1

PY |X2(y = 0|x2 = 01) = 1

PY |X2(y = 1|x2 = 10) = 1

PY |X2(y = 1|x2 = 11) = 1,

which can be graphically depicted as

11

10

01

00

1

01

1

1

1

and a binary message (either event A or event B) is required to be transmittedfrom the sender to the receiver. Then the data transmission code with (codeword00 for event A, codeword 10 for event B) obviously induces less ambiguity atthe receiver than the code with (codeword 00 for event A, codeword 01 for eventB).

In short, the objective in designing a data transmission (or channel) codeis to transform a noisy channel into a reliable medium for sending messagesand recovering them at the receiver with minimal loss. To achieve this goal, thedesigner of a data transmission code needs to take advantage of the common partsbetween the sender and the receiver sites that are least affected by the channelnoise. We will see that these common parts are probabilistically captured by themutual information between the channel input and the channel output.

As illustrated in the previous example, if a “least-noise-affected” subset ofthe channel input words is appropriately selected as the set of codewords, themessages intended to be transmitted can be reliably sent to the receiver witharbitrarily small error. One then raises the question:

What is the maximum amount of information (per channel use) thatcan be reliably transmitted over a given noisy channel ?

83

In the above example, we can transmit a binary message error-free, and hencethe amount of information that can be reliably transmitted is at least 1 bitper channel use (or channel symbol). It can be expected that the amount ofinformation that can be reliably transmitted for a highly noisy channel shouldbe less than that for a less noisy channel. But such a comparison requires a goodmeasure of the “noisiness” of channels.

From an information theoretic viewpoint, “channel capacity” provides a goodmeasure of the noisiness of a channel; it represents the maximal amount of infor-mational messages (per channel use) that can be transmitted via a data trans-mission code over the channel and recovered with arbitrarily small probabilityof error at the receiver. In addition to its dependence on the channel transitiondistribution, channel capacity also depends on the coding constraint imposedon the channel input, such as “only block (fixed-length) codes are allowed.” Inthis chapter, we will study channel capacity for block codes (namely, only blocktransmission code can be used).2 Throughout the chapter, the noisy channel isassumed to be memoryless (as defined in the next section).

4.2 Discrete memoryless channels

Definition 4.1 (Discrete channel) A discrete communication channel is char-acterized by

• A finite input alphabet X .

• A finite output alphabet Y .

• A sequence of n-dimensional transition distributions

PY n|Xn(yn|xn)∞n=1

such that∑

yn∈Yn PY n|Xn(yn|xn) = 1 for every xn ∈ X n, where xn =(x1, · · · , xn) ∈ X n and yn = (y1, · · · , yn) ∈ Yn. We assume that the abovesequence of n-dimensional distribution is consistent, i.e.,

PY i|Xi(yi|xi) =

∑

xi+1∈X∑

yi+1∈Y PXi+1(xi+1)PY i+1|Xi+1(yi+1|xi+1)∑

xi+1∈X PXi+1(xi+1)

=∑

xi+1∈X

∑

yi+1∈YPXi+1|Xi(xi+1|xi)PY i+1|Xi+1(yi+1|xi+1)

for every xi, yi, PXi+1|Xi and i = 1, 2, · · · .

2See [48] for recent results regarding channel cpacity when no coding constraints are appliedon the channel input (so that variable-length codes can be employed).

84

In general, real-world communications channels exhibit statistical memoryin the sense that current channel outputs statistically depend on past outputsas well as past, current and (possibly) future inputs. However, for the sake ofsimplicity, we restrict our attention in this chapter to the class of memorylesschannels (channels with memory will later be treated in Part II).

Definition 4.2 (Discrete memoryless channel)A discrete memoryless chan-nel (DMC) is a channel whose sequence of transition distributions PY n|Xn satisfies

PY n|Xn(yn|xn) =n∏

i=1

PY |X(yi|xi) (4.2.1)

for every n = 1, 2, · · · , xn ∈ X n and yn ∈ Yn. In other words, a DMC isfully described by the channel’s transition distribution matrix Q , [px,y] of size|X | × |Y|, where

px,y , PY |X(y|x)for x ∈ X , y ∈ Y . Furthermore, the matrix Q is stochastic; i.e., the sum of theentries in each of its rows is equal to 1

(since

∑

y∈Y px,y = 1 for all x ∈ X).

Observation 4.3 We note that the DMC’s condition (4.2.1) is actually equiv-alent to the following two sets of conditions:

PYn|Xn,Y n−1(yn|xn, yn−1) = PY |X(yn|xn) ∀n = 1, 2, · · · , xn, yn;

(4.2.2a)

PY n−1|Xn(yn−1|xn) = PY n−1|Xn−1(yn−1|xn−1) ∀n = 2, 3, · · · , xn, yn−1.

(4.2.2b)

PYn|Xn,Y n−1(yn|xn, yn−1) = PY |X(yn|xn) ∀n = 1, 2, · · · , xn, yn;

(4.2.3a)

PXn|Xn−1,Y n−1(xn|xn−1, yn−1) = PXn|Xn−1(xn|xn−1) ∀n = 1, 2, · · · , xn, yn−1.

(4.2.3b)

Condition (4.2.2a) (also, (4.2.3a)) implies that the current output Yn only de-pends on the current input Xn but not on past inputs Xn−1 and outputs Y n−1.Condition (4.2.2b) indicates that the past outputs Y n−1 do not depend on thecurrent input Xn. These two conditions together give

PY n|Xn(yn|xn) = PY n−1|Xn(yn−1|xn)PYn|Xn,Y n−1(yn|xn, yn−1)

= PY n−1|Xn−1(yn−1|xn−1)PY |X(yn|xn);

hence, (4.2.1) holds recursively on n = 1, 2, · · · . The converse (i.e., (4.2.1) impliesboth (4.2.2a) and (4.2.2b)) is a direct consequence of

PYn|Xn,Y n−1(yn|xn, yn−1) =PY n|Xn(yn|xn)

∑

yn∈Y PY n|Xn(yn|xn)

85

andPY n−1|Xn(yn−1|xn) =

∑

yn∈YPY n|Xn(yn|xn).

Similarly, (4.2.3b) states that the current inputXn is independent of past outputsY n−1, which together with (4.2.3a) implies again

PY n|Xn(yn|xn)

=PXn,Y n(xn, yn)

PXn(xn)

=PXn−1,Y n−1(xn−1, yn−1)PXn|Xn−1,Y n−1(xn|xn−1, yn−1)PYn|Xn,Y n−1(yn|xn, yn−1)

PXn−1(xn−1)PXn|Xn−1(xn|xn−1)

= PY n−1|Xn−1(yn−1|xn−1)PY |X(yn|xn),

hence, recursively yielding (4.2.1). The converse for (4.2.3b)—i.e., (4.2.1) imply-ing (4.2.3b)— can be analogously proved by noting that

PXn|Xn−1,Y n−1(xn|xn−1, yn−1) =PXn(xn)

∑

yn∈Y PY n|Xn(yn|xn)

PXn−1(xn−1)PY n−1|Xn−1(yn−1|xn−1).

Note that the above definition of DMC in (4.2.1) prohibits the use of channelfeedback, as feedback allows the current channel input to be a function of pastchannel outputs (therefore, conditions (4.2.2b) and (4.2.3b) cannot hold withfeedback). Instead, a causality condition generalizing condition (4.2.2a) (e.g., seeDefinition 7.4 in [53]) will be needed to define a channel with feedback (feedbackwill be considered in Part II of this book).

Examples of DMCs:

1. Identity (noiseless) channels: An identity channel has equal-size input andoutput alphabets (|X | = |Y|) and channel transition probability satisfying

PY |X(y|x) =

1 if y = x

0 if y 6= x.

This is a noiseless or perfect channel as the channel input is received error-free at the channel output.

2. Binary symmetric channels: A binary symmetric channel (BSC) is a chan-nel with binary input and output alphabets such that each input has a

86

(conditional) probability given by ε for being received inverted at the out-put, where ε ∈ [0, 1] is called the channel’s crossover probability or bit errorrate. The channel’s transition distribution matrix is given by

Q = [px,y] =

[p0,0 p0,1p1,0 p1,1

]

=

[PY |X(0|0) PY |X(1|0)PY |X(0|1) PY |X(1|1)

]

=

[1− ε εε 1− ε

]

(4.2.4)

and can be graphically represented via a transition diagram as shown inFigure 4.2.

1

0

X PY |X Y

1

0

1− ε

1− ε

ε

ε

Figure 4.2: Binary symmetric channel.

If we set ε = 0, then the BSC reduces to the binary identity (noiseless)channel. The channel is called “symmetric” since PY |X(1|0) = PY |X(0|1);i.e., it has the same probability for flipping an input bit into a 0 or a 1.A detailed discussion of DMCs with various symmetry properties will bediscussed at the end of this chapter.

Despite its simplicity, the BSC is rich enough to capture most of thecomplexity of coding problems over more general channels. For exam-ple, it can exactly model the behavior of practical channels with additivememoryless Gaussian noise used in conjunction of binary symmetric mod-ulation and hard-decision demodulation (e.g., see [51, p. 240].) It is alsoworth pointing out that the BSC can be explicitly represented via a binarymodulo-2 additive noise channel whose output at time i is the modulo-2sum of its input and noise variables:

Yi = Xi ⊕ Zi for i = 1, 2, · · ·

87

where ⊕ denotes addition modulo-2, Yi, Xi and Zi are the channel output,input and noise, respectively, at time i, the alphabets X = Y = Z = 0, 1are all binary, and it is assumed that Xi and Zj are independent from eachother for any i, j = 1, 2, · · · , and that the noise process is a Bernoulli(ε)process – i.e., a binary i.i.d. process with Pr[Z = 1] = ε.

3. Binary erasure channels: In the BSC, some input bits are received perfectlyand others are received corrupted (flipped) at the channel output. In somechannels however, some input bits are lost during transmission instead ofbeing received corrupted (for example, packets in data networks may getdropped or blocked due to congestion or bandwidth constraints). In thiscase, the receiver knows the exact location of these bits in the receivedbitstream or codeword, but not their actual value. Such bits are thendeclared as “erased” during transmission and are called “erasures.” Thisgives rise to the so-called binary erasure channel (BEC) as illustrated inFigure 4.3, with input alphabet X = 0, 1 and output alphabet Y =0, E, 1, where E represents an erasure, and channel transition matrixgiven by

Q = [px,y] =

[p0,0 p0,E p0,1p1,0 p1,E p1,1

]

=

[PY |X(0|0) PY |X(E|0) PY |X(1|0)PY |X(0|1) PY |X(E|1) PY |X(1|1)

]

=

[1− α α 00 α 1− α

]

(4.2.5)

where 0 ≤ α ≤ 1 is called the channel’s erasure probability.

4. Binary channels with errors and erasures: One can combine the BSC withthe BEC to obtain a binary channel with both errors and erasures, asshown in Figure 4.4. We will call such channel the binary symmetricerasure channel (BSEC). In this case, the channel’s transition matrix isgiven by

Q = [px,y] =

[p0,0 p0,E p0,1p1,0 p1,E p1,1

]

=

[1− ε− α α ε

ε α 1− ε− α

]

(4.2.6)

where ε, α ∈ [0, 1] are the channel’s crossover and erasure probabilities,respectively. Clearly, setting α = 0 reduces the BSEC to the BSC, andsetting ε = 0 reduces the BSEC to the BEC.

More generally, the channel needs not have a symmetric property inthe sense of having identical transition distributions when inputs bits 0 or

88

③

1

0

X PY |X Y

1

0

E

1− α

1− α

α

α

Figure 4.3: Binary erasure channel.

1 are sent. For example, the channel’s transition matrix can be given by

Q = [px,y] =

[p0,0 p0,E p0,1p1,0 p1,E p1,1

]

=

[1− ε− α α ε

ε′ α′ 1− ε′ − α′

]

(4.2.7)

where the probabilities ε 6= ε′ and α 6= α′ in general. We call such chan-nel, an asymmetric channel with errors and erasures (this model mightbe useful to represent practical channels using asymmetric or non-uniformmodulation constellations).

③

1

0

X PY |X Y

1

0

E

1− ε− α

1− ε− α

α

α

ε

ε

Figure 4.4: Binary symmetric erasure channel.

5. q-ary symmetric channels: Given an integer q ≥ 2, the q-ary symmetricchannel is a non-binary extension of the BSC; it has alphabets X = Y =

89

0, 1, · · · , q − 1 of size q and channel transition matrix given by

Q = [px,y]

=

p0,0 p0,1 · · · p0,q−1

p1,0 p1,1 · · · p1,q−1...

......

...pq−1,0 pq−1,1 · · · pq−1,q−1

=

1− ε εq−1

· · · εq−1

εq−1

1− ε · · · εq−1

......

......

εq−1

εq−1

· · · 1− ε

(4.2.8)

where 0 ≤ ε ≤ 1 is the channel’s symbol error rate (or probability). Whenq = 2, the channel reduces to the BSC with bit error rate ε, as expected.

As the BSC, the q-ary symmetric channel can be expressed as a modulo-q additive noise channel with common input, output and noise alphabetsX = Y = Z = 0, 1, · · · , q − 1 and whose output Yi at time i is given byYi = Xi ⊕q Zi, for i = 1, 2, · · · , where ⊕q denotes addition modulo-q, andXi and Zi are the channel’s input and noise variables, respectively, at timei. Here, the noise process Zn∞n=1 is assumed to be an i.i.d. process withdistribution

Pr[Z = 0] = 1− ε and Pr[Z = a] =ε

q − 1∀a ∈ 1, · · · , q − 1.

It is also assumed that the input and noise processes are independent fromeach other.

6. q-ary erasure channels: Given an integer q ≥ 2, one can also considera non-binary extension of the BEC, yielding the so called q-ary erasurechannel. Specifically, this channel has input and output alphabets givenby X = 0, 1, · · · , q − 1 and Y = 0, 1, · · · , q − 1, E, respectively, whereE denotes an erasure, and channel transition distribution given by

PY |X(y|x) =

1− α if y = x, x ∈ Xα if y = E, x ∈ X0 if y 6= x, x ∈ X

(4.2.9)

where 0 ≤ α ≤ 1 is the erasure probability. As expected, setting q = 2reduces the channel to the BEC.

90

4.3 Block codes for data transmission over DMCs

Definition 4.4 (Fixed-length data transmission code) Given positive in-tegers n and M , and a discrete channel with input alphabet X and outputalphabet Y , a fixed-length data transmission code (or block code) for this chan-nel with blocklength n and rate 1

nlog2 M message bits per channel symbol (or

channel use) is denoted by C∼n = (n,M) and consists of:

1. M information messages intended for transmission.

2. An encoding function

f : 1, 2, . . . ,M → X n

yielding codewords f(1), f(2), · · · , f(M) ∈ X n, each of length n. The setof these M codewords is called the codebook and we also usually writeC∼n = f(1), f(2), · · · , f(M) to list the codewords.

3. A decoding function g : Yn → 1, 2, . . . ,M.

The set 1, 2, . . . ,M is called the message set and we assume that a messageW follows a uniform distribution over the set of messages: Pr[W = w] = 1

Mfor

all w ∈ 1, 2, . . . ,M. A block diagram for the channel code is given at thebeginning of this chapter; see Figure 4.1. As depicted in the diagram, to conveymessage W over the channel, the encoder sends its corresponding codewordXn = f(W ) at the channel input. Finally, Y n is received at the channel output(according to the memoryless channel distribution PY n|Xn) and the decoder yields

W = g(Y n) as the message estimate.

Definition 4.5 (Average probability of error) The average probability oferror for a channel block code C∼n = (n,M) code with encoder f(·) and decoderg(·) used over a channel with transition distribution PY n|Xn is defined as

Pe( C∼n) ,1

M

M∑

w=1

λw( C∼n),

where

λw( C∼n) , Pr[W 6= W |W = w] = Pr[g(Y n) 6= w|Xn = f(w)]

=∑

yn∈Yn: g(yn) 6=w

PY n|Xn(yn|f(w))

is the code’s conditional probability of decoding error given that message w issent over the channel.

91

Note that, since we have assumed that the message W is drawn uniformlyfrom the set of messages, we have that

Pe( C∼n) = Pr[W 6= W ].

Observation 4.6 Another more conservative error criterion is the so-calledmaximal probability of error

λ( C∼n) , maxw∈1,2,··· ,M

λw( C∼n).

Clearly, Pe( C∼n) ≤ λ( C∼n); so one would expect that Pe( C∼n) behaves differentlythan λ( C∼n). However it can be shown that from a code C∼n = (n,M) witharbitrarily small Pe( C∼n), one can construct (by throwing away from C∼n half ofits codewords with largest conditional probability of error) a code C∼ ′

n = (n, M2)

with arbitrarily small λ( C∼ ′n) at essentially the same code rate as n grows to

infinity (e.g., see [13, p. 204], [53, p. 163]).3 Hence, we will only use Pe( C∼n)as our criterion when evaluating the “goodness” or reliability4 of channel blockcodes.

Our target is to find a good channel block code (or to show the existence of agood channel block code). From the perspective of the (weak) law of large num-bers, a good choice is to draw the code’s codewords based on the jointly typicalset between the input and the output of the channel, since all the probabilitymass is ultimately placed on the jointly typical set. The decoding failure thenoccurs only when the channel input-output pair does not lie in the jointly typicalset, which implies that the probability of decoding error is ultimately small. Wenext define the jointly typical set.

Definition 4.7 (Jointly typical set) The set Fn(δ) of jointly δ-typical n-tuplepairs (xn, yn) with respect to the memoryless distribution PXn,Y n(xn, yn) =∏n

i=1 PX,Y (xi, yi) is defined by

Fn(δ) ,

(xn, yn) ∈ X n × Yn :

3Note that this fact holds for single-user channels with known transition distributions (asgiven in Definition 4.1) that remain constant throughout the transmission of a codeword. Itdoes not however hold for single-user channels whose statistical descriptions may vary in anunknown manner from symbol to symbol during a codeword transmission; such channels, whichinclude the class of “arbitrarily varying channels” (see [14, Chapter 2, Section 6]), will not beconsidered in this textbook.

4We interchangeably use the terms “goodness” or “reliability” for a block code to meanthat its (average) probability of error asymptotically vanishes with increasing blocklength.

92

∣∣∣∣− 1


∣∣∣∣< δ,

∣∣∣∣− 1

nlog2 PY n(yn)−H(Y )

∣∣∣∣< δ,

and

∣∣∣∣− 1

nlog2 PXn,Y n(xn, yn)−H(X, Y )

∣∣∣∣< δ

.

In short, a pair (xn, yn) generated by independently drawing n times under PX,Y

is jointly δ-typical if its joint and marginal empirical entropies are respectivelyδ-close to the true joint and marginal entropies.

With the above definition, we directly obtain the joint AEP theorem.

Theorem 4.8 (Joint AEP) If (X1, Y1), (X2, Y2), . . ., (Xn, Yn), . . . are i.i.d.,i.e., (Xi, Yi)∞i=1 is a dependent pair of DMSs, then

− 1

nlog2 PXn(X1, X2, . . . , Xn) → H(X) in probability,

− 1

nlog2 PY n(Y1, Y2, . . . , Yn) → H(Y ) in probability,

and

− 1

nlog2 PXn,Y n((X1, Y1), . . . , (Xn, Yn)) → H(X, Y ) in probability

as n → ∞.

Proof: By the weak law of large numbers, we have the desired result.

Theorem 4.9 (Shannon-McMillan theorem for pairs) Given a dependentpair of DMSs with joint entropy H(X, Y ) and any δ greater than zero, we canchoose n big enough so that the jointly δ-typical set satisfies:

1. PXn,Y n(F cn(δ)) < δ for sufficiently large n.

2. The number of elements in Fn(δ) is at least (1 − δ)2n(H(X,Y )−δ) for suffi-ciently large n, and at most 2n(H(X,Y )+δ) for every n.

3. If (xn, yn) ∈ Fn(δ), its probability of occurrence satisfies

2−n(H(X,Y )+δ) < PXn,Y n(xn, yn) < 2−n(H(X,Y )−δ).

93

Proof: The proof is quite similar to that of the Shannon-McMillan theorem fora single memoryless source presented in the previous chapter; we hence leave itas an exercise.

We herein arrive at the main result of this chapter, Shannon’s channel codingtheorem for DMCs. It basically states that a quantity C, termed as channelcapacity and defined as the maximum of the channel’s mutual information overthe set of its input distributions (see below), is the supremum of all “achievable”channel block code rates; i.e., it is the supremum of all rates for which thereexists a sequence of block codes for the channel with asymptotically decaying(as the blocklength grows to infinity) probability of decoding error. In otherwords, for a given DMC, its capacity C, which can be calculated by solely usingthe channel’s transition matrix Q, constitutes the largest rate at which one canreliably transmit information via a block code over this channel. Thus, it ispossible to communicate reliably over an inherently noisy DMC at a fixed rate(without decreasing it) as long as this rate is below C and the code’s blocklengthis allowed to be large.

Theorem 4.10 (Shannon’s channel coding theorem) Consider a DMCwith finite input alphabet X , finite output alphabet Y and transition distributionprobability PY |X(y|x), x ∈ X and y ∈ Y . Define the channel capacity5

C , maxPX

I(X;Y ) = maxPX

I(PX , PY |X)

where the maximum is taken over all input distributions PX . Then the followinghold.

• Forward part (achievability): For any 0 < ε < 1, there exist γ > 0 and asequence of data transmission block codes C∼n = (n,Mn)∞n=1 with

lim infn→∞

1

nlog2 Mn ≥ C − γ

5First note that the mutual information I(X;Y ) is actually a function of the input statisticsPX and the channel statistics PY |X . Hence, we may write it as

I(PX , PY |X) =∑

x∈X

∑

y∈Y

PX(x)PY |X(y|x) log2PY |X(y|x)

∑

x′∈X PX(x′)PY |X(y|x′).

Such an expression is more suitable for calculating the channel capacity.Note also that the channel capacity C is well-defined since, for a fixed PY |X , I(PX , PY |X) is

concave and continuous in PX (with respect to both the variational distance and the Euclideandistance (i.e., L2-distance) [53, Chapter 2]), and since the set of all input distributions PX isa compact (closed and bounded) subset of R|X | due to the finiteness of X . Hence there existsa PX that achieves the supremum of the mutual information and the maximum is attainable.

94

andPe( C∼n) < ε for sufficiently large n,

where Pe( C∼n) denotes the (average) probability of error for block code C∼n.

• Converse part: For any 0 < ε < 1, any sequence of data transmission blockcodes C∼n = (n,Mn)∞n=1 with

lim infn→∞

1

nlog2 Mn > C

satisfiesPe( C∼n) > (1− ǫ)µ for sufficiently large n, (4.3.1)

where

µ = 1− C

lim infn→∞1nlog2 Mn

> 0,

i.e., the codes’ probability of error is bounded away from zero for all nsufficiently large.6

Proof of the forward part: It suffices to prove the existence of a good blockcode sequence (satisfying the rate condition, i.e., lim infn→∞(1/n) log2 Mn ≥C − γ for some γ > 0) whose average error probability is ultimately less than ε.Since the forward part holds trivially when C = 0 by setting Mn = 1, we assumein the sequel that C > 0.

We will use Shannon’s original random coding proof technique in which thegood block code sequence is not deterministically constructed; instead, its exis-tence is implicitly proven by showing that for a class (ensemble) of block codesequences C∼n∞n=1 and a code-selecting distribution Pr[ C∼n] over these blockcode sequences, the expectation value of the average error probability, evaluatedunder the code-selecting distribution on these block code sequences, can be madesmaller than ε for n sufficiently large:

E C∼n[Pe( C∼n)] =

∑

C∼n

Pr[ C∼n]Pe( C∼n) → 0 as n → ∞.

Hence, there must exist at least one such a desired good code sequence C∼ ∗n∞n=1

among them (with Pe( C∼ ∗n) → 0 as n → ∞).

6(4.3.1) actually implies that lim infn→∞ Pe( C∼n) ≥ limǫ↓0(1 − ǫ)µ = µ, where the errorprobability lower bound is nothing to do with ǫ. Here we state the converse of Theorem 4.10in a form in parallel to the converse statements in Theorems 3.5 and 3.13.

95

Fix ε ∈ (0, 1) and some γ in (0,min4ε, C). Observe that there exists N0

such that for n > N0, we can choose an integer Mn with

C − γ

2≥ 1

nlog2 Mn > C − γ. (4.3.2)

(Since we are only concerned with the case of “sufficient large n,” it suffices toconsider only those n’s satisfying n > N0, and ignore those n’s for n ≤ N0.)

Define δ , γ/8. Let PX be the probability distribution achieving the channelcapacity:

C , maxPX

I(PX , PY |X) = I(PX , PY |X).

Denote by PY n the channel output distribution due to channel input productdistribution PXn (with PXn(xn) =

∏ni=1 PX(xi)), i.e.,

PY n(yn) =

∑

xn∈Xn

PXn,Y n(xn, yn)

wherePXn,Y n(x

n, yn) , PXn(xn)PY n|Xn(yn|xn)

for all xn ∈ X n and yn ∈ Yn. Note that since PXn(xn) =∏n

i=1 PX(xi) and the

channel is memoryless, the resulting joint input-output process (Xi, Yi)∞i=1 isalso memoryless with

PXn,Y n(xn, yn) =

n∏

i=1

PX,Y (xi, yi)

andPX,Y (x, y) = PX(x)PY |X(y|x) for x ∈ X , y ∈ Y .

We next present the proof in three steps.

Step 1: Code construction.

For any blocklength n, independently select Mn channel inputs with re-placement7 from X n according to the distribution PXn(xn). For the se-lected Mn channel inputs yielding codebook C∼n , c1, c2, . . . , cMn

, definethe encoder fn(·) and decoder gn(·), respectively, as follows:

fn(m) = cm for 1 ≤ m ≤ Mn,

7Here, the channel inputs are selected with replacement. That means it is possible andacceptable that all the selected Mn channel inputs are identical.

96

and

gn(yn) =

m, if cm is the only codeword in C∼n

satisfying (cm, yn) ∈ Fn(δ);

any one in 1, 2, . . . ,Mn, otherwise,

where Fn(δ) is defined in Definition 4.7 with respect to distribution PXn,Y n .(We evidently assume that the codebook C∼n and the channel distributionPY |X are known at both the encoder and the decoder.) Hence, the codeC∼n operates as follows. A message W is chosen according to the uniformdistribution from the set of messages. The encoder fn then transmits theW th codeword cW in C∼n over the channel. Then Y n is received at thechannel output and the decoder guesses the sent message via W = gn(Y

n).

Note that there is a total |X |nMn possible randomly generated codebooksC∼n and the probability of selecting each codebook is given by

Pr[ C∼n] =Mn∏

m=1

PXn(cm).

Step 2: Conditional error probability.

For each (randomly generated) data transmission code C∼n, the conditionalprobability of error given that message m was sent, λm( C∼n), can be upperbounded by:

λm( C∼n) ≤∑

yn∈Yn: (cm,yn) 6∈Fn(δ)

PY n|Xn(yn|cm)

+Mn∑

m′=1m′ 6=m

∑

yn∈Yn: (cm′ ,yn)∈Fn(δ)

PY n|Xn(yn|cm), (4.3.3)

where the first term in (4.3.3) considers the case that the received channeloutput yn is not jointly δ-typical with cm, (and hence, the decoding rulegn(·) would possibly result in a wrong guess), and the second term in(4.3.3) reflects the situation when yn is jointly δ-typical with not only thetransmitted codeword cm, but also with another codeword cm′ (which maycause a decoding error).

By taking expectation in (4.3.3) with respect to the mth codeword-selecting distribution PXn(cm), we obtain

∑

cm∈Xn

PXn(cm)λm( C∼n) ≤∑

cm∈Xn

∑

yn 6∈Fn(δ|cm)

PXn(cm)PY n|Xn(yn|cm)

97

+∑

cm∈Xn

Mn∑

m′=1m′ 6=m

∑

yn∈Fn(δ|cm′ )

PXn(cm)PY n|Xn(yn|cm)

= PXn,Y n (F cn(δ))

+Mn∑

m′=1m′ 6=m

∑

cm∈Xn

∑

yn∈Fn(δ|cm′ )

PXn,Y n(cm, yn),

(4.3.4)

whereFn(δ|xn) , yn ∈ Yn : (xn, yn) ∈ Fn(δ) .

Step 3: Average error probability.

We now can analyze the expectation of the average error probability

E C∼n[Pe( C∼n)]

over the ensemble of all codebooks C∼n generated at random according toPr[ C∼n] and show that it asymptotically vanishes as n grows without bound.We obtain the following series of inequalities.

E C∼n[Pe( C∼n)] =

∑

C∼n

Pr[ C∼n]Pe( C∼n)

=∑

c1∈Xn

· · ·∑

cMn∈Xn

PXn(c1) · · ·PXn(cMn)

(

1

Mn

Mn∑

m=1

λm( C∼n)

)

=1

Mn

Mn∑

m=1

∑

c1∈Xn

· · ·∑

cm−1∈Xn

∑

cm+1∈Xn

· · ·∑

cMn∈Xn

PXn(c1) · · ·PXn(cm−1)PXn(cm+1) · · ·PXn(cMn)

×(∑

cm∈Xn

PXn(cm)λm( C∼n)

)

≤ 1

Mn

Mn∑

m=1

∑

c1∈Xn

· · ·∑

cm−1∈Xn

∑

cm+1∈Xn

· · ·∑

cMn∈Xn


×PXn,Y n (F cn(δ))

+1

Mn

Mn∑

m=1

∑

c1∈Xn

· · ·∑

cm−1∈Xn

∑

cm+1∈Xn

· · ·∑

cMn∈Xn

98


×Mn∑

m′=1m′ 6=m

∑

cm∈Xn

∑

yn∈Fn(δ|cm′ )

PXn,Y n(cm, yn) (4.3.5)

= PXn,Y n (F cn(δ))

+1

Mn

Mn∑

m=1

Mn∑

m′=1m′ 6=m

∑

c1∈Xn

· · ·∑

cm−1∈Xn

∑

cm+1∈Xn

· · ·∑

cMn∈Xn


×∑

cm∈Xn

∑

yn∈Fn(δ|cm′ )

PXn,Y n(cm, yn)

,

where (4.3.5) follows from (4.3.4), and the last step holds since PXn,Y n (F cn(δ))

is a constant independent of c1, . . ., cMnand m. Observe that for n > N0,

Mn∑

m′=1m′ 6=m

∑

c1∈Xn

· · ·∑

cm−1∈Xn

∑

cm+1∈Xn

· · ·∑

cMn∈Xn


×∑

cm∈Xn

∑

yn∈Fn(δ|cm′ )

PXn,Y n(cm, yn)

=Mn∑

m′=1m′ 6=m

∑

cm∈Xn

∑

cm′∈Xn

∑

yn∈Fn(δ|cm′ )

PXn(cm′)PXn,Y n(cm, yn)

=Mn∑

m′=1m′ 6=m

∑

cm′∈Xn

∑

yn∈Fn(δ|cm′ )

PXn(cm′)

(∑

cm∈Xn

PXn,Y n(cm, yn)

)

=Mn∑

m′=1m′ 6=m

∑

cm′∈Xn

∑

yn∈Fn(δ|cm′ )

PXn(cm′)PY n(yn)

=Mn∑

m′=1m′ 6=m

∑

(cm′ ,yn)∈Fn(δ)

PXn(cm′)PY n(yn)

99

≤Mn∑

m′=1m′ 6=m

|Fn(δ)|2−n(H(X)−δ)2−n(H(Y )−δ)

≤Mn∑

m′=1m′ 6=m

2n(H(X,Y )+δ)2−n(H(X)−δ)2−n(H(Y )−δ)

= (Mn − 1)2n(H(X,Y )+δ)2−n(H(X)−δ)2−n(H(Y )−δ)

< Mn · 2n(H(X,Y )+δ)2−n(H(X)−δ)2−n(H(Y )−δ)

≤ 2n(C−4δ) · 2−n(I(X;Y )−3δ) = 2−nδ,

where the first inequality follows from the definition of the jointly typicalset Fn(δ), the second inequality holds by the Shannon-McMillan theoremfor pairs (Theorem 4.9), the last inequality follows since C = I(X; Y ) bydefinition of X and Y , and since (1/n) log2 Mn ≤ C − (γ/2) = C − 4δ.Consequently,

E C∼n[Pe( C∼n)] ≤ PXn,Y n (F c

n(δ)) + 2−nδ,

which for sufficiently large n (and n > N0), can be made smaller than2δ = γ/4 < ε by the Shannon-McMillan theorem for pairs.

Before proving the converse part of the channel coding theorem, let us recallFano’s inequality in a channel coding context. Consider an (n,Mn) channelblock code C∼n with encoding and decoding functions given by

fn : 1, 2, · · · ,Mn → X n

andgn : Yn → 1, 2, · · · ,Mn,

respectively. Let message W , which is uniformly distributed over the set ofmessages 1, 2, · · · ,Mn, be sent via codeword Xn(W ) = fn(W ) over the DMC,and let Y n be received at the channel output. At the receiver, the decoderestimates the sent message via W = gn(Y

n) and the probability of estimationerror is given by the code’s average error probability:

Pr[W 6= W ] = Pe( C∼n)

since W is uniformly distributed. Then Fano’s inequality (2.5.2) yields

H(W |Y n) ≤ 1 + Pe( C∼n) log2(Mn − 1)

< 1 + Pe( C∼n) log2Mn. (4.3.6)

100

We next proceed with the proof of the converse part.

Proof of the converse part: For any (n,Mn) block channel code C∼n as de-scribed above, we have that W → Xn → Y n form a Markov chain; we thusobtain by the data processing inequality that

I(W ;Y n) ≤ I(Xn;Y n). (4.3.7)

We can also upper bound I(Xn;Y n) in terms of the channel capacity C as follows

I(Xn;Y n) ≤ maxPXn

I(Xn;Y n)

≤ maxPXn

n∑

i=1

I(Xi;Yi) (by Theorem 2.21)

≤n∑

i=1

maxPXn

I(Xi;Yi)

=n∑

i=1

maxPXi

I(Xi;Yi)

= nC. (4.3.8)

Consequently, code C∼n satisfies the following:

log2 Mn = H(W ) (since W is uniformly distributed)

= H(W |Y n) + I(W ;Y n)

≤ H(W |Y n) + I(Xn;Y n) (by (4.3.7))

≤ H(W |Y n) + nC (by (4.3.8))

< 1 + Pe( C∼n) · log2 Mn + nC. (by (4.3.6))

This implies that

Pe( C∼n) > 1− C

(1/n) log2 Mn

− 1

log2 Mn

= 1− C + 1/n

(1/n) log2 Mn

.

So if lim infn→∞(1/n) log2 Mn = C1−µ

, then for any 0 < ε < 1, there exists aninteger N such that for n ≥ N ,

1

nlog2 Mn ≥ C + 1/n

1− (1− ε)µ, (4.3.9)

because, otherwise, (4.3.9) would be violated for infinitely many n, implying acontradiction that

lim infn→∞

1

nlog2 Mn ≤ lim inf

n→∞

C + 1/n

1− (1− ε)µ=

C

1− (1− ε)µ.

101

Hence, for n ≥ N ,

Pe( C∼n) > 1− [1− (1− ε)µ]C + 1/n

C + 1/n= (1− ǫ)µ > 0;

i.e., Pe( C∼n) is bounded away from zero for n sufficiently large.

Observation 4.11 The results of the above channel coding theorem is illus-trated in Figure 4.5,8 where R = lim infn→∞(1/n) log2 Mn (measured in messagebits/channel use) is usually called the ultimate (or asymptotic) coding rate ofchannel block codes. As indicated in the figure, the ultimate rate of any goodblock code for the DMC must be smaller than or equal to the channel capacityC.9 Conversely, any block code with (ultimate) rate greater than C, will haveits probability of error bounded away from zero. Thus for a DMC, its capacityC is the supremum of all “achievable” channel block coding rates; i.e., it is thesupremum of all rates for which there exists a sequence of channel block codeswith asymptotically vanishing (as the blocklength goes to infinity) probabilityof error.

C

lim supn→∞ Pe = 0for the best channel block code

lim supn→∞ Pe > 0for all channel block codes

R

Figure 4.5: Ultimate channel coding rate R versus channel capacity Cand behavior of the probability of error as blocklength n goes to infinityfor a DMC.

Shannon’s channel coding theorem, established in 1948 [43], provides the ul-timate limit for reliable communication over a noisy channel. However, it doesnot provide an explicit efficient construction for good codes since searching for

8Note that Theorem 4.10 actually indicates limn→∞ Pe = 0 for R < C and lim infn→∞ Pe >0 for R > C rather than the behavior of lim supn→∞ Pe indicated in Figure 4.5; these howeveronly holds for a DMC. For a more general channel, three partitions instead of two may result,i.e., R < C, C < R < C and R > C, which respectively correspond to lim supn→∞ Pe = 0for the best block code, lim supn→∞ Pe > 0 but lim infn→∞ Pe = 0 for the best block code,and lim infn→∞ Pe > 0 for all channel code codes, where C is named the optimistic channelcapacity. Since C = C for DMCs, the three regions are reduced to two. C will later beintroduced in Part II.

9It can be seen from the theorem that C can be achieved as an ultimate transmission rateas long as (1/n) log2 Mn aproaches C from below with increasing n (see (4.3.2)).

102

a good code from the ensemble of randomly generated codes is prohibitivelycomplex, as its size grows double-exponentially with blocklength (see Step 1of the proof of the forward part). It thus spurred the entire area of codingtheory, which flourished over the last 65 years with the aim of constructingpowerful error-correcting codes operating close to the capacity limit. Particularadvances were made for the class of linear codes (also known as group codes)whose rich10 yet elegantly simple algebraic structures made them amenable forefficient practically-implementable encoding and decoding. Examples of suchcodes include Hamming codes, Golay codes, BCH and Reed-Solomon codes,and convolutional codes. In 1993, the so-called Turbo codes were introduced byBerrou et al. [3, 4] and shown experimentally to perform close to the chan-nel capacity limit for the class of memoryless channels. Similar near-capacityachieving linear codes were later established with the re-discovery of Gallager’slow-density parity-check codes [17, 18, 33, 34]. Many of these codes are used withincreased sophistication in today’s ubiquitous communication, information andmultimedia technologies. For detailed studies on coding theory, see the followingtexts [8, 10, 26, 31, 35, 41, 51].

4.4 Calculating channel capacity

Given a DMC with finite input alphabet X , finite input alphabet Y and channeltransition matrix Q = [px,y] of size |X | × |Y|, where px,y , PY |X(y|x), for x ∈ Xand y ∈ Y , we would like to calculate

C , maxPX

I(X;Y )

where the maximization (which is well-defined) is carried over the set of inputdistributions PX , and I(X;Y ) is the mutual information between the channel’sinput and output.

Note that C can be determined numerically via non-linear optimization tech-niques – such as the iterative algorithms developed by Arimoto [1] and Blahut[7, 9], see also [15] and [53, Chap. 9]. In general, there are no closed-form (single-letter) analytical expressions for C. However, for many “simplified” channels,it is possible to analytically determine C under some “symmetry” properties oftheir channel transition matrix.

10Indeed, there exist linear codes that can achieve the capacity of memoryless channels withadditive noise (e.g., see [14, p. 114]). Such channels include the BSC and the q-ary symmetricchannel.

103

4.4.1 Symmetric, weakly-symmetric and quasi-symmetricchannels

Definition 4.12 A DMC with finite input alphabet X , finite output alphabetY and channel transition matrix Q = [px,y] of size |X | × |Y| is said to be sym-metric if the rows of Q are permutations of each other and the columns of Qare permutations of each other. The channel is said to be weakly-symmetric ifthe rows of Q are permutations of each other and all the column sums in Q areequal.

It directly follows from the definition that symmetry implies weak-symmetry.Examples of symmetric DMCs include the BSC, the q-ary symmetric channel andthe following ternary channel with X = Y = 0, 1, 2 and transition matrix

Q =

PY |X(0|0) PY |X(1|0) PY |X(2|0)PY |X(0|1) PY |X(1|1) PY |X(2|1)PY |X(0|2) PY |X(1|2) PY |X(2|2)

=

0.4 0.1 0.50.5 0.4 0.10.1 0.5 0.4

.

The following DMC with |X | = |Y| = 4 and

Q =

0.5 0.25 0.25 00.5 0.25 0.25 00 0.25 0.25 0.50 0.25 0.25 0.5

(4.4.1)

is weakly-symmetric (but not symmetric). Noting that all above channels in-volve square transition matrices, we emphasize that Q can be rectangular whilesatisfying the symmetry or weak-symmetry properties. For example, the DMCwith |X | = 2, |Y| = 4 and

Q =

[1−ε2

ε2

1−ε2

ε2

ε2

1−ε2

ε2

1−ε2

]

(4.4.2)

is symmetric (where ε ∈ [0, 1]), while the DMC with |X | = 2, |Y| = 3 and

Q =

[13

16

12

13

12

16

]

is weakly-symmetric.

Lemma 4.13 The capacity of a weakly-symmetric channel Q is achieved by auniform input distribution and is given by

C = log2 |Y| −H(q1, q2, · · · , q|Y|) (4.4.3)

104

where (q1, q2, · · · , q|Y|) denotes any row of Q and

H(q1, q2, · · · , q|Y|) , −|Y|∑

i=1

qi log2 qi

is the row entropy.

Proof: The mutual information between the channel’s input and output is givenby

I(X;Y ) = H(Y )−H(Y |X)

= H(Y )−∑

x∈XPX(x)H(Y |X = x)

where H(Y |X = x) = −∑y∈Y PY |X(y|x) log2 PY |X(y|x) = −∑y∈Y px,y log2 px,y.

Noting that every row of Q is a permutation of every other row, we obtainthat H(Y |X = x) is independent of x and can be written as

H(Y |X = x) = H(q1, q2, · · · , q|Y|)

where (q1, q2, · · · , q|Y|) is any row of Q. Thus

H(Y |X) =∑

x∈XPX(x)H(q1, q2, · · · , q|Y|)

= H(q1, q2, · · · , q|Y|)

(∑

x∈XPX(x)

)

= H(q1, q2, · · · , q|Y|).

This implies

I(X;Y ) = H(Y )−H(q1, q2, · · · , q|Y|)

≤ log2 |Y| −H(q1, q2, · · · , q|Y|)

with equality achieved iff Y is uniformly distributed over Y . We next show thatchoosing a uniform input distribution, PX(x) = 1

|X | ∀x ∈ X , yields a uniformoutput distribution, hence maximizing mutual information. Indeed, under auniform input distribution, we obtain that for any y ∈ Y ,

PY (y) =∑

x∈XPX(x)PY |X(y|x) =

1

|X |∑

x∈Xpx,y =

A

|X |

105

where A ,∑

x∈X px,y is a constant given by the sum of the entries in any columnof Q, since by the weak-symmetry property all column sums in Q are identical.Note that

∑

y∈Y PY (y) = 1 yields that

∑

y∈Y

A

|X | = 1

and hence

A =|X ||Y| . (4.4.4)

Accordingly,

PY (y) =A

|X | =|X ||Y|

1

|X | =1

|Y|for any y ∈ Y ; thus the uniform input distribution induces a uniform outputdistribution and achieves channel capacity as given by (4.4.3).

Observation 4.14 Note that if the weakly-symmetric channel has a square (i.e.,with |X | = |Y|) transition matrix Q, then Q is a doubly-stochastic matrix; i.e.,both its row sums and its column sums are equal to 1. Note however that havinga square transition matrix does not necessarily make a weakly-symmetric channelsymmetric; e.g., see (4.4.1).

Example 4.15 (Capacity of the BSC) Since the BSC with crossover proba-bility (or bit error rate) ε is symmetric, we directly obtain from Lemma 4.13that its capacity is achieved by a uniform input distribution and is given by

C = log2(2)−H(1− ε, ε) = 1− hb(ε) (4.4.5)

where hb(·) is the binary entropy function.

Example 4.16 (Capacity of the q-ary symmetric channel) Similarly, theq-ary symmetric channel with symbol error rate ε described in (4.2.8) is sym-metric; hence, by Lemma 4.13, its capacity is given by

C = log2 q−H

(

1− ε,ε

q − 1, · · · , ε

q − 1

)

= log2 q+ε log2ε

q − 1+(1−ε) log2(1−ε).

Note that when q = 2, the channel capacity is equal to that of the BSC, as ex-pected. Furthermore, when ε = 0, the channel reduces to the identity (noiseless)q-ary channel and its capacity is given by C = log2 q.

We next note that one can further weaken the weak-symmetry property anddefine a class of “quasi-symmetric” channels for which the uniform input distri-bution still achieves capacity and yields a simple closed-form formula for capacity.

106

Definition 4.17 A DMC with finite input alphabet X , finite output alphabetY and channel transition matrix Q = [px,y] of size |X | × |Y| is said to be quasi-symmetric11 if Q can be partitioned along its columns into m weakly-symmetricsub-matrices Q1,Q2, · · · ,Qm for some integer m ≥ 1, where each Qi sub-matrixhas size |X | × |Yi| for i = 1, 2, · · · ,m with Y1 ∪ · · · ∪ Ym = Y and Yi ∩ Yj = ∅∀i 6= j, i, j = 1, 2, · · · ,m.

Hence, quasi-symmetry is our weakest symmetry notion, since a weakly-symmetric channel is clearly quasi-symmetric (just set m = 1 in the abovedefinition); we thus have: symmetry =⇒ weak-symmetry =⇒ quasi-symmetry.

Lemma 4.18 The capacity of a quasi-symmetric channel Q as defined above isachieved by a uniform input distribution and is given by

C =m∑

i=1

aiCi (4.4.6)

whereai ,

∑

y∈Yi

px,y = sum of any row in Qi, i = 1, · · · ,m,

and

Ci = log2 |Yi| −H(

any row in the matrix 1aiQi

)

, i = 1, · · · ,m

is the capacity of the ith weakly-symmetric “sub-channel” whose transition ma-trix is obtained by multiplying each entry of Qi by

1ai

(this normalization renderssub-matrix Qi into a stochastic matrix and hence a channel transition matrix).

Proof: We first observe that for each i = 1, · · · ,m, ai is independent of theinput value x, since sub-matrix i is weakly-symmetric (so any row in Qi is apermutation of any other row); and hence ai is the sum of any row in Qi.

For each i = 1, · · · ,m, define

PYi|X(y|x) ,

px,yai

if y ∈ Yi and x ∈ X ;

0 otherwise

where Yi is a random variable taking values in Yi. It can be easily verified thatPYi|X(y|x) is a legitimate conditional distribution. Thus [PYi|X(y|x)] = 1

aiQi

is the transition matrix of the weakly-symmetric “sub-channel” i with input

11This notion of “quasi-symmetry” is slightly more general than Gallager’s notion [19, p. 94],as we herein allow each sub-matrix to be weakly-symmetric (instead of symmetric as in [19]).

107

alphabet X and output alphabet Yi. Let I(X;Yi) denote its mutual information.Since each such sub-channel i is weakly-symmetric, we know that its capacityCi is given by

Ci = maxPX

I(X;Yi) = log2 |Yi| −H(

any row in the matrix 1aiQi

)

,

where the maximum is achieved by a uniform input distribution.

Now, the mutual information between the input and the output of our originalquasi-symmetric channel Q can be written as

I(X;Y ) =∑

y∈Y

∑

x∈XPX(x) px,y log2

px,y∑

x′∈X PX(x′)px′,y

=m∑

i=1

∑

y∈Yi

∑

x∈Xai PX(x)

px,yai

log2

px,yai

∑

x′∈X PX(x′)px′,yai

=m∑

i=1

ai∑

y∈Yi

∑

x∈XPX(x)PYi|X(y|x) log2

PYi|X(y|x)∑

x′∈X PX(x′)PYi|X(y|x′)

=m∑

i=1

ai I(X;Yi).

Therefore, the capacity of channel Q is

C = maxPX

I(X;Y )

= maxPX

m∑

i=1

ai I(X;Yi)

=m∑

i=1

ai maxPX

I(X;Yi) (as the same uniform PX maximizes each I(X;Yi))

=m∑

i=1

ai Ci.

Example 4.19 (Capacity of the BEC) The BEC with erasure probability αas given in (4.2.5) is quasi-symmetric (but neither weakly-symmetric nor sym-metric). Indeed, its transition matrix Q can be partitioned along its columnsinto two symmetric (hence weakly-symmetric) sub-matrices

Q1 =

[1− α 00 1− α

]

108

and

Q2 =

[αα

]

.

Thus applying the capacity formula for quasi-symmetric channels of Lemma 4.18yields that the capacity of the BEC is given by

C = a1C1 + a2C2

where a1 = 1− α, a2 = α,

C1 = log2(2)−H

(1− α

1− α,

0

1− α

)

= 1−H(1, 0) = 1− 0 = 1,

andC2 = log2(1)−H

(α

α

)

= 0− 0 = 0.

Therefore, the BEC capacity is given by

C = (1− α)(1) + (α)(0) = 1− α. (4.4.7)

Example 4.20 (Capacity of the BSEC) Similarly, the BSEC with crossoverprobability ε and erasure probability α as described in (4.2.6) is quasi-symmetric;its transition matrix can be partitioned along its columns into two symmetricsub-matrices

Q1 =

[1− ε− α ε

ε 1− ε− α

]

and

Q2 =

[αα

]

.

Hence by Lemma 4.18, the channel capacity is given by C = a1C1 + a2C2 wherea1 = 1− α, a2 = α,

C1 = log2(2)−H

(1− ε− α

1− α,

ε

1− α

)

= 1− hb

(1− ε− α

1− α

)

,

andC2 = log2(1)−H

(α

α

)

= 0.

We thus obtain that

C = (1− α)

[

1− hb

(1− ε− α

1− α

)]

+ (α)(0)

= (1− α)

[

1− hb

(1− ε− α

1− α

)]

. (4.4.8)

As already noted, the BSEC is a combination of the BSC with bit error rate εand the BEC with erasure probability α. Indeed, setting α = 0 in (4.4.8) yieldsthat C = 1 − hb(1 − ε) = 1 − hb(ε) which is the BSC capacity. Furthermore,setting ε = 0 results in C = 1− α, the BEC capacity.

109

4.4.2 Karuch-Kuhn-Tucker condition for channel capac-ity

When the channel does not satisfy any symmetry property, the following neces-sary and sufficient Karuch-Kuhn-Tucker (KKT) condition (e.g., cf. Appendix B.10,[19, pp. 87-91] or [5, 11]) for calculating channel capacity can be quite useful.

Definition 4.21 (Mutual information for a specific input symbol) Themutual information for a specific input symbol is defined as:

I(x;Y ) ,∑

y∈YPY |X(y|x) log2

PY |X(y|x)PY (y)

.

From the above definition, the mutual information becomes:

I(X;Y ) =∑

x∈XPX(x)

∑


PY |X(y|x)PY (y)

=∑

x∈XPX(x)I(x;Y ).

Lemma 4.22 (KKT condition for channel capacity) For a given DMC, aninput distribution PX achieves its channel capacity iff there exists a constant Csuch that

I(x : Y ) = C ∀x ∈ X with PX(x) > 0;

I(x : Y ) ≤ C ∀x ∈ X with PX(x) = 0.(4.4.9)

Furthermore, the constant C is the channel capacity (justifying the choice ofnotation).

Proof: The forward (if) part holds directly; hence, we only prove the converse(only-if) part.

Without loss of generality, we assume that PX(x) < 1 for all x ∈ X , sincePX(x) = 1 for some x implies that I(X;Y ) = 0. The problem of calculating thechannel capacity is to maximize

I(X;Y ) =∑

x∈X

∑


PY |X(y|x)∑

x′∈X PX(x′)PY |X(y|x′), (4.4.10)

subject to the condition∑

x∈XPX(x) = 1 (4.4.11)

110

for a given channel distribution PY |X . By using the Lagrange multiplier method(e.g., see Appendix B.10 or [5]), maximizing (4.4.10) subject to (4.4.11) is equiv-alent to maximize:

f(PX) ,∑

x∈Xy∈Y

PX(x)PY |X(y|x) log2PY |X(y|x)

∑

x′∈XPX(x

′)PY |X(y|x′)+ λ

(∑

x∈XPX(x)− 1

)

.

We then take the derivative of the above quantity with respect to PX(x′′), and

obtain that12∂f(PX)

∂PX(x′′)= I(x′′;Y )− log2(e) + λ.

By Property 2 of Lemma 2.46, I(X;Y ) = I(PX , PY |X) is a concave function inPX (for a fixed PY |X). Therefore, the maximum of I(PX , PY |X) occurs for a zeroderivative when PX(x) does not lie on the boundary, namely 1 > PX(x) > 0.For those PX(x) lying on the boundary, i.e., PX(x) = 0, the maximum occurs iffa displacement from the boundary to the interior decreases the quantity, whichimplies a non-positive derivative, namely

I(x;Y ) ≤ −λ+ log2(e), for those x with PX(x) = 0.

To summarize, if an input distribution PX achieves the channel capacity, then

I(x′′;Y ) = −λ+ log2(e), for PX(x′′) > 0;

I(x′′;Y ) ≤ −λ+ log2(e), for PX(x′′) = 0.

12The details for taking the derivative are as follows:

∂

∂PX(x′′)

∑

x∈X

∑

y∈Y

PX(x)PY |X(y|x) log2 PY |X(y|x)

−∑

x∈X

∑

y∈Y

PX(x)PY |X(y|x) log2

[∑

x′∈X

PX(x′)PY |X(y|x′)

]

+ λ

(∑

x∈X

PX(x)− 1

)

=∑

y∈Y

PY |X(y|x′′) log2 PY |X(y|x′′)−

∑

y∈Y

PY |X(y|x′′) log2

[∑

x′∈X

PX(x′)PY |X(y|x′)

]

+ log2(e)∑

x∈X

∑

y∈Y

PX(x)PY |X(y|x) PY |X(y|x′′)∑

x′∈X PX(x′)PY |X(y|x′)

+ λ

= I(x′′;Y )− log2(e)∑

y∈Y

[∑

x∈X

PX(x)PY |X(y|x)]

PY |X(y|x′′)∑

x′∈X PX(x′)PY |X(y|x′)+ λ

= I(x′′;Y )− log2(e)∑

y∈Y

PY |X(y|x′′) + λ

= I(x′′;Y )− log2(e) + λ.

111

for some λ. With the above result, setting C = −λ + 1 yields (4.4.9). Finally,multiplying both sides of each equation in (4.4.9) by PX(x) and summing overx yields that maxPX

I(X;Y ) on the left and the constant C on the right, thusproving that the constant C is indeed the channel’s capacity.

Example 4.23 (Quasi-symmetric channels) For a quasi-symmetric channel,one can directly verify that the uniform input distribution satisfies the KKT con-dition of Lemma 4.22 and yields that the channel capacity is given by (4.4.6);this is left as an exercise. As we already saw, the BSC, the q-ary symmetricchannel, the BEC and the BSEC are all quasi-symmetric.

Example 4.24 Consider a DMC with a ternary input alphabet X = 0, 1, 2,binary output alphabet Y = 0, 1 and the following transition matrix

Q =

1 012

12

0 1

.

This channel is not quasi-symmetric. However, one may guess that the capacityof this channel is achieved by the input distribution (PX(0), PX(1), PX(2)) =(12, 0, 1

2) since the input x = 1 has an equal conditional probability of being re-

ceived as 0 or 1 at the output. Under this input distribution, we obtain thatI(x = 0;Y ) = I(x = 2;Y ) = 1 and that I(x = 1;Y ) = 0. Thus the KKT con-dition of (4.4.9) is satisfied; hence confirming that the above input distributionachieves channel capacity and that channel capacity is equal to 1 bit.

Observation 4.25 (Capacity achieved by a uniform input distribution)We close this chapter by noting that there is a class of DMC’s that is largerthan that of quasi-symmetric channels for which the uniform input distributionachieves capacity. It concerns the class of so-called “T -symmetric” channels [40,Section V, Definition 1] for which

T (x) , I(x;Y )− log2 |X | =∑


PY |X(y|x)∑

x′∈X PY |X(y|x′)

is a constant function of x (i.e., independent of x), where I(x;Y ) is the mu-tual information for input x under a uniform input distribution. Indeed theT -symmetry condition is equivalent to the property of having the uniform inputdistribution achieve capacity. This directly follows from the KKT condition ofLemma 4.22. An example of a T -symmetric channel that is not quasi-symmetricis the binary-input ternary-output channel with the following transition matrix

Q =

[13

13

13

16

16

23

]

.

112

Hence its capacity is achieved by the uniform input distribution. See [40, Fig-ure 2] for (infinitely-many) other examples of T -symmetric channels. However,unlike quasi-symmetric channels, T -symmetric channels do not admit in gen-eral a simple closed-form expression for their capacity (such as the one given in(4.4.6)).

113

Chapter 5

Differential Entropy and GaussianChannels

We have so far examined information measures and their operational character-ization for discrete-time discrete-alphabet systems. In this chapter, we turn ourfocus to continuous-alphabet (real-valued) systems. Except for a brief interludewith the continuous-time (waveform) Gaussian channel, we consider discrete-time systems, as treated throughout the book.

We first recall that a real-valued (continuous) random variable X is describedby its cumulative distribution function (cdf)

FX(x) , Pr[X ≤ x]

for x ∈ R, the set of real numbers. The distribution of X is called absolutely con-tinuous (with respect to the Lebesgue measure) if a probability density function(pdf) fX(·) exists such that

FX(x) =

∫ x

−∞fX(t)dt

where fX(t) ≥ 0 ∀t and∫ +∞−∞ fX(t)dt = 1. If FX(·) is differentiable everywhere,

then the pdf fX(·) exists and is given by the derivative of FX(·): fX(t) = dFX(t)dt

.The support of a random variable X with pdf fX(·) is denoted by SX and canbe conveniently given as

SX = x ∈ R : fX(x) > 0.

We will deal with random variables that admit a pdf.1

1A rigorous (measure-theoretic) study for general continuous systems, initiated by Kol-mogorov [28], can be found in [38, 25].

114

5.1 Differential entropy

Recall that the definition of entropy for a discrete random variable X represent-ing a DMS is

H(X) ,∑

x∈X−PX(x) log2 PX(x) (in bits).

As already seen in Shannon’s source coding theorem, this quantity is the mini-mum average code rate achievable for the lossless compression of the DMS. But ifthe random variable takes on values in a continuum, the minimum number of bitsper symbol needed to losslessly describe it must be infinite. This is illustratedin the following example, where we take a discrete approximation (quantization)of a random variable uniformly distributed on the unit interval and study theentropy of the quantized random variable as the quantization becomes finer andfiner.

Example 5.1 Consider a real-valued random variable X that is uniformly dis-tributed on the unit interval, i.e., with pdf given by

fX(x) =

1 if x ∈ [0, 1);

0 otherwise.

Given a positive integer m, we can discretize X by uniformly quantizing it intom levels by partitioning the support of X into equal-length segments of size∆ = 1

m(∆ is called the quantization step-size) such that:

qm(X) =i

m, if

i− 1

m≤ X <

i

m,

for 1 ≤ i ≤ m. Then the entropy of the quantized random variable qm(X) isgiven by

H(qm(X)) = −m∑

i=1

1

mlog2

(1

m

)

= log2 m (in bits).

Since the entropy H(qm(X)) of the quantized version of X is a lower bound tothe entropy of X (as qm(X) is a function of X) and satisfies in the limit

limm→∞

H(qm(X)) = limm→∞

log2m = ∞,

we obtain that the entropy of X is infinite.

The above example indicates that to compress a continuous source withoutincurring any loss or distortion indeed requires an infinite number of bits. Thus

115

when studying continuous sources, the entropy measure is limited in its effective-ness and the introduction of a new measure is necessary. Such new measure isindeed obtained upon close examination of the entropy of a uniformly quantizedreal-valued random-variable minus the quantization accuracy as the accuracyincreases without bound.

Lemma 5.2 Consider a real-valued random variable X with support [a, b) and

pdf fX such that −fX log2 fX is integrable2 (where −∫ b

afX(x) log2 fX(x)dx is

finite). Then a uniform quantization of X with an n-bit accuracy (i.e., witha quantization step-size of ∆ = 2−n) yields an entropy approximately equal to

−∫ b

afX(x) log2 fX(x)dx+ n bits for n sufficiently large. In other words,

limn→∞

[H(qn(X))− n] = −∫ b

a

fX(x) log2 fX(x)dx

where qn(X) is the uniformly quantized version of X with quantization step-size∆ = 2−n.

Proof:

Step 1: Mean-value theorem. Let ∆ = 2−n be the quantization step-size,and let

ti ,

a+ i∆, i = 0, 1, · · · , j − 1

b, i = j

where j = ⌈(b − a)2n⌉. From the mean-value theorem (e.g., cf. [36]), wecan choose xi ∈ [ti−1, ti] for 1 ≤ i ≤ j such that

pi ,

∫ ti

ti−1

fX(x)dx = fX(xi)(ti − ti−1) = ∆ · fX(xi).

Step 2: Definition of h(n)(X). Let

h(n)(X) , −j∑

i=1

[fX(xi) log2 fX(xi)]2−n.

Since −fX(x) log2 fX(x) is integrable,

h(n)(X) → −∫ b

a

fX(x) log2 fX(x)dx as n → ∞.

Therefore, given any ε > 0, there exists N such that for all n > N ,∣∣∣∣−∫ b

a

fX(x) log2 fX(x)dx− h(n)(X)

∣∣∣∣< ε.

2By integrability, we mean the usual Riemann integrability (e.g, see [42]).

116

Step 3: Computation of H(qn(X)). The entropy of the (uniformly) quan-tized version of X, qn(X), is given by

H(qn(X)) = −j∑

i=1

pi log2 pi

= −j∑

i=1

(fX(xi)∆) log2(fX(xi)∆)

= −j∑

i=1

(fX(xi)2−n) log2(fX(xi)2

−n)

where the pi’s are the probabilities of the different values of qn(X).

Step 4: H(qn(X))− h(n)(X) .

From Steps 2 and 3,

H(qn(X))− h(n)(X) = −j∑

i=1

[fX(xi)2−n] log2(2

−n)

= n

j∑

i=1

∫ ti

ti−1

fX(x)dx

= n

∫ b

a

fX(x)dx = n.

Hence, we have that for n > N ,

[

−∫ b

a

fX(x) log2 fX(x)dx+ n

]

− ε < H(qn(X))

= h(n)(X) + n

<

[

−∫ b

a

fX(x) log2 fX(x)dx+ n

]

+ ε,

yielding that

limn→∞

[H(qn(X))− n] = −∫ b

a

fX(x) log2 fX(x)dx.

More generally, the following result due to Renyi [39] can be shown for (absolutelycontinuous) random variables with arbitrary support.

117

Theorem 5.3 [39, Theorem 1] For any real-valued random variable with pdf fX ,if −∑j

i=1 pi log2 pi is finite, where the pi’s are the probabilities of the differentvalues of uniformly quantized qn(X) over support SX , then

limn→∞

[H(qn(X))− n] = −∫

SX

fX(x) log2 fX(x)dx

provided the integral on the right-hand side exists.

In light of the above results, we can define the following information measure.

Definition 5.4 (Differential entropy) The differential entropy (in bits) of acontinuous random variable X with pdf fX and support SX is defined as

h(X) , −∫

SX

fX(x) · log2 fX(x)dx = E[− log2 fX(X)],

when the integral exists.

Thus the differential entropy h(X) of a real-valued random variable X hasan operational meaning in the following sense. Since H(qn(X)) is the minimumaverage number of bits needed to losslessly describe qn(X), we thus obtain thath(X) + n is approximately needed to describe X when uniformly quantizing itwith an n-bit accuracy. Therefore, we may conclude that the larger h(X) is, thelarger is the average number of bits required to describe a uniformly quantizedX within a fixed accuracy.

Example 5.5 A continuous random variable X with support SX = [0, 1) andpdf fX(x) = 2x for x ∈ SX has differential entropy equal to

∫ 1

0

−2x · log2(2x)dx =x2(log2 e− 2 log2(2x))

2

∣∣∣∣

1

0

=1

2 ln 2− log2(2) = −0.278652 bits.

We herein illustrate Lemma 5.2 by uniformly quantizing X to an n-bit accuracyand computing the entropy H(qn(X)) and H(qn(X))−n for increasing values ofn, where qn(X) is the quantized version of X.

We have that qn(X) is given by

qn(X) =i

2n, if

i− 1

2n≤ X <

i

2n,

118

n H(qn(X)) H(qn(X))− n1 0.811278 bits −0.188722 bits2 1.748999 bits −0.251000 bits3 2.729560 bits −0.270440 bits4 3.723726 bits −0.276275 bits5 4.722023 bits −0.277977 bits6 5.721537 bits −0.278463 bits7 6.721399 bits −0.278600 bits8 7.721361 bits −0.278638 bits9 8.721351 bits −0.278648 bits

Table 5.1: Quantized random variable qn(X) under an n-bit accuracy:H(qn(X)) and H(qn(X))− n versus n.

for 1 ≤ i ≤ 2n. Hence,

Pr

qn(X) =i

2n

=(2i− 1)

22n,

which yields

H(qn(X)) = −2n∑

i=1

2i− 1

22nlog2

(2i− 1

22n

)

=

[

− 1

22n

2n∑

i=1

(2i− 1) log2(2i− 1) + 2 log2(2n)

]

.

As shown in Table 5.1, we indeed observe that as n increases, H(qn(X)) tendsto infinity while H(qn(X))− n converges to h(X) = −0.278652 bits.

Thus a continuous random variable X contains an infinite amount of infor-mation; but we can measure the information contained in its n-bit quantizedversion qn(X) as: H(qn(X)) ≈ h(X) + n (for n large enough).

Example 5.6 Let us determine the minimum average number of bits requiredto describe the uniform quantization with 3-digit accuracy of the decay time(in years) of a radium atom assuming that the half-life of the radium (i.e., themedian of the decay time) is 80 years and that its pdf is given by fX(x) = λe−λx,where x > 0.

Since the median of the decay time is 80, we obtain:∫ 80

0

λe−λxdx = 0.5,

119

which implies that λ = 0.00866. Also, 3-digit accuracy is approximately equiv-alent to log2 999 = 9.96 ≈ 10 bits accuracy. Therefore, by Theorem 5.3, thenumber of bits required to describe the quantized decay time is approximately

h(X) + 10 = log2e

λ+ 10 = 18.29 bits.

We close this section by computing the differential entropy for two commonreal-valued random variables: the uniformly distributed random variable andthe Gaussian distributed random variable.

Example 5.7 (Differential entropy of a uniformly distributed randomvariable) Let X be a continuous random variable that is uniformly distributedover the interval (a, b), where b > a; i.e., its pdf is given by

fX(x) =

1

b−aif x ∈ (a, b);

0 otherwise.

So its differential entropy is given by

h(X) = −∫ b

a

1

b− alog2

1

b− a= log2(b− a) bits.

Note that if (b− a) < 1 in the above example, then h(X) is negative, unlikeentropy. The above example indicates that although differential entropy has aform analogous to entropy (in the sense that summation and pmf for entropy arereplaced by integration and pdf, respectively, for differential entropy), differen-tial entropy does not retain all the properties of entropy (one such operationaldifference was already highlighted in the previous lemma and theorem).3

Example 5.8 (Differential entropy of a Gaussian random variable) LetX ∼ N (µ, σ2); i.e., X is a Gaussian (or normal) random variable with finite meanµ, variance Var(X) = σ2 > 0 and pdf

fX(x) =1√2πσ2

e−(x−µ)2

2σ2

for x ∈ R. Then its differential entropy is given by

h(X) =

∫

R

fX(x)

[1

2log2(2πσ

2) +(x− µ)2

2σ2log2 e

]

dx

3By contrast, entropy and differential entropy are sometimes called discrete entropy andcontinuous entropy, respectively.

120

=1

2log2(2πσ

2) +log2 e

2σ2E[(X − µ)2]

=1

2log2(2πσ

2) +1

2log2 e

=1

2log2(2πeσ

2) bits. (5.1.1)

Note that for a Gaussian random variable, its differential entropy is only a func-tion of its variance σ2 (it is independent from its mean µ). This is similar tothe differential entropy of a uniform random variable, which only depends ondifference (b− a) but not the mean (a+ b)/2.

5.2 Joint and conditional differential entropies, divergenceand mutual information

Definition 5.9 (Joint differential entropy) If Xn = (X1, X2, · · · , Xn) is acontinuous random vector of size n (i.e., a vector of n continuous random vari-ables) with joint pdf fXn and support SXn ⊆ Rn, then its joint differentialentropy is defined as

h(Xn) , −∫

SXn

fXn(x1, x2, · · · , xn) log2 fXn(x1, x2, · · · , xn) dx1 dx2 · · · dxn

= E[− log2 fXn(Xn)]

when the n-dimensional integral exists.

Definition 5.10 (Conditional differential entropy) Let X and Y be twojointly distributed continuous random variables with joint pdf fX,Y and supportSX,Y ⊆ R2 such that the conditional pdf of Y given X, given by

fY |X(y|x) =fX,Y (x, y)

fX(x),

is well defined for all (x, y) ∈ SX,Y , where fX is the marginal pdf of X. Thenthe conditional entropy of Y given X is defined as

h(Y |X) , −∫

SX,Y

fX,Y (x, y) log2 fY |X(y|x) dx dy = E[− log2 fY |X(Y |X)],


Note that as in the case of (discrete) entropy, the chain rule holds for differ-ential entropy:

h(X, Y ) = h(X) + h(Y |X) = h(Y ) + h(X|Y ).

121

Definition 5.11 (Divergence or relative entropy) LetX and Y be two con-tinuous random variables with marginal pdfs fX and fY , respectively, such thattheir supports satisfy SX ⊆ SY ⊆ R. Then the divergence (or relative entropy orKullback-Leibler distance) between X and Y is written as D(X‖Y ) or D(fX‖fY )and defined by

D(X‖Y ) ,

∫

SX

fX(x) log2fX(x)

fY (x)dx = E

[fX(X)

fY (X)

]

when the integral exists. The definition carries over similarly in the multivariatecase: for Xn = (X1, X2, · · · , Xn) and Y n = (Y1, Y2, · · · , Yn) two random vectorswith joint pdfs fXn and fY n , respectively, and supports satisfying SXn ⊆ SY n ⊆Rn, the divergence between Xn and Y n is defined as

D(Xn‖Y n) ,

∫

SXn

fXn(x1, x2, · · · , xn) log2fXn(x1, x2, · · · , xn)

fY n(x1, x2, · · · , xn)dx1 dx2 · · · dxn


Definition 5.12 (Mutual information) LetX and Y be two jointly distributedcontinuous random variables with joint pdf fX,Y and support SXY ⊆ R2. Thenthe mutual information between X and Y is defined by

I(X;Y ) , D(fX,Y ‖fXfY ) =∫

SX,Y

fX,Y (x, y) log2fX,Y (x, y)

fX(x)fY (y)dx dy,

assuming the integral exists, where fX and fY are the marginal pdfs of X andY , respectively.

Observation 5.13 For two jointly distributed continuous random variables Xand Y with joint pdf fX,Y , support SXY ⊆ R2 and joint differential entropy

h(X, Y ) = −∫

SXY

fX,Y (x, y) log2 fX,Y (x, y) dx dy,

then as in Lemma 5.2 and the ensuing discussion, one can write

H(qn(X), qm(Y )) ≈ h(X, Y ) + n+m

for n and m sufficiently large, where qk(Z) denotes the (uniformly) quantizedversion of random variable Z with a k-bit accuracy.

On the other hand, for the above continuous X and Y ,

I(qn(X); qm(Y )) = H(qn(X)) +H(qm(Y ))−H(qn(X), qm(Y ))

122

≈ [h(X) + n] + [h(Y ) +m]− [h(X, Y ) + n+m]

= h(X) + h(Y )− h(X, Y )

=

∫

SX,Y

fX,Y (x, y) log2fX,Y (x, y)

fX(x)fY (y)dx dy

for n and m sufficiently large; in other words,

limn,m→∞

I(qn(X); qm(Y )) = h(X) + h(Y )− h(X, Y ).

Furthermore, it can be shown that

limn→∞

D(qn(X)‖qn(Y )) =

∫

SX

fX(x) log2fX(x)

fY (x)dx.

Thus mutual information and divergence can be considered as the true toolsof Information Theory, as they retain the same operational characteristics andproperties for both discrete and continuous probability spaces (as well as generalspaces where they can be defined in terms of Radon-Nikodym derivatives (e.g.,cf. [25]).4

The following lemma illustrates that for continuous systems, I(·; ·) andD(·‖·)keep the same properties already encountered for discrete systems, while differ-ential entropy (as already seen with its possibility if being negative) satisfiessome different properties from entropy. The proof is left as an exercise.

Lemma 5.14 The following properties hold for the information measures ofcontinuous systems.

1. Non-negativity of divergence: Let X and Y be two continuous ran-dom variables with marginal pdfs fX and fY , respectively, such that theirsupports satisfy SX ⊆ SY ⊆ R. Then

D(fX‖fY ) ≥ 0

with equality iff fX(x) = fY (x) for all x ∈ SX except in a set of fX-measurezero (i.e., X = Y almost surely).

2. Non-negativity of mutual information: For any two continuous ran-dom variables X and Y ,

I(X;Y ) ≥ 0

with equality iff X and Y are independent.

4This justifies using identical notations for both I(·; ·) and D(·‖·) as opposed to the dis-cerning notations of H(·) for entropy and h(·) for differential entropy.

123

3. Conditioning never increases differential entropy: For any two con-tinuous random variables X and Y with joint pdf fX,Y and well-definedconditional pdf fX|Y ,

h(X|Y ) ≤ h(X)

with equality iff X and Y are independent.

4. Chain rule for differential entropy: For a continuous random vectorXn = (X1, X2, · · · , Xn),

h(X1, X2, . . . , Xn) =n∑

i=1

h(Xi|X1, X2, . . . , Xi−1),

where h(Xi|X1, X2, . . . , Xi−1) , h(X1) for i = 1.

5. Chain rule for mutual information: For continuous random vectorXn = (X1, X2, · · · , Xn) and random variable Y with joint pdf fXn,Y andwell-defined conditional pdfs fXi,Y |Xi−1 , fXi|Xi−1 and fY |Xi−1 for i = 1, · · · , n,we have that

I(X1, X2, · · · , Xn;Y ) =n∑

i=1

I(Xi;Y |Xi−1, · · · , X1),

where I(Xi;Y |Xi−1, · · · , X1) , I(X1;Y ) for i = 1.

6. Data processing inequality: For continuous random variables X, Yand Z such that X → Y → Z, i.e., X and Z are conditional independentgiven Y (cf. Appendix B),

I(X;Y ) ≥ I(X;Z).

7. Independence bound for differential entropy: For a continuous ran-dom vector Xn = (X1, X2, · · · , Xn),

h(Xn) ≤n∑

i=1

h(Xi)

with equality iff all the Xi’s are independent from each other.

8. Invariance of differential entropy under translation: For continuousrandom variables X and Y with joint pdf fX,Y and well-defined conditionalpdf fX|Y ,

h(X + c) = h(X) for any constant c ∈ R,

124

andh(X + Y |Y ) = h(X|Y ).

The results also generalize in the multivariate case: for two continuousrandom vectors Xn = (X1, X2, · · · , Xn) and Y n = (Y1, Y2, · · · , Yn) withjoint pdf fXn,Y n and well-defined conditional pdf fXn|Y n ,

h(Xn + cn) = h(Xn)

for any constant n-tuple cn = (c1, c2, · · · , cn) ∈ Rn, and

h(Xn + Y n|Y n) = h(Xn|Y n),

where the addition of two n-tuples is performed component-wise.

9. Differential entropy under scaling: For any continuous random vari-able X and any non-zero real constant a,

h(aX) = h(X) + log2 |a|.

10. Joint differential entropy under linear mapping: Consider the ran-dom (column) vector X = (X1, X2, · · · , Xn)

T with joint pdf fXn , where Tdenotes transposition, and let Y = (Y1, Y2, · · · , Yn)

T be a random (column)vector obtained from the linear transformation Y = AX, where A is aninvertible (non-singular) n× n real-valued matrix. Then

h(Y ) = h(Y1, Y2, · · · , Yn) = h(X1, X2, · · · , Xn) + log2 |det(A)|,where det(A) is the determinant of the square matrix A.

11. Joint differential entropy under nonlinear mapping: Consider therandom (column) vector X = (X1, X2, · · · , Xn)

T with joint pdf fXn , andlet Y = (Y1, Y2, · · · , Yn)

T be a random (column) vector obtained from thenonlinear transformation

Y = g(X) , (g1(X1), g2(X2), · · · , gn(Xn))T ,

where each gi : R → R is a differentiable function, i = 1, 2, · · · , n. Thenh(Y ) = h(Y1, Y2, · · · , Yn)

= h(X1, · · · , Xn) +

∫

Rn

fXn(x1, · · · , xn) log2 |det(J)| dx1 · · · dxn,

where J is the n× n Jacobian matrix given by

J ,

∂g1∂x1

∂g1∂x2

· · · ∂g1∂xn

∂g2∂x1

∂g2∂x2

· · · ∂g2∂xn

...... · · · ...

∂gn∂x1

∂gn∂x2

· · · ∂gn∂xn

.

125

Observation 5.15 Property 9 of the above Lemma indicates that for a contin-uous random variable X, h(X) 6= h(aX) (except for the trivial case of a = 1)and hence differential entropy is not in general invariant under invertible maps.This is in contrast to entropy, which is always invariant under invertible maps:given a discrete random variable X with alphabet X ,

H(f(X)) = H(X)

for all invertible maps f : X → Y , where Y is a discrete set; in particularH(aX) = H(X) for all non-zero reals a.

On the other hand, for both discrete and continuous systems, mutual infor-mation and divergence are invariant under invertible maps:

I(X;Y ) = I(g(X);Y ) = I(g(X);h(Y ))

andD(X‖Y ) = D(g(X)‖g(Y ))

for all invertible maps g and h properly defined on the alphabet/support of theconcerned random variables. This reinforces the notion that mutual informationand divergence constitute the true tools of Information Theory.

Definition 5.16 (Multivariate Gaussian) A continuous random vector X =(X1, X2, · · · , Xn)

T is called a size-n (multivariate) Gaussian random vector witha finite mean vector µ , (µ1, µ2, · · · , µn)

T , where µi , E[Xi] < ∞ for i =1, 2, · · · , n, and an n× n invertible (real-valued) covariance matrix

KX = [Ki,j ]

, E[(X − µ)(X − µ)T ]

=

Cov(X1, X1) Cov(X1, X2) · · · Cov(X1, Xn)Cov(X2, X1) Cov(X2, X2) · · · Cov(X2, Xn)

...... · · · ...

Cov(Xn, X1) Cov(Xn, X2) · · · Cov(Xn, Xn)

,

where Ki,j = Cov(Xi, Xj) , E[(Xi−µi)(Xj−µj)] is the covariance5 between Xi

and Xj for i, j = 1, 2, · · · , n, if its joint pdf is given by the multivariate Gaussianpdf

fXn(x1, x2, · · · , xn) =1

(√2π)n

√det(KX)

e−12(x−µ)TK

−1X

(x−µ)

for any (x1, x2, · · · , xn) ∈ Rn, where x = (x1, x2, · · · , xn)T . As in the scalar case

(i.e., for n = 1), we write X ∼ Nn(µ,KX) to denote that X is a size-n Gaussianrandom vector with mean vector µ and covariance matrix KX .

5Note that the diagonal components of KX yield the variance of the different randomvariables: Ki,i = Cov(Xi, Xi) = Var(Xi) = σ2

Xi, i = 1, · · · , n.

126

Observation 5.17 In light of the above definition, we make the following re-marks.

1. Note that a covariance matrix K is always symmetric (i.e., KT = K)and positive-semidefinite.6 But as we require KX to be invertible in thedefinition of the multivariate Gaussian distribution above, we will hereafterassume that the covariance matrix of Gaussian random vectors is positive-definite (which is equivalent to having all the eigenvalues of KX positive),thus rendering the matrix invertible.

2. If a random vector X = (X1, X2, · · · , Xn)T has a diagonal covariance ma-

trix KX (i.e., all the off-diagonal components of KX are zero: Ki,j = 0for all i 6= j, i, j = 1, · · · , n), then all its component random variables areuncorrelated but not necessarily independent. However, if X is Gaussianand have a diagonal covariance matrix, then all its component randomvariables are independent from each other.

3. Any linear transformation of a Gaussian random vector yields anotherGaussian random vector. Specifically, if X ∼ Nn(µ,KX) is a size-n Gaus-sian random vector with mean vector µ and covariance matrix KX , and ifY = AmnX, where Amn is a given m× n real-valued matrix, then

Y ∼ Nm(Amnµ,AmnKXATmn)

is a size-m Gaussian random vector with mean vectorAmnµ and covariance

matrix AmnKXATmn.

More generally, any affine transformation of a Gaussian random vectoryields another Gaussian random vector: if X ∼ Nn(µ,KX) and Y =AmnX + bm, where Amn is a m× n real-valued matrix and bm is a size-mreal-valued vector, then

Y ∼ Nm(Amnµ+ bm,AmnKXATmn).

6An n×n real-valued symmetric matrix K is positive-semidefinite (e.g., cf. [16]) if for everyreal-valued vector x = (x1, x2 · · · , xn)

T ,

xTKx = (x1, · · · , xn)K

x1

...xn

≥ 0,

with equality holding only when xi = 0 for i = 1, 2, · · · , n. Furthermore, the matrix is positive-definite if xT

Kx > 0 for all real-valued vectors x 6= 0, where 0 is the all-zero vector of sizen.

127

Theorem 5.18 (Joint differential entropy of the multivariate Gaussian)IfX ∼ Nn(µ,KX) is a Gaussian random vector with mean vector µ and (positive-definite) covariance matrix KX , then its joint differential entropy is given by

h(X) = h(X1, X2, · · · , Xn) =1

2log2 [(2πe)

ndet(KX)] . (5.2.1)

In particular, in the univariate case of n = 1, (5.2.1) reduces to (5.1.1).

Proof: Without loss of generality we assume that X has a zero mean vectorsince its differential entropy is invariant under translation by Property 8 ofLemma 5.14:

h(X) = h(X − µ);

so we assume that µ = 0.

Since the covariance matrix KX is a real-valued symmetric matrix, thenit is orthogonally diagonizable; i.e., there exits a square (n × n) orthogonalmatrix A (i.e., satisfying AT = A−1) such that AKXA

T is a diagonal ma-trix whose entries are given by the eigenvalues of KX (A is constructed usingthe eigenvectors of KX ; e.g., see [16]). As a result the linear transformationY = AX ∼ Nn

(0,AKXA

T)is a Gaussian vector with the diagonal covariance

matrix KY = AKXAT and has therefore independent components (as noted in

Observation 5.17). Thus

h(Y ) = h(Y1, Y2, · · · , Yn)

= h(Y1) + h(Y2) + · · ·+ h(Yn) (5.2.2)

=n∑

i=1

1

2log2 [2πeVar(Yi)] (5.2.3)

=n

2log2(2πe) +

1

2log2

[n∏

i=1

Var(Yi)

]

=n

2log2(2πe) +

1

2log2 [det (KY )] (5.2.4)

=1

2log2 (2πe)

n +1

2log2 [det (KX)] (5.2.5)

=1

2log2 [(2πe)

n det (KX)] , (5.2.6)

where (5.2.2) follows by the independence of the random variables Y1, . . . , Yn

(e.g., see Property 7 of Lemma 5.14), (5.2.3) follows from (5.1.1), (5.2.4) holdssince the matrixKY is diagonal and hence its determinant is given by the productof its diagonal entries, and (5.2.5) holds since

det (KY ) = det(AKXA

T)

128

= det(A)det (KX) det(AT )

= det(A)2det (KX)

= det (KX) ,

where the last equality holds since (det(A))2 = 1, as the matrix A is orthogonal(AT = A−1 =⇒ det(A) = det(AT ) = 1/[det(A)]; thus, det(A)2 = 1).

Now invoking Property 10 of Lemma 5.14 and noting that |det(A)| = 1 yieldthat

h(Y1, Y2, · · · , Yn) = h(X1, X2, · · · , Xn) + log2 |det(A)|︸︷︷︸

=0

= h(X1, X2, · · · , Xn).

We therefore obtain using (5.2.6) that

h(X1, X2, · · · , Xn) =1

2log2 [(2πe)

n det (KX)] ,

hence completing the proof.

An alternate (but rather mechanical) proof to the one presented above con-sists of directly evaluating the joint differential entropy of X by integrating−fXn(xn) log2 fXn(xn) over Rn; it is left as an exercise.

Corollary 5.19 (Hadamard’s inequality) For any real-valued n×n positive-definite matrix K = [Ki,j ]i,j=1,··· ,n,

det(K) ≤n∏

i=1

Ki,i

with equality iff K is a diagonal matrix, where Ki,i are the diagonal entries ofK.

Proof: Since every positive-definite matrix is a covariance matrix (e.g., see [22]),let X = (X1, X2, · · · , Xn)

T ∼ Nn (0,K) be a jointly Gaussian random vectorwith zero mean vector and covariance matrix K. Then

1

2log2 [(2πe)

n det(K)] = h(X1, X2, · · · , Xn) (5.2.7)

≤n∑

i=1

h(Xi) (5.2.8)

=n∑

i=1

1

2log2 [2πeVar(Xi)] (5.2.9)

129

=1

2log2

[

(2πe)nn∏

i=1

Ki,i

]

, (5.2.10)

where (5.2.7) follows from Theorem 5.18, (5.2.8) follows from Property 7 ofLemma 5.14 and (5.2.9)-(5.2.10) hold using (5.1.1) along with the fact thateach random variable Xi ∼ N (0, Ki,i) is Gaussian with zero mean and varianceVar(Xi) = Ki,i for i = 1, 2, · · · , n (as the marginals of a multivariate Gaussianare also Gaussian (e,g., cf. [22])).

Finally, from (5.2.10), we directly obtain that

det(K) ≤n∏

i=1

Ki,i,

with equality iff the jointly Gaussian random variables X1, X2, . . ., Xn are inde-pendent from each other, or equivalently iff the covariance matrix K is diagonal.

The next theorem states that among all real-valued size-n random vectors (ofsupport Rn) with identical mean vector and covariance matrix, the Gaussianrandom vector has the largest differential entropy.

Theorem 5.20 (Maximal differential entropy for real-valued randomvectors) Let X = (X1, X2, · · · , Xn)

T be a real-valued random vector withsupport SXn = Rn, mean vector µ and covariance matrix KX . Then

h(X1, X2, · · · , Xn) ≤1

2log2 [(2πe)

n det(KX)] , (5.2.11)

with equality iff X is Gaussian; i.e., X ∼ Nn

(µ,KX

).

Proof: We will present the proof in two parts: the scalar or univariate case, andthe multivariate case.

(i) Scalar case (n = 1): For a real-valued random variable with support SX = R,mean µ and variance σ2, let us show that

h(X) ≤ 1

2log2

(2πeσ2

), (5.2.12)

with equality iff X ∼ N (µ, σ2).

For a Gaussian random variable Y ∼ N (µ, σ2), using the non-negativity ofdivergence, can write

0 ≤ D(X‖Y )

130

=

∫

R

fX(x) log2fX(x)

1√2πσ2

e−(x−µ)2

2σ2

dx

= −h(X) +

∫

R

fX(x)

[

log2

(√2πσ2

)

+(x− µ)2

2σ2log2 e

]

dx

= −h(X) +1

2log2(2πσ

2) +log2 e

2σ2

∫

R

(x− µ)2fX(x) dx

︸︷︷︸

=σ2

= −h(X) +1

2log2

[2πeσ2

].

Thus

h(X) ≤ 1

2log2

[2πeσ2

],

with equality iff X = Y (almost surely); i.e., X ∼ N (µ, σ2).

(ii). Multivariate case (n > 1): As in the proof of Theorem 5.18, we can use anorthogonal square matrix A (i.e., satisfying AT = A−1 and hence |det(A)| = 1)such that AKXA

T is diagonal. Therefore, the random vector generated by thelinear map

Z = AX

will have a covariance matrix given by KZ = AKXAT and hence have uncorre-

lated (but not necessarily independent) components. Thus

h(X) = h(Z)− log2 |det(A)|︸︷︷︸

=0

(5.2.13)

= h(Z1, Z2, · · · , Zn)

≤n∑

i=1

h(Zi) (5.2.14)

≤n∑

i=1

1

2log2 [2πeVar(Zi)] (5.2.15)

=n

2log2(2πe) +

1

2log2

[n∏

i=1

Var(Zi)

]

=1

2log2 (2πe)

n +1

2log2 [det (KZ)] (5.2.16)

=1

2log2 (2πe)

n +1

2log2 [det (KX)] (5.2.17)

=1

2log2 [(2πe)

n det (KX)] ,

where (5.2.13) holds by Property 10 of Lemma 5.14 and since |det(A)| = 1,(5.2.14) follows from Property 7 of Lemma 5.14, (5.2.15) follows from (5.2.12)

131

(the scalar case above), (5.2.16) holds since KZ is diagonal, and (5.2.17) followsfrom the fact that det (KZ) = det (KX) (as A is orthogonal). Finally, equality isachieved in both (5.2.14) and (5.2.15) iff the random variables Z1, Z2, . . ., Zn areGaussian and independent from each other, or equivalently iff X ∼ Nn

(µ,KX

).

Observation 5.21 The following two results can also be shown (the proof isleft as an exercise):

1. Among all continuous random variables admitting a pdf with support theinterval (a, b), where b > a are real numbers, the uniformly distributedrandom variable maximizes differential entropy.

2. Among all continuous random variables admitting a pdf with support theinterval [0,∞) and finite mean µ, the exponential distribution with param-eter (or rate parameter) λ = 1/µ maximizes differential entropy.

A systematic approach to finding distributions that maximize differential entropysubject to various support and moments constraints can be found in [13, 53].

5.3 AEP for continuous memoryless sources

The AEP theorem and its consequence for discrete memoryless (i.i.d.) sourcesreveal to us that the number of elements in the typical set is approximately2nH(X), where H(X) is the source entropy, and that the typical set carries al-most all the probability mass asymptotically (see Theorems 3.3 and 3.4). Anextension of this result from discrete to continuous memoryless sources by justcounting the number of elements in a continuous (typical) set defined via a law-of-large-numbers argument is not possible, since the total number of elementsin a continuous set is infinite. However, when considering the volume of thatcontinuous typical set (which is a natural analog to the size of a discrete set),such an extension, with differential entropy playing a similar role as entropy,becomes straightforward.

Theorem 5.22 (AEP for continuous memoryless sources) Let Xi∞i=1 bea continuous memoryless source (i.e., an infinite sequence of continuous i.i.d. ran-dom variables) with pdf fX(·) and differential entropy h(X). Then

− 1

nlog fX(X1, . . . , Xn) → E[− log2 fX(X)] = h(X) in probability.

Proof: The proof is an immediate result of the law of large numbers (e.g., seeTheorem 3.3).

132

Definition 5.23 (Typical set) For δ > 0 and any n given, define the typicalset for the above continuous source as

Fn(δ) ,

xn ∈ Rn :

∣∣∣∣− 1

nlog2 fX(X1, . . . , Xn)− h(X)

∣∣∣∣< δ

.

Definition 5.24 (Volume) The volume of a set A ⊂ Rn is defined as

Vol(A) ,

∫

Adx1 · · · dxn.

Theorem 5.25 (Consequence of the AEP for continuous memorylesssources) For a continuous memoryless source Xi∞i=1 with differential entropyh(X), the following hold.

1. For n sufficiently large, PXn Fn(δ) > 1− δ.

2. Vol(Fn(δ))≤ 2n(h(X)+δ) for all n.

3. Vol(Fn(δ))≥ (1− δ)2n(h(X)−δ) for n sufficiently large.

Proof: The proof is quite analogous to the corresponding theorem for discretememoryless sources (Theorem 3.4) and is left as an exercise.

5.4 Capacity and channel coding theorem for the discrete-time memoryless Gaussian channel

We next study the fundamental limits for error-free communication over thediscrete-time memoryless Gaussian channel, which is the most important cont-inuous-alphabet channel and is widely used to model real-world wired and wire-less channels. We first state the definition of discrete-time continuous-alphabetmemoryless channels.

Definition 5.26 (Discrete-time continuous memoryless channels) Con-sider a discrete-time channel with continuous input and output alphabets givenby X ⊆ R and Y ⊆ R, respectively, and described by a sequence of n-dimensionaltransition (conditional) pdfs fY n|Xn(yn|xn)∞n=1 that govern the reception ofyn = (y1, y2, · · · , yn) ∈ Yn at the channel output when xn = (x1, x2, · · · , xn) ∈X n is sent as the channel input.

The channel (without feedback) is said to be memoryless with a given (marginal)transition pdf fY |X if its sequence of transition pdfs fY n|Xn satisfies

fY n|Xn(yn|xn) =n∏

i=1

fY |X(yi|xi) (5.4.1)

133

for every n = 1, 2, · · · , xn ∈ X n and yn ∈ Yn.

In practice, the real-valued input to a continuous channel satisfies a certainconstraint or limitation on its amplitude or power; otherwise, one would have arealistically implausible situation where the input can take on any value from theuncountably infinite set of real numbers. We will thus impose an average costconstraint (t, P ) on any input n-tuple xn = (x1, x2, · · · , xn) transmitted over thechannel by requiring that

1

n

n∑

i=1

t(xi) ≤ P, (5.4.2)

where t(·) is a given non-negative real-valued function describing the cost fortransmitting an input symbol, and P is a given positive number representingthe maximal average amount of available resources per input symbol.

Definition 5.27 The capacity (or capacity-cost function) of a discrete-time con-tinuous memoryless channel with input average cost constraint (t, P ) is denotedby C(P ) and defined as

C(P ) , supFX :E[t(X)]≤P

I(X;Y ) (in bits/channel use) (5.4.3)

where the supremum is over all input distributions FX .

Lemma 5.28 (Concavity of capacity) If C(P ) as defined in (5.4.3) is finitefor any P > 0, then it is concave, continuous and strictly increasing in P .

Proof: Fix P1 > 0 and P2 > 0. Then since C(P ) is finite for any P > 0, then bythe 3rd property in Property A.4, there exist two input distributions FX1 andFX2 such that for all ǫ > 0,

I(Xi;Yi) ≥ C(Pi)− ǫ (5.4.4)

andE[t(Xi)] ≤ Pi (5.4.5)

where Xi denotes the input with distribution FXiand Yi is the corresponding

channel output for i = 1, 2. Now, for 0 ≤ λ ≤ 1, let Xλ be a random variablewith distribution FXλ

, λFX1 + (1− λ)FX2 . Then by (5.4.5)

EXλ[t(X)] = λEX1 [t(X)] + (1− λ)EX2 [t(X)] ≤ λP1 + (1− λ)P2. (5.4.6)

Furthermore,

C(λP1 + (1− λ)P2) = supFX : E[t(X)]≤λP1+(1−λ)P2

I(FX , fY |X)

134

≥ I(FXλ, fY |X)

≥ λI(FX1 , fY |X) + (1− λ)I(FX2 , fY |X)

= λI(X1;Y1) + (1− λ)I(X2 : Y2)

≥ λC(P1)− ǫ+ (1− λ)C(P2)− ǫ,

where the first inequality holds by (5.4.6), the second inequality follows fromthe concavity of the mutual information with respect to its first argument (cf.Lemma 2.46) and the third inequality follows from (5.4.4). Letting ǫ → 0 yieldsthat

C(λP1 + (1− λ)P2) ≥ λC(P1) + (1− λ)C(P2)

and hence C(P ) is concave in P .

Finally, it can directly be seen by definition that C(·) is non-decreasing,which, together with its concavity, imply that it is continuous and strictly in-creasing.

The most commonly used cost function is the power cost function, t(x) = x2,resulting in the average power constraint P for each transmitted input n-tuple:

1

n

n∑

i=1

x2i ≤ P. (5.4.7)

Throughout this chapter, we will adopt this average power constraint on thechannel input.

We herein focus on the discrete-time memoryless Gaussian channel with av-erage input power constraint P and establish an operational meaning for thechannel capacity C(P ) as the largest coding rate for achieving reliable commu-nication over the channel. The channel is described by the following additivenoise equation:

Yi = Xi + Zi, for i = 1, 2, · · · , (5.4.8)

where Yi, Xi and Zi are the channel output, input and noise at time i. The inputand noise processes are assumed to be independent from each other and the noisesource Zi∞i=1 is i.i.d. Gaussian with each Zi having mean zero and variance σ2,Zi ∼ N (0, σ2). Since the noise process is i.i.d, we directly get that the channelsatisfies (5.4.1) and is hence memoryless, where the channel transition pdf isexplicitly given in terms of the noise pdf as follows:

fY |X(y|x) = fZ(y − x) =1√2πσ2

e−(y−x)2

2σ2 .

As mentioned above, we impose the average power constraint (5.4.7) on thechannel input.

135

Observation 5.29 The memoryless Gaussian channel is a good approximatingmodel for many practical channels such as radio, satellite and telephone linechannels. The additive noise is usually due to a multitude of causes, whosecumulative effect can be approximated via the Gaussian distribution. This isjustified by the Central Limit Theorem which states that for an i.i.d. processUi∞i=1 with mean µ and variance σ2, 1√

n

∑ni=1(Ui − µ) converges in distribu-

tion as n → ∞ to a Gaussian distributed random variable with mean zero andvariance σ2 (see Appendix B).

Before proving the channel coding theorem for the above memoryless Gaus-sian channel with input power constraint P , we first show that its capacity C(P )as defined in (5.4.3) with t(x) = x2 admits a simple expression in terms of Pand the channel noise variance σ2. Indeed, we can write the channel mutualinformation I(X;Y ) between its input and output as follows:

I(X;Y ) = h(Y )− h(Y |X)

= h(Y )− h(X + Z|X) (5.4.9)

= h(Y )− h(Z|X) (5.4.10)

= h(Y )− h(Z) (5.4.11)

= h(Y )− 1

2log2

(2πeσ2

), (5.4.12)

where (5.4.9) follows from (5.4.8), (5.4.10) holds since differential entropy isinvariant under translation (see Property 8 of Lemma 5.14), (5.4.11) followsfrom the independence of X and Z, and (5.4.12) holds since Z ∼ N (0, σ2) isGaussian (see (5.1.1)). Now since Y = X + Z, we have that

E[Y 2] = E[X2] + E[Z2] + 2E[X]E[Z] = E[X2] + σ2 + 2E[X](0) ≤ P + σ2

since the input in (5.4.3) is constrained to satisfy E[X2] ≤ P . Thus the varianceof Y satisfies Var(Y ) ≤ E[Y 2] ≤ P + σ2, and

h(Y ) ≤ 1

2log2 (2πeVar(Y )) ≤ 1

2log2

(2πe(P + σ2)

)

where the first inequality follows by Theorem 5.20 since Y is real-valued (withsupport R). Noting that equality holds in the first inequality above iff Y isGaussian and in the second inequality iff Var(Y ) = P + σ2, we obtain thatchoosing the input X as X ∼ N (0, P ) yields Y ∼ N (0, P + σ2) and hence max-imizes I(X;Y ) over all inputs satisfying E[X2] ≤ P . Thus, the capacity of thediscrete-time memoryless Gaussian channel with input average power constraintP and noise variance (or power) σ2 is given by

C(P ) =1

2log2

(2πe(P + σ2)

)− 1

2log2

(2πeσ2

)

136

=1

2log2

(

1 +P

σ2

)

. (5.4.13)

Definition 5.30 Given positive integers n and M , and a discrete-time memory-less Gaussian channel with input average power constraint P , a fixed-length datatransmission code (or block code) C∼n = (n,M) for this channel with blocklengthn and rate 1

nlog2 M message bits per channel symbol (or channel use) consists

of:

1. M information messages intended for transmission.

2. An encoding function

f : 1, 2, . . . ,M → Rn

yielding real-valued codewords c1 = f(1), c2 = f(2), · · · , cM = f(M),where each codeword cm = (cm1, . . . , cmn) is of length n and satisfies thepower constraint P

1

n

n∑

i=1

c2i ≤ P,

for m = 1, 2, · · · ,M . The set of these M codewords is called the codebookand we usually write C∼n = c1, c2, · · · , cM to list the codewords.

3. A decoding function g : Rn → 1, 2, . . . ,M.

As in Chapter 4, we assume that a message W follows a uniform distributionover the set of messages: Pr[W = w] = 1

Mfor all w ∈ 1, 2, . . . ,M. Similarly,

to convey message W over the channel, the encoder sends its correspondingcodeword Xn = f(W ) ∈ C∼n at the channel input. Finally, Y n is received atthe channel output and the decoder yields W = g(Y n) as the message estimate.Also, the average probability of error for this block code used over the memorylessGaussian channel is defined as

Pe( C∼n) ,1

M

M∑

w=1

λw( C∼n),

where

λw( C∼n) , Pr[W 6= W |W = w] = Pr[g(Y n) 6= w|Xn = f(w)]

=

∫

yn∈Rn: g(yn) 6=w

fY n|Xn(yn|f(w)) dyn

137

is the code’s conditional probability of decoding error given that message w issent over the channel. Here fY n|Xn(yn|xn) =

∏ni=1 fY |X(yi|xi) as the channel is

memoryless, where fY |X is the channel’s transition pdf.

We next prove that for a memoryless Gaussian channel with input averagepower constraint P , its capacity C(P ) has an operational meaning in the sensethat it is the supremum of all rates for which there exists a sequence of datatransmission block codes satisfying the power constraint and having a probabilityof error that vanishes with increasing blocklength.

Theorem 5.31 (Shannon’s coding theorem for the memoryless Gaus-sian channel) Consider a discrete-time memoryless Gaussian channel withinput average power constraint P , channel noise variance σ2 and capacity C(P )as given by (5.4.13).

• Forward part (achievability): For any ε ∈ (0, 1), there exist 0 < γ < 2ε anda sequence of data transmission block code C∼n = (n,Mn)∞n=1 satisfying

1

nlog2 Mn > C(P )− γ

with each codeword c = (c1, c2, . . . , cn) in C∼n satisfying

1

n

n∑

i=1

c2i ≤ P (5.4.14)

such that the probability of error Pe( C∼n) < ε for sufficiently large n.

• Converse part: If for any sequence of data transmission block codes C∼n =(n,Mn)∞n=1 whose codewords satisfy (5.4.14), we have that

lim infn→∞

1

nlog2 Mn > C(P ),

then the codes’ probability of error Pe( C∼n) is bounded away from zero forall n sufficiently large.

Proof of the forward part: The theorem holds trivially when C(P ) = 0because we can choose Mn = 1 for every n and have Pe( C∼n) = 0. Hence, weassume without loss of generality C(P ) > 0.

Step 0:

Take a positive γ satisfying γ < min2ε, C(P ). Pick ξ > 0 small enoughsuch that 2[C(P )−C(P − ξ)] < γ, where the existence of such ξ is assured

138

by the strictly increasing property of C(P ). Hence, we have C(P − ξ) −γ/2 > C(P )− γ > 0. Choose Mn to satisfy

C(P − ξ)− γ

2>

1

nlog2 Mn > C(P )− γ,

for which the choice should exist for all sufficiently large n. Take δ = γ/8.Let FX be the distribution that achieves C(P − ξ), where C(P ) is givenby (5.4.13). In this case, FX is the Gaussian distribution with mean zeroand variance P − ξ and admits a pdf fX . Hence, E[X2] ≤ P − ξ andI(X;Y ) = C(P − ξ).

Step 1: Random coding with average power constraint.

Randomly draw Mn codewords according to pdf fXn with

fXn(xn) =n∏

i=1

fX(xi).

By law of large numbers, each randomly selected codeword

cm = (cm1, . . . , cmn)

satisfies

limn→∞

1

n

n∑

i=1

c2mi = E[X2] ≤ P − ξ

for m = 1, 2, . . . ,Mn.

Step 2: Code construction.

For Mn selected codewords c1, . . . , cMn, replace the codewords that vio-

late the power constraint (i.e., (5.4.14)) by an all-zero (default) codeword0. Define the encoder as

fn(m) = cm for 1 ≤ m ≤ Mn.

Given a received output sequence yn, the decoder gn(·) is given by

gn(yn) =

m, if (cm, yn) ∈ Fn(δ)

and (∀ m′ 6= m) (cm′ , yn) 6∈ Fn(δ),arbitrary, otherwise,

where the set

Fn(δ) ,

(xn, yn) ∈ X n × Yn :

∣∣∣∣− 1

nlog2 fXnY n(xn, yn)− h(X, Y )

∣∣∣∣< δ,

139

∣∣∣∣− 1

nlog2 fXn(xn)− h(X)

∣∣∣∣< δ,

and

∣∣∣∣− 1

nlog2 fY n(yn)− h(Y )

∣∣∣∣< δ

is generated by fXnY n(xn, yn) =∏n

i=1 fXY (xi, yi) where fXnY n(xn, yn) isthe joint input-output pdf realized when the memoryless Gaussian channel(with n-fold transition pdf fY n|Xn(yn|xn) =

∏ni=1 fY |X(yi|xi)) is driven by

input Xn with pdf fXn(xn) =∏n

i=1 fX(xi) (where fX achieves C(P − ξ)).

Step 3: Conditional probability of error.

Let λm denote the conditional error probability given codeword m is trans-mitted. Define

E0 ,

xn ∈ X n :1

n

n∑

i=1

x2i > P

.

Then by following similar argument as (4.3.4),7 we get:

E[λm] ≤ PXn(E0) + PXn,Y n (F cn(δ))

+Mn∑

m′=1m′ 6=m

∫

cm∈Xn

∫

yn∈Fn(δ|cm′ )

fXn,Y n(cm, yn) dyndcm, (5.4.15)

7In this proof, specifically, (4.3.3) becomes:

λm( C∼n) ≤∫

yn 6∈Fn(δ|cm)

fY n|Xn(yn|cm) dyn +

Mn∑

m′=1m′ 6=m

∫

yn∈Fn(δ|cm′ )

fY n|Xn(yn|cm) dyn.

By taking expectation with respect to the mth codeword-selecting distribution fXn(cm), weobtain

E[λm] =

∫

cm∈Xn

fXn(cm)λm( C∼n) dcm

=

∫

cm∈Xn∩E0

fXn(cm)λm( C∼n) dcm +

∫

cm∈Xn∩Ec

0


≤∫

cm∈E0

fXn(cm) dcm +

∫

cm∈Xn


≤ PXn(E0) +∫

cm∈Xn

∫

yn 6∈Fn(δ|cm)

fXn(cm)fY n|Xn(yn|cm) dyndcm

+

∫

cm∈Xn

Mn∑

m′=1m′ 6=m

∫

yn∈Fn(δ|cm′ )

fXn(cm)fY n|Xn(yn|cm) dyndcm.

140

whereFn(δ|xn) , yn ∈ Yn : (xn, yn) ∈ Fn(δ) .

Note that the additional term PXn(E0) in (5.4.15) is to cope with theerrors due to all-zero codeword replacement, which will be less than δ forall sufficiently large n by the law of large numbers. Finally, by carryingout a similar procedure as in the proof of the channel coding theorem fordiscrete channels (cf. page 100), we obtain:

E[Pe(Cn)] ≤ PXn(E0) + PXn,Y n (F cn(δ))

+Mn · 2n(h(X,Y )+δ)2−n(h(X)−δ)2−n(h(Y )−δ)

≤ PXn(E0) + PXn,Y n (F cn(δ)) + 2n(C(P−ξ)−4δ) · 2−n(I(X;Y )−3δ)

= PXn(E0) + PXn,Y n (F cn(δ)) + 2−nδ.

Accordingly, we can make the average probability of error, E[Pe(Cn)], lessthan 3δ = 3γ/8 < 3ε/4 < ε for all sufficiently large n.

Proof of the converse part: Consider an (n,Mn) block data transmissioncode satisfying the power constraint (5.4.14) with encoding function

fn : 1, 2, . . . ,Mn → X n

and decoding functiongn : Yn → 1, 2, . . . ,Mn.

Since the message W is uniformly distributed over 1, 2, . . . ,Mn, we haveH(W ) = log2 Mn. Since W → Xn = fn(W ) → Y n forms a Markov chain(as Y n only depends on Xn), we obtain by the data processing lemma thatI(W ;Y n) ≤ I(Xn;Y n). We can also bound I(Xn;Y n) by C(P ) as follows:

I(Xn;Y n) ≤ supFXn : (1/n)

∑ni=1 E[X2

i ]≤P

I(Xn;Y n)

≤ supFXn : (1/n)

∑ni=1 E[X2

i ]≤P

n∑

j=1

I(Xj ;Yj) (by Theorem 2.21)

= sup(P1,P2,...,Pn) : (1/n)

∑ni=1 Pi=P

supFXn : (∀ i) E[X2

i ]≤Pi

n∑

j=1

I(Xj;Yj)

≤ sup(P1,P2,...,Pn) : (1/n)

∑ni=1 Pi=P

n∑

j=1

supFXn : (∀ i) E[X2

i ]≤Pi

I(Xj;Yj)

= sup(P1,P2,...,Pn) : (1/n)

∑ni=1 Pi=P

n∑

j=1

supFXj

: E[X2j ]≤Pj

I(Xj;Yj)

141

= sup(P1,P2,...,Pn):(1/n)

∑ni=1 Pi=P

n∑

j=1

C(Pj)

= sup(P1,P2,...,Pn):(1/n)

∑ni=1 Pi=P

nn∑

j=1

1

nC(Pj)

≤ sup(P1,P2,...,Pn):(1/n)

∑ni=1 Pi=P

nC

(

1

n

n∑

j=1

Pj

)

(by concavity of C(P ))

= nC(P ).

Consequently, recalling that Pe( C∼n) is the average error probability incurred byguessingW from observing Y n via the decoding function gn : Yn → 1, 2, . . . ,Mn,we get

log2 Mn = H(W )

= H(W |Y n) + I(W ;Y n)

≤ H(W |Y n) + I(Xn;Y n)

≤ hb(Pe( C∼n)) + Pe( C∼n) · log2(|W| − 1) + nC(P )

(by Fano′s inequality)

≤ 1 + Pe( C∼n) · log2(Mn − 1) + nC(P ),

(by the fact that (∀ t ∈ [0, 1]) hb(t) ≤ 1)

< 1 + Pe( C∼n) · log2Mn + nC(P ),

which implies that

Pe( C∼n) > 1− C(P )

(1/n) log2 Mn

− 1

log2 Mn

.

So if lim infn→∞(1/n) log2 Mn > C(P ), then there exists δ > 0 and an integer Nsuch that for n ≥ N ,

1

nlog2 Mn > C(P ) + δ.

Hence, for n ≥ N0 , maxN, 2/δ,

Pe( C∼n) ≥ 1− C(P )

C(P ) + δ− 1

n(C(P ) + δ)≥ δ

2(C(P ) + δ).

We next show that among all power-constrained continuous memoryless chan-nels with additive noise admitting a pdf, choosing a Gaussian distributed noiseyields the smallest channel capacity. In other words, the memoryless Gaussianmodel results in the most pessimistic (smallest) capacity within the class ofadditive-noise continuous memoryless channels.

142

Theorem 5.32 (Gaussian noise minimizes capacity of additive-noisechannels) Every discrete-time continuous memoryless channel with additivenoise (admitting a pdf) of mean zero and variance σ2 and input average powerconstraint P has its capacity C(P ) lower bounded by the capacity of the mem-oryless Gaussian channel with identical input constraint and noise variance:

C(P ) ≥ 1

2log2

(

1 +P

σ2

)

.

Proof: Let fY |X and fYg |Xgdenote the transition pdfs of the additive-noise chan-

nel and the Gaussian channel, respectively, where both channels satisfy inputaverage power constraint P . Let Z and Zg respectively denote their zero-meannoise variables of identical variance σ2. Writing the mutual information in termsof the channel’s transition pdf and input distribution as in Lemma 2.46, thenfor any Gaussian input with pdf fXg

with corresponding outputs Y and Yg whenapplied to channels fY |X and fYg |Xg

, respectively, we have that

I(fXg, fY |X)− I(fXg

, fYg |Xg)

=

∫

X

∫

YfXg

(x)fZ(y − x) log2fZ(y − x)

fY (y)dydx

−∫

X

∫

YfXg

(x)fZg(y − x) log2

fZg(y − x)

fYg(y)

dydx

=

∫

X

∫

YfXg

(x)fZ(y − x) log2fZ(y − x)

fY (y)dydx

−∫

X

∫

YfXg

(x)fZ(y − x) log2fZg

(y − x)

fYg(y)

dydx

=

∫

X

∫

YfXg

(x)fZ(y − x) log2fZ(y − x)fYg

(y)

fZg(y − x)fY (y)

dydx

≥∫

X

∫

YfXg

(x)fZ(y − x)(log2 e)

(

1− fZg(y − x)fY (y)

fZ(y − x)fYg(y)

)

dydx

= (log2 e)

[

1−∫

Y

fY (y)

fYg(y)

(∫

XfXg

(x)fZg(y − x)dx

)

dy

]

= 0,

with equality holding in the inequality iff fY (y)/fYg(y) = fZ(y − x)/fZg

(y − x)for all x. Therefore,

1

2log2

(

1 +P

σ2

)

= supFX : E[X2]≤P

I(FX , fYg |Xg)

= I(f ∗Xg, fYg |Xg

)

143

≤ I(f ∗Xg, fY |X)

≤ supFX : E[X2]≤P

I(FX , fY |X)

= C(P ).

Observation 5.33 (Channel coding theorem for continuous memory-less channels) We close this section by noting that Theorem 5.31 can be gen-eralized to a wide class of discrete-time continuous memoryless channels withinput cost constraint (5.4.2) where the cost function t(·) is arbitrary, by show-ing that C(P ) , supFX :E[t(X)]≤P I(X;Y ) is the largest rate for which there existblock codes for the channel satisfying (5.4.2) which are reliably good (i.e., withasymptotically vanishing error probability). The proof is quite similar to thatof Theorem 5.31, except that some modifications are needed in the forward partas for a general (non-Gaussian) channel, the input distribution FX used to con-struct the random code may not admit a pdf (e.g., cf. [19, Chapter 7], [53,Theorem 11.14]).

5.5 Capacity of uncorrelated parallel Gaussian channels:The water-filling principle

Consider a network of k mutually-independent discrete-time memoryless Gaus-sian channels with respective positive noise powers (variances) σ2

1, σ22, . . . and σ2

k.If one wants to transmit information using these channels simultaneously (inparallel), what will be the system’s channel capacity, and how should the signalpowers for each channel be apportioned given a fixed overall power budget ? Theanswer to the above question lies in the so-called water-filling or water-pouringprinciple.

Theorem 5.34 (Capacity of uncorrelated parallel Gaussian channels)The capacity of k uncorrelated parallel Gaussian channels under an overall inputpower constraint P is given by

C(P ) =k∑

i=1

1

2log2

(

1 +Pi

σ2i

)

,

where σ2i is the noise variance of channel i,

Pi = max0, θ − σ2i ,

and θ is chosen to satisfy∑k

i=1 Pi = P . This capacity is achieved by a tupleof independent Gaussian inputs (X1, X2, · · · , Xk), where Xi ∼ N (0, Pi) is theinput to channel i, for i = 1, 2, · · · , k.

144

Proof: By definition,

C(P ) = supFXk :

∑ki=1 E[X2

k]≤P

I(Xk;Y k).

Since the noise random variables Z1, . . . , Zk are independent from each other,

I(Xk;Y k) = h(Y k)− h(Y k|Xk)

= h(Y k)− h(Zk +Xk|Xk)

= h(Y k)− h(Zk|Xk)

= h(Y k)− h(Zk)

= h(Y k)−k∑

i=1

h(Zi)

≤k∑

i=1

h(Yi)−k∑

i=1

h(Zi)

≤k∑

i=1

1

2log2

(

1 +Pi

σ2i

)

where the first inequality follows from the chain rule for differential entropy andthe fact that conditioning cannot increase differential entropy, and the secondinequality holds since output Yi of channel i due to input Xi with E[X2

i ] = Pi hasits differential entropy maximized if it is Gaussian distributed with zero-meanand variance Pi + σ2

i . Equalities hold above if all the Xi inputs are independentof each other with each input Xi ∼ N (0, Pi) such that

∑ki=1 Pi = P .

Thus the problem is reduced to finding the power allotment that maximizesthe overall capacity subject to the constraint

∑ki=1 Pi = P with Pi ≥ 0. By

using the Lagrange multiplier technique and verifying the KKT condition (seeExample B.20 in Appendix B.10), the maximizer (P1, . . . , Pk) of

max

k∑

i=1

1

2log2

(

1 +Pi

σ2i

)

+k∑

i=1

λiPi − ν

(k∑

i=1

Pi − P

)

can be found by taking the derivative of the above equation (with respect to Pi)and setting it to zero, which yields

λi =

− 1

2 ln(2)

1

Pi + σ2i

+ ν = 0, if Pi > 0;

− 1

2 ln(2)

1

Pi + σ2i

+ ν ≥ 0, if Pi = 0.

Hence,

Pi = θ − σ2i , if Pi > 0;

Pi ≥ θ − σ2i , if Pi = 0,

(equivalently, Pi = max0, θ − σ2i ),

145

where θ , log2 e/(2ν) is chosen to satisfy∑k

i=1 Pi = P .

We illustrate the above result in Figure 5.1 and elucidate why the Pi powerallotments form a water-filling (or water-pouting) scheme. In the figure, wehave a vessel where the height of each of the solid bins represents the noisepower of each channel (while the width is set to unity so that the area of eachbin yields the noise power of the corresponding Gaussian channel). We can thusvisualize the system as a vessel with an uneven bottom where the optimal inputsignal allocation Pi to each channel is realized by pouring an amount P units ofwater into the vessel (with the resulting overall area of filled water equal to P ).Since the vessel has an uneven bottom, water is unevenly distributed among thebins: noisier channels are allotted less signal power (note that in this example,channel 3, whose noise power is largest, is given no input power at all and ishence not used).

σ21

σ22

σ23

σ24

P1

θP2

P4

P = P1 + P2 + P4

Figure 5.1: The water-pouring scheme for uncorrelated parallel Gaus-sian channels. The horizontal dashed line, which indicates the levelwhere the water rises to, indicates the value of θ for which

∑ki=1 Pi = P .

Observation 5.35 Although it seems reasonable to allocate more power to lessnoisy channels for the optimization of the channel capacity according to thewater-filling principle, the Gaussian inputs required for achieving channel ca-pacity do not fit the digital communication system in practice. One may wonderwhat an optimal power allocation scheme will be when the channel inputs aredictated to be practically discrete in values, such as binary phase-shift keying(BPSK), quadrature phase-shift keying (QPSK), or 16 quadrature-amplitudemodulation (16-QAM). Surprisingly, a different procedure from the water-fillingprinciple results under certain conditions. By characterizing the relationship be-tween mutual information and minimum mean square error (MMSE) [23], the

146

optimal power allocation for parallel AWGN channels with inputs constrainedto be discrete is established, resulting in a new graphical power allocation inter-pretation called the mercury/water-filling principle [32]: i.e., mercury of properamounts [32, eq. (43)] must be individually poured into each bin before the wa-ter of amount P =

∑ki=1 Pi is added to the vessel. It is hence named (i.e., the

mercury/water-filling principle) because mercury is heavier than water and doesnot dissolve in it, and so can play the role of a pre-adjustment of bin heights.This research concludes that when the total transmission power P is small, thestrategy that maximizes the capacity follows approximately the equal signal-to-noise ratio (SNR) principle, i.e., a larger power should be allotted to a noisierchannel to optimize the capacity. Very recently, it was found that when the ad-ditive noise is no longer Gaussian, the mercury adjustment fails to interpret theoptimal power allocation scheme. For additive Gaussian noise with arbitrary dis-crete inputs, the pre-adjustment before the water pouring step is always upward;hence, a mercury-filling scheme is proposed to materialize the lifting of heightsof bin bases. However, since the pre-adjustment of heights of bin bases generallycan be in both up and down directions (see Example 1 in [49] for quaternary-input additive Laplacian noise channels), the use of the name mercury/waterfilling becomes inappropriate. For this reason, the graphical interpretation ofthe optimal power allocation principle is simply named two-phase water-fillingprinciple in [49]. We end this observation by emphasizing that the true measurefor a digital communication system is of no doubt the effective transmission ratesubject to an acceptably good transmission error (e.g., an overall bit error rate≤ 10−5). In order to make the analysis tractable, researchers adopt the channelcapacity as the design criterion instead, in the hope of obtaining a simple ref-erence scheme for practical systems. Further study is thus required to examinethe feasibility of such an approach (even though the water-filling principle iscertainly a theoretically interesting result).

5.6 Capacity of correlated parallel Gaussian channels

In the previous section, we considered a network of k parallel discrete-time mem-oryless Gaussian channels in which the noise samples from different channels areindependent from each other. We found out that the power allocation strat-egy that maximizes the system’s capacity is given by the water-filling scheme.We next study a network of k parallel memoryless Gaussian channels where thenoise variables from different channels are correlated. Surprisingly, we obtainthat water-filling provides also the optimal power allotment policy.

Let KZ denote the covariance matrix of the noise tuple (Z1, Z2, . . . , Zk), andlet KX denote the covariance matrix of the system input (X1, . . . , Xk), where weassume (without loss of the generality) that each Xi has zero mean. We assume

147

that KZ is positive definite. The input power constraint becomes

k∑

i=1

E[X2i ] = tr(KX) ≤ P,

where tr(·) denotes the trace of the k× k matrix KX . Since in each channel, theinput and noise variables are independent from each other, we have

I(Xk;Y k) = h(Y k)− h(Y k|Xk)

= h(Y k)− h(Zk +Xk|Xk)

= h(Y k)− h(Zk|Xk)

= h(Y k)− h(Zk).

Since h(Zk) is not determined by the input, determining the system’s capacityreduces to maximizing h(Y k) over all possible inputs (X1, . . . , Xk) satisfying thepower constraint.

Now observe that the covariance matrix of Y k is equal to KY = KX +KZ ,which implies by Theorem 5.20 that the differential entropy of Y k is upperbounded by

h(Y k) ≤ 1

2log2

[(2πe)kdet(KX +KZ)

],

with equality iff Y k Gaussian. It remains to find out whether we can find in-puts (X1, . . . , Xk) satisfying the power constraint which achieve the above upperbound and maximize it.

As in the proof of Theorem 5.18, we can orthogonally diagonalize KZ as

KZ = AΛAT ,

where AAT = Ik (and thus det(A)2 = 1), Ik is the k × k identity matrix,and Λ is a diagonal matrix with positive diagonal components consisting of theeigenvalues of KZ (as KZ is positive definite). Then

det(KX +KZ) = det(KX +AΛAT )

= det(AATKXAAT +AΛAT )

= det(A) · det(ATKXA+Λ) · det(AT )

= det(ATKXA+Λ)

= det(B+Λ),

where B , ATKXA. Since for any two matrices C and D, tr(CD) = tr(DC),we have that tr(B) = tr(ATKXA) = tr(AATKX) = tr(IkKX) = tr(KX). Thusthe capacity problem is further transformed to maximizing det(B +Λ) subjectto tr(B) ≤ P .

148

By observing that B+Λ is positive definite (because Λ is positive definite)and using Hadamard’s inequality given in Corollary 5.19, we have

det(B+Λ) ≤k∏

i=1

(Bii + λi),

where λi is the component of matrix Λ locating at ith row and ith column, whichis exactly the i-th eigenvalue of KZ . Thus, the maximum value of det(B + Λ)under tr(B) ≤ P is realized by a diagonal matrix B (to achieve equality inHadamard’s inequality) with

k∑

i=1

Bii = P.

Finally, as in the proof of Theorem 5.34, we obtain a water-filling allotment forthe optimal diagonal elements of B:

Bii = max0, θ − λi,

where θ is chosen to satisfy∑k

i=1 Bii = P . We summarize this result in the nexttheorem.

Theorem 5.36 (Capacity of correlated parallel Gaussian channels) Thecapacity of k correlated parallel Gaussian channels with positive-definite noisecovariance matrix KZ under overall input power constraint P is given by

C(P ) =k∑

i=1

1

2log2

(

1 +Pi

λi

)

,

where λi is the i-th eigenvalue of KZ ,

Pi = max0, θ − λi,

and θ is chosen to satisfy∑k

i=1 Pi = P . This capacity is achieved by a tuple ofzero-mean Gaussian inputs (X1, X2, · · · , Xk) with covariance matrix KX havingthe same eigenvectors as KZ , where the i-th eigenvalue of KX is Pi, for i =1, 2, · · · , k.

5.7 Non-Gaussian discrete-time memoryless channels

If a discrete-time channel has an additive but non-Gaussian memoryless noiseand an input power constraint, then it is often hard to calculate its capacity.Hence, in this section, we introduce an upper bound and a lower bound on thecapacity of such a channel (we assume that the noise admits a pdf).

149

Definition 5.37 (Entropy power) For a continuous random variable Z with(well-defined) differential entropy h(Z) (measured in bits), its entropy power isdenoted by Ze and defined as

Ze ,1

2πe22·h(Z).

Lemma 5.38 For a discrete-time continuous-alphabet memoryless additive-noisechannel with input power constraint P and noise variance σ2, its capacity satis-fies

1

2log2

P + σ2

Ze

≥ C(P ) ≥ 1

2log2

P + σ2

σ2. (5.7.1)

Proof: The lower bound in (5.7.1) is already proved in Theorem 5.32. The upperbound follows from

I(X;Y ) = h(Y )− h(Z)

≤ 1

2log2[2πe(P + σ2)]− 1

2log2[2πeZe].

The entropy power of Z can be viewed as the variance of a correspondingGaussian random variable with the same differential entropy as Z. Indeed, if Zis Gaussian, then its entropy power is equal to

Ze =1

2πe22h(Z) = Var(Z),

as expected.

Whenever two independent Gaussian random variables, Z1 and Z2, are added,the power (variance) of the sum is equal to the sum of the powers (variances) ofZ1 and Z2. This relationship can then be written as

22h(Z1+Z2) = 22h(Z1) + 22h(Z2),

or equivalentlyVar(Z1 + Z2) = Var(Z1) + Var(Z2).

However, when two independent random variables are non-Gaussian, the rela-tionship becomes

22h(Z1+Z2) ≥ 22h(Z1) + 22h(Z2), (5.7.2)

or equivalentlyZe(Z1 + Z2) ≥ Ze(Z1) + Ze(Z2). (5.7.3)

Inequality (5.7.2) (or equivalently (5.7.3)), whose proof can be found in [13,Section 17.8] or [9, Theorem 7.10.4], is called the entropy-power inequality. Itreveals that the sum of two independent random variables may introduce moreentropy power than the sum of each individual entropy power, except in theGaussian case.

150

5.8 Capacity of the band-limited white Gaussian channel

We have so far considered discrete-time channels (with discrete or continuousalphabets). We close this chapter by briefly presenting the capacity expressionof the continuous-time (waveform) band-limited channel with additive whiteGaussian noise. The reader is referred to [52], [19, Chapter 8], [2, Sections 8.2and 8.3] and [25, Chapter 6] for rigorous and detailed treatments (includingcoding theorems) of waveform channels.

The continuous-time band-limited channel with additive white Gaussian noiseis a common model for a radio network or a telephone line. For such a channel,illustrated in Figure 5.2, the output waveform is given by

Y (t) = (X(t) + Z(t)) ∗ h(t), t ≥ 0,

where “∗” represents the convolution operation (recall that the convolution be-tween two signals a(t) and b(t) is defined as a(t) ∗ b(t) =

∫∞−∞ a(τ)b(t − τ)dτ).

Here X(t) is the channel input waveform with average power constraint

limT→∞

1

T

∫ T/2

−T/2

E[X2(t)]dt ≤ P (5.8.1)

and bandwidth W cycles per second or Hertz (Hz); i.e., its spectrum or Fouriertransform X(f) , F [X(t)] =

∫ +∞−∞ X(t)e−j2πftdt = 0 for all frequencies |f | > W ,

where j =√−1 is the imaginary unit number. Z(t) is the noise waveform of a

zero-mean stationary white Gaussian process with power spectral density N0/2;i.e., its power spectral density PSDZ(f), which is the Fourier transform of theprocess covariance (equivalently, correlation) function KZ(τ) , E[Z(s)Z(s+τ)],s, τ ∈ R, is given by

PSDZ(f) = F [KZ(t)] =

∫ +∞

−∞KZ(t)e

−j2πftdt =N0

2∀f.

Finally, h(t) is the impulse response of an ideal bandpass filter with cutoff fre-quencies at ±W Hz:

H(f) = F [(h(t)] =

1 if −W ≤ f ≤ W ,

0 otherwise.

Recall that one can recover h(t) by taking the inverse Fourier transform of H(f);this yields

h(t) = F−1[H(f)] =

∫ +∞

−∞H(f)ej2πftdf = 2W sinc(2Wt),

151

where

sinc(t) ,sin(πt)

πt

is the sinc function and is defined to equal 1 at t = 0 by continuity.

X(t)

H(f)+

Waveform channel

Z(t)

Y (t)

Figure 5.2: Band-limited waveform channel with additive white Gaus-sian noise.

Note that we can write the channel output as

Y (t) = X(t) + Z(t)

where Z(t) , Z(t) ∗ h(t) is the filtered noise waveform. The input X(t) is notaffected by the ideal unit-gain bandpass filter since it has an identical bandwidthas h(t). Note also that the power spectral density of the filtered noise is givenby

PSDZ(f) = PSDZ(f)|H(f)|2 =

N0

2if −W ≤ f ≤ W ,

0 otherwise.

Taking the inverse Fourier transform of PSDZ(f) yields the covariance functionof the filtered noise process:

KZ(τ) = F−1[PSDZ(f)] = N0W sinc(2Wτ) τ ∈ R. (5.8.2)

To determine the capacity (in bits per second) of this continuous-time band-limited white Gaussian channel with parameters, P , W and N0, we convertit to an “equivalent” discrete-time channel with power constraint P by usingthe well-known Sampling theorem (due to Nyquist, Kotelnikov and Shannon),which states that sampling a band-limited signal with bandwidth W at a rate of1/(2W ) is sufficient to reconstruct the signal from its samples. Since X(t), Z(t)and Y (t) are all band-limited to [−W,W ], we can thus represent these signals by

152

their samples taken 12W

seconds apart and model the channel by a discrete-timechannel described by:

Yn = Xn + Zn, n = 0,±1,±2, · · · ,

where Xn , X( n2W

) are the input samples and Zn = Z( n2W

) and Yn = Y ( n2W

)

are the random samples of the noise Z(t) and output Y (t) signals, respectively.

Since Z(t) is a filtered version of Z(t), which is a zero-mean stationary Gaus-sian process, we obtain that Z(t) is also zero-mean, stationary and Gaussian.This directly implies that the samples Zn, n = 1, 2, · · · , are zero-mean Gaussianidentically distributed random variables. Now an examination of the expressionof KZ(τ) in (5.8.2) reveals that KZ(τ) = 0 for τ = n

2W, n = 1, 2, · · · , since

sinc(t) = 0 for all non-zero integer values of t. Hence, the random variables Zn,n = 1, 2, · · · , are uncorrelated and hence independent (since they are Gaussian)and their variance is given by E[Z2

n] = KZ(0) = N0W . We conclude that thediscrete-time process Zn∞n=1 is i.i.d. Gaussian with each Zn ∼ N (0, N0W ). Asa result, the above discrete-time channel is a discrete-time memoryless Gaussianchannel with power constraint P and noise variance N0W ; thus the capacity ofthe band-limited white Gaussian channel in bits per channel use is given using(5.4.13) by

1

2log2

(

1 +P

N0W

)

bits/channel use.

Given that we are using the channel (with inputs Xn) every1

2Wseconds, we ob-

tain that the capacity in bits/second of the band-limited white Gaussian channelis given by

C(P ) = W log2

(

1 +P

N0W

)

bits/second, (5.8.3)

where PN0W

is typically referred to as the signal-to-noise ratio (SNR).8

We emphasize that the above derivation of (5.8.3) is heuristic as we have notrigorously shown the equivalence between the original band-limited Gaussian

8(5.8.3) is achieved by zero-mean i.i.d. Gaussian Xn∞n=−∞ with E[X2n] = P , which can

be obtained by sampling a zero-mean, stationary and Gaussian X(t) with

PSDX(f) =

P2W if −W ≤ f ≤ W ,

0 otherwise.

Examining this X(t) confirms that it satisfies (5.8.1):

1

T

∫ T/2

−T/2

E[X2(t)]dt = E[X2(t)] = KX(0) = P · sinc(2W · 0) = P.

153

channel and its discrete-time version and we have not established a coding the-orem for the original channel. We point the reader to the references mentionedat the beginning of the section for a full development of this subject.

Example 5.39 (Telephone line channel) Suppose telephone signals are band-limited to 4 kHz. Given an SNR of 40 decibels (dB) – i.e., 10 log10

PN0W

= 40 dB– then from (5.8.3), we calculate that the capacity of the telephone line channel(when modeled via the band-limited white Gaussian channel) is given by

4000 log2(1 + 10000) = 53151.4 bits/second.

Example 5.40 (Infinite bandwidth white Gaussian channel) As the chan-nel bandwidth W grows without bound, we obtain from (5.8.3) that

limW→∞

C(P ) =P

N0

log2 e bits/second,

which indicates that in the infinite-bandwidth regime, capacity grows linearlywith power.

Observation 5.41 (Band-limited colored Gaussian channel) If the aboveband-limited channel has a stationary colored (non-white) additive Gaussiannoise, then it can be shown (e.g., see [19]) that the capacity of this channelbecomes

C(P ) =1

2

∫ W

−W

max

[

0, log2θ

PSDZ(f)

]

df,

where θ is the solution of

P =

∫ W

−W

max [0, θ − PSDZ(f)] df.

The above capacity formula is indeed reminiscent of the water-pouring scheme wesaw in Sections 5.5 and 5.6, albeit it is herein applied in the spectral domain. Inother words, we can view the curve of PSDZ(f) as a bowl, and water is imaginedbeing poured into the bowl up to level θ under which the area of the water isequal to P (see Figure 5.3.(a)). Furthermore, the distributed water indicates theshape of the optimum transmission power spectrum (see Figure 5.3.(b)).

154

(a) The spectrum of PSDZ(f) where the horizontal line represents θ,the level at which water rises to.

(b) The input spectrum that achieves capacity.

Figure 5.3: Water-pouring for the band-limited colored Gaussian chan-nel.

155

Chapter 6

Lossy Data Compression andTransmission

6.1 Fundamental concept on lossy data compression

6.1.1 Motivations

In operation, one sometimes needs to compress a source in a rate less thanentropy, which is known to be the minimum code rate for lossless data compres-sion. In such case, some sort of data loss is inevitable. People usually refer theresultant codes as lossy data compression code.

Some of the examples for requiring lossy data compression are made below.

Example 6.1 (Digitization or quantization of continuous signals) Theinformation content of continuous signals, such as voice or multi-dimensionalimages, is usually infinity. It may require an infinite number of bits to digitizesuch a source without data loss, which is not feasible. Therefore, a lossy datacompression code must be used to reduce the output of a continuous source toa finite number of bits.

Example 6.2 (Constraint on channel capacity) To transmit a source thr-ough a channel with capacity less than the source entropy is challenging. Asstated in channel coding theorem, it always introduces certain error for rateabove channel capacity. Recall that Fano’s inequality only provides a lowerbound on the amount of the decoding error, and does not tell us how large theerror is. Hence, the error could go beyond control when one desires to conveysource that generates information at rates above channel capacity.

In order to have a good control of the transmission error, another approachis to first reduce the data rate with manageable distortion, and then transmitthe source data at rates less than the channel capacity. With such approach,

156

the transmission error is only introduced at the (lossy) data compression stepsince the error due to transmission over channel can be made arbitrarily small(cf. Figure 6.1).

Channel withcapacity C

Source withH(X) > C

Output withunmanageable

error

Lossy datacompressorintroducingerror E

Compresseddata with

reduced rateRr < C Channel with

capacity C

Source withH(X) > C

Output withmanageableerror E

Figure 6.1: Example for applications of lossy data compression codes.

Example 6.3 (Extracting useful information) In some situation, some ofthe information is not useful for operational objective. A quick example willbe the hypothesis testing problem in which case the system designer concernsonly the likelihood ratio of the null hypothesis distribution against the alternativehypothesis distribution. Therefore, any two distinct source letters which producethe same likelihood ratio should not be encoded into different codewords. Theresultant code is usually a lossy data compression code since reconstruction ofthe source from certain code is usually impossible.

6.1.2 Distortion measures

A source is modelled as a random process Z1, Z2, . . . , Zn. For simplicity, weassume that the source discussed in this section is memoryless and with finitegeneric alphabet. Our objective is to compress the source with rate less thanentropy under a pre-specified criterion. In general, the criterion is given by adistortion measure as defined below.

Definition 6.4 (Distortion measure) A distortion measure is a mapping

ρ : Z × Z → R+,

157

where Z is the source alphabet, Z is the reproduction alphabet for the com-pressed code, and R+ is the set of non-negative real number.

From the above definition, the distortion measure ρ(z, z) can be viewed as acost of representing the source symbol z by a code symbol z. It is then expectedto choose a certain number of (typical) reproduction letters in Z to representthe source letters, which cost least.

When Z = Z, the selection of typical reproduction letters is similar to dividethe source letters into several groups, and then choose one element in each groupto represent the rest of the members in the same group. For example, supposethat Z = Z = 1, 2, 3, 4. Due to some constraints, we need to reduce thenumber of outcomes to 2, and the resultant expected cost can not be largerthan 0.5.1 The source is uniformly distributed. Given a distortion measure by amatrix as:

[ρ(i, j)] ,

0 1 2 21 0 2 22 2 0 12 2 1 0

,

the resultant two groups which cost least should be 1, 2 and 3, 4. We maychoose respectively 1 and 3 as the typical elements for these two groups (cf. Fig-ure 6.2). The expected cost of such selection is

1

4ρ(1, 1) +

1

4ρ(2, 1) +

1

4ρ(3, 3) +

1

4ρ(4, 3) =

1

2.

Note that the entropy of the source is reduced from 2 bits to 1 bit.

Sometimes, it is convenient to have |Z| = |Z|+ 1. For example,

|Z = 1, 2, 3| = 3, |Z = 1, 2, 3, E| = 4,

where E can be regarded as the erasure symbol, and the distortion measure isdefined by

[ρ(i, j)] ,

0 2 2 0.52 0 2 0.52 2 0 0.5

.

The source is again uniformly distributed. In this example, to denote sourceletters by distinct letters in 1, 2, 3 will cost four times than to represent themby E. Therefore, if only 2 outcomes are allowed, and the expected distortion

1Note that the constraints for lossy data compression code are usually specified on theresultant entropy and expected distortion. Here, instead of putting constraints on entropy, weadopt the number-of-outcome constraint simply because it is easier to understand, especiallyfor those who are not familiar with this subject.

158

1

3

2

4

⇒1

Representative for group 1, 2

3

Representative for group 3, 4

2

4

Figure 6.2: “Grouping” as one kind of lossy data compression.

cannot be greater than 1/3, then employing typical elements 1 and E to representsource 1 and 2, 3 is an optimal choice. The resultant entropy is reduced fromlog2(3) bits to [log2(3)− 2/3] bits.

It needs to be pointed out that to have |Z| > |Z| + 1 is usually not advan-tageous. Indeed, it has been proved that under some reasonable assumptions onthe distortion measure, to have larger reproduction alphabet than |Z| + 1 willnot perform better.

6.1.3 Frequently used distortion measures

Example 6.5 (Hamming distortion measure) Let source alphabet and re-production alphabet be the same, i.e., Z = Z. Then the Hamming distortionmeasure is given by

ρ(z, z) ,

0, if z = z;1, if z 6= z.

It is also named the probability-of-error distortion measure because

E[ρ(Z, Z)] = Pr(Z 6= Z).

Example 6.6 (Squared error distortion) Let source alphabet and reproduc-tion alphabet be the same, i.e., Z = Z. Then the squared error distortion isgiven by

ρ(z, z) , (z − z)2.

The squared error distortion measure is perhaps the most popular distortionmeasure used for continuous alphabets.

The squared error distortion has the advantages of simplicity and having closeform solution for most cases of interest, such as using least squares prediction.

159

Yet, such distortion measure has been criticized as an unhumanized criterion.For example, two speech waveforms in which one is a slightly time-shifted versionof the other may have large square error distortion; however, they sound verysimilar to human.

The above definition for distortion measures can be viewed as a single-letterdistortion measure since they consider only one random variable Z which draws asingle letter. For sources modelled as a sequence of random variables Z1, . . . , Zn,some extension needs to be made. A straightforward extension is the additivedistortion measure.

Definition 6.7 (Additive distortion measure) The additive distortion mea-sure ρn between sequences zn and zn is defined by

ρn(zn, zn) =

n∑

i=1

ρ(zi, zi).

Another example that is also based on a per-symbol distortion is the maxi-mum distortion measure:

Definition 6.8 (Maximum distortion measure)

ρn(zn, zn) = max

1≤i≤nρ(zi, zi).

After defining the distortion measures for source sequences, a natural ques-tion to ask is whether to reproduce source sequence zn by sequence zn of thesame length is a must or not. To be more precise, can we use zk to representzn for k 6= n? The answer is certainly yes if a distortion measure for zn andzk is defined. A quick example will be that the source is a ternary sequenceof length n, while the (fixed-length) data compression result is a set of binaryindexes of length k, which is taken as small as possible subject to some givenconstrains. Hence, k is not necessarily equal to n. One of the problems for tak-ing k 6= n is that the distortion measure for sequences can no longer be definedbased on per-letter distortions, and hence a per-letter formula for the best lossydata compression rate cannot be rendered.

In order to alleviate the aforementioned (k 6= n) problem, we claim that formost cases of interest, it is reasonable to assume k = n. This is because onecan actually implement the lossy data compression from Zn to Zk in two steps:the first step corresponds to lossy compression mapping hn : Zn → Zn, and thesecond step performs indexing hn(Zn) into Zk. For ease of understanding, thesetwo steps are illustrated below.

160

Step 1 : Find the data compression code

hn : Zn → Zn

for which the pre-specified distortion constraint and rate constraint areboth satisfied.

Step 2 : Derive the (asymptotically) lossless data compression block code forsource hn(Z

n). When n is sufficiently large, the existence of such code withblock length

k > H(h(Zn)) (equivalently, R =k

n>

1

nH(h(Zn))

is guaranteed by Shannon’s source coding theorem.

Through the above two steps, a lossy data compression code from

Zn → Zn︸︷︷︸

Step 1

→ 0, 1k

is established. Since the second step is already discussed in (asymptotically)lossless data compression context, we can say that the theorem regarding thelossy data compression is basically a theorem on the first step.

6.2 Fixed-length lossy data compression codes

Similar to lossless source coding theorem, the objective of information theoristsis to find the theoretical limit of the compression rate for lossy data compressioncodes. Before introducing the main theorem, we need to first define lossy datacompression codes.

Definition 6.9 (Fixed-length lossy data compression code subject toaverage distortion constraint) An (n,M,D) fixed-length lossy data com-pression code for source alphabet Zn and reproduction alphabet Zn consists ofa compression function

h : Zn → Zn

with the size of the codebook (i.e., the image h(Zn)) being |h(Zn)| = M , andthe average distortion satisfying

E

[1

nρn(Z

n, h(Zn))

]

≤ D.

161

Since the size of the codebook isM , it can be binary-indexed by k = ⌈log2 M⌉bits. Therefore, the average rate of such code is (1/n)⌈log2 M⌉ ≈ (1/n) log2 Mbits/sourceword.

Note that a parallel definition for variable-length source compression codecan also be defined. However, there is no conclusive results for the bound ofsuch code rate, and hence we omit it for the moment. This is also an interestingopen problem to research on.

After defining fixed-length lossy data compression codes, we are ready todefine the achievable rate-distortion pair.

Definition 6.10 (Achievable rate-distortion pair) For a given sequence ofdistortion measures ρnn≥1, a rate distortion pair (R,D) is achievable if thereexists a sequence of fixed-length lossy data compression codes (n,Mn, D) withultimate code rate lim supn→∞(1/n) logMn ≤ R.

With the achievable rate-distortion region, we define the rate-distortion func-tion as follows.

Definition 6.11 (Rate-distortion function) The rate-distortion function, de-noted by R(D), is equal to

R(D) , infR ∈ R : (R,D) is an achievable rate-distortion pair.

6.3 Rate-distortion function for discrete memorylesssources

The main result here is based on the discrete memoryless source (DMS) andbounded additive distortion measure. “Boundedness” of a distortion measuremeans that

max(z,z)∈Z×Z

ρ(z, z) < ∞.

The basic idea of choosing the data compression codewords from the set ofsource input symbols under DMS is to draw the codewords from the distortiontypical set. This set is defined similarly as the joint typical set for channels.

Definition 6.12 (Distortion typical set) For a memoryless distribution withgeneric marginal PZ,Z and a bounded additive distortion measure ρn(·, ·), the dis-tortion δ-typical set is defined by

Dn(δ) ,

(zn, zn) ∈ Zn × Zn :

162

∣∣∣∣− 1

nlogPZn(zn)−H(Z)

∣∣∣∣< δ,

∣∣∣∣− 1

nlogPZn(z

n)−H(Z)

∣∣∣∣< δ,

∣∣∣∣− 1

nlogPZn,Zn(z

n, zn)−H(Z, Z)

∣∣∣∣< δ,

and

∣∣∣∣

1

nρn(z

n, zn)− E[ρ(Z, Z)]

∣∣∣∣< δ

.

Note that this is the definition of the jointly typical set with additional con-straint on the distortion being close to the expected value. Since the additivedistortion measure between two joint i.i.d. random sequences is actually the sumof i.i.d. random variable, i.e.,

ρn(Zn, Zn) =

n∑

i=1

ρ(Zi, Zi),

the (weak) law-of-large-number holds for the distortion typical set. Therefore,an AEP-like theorem can be derived for distortion typical set.

Theorem 6.13 If (Z1, Z1), (Z2, Z2), . . ., (Zn, Zn), . . . are i.i.d., and ρn are bo-unded additive distortion measure, then

− 1

nlogPZn(Z1, Z2, . . . , Zn) → H(Z) in probability;

− 1

nlogPZn(Z1, Z2, . . . , Zn) → H(Z) in probability;

− 1

nlogPZn,Zn((Z1, Z1), . . . , (Zn, Zn)) → H(Z, Z) in probability;

and1

nρn(Z

n, Zn) → E[ρ(Z, Z)] in probability.

Proof: Functions of independent random variables are also independent randomvariables. Thus by the weak law of large numbers, we have the desired result.

It needs to be pointed out that without boundedness assumption, the nor-malized sum of an i.i.d. sequence is not necessary convergence in probability to afinite mean. That is why an additional condition, “boundedness” on distortionmeasure, is imposed, which guarantees the required convergence.

Theorem 6.14 (AEP for distortion measure) Given a discrete memory-less sources Z, a single-letter conditional distribution PZ|Z , and any δ > 0, theweakly distortion δ-typical set satisfies

163

1. PZn,Zn(Dcn(δ)) < δ for n sufficiently large;

2. for all (zn, zn) in Dn(δ),

PZn(zn) ≥ PZn|Zn(z

n|zn)e−n[I(Z;Z)+3δ]. (6.3.1)

Proof: The first one follows from the definition. The second one can be provedby

PZn|Zn(zn|zn) =

PZn,Zn(zn, zn)

PZn(zn)

= PZn(zn)

PZn,Zn(zn, zn)

PZn(zn)PZn(zn)

≤ PZn(zn)

e−n[H(Z,Z)−δ]

e−n[H(Z)+δ]e−n[H(Z)+δ]

= PZn(zn)en[I(Z;Z)+3δ].

Before we go further to the lossy data compression theorem, we also need thefollowing inequality.

Lemma 6.15 For 0 ≤ x ≤ 1, 0 ≤ y ≤ 1, and n > 0,

(1− xy)n ≤ 1− x+ e−yn, (6.3.2)

with equality holds if, and only if, (x, y) = (1, 0).

Proof: Let gy(t) , (1−yt)n. It can be shown by taking the second derivative ofgy(t) with respect to t that this function is strictly convex for t ∈ [0, 1]. Hence,for any x ∈ [0, 1],

(1− xy)n = gy((1− x) · 0 + x · 1

)

≤ (1− x) · gy(0) + x · gy(1)with equality holds if, and only if, (x = 0) ∨ (x = 1) ∨ (y = 0)

= (1− x) + x · (1− y)n

≤ (1− x) + x ·(e−y)n

with equality holds if, and only if, (x = 0) ∨ (y = 0)

≤ (1− x) + e−ny

with equality holds if, and only if, (x = 1).

164

From the above derivation, we know that equality holds for (6.3.2) if, and onlyif,

[(x = 0) ∨ (x = 1) ∨ (y = 0)] ∧ [(x = 0) ∨ (y = 0)] ∧ [x = 1] = (x = 1, y = 0).

(Note that (x = 0) represents (x, y) ∈ R2 : x = 0 and y ∈ [0, 1]. Similardefinition applies to the other sets.)

Theorem 6.16 (Rate distortion theorem) For DMS and bounded additivedistortion measure (namely,

ρmax , max(z,z)∈Z×Z

ρ(z, z) < ∞ and ρn(zn, zn) =

n∑

i=1

ρ(zi, zi)),

the rate-distortion function is

R(D) = minPZ|Z : E[ρ(Z,Z)]≤D

I(Z; Z).

Proof: Denote f(D) , minPZ|Z : E[ρ(Z,Z)]≤D I(Z; Z). Then we shall show that

R(D) defined in Definition 6.11 equals f(D).

1. Achievability (i.e., R(D + ε) ≤ f(D) + 4ε for arbitrarily small ε > 0): Weneed to show that for any ε > 0, there exist 0 < γ < 4ε and a sequence of lossydata compression codes (n,Mn, D + ε)∞n=1 with

lim supn→∞

1

nlogMn ≤ f(D) + γ.

The proof is as follows.

Step 1: Optimizer. Let PZ|Z be the optimizer of f(D), i.e.,

f(D) = minPZ|Z : E[ρ(Z,Z)]≤D

I(Z; Z) = I(Z; Z).

Then

E[ρ(Z, Z)] =1

nE[ρn(Z

n, Zn)] ≤ D.

Choose Mn to satisfy

f(D) +1

2γ ≤ 1

nlogMn ≤ f(D) + γ

for some γ in (0, 4ε), for which the choice should exist for all sufficientlylarge n > N0 for some N0. Define

δ , min

γ

8,

ε

1 + 2ρmax

.

165

Step 2: Random coding. Independently select Mn codewords from Zn ac-cording to

PZn(zn) =n∏

i=1

PZ(zi),

and denote this random codebook by C∼n, where

PZ(z) =∑

z∈ZPZ(z)PZ|Z(z|z).

Step 3: Encoding rule. Define a subset of Zn as

J ( C∼n) , zn ∈ Zn : ∃ zn ∈ C∼n such that (zn, zn) ∈ Dn(δ) ,

where Dn(δ) is defined under PZ|Z . Based on the codebook

C∼n = c1, c2, . . . , cMn,

define the encoding rule as:

hn(zn) =

cm, if (zn, cm) ∈ Dn(δ);(when more than one satisfying the requirement,just pick any.)

0, otherwise.

Note that when zn ∈ J ( C∼n), we have (zn, hn(zn)) ∈ Dn(δ) and

1

nρn(z

n, hn(zn)) ≤ E[ρ(Z, Z)] + δ ≤ D + δ.

Step 4: Calculation of probability outside J ( C∼n). Let N1 satisfying thatfor n > N1,

PZn,Zn(Dcn(δ)) < δ.

LetΩ , PZn(J c( C∼n)).

Then by random coding argument,

E[Ω] =∑

C∼n

PZn( C∼n)

∑

zn 6∈J ( C∼n)

PZn(zn)

=∑

zn∈Zn

PZn(zn)

∑

C∼n : zn 6∈J ( C∼n)

PZn( C∼n)

.

166

For any zn given, to select a codebook C∼n satisfying zn 6∈ J ( C∼n) is equiv-alent to independently draw Mn n-tuple from Zn which is not distortionjoint typical with zn. Hence,

∑

C∼n : zn 6∈J ( C∼n)

PZn( C∼n) =(

Pr[

(zn, Zn) 6∈ Dn(δ)])Mn

.

For convenience, we let K(zn, zn) be the indicator function of Dn(δ), i.e.,

K(zn, zn) =

1, if (zn, zn) ∈ Dn(δ);0, otherwise.

Then

∑

C∼n : zn 6∈J ( C∼n)

PZn( C∼n) =

1−∑

zn∈Zn

PZn(zn)K(zn, zn)

Mn

.

Continuing the computation of E[Ω], we get

E[Ω] =∑

zn∈Zn

PZn(zn)

1−∑

zn∈Zn

PZn(zn)K(zn, zn)

Mn

≤∑

zn∈Zn

PZn(zn)

1−∑

zn∈Zn

PZn|Zn(zn|zn)e−n(I(Z;Z)+3δ)K(zn, zn)

Mn

(by (6.3.1))

=∑

zn∈Zn

PZn(zn)

1− e−n(I(Z;Z)+3δ)∑

zn∈Zn

PZn|Zn(zn|zn)K(zn, zn)

Mn

≤∑

zn∈Zn

PZn(zn)

(

1−∑

zn∈Zn


+ exp

−Mn · e−n(I(Z;Z)+3δ))

(from (6.3.2))

≤∑

zn∈Zn

PZn(zn)

(

1−∑

zn∈Zn


+ exp

−en(f(D)+γ/2) · e−n(I(Z;Z)+3δ))

,

(for f(D) + γ/2 < (1/n) logMn)

≤ 1− PZn,Zn(Dn(δ)) + exp−enδ

,

167

(for f(D) = I(Z; Z) and δ ≤ γ/8)

= PZn,Zn(Dcn(δ)) + exp

−enδ

≤ δ + δ = 2δ, for n > N , max

N0, N1,1

δlog log

(1

minδ, 1

)

.

Since E[Ω] = E [PZn (J c( C∼n))] ≤ 2δ, there must exists a codebook C∼∗n

such that PZn (J c( C∼∗n)) is no greater than 2δ.

Step 5: Calculation of distortion. For the optimal codebook C∼∗n (from the

previous step) at n > N , its distortion is:

1

nE[ρn(Z

n, hn(Zn))] =

∑

zn∈J ( C∼∗n)

PZn(zn)1

nρn(z

n, hn(zn))

+∑

zn 6∈J ( C∼∗n)

PZn(zn)1

nρn(z

n, hn(zn))

≤∑

zn∈J ( C∼∗n)

PZn(zn)(D + δ) +∑

zn 6∈J ( C∼∗n)

PZn(zn)ρmax

≤ (D + δ) + 2δ · ρmax

≤ D + δ(1 + 2ρmax)

≤ D + ε.

2. Converse Part (i.e., R(D + ε) ≥ f(D) for arbitrarily small ε > 0 andany D ∈ D ≥ 0 : f(D) > 0): We need to show that for any sequence of(n,Mn, Dn)∞n=1 code with

lim supn→∞

1

nlogMn < f(D),

there exists ε > 0 such that

Dn =1

nE[ρn(Z

n, hn(Zn))] > D + ε

for n sufficiently large.

The proof is as follows.

Step 1: Convexity of mutual information. By the convexity of mutual in-formation I(Z; Z) with respective to PZ|Z ,

I(Z; Zλ) ≤ λ · I(Z; Z1) + (1− λ) · I(Z; Z2),

where λ ∈ [0, 1], and

PZλ|Z(z|z) , λPZ1|Z(z|z) + (1− λ)PZ2|Z(z|z).

168

Step 2: Convexity of f(D). Let PZ1|Z and PZ2|Z be two distributions achiev-ing f(D1) and f(D2), respectively. Since

E[ρ(Z, Zλ)] =∑

z∈ZPZ(z)

∑

z∈Z

PZλ|Z(z|z)ρ(z, z)

=∑

z∈ZPZ(z)

∑

z∈Z

[

λPZ1|Z(z|z) + (1− λ)PZ2|Z(z|z)]

ρ(z, z)

= λD1 + (1− λ)D2,

we have

f(λD1 + (1− λ)D2) ≤ I(Z; Zλ)

≤ λI(Z; Z1) + (1− λ)I(Z; Z2)

= λf(D1) + (1− λ)f(D2).

Therefore, f(D) is a convex function.

Step 3: Strictly decreasingness and continuity of f(D).

By definition, f(D) is non-increasing in D. Also, f(D) = 0 for

D ≥ minPZ

∑

z∈Z

∑

z∈ZPZ(z)PZ(z)ρ(z, z)

(which is finite from boundedness of distortion measure). Together withits convexity, the strictly decreasingness and continuity of f(D) over D ≥0 : f(D) > 0 is proved.

Step 4: Main proof.

logMn ≥ H(hn(Zn))

= H(hn(Zn))−H(hn(Z

n)|Zn), since H(hn(Zn)|Zn) = 0;

= I(Zn;hn(Zn))

= H(Zn)−H(Zn|hn(Zn))

=n∑

i=1

H(Zi)−n∑

i=1

H(Zi|hn(Zn), Z1, . . . , Zi−1)

by the independence of Zn,

and chain rule for conditional entropy;

≥n∑

i=1

H(Zi)−n∑

i=1

H(Zi|Zi)

where Zi is the ith component of hn(Zn);

169

=n∑

i=1

I(Zi; Zi)

≥n∑

i=1

f(Di), where Di , E[ρ(Zi, Zi)];

= n

n∑

i=1

1

nf(Di)

≥ nf

(n∑

i=1

1

nDi

)

, by convexity of f(D);

= nf

(1

nE[ρn(Z

n, hn(Zn))]

)

,

where the last step follows since the distortion measure is additive. Finally,lim supn→∞(1/n) logMn < f(D) implies the existence of N and γ > 0 suchthat (1/n) logMn < f(D)− γ for all n > N . Therefore, for n > N ,

f

(1

nE[ρn(Z

n, hn(Zn))]

)

< f(D)− γ,

which, together with the strictly decreasing of f(D), implies

1

nE[ρn(Z

n, hn(Zn))] > D + ε

for some ε = ε(γ) > 0 and for all n > N .

3. Summary: For D ∈ D ≥ 0 : f(D) > 0, the achievability and converseparts jointly imply that f(D) + 4ε ≥ R(D + ε) ≥ f(D) for arbitrarily smallε > 0. Together with the continuity of f(D), we obtain that R(D) = f(D) forD ∈ D ≥ 0 : f(D) > 0.

For D ∈ D ≥ 0 : f(D) = 0, the achievability part gives us f(D) + 4ε =4ε ≥ R(D + ε) ≥ 0 for arbitrarily small ε > 0. This immediately implies thatR(D) = 0(= f(D)) as desired.

The formula of the rate-distortion function obtained in the previous theoremis also valid for the squared error distortion over real numbers, even if it isunbounded. Here, we put the boundedness assumption just to facilitate theexposition of the current proof. Readers may refer to Part II of the lecture notesfor a more general proof.

The discussion on lossy data compression, especially on continuous sources,will be continued in Section 6.4. Examples of the calculation of rate-distortionfunctions will also be given in the same section.

170

After introducing Shannon’s source coding theorem for block codes, Shan-non’s channel coding theorem for block codes and rate-distortion theorem fori.i.d. or stationary ergodic system setting, we would like to once again makeclear the “key concepts” behind these lengthy proofs, that is, typical-set andrandom-coding. The typical-set argument (specifically, δ-typical set for sourcecoding, joint δ-typical set for channel coding, and distortion typical set for rate-distortion) uses the law-of-large-number or AEP reasoning to claim the existenceof a set with very high probability; hence, the respective information manipu-lation can just focus on the set with negligible performance loss. The random-coding argument shows that the expectation of the desired performance over allpossible information manipulation schemes (randomly drawn according to someproperly chosen statistics) is already acceptably good, and hence the existenceof at least one good scheme that fulfills the desired performance index is vali-dated. As a result, in situations where the two above arguments apply, a similartheorem can often be established. Question is “Can we extend the theorems tocases where the two arguments fail?” It is obvious that only when new prov-ing technique (other than the two arguments) is developed can the answer beaffirmative. We will further explore this issue in Part II of the lecture notes.

6.4 Property and calculation of rate-distortion functions

6.4.1 Rate-distortion function for binary sources and Ham-ming additive distortion measure

A specific application of rate-distortion function that is useful in practice is whenbinary alphabets and Hamming additive distortion measure are assumed. TheHamming additive distortion measure is defined as:

ρn(zn, zn) =

n∑

i=1

zi ⊕ zi,

where “⊕” denotes modulo two addition. In such case, ρ(zn, zn) is exactly thenumber of bit errors or changes after compression. Therefore, the distortionbound D becomes a bound on the average probability of bit error. Specifically,among n compressed bits, it is expected to have E[ρ(Zn, Zn)] bit errors; hence,the expected value of bit-error-rate is (1/n)E[ρ(Zn, Zn)]. The rate-distortionfunction for binary sources and Hamming additive distortion measure is givenby the next theorem.

Theorem 6.17 Fix a memoryless binary source

Z = Zn = (Z1, Z2, . . . , Zn)∞n=1

171

with marginal distribution PZ(0) = 1 − PZ(1) = p. Assume the Hammingadditive distortion measure is employed. Then the rate-distortion function

R(D) =

hb(p)− hb(D), if 0 ≤ D < minp, 1− p;0, if D ≥ minp, 1− p,

where hb , −p · log2(p)− (1− p) · log2(1− p) is the binary entropy function.

Proof: Assume without loss of generality that p ≤ 1/2.

We first prove the theorem under 0 ≤ D < minp, 1− p = p. Observe that

H(Z|Z) = H(Z ⊕ Z|Z).

Also observe that

E[ρ(Z, Z)] ≤ D implies PrZ ⊕ Z = 1 ≤ D.

Then

I(Z; Z) = H(Z)−H(Z|Z)= hb(p)−H(Z ⊕ Z|Z)≥ hb(p)−H(Z ⊕ Z) (conditioning never increase entropy)

≥ hb(p)− hb(D),

where the last inequality follows since the binary entropy function hb(x) is in-creasing for x ≤ 1/2, and PrZ ⊕ Z = 1 ≤ D. Since the above derivation istrue for any PZ|Z , we have

R(D) ≥ hb(p)− hb(D).

It remains to show that the lower bound is achievable by some PZ|Z , or equiv-

alently, H(Z|Z) = hb(D) for some PZ|Z . By defining PZ|Z(0|0) = PZ|Z(1|1) =1 − D, we immediately obtain H(Z|Z) = hb(D). The desired PZ|Z can beobtained by solving

1 = PZ(0) + PZ(1)

=PZ(0)

PZ|Z(0|0)PZ|Z(0|0) +

PZ(0)

PZ|Z(0|1)PZ|Z(1|0)

=p

1−DPZ|Z(0|0) +

p

D(1− PZ|Z(0|0))

and

1 = PZ(0) + PZ(1)

172

=PZ(1)

PZ|Z(1|0)PZ|Z(0|1) +

PZ(1)

PZ|Z(1|1)PZ|Z(1|1)

=1− p

D(1− PZ|Z(1|1)) +

1− p

1−DPZ|Z(1|1),

and yield

PZ|Z(0|0) =1−D

1− 2D

(

1− D

p

)

and PZ|Z(1|1) =1−D

1− 2D

(

1− D

1− p

)

.

This completes the proof for 0 ≤ D < minp, 1− p = p.

Now in the case of p ≤ D < 1 − p, we can let PZ|Z(1|0) = PZ|Z(1|1) = 1 to

obtain I(Z; Z) = 0 and

E[ρ(Z, Z)] =1∑

z=0

1∑

z=0

PZ(z)PZ|Z(z|z)ρ(z, z) = p ≤ D.

Similarly, in the case of D ≥ 1−p, we let PZ|Z(0|0) = PZ|Z(0|1) = 1 to obtain

I(Z; Z) = 0 and

E[ρ(Z, Z)] =1∑

z=0

1∑

z=0

PZ(z)PZ|Z(z|z)ρ(z, z) = 1− p ≤ D.

6.4.2 Rate-distortion function for Gaussian source andsquared error distortion measure

In this subsection, we actually show that the Gaussian source maximizes the rate-distortion function under square error distortion measure. The rate-distortionfunction for Gaussian source and squared error distortion measure is thus anupper bound for all other continuous sources of the same variance as summarizedin the next theorem. The result parallels Theorem 5.20 which states that theGaussian source maximizes the differential entropy among all continuous sourcesof the same covariance matrix. In a sense, one may draw the conclusion that theGaussian source is rich in content and is the most difficult one to be compressed(subject to squared-error distortion measure).

Theorem 6.18 Under the squared error distortion measure, namely

ρ(z, z) = (z − z)2,

173

the rate-distortion function for continuous source Z with zero mean and varianceσ2 satisfies

R(D) ≤

1

2log

σ2

D, for 0 ≤ D ≤ σ2;

0, for D > σ2.

Equality holds when Z is Gaussian.

Proof: By Theorem 6.16 (extended to “unbounded” squared-error distortionmeasure),

R(D) = minpZ|Z : E[(Z−Z)2]≤D

I(Z; Z).

So for any pZ|Z satisfying the distortion constraint,

R(D) ≤ I(pZ , pZ|Z).

For 0 ≤ D ≤ σ2, choose a dummy Gaussian random variable W with zeromean and variance aD, where a = 1 − D/σ2, and is independent of Z. LetZ = aZ +W . Then

E[(Z − Z)2] = E[(1− a)2Z2] + E[W 2]

= (1− a)2σ2 + aD = D

which satisfies the distortion constraint. Note that the variance of Z is equal toE[a2Z2] + E[W 2] = σ2 −D. Consequently,

R(D) ≤ I(Z; Z)

= h(Z)− h(Z|Z)= h(Z)− h(W + aZ|Z)= h(Z)− h(W |Z) (By Lemma 5.14)

= h(Z)− h(W ) (By Independence of W and Z)

= h(Z)− 1

2log(2πe(aD))

≤ 1

2log(2πe(σ2 −D))− 1

2log(2πe(aD))

=1

2log

σ2

D.

For D > σ2, let Z satisfy PrZ = 0 = 1, and be independent of Z. ThenE[(Z − Z)2] = E[Z2] + E[Z2]− 2E[Z]E[Z] = σ2 < D, and I(Z; Z) = 0. Hence,R(D) = 0 for D > σ2.

174

The achievability of this upper bound by Gaussian source can be proved byshowing that under the Gaussian source, (1/2) log(σ2/D) is a lower bound toR(D) for 0 ≤ D ≤ σ2.

For Gaussian source Z with E[(Z − Z)2] ≤ D,

I(Z; Z) = h(Z)− h(Z|Z)=

1

2log(2πeσ2)− h(Z − Z|Z) (Lemma 5.14)

≥ 1

2log(2πeσ2)− h(Z − Z) (Lemma 5.14)

≥ 1

2log(2πeσ2)− 1

2log(

2πeVar[(Z − Z)])

(Theorem 5.20)

≥ 1

2log(2πeσ2)− 1

2log(

2πeE[(Z − Z)2])

≥ 1

2log(2πeσ2)− 1

2log (2πeD)

=1

2log

σ2

D.

6.5 Joint source-channel information transmission

At the end of Part I of the lecture notes, we continue from Fig. 6.1 the discussionabout whether a source can be transmitted via a noisy channel of capacity C(S)within distortionD. Note that if the source entropy rate is less than the capacity,reliable transmission is achievable and hence the answer to the above query isaffirmative for arbitrary D. So the query is mostly concerned when the capacityis not large enough to reliably convey the source (i.e., the source entropy rateexceeds the capacity). This query then leads to the following theorem.

Theorem 6.19 (Joint source-channel coding theorem) Fix a distortionmeasure. A DMS can be reproduced at the output of a channel with distortionless than D (by taking sufficiently large blocklength), if

R(D)

Ts

<C(S)

Tc

,

where Ts and Tc represent the durations per source letter and per channel input,respectively. Conversely, all data transmission codes will have average distortionlarger than D for sufficiently large blocklength, if

R(D)

Ts

>C(S)

Tc

.

175

Note that the units of R(D) and C(S) should be the same, i.e., they should bemeasured both in nats (by taking natural logarithm), or both in bits (by takingbase-2 logarithm).

The theorem is simply a consequence of Shannon’s channel coding theorem(i.e., Theorem 4.10) and Rate-distortion theorem (i.e., Theorem 6.16) and hencewe omit its proof but focus on examples that can reveal its significance.

Example 6.20 (Binary-input additive white Gaussian noise (AWGN)channel) Assume that the discrete-time binary source to be transmitted is mem-oryless with uniform marginal distribution, and the discrete-time channel hasbinary input alphabet and real-line output alphabet with Gaussian transitionprobability. Denote by Pb the probability of bit error.

From Theorem 6.17, the rate-distortion function for binary input and Ham-ming additive distortion measure is

R(D) =

1− hb(D), if 0 ≤ D ≤ 1

2;

0, if D >1

2,

where hb(x) = −x log2(x) − (1 − x) log2(1 − x) is the binary entropy function.Notably, the distortion bound D is exactly a bound on bit error rate Pb sinceHamming additive distortion measure is used.

According to [12], the channel capacity-cost function for antipodal binary-input AWGN channel is

C(S) =S

σ2log2(e)−

1√2π

∫ ∞

−∞e−y2/2 log2

[

cosh

(

S

σ2+ y

√

S

σ2

)]

dy

=EbTc/Ts

N0/2log2(e)−

1√2π

∫ ∞

−∞e−y2/2 log2

[

cosh

(

EbEc/Ts

N0/2+ y

√

EbTc/Ts

N0/2

)]

dy

= 2Rγb log2(e)−1√2π

∫ ∞

−∞e−y2/2 log2[cosh(2Rγb + y

√

2Rγb)]dy,

where R = Tc/Ts is the code rate for data transmission and is measured in theunit of source letter/channel usage (or information bit/channel bit), and γb (oftendenoted by Eb/N0) is the signal-to-noise ratio per information bit.

Then from the joint source-channel coding theorem, good codes exist when

R(D) <Ts

Tc

C(S),

176

or equivalently

1−hb(Pb) <1

R

[

2Rγb log2(e)−1√2π

∫ ∞


√

2Rγb)]dy

]

.

By re-formulating the above inequality as

hb(Pb) > 1− 2γb log2(e) +1

R√2π

∫ ∞


√

2Rγb)]dy,

a lower bound on the bit error probability as a function of γb is established.This is the Shannon limit for any code to achieve binary-input Gaussian channel(cf. Fig. 6.3).

γb (dB)

Pb

Shannon Limit

R = 1/2R = 1/3

1

10−1

10−2

10−3

10−4

10−5

10−6

−6 −5 −4 −3 −2 −1 −.496 .186 1

Figure 6.3: The Shannon limits for (2, 1) and (3, 1) codes under antipo-dal binary-input AWGN channel.

The result in the above example becomes important due to the inventionof the Turbo coding [3, 4], for which a near-Shannon-limit performance is firstobtained. Specifically, the half-rate turbo coding system proposed in [3, 4] canapproach the bit error rate of 10−5 at γb = 0.9 dB, which is only 0.7 dB awayfrom the Shannon limit 0.19 dB. This implies that a near-optimal channel codehas been constructed, since in principle, no codes can perform better than theShannon limit.

177

Example 6.21 (AWGN channel with real input) Assume that the binarysource is memoryless with uniform marginal distribution, and the channel hasboth real-line input and real-line output alphabets with Gaussian transitionprobability. Denote by Pb the probability of bit error.

Again, the rate-distortion function for binary input and Hamming additivedistortion measure is

R(D) =

1− hb(D), if 0 ≤ D ≤ 1

2;

0, if D >1

2,

In addition, the channel capacity-cost function for real-input AWGN channel is

C(S) =1

2log2

(

1 +S

σ2

)

=1

2log2

(

1 +EbTc/Ts

N0/2

)

=1

2log2 (1 + 2Rγb) bits/channel symbol,

where R = Tc/Ts is the code rate for data transmission and is measured in theunit of information bit/channel usage, and γb = Eb/N0 is the signal-to-noiseratio per information bit.

Then from the joint source-channel coding theorem, good codes exist when

R(D) <Ts

Tc

C(S),

or equivalently

1− hb(Pb) <1

R

[1

2log2 (1 + 2Rγb)

]

.

By re-formulating the above inequality as

hb(Pb) > 1− 1

2Rlog2 (1 + 2Rγb) ,

a lower bound on the bit error probability as a function of γb is established. Thisis the Shannon limit for any code to achieve for real input Gaussian channel(cf. Fig. 6.4).

178

γb (dB)

Pb

Shannon Limit

R = 1/2R = 1/3

1

10−1

10−2

10−3

10−4

10−5

10−6

−6 −5 −4 −3 −2 −1 −.55−.001 1

Figure 6.4: The Shannon limits for (2, 1) and (3, 1) codes undercontinuous-input AWGN channels.

179

Appendix A

Overview on Suprema and Limits

We herein review basic results on suprema and limits which are useful for thedevelopment of information theoretic coding theorems; they can be found instandard real analysis texts (e.g., see [36, 50]).

A.1 Supremum and maximum

Throughout, we work on subsets of R, the set of real numbers.

Definition A.1 (Upper bound of a set) A real number u is called an upperbound of a non-empty subset A of R if every element of A is less than or equalto u; we say that A is bounded above. Symbolically, the definition becomes:

A ⊂ R is bounded above ⇐⇒ (∃ u ∈ R) such that (∀ a ∈ A), a ≤ u.

Definition A.2 (Least upper bound or supremum) Suppose A is a non-empty subset of R. Then we say that a real number s is a least upper bound orsupremum of A if s is an upper bound of the set A and if s ≤ s′ for each upperbound s′ of A. In this case, we write s = supA; other notations are s = supx∈A xand s = supx : x ∈ A.

Completeness Axiom: (Least upper bound property) Let A be a non-empty subset of R that is bounded above. Then A has a least upper bound.

It follows directly that if a non-empty set in R has a supremum, then thissupremum is unique. Furthermore, note that the empty set (∅) and any setnot bounded above do not admit a supremum in R. However, when workingin the set of extended real numbers given by R ∪ −∞,∞, we can define the

180

supremum of the empty set as −∞ and that of a set not bounded above as ∞.These extended definitions will be adopted in the text.

We now distinguish between two situations: (i) the supremum of a set Abelongs to A, and (ii) the supremum of a set A does not belong to A. It is quiteeasy to create examples for both situations. A quick example for (i) involvesthe set (0, 1], while the set (0, 1) can be used for (ii). In both examples, thesupremum is equal to 1; however, in the former case, the supremum belongs tothe set, while in the latter case it does not. When a set contains its supremum,we call the supremum the maximum of the set.

Definition A.3 (Maximum) If supA ∈ A, then supA is also called the max-imum of A, and is denoted by maxA. However, if supA 6∈ A, then we say thatthe maximum of A does not exist.

Property A.4 (Properties of the supremum)

1. The supremum of any set in R ∪ −∞,∞ always exits.

2. (∀ a ∈ A) a ≤ supA.

3. If −∞ < supA < ∞, then (∀ ε > 0)(∃ a0 ∈ A) a0 > supA− ε.(The existence of a0 ∈ (supA−ε, supA] for any ε > 0 under the conditionof | supA| < ∞ is called the approximation property for the supremum.)

4. If supA = ∞, then (∀ L ∈ R)(∃ B0 ∈ A) B0 > L.

5. If supA = −∞, then A is empty.

Observation A.5 In Information Theory, a typical channel coding theoremestablishes that a (finite) real number α is the supremum of a set A. Thus, toprove such a theorem, one must show that α satisfies both properties 3 and 2above, i.e.,

(∀ ε > 0)(∃ a0 ∈ A) a0 > α− ε (A.1.1)

and(∀ a ∈ A) a ≤ α, (A.1.2)

where (A.1.1) and (A.1.2) are called the achievability (or forward) part and theconverse part, respectively, of the theorem. Specifically, (A.1.2) states that α isan upper bound of A, and (A.1.1) states that no number less than α can be anupper bound for A.

181

Property A.6 (Properties of the maximum)

1. (∀ a ∈ A) a ≤ maxA, if maxA exists in R ∪ −∞,∞.

2. maxA ∈ A.

From the above property, in order to obtain α = maxA, one needs to showthat α satisfies both

(∀ a ∈ A) a ≤ α and α ∈ A.

A.2 Infimum and minimum

The concepts of infimum and minimum are dual to those of supremum andmaximum.

Definition A.7 (Lower bound of a set) A real number ℓ is called a lowerbound of a non-empty subset A in R if every element of A is greater than orequal to ℓ; we say that A is bounded below. Symbolically, the definition becomes:

A ⊂ R is bounded below ⇐⇒ (∃ ℓ ∈ R) such that (∀ a ∈ A) a ≥ ℓ.

Definition A.8 (Greatest lower bound or infimum) Suppose A is a non-empty subset of R. Then we say that a real number ℓ is a greatest lower boundor infimum of A if ℓ is a lower bound of A and if ℓ ≥ ℓ′ for each lower boundℓ′ of A. In this case, we write ℓ = infA; other notations are ℓ = infx∈A x andℓ = infx : x ∈ A.

Completeness Axiom: (Greatest lower bound property) Let A be anon-empty subset of R that is bounded below. Then A has a greatest lowerbound.

As for the case of the supremum, it directly follows that if a non-empty setin R has an infimum, then this infimum is unique. Furthermore, working in theset of extended real numbers, the infimum of the empty set is defined as ∞ andthat of a set not bounded below as −∞.

Definition A.9 (Minimum) If infA ∈ A, then infA is also called the min-imum of A, and is denoted by minA. However, if infA 6∈ A, we say that theminimum of A does not exist.

182

Property A.10 (Properties of the infimum)

1. The infimum of any set in R ∪ −∞,∞ always exists.

2. (∀ a ∈ A) a ≥ infA.

3. If ∞ > infA > −∞, then (∀ ε > 0)(∃ a0 ∈ A) a0 < infA+ ε.(The existence of a0 ∈ [infA, infA+ε) for any ε > 0 under the assumptionof | infA| < ∞ is called the approximation property for the infimum.)

4. If infA = −∞, then (∀A ∈ R)(∃ B0 ∈ A)B0 < L.

5. If infA = ∞, then A is empty.

Observation A.11 Analogously to Observation A.5, a typical source codingtheorem in Information Theory establishes that a (finite) real number α is theinfimum of a set A. Thus, to prove such a theorem, one must show that αsatisfies both properties 3 and 2 above, i.e.,

(∀ ε > 0)(∃ a0 ∈ A) a0 < α + ε (A.2.1)

and(∀ a ∈ A) a ≥ α. (A.2.2)

Here, (A.2.1) is called the achievability or forward part of the coding theorem;it specifies that no number greater than α can be a lower bound for A. Also,(A.2.2) is called the converse part of the theorem; it states that α is a lowerbound of A.

Property A.12 (Properties of the minimum)

1. (∀ a ∈ A) a ≥ minA, if minA exists in R ∪ −∞,∞.

2. minA ∈ A.

A.3 Boundedness and suprema operations

Definition A.13 (Boundedness) A subset A of R is said to be bounded if itis both bounded above and bounded below; otherwise it is called unbounded.

Lemma A.14 (Condition for boundedness) A subset A of R is bounded iff(∃ k ∈ R) such that (∀ a ∈ A) |a| ≤ k.

183

Lemma A.15 (Monotone property) Suppose that A and B are non-emptysubsets of R such that A ⊂ B. Then

1. supA ≤ supB.

2. infA ≥ inf B.

Lemma A.16 (Supremum for set operations) Define the “addition” of twosets A and B as

A+ B , c ∈ R : c = a+ b for some a ∈ A and b ∈ B.

Define the “scaler multiplication” of a set A by a scalar k ∈ R as

k · A , c ∈ R : c = k · a for some a ∈ A.

Finally, define the “negation” of a set A as

−A , c ∈ R : c = −a for some a ∈ A.

Then the following hold.

1. If A and B are both bounded above, then A + B is also bounded aboveand sup(A+ B) = supA+ supB.

2. If 0 < k < ∞ and A is bounded above, then k · A is also bounded aboveand sup(k · A) = k · supA.

3. supA = − inf(−A) and infA = − sup(−A).

Property 1 does not hold for the “product” of two sets, where the “product”of sets A and B is defined as as

A · B , c ∈ R : c = ab for some a ∈ A and b ∈ B.

In this case, both of these two situations can occur:

sup(A · B) > (supA) · (supB)sup(A · B) = (supA) · (supB).

184

Lemma A.17 (Supremum/infimum for monotone functions)

1. If f : R → R is a non-decreasing function, then

supx ∈ R : f(x) < ε = infx ∈ R : f(x) ≥ ε

andsupx ∈ R : f(x) ≤ ε = infx ∈ R : f(x) > ε.

2. If f : R → R is a non-increasing function, then

supx ∈ R : f(x) > ε = infx ∈ R : f(x) ≤ ε

andsupx ∈ R : f(x) ≥ ε = infx ∈ R : f(x) < ε.

The above lemma is illustrated in Figure A.1.

A.4 Sequences and their limits

Let N denote the set of “natural numbers” (positive integers) 1, 2, 3, · · · . Asequence drawn from a real-valued function is denoted by

f : N → R.

In other words, f(n) is a real number for each n = 1, 2, 3, · · · . It is usual to writef(n) = an, and we often indicate the sequence by any one of these notations

a1, a2, a3, · · · , an, · · · or an∞n=1.

One important question that arises with a sequence is what happens when ngets large. To be precise, we want to know that when n is large enough, whetheror not every an is close to some fixed number L (which is the limit of an).

Definition A.18 (Limit) The limit of an∞n=1 is the real number L satisfying:(∀ ε > 0)(∃ N) such that (∀ n > N)

|an − L| < ε.

In this case, we write L = limn→∞ an. If no such L satisfies the above statement,we say that the limit of an∞n=1 does not exist.

185

f(x)

ε

supx : f(x) < ε= infx : f(x) ≥ ε

supx : f(x) ≤ ε= infx : f(x) > ε

f(x)

ε

supx : f(x) ≥ ε= infx : f(x) < ε

supx : f(x) > ε= infx : f(x) ≤ ε

Figure A.1: Illustration of Lemma A.17.

Property A.19 If an∞n=1 and bn∞n=1 both have a limit in R, then the fol-lowing hold.

1. limn→∞(an + bn) = limn→∞ an + limn→∞ bn.

2. limn→∞(α · an) = α · limn→∞ an.

3. limn→∞(anbn) = (limn→∞ an)(limn→∞ bn).

Note that in the above definition, −∞ and ∞ cannot be a legitimate limitfor any sequence. In fact, if (∀ L)(∃ N) such that (∀ n > N) an > L, then we

186

say that an diverges to ∞ and write an → ∞. A similar argument applies toan diverging to −∞. For convenience, we will work in the set of extended realnumbers and thus state that a sequence an∞n=1 that diverges to either ∞ or−∞ has a limit in R ∪ −∞,∞.

Lemma A.20 (Convergence of monotone sequences) If an∞n=1 is non-de-creasing in n, then limn→∞ an exists in R∪−∞,∞. If an∞n=1 is also boundedfrom above – i.e., an ≤ L ∀n for some L in R – then limn→∞ an exists in R.

Likewise, if an∞n=1 is non-increasing in n, then limn→∞ an exists in R ∪−∞,∞. If an∞n=1 is also bounded from below – i.e., an ≥ L ∀n for some Lin R – then limn→∞ an exists in R.

As stated above, the limit of a sequence may not exist. For example, an =(−1)n. Then an will be close to either −1 or 1 for n large. Hence, more general-ized definitions that can describe the general limiting behavior of a sequence isrequired.

Definition A.21 (limsup and liminf) The limit supremum of an∞n=1 is theextended real number in R ∪ −∞,∞ defined by

lim supn→∞

an , limn→∞

(supk≥n

ak),

and the limit infimum of an∞n=1 is the extended real number defined by

lim infn→∞

an , limn→∞

(infk≥n

ak).

Some also use the notations lim and lim to denote limsup and liminf, respectively.

Note that the limit supremum and the limit infimum of a sequence is alwaysdefined in R ∪ −∞,∞, since the sequences supk≥n ak = supak : k ≥ n andinfk≥n ak = infak : k ≥ n are monotone in n (cf. Lemma A.20). An immediateresult follows from the definitions of limsup and liminf.

Lemma A.22 (Limit) For a sequence an∞n=1,

limn→∞

an = L ⇐⇒ lim supn→∞

an = lim infn→∞

an = L.

Some properties regarding the limsup and liminf of sequences (which areparallel to Properties A.4 and A.10) are listed below.

187

Property A.23 (Properties of the limit supremum)

1. The limit supremum always exists in R ∪ −∞,∞.

2. If | lim supm→∞ am| < ∞, then (∀ ε > 0)(∃ N) such that (∀ n > N)an < lim supm→∞ am + ε. (Note that this holds for every n > N .)

3. If | lim supm→∞ am| < ∞, then (∀ ε > 0 and integer K)(∃ N > K) suchthat aN > lim supm→∞ am−ε. (Note that this holds only for one N , whichis larger than K.)

Property A.24 (Properties of the limit infimum)

1. The limit infimum always exists in R ∪ −∞,∞.

2. If | lim infm→∞ am| < ∞, then (∀ ε > 0 and K)(∃ N > K) such thataN < lim infm→∞ am + ε. (Note that this holds only for one N , which islarger than K.)

3. If | lim infm→∞ am| < ∞, then (∀ ε > 0)(∃ N) such that (∀ n > N) an >lim infm→∞ am − ε. (Note that this holds for every n > N .)

The last two items in Properties A.23 and A.24 can be stated using theterminology of sufficiently large and infinitely often, which is often adopted inInformation Theory.

Definition A.25 (Sufficiently large) We say that a property holds for a se-quence an∞n=1 almost always or for all sufficiently large n if the property holdsfor every n > N for some N .

Definition A.26 (Infinitely often) We say that a property holds for a se-quence an∞n=1 infinitely often or for infinitely many n if for every K, the prop-erty holds for one (specific) N with N > K.

Then properties 2 and 3 of Property A.23 can be respectively re-phrased as:if | lim supm→∞ am| < ∞, then (∀ ε > 0)

an < lim supm→∞

am + ε for all sufficiently large n

andan > lim sup

m→∞am − ε for infinitely many n.

188

Similarly, properties 2 and 3 of Property A.24 becomes: if | lim infm→∞ am| < ∞,then (∀ ε > 0)

an < lim infm→∞

am + ε for infinitely many n

andan > lim inf

m→∞am − ε for all sufficiently large n.

Lemma A.27

1. lim infn→∞ an ≤ lim supn→∞ an.

2. If an ≤ bn for all sufficiently large n, then

lim infn→∞

an ≤ lim infn→∞

bn and lim supn→∞

an ≤ lim supn→∞

bn.

3. lim supn→∞ an < r ⇒ an < r for all sufficiently large n.

4. lim supn→∞ an > r ⇒ an > r for infinitely many n.

5.

lim infn→∞

an + lim infn→∞

bn ≤ lim infn→∞

(an + bn)

≤ lim supn→∞

an + lim infn→∞

bn

≤ lim supn→∞

(an + bn)

≤ lim supn→∞

an + lim supn→∞

bn.

6. If limn→∞ an exists, then

lim infn→∞

(an + bn) = limn→∞

an + lim infn→∞

bn

andlim supn→∞

(an + bn) = limn→∞

an + lim supn→∞

bn.

Finally, one can also interpret the limit supremum and limit infimum in termsof the concept of clustering points. A clustering point is a point that a sequencean∞n=1 approaches (i.e., belonging to a ball with arbitrarily small radius andthat point as center) infinitely many times. For example, if an = sin(nπ/2),then an∞n=1 = 1, 0,−1, 0, 1, 0,−1, 0, . . .. Hence, there are three clusteringpoints in this sequence, which are −1, 0 and 1. Then the limit supremum ofthe sequence is nothing but its largest clustering point, and its limit infimumis exactly its smallest clustering point. Specifically, lim supn→∞ an = 1 andlim infn→∞ an = −1. This approach can sometimes be useful to determine thelimsup and liminf quantities.

189

A.5 Equivalence

We close this appendix by providing some equivalent statements that are oftenused to simplify proofs. For example, instead of directly showing that quantityx is less than or equal to quantity y, one can take an arbitrary constant ε > 0and prove that x < y + ε. Since y + ε is a larger quantity than y, in some casesit might be easier to show x < y + ε than proving x ≤ y. By the next theorem,any proof that concludes that “x < y + ε for all ε > 0” immediately gives thedesired result of x ≤ y.

Theorem A.28 For any x, y and a in R,

1. x < y + ε for all ε > 0 iff x ≤ y;

2. x < y − ε for some ε > 0 iff x < y;

3. x > y − ε for all ε > 0 iff x ≥ y;

4. x > y + ε for some ε > 0 iff x > y;

5. |a| < ε for all ε > 0 iff a = 0.

190

Appendix B

Overview in Probability and RandomProcesses

This appendix presents a quick overview of basic concepts from probability the-ory and the theory of random processes. The reader can consult comprehensivetexts on these subjects for a thorough study (see, e.g., [2, 6, 22]). We closethe appendix with a brief discussion of Jensen’s inequality and the Lagrangemultipliers technique for the optimization of convex functions [5, 11].

B.1 Probability space

Definition B.1 (σ-Fields) Let F be a collection of subsets of a non-empty setΩ. Then F is called a σ-field (or σ-algebra) if the following hold:

1. Ω ∈ F .

2. Closedness of F under complementation: If A ∈ F , then Ac ∈ F , whereAc = ω ∈ Ω : ω 6∈ A.

3. Closedness of F under countable union: If Ai ∈ F for i = 1, 2, 3, . . ., then⋃∞

i=1 Ai ∈ F .

It directly follows that the empty set ∅ is also an element of F (as Ωc = ∅)and that F is closed under countable intersection since

∞⋂

i=1

Aci =

( ∞⋃

i=1

Ai

)c

.

The largest σ-field of subsets of a given set Ω is the collection of all subsets ofΩ (i.e., its powerset), while the smallest σ-field is given by Ω, ∅. Also, if A is

191

a proper (strict) non-empty subset of Ω, then the smallest σ-field containing Ais given by Ω, ∅, A,Ac.

Definition B.2 (Probability space) A probability space is a triple (Ω,F , P ),where Ω is a given set called sample space containing all possible outcomes(usually observed from an experiment), F is a σ-field of subsets of Ω and Pis a probability measure P : F → [0, 1] on the σ-field satisfying the following:

1. 0 ≤ P (A) ≤ 1 for all A ∈ F .

2. P (Ω) = 1.

3. Countable additivity: If A1, A2, . . . is a sequence of disjoint sets (i.e.,Ai ∩ Aj = ∅ for all i 6= j) in F , then

P

( ∞⋃

k=1

Ak

)

=∞∑

k=1

P (Ak).

It directly follows from properties 1-3 of the above definition that P (∅) = 0.Usually, the σ-field F is called the event space and its elements (which aresubsets of Ω satisfying the properties of Definition B.1) are called events.

B.2 Random variable and random process

A random variable X defined over probability space (Ω,F , P ) is a real-valuedfunction X : Ω → R that is measurable (or F-measurable), i.e., satisfying theproperty that

X−1((−∞, t]) , ω ∈ Ω : X(ω) ≤ t ∈ Ffor each real t.1

A random process is a collection of random variables that arise from the sameprobability space. It can be mathematically represented by the collection

Xt, t ∈ I,

1One may question why bother defining random variables based on some abstract proba-bility space. One may continue “A random variable X can simply be defined based on itsdistribution PX ,” which is indeed true (cf. Section B.3).A perhaps easier way to understand the abstract definition of random variables is that the

underlying probability space (Ω,F , P ) on which the random variable is defined is what trulyoccurs internally, but it is possibly non-observable. In order to infer which of the non-observableω occurs, an experiment that results in observable x that is a function of ω is performed. Suchan experiment results in the random variable X whose probability is so defined over theprobability space (Ω,F , P ).

192

where Xt denotes the tth random variable in the process, and the index t runsover an index set I which is arbitrary. The index set I can be uncountably infinite— e.g., I = R — in which case we are effectively dealing with a continuous-timeprocess. We will however exclude such a case in this chapter for the sake ofsimplicity. To be precise, we will only consider the following cases of index setI:

case a) I consists of one index only.case b) I is finite.case c) I is countably infinite.

B.3 Distribution functions versus probability space

In applications, we are perhaps more interested in the distribution functionsof the random variables and random processes than the underlying probabilityspace on which the random variables and random processes are defined. It canbe proved [6, Thm. 14.1] that given a real-valued non-negative function F (·)that is non-decreasing and right-continuous and satisfies limx↓−∞ F (x) = 0 andlimx↑∞ F (x) = 1, there exist a random variable and an underlying probabilityspace such that the cumulative distribution function (cdf) of the random variabledefined over the probability space is equal to F (·). This result releases us fromthe burden of referring to a probability space before defining the random variable.In other words, we can define a random variableX directly by its cdf, i.e., Pr[X ≤x], without bothering to refer to its underlying probability space. Nevertheless,it is better to keep in mind (and learn) that, formally, random variables andrandom processes are defined over underlying probability spaces.

B.4 Relation between a source and a random process

In statistical communication, a discrete-time source (X1, X2, X3, . . . , Xn) , Xn

consists of a sequence of random quantities, where each quantity usually takeson values from a source generic alphabet X , namely

(X1, X2, . . . , Xn) ∈ X × X × · · · × X , X n.

Another merit in defining random variables based on an abstract probability space can beobserved from the extension definition of random variables to random processes. With theunderlying probability space, any finite dimensional distribution of Xt, t ∈ I is well-defined.For example,

Pr[X1 ≤ x1, X5 ≤ x5, X9 ≤ x9] = P (ω ∈ Ω : X1(ω) ≤ x1, X5(ω) ≤ x5, X9(ω) ≤ x9) .

193

The elements in X are usually called letters.

B.5 Statistical properties of random sources

For a random process X = X1, X2, . . . with alphabet X (i.e., X ⊆ R is therange of each Xi) defined over probability space (Ω,F , P ), consider X∞, the setof all sequences x , (x1, x2, x3, . . .) of real numbers in X . An event E in2 FX , thesmallest σ-field generated by all open sets of X∞ (i.e., the Borel σ-field of X∞),is said to be T-invariant with respect to the left-shift (or shift transformation)T : X∞ → X∞ if

TE ⊆ E,

where

TE , Tx : x ∈ E and Tx , T(x1, x2, x3, . . .) = (x2, x3, . . .).

In other words, T is equivalent to “chopping the first component.” For example,applying T onto an event E1 defined below,

E1 , (x1 = 1, x2 = 1, x3 = 1, x4 = 1, . . .), (x1 = 0, x2 = 1, x3 = 1, x4 = 1, . . .),

(x1 = 0, x2 = 0, x3 = 1, x4 = 1, . . .) , (B.5.1)

yields

TE1 = (x1 = 1, x2 = 1, x3 = 1, . . .), (x1 = 1, x2 = 1, x3 = 1 . . .),

(x1 = 0, x2 = 1, x3 = 1, . . .)= (x1 = 1, x2 = 1, x3 = 1, . . .), (x1 = 0, x2 = 1, x3 = 1, . . .) .

We then have TE1 ⊆ E1, and hence E1 is T-invariant.

It can be proved3 that if TE ⊆ E, then T2E ⊆ TE. By induction, we canfurther obtain

· · · ⊆ T3E ⊆ T2E ⊆ TE ⊆ E. (B.5.2)

Thus, if an element say (1, 0, 0, 1, 0, 0, . . .) is in a T-invariant set E, then all itsleft-shift counterparts (i.e., (0, 0, 1, 0, 0, 1 . . .) and (0, 1, 0, 0, 1, 0, . . .)) should becontained in E. As a result, for a T-invariant set E, an element and all its left-shift counterparts are either all in E or all outside E, but cannot be partiallyinside E. Hence, a “T-invariant group” such as one containing (1, 0, 0, 1, 0, 0, . . .),

2Note that by definition, the events E ∈ FX are measurable, and PX(E) = P (A), wherePX is the distribution of X induced from the underlying probability measure P .

3If A ⊆ B, then TA ⊆ TB. Thus T2E ⊆ TE holds whenever TE ⊆ E.

194

(0, 0, 1, 0, 0, 1 . . .) and (0, 1, 0, 0, 1, 0, . . .) should be treated as an indecomposablegroup in T-invariant sets.

Although we are in particular interested in these “T-invariant indecomposablegroups” (especially when defining an ergodic random process), it is possible thatsome single “transient” element, such as (0, 0, 1, 1, . . .) in (B.5.1), is included in aT-invariant set, and will be excluded after applying left-shift operation T. Thishowever can be resolved by introducing the inverse operation T−1. Note thatT is a many-to-one mapping, so its inverse operation does not exist in general.Similar to taking the closure of an open set, the definition adopted below [46, p. 3]allows us to “enlarge” the T-invariant set such that all right-shift counterpartsof the single “transient” element are included.

T−1E , x ∈ X∞ : Tx ∈ E .

We then notice from the above definition that if

T−1E = E, (B.5.3)

then4

TE = T(T−1E) = E,

and hence E is constituted only by the T-invariant groups because

· · · = T−2E = T−1E = E = TE = T2E = · · · .

The sets that satisfy (B.5.3) are sometimes referred to as ergodic sets becauseas time goes by (the left-shift operator T can be regarded as a shift to a futuretime), the set always stays in the state that it has been before. A quick exampleof an ergodic set for X = 0, 1 is one that consists of all binary sequences thatcontain finitely many 0’s.5

We now classify several useful statistical properties of random process X =X1, X2, . . ..

4The proof of T(T−1E) = E is as follows. If y ∈ T(T−1E) = T(x ∈ X∞ : Tx ∈ E), thenthere must exist an element x ∈ x ∈ X∞ : Tx ∈ E such that y = Tx. Since Tx ∈ E, wehave y ∈ E and T(T−1E) ⊆ E.

On the contrary, if y ∈ E, all x’s satisfying Tx = y belong to T−1E. Thus, y ∈ T(T−1E),which implies E ⊆ T(T−1E).

5As the textbook only deals with one-sided random processes, the discussion on T-invariance only focuses on sets of one-sided sequences. When a two-sided random process. . . , X−2, X−1, X0, X1, X2, . . . is considered, the left-shift operation T of a two-sided sequenceactually has a unique inverse. Hence, TE ⊆ E implies TE = E. Also, TE = E if, andonly if, T−1E = E. Ergodicity for two-sided sequences can therefore be directly defined usingTE = E.

195

• Memoryless: A random process is said to be memoryless if the sequence ofrandom variables Xi is independent and identically distributed (i.i.d.).

• First-order stationary: A process is first-order stationary if the marginal dis-tribution is unchanged for every time instant.

• Second-order stationary: A process is second-order stationary if the joint dis-tribution of any two (not necessarily consecutive) time instances is invari-ant before and after left (time) shift.

• Weakly stationary process: A process is weakly stationary (or wide-sense sta-tionary or stationary in the weak or wide sense) if the mean and auto-correlation function are unchanged by a left (time) shift.

• Stationary process: A process is stationary (or strictly stationary) if the pro-bability of every sequence or event is unchanged by a left (time) shift.

• Ergodic process: A process is ergodic if any ergodic set (satisfying (B.5.3)) inFX has probability either 1 or 0. This definition is not very intuitive, butsome interpretations and examples may shed some light.

Observe that the definition has nothing to do with stationarity. Itsimply states that events that are unaffected by time-shifting (both left-and right-shifting) must have probability either zero or one.

Ergodicity implies that all convergent sample averages6 converge to aconstant (but not necessarily to the ensemble average), and stationarityassures that the time average converges to a random variable; hence, itis reasonably to expect that they jointly imply the ultimate time averageequals the ensemble average. This is validated by the well-known ergodictheorem by Birkhoff and Khinchin.

Theorem B.3 (Pointwise ergodic theorem) Consider a discrete-timestationary random process, X = Xn∞n=1. For real-valued function f(·) onR with finite mean (i.e., |E[f(Xn)]| < ∞), there exists a random variableY such that

limn→∞

1

n

n∑

k=1

f(Xk) = Y with probability 1.

If, in addition to stationarity, the process is also ergodic, then

limn→∞

1

n

n∑

k=1

f(Xk) = E[f(X1)] with probability 1.

6Two alternative names for sample average are time average and Cesaro mean. In thisbook, these names will be used interchangeably.

196

Example B.4 Consider the process Xi∞i=1 consisting of a family of i.i.d. bi-nary random variables (obviously, it is stationary and ergodic). Define thefunction f(·) by f(0) = 0 and f(1) = 1, Hence,7

E[f(Xn)] = PXn(0)f(0) + PXn

(1)f(1) = PXn(1)

is finite. By the pointwise ergodic theorem, we have

limn→∞

f(X1) + f(X2) + · · ·+ f(Xn)

n= lim

n→∞

X1 +X2 + · · ·+Xn

n= PX(1).

As seen in the above example, one of the important consequencesthat the pointwise ergodic theorem indicates is that the time average canultimately replace the statistical average, which is a useful result in engi-neering. Hence, with stationarity and ergodicity, one, who observes

X301 = 154326543334225632425644234443

from the experiment of rolling a dice, can draw the conclusion that thetrue distribution of rolling the dice can be well approximated by:

PrXi = 1 ≈ 1

30PrXi = 2 ≈ 6

30PrXi = 3 ≈ 7

30

PrXi = 4 ≈ 9

30PrXi = 5 ≈ 4

30PrXi = 6 ≈ 3

30

Such result is also known by the law of large numbers. The relation be-tween ergodicity and the law of large numbers will be further explored inSection B.7.

We close the discussion on ergodicity by remarking that in communica-tions theory, one may assume that the source is stationary or the source isstationary ergodic. But it is rare to see the assumption of the source beingergodic but non-stationary. This is perhaps because an ergodic but non-stationary source not only does not facilitate the analytical study of com-munications problems, but may have limited application in practice. Fromthis, we note that assumptions are made either to facilitate the analyticalstudy of communications problems or to fit a specific need of applications.Without these two objectives, an assumption becomes of minor interest.This, to some extent, justifies that the ergodicity assumption usually comesafter stationarity assumption. A specific example is the pointwise ergodictheorem, where the random processes considered is presumed to be sta-tionary.

7In terms of notation, we use PXn(0) to denote PrXn = 0. The above two representations

will be used alternatively throughout the book.

197

• First-order Markov chain: Three random variables X, Y and Z are said toform a Markov chain or a first-order Markov chain if

PX,Y,Z(x, y, z) = PX(x) · PY |X(y|x) · PZ|Y (z|y); (B.5.4)

i.e., PZ|X,Y (z|x, y) = PZ|Y (z|y). This is usually denoted by X → Y → Z.

X → Y → Z is sometimes read as “X and Z are conditionally independentgiven Y ” because it can be shown that (B.5.4) is equivalent to

PX,Z|Y (x, z|y) = PX|Y (x|y) · PZ|Y (z|y).Therefore, X → Y → Z is equivalent to Z → Y → X. Accordingly, theMarkovian notation is sometimes expressed as X ↔ Y ↔ Z.

• Markov chain for random sequences: The random variables X1, X2, X3, . . .are said to form a k-th order Markov chain if for all k < n,

Pr[Xn = xn|Xn−1 = xn−1, . . . , X1 = x1]

= Pr[Xn = xn|Xn−1 = xn−1, . . . , Xn−k = xn−k].

Each xn−1n−k ∈ X k is called the state at time n.

A Markov chain is irreducible if with some probability, we can go fromany state in X k to another state in a finite number of steps, i.e., for allxk, yk ∈ X k there exists j ≥ 1 such that

Pr

Xk+j−1j = xk

∣∣∣Xk

1 = yk

> 0.

AMarkov chain is said to be time-invariant or homogeneous, if for everyn > k,

Pr[Xn = xn|Xn−1 = xn−1, . . . , Xn−k = xn−k]

= Pr[Xk+1 = xk+1|Xk = xk, . . . , X1 = x1].

Therefore, a homogeneous first-order Markov chain can be defined throughits transition probability:

[PrX2 = x2|X1 = x1

]

|X |×|X |,

and its initial state distribution PX1(x). A distribution π(x) on X is saidto be a stationary distribution for a homogeneous first-order Markov chain,if for every y ∈ X ,

π(y) =∑

x∈Xπ(x) PrX2 = y|X1 = x.

If the initial state distribution is equal to the stationary distribution, thenthe homogeneous first-order Markov chain is a stationary process.

198

Markov

i.i.d. Stationary

Ergodic

Figure B.1: General relations of random processes.

The general relations among i.i.d. sources, Markov sources, stationary sourcesand ergodic sources are depicted in Figure B.1

B.6 Convergence of sequences of random variables

In this section, we will discuss modes in which a random process X1, X2, . . .converges to a limiting random variable X. Recall that a random variable is areal-valued function from Ω to R, where Ω the sample space of the probabilityspace over which the random variable is defined. So the following two expressionswill be used interchangeably.

X1(ω), X2(ω), X3(ω), . . . ≡ X1, X2, X3, . . .

for ω ∈ Ω. Note that the random variables in a random process are defined overthe same probability space (Ω,F , P ),

Definition B.5 (Convergence modes for random sequences)

1. Point-wise convergence on Ω.8

8Although such mode of convergence is not used in probability theory, we introduce itherein to contrast it with the almost sure convergence mode (see Example B.6).

199

Xn∞n=1 is said to converge to X pointwisely on Ω if

limn→∞

Xn(ω) = X(ω) for all ω ∈ Ω.

This is usually denoted by Xnp.w.−→ X.

2. Almost sure convergence or convergence with probability 1.

Xn∞n=1 is said to converge to X with probability 1, if

Pω ∈ Ω : limn→∞

Xn(ω) = X(ω) = 1.

This is usually denoted by Xna.s.−→ X.

3. Convergence in probability.

Xn∞n=1 is said to converge to X in probability, if for any ε > 0,

limn→∞

Pω ∈ Ω : |Xn(ω)−X(ω)| > ε = limn→∞

Pr|Xn −X| > ε = 0.

This is usually denoted by Xnp−→ X.

4. Convergence in rth mean.

Xn∞n=1 is said to converge to X in rth mean, if

limn→∞

E[|X −Xn|r] = 0.

This is usually denoted by XnLr−→ X.

5. Convergence in distribution.

Xn∞n=1 is said to converge to X in distribution, if

limn→∞

FXn(x) = FX(x),

for every continuity point of F (x), where

FXn(x) , PrXn ≤ x and FX(x) , PrX ≤ x.

This is usually denoted by Xnd−→ X.

An example that facilitates the understanding of pointwise convergence andalmost sure convergence is as follows.

200

Example B.6 Give a probability space (Ω, 2Ω, P ), where Ω = 0, 1, 2, 3, andP (0) = P (1) = P (2) = 1/3 and P (3) = 0. Define a random variable as Xn(ω) =ω/n. Then

PrXn = 0 = Pr

Xn =1

n

= Pr

Xn =2

n

=1

3.

It is clear that for every ω in Ω, Xn(ω) converges to X(ω), where X(ω) = 0 forevery ω ∈ Ω; so

Xnp.w.−→ X.

Now let X(ω) = 0 for ω = 0, 1, 2 and X(ω) = 1 for ω = 3. Then both of thefollowing statements are true:

Xna.s.−→ X and Xn

a.s.−→ X,

since

Pr

limn→∞

Xn = X

=3∑

ω=0

P (ω) · 1

limn→∞

Xn(ω) = X(ω)

= 1,

where 1· represents the set indicator function. However, Xn does not convergeto X pointwise because limn→∞ Xn(3) 6= X(3). So to speak, pointwise conver-gence requires “equality” even for those samples without probability mass, forwhich almost surely convergence does not take into consideration.

Observation B.7 (Uniqueness of convergence)

1. If Xnp.w.−→ X and Xn

p.w.−→ Y , then X = Y pointwisely. I.e., (∀ ω ∈ Ω)X(ω) = Y (ω).

2. If Xna.s.−→ X and Xn

a.s.−→ Y (or Xnp−→ X and Xn

p−→ Y ) (or XnLr−→ X

and XnLr−→ Y ), then X = Y with probability 1. I.e., PrX = Y = 1.

3. Xnd−→ X and Xn

d−→ Y , then FX(x) = FY (x) for all x.

For ease of understanding, the relations of the five modes of convergence canbe depicted as follows. As usual, a double arrow denotes implication.

Thm. B.9

Thm. B.8

Xnp.w.−→ X⇓

Xna.s.−→ X Xn

Lr−→ X (r ≥ 1)

Xnp−→ X⇓

Xnd−→ X

201

There are some other relations among these five convergence modes that arealso depicted in the above graph (via the dotted line); they are stated below.

Theorem B.8 (Monotone convergence theorem [6])

Xna.s.−→ X, (∀ n)Y ≤ Xn ≤ Xn+1, and E[|Y |] < ∞ ⇒ Xn

L1−→ X

⇒ E[Xn] → E[X].

Theorem B.9 (Dominated convergence theorem [6])

Xna.s.−→ X, (∀ n)|Xn| ≤ Y, and E[|Y |] < ∞ ⇒ Xn

L1−→ X

⇒ E[Xn] → E[X].

The implication of XnL1−→ X to E[Xn] → E[X] can be easily seen from

|E[Xn]− E[X]| = |E[Xn −X]| ≤ E[|Xn −X|].

B.7 Ergodicity and laws of large numbers

B.7.1 Laws of large numbers

Consider a random process X1, X2, . . . with common marginal mean µ. Supposethat we wish to estimate µ on the basis of the observed sequence x1, x2, x3, . . . .The weak and strong laws of large numbers ensure that such inference is possible(with reasonable accuracy), provided that the dependencies between Xn’s aresuitably restricted: e.g., the weak law is valid for uncorrelated Xn’s, while thestrong law is valid for independent Xn’s. Since independence is a more restrictivecondition than the absence of correlation, one expects the strong law to be morepowerful than the weak law. This is indeed the case, as the weak law states thatthe sample average

X1 + · · ·+Xn

nconverges to µ in probability, while the strong law asserts that this convergencetakes place with probability 1.

The following two inequalities will be useful in the discussion of this subject.

Lemma B.10 (Markov’s inequality) For any integer k > 0, real numberα > 0 and any random variable X,

Pr[|X| ≥ α] ≤ 1

αkE[|X|k].

202

Proof: Let FX(·) be the cdf of random variable X. Then,

E[|X|k] =

∫ ∞

−∞|x|kdFX(x)

≥∫

x∈R : |x|≥α|x|kdFX(x)

≥∫

x∈R : |x|≥ααkdFX(x)

= αk

∫

x∈R : |x|≥αdFX(x)

= αk Pr[|X| ≥ α].

Equality holds iff∫

x∈R : |x|<α|x|kdFX(x) = 0 and

∫

x∈R : |x|>α|x|kdFX(x) = 0,

namely,Pr[X = 0] + Pr[|X| = α] = 1.

In the proof of Markov’s inequality, we use the general representation forintegration with respect to a (cumulative) distribution function FX(·), i.e.,

∫

X·dFX(x), (B.7.1)

which is named the Lebesgue-Stieltjes integral. Such a representation can beapplied for both discrete and continuous supports as well as the case that theprobability density function does not exist. We use this notational convention toremove the burden of differentiating discrete random variables from continuousones.

Lemma B.11 (Chebyshev’s inequality) For any random variableX and realnumber α > 0,

Pr[|X − E[X]| ≥ α] ≤ 1

α2Var[X].

Proof: By Markov’s inequality with k = 2, we have:

Pr[|X − E[X]| ≥ α] ≤ 1

α2E[|X − E[X]|2].

Equality holds iff

Pr[|X − E[X]| = 0] + Pr[|X − E[X]| = α

]= 1,

203

equivalently, there exists p ∈ [0, 1] such that

Pr[X = E[X] + α

]= Pr

[X = E[X]− α

]= p and Pr[X = E[X]] = 1− 2p.

In the proofs of the above two lemmas, we also provide the condition underwhich equality holds. These conditions indicate that equality usually cannot befulfilled. Hence in most cases, the two inequalities are strict.

Theorem B.12 (Weak law of large numbers) Let Xn∞n=1 be a sequenceof uncorrelated random variables with common mean E[Xi] = µ. If the variablesalso have common variance, or more generally,

limn→∞

1

n2

n∑

i=1

Var[Xi] = 0, (equivalently,X1 + · · ·+Xn

n

L2−→ µ)

then the sample averageX1 + · · ·+Xn

nconverges to the mean µ in probability.

Proof: By Chebyshev’s inequality,

Pr

∣∣∣∣∣

1

n

n∑

i=1

Xi − µ

∣∣∣∣∣≥ ε

≤ 1

n2ε2

n∑

i=1

Var[Xi].

Note that the right-hand side of the above Chebyshev’s inequality is just thesecond moment of the difference between the n-sample average and the mean

µ. Thus the variance constraint is equivalent to the statement that XnL2−→ µ

implies Xnp−→ µ.

Theorem B.13 (Kolmogorov’s strong law of large numbers) Let Xn∞n=1

be an independent sequence of random variables with common mean E[Xn] = µ.If either

1. Xn’s are identically distributed; or

2. Xn’s are square-integrable9 with

∞∑

i=1

Var[Xi]

i2< ∞,

9A random variable X is said to be square-integrable if E[|X|2] < ∞.

204

thenX1 + · · ·+Xn

n

a.s.−→ µ.

Note that the above i.i.d. assumption does not exclude the possibility ofµ = ∞ (or µ = −∞), in which case the sample average converges to ∞ (or −∞)with probability 1. Also note that there are cases of independent sequences towhich the weak law applies, but the strong law does not. This is due to the factthat

n∑

i=1

Var[Xi]

i2≥ 1

n2

n∑

i=1

Var[Xi].

The final remark is that Kolmogorov’s strong law of large number can beextended to a function of an independent sequence of random variables:

g(X1) + · · ·+ g(Xn)

n

a.s.−→ E[g(X1)].

But such extension cannot be applied to the weak law of large number, sinceg(Yi) and g(Yj) can be correlated even if Yi and Yi are not.

B.7.2 Ergodicity and law of large numbers

After the introduction of Kolmogorov’s strong law of large numbers, one may findthat the pointwise ergodic theorem (Theorem B.3) actually indicates a similarresult. In fact, the pointwise ergodic theorem can be viewed as another versionof strong law of large numbers, which states that for stationary and ergodicprocesses, time averages converge with probability 1 to the ensemble expectation.

The notion of ergodicity is often misinterpreted, since the definition is notvery intuitive. Some engineering texts may provide a definition that a stationaryprocess satisfying the ergodic theorem is also ergodic.10 However, the ergodic

10Here is one example. A stationary random process Xn∞n=1 is called ergodic if for arbitrary

integer k and function f(·) on X k of finite mean,

1

n

n∑

i=1

f(Xi+1, . . . , Xi+k)a.s.−→ E[f(X1, . . . , Xk)].

As a result of this definition, a stationary ergodic source is the most general dependent randomprocess for which the strong law of large numbers holds. This definition somehow implies thatif a process is not stationary-ergodic, then the strong law of large numbers is violated (orthe time average does not converge with probability 1 to its ensemble expectation). But thisis not true. One can weaken the conditions of stationarity and ergodicity from its originalmathematical definitions to asymptotic stationarity and ergodicity, and still make the stronglaw of large numbers hold. (Cf. the last remark in this section and also Figure B.2)

205

theorem is indeed a consequence of the original mathematical definition of ergod-icity in terms of the shift-invariant property (see Section B.5 and the discussionin [21, pp. 174-175])).

Ergodicitydefined throughshift-invariance

property

Ergodicitydefined throughergodic theorem

i.e., stationarity andtime averageconverging tosample average

(law of large numbers)

Figure B.2: Relation of ergodic random processes respectively definedthrough time-shift invariance and ergodic theorem.

Let us try to clarify the notion of ergodicity by the following remarks.

• The concept of ergodicity does not require stationarity. In other words, anon-stationary process can be ergodic.

• Many perfectly good models of physical processes are not ergodic, yet theyhave a form of law of large numbers. In other words, non-ergodic processescan be perfectly good and useful models.

• There is no finite-dimensional equivalent definition of ergodicity as there isfor stationarity. This fact makes it more difficult to describe and interpretergodicity.

• I.i.d. processes are ergodic; hence, ergodicity can be thought of as a (kindof) generalization of i.i.d.

• As mentioned earlier, stationarity and ergodicity imply the time averageconverges with probability 1 to the ensemble mean. Now if a processis stationary but not ergodic, then the time average still converges, butpossibly not to the ensemble mean.

For example, let An∞n=−∞ and Bn∞n=−∞ be two i.i.d. binary 0-1random variables with PrAn = 0 = PrBn = 1 = 1/4. Suppose thatXn = An if U = 1, and Xn = Bn if U = 0, where U is equiprobable binary

206

random variable, and An∞n=1, Bn∞n=1 and U are independent. ThenXn∞n=1 is stationary. Is the process ergodic? The answer is negative.If the stationary process were ergodic, then from the pointwise ergodictheorem (Theorem B.3), its relative frequency would converge to

Pr(Xn = 1) = Pr(U = 1)Pr(Xn = 1|U = 1)

+Pr(U = 0)Pr(Xn = 1|U = 0)

= Pr(U = 1)Pr(An = 1) + Pr(U = 0)Pr(Bn = 1) =1

2.

However, one should observe that the outputs of (X1, . . . , Xn) form aBernoulli process with relative frequency of 1’s being either 3/4 or 1/4,depending on the value of U . Therefore,

limn→∞

1

n

n∑

i=1

Xna.s.−→ Y,

where Pr(Y = 1/4) = Pr(Y = 3/4) = 1/2, which contradicts to the ergodictheorem.

From the above example, the pointwise ergodic theorem can actuallybe made useful in such a stationary but non-ergodic case, since the esti-mate with stationary ergodic process (either An∞n=−∞ or Bn∞n=−∞) isactually being observed by measuring the relative frequency (3/4 or 1/4).This renders a surprising fundamental result of random processes— er-godic decomposition theorem: under fairly general assumptions, any (notnecessarily ergodic) stationary process is in fact a mixture of stationaryergodic processes, and hence one always observes a stationary ergodic out-come. As in the above example, one always observe either A1, A2, A3, . . .or B1, B2, B3, . . ., depending on the value of U , for which both sequencesare stationary ergodic (i.e., the time-stationary observation Xn satisfiesXn = U · An + (1− U) ·Bn).

• The previous remark implies that ergodicity is not required for the str-ong law of large numbers to be useful. The next question is whether ornot stationarity is required. Again the answer is negative. In fact, themain concern of the law of large numbers is the convergence of sampleaverages to its ensemble expectation. It should be reasonable to expectthat random processes could exhibit transient behavior that violates thestationarity definition, yet the sample average still converges. One canthen introduce the notion of asymptotically stationary to achieve the lawof large numbers. For example, a finite-alphabet time-invariant (but notnecessarily stationary) irreducible Markov chain satistfies the law of largenumbers.

207

Accordingly, one should not take the notions of stationarity and er-godicity too seriously (if the main concern is the law of large numbers)since they can be significantly weakened and still have laws of large num-bers holding (i.e., time averages and relative frequencies have desired andwell-defined limits).

B.8 Central limit theorem

Theorem B.14 (Central limit theorem) If Xn∞n=1 is a sequence of i.i.d. ran-dom variables with finite common marginal mean µ and variance σ2, then

1√n

n∑

i=1

(Xi − µ)d−→ Z ∼ N (0, σ2),

where the convergence is in distribution (as n → ∞) and Z ∼ N (0, σ2) is aGaussian distributed random variable with mean 0 and variance σ2.

B.9 Convexity, concavity and Jensen’s inequality

Jensen’s inequality provides a useful bound for the expectation of convex (orconcave) functions.

Definition B.15 (Convexity) Consider a convex set11 O ∈ Rm, where m is afixed positive integer. Then a function f : O → R is said to be convex over O iffor every x, y in O and 0 ≤ λ ≤ 1,

f(λx+ (1− λ)y

)≤ λf(x) + (1− λ)f(y).

Furthermore, a function f is said to be strictly convex if equality holds only whenλ = 0 or λ = 1.

Note that different from the usual notations xn = (x1, x2, . . . , xn) or x =(x1, x2, . . .) throughout this book, we use x to denote a column vector in thissection.

Definition B.16 (Concavity) A function f is concave if −f is convex.

11A set O ∈ Rm is said to be convex if for every x = (x1, x2, · · · , xm)T and y =

(y1, y2, · · · , ym)T in O (where T denotes transposition), and every 0 ≤ λ ≤ 1, λx+(1−λ)y ∈ O;in other words, the “convex combination” of any two “points” x and y in O also belongs to O.

208

Note that when O = (a, b) is an interval in R and function f : O → R

has a non-negative (resp. positive) second derivative over O, then the function isconvex (resp. strictly convex). This can be shown via the Taylor series expansionof the function.

Theorem B.17 (Jensen’s inequality) If f : O → R is convex over a convexset O ⊂ Rm, and X = (X1, X2, · · · , Xm)

T is an m-dimensional random vectorwith alphabet X ⊂ O, then

E[f(X)] ≥ f(E[X]).

Moreover, if f is strictly convex, then equality in the above inequality immedi-ately implies X = E[X] with probability 1.

Note: O is a convex set; hence, X ⊂ O implies E[X] ∈ O. This guaranteesthat f(E[X]) is defined. Similarly, if f is concave, then

E[f(X)] ≤ f(E[X]).

Furthermore, if f is strictly concave, then equality in the above inequality im-mediately implies that X = E[X] with probability 1.

Proof: Let y = aTx + b be a “support hyperplane” for f with “slope” vectoraT and affine parameter b that passes through the point (E[X], f(E[X])), wherea support hyperplane12 for function f at x′ is by definition a hyperplane pass-ing through the point (x′, f(x′)) and lying entirely below the graph of f (seeFigure B.9 for an illustration of a support line for a convex function over R).

Thus,(∀x ∈ X ) aTx+ b ≤ f(x).

By taking the expectation of both sides, we obtain

aTE[X] + b ≤ E[f(X)],

but we know that aTE[X] + b = f(E[X]). Consequently,

f(E[X]) ≤ E[f(X)].

12A hyperplane y = aTx+ b is said to be a support hyperplane for a function f with “slope”vector aT ∈ Rm and affine parameter b ∈ R if among all hyperplanes of the same slope vectoraT , it is the largest one satisfying aTx + b ≤ f(x) for every x ∈ O. A support hyperplanemay not necessarily be made to pass through the desired point (x′, f(x′)). Here, since we onlyconsider convex functions, the validity of the support hyperplane passing (x′, f(x′)) is thereforeguaranteed. Note that when x is one-dimensional (i.e., m = 1), a support hyperplane is simplyreferred to as a support line.

209

x

y

support liney = ax+ b

f(x)

Figure B.3: The support line y = ax+ b of the convex function f(x).

B.10 Lagrange multipliers technique and Karush-Kuhn-Tucker (KKT) condition

Optimization of a function f(x) over x = (x1, . . . , xn) ∈ X ⊆ Rn subject tosome inequality constraints gi(x) ≤ 0 for 1 ≤ i ≤ m and equality constraintshj(x) = 0 for 1 ≤ j ≤ ℓ is a central technique to problems in informationtheory. An immediate example is to maximize the mutual information subjectto an “inequality” power constraint and an “equality” probability unity-sumconstraint in order to determine the channel capacity. We can formulate suchan optimization problem [11, Eq. (5.1)] mathematically as13

minx∈Q

f(x), (B.10.1)

where

Q , x ∈ X : gi(x) ≤ 0 for 1 ≤ i ≤ m and hi(x) = 0 for 1 ≤ j ≤ ℓ .

In most cases, solving the constrained optimization problem defined in (B.10.1)is hard. Instead, one may introduce a dual optimization problem without con-straints as

L(λ,ν) , minx∈X

(

f(x) +m∑

i=1

λigi(x) +ℓ∑

j=1

νjhj(x)

)

. (B.10.2)

13Since maximization of f(·) is equivalent to minimization of −f(·), it suffices to discuss theKKT condition for the minimization problem defined in (B.10.1).

210

In the literature, λ = (λ1, . . . , λm) and ν = (ν1, . . . , νℓ) are usually referred toas Lagrange multipliers, and L(λ,ν) is called the Lagrange dual function. Notethat L(λ,ν) is a concave function of λ and ν since it is the minimization ofaffine functions of λ and ν.

It can be verified that when λi ≥ 0 for 1 ≤ i ≤ m,

L(λ,ν) ≤ minx∈Q

(

f(x) +m∑

i=1

λigi(x) +ℓ∑

j=1

νjhj(x)

)

≤ minx∈Q

f(x). (B.10.3)

We are however interested in when the above inequality becomes equality (i.e.,when the so-called strong duality holds) because if there exist non-negative λ

and ν that equate (B.10.3), then

f(x∗) = minx∈Q

f(x) = L(λ, ν)

= minx∈X

[

f(x) +m∑

i=1

λigi(x) +ℓ∑

j=1

νjhj(x)

]

≤ f(x∗) +m∑

i=1

λigi(x∗) +

ℓ∑

j=1

νjhj(x∗)

≤ f(x∗), (B.10.4)

where (B.10.4) follows because the minimizer x∗ of (B.10.1) lies in Q. Hence, ifthe strong duality holds, the same x∗ achieves both minx∈Q f(x) and L(λ, ν),and λigi(x

∗) = 0 for 1 ≤ i ≤ m.14

The strong duality does not in general hold. A situation that guarantees thevalidity of the strong duality has been determined by William Karush [27], andseparately Harold W. Kuhn and Albert W. Tucker [30]. In particular, when f(·)and gi(·)mi=1 are both convex, and hj(·)ℓj=1 are affine, and these functions areall differentiable, they found that the strong duality holds if, and only if, theKKT condition is satisfied [11, p. 258].

Definition B.18 (Karush-Kuhn-Tucker (KKT) condition) Point x= (x1,. . ., xn) and multipliers λ = (λ1, . . . , λm) and ν = (ν1, . . . , νℓ) are said to satisfythe KKT condition if

gi(x) ≤ 0, λi ≥ 0, λigi(x) = 0 i = 1, . . . ,m

hj(x) = 0 j = 1, . . . , ℓ

∂f∂xk

(x) +∑m

i=1 λi∂gi∂xk

(x) +∑ℓ

j=1 νj∂hj

∂xk(x) = 0 k = 1, . . . , n

14Equating (B.10.4) implies∑m

i=1 λigi(x∗) = 0. It can then be easily verified from

λigi(x∗) ≤ 0 for every 1 ≤ i ≤ m that λigi(x

∗) = 0 for 1 ≤ i ≤ m.

211

Note that when f(·) and constraints gi(·)mi=1 and hj(·)ℓj=1 are arbitraryfunctions, the KKT condition is only a necessary condition for the validity ofthe strong duality. In other words, for a non-convex optimization, we can onlyclaim that if the strong duality holds, then the KKT condition is satisfied butnot vice versa.

A case that is particularly useful in information theory is when x is restrictedto be a probability distribution. In such case, apart from other problem-specificconstraints, we have additionally n inequality constraints gm+i(x) = −xi ≤ 0 for1 ≤ i ≤ n and one equality constraint hℓ+1(x) =

∑nk=1 xk − 1 = 0. Hence, the

KKT condition becomes

gi(x) ≤ 0, λi ≥ 0, λigi(x) = 0 i = 1, . . . ,m

gm+k(x) = −xk ≤ 0, λm+k ≥ 0, λm+kxk = 0 k = 1, . . . , n

hj(x) = 0 j = 1, . . . , ℓ

hℓ+1(x) =∑n

k=1 xk − 1 = 0

∂f∂xk

(x) +∑m

i=1 λi∂gi∂xk

(x)− λm+k +∑ℓ

j=1 νj∂hj

∂xk(x) + νℓ+1 = 0 k = 1, . . . , n

From λm+k ≥ 0 and λm+kxk = 0, we can obtain the well-known relation below.

λm+k =

∂f∂xk

(x) +∑m

i=1 λi∂gi∂xk

(x) +∑ℓ

j=1 νj∂hj

∂xk(x) + νℓ+1 = 0 if xk > 0

∂f∂xk

(x) +∑m

i=1 λi∂gi∂xk

(x) +∑ℓ

j=1 νj∂hj

∂xk(x) + νℓ+1 ≥ 0 if xk = 0.

The above relation is the mostly seen form of the KKT condition when it is usedin problems of information theory.

Example B.19 Suppose for non-negative qi,j1≤i≤n,1≤j≤n′ with∑n′

j=1 qi,j = 1,

f(x) = −n∑

i=1

n′∑

j=1

xiqi,j logqi,j

∑ni′=1 xi′qi′,j

gi(x) = −xi ≤ 0 i = 1, . . . , n

h(x) =∑n

i=1 xi − 1 = 0

Then the KKT condition implies

xi ≥ 0, λi ≥ 0, λixi = 0 i = 1, . . . , n

∑ni=1 xi = 1

−n′∑

j=1

qk,j logqk,j


+ 1− λk + ν = 0 k = 1, . . . , n

212

which further implies that

λk =

−n′∑

j=1

qk,j logqk,j


+ 1 + ν = 0 xk > 0

−n′∑

j=1

qk,j logqk,j


+ 1 + ν ≥ 0 xk = 0

By this, the input distributions that achieve the channel capacities of somechannels such as BSC and BEC can be identified.

The next example shows the analogy of determining the channel capacity tothe problem of optimal power allocation.

Example B.20 (Water-filling) Suppose with σ2i > 0 for 1 ≤ i ≤ n and P > 0,

f(x) = −∑ni=1 log

(

1 + xi

σ2i

)

gi(x) = −xi ≤ 0 i = 1, . . . , n

h(x) =∑n

i=1 xi − P = 0

Then the KKT condition implies

xi ≥ 0, λi ≥ 0, λixi = 0 i = 1, . . . , n

∑ni=1 xi = P

− 1σ2i +xi

− λi + ν = 0 i = 1, . . . , n

which further implies that

λi =

− 1σ2i +xi

+ ν = 0 xi > 0

− 1σ2i +xi

+ ν ≥ 0 xi = 0equivalently xi =

1ν− σ2

i σ2i < 1

ν

0 σ2i ≥ 1

ν

This then gives the water-filling solution for the power allocation over continuous-input additive white Gaussian noise channels.

213

Bibliography

[1] S. Arimoto, “An algorithm for computing the capacity of arbitrary discretememoryless channel,” IEEE Trans. Inform. Theory, vol. 18, no. 1, pp. 14-20,Jan. 1972.

[2] R. B. Ash and C. A. Doleans-Dade, Probability and Measure Theory, Aca-demic Press, MA, 2000.

[3] C. Berrou, A. Glavieux and P. Thitimajshima, “Near Shannon limit error-correcting coding and decoding: Turbo-codes(1),” Proc. IEEE Int. Conf.Commun., pp. 1064-1070, Geneva, Switzerland, May 1993.

[4] C. Berrou and A. Glavieux, “Near optimum error correcting coding anddecoding: Turbo-codes,” IEEE Trans. Commun., vol. 44, no. 10, pp. 1261-1271, Oct. 1996.

[5] D. P. Bertsekas, with A. Nedic and A. E. Ozdagler, Convex Analysis andOptimization, Athena Scientific, Belmont, MA, 2003.

[6] P. Billingsley. Probability and Measure, 2nd. Ed., John Wiley and Sons, NY,1995.

[7] R. E. Blahut, “Computation of channel capacity and rate-distortion func-tions,” IEEE Trans. Inform. Theory, vol. 18, no. 4, pp. 460-473, Jul. 1972.

[8] R. E. Blahut, Theory and Practice of Error Control Codes, Addison-Wesley,MA, 1983.

[9] R. E. Blahut. Principles and Practice of Information Theory. Addison Wes-ley, MA, 1988.

[10] R. E. Blahut, Algebraic Codes for Data Transmission, Cambridge Univ.Press, 2003.

[11] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge UniversityPress, Cambridge, UK, 2003.

214

[12] S. A. Butman and R. J. McEliece, “The ultimate limits of binary coding fora wideband Gaussian channel,” DSN Progress Report 42-22, Jet PropulsionLab., Pasadena, Ca, pp. 78–80, August 1974.

[13] T. M. Cover and J.A. Thomas, Elements of Information Theory, 2nd Ed.,Wiley, NY, 2006.

[14] I. Csiszar and J. Korner, Information Theory: Coding Theorems for DiscreteMemoryless Systems, Academic, NY, 1981.

[15] I. Csiszar and G. Tusnady, “Information geometry and alternating min-imization procedures,” Statistics and Decision, Supplement Issue, vol. 1,pp. 205-237, 1984.

[16] S. H. Friedberg, A.J. Insel and L. E. Spence, Linear Algebra, 4th Ed., Pren-tice Hall, 2002.

[17] R. G. Gallager, “Low-density parity-check codes,” IRE Trans. Inform. The-ory, vol. 28, no. 1, pp. 8-21, Jan. 1962.

[18] R. G. Gallager, Low-Density Parity-Check Codes, MIT Press, 1963.

[19] R. G. Gallager, Information Theory and Reliable Communication, Wiley,1968.

[20] R. Gallager, “Variations on a theme by Huffman,” IEEE Trans. Inform.Theory, vol. 24, no. 6, pp. 668-674, Nov. 1978.

[21] R. .M. Gray and L. D. Davisson, Random Processes: A Mathematical Ap-proach for Engineers, Prentice-Hall, 1986.

[22] G. R. Grimmett and D. R. Stirzaker, Probability and Random Processes,Third Edition, Oxford University Press, NY, 2001.

[23] D. Guo, S. Shamai, and S. Verdu, “Mutual information and minimum mean-square error in Gaussian channels,” IEEE Trans. Inform. Theory, vol. 51,no. 4, pp. 1261–1283, Apr. 2005.

[24] T. S. Han and S. Verdu, “Approximation theory of output statistics,” IEEETrans. Inform. Theory, vol. 39, no. 3, pp. 752-772, May 1993.

[25] S. Ihara, Information Theory for Continuous Systems, World-Scientific, Sin-gapore, 1993.

[26] R. Johanesson and K. Zigangirov, Fundamentals of Convolutional Coding,IEEE, 1999.

215

[27] W. Karush, Minima of Functions of Several Variables with Inequalities asSide Constraints, M.Sc. Dissertation, Dept. Mathematics, Univ. Chicago,Chicago, Illinois, 1939.

[28] A. N. Kolmogorov, “On the Shannon theory of information transmission inthe case of continuous signals,” IEEE Trans. Inform. Theory, vol. 2, no. 4,pp. 102-108, Dec. 1956.

[29] A. N. Kolmogorov and S. V. Fomin, Introductory Real Analysis, Dover Pub-lications, NY, 1970.

[30] H. W. Kuhn and A. W. Tucker, “Nonlinear programming,” Proc. 2nd Berke-ley Symposium, Berkeley, University of California Press, pp. 481-492, 1951.

[31] S. Lin and D. J. Costello, Error Control Coding: Fundamentals and Appli-cations, 2nd Edition, Prentice Hall, NJ, 2004.

[32] A. Lozano, A. M. Tulino and S. Verdu, “Optimum power allocation for par-allel Gaussian channels with arbitrary input distributions,” IEEE Trans. In-form. Theory, vol. 52, no. 7, pp. 3033–3051, Jul. 2006.

[33] D. J. C. MacKay and R. M. Neal, “Near Shannon limit performance of lowdensity parity check codes,” Electronics Letters, vol. 33, no. 6, Mar. 1997.

[34] D. J. C. MacKay, “Good error correcting codes based on very sparse matri-ces,” IEEE Trans. Inform. Theory, vol. 45, no. 2, pp. 399-431, Mar. 1999.

[35] F. J. MacWilliams and N. J. A. Sloane, The Theory of Error CorrectingCodes, North-Holland Pub. Co., 1978.

[36] J. E. Marsden and M. J.Hoffman, Elementary Classical Analysis, W.H.Freeman & Company, 1993.

[37] R. J. McEliece, The Theory of Information and Coding, 2nd. Ed., Cam-bridge University Press, 2002.

[38] M. S. Pinsker, Information and Information Stability of Random Variablesand Processes, Holden-Day, San Francisco, 1964.

[39] A. Renyi, “On the dimension and entropy of probability distributions,” ActaMath. Acad. Sci. Hung., vol. 10, pp. 193-215, 1959.

[40] M. Rezaeian and A. Grant, “Computation of total capacity for discretememoryless multiple-access channels,” IEEE Trans. Inform. Theory, vol. 50,no. 11, pp. 2779-2784, Nov. 2004.

216

[41] T. J. Richardson and R. L. Urbanke, Modern Coding Theory, CambridgeUniversity Press, 2008.

[42] H. L. Royden. Real Analysis, Macmillan Publishing Company, 3rd. Ed., NY,1988.

[43] C. E. Shannon, “A mathematical theory of communications,” Bell Syst.Tech. Journal, vol. 27, pp. 379-423, 1948.

[44] C. E. Shannon, “Coding theorems for a discrete source with a fidelity cri-terion,” IRE Nat. Conv. Rec., Pt. 4, pp. 142-163, 1959.

[45] C. E. Shannon and W. W. Weaver, The Mathematical Theory of Commu-nication, Univ. of Illinois Press, Urbana, IL, 1949.

[46] P. C. Shields, The Ergodic Theory of Discrete Sample Paths, AmericanMathematical Society, 1991.

[47] N. J. A. Sloane and A. D. Wyner, Ed., Claude Elwood Shannon: CollectedPapers, IEEE Press, NY, 1993.

[48] S. Verdu and S. Shamai, “Variable-rate channel capacity,” IEEE Trans.Inform. Theory, vol. 56, no. 6, pp. 2651-2667, June 2010.

[49] S.-W. Wang, P.-N. Chen and C.-H. Wang, “Optimal power allocation for(N,K)-limited access channels,” IEEE Trans. Inform. Theory, vol. 58, no. 6,pp. 3725-3750, June 2012.

[50] W. R. Wade, An Introduction to Analysis, Prentice Hall, NJ, 1995.

[51] S. Wicker, Error Control Systems for Digital Communication and Storage,Prentice Hall, NJ, 1995.

[52] A. D. Wyner, “The capacity of the band-limited Gaussian channel,’ BellSyst. Tech. Journal, vol. 45, pp. 359-371, 1966.

[53] R. W. Yeung, Information Theory and Network Coding, Springer, NY, 2008.

217

Documents

It Lecture Notes2014