Chapter 6 Continuous Sources and Channelsshannon.cm.nctu.edu.tw/it/c1-6s04.pdf · • A straightforward extension of entropy from discrete sources to continuous sources. Entropy:

Chapter 6

Continuous Sources and Channels

Po-Ning Chen, Professor

Department of Communications Engineering

National Chiao Tung University

Hsin Chu, Taiwan 300, R.O.C.

Continuous Sources and Channels I: 6-1

• Model

{Xt ∈ X , t ∈ I}– Discrete sources

∗ Both X and I are discrete.

– Continuous sources

∗ Discrete-time continuous sources

· X is continuous; I is discrete.

∗ Waveform sources

· Both X and I are continuous.

Information Content of Continuous Sources I: 6-2

• A straightforward extension of entropy from discrete sources to continuous

sources.

Entropy: H(X)4=∑

x∈X−PX(x) log PX(x) nats.

Example 6.1 (extension of entropy to continuous sources) Give a

source X with source alphabet [0, 1) and uniform generic distribution.

We can make it discrete by quantizing it into m levels as

qm(X) =i

m, if

i − 1

m≤ X <

i

m,

for 1 ≤ i ≤ m.

Then the resultant entropy of the quantized source is

H(qm(X)) = −m∑

i=1

1

mlog

(

1

m

)

= log(m) nats.

Since the entropy H(qm(X)) of the quantized source is a lower bound to the

entropy H(X) of the uniformly distributed continuous source,

H(X) ≥ limm→∞

H(qm(X)) = ∞.

Information Content of Continuous Sources I: 6-3

• A quantization extension of entropy seems ineffective for continuous sources,

because their entropies are all infinity.

Proof: For any continuous source X , there must exist a non-empty open inter-

val in which the cumulative distribution function FX(·) is strictly increasing.

Now quantize the source into m + 1 level as follows:

• Assign one level to the complement of this open interval, and• assign m levels to this open interval such that the probability

mass on this interval, denoted by a, is equally distributed to

these m levels. (This is similar to do equal partitions on the

concerned domain of F−1X (·).)

Then

H(X) ≥ H(X∆) = −(1 − a) · log(1 − a) − a · loga

m,

where X∆ represents the quantized version of X . The lower bound goes to

infinity as m tends to infinity.

Differential Entropy I: 6-4

• Alternative extension of entropy to continuous sources (from its formula)

Definition 6.2 (differential entropy) The differential entropy (in nats)

of a continuous source with generic probability density function (pdf) pX is

defined as

h(X)4= −

∫

XpX(x) · log pX(x)dx.

The next example demonstrates the difference (in its quantity) between the

entropy and differential entropy.

Example 6.3 A continuous source X with source alphabet [0, 1) and pdf

f(x) = 2x has differential entropy equal to

∫ 1

0

−2x · log(2x)dx =x2(1 − 2 log(2x))

2

∣

∣

∣

∣

1

0

=1

2− log(2) ≈ −0.193 nats.

Examples of Differential Entropy I: 6-5

• Note that the differential entropy, unlike the entropy, can be negative in its

value.

Example 6.4 (differential entropy of continuous sources with uni-

form generic distribution) A continuous source X with uniform generic dis-

tribution over (a, b) has differential entropy

h(X) = log |b − a| nats.

Example 6.5 (differential entropy of Gaussian sources) A continu-

ous source X with Gaussian generic distribution of mean µ and variance σ2 has

differential entropy

h(X) =

∫

<φ(x)

[

1

2log(2πσ2) +

(x − µ)2

2σ2

]

dx

=1

2log(2πσ2) +

1

2σ2E[(X − µ)2]

=1

2log(2πσ2) +

1

2

=1

2log(2πσ2e) nats,

where φ(x) is the pdf of the Gaussian distribution with mean µ and variance σ2.

Properties of Differential Entropy I: 6-6

• The extension of AEP theorem from discrete cases to continuous cases is not

based on “number counting” (which is always infinity for continuous sources),

but on “volume measuring.”

Theorem 6.6 (AEP for continuous sources) Let X1, . . . , Xn be a

sequence of sources drawn i.i.d. according to the density pX(·). Then

−1

nlog pX(X1, . . . , Xn) → E[− log pX(X)] = h(X) in probability.

Proof: The proof is an immediate result of law of large numbers.

Definition 6.7 (typical set) For δ > 0 and any n given, define the typical set

as

Fn(δ)4=

{

xn ∈ X n :

∣

∣

∣

∣

−1

nlog pX(X1, . . . , Xn) − h(X)

∣

∣

∣

∣

< δ

}

.

Definition 6.8 (volume) The volume of a set A is defined as

Vol(A)4=

∫

Adx1 · · · dxn.


Theorem 6.9 (Shannon-McMillan theorem for continuous sources)

1. For n sufficiently large, PXn {F cn(δ)} < δ.

2. Vol(Fn(δ))≤ en(h(X)+δ) for all n.

3. Vol(Fn(δ))≥ (1 − δ)en(h(X)−δ) for n sufficiently large.

Proof: The proof is an extension of Shannon-McMillan theorem for discrete

sources, and hence we omit it.

We can also derive a source coding theorem for continuous

sources and obtain that when a continuous source is compressed

by quantization, it is beneficial to put most of the quantization

effort on its typical set, instead of on the entire source space.

E.g.

• Assigning (m − 1)-level to elements in Fn(δ);

• Assigning 1 level to those elements outside Fn(δ).


Remarks:

• If the differential entropy of a continuous source is larger, it is expected that

a larger number of quantization levels is required in order to minimize the

distortion introduced via quantization.

• So we may conclude that continuous sources with higher differential entropy

contain more information in volume.

• Question: Which continuous source is the richest in information (i.e., is highest

in differential entropy and cost more in quantization)? Answer: Gaussian.


Theorem 6.10 (maximal differential entropy of Gaussian source) The

Gaussian source has the largest differential entropy among all continuous sources

with identical mean and variance.

Proof: Let p(·) be the pdf of a continuous source X , and let φ(·) be the pdf of

a Gaussian source Y . Assume that these two sources have the same mean µ and

variance σ2. Observe that

−∫

<φ(y) log φ(y)dy =

∫

<φ(y)

[

1

2log(2πσ2) +

(y − µ)2

2σ2

]

dy

=1

2log(2πσ2) +

1

2σ2E[(Y − µ)2]

=1

2log(2πσ2) +

1

2σ2E[(X − µ)2]

= −∫

<p(x)

[

−1

2log(2πσ2) − (x − µ)2

2σ2

]

dx

= −∫

<p(x) log φ(x)dx.


Hence,

h(Y ) − h(X) = −∫

<φ(y) log φ(y)dy +

∫

<p(x) log p(x)dx

= −∫

<p(x) log φ(x)dx +

∫

<p(x) log p(x)dx

=

∫

<p(x) log

p(x)

φ(x)dx

≥∫

<p(x)

(

1 − φ(x)

p(x)

)

dx (fundamental inequality)

=

∫

<(p(x) − φ(x)) dx

= 0,

with equality holds if, and only if, p(x) = φ(x) for all x ∈ <.


• Properties regarding differential entropy for continuous sources, which are the

same as in discrete cases.

Lemma 6.11

1. h(X|Y ) ≤ h(X) with equality holds if, and only if, X and Y are indepen-

dent.

2. (chain rule for differential entropy)

h(X1, X2, . . . , Xn) =

n∑

i=1

h(Xi|X1, X2, . . . , Xi−1).

3. h(Xn) ≤∑ni=1 h(Xi) with equality holds if, and only if, {Xi}n

i=1 are inde-

pendent.


• There are some properties that are conceptually different in continuous cases

from the original ones in discrete cases.

Lemma 6.12

– (In discrete cases) For any one-to-one correspondence mapping f ,

H(f(X)) = H(X).

– (In continuous cases) For a mapping f(x) = ax with some non-zero con-

stant a,

h(f(X)) = h(X) + log |a|.

Proof: (For continuous cases only) Let pX(·) and pf(X)(·) be respectively the

pdfs of the original source and the mapped source. Then

pf(X)(u) =1

|a|pX

(u

a

)

.

By taking the new pdf into the formula of differential entropy, we have the

desired result.

Important Notes on Differential Entropy I: 6-13

• The above lemma says that the differential entropy can be increased by a one-

to-one correspondence mapping.

• Therefore, when viewing the quantity as a measure of information content of

continuous sources, we would yield that information content can be increased

by a function mapping, which is somewhat contrary to the intuition.

• Based on this reason, it may not be appropriate to interpret the differential

entropy as an index of information content of continuous sources.

• Indeed, it may be better to view this quantity as a measure of quantization

efficiency.

Important Notes on Differential Entropy I: 6-14

• Some researchers interpret the differential entropy as a convenient intermediate

formula of calculating the mutual information and divergence for systems with

continuous alphabets, which is in general true.

• Before introducing the formula of relative entropy and mutual information for

continuous settings, we point out beforehand that the interpretation, as well

as the operational characteristics, of the mutual information and divergence

for systems with continuous alphabets is exactly the same as for systems with

discrete alphabets.

More Properties on Differential Entropy I: 6-15

Corollary 6.13 For a sequence of continuous sources Xn = (X1, . . . , Xn), and a

non-singular n × n matrix A,

h(AXn) = h(Xn) + log |A|,

where |A| represents the determinant of matrix A.

Corollary 6.14 h(X + c) = h(X) and h(X|Y ) = h(X + Y |Y ).

Operational Meaning of Differential Entropy I: 6-16

Lemma 6.15 Give the pdf f(x) of a continuous source X , and suppose that

−f(x) log2 f(x) is Riemann-integrable. Then to uniformly quantize the random

source within n-bit accuracy, i.e., the quantization width is no greater than 2−n,

need approximately h(X) + n bits (as n large enough).

Proof:

Step 1: Mean-value theorem .

Let ∆ = 2−n be the width of two adjacent quantization levels. Let ti = i∆ for

integer i ∈ (−∞,∞). From mean-value theorem, we can choose xi ∈ [ti−1, ti]

such that∫ ti

ti−1

f(x)dx = f(xi)(ti − ti−1) = ∆ · f(xi).


Step 2: Definition of h∆(X) .

Let

h∆(X)4=

∞∑

i=−∞[f(xi) log2 f(xi)]∆.

Since h(X) is Riemann-integrable,

h∆(X) → h(X) as ∆ = 2−n → 0.

Therefore, given any ε > 0, there exists N such that for all n > N ,

|h(X) − h∆(X)| < ε.

Step 3: Computation of H(X∆) .

The entropy of the quantized source X∆ is

H(X∆) = −∞∑

i=−∞pi log2 pi = −

∞∑

i=−∞(f(xi)∆) log2(f(xi)∆) bits.


Step 4: H(X∆) − h∆(X) .

From Steps 2 and 3,

H(X∆) − h∆(X) = −∞∑

i=−∞[f(xi)∆] log2 ∆

= (− log2 ∆)

∞∑

i=−∞

∫ ti

ti−1

f(x)dx

= (− log2 ∆)

∫ ∞

−∞f(x)dx = − log2 ∆ = n.

Hence,

[h(X) + n] − ε < H(X∆) = h∆(X) + n < [h(X) + n] + ε,

for n > N .

• Remark 1: Since H(X∆) is the minimum average number of codeword length

for lossless data compression, to uniformly quantize a continuous source upto

n-bit accuracy requires approximately h(X) + n bits.

• Remark 2: We may conclude that the larger the differential entropy, the

average number of bits required to uniformly quantize the source subject to a

fixed accuracy is larger.


• This operational meaning of differential entropy can be used to interpret its

properties introduced in the previous subsection. For example, “h(X + c) =

h(X)” can be interpreted as “A shift in value does not change the quantization

efficiency of the original source.”

Riemann Integral Versus Lebesgue Integral I: 6-20

Riemann integral:

Let s(x) represent a step function on [a, b), which is defined as that there exists a

partition a = x0 < x1 < · · · < xn = b such that s(x) is constant during (xi, xi+1)

for 0 ≤ i < n.

If a function f(x) is Riemann integrable,∫ b

a

f(x)4= sup{

s(x) : s(x)≤f(x)}

∫ b

a

s(x)dx = inf{

s(x) : s(x)≥f(x)}

∫ b

a

s(x)dx.

Example of a non-Riemann-integrable function:

f(x) = 0 if x is irrational; f(x) = 1 if x is rational.

Then

sup{

s(x) : s(x)≤f(x)}

∫ b

a

s(x)dx = 0,

but

inf{

s(x) : s(x)≥f(x)}

∫ b

a

s(x)dx = (b − a).

Riemann Integral Versus Lebesgue Integral I: 6-21

Lebesgue integral:

Let t(x) represent a simple function, which is defined as the linear combination of

indicator functions for mutually-disjoint partitions.

For example, let U1, . . . ,Um be the mutually-disjoint partitions of the domain Xand ∪m

i=1Ui = X . The indicator function of Ui is 1(x;Ui) = 1 if x ∈ Ui, and 0,

otherwise.

Then t(x) =∑m

i=1 ai1(x;Ui) is a simple function.

If a function f(x) is Lebesgue integrable, then∫ b

a

f(x) = sup{

t(x) : t(x)≤f(x)}

∫ b

a

t(x)dx = inf{

t(x) : t(x)≥f(x)}

∫ b

a

t(x)dx.

The previous example is actually Lebesgue integrable, and its Lebesgue integral is

equal to zero.

Example of Quantization Efficiency I: 6-22

Example 6.16 Find the minimum average number of bits required to uniformly

quantize the decay time (in years) of a radium atom upto 3-digit accuracy, if the

half-life of the radium is 80 years. Note that the half-life of a radium atom is the

median of its decay time distribution f(x) = λe−λx (x > 0).

Since the median is 80, we obtain:∫ 80

0

λe−λxdx = 0.5,

which implies λ = 0.00866. Also, 3-digit accuracy is approximately equivalent to

log2 999 = 9.96 ≈ 10 bit accuracy. Therefore, the number of bits required to

quantize the source is approximately

h(X) + 10 = log2

e

λ+ 10 = 18.29 bits.

Relative Entropy and Mutual Information I: 6-23

Definition 6.17 (relative entropy) Define the relative entropy between two

densities pX and pX by

D(X‖X)4=

∫

XpX(x) log

pX(x)

pX(x)dx.

Definition 6.18 (mutual information) The mutual information with input-

output joint density pX,Y (x, y) is defined as

I(X ; Y )4=

∫

X×YpX(x, y) log

pX,Y (x, y)

pX(x)pY (y)dxdy.

Relative Entropy and Mutual Information I: 6-24

• Different from the cases for entropy, the properties of relative entropy and

mutual information in continuous cases are the same as those in the discrete

cases.

• In particular, the mutual information of the quantized version of a continuous

channel will converge to the mutual information of the same continuous channel

(as the quantization step size goes to zero).

• Hence, some researchers prefer to define the mutual information of a continuous

channel directly as the limit of the quantized channel.

• Here, we quote some of the properties on relative entropy and mutual informa-

tion from discrete settings.

Lemma 6.19

1. D(X‖X) ≥ 0 with equality holds if, and only if, pX = pX .

2. I(X ; Y ) ≥ 0 with equality holds if, and only if, X and Y are independent.

3. I(X ; Y ) = h(Y ) − h(Y |X).

Rate Distortion Theorem Revisited I: 6-25

• Since the entropy of continuous sources is infinity, to compress a continuous

source without distortion is impossible according to Shannon’s source coding

theorem.

• Thus, one way to characterize the data compression for continuous sources is

to encode the original source subject to a constraint on the distortion, which

yields the rate-distortion function for data compression.

• In concept, the rate-distortion function is the minimum data compression rate

(nats required per source letter or nats required per source sample) for which

the distortion constraint is satisfied.

Specific Formula of Rate Distortion Theorem I: 6-26

Theorem 6.20 Under the squared error distortion measure, namely

ρ(z, z) = (z − z)2,

the rate-distortion function for continuous source Z with zero mean and variance

σ2 satisfies

R(D) ≤

1

2log

σ2

D, for 0 ≤ D ≤ σ2;

0, for D > σ2.

Equality holds when Z is Gaussian.

Proof: By Theorem 5.16 (extended to squared-error distortion measure),

R(D) = min{

pZ|Z : E[(Z−Z)2]≤D

}

I(Z; Z).

So for any pZ|Z satisfying the distortion constraint,

R(D) ≤ I(pZ, pZ|Z).

For 0 ≤ D ≤ σ2, choose a dummy Gaussian random variable W with zero mean

and variance aD, where a = 1−D/σ2, and is independent of Z. Let Z = aZ +W .

Then

E[(Z − Z)2] = E[(1 − a)2Z2] + E[W 2]

= (1 − a)2σ2 + aD = D

Specific Formula of Rate Distortion Theorem I: 6-27

which satisfies the distortion constraint. Note that the variance of Z is equal to

E[a2Z2] + E[W 2] = σ2 − D. Consequently,

R(D) ≤ I(Z; Z)

= h(Z) − h(Z|Z)

= h(Z) − h(W + aZ|Z)

= h(Z) − h(W |Z) (By Corollary 6.14)

= h(Z) − h(W ) (By Lemma 6.11)

= h(Z) − 1

2log(2πe(aD))

≤ 1

2log(2πe(σ2 − D)) − 1

2log(2πe(aD))

=1

2log

σ2

D.

For D > σ2, let Z satisfy Pr{Z = 0} = 1, and be independent of Z. Then

E[(Z − Z)2] = E[Z2] − E[Z2] − 2E[Z]E[Z] = σ2 < D, and I(Z; Z) = 0. Hence,

R(D) = 0 for D > σ2.

The achievability of the upper bound by Gaussian source will be proved in The-

orem 6.22.

Rate Distortion Function for Binary Source I: 6-28

Theorem 6.21 Fix a memoryless binary source

Z = {Zn = (Z1, Z2, . . . , Zn)}∞n=1

with marginal distribution PZ(0) = 1 − PZ(1) = p. Assume that the Hamming

additive distortion measure is employed. Then the rate-distortion function

R(D) =

{

Hb(p) − Hb(D), if 0 ≤ D ≤ min{p, 1 − p};0, if D > min{p, 1 − p},

where Hb(p)4= − p · log(p) − (1 − p) · log(1 − p) is the binary entropy function.

Proof: Assume without loss of generality that p ≤ 1/2.

We first prove the theorem under 0 ≤ D < min{p, 1− p} = p. Observe that for

any binary random variable Z,

H(Z|Z) = H(Z ⊕ Z|Z).

Also observe that

E[ρ(Z, Z)] ≤ D implies Pr{Z ⊕ Z = 1} ≤ D.


Then

I(Z; Z) = H(Z) − H(Z|Z)

= Hb(p) − H(Z ⊕ Z|Z)

≥ Hb(p) − H(Z ⊕ Z) (conditioning never increase entropy)

≥ Hb(p) − Hb(D),

where the last inequality follows since Hb(x) is increasing for x ≤ 1/2, and Pr{Z⊕Z = 1} ≤ D. Since the above derivation is true for any PZ|Z , we have

R(D) ≥ Hb(p) − Hb(D).

It remains to show that the lower bound is achievable by some PZ|Z , or equivalently,

H(Z|Z) = Hb(D) for some PZ|Z . By defining PZ|Z(0|0) = PZ|Z(1|1) = 1 − D,

we immediately obtain H(Z|Z) = Hb(D). The desired PZ|Z can be obtained by

solving

1 = PZ(0) + PZ(1)

=PZ(0)

PZ|Z(0|0)PZ|Z(0|0) +

PZ(0)

PZ|Z(0|1)PZ|Z(1|0)

=p

1 − DPZ|Z(0|0) +

p

D(1 − PZ|Z(0|0))


and

1 = PZ(0) + PZ(1)

=PZ(1)

PZ|Z(1|0)PZ|Z(0|1) +

PZ(1)

PZ|Z(1|1)PZ|Z(1|1)

=1 − p

D(1 − PZ|Z(1|1)) +

1 − p

1 − DPZ|Z(1|1),

and yield

PZ|Z(0|0) =1 − D

1 − 2D

(

1 − D

p

)

and PZ|Z(1|1) =1 − D

1 − 2D

(

1 − D

1 − p

)

.

Now in the case of p ≤ D < 1 − p, we can let PZ|Z(1|0) = PZ|Z(1|1) = 1 to

obtain I(Z; Z) = 0 and

E[ρ(Z, Z)] =

1∑

z=0

1∑

z=0

PZ(z)PZ|Z(z|z)ρ(z, z) = p ≤ D.

Similarly, in the case of D ≥ 1 − p, we let PZ|Z(0|0) = PZ|Z(0|1) = 1 to obtain

I(Z; Z) = 0 and

E[ρ(Z, Z)] =

1∑

z=0

1∑

z=0

PZ(z)PZ|Z(z|z)ρ(z, z) = 1 − p ≤ D.


• Remark: The Hamming additive distortion measure is defined as:

ρn(zn, zn) =

n∑

i=1

zi ⊕ zi,

where “⊕” denotes modulo two addition. In such case, ρ(zn, zn) is exactly the

number of bit changes or bit errors after compression.

Rate Distortion Function for Gaussian Sources I: 6-32

Theorem 6.22 Fix a memoryless source

Z = {Zn = (Z1, Z2, . . . , Zn)}∞n=1

with zero-mean Gaussian marginal distribution of variance σ2. Assume that the

squared error distortion measure is employed. Then the rate-distortion function is

given by:

R(D) =

1

2log

σ2

D, if 0 ≤ D ≤ σ2;

0, if D > σ2.

Proof: From Theorem 6.20, it suffices to show that under the Gaussian source,

(1/2) log(σ2/D) is a lower bound to R(D) for 0 ≤ D ≤ σ2.

Rate Distortion Function for Gaussian Sources I: 6-33

This can be proved as follows. For Gaussian source Z with E[(Z − Z)2] ≤ D,

I(Z; Z) = h(Z) − h(Z|Z)

=1

2log(2πeσ2) − h(Z − Z|Z) (Corollary 6.14)

≥ 1

2log(2πeσ2) − h(Z − Z) (Lemma 6.11)

≥ 1

2log(2πeσ2) − 1

2log(

2πe Var[(Z − Z)])

(Theorem 6.10)

≥ 1

2log(2πeσ2) − 1

2log(

2πe E[(Z − Z)2])

≥ 1

2log(2πeσ2) − 1

2log (2πeD)

=1

2log

σ2

D.

Channel Coding Theorem for Continuous Sources I: 6-34

• Power constraint on channel input

– To derive the channel capacity for a memoryless continuous channel with-

out any constraint on the inputs is somewhat impractical, especially when

the input can be any number on the infinite real line.

– Such constraint is usually of the form

E[t(X)] ≤ S

or1

n

n∑

i=1

E[t(Xi)] ≤ S for a sequence of random inputs,

where t(·) is a non-negative cost function.

Example 6.23 (average power constraint) t(x)4= x2, i.e., the con-

straint is that the average input power is bounded above by S.


• As extended from discrete cases, the channel capacity of a discrete-time con-

tinuous channel with the input cost constraints is of the form

C(S)4= max

{pX : E[t(X)]≤S}I(X ; Y ). (6.3.1)

• Claim: C(S) is a concave function of S.


Lemma 6.24 (concavity of capacity-cost function) C(S) is concave, con-

tinuous, and strictly increasing in S.

Proof: Let PX1 and PX2 be two distributions that respectively achieve C(P1) and

C(P2). Denote PXλ

4= λPX1 + (1 − λ)PX2. Then

C(λP1 + (1 − λ)P2) = max{PX : E[t(X)]≤λP1+(1−λ)P2}

I(PX , PY |X)

≥ I(PXλ, PY |X)

≥ λI(PX1, PY |X) + (1 − λ)I(PX2, PY |X)

= λC(P1) + (1 − λ)C(P2),

where the first inequality holds since

EXλ[t(X)] =

∫

<t(x)dPλ(x)

= λ

∫

<t(x)dPX1(x) + (1 − λ)

∫

<t(x)dPX2(x)

= λEX1[t(X)] + (1 − λ)EX2[t(X)]

≤ λP1 + (1 − λ)P2,

and the second inequality follows from the concavity of mutual information with

respect to the first argument. Accordingly, C(S) is concave in S.


Furthermore, it can be easily seen by definition that C(S) is non-decreasing,

which, together with its concavity, implies its continuity and strict increasing.


• Although the capacity-cost function formula in (6.3.1) is valid for general cost

function t(·), we only substantiate it under the average power constraint in

the next forward channel coding theorem.

• Its validity for a more general case can be similarly proved based on the same

concept.


Theorem 6.25 (forward channel coding theorem for continuous

channels under average power constraint) For any ε ∈ (0, 1), there exist

0 < γ < 2ε and a data transmission code sequence { C∼n = (n, Mn)}∞n=1 satisfying

1

nlog Mn > C(S) − γ

and for each codeword c = (c1, c2, . . . , cn),

1

n

n∑

i=1

c2i ≤ S (6.3.2)

such that the probability of decoding error Pe( C∼n) is less than ε for sufficiently

large n, where

C(S)4= max

{pX : E[X2]≤S}I(X ; Y ).


Proof: The theorem holds trivially when C(S) = 0 because we can choose Mn = 1

for every n, and yields Pe( C∼n) = 0. Hence, assume without loss of generality

C(S) > 0.

Step 0:

Take a positive γ satisfying

γ < min{2ε, C(S)}.

Pick ξ > 0 small enough such that

2[C(S) − C(S − ξ)] < γ,

where the existence of such ξ is assured by the strict increasing of C(S). Hence,

we have C(S − ξ) − γ/2 > C(S) − γ > 0. Choose Mn to satisfy

C(S − ξ) − γ

2>

1

nlog Mn > C(S) − γ,

for which the choice should exist for all sufficiently large n. Take δ = γ/8. Let

PX be the distribution that achieves C(S − ξ); hence, E[X2] ≤ S − ξ and

I(X ; Y ) = C(S − ξ).


Step 1: Random coding with average power constraint.

Randomly draw Mn − 1 codewords according to distribution PXn with

PXn(xn) =

n∏

i=1

PX(xi).

By law of large numbers, each randomly selected codeword

cm = (cm1, . . . , cmn)

satisfies

limn→∞

1

n

n∑

i=1

c2mi ≤ S − ε almost surely

for m = 1, 2, . . . , Mn − 1.


Step 2: Coder.

For Mn selected codewords {c1, . . . , cMn}, replace the codewords that violate

the power constraint (i.e., (6.3.2)) by all-zero codeword 0. Define the encoder

as

fn(m) = cm for 1 ≤ m ≤ Mn.

When receiving an output sequence yn, the decoder gn(·) is given by

gn(yn) =

m, if (cm, yn) ∈ Fn(δ)

and (∀ m′ 6= m) (cm′, yn) 6∈ Fn(δ),

arbitrary, otherwise,

where

Fn(δ)4=

{

(xn, yn) ∈ X n × Yn :

∣

∣

∣

∣

−1

nlog pXnY n(xn, yn) − h(X, Y )

∣

∣

∣

∣

< δ,

∣

∣

∣

∣

−1

nlog pXn(xn) − h(X)

∣

∣

∣

∣

< δ,

and

∣

∣

∣

∣

−1

nlog pY n(yn) − h(Y )

∣

∣

∣

∣

< δ

}

.


Step 3: Probability of error.

Let λm denote the error probability given that codeword m is transmitted.

Define

E04=

{

xn ∈ X n :1

n

n∑

i=1

x2i > S

}

.

Then by following similar argument as (4.3.2) (discrete cases), we get:

E[λm] ≤ PXn(E0) + PXn,Y n (F cn(δ))

+

Mn∑

m′=1m′ 6=m

∑

cm∈Xn

∑

yn∈Fn(δ|cm′)

PXn,Y n(cm, yn),

where

Fn(δ|xn)4= {yn ∈ Yn : (xn, yn) ∈ Fn(δ)} .

Note that the additional term PXn(E0) (to discrete cases) is to cope with the

errors due to all-zero codeword replacement, which will be less than δ for all

sufficiently large n by the law of large numbers.


Finally, by carrying out similar procedure as in the proof of the capacity for

discrete channels, we obtain:

E[Pe(Cn)] ≤ PXn(E0) + PXn,Y n (F cn(δ))

+Mn · en(h(X,Y )+δ)e−n(h(X)−δ)e−n(h(Y )−δ)

≤ PXn(E0) + PXn,Y n (F cn(δ)) + en(C(S−ξ)−4δ) · e−n(I(X;Y )−3δ)

= PXn(E0) + PXn,Y n (F cn(δ)) + e−nδ.

Accordingly, we can make the average probability of error, namely E[Pe(Cn)],

less than 3δ = 3γ/8 < 3ε/4 < ε for all sufficiently large n.


Theorem 6.26 (converse channel coding theorem for continuous

channels) For any data transmission code sequence { C∼n = (n, Mn)}∞n=1 satisfy-

ing the power constraint, if the ultimate data transmission rate satisfies

lim infn→∞

1

nlog Mn > C(S),

then its probability of decoding error is bounded away from zero for all n sufficiently

large.

Proof: For an (n, Mn) block data transmission code, an encoding function is chosen

as:

fn : {1, 2, . . . , Mn} → X n,

and each index i is equally likely for the average probability of block decoding error

criterion. Hence, we can assume that the information message {1, 2, . . . , Mn} is

generated from a uniformly distributed random variable, and denote it by W . As

a result,

H(W ) = log Mn.

Since W → Xn → Y n forms a Markov chain because Y n only depends on Xn, we

obtain by the data processing lemma that I(W ; Y n) ≤ I(Xn; Y n).


We can also bound I(Xn; Y n) by C(S) as:

I(Xn; Y n) ≤ max{PXn : (1/n)

∑ni=1 E[X2

i ]≤S}I(Xn; Y n)

≤ max{PXn : (1/n)

∑ni=1 E[X2

i ]≤S}

n∑

j=1

I(Xj; Yj)

= max{(P1,P2,...,Pn) : (1/n)

∑ni=1 Pi=S}

max{PXn : (∀ i) E[X2

i ]≤Pi}

n∑

j=1

I(Xj; Yj)

≤ max{(P1,P2,...,Pn) : (1/n)

∑ni=1 Pi=S}

n∑

j=1

max{PXn : (∀ i) E[X2

i ]≤Pi}I(Xj; Yj)

≤ max{(P1,P2,...,Pn) : (1/n)

∑ni=1 Pi=S}

n∑

j=1

max{PXj

: E[X2j ]≤Pj}

I(Xj; Yj)

= max{(P1,P2,...,Pn):(1/n)

∑ni=1 Pi=S}

n∑

j=1

C(Pj)

= max{(P1,P2,...,Pn):(1/n)

∑ni=1 Pi=S}

nn∑

j=1

1

nC(Pj)


≤ max{(P1,P2,...,Pn):(1/n)

∑ni=1 Pi=S}

nC

1

n

n∑

j=1

Pj

(by concavity of C(S))

= nC(S).

Consequently, by defining Pe( C∼n) as the error of guessing W by observing Y n via

a decoding function

gn : Yn → {1, 2, . . . , Mn},which is exactly the average block decoding failure, we get

log Mn = H(W )

= H(W |Y n) + I(W ; Y n)

≤ H(W |Y n) + I(Xn; Y n)

≤ Hb(Pe( C∼n)) + Pe( C∼n) · log(|W| − 1) + nC(S),

(by Fano′s inequality)

≤ log(2) + Pe( C∼n) · log(Mn − 1) + nC(S),

(by the fact that (∀ t ∈ [0, 1]) Hb(t) ≤ log(2))

≤ log(2) + Pe( C∼n) · log Mn + nC(S),


which implies that

Pe( C∼n) ≥ 1 − C(S)

(1/n) log Mn− log(2)

log Mn.

So if lim infn→∞(1/n) log Mn > C(S), then there exists δ with 0 < δ < 4ε and an

integer N such that for n ≥ N ,

1

nlog Mn > C(S) + δ.

Hence, for n ≥ N04= max{N, 2 log(2)/δ},

Pe( C∼n) ≥ 1 − C(S)

C(S) + δ− log(2)

n(C(S) + δ)≥ δ

2(C(S) + δ).

Remarks:

• This is a weak converse statement.

• Since the capacity is now a function of the cost constraint, it is named the

capacity-cost function.

Next, we will derive the capacity-cost function of some frequently

used channel models.

Memoryless Additive Gaussian Channels I: 6-49

Definition 6.27 (memoryless additive channel) Let

X1, . . . , Xn and Y1, . . . , Yn

be the input and output sequences of the channel. Also let N1, . . . , Nn be the noise.

Then a memoryless additive channel is defined by

Yi = Xi + Ni

for each i, where {Xi, Yi, Ni}ni=1 are i.i.d. and Xi is independent of Ni.

Definition 6.28 (memoryless additive Gaussian channel) A memoryless

additive channel is called memoryless additive Gaussian channel, if the noise is

a Gaussian random variable.

Theorem 6.29 (capacity of memoryless additive Gaussian channel

under average power constraint) The capacity of a memoryless additive

Gaussian channel with noise model N (0, σ2) and average power constraint is equal

to:

C(S) =1

2log

(

1 +S

σ2

)

nats/channel symbol.


Proof: By definition,

C(S) = max{pX : E[X2]≤S}

I(X ; Y )

= max{pX : E[X2]≤S}

(h(Y ) − h(Y |X))


(h(Y ) − h(N + X|X))


(h(Y ) − h(N |X))


(h(Y ) − h(N))

=

(

max{pX : E[X2]≤S}

h(Y )

)

− h(N),

where N represents the additive Gaussian noise. We thus need to find an input

distribution satisfying E[X2] ≤ S that maximizes the differential entropy of Y .

Recall that the differential entropy subject to mean and variance constraint is max-

imized by Gaussian random variable; also, the differential entropy of a Gaussian

random variable with variance σ2 is (1/2) log(2πσ2e) nats, and is nothing to do

with its mean.


Therefore, by taking X to be a Gaussian random variable having distribution

N (0, S), Y achieves its largest variance S + σ2 under the constraint E[X2] ≤ S.

Consequently,

C(S) =1

2log(2πe(S + σ2)) − 1

2log(2πeσ2)

=1

2log

(

1 +S

σ2

)

nats/channel symbol.

Theorem 6.30 (worseness in capacity of Gaussian noise) For all mem-

oryless additive discrete-time continuous channels whose noise has zero-mean and

variance σ2, the capacity subject to average power constraint is lower bounded by

1

2log

(

1 +S

σ2

)

,

which is the capacity for memoryless additive discrete-time Gaussian channel. (This

means that the Gaussian noise is the “worst” kind of noise in the sense of channel

capacity.)


Proof: Let pYg|Xg and pY |X denote the transition probabilities of the Gaussian

channel and some other channel satisfying the cost constraint, respectively. Let Ng

and N respectively denote their noises. Then for any Gaussian input pXg ,

I(pXg, pY |X) − I(pXg, pYg|Xg)

=

∫

<

∫

<pXg(x)PN(y − x) log

pN(y − x)

pY (y)dydx

−∫

<

∫

<pXg(x)PNg(y − x) log

pNg(y − x)

pYg(y)dydx

=

∫

<

∫


pN(y − x)

pY (y)dydx

−∫

<

∫


pNg(y − x)

pYg(y)dydx

=

∫

<

∫


pN(y − x)pYg(y)

pNg(y − x)pY (y)dydx

≥∫

<

∫

<pXg(x)PN(y − x)

(

1 − pNg(y − x)pY (y)

pN(y − x)pYg(y)

)

dydx

= 1 −∫

<

pY (y)

pYg(y)

(∫

<pXg(x)pNg(y − x)dx

)

dy

= 0,


with equality holds if, and only if,

pY (y)

pYg(y)=

pN(y − x)

pNg(y − x)

for all x.

Therefore,

max{pX : E[X2]≤S}

I(pX , pYg|Xg) = I(p∗Xg, pYg|Xg)

≤ I(p∗Xg, PY |X)

≤ max{pX : E[X2]≤S}

I(pX , pY |X).

Uncorrelated Parallel Gaussian Channels I: 6-54

• Water-pouring scheme

– In concept, more channel input power should be placed on those channels

with smaller noise power.

– The water-pouring scheme not only substantiates the above intuition, but

also gives a quantitatively meaning for it.

σ21

σ22

σ23

σ24

S1

S2

S4

S = S1 + S2 + S4


Theorem 6.31 (capacity for parallel additive Gaussian channels) The

capacity of k parallel additive Gaussian channels under an overall input power

constraint S, is

C(S) =

k∑

i=1

1

2log

(

1 +Si

σ2i

)

,

where σ2i is the noise variance of channel i, Si = max{0, θ − σ2

i }, and θ is chosen

to satisfy∑k

i=1 Si = S.

This capacity is achieved by a set of independent Gaussian input with zero mean

and variance Si.


Proof: By definition,

C(S) = max{

pXk :

∑ki=1 E[X2

i ]≤S}

I(Xk; Y k).

Since noise N1, . . . , Nk are independent,

I(Xk; Y k) = h(Y k) − h(Y k|Xk)

(

= h(Y k) − h(Nk + Xk|Xk) = h(Y k) − h(Nk|Xk)

)

= h(Y k) − h(Nk)

= h(Y k) −k∑

i=1

h(Ni)

≤k∑

i=1

h(Yi) −k∑

i=1

h(Ni)

=

k∑

i=1

I(Xi; Yi)

≤k∑

i=1

1

2log

(

1 +Si

σ2i

)

with equality holds if each input is a Gaussian random variable with zero mean


and variance Si, and the inputs are independent, where Si is the individual power

constraint applied on channel i with∑k

i=1 Si = S.

So the problem is reduced to finding the power allotment that maximizes the

capacity subject to the constraint∑k

i=1 Si = S. By using the Lagrange multiplier

technique, the maximizer of

max

{

k∑

i=1

1

2log

(

1 +Si

σ2i

)

+ λ

(

k∑

i=1

Si − S

)}

can be found by taking the derivative (w.r.t. Si) of the above equation and let it

be zero, which yields

1

2

1

Si + σ2i

+ λ = 0, if Si > 0;

1

2

1

Si + σ2i

+ λ ≤ 0, if Si = 0.

Hence,{

Si = θ − σ2i , if Si > 0;

Si ≥ θ − σ2i , if Si = 0,

where θ = −1/(2λ).

Rate Distortion for Parallel Gaussian Sources I: 6-58

A theorem on rate-distortion function parallel to that on capacity-cost function

can also be established.

Theorem 6.32 (rate distortion for parallel Gaussian sources) Given

k mutually independent Gaussian sources with variance σ21, . . . , σ

2k. The overall

rate-distortion function for additive squared error distortion, namely

k∑

i=1

E[(Zi − Zi)2] ≤ D

is given by

R(D) =

k∑

i=1

1

2log

σ2i

Di,

where Di = min{θ, σ2i } and θ is chosen to satisfy

∑ki=1 Di = D.

Rate Distortion for Parallel Gaussian Sources I: 6-59

θ

The overall water isD = D1 + D2 + D3 + D4

= σ21 + σ2

2 + θ + σ24.

height σ21

height σ22

height σ23

height σ24

��

��

��

Pour in water of amount D.

The water-pouring for lossy data compression of parallel Gaussian sources.

Correlated Parallel Additive Gaussian Channels I: 6-60

Theorem 6.33 (capacity for correlated parallel additive Gaussian ch-

annels) The capacity of k parallel additive Gaussian channels under an overall

input power constraint S, is

C(S) =

k∑

i=1

1

2log

(

1 +Si

λi

)

,

where λi is the i-th eigenvalue of the positive-definite noise covariance matrix KN ,

Si = max{0, θ − λi},and θ is chosen to satisfy

∑ki=1 Si = S.

This capacity is achieved by a set of independent Gaussian input with zero mean

and variance Si.

• A matrix Kk×k is positive definite if for every x1, . . . , xk,

[x1, . . . , xk]K

x1...

xk

≥ 0,

with equality holds only when xi = 0 for 1 ≤ i ≤ k.

• The matrix Λ in the decomposition of a positive-definite matrix K, i.e., K =

QΛQt, is a diagonal matrix with non-zero diagonal components.


Proof: Let KX be the covariance matrix of the input, X1, . . . , Xk. The input

power constraint then becomes

k∑

i=1

E[X2i ] = tr(KX) ≤ S,

where tr(·) represent the traverse of the k × k matrix KX . Assume that the input

is independent of the noise. Then

I(Xk; Y k) = h(Y k) − h(Y k|Xk)

= h(Y k) − h(Nk + Xk|Xk)

= h(Y k) − h(Nk|Xk)

= h(Y k) − h(Nk).

Since h(Nk) is not determined by the input, the capacity-finding problem is reduced

to maximize h(Y k) over all possible inputs satisfying the power constraint.

Now observe that the covariance matrix of Y k is KY = KX +KN , which implies

that the differential entropy of Y k is upper bounded by

h(Y k) ≤ 1

2log((2πe)k|KX + KN |).

It remains to find the KX (if it is possible) under which the above upper bound is

achieved, and also this achievable upper bound is maximized.


Decompose KN into its diagonal form as

KN = QΛQt,

where superscript “t” represents the transpose operation on matrices, QQt = Ik×k,

and Ik×k represents the identity matrix of order k. Note that since KN is positive

definite, Λ is a diagonal matrix with positive diagonal components equal to the

eigenvalues of KN . Then

|KX + KN | = |KX + QΛQt|= |Q| · |QtKXQ + Λ| · |Qt|= |QtKXQ + Λ|= |A + Λ|,

where A4= QtKXQ. Since tr(A) = tr(KX), the problem is further transformed

to maximize |A + Λ| subject to tr(A) ≤ S.


Lemma [Hadamard’s inequality] Any positive definite k×k matrix

K satisfies

|K| ≤k∏

i=1

Kii,

where Kii is the component of matrix K locating at ith row and ith column.

Equality holds if, and only if, the matrix is diagonal.

By observing that A + Λ is positive definite (because Λ is positive definite),

together with the above lemma, we have

|A + Λ| ≤k∏

i=1

(Aii + λi),

where λi is the component of matrix Λ locating at ith row and ith column, which

is exactly the i-th eigenvalue of KN . Thus, the maximum value of |A + Λ| under

tr(A) ≤ S is achieved by a diagonal A with

k∑

i=1

Aii = S.


Finally, we can adopt the Lagrange multiplier technique as used previously to

obtain:

Aii = max{0, θ − λi},where θ is chosen to satisfy

∑ki=1 Aii = S.

Waveform channels will be discussed next!

Bandlimited Waveform Channels with AWGN I: 6-65

• A common model for communication over a radio network or a telephone line

is a band-limited channel with white noise, which is a continuous-time channel,

modelled as

Yt = (Xt + Nt) ∗ h(t),

where “∗” represents the convolution operation, Xt is the waveform source, Yt

is the waveform output, Nt is the white noise, and h(t) is the band-limited

filter.

-Xt H(f)-+?

Channel

Nt

-Yt


Coding over waveform channels

• For a fixed interval [0, T ), select M different functions (waveform codebook)

c1(t), c2(t), . . . , cM(t),

for each informational message.

• Based on the received function y(t) for t ∈ [0, T ) at the channel output, a

decision on the input message is made.

Sampling theorem

• Sampling a band-limited signal at a sampling rate 1/(2W ) is sufficient to re-

construct the signal from the samples, if W is the bandwidth of the signal.

Conventional Remark: Based on the sampling theorem, one can sample the

filtered waveform signal cm(t) and the filtered waveform noise Nt, and reconstruct

them distortionlessly by the sampling frequency 2W (from the receiver Yt point of

view).


My Remarks: Assume for the moment that the channel is noiseless.

• The input waveform codeword cj(t) is time-limited in [0, T ); so it cannot be

band-limited.

• However, the receiver can only observe a band-limited version cj(t) of cj(t) due

to the ideal band-limited filter h(t). Notably, cj(t) is no longer time-limited.

• By sampling theorem, the band-limited but time-unlimited cj(t) can be distor-

tionlessly reconstructed by its (possibly infinitely many) samples at sampling

rate 1/(2W ).

• Implicit System Constraint: Yet, the receiver can only use those samples

within time [0, T ) to guess what the transmitter originally sent out.

– Notably, these 2WT samples may not reconstruct cj(t) without distortion.

• As a result, the waveform codewords {cj(t)}Mj=1 are chosen such that their

residual signals {˜cj(t)}Mj=1, after experiencing the ideal lowpass filter h(t) and

the implicit 2WT -sample constraint, are more “resistent” to noise.


What we learn from these remarks.

• The time-limited waveform cj(t) would pass through an ideal lowpass filter and

be sampled, and only 2WT samples survives at the receiver end without noise.

• ˜Xt (not Xt) is the true residual signal that survives at the receiver end without

noise.

• The power constraint in the capacity-cost function is actually applied on ˜Xt

(the true signal that can be reconstructed by the 2WT samples seen at the

receiver end), rather than the transmitted signal Xt.

– Indeed, the signal-to-noise ratio concerned in most communication prob-

lems is the ratio of the signal power survived at the receiver end

against the noise power experienced by this received signal.

– Do not misunderstand that the signal power in this ratio is the transmitted

power at the transmitter end.


The noise process

• How about the band-limited AWGN Nt = Nt ∗ h(t)?

• Nt is no longer white?

• However, with the right sampling rate, the samples are still uncorrelated.

• Consequently, the noises experienced by the 2WT signal samples are still in-

dependent Gaussian distributed (cf. The next slide).

Yt = (Xt + Nt) ∗ h(t) = Xt + Nt.

Yk/(2W ) = Xk/(2W ) + Nk/(2W ) for 0 ≤ k < 2WT.


E[Ni/(2W )Nk/(2W )] = E

[(∫

<h(τ )Ni/(2W )−τdτ

)(∫

<h(τ ′)Nk/(2W )−τ ′dτ ′

)]

=

∫

<

∫

<h(τ )h(τ ′)E

[

Ni/(2W )−τNk/(2W )−τ ′]

dτ ′dτ

=

∫

<

∫

<h(τ )h(τ ′)

N0

2δ

(

i

2W− k

2W− τ + τ ′

)

dτ ′dτ

=N0

2

∫

<h(τ )h(τ − (i − k)/(2W ))dτ

(∫ ∞

−∞|H(f)|2df = 1

)

=N0

2

∫

<

(∫ W

−W

1√2W

ej2πfτdf

)(∫ W

−W

1√2W

ej2πf ′(τ−(i−k)/(2W ))df ′)

dτ

=N0

4W

∫ W

−W

∫ W

−W

(∫

<ej2π(f+f ′)τdτ

)

e−j2πf ′(i−k)/(2W )df ′df

=N0

4W

∫ W

−W

∫ W

−W

δ(f + f ′)e−j2πf ′(i−k)/(2W )df ′df

=N0

4W

∫ W

−W

ej2πf(i−k)/(2W )df

(

=N0

2

sin(2πW (i − k)/(2W ))

π(i − k)/(2W )

)

=N0

2

sin(π(i − k))

π(i − k)(With the right sample rate, W is cancelled out.)

=

{

N0/2, if i = k;

0, if i 6= k.


Hence, the capacity-cost function of this channel subject to input waveform

width T (and implicit system constraint) is equal to:

CT (S) = max{

pX2WT :

∑2WT−1i=0 E[X2

i/(2W )]≤S}

I(X2WT ; Y 2WT )

=

2WT−1∑

i=0

1

2log

(

1 +Si

σ2i

)

=

2WT−1∑

i=0

1

2log

(

1 +S/(2WT )

(N0/2)

)

=

2WT−1∑

i=0

1

2log

(

1 +S

WTN0

)

= WT · log

(

1 +S

WTN0

)

,

where the input samples X2WT that achieves the capacity is also i.i.d. Gaussian

distributed.

• It can be proved similarly to “white Nt ⇒ i.i.d. N 2WT” that a white Gaussian

process Xt can render an i.i.d. Gaussian distributed filtered samples X2WT , if

a right sampling rate is employed.


Remark on notations

• CT (S) denotes the capacity-cost function subject to input waveform width T .

This notation is specifically used in waveform channels.

• CT (S) should not be confused with the notation of C(S), which is used to

represent the capacity-cost function of a discrete-time channels, where the

channel input is transmitted only at each sampled time instance, and hence no

duration T is involved.

• One, however, can measure C(S) by the unit of bits per sample period to

relate the quantity to the usual unit of data transmission speed, such as bits

per second.

• To obtain the maximum reliable data transmission speed over waveform channel

in units of bits per second, we require to calculate:

CT (S)

T.


Example 6.34 (telephone line channel) Suppose telephone signals are band-

limited to 4 kHz. Given signal-to-noise (SNR) ratio of 20dB (namely, S/(WN0) =

20dB) and T = 1 millisecond, the capacity of bandlimited Gaussian waveform

channel is equal to:

CT (S) = WT log

(

1 +S

WTN0

)

= 4000 ×(

1 × 10−3)

× log

(

1 +100

1 × 10−3

)

.

Therefore, the maximum reliable transmission speed is:

CT (S)

T= 46051.7 nats per second

= 66438.6 bits per second.


Remarks:

• It needs to pay special attention that the capacity-cost formula used above is

calculated based on a prohibitively simplified channel model, i.e., the noise is

additive white Gaussian.

• The capacity formula from such a simplified model only provides a lower bound

to the true channel capacity, since AWGN is the worst noise in the sense of

capacity.

• So the true channel capacity is possibly higher than the quantity obtained in

the above example!

Band-Unlimited Waveform Channels with AWGN I: 6-75

Take W → ∞ in the above formula, we obtain the channel capacity for Gaussian

waveform channel of infinite bandwidth is

CT (S) → S

N0(nats per T unit time).

• The capacity grows linearly with the input power.

Filtered Waveform Stationary Gaussian Channels I: 6-76

-Xt H(f) - + -

Channel

Yt?

Nt

The filtered waveform stationary Gaussian channel can be transformed to:

-Xt H(f)-+?

N ′t

-Yt

where the power spectral density PSDN ′(f) of N ′t satisfies

PSDN ′(f)4=

PSDN(f)

|H(f)|2 ,

and PSDN(f) is the power spectral density of Nt.


Notes

• We cannot use sampling theorem in this case (as we did previously) because

– N ′t is no longer assumed white; so the noise samples are not i.i.d.

– h(t) is not necessarily band-limited.

Lemma 6.35 Any real-valued function v(t) defined over [0, T ) can be decomposed

into

v(t) =

∞∑

i=1

viΨi(t),

where the real-valued functions {Ψi(t)} are any orthonormal set (which can span

the space with respect to v(t)) of functions on [0, T ), namely∫ T

0

Ψi(t)Ψj(t)dt =

{

1, if i = j;

0, if i 6= j,

and

vi =

∫ T

0

Ψi(t)v(t)dt.


Examples

• Orthonormal set is not unique.

– Sampling theorem employs sinc function as the orthonormal set to span a

bandlimited signal.

– Fourier expansion is another example of orthonormal expansion.

• The above two orthonormal sets do not suit here, because their resultant coef-

ficients are not necessarily i.i.d.

• We therefore have to introduce the Karhunen-Loeve expansion.


Lemma 6.36 (Karhunen-Loeve expansion) Given a stationary random

process Vt, and its autocorrelation function

φV (t) = E[VτVτ+t].

Let {Ψi(t)}∞i=1 and {λi}∞i=1 be the eigenfunctions and eigenvalues of φV (t), namely∫ T

0

φV (t − s)Ψi(s)ds = λiΨi(t), 0 ≤ t ≤ T.

Then the expansion coefficients {Λi}∞i=1 of Vt with respect to orthonormal func-

tions {Ψi(t)}∞i=1 are uncorrelated. In addition, if Vt is Gaussian, then {Λi}∞i=1 are

independent Gaussian random variables.


Proof:

E[ΛiΛj] = E

[∫ T

0

Ψi(t)Vtdt ×∫ T

0

Ψj(s)Vsds

]

=

∫ T

0

∫ T

0

Ψi(t)Ψj(s)E[VtVs]dtds

=

∫ T

0

∫ T

0

Ψi(t)Ψj(s)φV (t − s)dtds

=

∫ T

0

Ψi(t)

(∫ T

0

Ψj(s)φV (t − s)ds

)

dt

=

∫ T

0

Ψi(t)(λjΨj(t))dt

=

{

λi, if i = j;

0, if i 6= j.


Theorem 6.37 Give a filtered Gaussian waveform channel with noise spectral

density PSDN(f) and filter H(f).

C(S) = limT→∞

CT (S)

T=

1

2

∫ ∞

−∞max

[

0, logθ

PSDN(f)/|H(f)|2]

df,

where θ is the solution of

S =

∫ ∞

−∞max

[

0, θ − PSDN(f)

|H(f)|2]

df.

• Remark: This also follows the water-pouring scheme.


(a) The water pouring scheme. The curve is PSDN(f)/H2(f)

(b) The input spectral density that achieves capacity.


Proof: Let the Karhunen-Loeve expansion of the autocorrelation function of N ′t

be denoted by {Ψi(t)}∞i=1

To abuse the notations without ambiguity, we let N ′i and Xi be the Karhunen-

Loeve coefficients of N ′t and Xt with respect to Ψi(t).

Since {N ′i}∞i=1 are independent Gaussian distributed, we obtain that the channel

capacity subject to input waveform width T is

CT (S) =

∞∑

i=1

1

2log

(

1 +max(0, θ − λi)

λi

)

=

∞∑

i=1

1

2log

[

max

(

1,θ

λi

)]

=

∞∑

i=1

1

2max

[

0, logθ

λi

]

,

where θ is the solution of

S =

∞∑

i=1

max [0, θ − λi]

and E[(N ′i)

2] = λi is the i-th eigenvalue of the autocorrelation function of N ′t

(corresponding to eigenfunction Ψi(t)).


The proof is then completed by applying the Toeplitz distribution theorem.

[Toeplitz distribution theorem] Consider a zero-mean stationary random

process Vt with power spectral density PSDV (f) and∫ ∞

−∞PSDV (f)df < ∞.

Denote by λ1(T ), λ2(T ), λ3(T ), . . . the eigenvalues of the Karhunen-Loeve expan-

sion corresponding to the autocorrelation function of Vt over the time interval of

width T . Then for any real-valued continuous function a(·) satisfying

a(t) ≤ K · t for 0 ≤ t ≤ maxf∈<

PSDV (f)

for any finite constant K,

limT→∞

1

T

∞∑

i=1

a(λi(T )) =

∫ ∞

−∞a(

PSDV (f))

df.

Information Transmission Theorem I: 6-85

Theorem 6.38 (joint source-channel coding theorem) Fix a distortion

measure. A DMS can be reproduced at the output of a channel with distortion less

than D (by taking sufficiently large blocklength), if

R(D) nats/source letter

Ts seconds/source letter<

C(S) nats/channel usage

Tc seconds/channel usage,

where Ts and Tc represent the durations per source letter and per channel input,

respectively.

• Note that the units of R(D) and C(S) should be the same, i.e., they should be

measured both in nats (by taking natural logarithm), or both in bits (by taking

base-2 logarithm).

Theorem 6.39 (joint source-channel coding converse) All data trans-

mission codes will have average distortion larger than D for sufficiently large block-

length, ifR(D)

Ts>

C(S)

Tc.


Example 6.40 (additive white Gaussian noise (AWGN) channel with

binary channel input)

• The source is discrete-time binary memoryless with uniform marginal.

• The discrete-time continuous channel has binary input alphabet and real-line

output alphabet with Gaussian transition probability.

• Denote by Pb the probability of bit error (i.e, Hamming distortion measure is

adopted.)


• The rate-distortion function for binary input and Hamming additive distortion

measure is

R(D) =

log(2) − Hb(D), if 0 ≤ D ≤ 1

2;

0, if D >1

2.

• Due to Butman and McEliece, the channel capacity-cost function for binary-

input AWGN channel is

C(S) =S

σ2− 1√

2π

∫ ∞

−∞e−y2/2 log

[

cosh

(

S

σ2+ y

√

S

σ2

)]

dy

=EbTc/Ts

N0/2− 1√

2π

∫ ∞

−∞e−y2/2 log

[

cosh

(

EbTc/Ts

N0/2+ y

√

EbTc/Ts

N0/2

)]

dy

= 2Rγb −1√2π

∫ ∞

−∞e−y2/2 log[cosh(2Rγb + y

√

2Rγb)]dy,

where R = Tc/Ts is the code rate for data transmission and is measured in the

unit of source letter/channel usage (or information bit/channel bit), and γb

(often denoted by Eb/N0) is the signal-to-noise ratio per information bit.


• Then from the joint source-channel coding theorem, good codes exist when

R(D) <Ts

TcC(S),

or equivalently

log(2) − Hb(Pb) < 2γb −1

R√

2π

∫ ∞


√

2Rγb)]dy.


• By re-formulating the above inequality as

Hb(Pb) > log(2) − 2γb +1

R√

2π

∫ ∞


√

2Rγb)]dy,

a lower bound on the bit error probability as a function of γb is established.

• This is the Shannon limit for any code to achieve binary-input Gaussian chan-

nel.

1e-6

1e-5

1e-4

1e-3

1e-2

1e-1

1

-6 -5 -4 -3 -2 -1-.495 0.19 1 2

Pb

γb (dB)

Shannon limit

R = 1/2 c

c c c c cc c

cc

cc

c

c

c

c

c

R = 1/3 ss s s s s

ss

ss

s

sss

s

s

s

s

s

s

The Shannon limits for (2, 1) and (3, 1) codes under binary-input AWGN channel.


• The result in the above example becomes important due to the invention of the

Turbo coding, for which a near-Shannon-limit performance is first obtained.

• That implies that a near-optimal channel code has been constructed, since in

principle, no codes can perform better than the Shannon-limit.


Example 6.41 (AWGN channel with real number input)

• The source is discrete-time binary memoryless with uniform marginal.

• The discrete-time continuous channel has real-line input alphabet and real-line

output alphabet with Gaussian transition probability.

• Denote by Pb the probability of bit error (i.e, Hamming distortion measure is

adopted.)


• The rate-distortion function for binary input and Hamming additive distortion

measure is

R(D) =

log(2) − Hb(D), if 0 ≤ D ≤ 1

2;

0, if D >1

2.

• The channel capacity-cost function is

C(S) =1

2log

(

1 +S

σ2

)

=1

2log

(

1 +EbTc/Ts

N0/2

)

=1

2log (1 + 2Rγb) nats/channel symbol,

where R = Tc/Ts is the code rate for data transmission and is measured in the

unit of information bit/channel usage, and γb = Eb/N0 is the signal-to-noise

ratio per information bit.


• Then from the joint source-channel coding theorem, good codes exist when

R(D) <Ts

TcC(S),

or equivalently

log(2) − Hb(Pb) <1

R

[

1

2log (1 + 2Rγb)

]

.


• By re-formulating the above inequality as

Hb(Pb) > log(2) − 1

2Rlog (1 + 2Rγb) ,

a lower bound on the bit error probability as a function of γb is established.

• This is the Shannon limit for any code to achieve for real-input Gaussian chan-

nel.

1e-6

1e-5

1e-4

1e-3

1e-2

1e-1

1

-6 -5 -4 -3 -2 -1 -.5 0 1 2

Pb

γb (dB)

Shannon limit

R = 1/2 c

c c c c cc

cc

cc

c

ccc

c

c

R = 1/3 ss s s s s

ss

ss

sssss

sss

s

s

s

The Shannon limits for (2, 1) and (3, 1) codes under continuous-input AWGN

channels.

Capacity Bounds for Non-Gaussian Channels I: 6-95

• If a channel has additive but non-Gaussian noise and an input power constraint,

then it is often hard to calculate the channel capacity, not to mention to derive

a close-form capacity formula.

• We then only introduce an upper bound and a lower bound on the capacity for

non-Gaussian channels.


Definition 6.42 (entropy power) The entropy power of a random variable N

is defined as

Ne4=

1

2πee2·h(N).

Lemma 6.43 For a discrete-time continuous-alphabet additive-noise channel, the

channel capacity-cost function satisfies

1

2log

S + σ2

Ne≥ C(S) ≥ 1

2log

S + σ2

σ2,

where S is the bound in input power constraint and σ2 is the noise power.

Proof: The lower bound is already proved in Theorem 6.30. The upper bound

follows from

I(X ; Y ) = h(Y ) − h(N)

≤ 1

2log[2πe(S + σ2)] − 1

2log[2πeNe].


• The entropy power of a noise N can be viewed as the average noise power of a

corresponding Gaussian random variable who has the same differential entropy

as N .

• For a Gaussian noise N , its entropy power is equal to

Ne =1

2πee2h(X) = Var(N),

from which the name comes.

• The larger the entropy power, the smaller the upper bound of the capacity-cost

function.


• Whenever two independent Gaussian noises, N1 and N2, are added, the power

(variance) in the sum is equal to the sum of the power (variance) of the two

noises, i.e.,

e2h(N1+N2) = e2h(N1) + e2h(N2),

or equivalently

Var(N1 + N2) = Var(N1) + Var(N2).

• However, when two independent noises are non-Gaussian, the relationship be-

comes

e2h(N1+N2) ≥ e2h(N1) + e2h(N2),

or equivalently

Ne(N1 + N2) ≥ Ne(N1) + Ne(N2).

This is called the entropy-power inequality.

– The entropy-power inequality indicates that the sum of two independent

noises may introduce more noise power than the sum of each individual

power, except for Gaussian noises.

Key Notes I: 6-99

• Models of discrete-time continuous sources and channels

• Models of waveform sources and channels.

• Differential entropy and its operational meaning in quantization efficiency

• Maximal differential entropy of Gaussian source, among all sources with the

same mean and variance

• The mismatch in properties of entropy and differential entropy

• Relative entropy and mutual information of continuous systems

• Rate-distortion function for continuous sources

– Its calculation for Gaussian sources under squared error distortion

– Its calculation for binary sources under Hamming additive distortion mea-

sure

• Capacity-cost function and its proof

• Calculation of the capacity-cost function for specific channels

– Memoryless additive Gaussian channels

– Uncorrelated and correlated parallel Gaussian channels

Key Notes I: 6-100

– Water pouring scheme (graphical interpretation)

– Band-limited waveform channels with white Gaussian noise

– Filtered waveform stationary Gaussian channels

• Information-transmission theorem (joint source-channel coding theorem)

– Shannon limit (BER versus SNRb)

• Interpretation of entropy-power (provide an upper bound on capacity of non-

Gaussian channels)

– Operational characteristics of entropy-power inequality

Documents

Chapter 6 Continuous Sources and Channelsshannon.cm.nctu.edu.tw/it/c1-6s04.pdf · • A straightforward extension of entropy from discrete sources to continuous sources. Entropy: