12
612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, NO. 2, MARCH 1997 Empirical Quantizer Design in the Presence of Source Noise or Channel Noise Tam´ as Linder, Member, IEEE, G´ abor Lugosi, and Kenneth Zeger, Senior Member, IEEE Abstract— The problem of vector quantizer empirical design for noisy channels or for noisy sources is studied. It is shown that the average squared distortion of a vector quantizer designed optimally from observing clean independent and identically dis- tributed (i.i.d.) training vectors converges in expectation, as the training set size grows, to the minimum possible mean-squared error obtainable for quantizing the clean source and transmitting across a discrete memoryless noisy channel. Similarly, it is shown that if the source is corrupted by additive noise, then the average squared distortion of a vector quantizer designed optimally from observing i.i.d. noisy training vectors converges in expectation, as the training set size grows, to the minimum possible mean- squared error obtainable for quantizing the noisy source and transmitting across a noiseless channel. Rates of convergence are also provided. Index Terms—Empirical vector quantizer design, lossy source coding, training sets, convergence rates, channel noise. I. INTRODUCTION T HE design of quantizers has been studied over the last four decades from various perspectives. On the practical side, the Lloyd–Max [1], [2] algorithm provides an efficient iterative method of designing locally optimal quantizers from known source statistics or from training samples. The general- ized Lloyd algorithm [3], [4] similarly is useful for designing vector quantizers. A theoretical problem motivated by practice is the question of consistency: if the observed training set size is large enough, can one expect a performance nearly as good as in the case of known source statistics? The consistency of design based on global minimization of the empirical distortion was established with various levels of generality by Pollard [5], Abaya and Wise [6], and Sabin [7]. The finite sample performance was also analyzed by Pollard [8], Linder, Lugosi, and Zeger [9], and Chou [10]. The consistency of the generalized Lloyd algorithm was also established by Sabin [7] and Sabin and Gray [11]. An interesting interpretation of the quantizer design problem was given by Merhav and Ziv [12], who obtained lower bounds on the amount of side information Manuscript received November 10, 1995; revised June 19, 1996. This research was supported in part by the National Science Foundation, the Hungarian National Foundation for Scientific Research, and the Foundation for Hungarian Higher Education and Research. The material in this paper was presented in part at the Data Compression Conference, Salt Lake City, UT, March 1996. T. Linder and G. Lugosi are with the Department of Mathematics and Computer Science, Technical University of Budapest, Budapest, Hungary. K. Zeger was with the Department of Electrical and Computer Engineering, Coordinated Science Laboratory, University of Illinois, Urbana-Champaign, Urban IL 61801 USA. He is now with the Department of Electrical and Computer Engineering, University of California at San Diego, La Jolla, CA 92093 USA. Publisher Item Identifier S 0018-9448(97)00782-7. a quantizer design algorithm needs to perform nearly optimally for all sources. Less is known about the more general situation when the quantized source is to be transmitted through a noisy channel (joint source and channel coding), or when the source is corrupted by noise prior to quantization (quantization of a noisy source). In the noisy channel case, theoretical research has mostly concentrated on the questions of optimal rate- distortion performance in the limit of large block length either for separate [13], or joint [14] source and channel coding, as well as for high-resolution source-channel coding [15], [16]. Practical algorithms have also been proposed to iteratively design (locally) optimal source and channel coding schemes [17], [18]. For the noisy source quantization problem, the optimal rate-distortion performance was analyzed by Dobrushin and Tsybakov [19] and Berger [20]. The structure of the optimal noisy source quantizer for squared distortion was studied by Fine [21], Sakrison [22], and Wolf and Ziv [23]. The framework of these works also included transmission through a noisy channel. Properties of optimal noisy source quantizers as well as a treatment of Gaussian sources corrupted by additive independent Gaussian noise were given by Ayanoglu [24]. A Lloyd–Max-type iterative design algorithm was given by Ephraim and Gray [25] for the design of vector quantizers for noisy sources. A design approach based on deterministic annealing was reported by Rao et al. [26]. No consistency results have yet been proved for empirical design of noisy channel or noisy source vector quantizers. In empirical design of standard vector quantizers one can observe a finite number of independent samples of the source vector. The procedure chooses the quantizer which minimizes the average distortion over this data. One is interested in the expected distortion of the designed quantizer when it is used on a source which is independent of the training data. An empirical design procedure is called consistent if the expected distortion of the empirical quantizer approaches the distortion of the quantizer which is optimal for the source, as the size of the training data increases. If consistency is established, one can investigate the rate of convergence of the algorithm, i.e., how fast the expected distortion of the empirically optimal quantizer approaches the optimal distortion. Tight conver- gence rates have practical significance, since consistency alone gives no indication of the relationship between the resulting distortion and the size of the training data. In this paper, we investigate the consistency of vector quantizers obtained by global empirical error minimization 0018–9448/97$10.00 1997 IEEE

612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, …code.ucsd.edu/.../LiLuZe97-IT-NoisyChannelSource.pdf · 2000-05-18 · 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, …code.ucsd.edu/.../LiLuZe97-IT-NoisyChannelSource.pdf · 2000-05-18 · 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43,

612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, NO. 2, MARCH 1997

Empirical Quantizer Design in the Presenceof Source Noise or Channel Noise

Tamas Linder,Member, IEEE, Gabor Lugosi, and Kenneth Zeger,Senior Member, IEEE

Abstract—The problem of vector quantizer empirical designfor noisy channels or for noisy sources is studied. It is shownthat the average squared distortion of a vector quantizer designedoptimally from observing clean independent and identically dis-tributed (i.i.d.) training vectors converges in expectation, as thetraining set size grows, to the minimum possible mean-squarederror obtainable for quantizing the clean source and transmittingacross a discrete memoryless noisy channel. Similarly, it is shownthat if the source is corrupted by additive noise, then the averagesquared distortion of a vector quantizer designed optimally fromobserving i.i.d. noisy training vectors converges in expectation,as the training set size grows, to the minimum possible mean-squared error obtainable for quantizing the noisy source andtransmitting across a noiseless channel. Rates of convergence arealso provided.

Index Terms—Empirical vector quantizer design, lossy sourcecoding, training sets, convergence rates, channel noise.

I. INTRODUCTION

T HE design of quantizers has been studied over the lastfour decades from various perspectives. On the practical

side, the Lloyd–Max [1], [2] algorithm provides an efficientiterative method of designing locally optimal quantizers fromknown source statistics or from training samples. The general-ized Lloyd algorithm [3], [4] similarly is useful for designingvector quantizers. A theoretical problem motivated by practiceis the question of consistency: if the observed training set sizeis large enough, can one expect a performance nearly as goodas in the case of known source statistics? The consistencyof design based on global minimization of the empiricaldistortion was established with various levels of generalityby Pollard [5], Abaya and Wise [6], and Sabin [7]. The finitesample performance was also analyzed by Pollard [8], Linder,Lugosi, and Zeger [9], and Chou [10]. The consistency of thegeneralized Lloyd algorithm was also established by Sabin [7]and Sabin and Gray [11]. An interesting interpretation of thequantizer design problem was given by Merhav and Ziv [12],who obtained lower bounds on the amount of side information

Manuscript received November 10, 1995; revised June 19, 1996. Thisresearch was supported in part by the National Science Foundation, theHungarian National Foundation for Scientific Research, and the Foundationfor Hungarian Higher Education and Research. The material in this paper waspresented in part at the Data Compression Conference, Salt Lake City, UT,March 1996.

T. Linder and G. Lugosi are with the Department of Mathematics andComputer Science, Technical University of Budapest, Budapest, Hungary.

K. Zeger was with the Department of Electrical and Computer Engineering,Coordinated Science Laboratory, University of Illinois, Urbana-Champaign,Urban IL 61801 USA. He is now with the Department of Electrical andComputer Engineering, University of California at San Diego, La Jolla, CA92093 USA.

Publisher Item Identifier S 0018-9448(97)00782-7.

a quantizer design algorithm needs to perform nearly optimallyfor all sources.

Less is known about the more general situation when thequantized source is to be transmitted through a noisy channel(joint source and channel coding), or when the source iscorrupted by noise prior to quantization (quantization of anoisy source). In the noisy channel case, theoretical researchhas mostly concentrated on the questions of optimal rate-distortion performance in the limit of large block length eitherfor separate [13], or joint [14] source and channel coding, aswell as for high-resolution source-channel coding [15], [16].Practical algorithms have also been proposed to iterativelydesign (locally) optimal source and channel coding schemes[17], [18].

For the noisy source quantization problem, the optimalrate-distortion performance was analyzed by Dobrushin andTsybakov [19] and Berger [20]. The structure of the optimalnoisy source quantizer for squared distortion was studiedby Fine [21], Sakrison [22], and Wolf and Ziv [23]. Theframework of these works also included transmission through anoisy channel. Properties of optimal noisy source quantizers aswell as a treatment of Gaussian sources corrupted by additiveindependent Gaussian noise were given by Ayanoglu [24].A Lloyd–Max-type iterative design algorithm was given byEphraim and Gray [25] for the design of vector quantizersfor noisy sources. A design approach based on deterministicannealing was reported by Raoet al. [26]. No consistencyresults have yet been proved for empirical design of noisychannel or noisy source vector quantizers.

In empirical design of standard vector quantizers one canobserve a finite number of independent samples of the sourcevector. The procedure chooses the quantizer which minimizesthe average distortion over this data. One is interested in theexpected distortion of the designed quantizer when it is usedon a source which is independent of the training data. Anempirical design procedure is calledconsistentif the expecteddistortion of the empirical quantizer approaches the distortionof the quantizer which is optimal for the source, as the size ofthe training data increases. If consistency is established, onecan investigate the rate of convergence of the algorithm, i.e.,how fast the expected distortion of the empirically optimalquantizer approaches the optimal distortion. Tight conver-gence rates have practical significance, since consistency alonegives no indication of the relationship between the resultingdistortion and the size of the training data.

In this paper, we investigate the consistency of vectorquantizers obtained by global empirical error minimization

0018–9448/97$10.00 1997 IEEE

Page 2: 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, …code.ucsd.edu/.../LiLuZe97-IT-NoisyChannelSource.pdf · 2000-05-18 · 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43,

LINDER et al.: EMPIRICAL QUANTIZER DESIGN IN THE PRESENCE OF SOURCE NOISE OR CHANNEL NOISE 613

for noisy channels and noisy sources. In both cases, thenotion of empirical (sample) distortion is not as simple asin standard vector quantizer design. For noisy channels, thechannel transition probabilities are assumed to be known,and the empirical distortion is defined as the expected valueof the distortion between a source symbol and its randomreproduction, where the expectation is taken with respect tothe channel. For sources corrupted by noise, the density of thenoise is assumed to be known and the estimation-quantizationstructure (see, e.g., [23]) of the optimal quantizer is used. Herethe sample distortion has no unique counterpart. Althougha modified distortion measure can be introduced [25] whichconverts the problem into a standard quantization problem,this modified measure cannot directly be used since it is afunction of the unknown source statistics. The main difficultylies in the fact that, in general, the encoding regions of anoisy source vector quantizer need not be either convex orconnected. Thus the set of quantizers to be considered in theminimization procedure is more complex than in the cleansource or noisy channel case.

In this paper, Section II gives the necessary definitionsfor noisy channel and noisy source quantization problems.In Section III, consistency of the empirical design for noisychannel quantization is established. In particular, Theorem1 proves that the expected squared error distortion of thequantizer minimizing the appropriately defined empirical dis-tortion over training vectors is withinof the distortion of the quantizer which is optimal for thegiven source and channel. This is the same rate as thatobtained in [9] for the standard vector quantizer problem. InSection IV, empirical design for sources corrupted by additivenoise is considered. A method is presented which combinesnonparametric estimation with empirical error minimization.Theorem 2 proves that if the conditional mean of the cleansource given the noisy source can be consistently estimated,then the method is consistent. Based on this result, Corollary1 establishes the consistency of empirical design for additive,independent noise. We conjecture that the noisy source designproblem is likely to be more difficult than the noisy channelquantizer design problem, when only noisy source samplesare available. In Theorem 3 it is shown that consistency andconvergence rates can be obtained under much more generalconditions on the noise, if training samples from the cleansource are also available.

II. PRELIMINARIES

A. Vector Quantizers for Noisy Channels

An -level noisy-channel vector quantizeris defined viatwo mappings. Theencoder maps into the finite set

, and thedecoder maps ontothe set of codewords by the rule

, for . The rate of the quantizer isbits per source symbol. The quantizer takes an

-valued random vector as its input, and produces theindex . The index is then transmitted througha noisy channel, and the decoder receives the index

, a random variable whose conditional distributiongiven is

where the are the channel transition probabilities. Thechannel is assumed to be discrete withinput and outputsymbols, with known transition probabilities, and the channelis assumed to work independently of the source. The outputof the quantizer is

and the joint distribution of is determined by thesource distribution and the conditional distribution

We will use the notation as for an ordinaryvector quantizer, but now is not a deterministic mapping.The performance of will be measured by the mean-squareddistortion , where denotes the Euclideannorm of the vector . The quantizer distortion can be written as

(1)

where the encoding regions , forcompletely determine the encoder . It is

obvious from (1) that given the decoder , the encoderregions

determine an encoder (with ties broken arbitrarily) which mini-mizes the distortion over all encoders. The above encoding ruleis sometimes called theweighted nearest neighbor condition(see, e.g., [14], [17], [27], [28]). Note that some of themay be empty in an optimal noisy channel vector quantizer(in contrast to the noiseless channel case).

Assuming that , there always exists an -level quantizer minimizing the distortion over all -levelquantizers. This is easily seen by adapting an argument fordeterministic quantizers by Pollard [5]. Let us denote thedistortion of such an optimal quantizer by

Page 3: 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, …code.ucsd.edu/.../LiLuZe97-IT-NoisyChannelSource.pdf · 2000-05-18 · 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43,

614 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, NO. 2, MARCH 1997

where the minimum is taken over all (-level) encoders anddecoders operating on the fixed channel and source. Thus

depends on , the source statistics, and on the channeltransition probabilities, which we will assume to be fixed andknown throughout this paper.

B. Vector Quantizers for Noisy Sources

Assume that is the noisy version of the source. canbe viewed as the output of a channel whose input is. Thenoisy source is to be quantized by an -level quantizersuch that the mean-squared distortion

is as small as possible. In this problem, an-level quantizeris characterized by its codevectors

and the measurable sets ,, calledencoding regions. As was noted in several

papers dealing with this problem (see, e.g., [19], [21]–[23]),the structure of the optimal -level quantizer can be obtainedvia a useful decomposition. Let denote aversion of the conditional expectation . Then

(2)

where the cross term disappears after taking iterated expec-tations, first conditioned on . Thus to minimize

, the quantizer has to minimize .If the codevectors are given, then the encodingregions minimizing the distortion must satisfy

for if (3)

This means that for any

where is an ordinarynearest neighborquantizer which hasthe same codevectors as. Thus by (2) we have

where the second infimum is taken over all-level nearestneighbor quantizers . Since , itfollows from, e.g., Pollard [5] that an optimal quantizer

exists. Therefore, the quantizer minimizingis obtained by first transforming by and then

quantizing by a nearest neighbor quantizer , that is,

Furthermore

(4)

III. EMPIRICAL DESIGN FORNOISY CHANNELS

In most applications one does not know the actualsource statistics, but instead can observe a sequence ofindependent and identically distributed (i.i.d.) copies

of . These “training samples” inducethe empirical distribution which assigns probability toevery measurable according to the rule

where is the indicator function of the event of its argument.When the source statistics are not known, one cannot directlysearch for an optimal quantizer . Instead, one generallyattempts to minimize the empirical distortion, which is a func-tional of rather than of the true source distribution. Theempirical distortion is the expected value (expectationtaken over the channel use) of the average distortion of thequantizer when is quantized

(5)

The empirical distortion can be rewritten in the simple form

where is a function which depends on thequantizer as

(6)

Note that the empirical distortion is a random variable, afunction of the training data . We remark here that byusing the function , the expected distortion of in (1) canbe rewritten as

Assume we design a quantizer based on the training databy minimizing the empirical distortionover all possible quan-tizers. This minimization can be carried out in principle, sincegiven and the channel transition probabilities, we cancalculate for any quantizer using weighted nearestneighbor encoding.

Let be the quantizer minimizing ,

and let

where is independent of . Then is the averagedistortion of the empirically optimal quantizer when it isused on data independent of the training set. A fundamentalquestion is how close this distortion gets to the optimal

Page 4: 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, …code.ucsd.edu/.../LiLuZe97-IT-NoisyChannelSource.pdf · 2000-05-18 · 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43,

LINDER et al.: EMPIRICAL QUANTIZER DESIGN IN THE PRESENCE OF SOURCE NOISE OR CHANNEL NOISE 615

as the size of the training data increases, and therefore as thesource statistics are more and more revealed by the empiricaldistribution.

One goal in this paper is to investigate how fast thedifference between the expected distortion of the empiricallyoptimal quantizer and the optimal distortion

decreases as the training set sizeincreases. An upper boundon this difference, converging to zero as , is givenwhich indicates how large the training set size should be sothat the designed quantizer has a distortion near the optimum.

In what follows we assume that the source is boundedalmost surely (a.s.), so that for some

. With this assumption we have the following theorem.Theorem 1: Assume that a source is bounded

as for some , and let, where the are i.i.d. copies of . Sup-

pose an -level noisy channel vector quantizer isdesigned by using empirical distortion minimization over thetraining set . Then the average distortion of this quantizeris bounded above as

where is the distortion of the -level quantizer that isoptimal for the source and the channel, and .

Proof: The proof of the theorem is based on a techniqueoften used in the statistical and learning theory literature (see,e.g., [29]). First we note that the conditiona.s. implies that both (the globally optimal quantizer)and (the empirically optimal quantizer) must havecodevectors lying inside the sphere of radius centered atthe origin, since projecting any codevector outside this sphereback to the surface of the sphere clearly reduces the distortion.Let be a quantizer for the noisy channel and introduce thenotation

where is defined in (6). Let be the class of all func-tions , where ranges through all -level noisy channelquantizers whose codepoints lie inside the sphere

. These quantizers can be assumed touse the weighted nearest neighbor encoding rule since both

and use such encoders. For a fixed arbitrary, let be an -covering of , i.e., let be a set

of functions such that for each , there exists an

-level noisy channel quantizer with satisfying

Let be an arbitrary fixed optimal quantizer (i.e., hascodevectors and distortion ), and let denote a quantizersuch that satisfies

(7)

Then

a.s.

where in the inequality we used (7) and the fact thatis minimized by the empirically optimal quantizer. Thus wehave that

a.s. (8)

The right-hand side of the above inequality is a randomvariable whose expectation gives an upper bound on

. To upper-bound this expectationwe will use Hoeffding’s [30] probability inequality which saysthat if are i.i.d. real-valued random variables suchthat for some and , then

(9)

Bounding the Cardinality of a Minimal-Covering: In orderto use the facts above, we derive an upper bound on thecardinality of a minimal -covering of the class

, where is the set of all -level noisy channel vectorquantizers with weighted nearest neighbor encoders and whosecodevectors have norm at most . Since the all lie inthe sphere , the set of functions has a constantenvelope of . Let us assume now that we are given thequantizers having codevectors and

, respectively, such that for some , we havefor all . For a given ,

assume without loss of generality that .Setting

and

we have by the weighted nearest neighbor property that

(10)

Page 5: 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, …code.ucsd.edu/.../LiLuZe97-IT-NoisyChannelSource.pdf · 2000-05-18 · 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43,

616 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, NO. 2, MARCH 1997

If we consider a rectangular grid of widthin , thenfor any there is a point on this grid suchthat . Thus letting be the set of all noisychannel quantizers which have all their codepoints on this gridand which use weighted nearest neighbor encoding, we obtainfrom (10) that for any there exists a suchthat

This implies that is an -covering offor . Letting denote the volume of

we thus obtain

With this we obtain from (8), the union bound, and Hoeffding’sinequality that for any such that

(11)

This inequality holds for all . Choose .The difference inside the probability on the left-hand side isa.s. upper-bounded by . Using the simple bound

, valid for any and random variablesuch that , we obtain

Finally, if we choose with constant, then the second term on the right-hand side

of the above inequality is on the order of , and the proofof the theorem is complete.

IV. EMPIRICAL DESIGN FORNOISY SOURCE

In the noisy source quantizer design problem we are giventhe samples drawn independently fromthe distribution of . We also assume that the conditionaldistribution of the noisy source given is known (i.e., thechannel between and is known), and that

for some known constant . In this situation themethod of empirical distortion minimization cannot be applieddirectly, since we only have the indirect (noisy) observations

about . However, the decomposition (4) sug-gests the following method for noisy source quantizer design:

i) Split the data into two parts,

and (assume is even) and

estimate from the first half ofthe samples and theknownconditional distribution

. The estimate is requiredto be consistent:

as(12)

Since the upper bound on is known we alsorequire that

(13)

ii) Using the second half of the training data define a newset of training vectors

and consider a nearest neighbor quantizer minimiz-ing the empirical distortion

(14)

Here the minimization is over all -level nearest neigh-bor quantizers. The quantizer for the noisy source de-signed from the noisy samples is then obtained from

and as

The following theorem gives an estimate for the differencebetween the distortion of and the minimum achievabledistortion .

Theorem 2: Assume that a source is bounded asfor some and let

be i.i.d. samples of the noisy source. Suppose, furthermore,that the conditional distribution of given , and the constant

are known, and that the estimator ofhas error

and is bounded as

Then the -level quantizer designed in steps i) and ii)above satisfies

where is the distortion of the optimal -level quantizerfor the noisy source problem, and .

Page 6: 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, …code.ucsd.edu/.../LiLuZe97-IT-NoisyChannelSource.pdf · 2000-05-18 · 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43,

LINDER et al.: EMPIRICAL QUANTIZER DESIGN IN THE PRESENCE OF SOURCE NOISE OR CHANNEL NOISE 617

Additive Independent Noise:Before proving the theoremwe show how it can be applied to the special (but veryimportant) case when , where the noise isindependent of . Theorem 2 implies that if there exists anconsistent estimate of , then the quantizer design procedureusing this estimate will be consistent, i.e.,

Such consistent estimators exist, for example, when hasa density , has aboundeddensity , and the characteristicfunction of is nonzero almost everywhere. To see this, weuse the following lemma (proved in the Appendix).

Lemma 1: Let be a random vector in ,where and are independent absolutely continuous randomvariables. Assume that the densityof is known, and itscharacteristic function is nonzero for almostall . Assume that i.i.d. copies ofare observed. Then for every densityof there exists anestimator of such that

a.s.

Take in Lemma 1. The estimator in the lemmaintegrates to one, but it may take negative values. Also, eventhough has a bounded support, can have an unboundedsupport since it is obtained by deconvolving a kernel densityestimate of the densityof . Let . Then

is a probability density with support contained in .Moreover, by [31, pp. 12–13] we have

so that is also strongly consistent, and we can actuallyuse the notation instead of . Since

we have

for all such that . We define our estimate as

It is immediate that since

is a probability density in , with support contained in theconvex set , and thus . We also have

a.s. (15)

where . Thus

a.s. (16)

that is, is pointwise strongly consistent. Lettingwe have by (15) and (16) that for all

and every ,

a.s.

since both and vanish outside . It follows that

a.s.

for almost every . Fubini’s theorem and the dominatedconvergence theorem then imply that

The consistency of the design procedure now follows fromTheorem 2. Thus we have proved the following.

Corollary 1: Assume the conditions of Theorem 2 andsuppose , where is independent of andhas a bounded density whose characteristic function is almosteverywhere nonzero. Then there exists a bounded estimator

of such that

and the noisy source design procedure is consistent, i.e.,

Proof of Theorem 2:Using the same decomposition as in(2), the distortion of can be written

(17)

Then, by the Cauchy–Schwarz inequality, one obtains

(18)

Page 7: 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, …code.ucsd.edu/.../LiLuZe97-IT-NoisyChannelSource.pdf · 2000-05-18 · 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43,

618 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, NO. 2, MARCH 1997

where . Recall now thatdepends only on the samples but is

independent of . With this in mind,we introduce an auxiliary -level quantizer (used only inthe analysis) which minimizes the conditional distortion

Note that depends on . By defi-nition, is an -level quantizer minimizing the empiricaldistortion over the samples for

a given . This fact and the independence of , andimply that for the conditional probability

can be upper-bounded using the same technique as in theproof of Theorem 1. In fact, if the channel is made noiselessby substituting the transition probabilities inTheorem 1, then the quantizers there become ordinary nearestneighbor quantizers. Since , for a fixed ,the inequality (11) implies, after replacing by , thatfor a.e. ,

(19)

Since the upper bound is independent of, it follows that

and one obtains in the same way as in Theorem 1 that

(20)

where .Now recall that is an optimal nearest neighbor quantizer

for and that is an optimal nearest neighbor quan-tizer for the conditional distribution of given .Thus

where the first inequality holds because is optimal for thedistribution of given , and the second inequalityfollows because is a nearest neighbor quantizer. Therefore,

In the last inequality the uniform boundedness ofand ,and the triangle inequality were used. It follows that

Combining this with (18) and (20) gives

and since , one finally gets from (17) that

and the proof is complete.

V. EMPIRICAL DESIGN FROM CLEAN SOURCE SAMPLES

So far we have assumed that the training data consisted ofsamples from the noisy source. In practice, it is often the casethat there might be samples available from the clean source. Inwhat follows this situation is explored and the consistency ofempirical design is proved. Moreover, it will be shown that, asopposed to the case of empirical design from noisy samples,in this case the convergence rate of is easilyachievable.

Assume that we are given as training data the i.i.d. samplesdrawn from the distribution of the clean source

, and that the conditional distribution of given isknown. For the sake of concreteness suppose thathasa conditional density given . Then isestimated again using the first half of the samples and .The empirical design of Theorem 2 can be used with themodification that now is defined as

(21)

where the minimization is over all -level nearest neighborvector quantizers whose codepoints lie inside . Thefollowing result states that the procedure is consistent ingeneral, and if satisfies some additional conditions, thenwe can obtain the convergence rate .

Theorem 3: Assume that source is bounded as

and let be i.i.d. copies of . Suppose thatand the conditional density of , given , are known.Then the quantizer is consistent, i.e.,

Page 8: 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, …code.ucsd.edu/.../LiLuZe97-IT-NoisyChannelSource.pdf · 2000-05-18 · 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43,

LINDER et al.: EMPIRICAL QUANTIZER DESIGN IN THE PRESENCE OF SOURCE NOISE OR CHANNEL NOISE 619

where the estimator is

and

arg min

If, additionally, is uniformly bounded and isalmost surely bounded, then

where .Proof: To prove the consistency of , we first show

that is consistent. Introduce the notation

and

(note that is the density of ). Then by the strong law oflarge numbers, for every we have and

a.s. Thus for all such that ,we obtain

a.s.

Since a.s., it follows that a.s. forall . Then the dominated convergence theorem implies that

is consistent, i.e.,

(22)

To finish the consistency part of the theorem, we copy theproof of Theorem 2 after redefining the training data as

and .Clearly, one need only check that the defined in (21)satisfies (19), i.e.,

(23)

This is seen by noticing that according to (21), the empiricallyoptimal has to minimize the functional

where is defined as

Let and be -level nearest neighbor vector quantizerswhose codevectors and lie inside

and satisfy for all . Since, the nearest neighbor property implies that

and therefore,

Thus for fixed , the family of functions parame-terized by has the same-covering as between (10)and (11). It follows from Theorem 1 that of (21) satisfies(23). The rest of the proof is identical to that of Theorem 2,and we obtain that

(24)

where . Since by(22), the consistency part of the theorem is proved.

To obtain the convergence rate it suffices to prove that

(25)

for some constant , since the boundedness of implies thatand thus

The term in (24) comes from the upper boundon

in the proof of Theorem 3, and can be replaced by. Substituting this and (25) in-

to (24) gives the stated convergence rate.

Page 9: 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, …code.ucsd.edu/.../LiLuZe97-IT-NoisyChannelSource.pdf · 2000-05-18 · 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43,

620 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, NO. 2, MARCH 1997

Finally, the estimate (25) is proved. For all such thatwe have

The expectation of the first term can be upper-bounded as

Var

for a constant , where the first inequality follows fromthe fact that the a.s., and last inequality holdsbecause is uniformly bounded. For the second term,we similarly have

for some constant . By the assumption on the distributionof , outside some compact set, so that

for a constant , which proves (25).

VI. CONCLUSION

We have investigated the problem of empirical vector quan-tizer design for noisy channels or noisy sources. The notion ofempirical distortion minimization was suitably defined for bothcases, and proofs of consistency of the methods were given.For the noisy channel problem it was shown that the averagesquared distortion of an optimal vector quantizer designedfrom observing clean i.i.d. training vectors converges, inexpectation, as the training set size grows, to the minimumpossible mean-squared error obtainable for quantizing theclean source and transmitting across a discrete memorylessnoisy channel. The convergence rate was alsoobtained. The comparison of this rate with that obtained in[9] for empirical design for ordinary vector quantizers showsthat noisy channel vector quantizer design is not a harder

problem from a statistical viewpoint. Consistency of an em-pirical design method for sources corrupted by noise was alsoproved under some regularity conditions. Determining a goodconvergence rate is an open problem for the case when onlynoisy training samples are available. The estimation probleminvolved in the design indicates that, in general, this problemis significantly harder than ordinary vector quantizer design.When training samples from the clean source are available, wecan obtain the same convergence rate as for the standard vectorquantizer design problem or for the noisy channel problemunder mild conditions on the noise distribution.

The method of empirical distortion minimization (searchingfor a quantizer globally optimal over the training samples)is computationally prohibitive in practice. It is therefore ofpractical significance to carry out analyses similar to what wepresented here for suboptimal, but computationally feasiblemethods of design. Such an analysis of consistency was givenfor the generalized Lloyd–Max algorithm in ordinary vectorquantizer design by Sabin and Gray [11]. An interesting areaof future research would be to provide convergence rates forsuboptimal algorithms for ordinary, as well as noisy channelor noisy source vector quantizer design.

APPENDIX

PROOF OF LEMMA 1

The estimate with the required property is a-dimensionalextension of the estimator proposed by Devroye [33]. Theproof is based on [33], where convergence in expectation wasproved. First some notation is introduced. and

are the characteristic functions of and, respectively, and the empirical characteristic function of

the data is denoted by

The estimator uses akernel function with, such that its Fourier transform

satisfies andif for some constant , where denotesthe -dimensional ball of radius centered at the origin. Wealso define asmoothing parameter , a tail parameter

, and anoise-control parameter . All of theseparameters may change with the sample size. Introduce theset , and let denote the real part ofthe complex number. Our estimate is defined as follows:

if

if

We claim that this estimate satisfies the required consistencyproperty if the parameters vary with as follows:

(26)

(27)

Page 10: 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, …code.ucsd.edu/.../LiLuZe97-IT-NoisyChannelSource.pdf · 2000-05-18 · 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43,

LINDER et al.: EMPIRICAL QUANTIZER DESIGN IN THE PRESENCE OF SOURCE NOISE OR CHANNEL NOISE 621

(28)

(29)

where denotes the Lebesgue measure.To see why the estimate is consistent, we introduce the

notation and

Next define the auxiliary function

and write the decomposition

where denotes the volume of the unit ball in . It is nowshown that each of the four terms tends to zero as ,almost surely.

Clearly by (26). Since andby (27), we have by the well-known “approximation of theidentity” property of the family that(see, e.g., [34, Theorem 9.6]). Also,

which converges to zero by (28). To show that , weintroduce the random variables

Then

(by Parseval’s identity)

constant

which converges to zero by (29). Summarizing, we haveproved that for every density

To prove convergence with probability one, recall a powerfulinequality of McDiarmid [35] (see also Devroye [36]). Accord-ing to this inequality, if is an arbitrary functionsatisfying the boundedness condition

then for any independent random variables

We apply this inequality to the -error

It suffices to obtain a good upper bound on the variability ofthe -error if we replace by an arbitrary . Denote themodified estimate by . Then (see (30) at the top of thefollowing page). Therefore, McDiarmid’s inequality impliesthat

The upper bound is summable for every if

which is satisfied by (29). Thus by the Borel–Cantelli lemma,with probability one. To complete the proof

Page 11: 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, …code.ucsd.edu/.../LiLuZe97-IT-NoisyChannelSource.pdf · 2000-05-18 · 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43,

622 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, NO. 2, MARCH 1997

(by the Cauchy–Schwarz inequality)

(by Parseval’s identity)

constant (30)

of the lemma, it suffices to demonstrate the existence of theparameters of the estimate satisfying the conditions (26)–(29).

For each positive integer, set and

To see that such an exists, note that since almosteverywhere, the continuity of the Lebesgue measure impliesthat for any fixed , as . Let

for . For all ,define and to be the same as their values for. Then as

, , and , and therefore (27) is satisfied.Also, , and if , then

. Define

Then (so (26) is satisfied) and ,so that (28) is satisfied. Finally, , whichimplies (29).

ACKNOWLEDGMENT

The authors wish to thank L. Gyorfi for helpful discussionsand J. Fan for pointing out some relevant references in thestatistical literature.

REFERENCES

[1] S. P. Lloyd, “Least squared quantization in pcm,” unpublished memo-randum, Bell Labs., 1957; reprinted inIEEE Trans. Inform. Theory, vol.IT-28, no. 2, pp. 129–137, Mar. 1982.

[2] J. Max, “Quantizing for minimum distortion,”IEEE Trans. Inform.Theory, vol. IT-6, pp. 7–12, Mar. 1960.

[3] Y. Linde, A. Buzo, and R. M. Gray, “An algorithm for vector quantizerdesign,” IEEE Trans. Commun., vol. COM-28, pp. 84–95, Jan. 1980.

[4] R. M. Gray, J. C. Kieffer, and Y. Linde, “Locally optimum blockquantizer design,”Inform. Contr., vol. 45, pp. 178–198, 1980.

[5] D. Pollard, “Quantization and the method ofk-means,” IEEE Trans.Inform. Theory, vol. IT-28, no. 2, pp. 199–205, Mar. 1982.

[6] E. A. Abaya and G. L. Wise, “Convergence of vector quantizers withapplications to optimal quantization,”SIAM J. Appl. Math., vol. 44, pp.183–189, 1984.

[7] M. J. Sabin, “Global convergence and empirical consistency of the gen-eralized Lloyd algorithm,” Ph.D. dissertation, Stanford Univ., Stanford,CA, 1984.

[8] D. Pollard, “A central limit theorem fork-means clustering,”Ann. Prob.,vol. 10, no. 4, pp. 919–926, 1982.

[9] T. Linder, G. Lugosi, and K. Zeger, “Rates of convergence in thesource coding theorem, in empirical quantizer design, and in universallossy source coding,”IEEE Trans. Inform. Theory, vol. 40, no. 6, pp.1728–1740, Nov. 1994.

[10] P. A. Chou, “The distortion of vector quantizers trained onn vectorsdecreases to the optimum asOp(1=n),” in Proc. IEEE Int. Symp. onInformation Theory(Trondheim, Norway, 1994).

[11] M. J. Sabin and R. M. Gray, “Global convergence and empiricalconsistency of the generalized Lloyd algorithm,”IEEE Trans. Inform.Theory, vol. IT-32, no. 2, pp. 148–155, Mar. 1986.

[12] M. Merhav and J. Ziv, “On the amount of side information required forlossy data compression,” preprint, 1995.

[13] C. E. Shannon, “Coding theorems for a discrete source with a fidelitycriterion,” in IRE Nat. Conv. Rec., pt. 4, 1959, pp. 138–143.

Page 12: 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, …code.ucsd.edu/.../LiLuZe97-IT-NoisyChannelSource.pdf · 2000-05-18 · 612 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43,

LINDER et al.: EMPIRICAL QUANTIZER DESIGN IN THE PRESENCE OF SOURCE NOISE OR CHANNEL NOISE 623

[14] J. Dunham and R. M. Gray, “Joint source and noisy channel trellisencoding,”IEEE Trans. Inform. Theory, vol. IT-27, no. 4, pp. 516–519,July 1981.

[15] K. Zeger and V. Manzella, “Asymptotic bounds on optimal noisychannel quantization via random coding,”IEEE Trans. Inform. Theory,vol. 40, no. 6, pp. 1926–1938, Nov. 1994.

[16] S. McLaughlin and D. Neuhoff, “Asymptotic bounds in source-channelcoding,” in Proc. IEEE Int. Symp. on Information Theory(Budapest,Hungary, 1991).

[17] N. Farvardin and V. Vaishampayan, “Optimal quantizer design fornoisy channels: An approach to combined source-channel coding,”IEEETrans. Inform. Theory, vol. IT-33, no. 6, pp. 827–838, Nov. 1987.

[18] E. Ayanoglu and R. M. Gray, “The design of joint source and channeltrellis waveform coders,”IEEE Trans. Inform. Theory, vol. IT-33, no.6, pp. 855–865, Nov. 1987.

[19] R. L. Dobrushin and B. S. Tsybakov, “Information transmission withadditional noise,”IRE Trans. Inform. Theory, vol. IT-18, pp. 293–304,1962.

[20] T. Berger, Rate Distortion Theory. Englewood Cliffs, NJ: Prentice-Hall, 1971.

[21] T. Fine, “Optimum mean-square quantization of a noisy input,”IEEETrans. Inform. Theory,vol. IT-11, pp. 293–294, Apr. 1965.

[22] D. J. Sakrison, “Source encoding in the presence of random distur-bance,” IEEE Trans. Inform. Theory, vol. IT-14, pp. 165–167, Jan.1968.

[23] J. K. Wolf and J. Ziv, “Transmission of noisy information to a noisyreceiver with minimum distortion,”IEEE Trans. Inform. Theory, vol.IT-16, pp. 406–411, July 1970.

[24] E. Ayanoglu, “On optimal quantization of noisy sources,”IEEE Trans.Inform. Theory, vol. 36, no. 6, pp. 1450–1452, Nov. 1990.

[25] Y. Ephraim and R. M. Gray, “A unified approach for encoding clean andnoisy sources by means of waveform and autoregressive model vector

quantization,”IEEE Trans. Inform. Theory, vol. 34, no. 4, pp. 826–834,July 1988.

[26] A. Rao, D. Miller, K. Rose, and A. Gersho, “Generalized vectorquantization: Jointly optimal quantization and estimation,” inProc. IEEEInt. Symp. on Information Theory(Whistler, B.C., Canada, Sept. 1995).

[27] H. Kumazawa, M. Kasahara, and T. Namekawa, “A construction ofvector quantizers for noisy channels,”Electron. and Eng. in Japan, vol.67-B, no. 4, pp. 39–47, 1984.

[28] K. Zeger and A. Gersho, “Vector quantizer design for memorylessnoisy channels,” inProc. IEEE Int. Conf. on Communications, 1988,pp. 1593–1597.

[29] A. R. Barron, “Complexity regularization with application to artificialneural networks,” inNonparametric Functional Estimation and RelatedTopics, G. Roussas, Ed. Dordrecht, The Netherlands: Kluwer, 1991,NATO ASI Ser., pp. 561–576.

[30] W. Hoeffding, “Probability inequalities for sums of bounded randomvariables,”J. Amer. Statist. Assoc., vol. 58, pp. 13–30, 1963.

[31] L. Devroye,A Course in Density Estimation.Boston, MA: Birkhauser,1987.

[32] G. T. Hwang and L. A. Stefanski, “Monotonicity of regression functionsin structural measurement error models,”Stat. Prob. Lett., vol. 20, pp.113–116, 1993.

[33] L. Devroye, “Consistent deconvolution in density estimation,”Can. J.Stat., vol. 17, no. 2, pp. 235–239, 1989.

[34] R. L. Wheeden and A. Zygmund,Measure and Integral. New York:Dekker, 1977.

[35] C. McDiarmid, “On the method of bounded differences,” inSurv. inCombinatorics 1989. Cambridge, U.K.: Cambridge Univ. Press, 1989,pp. 148–188.

[36] L. Devroye, “Exponential inequalities in nonparametric estimation,” inNonparametric Functional Estimation and Related Topics, G. Roussas,Ed. Dordrecht, The Netherlands: Kluwer, 1991, NATO ASI Ser., pp.31–44.