pdfs.semanticscholar.org · 2015-07-28 · M.I.T Media Lab oratory P erceptual Computing Section T ec hnical Rep ort No. 351 Also app ears in the IEEE T rans. on Image Pro cessing,

M.I.T Media Laboratory Perceptual Computing Section Technical Report No. 351Also appears in the IEEE Trans. on Image Processing, Vol. 6, No. 2, February 1997, pp. 268{284Cluster-based probability model and its application toimage and texture processingKris Popat and Rosalind W. Picard�Rm E15-383, The MIT Media Laboratory, 20 Ames St., Cambridge, MA [email protected], [email protected] Tel: (617) 253-6271 Fax: (617) 253-8874EDICS Number: IP 1.9 (with applications to 1.1, 1.4, and 1.5)AbstractWe develop, analyze, and apply a speci�c form ofmixture modeling for density estimation, within thecontext of image and texture processing. The tech-nique captures much of the higher-order, nonlinearstatistical relationships present among vector ele-ments by combining aspects of kernel estimation andcluster analysis. Experimental results are presentedin the following applications: image restoration, im-age and texture compression, and texture classi�ca-tion.1 IntroductionIn many signal processing tasks, uncertainty plays a funda-mental role. Examples of such tasks are compression, detec-tion, estimation, classi�cation, and restoration | in all ofthese, the future inputs are not known perfectly at the timeof system design, but instead must be characterized only interms of their \typical," or \likely" behavior, by means ofsome probabilistic model. Every such system has a proba-bilistic model, be it explicit or implicit. Often, the level ofperformance achieved by such a system depends strongly onthe accuracy of the probabilistic model it employs.This paper presents a method for �nding an explicit proba-bility distribution estimate, and demonstrates its applicationto a variety of image processing problems. In particular, thefocus is on obtaining accurate estimates of conditional distri-butions, where the number of conditioning variables is rela-tively large (on the order of ten). If conditional distributionsare estimated directly, then care must be taken to ensure con-sistency [1]. In this work, we begin by estimating the joint dis-tribution | in this way, we avoid consistency problems. Oncethe joint distribution has been estimated, the conditional canbe computed by a simple normalization.As mentioned, the goal here is obtaining a high-dimensionaljoint probability distribution, i.e., on the order of d = 10 jointvariables. Traditional attempts usually stop at d = 3 vari-ables or less. Major obstacles exist when estimating high-ddistributions [2, 3]. Foremost is the exponential growth of theamount of data required to obtain an estimate of prescribedquality as d is increased. Large regions in the d-dimensionalspace are likely to be devoid of observations. In the discretecase, the size of the vector alphabet is usually astronomical |for example, a 3� 3 neighborhood of 8-bit pixels can assume272 distinct values. Consequently, processing in the vectoralphabet must be bypassed altogether. The situation is nobetter when the conditional distribution formulation is used.Although the variable is then one-dimensional, the numberof conditioning states replaces the vector alphabet size as theastronomical quantity, and the same situation follows. Theseobstacles are consequences of what is commonly referred to�This work was supported in part by HP Labs and NECCorp.

as \the curse of dimensionality."1 A term that is more spe-ci�c to the density estimation problem, coined by Scott andThompson [3], is the empty space phenomenon.The estimation technique described in this paper combinestwo weapons in combating the empty space phenomenon: ker-nel estimation and cluster analysis. Kernel estimation (re-viewed in Section 1.2) provides a means of interpolating prob-ability to �ll in empty regions. It is a means of generalizing theobserved data. However, kernel estimation is poor at model-ing rare events (tail regions), and in high dimensions, almostall events are rare. On the other hand, cluster analysis iden-ti�es critical regions of space that need to be covered by ker-nels, and is a means of summarizing the observed data. Thecombination of these two techniques results in an economized,heterogeneous kernel estimate that works well in both modeand tail regions.The cluster-based probability model is a type of mixturemodel, and mixture models are not new. Their estimation anduse dates back at least to the 1894 work of Karl Pearson; see [5]or [6] for a survey. Mixture models have customarily been usedin situations calling for unsupervised learning. Speci�cally,a mixture model naturally arises when an observation x isbelieved to obey probability law pc(x j!c) with probabilityP (!c), where !c is one of several \states of nature," or classes[7]. Alternatively, a mixture model may be viewed as a meansof estimating an arbitrary probability law, even in situationswhere there is no reason to believe that the true probabilitylaw is a mixture [5, p. 118�]. The cluster-based probabilitymodel is viewed in this way.Mixture models have received considerable attention fromthe speech processing community over the past two decades[8]. They are also a topic of current interest among researchersin the �eld of arti�cial neural networks (ANN's) [9], where theemphasis has been on estimating the system output valuesthemselves, rather than on estimating predictive probabilitydistributions for those values (see Section 6.2). However, theuse of mixture models, and more generally, the application ofhigh-dimensional probabilistic modeling, are subjects whichare rarely dealt with in the image processing literature. Thecurrent paper develops, analyzes, and applies a particular typeof mixture for high-dimensional probabilistic modeling, withinthe context of image and texture processing.1.1 Terms and notationLet x = [x1; : : : ; xd] be a random vector, and let X and Xidenote particular values of x and xi respectively. It is assumedthat the d elements of x share the same range of values X � R:In the continuous case, X is assumed to be a bounded intervalof the real line. In the discrete case, X is assumed to be a setof K real numbers on a bounded interval. The set of possiblevalues of x is denoted X d � Rd; it is the d-fold cartesian1D.W. Scott [4] has attributed the �rst use of this termto R.E. Bellman in describing the exponential growth withdimension of the complexity of combinatorial optimization.1

product of X with itself. Note that, in the discrete case, thenumber of possible values for X d is Kd.It is assumed that successive realizations of x are indepen-dent and that they obey one and the same probability law.2In the continuous case, x is governed by a probability densityfunction (PDF) f(x), which satis�esZV f(X)dX = Probfx 2 V g (1:1)for all measurable V � X d: In addition to the usual require-ments of nonnegativity and integrating to one, it is assumedthroughout that f(x) is continuous and bounded. In the dis-crete case, x is governed by a probability mass function (PMF)p(x) de�ned asp(X) = Probfx = Xg; X 2 X d: (1:2)The notation f(x1; : : : ; xd) will be used interchangeably withf(x); likewise for p(x1; : : : ; xd) and p(x). The main use for fand p in the applications will be in providing the conditional,one-dimensional probability lawsf(xdjX1; : : : ; Xd�1) = f(X1; X2 : : : ; xd)f(X1; : : : ; Xd�1)and (1:3)p(xdjX1; : : : ; Xd�1) = p(X1; X2 : : : ; xd)p(X1; : : : ; Xd�1) :The probability law is to be estimated from a learning sam-ple L of size N (also called the training data), whereL = fXngNn=1:The ith element of the nth sample vector is denoted Xn;i, andan estimate of f or p based on L is denoted ~f or ~p.The quality of an estimate can be measured in a variety ofways. The most commonly used criteria are the L1 and L2norms [2, 10]. A criterion which is relevant in compressionand classi�cation applications is the relative entropy, de�nedas D(f jj ~f ) = ZXd f(X) ln f(X)~f(X)dXand (1:4)D(pjj~p) = XX2Xd p(X) log2 p(X)~p(X)in the continuous and discrete case, respectively. The relativeentropy is directly related to e�ciency in compression and toerror rate in classi�cation [11].A partition U of X d is a collection of M cells U = fUm �X dgMm=1 such thatM[m=1Um = X d and Um \ Um0 = ; for m 6= m0:Given a set of vectors Y1; : : : ;YM in X d, a nearest-neighborpartition U(Y1; : : : ;YM) is obtained by including in cell Umthose X in X d which are closer to Ym than to every othercell. The Euclidean norm is assumed.2In practice, vectors of image features are generally notindependent, so the assumption of independence is usuallyviolated to some degree. However, if the vectors are formed insuch a way that the intravector dependence is much strongerthan the intervector dependence, then a system that fails toexploit the latter may still perform well.

A pixel neighborhood N is a collection of (row, column)index pairs which specify the locations of conditioning pixelsrelative to a given current pixel location. A neighborhoodis causal if it includes only pixel locations that precede thecurrent location in raster scan order. In this paper, the vectorx is formed by taking the conditioning pixels as the �rst d� 1elements, and appending the current pixel as the dth element.The neighborhoods used in this paper are shown in Figure 1.1.2 Histograms and kernel estimatesThis section lays the groundwork for the cluster-based estima-tion technique by brie y reviewing the two most commonlyused nonparametric PDF/PMF estimators: histograms andkernel estimates.For discrete x, the normalized histogram ~pH(x) of L is de-�ned as ~pH(X) = N(X)=N; (1:5)where N(X), termed a bin, is the number of times that Xappears in L. The histogram is the maximum-likelihood es-timator of p, which implies that it is asymptotically unbiased(as N !1) and consistent [12]. However, in practice usuallyN � Kd, so that asymptotic behavior is not reached. In fact,typically ~pH(X) = 0 for all but a small fraction of X d, even inregions of relatively high probability. Thus, the empty spacephenomenon becomes an empty bin problem.The learning sample is not only well represented by the his-togram, but it is too well represented. The histogram over�tsthe learning sample. Yet in over�tting the learning sample,the histogram serves as a relatively compact summary of it.What the histogram in the discrete case lacks is not the abil-ity to summarize the learning data, but to generalize it. Bothproperties are indispensable when working in higher dimen-sions.Generalizing means inferring probabilities of previously un-seen vectors from those in the learning sample. This requiresthat an assumption be made about how the vectors relate. Asmoothness assumption about the probability law is often rea-sonable: small changes in a vector imply small changes in itsprobability. (Smoothness of the probability law should not beconfused with smoothness of image neighborhoods; the latteris not required for the former.) The smoothness assumptionis implicitly used in continuous-valued histograms when adja-cent values are grouped into the same bin. As the bin size isincreased, the histogram both summarizes and generalizes thedata better, but at the cost of decreased resolution.An alternative means of generalizing the learning sampleis kernel estimation. A kernel estimate ~fK is typically of theform ~fK(X) = 1N NXn=1 k(X�Xn); (1:6)where the kernel function k(X) is itself a PDF that is usuallychosen to be spherically symmetric and local to the origin[4]. A Gaussian kernel is often used. The e�ect of kernelestimation is to \radiate" probability from each vector in thelearning sample to the space immediately around it, whichis justi�ed by the smoothness assumption. In this way, thelearning sample is generalized.Kernel estimation is a powerful technique in nonparametricstatistics with many practical successes reported and a richsupporting theory [13]. However, it is not without its short-comings. Foremost is its inability to summarize the learningsample. In kernel estimation, a kernel is placed at each sample,requiring each training vector to be retained and used when-ever the estimate is evaluated. In high dimensional spaces,where large learning samples are necessary, this makes the2

1 2 3 4 5 6Figure 1: Conditioning neighborhoods used in this paper. In each case, solid circles correspond to x1; : : : ; xd�1, and the opencircle to xd.kernel method prohibitively expensive in terms of both com-putation and storage. Attempting to economize by subsam-pling the training data is tantamount to using a smaller sam-ple, which leads to inaccuracy, most notably in the tails. Inhigher dimensions, this type of economized kernel estimationbecomes problematic, as the tails usually contain most of thetotal probability (for a nice illustration of this point, see [2],pp. 91{93). Adaptive kernel estimates have been proposed tomitigate this problem [2, pp. 100{110], but they too rely ondistribution sampling for the kernel locations, and thereforeare prone to poor performance in tail regions.In the next section, a modi�cation of the kernel method isproposed wherein important regions of X d are identi�ed viacluster analysis, then region-speci�c kernels are �t to theselocations. The result is a model that represents both mode andtail regions well, while combining the summarizing strength ofhistograms with the generalizing strength of kernel estimates.2 Cluster-based probability estimationScott and Thompson [3] have observed, \: : : the problem ofdensity estimation in higher dimensions involves �rst of all�nding where the action is." Cluster-based kernel probabilityestimation begins by identifying the locations of importantregions of X d, by means of cluster analysis. The details ofthe cluster analysis are taken up in Section 2.1. Here, weassume that a partition U consisting of M cells has alreadybeen obtained.To each cell Um � U , �t a single product kernelQdi=1 km;i(Xi), where each km;i is a one-dimensional PDF.De�ne wm = N(Um)=N , where N(Um) denotes the number oftraining points that lie in Um: For continuous x, the cluster-based estimate of the PDF is de�ned as~fC(X) = MXm=1wm dYi=1 km;i(Xi): (2:1)A natural choice for the kernels is the Gaussian form,km;i(Xi) = 1 p2��̂2m;i e�(Xi��̂m;i)2=(2 2�̂2m;i) (2:2)The parameters �̂m;i and �̂2m;i are taken to be the one-dimensional sample mean Xm;i and sample variance S2m;i ofthe training points that lie in Um.In the discrete case, the cluster-based probability estimateis of identical form:~pC(X) = MXm=1wm dYi=1 km;i(Xi); (2:3)where the one-dimensional product kernels km;i are obtainedby sampling the corresponding continuous kernels at the dis-crete values in X (e.g., integer grayscale values between 0 and255), then normalizing to obtain a PMF. All of the exper-imental results presented in this paper were obtained usingGaussian kernels discretized in this way.

The necessity of the parameter in (2.2) is discussed inAppendix B. An appropriate value is determined empirically.In this study, the best values for were found to be between1:0 and 1:5.2.1 Obtaining the partition: clusteringThough we use the term \clustering," our goal is di�erent fromthat of traditional cluster analysis. Essentially, what we areafter from cluster analysis is much the same as what vectorquantization is after: representational e�ciency, as opposedto e�ciency of discrimination.A clustering technique that is widely used in vector quan-tization is the k-means procedure [7], whose origin is oftenattributed to Forgy [14, 15]. Stagewise application of the k-means procedure, in which the initial guesses for the clustercentroids at each stage are obtained by splitting the centroidsresulting from a previous stage, is known as the LBG algo-rithm [14, 16]. We have adopted the LBG algorithm for all ofthe experiments here.The clustering algorithm is now described. De�ne Xm =[Xm;1; : : : ; Xm;d]; for m = 1; : : : ;M . The following basic pro-cedure is repeated for successively larger values of M , thenumber of kernels. The parameters �1, �2, and �3 are dis-cussed after the algorithm.1. Initialization: set M = 1, set X1 to be the centroid ofall training points, and set D to be the total squareddistortion incurred by substituting X1 for each trainingpoint. Initialize a partition of L to consist of a single cellthat includes all training points.2. If the desired M has been reached, or if all cells havepopulation less than �3, then stop. Otherwise, increaseM by splitting every Xm into two points, one staying inplace and the other shifting by a random o�set in eachdimension, where the o�set is uniformly distributed be-tween zero and �2 times the magnitude of Xm. However,never splitXm's whose cells have populations of less than�3.3. Using fXmg, compute a nearest neighbor partition of L.4. Recompute fXmg using this partition.5. Recompute D, the total squared distortion incurred bysubstituting the nearest X for each training vector. Ifthe fXmg resulting from Step 4 reduces D by a factorof more than �1 relative to its previous value, return toStep 3. Otherwise, return to Step 2.Suitable values in all applications described here were deter-mined empirically to be �1 = 0:01, �2 = 0:01, and �3 = 10. Inthe case of discrete x, a random dither uniformly distributedon (��=2;�=2) is added to every element of every train-ing point prior to clustering, where � is the average spac-ing between values in the discrete alphabet X (e.g., � = 1for integer-valued pixel data). Adding this dither allows cellboundaries to migrate more smoothly by preventing multiplepoints from lying on top of one another.3

The number of kernelsM can be chosen in one of the follow-ing ways. First, if it is determined empirically that choosingM larger than a certain value results in no improvement inperformance, then this value can be taken as M . If no suchlimiting value is found, then M can be chosen on the basisof the available computation resources, or alternatively, it canbe chosen to minimize the overall description length of themodel and data, i.e., application of the minimum-descriptionlength principle[17]. Finally, M can be determined indirectlyby the size of the learning sample | eventually, the LBG al-gorithm will stop creating new clusters on account of low cellpopulations.2.2 Optimizing the model parameters viathe EM algorithmAn alternative to the method of estimating the weights andkernel parameters described above is to use the expectation-maximization (EM) algorithm [18, 6]. This algorithm resultsin a local maximum of the model likelihood, which for largetraining samples approximates a local minimum of the relativeentropy D(pjj~p).The EM algorithm is closely related to the k-means algo-rithm; in fact it is a \soft" version of it, as will be clear from itsdescription below. However, it is much more computationallyexpensive than k-means, and is highly sensitive to the initialguess for the parameters being estimated. For these reasons,it is suggested that the EM algorithm be used only to re�nethe parameter values obtained by the method described pre-viously, rather than to obtain them from scratch.Preliminary experiments have shown that the performanceadvantage of optimizing via the EM algorithm can be substan-tial (about 0.3 bits per pixel improvement in lossless compres-sion of several natural images, using M = 64 and d = 3).Moreover, use of the EM algorithm obviates the parameter in (2.2).The EM algorithm consists of two steps, the \expectation"step (E-step) and the \maximization" step (M-step). Theseare iterated until the rate of improvement of the likelihoodfalls below a speci�ed convergence threshold.The E-step involves a soft-assignment of the training pointsto clusters, where the strength of assignment of training pointXn to cluster m is given by�n;m = wmQdi=1 km;i(Xn;i)PMm0=1 wm0 Qdi=1 km0;i(Xn;i)The M-step then updates the values of fwmg, f�̂2m;ig, andf�̂m;ig, using update rules which can be regarded as weightedversions of the usual maximum-likelihood estimators:�̂m;i = PNn=1 �n;mXn;iPNn=1 �n;m�̂2m;i = PNn=1 �n;m(Xn;i � �̂m;i)2PNn=1 �n;mwm = PNn=1 �n;mPMm0=1PNn=1 �n;m0Note that if �n;m is replaced by a hard, nearest-neighbormembership assignment, i.e., if we set �n;m = 1 if m is theclosest cluster to Xn, and �n;m = 0 otherwise, then the EMalgorithm becomes the k-means algorithm.The main di�culty we have encountered in applying the EMalgorithm is its time-complexity for large models. The highcomplexity comes about because in EM, each training point

e�ectively \belongs" to every cluster, whereas in k-means,each training point belongs to only one cluster. The time-complexity of EM, assuming a direct implementation of thesteps given above, is O(dMN), while that of k-means is onlyO(dN). Though some computational savings is possible byeliminating terms that are multiplied by negligible �n;m's, thecost is still much higher than that of k-means. It may notbe feasible to apply it to the large (e.g., M = 1024) cluster-based models considered in this paper. More investigation isrequired to assess and reduce the computational complexityof EM optimization for very large cluster-based models.2.3 Componentwise separabilityThe restriction to separable kernels might seem to be unnec-essary and even harmful to accuracy, but the situation is notas clear as it appears. There are at least three good reasonsfor imposing the restriction.First, what appears to be greater approximation e�ciencyin the nonseparable case comes at the price of increased modelcomplexity, since the entire covariance matrix (d(d+1)=2 de-grees of freedom) must be stored for each cluster, instead ofjust d variances.Second, the estimation problem is more di�cult when non-separable kernels are used. The entire covariance matrix, in-stead of just the dimension variances, must be estimated fromthe within-cell training data. The quality of the estimates islikely to be more sensitive to low cell populations (which occurfrequently in practice), since a greater number of parametersmust be computed from the same amount of data. The sensi-tivity to low populations may be alleviated somewhat by usingthe EM algorithm to optimize the parameters obtained by k-means, but the time-complexity of EM, which in this caseis O(d2MN), makes this impractical for large models (e.g.,M = 1024).Finally, the use of separable kernels greatly simpli�es com-putation when the estimate is evaluated, as described in thefollowing section.2.4 ComputationIn a huge variety of image processing applications, includingall of those considered in this paper, what is needed is theone-dimensional PDF or PMF of a pixel, conditioned on a setof neighborhood pixels. This can be obtained directly fromthe estimated vector probability law using (1.3). The compo-nentwise separability of the cluster-based estimate simpli�esthe computation, by allowing the conditional PDF or PMF tobe written as a weighted sum of the one-dimensional kernelskm;d. In particular, the conditional PDF is~fC(xdjX1; : : : ; Xd�1) = MXm=1 rmkm;d(xd); (2:4)where the factors frmg are given byrm / wmkm;1(X1)km;2(X2) � � � km;d�1(Xd�1); (2:5)normalized such that Pm rm = 1. The discrete case is thesame. The normalization of rm can be distributed over thed� 1 conditioning dimensions by growing (2.5) as a sequenceof partial products and renormalizing after each dimension ismultiplied into it. Alternatively, the product can be formedby summing in the logarithm domain, then normalized in twosteps: �rst by shifting the accumulated logarithm to a rangethat avoids under ow, then exponentiating and renormalizingto sum to one. Both strategies rely on the product structureof rm, which derives from the componentwise separability ofthe kernels.For a given conditioning set X1; : : : ; Xd�1, not all of thekernels will contribute signi�cantly to the conditional distri-bution | i.e., some of the rm's will be negligible. This makes4

possible savings in computation by omitting the insigni�cantkernels. The di�culty lies in knowing which kernels to omitwithout actually computing them. One method is to weed outinsigni�cant kernels as the product (2.5) is grown, by delet-ing those rm for which km;i(Xi) is smaller than some suitablethreshold (determined empirically).In a previous paper [19] it was suggested that the discretizedkernels km;i be precomputed to further speed execution time.This is appropriate when the need to save execution time faroutweighs the need to save memory. However, if the compu-tation of rm is carried out in the logarithm domain, then theexecution-time savings achieved by precomputing fln km;ig isnegligible, so that the memory advantage of computing-as-needed may take precedence. This is particularly true in ap-plications where multiple cluster-based probability models areto be used.2.5 ExampleWe illustrate the cluster-based estimation technique with asimple two-dimensional example. Real applications are con-sidered in Sections 3{5.A learning sample consisting of N = 19; 700 vectors wasextracted by sliding neighborhood N1 (see Figure 1) over a150 � 150 patch of the aluminum wire texture D1 (see Fig-ure 10). The logarithm of the resulting histogram is shown asa density plot in Figure 2 (a). Notice the speckling through-out, even in the relatively dark (high-probability) areas. Thisis indicative of the high pointwise variance that arises as aresult of the empty-space phenomenon.Figure 2 (b) and (c) illustrate the cluster-based modelingapproach, using M = 8. In (b), centroids resulting from theclustering procedure of Section 2.1 are shown with the inducednearest-neighbor partition. Equiprobability ellipses for thecorresponding kernels are shown in (c), each marked with itsweight wm.The structure of the underlying PMF apparent in the his-togram has several noteworthy characteristics. The marginaldistribution (common to x1 and x2) has a strong mode, andthere is a great deal of correlation. But the correlation issubstantially nonlinear, so that much of it would be left unex-ploited by linear dimensionality reduction techniques. Also,parametric techniques are unlikely to succeed because of thehighly irregular shape. It is a broad-tailed distribution forwhich we expect kernel estimation to perform poorly whenthe number of kernels is restricted, as it must always be inpractice. This point is made in Figure 2 (d), where a ker-nel estimate was economized by random subsampling of thelearning sample to obtain 128 kernel centers. The tail regionsare poorly represented, as expected.Figures 2 (e){(h) show the cluster-based estimates for sev-eral values of M . The tails are well-represented, and the esti-mate does not su�er from the speckling that plagues the his-togram. In other words, the cluster-based probability modelboth summarizes and generalizes the training data.2.6 Asymptotic propertiesIt is clear from Figure 2 that as M increases, the potentialfor the cluster-based model to approximate an arbitrary con-tinuous probability law improves. This improvement comesat the cost of increased model complexity. Nevertheless, itwould be satisfying to know that the approximation can bemade as close as desired by choosing a su�ciently large M .In this section we establish that as M gets inde�nitely large,the cluster-based model converges in probability (N !1) tothe true probability law, in both the discrete and continuouscases.Let the diameter of the smallest d-sphere that covers cell Ube denoted Diam(U). For a given partition U , de�ne Diam(U)

as Diam(U) = supU2UDiam(U): (2:6)Let fUjg1j=1 be a sequence of partitions such that, for every� > 0, there exists a j0 such that Diam(Uj) < � wheneverj > j0.Associated with each partition Uj is a cluster-based proba-bility model ~fC;j .Let V � X d be a measurable set, and de�ne the �-skin S�as S� = fX : jjX�XBjj < � 8XB 2 Bg; (2:7)where B is the boundary of V . It is assumed that B is smoothin the sense thatVolume fS�g ! 0 as �! 0: (2:8)Let Uj;B be the union of those cells in Uj that intersect B.Also, let Uj;I be the union of the cells in Uj that lie completelywithin V . Thus, Uj;B \ Uj;I = ;; 8j.Proposition 1. For every � > 0; there exists a j0 for whichlimN!1Prob��ZV �f(X)� ~fC;j(X)� dX�� > �� = 0whenever j > j0.A proof is given in Appendix A.The smoothness restriction (2.8) on the boundary of V isnot burdensome, since the sets that are of interest in mostapplications have boundaries that are decided by some sort ofdistance-based criterion, and such boundaries are smooth.A remaining complication is that the LBG algorithm doesnot, in general, lead to sequences of partitions with the neededproperty of decreasing maximum cell diameter. The problemis that regions of zero probability are never populated in thelearning sample; hence these regions can never be split. Nev-ertheless, it is clear that any region of nonzero probabilitywill always be populated in a su�ciently large learning sam-ple, and will therefore eventually be split. Hence, when theLBG algorithm is used to obtain the sequence of partitions,the asymptotic analysis holds on all regions of X d that havenonzero probability.We now consider asymptotic behavior in the discrete case.We argue that as Diam(Uj) ! 0, the discrete cluster-basedprobability model degenerates into the histogram, and there-fore inherits all of its asymptotic properties. First note thatas N ! 1, every element of X d that has nonzero proba-bility will be represented. Now suppose that, for a given j,one or more of the cells in Uj contains more than a singlepoint in X d. Then at least one of these must be split in a�ner partition, so that in the limiting case, every cell con-tains exactly one point X 2 X d. This implies that each cellhas zero variance, so that, informally, the kernels are \im-pulses" of height N(X)=N , which is precisely the de�nition ofthe histogram. As mentioned in Section 1.2, the histogram isthe maximum-likelihood estimate, so that it is asymptoticallyunbiased and consistent. Hence, the discrete cluster-basedprobability model converges to the true probability law as re-quired.It is interesting to contrast the manner in which the his-togram, the kernel estimate, and the cluster-based probabilityestimate construct the probability surface in high-probabilityregions. The kernel estimate requires that a large number oflearning examples be situated in a small region, so that thekernel function can e�ectively blur them into a single mode.This is wasteful of storage in the sense that many di�erent mi-croscopic con�gurations lead to the same blurred macroscopicmode, yet the speci�c con�guration must be stored exactly. In5

0

50

100

150

200

250

50 100 150 200 250

HISTOGRAM

(a)(a) 0

50

100

150

200

250

50 100 150 200 250

CENTROIDS AND CELLS

(b)(b) 0

50

100

150

200

250

50 100 150 200 250

ELLIPSES AND WEIGHTS

(c)

0.41

0.11

0.04

0.21 0.05

0.07

0.07

0.04

(c) 0

50

100

150

200

250

50 100 150 200 250

(d)

KERNEL ESTIMATE (128 RANDOM PTS)

(d)0

50

100

150

200

250

50 100 150 200 250

(e)

M = 4

(e) 0

50

100

150

200

250

50 100 150 200 250

(f)

M = 8

(f) 0

50

100

150

200

250

50 100 150 200 250

(g)

M = 32

(g) 0

50

100

150

200

250

50 100 150 200 250

M = 128

(h)(h)Figure 2: (a) The logarithm of the histogram is shown as a density plot. (b) Centroids and partition, and (c) kernel ellipses,for the cluster-based probability model with M = 8. (d) Log density plot of a Gaussian kernel estimate (�2 = 30) using 128points randomly selected from the training set as the kernel locations. Note that the kernel approach does not represent thetail regions well. Log density plots (e){(h) correspond to cluster-based probability models with increasing M ( = 2:0).its favor, the kernel estimate has the advantage of data-drivenadaptation to local variations in density of the learning sam-ple. At the other extreme is the histogram, which representsthe probability surface as a single number for every cell in apartition. The precise con�guration of the learning exampleswithin each cell is forgotten, giving the technique the poten-tial for storage e�ciency. The disadvantage is that constantprobability is assigned over entire cells, even when the train-ing vectors happen to lie in some limited (possibly remote)part of a cell. The cluster-based probability model combinesdesirable properties of both approaches: it uses a single kernelwhich may be centered anywhere in the cell, and uses cell pop-ulation instead of the precise con�guration of training pointsto establish the height of the probability surface.The remainder of this paper considers some applicationsof the cluster-based probability estimate. Since all of theseapplications involve working with discrete pixels intensities,the PMF formulation is used rather than the PDF.3 Image restorationSuppose an image has been degraded in an unknown way, orelse in a way that is so di�cult to describe mathematicallythat direct inversion of the degradation process is infeasible.We wish to recover an estimate of the original from this de-graded version | how can this be accomplished?Regardless of the di�culty in describing the degradation,the technique illustrated in Figure 3 can be used. First, alearning sample is formed by extracting a vector for everypixel location, using a neighborhood which need not be causal.Only one training pair is shown in the �gure, but it shouldbe understood that a large number of such pairs make up thetraining set, and that the test image to be restored is excludedfrom that set. After the learning sample has been obtained, acluster-based PMF is trained on it. The choice of M can bemade in one of the ways described at the end of Section 2.1; inour experiments, we obtained acceptable results using valuesof M ranging from 128 to 2048.

After the cluster-based estimate has been obtained, it canbe used to restore a previously unseen degraded image, in thefollowing way. For a given pixel location, let X1; : : : ; Xd�1 de-note the values of the degraded neighborhood pixels, and letxd be the unknown original pixel value. The value of xd is thenchosen which best explains the occurrence of X1; : : : ; Xd�1with respect to some appropriate criterion. Common criteriaare maximum-likelihood (ML), maximum a posteriori proba-bility (MAP), and least expected squared error (LSE):ML: Xd = argmaxxd ~pC(X1; : : : ; Xd�1jxd)MAP: Xd = argmaxxd ~pC(xdjX1; : : : ; Xd�1)LSE: Xd = E~pC (xdjX1; : : : ; Xd�1)where the right-hand side in the LSE formula is the conditionalexpectation with respect to the model. All of these criteriacan be used with the cluster-based probability model.3.1 Restoration experimentsFigure 4 shows the result of cluster-based MAP restorationin an example where a 128 � 128 8-bit image (a) is degradedby additive white Gaussian noise with a variance of 100 (b).The result of nonadaptive, separable eleven-tap Wiener �lter-ing is shown in (c).3 For the cluster-based probability model,the square 3 � 3 neighborhood (d = 9 + 1 = 10) of Figure 3was used, with M = 1024. The training set for this example,shown in Figure 9, consisted of twenty-�ve natural images to-gether with their degraded versions. The restored image (d)exhibits signi�cant noise reduction while maintaining reason-able sharpness. The improvement in signal-to-noise ratio is3.01 dB.To get an idea of the e�ect of the cluster-based MAP tech-nique on objective image quality, the experiment was repeated3The �lter length was chosen to give the best resultperceptually.6

TRAINING

ORIGINAL

DEGRADED

Vector to be modeled

RESTORATION

DEGRADED RESTORED

arg max of conditional PMF

Figure 3: In the training phase (left), training data vectors x are formed by appending an original pixel to the correspondingdegraded neighborhood pixels. In the restoration phase (right), the �nal component of the vector is unknown; its value isestimated by maximizing its PMF conditioned on the degraded pixels.(a) (b) (c) (d)Figure 4: Image restoration example (see text). (a) original image; (b) degraded; (c) Wiener �ltered (d) restored using cluster-based model.by alternately using each of the training images as the testimage. For every test image, the cluster-based estimate wasretrained using a training set that excluded that test image.This prevented overtraining, which would have lead to unreal-istically optimistic performance estimates. In each case, therewas an improvement in root-mean-square noise level rangingbetween 3 and 5 dB. Similar improvement resulted when theconditional expectation (LSE) was used instead of the MAPestimate. The ML estimate is usually used when no prior dis-tribution is available for the quantity being estimated, i.e., xdis treated as a deterministic but unknown parameter. Since aprior is available for xd in the cluster-based probability model,the ML estimate was not implemented.Because cluster-based restoration requires only weak as-sumptions about the statistics of the degradation (station-arity and locality of spatial dependence), it is exible andcan accommodate high-order, nonlinear statistical interac-tions that might be present in the degradations. For thisreason, it is expected to perform well in restoration problemswhere the degradation process is spatially local but highlynonlinear and/or di�cult to express mathematically. Exam-ples of restoration problems that �t this description are de-halftoning, �lm grain reduction, and compensating the e�ectsof quantization in lossy compression schemes.3.2 Relationship to nonlinear interpolationCluster-based restoration is similar in spirit to a VQ-basednonlinear interpolation technique proposed by Gersho [20].Both approaches have the potential to learn nonlinear statisti-

cal relationships from training data and to use those relation-ships to �ll in missing values. However, the techniques di�er inone important respect. In the cluster-based technique, severalkernels interact to determine the restored value, instead of thevalue being determined by a single codebook entry. Thus, re-stored values are not limited to only those appearing explicitlyin a codebook; instead, they are synthesized from the modeland the available conditioning information.4 Lossless compressionMost of the published research in image compression dealswith lossy compression: the image that is reconstructed fromthe compressed representation approximates the original; it isnot required to be identical to it. Such techniques routinelyachieve compression ratios of 10:1 or more, with little or nonoticeable distortion.On the other hand, lossless (strictly reversible) compres-sion techniques typically yield compression ratios no betterthan 2:1 on natural images [21, 22]. This comparatively poorperformance is a consequence of the requirement that the re-constructed image be bit-for-bit identical to the original. Anupper bound on the compression ratio that can be achievedcomes from noise that is inevitably introduced in the acqui-sition process. However, in practice it is not just noise, butalso any noiselike phenomena that contributes inordinately tobit rate. Often, the regions which appear noiselike are tex-tured regions. In this section we consider losslessly compress-ing both natural scenes (which include textured regions) and7

CLUSTER−BASEDPROBABILITY

MODEL

MULTISYMBOLALPHABET

ARITHMETICCODER

GRAYSCALEIMAGE

CODEBITS

(a) 2log (number of clusters)

aver

age

bits

/pix

el

2 4 6 8 10 12

LENNA

4.0

4.5

5.0

5.5

6.0

6.5

7.0

2 4 6 8 10 12

D1

2 5

(b)Figure 5: (a) Performance of lossless compression system depends on probabilistic model. (b) Dependence of bit rate on M forlossless compression of two images: lenna and D1.pure textures.Despite their low compression ratios, lossless techniques arerequired in certain applications. For example, in image andvideo processing research, the original reference images andvideo sequences must be intact if the research conclusions areto be valid. Some image data, such as satellite imagery, mayhave been obtained at great expense, and the risk of losingexpensive information which might later be required for someunforeseen purpose precludes any archiving scheme based onlossy compression. Another often-cited application for losslesscoding is the archiving of digital radiology and other medicalimages, where ethical and legal considerations make the useof lossy techniques questionable.But the main reason for considering lossless compressionhere is that lossless compression is ultimately a modeling prob-lem. Shannon established that an optimal lossless compressionsystem uses� log2 p(X) bits to encode the message elementX,where p(X) is the probability of X. (For economy of notation,the conditioning is not explicitly mentioned, but it should beunderstood that p is conditioned on whatever is both relevantand available at both the encoder and decoder). The optimalaverage bit rate is thusH(x) = �XX2X p(X) log2 p(X);which is the entropy of x. If instead of the true probability,an estimate ~p is used to construct a code, then the average bitrate R~p is R~p = � XX2Xd p(X) log2 ~p(X)= H(x) +D(pjj~p); (4.1)where D(pjj~p) is the relative entropy de�ned by expression(1.4). It is easy to verify that D(pjj~p) � 0, with equality ifand only if ~p(X) = p(X) for all X. Thus, the goal of designinga lossless compression is exactly the same as that of estimatingp. The above assumes an optimal code. It is natural to ques-tion whether codes that are achievable in practice also performbest when ~p = p. As will be discussed in the following section,arithmetic coding is a practical technique for lossless compres-sion that very nearly achieves Shannon's optimal code lengthassignment [23], so that the above analysis pertains to practiceas well as theory.4.1 Lossless compression with arithmeticcodingArithmetic coding is a form of entropy coding that o�ers sig-ni�cant advantages over other methods in many applications

TABLE I: Estimated Bits per Pixel (M = 2048)Image N2 N3 N4 N5cman 5.50 4.81 4.77 4.79lenna 5.11 4.53 4.38 4.41D1 4.83 4.40 4.24 4.14D77 6.08 6.04 5.46 5.30[23, 24, 25, 26]. It has near-optimal e�ciency (relative tothe assumed probability law) for a broad class of sources andover a wide range of coding rates. It is also inherently adap-tive, and simpli�es the encoding of large-alphabet low-entropysources. Most importantly, it allows the probabilistic model tobe speci�ed explicitly and separately from the actual encoder.The �rst use of arithmetic coding as an image compressiontechnique was by Langdon and Rissanen [27]. In their system,which was for binary images, each pixel was encoded using aPMF conditioned on a nearby set of previously encoded pixels,i.e., on a causal neighborhood. Since the input was binary,the number of conditioning states remained manageable evenfor large neighborhoods, so that a histogram PMF estimatewas feasible. For example, ten conditioning pixels means only1,024 conditioning states.Extending the Langdon-Rissanen scheme to handlegrayscale images is greatly complicated by the empty spacephenomenon. For example, in the case of eight-bit pixels, aten-pixel conditioning neighborhood implies 280 conditioningstates, most of which will never be observed in a reasonablysized training sample. Moreover, in the grayscale case, wehave good reason to believe that the underlying probabilitylaw is smooth, but the histogram PMF estimate makes no useof this prior knowledge.We substitute the cluster-based probability estimate for thehistogram estimate in the Langdon-Rissanen scheme, therebyextending it to handle grayscale images. The system is shownin Figure 5 (a). Pixels are arithmetically encoded in rasterorder. As in the Langdon-Rissanen scheme, the PMF usedfor each pixel is conditioned on a set of previously-encodedpixels, so that the decoder has access to the same conditioninginformation that the encoder had.44.2 Compression experimentsUsing neighborhoods N2{N5 shown in Figure 1, cluster-basedprobability models were trained on the set of 25 natural im-ages shown in Figure 9. These PMF estimates were then used4At the top and left boundaries, unavailable conditioningpixels are arbitrarily set to 128; the resulting local ine�ciencyhas little e�ect on the overall bit rate.8

in an arithmetic coding system to compress two natural im-ages not in the training set: cman and lenna. The resultingestimated bit rates for M = 2048 are shown in the �rst tworows of Table 4. The rates are based on the assumption that16-bit arithmetic is used in the encoder and decoder. Next,the experiment was repeated using two natural textures: D1(aluminum wire mesh) and D77 (cotton canvas), but usinga 352 � 352 portion of each texture for training and a dis-joint 128� 128 portion for testing. The textures are shown inFigure 10.The performance listed in the �rst two rows of the tablecompares favorably with that reported in the literature fornatural scenes [21, 22]. The compression performance for thetwo textures is more di�cult to interpret, since no previousresults seem to have been published. One could argue thatthe textures, having fewer blank regions, are more di�cult tocompress than natural scenes. But this di�culty is o�set bytheir relative homogeneity, which should allow a single modelto work well over the entire texture.The dependence of bit rate onM (the number of kernels) isshown graphically for the lenna and D1 test images in Figure 5(b), for each of the neighborhoods N2 and N5. For both testimages, the N2 curve reaches a limit at about M = 256, whileN5 curve continues to improve as M increases to 2048.4.3 Comparing PMF estimatesHow good is an estimate ~p in terms of relative entropy? It isdi�cult to estimate D(pjj~p) in (4.1) directly from a sample oflimited size, since the true p remains at all times unknown.However, if ~p1 and ~p2 are two competing estimates of p, thentheir di�erence in relative entropyD(pjj~p1)�D(pjj~p2) = R~p1 �R~p2is easily and reliably estimated from a moderate-size sampleby subtracting the sample average of � log ~p2(x) from thatof � log ~p1(x). These sample averages are just the averagebit rates produced by arithmetic coders based on ~p1(x) and~p2(x). The comparison can be carried out for more than twoestimates in a similar way, since the unknown entropy term iscommon to all of them. Thus, practical lossless compressionserves as a fair test of the relative accuracy or predictive powerof a model with respect to the relative entropy measure; it isa level playing �eld upon which competing models can battle.Whichever model produces the fewest bits, wins. The test isdecisive even when the sample sizes are small. As will be seenin the next section, this type of competition can be used asthe basis for classi�cation.5 Texture classi�cationClassi�cation is an activity humans carry out constantly, tomake sense of the world we perceive around us. We recognizesimilarities and di�erences among sensory stimuli, and groupobjects accordingly. The speci�c similarity metrics we employseem extraordinarily complex, but one thing about them iscertain: they are usually high-dimensional, and involve theintegration of diverse elements of knowledge, some high-leveland some primitive, some conscious and some at the level ofintuition.Much simpler is the situation in which a machine is toclassify objects on the basis of objectively measured featuresand with respect to some clearly stated criterion (possiblyBayesian). We will see an example shortly involving textureswhere the result of such classi�cation matches closely whatseems subjectively reasonable.For simplicity and concreteness, we consider the case inwhich the prior probabilities of the classes are equal, andwhere the goal is to minimize the overall probability of classi-�cation error. In this case, the Bayes decision rule reduces to

the familiar ML rule, which is to choose the class which makesthe observed data most likely [7].ML classi�cation can be applied in many di�erent ways.For instance, if it were known beforehand that a particularpatch in an image consists of a single texture class, then themodel with the greatest likelihood could be chosen.More typical is the situation where we are not assured thatthe entire patch came from a single class; indeed, the task isfrequently to determine the boundaries between classes in animage that is assumed to be heterogeneous (e.g., image seg-mentation). In this case, the classi�cation decision should bemore or less independent for each pixel. This can be accom-plished by centering a neighborhood (not necessarily causal)at each pixel location, and choosing for that pixel the classwhose modeled conditional probability is greatest. The re-sulting classi�cation will of course appear \noisy," since eachdecision is based on only one neighborhood observation. Analternative is to assume limited spatial homogeneity, and tochoose a class that maximizes \local" likelihood, appropri-ately de�ned. This idea will be made more precise shortly.Another way around the single-observed-neighborhoodproblem is to use several di�erent models for each class, eachconditioned on a di�erent neighborhood around the currentpixel. It is natural to choose these neighborhoods to be atdi�erent scales, resulting in a multiresolution system. Thefollowing heuristic can then be invoked: To be assigned to acertain class, a pixel must have high conditional likelihood si-multaneously with respect to several di�erent neighborhoods.This idea has recently been generalized as an independentmethod of attack on the curse of dimensionality in densityestimation [28].Consider now the problem of classifying regions in hetero-geneous images. Suppose that there are J classes, and thatvectors drawn from class j follow PMF pj(x). Associated witheach class is a cluster-based probability model ~pj(x), trainedpreviously. Let S = fX1;X2; : : : ;XNg be a sample drawnfrom an unknown class which we wish to identify. If we assumefor now that the vectors in S are independent, the likelihoodfunction for class j is simplypj(X1)pj(X2) � � � pj(XN): (5:1)After taking the logarithm, we obtain the decision rule:Choose j for which NPn=1 log pj(Xn) is maximized. (5:2)To evaluate each log pj(Xn), the technique described in Sec-tion 2.4 can be used.Typically, the vectors are formed at each pixel location us-ing a neighborhood like N6. Formed in this way, vectors corre-sponding to adjacent pixels are not independent, so that (5.1)is not strictly justi�ed. The resulting decision rule does nottake advantage of the statistical dependence among vectors.However, one can argue that if the dependence among vectorsis similar for all classes, ignoring it for all classes results in auseful decision rule.It is worthwhile noting that the summation in (5.2), afterdividing by N , equals the average bit rate of an ideal entropycoder fed by model ~pj . Consequently, the decision rule canin principle be implemented using the structure shown in Fig-ure 7. Of course in practice, real entropy coders need not beused, since only the rates and not the actual code bits areneeded.The possibility of realizing the decision rule in this wayillustrates a connection between data compression and classi-�cation. This connection seems especially important with thegrowth of large libraries of image data, where one will wantto search and make decisions on compressed data [29]. Recent9

Figure 6: Four-class example of texture classi�cation using the cluster-based model. From left to right, the images are: theoriginal composite test image, classi�cation using resolution averaging only, and classi�cation using both resolution and spatialaveraging.IDEAL

ENTROPYCODER

PROBABILISITICMODEL

FOR CLASS 1

IDEALENTROPY

CODER

PROBABILISITICMODEL

FOR CLASS 2

UNKNOWN CLASS

SAMPLE VECTORS FROM

PROBABILISITICMODEL

FOR CLASS J

IDEALENTROPY

CODER

CLASSIFICATIONDECISION

MINIMUM AVERAGEBIT RATE

CHOOSE CLASS WITH

Figure 7: A connection between data compression and classi�cation: the minimum-error-probability rule can be realized usinga bank of ideal entropy coders, each tuned to a di�erent source.work by Perlmutter et al. indicates that combining these twotasks can result in improved performance for both [30].The above assumes that nearby pixels come from the sameclass. In practice, when working with heterogeneous images,it is desirable to impose this assumption in a soft manner.How can this be accomplished? Thinking of the summationin (5.2) as a local averaging operation is suggestive of spatiallowpass �ltering. In particular, for each class, the logarithmof the likelihood is obtained for each pixel location, and theresulting \image" of log-likelihoods can be spatially lowpass�ltered. The decision rule then assigns to each pixel locationthe class with the maximum smoothed log-likelihood functionat that location.As mentioned earlier, it is also possible to average acrossseveral models, each working at a di�erent spatial resolution.While spatial averaging re ects the assumption that nearbypixels are likely to have come from the same class, resolutionaveraging imposes the requirement that a candidate matchthe sample texture at several scales simultaneously. This isbecause averaging logarithms corresponds to taking a product,which will be small if any of the factors is small. Spatial andmultiresolution averaging are not mutually exclusive; in fact,our best results employ both.5.1 Classi�cation exampleThe cluster-based classi�cation scheme was applied to thecomposite test image shown on the left in Figure 6. The imageconsists of four Brodatz textures, they are (clockwise from topleft) D68, D55, D77, and D84. For each class, three cluster-based probability models were trained on data that did notinclude the test data. Models were obtained at three di�erent

resolution scales: one model was trained using neighborhoodN6, another using the same neighborhood but with the pixellocation o�sets scaled by a factor of 2 around the center, anda third with the o�sets scaled by a factor of 4. Since N6 con-tains 12 conditioning pixels, the total dimensionality for eachof these models is 13.To perform the classi�cation, three vectors were formed forevery pixel in the test image, one for each of the three reso-lutions. For each class and each resolution, the logarithm ofthe conditional probability of the center pixel was computedusing the corresponding model. These values were then aver-aged across resolutions. The classi�cation results are shownin the middle image of Figure 6. The total classi�cation errorrate is less than �ve percent. The test was repeated using spa-tial averaging of the logarithms in addition to the resolutionaveraging; the results are shown on the right in Figure 6. A7-tap separable lowpass �lter was used to perform the spatialaveraging. The overall classi�cation error rate in this case isbelow one percent.6 DiscussionThe preceding sections described the cluster-based probabilitymodeling technique, and considered some applications. Thissection follows up on some of the issues raised in previoussections, and suggests topics for further study.6.1 Preprocessing vs. modelingIn a complex signal processing system, some preprocessing isusually carried out on the input to make the subsequent prob-abilistic modeling and processing tasks easier. Image compres-sion systems, for example, often employ an invertible, energy-10

compacting transformation as a �rst step, resulting in a sig-nal that is easier than the original to quantize and encodee�ciently.The problem of improving preprocessing operations has re-ceived much attention among researchers in various disciplinesover the past several decades. The philosophical direction ofsuch research e�ort has been towards more sophisticated pre-processing, enabling less sophisticated probabilistic models tobe used.Of interest here is the complementary research direction:toward more sophisticated models, enabling less sophisticatedpreprocessing to be used. Examples of techniques in this direc-tion are vector quantization (VQ) in the areas of compressionand interpolation [14], and arti�cial neural networks in theareas of regression and classi�cation [31]. These techniquesreduce the e�ect of preprocessing on system performance, byexploiting nonlinear, higher order statistical relationships thatexist among the signal elements. A striking aspect of these ap-proaches is that they can function largely as \black boxes" |one need not understand the information source in order toprocess it. This is either good or bad, depending on one'sobjectives. The approach advocated here shares the spirit ofthese approaches by capturing whatever statistical relation-ships exist. It di�ers from them in that these relationshipsare made available as an explicit probability law, which maythen be used for a variety of purposes. Moreover, having theprobability law allows us to build a conceptual bridge betweenthese \black-box" approaches and classical approaches.The strategy taken in the experimental parts of this paperhas been to work directly in the untransformed observationspace of pixel neighborhoods. This choice draws attention tothe fact that more powerful modeling makes the initial trans-formation or feature selection step less critical. Were maxi-mum performance the main goal, then substantial e�ort wouldbe justi�ed in devising suitable transformations for use in tan-dem with the modeling technique described here.6.2 Relationship to arti�cial neuralnetworks and radial basis functionsIt was commented in Section 1 that the proposed techniquedi�ers in philosophy from traditional mixture modeling in thatthe goal is to approximate a probability law, not to decom-pose it into physically signi�cant components. Thus, we canview the modeling problem as one of function approximation,where the function to be approximated is the PDF or PMF.In particular, if the km;i's are chosen appropriately, then (2.3)amounts to a radial basis function (RBF) approximation top(x). One might hope, therefore, the RBF literature wouldprovide insight into such issues as training, means of imple-mentation, and bounds on approximation accuracy [32, 33,34]. This is true to some extent, but two di�erences are ap-parent. The �rst is that values of the function being approxi-mated (a probability law) are never actually observed; insteadour observations consist only of samples that we believe to begoverned by the function. The second di�erence is in the rel-evant approximation criterion, which in the applications ofinterest here is the relative entropy (1.4). This criterion is nota distance metric, since it is asymmetric and does not sat-isfy the triangle inequality. Consequently, much of the RBFapproximation theory does not apply directly.The relationship to RBF's suggests a connection to arti�-cial neural networks (ANN's). As mentioned in Section 6.1,a system using the cluster-based probability model does havecertain elements in common with an ANN. Both are capable oflearning complex, nonlinear relationships from training data,and exploiting them to perform various information process-ing tasks. We expect that the applications we are consideringcould be handled by an appropriate type of ANN with a com-

parable level of performance. However, there is nothing inher-ently \neural" about the cluster-based probability model, andit is not connected to any speci�c class of hardware topology.If we compare the cluster-based model with a classical ANNlike a multilayer perceptron, another distinction emerges. Theset of weights in the perceptron, which completely determinesits function, has no obvious interpretation outside the net-work. For instance, the set of weights cannot be used by someother perceptron to perform a di�erent task. In contrast, themethod of this paper provides an explicit probabilistic modelfor the source, which can be used equally well in a variety ofapplications like compression, restoration, and classi�cation.In principle, this distinction vanishes when the goal of theANN is speci�cally to estimate the PMF[35], rather than tocarry out the ultimate information processing task. In thiscase, the approaches may be accomplishing the same thing indi�erent ways.Viewing the cluster-based model as an arti�cial neural net-work might prove to be useful when it is desired to adapt thekernels as data are being processed, as opposed to the cur-rent technique in which the kernels are �xed after an initialtraining phase. Also, techniques used in pruning insigni�cantnodes in ANN's might prove useful in eliminating insigni�cantkernels during computation. To understand the connection toneural networks more fully requires study beyond the scopeof this paper.6.3 Hierarchies of cluster-based modelsWhen cluster-based models are de�ned directly on pixel neigh-borhoods, a large number of the clusters are inevitably allo-cated along the main diagonal of the probability space. Thisis a consequence of the high positive correlation of spatiallyadjacent pixels in natural images and textures. This is onereason for working with some other features besides pixels,where the high correlation is absent. However, this type ofcommonality in the kernel distributions among models is sug-gestive of a plan for organizing a large collection of modelsin such a way that they can easily share common attributeswhen appropriate. In particular, a tree structure can be used.Kernels that are common to all of the models can be stored atthe root of the tree. The leaves of the tree would correspondto the individual models, and the path from the root to eachleaf would specify which kernels would have to be added toresult in the corresponding model. Such hierarchies of modelscan be expected to play a substantial role when the cluster-based technique is applied to large, real-world classi�cationproblems, as occur in digital libraries.An interesting alternative means of sharing common at-tributes among several models has been suggested recently byPudil et al. [36], in a classi�cation setting. The approach isto posit a common \background" density for all of the classes,and to express each class-conditional density as a mixture ofproducts of this background density with a class-speci�c mod-ulating function. Expressing the density in this way simpli�esthe sharing of common attributes, and provides a basis forfeature selection: choose the features that provide maximumdeviation from the background. The classi�cation results re-ported in [36] are for small mixtures (2, 3, and 4 components).In applications when the goal is to characterize the preciseshape of the density, not just that portion which provides dis-criminatory power, much larger mixtures may be necessary.7 ConclusionsAmultidimensional probability model, based on cluster analy-sis, has been presented, analyzed, and applied to certain prob-lems in image and texture processing. The model combinesthe summarizing ability of a histogram with the generalizing11

ability of kernel estimation, while avoiding some of the draw-backs of each. In particular, it avoids the empty-bin problemassociated with high-dimensional histograms, and it performsbetter in tail regions than a traditional kernel estimate of thesame complexity.It was shown that under reasonable conditions, the estimateconverges asymptotically to the true probability law as itscomplexity is allowed to increase arbitrarily. In practice, themodel was shown to be successful in applications requiring theestimation of a joint PMF with up to 13 variables.Applied to image restoration, the model was used to learncomplex degradations that cannot be expressed easily in math-ematical form. Lossless compression provided a fair test of theaccuracy of the model; for natural images the results werecompetitive with methods designed speci�cally for losslesscompression. In texture classi�cation, several cluster-basedmodels were used e�ectively in a standard Bayesian frame-work. Performance was also shown to improve when within-class averaging of log-likelihoods was carried out both spatiallyand across scales.By capturing the high-order, nonlinear relationships thatexist among features, and by providing an explicit estimateof the governing probability law, the model is able to extendthe performance of otherwise traditional systems that rely onprobabilistic models. The practical value of the cluster-basedprobability model was demonstrated by its e�ectiveness in thefollowing applications: image restoration, lossless image andtexture compression, and texture classi�cation.AcknowledgmentsThanks to the anonymous reviewers for their helpful sugges-tions, and to Michael I. Jordan for bringing to our attentionthe relationship between the k-means and EM algorithms.A Proof of Proposition 1We consider the case where the kernels have �nite support.The in�nite-support case is more involved and is not consid-ered here.Since Uj;I � V � (Uj;I [ Uj;B) ;ZUj;I f(X)dX � ZV f(X)dX� ZUj;I f(X)dX+ ZUj;B f(X)dX(A.1)andZUj;I ~fC;j(X)dX � ZV ~fC;j(X)dX� ZUj;I ~fC;j(X)dX+ ZUj;B ~fC;j(X)dX(A.2)Expressions (A.1) and (A.2) can be combined to yield� � � � ZV [f(x)� ~fC;j(X)]dX � � + � (A:3)where � = ZUj;I [f(X)� ~fC;j(X)]dX;� = ZUj;B ~fC;j(X)dX;and

� = ZUj;B f(X)dX:Let N(Uj;I) and N(Uj;B) denote the number of trainingsamples that lie in Uj;I and Uj;B respectively. Choose suf-�ciently small so that the all kernels are strictly contained intheir respective cells. This is possible because of the �nite-support hypothesis. Under this condition, the integral of thecluster-based probability model over themth cell is simply theweight wm of that cell, so thatZUj;I ~fC;j(X)dX = N(Uj;I)=NP�! Prob fX 2 Uj;Ig = ZUj;I f(X)dX (A.4)and ZUj;B ~fC;j(X)dX = N(Uj;B)=NP�! Prob fX 2 Uj;Bg = ZUj;B f(X)dX; (A.5)where the arrows with the P's above them signify convergencein probability (as N !1) by the weak law of large numbers.We can rewrite (A.4) and (A.5) as� P�! 0 and � P�! �: (A:6)Combining (A.3) and (A.6) giveslimN!1Prob��ZV �f(X)� ~fC;j(X)�dX�� > �� = 0;where we have used the de�nition of convergence in proba-bility, and, in simplifying the left-hand side of (A.3), the factthat if a P�! a0 and b P�! b0, then (a+ b) P�! (a0 + b0):All that remains is to show that � < �. To this end, choose� su�ciently small so thatZS� f(X)dX < �: (A:7)This is possible because of condition (2.8) and the bounded-ness of f . Let j0 be such that Diam(Uj) < � whenever j > j0.It follows that Uj;B � S�; so that� = ZUj;B f(X)dX � ZS� f(X)dX < �: (A:8)Notice the crucial role played by fwmg, the normalized cellpopulations. If these weights were absent, then convergence ofthe estimate would be contingent on whether the distributionof the kernel centers is the same as the distribution beingmodeled. De�ning the weights in this way makes the estimateasymptotically insensitive to the precise choice of partition.B The parameter The spread parameter in (2.2) is necessary because the sam-ple variance tends to underestimate the value required to al-low the Gaussian kernels to add up to a uniform density inregions where the density is in fact uniform. Without > 1,the resulting density estimate would be too high at the kernelcenters and too low at the cell boundaries.This phenomenon is most easily illustrated in one dimen-sion. Suppose that the true density f is uniform on [�a; a] �12

1 1.5 2 2.5 30.5

1

1.5

2

2.5

3

3.5

4

4.5

5

γ

(R λ)

Figure 8: Ratio of estimated cell-center to cell-edge densityfor a uniform distribution in one dimension, as a function ofthe global spread parameter , for large M .R. Assume that a partition is obtained by dividing [�a; a]into M equal-sized cells. The within-cell variance is thena2=(3M2). Suppose that kernels of the form (2.2) are cen-tered at every cell. For the innermost cell, we can examinethe ratio of ~f evaluated at the center of this cell (X = 0) toits value at the right boundary (X = a=M). Call this ratioR( ). Since the true distribution is uniform, we desire thatR( ) be unity, at least when M is large. Assuming M is odd,the ratio is given byR( ) = ~fC(0)= ~fC (a=M)= 1 + 2 (M�1)=2Xm=1 e�6m2= 2e�3M2= 2 + 2 (M�1)=2Xm=1 e�3(2m�1)2=2 2 ; (B.1)which is plotted in the limit as M ! 1 in Figure 8. The�gure shows that if = 1 (corresponding to leaving outaltogether), then the center-to-edge probability ratio is 2:25;implying that ~f is strongly nonuniform. As increases, theratio approaches the desired value of unity as a limit. Butusing an excessively large value for reduces the ability ofthe estimate to adapt to local variations in regions where thetrue density is nonuniform. The best choice for dependson both the true distribution and on the particular partitionU . In practice, a good choice can be determined empiricallyby trying several values, then selecting the one that resultsin the best performance in the given application. Often thiscorresponds to choosing the value of which maximizes thelikelihood of the given data. In obtaining the experimentalresults presented in Sections 3{5, the best values found for ranged between 1:1 and 1:5. It should be noted that the use ofa single, global variance multiplier is suboptimal; it wouldbe better (though computationally more expensive) to opti-mize all of the variances (and all of the means, for that mat-ter) by some optimization technique such as the expectation-maximization (EM) algorithm, as discussed in Section 2.2.However, for large M , (e.g., M = 1; 024), the computationalcost of EM appears to be prohibitive, so that the device ofintroducing the spread parameter is a reasonable recourse.References[1] J. Besag, \Spatial interaction and the statistical analysisof lattice systems (with discussion)," J. Roy. Stat. Soc.,Ser. B, vol. 36, pp. 192{236, 1974.

[2] B. W. Silverman, Density Estimation for Statistics andData Analysis. London: Chapman and Hall, 1986.[3] D. Scott and J. Thompson, \Probability density esti-mation in higher dimensions," in Computer Science andStatistics: Proceedings of the Fifteenth Symposium on theInterface (J. Gentle, ed.), (Amsterdam), pp. 173{179,North-Holland, 1983.[4] D. W. Scott, Multivariate density estimation. New York:John Wiley & Sons, 1992.[5] B. Everitt and D. Hand, Finite mixture distributions.London: Chapman and Hall, 1981.[6] R. A. Redner and H. F. Walker, \Mixture densities, max-imum likelihood, and the EM algorithm," SIAM Review,vol. 26, pp. 195{239, April 1984.[7] R. O. Duda and P. E. Hart, Pattern classi�cation andscene analysis. New York: Wiley, 1973.[8] L. R. and B.-H. Juang, Fundamentals of speech recogni-tion. PTR Prentice Hall, 1993.[9] S. Haykin, Neural networks: a comprehensive foundation.Macmillan, 1994.[10] L. Devroye, A course in density estimation. Boston:Birkhauser, 1987.[11] T. M. Cover and J. A. Thomas, Elements of InformationTheory. John Wiley and Sons, 1991.[12] J. A. Rice, Mathematical Statistics and Data Analysis.Wadsworth and Brooks/Cole, 1988.[13] D. Hand, Kernel discriminant analysis. Research StudiesPress, 1982.[14] A. Gersho and R. M. Gray, Vector quantization and signalcompression. Kluwer academic publishers, 1991.[15] A. K. Jain and R. C. Dubes, Algorithms for clusteringdata. Prentice Hall, 1988.[16] Y. Linde, A. Buzo, and R. M. Gray, \An algorithmfor vector quantizer design," IEEE Trans. Comm.,vol. COM-28, pp. 84{95, Jan. 1980.[17] J. Rissanen, \Modeling by shortest desciption length,"Automatica, vol. 14, pp. 465{471, 1978.[18] A. P. Dempster, N. Laird, and D. Rubin, \Maximumlikelihood from incomplete data via the EM algorithm,"J. Royal Stat. Soc., vol. 39, pp. 1{38, 1977.[19] K. Popat and R. W. Picard, \A novel cluster-based prob-ability model for texture synthesis, classi�cation, andcompression," in Proc. SPIE Visual Communications '93,(Cambridge, Mass.), 1993.[20] A. Gersho, \Optimal nonlinear interpolative vector quan-tization," IEEE Trans. Comm., vol. 38, pp. 1285{1287,Sept. 1990.[21] G. Langdon, A. Gulati, and E. Seiler, \On the JPEGmodel for lossless image compression," in Proc. IEEEData Comp. Conf., (Utah), 1992.[22] M. Rabbani and P. W. Jones, Digital image compressiontechniques. Bellingham, Washington: SPIE Optical En-gineering Press, 1991.[23] J. J. Rissanen and G. G. Langdon, \Arithmetic coding,"IBM J. Res. Develop., vol. 23, pp. 149{162, March 1979.[24] J. Rissanen and G. G. Langdon, \Universal modeling andcoding," IEEE Trans. Inform. Theory, vol. IT-27, pp. 12{23, 1981.[25] G. G. Langdon, \An introduction to arithmetic coding,"IBM J. Res. Develop., Mar. 1984.13

Figure 9: The training set (�rst twenty-�ve images) and test set (last two images, cman and lenna) for experiments involvingnatural scenes. All are 8-bit monochrome, 512 � 512, except for cman, which is 256 � 256.Figure 10: Textures from the Brodatz album [37]: D1 (aluminum wire mesh), D15 (straw), D20 (magni�ed French canvas), D21(French canvas), D22 (reptile skin), D55 (straw matting), D68 (wood grain), D77 (cotton canvas), D84 (ra�a looped to a highpile), D103 (loose burlap). These textures are 256 � 256, cut out of a 512� 512 original. All images are 8-bit monochrome.[26] K. Popat, \Scalar quantization with arithmetic coding,"Master's thesis, Dept. of Elec. Eng. and Comp. Science,M.I.T., Cambridge, Mass., 1990.[27] G. Langdon and J. Rissanen, \Compression of black-white images with arithmetic coding," IEEE Trans.Comm., vol. COM-29, pp. 858{867, June 1981.[28] K. Popat and R. W. Picard, \Exaggerated consensus inlossless image compression," in ICIP-94: 1994 Interna-tional Conference on Image Processing, (Austin, TX),IEEE, Nov. 1994.[29] R. W. Picard, \Content access for image/video coding:`The Fourth Criterion'," Tech. Rep. 295, MIT Media Lab,Perceptual Computing, Cambridge, MA, 1994. MPEGDoc. 127, Lausanne, 1995.[30] K. Perlmutter, S. Perlmutter, R. Gray, R. Olshen, andK. Oehler, \Bayes risk weighted vector quantization withposterior estimation for image compression and classi�-cation," IEEE Transactions on Image Processing, vol. 5,pp. 347{360, February 1996.[31] J. Hertz, A. Krogh, and R. G. Palmer, Introduction tothe theory of neural computation. Addison-Wesley, 1991.[32] T. Poggio and F. Girosi, \Networks for approximationand learning," Proceedings of the IEEE, vol. 78, pp. 1481{1497, Sept. 1990.[33] T. Poggio and F. Girosi, \Extension of a theory of net-works for approximation and learning: Dimensionalityreduction and clustering," A.I. Memo #1167, Arti�cialIntelligence Laboratory, Massachusetts Institute of Tech-nology, 1990.[34] J. Moody and C. J. Darken, \Fast learning in networksof locally-tuned processing units," Neural Computation,vol. 1, pp. 281{294, 1989.

[35] M. D. Richard and R. P. Lippmann, \Neural networkclassi�ers estimate Bayesian a posteriori probabilities,"Neural Computation, vol. 3, pp. 461{483, 1991.[36] P. Pudil, J. Novovicova, and J. Kittler, \Simultaneouslearning of decision rules and important attributes forclassi�cation problems in image analysis," Image and vi-sion computing, vol. 12, pp. 193{198, April 1994.[37] P. Brodatz, Textures: A Photographic Album for Artistsand Designers. New York: Dover, 1966.

14

Documents

pdfs.semanticscholar.org · 2015-07-28 · M.I.T Media Lab oratory P erceptual Computing Section T ec hnical Rep ort No. 351 Also app ears in the IEEE T rans. on Image Pro cessing,