11
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6, JUNE 2008 897 Maximum-Entropy Expectation-Maximization Algorithm for Image Reconstruction and Sensor Field Estimation Hunsop Hong, Student Member, IEEE, and Dan Schonfeld, Senior Member, IEEE Abstract—In this paper, we propose a maximum-entropy expec- tation-maximization (MEEM) algorithm. We use the proposed al- gorithm for density estimation. The maximum-entropy constraint is imposed for smoothness of the estimated density function. The derivation of the MEEM algorithm requires determination of the covariance matrix in the framework of the maximum-entropy like- lihood function, which is difficult to solve analytically. We, there- fore, derive the MEEM algorithm by optimizing a lower-bound of the maximum-entropy likelihood function. We note that the clas- sical expectation-maximization (EM) algorithm has been employed previously for 2-D density estimation. We propose to extend the use of the classical EM algorithm for image recovery from randomly sampled data and sensor field estimation from randomly scattered sensor networks. We further propose to use our approach in den- sity estimation, image recovery and sensor field estimation. Com- puter simulation experiments are used to demonstrate the superior performance of the proposed MEEM algorithm in comparison to existing methods. Index Terms—Expectation-maximization (EM), Gaussian mix- ture model (GMM), image reconstrution, Kernel density estima- tion, maximum entropy, Parzen density, sensor field estimation. I. INTRODUCTION E STIMATING an unknown probability density function (pdf) given a finite set of observations is an important aspect of many image processing problems. The Parzen win- dows method [1] is one of the most popular methods which provides a nonparametric approximation of the pdf based on the underlying observations. It can be shown to converge to an arbitrary density function as the number of samples increases. The sample requirement, however, is extremely high and grows dramatically as the complexity of the underlying density function increases. Reducing the computational cost of the Parzen windows density estimation method is an active area of research. Girolami and He [2] present an excellent review of recent developments in the literature. There are three broad categories of methods adopted to reduce the computational cost of the Parzen windows density estimation for large sample sizes: a) approximate kernel decomposition method [3], b) data Manuscript received March 29, 2007; revised January 13, 2008. The associate editor coordinating the review of this manuscript and approving it for publica- tion was Dr. Gaurav Sharma. The authors are with the Multimedia Communications Laboratory, Depart- ment of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, IL 60607-7053 USA (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2008.921996 reduction methods [4], and c) sparse functional approximation method. Sparse functional approximation methods like support vector machines (SVM) [5], obtain a sparse representation in approxi- mation coefficients and, therefore, reduce computational costs for performance on a test set. Excellent results are obtained using these methods. However, these methods scale as making them expensive computationally. The reduced set den- sity estimator (RSDE) developed by Girolami and He [2] pro- vides a superior sparse functional approximation method which is designed to minimize an integrated squared-error (ISE) cost function. The RSDE formulates a quadratic programming problem and solves it for a reduced set of nonzero coefficients to arrive at an estimate of the pdf. Despite the computational efficiency of the RDSE in density estimation, it can be shown that this method suffers from some important limitations [6]. In particular, not only does the linear term in the ISE measure result in a sparse representation, but its optimization leads to as- signing all the weights to zero with the exception of the sample point closest to the mode as observed in [2] and [6]. As a result, the ISE-based approach to density estimation degenerates to a trivial solution characterized by an impulse coefficient distribu- tion resulting in a single kernel density function as the number of data samples increases. However, the expectation-maximization algorithm (EM) [7] provides a very effective and popular alternative for estimating model parameters. It provides an iterative solution, which con- verges to a local maximum of the likelihood function. Although the solution to the EM algorithm provides the maximum like- lihood estimate of the kernel model for density function, the resulting estimate is not guaranteed to be smooth and may still preserve some of the sharpness of the ISE-based density estima- tion methods. A common method used in regularization theory to ensure smooth estimates is to impose the maximum entropy constraint. There have been some attempts to bind the entropy criterion with EM algorithm. Byrne [8] proposed an iterative image reconstruction algorithm based on cross-entropy mini- mization using the Kullback–Leibler (KL) divergence measure [9]. Benavent et al. [10] presented an entropy-based EM algo- rithm for the Gaussian mixture model in order to determine the optimal number of centers. However, despite the efforts to use maximum entropy to obtain smoother density estimates, thus far, there have been no successful attempts to expand the EM algorithm by incorporating a maximum-entropy penalty-based approach to estimating the optimal weight, mean and covariance matrix. In this paper, we introduce several novel methods for smooth kernel density estimation by relying on a maximum-entropy 1057-7149/$25.00 © 2008 IEEE

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6 ...ds/Papers/HoSc08.pdf · cost of the Parzen windows density estimation for large sample sizes: a) approximate kernel decomposition

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6 ...ds/Papers/HoSc08.pdf · cost of the Parzen windows density estimation for large sample sizes: a) approximate kernel decomposition

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6, JUNE 2008 897

Maximum-Entropy Expectation-MaximizationAlgorithm for Image Reconstruction

and Sensor Field EstimationHunsop Hong, Student Member, IEEE, and Dan Schonfeld, Senior Member, IEEE

Abstract—In this paper, we propose a maximum-entropy expec-tation-maximization (MEEM) algorithm. We use the proposed al-gorithm for density estimation. The maximum-entropy constraintis imposed for smoothness of the estimated density function. Thederivation of the MEEM algorithm requires determination of thecovariance matrix in the framework of the maximum-entropy like-lihood function, which is difficult to solve analytically. We, there-fore, derive the MEEM algorithm by optimizing a lower-bound ofthe maximum-entropy likelihood function. We note that the clas-sical expectation-maximization (EM) algorithm has been employedpreviously for 2-D density estimation. We propose to extend the useof the classical EM algorithm for image recovery from randomlysampled data and sensor field estimation from randomly scatteredsensor networks. We further propose to use our approach in den-sity estimation, image recovery and sensor field estimation. Com-puter simulation experiments are used to demonstrate the superiorperformance of the proposed MEEM algorithm in comparison toexisting methods.

Index Terms—Expectation-maximization (EM), Gaussian mix-ture model (GMM), image reconstrution, Kernel density estima-tion, maximum entropy, Parzen density, sensor field estimation.

I. INTRODUCTION

ESTIMATING an unknown probability density function(pdf) given a finite set of observations is an important

aspect of many image processing problems. The Parzen win-dows method [1] is one of the most popular methods whichprovides a nonparametric approximation of the pdf based onthe underlying observations. It can be shown to converge to anarbitrary density function as the number of samples increases.The sample requirement, however, is extremely high andgrows dramatically as the complexity of the underlying densityfunction increases. Reducing the computational cost of theParzen windows density estimation method is an active areaof research. Girolami and He [2] present an excellent reviewof recent developments in the literature. There are three broadcategories of methods adopted to reduce the computationalcost of the Parzen windows density estimation for large samplesizes: a) approximate kernel decomposition method [3], b) data

Manuscript received March 29, 2007; revised January 13, 2008. The associateeditor coordinating the review of this manuscript and approving it for publica-tion was Dr. Gaurav Sharma.

The authors are with the Multimedia Communications Laboratory, Depart-ment of Electrical and Computer Engineering, University of Illinois at Chicago,Chicago, IL 60607-7053 USA (e-mail: [email protected]; [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2008.921996

reduction methods [4], and c) sparse functional approximationmethod.

Sparse functional approximation methods like support vectormachines (SVM) [5], obtain a sparse representation in approxi-mation coefficients and, therefore, reduce computational costsfor performance on a test set. Excellent results are obtainedusing these methods. However, these methods scale asmaking them expensive computationally. The reduced set den-sity estimator (RSDE) developed by Girolami and He [2] pro-vides a superior sparse functional approximation method whichis designed to minimize an integrated squared-error (ISE) costfunction. The RSDE formulates a quadratic programmingproblem and solves it for a reduced set of nonzero coefficientsto arrive at an estimate of the pdf. Despite the computationalefficiency of the RDSE in density estimation, it can be shownthat this method suffers from some important limitations [6].In particular, not only does the linear term in the ISE measureresult in a sparse representation, but its optimization leads to as-signing all the weights to zero with the exception of the samplepoint closest to the mode as observed in [2] and [6]. As a result,the ISE-based approach to density estimation degenerates to atrivial solution characterized by an impulse coefficient distribu-tion resulting in a single kernel density function as the numberof data samples increases.

However, the expectation-maximization algorithm (EM) [7]provides a very effective and popular alternative for estimatingmodel parameters. It provides an iterative solution, which con-verges to a local maximum of the likelihood function. Althoughthe solution to the EM algorithm provides the maximum like-lihood estimate of the kernel model for density function, theresulting estimate is not guaranteed to be smooth and may stillpreserve some of the sharpness of the ISE-based density estima-tion methods. A common method used in regularization theoryto ensure smooth estimates is to impose the maximum entropyconstraint. There have been some attempts to bind the entropycriterion with EM algorithm. Byrne [8] proposed an iterativeimage reconstruction algorithm based on cross-entropy mini-mization using the Kullback–Leibler (KL) divergence measure[9]. Benavent et al. [10] presented an entropy-based EM algo-rithm for the Gaussian mixture model in order to determine theoptimal number of centers. However, despite the efforts to usemaximum entropy to obtain smoother density estimates, thusfar, there have been no successful attempts to expand the EMalgorithm by incorporating a maximum-entropy penalty-basedapproach to estimating the optimal weight, mean and covariancematrix.

In this paper, we introduce several novel methods for smoothkernel density estimation by relying on a maximum-entropy

1057-7149/$25.00 © 2008 IEEE

Page 2: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6 ...ds/Papers/HoSc08.pdf · cost of the Parzen windows density estimation for large sample sizes: a) approximate kernel decomposition

898 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6, JUNE 2008

penalty and use the proposed methods for the solution ofimportant applications in image reconstruction and sensor fieldestimation. The remainder of the paper is organizes as follows.In Section II, we first introduce kernel density estimationand present the integrated squared-error (ISE) cost function.We subsequently introduce the maximum-entropy ISE-baseddensity estimation to ensure that the estimated density functionis smooth and does not suffer from the degeneracy of theISE-based kernel density estimation. Determination of themaximum-entropy ISE-based cost function is a difficult taskand generally requires the use of iterative optimization tech-niques. We propose the hierarchical maximum entropy kerneldensity estimation (HMEKDE) method by using a hierarchicaltree structure for the decomposition of the density estimationproblem under the maximum-entropy constraint at multipleresolutions. We derive a closed-form solution to the hierarchicalmaximum-entropy kernel density estimate for implementationon binary trees. We also propose an iterative solution to apenalty-based maximum-entropy density estimation by usingNewton’s method. The methods discussed in this section pro-vide the optimal weights for kernel density estimates which relyon fixed kernels located at few samples. In Section III, we pro-pose the maximum-entropy expectation maximization (MEEM)algorithm to provide the optimal estimates of the weight, mean,and covariance for kernel density estimation. We investigate theperformance of the proposed MEEM algorithm for 2-D den-sity estimation and provide computer simulation experimentscomparing the various methods presented for the solution ofmaximum-entropy kernel density estimation in Section IV. Wepropose the application of both the EM and MEEM algorithmsfor image reconstruction from randomly sampled images andsensor field estimation from randomly scattered sensors inSection V. The basic EM algorithm estimates a complete dataset from partial data sets, and, therefore, we propose to use theEM and MEEM algorithms in these image reconstruction andsensor network applications. We present computer simulationsof the performance of the various methods for kernel densityestimation for these applications and discuss the advantagesand disadvantages in various applications. A discussion of theperformance of the MEEM algorithm as the number of kernelsvaries is provided in Section VI. Finally, in Section VII, weprovide a brief summary and discussion of our results.

II. KERNEL-BASED DENSITY ESTIMATION

A. Parzen Density Estimation

The parzen density estimator using the Gaussian Kernel isgiven by Torkkola [11]

(1)

where is the total number of observation and is the isotropicGaussian kernel defined by

(2)

The main limitation of the Parzen windows density estimator isthe very high computational cost due to the very large numberof kernels required for its representation.

B. Kernel Density Estimation

We seek an approximation to the true density of the form

(3)

where and the function denotes the Gaussian kerneldefined in (2). The weights must be determined such that theoverall model remains a pdf, i.e.,

(4)

Later in this paper, we will explore the simultaneous optimiza-tion of the mean, variance, and weights of the Gaussian ker-nels. Here, we focus exclusively on the weights . The variancesand means of the Gaussian kernels are estimated by using the

-means algorithm in order to reduce the computational burden.Specifically, the centers of the kernels in (3) are determinedby -means clustering, and the variance of the kernels is set tothe mean of Euclidean distance between centers [12]. We as-sume that is significantly greater than since the Parzenmethod relies on delta functions at the sample data which arerepresented by Gaussian functions with very narrow variance.The mixture of Gaussian model, on the other hand, relies on afew Gaussian kernels and the variance of each Gaussian func-tion is designed to capture many sample points.

Therefore, only the coefficients are unknown. We rely onminimization of the error between and using the ISEmethod. The ISE cost function is given by

(5)

Substituting and , using (1) and (3), the equation canbe expanded and the order of integration and summation ex-changed. Thus, we can write the cost function of (5) in vector-matrix form

(6)

where

(7)

Our goal is to minimize this function with respect to underthe conditions provided by (4). Equation (6) is a quadratic pro-gramming problem, which has a unique solution if the matrix

Page 3: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6 ...ds/Papers/HoSc08.pdf · cost of the Parzen windows density estimation for large sample sizes: a) approximate kernel decomposition

HONG AND SCHONFELD: MAXIMUM-ENTROPY EXPECTATION-MAXIMIZATION ALGORITHM 899

is positive semi-definite [13]. Therefore, can be simplified to.

In Appendix A, we prove that the solution of the ISE-basedkernel density estimation degenerates as the number of obser-vations increases to a trivial solution that concentrates the esti-mated probability mass in a single kernel. This degeneracy leadsto a sharp peak in the estimated density, which is characterizedby the minimum-entropy solution.

C. Maximum-Entropy Kernel Density Estimation

Given observations from an unknown probability distribu-tion, there may exist an infinity of probability distributions con-sistent with the observations and any given constraints [14].The maximum entropy principle states that under such circum-stances we are required to be maximally uncertain about whatwe do not know, which corresponds to selecting the densitywith the highest entropy among all candidate solutions to theproblem. In order to avoid degenerate solutions to (6), we max-imize the entropy and minimize the divergence between the es-timated distribution and the Parzen windows density estimate.Here, we use Renyi’s quadratic entropy measure given by [11],which is defined as

(8)

Substituting (3) into (8), we obtain

By expanding the square, interchanging the order of summationand integration, we obtain the following:

(9)

Since the logarithm is a monotonic function, maximizing thelogarithm of a function is equivalent to maximizing the function.Thus, the maximum entropy solution of the entropycan be reached by maximizing the function expressed invector-matrix form

The optimal maximum entropy solution of is

(10)

where is subject to the constraints provided by (4).1) Penalty-Based Approach Using Newton’s Method: We

adopt the penalty-based approach by introducing an arbitraryconstant to balance between the ISE and entropy cost func-tions. We, therefore, define a new cost function given by

where is the penalty coefficient. Since the variable is con-stant with respect to it will be omitted. We now have

(11)

Newton’s method for multiple variables is given in [15]

(12)

where denotes the iteration. We will use the soft-max functionfor the weight constraint [16]. The weight of the centercan be expressed as

(13)

Therefore, the derivative of the weight with respect to isgiven by

.(14)

For convenience, we define the following variables:

(15)

(16)

(17)

(18)

(19)

We can now express (11) using (15) and (18)

(20)

The element of the gradient of (20) is given by

The derivation of the gradient is provided in Appendix B. From(57), (58), and (62), the element of the Hessian matrix

is given by the following.a)

(21)

b)

(22)

The detailed derivation of the Hessian matrix are also presentedin Appendix B. We assume that the Hessian matrix is positivedefinite. Finally, the gradient and Hessian required for the iter-ation in (12) can be generated using (21), (22), and (59).

2) Constrained-Based Approach Using a Hierarchical Bi-nary Tree: Our preference is to avoid penalty-based methodsand to derive the optimal weights as a constrained optimizationproblem. Specifically, we seek the maximum entropy weights

Page 4: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6 ...ds/Papers/HoSc08.pdf · cost of the Parzen windows density estimation for large sample sizes: a) approximate kernel decomposition

900 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6, JUNE 2008

Fig. 1. Binary tree structure for hierarchical density estimation.

such that its corresponding ISE cost function doesnot exceed the optimal ISE cost beyond a prespecifiedvalue . We thus define the maximum-entropy coefficientsto be given by

(23)

such that .A closed-form solution to this problem is difficult to obtain in

general. However, we can obtain the closed-form solution whenthe number of centers is limited to two. Hence, we form an iter-ative process, where we assume that we only have two centersat each iteration. We represent this iterative process as a hier-archical model, which generates new centers at each iteration.We use a binary tree to illustrate the hierarchical model, whereeach node in the tree depicts a single kernel. Therefore, in thebinary tree, each parent node has two children nodes as seen inFig. 1. The final density function corresponds to the kernels atthe leafs of the tree. We now wish to determine the maximumentropy kernel density estimation at each iteration of the hier-archical binary tree. We, therefore, seek the maximum entropycoefficients. Note that sum of these coefficients is dictated bythe corresponding coefficients of their parent node. This restric-tion will ensure that the sum of the coefficients of all the leavenodes (i.e., nodes with no children) is one since we set the co-efficient of the root parent node to 1. We simplify the notationby considering and to be the coefficients of the childrennodes where is used to denote the coefficient of their corre-sponding parent node (i.e., ). This implies thatit is sufficient to characterize the optimal coefficient such that

.The samples are divided into two groups using -means

method at each node. Let us adopt the following notation:

(24)

where . From (6) and (7), we observe that

(25)

(26)

where , , and .

The constraint in the maximum entropy problem is definedsuch that its corresponding ISE cost function does not ex-ceed the optimal ISE cost beyond a prespecified value. From (6), (25), and (26), we can determine the optimal ISE

coefficient by minimization of the cost given by

(27)

such that . It is easy to show that

(28)

Therefore, from (6), (25), (26), and (28), we have

(29)

We assume, without loss of generality, that . Therefore,the constant is equivalent to

.From (10) and (25), we observe that the maximum entropy

coefficient is given by

(30)

such that and.

Therefore, from (30), we form the Lagrangiangiven by

Differentiating with respect to and setting to zero,we have

(31)

We shall now determine the Lagrange multiplier by satisfyingthe constraint

(32)

From (31) and (32), we observe that

(33)

Therefore, from (33) and (31), we observe that

(34)

Finally, we impose the condition . Therefore,from (34), we have

Page 5: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6 ...ds/Papers/HoSc08.pdf · cost of the Parzen windows density estimation for large sample sizes: a) approximate kernel decomposition

HONG AND SCHONFELD: MAXIMUM-ENTROPY EXPECTATION-MAXIMIZATION ALGORITHM 901

III. MAXIMUM-ENTROPY

EXPECTATION-MAXIMIZATION ALGORITHM

As seen in previous section, the ISE-based methods enablepdf estimation given a set of observations without informationabout the underlying density. However, the ISE based solutionsdo not fully utilize the sample information as the number ofsamples increases. Moreover, ISE-based methods are generallyused to determine optimal weights used in the linear combina-tion. Selection of the mean and variance of the kernel functionsis accomplished by using the -means algorithm, which can beviewed as a hard limiting case of the EM [7]. The EM algorithmoffers an approximation of the pdf by an iterative optimizationunder the maximum likelihood criterion.

A probability density function can be approximated asthe sum of Gaussian functions

(35)

where is center of a Gaussian function, is a covariancematrix of function and is the weight for each centerwhich subject to the conditions as (4). The Gaussian functionis given by

(36)

From (35) and (36), we observe that the logarithm of the like-lihood function for the given Gaussian mixture parameters thathas observations can be written as

(37)

where is the sample and is a set of parameters (i.e., theweights, centers, and covariances) to be estimated.

The entropy term is added in order to make the estimated den-sity function smooth and not to have an impulse distribution.We expand Renyi’s quadratic entropy measure [11] to incorpo-rate with covariance matrices and use the measure again. Substi-tuting (35) into (8), expanding the square and interchanging theorder of summation and integration, we obtain the following:

(38)

We, therefore, form an augmented likelihood function pa-rameterized by a positive scalar in order to simultaneouslymaximize the entropy and likelihood using (37) and (38). Theaugmented likelihood function is given by

(39)

The expectation step of the EM algorithm can be separatedinto two terms, one is the expectation related with likelihoodand the other is the expectation related with the entropy penalty

(40)

(41)

where denotes that this expectation is from the likelihoodfunction, denotes that this expectation is from the entropypenalty, and denotes the number of iteration.

The Jensen’s inequality is applied to find the new lowerbound of the likelihood functions using (40) and(41). Therefore, the lower bound function for the likeli-hood function can be derived as

(42)

Now, we wish to obtain a lower bound for the entropy. This bound cannot be derived using the method in (42)

since is not a concave function. To derive the lowerbound, we, therefore, rely on a monotonically decreasing andconcave function such that . The detailedderivation is provided in Appendix C. Notice that maximiza-tion of the entropy remains unchanged if we replace the func-tion in (38) by since both are monotonically de-creasing functions. We can now use Jensens inequality to obtainthe lower bound for the entropy

The lower bound which combines the two lowerbounds is given by

(43)

Since we have the lower bound function, the new estimates ofthe parameters are easily calculated by setting the derivatives of

with respect to each parameters to zero.

Page 6: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6 ...ds/Papers/HoSc08.pdf · cost of the Parzen windows density estimation for large sample sizes: a) approximate kernel decomposition

902 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6, JUNE 2008

A. Mean

The new estimates for the mean vectors can be obtained bythe derivative of (43) with respect to and setting it to zero.Therefore

(44)

B. Weight

For the weights, we once again use the soft-max function in(13) and (14). Thus, by setting the derivative of with re-spect to to zero, the new estimated weight is given by

(45)

C. Covariance

In order to update the EM algorithm, the derivative of (43)with respect to is required. However, the derivative cannotbe solved directly because of the existence of the inverse ma-trix which appears in the derivative. We, there-fore, introduce a new lower bound for the EM algorithm usingCauchy–Schwartz inequality. The lower boundgiven by (43) can be rewritten as

(46)

The term in (46) is equal to. Using the Cauchy–Schwartz

inequality and the fact that the Gaussian function is greater thanor equal to zero, we obtain

(47)

Using (47) and the symmetric property of Gaussian, we thusintroduce a new lower bound for the covariance givenby

Therefore, the new estimated covariance is attained bysetting the derivative of the new lower bound withrespect to to zero

(48)

We note that the EM algorithm presented here relies on a simpleextension of the lower-bound maximization method in [17]. Inparticular, we can use this method to prove that our algorithmconverges to a local maximum on the bound generated by theCauchy–Schwartz inequality, which serves as a lower boundon the augmented likelihood function. Moreover, we wouldhave attained a local maximum of the augmented likelihoodfunction had we not used the Cauchy–Schwartz inequality toobtain a lower bound for the sum of the covariances. Notethat the Cauchy–Schwartz inequality is met with equality ifand only if the covariance matrices of the different kernelsare identical. Therefore, if the kernels are restricted to havethe same covariance structure, the maximum-entropy expecta-tion-maximization algorithm converges to a local maximum ofthe augmented likelihood function.

IV. TWO-DIMENSIONAL DENSITY ESTIMATION

We apply MEEM method and other conventional methods toa 2-D density estimation problem. Fig. 2(a) describes original2-D density function and Fig. 2(b) displays a scatter plot of 500data samples drawn from (49) in the interval [0,1]. The equationused for generating the samples is given by

(49)

where . Given data without knowledge of theunderlying density function used to generate the observations,we must estimate the 2-D density function. Here, we use 500,1000, 1500, and 2000 samples for the experiment. With the ex-ception of the RSDE method, the other approaches cannot beused to determine the optimal number of centers since it will

Page 7: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6 ...ds/Papers/HoSc08.pdf · cost of the Parzen windows density estimation for large sample sizes: a) approximate kernel decomposition

HONG AND SCHONFELD: MAXIMUM-ENTROPY EXPECTATION-MAXIMIZATION ALGORITHM 903

Fig. 2. Comparison of 2-D density estimation from 500 samples. (a) Orig-inal density function; (b) 500 samples; (c) RSDE; (d) HMEKDE; (e) Newton’smethod; (f) conventional EM; (g) MEEM.

fluctuate based on variations in the problem (e.g., initial condi-tions). We determine the number of centers experimentally suchthat we assign less than 100 samples per center for Newton’smethod, EM and MEEM. For the HMEKDE method, we termi-nate the splitting of the hierarchical tree when the leaf has lessthan 5% of total number of samples.

The results of RSDE are shown in Fig. 2(c). RSDE methodis very powerful algorithm in that it requires no parameters forthe estimation. However, the choice for the kernel width is verycrucial since it suffers the degeneracy problem when the kernelwidth is large and the reduction performance is diminishedwhen the kernel width is small. The results of Newton’s methodand HMEKDE are given in Fig. 2(d) and (e), respectively.The major practical issue in implementing Newton’s methodis the guarantee of local minimum, which can be sustained bypositive definitiveness of Hessian matrix [15]. Thus, we usethe Levenberg–Marquardt algorithm [18], [19]. The valuein HMEKDE method is chosen experimentally. The resultsof the conventional EM algorithm and the MEEM algorithmare shown in Fig. 2(f) and (g), respectively. The variablein MEEM algorithm is chosen experimentally. The result ofMEEM is properly smoothed.

In Fig. 3, SNR improvements according to iteration and thevalue of is displayed using 300 samples. We choose the valueas proportional to the number of samples. The parameter valuesmultiplied by the number of samples, are shown in Fig. 3 (i.e.,0.05, 0.10, and 0.15). We observe the over-fitting problem of theEM algorithm in Fig. 3. The overall improvements in SNR aregiven in Table I.

V. IMAGE RECONSTRUCTION AND SENSOR FIELD ESTIMATION

Density estimation problem can easily expanded into prac-tical problems like image reconstruction from random sample.For experiment, we use 256 256 gray Pepper, Lena, andBarbara images which is shown in Fig. 4(a)–(c).

We take 50% samples of Pepper image, 60% samples of Lenaimage and 70% of Barbara image. We use density functionmodel in [20] where is the intensity value andis the location of a pixel. We estimate a density function ofgiven image from samples. For the reduction of computational

Fig. 3. SNR improvements according to iteration and the parameter � .

TABLE ISNR COMPARISON OF ALGORITHM FOR 2-D DENSITY ESTIMATION

Fig. 4. Three 256 � 256 gray images used for the experiments. (a) Pepper,(b) Lena, and (c) Barbara and two sensor fields used for sensor field estimationfrom randomly scattered sensor: (d) polynomial sensor field and (e) artificialsensor field.

burden, 50% overlapped 16 16 blocks are used for the experi-ment. Since the smoothness is different from block to block, wechoose the smoothing parameter for each block experimentally.The initial center location is equally spaced. We use 3 3centers for experiment. Using the estimated density function,we can estimate the intensity value of given locationusing expectation operation of conditional density distributionfunction. The sampled image and the reconstruction results ofLena are shown in Fig. 5.

We can also expand our approach into the estimation of sensorfield from randomly scattered sensors. In this experiment, wegenerate an arbitrary field using polynomials in Fig. 4(d) and anartificial field in Fig.4(e). The original sensor field is randomlysampled and 2% of samples is used for the polynomial field and30% of samples are used for the artificial field. We use densityfunction model where L is intensity value and isthe location of sensor. 50% overlapped 32 32 blocks and 16

16 blocks are used for the estimation of polynomial sensorfield and artificial sensor field respectively for computationaltime. We also choose the smoothing parameter for each blockexperimentally. The initial center location is equally spaced. Weuse 3 3 centers for each experiment. We estimate a densityfunction of given field using sensors. For each algorithm exceptHMEKDE, we use equally spaced centers for the initial locationof center. The sampled sensor field and the estimation results of

Page 8: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6 ...ds/Papers/HoSc08.pdf · cost of the Parzen windows density estimation for large sample sizes: a) approximate kernel decomposition

904 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6, JUNE 2008

TABLE IISNR COMPARISON OF DENSITY ESTIMATION ALGORITHM FOR IMAGE RECONSTRUCTION AND SENSOR FIELD ESTIMATION

Fig. 5. Comparison of density estimation for image reconstruction fromrandomly sampled image. (a) 60% sampled image; (b) RSDE; (c) HMEKDE;(d) Newton’s method; (e) conventional EM; (f) MEEM.

Fig. 6. Comparison of density estimation for artificial sensor field estima-tion from randomly scattered sensor. (a) 30% sampled sensor; (b) RSDE;(c) HMEKDE; (d) Newton’s method; (e) conventional EM; (f) MEEM.

artificial field are given in Fig. 6. The signal to noise ratio of theresults and the computational time are also given in Table II.

VI. DISCUSSION

In this section, we discuss the relationship between thenumber of center and minimum/maximum entropy. Our exper-imental results indicate that, in most cases, the results underthe maximum entropy show better results than the conventionalEM algorithm. However, in some limited cases, like when weuse a small number of centers, the results of minimum entropypenalty shows better results than the results of the conventionalEM algorithm and maximum entropy penalty. This is due to thecharacteristics of maximum and minimum entropy, which iswell described in [21]. The maximum entropy solution providesus smooth solution. In the case that the number of centers arerelatively sufficient, each center can represent piecewise oneGaussian component, which means the resulting density func-tion can be described better under maximum entropy criterion.On the contrary, the minimum entropy solution gives us theleast smooth distribution. In the case that the number of centersare insufficient, each center should represent a large number ofsamples; thus, the resulting distribution described by a centershould be the least smooth one, since each center cannot bedescribed in terms of piecewise Gaussian any more. However,the larger number of centers used, the better the result.

VII. CONCLUSION

In this paper, we develop a new algorithm for density es-timation using the EM algorithm with a ME constraint. Theproposed MEEM algorithm provides a recursive method tocompute a smooth estimate of the maximum likelihood esti-mate. The MEEM algorithm is particularly suitable for tasksthat require the estimation of a smooth function from limitedor partial data, such as image reconstruction and sensor fieldestimation. We demonstrated the superior performance of theproposed MEEM algorithm in comparison to various methods(including the traditional EM algorithm) in application to2-D density estimation, image reconstruction from randomlysampled data, and sensor field estimation from scattered sensornetworks.

Page 9: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6 ...ds/Papers/HoSc08.pdf · cost of the Parzen windows density estimation for large sample sizes: a) approximate kernel decomposition

HONG AND SCHONFELD: MAXIMUM-ENTROPY EXPECTATION-MAXIMIZATION ALGORITHM 905

APPENDIX ADEGENERACY OF THE KERNEL DENSITY ESTIMATION

This appendix illustrates the degeneracy of kernel density es-timation discussed in [6]. We will show that the ISE cost func-tion converges asymptotically to the linear linear term asthe number of data samples increases. Moreover, we show thatoptimization of the linear term leads to a trivial solutionwhere all of the coefficients are zero except one which is con-sistent with the observation in [2]. We will, therefore, establishthat the minimal ISE coefficients will converge to an impulsecoefficient distribution as the number of data samples increases.In the following proposition, we prove that the ISE cost function

in (6) decays asymptotically to the linear linear termas the number of data samples increases.

Proposition 1: as .Proof: The ratio of the quadratic and linear term in (6) is

given by

where we conclude that the quadratic term decaysasymptotically at an exponential rate with increasing numberof data samples and the quadratic programming minimizingproblem in (6) reduces to a linear programming problem de-fined by the linear term .

Therefore, we can now determine the minimal ISE coeffi-cients as the number of data samples increases from (6)by minimization of the linear programming problem defined by

; i.e.,

(50)

such that and when .In the following proposition, we show that the linear program-

ming problem corresponding to the minimal ISE cost functionas the number of data samples increases degenerates to atrivial distribution of the coefficients which consists of an

impulse function. In particular, we assume that the elements inthe vector have a unique maximum element with index

. This assumption generally corresponds to the case wherethe true density function has a distinct maximum leading to ahigh density region in the data samples. We show that the op-timal distribution of the coefficients obtained from the so-lution of the linear programming problem in (50) is character-ized by a spike corresponding to the maximum element and zerofor all other coefficients.

Proposition 2: if and only if and.

Proof: We observe that

(51)

If we set and on the left side of (51),

and apply the constraint on the right, the inequality ismet as an equality. Or

We now prove the converse,. Therefore

Expanding the sum, we obtain

Canceling common terms and grouping terms with like coeffi-cients, we observe that

(52)

Since in (52), this implies , .This result can be easily extended to the case where the

elements in the vector have maximum element atindexes where . This situation gener-ally arises when the true density function has several nearlyequal modes leading to a few high density regions in the datasample. In this case, we can show that , where

if and only if and when

We now observe that the minimal ISE coefficient distributiondecays asymptotically to a Kronecker delta function

as the number of data samples increases (i.e.,, when and , when

.Corollary 1: as .

Proof: The proof is obtained directly from Propositions 1and 2.

Page 10: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6 ...ds/Papers/HoSc08.pdf · cost of the Parzen windows density estimation for large sample sizes: a) approximate kernel decomposition

906 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6, JUNE 2008

This corollary implies that the minimal ISE kernel density es-timation leads to the degenerative approximation whichconsists of a single kernel and is given by

(53)

as the number of samples increases [see (3)].We will now examine the entropy of the degenerative distribu-

tion given by (53), which has the lowest entropy amongall possible kernel density estimates.

Proposition 3: .Proof: We observe that , for all and .

Therefore, it follows that

Therefore, we have

(54)

Taking logarithms on both sides and multiplying by 1, weobtain

(55)

We now compute the entropy of the degenerativedistribution . From (2), (9), and (53), we obtain

(56)

We now add to both sides of (55) and using (9) and(59), we observe that

This completes the proofs.From the proposition above, we observe that the ISE-based

kernel density estimation yields the lowest entropy kernel den-sity estimation. It results in a kernel density estimate that con-sists of a single kernel. This result presents a clear indication ofthe limitation of ISE-based cost functions.

APPENDIX BGRADIENT AND HESSIAN IN NEWTON’S METHOD

In this appendix, we provide the detailed derivation of thegradient and the Hessian matrix of (12). First, we present the

gradient of (20) which requires the gradient of and .Thus, from (16) and (17), we can express the gradient of as

(57)

Similarly, from (18) and (19), the gradient of can be ex-pressed as

(58)

Thus, the element of the gradient can be expressed as

(59)

The element of the Hessian matrixcan be expressed as

a)

(60)

b)

(61)

Page 11: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 6 ...ds/Papers/HoSc08.pdf · cost of the Parzen windows density estimation for large sample sizes: a) approximate kernel decomposition

HONG AND SCHONFELD: MAXIMUM-ENTROPY EXPECTATION-MAXIMIZATION ALGORITHM 907

The gradient of is given by

(62)

APPENDIX CCONCAVE FUNCTION INEQUALITY

Let us consider a monotonically decreasing and concavefunction . Since is same monotonic decreasing func-tion as , maximizing is equivalent to maximizing

. Therefore,

Thus, the entropy term can be rewritten as

The argument of the function has finite rangesince , for and

for . The function satisfies

(63)

within the range , if and. It can be easily shown that the function

is convex. Therefore

Meanwhile, the function is concave. Therefore

for . Finally, if and, then the conditions required for (63) are satisfied.

REFERENCES

[1] E. Parzen, “On estimation of a probability denstiy function and mode,”Ann. Math. Statist., vol. 33, pp. 1065–1076, 1962.

[2] M. Girolami and C. He, “Probability density estimation from optimallycondensed data samples,” IEEE Trans. Pattern Anal. Mach. Intell., vol.25, no. 10, pp. 1253–1264, Oct. 2003.

[3] A. Izenmann, “Recent developments in nonparametric density estima-tion,” J. Amer. Statistic. Assoc., vol. 86, pp. 205–224, 1991.

[4] D. W. Scott and W. F. Szewczyk, “From kernels to mixtures,” Techno-metrics, vol. 43, pp. 323–335, Aug. 2001.

[5] S. Mukherjee and V. Vapnik, Support Vector Method for MultivariateDensity Estimation. Cambridge, MA: MIT Press, 2000.

[6] N. Balakrishnan and D. Schonfeld, “A maximum entropy kernel den-sity estimator with applications to function interpolation and texturesegmentation,” presented at the SPIE Conf. Computational Imaging IV,San Jose, CA, 2006.

[7] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihoodfrom incomplete data via the em algorithm,” J. Roy. Statist. Assoc., vol.39, pp. 1–38, 1977.

[8] C. L. Byrne, “Iterative image reconstruction algorithms based on cross-entropyminimization,” IEEE Trans. Image Process., vol. 2, no. 1, pp.96–103, Jan. 1993.

[9] S. Kullback and R. A. Leibler, “On information and sufficiency,” Ann.Math. Statist., vol. 22, pp. 79–86, Mar. 1951.

[10] A. P. Benavent, F. E. Ruiz, and J. M. S. Martinez, “Ebem: An entropy-based em algorithm for gaussian mixture models,” in Proc. 18th Int.Conf. Pattern Recognition, 2006, vol. 2, pp. 451–455.

[11] K. Torkkola, “Feature extraction by non-parametric mutual informa-tion maximization,” J. Mach. Learn. Res., vol. 3, pp. 1415–1438.

[12] I. T. Nabney, Netlab, Algorithms for Pattern Recognition. New York:Springer, 2004.

[13] R. J. Vanderbei, Linear Programming: Foundation and Extensions, 2nded. Boston, MA: Kluwer, 2001.

[14] J. N. Kapur and H. K. Kesavan, Entropy Optimization With Applica-tions. San Diego, CA: Academic, 1992.

[15] T. K. Moon and W. C. Stirling, Mathematical Methods and Algorithmsfor Signal Processing. Upper Saddle River, NJ: Prentice-Hall, 1999.

[16] C. Bishop, Neural Networks for Pattern Recogntion. Oxford, U.K.:Oxford Univ. Press, 1995.

[17] R. Neal and G. Hinton, “A view of the em algorithm that justifies incre-mental, sparse, and other variants,” in Learning in Graphical Models,M. I. Jordan, Ed. Norwell, MA: Kluwer, 1998.

[18] K. Levenberg, “A method for the solution of certain non-linear prob-lems in least squares,” Quart. Appl. Math., vol. 2, pp. 164–168, Jul.1994.

[19] D. W. Marquardt, “An algorithm for the least-squares estimation ofnonlinear parameters,” SIAM J. Appl. Math., vol. 11, pp. 431–441, Jun.1963.

[20] D. Comaniciu and P. Meer, “Mean shift: A robust approach towardfeature space analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol.24, no. 5, pp. 603–619, May 2002.

[21] Y. Lin and H. K. Kesavan, “Minimum entropy and information mea-sure,” IEEE Trans. Syst., Man., Cybern. C, Cybern., vol. 28, no. 5, pp.488–491, Aug. 1998.

Hunsop Hong (S’08) received the B.S. and M.S. de-grees in electronic engineering from Yonsei Univer-sity, Seoul, Korea, in 2000 and 2002, respectively. Heis currently pursuing the Ph.D. degree at the Depart-ment of Electrical and Computer Engineering, Uni-versity of Illinois at Chicago.

He was a Research Engineer at the Electronicsand Telecommunications Research Institute (ETRI),Daejeon, Korea, until 2003. His research interestsinclude image processing and density estimation.

Dan Schonfeld (M’90–SM’05) was born in Westch-ester, PA, in 1964. He received the B.S. degree inelectrical engineering and computer science from theUniversity of California, Berkeley, and the M.S. andPh.D. degrees in electrical and computer engineeringfrom the Johns Hopkins University, Baltimore, MD,in 1986, 1988, and 1990, respectively.

In 1990, he joined the University of Illinois atChicago, where he is currently an Associate Pro-fessor in the Department of Electrical and ComputerEngineering. He has authored over 100 technical

papers in various journals and conferences. His current research interests arein signal, image, and video processing; video communications; video retrieval;video networks; image analysis and computer vision; pattern recognition; andgenomic signal processing.

Dr. Schonfeld was coauthor of a paper that won the Best Student Paper Awardin Visual Communication and Image Processing 2006. He was also coauthorof a paper that was a finalist in the Best Student Paper Award in Image andVideo Communication and Processing 2005. He has served as an Associate Ed-itor of the IEEE TRANSACTIONS ON IMAGE PROCESSING (Nonlinear Filtering) aswell as an Associate Editor of the IEEE TRANSACTIONS ON SIGNAL PROCESSING

(Multidimensional Signal Processing and Multimedia Signal Processing).