CHAPTER 11 NONPARAMETRIC PIXEL APPEARANCE PROBABILITYjoemsong/publications/Song2005-MedIm… · than a parametric one such as an appearance based model using a multivariate normal

August 31, 2005 15:20 WSPC/Trim Size: 9.75in x 6.5in for Review Volume pixapp

CHAPTER 11

NONPARAMETRIC PIXEL APPEARANCE PROBABILITYMODEL USING GRID QUANTIZATION

FOR LOCAL IMAGE INFORMATION REPRESENTATION

Mingzhou Song

Department of Computer Science, New Mexico State UniversityP.O. Box 30001, Las Cruces, NM 88003, U.S.A.

andDoctoral Program in Computer Science, Graduate Center, City University of New York

365 Fifth Ave., New York, NY 10016, U.S.A.E-mail: [email protected]

Robert M. Haralick

Doctoral Program in Computer Science, Graduate Center, City University of New York365 Fifth Ave., New York, NY 10016, U.S.A.

E-mail: [email protected]

We describe a nonparametric pixel appearance probability model to representlocal image information. It allows an optimal image analysis framework that in-tegrates low- and high-level stages to substantially improve overall accuracy ofobject reconstruction. In this framework, feature detection would be an overallconsequence rather than an intermediate result. The pixel appearance probabilitymodel is a probability density function obtained by grid quantization. The gridis found by a genetic algorithm and a local refinement algorithm. The densityvalues are computed by smoothing neighboring cells. We apply the pixel appear-ance probability model to represent features of echocardiographic images. Weillustrate the substantially improved performance on left ventricle surface recon-struction due to the proposed pixel appearance probability model.

1. Introduction

The ultimate goal of medical image analysis is to acquire quantitative represen-tations of objects that are of medical concern from observed images. In echocar-diography, one objective is to create a three-dimensional (3-D) left ventricle (LV)surface model, including the epicardium (EPI), the outer surface of the LV, and theendocardium (ENDO), the inner surface of the LV. A standard two-stage approachto achieve this aim comprises feature detection and object reconstruction. Featuredetection is a procedure to classify each pixel into categories such as edges and re-gions. Established techniques based on pixel intensities and their derivatives, knownas low-level operators, are able to reliably perform feature detection on images of

1


2 Song and Haralick

good quality. Detected features are fed into an algorithm for object reconstruction,known as high-level methods. Such a two-stage framework is computationally effi-cient and can work well in many applications when image quality is not a majorconcern. However, noisy imagery is particularly common in ultrasound imaging.Under uncertainty due to low signal-to-noise ratio, low-level feature detection byonly inspecting information in the local neighborhood of each pixel is inaccurate. Toovercome the unreliable feature detection, we advocate an image analysis frameworkthat integrates low- and high-level stages to substantially improve overall accuracyof object reconstruction.

Rather than using detected features to represent information in the images,we will adopt a much richer representation of the low-level information using agrid quantization technique. Statistical and computational efficiency considerationslead to this choice. This representation is nonparametric and carries less biasesthan a parametric one such as an appearance based model using a multivariatenormal distribution. The intension of designing a new nonparametric technique isto alleviate the online computation and storage burden of standard nonparametricones. The quantization grid is trained either in conjunction with a high-level modelor directly from low-level groundtruth.

The grid quantization technique is used to obtain the pixel appearance prob-ability model in three steps. In the first step, a globally good grid is found by agenetic algorithm; in the second step, the grid is refined by a fast local algorithm; inthe last step, probability density values are obtained for each cell in the grid. Thistechnique is more practical and is subject to less biases than other methods.

Training of the pixel appearance probability model usually involves much morethan a single run of grid quantization. The reason is that the pixel classification isnot available or unreliable. We provide a generalized EM algorithm to handle suchsituations when only images and object models are available with no edge infor-mation. The algorithm completely solves the optimal estimation of the probabilitydensity functions in the integrated model for object reconstruction.

We applied the pixel appearance probability model in reconstructing the 3-Dleft ventricle surface model from 2-D images. We are able to reduce the error byabout 2.6mm when compared with a standard two-stage approach.

In this chapter, we will first review related work to pixel appearance probabilitymodels in Section 2. We describe the integrated framework in Section 3 and explainthe role of a pixel appearance probability model in an integrated framework. InSection 4, we deal with the technicalities of grid quantization. We introduce a pixelappearance probability model for echocardiographic images in Section 5 and a pixelclass prediction probability model in Section 6. A method to train the two models isgiven in Section 7, when low-level edge information is not available. We illustrate theperformance of 3-D left ventricle surface reconstruction in Section 8. In Section 9,we summarize this chapter.


Nonparametric Pixel Appearance Probability Model 3

2. Related Work

Chakraborty et al integrate gradient and region information when performing pixelclassification.1 The region information is modeled by Markov random fields with noshape statistics applied. Cootes et al model image pixels by the statistical activeappearance models,2,3 which are equivalent to parametric multivariate normal prob-ability models. They combine the active appearance models and the active shapemodels to find the best 2-D contour from images. Pixel classification decisions arestill made but might change during the model fitting iterations. This iterative classi-fication scheme allows shape statistics to guide the local feature detection, althoughit may not necessarily be optimal in the Bayesian sense. As far as we have foundin the literature, only Mignotte and Meunier have explored the idea of modelingshapes from images without an explicit feature detection stage.4 Their treatment ismore or less intuitive and lacks for a systematic account. By contrast, we will usethe Bayesian framework to integrate the low-level image information and high-levelprior shape knowledge through a pixel class prediction mechanism. We model thelow-level image information by a statistically effective and computationally efficientgrid quantization technique.

To represent image feature vectors, parametric statistical models are widely usedin general and Gaussian distributions in particular. However, in a noisy imagingenvironment, Gaussian models or other simple parametric models are not effectivebecause of their large modeling biases. Standard nonparametric kernel methodsare seldom employed owing to their low efficiency when dealing with millions offeature vectors, which do not really form an especially large sample size for pixelbased applications. An alternative to kernel methods is quantization, which is afunction that maps a larger set to a smaller one. Two steps are performed duringquantization: discretization and representation. The discretization step determinesthe size and shape of the cells by partitioning the larger set into smaller subsets.The representation step assigns summary statistics to each cell.

For discretization, equal bin width histograms or their simple extension in multi-dimensions are often adopted, because they are computationally efficient to obtainand apply. However, they are not statistically effective for two reasons. First, thecells are blindly allocated in advance, not adapting to the data. Second, the nor-malized frequency may be zero for many cells and there is no guarantee for the con-sistency of the density estimates. The statistically equivalent blocks approach is anearly tree structured partitioning scheme.5 The CART algorithm is a tree-structuredclassifier.6 Grid-based partition scheme are studied, e.g., multivariate ASH,7 STINGalgorithm,8 OptiGrid algorithm,9 STUCCO algorithm10 and Adaptive Grids.11 Allthe multivariate discretization approaches proposed are sub-optimal algorithms;most of them are greedy. For tree-based algorithms, it is evident since a tree isconstructed level by level using a certain greedy method and does not possess aglobal optimal measure. A grid can be acquired randomly.12,13 The grid lines canbe equally spaced.7 A grid is improved by merging adjacent intervals by hypothesis


4 Song and Haralick

testing.10 The adaptive grids technique merges dense cells.11

Splines or local polynomials have been used to delineate a probability den-sity function over the cells, but they are not computationally efficient in a multi-dimensional space. On the other hand, using the empirical density leads to emptycells. Existence of empty cells is particular prominent in a high dimensionalspace, due to the increased sparsity of data. WARPing14 and averaging shiftedhistograms15 smooth cell density estimates. Otherwise there are relatively few meth-ods. We will present a closely related smoothing method for grid quantization.

3. A Framework to Integrate Low- and High-Level Stages

Although the pixel appearance probability model to be discussed can be used sepa-rately from high-level object reconstruction, it is best understood under a frameworkthat integrates the low- and high-level processes. We explain here the integratedframework. We use a Bayesian framework to formulate the overall object recon-struction problem:

maxΘ

p(Θ|Z) (MAP Rule),

where Θ is the object model, Z is the feature vector of a pixel, p(Θ|Z) is the posteriorprobability of Θ given Z. This formulation is known as the maximum a posteriori(MAP) rule. By Bayes’ Theorem, the MAP rule is equivalent to

maxΘ

p(Θ|Z) = maxΘ

p(Θ)p(Z|Θ)p(Z)

∝ maxΘ

p(Θ)p(Z|Θ), (1)

where p(Θ) is the prior probability of object model Θ and p(Z|Θ) is the conditionalprobability of observed feature vector Z given an object model Θ. We also callp(Z|Θ) the object appearance model, capturing the overall imaging system behavior.

The prior probability p(Θ) can be assessed through ways depending upon theapplication. For a surface object model, it can be the prior probability characterizingsmoothness, or the shape of the objects, or several user input points. p(Z) is theprobability of observing a particular feature vector Z, the knowledge of which isonly necessary when the exact posterior probability is desired. Computing p(Z|Θ)directly is difficult because of the many degrees of freedom of the feature vector Z.

Feature detection precisely avoids finding the functional form of p(Z|Θ). We useY to denote the class label of a pixel, marking each pixel to be visible as either onor off the object. If we can detect the class labels for each pixel, we can search foran object model Θ∗ that fits the class labels best, instead of fitting to the originalimages. These two stages, i.e., feature detection and model fitting, form a standardapproach of object reconstruction, summarized as Alg. 1. P (Y |Z) is the posteriorof a class label Y given the feature vector Z. p(Z|Y ) is the likelihood of the classlabel Y for the feature vector Z. p(Θ|Y ) is the posterior of the object model Θgiven the class label Y . P (Y |Θ) is the likelihood of the object model Θ for theclass label Y . Estimation of p(Z|Θ) is not necessary in this framework. Instead we



Algorithm 1 Two-stage object reconstruction.Stage 1. Feature detection. Find Y ∗ that solves

maxY

P (Y |Z) ∝ maxY

P (Y )p(Z|Y );

Stage 2. Model fitting. Find Θ∗ that solves

maxΘ

p(Θ|Y ∗) ∝ maxΘ

p(Θ)P (Y ∗|Θ).

estimate p(Z|Y ) and P (Y |Θ), as well as apply suitable priors P (Y ) and p(Θ). Ifwe can detect the class label Y from the feature vector Z with strong confidence,the two-stage approach can work well. However, if we have to detect class labels onfuzzy images, the two-stage framework does not yield an optimal Θ∗ because thedetected class label Y ∗ from the first stage may be unreliable.

In the integrated approach to be proposed, we will not assign a class label to eachpixel, but will profile each pixel by probabilities of having different class labels. Westill need the class label Y to avoid direct off-line estimation and online computationof p(Z|Θ). Although it is not directly observable, a class label serves as a hiddenbridge between the feature vector Z and the object model Θ. Under the assumptionthat an object model Θ1 inferred from both the feature vector Z and the class labelY has the same probability distribution as the object model Θ2 inferred from onlythe class label Y , we can obtain the following theorem.16

Theorem 1: (Integrated object inference) The posterior probability of an objectmodel Θ given the observed feature vector Z can be written as

p(Θ|Z) =p(Θ)p(Z)

K∑y=1

P (Y = y|Θ)p(Z|Y = y), (2)

with the assumption p(Θ|Z, Y ) = p(Θ|Y ).

The integrated object inference theorem leads to the integrated approach of objectmodel optimization

Θ∗ = argmaxΘ

p(Θ)∑

y

P (Y = y|Θ)p(Z|Y = y), (3)

where Θ∗ is an optimal object model. The interpretation behind the integratedapproach is as follows. Every image pixel is assigned a likelihood profile of beingdifferent classes p(Z|Y ). Another class probability P (Y |Θ) profile is predicted froman object model. When the likelihood profile P (Y |Z) and the predicted class prob-ability profile P (Y |Θ) match well, the object model that generates the predictedclass probability profile is a good explanation of the images. If p(Z) can be calcu-lated, we can get the posterior p(Θ∗|Z) which indicates the goodness of the solutionΘ∗. When P (Y |Z) has a single narrow peak, the maximization of p(Θ|Z) can beapproximated by the two-stage approach. However, the two-stage approximation


6 Song and Haralick

generally is not optimal in the sense of the Bayesian framework and may lead toless consistent results. We call P (Y |Θ) the Pixel Class Prediction (PicPre) proba-bility model, meaning that the class label Y can be predicted probabilistically froma given object model Θ. We call p(Z|Y ) the Pixel Appearance (PixApp) probabilitymodel, because p(Z|Y ) depicts probabilistically the appearance of a pixel with theclass label Y .

4. Grid Quantization

Grid quantization we describe is a nonparametric statistical pattern recognitiontechnique. It partitions the space into hyper-rectangular cells shown in Fig. 1 andestimates the probability density for each cell. Nonparametric methods do not nec-

Fig. 1. The grid partition pattern.

essarily guarantee a minimum variance estimate. However, it does not have thepotentially large modeling biases inherent in parametric models that do not fitreality. Nonparametric models usually require larger sample size than parametricmodels, since the degrees of freedom of nonparametric models are usually muchgreater than parametric models. However, larger sample size almost always impliesmore CPU cycles and memory capacity. Best control variables in a nonparametricmodel are typically determined during the time consuming off-line training, whichmight be affordable in many situations, for example, some medical imaging appli-cations. The online CPU cycles and the memory requirement are evidently high forstandard nonparametric models, which might be unacceptable in some applications.For example, a kernel method often estimates the density value for a single pointfrom the sample directly, typically involving the entire data with the implicationthat time and space are at least in proportion to sample size.

Grid quantization intends to alleviate the online computation and storage bur-den of standard nonparametric techniques. Grid quantization finds the most effec-tive nonparametric representation of the data, for given computation resources, interms of CPU cycles, memory requirement and the targeted performance. It re-quires an intensive off-line training in order to determine the best representation.Grid quantization possesses the scalability to trade more resources for performance.In contrast, it is prohibitive to scale most other nonparametric models. A grid quan-tization locates the most important regions in the feature space which are then finely



quantized, while unimportant regions are coarsely quantized. What determines im-portance relies on the pattern recognition task. We will use average log likelihood,entropy, classification accuracy, or convex combination of them. Kernel methodstreat equally everywhere in the space and for unimportant regions, there exists thepotential of wasting resources. A concern might arise for the quantization effect,but to accomplish acceptable results does not necessarily entail infinite resolutionquantization. Since a grid is constructed by adapting to the data, the quantizationeffect can be minimized for a particular pattern recognition task.

We choose grid quantization pattern because it is a reasonable tradeoff betweencomputational and statistical efficiencies. Other options are less attractive. For anequal spacing grid, retrieval efficiency is linear in the dimension and does not dependon the number of cells. However, an equal spacing grid may not be statistically effi-cient. Variable spacing grid lines can dramatically improve the statistical efficiencywhile having low computational complexity. A tree structured quantization patterncan have very high statistical efficiency and low computational complexity, but tooptimize a tree for a global criterion is a hard problem. Almost all tree structuredquantizers use greedy approaches. An irregular quantization pattern can further im-prove statistical efficiency, but the computational complexity could be much higher– more expensive to use than a kernel method. Another consideration to choosea grid pattern is that smoothing can be carried out efficiently. The purpose ofsmoothing is to improve the generalization ability of a quantizer. Smoothing of agrid quantizer is analogous to pruning a tree structured quantizer.

A quantizer is constructed off-line on training data by optimizing a certain per-formance measure. Training is two fold. The first is to achieve as good performanceas possible on the training data. The second is to obtain a consistent density esti-mate of each cell by smoothing. Our smoothing strategy is a close approximationto the kN spacing approach. The latter subdivides the space into cells such thateach cell contains kN data points. It has been shown that the kN density estimate isboth L1

17 and L215,18 consistent. A quantizer is used online by table-lookup. A cell

is located for a given data vector and then the density value of that cell is returned.The density value can finally be used to address a particular pattern recognitiontask.

4.1. Notations, Definitions and Problem Statement

Let the random vector X ∈ RD represent an individual data or pattern. X(d) is thed-th dimension random variable of X. We call the sequence XN = {x1, x2, · · · , xN}a data set or a sample of size N . This set contains i.i.d. data vectors from x1

to xN . Let K be the total number of classes and {1, 2, · · · ,K} be the class labelset. Let random variable Y ∈ {1, 2, · · · ,K} be the class assignment of X. We callthe sequence YN = {y1, y2, · · · , yN} the class assignment set of XN , where x1 hasclass label y1, x2 has class label y2 and so on. We also consider a more general casewhere the class assignment is not exclusive, but instead weighted. Let random vector


8 Song and Haralick

W ∈ [0, 1]K be the weighted class assignment vector. W (y) is the weight for classy. We also require

∑Ky=1W (y) = 1. We call the sequence WN = {w1, w2, · · · , wN}

the weighted class assignments of the data set XN .

Definition 2: A quantizer Q is a function that maps RD to I. I, called quantizationindex set, is a finite set of integers or integer vectors.

Definition 3: A quantization cell of Q is a set C ⊂ RD such that for any x1, x2 ∈ C,Q(x1) = Q(x2).

For each cell, we assign an index q ∈ I to it. We use the function notation q = Q(x)to denote that x is quantized to q or x belongs to cell q. We use N(y) to denote thetotal number of class y data in XN , that is

N(y) =N∑

n=1

wn(y).

Let Nq be the total number of data in cell q. Let Nq(y) be the total number of dataof class y in cell q, that is

Nq(y) =∑

n∈{n|q=Q(xn)}

wn(y).

Let L be the number of cells. The index to the first cell is q = 1, and the last cellq = L.

Let pX|Y (x|Y = y) be the class y conditional probability density function(p.d.f.). We use pX|Y (x|Y = y) to denote the estimated p.d.f. In most cases, weshall drop the subscript X|Y from pX|Y (x|Y = y) and pX|Y (x|Y = y) to simplifythe notation. The overall grid quantization problem is to estimate the p.d.f.s

pX|Y (x|Y = 1), pX|Y (x|Y = 2), · · · , pX|Y (x|Y = K),

such that a certain quantizer performance measure T (), which we will define later,is maximized for the training data XN , YN or WN . It is equivalent to solving

maxQ

T (XN , YN ) or T (XN , WN ). (4)

4.2. The Performance Measure of A Quantizer

The quantizer performance measure includes three components: log likelihood, en-tropy and correct classification probability. We explain each of them as follows.

4.2.1. Log Likelihood

Kullback-Leibler divergence from p(x) to p(x) is

D(p||p) =∫p(x) log

p(x)p(x)

dx = E[log p(X)]−E[log p(X)],



which, being non-negative (zero if and only if p(x) = p(x)), should be minimized.As p(x) is fixed, maximizing E[log p(X)] is equivalent to minimizing DKL(p||p).Let p(q|y) be the density of cell q. Then E[log p(X|Y )] can be estimated by

1N(y) log

∏Lq=1 (p(q|y))Nq(y). The overall average log likelihood of a quantizer Q is

J(Q) =1N

K∑y=1

N(y)E[log p(X|y)] =1N

K∑y=1

logL∏

q=1

(p(q|y))Nq(y).

When the class number ratio N(1) : N(2) : · · · : N(K) is representative for the datapopulation, the overall average log likelihood is preferred, with the log likelihood ofpopular classes being emphasized.

The mean class average log likelihood is

J(Q) =1K

K∑y=1

E[log p(X|y)] =1K

K∑y=1

1N(y)

logL∏

q=1

(p(q|y))Nq(y).

When the class number count N(y) is randomly decided or every class is consideredto have equal importance, the mean class average log likelihood is preferred, withevery class contributing equally to the log likelihood of the quantizer.

4.2.2. Correct Classification Probability

Let P (y) be the prior probability of class Y . Within cell q, the Bayes’ rule is equiv-alent to

y∗q = argmaxy

P (y)Nq(y)N(y)

.

Let Nc(q) be the number of correct decisions in cell q, i.e.,Nc(q) = Nq(y∗q ). Wedefine the correct classification probability in two situations. The overall correctclassification probability is

Pc(Q) =

∑Lq=1Nc(q)N

. (5)

The mean class correct classification probability is

Pc(Q) =1K

K∑y=1

L∑q=1

I(y = y∗q )Nc(q)N(y)

. (6)

In the above two equations, I is indicator function. The choice of either shouldfollow the same considerations for the choice of average log likelihood.

4.2.3. Entropy

Similar to the case of average log likelihood, we give two options: overall entropy andmean class entropy. Again, the choice of either should also follow the considerations


10 Song and Haralick

for the choice of average log likelihood and correct classification probability. Wedefine the overall entropy by

H(Q) =Nq

Nlog

N

Nq. (7)

We define mean class entropy by

H(Q) =1K

K∑y=1

L∑q=1

Nq(y)N(y)

logN(y)Nq(y)

. (8)

Entropy has been used as a class impurity measure. But we use entropy as a measureof the consistence or generalization ability of the quantizer.

4.2.4. The Performance Measure Function

We define the quantizer performance measure function, by linearly combining aver-age log likelihood, entropy and the log of correct classification probability, as follows

T (Q) = WJJ(Q) +WHH(Q) +Wc logPc(Q), (9)

where WJ ,WH and Wc are given non-negative weights for average log likelihoodJ(Q), entropy H(Q) and log of correct classification probability logPc(Q), respec-tively.

4.3. Preprocessing

The performance of classifiers does not increase with dimensions monotonicallywhen sample size is fixed, because the required sample size will grow exponen-tially to achieve a similar performance. Dimension reduction may be necessary.Popular techniques including principal component analysis, projection pursuit andindependent component analysis. In addition to dimension reduction, for data ina high dimension, we may also want to view the data in a new coordinate systemsuch that the most interesting dimensions come first, where we would like to usemore quantization levels. Let z1, z2, ..., zN ∈ RM denote the feature vectors, eachrepresenting a pixel. Normalization may also be necessary to make the dimensionreduction sensible. We use matrix B to represent the normalization and matrix Wto represent dimensional reduction and coordinate change. A feature vector z isprojected to x ∈ RD by x = WTBz.

4.4. Relative Quantization Levels

The choice of a proper total number of quantization cells L depends on the samplesize and the underlying distribution. It is also limited by available storage resource.When the sample size is large enough, it would be mostly determined by the avail-able storage resource, as a small quantization cell is always preferred for asymptoticoptimality, which means at least a large L. Once L is fixed, quantization levels



L1, · · · , LD for each dimension are to be assigned. The information content of arandom variable can be measured by its entropy. We assign the number of quanti-zation levels for each dimension based on the marginal entropy in that dimension.We use the scale invariant portion of the continuous histogram entropy, or the dis-crete histogram entropy to guide the assignment of quantization levels. Let Hd(X)be the marginal histogram entropy for the d-th dimension of X. The bit-allocationrule is defined by

H1(X)logL1

=H2(X)logL2

= · · · = HD(X)logLD

(10)

logL1 + logL2 + · · ·+ logLD = logL. (11)

Solving the above equations for Ld, d = 1, · · · , D, we get

Ld = LHd(X)PD

m=1 Hm(X) , d = 1, 2, · · · , D. (12)

4.5. Obtaining a Grid Quantizer

Having defined the quantizer performance measure function and applied certainpre-processing of the data, we obtain a grid quantizer by three steps performed bythree algorithms. The first algorithm attempts to find a good grid in a global senseusing a genetic algorithm. The second algorithm refines a grid locally by adjustingthe grid lines one by one. The third algorithm obtains a smoothed density estimatefor each grid cell.

4.5.1. Finding a Globally Good Grid

We cast the grid optimization problem into a genetic algorithms model19 in thefollowing way. An individual has a single chromosome, which is a grid. A gene is thesequence of decision boundaries in a particular dimension of the grid. A nucleotideis a single decision boundary in the gene. How well an individual adapts to theenvironment is measured by the fitness function. The choice of a fitness functiondepends on how selection is carried out. We use fitness proportionate selection, thechance of selecting individual being in proportion to its fitness. Therefore, we wouldnormally require the fitness function be non-negative in order to directly relate itwith probabilities.

Definition 4: The fitness function of a grid is

ϕ(G) = exp(T (G)) = [exp(J(G))]WJ [exp(H(G))]WH [Pc(G)]Wc . (13)

ϕ(G) is non-negative since it is a real exponential function of T (G). exp(J(G))corresponds to the geometric average of likelihood which is a probability. exp(H(G))is the information content of the grid expressed in terms of the size of a code book.Pc(G) is exactly the correct classification probability. T (G) is a weighted geometriccombination of the above components.



Algorithm 2 Optimize-Grid-Genetic outlines the main steps of the grid optimiza-tion genetic algorithm. It starts with an initial population of Np random grids. Thepopulation evolves in the selection-reproduction-selection cycle as described in thefor loop from line 2 to 16. In each cycle, or generation, Np children are reproducedin the for loop from line 7 to 15. Every execution of the loop produces two childrenC1 and C2 by parents G1 and G2. G1 and G2 are randomly selected (line 8 and9) from the population and the chance of their being selected is in proportion totheir fitness function values. Selection is illustrated in Fig. 2. A cross-over site dr is

Fig. 2. Grid selection. In a population of four grids, two (lower left and upper right) are randomly

selected as the parents by a chance in proportion to their fitness.

randomly decided for the parent chromosomes or grids G1 and G2. The cross-overoccurs with a probability of Pr. A cross-over example of a grid is shown in Fig. 3.Once the cross-over is finished, two children C1 and C2 are produced (line 11). The

Fig. 3. Grid crossover. The quantization of the vertical axis in the top grid and the quantizationof the horizontal axis of the bottom grid are crossed over to form a next generation grid on the

right.

decision boundaries of each child grid are randomly changed by mutation (line 12to 13), illustrated in Fig. 4. Then both children are added to the next generation(line 14). The best grid is kept as G∗, and is returned after a certain number ofgenerations have evolved.



Fig. 4. Grid mutation. The thick vertical grid line could be mutated to the dashed lines on itssides. In this example it was mutated to the one on the right.

Algorithm 2 Optimize-Grid-Genetic(XN , YN , Np, Ng, Pr, Pu)

1: P0 ← a population of NP random grids;2: for j ← 0 to Ng − 1 do3: if ϕ(G∗) < max

G∈Pjϕ(G) then

4: G∗ ← argmaxG∈Pj

ϕ(G);

5: end if6: Pj+1 ← φ;7: for i ← 0 to Np − 1 with increment of 2 do8: Select a grid G1 from Pj with a probability proportional to its fitness value;9: Select a grid G2 from Pj with a probability proportional to its fitness value;

10: Randomly decide a dimension dr as the cross-over site;11: Exchange the decision boundaries of dimensions 1 to dr between C1 and

C2 with probability Pr or do not exchange;12: Mutation: adjust each decision boundary of C1 with probability Pu;13: Mutation: adjust each decision boundary of C2 with probability Pu;14: Pj+1 ← Pj+1 ∪ {C1, C2};15: end for16: end for17: if ϕ(G∗) < max

G∈PNgϕ(G) then

18: G∗ ← argmaxG∈PNg

ϕ(G);

19: end if20: return G∗;

4.5.2. Local Refinement of a Grid

When the current best solution is close to the global optimal, genetic algorithmswill be less efficient. Here we design a more efficient grid refinement algorithm byadjusting the decision boundaries one by one. An adjusted boundary is acceptedonly when the performance measure increases. The idea is explained in Fig. 5. Bydefinitions of J(G),H(G) and Pc(G), they are additive, i.e.,

J(G) =L∑

q=1

J(q), H(G) =L∑

q=1

H(q), Pc(G) =L∑

q=1

Pc(q).



Fig. 5. Grid refinement. We adjust the thick grid line within the shaded area until a local maxi-mum is reached.

Therefore, when one decision boundary of a grid is moved, we need only recalculatethe change of the additive measures on those affected cells and data.

Algorithm 3 describes the Refine-Grid algorithm. The input includes a grid G,data sets XN and YN , number of refinements NR and the convergence parameter δ.J,H, Pc and T record the performance measures of current grid. The refinement isdone dimension by dimension in the for loop between line 6 and 20. The decisionboundaries of each dimension are refined one by one in the for loop from line 7 to19. A sub-grid Gs is formed by all the cells that is affected by a current decisionboundary. The data that fall in the sub-grid are kept in sets XN ′ and YN ′ . Otherdata will not affect the refinement of the current decision boundary. The additivemeasures Js,Hs and P s

c are calculated for the sub-grid Gs. Line 12 finds the opti-mal decision boundary that maximizes the overall measure T (Gs) and assigns theoptimal boundary to the optimal sub-grid G∗

s. The changes in the additive measuresof the sub-grid are calculated and the additive measures of the grid G are updatedby the changes (line 13 to 16). The optimal decision boundary replaces the originalone in G (line 17). The overall measure T is also updated. The refinement repeatsuntil convergence is achieved as measured by ∆T/T .

4.5.3. Density Estimation for Each Quantization Cell

The consistency of a quantizer is determined by the bias and variance of the densityestimates. To reduce the bias, the sample size N must be large, and Nq/N of eachcell must be small. To diminish the variance, the sample size Nq within a cell mustbe large. Overfitting manifests the large variance in cell density estimates, as a resultof cell emptiness. Smoothing can reduce the variance of the density estimates. Whensmoothing is over-done, quantizer gives very consistent result on both the trainingsample or an unseen sample, but carries large biases.

We offer a smoothing algorithm to assign a density estimate to a quantizationcell. The algorithm is an approximation to the k nearest neighbor density estimate.Let V (q) be the volume of cell q. Let Vk(q) be the volume of a minimum neighbor-hood of cell q that contains at least k points. Let kq be the actual number of pointsin the neighborhood. A smoothed probability density estimate of cell q is

p(q) =kq

Vk(q)∑

rkr

Vk(r)V (r). (14)

Searching for the exact k-th nearest neighbor is computationally expensive. We



Algorithm 3 Refine-Grid(G, XN , YN , NR, δ)1: J ← J(G), H ← H(G), Pc ← Pc(G);2: T ←WJJ +WHH +Wc logPc;3: j ← 0;4: repeat5: T− ← T ;6: for d← 1 to D do7: for q ← 1 to Ld − 1 do8: Form a sub-grid Gs by the cells sharing the decision boundaries G[d, q];9: G∗

s ← Gs;10: XN ′ ,YN ′ ← {(xn, yn)|xn(d) ∈ (Gs[d, 0], Gs[d, 2]]};11: Js ← J(Gs), Hs ← H(Gs), P s

c ← Pc(Gs), all on XN ′ ,YN ′ ;12: G∗

s[d, 1]← argmaxGs[d,1]

T (Gs) on XN ′ ,YN ′ ;

13: ∆J ← J(G∗s) − Js, ∆H ← H(G∗

s) − Hs, ∆Pc ← Pc(G∗s) − P s

c , all onXN ′ ,YN ′ ;

14: J ← J + ∆J ;15: H ← H + ∆H;16: Pc ← Pc + ∆Pc;17: G[d, q] ← G∗

s[d, 1];18: T ←WJJ +WHH +Wc logPc;19: end for20: end for21: j ← j + 1;22: ∆T ← T − T−;23: until j = NR or |∆T/T | < δ ;

introduce an approximate, but no less than k, nearest neighbor smoothing algorithmafter giving some basic definitions.

Definition 5: We call cell a and b neighbor cells if they share at least a partialdecision boundary.

Definition 6: The radius 0 neighborhood of cell q is a set that contains exactly thecell itself. We use N (q, 0) to denote the radius 0 neighborhood of cell q.

Definition 7: The radius R (R ∈ Z+) neighborhood of cell q, N (q,R), is the unionof the radius R− 1 neighborhood N (q,R− 1), and the set of all the neighbor cellsof the cells in N (q,R− 1).

Figure 6 shows the neighborhoods of a cell with different radii. In Fig. 6(a), thecell of interest is the cell in gray. The cell itself is also its radius 0 neighborhood.Figures 6(b) and (c) draw its radius 1 and 2 neighborhoods.



(a) R = 0 (b) R = 1 (c) R = 2

Fig. 6. Radius R neighborhood of a cell.

Definition 8: k neighborhood of a cell is the smallest radius R neighborhood ofthe cell that contains at least k points.

Algorithm 4 Radius-Smoothing is based on the k neighborhood concept. The algo-rithm searches for a minimum radius R neighborhood with at least k data pointsfor a current cell. The density of the k neighborhood is assigned to the cell as itsdensity estimate. M is the total mass on the density support. M can be consideredan adjusted data count by smoothing and is closely related to N . For cells with lessthan k data points, the initial guess of R is the radius of the k neighborhood ofa previously processed neighbor cell. To make this initial guess more realistic, weshall go through the cells in an order that every pair of cells visited consecutivelyare neighbor cells. kq is the actual total number of data points in current radius Rneighborhood. Based on the data count in current radius R neighborhood, we eitherincrease R until there are at least k data points in the neighborhood, or decrease Runtil the R− 1 neighborhood contains less than k data points.

The extent of smoothing is usually controlled by a parameter on the size of thelocal neighborhood. This naturally brings up the question of how to measure thequantizer consistency as a function of the control parameter. We use cross-validationto determine an optimal control parameter k∗ for smoothing, such that it maximizesthe average quantizer performance.

5. A Pixel Appearance Probability Model for EchocardiographicImages

The appearance of a pixel is defined by its local information, such as intensity,contrast, directional derivatives, gradient, and etc. It is a result of the imagingprocess of a point on the object with some structural type corresponding to thepixel location. However, it may not be strictly a function of structural types. Aswe have discussed earlier, the PixApp probability model is used to capture suchappearance uncertainty. We present in this section a particular PixApp probabilitymodel that we have designed for echocardiographic image pixels.

Figures 7(a)(d)(g) show ultrasound images of the left ventricle andFigs. 7(b)(e)(h) show the same images with the visible left ventricle boundary over-laid. Figures 7(c)(f)(i) overlay the complete contour of the underlying left ventriclesurface on the original images. It is quite evident that the pixels on the underlying



Algorithm 4 Radius-Smoothing(Q, k)1: M ← 0, R← −1;2: for each cell q do3: if N(q) = k then4: R← 0, kq ← N(q);5: else6: if R < 0 then7: R← 0, kq ← N(q);8: else9: A ← N (q−, R) ∩N (q,R);

10: kq ← kq− |N (q−, R)−A|+ |N (q,R)−A|;11: end if12: while kq > k and R > 0 do13: kq ← kq − |N (q,R)−N (q,R− 1)|, R← R− 1;14: end while15: while kq < k and R < Rmax do16: kq ← kq + |N (q,R+ 1)−N (q,R)|,R← R+ 1;17: end while18: end if19: ρ(q)← kq

V (N (q,R)) , M ←M + ρ(q)V (q), q− ← q

20: end for21: for each cell q do22: p(q)← ρ(q)

M ;23: end for

surface contour do not have uniform appearance everywhere: some pixels are brightwith high contrast, while others do not differ too much from the background. Thebackground pixels also have variable appearance.

During ultrasound imaging, signals arriving at an interface between media withdifferent acoustic impedance produce strong echo when the angle of incidence is nearperpendicular; signals arriving at an interface at near tangential angles produce veryweak echo. Thus the image intensity and its spatial variation are important. Thederivatives carry spatial variation information. We fit a cubic facet model20 to thepixel intensities in a local window centered at each pixel. A facet is a smooth surfacepatch in 3-D. We analytically derive all first- and second-order derivatives for eachpixel by fitting a facet centered at the pixel. Each pixel feature vector contains

(1) the spatially and temporally smoothed pixel intensity value, capturing absolutestrength of echo signals. A 3-D median filter, with a window of 5 pixels × 5pixels × 5 frames, is used to obtain the smoothed pixel values;

(2) the third-order facet model approximation of the pixel intensity value, givingthe absolute strength of echo signals but on a larger scale. A window of 21 pixels



(a)Parasternal

long axis

view.

(b) Withvisible con-

tours.

(c) Withcomplete

contours.

(d) Short

axis view.

(e) With

visible con-tours.

(f) With

completecontours.

(g) Apical 4chamber

view.

(h) Withvisible con-

tours.

(i) Withcomplete

contours.

Fig. 7. An ultrasound image of the left ventricle and surface model contours.

× 21 pixels is used to compute each facet;(3) the directional derivative along the gradient direction, providing a local measure

of edginess. A window of 21 pixels × 21 pixels is used to compute each facet;(4) the minimum second directional derivative, among second derivatives along all

directions, indicating relative strength of echo signals. A window of 21 pixels ×



21 pixels is used to compute each facet;(5) the directional derivative from a point inside the LV, to help distinguish ENDO

and EPI surfaces. The inner point is derived from user input points. A 3-Dmedian filter with a window of 17 pixels × 17 pixels × 17 frames is first appliedto get the smoothed pixel values. Then a window of 21 pixels × 21 pixels isused to compute each facet.

Examples of the feature vector maps are shown in Fig. 8.

(a) The

original.

(b) Fea-

ture 1 map.

(c) Fea-

ture 2 map.

(d) Fea-ture 3 map.

(e) Fea-ture 4 map.

(f) Fea-ture 5 map.

Fig. 8. An original parasternal long axis view image and its five feature maps.

The PixApp probability model describes the pixel appearance uncertainty undernoise for a given structural type. Let classes Y = 1 and Y = 2 correspond to EPIand ENDO pixels, respectively. An additional class Y = 3 labels the background.Thus, the PixApp probability model for echocardiographic images include threep.d.f.s: p(Z|1), p(Z|2) and p(Z|3). We use grid quantization to represent them. Ourtechnique requires larger sample size in order to obtain a better representation forthe pixel appearance than a multivariate normal model. For a multivariate sampleof size on the order of millions, not uncommon for image pixels, we are reasonablygrounded. The PixApp probability model can be estimated using the techniquedescribed in Section 4 directly if the pixel class information Y is available for each



pixel. Section 7 introduces a strategy to estimate PixApp probability models withoutexplicit knowledge of pixel classes.

6. A Pixel Prediction Probability Model for Ultrasound Imaging

When pixel classification is not available for observed training images, the PixAppprobability model can not be trained directly. If high-level object models are givenfor the observed training images, which is more common in practice, a pixel classcan be predicted probabilistically through the PicPre probability model using anobject model. In this section, we describe a PicPre probability model for ultrasoundimaging. In Section 7, we describe how this model is trained simultaneously withthe PixApp probability model.

Pixel class prediction associates every pixel on an imaging plane with some phys-ical properties of a given object model and its environment. Pixel class predictionand classification are fundamentally different. Pixel classification assigns class labelinformation to each pixel based on observed images, not from an object model. Wedenote the deterministic prediction from an object model to pixel class by Y |Θ, bywhich each pixel has an exclusive class assignment. We represent the probabilisticprediction by the conditional probability P (Y |Θ), which provides a soft pixel classprediction, allowing a more precise relationship to be captured.

Deterministic prediction methods can be considered special cases of the prob-abilistic prediction methods to be described. To predict the output of a physicalsystem, we need to model both systematic and random behaviors of the system.Well understood systematic behaviors are often described by functional models.Less studied and more complex physical processes, often represented by probabilis-tic models, account for the random behaviors, as well as random noises from theenvironment. The overall probabilistic prediction is shown in the diagram in Fig. 9.A surface model Θ is used to produce simulated images via a physical simulation.By the distance transform, the distance d(y) to and the intensity I(y) of the clos-est surface y pixel are calculated. Details of d(y) and I(y) will be explained shortly.Then the pixel class probability profile P (Y |Θ) is obtained by probability modeling.

Surface model

ΘSimulated

images

d(y)I(y)

Pixel class

probability

profilep (Y |Θ)

Physical

simulation

Distance

transform

Probability

modeling

Fig. 9. Probabilistic pixel class prediction.

When a 2-D image is scanned from a 3-D object in ultrasound imaging, twophenomena occur:

(1) a 3-D point on the object is transformed to a 2-D pixel on the image by plane



intersection. The plane is defined by ultrasound beams in a single B-scan.(2) the physical properties of the 3-D point yield a 2-D pixel intensity, determined

mostly by the reflective properties of the object.

Simulation generates images from an object model by functional modeling of areal imaging system. We have implemented an ultrasound imaging simulation sys-tem to synthesize echocardiographic images from 3-D surface models of the leftventricle.21,22 The object model of LV include two geometric surface models, onefor EPI and the other for ENDO. The simulation software is capable of performingbackscattering, attenuation, and reflection, implemented by a ray-tracing algorithm.We only do reflections in this study, since our purpose is to predict the systematicimage dropout rather than the stochastic behavior of the speckle noise. The dropoutis mostly due to weak reflection at interfaces. The randomness is accounted for inthe PixPre and PixApp probability models.

The distance from a pixel p to its closest class-y neighbor pixel q on the simulatedimage is denoted by d(y). The intensity of the neighbor pixel q is denoted by I(y).d(y) and I(y) of every pixel on a simulated image can be efficiently found by thedistance transform.20,23

During imaging, a point on the surface may be transformed to a pixel lookingmore like the background; a point not on the surface may be transformed to apixel as if on the surface. The PicPre probability model allows such variations thansimply saying that a pixel coming from a point on surface y must have label y.In addition, we have further considerations in the PicPre probability model forthe following observations. An off-surface point closer to an on-surface point mayappear as a pixel with similar location and intensity with the type of the pixel fromthe on-surface point. An off-surface point that has a stronger on-surface neighborpoint is more likely to appear as a pixel that is similar to the type of a pixel fromthe on-surface neighbor point. To satisfy the above considerations, we design thefollowing generic parametric PicPre probability model P (Y |Θ):

P (Y = y|Θ)f(I(y), d(y)|λy)

=P (Y = K|Θ)

β, y = 1, 2, · · · ,K − 1, (15)

with the constraintsK∑

y=1

P (Y = y|Θ) = 1 and P (Y = y|Θ) ≥ 0, y = 1, · · · ,K. (16)

In Eq. (15), f(I, d|λ) : [0,∞)× [0,∞)→ (0,∞) is a decay function decreasing withd but increasing with I, λ1, λ2, · · · , λK−1 are non-negative decay rates of differentclasses, and β is a non-negative parameter which corresponds to the strength of apixel being the background. Solving Eqs. (15) and (16), we get

P (Y = y|Θ) =

f(I(y),d(y)|λy)

β+PK−1

k=1 f(I(k),d(k)|λk), y = 1, 2, · · · ,K − 1

β

β+PK−1

k=1 f(I(k),d(k)|λk), y = K

. (17)



In the above model, the probability of a pixel being class y should be in proportionto some monotonically decreasing function of its distance to the nearest pixel comingdirectly from surface y; the probability of a pixel being on background should bein proportion to some function of the smallest distances for this pixel to all othertypes of non-background pixels. Meanwhile, a pixel is more likely to be from surfacey if its neighbor pixel coming directly from surface y has a larger intensity.

In our study, we design the intensity exponential decay model with f(I, d|λ) =Ie−λd, that is

P (Y = y|Θ) =

I(y)e−λyd(y)

β+PK−1

k=1 I(k)e−λkd(k) , y = 1, 2, · · · ,K − 1β

β+PK−1

k=1 I(k)e−λkd(k) , y = K. (18)

We will illustrate this particular PicPre probability model by examples in nextsection.

7. Training PixApp and PicPre Probability Models withoutLow-Level Edge Groundtruth

In some situations, examples of object models Θ and their images are given, but theclass labels Y are not available. In other situations, the class labels Y are too inac-curate to use. To achieve the overall optimality, we need to consider simultaneousestimation of the PixApp and PicPre probability models. Our strategy will alloweach pixel to participate in a different manner on a scale of 0 to 1 in the PixAppprobability models for different classes. We solve the problem of joint estimation ofthe PixApp and the PicPre probability models by a generalized EM algorithm. Inthe off-line training phase, the underlying goal is to make an accurate and consistentestimation of p(Z|Θ). We use the Kullback-Leibler divergence as the criterion forthe density estimation, equivalent to the expected log likelihood. A caveat is thatthe target of estimation is not Θ, but the conditional p.d.f. p(Z|Θ).

It is necessary to inspect how the models are estimated in the two-stage ap-proach, i.e., the feature detection and model fitting approach. In the feature de-tection stage, the PixApp probability model p(Z|Y ) is used. Estimation of p(Z|Y )requires the knowledge of class labels. However, the class labels are not observeddata and they are typically produced by human experts. A class label has to beunique for each pixel. In the model fitting stage, the PicPre probability modelP (Y |Θ) is used. Using the class labels and known surface models, P (Y |Θ) can beestimated. The only problem with these two estimations is that the uncertainty ofclass label Y as described by P (Y |Θ) is not taken into account in the estimation ofp(Z|Y ). The isolation can degrade the performance seriously when the uncertaintyof class labels for given surface models is prominent.

In the integrated approach, the conditional probability p(Z|Θ) is given in termsof a summation over Y by

p(Z|Θ) =∑

y

P (Y = y|Θ)p(Z|Y = y).



In this form, we still need to estimate the PicPre probability model P (Y |Θ) andthe PixApp probability model p(Z|Y ), but we do not have to make a decision onthe class label Y of each pixel, because every possibility of Y is considered. Sincewe have decided that P (Y |Θ) is a parametric model and p(Z|Y ) is a nonparametricmodel, the overall model p(Z|Θ) is a hybrid model. On one hand, maximum likeli-hood estimation for p(Z|Θ) requires joint estimation of the PixApp and the PicPreprobability models. On the other hand, joint estimation of a hybrid model poses acomputational challenge.

Although it is typically a solution to parametric density estimation with miss-ing or hidden variables, the expectation maximization (EM) algorithm suitablyperforms maximum likelihood estimation on p.d.f.s that can be written as an in-tegral or summation. The missing or hidden variable is precisely the integral orsummation variable. Whether the targeted p.d.f. is parametric, nonparametric orhybrid will affect neither the applicability nor the convergence of the EM algorithm.In the integrated model, the goal is to maximize

E[log p(Z|Θ)] (19)

over all possible p.d.f.s p(Z|Θ) (not over Θ in our case.) When p(Z|Θ) is writtenin the integrated form, Y is the missing or hidden variable. Instead of maximizingEq. (19), the EM algorithm maximizes an approximation of

E[log p(Y, Z|Θ)] (20)

over p(Y,Z|Θ) in its iterations.We will link p(Y,Z|Θ) to the PixApp and thePicPre probability models soon. Through the EM algorithm, the maximization ofE[log p(Y, Z|Θ)] is substantially computationally easier than that of E[log p(Z|Θ)].We initialize p(Y, Z|Θ) by an initial guess p0(Y,Z|Θ). It is then followed by it-erations of expectation steps (E-steps) and maximization steps (M-steps). In theE-step of iteration m, we first compute the conditional probability πm(Y |Z,Θ), us-ing pm(Y, Z|Θ). Then we find the expectation φm(p(Y, Z|Θ)) = Eπm

[log p(Y,Z|Θ)],using the conditional probability πm(Y |Z,Θ). In the M-step of iteration m, we finda solution that maximizes the expectation φm(p(Y, Z|Θ)), assigning the optimalsolution to pm+1(Y, Z|Θ). The E-step and the M-step alternate until convergenceis achieved. When the M-step returns a sub-optimal solution that does not de-crease φm(p(Y,Z|Θ)), the algorithm is known as a generalized EM algorithm. Boththe original and the generalized EM algorithms increase the targeted expected loglikelihood E[log p(Z|Θ)] monotonically as a function of the iteration number.24

Now we associate p(Y, Z|Θ) with the PixApp probability model p(Z|Y ) andthe PicPre probability model P (Y |Θ).By the conditional independence assumptionp(Θ|Z, Y ) = p(Θ|Y ), we have

p(Y,Z|Θ) = P (Y |Θ)p(Z|Y,Θ) = P (Y |Θ)p(Z|Y ).

The above equation implies that p(Y, Z|Θ) is exactly the product of the predictionprobability P (Y |Θ) and the appearance probability p(Z|Y ). Thus the M-step can



be written as

maxp(Y,Z|Θ)

Eπm [log p(Y, Z|Θ)]

= maxp(Y,Z|Θ)

Eπm[logP (Y |Θ)] + Eπm

[log p(Z|Y )]

= maxP (Y |Θ)

Eπm[logP (Y |Θ)] + max

p(Z|Y )Eπm

[log p(Z|Y )].

(21)

Thus we have broken the M-step into two independent optimization problems. Oneis the parametric estimation of the PicPre probability model, and the other is thenonparametric estimation of the PixApp probability model. Replacing p(Y,Z|Θ) byP (Y |Θ)p(Z|Y ), we give Alg. 5 Estimate-Integrated-Model.

Algorithm 5 Estimate-Integrated-ModelInitialization:

P0(Y |Θ) and p0(Z|Y )

Iteration:

(1) E-step.

πm(Y |Z,Θ) =Pm(Y |Θ)pm(Z|Y )∑

k Pm(Y = k|Θ)pm(Z|Y = k)(22)

φm(P (Y |Θ)) = Eπm [logP (Y |Θ)] (23)

ψm(p(Z|Y )) = Eπm[log p(Z|Y )] (24)

(2) M-step.

Pm+1(Y |Θ) = argmaxP (Y |Θ)

φm(P (Y |Θ)) (25)

pm+1(Z|Y ) = argmaxp(Z|Y )

ψm(p(Z|Y )) (26)

The Estimate-Integrated-Model algorithm differs in Eq. (22) from the two-stageestimation solution. In the two-stage approach, every pixel is assigned a unique classlabel y, equivalent to setting πm(Y |Z,Θ) = δ(Y −y). Here, πm(Y |Z,Θ) signifies theprobability profile of class labels for given images and the surface model. In addition,Estimate-Integrated-Model iterates over the two steps, while the two-stage approachdoes them only once.

Here is a summary of the overall training strategy. Training images and the cor-responding ground-truth surface models are input data to the estimation. Trainingimages are pre-processed to remove noise and obtain feature vectors. Ground-truthsurface models produce simulated images after imaging simulation. Before the firstiteration, some initial guesses for PixApp and PicPre probability models are made.With the PixApp probability model, feature vectors are optimally quantized to



calculate the PixApp probability of each pixel for each class. K maps of PixAppprobabilities are generated per image. Meanwhile, the PicPre probability profileof each pixel is calculated on the simulated images using the PicPre probabilitymodel. A total of K PicPre probability maps are created for each image. With thePixApp and PicPre probability maps, we can calculate a class probability profilefor each pixel given both the observed images and the ground-truth surface mod-els. The class profile of each pixel, as a weight vector, participates in both findinga grid quantization for PixApp probability model and the parameter estimationof the PicPre probability model. With the weight vectors and the feature vectors,we obtain an updated PixApp probability model. With the weight vectors and thesimulated images, we obtain an updated PicPre probability model. Then we startthe next iteration with the newly updated models, until the overall log likelihoodEπm

[log p(Y, Z|Θ)] converges.

7.1. PixApp Probability Model Estimation

One of the two expectations to be maximized in the M-step is Eπ[log p(Z|Y )].The expectation is with respect to both Y and Z. However the unknown condi-tional probability P (Y |Z,Θ) is replaced by an approximation π(Y |Z,Θ). HenceEπ[log p(Z|Y )] can be written as

Eπ[log p(Z|Y )] =∫p(z|Θ)

K∑k=1

π(Y = k|z,Θ) log p(z|Y = k)dz.

Taking the sample average log likelihood as the expected value, we obtain

L1 =1N

N∑n=1

K∑k=1

π(yn = k|zn,Θ) log p(zn|yn = k).

Since π(Y |Z,Θ) is given in the E-step, we simplify the notation by letting

πkn = π(yn = k|zn,Θ). (27)

which can be thought of as a normalized class weight. Then L1 can be written as

L1 =1N

N∑n=1

K∑k=1

πkn log p(zn|yn = k),

which is the weighted log likelihood of the sample. Thus maximization ofEπ[log p(Z|Y )] is approximated by that of L1. As Z is usually a continuous vector,p(Z|Y ) is a p.d.f. conditioned on the discrete variable Y . What assumption can bemade on p(Z|Y ) attributes directly to how well the imaging process is understood.When there is complex and less studied noise during the imaging process, we wouldlike to make as few assumptions as possible. In the worst case where the least infor-mation is available about the noise, we use the grid quantization technique describedin Section 4 to describe the density function p(Z|Y ).



For the original pixel feature vector Z of 5 dimensions, we reduce it to a 3dimensional vectorX. We perform grid quantization onX instead of Z. The optimal3-D quantization grid we obtained for the three classes are displayed in Fig. 10,where 1-D and 2-D combinations of the grid are drawn. We show the 1-D and 2-Dmarginal densities of the 3-D PixApp probability densities p(X|Y ) in Fig. 11 andFig. 12, respectively. Figure 13 shows estimated PixApp probability maps for thethree classes of a given image.

−10 −5 0x

1

−10 −5 0

−5

0

5

x1

x 2

−10 −5 0

−5

0

5

x1

x 3

−5 0 5

−10

−5

0

x2

x 1

−5 0 5x

2

−5 0 5

−5

0

5

x2

x 3

−5 0 5

−10

−5

0

x3

x 1

−5 0 5

−5

0

5

x3

x 2

−5 0 5x

3

Fig. 10. The 3-D quantization grid.

7.2. PicPre Probability Model Estimation

The other expectation to be maximized in the M-step is Eπ[logP (Y |Θ)]. Theexpectation is on both Z and Y , where Z is implicitly expressed in π(Y |Z,Θ).Eπ[logP (Y |Θ)] is an estimate of E[logP (Y |Θ)] with P (Y |Z,Θ) replaced by



−15 −10 −5 0 50

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

x(1)

Probability density function p(x(1))

Background pixel class

EPI pixel class ENDO pixel class

(a) x(1)

−8 −6 −4 −2 0 2 4 6 80

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

x(2)

probability density function p(x(2))

EPI pixel class

ENDO pixel class


(b) x(2)

−8 −6 −4 −2 0 2 4 6 80

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

x(3)

Probability density function p(x(3))

EPI pixel class

ENDO pixel class


(c) x(3)

Fig. 11. One-dimensional marginal densities of estimated PixApp probability model.

π(Y |Z,Θ). Therefore we have

Eπ[logP (Y |Θ)] =∫p(z|Θ)

K∑k=1

π(Y = k|z,Θ) logP (Y = k|Θ)dz.

We can further obtain an estimate of Eπ[logP (Y |Θ)] by the average log likelihoodof the sample, that is

L2 =N∑

n=1

K∑k=1

π(yn = k|zn,Θ) logP (yn = k|Θ).

The average factor 1/N is not shown because it does not affect the maximization.Therefore

L2 =N∑

n=1

K∑k=1

πkn logP (yn = k|Θ),

which is the weighted log likelihood of the PicPre label assignments. Hence themaximization of Eπ[logP (Y |Θ)] reduces to the maximization of the weighted loglikelihood L2. As we have defined P (Y |Θ) by a continuous parametric model pre-viously, L2 is a function of the PicPre model parameters λ1, λ2, · · · , λK−1, β. Since



x1

x 2

−2 0 2

−2

0

2

x1

x 3−2 0 2

−2

0

2

x2

x 3

−2 0 2

−2

0

2

(a) Class 1 (EPI pixel)

x1

x 2

−2 0 2

−2

0

2

x1

x 3

−2 0 2

−2

0

2

x2

x 3

−2 0 2

−2

0

2

(b) Class 2 (ENDO pixel)

x1

x 2

−2 0 2

−2

0

2

x1

x 3

−2 0 2

−2

0

2

x2

x 3

−2 0 2

−2

0

2

(c) Class 3 (Background pixel)

Fig. 12. Two-dimensional marginal densities of estimated PixApp probability model.

the parameters are all nonnegative, we can re-parameterize them by

λk = τ2k , k = 1, · · · ,K = 1

β = α2.

We denote the parameter vector by

u = [τ1, τ2, · · · , τK−1, α],

and the likelihood L2 by L2(u). We adopt a quasi-Newton method that has beenwidely applied in many unconstrained continuous optimization problems, called the



(a) Class 1Epicardium

PixApp

(p(Z|1))map.

(b)Class 2 En-

docardium

PixApp(p(Z|2))map.

(c) Class 3Back-

ground

PixApp(p(Z|3))map.

Fig. 13. PixApp probability maps of apical four chamber view.

Broyden-Fletcher-Goldfarb-Shanno (BFGS) method. It updates the Hessian matrixwith a rank two difference matrix during every iteration and guarantees the approx-imated Hessian matrix is positive definite for minimization problems.25 The majorsteps include finding a Newton search direction, the line search and the Hessianupdate. Since we are to maximize L2(u), the objective function of the minimizationis −L2(u). Figure 14 shows the estimated intensity exponential decay PicPre prob-ability model. Figure 15 illustrates the pixel class prediction process. We obtain asimulated image shown in Fig. 15(a) through ultrasound imaging simulation. Thenwe compute the distance transform of the epicardium and endocardium contours,shown as Figs. 15(b) and (c). Figures 15(d) and (e) are the intensity maps of theclosest on-surface pixels. We apply the estimated PicPre probability model on thedistance and intensity maps and display the PicPre probability maps in Fig. 15(f)to (h).

8. Surface Reconstruction for Left Ventricle Using the EstimatedPixel Appearance Probability Model

In our experiment, we used a total of 45 in vivo clinical studies. There are 16normal studies and 29 diseased studies. There are six condition groups among the45 studies. Forty-four sets of image sequences were acquired from ATL ultrasoundmachines; one set of image sequences was acquired from an HP ultrasound machine.These image sequences were acquired for other studies by three operators over aperiod of two years, so that they incorporate some amount of operator and systemsetting variability. The frame rate was 30 per second. The horizontal and verticalresolutions of the images were, respectively, 0.37− 0.46mm and 0.37− 0.41mm perpixel. For each study, we selected subsequences of images from four or five differentviews, including three or four long-axis views and one short-axis view. Each view



−15 −10 −5 0 5 10 150

0.5

1EPI probability as a function of location

−15 −10 −5 0 5 10 150

0.5

1Endo probability as a function of location

−15 −10 −5 0 5 10 150

0.5

1

pixel location (mm)

Background probability as a function of location

EPI surface ENDO surface

EPI surface ENDO surface

ENDO surface

EPI surface

Fig. 14. The estimated PicPre probability model.

was further cut into an upper sector and a lower division, divided by an arc passingan inner point of the LV and centered at the transducer location. We selected20 studies of good image quality as the training set. The remaining 25 studiesformed the test set. We performed the experiment at end diastole. We measuredthe projection distance between the optimized and the ground-truth surface models.The projection distance from surface A to surface B is defined as the mean vertexprojection distance from all the vertices of surface A to surface B. The projectiondistance between surface A and B is the average of the projection distances fromA to B and from B to A.

In our study of 3-D left ventricle surface reconstruction from 2-D echocardio-graphic images, we obtained much better results using the pixel appearance prob-ability model with the integrated approach than the two-stage approach. In thetwo stage approach, we used Canny edge detector to find boundaries and then re-construct 3-D surface model from the detect edges. We were only able to obtainmeaningful results for the studies with the best quality images. The distance er-rors between the manually delineated surfaces and the reconstructed ones rangefrom 3.1mm to 6.6mm for normal cases, which were far from the requirement forpractical clinical use. For the integrated approach, we were able to handle imageswith modest image quality. On all the normal studies, we achieved distance er-rors from 1.1mm/1.7mm (endocardium/epicardium) to 3.1mm/4.0mm, the averagebeing 1.9mm/2.4mm.16



(a) Thesimulated

image.

(b) d(1)

distancemap.

(c) d(2) dis-

tance map.

(d) I(1) in-

tensitymap.

(e) I(2) in-

tensitymap.

(f) P (1|Θ)map.

(g) P (2|Θ)map.

(h) P (3|Θ)map.

Fig. 15. A simulated image and its intensity, distance and PicPre probability maps.

9. Conclusions

In this chapter, we have presented the pixel appearance probability model for rep-resentation of local pixel information, and illustrated its usage in echocardiographicimage analysis. This technique is vastly different from standard feature detectionbased low-level image processing techniques, in that it preserves much richer in-formation from the original images. This model is obtained by a grid quantization



technique, which is a statistically effective and computationally efficient approachto estimating probability density functions. The pixel appearance model can beused in its own right for purposes such as pixel classification. We have argued thatthis model can be used much more effectively in an integrated object reconstructionframework as opposed to the traditional low- and high-level approach. In our study,the adoption of the pixel appearance probability model has reduced the error of theleft ventricle groundtruth surface reconstruction by about 2.6mm.

Acknowledgments

The authors wish to thank Dr. Florence H. Sheehan and Dr. Richard K. Johnsonfor discussions and suggestions related to the work reported in this chapter. Theauthors also wish to thank the Cardiovascular Research and Training Center atUniversity of Washington for providing the echocardiographic image sequences andleft ventricle surface models from manually delineated features.

References

1. A. Chakraborty, L. H. Staib, and J. S. Duncan. Deformable boundary finding inmedical images by integrating gradient and region information. IEEE Transactionson Medical Imaging, 15(6):859–870, December 1996.

2. T. F. Cootes, A. Hill, C. J. Taylor, and J. Haslam. The use of active shape models forlocating structures in medical images. Image and Vision Computing, 12(6):355–366,July 1994.

3. T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. IEEETransactions on Pattern Analysis and Machine Intelligence, 23(6):681–685, June 2001.

4. M. Mignotte and J. Meunier. Deformable template and distribution mixture-baseddata modeling for the encodcarial contour tracking in an echographic sequence. InProceedings of 1999 IEEE Computer Society Conference on Computer Vision andPattern Recognition, pages 225–30, 1999.

5. M. P. Gessaman. A consistent nonparametric multivariate density estimator basedon statistically equivalent blocks. The Annals of Mathematical Statistics, 41:1344–46,1970.

6. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regres-sion Trees. Statistics/Probability Series. Wadsworth & Brooks/Cole, Pacific Grove,California, 1984.

7. D. W. Scott and G. Whittaker. Multivariate applications of the ASH in regression.Communications in Statistics A: Theory and Methods, 25:2521–30, 1996.

8. W. Wang, J. Yang, and R. R. Muntz. STING: A statistical information grid approachto spatial data mining. In Proceedings of the 23rd VLDB Conference, pages 186–195,1997.

9. A. Hinneburg and D. A. Keim. Optimal grid-clustering: Towards breaking the curseof dimensionality in high-dimensional clustering. In M. P. Atkinson, M. E. Orlowska,P. Valduriez, S. B. Zdonik, and M. L. Brodie, editors, Proceedings of 25th Interna-tional Conference on Very Large Data Bases, pages 506–517, Edinburgh, Scotland,UK, September 1999. Morgan Kaufmann.

10. S. D. Bay. Multivariate discretization for set mining. Knowledge and InformationSystems, 3(4):491–512, 2001.



11. H. Nagesh, S. Goil, and A. Choudhary. Adaptive grids for clustering massive data sets.In V. Kumar and R. Grossman, editors, Proceedings of the First SIAM InternationalConference on Data Mining, April 5-7, 2001, Chicago, IL USA, pages 506–517. Societyfor Industrial & Applied Mathematics, 2001.

12. L. B. Hearne and E. J. Wegman. Maximum entropy density estimation using randomtessellations. In Computing Science and Statistics, volume 24, pages 483–7, 1992.

13. L. B. Hearne and E. J. Wegman. Fast multidimensional density estimation based onrandom-width bins. In Computing Science and Statistics, volume 26, pages 150–5,October 1994.

14. W. Hardle and D. W. Scott. Smoothing by weighted averaging of rounded points.Computational Statistics, 7:97–128, 1992.

15. D. W. Scott. Multivariate Density Estimation – Theory, Practice and Visualization.Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, 1992.

16. M. Song, R. M. Haralick, F. H. Sheehan, and R. K. Johnson. Integrated surface modeloptimization for freehand 3-D echocardiography. IEEE Transactions on Medical Imag-ing, 21(9):1077–1090, 2002.

17. G. Lugosi and A. B. Noble. Consistency of data-driven histogram methods for densityestimation and classification. Annals of Statistics, 24:687–706, 1996.

18. K. Fukunaga. Introduction to Statistical Pattern Recognition. Computer Science andScientific Computing. Academic Press, 2nd edition, 1990.

19. D. E. Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learning.Addison-Wesley, 1989.

20. R. M. Haralick and L. G. Shapiro. Computer and Robot Vision, volume I. Addison-Wesley, 1992.

21. M. Song. Ultrasound imaging simulation and echocardiographic image synthesis. Mas-ters thesis, Department of Electrical Engineering, University of Washington, Seattle,Washington, June 1999.

22. M. Song, R. M. Haralick, and F. H. Sheehan. Ultrasound imaging simulation andechocardiographic image synthesis. In International Conference on Image Processing2000, Vancouver, Canada, September 2000.

23. M. J. Seaidoun. A Fast Exact Euclidean Distance Transform With Application ToComputer Vision And Digital Image Processing. Ph.D. dissertation, Northeastern Uni-versity, 1993.

24. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incompletedata via the EM algorithm. Journal of the Royal Statistical Society B, 39(1):1–38,1977.

25. J. V. Burke. MATH 516 Numerical Optimization lecture notes. Department of Math-ematics, University of Washington, 1999.

Documents

CHAPTER 11 NONPARAMETRIC PIXEL APPEARANCE PROBABILITYjoemsong/publications/Song2005-MedIm… · than a parametric one such as an appearance based model using a multivariate normal