Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
KERNEL METHODS FOR STATISTICAL LEARNING IN COMPUTER VISION
AND PATTERN RECOGNITION APPLICATIONS
By
Refaat Mokhtar Mohamed
M.Sc., EE, Assiut University, Egypt, 2001
B.Sc., EE, Assiut University, Egypt, 1995
A Dissertation
Submitted to the Faculty of the
Graduate School of the University of Louisville
in Partial Fulfillment of the Requirements
for the Degree of
Doctor of Philosophy
Department of Electrical and Computer Engineering
University of Louisville
Louisville, Kentucky
December 2005
KERNEL METHODS FOR STATISTICAL LEARNING IN COMPUTER VISION AND PATTERN
RECOGNITION APPLICATIONS
By
Refaat Mokhtar Mohamed
A Dissertation Approved on
by the Following Reading and Examination Committee:
Aly Farag, Ph.D., Dissertation Director
Jon Atli Benediktsson, Ph.D.
Georgy Gimel’farb, Ph.D.
Greg Rempala, Ph.D.
Hichem Frigui. Ph.D.
Xiangqian Liu , Ph.D.
Tamer Inanc, Ph.D.
Ryan Gill, Ph.D.
ii
DEDICATION
To:
The memory of my mother who died on August 29, 1985
My lovely wife Yosra
iii
ACKNOWLEDGMENTS
All deepest and sincere thanks are due to Almighty ALLAH, the merciful, the com-
passionate for the uncountable gifts given to me.
I would like to extend my deepest appreciation to Dr. Aly A. Farag for giving me
the opportunity to join the CVIP Lab and for his direction and assistance in developing
this dissertation. I would also like to thank Dr. Moumen Ahmed for helping me in joining
the Lab. Many thanks for Dr. Greg Rempala for his continuous discussion and support. I
would also like to thank Dr. Jon Atli Benediktsson for joining my PhD committee. I would
also like to thank Dr. Hichem Frigui, Dr. Xiangqian Liu, Dr. Tamer Inanc, and Dr. Ryan
Gill for serving on my committee. I would also like to thank my colleagues in the CVIP
Lab for their continuous support and friendship during the past years, with special thanks
to Ayman El-Baz for his collaboration. I would like to thank all my friends in Louisville
for turning my stay here into a pleasant life.
Finally, I would like to thank my family for their unwaivering encouragement and
support, without which this thesis and research would not have been possible.
iv
ABSTRACT
KERNEL METHODS FOR STATISTICAL LEARNING IN COMPUTER VISION AND
PATTERN RECOGNITION APPLICATIONS
Refaat Mohamed
December 1, 2005
Statistical learning-based kernel methods are rapidly replacing other empirically
learning methods (e.g. neural networks) as a preferred tool for machine learning due to
many attractive features: a strong basis from statistical learning theory; no computational
penalty in moving from linear to non-linear models; the resulting optimization problem
is convex, guaranteeing a unique global solution and consequently producing systems with
excellent generalization performance. This research work introduces statistical learning for
solving different problems in computer vision and pattern recognition applications.
The probability density function (pdf) estimation is a one of the major ingredients
in Bayesian pattern recognition and machine learning. Many algorithms have been intro-
duced for solving the probability density function estimation problem either in parametric
or nonparametric setup. In the parametric approach, a reasonable functional form for the
probability density function is assumed, as such the problem is reduced to the parameters
estimation of the functional form. For estimating general density functions, the nonpara-
metric setups are used where there is no form assumed for the density function.
The curse of dimensionality is a major difficulty which exists in the density func-
tion estimation with high dimensional data spaces. An active area of research in the pattern
v
analysis community is to develop algorithms which cope with the dimensionality problem.
The purpose of this thesis is to present a kernel-based method for solving the density es-
timation problem as one of the fundamental problems in machine learning. The proposed
method does not pay much attention to the dimensionality problem.
The contribution of this thesis has three folds: creating a reliable and efficient
learning-based density estimation algorithm which is minimally dependent on the input
space dimensionality, investigating efficient learning algorithms for the proposed approach,
and investigating the performance of the proposed algorithm in different computer vision
and pattern recognition applications.
vi
TABLE OF CONTENTS
DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiLIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xLIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiNomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivCHAPTER
I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1A. Research Domain of the Thesis . . . . . . . . . . . . . . . . . . . . . 3
1. Phase I: Implementation and Analysis of the MF-Based SVMDensity Estimation Framework . . . . . . . . . . . . . . . . . . 4
2. Phase II: Automation and Enhancements for the Learning Algo-rithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3. Phase III: Applications of the New MF-Based SVM Density Es-timation Framework in Real World Pattern Recognition Problems 5
B. Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
II. STATISTICAL LEARNING BASED SUPPORT VECTOR MACHINES . . 9A. Support Vector Machines (SVM) Regression . . . . . . . . . . . . . . 10B. Mean Field Theory for Learning of SVM Regression . . . . . . . . . . 14C. Summary of the Statistical Learning MF-Based SVM Regression Al-
gorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17D. Remarks on the MF-Based SVM Regression Algorithm . . . . . . . . 18E. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
III. DENSITY ESTIMATION USING MEAN FIELD BASED SUPPORT VEC-
TOR MACHINES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20A. Density Estimation Problem Formulation . . . . . . . . . . . . . . . . 21B. Obtaining the Probability Density Function Estimate and Choosing of
the Kernel Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23C. Summary of the Proposed SVM Density Estimation Algorithm . . . . 24D. Consistency of the Proposed Algorithm . . . . . . . . . . . . . . . . . 24
1. Equivalent Kernel in Gaussian Processes Prediction . . . . . . . 252. Consistency Argument of the Proposed Algorithm . . . . . . . . 27
E. Convergence of the Proposed Algorithm . . . . . . . . . . . . . . . . . 29
vii
F. Estimation of the Learning Parameters . . . . . . . . . . . . . . . . . 291. Kernel Optimization using the EM algorithm . . . . . . . . . . . 292. Cross-Validation for Parameters Estimation . . . . . . . . . . . . 31
G. Experiments for Evaluating the Proposed Density Estimation . . . . . . 331. Density estimation for a 1-D Gaussian distribution . . . . . . . . 342. Density estimation for 1-D Mixture of Gaussian distributions . . 353. Density Estimation for 1-D Rayleigh Distribution . . . . . . . . . 364. Comparison with state of the art methods . . . . . . . . . . . . . 375. Density estimation for a 2-D cases . . . . . . . . . . . . . . . . . 396. Experiments on the automatic selection of the Kernel width using
EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 417. Experiments on the automatic selection of the learning parame-
ters using Cross Validation . . . . . . . . . . . . . . . . . . . . . 428. Experiments on the algorithm convergence . . . . . . . . . . . . 42
H. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
IV. STATISTICAL LEARNING IN COMPUTER VISION . . . . . . . . . . . . 49A. Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
1. An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502. Basic Regression Relations . . . . . . . . . . . . . . . . . . . . 523. Simultaneous Optimization of the Regression Relations . . . . . 534. The Overall Calibration Algorithm . . . . . . . . . . . . . . . . 54
B. Discussion of Some Calibration Methods . . . . . . . . . . . . . . . . 551. Linear Direct Transform Method (LDT) . . . . . . . . . . . . . . 552. Nonlinear Two Stages Method (NL) . . . . . . . . . . . . . . . . 563. Neural Networks Method (NN) . . . . . . . . . . . . . . . . . . 564. Heikkile Method (Heikki) . . . . . . . . . . . . . . . . . . . . . 57
C. Experimental Results and Discussions . . . . . . . . . . . . . . . . . . 581. Simulation with Synthetic Data . . . . . . . . . . . . . . . . . . 592. Experiments with Real Images . . . . . . . . . . . . . . . . . . . 62
D. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
V. APPLICATIONS ON THE PROPOSED DENSITY ESTIMATION APPROACH 68A. Test-of-agreement (ToA) for the response of two classifiers . . . . . . . 68B. Experiments for Density Estimation Using Real Remote Sensing Mul-
tispectral Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 701. Experiments for density estimation using a multispectral agricul-
tural area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702. Experiments for density estimation using a multispectral urban
area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74C. Experiments for Density Estimation Using Real Remote Sensing Hy-
perspectral Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 771. Experiments for density estimation using a hyperspectral 34-band
data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
viii
2. Experiments for density estimation using a hyperspectral 58-banddata set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
D. Applications in the Class Prior Probability Estimation . . . . . . . . . 821. MRF Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 832. MRF Parameters Estimation Using SVM . . . . . . . . . . . . . 853. Image Segmentation Algorithm . . . . . . . . . . . . . . . . . . 854. Experiment on MRF Model Parameters Estimation . . . . . . . . 875. Experiments Using Remote Sensing Data . . . . . . . . . . . . . 88
E. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
VI. STATISTICAL LEARNING FOR CHANGE DETECTION . . . . . . . . . 96A. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96B. Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96C. Proposed Change Detection Approach . . . . . . . . . . . . . . . . . . 99D. Statistical Shape Modeling . . . . . . . . . . . . . . . . . . . . . . . . 100E. Change Detection Algorithm . . . . . . . . . . . . . . . . . . . . . . . 102F. Discussion of Some Change Detection Methods . . . . . . . . . . . . 102
1. Change Detection using Automatic Analysis of the DifferenceImage and EM Algorithm (DIEM) . . . . . . . . . . . . . . . . . 104
2. Change Detection using MRF Modeling (DIMRF) . . . . . . . . 106G. Experimental Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
1. Experiments on the proposed shape modeling approach . . . . . 1062. Experiments on Statistical Shape Modeling Using the MF-based
SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1073. Experiments on the Segmentation Algorithm . . . . . . . . . . . 107
a. Image Segmentation Algorithm107
b. Results108
4. Experiments on Different Resolutions Data Sets . . . . . . . . . 1135. Experiments on the Change Detection Algorithm . . . . . . . . . 115
a. Cairo Data Set116
b. Louisville Data117
H. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
VII. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121A. Review and Applications . . . . . . . . . . . . . . . . . . . . . . . . . 122B. Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123C. Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125CURRICULUM VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
ix
LIST OF TABLES
TABLE . PAGE1. Parameters of the 1-D mixture of Gaussians density function . . . . . . . . . 352. Results for the mixture of a Gaussian and Exponential density functions . . . 393. Ground truth camera parameters versus estimated parameters . . . . . . . . 614. Error in the 3-D reconstructed data . . . . . . . . . . . . . . . . . . . . . . . 655. Classification confusion matrix for the multispectral agricultural area using
the MF-based SVM estimator. . . . . . . . . . . . . . . . . . . . . . . . . . 726. Classification accuracy using different density estimators for the multispec-
tral agricultural area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747. Classification confusion matrix for the multispectral urban area using the
MF-based SVM estimator. . . . . . . . . . . . . . . . . . . . . . . . . . . . 768. Classification accuracy using different density estimators for the multispec-
tral urban area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 779. Classification confusion matrix for the hyperspectral 34-band data using the
MF-based SVM estimator. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7910. Classification accuracy using different density estimators for the hyperspec-
tral 34-band data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8011. Classification confusion matrix for the hyperspectral 58-band urban area us-
ing the MF-based SVM estimator. . . . . . . . . . . . . . . . . . . . . . . . 8112. Classification accuracy using different density estimators for the hyperspec-
tral 58-band urban area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8213. The estimated means for 2nd order MRF cliques. . . . . . . . . . . . . . . . 8514. Estimated parameters for the mixture of Gaussians distribution. . . . . . . . 8715. Classification confusion matrix for the hyperspectral urban area after apply-
ing the MRF modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8916. Classification accuracy after applying MRF modeling for the 58-band hyper-
spectral data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9017. Classification confusion matrix for the multispectral data set without using
shape modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11218. Classification confusion matrix for the multispectral data set using shape
modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11319. Comparison of classification accuracies for the 15-meter resolution data set
using different algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11420. Comparison of classification accuracies for the 60-meter resolution data set
using different algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11521. Detection rates of the different change detection approaches. . . . . . . . . . 117
x
LIST OF FIGURES
FIGURE . PAGE1. Splitting data in Cross-Validation setups. . . . . . . . . . . . . . . . . . . . 322. Estimation of the 1-D Gaussian density function with the SVM density esti-
mation which is formulated with the, (a) proposed, (b) traditional formula-tion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3. Estimation of a 1-D mix of Gaussian density functions . . . . . . . . . . . . 364. Estimation of a Rayleigh density function. . . . . . . . . . . . . . . . . . . . 375. Estimation of 1-D mixture of a Gaussian and an Exponential density func-
tions, (a) SDC method (quoted from [1]), (b) MF-based Method. . . . . . . 396. Estimation of a 2-D Gaussian density function, (a) the reference density
function and its contour, (b) the estimated density using the traditional formulation-based SVM and its contour, (c) the estimated density using MF-based SVMand its contour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7. Comparison between the estimation results of a 2-D mixture of an isotropicGaussian and two Gaussians with both positive and negative correlationstructure, (a) the contour of the estimated density using Parzen windowmethod, (b) the contour of the estimated density using Reduced set method,(c) the contour of the estimated density using MF-based SVM method, and(c) the estimated density using MF-based SVM . . . . . . . . . . . . . . . . 45
8. Estimation of the mixture of Gaussians in Fig 3 (a) with the proposed algo-rithm for automatic kernel parameters estimation, (b) CDF of the estimateddensity without automatic kernel optimization, and (c) CDF of the estimateddensity with the proposed kernel optimization algorithm . . . . . . . . . . . 46
9. Effect of the regularization constant C on the proposed algorithm performance 4710. Convergence of the estimation error with the optimization iterations for the
Gaussian density estimation example . . . . . . . . . . . . . . . . . . . . . 4811. Representation of the camera calibration as a mapping problem . . . . . . . 5212. The RMSE fortx as a function of noiseσ, computed for the five approaches:
linear, nonlinear using simplex method, neuro-calibration, Heikki and MFSVM.62
13. The RMSE forRy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6314. The RMSE foruo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6415. The RMSE for the skewness angleθ. Note: Heikki method assumes an ideal
camera model in the sense of skewness (i.e.θ = π/2), so there is no errorindicated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
xi
16. Calibration setup: A stereo pair of images for a checker-board calibrationpattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
17. The 2-D projection of the calibration pattern corners (dots), and the detectedcorners from the image (left one of Fig. 16) of the calibration pattern (circles) 66
18. A multispectral agricultural area: (a) land cover, and classification resultsusing: (b) SVM , and (c) MF-SVM as a density estimator. . . . . . . . . . . 71
19. A multispectral urban area: (a) RGB snap-shot, and color-coded classifica-tion results using: (b) SVM , and (c) MF-SVM as a density estimator. . . . . 75
20. A hyperspectral 34-band urban area: (a) RGB snap-shot, and color-codedclassification results using: (b) SVM , and (c) MF-SVM as a density estimator. 78
21. A hyperspectral 58-band urban area: (a) RGB snap-shot, and color-codedclassification results using: (b) SVM , and (c) MF-SVM as a density estimator. 91
22. Numbering and order coding of neighborhood structure. . . . . . . . . . . . 9223. Clique Shapes of second order MRF model. . . . . . . . . . . . . . . . . . . 9224. A texture image for MRF model parameters estimation experiment: (a)original
image generated by Metropolis algorithm, (b) histogram of the MRF modelclique shapes of the original image, (c) regenerated image using the para-meters estimated using the MF-based SVM algorithm, and (d) the estimatedmixture of Gaussians to fit the cliques histogram. . . . . . . . . . . . . . . . 93
25. Evolution of the log-likelihood in the hyperspectral 34-band example. . . . . 9426. The final segmented image with the proposed segmentation setup for the
hyperspectral 34-band area. . . . . . . . . . . . . . . . . . . . . . . . . . . 9427. Evolution of the log-likelihood. . . . . . . . . . . . . . . . . . . . . . . . . 9528. The final segmented image for the 58-band data set. . . . . . . . . . . . . . . 9529. The RGB and the reference classified image of Cairo data set. . . . . . . . . 10930. Samples of shape modeling two classes of Cairo data set: class points, signed
distance map, and shape model density function for a) Water class, andb)Transportation class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
31. Classification results of Cairo data set. . . . . . . . . . . . . . . . . . . . . . 11232. Results for the 15-meter resolution data set: (a) Original, (b) Registration
results, (c) Classification results, and (d) Classification results after inversetransformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
33. Results for the 60-meter resolution data set: (a) Original, (b) Registrationresults, (c) Classification results, and (d) Classification results after inversetransformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
34. Cairo data set for the change detection evaluation: (a) Reference with changes,(b) Reference changes-map . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
35. Results for change detection algorithms: (a) difference-image using CVA,(b) detected changes-map using pixelwise analysis of the difference map, (c)detected changes-map using MRF modeling, and (d) detected changes-mapusing the proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 119
xii
36. Results for the change detection algorithm: (a) Reference with changes, (b)Reference change-map, (c) Ordinary classification, and (d) Detected changes-map using the proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . 120
xiii
Nomenclature
D Data sampleDr (yi, ti)
ε An error tolerance
K (y,y′) Covariance function
τ Data sampleτ r ti
τ Data sampleτ r ti
D A data sample (input vectors and their associated targets)
τ Targets vector
b A bias in a linear kernel
C A regularization constant
CS Changes set
F (y) Cumulative distribution function value aty
Fn(y) Empirical cumulative distribution function value aty
I An image
I(−∞, y](u) Indicator function
Ir Reference image
M Point in 3-D space
xiv
m Point in 2-D space
P (3× 4) Projection matrix
t A target value
wi The weight corresponding the training instanti in SVM regression
y An instant of a raw data sample
Kn Covariance matrix
L(t, g(y)) Loss function
σ2i variance of a Gaussian distribution
g(y) Estimated output corresponding to the input vectory
CVA Change Vector Analysis
SD-Map Signed distance map
ToA Test of Agreement
xv
CHAPTER I
INTRODUCTION
Learning, in its broadest sense, is defined as building a model from incomplete
information in order to predict as accurately as possible some underlying structure of the
unknownreality. In practice, for this definition to make sense, the information about the
structure, the experience, and the model have to be expressed numerically. Thus, from the
statistical point of view the problem of learning becomes one of function estimation. As
such, an estimator of functions is referred to as a learning machine.
The problem of statistical learning is rather broad. Given a statistical framework
one can consider any learning problem in the real world as a problem of estimation of
someunknownfunction. Some types of learning problems are classes of special kinds of
function with special properties, and as such it is convenient to group learning problems
into differentdomains. The special properties of the types of functions in these domains
can give them more simple solutions than considering the general case. Indeed, consid-
ering solutions to problems in certain domains, one can construct learning machines with
properties that enable them to solve tasks which would (at present) be hard to solve in the
general case.
In the statistical learning framework, learning means estimating a function from a
given training data set. In this work, learning is used in solving many computer vision
and pattern recognition problems: estimating the probability density function (pdf) which
underlies the distribution of a given random sample, regression of functions, and further
analysis of classical pattern recognition approaches.
The principal part of the current research uses the statistical learning for various
1
types of density estimation. The problem of probability density function estimation is a
classical one in statistics. For the pattern recognition community, density estimation is
the bottle neck in designing a large class of classifiers which depends on maximization of
the posteriori probability principle. In such class, designing of the classifier is carried out
under the assumption that class conditional densities are known. Typically, there are some
vague, general knowledge about these densities, together with a number of design samples
or training data. The problem then, is to find some way to use this information to estimate
the unknown densities.
Depending on the available prior knowledge about the structure of the density func-
tion, there are mainly two directions for density estimation: parametric and nonparamet-
ric [2]. The parametric approach assumes a functional model controlled by a given set of
parameters which have to be fitted to the data. The goal of the learning process is to use the
training sample to estimate this set of parameters. Examples for such learning paradigm
are Maximum Likelihood Estimation (MLE), Bayesian Estimation, and the Expectation
Maximization (EM) algorithm. Problem associated with parametric density estimation are:
1. The forms of the distribution functions are assumed to be known, where normally the
density form for most practical applications is not known.
2. Most known forms of distributions are unimodal (single peak) while in most practical
applications distributions are multimodal.
3. In most cases individual features are assumed to be independent, but approximating
a multivariate distribution as a product of univariate distributions does not work well
in practice.
Due to the above problems associated with the parametric density estimation meth-
ods, an extensive research effort is directed toward nonparametric methods. The main dif-
ference with respect to the parametric approach comes from the influence of the data points
2
yi’s on the estimate at locationy. All the points have the same importance for parametric
estimators, whereas nonparametric estimators are asymptotically local, i.e. the influence
of the points vanishes as the distance from the location where the density is computed in-
creases. Examples for nonparametric methods are K-Nearest Neighborhood (KNN) and
Parzen-Window methods [3].
The curse of dimensionality term is used to describe the problem that occurs when
searching in/or estimating pdf on high-dimensional spaces. The complexity grows ex-
ponentially with the dimension, rapidly outstripping the computational and memory stor-
age capabilities of computers. The problem of estimating a density function on a high-
dimensional space may be seen as determining the density at each cell in a multidimen-
sional grid. Given a fixed number ofK grid lines per dimension, the number of indepen-
dent cells grows asKp, wherep is the dimension. Furthermore, if the density function
is to be estimated based on a set of high-dimensional samples, the number of samples re-
quired for accurate density function estimation also grows asKp. As stated in [4], with a
fixed number of training samples, the dimension for which accurate estimation is possible
is severely limited to a small number, usually about 5 depending on the specific problem.
This study aims to develop a learning algorithm for nonparametric density esti-
mation which is independent on the input space dimensionality. This algorithm uses the
Support Vector Machines (SVM) algorithm as the main building block of the density esti-
mation procedure.
A. Research Domain of the Thesis
This study addresses the density estimation problem in various dimensionality spaces
using the SVM algorithm. Existing algorithms suffer from the following:
1. The curse of dimensionality problem.
3
2. In terms of accuracy and time requirements, poor learning procedures for the SVM
algorithm.
3. The performance of the SVM density estimation algorithm in real world applications
is not well investigated.
The proposed research in this thesis addresses these shortcomings and represents
a novel framework for probability density estimation using the SVM. A novel procedure
which incorporates the Mean Field (MF) theory in the learning of the SVM density esti-
mator is proposed. For this reason, the proposed approach in this thesis is called the Mean
Field-based Support Vector Machines (or simply MF-Based SVM) density estimation ap-
proach. The proposed work in this thesis is divided into three phases:
1. Phase I: Implementation and Analysis of the MF-Based SVM Density Estimation
Framework
In this phase, the theoretical aspects of the MF-Based SVM density estimation
framework are developed. This requires the study of the statistical learning principles and
formulating the statistical representation of the proposed framework. Analysis and perfor-
mance assessment of the proposed framework need to be accomplished. This includes the
performance accuracy of the framework with different data types (synthetic or real) and in
different dimensional spaces. Also, this includes the performance of the framework in time
considerations.
2. Phase II: Automation and Enhancements for the Learning Algorithm
In this phase, some methods for enhancing the optimization algorithm in terms of:
accuracy, speed, and automation. The EM algorithm is proposed to be incorporated in the
learning process for estimating a set of the parameters. Also, the Cross-Validation principle
4
is proposed to be used for estimating some other set of the learning parameters.
3. Phase III: Applications of the New MF-Based SVM Density Estimation Framework in
Real World Pattern Recognition Problems
In this phase, the proposed framework is used in different applications of the pat-
tern recognition problem. In the following, a brief description of the main three of such
applications is presented although only the first two of them will be addressed in the body
of this thesis.
1. Explicit Camera Calibration
Camera calibration is an extensively studied topic in different machine intelligence
communities. The explicit camera calibration methods develop solutions by analyz-
ing physical model of camera imaging so that calibration is to identify a set of mod-
eling parameters of physical meanings [5], whereas the implicit calibration methods
resort to realizing a nonlinear mapping function that can well describe the input-
output relation [6]. The explicit calibration methods can provide camera’s physical
parameters, which are important in some applications, such as computer graphics,
virtual reality, 3-D reconstruction, etc.
This research presents an explicit approach for solving the camera calibration prob-
lem. Principally, The approach considers the problem as a mapping from the 3D
world coordinate system to the 2D image coordinate system, where the projection
matrix is the mapping function, and a statistical based regression algorithm is used
to simulate this mapping.
2. Remote Sensing
Segmentation of satellite imagery is one of the important research topics in the re-
mote sensing data analysis, and it is one of the funded projects in the CVIP lab. An
5
important main aspect associated with the remote sensing data is the high dimension-
ality of the data space. Thus, the proposed framework is applicable and provides a
promising direction for estimating the density function in the remote sensing spaces
which can be used for the image segmentation in further processing.
3. Parameters Estimation for Markov Random Field Models
Markov Random Fields (MRF) provide practical implementation for regions mod-
eling in image segmentation. The problem of parameter estimation of MRF models
is known to be a challenging task in the pattern recognition literature. This study
presents a formulation of the MRF modeling in a way such that the MF-Based SVM
algorithm can be applied for estimating the parameters of the MRF models.
4. Changes Detection in Images
Changes detection in images finds many applications in city planning, monitoring,
and security assessments. This research introduces statistical learning as a tool for
detecting changes in the land cover of scenes using space-born imagery. It proposes
a new approach and evaluates this approach with remote sensing data sets.
B. Thesis Outline
This thesis consists of seven chapters. The following remarks summarize the scope
of each chapter.
Chapter 1 introduces the thesis topic and discusses the motivation for the thesis
work. In addition, it discusses the research domain of the thesis and outlines the thesis
manuscript.
Chapter 2 discusses the theoretical formulation of the proposed MF-based SVM
regression framework. The statistical aspects behind the presented formulation and the dif-
ferences between the presented formulation and previous formulations are highlighted. The
6
introduction of the Mean Field theory in the learning of the SVM algorithm is discussed.
Chapter 3 uses the proposed MF-based SVM regression framework in a deep prob-
lem of machine learning which is the probability density function problem. The theoretical
aspect of this MF-based SVM density estimation approach are discussed in deep. The con-
sistency and convergence of the approach are discussed. Statistical performance measures
are used to illustrate the results of the presented density estimation approach. Also, several
estimation approaches are presented and illustrated for automating the learning algorithm:
the EM algorithm for estimating the parameters of the kernel, and Cross Validation for
estimating the rearranging parameters.
Chapter 4 presents the application of the proposed regression approach in a well
known computer vision problem which is the camera calibration. Motivations behind using
statistical learning approaches for camera calibration are discussed in this chapter. The
formulation of the camera calibration problem in a regression setup is outlined and the
link between this formulation and the learning of MF-based SVM regression algorithm is
established. A mixed learning algorithm between gradient descent and MF-based SVM
regression is formulated and applied to synthetic as well as real data sets.
Chapter 5 presents the application of the proposed density estimation approach in
segmentation of remote sensing data sets. Application of the framework in the segmen-
tation of real world multispectral and hyperspectral imagery is presented and evaluated
against other algorithms. Estimation of the MRF parameters in image modeling using the
proposed MF-Based SVM framework is presented.
Chapter 6 presents the application of the proposed statistical learning approaches in
solving the change detection problem. The problem considered in this work is the changes
in land cover of a scene using remote sensing imagery. The approach depends on using
statistical learning in modeling the shapes of the classes defined in the reference image.
These shape models are used with the models of the statistical sensor models (probability
7
densities) in detecting the changes in a scene.
Chapter 7 presents the conclusion and the future aspects to be addressed for further
boosting of the current work.
8
CHAPTER II
STATISTICAL LEARNING BASED SUPPORT VECTOR MACHINES
Support Vector Machines (SVM) were invented by Vladimir Vapnik and his co-
workers (first introduced in [7]). They are specific class of algorithms, characterized by
usage of kernels, absence of local minima, sparseness of the solution and capacity control
obtained by acting on the margin, or on number of support vectors. However, all these
nice features were already present in machine learning since 1960s. But, it was not until
1992 that all these features were put together to form the maximal margin classifier, the
basic Support Vector Machines, and not until 1995 that the soft margin version was intro-
duced [8]. SVM can be applied not only to classification problems but also to the case
of regression [9]. Still it contains all the main features that characterize maximum margin
algorithm: a non-linear function is learned by linear learning machine mapping into high
dimensional kernel-induced feature space.
SVM are gaining popularity due to many attractive features and promising empir-
ical performance. For instance, the formulation of SVM density estimation employs the
Structural Risk Minimization (SRM) principle, which has been shown to be superior to
the traditional Empirical Risk Minimization (ERM) principle employed in conventional
learning algorithms (e.g. neural networks) [10]. SRM minimizes an upper bound on the
generalization error as opposed to ERM, which minimizes the error on the training data. It
is this difference which makes SVM more attractive in statistical learning applications.
The traditional formulation of the SVM density estimation problem raises a quadratic
optimization problem of the same size as the training data set. This computationally de-
manding optimization problem prevents the SVM from being the default choice of the
9
pattern recognition community [11].
Several approaches have been introduced for circumventing the above shortcomings
of the SVM learning. These include simpler optimization criterion for SVM design (e.g.
the kernel ADATRON [12]), specialized QP algorithms like the conjugate gradient method,
decomposition techniques (which break down the large QP problem into a series of smaller
QP sub-problems), the sequential minimal optimization (SMO) algorithm and its various
extensions [13], Nystrom approximations [14], and greedy Bayesian methods [15] and
the Chunking algorithm [16]. Recently, active learning has become a popular paradigm
for reducing the sample complexity of large-scale learning tasks (e.g. [17–19]). In active
learning, instead of learning from ”random samples,” the learner has the ability to select its
own training data. This is done iteratively and the output of one step is used to select the
examples for the next step.
In this chapter, an algorithm which uses the Mean Field (MF) theory is used for
the learning of the SVM estimator. The MF methods provide efficient approximations
which are able to cope with the complexity of probabilistic data models [20]. MF methods
replace the intractable task of computing high dimensional sums and integrals by the much
easier problem of solving a system of linear equations. The density estimation problem
is formulated so that the MF method can be used to approximate the learning procedure
in a way that avoids the quadratic programming optimization. This proposed approach is
suitable for high dimensional density estimation problems and it is successfully applied to
various remote sensing data sets.
This chapter outlines the density estimation problem. A supervised density estima-
tion algorithm, which is based on the SVM approach, is presented and a practical learning
procedure is discussed. The practical aspects behind selecting the learning parameters for
proper density function estimation are discussed.
10
A. Support Vector Machines (SVM) Regression
The above discussion shows how thesuperviseddensity estimation problem is re-
duced to a regression problem. In this section, the SVM is presented as a supervised re-
gression tool and later on, it will be shown how can it be used as a density estimator for the
CCP. In the following discussion, the SVM as a regression tool is considered as the maxi-
mum a posteriori prediction with a Gaussian prior, under the Bayesian framework (Bayes’
theorem is used to relate the prior and posterior distributions). The idea is that, instead of
defining prior distributions over parameters of the learning machine, a Gaussian prior dis-
tribution is assumed over the function space on which the machine computes. In general,
the supervised regression learning problem can be stated as follows:
Given a training data setD = (yi, ti) |i = 1, 2, . . . , n, of input vectorsyi’s and associ-
ated targetsti’s, the goal is to infer the outputt for a new input data pointy. Generally, a
loss function which relates the estimated targetg(y) and the true targett is defined to char-
acterize the regression problem. In this work, the Vapnik’s - loss function is used which is
defined as:
L (t, g(y)) =
0 if |t− g(y)| ≤ ε
|t− g(y)| − ε otherwise
(1)
whereε> 0 is a predefined constant which controls the noise tolerance. To construct a
Bayesian framework under the assumed loss function in (1), an exponential model is em-
ployed. In this model, the likelihood for the probability of the true outputt at a given point
y, providing that the machine output isg(y), is assumed by the following relationship:
p (t|g (y)) =C
2 (εC + 1)exp −CL(t, g (y)) (2)
Since the elements of the training sample are assumed to be statistically independent ran-
dom vectors, the probabilistic interpretation of the SVM regression can be considered to
11
have the following likelihood, see [20]:
p(τ |g(D))=
(C
2 (εC + 1)
)n
exp
−C
n∑i=1
L(ti, g(yi))
(3)
where:
τ= [t1, t2, . . . , tn] andg (D) = [g (y1) , g (y2) , . . . , g (yn)]. Since, the SVM is con-
sidered as a maximum a posterior probability estimator with a Gaussian prior, the prior
probability distribution of the predictiong (y) is assumed as a Gaussian Process, GP. Gen-
erally, a GP is a stochastic process which is completely specified by its mean vector and
covariance matrix [21]. Thus, the prior probability for a sampleD can be expressed as a
GP with zero mean (for simplicity) and a covariance functionK (y,y′) as:
p(g(D))=1√
(2π)n det (Kn)exp
−1
2g (D)K−1
n g (D)T
(4)
whereKn = [K (yi,yj)] is the covariance matrix at the points ofD (K(., .) is a kernel function).
This can be parameterized with respect toD by Bayes’ Theorem:
p (g (D) |D) =p (D|g (D)) p (g (D))
p (D)
=M exp
−C
∑ni=1 L(ti, g(yi))− 1
2g (D)K−1
n g (D)T
√(2π)n det (Kn) p (D)
(5)
whereM =(
C2(εC+1)
)n
. The estimate of the posterior prediction distribution is the one
that maximizes the numerator of (5). Equivalently, the MAP estimate is obtained from:
ming(D)
C
n∑i=1
L(ti, g(yi)) +1
2g(D)K−1
n g(D)T (6)
Direct solution of (6) can be obtained by quadratic programming optimization (e.g., [22]).
The size of the optimization problem is the same as the size of the training sample. Since
Quadratic Programming (QP) Optimization routines have high complexity and require huge
memory and computational time for large data applications, solving the QP, especially with
12
a densen × n matrix, limits the use of the SVM algorithm for large data sets (e.g., [11]).
One way to avoid raising such a QP problem is to consider an approximate formulation for
the SVM regression [23]. The rest of this section and the following section present one of
such methods.
Using the posterior prediction distributionp (g (D) |D) which is defined in (5), the
prediction (expectation) on a new test pointy is given by:
〈g (y)〉 =
∫g (y) p (g (y) |D) dg(y)
=
∫g (y) p (g (y) ,g (D) |D) dg(y) dg(D) (7)
Substituting from (5) into (7) and with some mathematical reduction:
〈g (y)〉 =M√
(2π)n det (Kn)
∫g (y)A dg(y) dg(D) (8)
where:
A=exp
−C
∑ni=1 L(ti, g(yi))− 1
2g(D,y)K−1
n+1g(D,y)T
p (D),
Kn+1 =
Kn Kn (y)T
Kn (y) K (y,y)
, and
Kn (y) = [K (y1,y) ,K (y2,y) , . . . ,K (yn,y)] .
But:
g(D)p(g(D)) = KnKn−1p(g(D)) = −Kn
∂
∂ g(D)p(g(D)).
Then by extending the prior to include the new point (test pointy) we get:
g (y) exp
−1
2g(D,y)K−1
n+1g(D,y)T
=
n+1∑i=1
K (y,yi)∂
∂ g(yi)exp
−1
2g(D,y)K−1
n+1g(D,y)T
(9)
13
Substituting from (9) into (8) and apply integration by parts to shift the differentiation from
the prior to the likelihood, then:
〈g (y)〉 =M
P (D)
n∑i=1
K (y,yi)
∫N (g (D) |0,Kn)g (y).
∂
∂g (yi)exp
−C
n∑j=1
L(tj, g(yj))
dg(D)
=n∑
i=1
wi K (y,yi) (10)
wherewi is a constant defined as:
wi =M
P (D)
∫N (g (D) |0,Kn)g (y).
∂
∂g (yi)exp
−C
n∑j=1
L(tj, g(yj))
dg(D) (11)
B. Mean Field Theory for Learning of SVM Regression
The learning process suggests that the weightswi’s in (11) should be estimated us-
ing the training sample. But as can be seen, (11) is highly complicated and computationally
expensive since it contains a lot of integrations which need to be evaluated numerically. In
this work, the Mean Field theory is used to get an approximate and easy expression for
wi’s [24, 25]. The basic idea of the mean field theory is to approximate the statistics of
a random variable which is correlated to other random variables by assuming that the in-
fluence of the other variables can be compressed into a single effective mean “field” with
a rather simple distribution [20]. While MF theory arose primarily in the field of statis-
tical mechanics [26], it has more recently been applied elsewhere, for example for doing
Inference in Graphical Models theory in artificial intelligence [27, 28].
In this work, this approach is used to approximate the so calledcavity distribution.
The cavity distribution is defined asp(g (yi) |D
), whereg (yi) is the regressed SVM output
14
corresponding to an instantyi which is left from the training sample, andD is the training
sample without the instantyi.
For the cavity derivation, it is useful to introduce a new predictive posterior for the
output corresponding to the instantyi as:
p(g (yi) |D
)=
∫p(g(D))p(τ |g(D)) dg
(D)∫
p(g(D))p(τ |g(D)) dg(D)(12)
whereτ is the target vectorτ excludingti.
For the predictive posterior in (12), an average (expected value) can be defined as:
〈V〉i =
∫V p
(g (yi) |D
)dg(yi) (13)
where〈V〉i denotes the expected value forV given only the data sampleD. Then the
expression for the weightwi in (11) can be rewritten as:
wi =〈M ∂
∂g(yi)exp −CL(tj, g(yj))〉
i
〈M exp −CL(tj, g(yj))〉i(14)
To enable the weight’s calculation from (14), a closed form for the cavity distribution
in (12) is required, and that is where the Mean Field theory comes into play. The MF
considers that it is possible to calculate averages overp(g (yi) |D
)because of the fact that
it is a predictive posterior of the field at an inputyi. The MF approximatesp(g (yi) |D
)
with a Gaussian distribution in the form:
p(g (yi) |D
)≈ 1√2πσ2
i
exp
−(g (yi)− 〈g (yi)〉i)2
2σ2i
(15)
where the variance is defined as:σ2i =〈g (y2
i )i〉 − 〈g (yi)〉2i .Inserting (15) into (13) and evaluating (14), the weight coefficients can be obtained
as:
wi≈F (〈g (yi)〉i, σ2i )
G (〈g (yi)〉i, σ2i )
=Fi
Gi
(16)
15
where:
Fi =C
2exp
C
2
(2〈g (yi)〉i − 2ti + 2ε + Cσ2
i
)
×(
1− erf
〈g (yi)〉i − ti + ε + Cσ2
i√2σ2
i
)
− C
2exp
C
2
(−2〈g (yi)〉i + 2ti + 2ε + Cσ2i
)
×(
1− erf
−〈g (yi)〉i + ti + ε + Cσ2
i√2σ2
i
)
and
Gi =1
2erf
ti − 〈g (yi)〉i + ε√
2σ2i
− 1
2erf
ti − 〈g (yi)〉i − ε√
2σ2i
+C
2exp
C
2
(2〈g (yi)〉i − 2ti + 2ε + Cσ2
i
)
×(
1− erf
〈g (yi)〉i − ti + ε + Cσ2
i√2σ2
i
)
− C
2exp
C
2
(−2〈g (yi)〉i + 2ti + 2ε + Cσ2i
)
×(
1− erf
−〈g (yi)〉i + ti + ε + Cσ2
i√2σ2
i
)(17)
Equations(16) and(17) are called the Mean Field equations corresponding to the
weight coefficientwi. To evaluate the weight coefficients in (16), it is required to get both
the mean (average)〈g (yi)〉i and the varianceσ2i of the assumed Gaussian model for the
local predictive distributionp(g (yi) |D
). The detailed derivation for both〈g (yi)〉i and
σ2i depending on the mean field theory can be found in [20], but only the final results are
summarized here. The posterior average atyi is given by:
〈g (yi)〉 =n∑
j=1
wj K (yi,yj) (18)
16
From [20], the following results are obtained:
〈g (yi)〉i ≈ 〈g (yi)〉 − σ2i wi (19)
and,
σ2i ≈
1[(Σ +K)−1]
ii
− Σi (20)
where:
Σ = diag (Σ1, Σ2, . . . , Σn) , and
Σi = −σ2i −
(∂wi
∂〈g (yi)〉i
)−1
The expression for ∂wi
∂〈g(yi)〉i can be obtained from Equations(16) and(17) as:
∂wi
∂〈g (yi)〉i≈ C2 − wi
− wi〈g (yi)〉i + σ2i C
2∫ ti+ε
ti−εp(g (yi) |D
)dg(yi)
σ2i G (〈g (yi)〉i, σ2
i )
≈ C2 − wi − wi〈g (yi)〉i + σ2i C
2 + IGi
σ2i G (〈g (yi)〉i, σ2
i )(21)
where:
IGi =1
2erf
ti − 〈g (yi)〉i + ε√
2σ2i
− 1
2erf
ti − 〈g (yi)〉i − ε√
2σ2i
C. Summary of the Statistical Learning MF-Based SVM Regression Algorithm
The implementation steps of the proposed approach for density estimation using
SVM with the mean field theory being applied to the learning process are presented below:
Step 1.Prepare the training data setD.
17
Step 2.Set a learning rateη and randomly initializewi’s.
Step 3.Choose a kernelK (y,y′) and accordingly, calculate the covariance matrixKn and
let σ2i = [Kn]ii.
Step 4.Iterate steps 5 and 6 until convergence inwi’s.
Step 5.“inner loop”: For i = 1, 2, . . . , n do
5.a calculate〈g (yi)〉 from (18)
5.b calculate〈g (yi)〉i from (19).
5.c calculateFi andGi from (17)
5.d updatewi by:
wi = wi + η(Fi
Gi− wi
)
Step 6.“outer loop”: For every M iterations forwi, updateσ2i from (20).
D. Remarks on the MF-Based SVM Regression Algorithm
1. The most computationally expensive step in the above algorithm is the inversion of
the matrixKn + Σ in Step 6. So, it is recommended that step 6 at the ”outer loop”
iterate less frequently than step 5 of the ”inner loop”. For example, afterM = 10
iterations of updatingwi, there will be one update ofσ2i .
2. The optimization needed to obtain the weights is carried out in the feature space, i.e.
after applying the kernel function on the input samples.
3. Since the optimization is done in the feature space, the optimization does not depend
on the input space dimensionality and so the density estimation procedure too.
18
4. The following chapters introduce methods for obtaining the best values for the learn-
ing parameters (e.g.C).
E. Summary
In this chapter, the foundation of the statistical based formulation of SVM regres-
sion. This formulation allows a fast and efficient learning procedure for SVM regression
algorithm. The Mean Field theory is used to approximate the hard-to-evaluate integra-
tions with a much simpler system of equations which are solved iteratively to estimate the
weights in the weighted sum of kernels representation. The fastness and efficiency prop-
erties of the algorithm open the applicability of it in solving many pattern recognition and
computer vision problems.
19
CHAPTER III
DENSITY ESTIMATION USING MEAN FIELD BASED SUPPORT VECTORMACHINES
Density estimation is a problem of fundamental importance to all aspects of ma-
chine learning and pattern recognition [29, 30]. The probability density function (PDF) of
a continuous distribution is estimated from a representative sample drawn from the under-
lying density. The estimation can be carried out either in a parametric or non-parametric
way. When it is reasonable to assume, a priori, a particular functional form for the PDF then
the problem is reduced to the estimation of the required functional parameters; paramet-
ric approach. For estimating arbitrary density functions, finite mixture models [31, 32] are
gaining much attention as powerful approaches and they are routinely employed in many
practical applications. One can consider a finite mixture model as providing a condensed
representation of the data sample in terms of the sufficient statistics of each of the mixture
components and their respective mixing weights.
The kernel density estimator, also commonly referred to as the Parzen window es-
timator [33], can be viewed as the limiting form of a mixture model where the number of
mixture components will equal the number of points in the data sample. Unlike paramet-
ric or finite-mixture approaches to density estimation where only sufficient statistics and
mixing weights are required in estimation, Parzen density estimates employ the full data
sample in defining density estimates for subsequent observations. So, while large sample
sizes ensure reliable density estimates, they bring with them a computational cost for test-
ing which scales directly with the sample size. Herein lies the main practical difficulty with
employing kernel-based Parzen window density estimators.
20
In this dissertation, a method is proposed for estimating the density function using
the principle of Parzen estimator. But, it uses SVM principles to choose a subset of the
training data (Support Vectors) which is then is used in the computation of the density
estimate. Usually, the size of the Support Vectors subset is much smaller than the size
of the training data set and that reduces the computational cost of the estimation process.
The MF-based SVM regression approach introduces in the previous chapter is used in the
proposed density estimator to make the approach more faster and accurate. Also, methods
for automating the estimation process and enhancing the results are provided.
A. Density Estimation Problem Formulation
Given a random vectorY, the relation:
F (y) = P (Y < y) (22)
defines the cumulative probability distribution function, CDF, of the random vectorY. The
probability density function, PDF,p(y), of the random vectorY at a specific pointy is a
nonnegative quantity and it is related to the CDF by the relation:
F (y) =
∫ y
−∞p(y′) dy′ (23)
Hence, in order to estimate the probability density function it is required to obtain a solu-
tion for the inverse of the integral equation:∫ y
−∞p(y′, α) dy′ = F (y) (24)
on a given set of densitiesp(y, α). Where, the integration is a vector integration, andα is
the parameter set which characterizes the density function.
From another point of view, the estimation problem in ( 24) can be regarded as
solving the linear operator equation:
A [p(y)] = F (y) (25)
21
where the operatorA [.] is a one-to-one mapping for the elements of the Hilbert spaceE1
wherep(y) is defined into elements of the Hilbert spaceE2 whereF (y) is defined. But,
neitherp(y) nor F (y) in ( 25) is known. However, from the principles of probability
theory [34], given a random sampleD = y1,y2, . . . ,yn from an unknown distribution,
a practical estimate forF (y) can be obtained by:
Fn(y) =1
n
n∑
k=1
I(−∞, y](yk) (26)
where,n is the size of the sample andI(−∞, y](u) is the indicator function which is defined
as:
I(−∞, y](u) =
1 if u ≤ y
0 else
(27)
if both of y andu are scalars (1-dimensional data). Ify andu are vectors of lengthd, then:
I(−∞, y](u) =d∏
i=1
I(−∞, yi](ui) (28)
This functionFn(y), which is called the empirical distribution function, converges
with probability 1 to the original distribution functionF (y) [35]. Therefore, the pairs:
(y1, Fn(y1)), (y2, Fn(y2)), . . . , (yn, Fn(yn)) are constructed from the sampleD to gen-
erate the training data set:
D = (yi, ti) |ti = Fn(yi); i = 1, 2, . . . , n (29)
Now, a regression algorithm uses this training data set to solve the density estimation prob-
lem (25) in the image space (right hand side of (25)) to geta continuousapproximation for
the distribution functionF (y). This approximation can be used to express the solution in
the pre-image space (left hand side of ( 25)) to get an estimate for the density function using
the known operatorA. In this work, the SVM is used as a regression algorithm to get a
continuous approximation for the distribution functionF (y). The motivation behind using
22
the SVM as a regression tool is that a dense continuous approximation forF (y) is obtained
which should besafely differentiableso that the density functionp(y) can be obtained.
B. Obtaining the Probability Density Function Estimate and Choosing of theKernel Function
The above discussion shows that the MF-Based SVM regression algorithm can be
used for approximating the distribution functionF (y) from the training sampleD. The
algorithm proposed in the previous chapter is used to get and approximation forF (y)
which will be in the form of a weighted sum of the kernel function working on the instants
of the training sample as (see (10)):
F (y) =n∑
i=1
wi K (y,yi) (30)
Consequently, the estimate of the density function will be simply in the form:
p (y) =n∑
i=1
wi K ′(y,yi) =
n∑i=1
wi K (y,yi) (31)
whereK (y,yi) is the derivative ofK (y,yi).
There are some conditions on the kernel functionK (y,yi) so that a valid density
function estimate can be obtained from (31), see [22]. These conditions are:
i. Kγ = a (γ) K(
y−yi
γ
)
ii. a (γ)∫
K(
y−yi
γ
)dy = 1
iii. K(0) = 1
In the presented algorithm, a Gaussian Radial Basis Function (GRBF) kernel is used which
satisfies the above conditions (see, [22]) and it has the form:
K(y,yi) = exp
(−1
2(y − yi)Λ
−1(y − yi)T
)(32)
23
whereΛ is a parameter which is assumed to be predefined.
C. Summary of the Proposed SVM Density Estimation Algorithm
The implementation steps of the proposed approach for density estimation using
SVM with the mean field theory being applied to the learning process are presented below:
Step 1.Generate the training data setD defined in ( 29).
Step 2.Apply the MF-Based SVM regression algorithm (Algorithm-II.C) to get an approx-
imation forF (y).
Step 3.Calculatep (y) from (31).
The main goal of steps 1 and 2 is to get the weights of the SVM regression expan-
sion 30. Only those vectors which have corresponding weights greater than some threshold
(Support Vectors) are used in calculating the density estimate at test point. This reduces
the computational time for the density estimation than the traditional Parzen window esti-
mators.
D. Consistency of the Proposed Algorithm
The core component of the proposed density estimation approach is the regres-
sion algorithm using SVM. The proposed SVM regression approach is formulated using
Gaussian Processes and the Mean Field theory. It was shown that SVM regression boils
to a Gaussian Process prediction scheme. Thus, to test the consistency of the density
estimation approach, it suffices to show the consistency of Gaussian Process prediction
approaches. The following discussion examines the consistency issue of regression us-
ing Gaussian Processes. The argument of the consistency discussion uses the concept of
Equivalent Kernel (EK) which will be discussed next.
24
1. Equivalent Kernel in Gaussian Processes Prediction
As shown in (10), the predicted value for a test pointy using Gaussian Processes
Regression is a weighted sum of the kernel function acting on the input training points.
This can be written as:
g (y) =n∑
i=1
wiK (y,yi) = k(y)wT (33)
wherek(y) is a vector where itsi’th element is the value of the kernel function between
the test pointy and the training pointyi; K (y,yi). From the optimization point of view,
the problem is considered in theweight spacewhere the objective is to estimate the weight
vectorw. To make the derivation feasible, instead of usingε−insensitive loss function,
a quadratic loss function is used. Assuming the observations have noiseσν , the objective
function can be written in the form (see (6)):
E =1
2σ2ν
n∑i=1
(ti − g(yi))2 +
1
2g(D)K−1
n g(D)T (34)
Usingg as a shortcut forg(D), the following results can be obtained:
g = [k(y1)wT k(y2)w
T · · · k(yn)wT ]
= wKn (35)n∑
i=1
(ti − g(yi))2 = (τ − g)(τ − g)T
= (τ −wKn)(τ −wKn)T
(36)
Thus, the objective function is reduced to:
E =1
2σν
(τ − g)(τ − g)T +1
2gK−1
n gT
=1
2σν
ττT − 1
2σν
gτT +1
2σν
ggT +1
2gK−1
n gT
=1
2σν
g(σνK−1n + I)gT − gτT +
1
2ττT (37)
25
The posterior mean value of machine out vectorgPM is the one which minimizes
E, or the solution of (using vector differentiation with respect tow):
(σνK−1n + I)gPM
T = τT (38)
Substituting form (35) into (38), the posterior mean value of the weight vectorwPM
can be found from:
(σνK−1n + I)KT
nwPMT = τT or,
(σνK−1n + I)KnwPM
T = τT sinceKn is symmetric. Thus:
(σνI +Kn)wPMT = τT (39)
AssumeΣeq = (Kn + σνI), then:
wTPM = Σ−1
eq τT (40)
The mean prediction for a new inputy is:
µ(y) = k(y)wTPM
= k(y)Σ−1eq τT (41)
From the last results, the predictive mean at a test point can be written in the form:
g(y) = h(y)τT (42)
where:
h(y) = k(y)Σ−1eq = k(y)(Kn + σνI)−1 (43)
is known as theweight functionor theEquivalent Kernel.
26
2. Consistency Argument of the Proposed Algorithm
Suppose that the regression approach has a loss functionL, under a given Borel
probability measureu(y, t), the risk is defined as:
RL(g) =
∫L(t, g(y))du(y, t) (44)
where the optimization is done over the functional space over which the machine is com-
puting, i.e. the objective is to find the functionalη(y) that minimizes the riskRL(g).
Definition 1:
A procedure that returnsgD is consistent if:
RL(gD) → RL(η) as n →∞ (45)
As shown in (42), the GP posterior mean is expressed in terms of the equivalent
kernel (EK)h(y). But it is hard to understand the consistency of EK since it depends on
the matrix inverse ofKn +σνI, andKn depends on location of training inputs, see [36]. To
smooth out the issue of random locations of the input training points over the input space,
it will be assumed that the observations are distributedideally over the input space. This
means that the observations “smeared out” across the input space withρ data points per unit
(length, area, or volume, depending on the input space dimensionality). In this assumed
ideal case, the definition of consistency becomes:
Definition 2:
A procedure with ideal assumption of smeared out observations withρ data points per input
space volume unit is consistent if:
RL(gD) → RL(η) as ρ →∞ (46)
The objective function of the GP regression (see (34)) with a quadratic loss function
can be written in the form:
J [f ] =1
2σν
n∑i=1
(ti − g(yi))2 +
1
2‖g‖2
H (47)
27
where‖g‖H is the Reproducing Kernel Hilbert Space (RKHS) norm corresponding to the
kernelK. Under the idealized smearing out of the observations, a smoothed version of the
objective function can be obtained as:
Jρ[f ] =ρ
2σν
∫(η(y)− g(y))2dy +
1
2‖g‖2
H (48)
Williams [37] uses a Fourier analysis-based approach to argue the consistency of
the Gaussian Processes Prediction. In the following, a brief outlines of their approach is
presented, while the details can be found in the mentioned reference.
The basic relation between the functiong(y) and its Fourier transform is:
g(y) =
∫g(s)e2π i s.y ds (49)
and similarly forη(y). Under a stationary kernel, i.e.K(y, y) = K(y − y), the RKHS
norm in (48) can be represented as (see [38] for details):
‖g‖2H =
∫ |g(s)|2SK(s)
ds (50)
whereSK(s) is the power spectrum of the kernelK. Thus,
Jρ[f ] =1
2
∫ (ρ
σ2ν
|η(s)− g(s)|2 +|g(s)|2SK(s)
)ds (51)
The minimization ofJρ[f ] can be done using calculus of variations [39] which re-
sults in:
g(s) =SK(s)η(s)
σ2ν/ρ + SK(s)
(52)
In the domain of inverse Fourier transform, (52) can recognized as the convolution
relation:
g(y) =
∫h(y − y) η(y) dy (53)
From which, the Fourier transform of EK is:
h(s) =SK(s)
SK(s) + σ2ν/ρ
=1
1 + σ2ν/(ρ SK(s))
(54)
28
It can be easily noted from (54) that:
h(s) → 1 as ρ →∞ (55)
Thus,g(s) → η asρ →∞ which means that:
g(y) → η(y) as ρ →∞ (56)
which proves the consistency of the Gaussian Processes based Regression.
E. Convergence of the Proposed Algorithm
The proposed density estimation approach is has an iterative nature. Supposee(m)
denotes the error vector at iterationm. The convergence of such approach can be defined
such that:
e(m) → 0 as m →∞ (57)
or in simplest cases:e(m) = 0 for somem = m0. In our work, the convergence is shown
empirically.
F. Estimation of the Learning Parameters
The above proposed procedure for the MF-Based SVM framework contains some
learning parameters (e.g. regularization constant (C), learning rate (η), and the kernel’s
shape and parameters). These parameters should be carefully selected for proper perfor-
mance of the approach. This section proposes some methods for automatic selection of
these parameters.
1. Kernel Optimization using the EM algorithm
One of the commonly used kernels with SVM learning is the Gaussian Radial Basis
Function (GRBF) [40–42], which has the form in (32). The following discussion explains
29
an approach for automatic selection of the covariance for RBF kernel. This approach in-
corporates the EM algorithm [43–45] into the learning procedure so that the covariance of
the kernel is optimized while the SVM weight coefficients are estimated. The EM algo-
rithm is used to automatically select the covariance matrices of the kernels centered at the
training instants. This automatic optimization of the covariance matrix makes the SVM
learning faster in adaptation and more accurate which is reflected in a good performance of
the algorithm.
The EM algorithm can be used to estimate the parameters of a mixture of a Gaussian
distribution [42] based on the maximization of the following likelihood function:
L(w,Θ) =∑y∈Y
f(y) log p((y)) (58)
wheref(y) is the empirical density function.
The maximization of (58) can be found using the iterative block relaxation algo-
rithm. The relative contributions of each data itemy = 0, . . . , Y into each Gaussian com-
ponent at the stepm are specified by the following respective conditional weights
π[m](r|y) = w[m]r ϕ(y|θ[m]
r )
p[m]w,Θ(y)
(59)
wherer = i = 1, 2, . . . , n.
The block relaxation converging to a local maximum of the likelihood function
in (58) repeats iteratively the following two steps:
1. E-step[m + 1]: to find the covariance of a Gaussian component by maximizing
L(w,Θ) under the fixed conditional weights of (59) for the stepm, and
2. M-step[m + 1]: to find these latter weights by maximizingL(w,Θ) under the fixed
parameters (in our case this is the covariance)
until the changes of the log-likelihood and all the model parameters become small.
30
The covariance of each Gaussian is obtained by the unconditional maximization:
(σ[m+1]r )2 =
1
w[m+1]r
∑y∈Y
(y − µ
[m+1]i
).(y − µ
[m+1]i
)′
·f(y)π[m](r|y) (60)
This step is repeated in each step of the optimization of the SVM weight coefficients
in (30).
Initialization of the parameters for EM algorithm
As stated before the centers (means) of the Gaussian kernels are chosen to be the
input instances themselves. So, the proposed approach uses only the EM for estimating
the variances (covariances in multidimensional spaces) of the kernel function. To start the
EM algorithm, all these parameters are initialized to the same value which is the empiri-
cal variance (covariance) of the input training instances. In 1-D spaces this initialization
becomes:
σ21 = σ2
2 = · · · σ2n = σ2
empirical (61)
where
σ2empirical =
n∑i=1
(yi −m)2 and m =1
n
n∑i=1
yi
In multidimensional spaces:
Σ1 = Σ2 = · · ·Σn = Σempirical (62)
where
Σempirical =n∑
i=1
(yi −m)2 and m =1
n
n∑i=1
yi
2. Cross-Validation for Parameters Estimation
Cross-Validation (CV) is probably the simplest and most widely used method for
estimating the prediction error [46]. In CV methods, the training sample is split into two
31
FIGURE 1 – Splitting data in Cross-Validation setups.
parts: one for model fitting and the other for model evaluation. The idea behind CV is to
recycle data by switching the roles of training and test samples.
Specifically speaking, suppose there is anJ-fold CV problem. The data is split into
J roughly equal-sized parts (see Fig. (1) for J=5). For thejth part, the model is fitted using
theJ − 1 parts of the training data and the evaluation is done by thejth part of the data.
The application of the CV principle in parameters estimation goes as follows: sup-
pose the estimation algorithm has a parameter setλ, the steps to get the optimum value for
λ are:
1. Split the whole training data setD into J disjoint subsamplesD1,D1,· · · , DJ .
2. For j = 1, 2, · · · , J fit a model to the training sampleD = DrDj, and compute the
discrepancy,ej(λ), using the test sampleDj.
3. Find the optimalλ∗ as the minimizer of the overall discrepancye(λ) =∑
j ej(λ).
For illustration purposes, the general linear regression model is considered here,
assuming that there is an input vectory which has the corresponding target vectort. In
its basic form, the CV methods use the leave-one-out principle in approaching the CV
algorithm, which means thatJ = n. The ordinary CV (OCV) estimate of the prediction
error is:
OCV (λ) =1
n
n∑i=1
(ti − 〈gλ(yi)〉i)2 (63)
A CV estimate ofλ is the minimizer of (63).
In order to illustrate the concept, CV is used to estimate the parameters:C; the
regularization constant, andΛ; the kernel covariance; i.e.λ = C, Λ in the proposed den-
sity estimation approach. The search method used in this section is the “grid-search” [47]
32
on C andΛ. In grid-search, basically pairs of(C, Λ) are tried and the one with the best
cross-validation accuracy is picked. The grid-search is straightforward but seems not to be
an intellectual choice. In fact, there are several advanced methods which can save com-
putational cost by, for example, approximating the cross-validation rate. However, there
are two motivations why the simple grid-search approach is preferred here. One is that
psychologically we may not feel safe to use methods which avoid doing an exhaustive
parameter search by approximations or heuristics; especially in illustrative situations like
what is done in our work. The other reason is that the computational time to find good
parameters by grid-search is not much more than that by advanced methods since there are
only two parameters(C, Λ).
G. Experiments for Evaluating the Proposed Density Estimation
In this section, several data examples are used to illustrate the performance of the
proposed algorithm for density estimation. The data sets are generated with standard ran-
dom generators in 1- and 2-D spaces. The performance of the proposed algorithm is eval-
uated visually and using the Kullback-Leibler Distance (KLD) [48] measure which is per-
haps the most frequently used information-theoretic distance measure between two proba-
bility densities. KLD is one example of Ali-Silvey class of information-theoretic distance
measures [49] which are defined to be:
d(p0, p1) = f(ε0[c(ψ(x))]) (64)
wherep0 andp1 are two probability densities,ψ(.) represents the likelihood ratiop1
p0, c(.)
is convex,ε0[.] is the expected value with respect to the distributionp0 andf(.) is a non-
decreasing function. Suppose thatc(x) = x log x andf(x) = x, then the KLD is defined
to be:
KLD(p1| p0) =
∫p1(x) log
p1(x)
p0(x)dx (65)
33
In the practical application of the KLD for evaluating the density approximationp1 of the
reference densityp0, bothKLD(p1| p0) andKLD(p0| p1) should be calculated. If their
values are close with opposite signs this means that the two densities are close to each
other which means that the density estimator works fine.
1. Density estimation for a 1-D Gaussian distribution
This is a simple and standard experiment but illustrative and it is used here for
comparison purposes. In this experiment, a data sample of 100 instants from a 1-D standard
normal distribution is used to illustrate the performance of the proposed MF-based SVM
density estimation algorithm. The results are compared to those obtained in a previous work
which had been done using the traditional formulation of the SVM based density estimation
approach. The comparison is done based on the visual evaluation, the convergence speed
and the KLD measure.
As shown in Fig. (2), the MF-based SVM approximates closely the reference den-
sity function, while there is an apparent error in the approximation produced by the traditionally-
formulated SVM. In the case of the proposed MF-based SVM estimator, the KLD are:
KLD(p1| p0) = 0.12 andKLD(p0| p1) = −0.094. The two KLDs are close enough to
each other to show that the estimation is a good one. On the other hand, in the tradi-
tional formulation based SVM approximation Fig. (2-b), the distances are:KLD(p1| p0) =
−0.85 andKLD(p0| p1) = 1.86 which are not close to each other like the case in the pro-
posed algorithm. The computational cost of the proposed algorithm is of orderO(N2)
while the traditional SVM learning algorithm has the orderO(N3), see [50]. The cur-
rent experiment takes 0.015 second for the optimization process in the proposed MF-Based
SVM while it takes 0.22 second with the traditional SVM which emphasizes the faster
response of the proposed algorithm.
34
−4 −3 −2 −1 0 1 2 3 40
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4TrueEstimated
−4 −3 −2 −1 0 1 2 3 40
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4 TrueEstimated
(a) (b)
FIGURE 2: Estimation of the 1-D Gaussian density function with the SVM density esti-
mation which is formulated with the, (a) proposed, (b) traditional formulation.
2. Density estimation for 1-D Mixture of Gaussian distributions
In this little more challenging experiment, a data set of 100 instants is generated
from a 1-D mixture of Gaussians. The mixture consists of two components and has the
form:
p(x) = α1N (µ1, σ21) + α2N (µ2, σ
22) (66)
with the parameters shown in Table 1.
TABLE 1PARAMETERS OF THE 1-D MIXTURE OF GAUSSIANS DENSITY FUNCTION
Parameter µ1 µ2 σ21 σ2
2 α1 α2
Value -1 7 9 4 0.4 0.6
The results in Fig. (3) show that the proposed algorithm approximates well the
density function in (66). There are little errors at the tails of the density function com-
ponents. These tail-errors add up at the intersection of the two components which pro-
duces a noticeable error. The distance measure values in this case are:KLD(p1| p0) =
0.26 andKLD(p0| p1) = −0.09 which are affected by the error discussed before. This
35
−15 −10 −5 0 5 10 15 200
0.02
0.04
0.06
0.08
0.1
0.12
0.14
TrueEstimated
FIGURE 3 – Estimation of a 1-D mix of Gaussian density functions
experiment takes 0.313 second for the optimization process to converge which means that
the algorithm still maintains a considerably fast convergence.
3. Density Estimation for 1-D Rayleigh Distribution
In this experiment, a data sample of 100 instants from a Rayleigh distribution which
has the form:
p(x) =xe
−x2
2s2
s2(67)
where the parameters is set to 1 in the experiment. The Rayleigh distribution is chosen
because there is a special interest in the medical imaging applications for the Rayleigh
distribution, and also this distribution represents a good non-symmetric variation other
than the Gaussian. The results shown in Fig. (4) illustrate that the proposed algorithm
approximates the density function very well. The KLD in this case are:KLD(p1| p0) =
0.3 andKLD(p0| p1) = −0.2. The apparent difference between the two distances may be,
is due to the small error in the left tail. It is interesting to note here that the peak values
36
0 0.5 1 1.5 2 2.5 3 3.50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
TrueEstimated
FIGURE 4 – Estimation of a Rayleigh density function.
of the densities (reference and estimated) occur at the same value ofx which is a positive
argument for the proposed estimation algorithm. This experiment takes 0.016 second for
the optimization process to converge.
4. Comparison with state of the art methods
In this section, a comparison between the proposed density estimation approach
and two of the state of the art methods is discussed. These methods are: Sparse Density
Construction (SDC) method [1], and Reduced Set Density Estimator (RSDE) method [50].
SDC method presents an efficient construction algorithm for obtaining sparse kernel
density estimates based on a regression approach that directly optimizes model generaliza-
tion capability. It uses an orthogonal forward regression to ensure computational efficiency
of the density construction. The algorithm incrementally minimizes the leave-one-out test
score. This method is shown to perform comparably with other algorithms.
RSDE method is optimal in theL2 sense in that the integrated squared error be-
37
tween the unknown true density and the RSDE is minimized in devising the estimator. The
required optimization turns out to be a straightforward quadratic optimization with simple
positivity and equality constraints and thus suitable forms of Multiplicative Updating [51]
or Sequential Minimal Optimization as introduced in [52] can be employed which ensures
at most quadratic scaling in the original sample size.
RSDE approach is fundamentally different in from classical Parzen window and
SVM density estimators in that the Integrated Squared Error (ISE) between the true (un-
known) density and the reduced set estimator is minimized. The sparsity of representation
(data condensation) emerges naturally from direct minimization of ISE due to the required
constraints on the functional form of p(x) without the requirement to resort to additional
sparsity inducing regularization terms or employingL1 or ε-insensitive losses.
For the comparison with the two methods, an example is chosen which is presented
in the above mentioned references and we apply our proposed density estimation approach
on that example. The function used in this example is a 1-D mixture of a Gaussian and
exponential densities in the form:
p(x) = 0.51√2π
exp
(−(x− 2)2
2
)+ 0.5
0.7
2exp(−0.7|x + 2|) (68)
The performance measure used is theL1 test error which has the form:
L1 =1
Ntest
Ntest∑
k=1
|p(x)− p(x)| (69)
where p(x) is the estimated density at pointx, andNtest is the number of test points.
According to the above reference, SDC method is compared with Parzen Window method
and classical SVM method [53]. The following table shows the results that presented in
their reference, in comparison with the proposed MF-based SVM approach. The results
show that the MF-based SVM approach outperforms the other approaches in terms of the
accuracy, with a slightly high number of kernels used. There is no quantitative evaluation
of the computational time of SDC or RSDE method. But they uses the leave-one-out test
38
TABLE 2RESULTS FOR THE MIXTURE OF A GAUSSIAN AND EXPONENTIAL DENSITY
FUNCTIONS
Method Parzen Window Classical SVM SDC RSDE MF-SVM
L1 × 10−2 2.063 2.165 2.177 1.8 0.5
kernel number 100 5 5 5 10
score which is known to be time consuming, although they use an iterative approach to
decrease the computational complicity. The visual results illustrated in Fig 5 show the
better performance of the proposed MF-based SVM density estimation approach.
−10 −5 0 50
0.05
0.1
0.15
0.2
0.25
x
p(x)
True
−−− MF−based
(a) (b)
FIGURE 5: Estimation of 1-D mixture of a Gaussian and an Exponential density functions,
(a) SDC method (quoted from [1]), (b) MF-based Method.
5. Density estimation for a 2-D cases
The first experiment is carried out to assess the performance of the proposed al-
gorithm in high dimensional spaces. A data set of 100 instants from a 2-D Gaussian dis-
tribution is used. Again, this experiment is used to compare the proposed algorithm with
the traditionally formulated SVM algorithm. Figure (6) shows both the density function
and its contour for the reference density function, the estimated density function using
39
the traditionally-formulated SVM estimator and the estimated density function using the
proposed MF-Based SVM estimator.
As can be noted from the figure, there is a significant improvement in the estimation
using the MF-based SVM over the traditionally formulated SVM. In the contour plot for
the estimated density there is a slight deformation in the contour of the estimated density
function using the traditional SVM and there is a shift in the mean vector . The distance
measures in the case of the traditionally-formulated SVM estimator are:KLD(p1| p0) =
0.39 andKLD(p0| p1) = −8.4 which shows that there is a large difference due to the
shift mentioned before. The significant improvement can be noted from the contour of the
estimated density function using the MF-based SVM estimator. The distance measures are:
KLD(p1| p0) = 4.029 andKLD(p0| p1) = −3.6, showing a close fit. This experiment
takes 0.172 second for the optimization process to converge with the proposed MF-Based
SVM learning while it takes 0.578 second with the traditional SVM.
Another experiment employs a sample (200 points) of 2-D data which is generated
with equal probability from an isotropic Gaussian and two Gaussians with both positive and
negative correlation structure. The probability density is estimated using a Parzen window
employing a Gaussian kernel and leave-one-out cross-validation was employed in selecting
the kernel bandwidth, the RSDE is obtained employing a Gaussian kernel and the kernel
bandwidth is selected by minimizing the cross-entropy between the Parzen window esti-
mate and RSDE, and the proposed MF-based SVM. The probability density iso-contours
of the resultant estimation are shown in Fig. 7. The results illustrate that the performance of
the proposed MF-based density estimation approach is highly comparable to RSDE method
with the advantage of avoiding the use of Quadratic Programming tools.
40
6. Experiments on the automatic selection of the Kernel width using EM algorithm
To evaluate the proposed algorithm for automatic kernel parameters selection, the
same data set for the mixture of Gaussian density functions is used.
The results in Fig. (8) show that the proposed Mean Field-based SVM density es-
timation with automatic kernel optimization approach approximates well the density func-
tion in (66). Comparing the results with Fig. (3), it shows that the proposed algorithm for
automatic selection of the kernel width enhances the estimation results. For a quantitative
evaluation, the Kullback-Leibler distance (KLD) [49] and the Levy distance [34] measures
are used. For the proposed MF-based SVM approach with kernel optimization, the KLD
is 0.02 which is small enough to show that the proposed approach is a good density esti-
mator. For comparison purposes, the KLD for MF-Based SVM approach without kernel
optimization is 0.09, which is another proof that the proposed approach outperforms other
algorithms.
The Levy distance is used to compare two distribution functions in order to reflect
the similarity of their density functions. In this experiment, the Levy distance is used to
compare the empirical distribution function of the input random sample and the estimated
distribution function by the density estimator. The CDF of the MF-Based SVM without
kernel optimization and that of the proposed MF-based SVM with kernel optimization are
shown in Fig. (8). The Levy distance is 0.049 for the proposed approach while it is 0.079
for the MF-Based SVM without kernel optimization which again illustrates the outstanding
performance of the proposed approach.
41
7. Experiments on the automatic selection of the learning parameters using Cross Valida-
tion
One step toward the automation of the proposed approach is to automatically esti-
mate the regularization constatC and the kernel widthσ. The following results discuss the
application of Cross Validation in estimating these two parameters. Figure (9-a) shows that
the improper choice of the regularization constantC (C = 2.1 in this case while it was0.1
in the previous case) results in a bad performance of the density estimation algorithm.
The evolution of the estimation error with the value of the kernel width at constant
value ofC is shown in Fig (9-b). The curve isconvexand shows that there is an optimal
values ofσ which provides minimum error. The error surface with the two parameters:C
andσ is shown in Fig (9-c), where its minimum occurs atC = 1.5 andσ = 0.8.
8. Experiments on the algorithm convergence
The convergence of the estimation approach is evaluated by the error convergence
during the learning process. The estimation error is calculated at each learning step of the
1-D Gaussian density function example discussed before. As shown in Fig (10), the error
converges with each learning step, even linearly. This reflects that the convergence of the
overall estimation algorithm.
H. Conclusion
This chapter presented the foundations of the density estimation approach based
on statistical learning principles. The proposed approach uses MF-based SVM algorithm
presented in chapter II. The chapter starts from the basic principles of the statistical theory
to represent the density estimation problem in terms of a regression setup, where MF-based
SVM regression approach is used. The consistency of the proposed approach is discussed in
42
terms of the equivalent kernel formulation. Different approaches for estimating the learning
parameters were presented, e.g. EM algorithm for the kernel width and the Cross Validation
approach for parameters estimation. Several experiments were presented to illustrate the
performance of the approach and its convergence.
43
−20
2
−2
0
2
0
0.2
0.4
0.6
0.8
1
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
(a)
−20
2
−2
0
2
0
0.2
0.4
0.6
0.8
1
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
(b)
−20
2
−2
0
2
0
0.2
0.4
0.6
0.8
1
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
(c)
FIGURE 6: Estimation of a 2-D Gaussian density function, (a) the reference density func-
tion and its contour, (b) the estimated density using the traditional formulation-based SVM
and its contour, (c) the estimated density using MF-based SVM and its contour
44
−2 0 2 4 6
−2
0
2
4
6
Parzen Window Density Estimation
−2 0 2 4 6
−2
0
2
4
6
Reduced Set Density Estimation
(a) (b)
−2 0 2 4 6
−2
0
2
4
6
MF−based SVM Density Estimation
−4
0
4
8
−4
0
4
8
0
0.4
0.8
1
Estimated density function using MF−based SVM
(c) (d)
FIGURE 7: Comparison between the estimation results of a 2-D mixture of an isotropic
Gaussian and two Gaussians with both positive and negative correlation structure, (a) the
contour of the estimated density using Parzen window method, (b) the contour of the es-
timated density using Reduced set method, (c) the contour of the estimated density using
MF-based SVM method, and (c) the estimated density using MF-based SVM
45
−15 −10 −5 0 5 10 15 200
0.02
0.04
0.06
0.08
0.1
0.12
0.14TrueEstimated
(a)
−10 −5 0 5 10 150
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
−8 −4 0 4 8 120
0.2
0.4
0.6
0.8
1
EmpiricalEstimated
(c) (d)
FIGURE 8 – Estimation of the mixture of Gaussians in Fig 3 (a) with the proposed algo-rithm for automatic kernel parameters estimation, (b) CDF of the estimated density withoutautomatic kernel optimization, and (c) CDF of the estimated density with the proposed ker-nel optimization algorithm
46
−4 −3 −2 −1 0 1 2 3 40
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4TrueEstimated
0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5x 10
−3
RM
SE
σ
C=1.5
(a) (b)
0.6
0.8
1
1.2
1.4
1.6
0.50.6
0.70.8
0.910
2
4
6
x 10−3
C
σ
RM
SE
(c)
FIGURE 9: Effect of the regularization constant C on the proposed algorithm performance
47
0 5 10 15 204.8
5
5.2
5.4
5.6
5.8
6x 10
−4
Iteration
RM
S E
rror
FIGURE 10: Convergence of the estimation error with the optimization iterations for the
Gaussian density estimation example
48
CHAPTER IV
STATISTICAL LEARNING IN COMPUTER VISION
Statistical learning-based kernel methods are rapidly replacing other empirically
learning methods (e.g. neural networks) as a preferred tool for machine learning due to
many attractive features: a strong basis from statistical learning theory; no computational
penalty in moving from linear to non-linear models; the resulting optimization problem is
convex, guaranteeing a unique global solution and consequently producing systems with
excellent generalization performance [11]. This chapter presents a statical learning-based
approach for solving the camera calibration problem. This approach uses the proposed
MF-based SVM algorithm to estimate the elements of the perspective projection matrix.
Camera calibration is an extensively studied topic in different machine intelligence
communities. The purpose of it is to establish a mapping between the camera’s 2-D image
plane and a 3-D world coordinate system so that a measurement of a 3-D point position
can be inferred from its projections in cameras’ image frames. The existing techniques to
solve this problem can be broadly classified into three main categories: linear, nonlinear
(or iterative), and two- step methods. Complete review of these approaches can be found
in [54, 55].
The explicit camera calibration methods develop solutions by analyzing physical
model of camera imaging so that calibration is to identify a set of modeling parameters
of physical meanings [5], whereas the implicit calibration methods resort to realizing a
nonlinear mapping function that can well describe the input- output relation [6]. The ex-
plicit calibration methods can provide camera’s physical parameters, which are important
in some applications, such as computer graphics, virtual reality, 3-D reconstruction, etc.
49
This chapter presents an explicit approach for solving the camera calibration prob-
lem. Principally, The approach considers the problem as a mapping from the 3-D world
coordinate system to the 2-D image coordinate system, where the projection matrix is the
mapping function, and MF-based SVM algorithm is used to simulate this mapping.
An important issue of SVM algorithm is the choice of the kernel function [56]. The
shape of the kernel controls the capacity and performance of the algorithm. To enable
explicit estimation of the projection matrix from SVM regression setup, a linear kernel is
used. Although SVM algorithm with a linear kernel has some limitations in the type of
mappings that can be simulated [57], a first order linear kernel is experimentally shown to
be sufficient for the current application.
A. Camera Calibration
This section provides a brief introduction to the camera calibration problem from
the regression point of view; as treated in this work. The derivation of the proposed ap-
proach as well as the implementation steps are discussed.
1. An Overview
The camera model considered here is the perspective projection based on the pin-
hole model [58]. If a pointM has world coordinates(X, Y, Z) and is projected onto a point
m that has image coordinates(x, y), this projection can be described, in homogeneous co-
50
ordinates, by the equation:
s m = P M or
s
x
y
1
= P
X
Y
Z
1
(70)
wheres is a scaling factor andP (3×4) is theprojection matrix, which can be decomposed
into two matrices:P = A D where
D =
R t
0T3 1
A =
αx −αx cot θ x0 0
0 αy/ sin θ y0 0
0 0 1 0
(71)
The4 × 4 matrix D represents the mapping from world coordinates to camera co-
ordinates and accounts for six extrinsic parameters of the camera: three for the rotation
R which is normally specified by three rotation (Euler) angles:Rx, Ry andRz and three
for the translationt = (tx; ty; tz)T . 03 represents the null vector(0; 0; 0)T . The3 × 4 ma-
trix A represents the intrinsic parameters of the camera: the scale factorsαx andαy, the
coordinatesx0 andy0 of the principal point, and the angleθ between the image axes.
The projection matrix can be represented in a simplified form as:
P =
P1
P2
P3
(72)
Using (72), the relation in (70) can be represented pictorially as in Fig. 11. This
figure shows that there is a coupling between the outputs of the three branches (the scaling
51
FIGURE 11 – Representation of the camera calibration as a mapping problem
terms is repeated in each output). Thus, it isnotpossible to optimize a branch independent
of the others.
2. Basic Regression Relations
As stated before, the proposed approach considers each branch of Fig. 11 as a re-
gression problem, and solves it using MF-based SVM algorithm. The general SVM regres-
sion rule in (10) is used to formulate the output from each branch. As stated before also, a
linear kernel is used in SVM algorithm formulation to enable an explicit estimation of the
projection matrix. This kernel has the form:
K(M, M) = M.M + b = M tM + b (73)
whereb is a constant. The general regression relation becomes:
f(M) =n∑
i=1
wi K(Mi, M) =n∑
i=1
wi (M ti M + b)
=
(n∑
i=1
wiXi
)X +
(n∑
i=1
wiYi
)Y +
(n∑
i=1
wiZi
)Z + (b + 1)
(n∑
i=1
wi
)(74)
where eachMi is a point from the training sample.
52
The specific outputs are obtained from (74) as:
fx(M) =
(n∑
i=1
wxi Xi
)X +
(n∑
i=1
wxi Yi
)Y +
(n∑
i=1
wxi Zi
)Z + (b + 1)
(n∑
i=1
wxi
)
f y(M) =
(n∑
i=1
wyi Xi
)X +
(n∑
i=1
wyi Yi
)Y +
(n∑
i=1
wyi Zi
)Z + (b + 1)
(n∑
i=1
wyi
)
f s(M) =
(n∑
i=1
wsi Xi
)X +
(n∑
i=1
wsi Yi
)Y +
(n∑
i=1
wsi Zi
)Z + (b + 1)
(n∑
i=1
wsi
)
(75)
wherewxi denotes thei’th weight in the regression machine which computesfx(M).
3. Simultaneous Optimization of the Regression Relations
As stated before, the relations in (75) are coupled, so the corresponding regression
machines can not be individually optimized. To overcome this problem, a gradient descent
step is used to simultaneously optimize the values of the scaling factors while optimizing
the Support Vector regression machines. This step minimizes the overall error:
E =n∑
i=1
‖PMi − mi‖2
=n∑
i=1
(fx(Mi)− sixi
)2
+(f y(Mi)− siyi
)2
+(f s(Mi)− si
)2
(76)
The proposed algorithm minimizes the error in (76) with respect to the scaling factor
s according to the gradient descent rule:
∆si = xi
(sixi − fx(Mi)
)+ yi
(siyi − f y(Mi)
)+
(si − f s(Mi)
)(77)
The update ofs follows:
snew = sold − η∆s (78)
whereη is a learning parameter.
53
4. The Overall Calibration Algorithm
In the following: the implementation steps of the proposed approach are summa-
rized.
Algorithm 1 Statistical Learning based Camera Calibration Algorithm
1. Prepare the training data sets where the inputs are the 3-D world coordinates, and the
outputs are the corresponding 2-D coordinates. Preferably, normalize both the inputs
and outputs, and prepare the augmented data set.
2. Initialize the values of the scaling factors ( to all ones in our implementation).
3. Optimize the regression branches in Fig. 11 using MF-based SVM algorithm
(see [59]).
4. Updates from (78).
5. Iterate from step 3 to minimize the overall error in (76).
After the completion of the optimization process, the estimated calibration matrix
will have the form:
P =
∑ni=1 wx
i Xi
∑ni=1 wx
i Yi
∑ni=1 wx
i Zi (b + 1)∑n
i=1 wxi
∑ni=1 wy
i Xi
∑ni=1 wy
i Yi
∑ni=1 wy
i Zi (b + 1)∑n
i=1 wyi
∑ni=1 ws
i Xi
∑ni=1 ws
i Yi
∑ni=1 ws
i Zi (b + 1)∑n
i=1 wsi
(79)
54
B. Discussion of Some Calibration Methods
For comparison, the experimental work uses two classical calibration methods: lin-
ear [58], and nonlinear (NL) using the simplex method [60], and two state-of-the-art meth-
ods: neural networks (NN) [61], and Heikki method [62]. The following discussion briefly
explains these method.
1. Linear Direct Transform Method (LDT)
The direct implication of the camera calibration problem in (70) results in:
xi =P11Xi + P12Yi + P13Zi + P14
P31Xi + P32Yi + P33Zi + P34
andyi =P21Xi + P22Yi + P23Zi + P24
P31Xi + P32Yi + P33Zi + P34
(80)
GivenN correspondence points, LDT rearranges the formulas in (80) into2N linear
equations inm′s in the form:
Cp = 0 (81)
where:
C =
X1 Y1 Z1 1 0 0 0 0 −x1X1 −x1Y1 −x1Z1 −x1
0 0 0 0 X1 Y1 Z1 1 −y1X1 −y1Y1 −y1Z1 −y1
X2 Y2 Z2 1 0 0 0 0 −x2X2 −x2Y2 −x2Z2 −x2
0 0 0 0 X2 Y2 Z2 1 −y2X2 −y2Y2 −y2Z2 −y2
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
XN YN ZN 1 0 0 0 0 −xNXN −xNYN −xNZN −xN
0 0 0 0 XN YN ZN 1 −yNXN −yNYN −yNZN −yN
and
p = [P11 P12 P13 · · ·P33 P34]T (82)
55
The system of linear equations in 81 can be solved using the Singular Value Decomposition
SVD approach to get the unknownp. The SVD decomposesC as:
C = USV T (83)
The solution is the eigenvectorV which corresponds to the smallest eigenvalue in the main
diagonal ofS.
2. Nonlinear Two Stages Method (NL)
In this approach an iterative minimization algorithm is employed to solve for the
camera parameters. The first stage of the approach finds initial values of the camera para-
meters. This can be done by many methods (see [63]), eventually by the above discussed
DLT method. The estimated camera parameters are used to reproject the control 3-D points
into the 2-D space. Then, the second stage of the algorithm uses a nonlinear optimization
algorithm to minimize the following error criterion in the image space:
E =N∑
i=1
∇(P Mi) (84)
Details of (84) can be found in (76). For the implementation of this method in the
current work, the well known Simplex method is used to minimize (84).
3. Neural Networks Method (NN)
There are many neural networks based methods for camera calibration. The one
used here is interested to employ a neural network not only to learn the mapping from 3-D
points to 2-D pixel points which minimizes the error in (76), but also to extract the projec-
tion matrix and camera parameters. Therefore, the network structure is laid out accordingly.
The net is a two-layer feedforward neural network. The input layer has three neurons plus
one augmented fixed at1. These three correspond to the three coordinatesX; Y ; Z of a 3-D
56
point. The number of output units is three, and the hidden layer consists of four neurons
(three plus one dummy). The hidden and output neurons have unity activation functions.
The weight matrix of the hidden layer is denoted byV , and it is assumed to correspond to
the extrinsic parameters matrixD. The weight matrix of the output layer is denotedW and
corresponds to the intrinsic parameters matrixA.
4. Heikkile Method (Heikki)
The calibration procedure suggested in this method utilizes circular control points
and performs mapping from world coordinates into image coordinates and backward from
image coordinates to lines of sight or 3-D plane coordinates. It introduces bias correction
for circular control points and a non-recursive method for reversing the distortion model.
The motivation for using circular control points is that lines in the object space are mapped
as lines on the image plane, but in general perspective projection is not a shape preserving
transformation. Two- and three-dimensional shapes are deformed if they are not coplanar
with the image plane. This is also true for circular landmarks, which are commonly used
control points in calibration. However, a bias between the observations and the model is
induced if the centers of their projections in the image are treated as projections of the circle
centers. This approach is mainly intended to be used with circular landmarks. However, it
is also suitable for small points without any specific geometry. In that case, the radius is set
to zero.
This approach is iterative and consists of two stages as the NL approach. The first
stages initializes the camera parameters with an easy approach (e.g. DLT) approach. In the
second stage, the parameters of the forward camera model are estimated by minimizing the
weighted sum of squared differences between the observations and the model. Assuming
that there aren circular control points andK images that are indexed byn = 1, · · · , n and
k = 1, · · · , K. A vector containing the observed image coordinates of the center of the
57
ellipsei in the framek is denoted byeo(i, k), and the corresponding vector produced by
the forward camera model of is denoted byem(i, k). Now, the objective function that used
can be expressed as:
J(θ) = eT (θ)Σ−1e(θ) (85)
whereeT (θ) = [(eo(1, 1)− em(1, 1))T (eo(2, 1)− em(2, 1))T · · · (eo(n,K)− em(n,K))T ].
The matrixΣ is the covariance matrix of the observation error. The parameters of the
forward camera model are obtained by minimizingJ(θ):
θ = arg minθ
J(θ) (86)
Again, this method uses the Simplex optimization method to solve for optimumθ.
C. Experimental Results and Discussions
The performance of the proposed statistical learning MFSVM approach for camera
calibration is evaluated in two different ways. The first part uses synthetic data to evaluate
the proposed MFSVM camera calibration approach. The synthetic data allows a quan-
titative evaluation of the approach performance with respect to ground truth data. This
synthetic data is generated from a virtual camera with specific internal and external para-
meters, and the performance of the proposed approach is investigated using this data. The
robustness of the algorithm is evaluated by adding different level of noise to the training
data. The proposed approach shows nearly steady values for the estimated parameters and
outperforms other classical algorithms.
The second part uses a real checker-board object with black/white squares. The
known 3-D coordinates of the black squares’ corners (inputs) with their corresponding
image coordinates (desired outputs) are used to train the proposed algorithm. The proposed
approach is applied in 3-D reconstruction of a real scene, and the approach performance is
reflected in the accuracy of the reconstruction process. The performance of the proposed
58
technique is compared against some known algorithms: classical linear, nonlinear, and
neural network approaches. The proposed approach shows an outstanding performance.
1. Simulation with Synthetic Data
In this experiment, a virtual CCD camera is modeled (pinhole camera model is
assumed) such that its intrinsic and extrinsic parameters are used as ground truth data. This
allows a fair comparison between the estimated parameters using the proposed approach
and other camera calibration approaches. In addition, this allows us to characterize the
performance of the proposed approach using noisy data at different noise levels. The setup
of this experiment can be described as follows:
• Given the ground truth values of the camera parameters (shown in Table 3), construct
the projection matrix (see 71).
• Given a set of 3-D reference points (representing points of interest of a calibration
pattern) and the projection matrix, get the projected image points.
• In real setups, the 2-D image points are detected by applying image processing algo-
rithms on the captured images of the calibration pattern (noisy environment, sensor
insensitivity, and inaccuracy of the feature extractors could be different sources of
errors). To simulate these conditions, Gaussian noise with zero mean and different
standard deviations,σ, is added to the 2-D image points.
• Use these noisy 2-D points and their corresponding 3-D points to produce a noisy
version of the projection matrix.
• Use this projection matrix in a backward computation of both camera intrinsic and
extrinsic parameters.
• Compare the estimated parameters and their corresponding ground truth values.
59
Motivated by a basic and realistic assumption that most training data sets are conta-
minated by different errors from data acquisition and/or pre-processing steps, the statistical
learning would be a reasonable approach for robust camera calibration. To show the robust-
ness of the proposed approach, a series of11 experiments with different noise levels are
carried out. The noise standard deviation,σ, is selected in the range from0 to 4 pixels with
a step0.4. Each experiment is repeated 50 times to get average results. The camera parame-
ters are then estimated by four other camera calibration approaches: linear [58], nonlinear
(NL) using the simplex method [60], calibration using neural networks (NN) [61], and
Heikki method [62]. The following discussion briefly explains these method.
To explicitly estimate the camera parameters, the proposed approach has to use a
linear kernel. Employing this kernel has introduced a trade off between the less accuracy
of the approach at low-level noise and the high robustness at high-level noise. In the ideal
case, where the training data set is free of errors/noise, Table 3 shows the estimated values
of the camera parameters in comparison with the other approaches against the ground truth
values. Although these results give minor credit to the competitive approaches over the
MFSVM approach at low-noise levels, the proposed approach gets significant credits at
higher noise levels as shown in Table 3.
The root mean square errors(RMSE) over the 50 trails between the ground truth
parameters and the estimated parameters are plotted in Figures (12, 13, 14, 15) as a function
of σ for a set of four camera parameters. This figure shows that the performance of most
of the traditional calibration methods is highly degraded with the increase of the noise
level which is reflected in high RMSE values. This degradation is rapid for the linear
approach, and with lower rates for both the nonlinear (NL) and neuro-calibration (NN)
approaches. However, the proposed approach shows outstanding robustness against noise
levels. The values of the RMSE are almost the same in all noise levels for thetx parameter,
and slightly increased in the other displayed parameters. Robustness against noise is one
60
TABLE 3: Ground truth camera parameters versus estimated parameters
Para-
meter
tx
(mm)
ty
(mm)
tz
(mm)
Rx
(rad)
Ry
(rad)
Rz
(rad)
αu
(pix)
αv
(pix)
uo
(pix)
vo
(pix)
θ
(rad)
True -27.0 -28.0 701.0 0.09 0.80 -0.03 556.0 549.0 172.0 121.0 1.57
Estimated parameters using noise-free data (σ = 0 pixels)
Linear -27.3 -28.1 700.7 0.090 0.799 -0.030 555.7 548.8 171.8 120.9 1.57
NL -27.2 -28.0 700.8 0.090 0.800 -0.030 555.9 548.8 171.9 121.0 1.57
NN -27.5 -28.2 699.5 0.089 0.800 -0.030 555.0 548.1 171.7 120.8 1.57
Heikki -27 -27.9 701.0 0.09 0.800 -0.030 556.1 550 172 120.9 1.57
SVM -29.7 -29.3 697.8 0.088 0.796 -0.031 553.4 545.4 169.8 120.1 1.57
Estimated parameters using noisy data (σ = 4 pixels)
Linear -80.4 -59.2 654.3 0.044 0.768 -0.068 531.5 519.0 141.3 100.7 1.56
NL -49.3 -44.6 676.8 0.059 0.781 -0.054 539.8 532.8 157.9 108.9 1.56
NN -49.8 -46.0 671.0 0.062 0.782 -0.053 536.4 528.7 156.8 108.3 1.56
Heikki -43.7 -48.9 884.8 0.172 0.762 -0.093 707.4 707.5 155.3 104.8 1.57
SVM -34.3 -41.6 682.6 0.069 0.791 -0.045 543.6 536.9 166.3 110.9 1.57
of the main strengths of statistical learning approaches in general, and specifically SVM.
This robustness found an increase interest of the application of such approaches in machine
learning application. As the results demonstrate, robustness is the feature that motivates the
use of MFSVM approach in solving the problem of camera calibration especially when the
noise level can not be ignored.
61
0 0.5 1 1.5 2 2.5 3 3.5 40
10
20
30
40
50
60
70tx
σ
RM
SE
(m
m)
linearsimplexneuroHeikkiSVM
FIGURE 12: The RMSE fortx as a function of noiseσ, computed for the five approaches:
linear, nonlinear using simplex method, neuro-calibration, Heikki and MFSVM.
2. Experiments with Real Images
In case of real images, there is no ground truth data (camera parameters) available
for camera calibration. The only available data is the 3-D coordinates for some control
points on a given calibration pattern. Therefore the accuracy of a camera calibration ap-
proach is measured in terms of the accuracy in reconstructing these 3-D points through
triangulation [54, 55]. To carry out this accuracy measure, calibration is performed for two
CCD cameras working as a stereo pair. Two images of a calibration pattern (see Fig. 16) are
62
0 0.5 1 1.5 2 2.5 3 3.5 40
0.005
0.01
0.015
0.02
0.025
0.03
0.035Ry
σ
RM
SE
(rad
)
linearsimplexneuroHeikkiSVM
FIGURE 13: The RMSE forRy
captured using these two cameras. The 3-D points used for the calibration are the vertices
of the checker-board squares of the calibration pattern.
Knowing the 3-D coordinates (Xi, Yi, Zi) of these points, the corresponding image-
point locations are detected accurately using edge-detection and fitting techniques (for ex-
amples of such techniques see [54, 55]). Given these two sets of points, the two cameras
are calibrated using the four calibration approaches stated before. The accuracy of the
calibration process is measured usingthe root mean square error(RMSE) defined as:
RMSE =
[1
n
n∑i=1
(Xi − Xi)2 + (Yi − Yi)
2 + (Zi − Zi)2
] 12
(87)
where (Xi, Yi, Zi) are the estimated 3-D coordinates of thei’th point andn is the number
63
0 0.5 1 1.5 2 2.5 3 3.5 40
5
10
15
20
25
30
35
uo
σ
RM
SE
(mm
)
linearsimplexneuroHeikkiSVM
FIGURE 14: The RMSE foruo
of points.
Figure 17 shows the 2-D projection of the calibration pattern corners (in dots) and
the detected corners from processing the pattern image (in circles). To obtain the 2-D
projection, the cameras are calibrated and then the 2-D projections are computed using (70).
It is clear from Fig. 17 that the projected 2-D corners are almost perfect with respect to the
detected ones. Knowing that the corners are detected perfectly from the images, the figure
illustrates the outstanding performance of the proposed MFSVM approach.
For quantitative evaluation, the RMSE of the four camera calibration approaches
for are given in Table 4.
64
0 0.5 1 1.5 2 2.5 3 3.5 40
0.002
0.004
0.006
0.008
0.01
0.012θ
σ
RM
SE
(rad
)
linearsimplexneuroHeikkiSVM
FIGURE 15: The RMSE for the skewness angleθ. Note: Heikki method assumes an ideal
camera model in the sense of skewness (i.e.θ = π/2), so there is no error indicated.
TABLE 4: Error in the 3-D reconstructed data
linear nonlinear neuro-calibration SVM
RMSE(mm) 0.4280 0.2991 0.3044 0.2794
It is clear from the table that the proposed approach outperforms the other methods
and it gives more accurate results in reconstructing the 3-D coordinates. It is worthy to
notice that, although the difference between the RMSE of MFSVM approach and the non-
linear approach is small, the MFSVM approach relaxes the requirement of a good guess
to start with. This requirement is one of the main drawbacks of nonlinear methods and
65
FIGURE 16: Calibration setup: A stereo pair of images for a checker-board calibration
pattern
FIGURE 17: The 2-D projection of the calibration pattern corners (dots), and the detected
corners from the image (left one of Fig. 16) of the calibration pattern (circles)
66
without it, the solution can diverge from the correct one. These results on real images em-
phasize the results obtained for the synthetic data and verify the validity of the proposed
method in real situations.
D. Conclusion
In this chapter, a robust method for camera calibration using Mean field theory-
based Support Vector Machines (MFSVM), as a statistical learning approach, is presented.
The projection matrix is obtained explicitly by using a dot product kernel in the formulation
of SVM algorithm.
The explicit estimation of the camera parameters is evaluated using synthetic data
while a 3-D scene reconstruction problem is used to evaluate the performance in real world
setups. Different noise levels are used to show the robustness of the approach which is
illustrated by nearly steady values of the estimated parameters with the noise standard
deviation. In addition, the approach is compared with other known techniques of camera
calibration namely; linear, non-linear using the simplex method, and neuro-calibration. The
experimental results showed an outstanding performance of the proposed approach in terms
of accuracy and robustness against noise compared to the competitive camera calibration
approaches. The RMSE drops from0.428 with the linear calibration to0.2794 with the
proposed approach.
67
CHAPTER V
APPLICATIONS ON THE PROPOSED DENSITY ESTIMATION APPROACH
This chapter presents an extensive elaborated experimental work which has been
carried out to evaluate some applications on the proposed density estimation approach.
The applications include density estimation using real remote sensing data in a Bayes Clas-
sification setup, and parameters estimation of MRF models in image applications. The
density estimation accuracy is reflected in the classification accuracy of the data sets using
only class conditional probability modeling with pre-specified priors. Since, the reference
density function is not known for these real data sets, the above evaluation methods (visual
inspection and KLD measure) for the performance of the proposed density estimation al-
gorithm cannot be used. Instead, the classification accuracy is used as a practical measure
of the performance of the density estimation algorithm. When carrying the classification
experiments to compare the performance of the different density estimation algorithms, the
operating conditions (in a Bayes classification setup) are the same except for the density
estimation algorithm. Thus, the argument that the classification accuracy is an indication
for the density estimation performance is applicable.
A. Test-of-agreement (ToA) for the response of two classifiers
To compare the performance of two classifiers against each other, a rule is proposed
which will be discussed here. Suppose there are two classifiers with the rulesM1(.) and
M2(.) respectively which are applied to the test data set. Define the statistic:
Sn =n∑
i=1
zi (88)
68
wherezi = 1 if M1(yi) = M2(yi) and0 otherwise. The statisticSn measures the agreement
between the two classifiers’ outputs, reflecting how much the responses of the two classi-
fiers agree in response to the same data point. From the central limit theorem (CLT) [34]:
Sn − Sn√nSn
2∼ N (0, Sn) (89)
The hypothesis:
H0 : probability that the classifiers agree. (90)
has the95% confidence interval[cSn
n− 2A
cSn
n+ 2A], whereA =
√bp(1−bp)
n; p =
cSn
n.
If this confidence interval contains the point1, then the probability that the re-
sponses of the two classifiers agree is1. If the confidence interval does not contain 1, then
there is a chance that the two classifiers disagree, and thus the hypothesis is rejected. With
this argument in mind, the rule to compare the performance of two classifiers is as follows:
1. If there is a difference in the performance between the two classifier but they disagree,
this means that this difference is significant.
2. If there is a difference in the performance between the two classifier and they agree,
this means that this difference is not significant.
Throughout the following experiments, the95% confidence interval between the
Bayes classifier which uses the MF-Based SVM density estimator against that uses: MLE,
Parzen-window, KNN or traditionally formulated SVM density estimators are calculated.
If an interval contains the point1, then the apparent difference in the performance between
the two classifiers is notsignificant.
69
B. Experiments for Density Estimation Using Real Remote Sensing MultispectralData
The following experiments are used to illustrate the performance of the proposed
density estimation algorithm in real data sets of relatively high dimensional spaces. In
the current experiments, two 7 bands multispectral data sets, with 30-meters resolution are
used. In the multispectral experiments, each point in the data sets is represented by a vector
of length 7, giving a dimensionality of the seventh order.
1. Experiments for density estimation using a multispectral agricultural area
This data set represents an agricultural area in the state of Kentucky, in the USA. A
169x169 scene is cropped from a multispectral Landsat 7-bands data collected in Wednes-
day, December 5, 2001. The resolution of this data set is28.5mx28.5m per pixel. Nine
classes are defined in this data set: Background, Corn, Soybean, Wheat, Oats, Alfalfa,
Clover, Hay/Grass and Unknown. The ground truth labels are available for the whole data
set. For the evaluation purposes, a subset from each class is used for training the density
estimator and the rest of the data is used for testing. Figure (18) shows the reference land
cover of the area and the classification results based on the SVM density estimators.
The confusion matrix for the classification based on the density estimation using
MF-based SVM is shown in Table 5. The average true classification accuracy for the
classes is 78.5%. The largest source of error is due to the misclassification between the
Background and other classes, with 46% of the Background reference pixels are classified
to the other classes and 9.6% from the other classes are classified to Background. A spe-
cific noticeable example for the misclassification between the Background and the other
classes is the misclassification between the Soybean and the Background, with 18.6% from
the Background reference points are classified as Soybean and 5% from the Soybean are
classified as Background. Other large errors can be noted in the Alfalfa and Hay/Grass
70
(a)
(b) (c)
FIGURE 18: A multispectral agricultural area: (a) land cover, and classification results
using: (b) SVM , and (c) MF-SVM as a density estimator.
classes. However, the reason for the later error is due to the prior probability assumption
which is assumed as the share of the class reference points in the data set. Since, each of the
Alfalfa and Hay/Grass classes is less represented in the data set, their priors are small and
thus a noticeable error is generated. This realization calls for another estimation method
for the prior probabilities. This dissertation presents a method for this modeling using the
MRF which will be discussed later in this chapter. Another observation that can be shown
from the classified image in Fig. (18-b), in which most of the regions contain some ran-
71
TABLE 5: Classification confusion matrix for the multispectral agricultural area using the
MF-based SVM estimator.
Class Total
Points
Back-
ground
Corn Soy-
bean
Wh-
eat
Oats Alf-
alfa
Cl-
over
Hay/
Grass
Unk-
nown
%
True
Back-
ground
6790 3661 770 1262 555 358 4 144 15 21 53.92
Corn 9371 475 8787 93 1 7 3 1 3 1 93.77
Soybean 8455 1090 101 6985 74 85 0 92 6 22 82.61
Wheat 1923 199 0 37 1581 74 0 31 1 0 82.22
Oats 800 121 2 22 47 598 0 10 0 0 74.75
Alfalfa 65 19 22 6 0 0 13 0 5 0 20
Clover 619 120 7 103 19 54 0 316 0 0 51.05
Hay/-
Grass
142 54 17 27 1 6 1 5 29 2 20.42
Unknown 396 8 0 3 6 0 0 0 1 378 95.45
% +ve
true
63.7 90.53 95.51 69.22 50.59 61.9 52.75 48.33 89.15
dom misclassification points (which appear like a random salt&pepper noise in the image)
although they should be smooth and clean. This is mainly due to the fact that the Bayes
classification setup treats the points in the data set as realizations of independent random
variables regardless of the contextual interactions. The MRF modeling overcomes also this
problem as will be illustrated later.
To evaluate the proposed MF-based SVM density estimator, other classical and new
density estimation algorithms are applied on the same data set. Table 6 summarizes the
classification accuracies obtained with different density estimators. It can be noted from
72
the table that the MF-based SVM density estimation algorithm outperforms the other al-
gorithms (this is reflected in the classification accuracy as discussed before). With the
classical algorithms (MLE with Gaussian assumptions, Parzen-Window estimation and K-
Nearest Neighbors “KNN”) there are some classes which have been completely disap-
peared (e.g. Alfalfa and Hay/Grass), however both the SVM-based algorithms manage to
recover part of these classes. The proposed MF-based SVM outperforms the traditional
SVM in the overall classification accuracy.
An important note here is that MLE fails in recognizing some classes because of the
unimodality assumption for the class conditional probabilities (CCP). One way to boost the
performance of the Gaussian-based ML approach is to use the enhanced statistics approach
which is proposed in [64], while another way is to use multimodal form for CCP. Under
the assumption of a Gaussian kernel in MF-based SVM regression, the density estimator
is equivalent to a mixture of Gaussians, thus MF-based SVM density estimator assumes a
multimodal form for CCP. But, the distinct feature here is that the optimum value of the
number of components in the mixture is automatically obtained. This automatic selection
of the number of components makes the performance of the proposed approach superior to
the traditional mixture of Gaussian density estimator using the EM algorithm [32, 65], the
results are illustrated in Table 6.
To justify the latest stated argument regarding the performance of the classifier,
which uses the MF-Based SVM density estimation, the ToA rule stated above is used
to analyze the results in Table 6. The95% confidence intervals (see section V.A) are:
[0.7462 0.7567], [0.7906 0.8005], [0.8033 0.8130], and[0.883 0.891]. None of these inter-
vals contains the point1 which indicates that the apparent difference in the performance of
the classifier which uses the MF-based SVM density estimator and the others issignificant.
But the Bayes classifier based on the MF-Based SVM density estimator has an apparently
better performance than the others, reflecting the better performance of the density estima-
73
tor.
TABLE 6: Classification accuracy using different density estimators for the multispectral
agricultural area.
Class % Accuracy
MLE Parzen
Window
KNN
(k=15)
Mixture of
Gaussians
Traditional
SVM
MF-based
SVM
Back-
ground
52 37 46 48 50.4 53.9
Corn 94 97 96 93 91.5 93.77
Soybean 78 92 82 82 77.9 82.61
Wheat 44 31 40 76 84.2 82.22
Oats 7 9 4 31 72.5 74.75
Alfalfa 0 0 0 41 76.9 20
Clover 5 4 4 44 69.8 51.05
Hay/-
Grass
0 0 1 9 66.2 20.42
Unknown 94 94 94 93 95.7 95.45
Average 71 72 71 74.4 76.1 78.5
2. Experiments for density estimation using a multispectral urban area
This data set represents an urban area around the Golden Gate Bay at the city of San
Francisco, California state, USA and shown in Fig. (19). A 700x700 scene is cropped from
a4632x4511 multispectral Landsat data set collected in Tuesday, September 28, 1998 with
a resolution5mx5m per pixel. There are five classes which are defined in this data set:
Trees, Streets, Water, Buildings, and Earth. The available ground truth set contains 5076
74
data points. For the evaluation purposes, a subset from each class is used for training and
the rest of the data is used for testing.
(a) (b)
(c)
FIGURE 19: A multispectral urban area: (a) RGB snap-shot, and color-coded classification
results using: (b) SVM , and (c) MF-SVM as a density estimator.
The confusion matrix for the classification using the MF-based SVM density esti-
mator is shown in Table 7, which indicates that this experiment is an easy experiment with
respect to the previous one. The overall average true classification accuracy for the classes
is 96.7%. There is a little classification confusion between the Streets and the Earth classes
where 3.8% from the Streets’ points are classified as Earth and 1% from the Earth points
75
TABLE 7: Classification confusion matrix for the multispectral urban area using the MF-
based SVM estimator.
Class Total
Points
Trees Streets Water Build-
ings
Earth % True
per Class
Trees 212 212 0 0 0 0 100
Streets 521 2 495 0 4 20 95
Water 595 0 0 595 0 0 100
Buildings 292 1 37 0 254 0 87
Earth 410 0 4 0 0 406 99
% +ve true 98.6 92.35 100 98.45 95.31
are classified as Streets. Another noticeable confusion is between the Buildings and Streets
classes. There are 12.67 % from the Buildings’ points are classified as Streets while there
are 0.8 % from the Streets’ points which are classified as Buildings. The misclassification
between Streets, Earth, and Buildings classes is reasonable due to the similarity between
these classes in the real world.
The proposed MF-based SVM density estimator is evaluated against some other
density estimation algorithms by noting the classification rate of the Bayes classification
setup using different density estimation algorithms. Table 8 summarizes the obtained re-
sults with different density estimators. It can be noted from the table that the SVM density
estimators outperform the classical algorithms however there is a little improvement using
the MF-based SVM estimator over the traditionally-formulated SVM estimator.
The95% confidence intervals for the results in Table 8 are:[0.93 0.94], [0.93 0.94],
[0.95 0.96], and[0.997 0.999]. None of these intervals contains the point1, which empha-
sizes the better performance of the proposed density estimator.
76
TABLE 8: Classification accuracy using different density estimators for the multispectral
urban area.
Class % Accuracy
MLE Parzen
Window
KNN
(k=3)
Traditional
SVM
MF-based
SVM
Trees 99 85 89 97.5 100
Streets 91 97 94 95.6 95
Water 97 100 100 100 100
Buildings 90 68 89 91.8 87
Earth 80 82 84 91.3 99
Average 92 89 92.7 95.7 96.7
C. Experiments for Density Estimation Using Real Remote Sensing HyperspectralData
The performance of the proposed density estimator algorithm in real high dimen-
sional spaces is illustrated in this section. In the current experiments, two hyperspectral
data sets, one has 34 bands and the other has 58 bands are used. These hyperspectral data
sets will raise a density estimation problem of the 34th and 58th dimensionally orders,
respectively.
1. Experiments for density estimation using a hyperspectral 34-band data set
This experiment uses a hyperspectral data set of size200x200 for an urban area
in the state of Indiana, in the USA. This scene is cropped from a618x1013 data set of
type “AISA Classic Reflectance” collected using the AISA hyperspectral sensor with34
channels in 1983 with a3mx3m resolution. There are nine classes defined in it: Agricul-
tural, Coniferous, Herbaceous, Other Impervious, Roads, Soil / Disturbed, and Water. The
77
ground truth labels are available for the whole data set and only a subset from each class is
used for training the density estimators. This data set is illustrated in Fig. (20).
(a) (b)
(c)
FIGURE 20: A hyperspectral 34-band urban area: (a) RGB snap-shot, and color-coded
classification results using: (b) SVM , and (c) MF-SVM as a density estimator.
The confusion matrix for the classification based on the density estimation using
MF-based SVM is shown in Table 9. The average true classification accuracy for the classes
78
TABLE 9: Classification confusion matrix for the hyperspectral 34-band data using the
MF-based SVM estimator.
Class Total
Points
Agricu-
ltural
Conife-
rous
Herba-
ceous
O Imp-
ervious
Roads Soil Water %
True
Agricultural 5138 4122 170 497 221 11 102 15 80.23
Coniferous 15182 3 12878 2228 66 2 1 4 84.82
Herbaceous 7481 27 947 5914 291 197 75 30 79.05
Other Imp. 925 4 56 175 595 62 30 3 64.32
Roads 627 0 17 145 44 418 3 0 66.67
Soil 6362 229 4 409 609 124 4922 64 77.37
Water 4285 5 323 282 328 2 133 321274.96
% +ve true 93.9 89.5 61.3 27.6 51.2 93.5 96.5
is 80.1%. The largest source of error is due to the misclassification between the different
types of vegetation. Actually, the ”Other Impervious” class has the lowest classification
rate because it shares its characteristics with other vegetation classes.
Table 10 summarizes the classification accuracies obtained with different density
estimators. From that table it can be noted that the MF-based SVM density estimation
algorithm outperforms the other algorithms.
The95% confidence intervals for the results in Table 10 are:[0.76 0.77], [0.73 0.74],
[0.74 0.75], and[0.856 0.863]. None of these intervals contains the point1, which reflects
the better performance of the MF-based SVM density estimator in high dimensional spaces.
2. Experiments for density estimation using a hyperspectral 58-band data set
Figure (21) shows a hyperspectral data set of an urban area in the state of New Mex-
ico, in the USA. This scene is of size 300x600 and had been cropped from a1093x2176 data
79
TABLE 10: Classification accuracy using different density estimators for the hyperspectral
34-band data.
Class % Accuracy
MLE Parzen
Window
KNN
(k=3)
Traditional
SVM
MF-based
SVM
Agricultural 86.65 85.95 85.66 80.87 80.23
Coniferous 71.63 91.4 62.4 88.33 84.82
Herbaceous 77.8 44.53 70.85 67.73 79.05
Other Impervious 66.7 46.38 60.43 69.73 64.32
Roads 83.41 36.36 76.24 67.3 66.67
Soil 90.77 80.27 82.27 70.7 77.37
Water 80.98 68.59 72.7 75.08 74.96
Average 78.8 75.8 71.4 78.53 80.1
set with a1mx1m resolution. This data set is collected using the AISA Eagle hyperspectral
sensor in1983 with 58 channels. There are nine classes defined in it: Unclassified/Shadow,
Water, Trees, Buildings, Asphalt Roads, Scrub Shrub/Herbaceous, Sand/Soil/Gravel, River-
ine Wetland and Fiverine Substrate. The ground truth labels are available for the whole data
set.
The confusion matrix for the classification based on the density estimation using
MF-based SVM is shown in Table 11. The average true classification accuracy for the
classes is 73%. The largest source of error is due to the misclassification between the Trees
and other classes, with 18.8% of the Trees reference pixels are classified to the other classes
and 19.6% from the other classes are classified to Trees. Due to the inherent similarities
of the materialistic structure between the Scrub Shrub/Herbaceous (low vegetation)and the
Trees, there is a significant misclassification between these two classes. There 11.4% from
80
TABLE 11: Classification confusion matrix for the hyperspectral 58-band urban area using
the MF-based SVM estimator.
Class Total
Points
Sha-
dow
Wa-
ter
Trees Buil-
dings
Asp-
halt
Scrub Sand Wet-
land
Sub-
strate
%
True
Shadow 1189 657 0 355 8 0 160 9 0 0 55.26
Water 7382 6 6930 73 6 13 2 194 0 158 93.88
Trees 91883 190 86 74607 1125 436 10477 4670 30 262 81.20
Build-
ings
11655 30 17 1133 5615 905 1236 2573 0 146 48.18
Asphalt 9349 32 2 245 91 7369 142 1465 0 3 78.82
Scrub 30542 1 0 12458 725 98 16569 684 1 6 54.25
Sand 25741 9 71 2599 2431 2156 413 17965 1 96 69.79
Wet-
land
437 0 0 51 0 0 27 7 350 2 80.1
Sub-
strate
1822 0 84 72 4 3 0 106 1 1552 85.18
% +ve
true
71 96.38 82.12 56.12 67.11 57.03 64.89 51.47 69.56
the Trees points are classified as Scrub Shrub/Herbaceous and 40.78% from the Scrub
Shrub/Herbaceous are classified as Tress. Other similar error can be seen between the
Sand/Soil/Gravel and Fiverine Substrate classes. The MF-based SVM density estimation
algorithm outperforms the other algorithm which is clear form Table 12.
The95% confidence intervals in this experiment are:[0.659 0.663], [0.725 0.729],
[0.696 0.7], and[0.807 0.811] which emphasizes the better performance of the proposed
density estimator in hyperspectral spaces.
81
TABLE 12: Classification accuracy using different density estimators for the hyperspectral
58-band urban area.
Class % Accuracy
MLE Parzen
Window
KNN
(k=3)
Traditional
SVM
MF-based
SVM
Shadow 87 69.13 97.9 80.82 55.26
Water 96 91.5 90.4 91 93.88
Trees 69.6 86.12 63.8 65.47 81.2
Buildings 51.3 39.76 46.7 48 48.18
Asphalt 52.9 70.8 84.9 81.42 78.82
Scrub 61 34.88 69.6 61.52 54.25
Sand 60.6 65.34 68.55 65.51 69.79
Wetland 82.38 2.29 90.85 89.47 80.1
Substrate 71.1 52.14 80.57 93.8 85.18
Average 66.2 70.2 67.02 65.99 73
D. Applications in the Class Prior Probability Estimation
The Bayesian classification setup requires the estimation of the class prior proba-
bility of each class defined in the image [2]. The Markov Random Field (MRF) is a natural
choice for implementing the hidden model for the segmented regions since MRF is the
best way to incorporate spatial correlations into a segmentation process. The refinement
of the segmented image using MRF modeling for the regions can be considered as a re-
finement of the Class Prior Probability (CPP) [66]. The images (raw data) and segmented
regions are specified with a joint Markov model that combines an unconditional model of
interdependent region labels and a conditional model of independent image signals in each
region [67–69]. The initial segmented image is then iteratively refined by using the MRF
82
model. In principle, the present work follows this conventional scheme but in contrast to
previous solutions, [70], it focuses on the most accurate identification of this region model.
The intra- and inter-region label co-occurrences are specified by a MRF model with the
nearest neighbors of each pixel. Under the assumed symmetric relationships between the
neighboring labels, the model resembles the conventional auto-binomial ones [71].
The present work suggests that the potential function for each clique in a MRF
model is assumed as a Gaussian-shaped kernel which leads to the formulation of the energy
function as a weighted sum of Gaussian kernels. Then, the MF-based SVM, in a regression
prospective, is used to estimate the parameters of this energy function, rather than using
the (empirically) pre-defined values for these parameters [3]. The motivation behind this
formulation is to design a complete classification framework where the developed MF-
based SVM algorithm is the main building block.
1. MRF Model
Definition 1: A cliqueC is a subset ofS for which every pair of sites is a neighbor. Single
pixels are also considered cliques. The set of all cliques on a grid is calledC.
Definition 2: A random fieldX is an MRF with respect to the neighborhood systemη =
ηs, s ∈ S if and only if
• p(X = x) > 0 for all x ∈ Ω, whereΩ is the set of all possible configurations on the
given grid;
• p(Xs = xs|Xs|r = xs|r) = p(Xs = xs|X∂s = x∂s), wheres|r refers to allN2 sites
excluding siter, and∂s refer to the neighborhood of sites;
Definition 3: X is a Gibbs random field (GRF) with respect to the neighborhood system
η = ηs : s ∈ S if and only if
p(x) =1
ze−E(x) (91)
83
whereZ is a normalizing constant called the partition function andE(x) is the energy
function of the form:
E(x) =∑c∈C
Vc(x) (92)
whereVc is called the potential and it is a function of the cliques around the site under
consideration. Only cliques of size 2 are involved in a pairwise interaction model. The
energy function for a pairwise interaction model can be written in the form [72]:
V (x) =N2∑t=1
G(xt) +N2∑t=1
m∑r=1
H(xt, xt:+r) (93)
whereG is the potential function for single-pixel cliques andH is the potential function
for all cliques of size 2. The parameterm depends on the size of the neighborhood around
each site. For example,m is 2, 4, 6, 10, and 12 for neighborhoods of orders 1, 2, 3, 4, 5,
respectively. Numbering and order coding of the neighborhood up to order five is shown in
Fig. 22. Also Fig. 22(a) shows the location of sitext:+r in the neighborhood system.
In this work the following model is proposed forG(.) andH(.) potential functions.
G(xt) =w0√2π
e−12
(µw0−I(xt)
σ
)2
(94)
H(xt, xt:+r) =wr√2π
e−12
(µwr−I(xt,x(t:+r))
σ
)2
(95)
whereI(a, b) is the indicator function whereI(a, b) = 1 if a = b, otherwise equal 0.I(a)
is always equal to 1.
The estimated mean values (µwr) of the clique shapes of the second order MRF (shown in
Fig. 23) are shown in Table 13.
84
TABLE 13: The estimated means for 2nd order MRF cliques.
Parameter µw0 µw1 µw2 µw3 µw4
Value 1/21 3/21 5/21 7/21 9/21
Parameter µw5 µw6 µw7 µw8 µw9
Value 11/21 13/21 15/21 17/21 21/21
2. MRF Parameters Estimation Using SVM
Comparing the form of the potential function of the MRF model in (93), after sub-
stituting the assumed models in (94) and (95), with that of the SVM regression output
in (18) shows that the SVM can be used for estimating the MRF parameters, provided that
the SVM regression algorithm uses a Gaussian Radial Basis-shaped kernel. In order to esti-
mate the weights in the SVM regression representation which correspond to the strengths of
the cliques in the MRF representation, the joint histogram for all clique shapes in the given
image are calculated. The MF-based SVM regression algorithm is used to approximate (fit
a regression to) the joint histogram by estimating the weights in (18). The experimental
section includes an example shows how this estimation is done.
3. Image Segmentation Algorithm
Typically in image segmentation, a segmented image after initial pixel-wise classi-
fication is further refined by optimal statistical estimation of the MRF model of the seg-
mented regions. The likelihood of the MAP image segmentation algorithm has the form
(see [3]):
Γ(X,Y) =1
|S|(log p(Y | X) + log p(X)
)(96)
whereS is the lattice representing the image,X is the segmented image (region map)
andY is the observed image. The first term in (96) is the likelihood for the conditional
85
distribution of an observed image (low level process) given its segmented image (the high
level process); i.e. class conditional probability. The second term is the unconditional
distribution of the segmented image which can be considered as a variation of the class
prior probability. As stated before, the SVM density estimation algorithm is used to model
the low level process. The high-level unconditional region map model is modeled using the
simple MRF model.
To make the search for a local maximum of the log-likelihood of (96) computation-
ally feasible, a conventional iterative process of estimating/re-estimating the conditional
image model is used (i.e. given a current region map, then update the map model given
the image). The process terminates when the current and previously estimated model pa-
rameters coincide to within a given accuracy range [67, 68, 70]. Therefore, the whole
iterative segmentation process is summarized as shown in the following algorithm. Be-
Algorithm 2 Image Segmentation Algorithm Outlines.
• Initialization: Find an initial map by the classical pixels Bayesian classification of
a given image after an initial estimation of the low level processY, using MF-based
SVM density estimator.
• Iterative refinement: Refine the initial map by iterating the following two steps:
1. Estimate the MRF parameters using the MF-based SVM algorithm.
2. Refine the segmented image using the ICM algorithm [66].
3. Calculate the log-likelihood from (96) and terminate if there is no big change
in the values of the log-likelihood.
cause at each step the approximate log-likelihood is greater than or equal to its previous
value, the proposed algorithm converges to a locally optimum solution. The experimental
section presents some experimental evolution of the log-likelihood values in (96) with the
86
iterations of the proposed segmentation algorithm.
4. Experiment on MRF Model Parameters Estimation
To evaluate the proposed algorithm for estimating the parameters of the MRF model
using the MF-based SVM algorithm, a synthetic texture image is generated by Metropolis
algorithm [71] as show in Fig. (24-a). Figure (24-b) shows the joint histogram for the ten
cliques shape (shown in Fig. (23)) of the second order neighborhood system. Figure (24-d)
shows the estimated Mixture of Gaussian distribution using the SVM which shows that
the SVM manages to estimate optimal values for the cliques strengths. Table 14 shows
the estimated parameters for each distribution. Figure (24-c) shows the regenerated image
using the estimated parameters shown in Table 14.
TABLE 14ESTIMATED PARAMETERS FOR THE MIXTURE OF GAUSSIANS
DISTRIBUTION.
Component Mean Weight Variance
1 1/21 0.1098 0.1592
2 3/21 0.1102 0.1592
3 5/21 0.1102 0.1592
4 7/21 0.1107 0.1592
5 9/21 0.1777 0.1592
6 11/21 0.0559 0.1592
7 13/21 0.0894 0.1592
8 15/21 0.0559 0.1592
9 17/21 0.0894 0.1592
10 21/21 0.0906 0.1592
87
5. Experiments Using Remote Sensing Data
The following experiments are used to illustrate the effectiveness of the MRF mod-
eling and the proposed segmentation setup (iterative setup) on the improvement of the seg-
mentation results. The two hyperspectral data sets are used to illustrate the performance.
In the proposed segmentation setup, as shown in section V.D.3, an initial guess for the
segmented image is obtained by the classical Bayes classifier with MF-based SVM den-
sity estimator. Then, the MRF parameters are calculated from the segmented image using
the MF-based SVM algorithm, the MAP segmentation is applied and the log-likelihood
from (96) is calculated. If there is a significant difference between the consecutive values
of the log-likelihood, the new values of the MRF model parameters are recalculated, and
the segmentation procedure is repeated again. Otherwise, i.e. if there is no significant
change in the log-likelihood values, the segmentation process is ended.
For the 34-band data set, Fig. (25) shows the evolution of the log-likelihood. It
can be noted that the log-likelihood converges and starts to saturate without major changes
after 8 iterations in this experiment. The final segmented image in Fig. (26) and the con-
fusion matrix in Table 15 illustrates the improvement effect of the CPP modeling on the
segmentation results. The average class accuracy rate increases to 83.75% (while it is 80%
without contextual modeling, see Table 10)and both the individual class accuracies and the
confidence in the points assigned to each class increase for most of the classes.
For the 58-band, Fig. (27) shows the evolution of the log-likelihood. It is clear that
the log-likelihood converges and starts to saturate without major changes after 6 iterations,
which means that a maximum estimate for the segmented image is obtained. The final seg-
mented image in Fig. (28) and classification results in Table (16) illustrate the improvement
effect of MRF modeling on segmentation results. The average class accuracy rate increases
to 83.38% (while it is 73% without contextual modeling, see Table 12) and the individual
class accuracies for most classes increase too.
88
TABLE 15: Classification confusion matrix for the hyperspectral urban area after applying
the MRF modeling.
Class Total
Points
Agricu-
ltural
Conife-
rous
Herba-
ceous
O Imp-
ervious
Roads Soil Water %
True
Agricultural 5138 4177 209 508 142 10 90 2 81.3
Coniferous 15182 0 13270 1883 21 0 0 8 87.4
Herbaceous 7481 8 939 6124 196 155 23 36 82.86
Other Im-
pervious
925 1 64 152 663 41 3 1 71.68
Roads 627 0 12 51 33 529 2 0 84.37
Soil 6362 92 4 448 309 1 5450 58 85.66
Water 4285 3 419 265 204 0 107 328776.7
% +ve true 97.57 88.95 64.93 42.28 71.88 96.03 96.9
E. Conclusion
This chapter presented several applications on the proposed statistical learning based
approaches for regression and density estimation. Remote Sensing data sets in multispec-
tral and hyperspectral spaces are used in these applications. The experiments on density
estimation use the classification accuracy in Bayes setups as indication for the performance
of the density estimation approach. The class prior probability in Bayes classification is
modeled using MRF models where the MF-based SVM algorithm is used to estimate the
model parameters.
89
TABLE 16: Classification accuracy after applying MRF modeling for the 58-band hyper-
spectral data set.
Class % Accuracy
Shadow 55.26
Water 96.95
Trees 91
Buildings 55.73
Asphalt 82.18
Scrub 52
Sand 70.42
Wetland 88.33
Substrate 87.82
Average 83.38
90
(a)
(b)
(c)
FIGURE 21: A hyperspectral 58-band urban area: (a) RGB snap-shot, and color-coded
classification results using: (b) SVM , and (c) MF-SVM as a density estimator.
91
(a) (b)
FIGURE 22: Numbering and order coding of neighborhood structure.
γ0 γ
1 γ
3 γ
2 γ
4 γ
5 γ
6 γ
7 γ
8 γ
9
FIGURE 23: Clique Shapes of second order MRF model.
92
(a) (c)
0 0.2 0.4 0.6 0.8 10
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Average number of Occurance of each clique shape
(b) (d)
FIGURE 24: A texture image for MRF model parameters estimation experiment:
(a)original image generated by Metropolis algorithm, (b) histogram of the MRF model
clique shapes of the original image, (c) regenerated image using the parameters estimated
using the MF-based SVM algorithm, and (d) the estimated mixture of Gaussians to fit the
cliques histogram.
93
1 2 3 4 5 6 7 8 9 10−450
−400
−350
−300
−250
−200
−150
−100
Iteration
Log−
Like
lihoo
d
FIGURE 25: Evolution of the log-likelihood in the hyperspectral 34-band example.
FIGURE 26: The final segmented image with the proposed segmentation setup for the
hyperspectral 34-band area.
94
1 2 3 4 5 6 7 8 9 10−3000
−2500
−2000
−1500
−1000
−500
Iteration
Log−
Like
lihoo
d
FIGURE 27: Evolution of the log-likelihood.
FIGURE 28: The final segmented image for the 58-band data set.
95
CHAPTER VI
STATISTICAL LEARNING FOR CHANGE DETECTION
Change detection in images finds many applications in city planning, monitoring,
and security assessments. This chapter introduces statistical learning as a tool for detecting
changes in images. It proposes a new approach and evaluates this approach with remote
sensing data sets.
A. Problem Statement
The problem of change detection in images can be stated formally as follows. Given
a imageI, find changesthat happened in that image with respect to a reference imageIr.
The current research focuses on the labeling or class-assignment study. Thus, “changes”
means non-matching labels. In turn, the change detection problem can be stated as finding
the pixel set (changes setCS) from I where the pixel labels of that set are different from
the labels of their counterparts inIr:
CS = (i, j) ∈ I : L(i, j) 6= L(m,n); (m,n) ∈ Ir and corresponds to(i, j) (97)
whereL(i, j) means the label of the pixel(i, j).
B. Literature Review
Automatic change detection is an active area of research in a number of important
applications, ranging from automatic video surveillance [73, 74] to video coding [75, 76],
tracking of moving objects [77, 78], and motion estimation [79, 80]. The increasing in-
terest to environmental protection and homeland security has led to the recognition of the
96
fundamental role played by change-detection techniques in monitoring the Earth’s sur-
face [73, 81–85]. Applications include, among others, damage assessment from natural
hazards (floods, forest fires, hurricanes) or large-scale accidents (e.g., oil spills), and also
keeping watch on unusual and suspicious activities in or around strategic infrastructural
elements (dams, waterways, reservoirs, power stations, nuclear and chemical enterprizes,
camping in unusual places, improvised landing strips, etc.) Most of the well known super-
vised and unsupervised methods for detecting changes in remotely sensed images [81, 83–
87] perform sequentially an image preprocessing, image comparison to get a “difference”
image, and the difference image analysis. Preprocessing: In the unsupervised change de-
tection case, the two images are made comparable in both the spatial and spectral domains.
The most critical step is to co-register the images with a sub-pixel accuracy so that corre-
sponding pixels within the images relate to the same ground area. Inaccurate co-registration
makes change detection unreliable [82], so special techniques are used to reduce the regis-
tration errors [81, 88–92]. Regarding to the spectral domain, potential error sources, such
as different illumination and atmospheric conditions at the two acquisition times, should be
accounted for to obtain accurate co-registration [93–95]. Depending on the application and
the available data, the problem is typically solved by absolute or relative radiometric image
calibration. One of the radiometric calibration algorithms proposed in [96–98] transforms
gray values in each image into the ground reflectance values, while the other algorithm
modifies histograms of the gray values to make the same gray values in the two images rep-
resent the same, but unknown reflectance [84, 93, 98]. Generally, illumination conditions in
remote sensing applications vary smoothly over each image. Therefore, in many cases an
original scene can be readily divided into Areas of Interest (AOIs) with assumed constant il-
lumination conditions. Separate analysis of each AOI suppresses main effects of varying il-
lumination on the change detection process [79]. Image comparison: The co-registered and
radiometrically corrected images (or linear/nonlinear combinations of their spectral signa-
97
tures [84]) are compared, pixel by pixel, in order to generate a “difference image” such that
the land-cover changes differ considerably in gray levels from the unchanged areas [84].
For example, the univariate image differencing (UID) [83, 84] performs pixel-wise sub-
traction of a single spectral band from the two images. The choice of the band depends
on the specific type of changes to be detected. The widely used Change Vector Analy-
sis (CVA) [94] forms differences for several spectral bands in each image (i.e., spectral
change vectors), and the difference image contains magnitudes of these vectors. Analysis
of the difference image: Land-cover changes are usually detected by thresholding the sig-
nal histogram of the difference image. The threshold selection is of a major importance for
the accurate change detection. Although some automatic choices of thresholds have been
proposed [99], remote sensing applications generally use non-automatic heuristic trial-and-
error strategies [81, 84, 100]. The classical choice of the threshold is based on a reasonable,
but not always verified assumption that only few changes have occurred between the two
observation dates. The changes are then represented by outliers of the marginal probability
distribution for the difference signals that mainly describe the unchanged pixels. Under this
assumption, a single-hypothesis testing based decision strategy [101] labels the signals that
are significantly different from the mean value, as changes. The decision threshold is fixed
at tσ from the mean difference value, whereσ is the standard signal deviation in the differ-
ence image andt is set by a trial-and-error procedure. The effect of the value oft on the
change detection accuracy is experimentally evaluated in [102]. Two Bayesian techniques
for automatic selection of the decision threshold in [103] maximize the total detection error
assuming the spatially independent pixels or a Markov Random Field (MRF) of the pixels
in the difference image, respectively. The MRF model uses the pixel spatial dependency
to improve the change detection. This approach has been extended in [104] using a semi-
parametric reduced Parzen model of probability distributions associated with changed and
unchanged pixels. In [105], the observed multi-temporal images are modeled by MRFs in
98
order to search for optimal changes under the maximum a posteriori (MAP) decision crite-
rion using the simulated annealing based energy minimization. Bernstein [106] studied the
change detection in relation to homeland security applications using the archived Landsat-5
images of the Portsmouth, Ohio Gaseous Diffusion Plant (OGDP) to determine capabili-
ties and limitations of long wavelength IR imagery in monitoring large nuclear enrichment
plants. This type of imagery was helpful in detecting large-scale changes in the OGPD’s
operational status (e.g. the shut-down of a single process building could be detected by
comparing the rooftop temperatures of the neighboring operational process buildings).
C. Proposed Change Detection Approach
The proposed approach for change detection starts with creating probabilistic shape
models from the classes defined in the reference imageIr. The shape modeling is done
with a new algorithm which uses distance-based shape descriptors and probability density
estimators with the proposed MF-based SVM approach. A Bayesian statistical analysis
approach uses these shape models in a MAP classification setup to detect pixels inI with
different labels from their counterparts inIr. The proposed approach differs from many
other familiar ones in that the changes are derived from classification maps for the reference
image, rather than from the imageI itself. This allows for using not only the pixel-wise
signatures, but also for the prior knowledge of shapes of objects being monitored. Also,
the proposed approach differs from other algorithms in that it detects changes within the
classification step itself rather than carrying out two consecutive steps: Classification, then
change detection. The main components of the proposed approach are: statistical shape
modeling and the statistical analysis. The details of these components are presented in the
following sections.
99
D. Statistical Shape Modeling
Shape representation is the main task in the analysis of shapes. The selection of
such representation is very important in several computer vision and medical applications
such as registration and segmentation. There are several ways described in [107, 108] for
shape representation. Although some of these ways are powerful enough to capture local
deformations, they require a large number of parameters to deal with important shape de-
tails, and some problems arise with changing the topology of shapes. In order to obtain
a shape model that realistically describes an object, a statistical shape representation ap-
proach is proposed in this work which is outlined in the following algorithm. The approach
assumes that there are multiple data sets (e.g. images) which describe the same shape.
Like most of the shape modeling approaches, the proposed approach starts with aligning
the different data sets together using a registration algorithm (one of such algorithms can
be found in [109]). The edges for each of the shape regions,Vi, in the different data sets
are determined and the average 2D edge,Vm, for each region is calculated (see [108]).
The contribution of the proposed approach is to introduce the signed distance con-
cept [110] in constructing a probabilistic map; the signed distance map (SD-Map), for a
data set that contains the object shape. The SD-Map is a representation for the relative
positions of the different points in the data set with respect to theshape points; points
that belong to the shape boundary in the reference data set (the data set that contains the
reference shape).
In case of images, the SD-Map is an image where the absolute value at a certain
pixel is the shortest Euclidean distance between the spatial position of that pixel and the
average 2D edges,Vm, of the object shape. By convention, the signed of a pixel is deter-
mined by whether that pixel lies inside (positive) or outside (negative) the boundary of one
of the shape regions. This representation of the signed distance map enables capturing of
the object shape with two interesting features: (1) the sign at a pixel determines whether
100
Algorithm 3 Outlines of the Statistical Shape Modeling Approach.
• Align the collected data sets together using a registration approach.
• Calculate the 2D edge,Vi, that describes the boundary of aregion from the object
shape in the data seti; i = 1 · · ·N , of theN data sets in the aligned data base for that
shape.
• Calculate the average 2D edgeVm for each region, i.e.1N
∑Ni=1 Vi.
• Given a 2D shape boundaryV (which is a collection of the average shape regions
Vm’s; i.e. V = ∪Mm=1Vm, where the shape is constructed fromM objects), the func-
tion S(i, j) which describes the distribution of the signed distance map inside and
outside the shape V is defined as follows:
S(i, j) =
0 (i, j) ∈ V
d((i, j), V ) (i, j) ∈ RV
−d((i, j), V ) Otherwise
(98)
whereRV is the space of the points which lie inside the region andd
((i, j), V
)is
the minimum Euclidean distance between the data set location(i, j) and the curveV .
Note: The proposed approach assumes that the shape is represented in a 2D Euclidean
space.
that pixel lies inside or outside the object shape, and (2) the absolute value at a pixel, which
is variant depending on the relative position of the pixel to the shape boundary, provides
a probabilistic representation of the object shape (see Fig 30 for a quick illustration of
SD-Maps).
Either the shape internal points (points which lie inside the shape) or the shape
external points (points which lie outside the shape) are enough to construct a model for
101
that shape. In the proposed approach, the shape internal points (the positive points in the
SD-Map) are used to model that shape. As stated above, these points have a probabilistic
distribution that calls for a probabilistic shape modeling which is anothercontributionof
the proposed approach. The MF-based SVM density estimator has proven itself as an
accurate density estimation algorithm and thus it is used to model the shape of each class
in an image.
One of the most powerful features of the proposed approach is that it allows mod-
eling for objects (classes) of multiple regions in the image, i.e., the object shape can have
multiple disjoint regions. Also, the shape representation in that approach is invariant to
translation and rotation. Further, to make this representation invariant to scaling, the fol-
lowing registration approach is used.
E. Change Detection Algorithm
The proposed algorithm depends on the Bayes theory for the analysis of the esti-
mated densities for the shape and sensor readings that are estimated using MF-based SVM
density estimator. Incorporating the shape information with the sensor data provides a
strong evidence for a change condition. If the combined shape and sensor information does
not provide an evidence for a change, the approach suspects the shape information and
deals only with the senor information. This is because, the shape information sometimes
becomes strong enough to hide the sensor information (because of the relative closeness
of the pixel location to the class shape boundaries). In such a case, the algorithm favors
stability over time of the sensor data. The steps of the algorithm are as follows:
F. Discussion of Some Change Detection Methods
This section presents brief discussion of two state-of-the art methods that used in
102
Algorithm 4 Outlines of the Change Detection Algorithm.
• Generating shape information: from the reference imageIr, generate the signed dis-
tance map for each class.
• Statistical modeling for the Shape Information: Use the proposed MF-based SVM
algorithm to model the shape information of each class.
• Pixel classification using both shape information and sensor data: for a pixelp:
1. Use the signed-distanced between the pixel location and each of the class aver-
age shape to getps(d | m) for m = 1 · · ·M , whereM is the number of defined
classes.
2. Use the sensor readingy to get the class conditional probabilityp(y | m).
3. Get a primary labeling of the pixel as:
m∗(p) = arg maxm∈M
ps(d | m) p(y | m)
• Change Detection at the pixelp: report a change atp if
1. If m∗(p) 6= mr(p); wheremr(p) is the class of pixelp in the reference image.
2. If m∗(p) = mr(p) still there is a chance for a change according to the following
steps:
– Get the primary class of the pixel using only sensor reading as:m∗(p) =
arg maxm∈M p(y | m).
– There will be a change ifm∗(p) 6= mr(p) and| p(y | m)−pr(y | m) |> T
whereT is a threshold.
the literature for change detection. These methods are used in the experimental work for
comparison.
103
1. Change Detection using Automatic Analysis of the Difference Image and EM Algo-
rithm (DIEM)
This approach [103, 104], is based on the formulation of the unsupervised change-
detection problem in terms of the Bayesian decision theory. In this context, an adap-
tive technique for the estimation of the statistical terms associated with the gray levels
of changed/unchanged pixels in a difference image is considered. This approach deals with
the widely used type of unsupervised techniques that perform change detection through a
direct comparison of the original raw images acquired in the same area at two different
times. The change-detection process performed by such unsupervised techniques is usu-
ally divided into three main sequential steps: 1) pre-processing, 2) image comparison and
3) analysis of the difference image. These steps are briefly detailed in the following.
• Preprocessing: Unsupervised change-detection algorithms usually take two digi-
tized images as input and return the locations where differences between the two
images can be identified. To accomplish such a task, a preprocessing step is neces-
sary aimed at rendering the two images comparable in both the spatial and spectral
domains. Concerning the spatial domain, the two images should be co-registered so
that pixels with the same coordinates in the images may be associated with the same
area on the ground. This is a very critical step, which, if inaccurately performed, may
render change-detection results unreliable [82].
With regard to the spectral domain, changes in illumination and atmospheric con-
ditions between the two acquisition times may be a potential source of errors and
should be taken into account in order to obtain accurate results [93, 94].
• Image Comparison: The two registered and corrected images (or a linear or non-
linear combination of the spectral bands of such images [84]) are compared, pixel
by pixel, in order to generate a further image (“difference image”). The difference
104
image is computed in such a way that pixels associated with land-cover changes
present gray level values significantly different from those of pixels associated with
unchanged areas. For example, the widely used Change Vector Analysis (CVA) tech-
nique is used to generate the difference image in remote sensing images. In this case,
several spectral channels are considered at each date (i.e., each pixel of the image
considered is represented by a vector whose components are the gray level values
associated with that pixel in the different spectral channels selected). Then, for each
pair of corresponding pixels, the so-called “spectral change vector” is computed as
the difference in the feature vectors at the two times. At this point, the pixels in the
difference image are associated with the magnitudes of the spectral change vectors;
it follows that unchanged pixels present small gray-level values, whereas changed
pixels present rather large values.
• Analysis of the Difference Image: Land-cover changes can be detected by applying
a decision threshold to the histogram of the difference image. For instance, when the
CVA technique is used (i.e., each pixel in the difference image is associated with the
magnitude of the difference between the corresponding feature vectors in the original
images), changed pixels can be identified on the right side of the histogram as they
are associated with large gray-levelvalues.The selection of the decision threshold
is of major importance as the accuracy of the final change-detection map strongly
depends on this choice.
The approach in [104] is based on the assumption that the histogram of the dif-
ference image can be modeled as a mixture density composed of the distributions of two
classes associated with changed and unchanged pixels, respectively. In this context, the
considered approach here uses the EM algorithm for the estimation of the conditional den-
sity functions of these classes. The estimated parameters by EM are used in pixelwise
classification of the difference image to change/no-change classes.
105
2. Change Detection using MRF Modeling (DIMRF)
The above described technique for change detection only considers information
contained within a pixel, even though intensity levels of neighboring pixels of images are
known to have significant correlation. Also, changes are more likely to occur in connected
regions rather than at disjoint points. By using these facts, a more reliable change detection
algorithm can be developed. To accomplish this, a Markov random field (MRF) model for
images is employed in the presented approach [103–105] so that statistical correlation of
intensity levels among neighboring pixels can be exploited.
For estimating MRF parameters, the method in [104] uses heuristic method; which
is not reliable, while the methods in [105] uses Simulating Annealing, which is known
to be slow. This dissertation proposed to use a numerical approach for estimating MRF
parameters (see Chapter 3).
G. Experimental Work
This section presents experimental work on the proposed approach for change de-
tection. The experiments are done on remote sensing data. The experiments illustrate first
the performance of the proposed statistical shape modeling approach, and presents an ap-
plication of it in remote sensing imagery segmentation. The change detection approach is
evaluated next.
1. Experiments on the proposed shape modeling approach
Figure (30-a) shows a RGB shot of the remote sensing LANDSAT 7 data set which
is used in the experiments. This data set is collected during November 2002 over Cairo,
Egypt. It contains: a 15-meter Panchromatic data, a 30-meter, 6-band multispectral data,
and a 60-meter, 2-band Thermal data. The size of the used scene is250x250 ”in the 6-band
106
multispectral case” which cropped from7771x8664 data set. There are 7 classes defined in
this data set: Water, Open, Deciduous, Low Density Residential, High Density Residential,
Urban, and Transportation. The 30-meter ground cover is available (see Fig (30-b)), so the
multispectral data is used as the reference set. The classification of data set is challenging
because the ir-regularity and the scattered nature of most of the classes in the scene (see
for example: the Deciduous, Low Density Residential, and Urban classes in the reference
classified image in Fig. (30-b)).
2. Experiments on Statistical Shape Modeling Using the MF-based SVM
The MF-based SVM density estimator is used in this section for statistical modeling
of the probabilistic distribution of the class shapes in the classified reference image as a
prior information. The approach outlined in Algorithm-3 is used to collect the data points
of interest for modeling the shape of each class defined in the image. A subset from this
training sample (50 points in the current implementation) is used to train the MF-based
SVM density estimator algorithm.
Figure (30) shows samples of the estimated pdf’s of class shapes and illustrate how
accurate the MF-based SVM algorithm managed to capture a shape pdf. The figure shows
the empirical density (histogram) of the class shape and the estimated density using the
MF-based SVM algorithm.
3. Experiments on the Segmentation Algorithm
To illustrate the performance of the proposed statistical based shape modeling in real
application, a classification algorithm [111] is used for multispectral data segmentation.
a. Image Segmentation Algorithm
This section summarizes the steps that are used to incorporate the proposed shape mold-
107
ing approach in image segmentation. The procedure used here is basically a Bayes clas-
sification rule based on shape priors. Letxi represents a class label in the image, where
i ∈ [1, K] andK is the total number of classes (objects) in the image. The goal of a
pixel-wise image segmentation algorithm is to determine the class to which a feature vec-
tor y belongs, i.e. the class index for the pixel which has a feature vectory. Bayes rule
formulates this goal as (see [3]):
y ∈ xi if p(y | xi) p(xi) > p(y | xj) p(xj),∀i 6= j; i, j ∈ [1, K] (99)
The termp(y | xi) is the conditional probability of the pixel valuey given that its
class label isxi (class conditional probability). The second termp(xi) represents the class
prior probability. Since the class prior probability represents the prior belief that the pixel
belongs to classxi, the proposed statistical shape modeling is used to substitute for this
belief. Thus the modified Bayes rule becomes:
y ∈ xi if p(y | xi) p(S | xi) > p(y | xj) p(S | xj),∀i 6= j; i, j ∈ [1, K]. (100)
whereS is the signed distance at the pixel which has the feature vectory (see Algorithm-
3). The MF-based SVM density estimator is used to implement the class conditional prob-
ability of that rule since this estimator has proven itself to be of special interest in high
dimensional density estimation problems [112]. Also, it is used to implement the statistical
shape density function term. Therefore, the whole segmentation process is summarized as
shown in Algorithm-5:
b. Results
Since the ground truth of the classified image is available, a subset from each class is used
to train the MF-based SVM density estimation algorithm and the rest is used for evaluation.
The RGB images shown in Fig (31) show the segmentation results.
108
FIGURE 29 – The RGB and the reference classified image of Cairo data set.
109
0 2 4 6 8 100
0.1
0.2
0.3
0.4
0.5
Sign Distance
EmpiricalEstimated
1 1.5 2 2.5 3 3.50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Sign Distance
(a) Water class (c)Transportation class
FIGURE 30 – Samples of shape modeling two classes of Cairo data set: class points, signeddistance map, and shape model density function for a) Water class, and b)Transportationclass
110
Algorithm 5 Image Segmentation Algorithm Outlines.
Pre-Segmentation Step:
Register the input and the reference images.
• Training:
For each defined class in the image:
– train the MF-based SVM density estimator using the prior information about
that class to get its statistical shape model.
– train the MF-based SVM density estimator using the class data to get its class
conditional probability model.
• Segmentation:
For each pixel in the image:
– calculate the class conditional probabilities and the corresponding shape prob-
abilities.
– classify the pixel using the above modified Bayes’ rule, (100).
Table 17 shows the classification confusion matrix of the data set using a classical
Bayes classifier using the MF-based SVM density estimator. The class priors are assumed
as the shares of class points in the data set. The results illustrate how challenging is the data
set. The average class classification accuracy is 54.26%. The Water class has the highest
class classification accuracy (76.52%) while the Transportation class has the lowest class
classification accuracy (14.08%). The trust in the points assigned to a specific class (class
reliability) is also small. The Low density residential class has the lowest reliability rate,
9.48% while the Water class has the highest reliability of 84%.
111
(a) Classified without shape constraints (b) Classified with shape constraints
FIGURE 31 – Classification results of Cairo data set.
TABLE 17CLASSIFICATION CONFUSION MATRIX FOR THE MULTISPECTRAL DATA SET
WITHOUT USING SHAPE MODELING
Class Total
Points
Water Open Decid-
uous
L Den-
sity
H Den-
sity
Urban Transp-
ortation
%
True
Water 5613 4295 113 927 48 53 158 19 76.52
Open 25313 342 15225 1473 3807 1275 2736 455 60.15
Deciduous 2252 288 677 684 235 120 192 56 30.37
L. Density 2141 12 554 99 614 367 335 160 28.68
H. Density 18940 140 472 232 1725 9726 5313 1332 51.35
Urban 5571 29 77 144 319 1590 2992 420 53.71
Transport. 2670 4 275 76 341 790 808 376 14.08
% +ve Rate 84 87.5 23.17 9.48 72.49 69.86 13.34 54.26
Table 17 shows the classification confusion matrix using the Bayes classifier with
the shape statistical modeling applied. The results illustrate how much excellent improve-
ment can be achieved using the shape modeling. The average class classification accuracy
112
TABLE 18CLASSIFICATION CONFUSION MATRIX FOR THE MULTISPECTRAL DATA SET
USING SHAPE MODELING
Class Total
Points
Water Open Decid-
uous
L Den-
sity
H Den-
sity
Urban Transp-
ortation
%
True
Water 5613 4986 142 330 27 54 68 6 88.83
Open 25313 239 22345 739 1059 592 181 158 88.27
Deciduous 2252 59 83 1990 37 38 36 9 88.37
L. Density 2141 4 59 7 2009 21 29 12 93.83
H. Density 18940 115 400 79 123 17418 359 446 91.96
Urban 5571 22 46 30 42 219 5145 67 92.35
Transport. 2670 5 89 9 56 3 57 2451 91.80
% +ve Rate 91.82 96.46 62.5 59.91 94.94 87.57 77.8390.15
increases to 90.15%. The Urban class has the highest class classification accuracy (92.35%)
while the Open class has the lowest class classification accuracy (88.27%). The trust in the
points assigned to a specific class (class reliability) increases too. The Open class has the
highest reliability rate, 96.46% while the Transportation class has the lowest reliability of
77.83%. Figure (31) illustrates the improvement of applying the shape constraint on the
segmented image; Fig (31-c) which is so close to the reference image; Fig (31-b).
4. Experiments on Different Resolutions Data Sets
To assess the overall segmentation setup and illustrate the effect of the registration
step, the Panchromatic 15-meter resolution data, and the Thermal 60-meter resolution data
sets are used. The images are co-registered first to the multispectral (6-band) data set using
the proposed MI algorithm. Then the registered data is used for classification. Shots of the
results are shown in Fig. (32) and Fig. (33), while Tables 19 and 20 show results of applying
113
(a) (b) (c) (d)
FIGURE 32: Results for the 15-meter resolution data set: (a) Original, (b) Registration
results, (c) Classification results, and (d) Classification results after inverse transformation.
TABLE 19: Comparison of classification accuracies for the 15-meter resolution data set
using different algorithms.
Class % Accuracy
MLE KNN MF-SVM Shape-based
Water 86.2 94.8 72.6 84.22
Open 80.2 16.8 18.4 99.4
Deciduous 0 0 48.6 59.77
L. Density 0 0 17.1 54.32
H. Density 34.6 0 10.8 91.71
Urban 0 0 18.4 96.07
Transportation 0 0 17.5 49.25
% Average 51 15.4 22 90.29
the proposed algorithm in comparison with other algorithms. The results illustrate that the
traditional algorithms fail with these data sets, while the proposed algorithm performs very
well. The average classification accuracy is about 90% in the panchromatic data, while it
is 93% in the Thermal data.
114
(a) (b) (c) (d)
FIGURE 33: Results for the 60-meter resolution data set: (a) Original, (b) Registration
results, (c) Classification results, and (d) Classification results after inverse transformation.
TABLE 20: Comparison of classification accuracies for the 60-meter resolution data set
using different algorithms.
Class % Accuracy
MLE KNN MF-SVM Shape-based
Water 84.2 94.8 29.2 96.04
Open 58.1 4.62 29.4 97.73
Deciduous 0 5.64 23.8 62.61
L. Density 0 17.8 25.9 57.08
H. Density 54.2 18.5 17.8 99.91
Urban 0 4.13 19.2 87.45
Transport. 0 3.87 20.8 59.40
% Average 48 17.3 24.5 93.03
5. Experiments on the Change Detection Algorithm
The following experiments discuss the performance of the proposed statistical learn-
ing based change detection algorithm. Two data sets are used in the experiments: the first
115
is the multispectral Cairo data set which is presented before. The second one is a mul-
tispectral data set for a dam in the Louisville downtown, Kentucky, USA. The algorithm
performance is compared with the above presented: analysis of the difference image and
EM algorithm, and using MRF modeling.
a. Cairo Data Set
The description of this data set is presented above. Since, there is not available to us
samples of this data set over different instants of time, we simulate some changes in the
data set. This simulation is done by assuming that an Urban area is grown in the Open
area and also assuming that a new Transportation facility is established. The growing of
the new areas is simulated by randomly sampling pixels from the reference points of Urban
and Transportation classes in the areas where changes occur. In Fig (34-a), the reference
image shown above with the superimposed changes. A reference changes-map is shown in
Fig (34-b) where0 represents a change.
Figure (35) shows the results of applying the different change detection approaches.
The difference-map used by the approach which depends on the analysis of the difference-
map is shown in Fig (35-a). It easily noted that even the unchanged pixels have some small
values in the difference-map. The detected changes using the analysis of the difference-
map with EM algorithm are shown in Fig (35-b). It can be noted that there are some
pixels that marked as changes while they are not because of the pixelwise nature of the
algorithm. The effect of applying MRF modeling to the difference-map is illustrated in
Fig (35-c) where some enhancements can be seen, especially for the isolated pixels, but
some deformations to the changes area are induced. The results of applying the proposed
shape-based change detection algorithm are shown in Fig (35-d) which illustrates that the
algorithm successfully detects the changes happened in the data set. The detection rates of
the different approaches are shown in Table 21 which illustrates that the proposed approach
outperforms the other algorithm with its detection of85%.
116
(a) (b)
FIGURE 34: Cairo data set for the change detection evaluation: (a) Reference with
changes, (b) Reference changes-map
TABLE 21: Detection rates of the different change detection approaches.
Algorithm DIEM DIMRF Shape-based
Detection Rate 80% 82% 85%
b. Louisville Data
These are two multispectral Landsat data sets of the downtown Louisville, Kentucky,
USA. One of the data sets is collected in Summer 1992 and the other is collected in 2001. A
scene of size200x164 pixels is used in the experiments. The reference land covers for both
instances are available from the internet website (http://gisdata.usgs.net/website/kentucky/viewer.php).
Since, the number of defined classes in this scene is big and the data collection spans a
decade, there is a large amount of changes in this scene. So, for simplicity of the illustra-
tion, we consider only the changes in few number of classes which surround the McAlpine
dam.
Figure (36) illustrate the different images related to the scene. The RGB images
are shown in (a) and (b). The references classification of the scene considered in this
117
experiment are shown in (c) and (d), and the changes-map is shown in (d). As the legend
illustrates, only changes in the Deciduous and Wetland classes are considered for detection
in this experiment. The same steps used in Cairo data set (Algorithm -4) are applied to this
data set: shape modeling of the classes, sensor readings modeling, and the statistical Bayes
analysis for change detection. The change detection rate is almost the same as Cairo data
set:84.5%.
H. Conclusion
In this chapter, statistical learning is used to develop a new method for change
detection. The land cover changes in remote sensing data are used as an application. The
method starts with learning the shapes of the classes defined in the classified image of
the reference data set. The shape learning uses also a statistical learning algorithm which
discussed in the chapter. The previously presented density function estimation approach is
used to model the sensor readings of the data sets. The change detection approach used the
models of the shape and the sensor reading to detect the changes in the scene.
Experiments are carried out using two data sets: one for Cairo, Egypt, and the other
for Louisville, KY, USA. The average change detection rate is85% which is outperforming
previous approaches.
The proposed change detection approach differs from other approaches in many
disguisedly features. First: it does not use the row (sensor) data directly, which means
that it is possible to use it for detecting changes from different sources (sensors) data; e.g.
multispectral and hyperspectral data. Second: It does not classify the data sets and then
compare the classification results, which means that there not accumulation of the error.
Third: it does not apply specific types of filters on the images, which adds simplicity and
speed to the approach.
118
(a) (b)
(c) (d)
FIGURE 35: Results for change detection algorithms: (a) difference-image using CVA, (b)
detected changes-map using pixelwise analysis of the difference map, (c) detected changes-
map using MRF modeling, and (d) detected changes-map using the proposed algorithm
119
(a) (b)
(c) (d)
(e) (f)
FIGURE 36: Results for the change detection algorithm: (a) Reference with changes, (b)
Reference change-map, (c) Ordinary classification, and (d) Detected changes-map using
the proposed algorithm
120
CHAPTER VII
Conclusion
The attractive features of statistical learning methods introduce them rapidly in re-
placing classical learning methods. The good generalization capabilities make the statisti-
cal learning methods applicable for a wide range of practical problems. The introduction
of mathematically-based methods for the analysis and learning of these methods open the
door for better understanding and more contributions in boosting their performance. The
Mean Field-based Support Vector Machines (MF-based SVM) regression approach is the
nucleus of the statistical learning methods introduced in this dissertation. The approach
utilizes probability and statistical theories in establishing a reliable, efficient, and fast re-
gression approach that can cope with a variety of problems in the Computer Vision and
Pattern Recognition world. An approach which uses the mean field theory is established
for the learning of the regression approach. A basic tool in machine learning problems is the
estimation of the probability density function. A method which uses MF-based regression
algorithm is presented and illustrated using a variety of both synthetic and real data sets.
The statistical properties of the density estimation approach are discussed and illustrated.
The camera calibration, which is a fundamental issue in computer vision applications is
formulated in way that enables the use of MF-based SVM regression approach in solving
this fundamental problem. The density estimation approach is used in a variety of pattern
recognition problem including, classification, shape modeling, and changes detection with
applications in remote sensing imagery processing.
This chapter presents a review of the dissertation, outlines the applications of the
proposed approaches, and finally describes some natural extensions.
121
A. Review and Applications
Building statistical learning based frameworks for solving computer vision a pattern
recognition problems is the main research focus of the thesis. It starts with building a
regression approach which is then used in a variety of frameworks to solve the estimation of
the probability density function problem, camera calibration problem, image segmentation,
and changes detection.
Chapter 2 outlined the theoretical principles of the proposed MF-based SVM re-
gression approach. The inclusion of the Mean Field theory in learning SVM algorithm is
presented and investigated.
Chapter 3 discussed the use of the proposed MF-based SVM regression framework
in a deep problem of machine learning which is the probability density function problem.
The theoretical aspect of this MF-based SVM density estimation approach are discussed in
detail. The statistical properties of this estimation approach; consistency and convergence
of the approach are discussed. Statistical performance measures are used to illustrate the
results of the presented density estimation approach. Also, several estimation approaches
are presented and illustrated for automating the learning algorithm: the EM algorithm for
estimating the parameters of the kernel, and Cross Validation for estimating the rearranging
parameters.
Chapter 4 presented the camera calibration problem in statistical learning based for-
mulation. Motivations behind using statistical learning approaches for camera calibration
are discussed in this chapter. The formulation of the camera calibration problem in a regres-
sion setup is outlined and the link between this formulation and the learning of MF-based
SVM regression algorithm is established. A mixed learning algorithm between gradient
descent and MF-based SVM regression is formulated and applied to synthetic as well as
real data sets.
Chapter 5 presented the application of the proposed density estimation approach in
122
segmentation of remote sensing data sets. Application of the framework in the segmen-
tation of real world multispectral and hyperspectral imagery is presented and evaluated
against other algorithms. Estimation of the MRF parameters in image modeling using the
proposed MF-Based SVM framework is presented.
Chapter 6 presented the application of the proposed statistical learning approaches
in solving the changes detection problem. The problem considered in this work is the
changes in land cover of a scene using remote sensing imagery. The approach depends
on using statistical learning in modeling the shapes of the classes defined in the reference
image. These shape models are used with the models of the statistical sensor models (prob-
ability densities) in detecting the changes in a scene.
B. Limitations
While the proposed MF-based SVM regression algorithm is promising in terms of
accuracy and speed, it has some limitations which should be further addressed. The major
limitation is that it contains some learning parameters which have to be carefully selected.
The dissertation provides some automation approaches, but integration of these approaches
for a fully automated approach should be considered.
C. Recommendations
This section discusses a number of recommendations that are suggested as exten-
sions to the dissertation work. These recommendations can be classified broadly into two
directions: performance improvement of the principal building block (statistical learning
based SVM regression) and real time applications. The principal building block is affected
by a number of learning parameters that control its performance. The choice of the values
of these parameters is not an easy task. The performance of the approach will be enhanced
123
greatly with automation procedures for estimating these parameters. The thesis presented
few algorithms to estimate the values of some of these parameters. But, more work is
needed to integrate these procedures and to establish new ones for the other parameters.
This thesis presented many real world useful applications. These applications will
be more valuable if they can be done faster, preferably in real time. For example, the camera
calibration approach can be further improved by applying it to active vision application and
real time 3D reconstruction applications.
Also, further investigation of the presented new applications needs to be done. The
changes detection approach should be applied for different kind of data sets. Different kinds
means: different sensors, different resolutions and different view angles. This requires a
sophisticated registration approach that can be applied for different kind imagery.
124
REFERENCES
[1] S. Chen, X. Hong, and C. Harris. Sparse Kernel Density Construction Using Orthog-
onal Forward Regression with Leave-One-Out Test Score and Local Regularization.
IEEE Transactions on Systems, Man, and CyberneticsPartb:Cybernetics, 34:1708–
1717, August 2004.
[2] R. Duda, P. Hart, and D. Stork.Pattern Classification. John Wiley and Sons, 2
edition, 2001.
[3] Aly Farag, Refaat Mohamed, and Hani Mahdi. Experiments in Image Classification
and Data Fusion. InProceedings of the Fifth International Conference on Informa-
tion Fusion, IF02, pages I 299–308, Annapolis, MD, July 11-17 2002.
[4] M. Koeppen. The Curse of Dimensionality. InProceedings of the 5th Online World
Conference on Soft Computing in Industrial Applications (WSC5), held on the inter-
net, September 4-18 2000.
[5] Z. Zhang. A Flexible New Technique for Camera Calibration.IEEE Trans. on
Pattern Analysis and Machine Intelligence, 22(11):1330–1334, 2000.
[6] J. Su, J. Wang, and Y. Xi. Incremental Learning with Balanced Update on Receptive
Fields for Multi-Sensor Data Fusion.IEEE Trans. on Systems, Man and Cybernatics,
Part B, 34(1):659–665, February 2004.
[7] B. Boser, I. Guyon, and V. Vapnik. A Training Algorithm for Optimal Margin Clas-
sifiers. InProceedings of the Computational Learning Theory (COLT), pages 144–
152, Berlin Heidelberg, New York, 1992.
125
[8] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1 edition,
1995.
[9] J. Shawe-Taylor and N. Cristianini.Kernel Methods for Pattern Analysis. University
Press, Cambridge, The United Kingdom, 1 edition, 2004.
[10] V. Vapnik, S. Golowich, and A. Smola. Support Vector Method for Multivariate
Density Estimation.Advances in Neural Information Processing Systems, 12:659–
665, April 1999.
[11] B. Scholkopf, C. Burges, and A. Smola.Advances in Kernel Methods – Support
Vector Learning. MIT Press, Cambridge, MA, 1999.
[12] T. Friess, N. Cristianini, and C. Campbell. The Kernel ADATRON Algorithm: A
Fast and Simple Learning Procedure for Support Vector Machines. InProceedings of
the 15th International Conference on Machine Learning, pages 188–196, Madison,
Wisconsin USA, July 24-27 1998.
[13] J. Platt.Fast Training of Support Vector Machines Using Sequential Minimal Opti-
mization. Advances in Kernel Methods Book. MIT Press, Cambridge: MA, 1999.
[14] C. Williams and M. Seeger. Using the Nystrom Method to Speed Up Kernel Ma-
chines.Advances in Neural Information Processing System, 14, 2001.
[15] M. Tipping and A. Faul. Fast Marginal Likelihood Maximization for Sparse
Bayesian Models. InProceedings of the International Workshop on AI and Sta-
tistics, Key West, FL, Jan 3-6 2003.
[16] C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition.Data
Mining and Knowledge Discovery, 2(2):1–47, 1998.
126
[17] P. Mitra, C. Murthy, and S. Pal. A Probabilistic Active Support Vector Learn-
ing Algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence,
26(3):413 – 418, March 2004.
[18] D. Cohn, Z. Ghahramani, and M. Jordan. Active Learning with Statistical Models.
Journal of AI Research, 4:129–145, 1996.
[19] D. MacKay. Information Based Objective Function for Active Data Selection.
Neural Computation, 4(4):590–604, 1992.
[20] M. Opper and O. Winther. Gaussian Processes for Classification: Mean Field Algo-
rithms. Neural Computation, 12:2655–2684, 2000.
[21] A. Papoulis and S. Pillai.Probability, Random Variables and Stochastic Processes.
McGraw-Hill, New York, 4 edition, 2001.
[22] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, second
edition, 2001.
[23] J. Gao, S. Gunn, and C. Harris. Mean Field Method for the Support Vector Machine
Regression.Neurocomputing, 50:391–405, November 2003.
[24] M. Opper and D. Saad.Advanced Mean Field Methods: Theory and Practice. MIT
press, Cambridge, MA, 2001.
[25] D. MacKay. Information theory, Inference, and Learning Algorithms. Cambridge
University Press, Cambridge, MA, 2003.
[26] V. Yakhot. Mean-Field Approximation and a Small Parameter in Turbulence Theory.
Physical Reviews, E, 63:026307, 2001.
[27] L. Saul, T. Jaakkola, and M. Jordan. Mean Field Theory for Sigmoid Belief Net-
works. Artificial Intelligence Research, 4:61–76, 1996.
127
[28] H. Kappen and W. Wiegerinck. Mean Field Theory for Graphical Models. In M. Op-
per and D. Saad, editors,Advanced Mean Field Theory, pages 37–49. MIT, Cam-
bridge, MA, 2001.
[29] B. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and
Hall¡ FL, USA, 1986.
[30] C. Bishop.Neural Networks for Pattern Recognition. Oxford University Press, 1995.
[31] Ayman El-Baz and Aly Farag. Pararmeter Estimation in Gibbs-Markov Image Mod-
els. In Proceedings of the 6th International Conference on Information Fusion,
pages 934–942, Queensland, Australia, July 8-11 2003.
[32] Refaat Mohamed and Aly Farag. A New Unsupervised Approach for the Classifi-
cation of Multispectral Data. InThe Sixth International Conference on Information
Fusion, pages 951–958, Queensland, Australia, July 8-11 2003.
[33] E. Parzen. On Estimation of a Probability Density Function and Mode.Annals of
Math. Statistics, 33:1065–1076, 1962.
[34] J. Lamperti. Probability-A survey of the Mathematical Theory. Wiley Series in
Probability and Statistics, Wiley, New York, 1996.
[35] J. Shao.Mathematical Statistics. Springer-Verlag, New York, 1999.
[36] C. Williams. Prediction with Gaussian Processes: Basic Ideas and Theortical Per-
spectives. InProc. Workshop Notions of Complexity: Information-theoretic, Com-
putational and Statistical Approaches, Eindhoven, The Netherlands, October 7-9
2004.
[37] P. Sollich and C. Williams. Understanding Gaussian Process Regression Using The
Equivalent Kernel. In L. Saul, Y. Weiss, and Leon Bottou, editors,Advances in
128
Neural Information Processing Systems 17, pages 1313–1320. MIT Press, Cam-
bridge, MA, 2005.
[38] A. Oppenheim, R. Schafer, and J. Buck.Discrete-Time Signal Processing (2nd ed.).
Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1999.
[39] B. Dacorogna.Direct Methods in the Calculus of Variations. Springer-Verlag New
York, Inc., New York, NY, USA, 1989.
[40] Z. Ghahramani and M. Jordan. Function Approximation via Density Estimation
Using the EM Approach. In G. Tesauro J. Cowan and J. Alspector, editors,Advances
in Neural Information Processing Systems 6, pages 120–127. Morgan Kaufmann,
San Mateo, CA, 1994.
[41] G. McLachlan and D. Peel.Finite Mixture Models. New York, Wiley, 2000.
[42] Aly Farag, Ayman El-Baz, and G. Gimelfarb. Density Estimation Using Modified
Expectation Maximization for a Linear Combination of Gaussians. InProceedings of
IEEE International Conference on Image Processing (ICIP- 2004), volume I, pages
194–197, Singapore, October 24-27 2004.
[43] A. Dempster, N. Laird, and D. Rubin. Maximum-Likelihood from Incomplete Data
via the EM Algorithm.Journal of Royal Statistics Society, Ser. B.(39), 1977.
[44] R. Redner and H. Walker. Mixture Densities, Maximum Likelihood and the EM
Algorithm. SIAM Review, 26(2), 1984.
[45] M. Jordan and R. Jacobs. Hierarchical Mixtures of Experts and the EM Algorithm.
Neural Computation, 6:181214, 1994.
[46] T. Hastie, R. Tibshirani, and J. Friedman.The Elements of Statistical Learning.
Springer, New York, 2001.
129
[47] R. Hocking. Developments in Linear Regression Methodology.Technometrics,
25:219–249, 1983.
[48] S. Kullback.Information Theory and Statistics. Wiley, New York, 1959.
[49] S. M. Ali and S. D. Silvey. A General Class of Coefficients of Divergence of One
Distribution from Another.Journal of Royal Statistics Society, B28:131–142, 1966.
[50] M. Girolami and C. He. Probability Density Estimation from Optimally Condensed
Data Samples.IEEE Transactions on Pattern Analysis and Machine Intelligence,
25(10):1253–1264, 2003.
[51] F. Saul and D. Lee. Multiplicative Updates for Non-Negative Quadratic Program-
ming in Support Vector Machines. MS-CIS 02-19, University of Pennsylvania, 2002.
[52] B. Scholkopf, J. Platt, J. Shawe-Taylor, J. Smola, and R. Williamson. Estimating the
Support of a High-Dimensional Distribution.Neural Computation, 13:1443–1471,
2001.
[53] S. Mukherjee and V. Vapnik. Support Vector Method for Multivariate Density Esti-
mation. A. I. Memo 1738, MIT AI Lab., 1999.
[54] R. Tsai. A Versatile Camera Calibration Technique for High-Accuracy 3D Machine
Vision Metrology Using Of-The-Shelf TV Cameras and Lenses.IEEE J. Robotics
and Automation, 3(4):323–344, August 1987.
[55] J. Weng, P. Cohen, and M. Herniou. Camera Calibration with Distortion Models and
Accuracy Evaluation.IEEE Trans. on Pattern Analysis and Machine Intelligence,
14(10):965–980, 1992.
[56] J. Gao, C. Harris, and S. Gunn. On a Class of Support Vector Kernels Based on
Frames in Function Hilbert Spaces.Neural Computation, 13:1975–1994, 2001.
130
[57] S. Gunn. Support Vector Machines for Classification and Regression. ISIS 1-98,
Department of Electronics and Computer Science, University of Southampton, 1998.
[58] O. Faugeras, editor.Three-Dimensional Computer Vision: A Geometric Viewpoint.
MIT Press, 1993.
[59] Aly Farag, Refaat Mohamed, and Ayman El-Baz. A Unified Framework for MAP
Estimation in Remote Sensing Image Segmentation.IEEE Trans. on Geoscience
and Remote Sensing, 43(7):1617–1634, 2005.
[60] M. Trucco and A. Verri.Intoductory Techniques for 3-D Computer Vision. Prentice
Hall, NJ, USA, 1998.
[61] Moumen Ahmed and Aly Farag. A Neural Network Approach for Solving the Prob-
lem of Camera Calibration.Image and Vision Computing, 20(9-10):619630, 2002.
[62] J. Heikkila. Geometric Camera Calibration Using Circular Control Points.IEEE
Trans. on Pattern Analysis and Machine Intelligence, 22(10):1066–1077, October
2000.
[63] R. Hartley and A. Zisserman.Multiple View Geometry in Computer Vision. Cam-
bridge University Press, ISBN: 0521540518, second edition, 2004.
[64] B. Shahshahani and D. Landgrebe. The Effect of Unlabeled Samples in Reducing the
Small Sample Size Problem and Mitigating the Hughes Phenomenon.IEEE Trans.
on Geoscience and Remote Sensing, 32(5):1087 – 1095, 1994.
[65] S Tadjudin and D. Landgrebe. Robust Parameter Estimation for Mixture Model.
IEEE Trans. on Geoscience and Remote Sensing, 38(1):439–445, 2000.
[66] J. Besag. On the Statistical Analysis of Dirty Pictures.Journal of Royal Statistical
Society, B48(3):259–302, 1986.
131
[67] C. Bouman and M. Shapiro. A Multiscale Random Field Model for Bayesian Image
Segmentation.IEEE Transaction on Image Processing, 3(2):162–177, 1994.
[68] Aly Farag and E. Delp. Image Segmentation Based on Composite Random Field
Models.Journal of Optical Engineering, 12:25942607, December 1992.
[69] G. Gimel’farb. Image Textures and Gibbs Random Fields. Kluwer Academic, Dor-
drecht, Netherland, 1999.
[70] Ayman El-Baz and Aly Farag. Image Segmentation Using GMRF Models: Pa-
rameters Estimation and Applications. InProceedings of the IEEE International
Conference on Image Processing, ICIP 2003, pages II 173–176, Barcelona, Spain,
September 14-17 2003.
[71] A. Jain and R. Dubes. Random Field Models in Image Analysis.Journal of Applied
Statistics, 16(2):131–164, 1989.
[72] J. Besag. Spatial Interaction and the Statistical Analysis of Lattice System.Journal
of Royal Statistical Society, B36(2):192–225, 1974.
[73] M. Carlotto. Detection and Analysis of Change in Remotely Sensed Imagery with
Application to Wide Area Surveillance.IEEE Transactions on Image Processing,
6:189–202, 1997.
[74] X. Yang and R.Yang. Change Detection Based on Remote Sensing Information
Model and its Application on Coastal Line of Yellow River Delta. InAsian Confer-
ence on Remote Sensing, Hong Kong, China, November 22-25 1999.
[75] D. Le Gall. MPEG: A Video Compression Standard for Multimedia Applications.
Communication of ACM, 34:47–58, 1991.
132
[76] S. Mallat. A Theory for Multiresolution Signal Decomposition: The Wavelet Repre-
sentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 11:674–
692, 1989.
[77] L. Chen and S. Chang. A Video Tracking System with Adaptive Predictors.Pattern
Recognition, 25:1171–1180, 1992.
[78] W. Kan, J. Krogmeier, and P. Doerschuk. Model-Based Vehicle Tracking from Im-
age Sequences with an Application to Road Surveillance.Optical Engineering,
35:1723–1729, 1996.
[79] S. Liu, C. Fu, and S. Chang. Statistical Change Detection with Moments Under
Time-Varying Illumination.IEEE Transactions on Image Processing, 7:1258–1268,
September 1998.
[80] C. Fu and S. Chang. A Motion Estimation Algorithm Under Time Varying Illumi-
nation Case.Pattern Recognition Letters, 10:195–199, 1989.
[81] L. Bruzzone and S. Serpico. An Iterative Technique for The Detection of Land-
Cover Transitions in Multitemporal Remote-Sensing Images.IEEE Transactions on
Geoscience and Remote Sensing, 35:858–867, 1997.
[82] J. Townshend, C. Justice, and C. Gurney. The Impact of Misregistration on Change
Detection. IEEE Transactions on Geoscience and Remote Sensing, 30:1054–1060,
1992.
[83] T. Fung. An Assessment of TM Imagery for Land-Cover Change Detection.IEEE
Transactions on Geoscience and Remote Sensing, 28:681–684, 1990.
[84] A. Singh. Digital Change Detection Techniques Using Remotely Sensed Data.In-
ternational Journal of Remote Sensing, 10:989–1003, 1989.
133
[85] J. Townshend and C. Justice. Spatial Variability of Images and the Monitoring of
Changes in the Normalized Difference Vegetation Index.International Journal of
Remote Sensing, 16:2187–2195, 1995.
[86] D. Wiemker. An Iterative Spectral-Spatial Bayesian Labeling Approach for Unsu-
pervised Robust Change Detection on Remotely Sensed Multispectral Imagery. In
Proc. 7th International Conference on Computer Analysis of Images and Patterns,
pages 263–270, Kiel, Germany, September 22-25 1997.
[87] A. Nielsen, K. Conradsen, and J. Simpson. Multivariate Alteration Detection (MAD)
and MAF Processing in Multispectral, Bitemporal Image Data: New Approaches to
Change Detection Studies.Remote Sensing of Environment, 64:1–19, 1998.
[88] J. Flusser and T. Suk. A Moment-Based Approach to Registration of Images with
Affine Geometric Distortion. IEEE Transactions on Geoscience Remote Sensing,
32:382–387, 1994.
[89] D. Barnea and H. Silverman. A Class of Algorithms for Fast Digital Image Regis-
tration. IEEE Transactions on Computers, C-21:179–186, 1972.
[90] J. Ton and A. Jain. Registering Landsat Images by Point Matching.IEEE Transac-
tions on Geoscience and Remote Sensing, 27:642–650, September 1989.
[91] T. Knoll and E. Delp. Adaptive Gray Scale Mapping to Reduce Registration Noise in
Difference Images.Computer Vision, Graphics, and Image Processing, 33:129–137,
1986.
[92] P. Gong, E. Ledrew, and J. Miller. Registration-Noise Reduction in Difference Im-
ages for Change Detection.International Journal of Remote Sensing, 13:773–779,
1992.
134
[93] P. Chavez. Radiometric Calibration of Landsat Thematic Mapper Multispectral Im-
ages.Photogrammetric Engineering and Remote Sensing, 55:1285–1294, 1989.
[94] P. Chavez and D. MacKinnon. Automatic Detection of Vegetation Changes in the
Southwestern United States Using Remotely Sensed Images.Photogrammetric En-
gineering and Remote Sensing, 60:571583, 1994.
[95] J. Richards.Remote Sensing Digital Image Analysis. Springer, NY, USA, 2nd edi-
tion, 1993.
[96] P. Slater. Reflectance and Radiance Based Methods for the In-Flight Absolute Cali-
bration of Multispectral Sensors.Remote Sensing of Environment, 22:11–37, 1987.
[97] P. Teillet and et al. Three Methods for the Absolute Calibration of the NOAA
AVHRR Sensors in Flight.Remote Sensing of Environment, 31:105–120, 1990.
[98] H. Olsson. Reflectance Calibration of Thematic Mapper for Forest Change Detec-
tion. International Journal of Remote Sensing, 16:81–96, 1995.
[99] P. Rosin. Thresholding for Change Detection.Computer Vision and Image Under-
standing, 86:79–95, 2002.
[100] T. Fung and E. LeDrew. The Determination of Optimal Threshold Levels for Change
Detection Using Various Accuracy Indices.Photogrammetric Engineering and Re-
mote Sensing, 54:1449–1454, 1988.
[101] K. Fukunaga.Introduction to Statistical Pattern Recognition. Academic, NY, USA,
2nd edition, 1990.
[102] R. Nelson. Detecting Forest Canopy Change Due to Insect Activity Using Landsat
Mss. Photogrammetric Engineering and Remote Sensing, 49:1303–1314, 1983.
135
[103] L. Bruzzone and D. Prieto. Automatic Analysis of the Difference Image for Unsu-
pervised Change Detection.IEEE Transactions on Geoscience and Remote Sensing,
38:1171–1182, 2000.
[104] L. Bruzzone and D. Prieto. An Adaptive and Semiparametric and Context Based
Approach to Unsupervised Change-Detection in Multitemporal Remote Sensing Im-
ages.IEEE Transactions on Image Processing, 11:452–466, 2002.
[105] T. Kasetkasem and P. Varshney. An Image Change-Detection Algorithm Based on
Markov Random Filed Models.IEEE Transactions on Geoscience and Remote Sens-
ing, 40:1815–1823, 2002.
[106] A. Bernstein. Monitoring Large Enrichment Plants Using Thermal Imagery from
Commercial Satellites: A Case Study.Science and Global Security, 9:143–163,
2001.
[107] T. Sebastin and et al. Recognition of Shapes by Editting Shock Graphs. InInterna-
tional Conference on Computer Vision, page 755762, Vancouver, Canada, July 9-12
2001.
[108] B. Ginneken and at al. Active Shape Model Segmentation with Optimal Features.
IEEE Transactions on Medical Imaging, 21:755–762, August 2002.
[109] A. Eldeib, S. Yamany, and Aly Farag. Volume Registration by Surface Point Sig-
nature and Mutual Information Maximization with Applications in Intra-Operative
MRI Surgeries. InInternational Conference on Image Processing, pages 200–203,
Vancouver, Canada, October 2000.
[110] X. Huang, D. Metaxas, and T. Chen. MetaMorphs: Deformable Shape and Texture
Models. InInternational Conference on Computer Vision and Pattern Recognition,
pages 496–503, Washington, D.C., USA, June 27-July 2 2004.
136
[111] Ayman El-Baz, Refaat Mohamed, and Aly Farag. Shape Constraints for Accurate
Image Segmentation with Applications in Remote Sensing Data. InThe eighth In-
ternational Conference on Information Fusion, Philadelphia, PA, USA, July 25-29
2005.
[112] Refaat Mohamed and Aly Farag. Mean Field Theory for Density Estimation Using
Support Vector Machines. InThe seventh International Conference on Information
Fusion, pages 856–861, Stockholm, Sweden, June 28-July 1 2004.
137
CURRICULUM VITA
NAME: Refaat M Mohamed
ADDRESS: Department of Electrical and Computer Engineering,
University of Louisville,
Louisville, KY 40292.
EDUCATION: * M.Sc. Electrical Engineering,
University of Assiut, Assiut, Egypt, 2001.
M.Sc. THESIS TITLE:
“An Intelligent Trajectory Tracking Controller for Robotics.”
* B.S. Electrical Engineering,
Very Good with the honor, first on class,
University of Assiut, Assiut, Egypt, 1995.
PREVIOUS
RESEARCH: Learning Systems, Robotic Control, Electronic Controllers Design.
TEACHING: Pattern Analysis and Machine Intelligence – GTA.
HONORS and AWARDS:
138
Dean’s Citation, University of Louisville Commencement,
Fall 2005.
Who’s Who Among Students in American Universities, 2005.
Outstanding Graduate Student, ECE Dept., University of
Louisville, 2004.
Second place on ECE department, University of Louisville,
Engineer’s Days Exhibit 2002.
Student Member, IEEE, since 2002.
Member, Eta Kappa Nu (HKN), since 2004.
PUBLICATIONS:
JOURNALS
1. Refaat M. Mohamed, Ayman S El-Baz, and Aly A Farag “A Bayes Analysis Ap-
proach for Change Detection in Remote Sensing Images,” Under preparation for the
IEEE Transactions on Geoscience and Remote Sensing.
2. Aly A Farag,Refaat M. Mohamedand Ayman S El-Baz, “A Unified Framework for
MAP Estimation in Remote Sensing Image Segmentation,” IEEE Transactions on
Geoscience and Remote Sensing, Vol. 43, No. 7, July 2005, pp. 1617-1634.
3. Ayman S El-Baz,Refaat M. Mohamed, and Aly A Farag, “Advanced Support Vector
Machines for Image Modeling Using Gibbs-Markov Random Field,” International
Journal of Information Technology Vol. 1, No. 4, pp. 297-300, 2004.
4. Refaat M. Mohamed, Ayman S El-Baz, and Aly A Farag, “Probability Density Esti-
mation Using Advanced Support Vector Machines and the Expectation Maximization
139
Algorithm,” International Journal Of Signal Processing Vol. 1, No. 4, pp. 260-264,
2004.
5. Aly A. Farag, Ayman S El-Baz, andRefaat M. Mohamed, “Density Estimation using
Generalized Linear Model and a Linear Combination of Gaussians,” International
Journal Of Signal Processing Vol. 1, No. 4, pp. 265-268, 2004.
CONFERENCES
6. Refaat M. Mohamed, Abdel-Rehim Ahmed, Ahmed Eid, and Aly Farag, “Statistical
Learning for Camera Calibration,” Submitted to the ECCV 2006.
7. Ayman S El-Baz,Refaat M. Mohamed, Aly A. Farag, and Georgy Gimel’farb, ”Un-
supervised Segmentation of Multi-Modal Images by a Precise Approximation of
Individual Modes with Linear Combinations of Discrete Gaussians,” International
Conference on Computer Vision and Pattern Recognition, CVPR-05, Workshop on
Learning in Computer Vision and Pattern Recognition, San Diego, California, June
19-25, 2005.
8. Refaat M. Mohamed, Ayman El-Baz and Aly A Farag, “Advanced Algorithms For
Bayesian Classification In High Dimensional Spaces With Applications In Hyper-
spectral Image Segmentation,” Accepted, The International Conference on Image
Processing, ICIP 2005, Sept. 11-14, Genoa, Italy.
9. Refaat Mohamed, Ayman El-Baz, and Aly Farag, “Remote Sensing Image Segmen-
tation Using SVM with Automatic Selection for the Kernel Parameters,” Accepted,
The eighth International Conference on Information Fusion, Philadelphia, PA, USA
, July 25-29, 2005.
140
10. Ayman S El-Baz,Refaat M. Mohamed, and Aly A. Farag, “Shape Constraints for Ac-
curate Image Segmentation with Applications in Remote Sensing Data,” Accepted,
The eighth International Conference on Information Fusion, Philadelphia, PA, USA
, July 25-29, 2005.
11. Refaat M. Mohamedand Aly A. Farag, “Mean Field Theory for Density Estimation
Using Support Vector Machines,” Seventh International Conference on Information
Fusion, Stockholm, July, 2004, pp. 495-501.
12. Hashem M. Mohamed, Khaled M Shaaban andRefaat M. Mohamed“A Robust
Framework for Detection of Human Faces in Clutter Color Images,” Seventh Interna-
tional Conference on Humans and Computers, University of Aizu, Japan, September
1-4, 2004.
13. Refaat M. Mohamedand Aly A. Farag, “Parameter Estimation for Bayesian Clas-
sification of Multispectral Data,” Seventh International Conference on Knowledge-
Based Intelligent Information and Engineering Systems, University of Oxford, United
Kingdom, September 4-5, 2003, pp. 346-355.
14. Refaat M. Mohamedand Aly A. Farag, “A New Unsupervised Approach for the
Classification of Multispectral Data,” Sixth International Conference on Information
Fusion, Fusion-03, Queensland, Australia, July 8-11, 2003, pp. 951-958.
15. Refaat M. Mohamedand Aly A. Farag, “Two Sequential Stages Classifier for Mul-
tispectral Data,” International Conference on Computer Vision and Pattern Recogni-
tion, CVPR-03, Workshop on Intelligent Learning, Madison, Wisconsin, June 16-22,
2003, pp. 110-116.
16. Refaat M. Mohamedand Aly A. Farag, “Classification of Multispectral Data Using
Support Vector Machines Approach for Density Estimation,” International Confer-
141
ence on Intelligent Engineering System, Assiut, Egypt, March 4-6, 2003, pp. 102-
109.
17. Aly A. Farag,Refaat M. Mohamedand Hani Mahdi, “Experiments in Image Classifi-
cation and Data Fusion,” Proceedings of 5th International Conference on Information
Fusion, Annapolis, MD, Vol. 1, pp. 299-308, July 2002.
18. Khaled M. Shaaban andRefaat M. Mohamed, “Autonomous Learning Cerebellum
Model Articulation Controller,” 2002 World Congress on Computational Intelligence,
WCCI 2002, Hilton Hawaiian Village Hotel Honolulu, Hawaii, May 12-17, 2002.
142