Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
BOOTSTRAP TECHNIQUES FOR
STATISTICAL PATTERN RECOGNITION
by
QUN WANG, M. Sc.
A thesis submitted to
the Faculty of Graduate Studies and Research
in partial fulfiliment of the requirements
for the degree of Master of Science
Information and System Science
School of Computer Science
Carleton University
Ottawa, Ontario
April2000
O Copyright
2000, Qun Wang
National Libmry l*l of Canada Bibîiot ue nationale % du Cana
Acquisitions and Acquisitions et Bibliogiaphic Se~rvïœs services bibliographiques
The author has granted a non- exclusive licence allowing the National Libraxy of Canada to reproduce, loan, distn'bute or sel1 copies of this thesis in rnicroform, paper or electronic fomats.
The author retains ownership of the copyright in this thesis. Neither die thesis nor substantid extracts kom it may be printed or othewise reproduced without the author's permission.
L'auteur a accordé une licence non exclusive permettant a la Bibliothèque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la forme de microfiche/film, de reproduction sur papier ou sur format électronique.
L'auteur conserve la propriété du droit d'auteur qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation.
This thesis studies the application of the bootstnp techniques to three fields of
statistical pattern recognition: the estimation of the Bhattacharyya bound, the estimation
of the error rate of a classifier and the classifier design. To initiate the study, the
bootstrap technique, and its concepts, algotithms and properties are explained.
In the application of the bootstrap to estimating the Bhattacharyya bound, it is
first argued, both theoretically and experimentally that the estimate of the Bhattacharyya
bound is usudly biased. Two types of the bootstrap estimaton of the Bhattacharyya
bound, the direct and indirect bias correction estimators, are suggested. It is proved
experimentally ihat the suggested bootstrap estimators of the Bhattacharyya bound are
superior to the genenl estimator.
In the application of the bootstrap to the enor rate estimation, the previous works
are reviewed, and the related algorithms are illustrated with the simulation results. These
algorithms are catalogued against the real-sample algonthm, as they use only the original
training samples to estimate the error rate. The pseudo-sampie algorithms of the error rate
estimation are then proposed. These use the bootstrap technique to generate pseudo-
samples for estimating the error rate. Cornparisons based on the experiments prove that
the pseudo-sample algorithms perfom better than most of the other algorithms.
In the application of the bootstrap to classifier design, the mixed-sample
algorithm is htroduced. It overcomes the limitation of the training sample size by adding
iii
pseudo-samples to the original training samples for classifier design. The expenmental
results prove that this strategy is successful.
First of dl, I would me to th& my Professor. Dr. John Oommen, for his
continued support, encouragement, and for believing in me. It is his guidance, ideas, and
wisdom that led me to explore new scientific domains, and to discover new research
horizons. 1 am deeply grateful to him for al1 that he has done for me.
1 would like to thank Ms. Patncia King-Edwards who helped me by proofkeading
and editing most of my thesis. Above dl, 1 would like to thank my wife, Ying Huang for
her help, support, and encouragement. This thesis would never be done if it was not for
her support.
To my parents.
CHAPTER 1 : I ~ ~ ~ ~ ~ ~ ~ ~ l ~ ~ o o m ~ o m o o m o m m m m o o m m o o m m m o m o m o o o o m o o o m o o o o o o o o o o o o m f
1 . 1 introduction ........ ... ................... . . . ............................................ ............. .... ....... 1
1.2 General Contributions ........................................................................................................ 4
1.3 Outline of the Thesis .............................,....*...... ....+ ............................................................................ 5
1.4 Expenment Data Set ........................................... . ............... . .............................. 6
2.1 htroduction ...... ........ .... ............ . .............. . . ............... . ................... ...................... . . . . . 1 I
2.2 The Fundmentals of Bootstrap .................... .. .... .... ........ .... .............................. 13
2.3 Boostrap Estimators ............................................. .....*.............. ............ ................ 17
........................... 2.3.2 Variance .. ......................................................................... 18
2.3.3 Quantile ........................................................................................................... 20
2.3.4 Pmtmetric Bwtstrap ................................. ...................................................... 21
2.3.5 Density Function .......................................................................................... 2 1
2.4 Properties of Bootsûap ...................................... ................................................ 25
.......................... .................. 2.5 Other Resampling Approaches ................................... 29
................................................................................ 2.5.1 Bayesian Bootstrap 3 0
............................................................................. 2.5.2 Random Weighting Method 34
CHAPTER 3: THE BHATTACHARYYA BOUND .................................................................. 39
3.1 introduction ................................... .. ....................................................... 3 9
................................................................. 3.2 Chemoff and Bhattacharyya Bounds 4 2
.......................... 3.2.1 Chemoff Bound .. ...................................................... 4 2
3.2.2 Bhattacharyya Bound ..................................................................................... 43
.......................... 3.3 The Estimation of Bhattacharyya Bound ........ ......................... 4 4
3.3.1 Simulation Results ...................... .......... ............................................ 47
............................ 3.4 The Bootstrap Estimaton - Direct Bias Correction .....*........ . 49
3.4.1 Simulation Results ........................................................................................ 57
................... ......... 3.5 The Bootstrap Estimators - Distance Bias Adjustment ... 59
3.5.1 AFMS (Adjust First and Mean Second) Schemes .................... ................. 61
3.5.2 MFAS (Mean FKst and Adjust Second) Schemes .................... .. .................. 66
................. .......*..................**....................*.....*..........*.......**............ 3.6 Summary ..... 71
................................................................................ .................... 4.1 Introduction ... 7 3
.......................................................... 4.2 Algonthms of the Emr Rate Estimatiag 7 5
............................... 4.2.1 Apparent Emr .... ............................................... 76 ...................................... 4.2.2 Cross-validation (Leavesne-out) ........................ .. 78
.........*....*................... ............................................... 4.2.3 Basic Bootsttap ... 79
4.2.4 EO Estimator .......... .. ....... .. .................................................................. 84
4.2.5 Leavesnesut Bootstrap ........................ ......................... ................. 86
4.2.6 0.632 Estimators ........................................................................................... 89
...................... 4.3 Simulation Results .. ................................................................... 90
................................................................................... 4.4 Pseudo-Sample Algonthms 92
4.4.1 Pseudo-Testing Algorithm ..................... .. .......................................... 9 2
4.4.2 Pseudo-Classifier Algorithm ................................................................... 9 9
4.4.3 Leave-one-out Pseudo-Classifier Algorith.cn ................................................. 101
4.5 Summary ...................... .... ..........................................................................*. 105
CHAITER 5: BOOTSTRAP BASED CLASSIFIER DESIGN ................................................. 110
5.1 Introduction ............... .. .................................................................................. 110
5.2 Previous Work ...................................................................................................... Ill
.............. 5.3 Mixed-saniple Classifier Design ... .................................................. 116
................................................................................................ 5.4 Simulation Results 118
5.5 Discussions and Conclusions ......... ....... ................................................................ 124
CHAPTER 6: SUMMARY AND D ~ ~ ~ ~ ~ ~ ~ ~ ~ m m m m m m m m m m m m m m a m m m m m m m m m m m m m m m m m m m m o m o m m m m m m m m m m m 126
6.1 Introduction ........................................................................................................... 126
6.2 Bias Correction ..................................................................................................... 126
6.3 Cross-validation .................................................................................................... 129
6.4 Pseudo-pattern Generation ................. ......................... ........................ 131
6.5 Conclusion ................... .. .................................................................................. 133
.................................................... TABLE 1.1 Expectations of the classes in Data Set 1 8
TABLE 3.1 Theoretical values of Bhattacharyya distance and bound .......................... .. 47
TABLE 3 3 Estimates of Bhattacharyya distances with the General approach ............... 47
TABLE 3 3 Estimates of Bhattacharyya bounds with the General approach(%) ............ 47
TABLE 3.4 Estimates of Bhattacharyya bounds with the Basic Bootstrap(%) ............... 58
TABLE 3.5 Estimates of Bhattacharyya bounds with the Basyesian Bootstra[(%) ........ 58
TABLE 3.6 Estimates of Bhattacharyya bounds with the Random Weighting Methoci(%)
.............O...........L......... ....*.. ............................................................................ 58
TABLE 3.7 Estimates of Bhattacharyya bounds(%) Basic Bootstrap. B-based AFMS .. 63
TABLE 3.8 Estimates of Bhattacharyya bounds(%) Bayesian Bootstrap, B-based AFMS
..................... t.,, ....................................................................................... 63
TABLE 3.9 Estimates of Bhattacharyya bounds(%) Random Weighting Method. B-
based AFMS ................................................................................................. 64
TABLE 3.10 Estimates of Bhattacharyya bounds(%) Basic Bootstrap, G-based AFMS
....................... .... .................................................................................... 65 TABLE 3.1 l Estimates of Bhattachary ya bounds(%) Bayesian Bootsrap, G-based AFMS
...................................................................................................................... 65
TABLE 3.12 Estirnates of Bhattacharyya bounds(%) Random Weighting Method, G-
based AFMS ............. ...........,............*......... .............O............ 66
TABLE 3.13 Estimates of Bhattacharyya bounds(%) Basic Bootstrap. B-based MFAS
.............. .......................... .............................................................................. 68
TABLE 3.14 Estimates of Bhattacharyya bounds(%) Bayesian Bootstrap, B-based
MFAS ........................................................................................................... 69
TABLE 3.15 Estimates of Bhattacharyya bounds(%) Random Weighting Method, B-
based MFAS ................................................................................................. 69
TABLE 3.16 Estimates of Bhattacharyya bounds(%) Basic Bootstrap, G-based MFAS
...........................................................*.....*.................................................... 70
TABLE 3.17 Estimates of Bhattacharyya bounds(%) Bayesian Bootstrap, G-based
MFAS .................*.. +. .................................................*................................... 70
TABLE 3.18 Estimates of Bhattacharyya bounds(%) Random Weighting Method, G-
based MFAS ........................................................................................... 70
TABLE 4.1 Emr Rate Estimates of 3-NN classifien (sample size 8) ........................ .... 91
.......................... TABLE 4.2 Error Rate Estimates of 3-NN classifien (sample size 24) 91
....... ...................... TABLE 4.3 Enor Rate Estimates of the Pseudo-Testing algonthm ... 98
TABLE 4.4 Emr Rate Estirnates of the Pseudo-Classifier algorithm (sample size 8) . 100
TABLE 4.5 Emr Rate Estimates of the Pseudo- Classifier algorithm (sarnple size 24)
.....*.*......*................................................................................................... 100
TABLE 4.6 Emr Rate Estimates of the Leave-One-Out Pseudo-Classitier algorithm
(sample size 8) ......................................................................................... 100
TABLE 4.7 Enor Rate Estimates of the Leave-One-Out Pseudo- Classifier algorithm
......................................... (sample size 24) .................... .......t*+....C...... 100
xii
TABLE 4.8 Absolute difference between estimates and true errors (sample size 8) ..... 107
TABLE 4.9 Absolute difference between estimates and tnie errors (sample size 24) ... 108
TABLE 4.10 Performance ranges according to the absolute difference fkom the true enor
(sample size 8). ...................... .......................... . . . ......... ............. 109
TABLE 4.1 1 Performance ranges according to the absolute difference fiom the tnie error
(sample size 24). ................. ........ .... ............... . .... ..................... ... .. .. . . . 1 09
Fipre 1.1 Expectltion Distribution of the Classes in Data Set 1 .................................... 8
Figure 5.1 Testing Error of Mixed-Sample Classifier . Classe Pair (A. B) ................. 119
Figure 5.2 Testing Error of Mixed-Sample Classifier . Classe Pair (A. C) ................. 119
................. Figure 5.3 Testing Enor of Mixed-Sample Classifier O Classe Pair (A. D) 120
Figure 5.4 Testing Emr of Mixed-Sample Classifier O Classe Pair (A. E) ................. 120
................. Figure 5.5 Testing Enor of Mixed-Sample Classifier O Classe Pair (A. F) 121
Figure 5.6 Testing Error o f Mixed-Sample Classifier . Classe Pair (A. G) ................. 121
xiv
CHAPTER 1
INTRODUCTION
1.1 Introduction
Since Efion's bootstrap technique was introduced in the late 1970's [Ef79], it has
attracted great interest from both the theoretical and applied sides of statistics. Since it
uses the simulation strategy to calculate estimates, the bootstrap technique is able to
retrieve more information from sarnple data, and to solve problems that are not easily
solved by the traditional methods: such as the bias, the variance and the parameters of the
distribution of an estimator. Over the past two decades, great effort has been made to
develop the theory and application of the bootstrap ([DH97], Ma921 and [ST95]). In
rapid succession, large numbers of problems have proved to be amenable to the new
technique, which confmned that the bootsvap is efficient, flexible and useful.
Efron's paper, "Estimating the error rate of a prediction rule: improvement on
Cross-validation" Efû3], may be the earliest work in applying the bootstrap technique to
the error rate estimation. More endeavors, [CMN85], [CMN86], @f86], [Ha86],
[JDC87], [Fu90], [We91], [DH92] and [ET97), have been made in this area since then.
The other field in statistical pattern recognition, where the bootstrap technique has been
applied to, is the classifier design wT971.
In this thesis, the bootstrap technique will be applied to three problems in
statistical pattern recognition:
1. Estimation of the Bhattacharyya bound;
CHAPTER 1: INTRODUCT~ON 2
2. Estimation of the error rate of a classifier; and
3. Classifier design.
The Bhattacharyya bound is an upper bound of the error rate of a classifier and an
exponential Function o f the Bhattacharyya distance. When the distributions of two classes
are normal, the Bhattacharyya distance is a fùnction of the fmt and second moments of
the normal distributions. It has been proved, both theoretically and practically, that an
estimate of the Bhattacharyya bound given by any ûaditional rnethod is seriously biased.
Applying the bootstrap technique, two types of new estimators of the Bhattacharyya
bound are introduced, which are respectively called the direct and indirect bias correction
estimaton. The fim estimator estimates the bound on the basis of directly correcting the
bias of the Bhattacharyya bound, while the second estimates the bound on the basis of
correcting the bias of the Bhattacharyya distance. The simulation results have proved that
both of hem successfully improve the estimates of the Bhattacharyya bound.
E m r rate estimation is, of course, an active topic because it is the crucial
measurement of a classifier. A lot of work has been done on the topic. It is generally
agreed that the apparent error estimator is biased so as to underestimate the enor, while
the traditional leavesne-out estimator over estimates the apparent error estimator. New
efforts have tried to improve the emr estimator by introducing the bootstrap technique.
Thrre bootstrap estimators were first introduced in Efron's papa [Et83]. The basic
bootstrap estimator uses the basic bootstrap technique to correct the estimate bias of the
apparent emr estimator. As opposed to hem, the EO estirnator uses the training patterns
excluded h m the bootstrap samples to estimate the enor rate. The 0.632 estimator is a
combination of the EO and the apparent error estimators. Although, fiom then, a few
other algorithms have been suggested, such as the MC and the convex bootstrap
[CMN85], no essential progress was made until the algorithms of the leavesne-out
bootstrap and 0.632+ were invented by E h n and Tibshirani [ET97]. nie leave-one-out
bootstrap is an algorithm combining the techniques of Cross-validation and bootstrap
together, while the 0.632+ estimator is a combination of the leave-one-out bootstrap and
the apparent enor estimators.
As al1 of the error rate estimation algorithms mentioned above use only the
original training samples, they are said to be the real sample algorithms. Motivated by the
SIMDAT algorithm [Tï92], the Bayesian bootstrap and random weighting method
[ST95], pseudo-sarnple algonthms of the error rate estimation are introduced in this
thesis. The basic concept of the pseudo-sample algorithms is to use the pseudo-samples
generated by a bootstrap sampling technique to estimate the error rate. These pseudo-
testing algorithms use the pseudo-sarnples as the testing samples, while the pseudo-
classifier algorithrns utilize the pseudo-samples to build the classifier, and use the
original training samples as the testing samples. The experiments demonstrated that the
pseudo-testing algorithm tends to underestimate the enor rate, while the pseudo-
classifier algorithm perfonned very well. It can be seen that, combined with the Cross-
validation technique, the pseudotlassifier algorithm is one of the best algorithrns
available to estimate the error rate.
The problem of classifier design is different h m the estimation problems. The
purpose of classifier design is to seek a better discriminant hction. Little work has been
done in this area to incorporate the bootstrap technique. The algorithms for classifier
design introduced by Hamamoto and his partners [HUT97] focus on creating artificial
training samples instead of the original training samples so as to constnict a classifier.
Their main strategy uses the local mean to replace the onginal training pattern when
constnicting a classifier. Their experiments proved that this strategy works for the k-NN
classifier,
In this thesis, the mixed-sample algorithms for classifier design will be proposed.
The main idea of the algorithms is to enlarge the training sample for constnicting the
classifier by adding pseudo-training patterns into the original training sample set. The
simulation results proved that this strategy works well when the proper bootstrap
sarnpling schemes are used to generate the pseudo-training patterns.
1.2 General Contributions
The general contributions of this thesis are distributed into three areas:
The Bhattacharyya bound estimation which uses:
a) Direct bias correction algorithm; and
b) Distiuice bias adjustment (indirect bias correction) algorithms.
The error rate estimation incorporating:
a) A pseudo-testing algorithm;
b) A pseudot lassi fier algorithm; and
C) A Leavesne-out pseudoclassifier algorithm.
The classifier design which incorporates a Mixed-sample classifier design.
CHAPTER 1: INTROOUCT~ON 5
Variations of the above algorithms will also be discussed in this thesis. These
înclude the modification of the resampling schemes (basic bootstmp, Bayesian bootstnip,
random weighting method and the SIMDAT algorithm), the pseudo-sample size and the
size of the nearest neighbors of a training pattern.
13 Thesis Outline
This thesis is composed of six chapten. Chapter 2 discusses the bootstrap
technique. Section 2.1 is a general instruction to the bootstrap and its application to
statistical pattern recogNtion. Section 2.2 describes the basic concept of Efron's
bootstntp as well as a general algorithm of the basic bootstrap, followed by several
examples of the bootstrap estirnators that are the content of Section 2.3. Two theorems
about the properties of the bootstrap are presented in Section 2.4. Section 2.5 illustrates
the other two bootstrap sampling schemes - the Bayesian bootstrap and random
weighting rnethod.
Chapter 3 discusses the estimation problem of the Bhattacharyya Bound. The
definitions and formulae of the Chemoff and Bhattacharyya bounds are presented in
Section 3.2. Subsequently, Section 3.3 describes a traditional algorithm for estimating the
Bhattacharyya bound, and provides the results of related expeciments. Two kinds of the
bootstnip estimators are discussed with the simulation results in Sections 3.4 and 3.5
respectively. Section 3.6 is a surnmary on the bootstrap-based Bhattacharyya bound
estimation methods.
CHAPTER 1: INTRODUCTION 6
The bootstrap-baseci error estimators of a classifier are the topic of Chapter 4.
Section 4.1 is a brief introduction of the prior work done regarding the emor estimation of
a classifier. The details of the main algo&hms of the emr estimators are represented in
Section 4.2 followed by the results of related experirnents that are illustnited in Section
4.3. Section 4.4 introduces three kinds of pseudo-sample algorithms of error estimation,
and discusses the algorithms with the related simulation results. Section 4.5 summarizes
al1 the algoritluns discussed in this chapter.
The bootstrap-based classifier design is the content of Chapter 5 . Section 5.1
states the problem of classifier design and discusses the problem to be solved. Previous
works done on the bootstrap-based classifier design are discussed in Section 5.2. The
algonthms of mixed-sarnple classifier design are introduced in Section 5.3, while the
related results of simulations are discussed in Section 5.4. Section 5.5 provides the
discussions and sumrnarizes the work done in mixed-sample classifier design.
Chapter 6 summarizes the applications of the bootstrap technique to the field of
statistical pattern recognition. The three main issues used in the thesis, bias correction,
cross-validation and pseudo-pattern generation, are discussed in Sections 6.2,6.3 and 6.4
respectively. Section 6.5 states the conclusions of our work.
1.4 Experiment Data Set
Throughout this paper, al1 the experiments have been carried out with a data set
referred to as Data Set 1, which contains 2-dimensionai training samples of seven classes.
CHAPTER 1: INTRODUCT~ON 7
All the classes follow a normal distribution with the same covariance mabixes, the
identity matrix, but they have different expectation vectoa generated by Algorithm 1.1.
Algorithm 1.1 Expectation Vector Genemting
Input: d. the dimension, Output: ErpectationList, the expectation vectors. Method BEGIN
For( j=l tod) Exfi] = O
End For Put EX in ExpectationList delta = 0.5 Do
k = random integer in [ l , d] EX [k] = EX [k] + delta delta = delta + 0.1 Put EX in ExpectationList
While (ModulusOfVector(EX, d) < 3) Return Expectatiodist
END Expectatioa Vector Generating
Procedure ModulusOfVector Input: (i) v[l], v[2], ..., v[dJ, a vector;
(ii) 4 the dimension; Output: The Modulus of a vector. Met hod BEGIN
modulus = O For(j=l tod)
modulus = modulus + vb] * v[i] End For modulus = Jmodulus Retwn modulus
END Procedure ModulusOfVector
In Algorithm 1.1, the expectation vector of the first class is set to (O, 0). Then, in the
Do...WhUe loop, a delta, with an initial value of 0.5, is randornly added to one of the
elements of the previous expectation vector to yield the next expectation vector, and delta
is then increased by 0.1. Expectation vectors for the classes are generated in this way
until the rnodulus of the last expectation vector exceeds 3. With Aigorithm 1.1,
expectations o f seven classes of training samples were generated. Their expectations are
listed in TABLE 1.1, and Figure 1.1 depicts the distribution of the expectations.
1 n
O 1 2 3 4
Figure 1 . 1 Expectation Distribution of the Classes in Data Set I
TABLE 1.1 Expectations o f the classes in Data Set 1
Each sample point is a 2-dimensional vector with a normal distribution. The
semple vectors were generated by Algorithm 1.2.
CLASS I
EXPECTATION CLASS
L
D ( 1 . 1 , 0.7)
A (0.0,O.O)
E
L EXPECTATION
B (0.5,O.O)
F ( 1 . 1 , 1 .5)
C 1
(1.1,O.O) G
(2.0, 1.5) (3.0, 1.5)
Algoritbm 1.2 d-Dimension Normal Vector
Input: (i) dy the dimension; (ii) E[1], E[2], . . ., E[q, the expectation vector;
Output: A vector follows a normal distribution. Method BEGIN
For (k = 1 to d) MltNormal [k] = StandardNormal MltNormal[k] = MltNormal [k] + E[k]
End For Retum MltNormal[l], . . . , MltNormal[d]
END d-Dimension Normal Vector
Procedure StandardNormal Input: nothing. Output: An approximate standard normal distribution variable. Method BEGrN
RandomNM = O For(j= 1 to 12)
RandomNM = RandomNM + Random Unifonn [ l , O] End For RandomNM = RandornNM - 6.0 Return RandomNM
END StandardNormal
Algorithm 1.2 is a typical algonthm for generating a d-dimension normal
distributed mdom vector with the identity matrix as the covariance matrix. It c a l s the
procedure StandardNormal d times to obtain d variables of approximate N(0, 1). The
procedure StandardNormal accumulates 12 variables of uniform distribution U[O, 11 to
return a variable which is approximately N(0, 1). Since the expectation of the random
normal vector required is generally not & a corresponding expectation value is added to
each element of the d-dimension random vector returned by the procedure
StandardNormal. In this way, a d-dimension nomally distributed random vector is
genemted with the given expectation vector and the identity matrix as the covariance
matrix.
200 separate samples each of size 50 were generated for each class in Data Set I,
and were used for the simulation. Anoiher sample of size 1000 was dso generated
sepanitely for each class, and was used for testing.
2.1 Introduction
The bootstmp technique is one of the most popular statistical approaches today for
estimating parameters and their distributions. It was first introduced by Efion in the late
1970's [Efï9]. The basic strategy of bootstrap is based on resampling and simulation.
Over the last two decades, bootstrap has k e n applied to a wide range of statistical
problems. Its theory has been extensively developed, and a variety of techniques dated
to bootstrap have been proposed for various application problems, including bias
estimation, standard enor estimation, confidence interval estimation, hypothesis testing,
estimation of prediction error, sample survey, linear modeling, kemel density estimation,
t h e senes and dependent data (see [ET93], [DH97l and [ST95]).
Ehn's work on Error Estimation [Efû3] can be ngarded as the start of applying
the bootstrap techniques in the area of statistical pattern recognition. Then Chemick
[CMN86], Jain [JDC87] and Weiss [We9 1 ] perfomied simulations to compare bootstrap
emr estimates with other estimates, mainly with the Cross-validation estimation
schemes. Efron [EW6] and [ET97], Davison pH921 and Fukunaga pu921 discussed aiso
theoretical issues of the bootstrap error estimation. Detailed descriptions on the above
works wiil be given in Chapters 3 and 4 respectively. There were aiso suggestions of
CKAPTER 2: THE BOOTSTRAP TECHNIQUE 12
using the bootstrap techniques to constnict pattern classifiers, by Hamamoto [HUT97],
which will be discussed in Chapter 5.
In this chapter, the essential concept and theory of the bootstrap technique will be
discussed. Some important bootstrap estimates useN in statistical pattern recognition and
the properties of bootstrap will be given, and other related resarnpliag plans will be
introduced as well. In Section 2.2, the concept and basic procedure of bootstrap are
discussed, and examples are given to illustrate the bootstrap procedure. Several bootstrep
estimations, including bias, variance, quantile and density function, are discussed in
Section 2.3,. These estimating procedures are very helpful in the understanding of Efion's
bootstrap theory, and an also usefid for estimating the eror bound and the e m r rate of a
classifier, which are the topics of the next two chapters. Section 2.4 discusses the
theoretical aspects of the bootstrap, which are mainly concemed with the large sample
properties, consistency and asymptotic distribution of the bootstrap estimations. The
theorems in Section 2.4 concem the theoretical properties of the indices on invoking
bootstrap. in the last section of the chapter, Section 2.5, we will discuss other resampling
plans, the Bayesian bootstrap and the Random Weighting method, which were introduced
by Rubin [ST95] and Zhen [Zh87] respectively. Both of them generalized Efron's
bootstrap approach in some sense. Both of them will also be applied in later chapters to
problems of statistical pattern recognition.
CHAPTER 2: THE BOOTSTRAP TECHNIQUE 13
2.2 The Fundamentals of Bootstrap
Let X = (Xi, &, ..., &,) be an i.i.d. d-dimension sample from an unknown
distribution F. We consider an arbitrary functional of F , O = B(F), which for example,
could be an expectation, a quantile, a variance, etc. The 0 is estimated by a fiuictional of
the empirical distribution k, 6 = 8( k), where
L. 1 F : mass - at ~ 1 , g2, ..., &,
n
where n is the sample size, {ri, 22, . . ., &) are the observed values of the sample {Xi, s, .. ., &). A host of unanswered questions now arise such as:
(i.) What is the bias of the estimation6 ?
(ii.) What is the variance of 6 ?
(iii.) What is the distribution of 6 ?
These problems are generally not easy to answer, especially in the case when F is
unknown. For example, the bias is well defined as
A
Bias= E, [e -01 =E,[O(F)-B(F)]. (2.1)
where "E;' indicates the expectation under distribution F. Though it is possible to
calculate 6 = 8(c) having observed XI = ZN, & = a, ..., 15, = s, it is not possible to
derive the bias directly because of the fact that both B and F are unknown. The question is
thus one of detemiinhg how to get an estimator of the bias?
CHAPTER 2: THE BOQTSTRAP TECHNIQUE 14
A
Fmm one perspective, the quantity 8 can be regarded as a simulation of 0, shce
we use the empirical distribution F to minor the characteristic of the unknown
distribution F via the sampling process. Extendhg the idea to the bias estimation, we can
use the same suaiegy to solve the problem. As the empiticai distribution F is known, it is
easy to generate a manual sarnple h m 6. Assuming the size of the sample generated is
n, the sample is
Thus we have an empirical distri bution 6 ' of the empirical distribution F , where,
A ' 1 * F : mass - atg , , & ? ,
n
A
corresponding 0 ' = 0( k') is the estimate of 6 . Thus an estimator of the bias is
sks = E' [6* - 6 1 = E'[o(F') - O&], (2*3)
where E* is the conditional expectation of $ given (z; = 5 ; , - 2 X' = , . - x n =
x* }.The procedure to estimate the b i s described above is called Boorstrup. An example -n
to dari@ issues is given below.
Example 2.1
Consider a simple case, in which 8 is the expectation of the unknown distribution
F,
CHAPTER 2: THE BOOTSTRAP TECHNIQUE 15
Then the eshate of 0 is
6 = o ( F ) = ~ & =x,
1 O
whem = -x i xi is the sample mean. Now if the observation values {g , z , . . ., g } n o
of a bootstrap sample are carried out by a random generator distributed uniformiy on (1,
1 where g' = -xi s; is the mean of the bootstrap sample. The procedure for generating a
n
bootstrap sample is repeated B times, and there are B values of 6 obtained h m the B
b* bootstrap samples {s Po , g:' , .. ., g ), b = 1.2, . . ., B,
The bootsûap estimate of 6 is the average of these 6 b , so,
Thus, the bootstrap estimate of the bias 4 - 0 is
The procedure presented above is a general bootstrap procedure called basic or
simple bootstrap.
CHAPTER 2: THE BOOTSTRAP TECHNIQUE 16
In summary, the bootstrap metbod takes the strategy of replacing $ with k* and F
with F in estimating a parameter Mg, F) with unknown distribution F. A typicai
bootstrap procedure which takes the above steps, is given formally below.
Algorithm 2.1 Basic Bootsîrap Input: (i) n, the size of the training sample;
(ii) g[l], g[2], .. ., ~ [ n ] the sample data of the training sample; (iii) B, the number of time bootstmp resampling is repeated.
Output: the bootstrap estimator . Method BEGW bootstrap-estimator = O For (i = 1 to B)
Faro= 1 ton) (2.6) m = random integer in [1, n] riil = r[ml
End-For bootstrap-estimator = bootstrap-estimator + Q(y[l], y[2], .. ., y[n]) (2.7)
End-For bootsbap-estimator = bootstrap-estimator / B Retu rn bootstrap-estimator END Basic Bootstmp
Note, the For loop (2.6) is to generate a bootstrap sample, e.g. a sample h m the
empificai distribution k ; and the Q(y[l], y[2], . . ., y[n]) in (2.7) is to calculate a bootstrap
estimator of the parameter Q with the bootstrap sample, e.g.
The question of how to calculate Q&[l 1, y[2], . . ., y[n]) depends on the estimating
formula of parameter Q. For instance, in Exampk 2.1 where we tried to estimate the bias
of the mean, it is
CHAPTER 2: THE BOOTSTRAP TECHNIQUE 17
E h n suggested chwsing B = 200 as the nurnber of times the bootstrap
resampling is repeated[E83], which is quite adequate for most of purposes.
2.3 Bootstrap Estimators
2.3.1 Bias
in this section we will discuss the bootstrap approaches for estimating the bias,
variance, and the quantile. Although the bootstrsp method is regarded as a distribution
fiee method, it is also possible to have a parameterized bootstrap. The concept of the
parametric bootstrap will also be introduced in the section, in order to present an
alternative view of the bootstrap techniques. At the end of the section, we will present a
bootstrap procedure, the SIMDAT a lgo r ih , for density function simulation, which
shows an exarnple of a more involved bootstrap procedure. The SiMDAT algorithm will
be applied in subsequent chapten to some related problems of statistical pattern
recognition.
A
As defined in section 2.2, the bias of an estimate 0(k) is Bias = ~ , [ 0 - 01 =
replacing F with 3' to get the comsponding 6 , the bootstrap estimate of bias will be
1 Cornparhg this expression to (2.3), we cm mbstitute - b 6 for ~ ' [ 6 '1 here, which
B
is analogous to using for estïmating E m .
CHAPTER 2: THE B~~TSTRAP TECHNIQUE 18
A
In order to apply (2.8), it is not necessary to require 8 to be "the same statistic as
8" [E82]. We can take 6 to be the sample median while 0 can be set to be E,[XI. Hence
A
the bootstrap technique provides more flexibility to yield an estimator. If 8 is an estimate
of 0 and a bootstrap estimate of the bias (2.3) is provided, a bootstrap estimate of 0 can be
obtained by correctiag the bias of the original estimate as:
It will be seen, in Chapter 3 and 4, how this strategy can be used to estimate the
error bound and error rate of a classifier.
2.3.2 Variance
In numerous applications and from a theoretical point of view, there is usually an
n
interest in estimating the variance of a statistic 8. In general, it is not easy to give a
1 variance estimate of statistics s2 = -G (ri - I ) ~ while the variance of statistics X is
n
shown to be
1
Generally it is hard to compute the variance VAR(^ ). With the bootstrap
rnethod, VAR(^ ) c m be estimated as
CHAPTER 2: THE B~~TSTRAP TECHNIQUE 19
1 A
where 8 ' = -&,O , after obtaining each 8 h m the bootstrap sample (x !' , x ,b' , B
A
For the standard deviation enor 8 = S = , the corresponding
bootsîrap standard error is
The n !* on the right side is the weight of x i in the bootstrap sample xb* = (X P o , r 2' ,
..., r 2' 1, which equals the number n b' of ri appeariag in xb' divided by the bootslrap
b ' wb' = n i In.
A h , s b* is the mean of the bootstrap sample xb',
- b* - b' x - E i w i x i -
Mer obtaining dl { 6 ; : b = 1, 2, .. ., B ), the bootstrap variance estimete of S is just a
direct computing of (2.1 O),
1 where s' = -& s;.
B
CHAPTER 2: THE ~ ~ ~ T S T R A P TECHNIQUE 20
The expressions (2.1 1) - (2.13) imply that the bootstrap methoà pmvides a
scheme by which we can assign weights to the sample data, which leads to other
resmpling sûategies which will be discussed later in this chapter.
2.33 Quantile Anothet quantity of considerable interest in statistics is the confidence interval of
A
0 . Typically, with a given value Bo, a question of pertinence is the probability of
Altematively, with a given level a, the question of interest is the boundary such that
It is very hard io answer the above questions in most cases if the distribution F is
unknown. With the bootsrrap approech, the bootstrap sarnples can k first used to obtain a
A A
set of estimates (8 ; : b = 1, 2, ..., B ), and then these ( 8 b : b = 1. 2, ..., B ) are just
taken as a sample of 6 . The estimate of (2.14) will thus be
A
By sorling ( 8 ; : 6 = 1,2, . . ., B ) as below:
the estimator of O0 in (2.15) will be
where L._l is the flwr function.
CHAPTER 2: THE BOOTSTRAP TECHNIQUE 21
Though it is a distribution fkee approach the bootstrap can also be used in a
paramettic way. Consider the case when the sample data x = (&, &. . . ., &,) follows a
normal distribution F = N(p, I), where the p and d are unknown parameters. Then the
parameaic maximum likelihood estimator of F is
Now the bootstrap procedure is camed out exactly as described above except that
SNORMAL takes the place of F in (2.6). Thus the loop (2.6) in Algorithm 2.1 is now
changed to
For ÿ = 1 to n) ru] = Randorn vector drawn fiom N( fi, 2).
End-For
Therefore the bootstrap sample wiil be drawn fkom FNO-L, which means that the
bootstrap is not merely putting weights on the data X = (XI, &, . . ., &,} .
23.5 Dcnsity Function
Density funciion is, of course, a great interest in statistics, because numerous
problems in statistics would be solved if the density function was known. Uafortunately,
in al1 of the statisticai problems, the density fwictions involved are unknown, or at most,
only a mode1 or an explicit fomi of the density hctions is known. in statistical pattern
recognition, density hctions play more important des, because knowing the density
CHAPTER 2: THE BOOTSTRAP TECHNIQUE 22
functions of the classes wüi help us to build and evaluate the correspondhg classifiers.
But the question of how to estimate a density hction is still a significant problem,
especiaily in a small sample size case. It CM be seen that the bootstrap provides a useful
tool for estimating density fuactions.
The algorithm presented here is due to Taylor and Thompson [TT92], called the
SIMDAT algorithm. The purpose of the S[MDAT algorithm is to provide a sampling
scheme which could generate a pseudo-data sample very close to that drawn fiom the
kemel density hction estimator of
where the Li in K(a) is a locally estimated covariance matrix. The details of the SIMDAT
algorithm are given below.
CHAPTER 2: THE BOOTSTRAP TECHNIQUE 23
Aigorithm 23 SlMDAT Input: (i) n, the size of the training sample;
(ii) di], ~[2] , . . ., =[n] the sarnple vector of the training sample; (iii) m, the nurnber of the nearest neighborhood of each sample data. (iv) N, the repeated times of the bootstmp resampling.
Output: ~ [ l ] , g[2], . . . , g[N x n], a set of pseudo-sample vector. Method BEGIN Re-sale the sarnple data set ~ [ l ] , ~ [ 2 ] , . . ., ~ [ n ] so that the marginal sample variances in each vector component are the same
For (i = 1 ton) find the m nearest neighbors y[i, 11, y[i, 21, . . ., y[i, ml of g[r) (include ~ [ f l itself), Forÿ = 1 ton)
End-For
For(j= 1 ton) y[i, jl = y[i, jl d i , 01
End-For End-For
For (i = 1 to N) k = random integer in [l, n]
= 1 tom) g[(i - 1) m +pl = O For (q = 1 to m)
u = random number from U(- - m m
End-For End-For Return 1[1], g[2], ..., g[N x ml END SIMDAT
There are three *gs done in the ffirst For loop of the SiMDAT algorithm. They
are:
CHAPTER 2: THE BOOTSTRAP TECHNIOUE 24
1. Find the m nearest neighbois of a sample vector, including the
sample vector itself;
2. Calculate the mean of the m nearest neighbors which is put in y[i, O];
3. Deduce the mean from each neighbor.
The third step above makes the neighbor sample vectors to have a local
distribution with the mean of zero. In generating pseudo-sample, a randomly weighted
linear combination of the nearest neighbors is given, and then the local mean y[i, O] is
added back. Here we see that the local mean y[i, O] takes the place of the original sample
vector ~ [ i ] in the SIMDAT algorithm, which incorporates smoothness into the pseudo-
sample vector.
Some properties of the pseudo-data sample are not diEcult to see. First, it is easy
to calculate the expectations, variances and covariances of the uniforni distribution
sample {uj} y=, ; these quantities are:
CO~(U~, uj) = O, for i +j.
If we now treat the m nearest neighbors {ai 1, a 2, ..., a ,,,) of sample vector as a
sample h m a truncated distribution with mean vector p = (pi, pt...,W)T and covariance
matrix X = (ai, j), it can be seen that the linear combination
CHAPTER 2: THE BOOTSTRAP TECHN~QUE 2s
z=Xj uj %,j,
where = si, will satisQ
E@= CI
and
where
It is clear that the linear combination Z will have the sarne expectation and covariance as
the sample data {y, [,a, 2, . . . , ~4 ,) if p = O. Therefore the pseudo-data sample generated
by the above procedure will be very close to that generated by the density fùnction of
(2.1 8).
In the previous section we presented several examples to show applications of the
bootsûap technique to different problems. Though the bootstrap is a very convenient and
appealing tool for data anaiysis, it is still necessary to confirm theoretically its suitability
for the problems at hand. There are already examples in the literatwe for which bootstrap
has failed Wa921. Thenfore we would like to know "when does bootstrap work?".
nieorem 2.1 and 2.2 of this section answer the question within a large sarnple context.
To illustrate the large sample results, we use the following notations,
C W E R 2: THE B~~TSTRAP TECHNIQUE 26
a) &, = {Ki, Xis, . . ., &,,) is an i.i.d. sample of size n with unknown
distribution Fn;
b) 6 , is the empiricai distribution of the sarnple X,,;
C) X : = {x:~ , X L2 , . . . , x ) is a boofStrap sample, an i.i.d. sample, with the
empirical distribution 6 .; d) P is the empirical distribution of bootstrap sample X: ;
e) On(F) is a bctional of the interest. in the theorem given below it is a lineat
fùnctional
W) = I&(x) d F(x).
What the bootstrap procedure tries to do now is to estirnate the distribution
P(%( n) W n ) 5 t)
The theorem below gives the necessary and sufficient conditions under which the
bootstrap procedure will be successful.
Theorem 2.1
Consider a sequence X,,, XRL, ..., XaJ, of i.i.d variables with distribution Fn. For a
x :, , . . ., x with empirical distribution F ', . Denote 6 = O,( ), d, to be the
Kolmogorov distance, and b(. . .) to be the conditionai law b(. . .l Xhl, XR2, . . ., &).
Then for every sequence t the following assertions are equivalent:
CHAPTER 2: THE BOOTSTRAP TECHNIQUE 27
A
(i) 8 . is asyrnptoticdly normal: There exist m,, with
(ii) The nonnai approximation with estimated variance works:
6(&-tn). N(0, $ : ) ) + O (inprobability),
1 A
where S = 1 1 (fi(Xn,i) - 8 n)2m n -
(iii) Bootstrap works:
da~(k(6 n - tn), d(6 - 6 ,)) + O (in probability). (2.2 1 )
If (i), (ii) or (iii) hold then t, can be chosen as the mean of the truncated variables
P + E [(&>(Xn,i) - P) 1 (k(Xn.i) PI 5 n%)I,
where p is a median of the distribution of M
The meaning of Theorem 2.1 is explicit, e.g. if a sequence t satisfies either of the
conditions (i), (ii) and (iii) then it cm be shown to satisfy the other two. The results of
Theorem 2.1 provide the condition under which the bootstrap works for the i.i.d. models.
It implies that if O,( S ,) is a good simulation of O#,) then O,( ) - O,( F ,) will be a
good simulation of On(Fn) - O,( F,) , and vice versa. M e r adding more conditions the
bootstrap scheme can be show to work under non i.i.d. models as well, which is the
result of Theorem 2.2 given below.
CHAPTER 2: THE BOOTSTRAP TECHN~UE 28 - -- - -
Theorem 2.2
Consider a sequence Xkl, l(nl, . . ., X, of independent variables with distribution
A A
FRi. For a function g. the 8 and 8 are defined as in Theorem 2.1. Then for every
sequence t, the following assertions are quivalent:
(i) There exist an such that for every E > O
and such that
(ii) The normal approximation with estimated variance works:
CL,(.&,-tn),N(O. S : ) )+O (inprobability),
(iii) Bootstrap works:
& ( , ( e n - tn), d(6 - 6 n)) + O (in probabüity).
Note that Theorem 2.2, XkI, XR2, ..., XnSn are not required to have the sarne
distribution alîhough they are still assumed independent. Since we forfeit the i.i.d
pmperty of the samples, the extra conditions (2.22) and (2.23) are required for ensuring
the consistency of bootstrap.
CHAPTER 2: THE BOOTSTRAP TECHNIQUE 29
The prwfs the of above two theorems are ornitted here as they are outside the
scope of the thesis. The reader is refemd to Mamrnen's paper [Ma921 for more details.
2.5 Other Resampling Approacbes
In Eumpk 2 3 it was demonstrated that the bootstrap scheme, in some sense,
assigns weights to the sample data. This idea can be used to generalize the bootstrap in
the following way.
Let f = (P; , Pi ,. . ., P :, ) be any probability vector on the n-dimensional
simplex
~ = { ~ : P : L o , G P : =1) ,
called a resampling vector [Ef82]. For a sample X = (Xi, &, ..., &) a re-weighted
empirical probability distribution 5' is defined with a resampling vector o* as
* * F :massP: o n g , i = 1 , 2 ,..., n, (2.25)
where (g,, a, . ..¶ s) are the observed values of the wunple (&, &, ..., &d Thus, a
resampled value of 6 , say 6 ', will be
6' = e(Fa@)) =e@). (2.26)
As before, after repeatedly generating the resarnpling vector ' B times, a sample of 6
will be obtained, say (6 : b = 1,2, ..., 83 which is just what was obtauied in step 3 of
the basic bootseap mentioned above. The basic bootstrap is, h m this point of view, a
specific case of (2.25) - (2.26) in which the resamplhg vector f talces the fom
CHAPTER 2: THE B~~TSTMP TECHNIQUE 30
P. = n : / n , 0; (2.27)
where n: is the number of xi appearing in a bootstrap sample. This means that f follows
a multinominal distribution,
1 P' - - Mult(n, Io), 9 n
1 1 1 where Po = ( - , - , . . . , - ) is a n-dimensional n n n
execute a resarnpling procedure by choosing
vector. In other words, This is possible to
another E*. These arguments lead to the
Bayesian Bootstrap and Random Weighting Method introduced by Rubin [ST95] and
Zhen [a871 respectively.
2.5.1 Bayesian Bootstrap
Consider the case where an i.i.d sample X = (Xi, &, ..., &} cornes fiom a
distribution F(X 1 0) with a unknown parameter 8 which is of interest. in fkquency
statistical analysis the 8 is regarded as a nonrandom constant. h Bayesian statistical
analysis the 8 is taken as a randorn variable (or vector) with a prior distribution, and the
sample (Xi, &, . . ., &} is considered coming fiom a conditional distribution of X given
0. The Bayesian estimator of 0 will be given in a fom of conditional expectation of 8
given (Xi, &, - 9 &),
where the conditional distribution F(0 1 Xi, &, ..., &) is known as the posterior
disüibution. Usuaily, posterior distributions are expressible oniy in ternis of complicated
CHAPTER 2: THE B~~TSTRAP TECHNIQUE 31
analyticai fûnctions, and it is hard to caiculate the matginal distributions and moments of
posterior distributions. Commonly, a nomal approximation to the posterior distribution is
suggested. Although it is easy to estimate, this approximation lacks high order accuracy.
To address this problem Rubin intmduced a noapararnetric method called the Buyesiun
Bootstrap under a specific assurnption on the prior distribution. Here we discuss the basic
algorithm of the Bayesian bootstrap with a noninformative prior distribution. The
algorithm is focmally given below,
Algorithm 2 3 Bayaiin Bootstrap Input: (i) n, the size of the training sarnple;
(ii) a l ] , g[2], . . ., g[n] the sample data of the training sample; (iii) B, the repeated times of the bootstrap resampling.
Output: the Bayesian bootstrap estimator. Metbod BEGIN bootstrap-estimator = O For (i = I to B)
u[n] = 1 ForO= 1 ton- 1)
u[j] = random number fiom U[O, 11 End-For Sort u[l] , u[2], ..., u[n] Forü =n t0 2)
Pb] = uü] - ufi - 11 End-For P[l] = u[l] calculate a Bayesian bootstrap estirnating value 6 ' = 8( $ ') bootstrap-estimator = bootstrap-estimator + 8 '
End-For bootstrap-estimator = bootstrap-esthator 1 B Retun bootstrap-estimator END Bayaian Bootsmp
In the algorithm, an i.i.d. random sample of size n - 1 h m the uniform
distribution U[O, 11 is fkt genemted. M e r obtaining the order statistics
CIIAPTER 2: THE B~~TSTRAP TECHNIQUE 32
O Su[l] Sr[Z]S ... Su[n]=l by sotzing the uniforni random sample, the component Plj] of the resampling vector is
assigned the value of the difference un] - uu - 11 of the unifonn order statistics, for j = 2,
..., n, and P[l] = u[l] is the difference u[l] - O. Thus, the re-weighted empirical
probability F' defined in (2.25) is given by the algorith, as
6' : mass P[i] on ~ [ i ] , i = 1,2, ..., n.
Accordingly a calculation of 6 ' = 8( i ') is carried out.
From the algorithm given above the relationship between the prior distribution
and the resampling vector f defined in (2.3 1) is not obvious. To clarify issues, consider a
very simple case in which the sample {Xi, X2, ..., X*) follows a 0-1 distribution, P(X =
1) = 0, and P(X = 0) = 1 - 8. Let n = zi xi, where (XI, ~ 2 , . . ., 4) are the observation
Under the assurnption of using a noninfonnative prior distribution, the posterior density
hct ion will be
c em(i - e y m ,
where c is a constant. Therefore the posterior distribution is a E h distribution. On the
other hand, the difference of the order statistics of the uniforrn distribution U[O, 11 has
also a form of Beta distribution (see [PC94]). Thus the resampling vector o' of the
Bayesian bootstrap fi& the noninfonnative prior distribution case.
Example 23
Consider the same problem discussed in Example 2.1. We use the Bayesian
bootstrap method to estimate the bias of f - 0, where 5 is the sample mean and 0 is the
expectation of the unknown distribution F. Now we draw an i.i.d. sample of size n - 1
h m the uniform distribution U[O, 11, say (ul, 4, ..., un .,), and get the order statistics
by sorting the sample,
O = u(,,) s u(,, 5 U(2) 5 *.. SU(,,-~) Sr u(") = 1.
M e t this, a resampling vector o' is made by assigning each of its components the vaiw
P; = "(il - ~ ( i - 1 ) . (2.3 1 )
A
This time the corresponding resampling valw of 0 will be
Repeatedly drawing B samples of size n - 1 fiom the uniforni distribution U[O, I l , we
obtah B resampling vectors p' = (P f a , P ia , . . ., P :' ), b = 1,2, . . ., B, and so we get B
A
resampling values of 8,
A
where 9 b = xi P 7 i . The Bayesian bootstrap estimator of 8 is the average of these
Thecefore the Bayesian bootstrap estimator of the bias x- 0 is
CHAPTER 2: THE BOOTSTRAP TECHNIQUE
It c m be seen that the ôasic concept of the Bayesian bootstrap algorithm is almost
the same as that of the basic bootstrap algorithm except that the resamplhg vector o' is
continuously distributed in the Bayesian bootstrap while it is discretely distributed in the
basic bootstrap case. Hence the Bayesian bootstrap algofith can be regarded as a
smoothed version of the basic bootstrap scheme.
2.53 Riadom Weighting Method
It can be argued that the method discussed above does not necessarily have to be
related to the prior distributions. With some insight it c m be seen that a continuously
distributed resampling vector can also be used to get a re-weighted empirical
probability distribution 6' as given in (2.25). in other words, we can randomly generate
the weighting vector en to get the resampling empincai distribution F'. This concept was
intmduced by Zhen (1987) and is called the Randorn Weighting Method. Thus, the
random weighting vector f can be obtained by the following xheme, given formally
below.
CHAPTER 2: THE BOOTSTRAP TECHNRQUE 35
Aigorithm 2.4 Random Weighthig Methd Input: (i) n, the size of the training sample;
(ii) ~ [ l ] , 3[2], . . . , r[n] the sample data of the training sample; (iii) B, ihe rqeated times of the bootstrap resampiing.
Output: the -dom weighting estimator. Method BEGIN random-weigbtiag-estimator = O For (i = 1 to B)
z[O] = 0 F o r 0 = 1 ton)
zü] = random number fiom a nonnegative distribution P z[O] = z[O] + zU]
End-For For(j= 1 ton)
P[j] = zU] I z[O] End-For
A
calculate a Bayesian bootsûap estimating value 8 ' = 8( 8 ') random-weighting-estimator = random-weightuig-estunator + 6 '
End-For random-weighting-estimator = random-weighting-ehator / B Return random-weighting-estimator END Random Wtighting Method
In this algorith, an i.i.d. random sample of size n is finit generated h m a
no~egative distribution P. Then a resampling vector P is obtained by assigning its
component Pu] the normaiized value of zü] / z[0], for j = 1, 2, .. ., n. Similar to the
Bayesian bootstrap, the ce-weighted empirical probability distribution Ê ' is given, in this
6' : mass P[i] on ~ [ i ] , i = 1,2, ..., n.
Then a caicdation of 6 = 8( 6 ') wiii be carried out accordingly.
One of the simplest ways is to take an i.i.d. sample (21, 4, ..., f ) fiom the
unifonn distribution U[O, 11 to construct the random weighting vectot f. indeed, the
CHAPTER 2: THE BOOTSTRAP TECHNIQUE 36
unifonn distribution U[O, 1) will be used for random weighting method simulations
through this thesis. Other steps to cany out a random weighting estimator are similar to
the basic bootstrap and the Bayesian bootstrap.
if we now examine expressions (2.1 1) - (2.13) given in Eumpk 23, it can be
seen that the w j' used in the bootstrap can also be regarded as, in some sense, a random
weighting method. Although the Bayesian bootstrap is another random weighting method,
in this thesis the phrase 'tandom weighting method" will only be used for the case of the
random weighting vector consmicted with a sample fiom the uniform disttibution
U[O, 11
A few results related to the theorecical aspects of the Bayesian bootstrap and
random weighting method are available in the literature. Under certain assumptions both
the Bayesian bootstrap and ramlom weighting method have excellent analytic properties,
such as consistency, and both approximately normal distributed. As this is outside the
scope of this thesis, these concepts will not be discussed here. The reader is referred to
[ST95] for more details.
In ihis chapter, we iaaoduced the concept of bootstrap and discussed some details
of the bootstrap technique. The idea of the bootstrap technique is simple and eiegant, it
involves "simulating the simulator". What the bootshap mainîy does is to draw a random
sample, cailed a bootstrap sample, h m the empirical distribution. It then combines the
btstrap and original amples together to obtain the estimates sought for. With this
CHAPTER 2: THE BOOTSTRAP TECHNIQUE 37
technique, we are able to extract some information of the relationship between the true
distribution F and empirical distribution F, such as the bias, standard deviation, and the
empirical distribution of a parameter estimate.
While Algorithm 2.1 is a general illustration of the basic bootshap procedure, the
details of calculating the bootsûap estimates are given individually in the examples of
Section 2.3. Though the bootstrap caiculations are usually similar to that of the traditionai
estimation, (as shown in Exarnples 2. L and 2.2), schemes more complex than traditional
estimation, such as the SIMDAT algorithm, are possible. The idea of using pseudo-
samples in the SlMDAT algorithm is very interesting, it provides a way by which we are
able to obtain 'extra' sample data. In Chapters 4 and 5, we will extend this idea to the
problems of error rate estimating and constmcting classifiers.
The resuits of Theorems 2.1 and 2.2 provided, at least, some theoreticai
foundation for the bootstrap technique, though they are only considered the one-
dimensionai linear functional cases. Of course, in practice, we would pay more attention
to small sample cases, but the large sample properties are necessary for theoretical
purpo=s-
The discussion of the Bayesian bootstrap and random weighting method provided
alternative xhemes to obtain the re-weighted empirical distribution 6'. instead of using
discrete resampling vectors in the basic bootstrap algorithm, the Bayesian bootstrap and
random weighting method use continuous cesarnpling vectors. Because of this, there is no
bootstrap sarnple drawn in both of these schemes. We calculate the bootstrap esha te by
assigning a random weight on each correspondhg item of the sample data. Both of the
CHAPTER 2: THE BOOTSTRAP TECHNIQUE 38
schemes will be used to estimate Bhattacharyya bounds and emr rates of classifiers in the
next two chapters. They will aiso be applied to construct classifiers in Chapter 5. It will
be seen that the performances of the Bayesian bootsüap and random weighting methods
are very similar, although they are difZerent fiom that of the basic bootstrap.
CHAPTER 3
THE BHATTACHARWA BOUND
3.1 Introduction
h a statistical pattern recognition problem, a random vector X, cailed a pattern, is
composed of a pair
X=& O), (3- 1)
where is a ddimension vector called the feature vector, and o is a variable representing
a class. Genedly, a feature vector y can take a continuous value in a d-dimension space
and a class variable o can take one of a finite set of values. In this thesis, we consider
only the case when there are two class cases, i.e. o E ( 1,2}. The prion probabilities of the
hivo classes are assurned known,
Pi = P(a> = 1), and P2 = P(o = 2).
For convenience, in later discussions it is always assumed that the prior probabilities of
the two classes are equal, and so,
P 1 = P 2 = % * (3-2)
The conditionai density of a feature vector given o, is denoted as pu&), o = 1, 2.
According to the Bayesian theorem, the posterior probability of j given y is
where p(& is the mixed density hct ion and is a constant independent of a.
CHAPTER 3: THE BHATTACHARWA BOUND 40
The purpose of pattern recognition is to detemine whether a given feature vector
V belongs to class 1 or 2. In other words, the aim is to predict the value of o, h m a given - feature vector 1. A decision d e based on probabilities is to maximize the posterior
probability with the given fature vector K. Thus, the classification d e is:
o = l if q 1 0 ' q 2 0 ;
0 = 2 if p l 0 < q 2 0 - (3 *4)
Because p(a) is positive, h m (3.3). the above rule can be equally expressed as
The niie expressed by (3.4) or (3.5) is called the Bayesian Decision Rule. When a
feature vector is classified to 1 by the Bayesian Decision Rule, it means that there is a
better chance of having o = 1. This does not mean, however, that o is exactiy equal to 1,
as there is always a possibility of making an error in the classification process. The
conditional error pmbability caused by rule (3.4) or (3.5) is
a = min( c i l 0 9 9 2 0 1 -
The Bayesian error is the average of a,
CHAPTER 3: THE BHATTACHARYYA BOUND 41
Equation (3.7) is the crucial measurement for the peiformance of the decision
d e . Udortunately, it is generally very hard to directîy calculate a Bayesian emr even if
the conditional probabilities plm and are known. Therefore, many researches
have devised upper bounds for (3.7) which are easier to calculate and estimate. Two such
bounds, the Chemoff and Bhattacharyya bounds, are the most well-known ones and will
be discussed in the second section. As the chapter title depicts, the Bhattacharyya bound
is the main topic discussed here. Thus, the third section will consider the general
approach of how to estimate the Bhattacharyya bound and the theoretical properties of the
estimate. Simulation results will be given in order to help us understand the properties of
the estimates. in Sections 3.4 and 3.5, we shall attempt to estimate the Bhattacharyya
bound by utilizing bootstrap techniques. The algorithms of the bootstrap approaches are
illustrated there dong with the discussion of the simulation results. Although the problem
of estimating the Chemoff bound is not mentioned here; it is easy to see that the
approaches of the Bhattacharyya bound estimations discussed here could be applied to the
Chernoff bound as well.
The data set used for al1 the expenments in this chapter is the set referred to as
Data Set 1, which consists of training samples for seven classes. Each class has a 2-
dimension normal distribution, and a training sample size of eight. Only two classes are
involved in each experirnent: one is the class A, and the other is selected fiom the rest.
Hence, there are, in total, six experimentd pairs of classes, (A, B), (A, C), (A, D), (A, E),
(A, F), and (A, G). With each class pair, an experiment will repeatedly do 200 trials of
CHAPTER 3: THE B~ATTACIIAIIYYA BOUND 42
simulation for an algorithm. The results of the experimcnts given in this chapter are the
statistics h m the 200 trials.
As only the case of two classes is discussed in this thesis, for simplification, it is
preferred to use the notation & . = CO). So & reprsents sample data h m class 1,
and represents a sample data h m class 2.
3 3 Chernoff and Bhattacharrya Bounds
33.1 Cbernofï Bound
For aay mal numbers, a, b t O, it is well h o w n that the following inequality
holds:
min (a, b ) ~ d b ' * , O a s s 1. (3.8)
Applying inequality (3.8) to (3 3, we have
The right side of the abve inqualit-
is cailed the Chemoff bound. The optimum s is the value that minimizes the vdw of eu.
When the conditionai density bctions are a nomal distribution
~j@!) = NjW, Zj), j = i , 2 .
The integration part of (3.9) can be expnssed as
CHAPTER 3: THE BHATTACHARYYA BOUND 43
w here
The quantity p(s) is called the Chemoff distance. It is obvious that the optimum s will
maximize the value of p(s), which can be obtained by studying the variation of p(s) for
various values of s for the given Mi and 1, (j = 1,2).
3.2.2 Bbattacharyya Bound
The Bhattacharyya bound results whu, the Chemoff bound is simplified by
selecting the value s = 1 1 2. In such a case, (3.9) is simplified to
-d ln) % = ~ ~ l , / " " ~ = ~ ~ e (3.12)
The upper bound given by (3.12) is called the Bhattacharyya bound. in our case, it is
assumed that Pl = P2 = !h; therefore, the Bhattacharyya bound will take a fonn of:
Correspondingly, the Bhattacharyya distance is:
CHAPTER 3: THE BHATTACHARYYA BOUND 44
There are two terms on the right side of the equation in (3.14). as well as (3.1 1). It
is obvious that the first tenn will be zero if MI = & and the second terni wiiî be zero if
XI = G. So the first term reflects the class separability, which is caused by the difference
of the means; while the second term reflects the class separability caused by the
difference of the covariances. It is not difficult to calculate the Bhattacharyya bound in a
normal distribution case when the parameten MJ and (j = 1,2) are given.
3.3 The Estimation of Bhattacharyya Bound
In most cases, the means and covariances are unknown; even though the
conditional distributions cm be answered to be normal. In such cases, it is necessary to
estirnate the Bhattacharyya bound from a given training sample. Let X = {&J, &, . . .,
L I , &J,..., &J) be the training sample: the first m sample data &I, ...,
belongs to class 1 with a normal conditional distribution
&,l , * * * r &"*l - Nl@gl, Zl), (3.1 5 )
and the other n sarnple data xi . , & belongs to class 2 with a normal conditional
distri but ion
&L-, - N2@, 5). (3.16)
Then the maximum lücelihood estimates of the means 4 and covariances G (j = 1,2) are
given by
CHAPTER 3: THE BHATTACHMYYA BOUND 4s
It is well known that the estimates given by (3.17) have good properties, and are
the optimized estimates of the parameten and G (j = 1, 2). Based on (3.17), an
estimate of the Bhattacheryya bound can be calculated by the following steps:
1. Replace the parameters in (3.14) with their estimates given by (3.1 7) to get an
estimate of Bhattacharyya distance, say fi (ln).
2. Replace p( 1 R) in (3.12) with fi (1 l2) to get an estimate of the Bhattacharyya
bound.
We refer to the estimating method described above as the Direct Estimclting approach or
the General approach. The unanswered question is one of detennining how good the
estimate of the General approach will be? The answer to this question is by no means
trivial or direct because the Bhattacharyya distance and bound are fairly complex
hctions of the parameten M, and z,. Here is a brief discussion on the estimate properties of a function of the
A A A
piinuneters. Let @ = (O1, €AI2, ..., &lT be a parameter vector, its estimate be @= (8 i , 82.
. . ., 6 ( and f = t@ be a function of & Using the general approach, an estimator of î =
CHAPTER 3: THE BHATTACHARYYA BOUND 46
a A
f@) can be obtained. Assuming that 8 and 8 are close enough, a Taylor series can be
used by expanâing î = f( fi ) up to the second order tems,
w h e ~ AB = 4 - 8. If 8 is an unbiased estimate of a, E(AB) = O. Thus,
We, thetefore, have i which is generally a biased estimator off. In our case
8 = (MI T, M2 ', SV~C(ZI lT, svec(X2) '1 ,
where svec(A) represents a vector consisting of al1 different components of a symmetric
matrh 9 = (a. IJ -) nxnr
~vec(A) = (ai13 **+, 41, a u , anî, annh
and its estimator
6 - =(M,', - & ', ~ v e c ( t , ) ~ , svec($$jT
given in (3.17) is unbiased. For the Bhattacharyya distance, the second tenn on the nght
side in (3.19) is extremely complicated (the details will not be discussed here as it is
beyond the scope of this thesis. The reader cm refer to Fukunaga's book FU921 for more
details). It is thus apparent that the estimate of the Bhanmtcharyya distance deduced fiom
the Gened approach is biased, and so is the Bhattacharyya bound. The simulation results
given below support the conclusion, and give a nsud illustration of how biased the
estimate of the General approach can be.
CHAPTER 3: THE BHATTACHARYYA BOUND 47
As we know the panuneters of the multi-normal distribution of each class, it is not
difficult to calculate the theoretical values of the Bhattacharyya distances and the bounds
of each class pair involved in an experiment. These theoretical values are given in
TABLE 3.1. Note that the Bhattacharyya bounds are given in ternis of a percentage, Le.
1 Bhattacharyya bound = - e " I R ) x 100%.
2
As al1 the multi-normal distributions of the classes have the same covariance
TABLE 3-1 Theoretical values o f Bhattacharyya distance and bound
matnx, the identical matrix, the Bhattacharyya distance of a class pair in TABLE 3.1
contains only the first term on the right side of (3.14). It is seen to be in proportion to the
(A, G) 0.419263 32,8766
square of the Euclidean distance of the expectations of the class pair. Thus, the results of
CLASS PAIR BhattpCbaryya distance Bbattacbaryva bound
the experiments reflect, in some sense, the effect of the classifying enor rate because the
(A, B) 0.0625 46.9707
(A, C) . 0.1375
43.5767
class pair changes in each expriment, and so does the classifier enor rate.
(A, D) - 0.16298
42,4804
TABLE 3.2 Estimates o f Bhattacharyya distances with the General approach n
TABLE 3.2 lists the means and standard deviations of the 200 triais of the
(A, E) 0.2325 13 39.6270
Meaas Standard Deviatiou
Bhattacharyya distance estimates. It can be seen h m the table that, on average, the
(& F) 0.3 125 36.5808
(& G) 0.4 19263
Bhattacharyya distance estimate is biased, i.e. the General approach usually over-
CLASS PAIR Thearetical Value
0.236728 0.13 8292
estimates the Bhattacharyya distance. The bias increases when the distance of two classes
(A, E) 0.2325 13
(A, F) 0.3 125
0.4073 151 0.458343 0.2478541 0.27 1062
(& D) O. 16298
(A, B) 0.0625
(& c) 0.1375
0.758463 0.405374
1.122263 0.558753
2.000085 1 .O267 16
CHAPTER 3: THE BHATTACHARYYA BOUND 48
increases. The standard deviation of the estimate increases dong with the increased
distance of the two classes as well.
TABLE 3.3 Estimates of Bhattacharwa bounds with the General ammach (%) CLASS PAIR - ( B) (AlCl (GD) (& E) (A FI (A, G)
1
Theoretical Value 46.9707 43,5767 42.4804 39,627 36,5808 32.8766 1
Means Standard Deviations
Maximum r
90% 1
8Ovo Median
20Ym 10%
Minimum J
TABLE 3.3 provides the statistics of the 200 trials of the Bhattacharyya distance
estimates in terms of percentages. Besides the means and standard deviations of the 200
estimates, the maximum, median, minimum, and 90%, 80%, 20% and 10% quantiles of
the estimates are also listed in the table. For instance, the mean and median of the 200
estimates of the Bhattacharyya bound for the class pair (A, E) are 25.14584% and
26.226% respectively, the 90% quantile of them is 35.69748%. Comparing them to the
theoretical value which is 39.627%, we can see that, for the class pair (A, E), the General
approach underestimates the quantile significantly. It is expected that the General
approach tends to under-estimate the Bhattacharyya bound, because the estimate of the
Bhattacharyya bound is just an exponential fùnction of the negative estimate of the
Bhattacharyya distance. Of course, the bias of the estimate increases, while the distance
of the two classes increases. One interesting fact is that with the class pair (A, B), the rate
of the estirnates which are under-estimating the Bhattacharyya bound is about 90% of the
CIiAPTER 3: THE BHATTACHARYYA BOUND 49
total 200 trials, while for the clas pair (A, G), the rate increases to 100%. We infer that,
generally speaking, the probability of under- estimating the Bhattacharyya bound usually
increases in proportion to the increase of the disiance of two classes.
In the case of the Bhattacharyya distance, the cause for the over-estimation is due
to the second term in (3.19) or the third term in (3.18). Observe that the Bhattacharyya
bound is a fiction of the Bhattacharyya distance p(1/2), and fi (112) given by the
General approach, is a biased estimate of ~(112). Therefore, the cause for under-
estimating is not only due to the third term in (3.18), but also due to the second tenn in
(3.18). Therefore, the question before us is one of correcting the bis, and as we shall
see, the bootstrap technique provides a way of achieving this. The next sections will
illustrate how the bootstrap technique can be applied when estimating the Bhattacharyya
bound.
3.4 The Bootstnp Estimators - Direct Bias Correction
When considering the problem of estimating the Bhachattaryya bound, there are
two different types of schemes available by which a bootstrap estimate of the quantity can
be obtained. The difference between the two schemes is a question of how we want to
correct the bias. First, the estimate bias may be directly corrected by using (2.9). Schemes
of this type are called the Direct Bias Correction schemes. Second, since the estimate
bias of the Bhattacharyya bound is caused by the estimate bias of the Bhattacharyya
distance, the bias of the Bhattacharyya distance c m be estimated first, and then the bias
estimate can be used to adjust the estimate of the Bhattacharyya bound. Schemes of this
CHAPTER 3: THE BHATTACHARYYA BOUND 50
type are called the Distance Biaî Adjustment schemes. This section will discuss the Direct
Bias Correction SC hemes with three di fferent resampling approac hes, the basic bootstrap,
Bayesian bootstrap and the random weighting method. A discussion of the distance bias
adjustment schemes will be the topic of the next section.
The direct bias conection schemes use directly the bootsaap estimate of the
Bhattacharyya bound to correct the bias, Le. to apply (2.9) given in Chapter 2. The
algoithms with different resampling schemes are descrikd below:
Algorithm 3.1 Bwic Boobtrap Input: (i) n, the size of the training sample of a class;
(ii) ~ [ l , 1],5[2. 11, . . ., g[n, 1 ] the training sample data of the first class; (iii) 3[1.2], p[2,2], . .., ~ [ n , 21 the training sample data of the second class; (iv) B, the mpeated times of the bootstrap resampling.
Output: The Bhattacharyya bound estimate. Method BEGIN &O = CakulateBound(l[l, 11, ..., g[n, 11 , ~ [ l . 21 ,..., ~ [ n , 21) E = O For (i = 1 to B)
ForQ = 1 ton) m = random integer in [l , n] yü, 11 =g[m, i l
End-For For (j = 1 to n)
m = random integer in [1 , n] yü. 21 = &m, 21
End-For E = E + CalcuIateBound(y[l, 11, . .., y[n, 11 , y[l, 21 ,..., y[n, 21)
End-For E = ~ x E ~ - E / B Return E
END Basic Booetmp
Procedure CalculateBound Input: (i) n, the size of the sample of a class;
(ii) ~ [ l , 11,212, 11, .. ., an, 11 the sample data of the f h t cless; (iii) g[1,2], g[2,2], . . ., ~ [ n , 21 the sarnple data of the second class;
Output: A Bhattacharyya bound estimate. Metbod BEGIN
p(1Q) = CaiculateDistance@[t. 11, g[2, 11, ..., g[n, 11, &1,2], ..., ~ [ n , 21) E = 0.5 * EXP(-p( ln ) ) Return E
END Procedure CalculateBound
Procedure CalculateDistance Input: (i) n, the size of the sample of a class;
(ii) g[1, 1],1[2, 11, . . ., g[n, 11 the sample data of the first class; (iii) g[l,2], g[2,2], . . ., g[n, 21 the sample data of the second class;
Output: a Bhattacharyya bound estimate. Method BEGIN
Return p(1/2) END Procedure CalculateDistance
The first step of this algorithm is to calculate the generai estimate of the
Bhattacharyya bound. The first For loop is used to calculate the B bootsaip estimates of
the Bhattacharyya bound. AAer that, (2.9) is applied to obtain the estimate of the
CHAPTER 3: THE BHATTACHARYYA BOUND 52 - - - - - - -- - - - - - - -
Bhattacharyya bound. The two For loops nested in the f b t For loop are for generating
bootstrap sarnplcs, so the bootstrap sample is drawn separately h m each training sample.
The reason for doing this is so that we can have a larger probability to ensure that the
estimated covariance matrixes f , and 2, are non-singular. if a bootstnip sample is drawn
fiom the merged training sample, the bootstrap sample may only contain a few data
points fiom one class sample. Thus, the estimated covariance rnatnxes $or 2, have a
higher probability of king singular, which may happen, especially, in the case of small
training samples, or when the dimension of the feature vecton is "large". Considering
that an estimate of a covariance matrix may be singular, a real algorithm should inspect
the three determinants lxi 1, lZzl and /ZlfZ21 &et each bootstrap sampling. If any of them
is zero or signiticantly close to zero, a bootstrap sample must be drawn and inspected
again until al1 of the three determinants are non-zero.
By changing the way of drawing the bootstrap sarnples, the other two estimating
algorithms for the Bhattacharyya bound, the Bayesian bootstrap and the random
weighting algorithms can be obtained. The algorithm of the Bayesina bootstntp is
presented below.
Aigorithm 33 Bayesian boobtrrp Input: (i) n, the size of the training sample of a class;
(ii) ~[l , 11, r[2,1], ..., g[n, L] the training sample data of the first class; (iii) g[1,2], 3[2,2], . . ., g[n, 21 the üaining sample data of the second class; (iv) B, the repeated times of the bootstrap resampling.
Output: The Bhattacharyya bound estimate. Method BEGIN EO = CakulateBound(x[l, 11, . .., g[n, 11 , z[I, 21, . .., ~ [ n , 21, general) & = O For (i = 1 to B)
bootstrapbound = bootstrap-bound + CakulateBound(l(1, Il, ..., ~ [ n , 11 ,5[1,2], .... a[n, 21, bootstrap)
End-For E = ~ x Q - E / B Return E
END Bayesian bootstrap
Procedure CalculateBound Input: (i) n, the size of the sample of a class;
(ii) g[l, 1],g[2, 11, . .., ~[n. 11 the sample data of the first class; (iii) ~ [ l , 21, ~[2,2], .. ., ~ [ n , 21 the sample data of the second class; (iv) flag to indicate 'general' or 'bootstrap'
Output: A Bhnttacharyya bound estimate. Method BEGIN
p(lR) = CalculateDistanco(z[l, 11, g[2, 11, .... ~ [ n , 11, g[1,2] ,..., g[n, 21, flag) E = 0.5 * EXP(-p(i(I/2)) Return E
END Procedure CalculateBound
CHAPTER 3: THE BHATTACHARYYA BOUND 54
Procedure CalculrteDbtance Input: (i) n, the size of the sample of a class;
(ii) g[1, 11, ~ [ 2 , 11, . . ., g[n, 11 the sample data of the fvst class; (iii) g[1,2], 4[2,2], . . . , g[n, 21 the sample data of the second class;
Outpuk A Bhattacharyya bound estimate. Metbod BEGIN
If flag = 'bootstrap' CetWeigbts-BayesianJ3ootstmp(n)
Else w[1] = w[2] = ... = w[>t] = II n
End4 f - zPI= ;r:,wli1z ti, 11 ri = z;.,wli1û ü. 11- y [i])Qü. 11- - Y [ W if flag = 'bootstrap'
CerWeights-Bayesian-Bootstrap(n) Eise
w[i] = w[2] = ... = w[n] = II In E n d 4 - z 121 = z;., wtilzü. 21
& = ~;.,wUl@ U', 21- g [21)(2 U', 21- Z [21)~
Reium p(1R) END Procedure CaIculateDistance
CHAPTER 3: THE BHATTACHARYYA BOUND SS
Promlure GetWeights-Bayesian-Bootstrap Input: n, the size of the training sample of a class; Output: w[l], w[2], ..., w[n], the resampling vector Method BEGIN For(j= 1 ton- 1)
wu] = raadom unifonn U[O, 11 End-For w[n] = 1 Sor~(w[i], wp1, *-* , wbl) For(j=nto2)
wü]=wü]-wu- 11 End-For Return w[l], w[2], ..., w[n] END Procedure GetWeigbtsJayesian-Bootstmp
The goal of the procedure CetWeigbts-Bayaia~~Bootstrap above is to set the
resampling vector, and the Sort procedure is any standard sorting procedure which puts
w[l], w[2], .. ., w[n] in an increasing order. The main difference between Algoritbm 3.1
and Algorithm 33 is the CahulateDistance procedure. in Algorithm 3.1, the bootstrap
sample is passed to the procedure to calculate the estirnates of the means and covariances,
and so the weights used in the calculation are always 1 1 n. In Abrithm 33, the
parameters passed to the CalculateDistance procedure are always the training sample
and a flag. If the flag is 'bootstrap', the procedure will invoke
CetWeights-Bayesian-Bootstrap to obtain the weights for calculating the
corresponding bootstrap estimate. Otherwise, it uses equivalent weights 11 n to calculate
the estimates, as is done in Algoritbm 3.1.
As it can be directly seen, Algorithm 33 is more general than Algoritbm 3.1.
Algoritbm 3.1 can be obtained h m Algoritbm 3.2 by just changing the
CHAPTER 3: THE BHATTACHARYYA BOUND 56
GctWeights-Bayesian-Bootstrap procedure to the GetWeifib-Basic-Boabtrap
procedure, as shown below:
Algorithm 3.3. CetWeights-Basic-Bootstrap Praeedun CetWeightsJasic-Boobtrap Input: n, the size of the training sample of a class; Output: w[ 1 1, w[2], . . ., w [n], the tesampling vector Method BEGIN For(i=1 ton)
wb] = O End-For For (j = 1 to n)
m = random integer in (1, n] w[m] = w[m] + 1
End-For For(j=1 ton)
wb] =wu] l n End-For Return w[l], w[2], . ... w[n] END Proceàure GetWeigbts~~asic~Bootstrap
From what was explained earlier, the algorithm for the random weighting method
can be obtained nom Algoritbm 3.2 by changing the CetWeights-Bayesi.nJBootstrap
procedure as well. As the other sections of the algorithm of the random weighting meihod
are exactly the same as that of Algorithm 3.2, only the changed procedure which is now
the CetWeights-Random-Weighting procedure of the random weighting method, is
given here.
CHAPTER 3: THE BHA~ACHARYYA BOUND 57
Algorithm 3.3 GetWeights-bndom-Weighting Proceûure CetWeigbts-Random-Weightiog Input: n, the size of the training sample of a c h ; Output: w[l], w[2], ..., w[n], the resampling vector Method BEGIN sum=o For0 = 1 ton)
wb] = random uniform U[O, 11 sum=sum+w[i]
End-For Foru = 1 ton)
wb] = WU] / sum End-For Return w[l], w[2], ..., w[n] END P r d u r e GetWeights-Random-Weigh ting
3.4.1.Simulation Results
The simulation experiments were done with the above three algorithms and the
statistics for them are listed in TABLE 3.4 - 4.6. The results are given in terms of
percentages. For instance, in TABLE 3.4, the mean and median of the 200 estimates
given by the Basic Bootstrap algorithm are 3 1.835 17% and 32.583 16% for the class pair
(A, E). In TABLE 3.5, for the same class pair, the mean and median of the 200 estimates
given by the Bayesian Bootstrap algonthm are 29.67647%, and 30.61675%. In TABLE
3.6, the mean and median given by the Random Weighting aigorithm are 27.34608% and
28.2740 1 %. Comparing these estimates to the corresponding estimate vaiue given by the
General approach, we do have less biased estimates. One noticeable thing is that for class
pair (A, B), the 80% quantiles of d l of the three algorithms are above the theoretical
value of the Bhattacharyya bounds, while the 90% quantile of the General appmach is
still under the theoretical value of the Bhattacharyya bounds.
CHAPTER 3: THE BHATTACHARYYA BOUND 58
TABLE 3.4 Estimates of Bhatîacharyya bounds with the Basic Bwtstrap (%] CLASS PAIR (& B) (& c) (A, D) (& E) (A* F) 1 ( A m
, Tbeoretkal Value 46.9707 43.5767 42.4804 39.627 36.5808 32.8766 Mean 48,33761 42,1575 1 40.52867 3 1.8351 18.51542 12.6412 STD 5.942545 8393 123 8.835997 10.20969 14.41235 8.457236
Maximum 58,6753 1 58.84465 58.44089 55.32598 49.5 1755 37.165 1 90% 55-50 185 52.4752 1 5 1,50358 44.83 127 37.95566 24.76845 80Y0 53,5539 1 49.2396 48,08842, 40.642 19 33.08389 19.96274
Median - 49,89679 42,763 1 4 1.23 105 32.583 16- 19.4 1409 1 1.5 166 Y
Minimum m TABLE 3.5 Estima
1 Tbeoretid Value Meaas
Standard Deviations Maximum
90% 80%
Median 20% 10%
1
Minimum J
TABLE 3.6 Estimates ( CLASS PAIR
L
, Tbeontkal Value Means
Standard Deviationa Maximum
1
90% 80%
:s of Bhattacharyya bounds with the Bayesian Bwtstrap (%)
(A, B) (A, C ) (A, D) (A, El (A, F) (A, G) 46.9707 43 S767 42.4804 39.027 36.5808 32.8766
Bhattachayya (A, B) 46.9707 423308
5.3 17276 5 1.683 13 48.64045 47-1 1296
Meâian 20% 10Ym
Miaimum ;
bounds with the Random Weighting Method (%)
(A, C) 43.5767
36.65799 7.391367 5 1 24357 45,7033
43.09073
(A, D) 42.4804 35.1367
7,87573 1 5 1.34823 44,87394 4 1.96743 -
2 1.15995 8,1271 52 7,30623 1 1,543205_
28.2740 i 13.99 1 1
1 1.4567 1 7.62 1 5 19
-
9.93726 1 2.067278 1 -3 1 1337 0.0 14472
36.0776, 24.33814 22.3457 10,0383
43.2 1 03 34.18869 32.72992 25.6969g1
(A, E) 39.627,-
2734608 8.9 17867 46.67435 38.15 15 1 34.85278
37.4952 27,1069
24.08489 8.288063
(A, F) 36.5808
20.44759 8.771 155 42.29843 32.52625 27.61 64
(& C) 32.8766
10.71386 7,196015 32.54 132 20,85488 l7.l82l!l
CHAPTER 3: THE BHATTACHARYVA BOUND 59
As can been seen fiom the tables, the bootstrap techniques does work in this case.
On average, the means of the 200 estimates for al1 the six class pairs improves to some
degree. As cm be seen, the percentage of the estimates with values over the theoretical
value also increased. We note, however, that there are a few disadvantages to the three
algorithms. The standard deviations of the estimates are larger than that of the Genetal
approach. In the experiment of the basic bootstrap algorithm with the class pair (A, F),
about 20% of the estimates are even below zero, which is quite unacceptable. Of course, a
restriction to the algorithm could be added to discard negative estimate values. A simple
way of doing this is to just reject an estimate when a negative estirnate value is reported,
and to re-draw the bootstrap sarnples and calculate the estimate again. This procedure
would be repeated until a no~egative estimate is produced. Compmtively, the Bayesian
bootstrap and d o m weighting method algorithms have smaller standard deviations and
no negative estimate values.
3.5 The Bootstnp Estimators - Distance Bias Adjustment
As the Bhattacharyya bound E. is an exponential function of the Bhattacharyya
1 distance p(lD), su = - e * ( lR' , it is not diffïcult to rewrite it in tenns of the Taylor series
2
expanding eu as,
CHAPTER 3: THE BHATTACHARWA BOUND 60
1 where = - e -'.(IR) , and Ap = p - m. Suppose E. is the theoretical value of the 2
Bhattacharyya bound and &,,O is an estimated value, then (3.20) provides a way of
representing eu in terms of cUo if the difference Ap = p - p~ is known. Fortunately, the
difference Ap = p - p~ is just the bias of the Bhattacharyya distance and can be estimated
by using the bootstrap technique. Thus the distance bias adjustment schemes of the
Bhattacharyya bound estimate can be effectively utilized. ARer obtaining the estimates of
the Bhattacharyya bound Ê, and bias of the Bhattacharyya distance Ab, a new estimate
of Ê, is obtained by adjusting Eu with Ad, as follows:
In (3.21), the approximation utilizes only up to the second order ternis of the Taylor
series. Observe that although it is feasible to use more tems of the Taylor series
expansion, as the order of a tenn increases, the contribution of the item decreases.
Thetefore, adding more items into (3.21) will not generally effect the final estimate much.
It is still necessary to figure out some details before a forma1 algorithm can be
obtained for the distance bias adjustment schemes. The first question to be answered is:
what will be chosen as Êu ? There are two possible options, namely, to use either a
General approach estimate, or a bootstmp estimate. The scheme will be called a G-based
(general apprwch based) estimate if the General approach estirnate is selected, and
called a B-based (bootstrap besed) estimate if a bootstmp estimate is selected. The
second is when the distance bias adjustment is performed. ln a bootstrap procedure, B sets
- - - - - - - - - -
of bootstrap samples are repeatedly drawn, an estimate is calculated for each bootstrap
sample, and then these estimates are averaged to obtain the final bootstrap estimate. So
the adjustment can be done ai each step of the calculation of the bootstrap samples, or
after the averaging. The first case is referred to as the AFMS (Adjustment First and Mean
Second) approach, and the second case is referred to as the MFAS (Mean First and
Adjustment Second) approach. It is obvious that the B-based estimate will definitely be
effected by the selection of AFMS or MFAS strategy. The choice of AFMS or MFAS has
an influence on the O-based estimate as well. Although the value of 6 , is not effected by
the bootstrap sarnples, as Ac2 is subject to the selection of AFMS or MFAS strategy,
they will be conespondingly different. Given below are the algorithms of the distance
bias adj ustment schemes with AFMS and MFAS respectively .
3.5.1 AFMS (Adjustment Fint rad Mean Second) Schemn
The main part of a generalized aigorithm with the B-based AFMS strategy is
listed below.
CHAPTER 3: THE BHATTACHARYYA BOUND 62
Alprithm 3.4 Distance Bias Adjustment - B-baseà AFMS Inpuk (i) n, the size of the training sarnple of a class;
(ii) 5[1, 11, ~ [ 2 , 11, . . .. ~ [ n , 11 the training sample data of the first class; (iii) ~ [ l , 21, g[2,2], . . ., g[n, 21 the training sample data of the second class; (iv) B, the repeated times of the bootstrap resampling.
Output: The Bhattachaqya bound estimate. Metbod BEGIN
= CalculateDistance&[1, 11, ..., g[n, 11 , g[1,2]. . .., g[n, 21, general) &O = 0.5 X EXP( - po) & = O For (i = 1 to B)
p = CatcuîateDistanceQ[l, 11, . . .. g[n, 11 , g[l, 21, . .., g[n, 21, bootstrap) ~ = P - w E = E + (2 x CO - 0.5 x EXP( - p)) x (1 - Ap + 0.5 x A ~ ~ )
End-For & = d B Return E
END-MAIN
The CalculateDistance procedure is exactly the same as that in Algorithm 3.2.
Therefore, the appropriate CetWeights procedure is also required in this algorithm. Of
course, we have to use different CetWeights procedures to carry out either a basic
bootstrap, Bayesian bootstrap or random weighting method scheme respectively. As each
CetWeights procedure for the three bootstrap schemes has already been represented in
Section 3.4, we omit the discussion of their details again.
In the algorithm above, the fkctor ( 2 ~ - 0.5 x EXP( - p)) is the bootstrap
Bhattacharyya bound estimate for the currentiy drawn bootstrap sample, multiplied by the
distance bias adjustment factor (1 - Ap + 0.5 x A ~ ~ ) . If ( 2 ~ - 0.5 x EXP( - p)) is
replaced by a, and nothing else has been changed, it can be seen that the algorithm will
become a G-based algorithm. Hence, we will not give another description separately for
CHAPTER 3: THE BHATTACHARYYA BOUND 63
the G-based and AFMS algorithms. With the same training sample data, experiments
have been done for the B-based/G-based and AFMS algorithms with ail of the three
bootstrap schemes, namely, the basic bootstrap, the Bayesian bootstnp and the random
weighting method. The statistics of the experiments are listed in TABLE 3.7 - 3.12.
TABLE 3.7 Estimates of Bhattacharyya bounds (%) Basic Bootstrap, B-based AFMS
TABLE 3.8 Estimates of Bhattacharyya bounds (%) Bayesian Bootstrap, B-based AFMS
CLASS PAIR l
Tbeoretical Value Meaa STD
Maximum 90%
L
80% L
Median , 20% 10%
Minimum
(A, D) 42,4804 30,0756 1 7.296333 76.740921 36,27044 34.32448 30.1 2942 2 1 -65784 20.4463 13.35947
CtASS PAfR I
Theoretical Value Mean STD
Masimum L
90% L
80% Median
20% 1
10% 1
Minimum
(A, 8) 46,9707 35.296 17 15.28205 243.753 1 3 8,76487 37.92635 34.46269 29.64628 2723526 2 1.26367
(A, B) 46.9707 3 7.5450 1 4.440 1 13 44.265 12 42.97227 4130967 383 1426 30.8 1623 28.80029 23.35308
. (A, E) 39.627
26.28996 9.38985 1 107.5065 33.1 7568,
(A, C) 43.5767 3 1.89302 15.54667 207.0035 36.44793 34.5 1293 30.74694 24,34977 21.95029 15.21436-
(A, C) 43 S767 32.0 1 08 1 5.779325 43.12695 39.20749 37.0198 32.71642 24.7636 1 20.86% 1 14.83666
(A, D) 42.4804 30.43654 6.587644 44,03085 38.53849 36.16435 31.01248 21.20 125 17.90 184 11 ,1767 1
(& F) 36.5808 24.80948 24.24049. 3 1 2.5824 33,209 1 1
. , 32.8766 9.294273 4.665902 23.54554 15.19422 13.61435 8.719324 3.80995 2368375 0.617094
(A, E) 1 (4 F)
( A s ) , 32.8766 20.63062 23.34433 283.3539 35.54862
39.627T 23.354
25.68454 15.16199 8.7324 12 6.913009
30.24766 2524358 17.98987 17.1 1695--
36.5808 1 7.06923
28.52795 21.20245 13.27882 1 1.5 1436
7.644248
6.646 1451 6.343796
59948431 2.454454
39.06706 32.17855 29.33561 23.01215 14.85 177 12.18928 6.761 167
32.8695 1 25.7 1094 23.04196 17.25403 9.325499 5.799734 3.384352
TABLE 3.7 - 3.9 list the results of the B-based AFMS schemes. The estimate
TABLE 3.9 Estimates of Bhattacharyya bounds (%) Random Weighting Method, 8-basecl AFMS
given by the Basic Bootstrap algorithm with the B-based AFMS scheme listed in TABLE
3.7. For instance, the mean and median of the 200 estimates given by this dgonthm are
h
CLASS PAIR 1
Tbeoretical Value Mean
1
STD 1
Maximum . 90%
L
80% L
Median 20% 10% '
Minimum ..'
26.2899% and 25.24358 for the class pair (A, E). For the sarne class pair, the mean and
(& F) 36.5808 18,1839 1 7.18 1 1 04 35.87448 27.83729 24.87642 18.6775% 9.248596 5.3280 12 2.74989 1
median of the 200 estimates given by the Bayesian Bootstrap algorithm are 23.354% and
(& c) 32,8766 9.3 16001 5,153994 25.25058 16,19525 13.89037 8.37569 1 3.0 16075 2.07233 0.376614
(& B) 46.9707 39,79466 4.630449 47.13392 45.45606
, 43.64801 40.56556 33.34305 30.90892 23.94745
23.01215% as listed in TABLE 3.8. In TABLE 3.9, for the sarne class pair, the mean and
median are 25.13843% and 23.15333% respectively, as obtained with the Random
(A, C) 43.5767 34.178 1 5.965 157 45.76277 4 1.2 1 526 3933556 34.83044 26.205 12 22.9 1647 16.23455
Weighting algorithm. These indicate that the performances of the three bootstrap schemes
are quite distinct. The basic bootstrap does not perfom well with the class pairs (A, B),
(A. D) 1 (kW
(A, C) and (A, D), while on average, the estimates of the basic bootstrap are higher than
42.4804 32.5296 1 6.878829 46.36956 40.64846 38.4615 33.39507 23.36485 19.32754 I2,13985
the Geneial approach estimates with the class pairs (A, E), (A, F) and (A, G). It may also
39.627 25.13843 7.1836 16 4 1 .O6053 34.68022 3 1.6583 1 25.15333.- 16.01 357 13.20445 7.275378
be noticed that the standard deviations of the basic bootstrap estimaies are quite large.
The cause of this is because some estimates are too high (even more than 100%), and so
they are not reasonable. It is possible to reduce the standard deviations of estimates of this
scheme by putting a restriction on the maximum estimated value of the algorithm.
CHAPTER 3: THE BHATTACHARYYA BOUND 65
Wiih the B-bascd AFMS strategy, the Bayesian bootstrap and random weighting
method schemes do not improve the Generai approach estimate much, except that they
have smaller estimate deviations.
CLASS PAiR Theoretical Value
Maximum
1 Mediaa
1 Minimum
E 3.10 Estimates of Bhattacharyya bounds (%) Basic Bootstrap, G-based AFMS
TABLE 3.1 1 Estimates of Bhattacharyya bounds (%) Bayesian Bootstrap, G-based AFMS
(4 36.5808 3732694 56.18202 759.5583 55.12306 40.48444 28.33757 18,4257 15.03369 6.976958
(A B) 1 ta c)
CLASS PAIR 1
Theoretical Value Mean
r
STD Maximum
(hc) 32.8766 33.0 1038 43.788 14 523 3873 57,11908 44.96434 22.48356 1 1.56859 8.7 13007 3,7663 13
(A-D) 42.4804 36.8919 12,17684 1 50.1 34 1 43.3800 1 40.72589
46.9707 4 1.1 1082 9.06335 1 1 54.2297 45.55304 44.3 8369
(A, F) 36.5808
1 8.77965,. 6.98 1 074 36.17906
r (A. E) 39.627
34.21 707 19.99692 259.0628 45.43 152 3 8.4882
43.5767 38.85462 15.95 164 1 84.7436 44.52779 4 1.69 1 49
90% 1
80% Median
20% I
10% Minimum
(A, B) 46.9707
_ 40.86336 4.82879 48.36243
(A, G) - 32.8766 10.36 139 5.058096 25.25234
3 1.64667 23.64481 2 1.33968 8.5989 1 1
40.82089 35.4003 1 33.1682 28.26328
46.73866 45.00507 4 1.77362 34.18005 3 1.19996 24.5209 1
37.27 138 28.7621 7 27.0222 18.50479
(A, E) - 39.627 25.69247 7.24333,. 42.32066.
(A, C) 43.5767 34,99263 6.1 59856 47,50654
35.64 145 26.77807 24.771 7 15.16554
(A, D) 42.4804 33.2 107 7.0928 1 4 48.0 1 179
42.50228 40.23935 35.6006 1 27.09707 24.193 1 6
1 6,4869
4 1.79076 39.52222 33.8597 1 23.5 185 1 19.42777 13.3958 1
16.83 1 13 14.83372 9.72574 3.9 12425 2.85 1593 0.45 1379
35.4 197 1 32.1 1332 25.49 169 16.669 13 13.89853 7.80 1 59
28,3465 25.3 1579 18.89684 10,07797 6.7575 1 4 3.20302 1
CHAPTER 3: THE BHATTACHARYYA BOUND 66
The results of the G-base AFMS are listed through TABLE 3.10 - 3.12. In
TABLE 3.12 Estimates of Bhattacharyya bounds (%) Random Weighting Method, G-based AFMS
TABLE 3.10. for the class pair (A, E), the mean and median of the 200 estimates given
by the Basic bootstrap aigorithm are 34.21707% and 3 1.64667% respectively. For the
sarne class pair, TABLE 3.1 1 gives the mean 25.69247% and median 25.49 1 69%, as
(A, F) 36.5808 18.82925 7.552264 36.857 1 1 29.09597 25.90956 19,19O99 9.472335 5.494032 2,736784
(& E) 39.627
26.1 5482 7.574049 42.86 147 362 1686 33.024 17 26.20663 16.70679 13*23641 7.429857
CLASS PAIR L
Tbmretical Value Mean
1
STD I
Maximum 90% 80%
Median 20% 1OVo
Minimum
estimated by the Bayesian bootstrap algorithm. The conesponding mean and median
(A, G) 32.8766 9.46 13 1 5 5,443392 25.59972 16.7321 14.12225 8.662 1 5 1 2.736777 1 .9768 1 1 0,434437
given by the Random Weighting algorithm are 26.15482% and 26.20663% listed in
-
(A, B) 46.9707 4 1.45 1
4.822625 48.80662 47.3 1289 45.50021 42.37573 3 5.022 1 1 3 1.85959 25.19 1 1 5
TABLE 3.12. Compared to the General approach, the estimates tend to improve with al1
of the bootstnip schemes. The standard deviations of the basic bootstrap scheme are still
(A, C ) 43 S767 35.65973 6.220326 48.0047 43.0634 40.95507 36.42466 27.3674 23.37 162 17.09629
too large because of' some extremely hi& estimate values, while the standard deviations
(A, O) 42.4804 33.89 14 7.1 6 1 089 48.40474 42.49743 40.10739 34.75562 24.1 5503 20,186
12.86532
of the other two bootstrap schemes are smaller than that of the General approach
estimates. On the average, however, the estimate bias of the Bayesian bootstrap and
random weighting method is bigger than that of the basic bootstrap.
3.53 MFAS (Mean First and Adjustment Second) Sebernes
The main pari of the algorithm with the B-based AFMS sbsitegy is presented
CHAPTER 3: THE BHATTACHARYYA BOUND 67
Aigorithm 3.5 Distance Bias Adjustmeat - Ebaseà MFAS Input: (i) n, the size of the training sample of a class;
(ii) g[l , 1],5[2, 11, . .., ~ [ n , 11 the training sample data of the first class; (iii) g[1,2], g[2,2], .. ., g[n, 21 the training sample data of the second class; (iv) B, the repeated times of the bootstrap resarnpling.
Output: The Bhattacharyya bound estimate. Method BEGIN ~ < o = CulculatcDistance(x[1, 11, . .., ~ [ n , 11 . g[l,2], .... g[n, 21, general) cg = 0.5 x EXP( - w) & = O p=O For (i = 1 to B)
temp = CalculatcDistance(l[I, 11, . . ., ~ [ n , 11 , ~ [ l , 21, ..., g[n, 21, bootstrap) p=p+temp E = E + 0.5 x EXP( - temp)
End-For p = p / B
Aci=ci-w & = d B E = ( ~ E ~ - E ) ~ ( 1 -AP+0.5x A~') Return E
END-MAIN
Like Algorithm 3.4, the CilculateDistance procedure is the same as that in
Algorithm 3.2. It is also possible to carry out the algorithm with difTerent bootstrap
schemes by selecting the proper GetWeights procedures respectively. The difference
between Aigorithm 3.4 and Algorithm 3.5 is that in the latter, the bootstnip estimates of
the Bhattacharyya distance and bound are accumdated respectively each time a bootstmp
sample is drawn. Then the averages of them are used to obtain the Bhattacharyya bound
estimate ( 2 ~ - E) and the estimate of the distance bias adjustment factor (1 - Ap + 0.5 x
AP2). The finai ,,mate is given by multiplyiag the two estimates. To get a G-based
MFAS algorithm, we need only replace ( 2 ~ - E) with Q, and omit E = E + 0.5 x EXP( -
CHAPTER 3: THE BHATTACHARVYA BOUND 68
temp) and related operations in Algorithm 35. Due to the obvious sllnilarities, we feel
that it is not necessary to present the formal code for the G-based MFAS aigorithm.
The experiment statistics of the B-based/G-based MFAS aigonthms are given in
TABLE 3 -13 - 3.18 with the three bootsirap schemes, the basic bootstrap, the Bayesian
bootstrap and the random weighting method. TABLE 3.13 is the statistics for the basic
bootstrap algorithm with the B-based MFAS schemes, which contains the mean and
median of 17.65553% and 17.1264% respectively for the class pair (A, E). In TABLE
3.14, the mem and median o f the sarne class pair are 2 1.734 15% and 2 1.80673%. which
are the statistics of the Bayesian bootstrap algorithm with the B-based MFAS schemes. In
TABLE 3.15, the random weighting algorithm provided the mean and median of
24.55 108% and 24.55 135% respectively for the same class pair.
TABLE 3.13 Estimates of Bhattacharyya bounds (%) Basic Bwtstrap, B-based MFAS
CLASS PGLR 1
Theorethl Value Mean STD
Maximum 90% 80%
Median 20% 10%
Minimum
(G B) 46.9707 3 1.75356 4.693594 39.78992 37.44745 36.154 1 1 32.10064. 25.39029 23.01 557 14.50959
tacl 1 (&DI 1 (A.U (A, F) 36.5808
1 1.82273 5.444358 27.9204 19.44508 16.55583 11.85741 5.197099 3.781602 1 5973 74
(A, G) 32.8766 5.644303 3.239644 17.7433 1 i 0.04395 8.355 129 4.83922
1 -9 19836 1.47 1255 0,3893 1
39.627 17.65553 6.4 12899,. 32.68469 26.34556 22.96689 17.1264
9.62674 1 7.838717 4.245385
43.5767 1 42.4804 26.1 0766 5.853703.. 37.05428 33.28894 3 1.13656 26.39337.. 18.553 1 7 14.7 1796 10.2638 1
24.529 19 6.573252.- 37.09475 32,89474 30.5 1698 24.97604 15.42574, 12.2922 6.86670 1
CHAPTER 3: THE BHATTACHARYYA BOUND 69
TABLE 3.14 Estimates o f Bhattacharyy Bavesian Bootstra~. B-based M
STD I
Maximum 90% W./@
Median 20% 10Yo
Minimum
, bounds (%) :AS
TABLE 3.15 Estimates o f Bhattacharyya bounds (%) Random Weighting Methocl, B-based MFAS
L
CLASS PAIR (A, 1 ( A m (A, D) (A, E) (A, F) t
Theoretical Value 46.9707 1 43 S767 42.4804 39.627 36.5808 Mean 39.59857. 33.91 181 32.13099 24.55 108 17.35893 STD 4.732293 6.08649 1 7.0493 76 7.3 82788 7.324002
Maximum 46.78544 45.55008, 46.56028 4 1 .O8643 - 3 5.47229 90% 45.28959 4 1.18236 40.54022 34.406 27.3 1003 80Vo 43.72768 39.05634 38.3 1 622 3 1 .O4702 23,96976
Median 40.4351 1 34,62662 20% 33.22375 25.94493 223 1655 15.30676 8.060977
L
1OVo 30.02714 22.15524 18.72613 12.17501 4.435716 Minimum 23.50559 1 5.6272 9.873922 6.648424 2.105345
TABLE 3.16 - 3.1 8 give the statistics of the G-based MFAS schemes. The
statistics of the basic bootstrap algorithm are listed in TABLE 3.16 which gives mean and
median of22.18424% and 2 1.586879?! respectively for the class pair (A, E), for instance.
The mean and median of the Bayesian bootstrap algorithm for the same class pair are
25.03786% and 24.84284% listed in TABLE 3.17. ln TABLE 3.18, the mean and median
of the random weighting algorithm for the saw class pair are 26.24318% and
CHAPTER 3: THE BHATTACHARVVA BOUND
TABLE 3.1 6 Estimates of Bhattacharyy I bounds (%)
TABLE 3.1 7 Estimates of Bhattacharyya bounds (%) Bayesian Bootstrap, G-based MFAS
CLASS f AIR r
Tbeoretical Value Meaa STD
Maximum 90%
I
80% Median
1
20% 10%
I
Minimum Y
TABLE 3.18 Estimates of Bhattacharyya bounds (%) Random Weifiting Method, G-base MFAS
a
CLASS PAIR , Tbcoretical Value
Meao r
STD 1
Maximum I
90% 80%
Mdian 20%
I
10% Minimum
(A, B) 46.9707 40.92233 4.930905 48.48 157
(A. B) 46.9707 4 1 -54 1 8 4.797535
(A. C) 43.5767 34.9339 6.36909 47,54306
(A, D) 42.4804 33.06534 7.386044 48.1 1052
46.795281 42.58552
(A, C) 43.5767
4 1.74863, 39.3275 1 33.99645 23.20067 18.75 1 89 10,004 16
45.086 12, 4 1.96837, 34.279
30.795 16 23.282 1 1
(A, F) 1 (A, G )
(A. E) 39.627
25.03786 7.78753 1 42.37459
40.37395,. 35.4585 1 26.38677 22.4292 16.16228
42.98542 36.4239 33.1 10 13 26.29059 16.75969 13.3948 1 7.45 1 15
(A. D) 42.4804 36.5808.-
1 8,77967 7.747655 36.99365 2926422 26.05862 19.0657 8.920452 5.0798 17 2,14916
48.826471 47.96425
35.382 12 3 1.9 170 1 24.84284 1 5,079 16 1 1.99698 6.362026
(A. E) 39.627 32.8766
9.19739 1 5.573234 26.06493 1 6.88561 14. f 2403 8.26875 1 2.358967 1.693297 0.237557
48.48275
(A F) 36,5808 17.48944
26,243 1 8 7.63339
3 5,78248 6.199704
(A, G) 32.8766 8.156097
28.03542 24.80452 17.62768 7.803 587 4.29990 1 1 -686659
34,00968 7.1 79579
47.3 7257 45.56246 42.47585. 35.07868 3 1.93767 25.44232
1 5.25405 12.89838 7.174623 1.893904 1.3 19445 0.2 1 5609
7.696632 36.1 5889
43.19451 42.5 1642,
5.238953 24.27576
4 1 .O5443 36.5479 1 27.60369 23.60087 1730403
40.16449 34.9 1882 24.41967 20.13763 12,09235
CHAPTER 3: THE BHATTACHARYYA BOUND 71
in conclusion, both the B-based and G-based MFAS algorithms have smaller
estimate deviations, and tend to under-estimate the Bhattacharyya bound. Generally, they
do not improve the estimate bis.
In this chapter, which mainly focused on the problem of estimating the
Bhattacharyya ôound for error analysis in classification, we showed both theoretically and
experimentally that the estimate of the Bhattacharyya bound is biased. The experiments
also showed that the larger the distance between two classes, the bigger the bis.
However Bootstrap techniques provide a formal solution by which we can reducc
the estimate bias. Two kinds of bootstrap schemes were intmduced to correct the estimate
bias: one is the direct bias correction. and the other, the distance bias adjustment. The
direct bias correction schemes are effective in reducing the estimate bias with al1 the three
bootstmp schemes: the basic bootstrap. the Bayesian bootstrap and the random weighting
method. The problems with these schemes are that the standard deviations of the
estimates are big and some estimates are bad, especially. in the case of the basic bootstrap
scheme.
The distance bias adjustment schemes corne with different stnitegies depending on
the choice of the base estimate, and the step of performing the distance bias adjustment.
The expehent results showed that G-based algorithms are generaliy better than 8-based
ones. Although the MFAS related algorithms have smaller estimate deviations, they do
not significantly reduce the estimate bias. The AFMS related algorithms have the
CHAPTER 3: THE BHATTACHARYYA BOUND 72
advantages of yielding good bias correctness for the more distant class pairs, especially,
with the Basic Bootstrap schemes. However, the Basic Bootstrap scheme with the AFMS
algorithm produces very large estimate deviations.
In summary, we conclude that the bootstrap technique is useful for improving the
property of the estimate of the Bhattacharyya bound. We have proposed a direct bias
correction scheme to estimate the Bhattacharyya bound of a clam pair with a smailer
distance. The basic bootstrap scheme with G-based AFMS strategy has been proposed to
estimate the Bhattacharyya bound of a class pair with a larger distance.
4.1 Introduction
In the previous chapter, the estimate problem of the Bhattacharyya bound was
discussed. This bound is only an upper bound of the emr rate of a classifier; it is not even
an optimized bound. As the e m r rate is the crucial measurement of a classifier's
performance, it is pnferable to know this e m r rate, if it can be computed. However, in
most cases, it is very hard to calculate the error rate; even though, the distributions of the
classes are known. Hence, estimating the error is an essential problem in statistical
pattern recognition.
Unfomuiately, the approaches for estimating the classifier error rate are limited
because of the limitation of both training and testing sarnples. The two main traditional
methods are the Apparent Error Estimator and the Cross-validation Estimator (Leave-
one-out). The former apparently under-estimates the enor rate [Efû6], while the latter is
much better than the fht. New studies done in the last two decades based on Efion's
bootstrap technique suggested that it could be used to improve the Cross-validation
Estimator [EtS3]. It was argued that the bootstrap estimator had a much lower variation,
although its bias is larger than that of the Cross-validation scheme. Some alternatives to
the bootstrap approaches, such as the EO estimator, the 0.632 estimator etc. were also
discussed in Ehn 's paper. It pointed out that the 0.632 estimator was a clear wimer of
the sarnpling experiments.
CHAPTER 4: B ~ ~ T S T R A P BASED ERROR ESTIMATES 74
Mer this, a comparative study on these and other methods, including the
Apparent error, Cross-validation, MC and convex bootstrap estimators, was done by
Chemick, Murthy and Nealy, [CMN85] and [CMN86]. Their experiments were carried
out with a linear classifier and the results also suggested that the 0.632 estimator was
better than the othea. Jain, Duôes, and Chen also compared the bootstrap estimator with
Cross-validation and Apparent error estimators [JDC87]. Their experiments were carried
out with 1-NN and Quaciratic classifiers. Their work again supported the conclusions that
bootstrap estimaton have a srnaller standard deviation than the Cross-validation has, and
that the 0.632 estimator has a better performance than the other estimator. Davison and
Hall discussed the theoretical aspects of the bootstrap and the Cross-validation estimators
PH921. They showed that both the bootstrap and the Cross-validation have a bias of
order r i ' if the populations are fixed, and only the bias of the bootstrap will increase to
order n'" when the populations converge towards one another at the rate of ci". To
improve the bias of the bootstmp estimator, another modified bootstrap estimaior, Leave-
one-out bootstrap was introduced by Efion and Tibshibani [ET97], which is a
combination of the Cross-validation and bootstrap estimators. Associated with the 0.632
method, the 0.632 + (Leave-onesut) method perfonned well in the sampling
experhents.
in this chapter, the aforementioned emr estimating methods will fim be
discussed except for the MC and Convex bootstrap. These are, in some ways, similar to
the Leave-one-out bootstrap and other methods that will be introduced in this chapter.
The detailed descriptions of the algorithms for each error rate estimator, including the
Apparent error estimator, Cross-validation or Leave-one-out, basic bootstrap, the EO e m
estimator, and two 0.632 e m r estimaton, will be given in the second section. Results of a
series of simulations of these estimators will be illustrated and discussed in the third
section.
in the latter parts of this chapter, some new algorithms will be introduced to
estimate the error rate. These algorithms are mainly based on the idea of 'local mean'
and 'pseudo-sample data', which is related to the idea of the SMIDAT aigorithm as
discussed in Chapter 2. Therefore, such an algorithm is referred to as the Pseudo-sample
rnethod as opposed to the previous algorithms that only use the original sample data to
do calculations, which are refened to as the Real-sample methods. The pseudo-sample
rnethod is the topic of the fourth section.
4.2 Algorithms of the Error Rate Estimating
Using the notation given in Chapter 3, let
be a training sample, q(V- X) be a decision d e or classifier based on the training sarnple
X, and Q[o , X)] be an error indicator,
Thus, the enor rate of classifier q@, X) cm be defuied as
where a) is a random vector independent of X. in general, the quantity Err i s
unknown, and our task is to obtain an estimate for it.
As mentioned above, there are various algorithms available to estimate the e m r
rate of a classifier. Below are the detailed descriptions of the aigorithms.
43.1 Apparent Error
The apparent error rate estimator uses the training sarnples themselves as the
testing samples to estimate the error rate. With the training sample given in (4.1), the
apparent emr estimator (Err.,,) is
A generalized algorithm of Err,, can be described as follows.
Algorithm 4.1 Apparent Error Estimator Input: (i) c, the number of classes;
(ii) ni, the size of the training sarnples of class i; (iii) ~ [ l , 1 ] , . . ., g[ni, 1],~[1,2], . . ., ~ [ n , cl, the training sample data;
Output: The enor estimate. Method BEGIN
BuildClassifkr() E m r = O Total = O For i= l tocdo
Total = Total + ni For j = 1 to ni do
Error = Emr + Classi@ingError&~, il) End-For
End-For E m r = Error / Total Retum Enot
ENû Apparent E m r Estimator
The algorithm is very simple. The purpose of the BuiJdCbrdfier procedure in the
above algorithm is to coastruct the classifier based on the given training samples. The
algorithm for this procedure depends on the kind of classifier used and king tasked. If
one wants to use the k-NN classifier with the Euclidean distance; the BuildClassiTici
procedure can be omitted. If one wants to use the Bayesian classifier; the task of the
BuildCla~silkr procedure is to calculate the means and covariances for each class; and
so on. The ClasaifyingError procedure perfoms the function of evaluating the error
indicator given in (4.2). First, the ClassifyingEhor procedure uses the classifier
constructed by the BuildClassifkr procedure to classifi a training pattern rb, il. It
returns the value O if ru, il is classified to i, and it retums the value 1 if the sample is
misclassified. As the details of the BrildClassifier and Cla~sifj4ngEnar procedures
straightforward (depending on the discriminant function), the algorithrns for these two
procedures are not included have.
It is obvious that the algorithm of the Apparent ercor estimator only substitutes the
training sample to calculate the e m r rate. Therefore, the estimated value of the error rate
will always be zero with a probability 1 if 1-NN classifier is used in chis algorithm and
the feature vector follows a continuous distribution. It is thus clear that the Apparent
emr estimator underestimates the emr rate.
Let i = { : j i) = { a,) : j # i denote the sub-set of the training samples
which does not contain the pattern (-5, ai). The Cross-validation, or Leave-onesut error
estimator will then be;
A generalized algorithm of Err, is given below.
Algorithm 4 3 Leave-One-ûut Error Estimator Input: (i) c, the num ber of classes;
(ii) ni, the size of the training sample of class i; (iii) ~ [ l , 11 , ..., l[nl, 11, ~ [ l , 21, ..., g[nn cl, the training sample data;
Output: the error estimate. Method BEGIN
Error = O Total = O For i = 1 tocdo
Total = Total + ni For j = 1 to ni do
BuildCbrsifier(j, i) Error = Enor + Classifj&@rror(x~, il)
End-For End-For Emr = E m r / Total Return Error
END Leave-OneOut Error Estimator
The algorithm is still simple and slightly different from the Apparent error
algorithm. Here the Bui!dClasrifier procedure is called n tirnes, while in Algorithm 4.1
it is called only once. Every t h e the BuüdClaasifier procedure is invoked, there are two
penuneters (j, i) p s e d to it; while there is no parameter passed to it in Aigorithm 4.1.
The parameten are used to indicate that the pattern au, i] is removed h m the training
sample, and the remaining training patterns are used to construct the classifier, which
means that not X, is used to build the classifier. Thus, the classifier will not have s'j,
il participating in when it is used to classify gD, il. By this strategy, the problem caused
by the Apparent enor estimator is avoided.
4.23 Basic Bootstrap
As the Apparent error Err,, under-estimates the tnie error Err. a positive bias
(PB) of the Err, can be defined as
PB(X, F) = Err - EIT,,.
The expectation of this bias cm be written as
Bias(F) = E[PB(X, F)] = E[Err - Enqp].
If Bias(F) were known, an optimised estimate of Err could be given as
Err,, = Err,, + Bias(F). (4.7)
In most cases the Bias is unknown and should be estimated h m the training sarnples.
The Cross-validation algorithm mentioned above estimates the Bias with:
With the bootstrap technique, the Bias(F) may be estimated in several ways.
Suppose the obsmed patterns of the training sarnple X are ]CI = gl, & = P. - - ., &, = 4
Similar to what was done in the previous chaptea, a bootstrap sample X* = (X , ,
CHAPTER 4: BOOTSTRAP BASEO ERROR ESTIMATES 80
. . . , - X ' t l } may be drawn fiom the empirical distribution k . Thus a bootstxap estimate of
Bias will be
where w : is the proportion o f the
By repeatedly drawing the bootstrap samples B times the bootstrap estimate of Bias is
Thus, the bootstnip estimate o f the Err is;
(h) E r b i = Enwp + Bias . (4.1 1)
The bootstrap estimator discussed above is called the basic bootstrap estimator.
An algorithm for it is given below:
Algorithm 4.3 Basic Bootstrap E m r Ecstimator Input: (i) c, the number of classes;
(ii) ni, the size of the training sample of class i; (iii) ~ [ l , 11 , . . ., x[ni, 11, al, 21, .. ., ~ [ n , cl, the training sample data; (iv) B, the repated times of drawing bootstrap samples.
Output: the error estimate. Method BEGIN
Total = O Fori= 1 tocdo
Total = Total + ni End-For Error-estirnate = O Fork= 1 to Bdo
CetBootstrapSample() BiiildClassifier(y[l, 11, y[2, 11, ..., y[n,, cl) Error = O F o r i = l tocdo
For j = 1 to ni do 1 Error = Emr + (- - wü, il) x ClassifyingErrotfx~, il) n
End-For End-For Error-estimate = Error-estimate + Emr / Total
End-For Error-estimate = Enor-estimate / B Error-estimate = Enor-estimate + AppError() Return Error-estirnate
END Basic Bootstrap Error Estimator
CHAPTER 4: B ~ ~ T ~ T R A P BASED ERROR ESTIMATES 82
Pmcdure GetBootstrapSample Input: (i) c, the nurnber of classes;
(ii) ni, the size of the training sample of class i; (iii) r[ 1 , 11 , . . . , ~ [ n 1, 1 1, g[ 1,2], . . . , ~ [ n , , cl, the training sample data;
Output: (il Bootsmp sample y[ 1 9 1 1, y[2,1], -, y[nc, cl (ii) weights w[l, 11, w[2, 11, ..., w[n,, cl.
Method BEGIN Total = O For+ 1 tocdo
Total = Total + ni For j = 1 to ni do
wu, il = O End-For
End-For For+ 1 tocdo
For j = 1 to ni do m = randorn integer in [ l , ni] yu'. il = g[m, il wu, i] = wu, i] + 1
End-For End-For For i = 1 tocdo
For j = 1 to ni do wu, i] = wu, il / Total
End-For End-For Return y[l, 11, y(2, 11, ..., y[nc, cl, W, 11, w[2, 11, ..., w[nc, cl END Procedure GetBootstrapSample
The parameters passed to the CetBootstrapSample procedure are given in the
descriptive part of the procedure, although they are omitted when the procedure is called
in the main body of the algorithm. The GetBootstrapSample procedure in this algorithm
p e h s two fiuictions: it draws a bootstrap sample and sets the weights for each pattern
of the training sample. The bootstcap sarnple is drawn separately fiom each class. The
reason for doing this is to avoid the bias of the bootstmp sample data. If al1 the training
samples fiom different classes were put together to draw a bootstrap sample, the
- - - - - -
bootstrap sample might contain no data from one or more classes. This would cause an
even more biased estimate, or might even deduce a bootstrap sample, which could not be
used to construct a classifier.
n i e weights calculated in the pmcedure are for the bootstrap estimate calculation.
One may argue that the GetWeights procedures given in Chapter 3 cm be used to
substitute the CetBootstrapSample procedure, so that one could use different bootstrap
schemes, basic bootstrap, Bayesian bootstrap and random weighting method. This is true
in some cases. The key issue is to detemine the kind of classifier used. If the Bayesian
classifier is used, a CetWeights procedure could take the place of the
GetBootstrapSampk procedure in the algorithm. Then the weights w[l , 11, w[2, 11, . . .,
w[n,, cl would be passed to the BuildClassifier procedure to construct the classifier.
However, a real bootstrap sample will be required. if a k-NN classifier is used. It is not
dificult to see that the Bayesian bootstrap and the random weighting method can not be
applied to a k-NN classifier.
The BuildClassifier pmcedure of this algorithm is the sarne as that for Algorîthm
4.1, except that it uses the bootstrap sample to build the classifier. The AppError
procedure only uses the Algorithm 4.1 to calculate an apparent error estimate. Thus, the
parameters (the number of classes, the training sarnple sizes of the classes, and the
training sample data) have to be passed to the procedure, even though it has not been
indicated in the algorithm.
CHAPTER 4: B~~TSTRAP BASED ERROR ESTIMATES 04
4.2.4 EO estimator
The EO estimator is another estimate based on the bootstrap technique. The
difference between the basic bootstrap and the EO estimator is that the EO estimator tests
oniy the training patterns not çontained in a bootstrap sample, while the basic bootstnip
tests dl the patterns of the training sarnple. To cany out an EO estimate, d e r each
bootstrsp sample drawing, the error rate calculation has to be changed. Suppose B
bootstrap sarnples ore drawn, the bh bootstrap sample is denoted as x'~, and the set of
training patterns not in x'' is denoted as Ab. Then the EO error rate estimate is:
Err,, =
where IAb( denotes the cardinality of the set Ab. The algonthm of the EO estimator is
represented below.
CHAPTER 4: B~~TSTRAP BASED ERROR ESTIMATES 85
Algorithm 4.4 E l error eatimator Input: (i) c, the number of classes;
(ii) ni, the size of the training sample of class i; (iii) g[ 1, 11 , . . . , s[ni, 1 1, ~ [ l ,2], . . . , g [ n , cl, the training sample data; (iv) B, the repeated times of hwing bootstrap samples.
Output: The error estimate. Method BEGIN
Total = O Error = O Fork= I toBdo
GetBwtstnpSample() BuildCIassifier(y[l , 11, y[2, 1 1, . . ., y[n,, cl) Fori= 1 tocdo
For j = 1 to ni do If wu, i] = O Then
Error = Emr + CIassifyiagError(x~, il) Total = Total + 1
End-tf End-For
End-For End-For Error = Error / Total Return Emr
END EO error estimator
The procedures used in the above are the same as those used in Algorithm 43,
and so they are not explained in any greater deiail. The key point in this algorithm is to
check the weight of a training pattern before it is classified. Thus, al1 the training patterns
passed to the ClassifyingError procedure are excluded h m the bootstrap sample. What
should also be pointed out is that this algorithm yields an error rate estimate directly
without adding the AppError to it. in other words, the EO estimator scheme is not an
algorithm based on the bias correction.
CHAPTER 4: B ~ ~ T S T R A P BASED ERROR ESTIMATES 86 - - -- - - - - -- - - -- - -- - -. -- . - - - - . -
4.25 Leavc-oac-out Bootstrap
Compared to the Cross-validation, the bootstrap provides an enor estimate with a
bigger downward bias (see [JDC87], [Ha861 and [ET97]). This is because the basic
bootstrap does not explicitly separate the test pattem fiom the training patterns which are
used to construct the classifier, like the Cross-validation does. So Efron and Tibshirani
developed a scheme called the Lemte-one-out bootstrap to deal with this separation.
Recall that in the Cross-validation algorithm, the training pattem set &il was used
to build the classifier when the test was done for training pattern &. Thus, the testing
pattem was not a part of the pattern set used to construct the classifier. So we de fine an
empincd distribution F(i) on the training sample set Kil,
Using this, a bootstrap sample
not contain Xi. For each i, there could be an e m r estimator given by:
~ r r : = E: fQ[m y il@,, x;,, )] j, (4.1 4)
where E: denotes the expectation under the distribution k(,). The error rate estimate of
the Leave-one-out ôootstmp is:
in the real algorithm, the ~ r r : in (4.14) can be obtained by the following: first,
the bootstrap samples are drawn as usual (say B bootstrap samples, x", x*: . . ., x*~).
CHAPTER 4: B ~ T S T R A P BASED ERROR ESTIMATES 87
Let N: be the times of training pattern & appearing in the b ' bootstrap sample. We
de fine:
and
Thus Err : can be given by:
The algorithm of the Leave-one-out bootstrap is listed below.
Algorithm 4.5 Leavcs~~e-out Bootstrap Error Estimator Input: (i) c, the number of classes;
(ii) ni, the size of the training sample of class i; (iii) g[1, 11 , . . ., g[ni, 11, ~ [ l , 21, . . ., x [ n , cl, the training sample data; (iv) B, the repeated times of drawing -kstrap samples.
Output: The error estimate. Method BEGIN
Total = O Fori= 1 tocdo
Total = ni Forj= 1 tonid@
Totalb, il = O Enorfi, i] = O
End-For End-For Fork= 1 toBdo
CetBootstrapSample() BuildClassifkr(y[l, 11, y[2, 11, . . ., y[nç, cl) For+ 1 tocdo
For j = 1 to ni do If wfi, il = O Then
Emru, i] = Emrü, i] + ClassityingError(x~, il) Total[i, il = Totale, il + 1
End-Cf End-For
End-For End-For Error = O For+ 1 tocdo
For j = 1 to ni do Error = Error + Errorü, i] / Totalü, il
End-For End-For E m r = Error / Total Return Enor
END Leave-one-out Bootstrap E m r Estimator
The procedures used in this algorithm are the same as in Algorithm 4.4. It should
be pointed out that this algorithm uses the same pattern set as that in Algorithm 4.4 to
calculate the emr rate. The diEerence betweea the two algocithms is the calculatioas
CHAPTER 4: BOOTSTIUP BASED ERROR ESTIMATES 89
themselves. In Algoritbm 4.4, the error rate estirnate is counted as the ratio of the total
number of the mixlassified training patterns to the total number of the training pattems
tested. in Algoritbm dS, however, the misclassification ratio for each training pattern
within al1 the bootstrap samples is calculated fmt, and so that the error rate estimate tums
out to be the average of the ratios of all the training pattems. a
4.2.6 0.632 Estimators
The motivation of the 0.632 estimators is not strong. However, the simulation
results showed that the 0.632 estimators have good performance (see [EtS3], [CMN85],
[CMN86], and [ET97]). When a bootstrap sample of size m is drawn from a ûaining
sample of size n, the probability of a training pattern being excluded fiom the bootstrap
1 sample is (1 - - ) m if the training patterns fiom al1 the classes are put together to draw a
n
bootstrap sample. In general, since the value of na is set to n, as n goes to infinity, (1 -
1 -)* tends to e-' 0.368. So the value 0.632 is considered to be the contribution rate of n
the EO estimator to the bias correction Ffü31. Thus, the Bias defined in (4.6) can be
est imated by :
~ i a s ( O . ~ ~ ~ ' = 0.632 x (EO - AppErr).
This gives the Err estimate
Err(0632i = AppErr + ~ i a s ( ~ - " ~ ' = 0.632 x EO + 0.368 x AppErr. (4.28)
As the Leave-one-out bootstrap estimator bas the sarne property as the EO estimator, (i.e.
only the training patterns excluded fiom the bootstrap sarnple are counted to estimate the
CHAPTER 4: B~~TSTRAP BASED ERROR ESTIMATES 90
error rate), an alternative way is to use the Leave-onesut bootstrap estimator
instead of EO, which yields the Err estimate
Erpo.632+1 = 0.632 x ErrCVCVbooI f 0.368 x AppErr. (4.29)
It is obvious bat the alg06thm Err(0.6~~) is a straighdorward application of the
algorithms EO and AppErr, and formula (4.28). Similarly, the algorithm to yield
Err(0.632+) is an application of the algorithms Errcv.brnt and AppErr, and formula (4.29).
4.3. Simulation Results
Experiments were carried out for al1 the algorithms discussed in the last section.
Al1 of the experiments used the Data Set 1 with training sample sizes 8 and 24 for each
class. As in Chapter 3, only two classes were involved in an experiment: one of hem was
the class A, and the other was chosen fiom the rest.
As the I -NN classifier is meaningless for the Apparent error estimator, the 3-NN
classifier was used in al1 the experiments. The tnie enors were obtained by using
independent testing sarnples of size 1,000 for each class. Therefore, the true erroa are the
result of measuring 2,000 independent testing patterns. Each expriment was repeated
200 times. TABLE 4.1 and TABLE 4.2 give the results of the average out of 200 trails
with training sample sizes 8 and 24 for each class respectively.
CHAPTER 4: BOOTSTRAP BASED ERROR ESTIMATES 91
TABLE 4-1 Error Rate Esti
4.05386
nates of 3-NN classifiers (sample size 8) (A, C) (A, D) (A, E) (A, F) (A, G)
37.021750 33 37225 23.54025 14,33675 6.3225
Apparent Error t+Fe% STD
Basic Mean 28,985 17 11.12407
STD 1 1,39010 0.632 Mean 42,40397
STD 9.692471 Leave-OnMut Mean 49.3988
Bootstriip STD 1 1.37005 0.632+ Mean 40,48904
STD _ 9.708022 32.20436 29.6 1924 2 1.1 0546 13,55376 5.89871 10.327 14 IO. 1 5273- 9.9 12306- 7.59366 1 4.987327
TABLE 4.2 Error Rate Estimates of 3-NN classifi rs (sarn~le size 241 Method Clam Pair (A, B) (A, Cl (A, D) True Mean 46.49800 35.92875 32.34675
STD 2.340672 3.050293 2.773933 Apparent Error Mean 23.354 17 19.54 167 17.86458
STD 5.703884 5.1 8980 4.8 1 1722 Leave-On&ut Mean 48.68750 37.13542 33 .58333
STD 9.1223 12 9.745987 8.7 19092 Basic Mean 24.74941 20.65822 1 8.88875
ho t s t r ap STD 5.76893 5.3 166 19 4.9 17935 EO Mean 47.42198 36.83978 33.3393 1
STD 6.359907 8.131911 7.231284
STD 5.4978 19 6.538436. 5.8 15562 Leave-One-Out Mean 47.6213 37.58765 34.2348 1
The mults of the experiments show that the bootstrap technique works well in
this case. Compared to the Leave-one-out estimator, al1 the bootstrap related algorithms
CHAPTER 4: B~~TSTRAP BASEP EMOR ESTMATES 92
have relatively smdler standard deviations; dthough, some of them have larger biases
than the Leave-one-out. The basic bootstrap algorithm improved the Apparent error rate
estimator, but it still tends to under-estimate the emr rates, while the EO and Leave-one-
out bootstrap algorithm generally over-estimate the e m r rates. The 0.632+ algorithm did
enhance the performance of the 0.632 aigorith, which confirms the importance of the
Leave-one-out approach. Although the Leave-one-out algorithm appears attractive,
always yielding the best performance with al1 the expriment class pain, it will be seen
later in this chapter that, from some standpoints, the bootstrap related aigorithrns perform
better than the Leave-onesut. It can also be setn that the results of these experiments are
basically consistent with the research works done in [Efû3], [CMNSS], [Ef86], [CMN86],
[JDC87], and [ET97].
4.4. Pseudo-Sample Algorithm
As pointed out at the begiming of the chapter, the algorithms previously
mentioned use only the training pattems to estimate the error rate of the classifier. This
means that the training patterns are used for both the construction of the classifier and for
testing the classifier. Two lessons have been leamed fiom the above results:
1. The training pattern used to test the classifier should be separated fiom those
used to build the classifier. This is the key issue that makes the Leavesnesut
emr estimator perfomi well. This is also why the 0.632+ estimator perfomed
better than the 0.632 estimator;
CHAPTER 4: B~~TSTRAP BASED ERROR ESTIMATES 93 -- -
2. A larger training sample size would improve the error estimaton. This cm be
seen by cornparhg the results in TABLE 4.1 and TABLE 4.2.
Separating the training pattems from the testing and building of the classifier can
be done by merely applying the Cross-validation technique. However, in practice, training
sarnple size is limited. The separating of the two implies the reduction of the number of
the pattems used to build the classifier, Although there will generaily not be a significant
effect if only one training pattern is lefl out, this is not the case when the sample size is
itself smdl. The examples of the 0.632 estimators have show that a weak relationship
between the pattems of testing and those of building the classifier will not make things
worse. Therefore, it would be beneficial to devise a way to increase the training sample
size and simultaneously retain the pattems used for testing the classifier in question.
This is what the SIMDAT algorithm given in Section 2.3.5 does. The algorithms
provided here use a SIMDAT-like approach to generate pseudo-training pattems. Then
the original and the pseudo training pattems are used separately as the testing pattems and
the pattems to construct the classifier. To distinguish from the algorithms mentioned in
the last section, the algorithms discussed here are referred to as pseudo-sampte
algorithms.
The main idea in the SIMDAT algorithm is that of combining the weightcd ni-
nearest neighboun of a sample point to constnict a pseudo-sample data. Here, the same
approach is used to constnict pseudo-training pattems. There is no restriction on the
method to be used for setting the weights of the m-nearest neighbours of a training
pattern A genedized algorith.cn to construct a pseudo-training pattern is given below:
CHAPTER 4: BOOTSTRAP BASED ERROR ESTIMATES 94
Algorithm 4.6 Generating A Pseudo-Training Pattern Procedure GetPseudoPattern 1 Input: g[1], g[2], . . ., ~ [ m ] the ni-nearest neighbours of the training pattem g Output: A pseudo-training pattern. Method BEGIN
GetWeights(m)
Retum END Procedure GetPseudoPlittern 1
Observe that the m-nearest neighbours of the training pattern g include 3 itself.
The GetWeights procedure above can be any one of the GetWeights procedures
presented in Chapter 3. Of coune, the GetWeigbts procedure for the Bayesian bootstrap
or the random weighting method are preferred because then the weights w[l], w[2], ...,
w[mJ are continuously distributed, and the pseudo-training pattem generated would not
identim with any pattem in the training sarnple set. The main idea is that the pseudo-
training pattem generated should oscillate around the mean of the mnearest neighbours.
So another way to generate a pseudo-training pattem is to apply the approach represented
in the SIMDAT algorithm. The following algorithm uses this approach:
CHAPTER 4: B~~TSTRAP BASED ERROR ESTIMATES 95
Algorithm 4.7 Cenerrting A Pseudo-Training Pattern Procedure CetPseudoPattern L I Input: ~ [ l ] , g[2], .. ., ~ [ m ] the m-nearest neighbours of the training pattern 8 Output: A pseudo-training pattern. Method BEGIN
Mean = O For+ 1 torndo
w[i] = RandomVariablffienerator() 9
Mean = Meaa + ~ [ i ] End-For Mean = Mean / m
m
v = %[il x ( ~ [ i ] - M a n ) - 1-1
v=x+Mean 0
Retum END Procedure CetPseudoPattern II
The differences beiween Algorithm 4.7 and Algorithm 4.6 are that the weights in
Algorithm 4.7 are not standardized. and the weights are assigned to the centralized
feature vectors &[il - Meaa). After the linear combination of the centralized feature
vectors has been obtained, the mean of the mnearest neighbours is added back to obtain
the pseudo-training pattern. Thus, the RindomVariableCenentor procedure above
shodd generate a random variable with zero as its conditional expectation. It is easy to
see that Algorithm 4.7 uses the approach used in the SiMDAT algorithm to obtain
pseudo-patterns.
After the pseudo-training pattems have been generated, they can be used in
various ways, either as the testing patterns, or as the pattems to construct the classifier
itself. The scheme is called a Pseudo-Testing dgorithm, if the pseudo-training samples
are used for testing purposes and a Pseudo-Classïfzer algorithrn, if the pseudo-training
CHAPTER 4: BOOTSTRAP BASED ERROR ESTIMATES %
samples are used to constnict the classifier. A Pseudo-Testing aigorithm is described
below:
Algorithm 4.8 Pseudo-Testiag Input: (i) c, the number of classes;
(ii) ni, the size of the training sample of class i; (iii) ~ [ l , 11 . . .., s[ni . 11, ~ [ i , 21. . . .. x[nc, CI, the training sample data; (iv) B, the repeated times of draing kmtstrap samples. (v) m, the number of the nearest neighbours of a training pattern
Output: The error estimate. Method BEGIN
Total = O Error = O For i = 1 to c do
For j = 1 to ni do GetNearestNeigbboursOf(x~, il, m)
End-For End-For Fork= 1 toBdo
CeneratePseudoSmnple() BuildCIauifierQ[l, 1],5[2, 11, . . ., ~ [ n , , cl) Fori=l tocdo
For j = 1 to ni do Enor = Error + ClassifjhgError(rü, il) Total = Total + I
End-For End-For
End-For Error = Error / Total Return Error
END Pseudo-Testing
CHAPTER 4: BOOTSTRAP BASED ERROR ESTIMATES 97
Procedure GeneratePaeudoLmple input: (i) c, the number of classes;
(ii) ni, the size of the training sample of class i; (iii) g[l, 11 , ..., ni, 11, g[1,2], . . ., g[n, cl, the training sample data;
Output: The pseudo-training patterns y[l, 11, .... y[ni, 11, y[l, 21, . .., y[n,, cl. Method BEGIN
F o r i = 1 tocdo For j = 1 to ni do
k = random integer in [1, ni] y[j, il = CetPseudoPattern(m-nearest neighboun of z@c, il)
End-For End-For
END GeneratePseudoS~mple
In this algorithm, y[l, 11, y[2, 11, ..., y[n,, cl represent the pseudo-training
samples generated. The pseudo-training samples have not ody in total, but also for each
class, the same size as the original training sample. Mer the classifier has been built with
the original training sampie, the pseudo-training sample is used to estimate the e m r rate
of the classifier. The GetNoarestNeigh boursOC(x~, il, m) procedure is used to calculate
the m-nearest neighbours of the training pattern gb, il. This function is basically used to
calculate the Euclidean distances between su, il and other patterns in the same class, and
select m from them with the minimal distances. The GetPseudoPattern procedure,
invoked in the GeneratePseudoSampk procedure. can be either of those yiven in
Algoritbm 4.6 and 4.7. Unlike the SiMDAT algorithm. the CetPseudoPattern procedure
does not generate a pseudo-training pattern for each pattern in the training sample.
Instead, it uses the bootstrap technique to randomly select a training pattern first; then it
uses the m-nearest neighbom of the pattern to generate a pseudo-training pattern.
CHAPTER 4: B~~TSTRAP BASED ERROR ESTIMATES 98
Experiments on Algorithm 4.8 were done with the Data Set 1. Only two classes
were involved in an expenment, each of hem has a training sample of size eight. Al1 of
the experiments were carried out with a 3-NN classifier. Two different
CetPseudoPattern procedures were used in the experiments: one of them used
Algoritbm 4.6 with the GetWeights procedure for the Bayesian bootstrap, and the other
1 used Algorithm 4.7 with a RandomVariibkCenerator procedure to generate the U(-
m
- 4':; ') , + \/3(m ; ') ) distributeci random variables. The other parameten used m m-
in the experiments were: B = 200, m = 3. TABLE 4.3 lists the statistics €rom the 200 trials
for each experiment.
TABLE 4.3 Error Rate Estimates of the Pseudo-Testinn alnorithm
- - -
The results indicate that the estimates provided by the pseudo-testing algorithms
are seriously biased downward. The cause of this is that the testing pattems are the linear
combinations of the m-nearest neighburs. As these linear combinations tend to the local
mean of the pattems in a class, their characteristics are more significant, and easier to
distinguish fiom each other. So the per6ormance of the Pseudo-Testing algorithm is not
Method True
1
SIMDAT
1 ,"=:
Class Pair Mean STD Mean STD Mean STD
(A. B) 1 (A, C) 46.99 150 4.05386 23.07175 15.65927 17.87747 8.370604
(A. (3) 6.3225 1.8635 1 2.126613 4.5801 3 0.635275 i.491685
37.08750 5.92 1621 15.26616 13.33733 12.83954 7.583 107
(A, D) 1 (A. E) 1 (A. FI 14.33675, 3.1 39 198 5.43765 8.879793
33.372251 23.54025 5.755794 14.05301, 13.97576
4.33367 8.634487 1 1,33642
10.92097 7.198176
7.1085751 3.346563 6.4247451 4364733
CHAPTER 4: BOOTSTRAQ BASED ERROR ESTIMATES 99
4.4.2 The Pseudo£laaaifier Algorithm
The Pseudo-Classifier algorithm exchanges the position of the original training
patterns with that of the pseudo-training patterns. The pseudo-training sample is used to
build the classifier. and the original training sample is used to test the classifier error rate.
The pseudo-classifier algorithm is represented hem:
Algoritbm 4.9 Pseudo-Clasaifier Input: (i) c, the number of classes;
(ii) ni, the size of the training sample of class i; (iii) ~ [ l , 11 . ..., x[ni, 11, ~ [ l , 21, ..., x[n,, cl, the training sample data; (iv) B, the repeated times of drawing Lotstrap samples. (v) m, the number of the nearest neighbours of a training pattern
Output: The error estimate. Method BEGIN
Totai = O Error = O Fori= 1 tocdo
For j = 1 to ni do GetNearestNeighboudf&~, il, m)
End-For End-For Fork=1 toBdo
GenentePseudoSample() BuildClasiifier(y[l, 11, y[2, 11, ..., y[nc, cl) Fori= 1 tocdo
Forj= 1 tonido Errcr = E m r + ClassifyiagError(x~, il) Total = Total + 1
End-For End-For
End-For Error = Error / Total Return Emr
END Pseuddlrssifier
The algorithm above is aimost the sarne as Algorithm 4.8, except that the
parameters passed to the BuildChssifîer and ClasaiîyingEmr procedures have
CHAPTER 4: B~~TSTRAP BMED ERROR ESTIMATES 100
c hanged. Although the change is small, the performance is highly improved. TABLE 4.4
- 4.5 list the statistics of the experiments on Algorithm 4.9. The experiments were done
with the Data Set 1.
Al1 the experiment parameten are the same as those for the experiments on
Algorithm 4.8, except that two training sarnple sizes 8 and 24 are used hem. Additional
experiments were cmied out with the CetWeights procedure for the random weighting
TABLE 4.4 Error Rate Estimates of the Pseudo-Classifier algorithm (sample size 8)
1 Metbod 1 Cl- Pair 1 (A, B) 1 ( A S ) 1 (&Dl 1 (A. E) 1 (A, FI 1 (&Cl 1
1 Bootstrap 1 STD 1 6.9244031 7.1391021 6.8554221 6.8433631 5.6533351 3.8827711
True
I
SiMDAT I
Bayesian
TABLE 4.5 Error Rate Estimates of the Pseudo-Classifier algorithm (sample size 24)
Mern STD Mean STD Mean
Random Weigb t in^
I Method 1 Clam Pair 1 (A. B) 1 (A,C) 1 (&Dl 1 (A, E) 1 (A, ,F) 1 (A+ C) 1
46.991 50 4.05386 35.97422 6.8 15439 27.98375
Mean STD
1 Bootsttrrp 1 STD 1 4.26 16071 4.9850671 4.5055 141 4.4286121 33980461 2.4338731
True
r
SIMDAT
Bayesir n
37,08750 5.92162 1 3 1 .O3078 7.748732 22.2825
28.00937 7.105689
For convenience, a pseudo-classifier algorithm is referred to as a SIMDAT
Mean STD Meaa STD Mean
Randon Weigbting
pseudo-clussijer algorithm (SPC), if the SIMDAT-like algorithm was used to genenite
33,37225 5.755794 29.4953 1 8.1 17096 20,64437
22.23 14 1 7.23 1788
the pseudo-training sample. A pseudo-classifier algorithm is referred to as a Bayesian
, 46.49800 2.340672 39.6638 3.852234 29.64885
Mean STD
Bootstrap pseudo-classijier algorithm (BBPC), if the Bayesian bootstrap is used to
23.54025 4.33367 23.72 187 8.657076 14.96 14 1
20.632 19 6.928664
35.92875 3.050293 33. t 1203 6.084795 23.68 13
29.7827 1 4.366496
14.33675 3.139 198 19.60203 7226305 9.709844
1 4.89687 6,9338 16
6,3225 1.8635 1
I 5.42453 6.58984 4.322344
23.7776 1 5.099502
9.65375 5.69913 1
32.346751 22.7975
4.266875 3.8302 16
1 3.83275 1.729629 19.04755 5.373447.. 9,498958
2.773933 3 1.2 1865 5.925 1 59 21,541 67
2 1 A640 1 4.572865
5.83675 1 .O13 12 13.3 1255 5.0694 16 3.994 166
2.2033 14 24.371 56 5.9 128 13 15.04438
1 5.13 792 9.474687 3.949583 4.52522 1 3.423749 2.472508
CHAPTER 4: B ~ ~ T S T R A P BASED ERROR ESTIMATES 101
generate the pseudo-training sample. A pseudo-classifier algorithm is referred to as a
Random Weightingpseudo-cIassi$er algorithm (RWPC). if the random weighting method
is used to generate the pseudo-training sample.
Although the SPC algorithm did not perform well enough in the experiments, the
BBPC and RWPC perfomed better than the basic bootstnip. They have smaller biases
and standard deviations. The experiments show also that, for the purpose of error rate
estimating, the pseudo-training pattern generating algorithm. Algorithm 4.6 is better than
Algorithm 4.7.
4.43 Leavc-onc-out Pseudo-Classifier Algorithm
A further irnprovement on the pseudo-classitier algorithm is to introduce the
Leavesne-out approach. As we have already seen that the bootstrap algorithm is
sharpened by the Leave-onesut, it is natural to think that the pseudo-classifier algorithm
rnight benefit h m it as well. The Leavesne-out pseudo-classifier algorithm (LPC) is
given below.
CHAPTER 4: B ~ ~ T ~ T R A P BASED ERROR EST~MATES 102
Algorithm 4.10 Leave-OncOut Pseudo-Classifier Input: (i) c, the number of classes;
(ii) ni, the size of the training sample of class i; (iii) g[i, 11 , . . ., s[ni, 11, ~ [ l , 21, . . ., ~ [ n , cl, the training sarnple data; (iv) B, the repeated times of drawing bootstrap samples. (v) m, the number of the nearest neighbours of a training pattern
Output: The error estimate. Metbod BEGIN
Total = O Error = O For i = 1 tocdo
Forj= 1 snido GetNearestNeighbounOf&~, il, m)
End-For End-For Fork= 1 to Bdo
For+ 1 tocdo For j = 1 to ni do
CeieratePseudoSample&~, il) BuildChssifier(y[l, 1 1, y[2, I 1, . . ., y[n,, cl) Error = Error + Classi~ingError(r~, il) Total = Total + I
End-For End-For
End-For Error = Error / Total Return Error
EM> Pseudo-Testing
CHAPTER 4: B~~TSTRAP BASED ERROR ESTIMATES 103
Pnmdure GeneratePseudoSampie Input: (i) c, the nwnber of classes;
(ii) ni, the size of the training sample of class i; (iii) g[1, 11 , . . . , x[ni, 11, ~ [ l ,2], . . . , x[n,, cl, the training sarnple data; (v) g[q, pl, the Gning pattern shouldbe lefi out
Output: The pseudo-training patterns y[ 1, 11, . . ., y[ni, 11, y[1,2], . . ., y[n,, cl. Method BEGIN
Fori= I tocdo Forj = 1 tanido
do k = mndom integer in [1, ni]
Whik &[k. il = g[q, pl) yu, il = CetPscudoPattern(m-nearest neighbours of xJk, il)
End-For End-For
END GeaentePseudoSampIe
The reader will observe that there are some changes in Algorithm 4.10 compared
to Algorithm 4.9. Fint, the GeaeratePseudoSampb and BuildClassifier procedures
have moved to the imermost For loop, so different pseudo-training samples are used to
test different training pattems. Secondly, when the CeneratePseudoSample procedure is
invoked, a pattem ~ [ q , pl is passed to it. When executing the
GeneritePseudoSample procedure, there is a Wbik loop to ensure that the patterns
drawn by the bootstrap sarnpling will exclude the pattem g[q, pl. This means that no
pseudo-training pattem would be a linear combination of the nt-nearest neighbouis of
x[q, pl. Although it is still possible for ~ [ q , pl to be involved in the pseudo-training I)
sarnple as one of the ni-nearest neighbouis of another pattem, the relationship between
x[q, pl and the pseudo-training sample is mlatively weak. 9
CHAPTER 4: I~OOTSTRAP BASED ERROR ESTIMATES 104
TABLE 4.6 - 4.7 list the statistics of the expenments using Algorithm 4.10. The
experiments were done with the Data Set 1. Al1 the panuneters used are the same as those
used for the expenments involving Algontbm 4.9.
TABLE 4.6 Error Rate Estimates of the Leave-One-Out Pseudo-Classifier Algorithm (samde siize 8) . . .
TABLE 4.7 Error Rate Estimates of the Leave-One-ûut Pseudo-Classifier Algorithm
L
Bayesian BooQtrap Random
Wei~hting
(samole size 24)
L
Met hod True
SIMDAT
1 Metbod
(A, E) 23.54025 4.33367
27.09 STD Mean STD Mean STD
Ch88 Pd? Mean STD Mean
(A, F) 14.33675 3.139198 22.0 1375
8.497344 37.97937 9.9438 17,. 38.72188 1 0.02002
Bayesian ; Bootstrap
As in the previous section, a pseudo-classifier algorithm is referred as to a Leuve-
(A, B) 46.99 1 50 4.05386 43.205
(AS) , 6.3225
1,86351 16.9 1
Random Weigbting
One-ûut SrMDAT pseudo-cIcassi/er (CVSPC), Leave-One-Out Bayesian Bootstrap
8.877637 29.78687 9.6 135 14
30.02 9.850337
STD Mean STD
pseudo-dassijier (CVBBPC), or a Lerne-One-Out Random Weighting pseudo-classijer
(A, C) 37.08750 5.921621 36.33 188
Mean 1 38.06104 STD 1 5.783742
(CVRWPC) Algorithm according to which approach is used for generating the pseudo-
(A, D) 33.37225,. 5.755794 34.32563 9.745623 27.2975
9.734568 27.5 1 187 9.73752 1
4.463 164 38.38708 5.290009
training wunple. Here, the 'CV' stands for Cross-Validation.
29.48875 6.84088
Although, the CVSPC algorithm does not perfonn weil here, the CVBBPC and
6.85 1486 30.42146 6.350641
CVRWPC algorithrns achieved mat progress compared with the BBPC and RWPC
6.760757 5.785625 4.866056
5,8825 4.873968
9.6439841 7.87 1366 19.78 1881 12.82 1 87
26.63542 5.992693
. 9.240409 19.78 188 9.240409_
6.660264 27.82375 5.7 1 1829
7.257 1 56 12.77 125 7.29 1 842
18.10437 5.634306
6.30261 1 19.53833 5.48523 1
1 118021 4.10253 1
5,643442 12.32875 4.136333
4.773958 2.778053
5.190987 5.453958 3.003775
CHAPTER 4: BOOTSTRAP BASED ERROR ESTIMATES 10s - - -- - --
algorithms. Thus, for instants, for class pair (A, E), the means of BBPC and RWPC on
200 trials are 14.96 1 1% and 14.89687% given in TABLE 4.4, while that of CVBBPC and
CVRWPC are both 19.78188 given in TABLE 4.6. As the mean of the tnie error rate is
23.54025, it is obvious that CVBBPC and CVRWPC provided much better estimates than
BBPC and RWPC did. This is trw not only for class pair (A, E) but also for al1 of the
other class pairs. Although, they are not considered to be the b a t among al1 the
algorithms discussed this chapter, in the next section we will see that the CVBBPC and
CVRWPC algorithms are definitely among the top level performers.
4.5. Summary
In this chapter, two types of algorithms for eror rate estirnating have ken
discussed. Algorithms of the first type are referred to as real-sample algorithrns because
they only use the real training sarnples to estimate the error rate. As opposed to these,
aigonthms of the other type are referred to as pseudo-sample aigorithms because they
generate a set of pseudo-trahing samples to estimate the enor rate.
The conclusions of the experirnents on the algorithms are:
i ) The apparent error estimator is obviously downward biased, and the bootstrap
techniques are helpfùl for correcting the bias.
2) The Leave-one-out estimator, on average, has a smaller bias while its standard
deviation is larger than those related to the bootstmp techniques. A notable
fact is that the Leave-onesut estirnator seems robust for d l of the class pairs
involved in the experirnents.
CHAPTER 4: B~~TSTRAP BASED ERROR ESTIMATES 106
3) The bootstrap-related estimators have smaller standard deviations and their
biases are also smaller. However, it seems that their performance is effited
by the changes of the class pair.
4) The pseudo-training sample algorithms work well, if the pseudo-training
patterns are used to constnnt the classifier. To generate pseudo-training
patterns, the Bayesian ôootstrap or random weighting-like approaches are
M e r than the SIMDAT-like approach.
According to the statistics on the averages and standard deviations of the
experiments, it is hard to say which algorithm pen0mis the best. We explain below an
altemate method for cornparhg these algorithms, which may be of great interest.
In each expriment, an algorithm was repeatedly carried out for 200 trials;
comspondingly, there were also 200 true errors obtained by ushg independent testing
samples. We now compute whether the estimate given by an algonth is close to the tnie
error when a training sample is given. This wouid be an important measurement of the
performance of an algorithm. For a large number of the algorithms, the absolute
difference between its estimate and the true error was calculateci for each trial. TABLE
4.8 - 4.9 give the statistics of the absolute difference for 200 trials.
CHAPTER 4: BOOTSTIUP BASED ERROR ESTIMATES 107
TABLE 4.8 Absolute difference between estimates and tme emrs (sample size 8)
Method IClusPdr 0.632 Mean
STD 0.632+ Mean
STD h i c Mean
Bootntrrp STD EO Mean
S m
LePive-on~ut' Mena Boohtrap , STD BBPC Mean
STD R W C Mean
STD CVBBPC Mean
STD CVRWC Mean
STD
TABLE 4.9 Absolute difference between estimates and true emrs (sample size 24)
EO Mean STD
Leaveonwwt Mean STD
Leavt-onwut Mena Boobtrrip STD BBPC . Mean
STD
CVRWPC 1 Mem I STD
Of course, it i s understwd that the smaller its absolute difference h m the tnie
emt, the better the performance of an estimator. From this point of view, the Leavesne-
out bootstrap algorithm is definitely better than just the Leave-one-out. It is almost m e
for the CVBBPC and CVRWPC algorithms. TABLE 4.10 - 4.1 1 range the perfomiances
of the algorithms according to the mean value given in TABLE 4.8 - 4.9.
TABLE 4. LO Performance ranges according to the absolute difference h m the truc emr (sample size 8)
The statistics above defulltely convince us that the bootstrap related algorithms
are generally better than the Leave-one-out. It dso suggested that the 0.632, 0.632+,
CWBPC or CVRWPC should be fust considered, if the training sample size is small;
the 0.632, 0.632+, EO and Lave-One-Out should be considered if the training sample
size is relatively large.
L
Clru Pair 0.632 0.632+
r
,Bwk Boobtrap EO
1
LeriveOne-Out Leave-One-Out
Bootstrap BBPC RWPC CVBBPC CVRWC
TABLE 4.1 1 Performance ranges according to the absolute diffemce h m the true e m r
7
(A, B) 1 3 8 4 7 2
10 9 6 5
(samde
BBPC 9 9 9 9 8 4 I
RWPC 8 8 8 8 9 6 CVBBPC 6 5 4 3 2 3
i C V R W P C i 7 i 6 I 7 i 6 i 6 i 2 t
h
Clms Pair 0.632 0.632+
,Basie Bootstrap EO
Leave-One-Out LeaveOne-Out
size 24)
(A, C) 1 2 10 6 7 3
8 9 5 4
(A, B) 5 4 10 2 3 I
(A, Dl 1 (A, E)
(A, D) 2 1 10 5
(A, C) 4 2 10 3 7 1
(A, F) 4 I 10 5 7 3
5 2 10 3 6 1
(A, G) 5 1 9 8 10 7
5 2 10 4 7 1
(A, E) 3 4 9 6
(A, F) 4 3 9 8
7 6
8 9 3
(A, G) 6 5 9 8
10 7
5 6 1
10 5
7 8 1
10 7
. 2 1
3 , 4 4 1 2
CHAPTER S
5.1 Introduction
Both the Bhattachayya bound and the emr rate of a classifier discussed in the
previous chapten have similar problems - that of estimating an unknown quantity. In
these pmblems, a question of bias correction arose, and the bootstrap techniques were
applied to correct it. Although the details are different, the basic strategies are similar
when the bootstrap technique is applied to this h d of problem: it involves sampling
fiom the original training set, and using the bias deduced h m the bootstrap samplc to
estimate the tnie bias. Generally, the bootstrap technique works well in such a case.
As opposed to this the classifier design problem is a totally different issue as it is
not a problem of bias correction. What is needed here is a technique to sharpen the
classifier within a given training sample. Therefore, the way to apply the bootstrap
technique to the problem of classifier design is different h m what was discussed before.
Here, the rrsearch will focus on how to generate an artificial training sample set fiom the
original one. The known work was done by Hamamoto, Uchimura and Tomita [HUT97]
who proposed bootstrap schemes to acbieve this. The details of their work makeup the
content of Section 5.2. In Section 5.3, some new schemes are introduced, which are
mainly based on the pseudo-sarnple algorithm mentioned in Chapter 4. The experimental
results obtained by using these schemes will be dirussed in Section 5.4.
CHAPTER 5: B ~ ~ T S T ~ ~ ~ B A S E D CLASSIFIER DESIGN 111
5.2 Previous Work
As mentioned above, the question adcûessed here is one of generating an artificial
training sample set when applying the bootstmp technique to the problem of classifier
design. Therefore. the schemes provided by Hamamoto CHUT971 are for generating the
synthetic training samples used to build the classifier. Below are the four algorithm
given in Hamamoto's paper.
Algoritbm 5.1 Bootstrapping 1 Input: (i) c, the number of classes;
(ii) ni, the size of the training sample of class i; (iii) ~ [ l , 1 ] , . . . , z[nl , 11, ~ [ l ,2], . . ., g[n,, cl, the training sample data; (v) m, the nurnber of nearest neighbors of a training pattern
Output: y[l, 11 , . . ., y[ni, 11, y[l, 21, ..., y[n, cl, the artificial training sarnple data. Method BEGIN
Fori= 1 tocdo For j = 1 to ni do
CetNeamtNeighboursO@~, il, m) End-For
End-For Fori=L tocdo
For j = 1 to ni do k = random integer in [l, ni] yü. il = GeneritcArti~ciaiPattem(~nemest neighbors of ~[k, il)
End-For End-For Ret-y[l, 11 9 - * - 9 y[n1, 11. y[l, 21, .**,y[&, cl
END Bwtstrapping 1
Input: ~ [ l ] , g[2], . . ., g[m] the m-nemst neighbors of the training pattem 5 Output: a pseudo-training pattern. Method BEGIN
GetWeights(m)
Return END Procdure GenerriteArtificialPattem
Procedure CetWeights Input: m, the number o f the neighbors o f a training pattern; Output: w[ 1 1, w[2], . . . , ~ [ m ] , the weights Method BEGIN
sum=o For (j = 1 to m)
wu] = random uni forni U[O, 1 1 sum=sum+w~]
End-For ForG=l tom)
wu'] = w[i] 1 sum End-For Rehim w[l], w[2], .... w[m]
END Pmcedure GetWeighb
The GetNearestNeighbounOf procedure is used to obtain the m-nearest
neighbors of a training pattern. Its task is to calculate the Euclidean distances between the
given pattern and the othea in the same class, sort the distances calculated, and collect
the Fust m patterns having the shortest distances. Note that a pattem itself is included in
its rn-nearest neighbors. The main procedure in this algorithm is the
GenerateArcificialPattern pmcedure that provides the artificid training sample set to
construct the classifier. The key point of the algorithm is to use a weighted average of the
m-nearest neighbors of a training pattem as an actificial training pattern. Observe that the
CHAPTER 5: BOOTSTRAP-BASED CLASSIFIER DESIGN 113
weights given by the GetWeighb procedure is just what was provided by the random
weighting method.
Algo rit hm 5.2 Bootstm pping II Input: (i) c, the number of classes;
(ii) ni, the size of the training sample of class i; (iii) g[1, 11 , .. ., z[ni, 11, ~ [ l , 21, . .., ~ [ n , cl, the training sarnple data; (v) rn, the number of the nearest neighbours of a training pattern
Output: y[l, 11 , ..., y[ni, 11, y[l, 21, ..., y[n,, cl, the artifcial training sarnple data. Method BEGIN
Fori= 1 t o c d o For j = 1 to ni do
CetNeareatNeighbounOt(xJj, il, m) End-For
End-For For+ 1 t o c d o
For j = 1 to ni do yü, il = GenerateArtificialPattem(m-nearest neighbon of g[j, il)
End-For End-For Retwn y[L 1 1 9 * * * , y[ni, 11, y[l, 21, O * * , y[nc, cl
END Bootstrapping II
The only change between AlgorithmS.1 and Algorithm 5.2 is chat the step
k = random integer in [l, ni]
in Algorithm 5.1 is eliminated. This means that the artificial training sample is just to
replace each pattern in the original sample with the weighted average of its m-nearest
neighbors.
CHAPTER 5: BOQTSTRAP-BASED CLASSIFIER DESIGN 114
Input: (i) c, the number of classes; (ii) ni, the size of the training sample of class i; (iii) g(1, 11 , ..., g[nl, 11, ~ [ l , 21, ..., g [ n , cl, the training sarnple data; (v) m, the number of the nearest neighbors of a training pattem
Output: y[l, I l . ..., y[n,, 11, y[l, 21, ..., y [ n , cl, the artificial training sample data. Method BEGIN
Fori=l tocdo For j = 1 to ni do
GetNearestNeighboursOf(&~, il, m) End-For
End-Foi Fori= 1 tocdo
For j = 1 to ni do k = random integer in [1 , ni] yu, il = the average of the mnearest neighbors of ~ [ k , i]
End-For End-For Remm yu, 11 9 --*, Y [ ~ I , 11, y[l, 21, ***, y[n, cl
END Bootstrapping III
Unlike Algorithm 5.2, this algorithm modifies Algorithm 5.1 in another way. It
replaces an original training pattern directly with the arithmetic average of its nt-nearest
neighbon; meanwhile, the original patterns are randomly selected so as to give the
averages of their n-aearest neighbors.
CHAPTER Sr BOOTSTRAP-BASED CLASSIHER DESIGN 115
Algorithm 5.4 Boohtrapping IV Input: (i) c, the number of classes;
(ii) ni, the size of the training sample of class i; (iii) g[l, 11 , ..., z[ni, 11, g[1,2], . .., ~ [ n , cl, the training sample data; (v) m, the number of the nearest neighbours of a training pattern
Output: y[l, 11 , . . . , y[n~, 11, y[l ,2], . . ., y[n, cl, the artificial training sample data. Method BEGIN
F o r i = 1 tocdo Forj 4 tonido
GetNeamtNeighbounOf~~, il, in)
End-For End-For For i = 1 tocdo
For j = 1 to ni do yu, i] = the average of the m-nearest neighboa of gü, il
End-For End-For Remrn y[!, 11 , ..., y[m, 11, y[L 21, .... y[no cl
END Bootstrapping IV
Algorithm 5.4 changed fiom Algorithm 5 3 the same way as Algorithm 5.3
changed from Algorithm 5.1. The artificial training sample consists of the arithrnetic
averages of the m-nearest neighbon for al1 the patterns in the original sample.
The cornparison of the above algorithms was done in the paper by Hamamoto
[HUT97] on the bases of the experiments with k-NN classifiers ( k = 1, 3, 5). The
experiments were designed to inspect the performances of the above algorithms in
various situations, as well as the different dimensions, training sarnple sizes, sizes of the
nearest neighbors, and distributions. The main conclusions of their work are:
1) The algoritbms outperfon the conventional k-NN classifiers as well as the
edited 1 -NN classifier.
2) The advantage of the algorithms cornes fiom removing the outliers by
smwthing the training pattern, e.g. using local means to replace a training pattern,
CHAPTER 5: BOOTSTMP~BASED CLASSIFIER DESIGN 116
3) The number of the nearest neighbors chosen has an effect on the result. An
algorithm was inCroduced to optimize the selection of this size.
The details of the experiments and the algo&.hm for optimizing the size of the
nearest neighbor set can be found in [HUT97].
5.3 Mixed-sam pie Classifier Design
In Chapter 4, we presented the pseudotlassifier algorithrns, which used a pseudo-
training sample to test the accuracy of the classifier. It is generaily accepted that the
larger a training sample, the greater the accuracy of a classifier. Consequently, we beiieve
that using the pseudo-training sample set to enlarge the training sample size is a good
idea. The question that we face is one of knowing how to generate suitable pseudo-
training patterns and blending them together with the original training samples to build a
classifier. Such a classifier design scheme is referred to as a Mixed-sample classifier
design because it uses the mixed training sample to construct the classifier. Generally,
this kind of algorithm will build a classifier in two steps: first, it will generate a pseudo-
training sample set; then it will build a classifier with al1 the training patterns using both
of the original and the pseudo-training samples.
The second step will be executed based on the classifier selected. This can involve
any standard classifier design algorithm based on the given training patterns. We will
thus focus on how to carry out the first step. Earlier, in Chapter 4 we presented a
CenentePseudoSampk procedure. We shall now marginally modify an existing
procedure and make it suitable for the purposes of the classifier design. Below is a typical
algorithm for generaiing a pseudo-training sample for the classifier design.
Algorithm 5.1 Generiiting Pseudo-training sam pie Input: (i) c, the number of classes;
(ii) ni, the size of the training sample of class i; (iii) al, 11 . . ... ni, 11, g[1,2], ..., ~ [ n , cl, the training sample data; (iv) B. the repeated times of drawing bootstrap samples. (v) m, the number of the nearest neighbours of a training pattern (vi) ti, the size of the pseudo-training sample of class i;
Output: y[l , 1 ] , . . . , y[ri , 11, y[ l,2], . . ., y[t,, cl, the pseudo-training sample. Method BEGIN
For i = 1 tocdo For j = I to ni do
CetNeamtNeighboursOf(x~, il, m) End-For
End-For Fori= 1 tocdo
For j = 1 to ti do k = random integer in [ l , ni] yu, il = GetPsrudoPattern(m-ne- neighbours of ~ [ k , il)
End-For End-For Ret~nyr l , 11 , ***, y[tt, 11, yu, 21, ***, y[t, cl
END Generating Pseudo-training sampk
The CetPseudoPattern procedure in the above algorithm can be either the
CetPseudoPattern 1 procedure given in Algorithm 4.6 or the CetPseudoPattern II
procedure given in Algorithm 4.7. Therefore, it is possible to have dif'ferent weighting
schemes for the CetPseudoPattero procedure. In our experiments, we used three
different schemes: the SIMDAT algocith, the Bayesian bootstrap and the random
weighting methods.
There are other parameten that have to be decided as well in this algorithm.
These are the nurnber of the nearest neighbors and the size of the pseudo-training sample
CHAPTER 5: BOOTSTRAP-BASED CLASSRFIER DESIGN
set of each class. It is preferred to give a fixed rate between the pseudo-training sample
size and the original training sample size for each class. As expected, the pseudo-training
sample size affects the performance of the classifier. In our experiments, several pseudo-
training sample sizes were used to ascertain theu effect on the performance of a
classifier. The size of the nearest neighbor set can also affect the performance of a
classifier. Two different sizes of the nearest neighbor set were used in our experiments to
show how they effect the classifier's performance.
5.4 Simulation Results
Experiments for the Mixed-sarnple algonthms were done with the Data Set 1. Two
classes each with a training sample of size 8 were involved in each experiment. Six
different sizes: 4.8. 12, 16,20 and 24, were used to generate the pseudo-training sample
sets. With the size of the nearest neighboa m = 3, three schemes namely, the SiMDAT
algorithm, the Bayesian bootstrap and the random weighting method were used to
generate the pseudo-training samples. To study the effect of the size of the nearest
neighbors, the experiments were also carried out with the size of the nearest neighbors m
= 8. Note that the training sample size of a class is 8, so nr = 8 means that a pseudo-
training pattern is a combination of dl the patterns in the training sample of a class. The
Bayesian bootstrap and the random weighting method were used in the experiments with
m = 8. To distinguish between the Bayesian booistrap and the random weighting method
used in two different sizes of the nearest neighbors, the previous ones are referred to as
the Local-Bayesian bootstrap and the Localiandom weighting method while the latter
CHAPTER 5: B~~TSTRAP-BASED CLASSIFIER DESIGN 119
are simply called the Bayesain bootstrap and the Random weighting method. M e r
constructing a classifier with the mixed-sample, an independent testing sample of size
1,000 for each clam was used to estimate the enor rate of the 3-NN classifier. Figure 5.1
- 5.6 provide the results of the experiments on an average of 200 trials.
+Bayes Bootstrap -+- Random Wdghtlng + Local Bayes Bmstrap K L o d WaigMing +,SIMDAT
Figure 5.1 Testing Error of Mixed-Sample Classifier-Class Pair (A, B)
Figure 5.2 Testing Emr o f Mixed-Sample Classifier-Class Pair (A, C)
CHAPTER S: BOOTSTRAP-BASED CLASSIFIER DESIGN 120
O 4 8 12 16 20 24
Prwbûainlng Srmpk Sizr
+Bayes Bootstfap -t- Random Weighting -6- Local Baye8 Boolstrap - Local Randan WeigMing -+-SMOAT
Figure 5.3 Testiiig Error of Mixed-Sample Classifier-Class Pair (A, D)
21 1 1 O 4 8 12 16 20 24
Pruôo4aining Srmph Sizm +Bayes BWsirap -m- Rondom Weighting --cc-. Local Bayes Bootstrap
- +. Local Random Weighîing .- S M W
Figure 5.4 Test ing Error of Mixed-Sarnple Classifier-Class Pair (A, E)
CHAPTER S: BOOTSTRAP-BASED CLASSIFIER DESIGN 121
+Bayes Booîstrap + Random Wdghting +Local Bayes Bootstrap Local Random WeigMing -+.SMDAT
Figure 5.5 Testing Error of Mixed-Sample Classifier-Class Pair (A, F)
-+-Bayes Bootstrap + Randocn Waighting -e-- Local Bayes Boolslrap - L o d Random Weighling +,SMDAT
Figure 5.6 Testing Error of Mixed-Sample Classifier-Class Pair (A, G )
The X-axis in each chart represents the pseudo-training sample size of a class,
while the Y-axis represents the error rate of the classifier estimated by the independent
testing sarnple. Each chart gives the experiment's results on one class pair. Although,
different class pairs have a different e m r rate; the expenment results showed consistent
tendencies, which are
1. Al1 the three algonthms for pseudo-training sarnple generation seem to make
things worse. These are the STMDAT algorithrn, the Local-Bayesian bootstrap
and the Local-random weighting method.
2. The Bayesian bootstrap and random weighting method apparently improve the
performance of the classifier.
3. The effect of the pseudo-training sample increases with its size increment.
For instance, without a pseudo-training sample mixed into the original training
sarnple For the classifier construction, the error rate of the 3-NN classifier would be
33.37% for class pair (AT D). When the size of the pseudo-sample increased to 16, twice
of the training sample size, the classifier's error rates increased to 34.41%, 33.82% and
34.15% for the Local-Bayesian Bootstrap, the Local-random weighting method and the
SIMDAT algorithm respectively. However, with the same pseudo-training sample size,
the error rates decreased to 3 1.08% and 3 1.14% for the Bayesain bootstrap and the
ranâom weighting method. This is also truc for class pair (A, E). Without the pseudo-
training sample, the error rate of the 3-NN classifier is 23.54%. The error rates rose to
24.14%, 23.59% and 25.14% for the Local-Bayesian Bootstrap, the Local-random
weighting method and the SMDAT algorithm respectively. At the sarne time, the e m r
rates decreased to 21.61% and 21.70% for the Bayesain bootstrap and the random
weighting method.
The failure of the Local-Bayesian Bwtstrap, the Local-random weighting method
and the SIMDAT algonthm indicates that the size of the nearest neighbon should not be
too small. Although, a combination of rn-nearest neighboa smoothed the training patterns
in some way; it might still be an outlier if it is the combination of rn-nearest neighboa of
an outlier and the size, rn, is small. Thus, the mixed-sample classifier might have more
outliea than that of the usual classifier, which makes things even worse. It is noted that
the SlMDAT algorithm is the worst one of al1 the schemes for class pairs (A, E), (A, F)
and (A, G). The cause for this is that the SIMDAT algorithm uses the unifom
xm - 1) distribution u[' - /y, 1 + ,/T 1 to generate the weighting vecion, m 111 m m
which allows for a pa ter chance for an outlier point to be gcnerated.
The aigorithrns of the Bayesian bootstrap and the random weighting method
perfomed very well as they use combinations of whole training patterns to produce the
pseudo-training smple. This codirms that the size of the nearest neighbors should not be
too small. The effect of the pseudo-training sample size is also significant. It can be seen
from Figure 5.1 - 5.6 that, for al1 class pairs, the larger the pseudo-training sarnple's size,
the lower the error rate. The evidence show that the rate of decrease slows down when
the pseudo-training sarnple's size increases. Generally, it is enough to take the size of the
pseudo-training sample as 1.5 times (in our case it is m = 12) as that of the original
training sample. The results of the experiments also showed that there is no significant
performance difference between the Bayesian bootstrap and the mdom weighting
method.
5.5 Discussions and Conclusions
In this chapter, the classifier design problem has k e n discussed. Previously,
Hamamoto and the others constnicted a clessifier by using the weighting local means in
place of the onginal training pattems. Their experiments demonstrated that their strategy
performed favorably. With the local means strategy, they also discussed how to find the
optimized sire of the nearest neighbon.
What introduced by us is an alternative approach for constnicting a classifier with
the mixed-sample. The motivation for using the mixed-sample clossifier design was to
have a larger sample size. This approach works when the pseudo-training patterns are
obtained fiom the weighting averages of al1 the original training pattems together.
However, it fails if the size of the nearest neighbors is small.
Comparing the results of the mixed-sarnple classifier design to the previous
results of Hamamoto's work, it may be seen that there are still interesting problems to be
discussed.
First, in ternis of the mixed-sarnple classifier design, the classifier designs
suggested by Hamamoto can be refened to as pseudo-training sample designs as they use
only the pseudo-training samples - the local means - to constnift a classifier. The
difference between the mixed-sample and the pseudo-training sample classifier designs is
that the latter uses the pseudo-training sample to replace the original one while the former
mixes the two samples together.
Second, the local meam function well in the pseudo-training sample classifier
design because the locd means removed the outliers by smwthing the pattems. However,
they perform badly in the mixed-sarnple classifier design when the size of the nearest
neighbon is mall, as a bad local mean might be an extra outlier. On the other hand, a
large size of the nearea neighbon, such as the whole set of the training sample, works in
the mixed-sarnple classifier design.
Third, Hamamoto's work proved that a large size for the set of nearest neighbon
used does not imply that it yields the best results [HUT97]. That is why an algorithm to
optimize the size of the nearest neighbors was suggested. Hence, it appears as if the
problem of optimizing the size of the nearest neighboa for the mixed-sample classifier
design is unanswered. This could be a problem for fùrther study.
Fourth, our experiments have proved that using a pseudo-training sample to
enlarge the sample size is a good strategy. As no work hm k e n done to compare the
mixed-sarnple classifier design with the pseudo-training sample classifier design, it is
dificult to say which one is better. It is reasonable, though, to use the original training
patterns in the construction of a classifier if they are beneficial.
Finally, fiom ail of the above discussions, it is confirmed that the classifier design
can be improved on by using a pseudo-training sample. The problem is how to avoid the
contamination of bad pseudo-patterns. There are two possible solutions to the problem:
one is to exclude the outliea fiom the original training sample set before generating the
pseudo-training samples. The other is to replace the outliers in the original training
sample with their local means. Although it is believed that these two approaches are able
to enhance the classifier design, further studies are needed to prove the assumption.
6.1 Introduction
This thesis has npresented an in-depth study on Efion's booistrap techniques. It
also discussed how the technique should be applied to probiems in statistical pattern
recognition, narnely those of estimating the Bhattacharyya emr bounds, estimating
classifier emr rates and designing classifiers. Al1 the works done in the area, including
the work presented here, demonstrate that the bootstrap technique is applicable to and
works well for problems in the field of statistical pattern recognition. Throughout our
discussion, we have proposed the use of three main strategies for the application of the
bootstrap technique, namely, bias correction, cross-validation and pseudo-pattern
generation. We now discuss how these strategies were applicable in statistical pattern
recognition.
6.2 Bias Correction
As explained in Chapter 2, Efron's bootstrap scheme was initially proposed to estimate
the bias of an estimator. The simulation of the distribution t* given by the bootstrap
sample, to experimentally represent the empirical distribution @ is useci to provide us
with the information about the way S simulates the tnie distribution F. Thus the bias of a
parameter 0 defined in (2.1) can be estimated as:
A bootstrap estimate of 0 is obtained by using the bias estimate to correct the original
estimate,
A n eBm= B- B ~ U .
Bias correction is obtained by merely applying (6.1).
There are two different ways of applying the b i s correction stmtegy: direct or
indirect. The direct b i s correction is to apply (6.1) directly on the parameter estimated.
This approach works well for estimating the Bhattacharyya bound. As s h o w in Chapter
3, it improved the estimates for al1 of the six class pairs. Direct bias correction also works
for estimating the error rate of a classifier. To do this, the basic bootstrap, 0.632, and
0.632+ estimators use the direct bias correction, even through their methods to estimate
the bias are different. The performance of a direct-bias-correction eslimator depends
largely on how well the bias estimate does. For instance, in estimating the error rate of a
classifier, al1 of the schemes, namely, the basic bootstrap, 0.632 and 0.632+ algorithrns
are based on the bias correction of the apparent error estimate. The performances of the
three algorithms Vary because they estimate the bias in different ways. Therefore, it is
very important to find a good bias estimator. Unfomuiately, however, there are no
gened niles that can be followed so as to yield a superior bias estimator.
When the parameter 8 is a function of another parameter 6: 8 = I(Q, it is possible
to correct c's bias fust. and then to obtain the estimate of 8 via the Taylor's expansion of
the hct ion 8 = c). This approach is called the indirect bias correction. in Chapter 3,
this approach was used to estimate the Bhattaçharyya bound. Although the FAMS
algorithm with the bootseap scheme improves the estimate significantly, the stanâacd
variance of the esthate is very large. For exarnple, in TABLE 3.7, the mean and the
standard deviation for the 200 trials are 20.63062% and 23.34433% for class pair (A, O),
while, in TABLE 3.10, they are 33.01038% and 43.78814% for the same class pair,
implying thai the results are not very satisfactory. On the other hand, the FAMS
algorithms with the Bayesian bootstrap and the random weighting method schemes have
a relatively small standard deviation, but they do wt correct the bins significantly. In
TABLE 3.8 with the Bayesian bootstrap scheme, for example, the mean and the standard
deviation of the class pair (A, G) are 9.294273% and 4.665902%. The fact is that there is
a large gap between the basic bootstrap schemes and the Beyaesian bootstnip scheme, as
well as between the basic bootstrap and the random weighting method. There seems to
also be a trade-off between the bias correction and the standard deviation. The question
of whether we can devise a scheme that can have the advantages of both the basic
bootstrap and the Bayesian bootstrap remains open.
Cornparhg the indirect bias correction with the direct bias correction, it is not
easy to state which is superior. The experimental results given in Chapter 3 show that the
direct bias correction perfoms better for class pairs (A, B), (A, C), (A, D) and (A, E),
; 1 while the indirect bias
cause for this is that
correction performs better for class pairs (A, F) and (A, G). The
the Bhattacharyya bound c, is an exponentiai hction of the
. 1
Bhattac haryya distance p(1/2), E,, = - e -* IR'. It mua alw be noted that the Bhattacharyya 2
distance p(lL2) is usually over-estimated. Hence the estimate bias of the Bhattacharyya
distance will contribute more to the estimate bias of the Bhattacharyya bound when the
Bhattacharyya distance is large. Therefore, the larger the Bhattacharyya distance is, the
larger its effect on the estimate bias of the Bhattacharyya bound will be.
The cross-validation approach is mainly concerned with the error rate estimation
of a classifier. The term bbcross-validation" as used by us in this thesis does not have the
same meaning as in machine learning. The key idea in the Cross-validation algorithm is
to separate the testing pattem h m the training pattems. It mi@, therefore, be better to
describe the scheme using the phrase "Training and testing pattern separating approach"
instead of using the tenn "Cross-validation". Al1 of the methods, namely EO, 0.632,
Leave-onesut bootstrap, 0.632+ estimators, CVBBPC and CVRWPC use the Cross-
validation to estimate the error rate. Compared with the other error rate estimation
algorithms, which do not separate the training and testing samples, these algorithms
obviously corne out as the wimea.
In small sample cases, however, an explicit separation does not necessarily mean
the bat . TABLE 4.8 and TABLE 4.10 indicate that when the training sample size is 8,
the impure separating estimators: 0.632, 0.632+, CVBBPC and CVRWPC estirnators
performed apparently better than the pure separating estimators: the EO and Leavesne-
out bootstrap estimatoa. On the other hanâ, TABLE 4.9 and TABLE 4.1 1 show that the
performance of the EO and the Leave-one-out bootstrap estimators irnproved
sigaificantly. These facts demonstrate us that the size of the training sample set is an
important factor. When some patterns were excluded h m the training sample for testing,
CHAPTER 6: SUMMARY 130
it effectively reduced the size of the training sample set to construct a classifier, and so
the emr rate of the classifier increasad. This explains why the emr rate estimates given
by Leave-one-out, EO and Leave-one-out bootstrap tend to over-estimate the tme error
rate,
When the size of the training sample set is large, the result will not be so effected
when we merely exclude one pattem each time h m the training sample for testing.
TABLE 4.9 and TABLE 4.1 1 clearly demonstrate this fact. However, when the size of a
training sample is small, it is hard to design, test and train the classifier with separate
patterns. One of the solutions is to use a pseudotlassifier algorithm, such as CVBBPC or
CVRWPC. Of course, the Leave-one-out bootstrap performs very well in the small
training size case, but a pseudo t lassi fier algorithm is pte ferred for the fol lowing reasons:
1. A pseudo-classifier algorithm cannot be worse than the Leave-one-out
bootstrap. Thus. for example, the performance of CVBBPC and CVRWPC are
comparable to the Leave-onesut bootstrap.
2. A pseudotlassifier algorithm is more flexible than the Leave-onesut
bootstrap as it is possible to improve its performance by changing either the
method of generating a weighting vector or by changing the number of the a
nearest neighbors of a training pattem used in designing the algorithm.
The first reason has k e n supported by our experirnents. The details of the second
reason will be studied in the next section,
6.4 Pseudo-pattern Generation
Pseudoîlassifier algorithms are used for estimating the error rate of a classifier,
and the pseudo-irainhg sample can be used in classifier design. in spite of the difierence
between their purpose, both the pseudo-classifier algorithms and the pseudo-training
sample algorithms are based on the same strategy, namely, that of pseudo-pattern
generating. Of course, the strategies for generating the pseudo-patterns Vary depending
on the purpose for which pseudo-patterns are generated.
Two issues are important when generating pseudo-patterns. The fint is the
schemes used to produce the weighting vectors. In our experiments, three schemes were
tested: the SiMDAT algorithm, the Bayesian bootstrap and the random weighting
method. Generally, the latter two perfom better although, in error rate estimation, the
SIMDAT algorithm has a better performance for the class pairs (A, B). (A, C) and (A, D)
(see TABLES 4.4 - 4.7). It is therefore clear, that a suitable scheme must be chosen based
on the situation at hand. Unfortunately, at this time, there are no rules available for such a
selection. For the error rate estimation, our experiments showed that the SIMDAT
algorithm is better when the emr rate is higher, and the Bayesian bootstrap and the
random weighting method are better with a lower error rate.
The second issue of importance is number of the nearest neighboa of a pattern
used in generating the pseudo-patterns. In error rate estimation, pseudo-patterns are used
to simulate the behavior of the training patterns. So it is better to keep the size of the
n e a s t neighbors small. in this way, the distribution of the pseudo-patterns can be kept
close to that of the original training sample. As opposed to this, in classifier design, the
purpose of adding the pseudo-patterns is to decrease the classifier's error. So the size of
the set of nearest neighbors of a pattern should be kept larger. In this way, the pseudo-
patterns will tend to appear mon frequently in the central vicinity of each class, and so
bring about more smoothness to the training samples.
In the estimation of emr rates, it is urilikely that we will be able to design an
algorithm that can optimize the size of the set of nearest neighbors because of the lack of
practicd criterion fùnction. The goal of the exercise is to compote an estimate that is as
close to the tnie error rate as possible. However, this problem is complex because the true
error rate is unknown in practice. A possible solution for the optimization is to choose the
size of the set of nearest neighbors depending on the results of simulation experiments.
In the classifier design, however, the Leave-one-out error estimation can be
implemented to optimize the size of the nearest neighbors. The algorithm is bnefly
described below:
1. For a fixed size of nearest neighboa, generate pseudo-patterns and constnict
the classifier with the Mixed-sample classifier design discussed in Chapter 5.
2. Use the Leave-onesut algorithm to estimate the error rate of the classifier
built in Step 1.
3. Do Steps 1 and 2 for three different sizes of nearea neighbors, mi = 2 c m2 =
(mi + m3) 1 2 < rns = the training sample size of a class. Compare the estimated
error rates.
Let Erq be the estimated error rate celated to size mi. if Errl < En2 < Err3, then
select a no = (mi + m2) 1 2; if En l > Err2 > En3, then select a mo = (m2 + m3) 1
2; if Enl > En2 and Err2 < En3, then select moi = (ml + m2) / 2 and mo2 (m2 +
mj) / 2.
Do Steps 1 and 2 for the newer selected size mo, and compare the newer
estimated emr rate Eno with the two othen (Enl and Err2 if r n ~ = (ml + m2) I
2; Erfi and En3 if mo = (ml + m2) 12). If there are two newer sizes moi and mo2
selected in Step 4, then two new error rates Enoi and Eno2 will be estimated,
and the cornparison would be carried out with EIQ, Enol and Eno2.
Repeat Steps 3 - 5, until the three compared sizes converge. The m with the
minimum Cross-validation error would be selected as the size of the nearest
neighbors.
Although we cannot guarantee that the best size for the set of the nearest
neighbors will result, this algorithm can, at least, provide a reasonably good estimate for
the cardinality of the set.
6.5 Conclusion
in this thesis we have demonstrated, conclusively, that the bootstrap technique is
useful for solving problems in statistical pattern recognition. Generally, aigorithms using
the bootstrap technique take more space and more time for calculations, which are the
main disadvantages of the bootstrap technique. However, space and time are not serious
problems in estimating the Bhattacharyya b o d , and in estiniating the error rate, as the
CHAPTER 6: SUMMARY 134
size of the training simple is not generally big, and the estirnates are only calculated
once. There are also no problems of space and time in the classifier design. Once the
classifier has been constructed with the mixed-samples, it will be used the same way as
the classifier designed by the usual approach.
The benefits of the bootstrap technique are obvious:
1. It corrects the estimate b i s of the Bhattacharyya bound and classifier emr
rate, and
2. It improves the performance of a classifier.
Therefore, the question now is not 'Tould the bootstrap technique be applied to
statistical pattern recognition?" but rather, "Which is the best bootstrap algorithm
applicable?" and "How much can the bootstrap technique achieve?".
Based on our research, the following algorithms for the bootstrap technique are
suggested for the three problems in statistical pattern recognition, which we have
researched.
To estimate the Bhattacharyya bound, the bias correction approach with the basic
bootstrap scheme is preferred. Either the direct or indirect bias correction approach cm
be used depending on the Bhattacharyya distance associated with the class pair. If the
Bhattacharyya distance is small, it is preferred that the direct bias correction be used.
ûtherwise the indirect bias correction would be better.
To estimate the emr rate of a classifier, a combined Cross-validation and
bootstrap approach is recornrnended, such as the 0.632+, the Leavesne-out bootstrap, the
CVBBPC or the CVRWPC. For the pseudo-classifier algorithms CVBBPC and
CHAPTER 6: SUMMARY 135
CVRWPC, the size of the nearest neighbors can be selected as m = 3, as demonstrated by
our experiments. As mentioned in the previous section, the pseudo-classifier algorithm
with the SIMDAT scheme can also be a good candidate, if the true error rate is high.
To design a classifier, a mixed-sample algonthm with the Bayesian bootstnip or
the random weighting method is preferred. The size of the set of nearest neighbors can be
the same as the size for the training samples of a class. Our experiments proved that this
size works well for the classifier design.
it is dificult to Say just how much advantages a bootstrap technique can give.
Although the methods recommended above have achieved high performances, there is
still potential to achieve more progress by appl ying the idea of pseudo-pattern generation
and the bootstrap technique. Two kinds of improvements seem possible. The first is to
seek a better scheme to generate the weighting vectoa for a specific problem. The other
is to optimize the size of the set of nearest neighbors of a pattern. Both of these avenues
are areas where much research can be done. The question of using parametrical bootstrap
techniques for the three problems studied remains open.
PD841 Buake, 0. and Droge, B. (1984). "Bootstnîp and cross-validation estimates of the prediction emr for linear regression models", Annals of Statistics, 12, pp. 1400- 1424.
[Bu891 Bumian, P. (1989), "A comparative study of ordinary cross-validation, v- fold cross-validation and the repeated leamiag-testing methods", Biometrika, 76, pp.503 - 541.
[CMN85] Chernick, M.R., Murthy, V.K. and Nealy, C.D. (1985), "Application of bootstrap and other resampling techniques: evaluation of classifier performance", Pattern Recognition Letters, 3, pp. 167 - 1 78.
[CMN86] Chemick, M.R., Murthy, V.K. and Nealy, C.D. (1986), "Correction to 'application of bootstrap and other resampling te~h~ques: evaluation of classifier perfomance'", Pattern Recognition Letters, 4, pp. 133 - 142.
Davison, A. C. and Hall, P. (1992), "On the bias and variability of bootstrap and cross-validation estimates of error rate in discrimination problems", Biometrika, 79, pp.279 - 284.
Davison, A. C. and Hinkley, D. V. (1997), Bootstrap methods and Their Application, Cambridge University Press, Cambridge.
E h n , B. (1979), "Bootstrap Method: Another Look at the Jackknife", Annafs of Statistics, 7 , pp. 1 - 26.
Ehn, B. (1982), The Juckknife, the Bootstrap and Other Resampling Plans, CBMS-NSF, Regionai Conference Series in Applied Muthematics, Society for Industrial and Applied Mathematics.
Ehn , B. (1983), "Estimating the enor rate of a prediction d e : improvment on cross-validation", Journal of the American Stutistical Association, 78, pp.3 16 - 33 1.
Ehn , B. (1986), "How biased is the apparent emr rate of a prediction rule?", Journal of the American Statistical Associution, 8 1, pp.46 1 - 470.
Ehn, B. and Tibshirani, R J . (1993), An introduction to the Bootstrap, Chapman & Hall, Inc.
E h n , B. and Tibshirani, R. J. (1997), "Improvement on cross-validation: the -632 + booîstap methd9, Journal of the American Statistical Association, 92, pp.548 - 560.
Fukunaga, K. (1990). Introduction to StiWticul Pattern Recognition, Second Edition, Academic Press, inc.
Hand, J.D., (1986), "Recent advances in emr rate estimation", Pattern Recognition Letters, 4, pp.335 - 346.
Hamamoto, Y., Uchimura, S. and Tomita, S. (1997), " A bootstrap technique for aearest neighbor classifier design", IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, pp.73 - 79.
Jain, A.K., Dubes, R.C. and Chen, C. (1987), "Bootstrap techniques for error estimation", IEEE Transactiom on Pattern Analysis and Machine Intelligence, PAMI-9,628 - 63 3.
Li, K.-C (1978), "Asymptotic optimality of C,,, 4 cross-validation and generalized cross-validation: discrete index set", Annals of Statistics, 1 5, pp.958 -975.
Mammen, E., (1992), When Does Boofstrap Work? - Asymptiotic Results and Simulations, Lecture Note in Statistics 77, Springer - Verlag, New York, inc.
Port, Sidney C., (1994), Theoreticai Probability /or Applicationr, John Wiley & Sons, Inc.
Shao, J. and Tu, D., (1995), The Jackknfe and Bootstrap, Springer - Verlag, New York, inc.
Taylor, M. S. and Thompson, J. R., (1992), "A nonparametric density estimation based iesampling algocithm, Exploring the Limits of Bootstrap, John Wiley & Sons, hc., pp.397 - 404.
Wand, M. P. and Jones, M. C. (1995), Kernel Smoothing, Chapman & Hall, London.
Weiss, S.M. (1991), "Srnafl sample emr rate estimation for k-NN classifiers", lEEE Transactions on Pattern Anuiysis and Machine Intelligence, 13, pp.285 - 289.
[a871 Zhen, 2. (1987), ''Etandom weighting methods", Acta Marh Appi. Sinica, 10, pp.247 - 253.