BOOTSTRAP TECHNIQUES STATISTICAL PATTERN RECOGNITIONcollectionscanada.gc.ca/obj/s4/f2/dsk1/tape3/PQDD_0027/MQ52407.… · 1.1 Introduction Since Efion's bootstrap technique was introduced

BOOTSTRAP TECHNIQUES FOR

STATISTICAL PATTERN RECOGNITION

by

QUN WANG, M. Sc.

A thesis submitted to

the Faculty of Graduate Studies and Research

in partial fulfiliment of the requirements

for the degree of Master of Science

Information and System Science

School of Computer Science

Carleton University

Ottawa, Ontario

April2000

O Copyright

2000, Qun Wang

National Libmry l*l of Canada Bibîiot ue nationale % du Cana

Acquisitions and Acquisitions et Bibliogiaphic Se~rvïœs services bibliographiques

The author has granted a non- exclusive licence allowing the National Libraxy of Canada to reproduce, loan, distn'bute or sel1 copies of this thesis in rnicroform, paper or electronic fomats.

The author retains ownership of the copyright in this thesis. Neither die thesis nor substantid extracts kom it may be printed or othewise reproduced without the author's permission.

L'auteur a accordé une licence non exclusive permettant a la Bibliothèque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la forme de microfiche/film, de reproduction sur papier ou sur format électronique.

L'auteur conserve la propriété du droit d'auteur qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation.

This thesis studies the application of the bootstnp techniques to three fields of

statistical pattern recognition: the estimation of the Bhattacharyya bound, the estimation

of the error rate of a classifier and the classifier design. To initiate the study, the

bootstrap technique, and its concepts, algotithms and properties are explained.

In the application of the bootstrap to estimating the Bhattacharyya bound, it is

first argued, both theoretically and experimentally that the estimate of the Bhattacharyya

bound is usudly biased. Two types of the bootstrap estimaton of the Bhattacharyya

bound, the direct and indirect bias correction estimators, are suggested. It is proved

experimentally ihat the suggested bootstrap estimators of the Bhattacharyya bound are

superior to the genenl estimator.

In the application of the bootstrap to the enor rate estimation, the previous works

are reviewed, and the related algorithms are illustrated with the simulation results. These

algorithms are catalogued against the real-sample algonthm, as they use only the original

training samples to estimate the error rate. The pseudo-sampie algorithms of the error rate

estimation are then proposed. These use the bootstrap technique to generate pseudo-

samples for estimating the error rate. Cornparisons based on the experiments prove that

the pseudo-sample algorithms perfom better than most of the other algorithms.

In the application of the bootstrap to classifier design, the mixed-sample

algorithm is htroduced. It overcomes the limitation of the training sample size by adding

iii

pseudo-samples to the original training samples for classifier design. The expenmental

results prove that this strategy is successful.

First of dl, I would me to th& my Professor. Dr. John Oommen, for his

continued support, encouragement, and for believing in me. It is his guidance, ideas, and

wisdom that led me to explore new scientific domains, and to discover new research

horizons. 1 am deeply grateful to him for al1 that he has done for me.

1 would like to thank Ms. Patncia King-Edwards who helped me by proofkeading

and editing most of my thesis. Above dl, 1 would like to thank my wife, Ying Huang for

her help, support, and encouragement. This thesis would never be done if it was not for

her support.

To my parents.

CHAPTER 1 : I ~ ~ ~ ~ ~ ~ ~ ~ l ~ ~ o o m ~ o m o o m o m m m m o o m m o o m m m o m o m o o o o m o o o m o o o o o o o o o o o o m f

1 . 1 introduction ........ ... ................... . . . ............................................ ............. .... ....... 1

1.2 General Contributions ........................................................................................................ 4

1.3 Outline of the Thesis .............................,....*...... ....+ ............................................................................ 5

1.4 Expenment Data Set ........................................... . ............... . .............................. 6

2.1 htroduction ...... ........ .... ............ . .............. . . ............... . ................... ...................... . . . . . 1 I

2.2 The Fundmentals of Bootstrap .................... .. .... .... ........ .... .............................. 13

2.3 Boostrap Estimators ............................................. .....*.............. ............ ................ 17

........................... 2.3.2 Variance .. ......................................................................... 18

2.3.3 Quantile ........................................................................................................... 20

2.3.4 Pmtmetric Bwtstrap ................................. ...................................................... 21

2.3.5 Density Function .......................................................................................... 2 1

2.4 Properties of Bootsûap ...................................... ................................................ 25

.......................... .................. 2.5 Other Resampling Approaches ................................... 29

................................................................................ 2.5.1 Bayesian Bootstrap 3 0

............................................................................. 2.5.2 Random Weighting Method 34

CHAPTER 3: THE BHATTACHARYYA BOUND .................................................................. 39

3.1 introduction ................................... .. ....................................................... 3 9

................................................................. 3.2 Chemoff and Bhattacharyya Bounds 4 2

.......................... 3.2.1 Chemoff Bound .. ...................................................... 4 2

3.2.2 Bhattacharyya Bound ..................................................................................... 43

.......................... 3.3 The Estimation of Bhattacharyya Bound ........ ......................... 4 4

3.3.1 Simulation Results ...................... .......... ............................................ 47

............................ 3.4 The Bootstrap Estimaton - Direct Bias Correction .....*........ . 49

3.4.1 Simulation Results ........................................................................................ 57

................... ......... 3.5 The Bootstrap Estimators - Distance Bias Adjustment ... 59

3.5.1 AFMS (Adjust First and Mean Second) Schemes .................... ................. 61

3.5.2 MFAS (Mean FKst and Adjust Second) Schemes .................... .. .................. 66

................. .......*..................**....................*.....*..........*.......**............ 3.6 Summary ..... 71

................................................................................ .................... 4.1 Introduction ... 7 3

.......................................................... 4.2 Algonthms of the Emr Rate Estimatiag 7 5

............................... 4.2.1 Apparent Emr .... ............................................... 76 ...................................... 4.2.2 Cross-validation (Leavesne-out) ........................ .. 78

.........*....*................... ............................................... 4.2.3 Basic Bootsttap ... 79

4.2.4 EO Estimator .......... .. ....... .. .................................................................. 84

4.2.5 Leavesnesut Bootstrap ........................ ......................... ................. 86

4.2.6 0.632 Estimators ........................................................................................... 89

...................... 4.3 Simulation Results .. ................................................................... 90

................................................................................... 4.4 Pseudo-Sample Algonthms 92

4.4.1 Pseudo-Testing Algorithm ..................... .. .......................................... 9 2

4.4.2 Pseudo-Classifier Algorithm ................................................................... 9 9

4.4.3 Leave-one-out Pseudo-Classifier Algorith.cn ................................................. 101

4.5 Summary ...................... .... ..........................................................................*. 105

CHAITER 5: BOOTSTRAP BASED CLASSIFIER DESIGN ................................................. 110

5.1 Introduction ............... .. .................................................................................. 110

5.2 Previous Work ...................................................................................................... Ill

.............. 5.3 Mixed-saniple Classifier Design ... .................................................. 116

................................................................................................ 5.4 Simulation Results 118

5.5 Discussions and Conclusions ......... ....... ................................................................ 124

CHAPTER 6: SUMMARY AND D ~ ~ ~ ~ ~ ~ ~ ~ ~ m m m m m m m m m m m m m m a m m m m m m m m m m m m m m m m m m m m o m o m m m m m m m m m m m 126

6.1 Introduction ........................................................................................................... 126

6.2 Bias Correction ..................................................................................................... 126

6.3 Cross-validation .................................................................................................... 129

6.4 Pseudo-pattern Generation ................. ......................... ........................ 131

6.5 Conclusion ................... .. .................................................................................. 133

.................................................... TABLE 1.1 Expectations of the classes in Data Set 1 8

TABLE 3.1 Theoretical values of Bhattacharyya distance and bound .......................... .. 47

TABLE 3 3 Estimates of Bhattacharyya distances with the General approach ............... 47

TABLE 3 3 Estimates of Bhattacharyya bounds with the General approach(%) ............ 47

TABLE 3.4 Estimates of Bhattacharyya bounds with the Basic Bootstrap(%) ............... 58

TABLE 3.5 Estimates of Bhattacharyya bounds with the Basyesian Bootstra[(%) ........ 58

TABLE 3.6 Estimates of Bhattacharyya bounds with the Random Weighting Methoci(%)

.............O...........L......... ....*.. ............................................................................ 58

TABLE 3.7 Estimates of Bhattacharyya bounds(%) Basic Bootstrap. B-based AFMS .. 63

TABLE 3.8 Estimates of Bhattacharyya bounds(%) Bayesian Bootstrap, B-based AFMS

..................... t.,, ....................................................................................... 63

TABLE 3.9 Estimates of Bhattacharyya bounds(%) Random Weighting Method. B-

based AFMS ................................................................................................. 64

TABLE 3.10 Estimates of Bhattacharyya bounds(%) Basic Bootstrap, G-based AFMS

....................... .... .................................................................................... 65 TABLE 3.1 l Estimates of Bhattachary ya bounds(%) Bayesian Bootsrap, G-based AFMS

...................................................................................................................... 65

TABLE 3.12 Estirnates of Bhattacharyya bounds(%) Random Weighting Method, G-

based AFMS ............. ...........,............*......... .............O............ 66

TABLE 3.13 Estimates of Bhattacharyya bounds(%) Basic Bootstrap. B-based MFAS

.............. .......................... .............................................................................. 68

TABLE 3.14 Estimates of Bhattacharyya bounds(%) Bayesian Bootstrap, B-based

MFAS ........................................................................................................... 69

TABLE 3.15 Estimates of Bhattacharyya bounds(%) Random Weighting Method, B-

based MFAS ................................................................................................. 69

TABLE 3.16 Estimates of Bhattacharyya bounds(%) Basic Bootstrap, G-based MFAS

...........................................................*.....*.................................................... 70

TABLE 3.17 Estimates of Bhattacharyya bounds(%) Bayesian Bootstrap, G-based

MFAS .................*.. +. .................................................*................................... 70

TABLE 3.18 Estimates of Bhattacharyya bounds(%) Random Weighting Method, G-

based MFAS ........................................................................................... 70

TABLE 4.1 Emr Rate Estimates of 3-NN classifien (sample size 8) ........................ .... 91

.......................... TABLE 4.2 Error Rate Estimates of 3-NN classifien (sample size 24) 91

....... ...................... TABLE 4.3 Enor Rate Estimates of the Pseudo-Testing algonthm ... 98

TABLE 4.4 Emr Rate Estirnates of the Pseudo-Classifier algorithm (sample size 8) . 100

TABLE 4.5 Emr Rate Estimates of the Pseudo- Classifier algorithm (sarnple size 24)

.....*.*......*................................................................................................... 100

TABLE 4.6 Emr Rate Estimates of the Leave-One-Out Pseudo-Classitier algorithm

(sample size 8) ......................................................................................... 100

TABLE 4.7 Enor Rate Estimates of the Leave-One-Out Pseudo- Classifier algorithm

......................................... (sample size 24) .................... .......t*+....C...... 100

xii

TABLE 4.8 Absolute difference between estimates and true errors (sample size 8) ..... 107

TABLE 4.9 Absolute difference between estimates and tnie errors (sample size 24) ... 108

TABLE 4.10 Performance ranges according to the absolute difference fkom the true enor

(sample size 8). ...................... .......................... . . . ......... ............. 109

TABLE 4.1 1 Performance ranges according to the absolute difference fiom the tnie error

(sample size 24). ................. ........ .... ............... . .... ..................... ... .. .. . . . 1 09

Fipre 1.1 Expectltion Distribution of the Classes in Data Set 1 .................................... 8

Figure 5.1 Testing Error of Mixed-Sample Classifier . Classe Pair (A. B) ................. 119

Figure 5.2 Testing Error of Mixed-Sample Classifier . Classe Pair (A. C) ................. 119

................. Figure 5.3 Testing Enor of Mixed-Sample Classifier O Classe Pair (A. D) 120

Figure 5.4 Testing Emr of Mixed-Sample Classifier O Classe Pair (A. E) ................. 120

................. Figure 5.5 Testing Enor of Mixed-Sample Classifier O Classe Pair (A. F) 121

Figure 5.6 Testing Error o f Mixed-Sample Classifier . Classe Pair (A. G) ................. 121

xiv

CHAPTER 1

INTRODUCTION

1.1 Introduction

Since Efion's bootstrap technique was introduced in the late 1970's [Ef79], it has

attracted great interest from both the theoretical and applied sides of statistics. Since it

uses the simulation strategy to calculate estimates, the bootstrap technique is able to

retrieve more information from sarnple data, and to solve problems that are not easily

solved by the traditional methods: such as the bias, the variance and the parameters of the

distribution of an estimator. Over the past two decades, great effort has been made to

develop the theory and application of the bootstrap ([DH97], Ma921 and [ST95]). In

rapid succession, large numbers of problems have proved to be amenable to the new

technique, which confmned that the bootsvap is efficient, flexible and useful.

Efron's paper, "Estimating the error rate of a prediction rule: improvement on

Cross-validation" Efû3], may be the earliest work in applying the bootstrap technique to

the error rate estimation. More endeavors, [CMN85], [CMN86], @f86], [Ha86],

[JDC87], [Fu90], [We91], [DH92] and [ET97), have been made in this area since then.

The other field in statistical pattern recognition, where the bootstrap technique has been

applied to, is the classifier design wT971.

In this thesis, the bootstrap technique will be applied to three problems in

statistical pattern recognition:

1. Estimation of the Bhattacharyya bound;

CHAPTER 1: INTRODUCT~ON 2

2. Estimation of the error rate of a classifier; and

3. Classifier design.

The Bhattacharyya bound is an upper bound of the error rate of a classifier and an

exponential Function o f the Bhattacharyya distance. When the distributions of two classes

are normal, the Bhattacharyya distance is a fùnction of the fmt and second moments of

the normal distributions. It has been proved, both theoretically and practically, that an

estimate of the Bhattacharyya bound given by any ûaditional rnethod is seriously biased.

Applying the bootstrap technique, two types of new estimators of the Bhattacharyya

bound are introduced, which are respectively called the direct and indirect bias correction

estimaton. The fim estimator estimates the bound on the basis of directly correcting the

bias of the Bhattacharyya bound, while the second estimates the bound on the basis of

correcting the bias of the Bhattacharyya distance. The simulation results have proved that

both of hem successfully improve the estimates of the Bhattacharyya bound.

E m r rate estimation is, of course, an active topic because it is the crucial

measurement of a classifier. A lot of work has been done on the topic. It is generally

agreed that the apparent error estimator is biased so as to underestimate the enor, while

the traditional leavesne-out estimator over estimates the apparent error estimator. New

efforts have tried to improve the emr estimator by introducing the bootstrap technique.

Thrre bootstrap estimators were first introduced in Efron's papa [Et83]. The basic

bootstrap estimator uses the basic bootstrap technique to correct the estimate bias of the

apparent emr estimator. As opposed to hem, the EO estirnator uses the training patterns

excluded h m the bootstrap samples to estimate the enor rate. The 0.632 estimator is a

combination of the EO and the apparent error estimators. Although, fiom then, a few

other algorithms have been suggested, such as the MC and the convex bootstrap

[CMN85], no essential progress was made until the algorithms of the leavesne-out

bootstrap and 0.632+ were invented by E h n and Tibshirani [ET97]. nie leave-one-out

bootstrap is an algorithm combining the techniques of Cross-validation and bootstrap

together, while the 0.632+ estimator is a combination of the leave-one-out bootstrap and

the apparent enor estimators.

As al1 of the error rate estimation algorithms mentioned above use only the

original training samples, they are said to be the real sample algorithms. Motivated by the

SIMDAT algorithm [Tï92], the Bayesian bootstrap and random weighting method

[ST95], pseudo-sarnple algonthms of the error rate estimation are introduced in this

thesis. The basic concept of the pseudo-sample algorithms is to use the pseudo-samples

generated by a bootstrap sampling technique to estimate the error rate. These pseudo-

testing algorithms use the pseudo-sarnples as the testing samples, while the pseudo-

classifier algorithrns utilize the pseudo-samples to build the classifier, and use the

original training samples as the testing samples. The experiments demonstrated that the

pseudo-testing algorithm tends to underestimate the enor rate, while the pseudo-

classifier algorithm perfonned very well. It can be seen that, combined with the Cross-

validation technique, the pseudotlassifier algorithm is one of the best algorithrns

available to estimate the error rate.

The problem of classifier design is different h m the estimation problems. The

purpose of classifier design is to seek a better discriminant hction. Little work has been

done in this area to incorporate the bootstrap technique. The algorithms for classifier

design introduced by Hamamoto and his partners [HUT97] focus on creating artificial

training samples instead of the original training samples so as to constnict a classifier.

Their main strategy uses the local mean to replace the onginal training pattern when

constnicting a classifier. Their experiments proved that this strategy works for the k-NN

classifier,

In this thesis, the mixed-sample algorithms for classifier design will be proposed.

The main idea of the algorithms is to enlarge the training sample for constnicting the

classifier by adding pseudo-training patterns into the original training sample set. The

simulation results proved that this strategy works well when the proper bootstrap

sarnpling schemes are used to generate the pseudo-training patterns.

1.2 General Contributions

The general contributions of this thesis are distributed into three areas:

The Bhattacharyya bound estimation which uses:

a) Direct bias correction algorithm; and

b) Distiuice bias adjustment (indirect bias correction) algorithms.

The error rate estimation incorporating:

a) A pseudo-testing algorithm;

b) A pseudot lassi fier algorithm; and

C) A Leavesne-out pseudoclassifier algorithm.

The classifier design which incorporates a Mixed-sample classifier design.

CHAPTER 1: INTROOUCT~ON 5

Variations of the above algorithms will also be discussed in this thesis. These

înclude the modification of the resampling schemes (basic bootstmp, Bayesian bootstnip,

random weighting method and the SIMDAT algorithm), the pseudo-sample size and the

size of the nearest neighbors of a training pattern.

13 Thesis Outline

This thesis is composed of six chapten. Chapter 2 discusses the bootstrap

technique. Section 2.1 is a general instruction to the bootstrap and its application to

statistical pattern recogNtion. Section 2.2 describes the basic concept of Efron's

bootstntp as well as a general algorithm of the basic bootstrap, followed by several

examples of the bootstrap estirnators that are the content of Section 2.3. Two theorems

about the properties of the bootstrap are presented in Section 2.4. Section 2.5 illustrates

the other two bootstrap sampling schemes - the Bayesian bootstrap and random

weighting rnethod.

Chapter 3 discusses the estimation problem of the Bhattacharyya Bound. The

definitions and formulae of the Chemoff and Bhattacharyya bounds are presented in

Section 3.2. Subsequently, Section 3.3 describes a traditional algorithm for estimating the

Bhattacharyya bound, and provides the results of related expeciments. Two kinds of the

bootstnip estimators are discussed with the simulation results in Sections 3.4 and 3.5

respectively. Section 3.6 is a surnmary on the bootstrap-based Bhattacharyya bound

estimation methods.

CHAPTER 1: INTRODUCTION 6

The bootstrap-baseci error estimators of a classifier are the topic of Chapter 4.

Section 4.1 is a brief introduction of the prior work done regarding the emor estimation of

a classifier. The details of the main algo&hms of the emr estimators are represented in

Section 4.2 followed by the results of related experirnents that are illustnited in Section

4.3. Section 4.4 introduces three kinds of pseudo-sample algorithms of error estimation,

and discusses the algorithms with the related simulation results. Section 4.5 summarizes

al1 the algoritluns discussed in this chapter.

The bootstrap-based classifier design is the content of Chapter 5 . Section 5.1

states the problem of classifier design and discusses the problem to be solved. Previous

works done on the bootstrap-based classifier design are discussed in Section 5.2. The

algonthms of mixed-sarnple classifier design are introduced in Section 5.3, while the

related results of simulations are discussed in Section 5.4. Section 5.5 provides the

discussions and sumrnarizes the work done in mixed-sample classifier design.

Chapter 6 summarizes the applications of the bootstrap technique to the field of

statistical pattern recognition. The three main issues used in the thesis, bias correction,

cross-validation and pseudo-pattern generation, are discussed in Sections 6.2,6.3 and 6.4

respectively. Section 6.5 states the conclusions of our work.

1.4 Experiment Data Set

Throughout this paper, al1 the experiments have been carried out with a data set

referred to as Data Set 1, which contains 2-dimensionai training samples of seven classes.

CHAPTER 1: INTRODUCT~ON 7

All the classes follow a normal distribution with the same covariance mabixes, the

identity matrix, but they have different expectation vectoa generated by Algorithm 1.1.

Algorithm 1.1 Expectation Vector Genemting

Input: d. the dimension, Output: ErpectationList, the expectation vectors. Method BEGIN

For( j=l tod) Exfi] = O

End For Put EX in ExpectationList delta = 0.5 Do

k = random integer in [ l , d] EX [k] = EX [k] + delta delta = delta + 0.1 Put EX in ExpectationList

While (ModulusOfVector(EX, d) < 3) Return Expectatiodist

END Expectatioa Vector Generating

Procedure ModulusOfVector Input: (i) v[l], v[2], ..., v[dJ, a vector;

(ii) 4 the dimension; Output: The Modulus of a vector. Met hod BEGIN

modulus = O For(j=l tod)

modulus = modulus + vb] * v[i] End For modulus = Jmodulus Retwn modulus

END Procedure ModulusOfVector

In Algorithm 1.1, the expectation vector of the first class is set to (O, 0). Then, in the

Do...WhUe loop, a delta, with an initial value of 0.5, is randornly added to one of the

elements of the previous expectation vector to yield the next expectation vector, and delta

is then increased by 0.1. Expectation vectors for the classes are generated in this way

until the rnodulus of the last expectation vector exceeds 3. With Aigorithm 1.1,

expectations o f seven classes of training samples were generated. Their expectations are

listed in TABLE 1.1, and Figure 1.1 depicts the distribution of the expectations.

1 n

O 1 2 3 4

Figure 1 . 1 Expectation Distribution of the Classes in Data Set I

TABLE 1.1 Expectations o f the classes in Data Set 1

Each sample point is a 2-dimensional vector with a normal distribution. The

semple vectors were generated by Algorithm 1.2.

CLASS I

EXPECTATION CLASS

L

D ( 1 . 1 , 0.7)

A (0.0,O.O)

E

L EXPECTATION

B (0.5,O.O)

F ( 1 . 1 , 1 .5)

C 1

(1.1,O.O) G

(2.0, 1.5) (3.0, 1.5)

Algoritbm 1.2 d-Dimension Normal Vector

Input: (i) dy the dimension; (ii) E[1], E[2], . . ., E[q, the expectation vector;

Output: A vector follows a normal distribution. Method BEGIN

For (k = 1 to d) MltNormal [k] = StandardNormal MltNormal[k] = MltNormal [k] + E[k]

End For Retum MltNormal[l], . . . , MltNormal[d]

END d-Dimension Normal Vector

Procedure StandardNormal Input: nothing. Output: An approximate standard normal distribution variable. Method BEGrN

RandomNM = O For(j= 1 to 12)

RandomNM = RandomNM + Random Unifonn [ l , O] End For RandomNM = RandornNM - 6.0 Return RandomNM

END StandardNormal

Algorithm 1.2 is a typical algonthm for generating a d-dimension normal

distributed mdom vector with the identity matrix as the covariance matrix. It c a l s the

procedure StandardNormal d times to obtain d variables of approximate N(0, 1). The

procedure StandardNormal accumulates 12 variables of uniform distribution U[O, 11 to

return a variable which is approximately N(0, 1). Since the expectation of the random

normal vector required is generally not & a corresponding expectation value is added to

each element of the d-dimension random vector returned by the procedure

StandardNormal. In this way, a d-dimension nomally distributed random vector is

genemted with the given expectation vector and the identity matrix as the covariance

matrix.

200 separate samples each of size 50 were generated for each class in Data Set I,

and were used for the simulation. Anoiher sample of size 1000 was dso generated

sepanitely for each class, and was used for testing.

2.1 Introduction

The bootstmp technique is one of the most popular statistical approaches today for

estimating parameters and their distributions. It was first introduced by Efion in the late

1970's [Efï9]. The basic strategy of bootstrap is based on resampling and simulation.

Over the last two decades, bootstrap has k e n applied to a wide range of statistical

problems. Its theory has been extensively developed, and a variety of techniques dated

to bootstrap have been proposed for various application problems, including bias

estimation, standard enor estimation, confidence interval estimation, hypothesis testing,

estimation of prediction error, sample survey, linear modeling, kemel density estimation,

t h e senes and dependent data (see [ET93], [DH97l and [ST95]).

Ehn's work on Error Estimation [Efû3] can be ngarded as the start of applying

the bootstrap techniques in the area of statistical pattern recognition. Then Chemick

[CMN86], Jain [JDC87] and Weiss [We9 1 ] perfomied simulations to compare bootstrap

emr estimates with other estimates, mainly with the Cross-validation estimation

schemes. Efron [EW6] and [ET97], Davison pH921 and Fukunaga pu921 discussed aiso

theoretical issues of the bootstrap error estimation. Detailed descriptions on the above

works wiil be given in Chapters 3 and 4 respectively. There were aiso suggestions of

CKAPTER 2: THE BOOTSTRAP TECHNIQUE 12

using the bootstrap techniques to constnict pattern classifiers, by Hamamoto [HUT97],

which will be discussed in Chapter 5.

In this chapter, the essential concept and theory of the bootstrap technique will be

discussed. Some important bootstrap estimates useN in statistical pattern recognition and

the properties of bootstrap will be given, and other related resarnpliag plans will be

introduced as well. In Section 2.2, the concept and basic procedure of bootstrap are

discussed, and examples are given to illustrate the bootstrap procedure. Several bootstrep

estimations, including bias, variance, quantile and density function, are discussed in

Section 2.3,. These estimating procedures are very helpful in the understanding of Efion's

bootstrap theory, and an also usefid for estimating the eror bound and the e m r rate of a

classifier, which are the topics of the next two chapters. Section 2.4 discusses the

theoretical aspects of the bootstrap, which are mainly concemed with the large sample

properties, consistency and asymptotic distribution of the bootstrap estimations. The

theorems in Section 2.4 concem the theoretical properties of the indices on invoking

bootstrap. in the last section of the chapter, Section 2.5, we will discuss other resampling

plans, the Bayesian bootstrap and the Random Weighting method, which were introduced

by Rubin [ST95] and Zhen [Zh87] respectively. Both of them generalized Efron's

bootstrap approach in some sense. Both of them will also be applied in later chapters to

problems of statistical pattern recognition.

CHAPTER 2: THE BOOTSTRAP TECHNIQUE 13

2.2 The Fundamentals of Bootstrap

Let X = (Xi, &, ..., &,) be an i.i.d. d-dimension sample from an unknown

distribution F. We consider an arbitrary functional of F , O = B(F), which for example,

could be an expectation, a quantile, a variance, etc. The 0 is estimated by a fiuictional of

the empirical distribution k, 6 = 8( k), where

L. 1 F : mass - at ~ 1 , g2, ..., &,

n

where n is the sample size, {ri, 22, . . ., &) are the observed values of the sample {Xi, s, .. ., &). A host of unanswered questions now arise such as:

(i.) What is the bias of the estimation6 ?

(ii.) What is the variance of 6 ?

(iii.) What is the distribution of 6 ?

These problems are generally not easy to answer, especially in the case when F is

unknown. For example, the bias is well defined as

A

Bias= E, [e -01 =E,[O(F)-B(F)]. (2.1)

where "E;' indicates the expectation under distribution F. Though it is possible to

calculate 6 = 8(c) having observed XI = ZN, & = a, ..., 15, = s, it is not possible to

derive the bias directly because of the fact that both B and F are unknown. The question is

thus one of detemiinhg how to get an estimator of the bias?

CHAPTER 2: THE BOQTSTRAP TECHNIQUE 14

A

Fmm one perspective, the quantity 8 can be regarded as a simulation of 0, shce

we use the empirical distribution F to minor the characteristic of the unknown

distribution F via the sampling process. Extendhg the idea to the bias estimation, we can

use the same suaiegy to solve the problem. As the empiticai distribution F is known, it is

easy to generate a manual sarnple h m 6. Assuming the size of the sample generated is

n, the sample is

Thus we have an empirical distri bution 6 ' of the empirical distribution F , where,

A ' 1 * F : mass - atg , , & ? ,

n

A

corresponding 0 ' = 0( k') is the estimate of 6 . Thus an estimator of the bias is

sks = E' [6* - 6 1 = E'[o(F') - O&], (2*3)

where E* is the conditional expectation of $ given (z; = 5 ; , - 2 X' = , . - x n =

x* }.The procedure to estimate the b i s described above is called Boorstrup. An example -n

to dari@ issues is given below.

Example 2.1

Consider a simple case, in which 8 is the expectation of the unknown distribution

F,


Then the eshate of 0 is

6 = o ( F ) = ~ & =x,

1 O

whem = -x i xi is the sample mean. Now if the observation values {g , z , . . ., g } n o

of a bootstrap sample are carried out by a random generator distributed uniformiy on (1,

1 where g' = -xi s; is the mean of the bootstrap sample. The procedure for generating a

n

bootstrap sample is repeated B times, and there are B values of 6 obtained h m the B

b* bootstrap samples {s Po , g:' , .. ., g ), b = 1.2, . . ., B,

The bootsûap estimate of 6 is the average of these 6 b , so,

Thus, the bootstrap estimate of the bias 4 - 0 is

The procedure presented above is a general bootstrap procedure called basic or

simple bootstrap.


In summary, the bootstrap metbod takes the strategy of replacing $ with k* and F

with F in estimating a parameter Mg, F) with unknown distribution F. A typicai

bootstrap procedure which takes the above steps, is given formally below.

Algorithm 2.1 Basic Bootsîrap Input: (i) n, the size of the training sample;

(ii) g[l], g[2], .. ., ~ [ n ] the sample data of the training sample; (iii) B, the number of time bootstmp resampling is repeated.

Output: the bootstrap estimator . Method BEGW bootstrap-estimator = O For (i = 1 to B)

Faro= 1 ton) (2.6) m = random integer in [1, n] riil = r[ml

End-For bootstrap-estimator = bootstrap-estimator + Q(y[l], y[2], .. ., y[n]) (2.7)

End-For bootsbap-estimator = bootstrap-estimator / B Retu rn bootstrap-estimator END Basic Bootstmp

Note, the For loop (2.6) is to generate a bootstrap sample, e.g. a sample h m the

empificai distribution k ; and the Q(y[l], y[2], . . ., y[n]) in (2.7) is to calculate a bootstrap

estimator of the parameter Q with the bootstrap sample, e.g.

The question of how to calculate Q&[l 1, y[2], . . ., y[n]) depends on the estimating

formula of parameter Q. For instance, in Exampk 2.1 where we tried to estimate the bias

of the mean, it is


E h n suggested chwsing B = 200 as the nurnber of times the bootstrap

resampling is repeated[E83], which is quite adequate for most of purposes.

2.3 Bootstrap Estimators

2.3.1 Bias

in this section we will discuss the bootstrap approaches for estimating the bias,

variance, and the quantile. Although the bootstrsp method is regarded as a distribution

fiee method, it is also possible to have a parameterized bootstrap. The concept of the

parametric bootstrap will also be introduced in the section, in order to present an

alternative view of the bootstrap techniques. At the end of the section, we will present a

bootstrap procedure, the SIMDAT a lgo r ih , for density function simulation, which

shows an exarnple of a more involved bootstrap procedure. The SiMDAT algorithm will

be applied in subsequent chapten to some related problems of statistical pattern

recognition.

A

As defined in section 2.2, the bias of an estimate 0(k) is Bias = ~ , [ 0 - 01 =

replacing F with 3' to get the comsponding 6 , the bootstrap estimate of bias will be

1 Cornparhg this expression to (2.3), we cm mbstitute - b 6 for ~ ' [ 6 '1 here, which

B

is analogous to using for estïmating E m .

CHAPTER 2: THE B~~TSTRAP TECHNIQUE 18

A

In order to apply (2.8), it is not necessary to require 8 to be "the same statistic as

8" [E82]. We can take 6 to be the sample median while 0 can be set to be E,[XI. Hence

A

the bootstrap technique provides more flexibility to yield an estimator. If 8 is an estimate

of 0 and a bootstrap estimate of the bias (2.3) is provided, a bootstrap estimate of 0 can be

obtained by correctiag the bias of the original estimate as:

It will be seen, in Chapter 3 and 4, how this strategy can be used to estimate the

error bound and error rate of a classifier.

2.3.2 Variance

In numerous applications and from a theoretical point of view, there is usually an

n

interest in estimating the variance of a statistic 8. In general, it is not easy to give a

1 variance estimate of statistics s2 = -G (ri - I ) ~ while the variance of statistics X is

n

shown to be

1

Generally it is hard to compute the variance VAR(^ ). With the bootstrap

rnethod, VAR(^ ) c m be estimated as


1 A

where 8 ' = -&,O , after obtaining each 8 h m the bootstrap sample (x !' , x ,b' , B

A

For the standard deviation enor 8 = S = , the corresponding

bootsîrap standard error is

The n !* on the right side is the weight of x i in the bootstrap sample xb* = (X P o , r 2' ,

..., r 2' 1, which equals the number n b' of ri appeariag in xb' divided by the bootslrap

b ' wb' = n i In.

A h , s b* is the mean of the bootstrap sample xb',

- b* - b' x - E i w i x i -

Mer obtaining dl { 6 ; : b = 1, 2, .. ., B ), the bootstrap variance estimete of S is just a

direct computing of (2.1 O),

1 where s' = -& s;.

B

CHAPTER 2: THE ~ ~ ~ T S T R A P TECHNIQUE 20

The expressions (2.1 1) - (2.13) imply that the bootstrap methoà pmvides a

scheme by which we can assign weights to the sample data, which leads to other

resmpling sûategies which will be discussed later in this chapter.

2.33 Quantile Anothet quantity of considerable interest in statistics is the confidence interval of

A

0 . Typically, with a given value Bo, a question of pertinence is the probability of

Altematively, with a given level a, the question of interest is the boundary such that

It is very hard io answer the above questions in most cases if the distribution F is

unknown. With the bootsrrap approech, the bootstrap sarnples can k first used to obtain a

A A

set of estimates (8 ; : b = 1, 2, ..., B ), and then these ( 8 b : b = 1. 2, ..., B ) are just

taken as a sample of 6 . The estimate of (2.14) will thus be

A

By sorling ( 8 ; : 6 = 1,2, . . ., B ) as below:

the estimator of O0 in (2.15) will be

where L._l is the flwr function.


Though it is a distribution fkee approach the bootstrap can also be used in a

paramettic way. Consider the case when the sample data x = (&, &. . . ., &,) follows a

normal distribution F = N(p, I), where the p and d are unknown parameters. Then the

parameaic maximum likelihood estimator of F is

Now the bootstrap procedure is camed out exactly as described above except that

SNORMAL takes the place of F in (2.6). Thus the loop (2.6) in Algorithm 2.1 is now

changed to

For ÿ = 1 to n) ru] = Randorn vector drawn fiom N( fi, 2).

End-For

Therefore the bootstrap sample wiil be drawn fkom FNO-L, which means that the

bootstrap is not merely putting weights on the data X = (XI, &, . . ., &,} .

23.5 Dcnsity Function

Density funciion is, of course, a great interest in statistics, because numerous

problems in statistics would be solved if the density function was known. Uafortunately,

in al1 of the statisticai problems, the density fwictions involved are unknown, or at most,

only a mode1 or an explicit fomi of the density hctions is known. in statistical pattern

recognition, density hctions play more important des, because knowing the density


functions of the classes wüi help us to build and evaluate the correspondhg classifiers.

But the question of how to estimate a density hction is still a significant problem,

especiaily in a small sample size case. It CM be seen that the bootstrap provides a useful

tool for estimating density fuactions.

The algorithm presented here is due to Taylor and Thompson [TT92], called the

SIMDAT algorithm. The purpose of the S[MDAT algorithm is to provide a sampling

scheme which could generate a pseudo-data sample very close to that drawn fiom the

kemel density hction estimator of

where the Li in K(a) is a locally estimated covariance matrix. The details of the SIMDAT

algorithm are given below.


Aigorithm 23 SlMDAT Input: (i) n, the size of the training sample;

(ii) di], ~[2] , . . ., =[n] the sarnple vector of the training sample; (iii) m, the nurnber of the nearest neighborhood of each sample data. (iv) N, the repeated times of the bootstmp resampling.

Output: ~ [ l ] , g[2], . . . , g[N x n], a set of pseudo-sample vector. Method BEGIN Re-sale the sarnple data set ~ [ l ] , ~ [ 2 ] , . . ., ~ [ n ] so that the marginal sample variances in each vector component are the same

For (i = 1 ton) find the m nearest neighbors y[i, 11, y[i, 21, . . ., y[i, ml of g[r) (include ~ [ f l itself), Forÿ = 1 ton)

End-For

For(j= 1 ton) y[i, jl = y[i, jl d i , 01

End-For End-For

For (i = 1 to N) k = random integer in [l, n]

= 1 tom) g[(i - 1) m +pl = O For (q = 1 to m)

u = random number from U(- - m m

End-For End-For Return 1[1], g[2], ..., g[N x ml END SIMDAT

There are three *gs done in the ffirst For loop of the SiMDAT algorithm. They

are:

CHAPTER 2: THE BOOTSTRAP TECHNIOUE 24

1. Find the m nearest neighbois of a sample vector, including the

sample vector itself;

2. Calculate the mean of the m nearest neighbors which is put in y[i, O];

3. Deduce the mean from each neighbor.

The third step above makes the neighbor sample vectors to have a local

distribution with the mean of zero. In generating pseudo-sample, a randomly weighted

linear combination of the nearest neighbors is given, and then the local mean y[i, O] is

added back. Here we see that the local mean y[i, O] takes the place of the original sample

vector ~ [ i ] in the SIMDAT algorithm, which incorporates smoothness into the pseudo-

sample vector.

Some properties of the pseudo-data sample are not diEcult to see. First, it is easy

to calculate the expectations, variances and covariances of the uniforni distribution

sample {uj} y=, ; these quantities are:

CO~(U~, uj) = O, for i +j.

If we now treat the m nearest neighbors {ai 1, a 2, ..., a ,,,) of sample vector as a

sample h m a truncated distribution with mean vector p = (pi, pt...,W)T and covariance

matrix X = (ai, j), it can be seen that the linear combination

CHAPTER 2: THE BOOTSTRAP TECHN~QUE 2s

z=Xj uj %,j,

where = si, will satisQ

E@= CI

and

where

It is clear that the linear combination Z will have the sarne expectation and covariance as

the sample data {y, [,a, 2, . . . , ~4 ,) if p = O. Therefore the pseudo-data sample generated

by the above procedure will be very close to that generated by the density fùnction of

(2.1 8).

In the previous section we presented several examples to show applications of the

bootsûap technique to different problems. Though the bootstrap is a very convenient and

appealing tool for data anaiysis, it is still necessary to confirm theoretically its suitability

for the problems at hand. There are already examples in the literatwe for which bootstrap

has failed Wa921. Thenfore we would like to know "when does bootstrap work?".

nieorem 2.1 and 2.2 of this section answer the question within a large sarnple context.

To illustrate the large sample results, we use the following notations,

C W E R 2: THE B~~TSTRAP TECHNIQUE 26

a) &, = {Ki, Xis, . . ., &,,) is an i.i.d. sample of size n with unknown

distribution Fn;

b) 6 , is the empiricai distribution of the sarnple X,,;

C) X : = {x:~ , X L2 , . . . , x ) is a boofStrap sample, an i.i.d. sample, with the

empirical distribution 6 .; d) P is the empirical distribution of bootstrap sample X: ;

e) On(F) is a bctional of the interest. in the theorem given below it is a lineat

fùnctional

W) = I&(x) d F(x).

What the bootstrap procedure tries to do now is to estirnate the distribution

P(%( n) W n ) 5 t)

The theorem below gives the necessary and sufficient conditions under which the

bootstrap procedure will be successful.

Theorem 2.1

Consider a sequence X,,, XRL, ..., XaJ, of i.i.d variables with distribution Fn. For a

x :, , . . ., x with empirical distribution F ', . Denote 6 = O,( ), d, to be the

Kolmogorov distance, and b(. . .) to be the conditionai law b(. . .l Xhl, XR2, . . ., &).

Then for every sequence t the following assertions are equivalent:


A

(i) 8 . is asyrnptoticdly normal: There exist m,, with

(ii) The nonnai approximation with estimated variance works:

6(&-tn). N(0, $ : ) ) + O (inprobability),

1 A

where S = 1 1 (fi(Xn,i) - 8 n)2m n -

(iii) Bootstrap works:

da~(k(6 n - tn), d(6 - 6 ,)) + O (in probability). (2.2 1 )

If (i), (ii) or (iii) hold then t, can be chosen as the mean of the truncated variables

P + E [(&>(Xn,i) - P) 1 (k(Xn.i) PI 5 n%)I,

where p is a median of the distribution of M

The meaning of Theorem 2.1 is explicit, e.g. if a sequence t satisfies either of the

conditions (i), (ii) and (iii) then it cm be shown to satisfy the other two. The results of

Theorem 2.1 provide the condition under which the bootstrap works for the i.i.d. models.

It implies that if O,( S ,) is a good simulation of O#,) then O,( ) - O,( F ,) will be a

good simulation of On(Fn) - O,( F,) , and vice versa. M e r adding more conditions the

bootstrap scheme can be show to work under non i.i.d. models as well, which is the

result of Theorem 2.2 given below.

CHAPTER 2: THE BOOTSTRAP TECHN~UE 28 - -- - -

Theorem 2.2

Consider a sequence Xkl, l(nl, . . ., X, of independent variables with distribution

A A

FRi. For a function g. the 8 and 8 are defined as in Theorem 2.1. Then for every

sequence t, the following assertions are quivalent:

(i) There exist an such that for every E > O

and such that

(ii) The normal approximation with estimated variance works:

CL,(.&,-tn),N(O. S : ) )+O (inprobability),

(iii) Bootstrap works:

& ( , ( e n - tn), d(6 - 6 n)) + O (in probabüity).

Note that Theorem 2.2, XkI, XR2, ..., XnSn are not required to have the sarne

distribution alîhough they are still assumed independent. Since we forfeit the i.i.d

pmperty of the samples, the extra conditions (2.22) and (2.23) are required for ensuring

the consistency of bootstrap.


The prwfs the of above two theorems are ornitted here as they are outside the

scope of the thesis. The reader is refemd to Mamrnen's paper [Ma921 for more details.

2.5 Other Resampling Approacbes

In Eumpk 2 3 it was demonstrated that the bootstrap scheme, in some sense,

assigns weights to the sample data. This idea can be used to generalize the bootstrap in

the following way.

Let f = (P; , Pi ,. . ., P :, ) be any probability vector on the n-dimensional

simplex

~ = { ~ : P : L o , G P : =1) ,

called a resampling vector [Ef82]. For a sample X = (Xi, &, ..., &) a re-weighted

empirical probability distribution 5' is defined with a resampling vector o* as

* * F :massP: o n g , i = 1 , 2 ,..., n, (2.25)

where (g,, a, . ..¶ s) are the observed values of the wunple (&, &, ..., &d Thus, a

resampled value of 6 , say 6 ', will be

6' = e(Fa@)) =e@). (2.26)

As before, after repeatedly generating the resarnpling vector ' B times, a sample of 6

will be obtained, say (6 : b = 1,2, ..., 83 which is just what was obtauied in step 3 of

the basic bootseap mentioned above. The basic bootstrap is, h m this point of view, a

specific case of (2.25) - (2.26) in which the resamplhg vector f talces the fom

CHAPTER 2: THE B~~TSTMP TECHNIQUE 30

P. = n : / n , 0; (2.27)

where n: is the number of xi appearing in a bootstrap sample. This means that f follows

a multinominal distribution,

1 P' - - Mult(n, Io), 9 n

1 1 1 where Po = ( - , - , . . . , - ) is a n-dimensional n n n

execute a resarnpling procedure by choosing

vector. In other words, This is possible to

another E*. These arguments lead to the

Bayesian Bootstrap and Random Weighting Method introduced by Rubin [ST95] and

Zhen [a871 respectively.

2.5.1 Bayesian Bootstrap

Consider the case where an i.i.d sample X = (Xi, &, ..., &} cornes fiom a

distribution F(X 1 0) with a unknown parameter 8 which is of interest. in fkquency

statistical analysis the 8 is regarded as a nonrandom constant. h Bayesian statistical

analysis the 8 is taken as a randorn variable (or vector) with a prior distribution, and the

sample (Xi, &, . . ., &} is considered coming fiom a conditional distribution of X given

0. The Bayesian estimator of 0 will be given in a fom of conditional expectation of 8

given (Xi, &, - 9 &),

where the conditional distribution F(0 1 Xi, &, ..., &) is known as the posterior

disüibution. Usuaily, posterior distributions are expressible oniy in ternis of complicated


analyticai fûnctions, and it is hard to caiculate the matginal distributions and moments of

posterior distributions. Commonly, a nomal approximation to the posterior distribution is

suggested. Although it is easy to estimate, this approximation lacks high order accuracy.

To address this problem Rubin intmduced a noapararnetric method called the Buyesiun

Bootstrap under a specific assurnption on the prior distribution. Here we discuss the basic

algorithm of the Bayesian bootstrap with a noninformative prior distribution. The

algorithm is focmally given below,

Algorithm 2 3 Bayaiin Bootstrap Input: (i) n, the size of the training sarnple;

(ii) a l ] , g[2], . . ., g[n] the sample data of the training sample; (iii) B, the repeated times of the bootstrap resampling.

Output: the Bayesian bootstrap estimator. Metbod BEGIN bootstrap-estimator = O For (i = I to B)

u[n] = 1 ForO= 1 ton- 1)

u[j] = random number fiom U[O, 11 End-For Sort u[l] , u[2], ..., u[n] Forü =n t0 2)

Pb] = uü] - ufi - 11 End-For P[l] = u[l] calculate a Bayesian bootstrap estirnating value 6 ' = 8( $ ') bootstrap-estimator = bootstrap-estimator + 8 '

End-For bootstrap-estimator = bootstrap-esthator 1 B Retun bootstrap-estimator END Bayaian Bootsmp

In the algorithm, an i.i.d. random sample of size n - 1 h m the uniform

distribution U[O, 11 is fkt genemted. M e r obtaining the order statistics

CIIAPTER 2: THE B~~TSTRAP TECHNIQUE 32

O Su[l] Sr[Z]S ... Su[n]=l by sotzing the uniforni random sample, the component Plj] of the resampling vector is

assigned the value of the difference un] - uu - 11 of the unifonn order statistics, for j = 2,

..., n, and P[l] = u[l] is the difference u[l] - O. Thus, the re-weighted empirical

probability F' defined in (2.25) is given by the algorith, as

6' : mass P[i] on ~ [ i ] , i = 1,2, ..., n.

Accordingly a calculation of 6 ' = 8( i ') is carried out.

From the algorithm given above the relationship between the prior distribution

and the resampling vector f defined in (2.3 1) is not obvious. To clarify issues, consider a

very simple case in which the sample {Xi, X2, ..., X*) follows a 0-1 distribution, P(X =

1) = 0, and P(X = 0) = 1 - 8. Let n = zi xi, where (XI, ~ 2 , . . ., 4) are the observation

Under the assurnption of using a noninfonnative prior distribution, the posterior density

hct ion will be

c em(i - e y m ,

where c is a constant. Therefore the posterior distribution is a E h distribution. On the

other hand, the difference of the order statistics of the uniforrn distribution U[O, 11 has

also a form of Beta distribution (see [PC94]). Thus the resampling vector o' of the

Bayesian bootstrap fi& the noninfonnative prior distribution case.

Example 23

Consider the same problem discussed in Example 2.1. We use the Bayesian

bootstrap method to estimate the bias of f - 0, where 5 is the sample mean and 0 is the

expectation of the unknown distribution F. Now we draw an i.i.d. sample of size n - 1

h m the uniform distribution U[O, 11, say (ul, 4, ..., un .,), and get the order statistics

by sorting the sample,

O = u(,,) s u(,, 5 U(2) 5 *.. SU(,,-~) Sr u(") = 1.

M e t this, a resampling vector o' is made by assigning each of its components the vaiw

P; = "(il - ~ ( i - 1 ) . (2.3 1 )

A

This time the corresponding resampling valw of 0 will be

Repeatedly drawing B samples of size n - 1 fiom the uniforni distribution U[O, I l , we

obtah B resampling vectors p' = (P f a , P ia , . . ., P :' ), b = 1,2, . . ., B, and so we get B

A

resampling values of 8,

A

where 9 b = xi P 7 i . The Bayesian bootstrap estimator of 8 is the average of these

Thecefore the Bayesian bootstrap estimator of the bias x- 0 is

CHAPTER 2: THE BOOTSTRAP TECHNIQUE

It c m be seen that the ôasic concept of the Bayesian bootstrap algorithm is almost

the same as that of the basic bootstrap algorithm except that the resamplhg vector o' is

continuously distributed in the Bayesian bootstrap while it is discretely distributed in the

basic bootstrap case. Hence the Bayesian bootstrap algofith can be regarded as a

smoothed version of the basic bootstrap scheme.

2.53 Riadom Weighting Method

It can be argued that the method discussed above does not necessarily have to be

related to the prior distributions. With some insight it c m be seen that a continuously

distributed resampling vector can also be used to get a re-weighted empirical

probability distribution 6' as given in (2.25). in other words, we can randomly generate

the weighting vector en to get the resampling empincai distribution F'. This concept was

intmduced by Zhen (1987) and is called the Randorn Weighting Method. Thus, the

random weighting vector f can be obtained by the following xheme, given formally

below.

CHAPTER 2: THE BOOTSTRAP TECHNRQUE 35

Aigorithm 2.4 Random Weighthig Methd Input: (i) n, the size of the training sample;

(ii) ~ [ l ] , 3[2], . . . , r[n] the sample data of the training sample; (iii) B, ihe rqeated times of the bootstrap resampiing.

Output: the -dom weighting estimator. Method BEGIN random-weigbtiag-estimator = O For (i = 1 to B)

z[O] = 0 F o r 0 = 1 ton)

zü] = random number fiom a nonnegative distribution P z[O] = z[O] + zU]

End-For For(j= 1 ton)

P[j] = zU] I z[O] End-For

A

calculate a Bayesian bootsûap estimating value 8 ' = 8( 8 ') random-weighting-estimator = random-weightuig-estunator + 6 '

End-For random-weighting-estimator = random-weighting-ehator / B Return random-weighting-estimator END Random Wtighting Method

In this algorith, an i.i.d. random sample of size n is finit generated h m a

no~egative distribution P. Then a resampling vector P is obtained by assigning its

component Pu] the normaiized value of zü] / z[0], for j = 1, 2, .. ., n. Similar to the

Bayesian bootstrap, the ce-weighted empirical probability distribution Ê ' is given, in this

6' : mass P[i] on ~ [ i ] , i = 1,2, ..., n.

Then a caicdation of 6 = 8( 6 ') wiii be carried out accordingly.

One of the simplest ways is to take an i.i.d. sample (21, 4, ..., f ) fiom the

unifonn distribution U[O, 11 to construct the random weighting vectot f. indeed, the


unifonn distribution U[O, 1) will be used for random weighting method simulations

through this thesis. Other steps to cany out a random weighting estimator are similar to

the basic bootstrap and the Bayesian bootstrap.

if we now examine expressions (2.1 1) - (2.13) given in Eumpk 23, it can be

seen that the w j' used in the bootstrap can also be regarded as, in some sense, a random

weighting method. Although the Bayesian bootstrap is another random weighting method,

in this thesis the phrase 'tandom weighting method" will only be used for the case of the

random weighting vector consmicted with a sample fiom the uniform disttibution

U[O, 11

A few results related to the theorecical aspects of the Bayesian bootstrap and

random weighting method are available in the literature. Under certain assumptions both

the Bayesian bootstrap and ramlom weighting method have excellent analytic properties,

such as consistency, and both approximately normal distributed. As this is outside the

scope of this thesis, these concepts will not be discussed here. The reader is referred to

[ST95] for more details.

In ihis chapter, we iaaoduced the concept of bootstrap and discussed some details

of the bootstrap technique. The idea of the bootstrap technique is simple and eiegant, it

involves "simulating the simulator". What the bootshap mainîy does is to draw a random

sample, cailed a bootstrap sample, h m the empirical distribution. It then combines the

btstrap and original amples together to obtain the estimates sought for. With this


technique, we are able to extract some information of the relationship between the true

distribution F and empirical distribution F, such as the bias, standard deviation, and the

empirical distribution of a parameter estimate.

While Algorithm 2.1 is a general illustration of the basic bootshap procedure, the

details of calculating the bootsûap estimates are given individually in the examples of

Section 2.3. Though the bootstrap caiculations are usually similar to that of the traditionai

estimation, (as shown in Exarnples 2. L and 2.2), schemes more complex than traditional

estimation, such as the SIMDAT algorithm, are possible. The idea of using pseudo-

samples in the SlMDAT algorithm is very interesting, it provides a way by which we are

able to obtain 'extra' sample data. In Chapters 4 and 5, we will extend this idea to the

problems of error rate estimating and constmcting classifiers.

The resuits of Theorems 2.1 and 2.2 provided, at least, some theoreticai

foundation for the bootstrap technique, though they are only considered the one-

dimensionai linear functional cases. Of course, in practice, we would pay more attention

to small sample cases, but the large sample properties are necessary for theoretical

purpo=s-

The discussion of the Bayesian bootstrap and random weighting method provided

alternative xhemes to obtain the re-weighted empirical distribution 6'. instead of using

discrete resampling vectors in the basic bootstrap algorithm, the Bayesian bootstrap and

random weighting method use continuous cesarnpling vectors. Because of this, there is no

bootstrap sarnple drawn in both of these schemes. We calculate the bootstrap esha te by

assigning a random weight on each correspondhg item of the sample data. Both of the


schemes will be used to estimate Bhattacharyya bounds and emr rates of classifiers in the

next two chapters. They will aiso be applied to construct classifiers in Chapter 5. It will

be seen that the performances of the Bayesian bootsüap and random weighting methods

are very similar, although they are difZerent fiom that of the basic bootstrap.

CHAPTER 3

THE BHATTACHARWA BOUND

3.1 Introduction

h a statistical pattern recognition problem, a random vector X, cailed a pattern, is

composed of a pair

X=& O), (3- 1)

where is a ddimension vector called the feature vector, and o is a variable representing

a class. Genedly, a feature vector y can take a continuous value in a d-dimension space

and a class variable o can take one of a finite set of values. In this thesis, we consider

only the case when there are two class cases, i.e. o E ( 1,2}. The prion probabilities of the

hivo classes are assurned known,

Pi = P(a> = 1), and P2 = P(o = 2).

For convenience, in later discussions it is always assumed that the prior probabilities of

the two classes are equal, and so,

P 1 = P 2 = % * (3-2)

The conditionai density of a feature vector given o, is denoted as pu&), o = 1, 2.

According to the Bayesian theorem, the posterior probability of j given y is

where p(& is the mixed density hct ion and is a constant independent of a.

CHAPTER 3: THE BHATTACHARWA BOUND 40

The purpose of pattern recognition is to detemine whether a given feature vector

V belongs to class 1 or 2. In other words, the aim is to predict the value of o, h m a given - feature vector 1. A decision d e based on probabilities is to maximize the posterior

probability with the given fature vector K. Thus, the classification d e is:

o = l if q 1 0 ' q 2 0 ;

0 = 2 if p l 0 < q 2 0 - (3 *4)

Because p(a) is positive, h m (3.3). the above rule can be equally expressed as

The niie expressed by (3.4) or (3.5) is called the Bayesian Decision Rule. When a

feature vector is classified to 1 by the Bayesian Decision Rule, it means that there is a

better chance of having o = 1. This does not mean, however, that o is exactiy equal to 1,

as there is always a possibility of making an error in the classification process. The

conditional error pmbability caused by rule (3.4) or (3.5) is

a = min( c i l 0 9 9 2 0 1 -

The Bayesian error is the average of a,

CHAPTER 3: THE BHATTACHARYYA BOUND 41

Equation (3.7) is the crucial measurement for the peiformance of the decision

d e . Udortunately, it is generally very hard to directîy calculate a Bayesian emr even if

the conditional probabilities plm and are known. Therefore, many researches

have devised upper bounds for (3.7) which are easier to calculate and estimate. Two such

bounds, the Chemoff and Bhattacharyya bounds, are the most well-known ones and will

be discussed in the second section. As the chapter title depicts, the Bhattacharyya bound

is the main topic discussed here. Thus, the third section will consider the general

approach of how to estimate the Bhattacharyya bound and the theoretical properties of the

estimate. Simulation results will be given in order to help us understand the properties of

the estimates. in Sections 3.4 and 3.5, we shall attempt to estimate the Bhattacharyya

bound by utilizing bootstrap techniques. The algorithms of the bootstrap approaches are

illustrated there dong with the discussion of the simulation results. Although the problem

of estimating the Chemoff bound is not mentioned here; it is easy to see that the

approaches of the Bhattacharyya bound estimations discussed here could be applied to the

Chernoff bound as well.

The data set used for al1 the expenments in this chapter is the set referred to as

Data Set 1, which consists of training samples for seven classes. Each class has a 2-

dimension normal distribution, and a training sample size of eight. Only two classes are

involved in each experirnent: one is the class A, and the other is selected fiom the rest.

Hence, there are, in total, six experimentd pairs of classes, (A, B), (A, C), (A, D), (A, E),

(A, F), and (A, G). With each class pair, an experiment will repeatedly do 200 trials of

CHAPTER 3: THE B~ATTACIIAIIYYA BOUND 42

simulation for an algorithm. The results of the experimcnts given in this chapter are the

statistics h m the 200 trials.

As only the case of two classes is discussed in this thesis, for simplification, it is

preferred to use the notation & . = CO). So & reprsents sample data h m class 1,

and represents a sample data h m class 2.

3 3 Chernoff and Bhattacharrya Bounds

33.1 Cbernofï Bound

For aay mal numbers, a, b t O, it is well h o w n that the following inequality

holds:

min (a, b ) ~ d b ' * , O a s s 1. (3.8)

Applying inequality (3.8) to (3 3, we have

The right side of the abve inqualit-

is cailed the Chemoff bound. The optimum s is the value that minimizes the vdw of eu.

When the conditionai density bctions are a nomal distribution

~j@!) = NjW, Zj), j = i , 2 .

The integration part of (3.9) can be expnssed as


w here

The quantity p(s) is called the Chemoff distance. It is obvious that the optimum s will

maximize the value of p(s), which can be obtained by studying the variation of p(s) for

various values of s for the given Mi and 1, (j = 1,2).

3.2.2 Bbattacharyya Bound

The Bhattacharyya bound results whu, the Chemoff bound is simplified by

selecting the value s = 1 1 2. In such a case, (3.9) is simplified to

-d ln) % = ~ ~ l , / " " ~ = ~ ~ e (3.12)

The upper bound given by (3.12) is called the Bhattacharyya bound. in our case, it is

assumed that Pl = P2 = !h; therefore, the Bhattacharyya bound will take a fonn of:

Correspondingly, the Bhattacharyya distance is:


There are two terms on the right side of the equation in (3.14). as well as (3.1 1). It

is obvious that the first tenn will be zero if MI = & and the second terni wiiî be zero if

XI = G. So the first term reflects the class separability, which is caused by the difference

of the means; while the second term reflects the class separability caused by the

difference of the covariances. It is not difficult to calculate the Bhattacharyya bound in a

normal distribution case when the parameten MJ and (j = 1,2) are given.

3.3 The Estimation of Bhattacharyya Bound

In most cases, the means and covariances are unknown; even though the

conditional distributions cm be answered to be normal. In such cases, it is necessary to

estirnate the Bhattacharyya bound from a given training sample. Let X = {&J, &, . . .,

L I , &J,..., &J) be the training sample: the first m sample data &I, ...,

belongs to class 1 with a normal conditional distribution

&,l , * * * r &"*l - Nl@gl, Zl), (3.1 5 )

and the other n sarnple data xi . , & belongs to class 2 with a normal conditional

distri but ion

&L-, - N2@, 5). (3.16)

Then the maximum lücelihood estimates of the means 4 and covariances G (j = 1,2) are

given by

CHAPTER 3: THE BHATTACHMYYA BOUND 4s

It is well known that the estimates given by (3.17) have good properties, and are

the optimized estimates of the parameten and G (j = 1, 2). Based on (3.17), an

estimate of the Bhattacheryya bound can be calculated by the following steps:

1. Replace the parameters in (3.14) with their estimates given by (3.1 7) to get an

estimate of Bhattacharyya distance, say fi (ln).

2. Replace p( 1 R) in (3.12) with fi (1 l2) to get an estimate of the Bhattacharyya

bound.

We refer to the estimating method described above as the Direct Estimclting approach or

the General approach. The unanswered question is one of detennining how good the

estimate of the General approach will be? The answer to this question is by no means

trivial or direct because the Bhattacharyya distance and bound are fairly complex

hctions of the parameten M, and z,. Here is a brief discussion on the estimate properties of a function of the

A A A

piinuneters. Let @ = (O1, €AI2, ..., &lT be a parameter vector, its estimate be @= (8 i , 82.

. . ., 6 ( and f = t@ be a function of & Using the general approach, an estimator of î =


a A

f@) can be obtained. Assuming that 8 and 8 are close enough, a Taylor series can be

used by expanâing î = f( fi ) up to the second order tems,

w h e ~ AB = 4 - 8. If 8 is an unbiased estimate of a, E(AB) = O. Thus,

We, thetefore, have i which is generally a biased estimator off. In our case

8 = (MI T, M2 ', SV~C(ZI lT, svec(X2) '1 ,

where svec(A) represents a vector consisting of al1 different components of a symmetric

matrh 9 = (a. IJ -) nxnr

~vec(A) = (ai13 **+, 41, a u , anî, annh

and its estimator

6 - =(M,', - & ', ~ v e c ( t , ) ~ , svec($$jT

given in (3.17) is unbiased. For the Bhattacharyya distance, the second tenn on the nght

side in (3.19) is extremely complicated (the details will not be discussed here as it is

beyond the scope of this thesis. The reader cm refer to Fukunaga's book FU921 for more

details). It is thus apparent that the estimate of the Bhanmtcharyya distance deduced fiom

the Gened approach is biased, and so is the Bhattacharyya bound. The simulation results

given below support the conclusion, and give a nsud illustration of how biased the

estimate of the General approach can be.


As we know the panuneters of the multi-normal distribution of each class, it is not

difficult to calculate the theoretical values of the Bhattacharyya distances and the bounds

of each class pair involved in an experiment. These theoretical values are given in

TABLE 3.1. Note that the Bhattacharyya bounds are given in ternis of a percentage, Le.

1 Bhattacharyya bound = - e " I R ) x 100%.

2

As al1 the multi-normal distributions of the classes have the same covariance

TABLE 3-1 Theoretical values o f Bhattacharyya distance and bound

matnx, the identical matrix, the Bhattacharyya distance of a class pair in TABLE 3.1

contains only the first term on the right side of (3.14). It is seen to be in proportion to the

(A, G) 0.419263 32,8766

square of the Euclidean distance of the expectations of the class pair. Thus, the results of

CLASS PAIR BhattpCbaryya distance Bbattacbaryva bound

the experiments reflect, in some sense, the effect of the classifying enor rate because the

(A, B) 0.0625 46.9707

(A, C) . 0.1375

43.5767

class pair changes in each expriment, and so does the classifier enor rate.

(A, D) - 0.16298

42,4804

TABLE 3.2 Estimates o f Bhattacharyya distances with the General approach n

TABLE 3.2 lists the means and standard deviations of the 200 triais of the

(A, E) 0.2325 13 39.6270

Meaas Standard Deviatiou

Bhattacharyya distance estimates. It can be seen h m the table that, on average, the

(& F) 0.3 125 36.5808

(& G) 0.4 19263

Bhattacharyya distance estimate is biased, i.e. the General approach usually over-

CLASS PAIR Thearetical Value

0.236728 0.13 8292

estimates the Bhattacharyya distance. The bias increases when the distance of two classes

(A, E) 0.2325 13

(A, F) 0.3 125

0.4073 151 0.458343 0.2478541 0.27 1062

(& D) O. 16298

(A, B) 0.0625

(& c) 0.1375

0.758463 0.405374

1.122263 0.558753

2.000085 1 .O267 16


increases. The standard deviation of the estimate increases dong with the increased

distance of the two classes as well.

TABLE 3.3 Estimates of Bhattacharwa bounds with the General ammach (%) CLASS PAIR - ( B) (AlCl (GD) (& E) (A FI (A, G)

1

Theoretical Value 46.9707 43,5767 42.4804 39,627 36,5808 32.8766 1

Means Standard Deviations

Maximum r

90% 1

8Ovo Median

20Ym 10%

Minimum J

TABLE 3.3 provides the statistics of the 200 trials of the Bhattacharyya distance

estimates in terms of percentages. Besides the means and standard deviations of the 200

estimates, the maximum, median, minimum, and 90%, 80%, 20% and 10% quantiles of

the estimates are also listed in the table. For instance, the mean and median of the 200

estimates of the Bhattacharyya bound for the class pair (A, E) are 25.14584% and

26.226% respectively, the 90% quantile of them is 35.69748%. Comparing them to the

theoretical value which is 39.627%, we can see that, for the class pair (A, E), the General

approach underestimates the quantile significantly. It is expected that the General

approach tends to under-estimate the Bhattacharyya bound, because the estimate of the

Bhattacharyya bound is just an exponential fùnction of the negative estimate of the

Bhattacharyya distance. Of course, the bias of the estimate increases, while the distance

of the two classes increases. One interesting fact is that with the class pair (A, B), the rate

of the estirnates which are under-estimating the Bhattacharyya bound is about 90% of the

CIiAPTER 3: THE BHATTACHARYYA BOUND 49

total 200 trials, while for the clas pair (A, G), the rate increases to 100%. We infer that,

generally speaking, the probability of under- estimating the Bhattacharyya bound usually

increases in proportion to the increase of the disiance of two classes.

In the case of the Bhattacharyya distance, the cause for the over-estimation is due

to the second term in (3.19) or the third term in (3.18). Observe that the Bhattacharyya

bound is a fiction of the Bhattacharyya distance p(1/2), and fi (112) given by the

General approach, is a biased estimate of ~(112). Therefore, the cause for under-

estimating is not only due to the third term in (3.18), but also due to the second tenn in

(3.18). Therefore, the question before us is one of correcting the bis, and as we shall

see, the bootstrap technique provides a way of achieving this. The next sections will

illustrate how the bootstrap technique can be applied when estimating the Bhattacharyya

bound.

3.4 The Bootstnp Estimators - Direct Bias Correction

When considering the problem of estimating the Bhachattaryya bound, there are

two different types of schemes available by which a bootstrap estimate of the quantity can

be obtained. The difference between the two schemes is a question of how we want to

correct the bias. First, the estimate bias may be directly corrected by using (2.9). Schemes

of this type are called the Direct Bias Correction schemes. Second, since the estimate

bias of the Bhattacharyya bound is caused by the estimate bias of the Bhattacharyya

distance, the bias of the Bhattacharyya distance c m be estimated first, and then the bias

estimate can be used to adjust the estimate of the Bhattacharyya bound. Schemes of this


type are called the Distance Biaî Adjustment schemes. This section will discuss the Direct

Bias Correction SC hemes with three di fferent resampling approac hes, the basic bootstrap,

Bayesian bootstrap and the random weighting method. A discussion of the distance bias

adjustment schemes will be the topic of the next section.

The direct bias conection schemes use directly the bootsaap estimate of the

Bhattacharyya bound to correct the bias, Le. to apply (2.9) given in Chapter 2. The

algoithms with different resampling schemes are descrikd below:

Algorithm 3.1 Bwic Boobtrap Input: (i) n, the size of the training sample of a class;

(ii) ~ [ l , 1],5[2. 11, . . ., g[n, 1 ] the training sample data of the first class; (iii) 3[1.2], p[2,2], . .., ~ [ n , 21 the training sample data of the second class; (iv) B, the mpeated times of the bootstrap resampling.

Output: The Bhattacharyya bound estimate. Method BEGIN &O = CakulateBound(l[l, 11, ..., g[n, 11 , ~ [ l . 21 ,..., ~ [ n , 21) E = O For (i = 1 to B)

ForQ = 1 ton) m = random integer in [l , n] yü, 11 =g[m, i l

End-For For (j = 1 to n)

m = random integer in [1 , n] yü. 21 = &m, 21

End-For E = E + CalcuIateBound(y[l, 11, . .., y[n, 11 , y[l, 21 ,..., y[n, 21)

End-For E = ~ x E ~ - E / B Return E

END Basic Booetmp

Procedure CalculateBound Input: (i) n, the size of the sample of a class;

(ii) ~ [ l , 11,212, 11, .. ., an, 11 the sample data of the f h t cless; (iii) g[1,2], g[2,2], . . ., ~ [ n , 21 the sarnple data of the second class;

Output: A Bhattacharyya bound estimate. Metbod BEGIN

p(1Q) = CaiculateDistance@[t. 11, g[2, 11, ..., g[n, 11, &1,2], ..., ~ [ n , 21) E = 0.5 * EXP(-p( ln ) ) Return E

END Procedure CalculateBound

Procedure CalculateDistance Input: (i) n, the size of the sample of a class;

(ii) g[1, 1],1[2, 11, . . ., g[n, 11 the sample data of the first class; (iii) g[l,2], g[2,2], . . ., g[n, 21 the sample data of the second class;

Output: a Bhattacharyya bound estimate. Method BEGIN

Return p(1/2) END Procedure CalculateDistance

The first step of this algorithm is to calculate the generai estimate of the

Bhattacharyya bound. The first For loop is used to calculate the B bootsaip estimates of

the Bhattacharyya bound. AAer that, (2.9) is applied to obtain the estimate of the

CHAPTER 3: THE BHATTACHARYYA BOUND 52 - - - - - - -- - - - - - - -

Bhattacharyya bound. The two For loops nested in the f b t For loop are for generating

bootstrap sarnplcs, so the bootstrap sample is drawn separately h m each training sample.

The reason for doing this is so that we can have a larger probability to ensure that the

estimated covariance matrixes f , and 2, are non-singular. if a bootstnip sample is drawn

fiom the merged training sample, the bootstrap sample may only contain a few data

points fiom one class sample. Thus, the estimated covariance rnatnxes $or 2, have a

higher probability of king singular, which may happen, especially, in the case of small

training samples, or when the dimension of the feature vecton is "large". Considering

that an estimate of a covariance matrix may be singular, a real algorithm should inspect

the three determinants lxi 1, lZzl and /ZlfZ21 &et each bootstrap sampling. If any of them

is zero or signiticantly close to zero, a bootstrap sample must be drawn and inspected

again until al1 of the three determinants are non-zero.

By changing the way of drawing the bootstrap sarnples, the other two estimating

algorithms for the Bhattacharyya bound, the Bayesian bootstrap and the random

weighting algorithms can be obtained. The algorithm of the Bayesina bootstntp is

presented below.

Aigorithm 33 Bayesian boobtrrp Input: (i) n, the size of the training sample of a class;

(ii) ~[l , 11, r[2,1], ..., g[n, L] the training sample data of the first class; (iii) g[1,2], 3[2,2], . . ., g[n, 21 the üaining sample data of the second class; (iv) B, the repeated times of the bootstrap resampling.

Output: The Bhattacharyya bound estimate. Method BEGIN EO = CakulateBound(x[l, 11, . .., g[n, 11 , z[I, 21, . .., ~ [ n , 21, general) & = O For (i = 1 to B)

bootstrapbound = bootstrap-bound + CakulateBound(l(1, Il, ..., ~ [ n , 11 ,5[1,2], .... a[n, 21, bootstrap)

End-For E = ~ x Q - E / B Return E

END Bayesian bootstrap

Procedure CalculateBound Input: (i) n, the size of the sample of a class;

(ii) g[l, 1],g[2, 11, . .., ~[n. 11 the sample data of the first class; (iii) ~ [ l , 21, ~[2,2], .. ., ~ [ n , 21 the sample data of the second class; (iv) flag to indicate 'general' or 'bootstrap'

Output: A Bhnttacharyya bound estimate. Method BEGIN

p(lR) = CalculateDistanco(z[l, 11, g[2, 11, .... ~ [ n , 11, g[1,2] ,..., g[n, 21, flag) E = 0.5 * EXP(-p(i(I/2)) Return E

END Procedure CalculateBound


Procedure CalculrteDbtance Input: (i) n, the size of the sample of a class;

(ii) g[1, 11, ~ [ 2 , 11, . . ., g[n, 11 the sample data of the fvst class; (iii) g[1,2], 4[2,2], . . . , g[n, 21 the sample data of the second class;

Outpuk A Bhattacharyya bound estimate. Metbod BEGIN

If flag = 'bootstrap' CetWeigbts-BayesianJ3ootstmp(n)

Else w[1] = w[2] = ... = w[>t] = II n

End4 f - zPI= ;r:,wli1z ti, 11 ri = z;.,wli1û ü. 11- y [i])Qü. 11- - Y [ W if flag = 'bootstrap'

CerWeights-Bayesian-Bootstrap(n) Eise

w[i] = w[2] = ... = w[n] = II In E n d 4 - z 121 = z;., wtilzü. 21

& = ~;.,wUl@ U', 21- g [21)(2 U', 21- Z [21)~

Reium p(1R) END Procedure CaIculateDistance

CHAPTER 3: THE BHATTACHARYYA BOUND SS

Promlure GetWeights-Bayesian-Bootstrap Input: n, the size of the training sample of a class; Output: w[l], w[2], ..., w[n], the resampling vector Method BEGIN For(j= 1 ton- 1)

wu] = raadom unifonn U[O, 11 End-For w[n] = 1 Sor~(w[i], wp1, *-* , wbl) For(j=nto2)

wü]=wü]-wu- 11 End-For Return w[l], w[2], ..., w[n] END Procedure GetWeigbtsJayesian-Bootstmp

The goal of the procedure CetWeigbts-Bayaia~~Bootstrap above is to set the

resampling vector, and the Sort procedure is any standard sorting procedure which puts

w[l], w[2], .. ., w[n] in an increasing order. The main difference between Algoritbm 3.1

and Algorithm 33 is the CahulateDistance procedure. in Algorithm 3.1, the bootstrap

sample is passed to the procedure to calculate the estirnates of the means and covariances,

and so the weights used in the calculation are always 1 1 n. In Abrithm 33, the

parameters passed to the CalculateDistance procedure are always the training sample

and a flag. If the flag is 'bootstrap', the procedure will invoke

CetWeights-Bayesian-Bootstrap to obtain the weights for calculating the

corresponding bootstrap estimate. Otherwise, it uses equivalent weights 11 n to calculate

the estimates, as is done in Algoritbm 3.1.

As it can be directly seen, Algorithm 33 is more general than Algoritbm 3.1.

Algoritbm 3.1 can be obtained h m Algoritbm 3.2 by just changing the


GctWeights-Bayesian-Bootstrap procedure to the GetWeifib-Basic-Boabtrap

procedure, as shown below:

Algorithm 3.3. CetWeights-Basic-Bootstrap Praeedun CetWeightsJasic-Boobtrap Input: n, the size of the training sample of a class; Output: w[ 1 1, w[2], . . ., w [n], the tesampling vector Method BEGIN For(i=1 ton)

wb] = O End-For For (j = 1 to n)

m = random integer in (1, n] w[m] = w[m] + 1

End-For For(j=1 ton)

wb] =wu] l n End-For Return w[l], w[2], . ... w[n] END Proceàure GetWeigbts~~asic~Bootstrap

From what was explained earlier, the algorithm for the random weighting method

can be obtained nom Algoritbm 3.2 by changing the CetWeights-Bayesi.nJBootstrap

procedure as well. As the other sections of the algorithm of the random weighting meihod

are exactly the same as that of Algorithm 3.2, only the changed procedure which is now

the CetWeights-Random-Weighting procedure of the random weighting method, is

given here.

CHAPTER 3: THE BHA~ACHARYYA BOUND 57

Algorithm 3.3 GetWeights-bndom-Weighting Proceûure CetWeigbts-Random-Weightiog Input: n, the size of the training sample of a c h ; Output: w[l], w[2], ..., w[n], the resampling vector Method BEGIN sum=o For0 = 1 ton)

wb] = random uniform U[O, 11 sum=sum+w[i]

End-For Foru = 1 ton)

wb] = WU] / sum End-For Return w[l], w[2], ..., w[n] END P r d u r e GetWeights-Random-Weigh ting

3.4.1.Simulation Results

The simulation experiments were done with the above three algorithms and the

statistics for them are listed in TABLE 3.4 - 4.6. The results are given in terms of

percentages. For instance, in TABLE 3.4, the mean and median of the 200 estimates

given by the Basic Bootstrap algorithm are 3 1.835 17% and 32.583 16% for the class pair

(A, E). In TABLE 3.5, for the same class pair, the mean and median of the 200 estimates

given by the Bayesian Bootstrap algonthm are 29.67647%, and 30.61675%. In TABLE

3.6, the mean and median given by the Random Weighting aigorithm are 27.34608% and

28.2740 1 %. Comparing these estimates to the corresponding estimate vaiue given by the

General approach, we do have less biased estimates. One noticeable thing is that for class

pair (A, B), the 80% quantiles of d l of the three algorithms are above the theoretical

value of the Bhattacharyya bounds, while the 90% quantile of the General appmach is

still under the theoretical value of the Bhattacharyya bounds.


TABLE 3.4 Estimates of Bhatîacharyya bounds with the Basic Bwtstrap (%] CLASS PAIR (& B) (& c) (A, D) (& E) (A* F) 1 ( A m

, Tbeoretkal Value 46.9707 43.5767 42.4804 39.627 36.5808 32.8766 Mean 48,33761 42,1575 1 40.52867 3 1.8351 18.51542 12.6412 STD 5.942545 8393 123 8.835997 10.20969 14.41235 8.457236

Maximum 58,6753 1 58.84465 58.44089 55.32598 49.5 1755 37.165 1 90% 55-50 185 52.4752 1 5 1,50358 44.83 127 37.95566 24.76845 80Y0 53,5539 1 49.2396 48,08842, 40.642 19 33.08389 19.96274

Median - 49,89679 42,763 1 4 1.23 105 32.583 16- 19.4 1409 1 1.5 166 Y

Minimum m TABLE 3.5 Estima

1 Tbeoretid Value Meaas

Standard Deviations Maximum

90% 80%

Median 20% 10%

1

Minimum J

TABLE 3.6 Estimates ( CLASS PAIR

L

, Tbeontkal Value Means

Standard Deviationa Maximum

1

90% 80%

:s of Bhattacharyya bounds with the Bayesian Bwtstrap (%)

(A, B) (A, C ) (A, D) (A, El (A, F) (A, G) 46.9707 43 S767 42.4804 39.027 36.5808 32.8766

Bhattachayya (A, B) 46.9707 423308

5.3 17276 5 1.683 13 48.64045 47-1 1296

Meâian 20% 10Ym

Miaimum ;

bounds with the Random Weighting Method (%)

(A, C) 43.5767

36.65799 7.391367 5 1 24357 45,7033

43.09073

(A, D) 42.4804 35.1367

7,87573 1 5 1.34823 44,87394 4 1.96743 -

2 1.15995 8,1271 52 7,30623 1 1,543205_

28.2740 i 13.99 1 1

1 1.4567 1 7.62 1 5 19

-

9.93726 1 2.067278 1 -3 1 1337 0.0 14472

36.0776, 24.33814 22.3457 10,0383

43.2 1 03 34.18869 32.72992 25.6969g1

(A, E) 39.627,-

2734608 8.9 17867 46.67435 38.15 15 1 34.85278

37.4952 27,1069

24.08489 8.288063

(A, F) 36.5808

20.44759 8.771 155 42.29843 32.52625 27.61 64

(& C) 32.8766

10.71386 7,196015 32.54 132 20,85488 l7.l82l!l

CHAPTER 3: THE BHATTACHARYVA BOUND 59

As can been seen fiom the tables, the bootstrap techniques does work in this case.

On average, the means of the 200 estimates for al1 the six class pairs improves to some

degree. As cm be seen, the percentage of the estimates with values over the theoretical

value also increased. We note, however, that there are a few disadvantages to the three

algorithms. The standard deviations of the estimates are larger than that of the Genetal

approach. In the experiment of the basic bootstrap algorithm with the class pair (A, F),

about 20% of the estimates are even below zero, which is quite unacceptable. Of course, a

restriction to the algorithm could be added to discard negative estimate values. A simple

way of doing this is to just reject an estimate when a negative estirnate value is reported,

and to re-draw the bootstrap sarnples and calculate the estimate again. This procedure

would be repeated until a no~egative estimate is produced. Compmtively, the Bayesian

bootstrap and d o m weighting method algorithms have smaller standard deviations and

no negative estimate values.

3.5 The Bootstnp Estimators - Distance Bias Adjustment

As the Bhattacharyya bound E. is an exponential function of the Bhattacharyya

1 distance p(lD), su = - e * ( lR' , it is not diffïcult to rewrite it in tenns of the Taylor series

2

expanding eu as,

CHAPTER 3: THE BHATTACHARWA BOUND 60

1 where = - e -'.(IR) , and Ap = p - m. Suppose E. is the theoretical value of the 2

Bhattacharyya bound and &,,O is an estimated value, then (3.20) provides a way of

representing eu in terms of cUo if the difference Ap = p - p~ is known. Fortunately, the

difference Ap = p - p~ is just the bias of the Bhattacharyya distance and can be estimated

by using the bootstrap technique. Thus the distance bias adjustment schemes of the

Bhattacharyya bound estimate can be effectively utilized. ARer obtaining the estimates of

the Bhattacharyya bound Ê, and bias of the Bhattacharyya distance Ab, a new estimate

of Ê, is obtained by adjusting Eu with Ad, as follows:

In (3.21), the approximation utilizes only up to the second order ternis of the Taylor

series. Observe that although it is feasible to use more tems of the Taylor series

expansion, as the order of a tenn increases, the contribution of the item decreases.

Thetefore, adding more items into (3.21) will not generally effect the final estimate much.

It is still necessary to figure out some details before a forma1 algorithm can be

obtained for the distance bias adjustment schemes. The first question to be answered is:

what will be chosen as Êu ? There are two possible options, namely, to use either a

General approach estimate, or a bootstmp estimate. The scheme will be called a G-based

(general apprwch based) estimate if the General approach estirnate is selected, and

called a B-based (bootstrap besed) estimate if a bootstmp estimate is selected. The

second is when the distance bias adjustment is performed. ln a bootstrap procedure, B sets

- - - - - - - - - -

of bootstrap samples are repeatedly drawn, an estimate is calculated for each bootstrap

sample, and then these estimates are averaged to obtain the final bootstrap estimate. So

the adjustment can be done ai each step of the calculation of the bootstrap samples, or

after the averaging. The first case is referred to as the AFMS (Adjustment First and Mean

Second) approach, and the second case is referred to as the MFAS (Mean First and

Adjustment Second) approach. It is obvious that the B-based estimate will definitely be

effected by the selection of AFMS or MFAS strategy. The choice of AFMS or MFAS has

an influence on the O-based estimate as well. Although the value of 6 , is not effected by

the bootstrap sarnples, as Ac2 is subject to the selection of AFMS or MFAS strategy,

they will be conespondingly different. Given below are the algorithms of the distance

bias adj ustment schemes with AFMS and MFAS respectively .

3.5.1 AFMS (Adjustment Fint rad Mean Second) Schemn

The main part of a generalized aigorithm with the B-based AFMS strategy is

listed below.


Alprithm 3.4 Distance Bias Adjustment - B-baseà AFMS Inpuk (i) n, the size of the training sarnple of a class;

(ii) 5[1, 11, ~ [ 2 , 11, . . .. ~ [ n , 11 the training sample data of the first class; (iii) ~ [ l , 21, g[2,2], . . ., g[n, 21 the training sample data of the second class; (iv) B, the repeated times of the bootstrap resampling.

Output: The Bhattachaqya bound estimate. Metbod BEGIN

= CalculateDistance&[1, 11, ..., g[n, 11 , g[1,2]. . .., g[n, 21, general) &O = 0.5 X EXP( - po) & = O For (i = 1 to B)

p = CatcuîateDistanceQ[l, 11, . . .. g[n, 11 , g[l, 21, . .., g[n, 21, bootstrap) ~ = P - w E = E + (2 x CO - 0.5 x EXP( - p)) x (1 - Ap + 0.5 x A ~ ~ )

End-For & = d B Return E

END-MAIN

The CalculateDistance procedure is exactly the same as that in Algorithm 3.2.

Therefore, the appropriate CetWeights procedure is also required in this algorithm. Of

course, we have to use different CetWeights procedures to carry out either a basic

bootstrap, Bayesian bootstrap or random weighting method scheme respectively. As each

CetWeights procedure for the three bootstrap schemes has already been represented in

Section 3.4, we omit the discussion of their details again.

In the algorithm above, the fkctor ( 2 ~ - 0.5 x EXP( - p)) is the bootstrap

Bhattacharyya bound estimate for the currentiy drawn bootstrap sample, multiplied by the

distance bias adjustment factor (1 - Ap + 0.5 x A ~ ~ ) . If ( 2 ~ - 0.5 x EXP( - p)) is

replaced by a, and nothing else has been changed, it can be seen that the algorithm will

become a G-based algorithm. Hence, we will not give another description separately for


the G-based and AFMS algorithms. With the same training sample data, experiments

have been done for the B-based/G-based and AFMS algorithms with ail of the three

bootstrap schemes, namely, the basic bootstrap, the Bayesian bootstnp and the random

weighting method. The statistics of the experiments are listed in TABLE 3.7 - 3.12.

TABLE 3.7 Estimates of Bhattacharyya bounds (%) Basic Bootstrap, B-based AFMS

TABLE 3.8 Estimates of Bhattacharyya bounds (%) Bayesian Bootstrap, B-based AFMS

CLASS PAIR l

Tbeoretical Value Meaa STD

Maximum 90%

L

80% L

Median , 20% 10%

Minimum

(A, D) 42,4804 30,0756 1 7.296333 76.740921 36,27044 34.32448 30.1 2942 2 1 -65784 20.4463 13.35947

CtASS PAfR I

Theoretical Value Mean STD

Masimum L

90% L

80% Median

20% 1

10% 1

Minimum

(A, 8) 46,9707 35.296 17 15.28205 243.753 1 3 8,76487 37.92635 34.46269 29.64628 2723526 2 1.26367

(A, B) 46.9707 3 7.5450 1 4.440 1 13 44.265 12 42.97227 4130967 383 1426 30.8 1623 28.80029 23.35308

. (A, E) 39.627

26.28996 9.38985 1 107.5065 33.1 7568,

(A, C) 43.5767 3 1.89302 15.54667 207.0035 36.44793 34.5 1293 30.74694 24,34977 21.95029 15.21436-

(A, C) 43 S767 32.0 1 08 1 5.779325 43.12695 39.20749 37.0198 32.71642 24.7636 1 20.86% 1 14.83666

(A, D) 42.4804 30.43654 6.587644 44,03085 38.53849 36.16435 31.01248 21.20 125 17.90 184 11 ,1767 1

(& F) 36.5808 24.80948 24.24049. 3 1 2.5824 33,209 1 1

. , 32.8766 9.294273 4.665902 23.54554 15.19422 13.61435 8.719324 3.80995 2368375 0.617094

(A, E) 1 (4 F)

( A s ) , 32.8766 20.63062 23.34433 283.3539 35.54862

39.627T 23.354

25.68454 15.16199 8.7324 12 6.913009

30.24766 2524358 17.98987 17.1 1695--

36.5808 1 7.06923

28.52795 21.20245 13.27882 1 1.5 1436

7.644248

6.646 1451 6.343796

59948431 2.454454

39.06706 32.17855 29.33561 23.01215 14.85 177 12.18928 6.761 167

32.8695 1 25.7 1094 23.04196 17.25403 9.325499 5.799734 3.384352

TABLE 3.7 - 3.9 list the results of the B-based AFMS schemes. The estimate

TABLE 3.9 Estimates of Bhattacharyya bounds (%) Random Weighting Method, 8-basecl AFMS

given by the Basic Bootstrap algorithm with the B-based AFMS scheme listed in TABLE

3.7. For instance, the mean and median of the 200 estimates given by this dgonthm are

h

CLASS PAIR 1

Tbeoretical Value Mean

1

STD 1

Maximum . 90%

L

80% L

Median 20% 10% '

Minimum ..'

26.2899% and 25.24358 for the class pair (A, E). For the sarne class pair, the mean and

(& F) 36.5808 18,1839 1 7.18 1 1 04 35.87448 27.83729 24.87642 18.6775% 9.248596 5.3280 12 2.74989 1

median of the 200 estimates given by the Bayesian Bootstrap algorithm are 23.354% and

(& c) 32,8766 9.3 16001 5,153994 25.25058 16,19525 13.89037 8.37569 1 3.0 16075 2.07233 0.376614

(& B) 46.9707 39,79466 4.630449 47.13392 45.45606

, 43.64801 40.56556 33.34305 30.90892 23.94745

23.01215% as listed in TABLE 3.8. In TABLE 3.9, for the sarne class pair, the mean and

median are 25.13843% and 23.15333% respectively, as obtained with the Random

(A, C) 43.5767 34.178 1 5.965 157 45.76277 4 1.2 1 526 3933556 34.83044 26.205 12 22.9 1647 16.23455

Weighting algorithm. These indicate that the performances of the three bootstrap schemes

are quite distinct. The basic bootstrap does not perfom well with the class pairs (A, B),

(A. D) 1 (kW

(A, C) and (A, D), while on average, the estimates of the basic bootstrap are higher than

42.4804 32.5296 1 6.878829 46.36956 40.64846 38.4615 33.39507 23.36485 19.32754 I2,13985

the Geneial approach estimates with the class pairs (A, E), (A, F) and (A, G). It may also

39.627 25.13843 7.1836 16 4 1 .O6053 34.68022 3 1.6583 1 25.15333.- 16.01 357 13.20445 7.275378

be noticed that the standard deviations of the basic bootstrap estimaies are quite large.

The cause of this is because some estimates are too high (even more than 100%), and so

they are not reasonable. It is possible to reduce the standard deviations of estimates of this

scheme by putting a restriction on the maximum estimated value of the algorithm.


Wiih the B-bascd AFMS strategy, the Bayesian bootstrap and random weighting

method schemes do not improve the Generai approach estimate much, except that they

have smaller estimate deviations.

CLASS PAiR Theoretical Value

Maximum

1 Mediaa

1 Minimum

E 3.10 Estimates of Bhattacharyya bounds (%) Basic Bootstrap, G-based AFMS

TABLE 3.1 1 Estimates of Bhattacharyya bounds (%) Bayesian Bootstrap, G-based AFMS

(4 36.5808 3732694 56.18202 759.5583 55.12306 40.48444 28.33757 18,4257 15.03369 6.976958

(A B) 1 ta c)

CLASS PAIR 1

Theoretical Value Mean

r

STD Maximum

(hc) 32.8766 33.0 1038 43.788 14 523 3873 57,11908 44.96434 22.48356 1 1.56859 8.7 13007 3,7663 13

(A-D) 42.4804 36.8919 12,17684 1 50.1 34 1 43.3800 1 40.72589

46.9707 4 1.1 1082 9.06335 1 1 54.2297 45.55304 44.3 8369

(A, F) 36.5808

1 8.77965,. 6.98 1 074 36.17906

r (A. E) 39.627

34.21 707 19.99692 259.0628 45.43 152 3 8.4882

43.5767 38.85462 15.95 164 1 84.7436 44.52779 4 1.69 1 49

90% 1

80% Median

20% I

10% Minimum

(A, B) 46.9707

_ 40.86336 4.82879 48.36243

(A, G) - 32.8766 10.36 139 5.058096 25.25234

3 1.64667 23.64481 2 1.33968 8.5989 1 1

40.82089 35.4003 1 33.1682 28.26328

46.73866 45.00507 4 1.77362 34.18005 3 1.19996 24.5209 1

37.27 138 28.7621 7 27.0222 18.50479

(A, E) - 39.627 25.69247 7.24333,. 42.32066.

(A, C) 43.5767 34,99263 6.1 59856 47,50654

35.64 145 26.77807 24.771 7 15.16554

(A, D) 42.4804 33.2 107 7.0928 1 4 48.0 1 179

42.50228 40.23935 35.6006 1 27.09707 24.193 1 6

1 6,4869

4 1.79076 39.52222 33.8597 1 23.5 185 1 19.42777 13.3958 1

16.83 1 13 14.83372 9.72574 3.9 12425 2.85 1593 0.45 1379

35.4 197 1 32.1 1332 25.49 169 16.669 13 13.89853 7.80 1 59

28,3465 25.3 1579 18.89684 10,07797 6.7575 1 4 3.20302 1


The results of the G-base AFMS are listed through TABLE 3.10 - 3.12. In

TABLE 3.12 Estimates of Bhattacharyya bounds (%) Random Weighting Method, G-based AFMS

TABLE 3.10. for the class pair (A, E), the mean and median of the 200 estimates given

by the Basic bootstrap aigorithm are 34.21707% and 3 1.64667% respectively. For the

sarne class pair, TABLE 3.1 1 gives the mean 25.69247% and median 25.49 1 69%, as

(A, F) 36.5808 18.82925 7.552264 36.857 1 1 29.09597 25.90956 19,19O99 9.472335 5.494032 2,736784

(& E) 39.627

26.1 5482 7.574049 42.86 147 362 1686 33.024 17 26.20663 16.70679 13*23641 7.429857

CLASS PAIR L

Tbmretical Value Mean

1

STD I

Maximum 90% 80%

Median 20% 1OVo

Minimum

estimated by the Bayesian bootstrap algorithm. The conesponding mean and median

(A, G) 32.8766 9.46 13 1 5 5,443392 25.59972 16.7321 14.12225 8.662 1 5 1 2.736777 1 .9768 1 1 0,434437

given by the Random Weighting algorithm are 26.15482% and 26.20663% listed in

-

(A, B) 46.9707 4 1.45 1

4.822625 48.80662 47.3 1289 45.50021 42.37573 3 5.022 1 1 3 1.85959 25.19 1 1 5

TABLE 3.12. Compared to the General approach, the estimates tend to improve with al1

of the bootstnip schemes. The standard deviations of the basic bootstrap scheme are still

(A, C ) 43 S767 35.65973 6.220326 48.0047 43.0634 40.95507 36.42466 27.3674 23.37 162 17.09629

too large because of' some extremely hi& estimate values, while the standard deviations

(A, O) 42.4804 33.89 14 7.1 6 1 089 48.40474 42.49743 40.10739 34.75562 24.1 5503 20,186

12.86532

of the other two bootstrap schemes are smaller than that of the General approach

estimates. On the average, however, the estimate bias of the Bayesian bootstrap and

random weighting method is bigger than that of the basic bootstrap.

3.53 MFAS (Mean First and Adjustment Second) Sebernes

The main pari of the algorithm with the B-based AFMS sbsitegy is presented


Aigorithm 3.5 Distance Bias Adjustmeat - Ebaseà MFAS Input: (i) n, the size of the training sample of a class;

(ii) g[l , 1],5[2, 11, . .., ~ [ n , 11 the training sample data of the first class; (iii) g[1,2], g[2,2], .. ., g[n, 21 the training sample data of the second class; (iv) B, the repeated times of the bootstrap resarnpling.

Output: The Bhattacharyya bound estimate. Method BEGIN ~ < o = CulculatcDistance(x[1, 11, . .., ~ [ n , 11 . g[l,2], .... g[n, 21, general) cg = 0.5 x EXP( - w) & = O p=O For (i = 1 to B)

temp = CalculatcDistance(l[I, 11, . . ., ~ [ n , 11 , ~ [ l , 21, ..., g[n, 21, bootstrap) p=p+temp E = E + 0.5 x EXP( - temp)

End-For p = p / B

Aci=ci-w & = d B E = ( ~ E ~ - E ) ~ ( 1 -AP+0.5x A~') Return E

END-MAIN

Like Algorithm 3.4, the CilculateDistance procedure is the same as that in

Algorithm 3.2. It is also possible to carry out the algorithm with difTerent bootstrap

schemes by selecting the proper GetWeights procedures respectively. The difference

between Aigorithm 3.4 and Algorithm 3.5 is that in the latter, the bootstnip estimates of

the Bhattacharyya distance and bound are accumdated respectively each time a bootstmp

sample is drawn. Then the averages of them are used to obtain the Bhattacharyya bound

estimate ( 2 ~ - E) and the estimate of the distance bias adjustment factor (1 - Ap + 0.5 x

AP2). The finai ,,mate is given by multiplyiag the two estimates. To get a G-based

MFAS algorithm, we need only replace ( 2 ~ - E) with Q, and omit E = E + 0.5 x EXP( -

CHAPTER 3: THE BHATTACHARVYA BOUND 68

temp) and related operations in Algorithm 35. Due to the obvious sllnilarities, we feel

that it is not necessary to present the formal code for the G-based MFAS aigorithm.

The experiment statistics of the B-based/G-based MFAS aigonthms are given in

TABLE 3 -13 - 3.18 with the three bootsirap schemes, the basic bootstrap, the Bayesian

bootstrap and the random weighting method. TABLE 3.13 is the statistics for the basic

bootstrap algorithm with the B-based MFAS schemes, which contains the mean and

median of 17.65553% and 17.1264% respectively for the class pair (A, E). In TABLE

3.14, the mem and median o f the sarne class pair are 2 1.734 15% and 2 1.80673%. which

are the statistics of the Bayesian bootstrap algorithm with the B-based MFAS schemes. In

TABLE 3.15, the random weighting algorithm provided the mean and median of

24.55 108% and 24.55 135% respectively for the same class pair.

TABLE 3.13 Estimates of Bhattacharyya bounds (%) Basic Bwtstrap, B-based MFAS

CLASS PGLR 1

Theorethl Value Mean STD

Maximum 90% 80%

Median 20% 10%

Minimum

(G B) 46.9707 3 1.75356 4.693594 39.78992 37.44745 36.154 1 1 32.10064. 25.39029 23.01 557 14.50959

tacl 1 (&DI 1 (A.U (A, F) 36.5808

1 1.82273 5.444358 27.9204 19.44508 16.55583 11.85741 5.197099 3.781602 1 5973 74

(A, G) 32.8766 5.644303 3.239644 17.7433 1 i 0.04395 8.355 129 4.83922

1 -9 19836 1.47 1255 0,3893 1

39.627 17.65553 6.4 12899,. 32.68469 26.34556 22.96689 17.1264

9.62674 1 7.838717 4.245385

43.5767 1 42.4804 26.1 0766 5.853703.. 37.05428 33.28894 3 1.13656 26.39337.. 18.553 1 7 14.7 1796 10.2638 1

24.529 19 6.573252.- 37.09475 32,89474 30.5 1698 24.97604 15.42574, 12.2922 6.86670 1


TABLE 3.14 Estimates o f Bhattacharyy Bavesian Bootstra~. B-based M

STD I

Maximum 90% W./@

Median 20% 10Yo

Minimum

, bounds (%) :AS

TABLE 3.15 Estimates o f Bhattacharyya bounds (%) Random Weighting Methocl, B-based MFAS

L

CLASS PAIR (A, 1 ( A m (A, D) (A, E) (A, F) t

Theoretical Value 46.9707 1 43 S767 42.4804 39.627 36.5808 Mean 39.59857. 33.91 181 32.13099 24.55 108 17.35893 STD 4.732293 6.08649 1 7.0493 76 7.3 82788 7.324002

Maximum 46.78544 45.55008, 46.56028 4 1 .O8643 - 3 5.47229 90% 45.28959 4 1.18236 40.54022 34.406 27.3 1003 80Vo 43.72768 39.05634 38.3 1 622 3 1 .O4702 23,96976

Median 40.4351 1 34,62662 20% 33.22375 25.94493 223 1655 15.30676 8.060977

L

1OVo 30.02714 22.15524 18.72613 12.17501 4.435716 Minimum 23.50559 1 5.6272 9.873922 6.648424 2.105345

TABLE 3.16 - 3.1 8 give the statistics of the G-based MFAS schemes. The

statistics of the basic bootstrap algorithm are listed in TABLE 3.16 which gives mean and

median of22.18424% and 2 1.586879?! respectively for the class pair (A, E), for instance.

The mean and median of the Bayesian bootstrap algorithm for the same class pair are

25.03786% and 24.84284% listed in TABLE 3.17. ln TABLE 3.18, the mean and median

of the random weighting algorithm for the saw class pair are 26.24318% and

CHAPTER 3: THE BHATTACHARVVA BOUND

TABLE 3.1 6 Estimates of Bhattacharyy I bounds (%)

TABLE 3.1 7 Estimates of Bhattacharyya bounds (%) Bayesian Bootstrap, G-based MFAS

CLASS f AIR r

Tbeoretical Value Meaa STD

Maximum 90%

I

80% Median

1

20% 10%

I

Minimum Y

TABLE 3.18 Estimates of Bhattacharyya bounds (%) Random Weifiting Method, G-base MFAS

a

CLASS PAIR , Tbcoretical Value

Meao r

STD 1

Maximum I

90% 80%

Mdian 20%

I

10% Minimum

(A, B) 46.9707 40.92233 4.930905 48.48 157

(A. B) 46.9707 4 1 -54 1 8 4.797535

(A. C) 43.5767 34.9339 6.36909 47,54306

(A, D) 42.4804 33.06534 7.386044 48.1 1052

46.795281 42.58552

(A, C) 43.5767

4 1.74863, 39.3275 1 33.99645 23.20067 18.75 1 89 10,004 16

45.086 12, 4 1.96837, 34.279

30.795 16 23.282 1 1

(A, F) 1 (A, G )

(A. E) 39.627

25.03786 7.78753 1 42.37459

40.37395,. 35.4585 1 26.38677 22.4292 16.16228

42.98542 36.4239 33.1 10 13 26.29059 16.75969 13.3948 1 7.45 1 15

(A. D) 42.4804 36.5808.-

1 8,77967 7.747655 36.99365 2926422 26.05862 19.0657 8.920452 5.0798 17 2,14916

48.826471 47.96425

35.382 12 3 1.9 170 1 24.84284 1 5,079 16 1 1.99698 6.362026

(A. E) 39.627 32.8766

9.19739 1 5.573234 26.06493 1 6.88561 14. f 2403 8.26875 1 2.358967 1.693297 0.237557

48.48275

(A F) 36,5808 17.48944

26,243 1 8 7.63339

3 5,78248 6.199704

(A, G) 32.8766 8.156097

28.03542 24.80452 17.62768 7.803 587 4.29990 1 1 -686659

34,00968 7.1 79579

47.3 7257 45.56246 42.47585. 35.07868 3 1.93767 25.44232

1 5.25405 12.89838 7.174623 1.893904 1.3 19445 0.2 1 5609

7.696632 36.1 5889

43.19451 42.5 1642,

5.238953 24.27576

4 1 .O5443 36.5479 1 27.60369 23.60087 1730403

40.16449 34.9 1882 24.41967 20.13763 12,09235


in conclusion, both the B-based and G-based MFAS algorithms have smaller

estimate deviations, and tend to under-estimate the Bhattacharyya bound. Generally, they

do not improve the estimate bis.

In this chapter, which mainly focused on the problem of estimating the

Bhattacharyya ôound for error analysis in classification, we showed both theoretically and

experimentally that the estimate of the Bhattacharyya bound is biased. The experiments

also showed that the larger the distance between two classes, the bigger the bis.

However Bootstrap techniques provide a formal solution by which we can reducc

the estimate bias. Two kinds of bootstrap schemes were intmduced to correct the estimate

bias: one is the direct bias correction. and the other, the distance bias adjustment. The

direct bias correction schemes are effective in reducing the estimate bias with al1 the three

bootstmp schemes: the basic bootstrap. the Bayesian bootstrap and the random weighting

method. The problems with these schemes are that the standard deviations of the

estimates are big and some estimates are bad, especially. in the case of the basic bootstrap

scheme.

The distance bias adjustment schemes corne with different stnitegies depending on

the choice of the base estimate, and the step of performing the distance bias adjustment.

The expehent results showed that G-based algorithms are generaliy better than 8-based

ones. Although the MFAS related algorithms have smaller estimate deviations, they do

not significantly reduce the estimate bias. The AFMS related algorithms have the


advantages of yielding good bias correctness for the more distant class pairs, especially,

with the Basic Bootstrap schemes. However, the Basic Bootstrap scheme with the AFMS

algorithm produces very large estimate deviations.

In summary, we conclude that the bootstrap technique is useful for improving the

property of the estimate of the Bhattacharyya bound. We have proposed a direct bias

correction scheme to estimate the Bhattacharyya bound of a clam pair with a smailer

distance. The basic bootstrap scheme with G-based AFMS strategy has been proposed to

estimate the Bhattacharyya bound of a class pair with a larger distance.

4.1 Introduction

In the previous chapter, the estimate problem of the Bhattacharyya bound was

discussed. This bound is only an upper bound of the emr rate of a classifier; it is not even

an optimized bound. As the e m r rate is the crucial measurement of a classifier's

performance, it is pnferable to know this e m r rate, if it can be computed. However, in

most cases, it is very hard to calculate the error rate; even though, the distributions of the

classes are known. Hence, estimating the error is an essential problem in statistical

pattern recognition.

Unfomuiately, the approaches for estimating the classifier error rate are limited

because of the limitation of both training and testing sarnples. The two main traditional

methods are the Apparent Error Estimator and the Cross-validation Estimator (Leave-

one-out). The former apparently under-estimates the enor rate [Efû6], while the latter is

much better than the fht. New studies done in the last two decades based on Efion's

bootstrap technique suggested that it could be used to improve the Cross-validation

Estimator [EtS3]. It was argued that the bootstrap estimator had a much lower variation,

although its bias is larger than that of the Cross-validation scheme. Some alternatives to

the bootstrap approaches, such as the EO estimator, the 0.632 estimator etc. were also

discussed in Ehn 's paper. It pointed out that the 0.632 estimator was a clear wimer of

the sarnpling experiments.

CHAPTER 4: B ~ ~ T S T R A P BASED ERROR ESTIMATES 74

Mer this, a comparative study on these and other methods, including the

Apparent error, Cross-validation, MC and convex bootstrap estimators, was done by

Chemick, Murthy and Nealy, [CMN85] and [CMN86]. Their experiments were carried

out with a linear classifier and the results also suggested that the 0.632 estimator was

better than the othea. Jain, Duôes, and Chen also compared the bootstrap estimator with

Cross-validation and Apparent error estimators [JDC87]. Their experiments were carried

out with 1-NN and Quaciratic classifiers. Their work again supported the conclusions that

bootstrap estimaton have a srnaller standard deviation than the Cross-validation has, and

that the 0.632 estimator has a better performance than the other estimator. Davison and

Hall discussed the theoretical aspects of the bootstrap and the Cross-validation estimators

PH921. They showed that both the bootstrap and the Cross-validation have a bias of

order r i ' if the populations are fixed, and only the bias of the bootstrap will increase to

order n'" when the populations converge towards one another at the rate of ci". To

improve the bias of the bootstmp estimator, another modified bootstrap estimaior, Leave-

one-out bootstrap was introduced by Efion and Tibshibani [ET97], which is a

combination of the Cross-validation and bootstrap estimators. Associated with the 0.632

method, the 0.632 + (Leave-onesut) method perfonned well in the sampling

experhents.

in this chapter, the aforementioned emr estimating methods will fim be

discussed except for the MC and Convex bootstrap. These are, in some ways, similar to

the Leave-one-out bootstrap and other methods that will be introduced in this chapter.

The detailed descriptions of the algorithms for each error rate estimator, including the

Apparent error estimator, Cross-validation or Leave-one-out, basic bootstrap, the EO e m

estimator, and two 0.632 e m r estimaton, will be given in the second section. Results of a

series of simulations of these estimators will be illustrated and discussed in the third

section.

in the latter parts of this chapter, some new algorithms will be introduced to

estimate the error rate. These algorithms are mainly based on the idea of 'local mean'

and 'pseudo-sample data', which is related to the idea of the SMIDAT aigorithm as

discussed in Chapter 2. Therefore, such an algorithm is referred to as the Pseudo-sample

rnethod as opposed to the previous algorithms that only use the original sample data to

do calculations, which are refened to as the Real-sample methods. The pseudo-sample

rnethod is the topic of the fourth section.

4.2 Algorithms of the Error Rate Estimating

Using the notation given in Chapter 3, let

be a training sample, q(V- X) be a decision d e or classifier based on the training sarnple

X, and Q[o , X)] be an error indicator,

Thus, the enor rate of classifier q@, X) cm be defuied as

where a) is a random vector independent of X. in general, the quantity Err i s

unknown, and our task is to obtain an estimate for it.

As mentioned above, there are various algorithms available to estimate the e m r

rate of a classifier. Below are the detailed descriptions of the aigorithms.

43.1 Apparent Error

The apparent error rate estimator uses the training sarnples themselves as the

testing samples to estimate the error rate. With the training sample given in (4.1), the

apparent emr estimator (Err.,,) is

A generalized algorithm of Err,, can be described as follows.

Algorithm 4.1 Apparent Error Estimator Input: (i) c, the number of classes;

(ii) ni, the size of the training sarnples of class i; (iii) ~ [ l , 1 ] , . . ., g[ni, 1],~[1,2], . . ., ~ [ n , cl, the training sample data;

Output: The enor estimate. Method BEGIN

BuildClassifkr() E m r = O Total = O For i= l tocdo

Total = Total + ni For j = 1 to ni do

Error = Emr + Classi@ingError&~, il) End-For

End-For E m r = Error / Total Retum Enot

ENû Apparent E m r Estimator

The algorithm is very simple. The purpose of the BuiJdCbrdfier procedure in the

above algorithm is to coastruct the classifier based on the given training samples. The

algorithm for this procedure depends on the kind of classifier used and king tasked. If

one wants to use the k-NN classifier with the Euclidean distance; the BuildClassiTici

procedure can be omitted. If one wants to use the Bayesian classifier; the task of the

BuildCla~silkr procedure is to calculate the means and covariances for each class; and

so on. The ClasaifyingError procedure perfoms the function of evaluating the error

indicator given in (4.2). First, the ClassifyingEhor procedure uses the classifier

constructed by the BuildClassifkr procedure to classifi a training pattern rb, il. It

returns the value O if ru, il is classified to i, and it retums the value 1 if the sample is

misclassified. As the details of the BrildClassifier and Cla~sifj4ngEnar procedures

straightforward (depending on the discriminant function), the algorithrns for these two

procedures are not included have.

It is obvious that the algorithm of the Apparent ercor estimator only substitutes the

training sample to calculate the e m r rate. Therefore, the estimated value of the error rate

will always be zero with a probability 1 if 1-NN classifier is used in chis algorithm and

the feature vector follows a continuous distribution. It is thus clear that the Apparent

emr estimator underestimates the emr rate.

Let i = { : j i) = { a,) : j # i denote the sub-set of the training samples

which does not contain the pattern (-5, ai). The Cross-validation, or Leave-onesut error

estimator will then be;

A generalized algorithm of Err, is given below.

Algorithm 4 3 Leave-One-ûut Error Estimator Input: (i) c, the num ber of classes;

(ii) ni, the size of the training sample of class i; (iii) ~ [ l , 11 , ..., l[nl, 11, ~ [ l , 21, ..., g[nn cl, the training sample data;

Output: the error estimate. Method BEGIN

Error = O Total = O For i = 1 tocdo


BuildCbrsifier(j, i) Error = Enor + Classifj&@rror(x~, il)

End-For End-For Emr = E m r / Total Return Error

END Leave-OneOut Error Estimator

The algorithm is still simple and slightly different from the Apparent error

algorithm. Here the Bui!dClasrifier procedure is called n tirnes, while in Algorithm 4.1

it is called only once. Every t h e the BuüdClaasifier procedure is invoked, there are two

penuneters (j, i) p s e d to it; while there is no parameter passed to it in Aigorithm 4.1.

The parameten are used to indicate that the pattern au, i] is removed h m the training

sample, and the remaining training patterns are used to construct the classifier, which

means that not X, is used to build the classifier. Thus, the classifier will not have s'j,

il participating in when it is used to classify gD, il. By this strategy, the problem caused

by the Apparent enor estimator is avoided.

4.23 Basic Bootstrap

As the Apparent error Err,, under-estimates the tnie error Err. a positive bias

(PB) of the Err, can be defined as

PB(X, F) = Err - EIT,,.

The expectation of this bias cm be written as

Bias(F) = E[PB(X, F)] = E[Err - Enqp].

If Bias(F) were known, an optimised estimate of Err could be given as

Err,, = Err,, + Bias(F). (4.7)

In most cases the Bias is unknown and should be estimated h m the training sarnples.

The Cross-validation algorithm mentioned above estimates the Bias with:

With the bootstrap technique, the Bias(F) may be estimated in several ways.

Suppose the obsmed patterns of the training sarnple X are ]CI = gl, & = P. - - ., &, = 4

Similar to what was done in the previous chaptea, a bootstrap sample X* = (X , ,

CHAPTER 4: BOOTSTRAP BASEO ERROR ESTIMATES 80

. . . , - X ' t l } may be drawn fiom the empirical distribution k . Thus a bootstxap estimate of

Bias will be

where w : is the proportion o f the

By repeatedly drawing the bootstrap samples B times the bootstrap estimate of Bias is

Thus, the bootstnip estimate o f the Err is;

(h) E r b i = Enwp + Bias . (4.1 1)

The bootstrap estimator discussed above is called the basic bootstrap estimator.

An algorithm for it is given below:

Algorithm 4.3 Basic Bootstrap E m r Ecstimator Input: (i) c, the number of classes;

(ii) ni, the size of the training sample of class i; (iii) ~ [ l , 11 , . . ., x[ni, 11, al, 21, .. ., ~ [ n , cl, the training sample data; (iv) B, the repated times of drawing bootstrap samples.

Output: the error estimate. Method BEGIN

Total = O Fori= 1 tocdo

Total = Total + ni End-For Error-estirnate = O Fork= 1 to Bdo

CetBootstrapSample() BiiildClassifier(y[l, 11, y[2, 11, ..., y[n,, cl) Error = O F o r i = l tocdo

For j = 1 to ni do 1 Error = Emr + (- - wü, il) x ClassifyingErrotfx~, il) n

End-For End-For Error-estimate = Error-estimate + Emr / Total

End-For Error-estimate = Enor-estimate / B Error-estimate = Enor-estimate + AppError() Return Error-estirnate

END Basic Bootstrap Error Estimator

CHAPTER 4: B ~ ~ T ~ T R A P BASED ERROR ESTIMATES 82

Pmcdure GetBootstrapSample Input: (i) c, the nurnber of classes;

(ii) ni, the size of the training sample of class i; (iii) r[ 1 , 11 , . . . , ~ [ n 1, 1 1, g[ 1,2], . . . , ~ [ n , , cl, the training sample data;

Output: (il Bootsmp sample y[ 1 9 1 1, y[2,1], -, y[nc, cl (ii) weights w[l, 11, w[2, 11, ..., w[n,, cl.

Method BEGIN Total = O For+ 1 tocdo


wu, il = O End-For

End-For For+ 1 tocdo

For j = 1 to ni do m = randorn integer in [ l , ni] yu'. il = g[m, il wu, i] = wu, i] + 1

End-For End-For For i = 1 tocdo

For j = 1 to ni do wu, i] = wu, il / Total

End-For End-For Return y[l, 11, y(2, 11, ..., y[nc, cl, W, 11, w[2, 11, ..., w[nc, cl END Procedure GetBootstrapSample

The parameters passed to the CetBootstrapSample procedure are given in the

descriptive part of the procedure, although they are omitted when the procedure is called

in the main body of the algorithm. The GetBootstrapSample procedure in this algorithm

p e h s two fiuictions: it draws a bootstrap sample and sets the weights for each pattern

of the training sample. The bootstcap sarnple is drawn separately fiom each class. The

reason for doing this is to avoid the bias of the bootstmp sample data. If al1 the training

samples fiom different classes were put together to draw a bootstrap sample, the

- - - - - -

bootstrap sample might contain no data from one or more classes. This would cause an

even more biased estimate, or might even deduce a bootstrap sample, which could not be

used to construct a classifier.

n i e weights calculated in the pmcedure are for the bootstrap estimate calculation.

One may argue that the GetWeights procedures given in Chapter 3 cm be used to

substitute the CetBootstrapSample procedure, so that one could use different bootstrap

schemes, basic bootstrap, Bayesian bootstrap and random weighting method. This is true

in some cases. The key issue is to detemine the kind of classifier used. If the Bayesian

classifier is used, a CetWeights procedure could take the place of the

GetBootstrapSampk procedure in the algorithm. Then the weights w[l , 11, w[2, 11, . . .,

w[n,, cl would be passed to the BuildClassifier procedure to construct the classifier.

However, a real bootstrap sample will be required. if a k-NN classifier is used. It is not

dificult to see that the Bayesian bootstrap and the random weighting method can not be

applied to a k-NN classifier.

The BuildClassifier pmcedure of this algorithm is the sarne as that for Algorîthm

4.1, except that it uses the bootstrap sample to build the classifier. The AppError

procedure only uses the Algorithm 4.1 to calculate an apparent error estimate. Thus, the

parameters (the number of classes, the training sarnple sizes of the classes, and the

training sample data) have to be passed to the procedure, even though it has not been

indicated in the algorithm.

CHAPTER 4: B~~TSTRAP BASED ERROR ESTIMATES 04

4.2.4 EO estimator

The EO estimator is another estimate based on the bootstrap technique. The

difference between the basic bootstrap and the EO estimator is that the EO estimator tests

oniy the training patterns not çontained in a bootstrap sample, while the basic bootstnip

tests dl the patterns of the training sarnple. To cany out an EO estimate, d e r each

bootstrsp sample drawing, the error rate calculation has to be changed. Suppose B

bootstrap sarnples ore drawn, the bh bootstrap sample is denoted as x'~, and the set of

training patterns not in x'' is denoted as Ab. Then the EO error rate estimate is:

Err,, =

where IAb( denotes the cardinality of the set Ab. The algonthm of the EO estimator is

represented below.


Algorithm 4.4 E l error eatimator Input: (i) c, the number of classes;

(ii) ni, the size of the training sample of class i; (iii) g[ 1, 11 , . . . , s[ni, 1 1, ~ [ l ,2], . . . , g [ n , cl, the training sample data; (iv) B, the repeated times of hwing bootstrap samples.

Output: The error estimate. Method BEGIN

Total = O Error = O Fork= I toBdo

GetBwtstnpSample() BuildCIassifier(y[l , 11, y[2, 1 1, . . ., y[n,, cl) Fori= 1 tocdo

For j = 1 to ni do If wu, i] = O Then

Error = Emr + CIassifyiagError(x~, il) Total = Total + 1

End-tf End-For

End-For End-For Error = Error / Total Return Emr

END EO error estimator

The procedures used in the above are the same as those used in Algorithm 43,

and so they are not explained in any greater deiail. The key point in this algorithm is to

check the weight of a training pattern before it is classified. Thus, al1 the training patterns

passed to the ClassifyingError procedure are excluded h m the bootstrap sample. What

should also be pointed out is that this algorithm yields an error rate estimate directly

without adding the AppError to it. in other words, the EO estimator scheme is not an

algorithm based on the bias correction.

CHAPTER 4: B ~ ~ T S T R A P BASED ERROR ESTIMATES 86 - - -- - - - - -- - - -- - -- - -. -- . - - - - . -

4.25 Leavc-oac-out Bootstrap

Compared to the Cross-validation, the bootstrap provides an enor estimate with a

bigger downward bias (see [JDC87], [Ha861 and [ET97]). This is because the basic

bootstrap does not explicitly separate the test pattem fiom the training patterns which are

used to construct the classifier, like the Cross-validation does. So Efron and Tibshirani

developed a scheme called the Lemte-one-out bootstrap to deal with this separation.

Recall that in the Cross-validation algorithm, the training pattem set &il was used

to build the classifier when the test was done for training pattern &. Thus, the testing

pattem was not a part of the pattern set used to construct the classifier. So we de fine an

empincd distribution F(i) on the training sample set Kil,

Using this, a bootstrap sample

not contain Xi. For each i, there could be an e m r estimator given by:

~ r r : = E: fQ[m y il@,, x;,, )] j, (4.1 4)

where E: denotes the expectation under the distribution k(,). The error rate estimate of

the Leave-one-out ôootstmp is:

in the real algorithm, the ~ r r : in (4.14) can be obtained by the following: first,

the bootstrap samples are drawn as usual (say B bootstrap samples, x", x*: . . ., x*~).

CHAPTER 4: B ~ T S T R A P BASED ERROR ESTIMATES 87

Let N: be the times of training pattern & appearing in the b ' bootstrap sample. We

de fine:

and

Thus Err : can be given by:

The algorithm of the Leave-one-out bootstrap is listed below.

Algorithm 4.5 Leavcs~~e-out Bootstrap Error Estimator Input: (i) c, the number of classes;

(ii) ni, the size of the training sample of class i; (iii) g[1, 11 , . . ., g[ni, 11, ~ [ l , 21, . . ., x [ n , cl, the training sample data; (iv) B, the repeated times of drawing -kstrap samples.


Total = O Fori= 1 tocdo

Total = ni Forj= 1 tonid@

Totalb, il = O Enorfi, i] = O

End-For End-For Fork= 1 toBdo

CetBootstrapSample() BuildClassifkr(y[l, 11, y[2, 11, . . ., y[nç, cl) For+ 1 tocdo

For j = 1 to ni do If wfi, il = O Then

Emru, i] = Emrü, i] + ClassityingError(x~, il) Total[i, il = Totale, il + 1

End-Cf End-For

End-For End-For Error = O For+ 1 tocdo

For j = 1 to ni do Error = Error + Errorü, i] / Totalü, il

End-For End-For E m r = Error / Total Return Enor

END Leave-one-out Bootstrap E m r Estimator

The procedures used in this algorithm are the same as in Algorithm 4.4. It should

be pointed out that this algorithm uses the same pattern set as that in Algorithm 4.4 to

calculate the emr rate. The diEerence betweea the two algocithms is the calculatioas

CHAPTER 4: BOOTSTIUP BASED ERROR ESTIMATES 89

themselves. In Algoritbm 4.4, the error rate estirnate is counted as the ratio of the total

number of the mixlassified training patterns to the total number of the training pattems

tested. in Algoritbm dS, however, the misclassification ratio for each training pattern

within al1 the bootstrap samples is calculated fmt, and so that the error rate estimate tums

out to be the average of the ratios of all the training pattems. a

4.2.6 0.632 Estimators

The motivation of the 0.632 estimators is not strong. However, the simulation

results showed that the 0.632 estimators have good performance (see [EtS3], [CMN85],

[CMN86], and [ET97]). When a bootstrap sample of size m is drawn from a ûaining

sample of size n, the probability of a training pattern being excluded fiom the bootstrap

1 sample is (1 - - ) m if the training patterns fiom al1 the classes are put together to draw a

n

bootstrap sample. In general, since the value of na is set to n, as n goes to infinity, (1 -

1 -)* tends to e-' 0.368. So the value 0.632 is considered to be the contribution rate of n

the EO estimator to the bias correction Ffü31. Thus, the Bias defined in (4.6) can be

est imated by :

~ i a s ( O . ~ ~ ~ ' = 0.632 x (EO - AppErr).

This gives the Err estimate

Err(0632i = AppErr + ~ i a s ( ~ - " ~ ' = 0.632 x EO + 0.368 x AppErr. (4.28)

As the Leave-one-out bootstrap estimator bas the sarne property as the EO estimator, (i.e.

only the training patterns excluded fiom the bootstrap sarnple are counted to estimate the


error rate), an alternative way is to use the Leave-onesut bootstrap estimator

instead of EO, which yields the Err estimate

Erpo.632+1 = 0.632 x ErrCVCVbooI f 0.368 x AppErr. (4.29)

It is obvious bat the alg06thm Err(0.6~~) is a straighdorward application of the

algorithms EO and AppErr, and formula (4.28). Similarly, the algorithm to yield

Err(0.632+) is an application of the algorithms Errcv.brnt and AppErr, and formula (4.29).

4.3. Simulation Results

Experiments were carried out for al1 the algorithms discussed in the last section.

Al1 of the experiments used the Data Set 1 with training sample sizes 8 and 24 for each

class. As in Chapter 3, only two classes were involved in an experiment: one of hem was

the class A, and the other was chosen fiom the rest.

As the I -NN classifier is meaningless for the Apparent error estimator, the 3-NN

classifier was used in al1 the experiments. The tnie enors were obtained by using

independent testing sarnples of size 1,000 for each class. Therefore, the true erroa are the

result of measuring 2,000 independent testing patterns. Each expriment was repeated

200 times. TABLE 4.1 and TABLE 4.2 give the results of the average out of 200 trails

with training sample sizes 8 and 24 for each class respectively.

CHAPTER 4: BOOTSTRAP BASED ERROR ESTIMATES 91

TABLE 4-1 Error Rate Esti

4.05386

nates of 3-NN classifiers (sample size 8) (A, C) (A, D) (A, E) (A, F) (A, G)

37.021750 33 37225 23.54025 14,33675 6.3225

Apparent Error t+Fe% STD

Basic Mean 28,985 17 11.12407

STD 1 1,39010 0.632 Mean 42,40397

STD 9.692471 Leave-OnMut Mean 49.3988

Bootstriip STD 1 1.37005 0.632+ Mean 40,48904

STD _ 9.708022 32.20436 29.6 1924 2 1.1 0546 13,55376 5.89871 10.327 14 IO. 1 5273- 9.9 12306- 7.59366 1 4.987327

TABLE 4.2 Error Rate Estimates of 3-NN classifi rs (sarn~le size 241 Method Clam Pair (A, B) (A, Cl (A, D) True Mean 46.49800 35.92875 32.34675

STD 2.340672 3.050293 2.773933 Apparent Error Mean 23.354 17 19.54 167 17.86458

STD 5.703884 5.1 8980 4.8 1 1722 Leave-On&ut Mean 48.68750 37.13542 33 .58333

STD 9.1223 12 9.745987 8.7 19092 Basic Mean 24.74941 20.65822 1 8.88875

ho t s t r ap STD 5.76893 5.3 166 19 4.9 17935 EO Mean 47.42198 36.83978 33.3393 1

STD 6.359907 8.131911 7.231284

STD 5.4978 19 6.538436. 5.8 15562 Leave-One-Out Mean 47.6213 37.58765 34.2348 1

The mults of the experiments show that the bootstrap technique works well in

this case. Compared to the Leave-one-out estimator, al1 the bootstrap related algorithms

CHAPTER 4: B~~TSTRAP BASEP EMOR ESTMATES 92

have relatively smdler standard deviations; dthough, some of them have larger biases

than the Leave-one-out. The basic bootstrap algorithm improved the Apparent error rate

estimator, but it still tends to under-estimate the emr rates, while the EO and Leave-one-

out bootstrap algorithm generally over-estimate the e m r rates. The 0.632+ algorithm did

enhance the performance of the 0.632 aigorith, which confirms the importance of the

Leave-one-out approach. Although the Leave-one-out algorithm appears attractive,

always yielding the best performance with al1 the expriment class pain, it will be seen

later in this chapter that, from some standpoints, the bootstrap related aigorithrns perform

better than the Leave-onesut. It can also be setn that the results of these experiments are

basically consistent with the research works done in [Efû3], [CMNSS], [Ef86], [CMN86],

[JDC87], and [ET97].

4.4. Pseudo-Sample Algorithm

As pointed out at the begiming of the chapter, the algorithms previously

mentioned use only the training pattems to estimate the error rate of the classifier. This

means that the training patterns are used for both the construction of the classifier and for

testing the classifier. Two lessons have been leamed fiom the above results:

1. The training pattern used to test the classifier should be separated fiom those

used to build the classifier. This is the key issue that makes the Leavesnesut

emr estimator perfomi well. This is also why the 0.632+ estimator perfomed

better than the 0.632 estimator;

CHAPTER 4: B~~TSTRAP BASED ERROR ESTIMATES 93 -- -

2. A larger training sample size would improve the error estimaton. This cm be

seen by cornparhg the results in TABLE 4.1 and TABLE 4.2.

Separating the training pattems from the testing and building of the classifier can

be done by merely applying the Cross-validation technique. However, in practice, training

sarnple size is limited. The separating of the two implies the reduction of the number of

the pattems used to build the classifier, Although there will generaily not be a significant

effect if only one training pattern is lefl out, this is not the case when the sample size is

itself smdl. The examples of the 0.632 estimators have show that a weak relationship

between the pattems of testing and those of building the classifier will not make things

worse. Therefore, it would be beneficial to devise a way to increase the training sample

size and simultaneously retain the pattems used for testing the classifier in question.

This is what the SIMDAT algorithm given in Section 2.3.5 does. The algorithms

provided here use a SIMDAT-like approach to generate pseudo-training pattems. Then

the original and the pseudo training pattems are used separately as the testing pattems and

the pattems to construct the classifier. To distinguish from the algorithms mentioned in

the last section, the algorithms discussed here are referred to as pseudo-sampte

algorithms.

The main idea in the SIMDAT algorithm is that of combining the weightcd ni-

nearest neighboun of a sample point to constnict a pseudo-sample data. Here, the same

approach is used to constnict pseudo-training pattems. There is no restriction on the

method to be used for setting the weights of the m-nearest neighbours of a training

pattern A genedized algorith.cn to construct a pseudo-training pattern is given below:


Algorithm 4.6 Generating A Pseudo-Training Pattern Procedure GetPseudoPattern 1 Input: g[1], g[2], . . ., ~ [ m ] the ni-nearest neighbours of the training pattem g Output: A pseudo-training pattern. Method BEGIN

GetWeights(m)

Retum END Procedure GetPseudoPlittern 1

Observe that the m-nearest neighbours of the training pattern g include 3 itself.

The GetWeights procedure above can be any one of the GetWeights procedures

presented in Chapter 3. Of coune, the GetWeigbts procedure for the Bayesian bootstrap

or the random weighting method are preferred because then the weights w[l], w[2], ...,

w[mJ are continuously distributed, and the pseudo-training pattem generated would not

identim with any pattem in the training sarnple set. The main idea is that the pseudo-

training pattem generated should oscillate around the mean of the mnearest neighbours.

So another way to generate a pseudo-training pattem is to apply the approach represented

in the SIMDAT algorithm. The following algorithm uses this approach:


Algorithm 4.7 Cenerrting A Pseudo-Training Pattern Procedure CetPseudoPattern L I Input: ~ [ l ] , g[2], .. ., ~ [ m ] the m-nearest neighbours of the training pattern 8 Output: A pseudo-training pattern. Method BEGIN

Mean = O For+ 1 torndo

w[i] = RandomVariablffienerator() 9

Mean = Meaa + ~ [ i ] End-For Mean = Mean / m

m

v = %[il x ( ~ [ i ] - M a n ) - 1-1

v=x+Mean 0

Retum END Procedure CetPseudoPattern II

The differences beiween Algorithm 4.7 and Algorithm 4.6 are that the weights in

Algorithm 4.7 are not standardized. and the weights are assigned to the centralized

feature vectors &[il - Meaa). After the linear combination of the centralized feature

vectors has been obtained, the mean of the mnearest neighbours is added back to obtain

the pseudo-training pattern. Thus, the RindomVariableCenentor procedure above

shodd generate a random variable with zero as its conditional expectation. It is easy to

see that Algorithm 4.7 uses the approach used in the SiMDAT algorithm to obtain

pseudo-patterns.

After the pseudo-training pattems have been generated, they can be used in

various ways, either as the testing patterns, or as the pattems to construct the classifier

itself. The scheme is called a Pseudo-Testing dgorithm, if the pseudo-training samples

are used for testing purposes and a Pseudo-Classïfzer algorithrn, if the pseudo-training

CHAPTER 4: BOOTSTRAP BASED ERROR ESTIMATES %

samples are used to constnict the classifier. A Pseudo-Testing aigorithm is described

below:

Algorithm 4.8 Pseudo-Testiag Input: (i) c, the number of classes;

(ii) ni, the size of the training sample of class i; (iii) ~ [ l , 11 . . .., s[ni . 11, ~ [ i , 21. . . .. x[nc, CI, the training sample data; (iv) B, the repeated times of draing kmtstrap samples. (v) m, the number of the nearest neighbours of a training pattern


Total = O Error = O For i = 1 to c do

For j = 1 to ni do GetNearestNeigbboursOf(x~, il, m)

End-For End-For Fork= 1 toBdo

CeneratePseudoSmnple() BuildCIauifierQ[l, 1],5[2, 11, . . ., ~ [ n , , cl) Fori=l tocdo

For j = 1 to ni do Enor = Error + ClassifjhgError(rü, il) Total = Total + I

End-For End-For

End-For Error = Error / Total Return Error

END Pseudo-Testing


Procedure GeneratePaeudoLmple input: (i) c, the number of classes;

(ii) ni, the size of the training sample of class i; (iii) g[l, 11 , ..., ni, 11, g[1,2], . . ., g[n, cl, the training sample data;

Output: The pseudo-training patterns y[l, 11, .... y[ni, 11, y[l, 21, . .., y[n,, cl. Method BEGIN

F o r i = 1 tocdo For j = 1 to ni do

k = random integer in [1, ni] y[j, il = CetPseudoPattern(m-nearest neighboun of z@c, il)

End-For End-For

END GeneratePseudoS~mple

In this algorithm, y[l, 11, y[2, 11, ..., y[n,, cl represent the pseudo-training

samples generated. The pseudo-training samples have not ody in total, but also for each

class, the same size as the original training sample. Mer the classifier has been built with

the original training sampie, the pseudo-training sample is used to estimate the e m r rate

of the classifier. The GetNoarestNeigh boursOC(x~, il, m) procedure is used to calculate

the m-nearest neighbours of the training pattern gb, il. This function is basically used to

calculate the Euclidean distances between su, il and other patterns in the same class, and

select m from them with the minimal distances. The GetPseudoPattern procedure,

invoked in the GeneratePseudoSampk procedure. can be either of those yiven in

Algoritbm 4.6 and 4.7. Unlike the SiMDAT algorithm. the CetPseudoPattern procedure

does not generate a pseudo-training pattern for each pattern in the training sample.

Instead, it uses the bootstrap technique to randomly select a training pattern first; then it

uses the m-nearest neighbom of the pattern to generate a pseudo-training pattern.


Experiments on Algorithm 4.8 were done with the Data Set 1. Only two classes

were involved in an expenment, each of hem has a training sample of size eight. Al1 of

the experiments were carried out with a 3-NN classifier. Two different

CetPseudoPattern procedures were used in the experiments: one of them used

Algoritbm 4.6 with the GetWeights procedure for the Bayesian bootstrap, and the other

1 used Algorithm 4.7 with a RandomVariibkCenerator procedure to generate the U(-

m

- 4':; ') , + \/3(m ; ') ) distributeci random variables. The other parameten used m m-

in the experiments were: B = 200, m = 3. TABLE 4.3 lists the statistics €rom the 200 trials

for each experiment.

TABLE 4.3 Error Rate Estimates of the Pseudo-Testinn alnorithm

- - -

The results indicate that the estimates provided by the pseudo-testing algorithms

are seriously biased downward. The cause of this is that the testing pattems are the linear

combinations of the m-nearest neighburs. As these linear combinations tend to the local

mean of the pattems in a class, their characteristics are more significant, and easier to

distinguish fiom each other. So the per6ormance of the Pseudo-Testing algorithm is not

Method True

1

SIMDAT

1 ,"=:

Class Pair Mean STD Mean STD Mean STD

(A. B) 1 (A, C) 46.99 150 4.05386 23.07175 15.65927 17.87747 8.370604

(A. (3) 6.3225 1.8635 1 2.126613 4.5801 3 0.635275 i.491685

37.08750 5.92 1621 15.26616 13.33733 12.83954 7.583 107

(A, D) 1 (A. E) 1 (A. FI 14.33675, 3.1 39 198 5.43765 8.879793

33.372251 23.54025 5.755794 14.05301, 13.97576

4.33367 8.634487 1 1,33642

10.92097 7.198176

7.1085751 3.346563 6.4247451 4364733

CHAPTER 4: BOOTSTRAQ BASED ERROR ESTIMATES 99

4.4.2 The Pseudo£laaaifier Algorithm

The Pseudo-Classifier algorithm exchanges the position of the original training

patterns with that of the pseudo-training patterns. The pseudo-training sample is used to

build the classifier. and the original training sample is used to test the classifier error rate.

The pseudo-classifier algorithm is represented hem:

Algoritbm 4.9 Pseudo-Clasaifier Input: (i) c, the number of classes;

(ii) ni, the size of the training sample of class i; (iii) ~ [ l , 11 . ..., x[ni, 11, ~ [ l , 21, ..., x[n,, cl, the training sample data; (iv) B, the repeated times of drawing Lotstrap samples. (v) m, the number of the nearest neighbours of a training pattern


Totai = O Error = O Fori= 1 tocdo

For j = 1 to ni do GetNearestNeighboudf&~, il, m)

End-For End-For Fork=1 toBdo

GenentePseudoSample() BuildClasiifier(y[l, 11, y[2, 11, ..., y[nc, cl) Fori= 1 tocdo

Forj= 1 tonido Errcr = E m r + ClassifyiagError(x~, il) Total = Total + 1

End-For End-For

End-For Error = Error / Total Return Emr

END Pseuddlrssifier

The algorithm above is aimost the sarne as Algorithm 4.8, except that the

parameters passed to the BuildChssifîer and ClasaiîyingEmr procedures have

CHAPTER 4: B~~TSTRAP BMED ERROR ESTIMATES 100

c hanged. Although the change is small, the performance is highly improved. TABLE 4.4

- 4.5 list the statistics of the experiments on Algorithm 4.9. The experiments were done

with the Data Set 1.

Al1 the experiment parameten are the same as those for the experiments on

Algorithm 4.8, except that two training sarnple sizes 8 and 24 are used hem. Additional

experiments were cmied out with the CetWeights procedure for the random weighting

TABLE 4.4 Error Rate Estimates of the Pseudo-Classifier algorithm (sample size 8)

1 Metbod 1 Cl- Pair 1 (A, B) 1 ( A S ) 1 (&Dl 1 (A. E) 1 (A, FI 1 (&Cl 1

1 Bootstrap 1 STD 1 6.9244031 7.1391021 6.8554221 6.8433631 5.6533351 3.8827711

True

I

SiMDAT I

Bayesian

TABLE 4.5 Error Rate Estimates of the Pseudo-Classifier algorithm (sample size 24)

Mern STD Mean STD Mean

Random Weigb t in^

I Method 1 Clam Pair 1 (A. B) 1 (A,C) 1 (&Dl 1 (A, E) 1 (A, ,F) 1 (A+ C) 1

46.991 50 4.05386 35.97422 6.8 15439 27.98375

Mean STD

1 Bootsttrrp 1 STD 1 4.26 16071 4.9850671 4.5055 141 4.4286121 33980461 2.4338731

True

r

SIMDAT

Bayesir n

37,08750 5.92162 1 3 1 .O3078 7.748732 22.2825

28.00937 7.105689

For convenience, a pseudo-classifier algorithm is referred to as a SIMDAT

Mean STD Meaa STD Mean

Randon Weigbting

pseudo-clussijer algorithm (SPC), if the SIMDAT-like algorithm was used to genenite

33,37225 5.755794 29.4953 1 8.1 17096 20,64437

22.23 14 1 7.23 1788

the pseudo-training sample. A pseudo-classifier algorithm is referred to as a Bayesian

, 46.49800 2.340672 39.6638 3.852234 29.64885

Mean STD

Bootstrap pseudo-classijier algorithm (BBPC), if the Bayesian bootstrap is used to

23.54025 4.33367 23.72 187 8.657076 14.96 14 1

20.632 19 6.928664

35.92875 3.050293 33. t 1203 6.084795 23.68 13

29.7827 1 4.366496

14.33675 3.139 198 19.60203 7226305 9.709844

1 4.89687 6,9338 16

6,3225 1.8635 1

I 5.42453 6.58984 4.322344

23.7776 1 5.099502

9.65375 5.69913 1

32.346751 22.7975

4.266875 3.8302 16

1 3.83275 1.729629 19.04755 5.373447.. 9,498958

2.773933 3 1.2 1865 5.925 1 59 21,541 67

2 1 A640 1 4.572865

5.83675 1 .O13 12 13.3 1255 5.0694 16 3.994 166

2.2033 14 24.371 56 5.9 128 13 15.04438

1 5.13 792 9.474687 3.949583 4.52522 1 3.423749 2.472508

CHAPTER 4: B ~ ~ T S T R A P BASED ERROR ESTIMATES 101

generate the pseudo-training sample. A pseudo-classifier algorithm is referred to as a

Random Weightingpseudo-cIassi$er algorithm (RWPC). if the random weighting method

is used to generate the pseudo-training sample.

Although the SPC algorithm did not perform well enough in the experiments, the

BBPC and RWPC perfomed better than the basic bootstnip. They have smaller biases

and standard deviations. The experiments show also that, for the purpose of error rate

estimating, the pseudo-training pattern generating algorithm. Algorithm 4.6 is better than

Algorithm 4.7.

4.43 Leavc-onc-out Pseudo-Classifier Algorithm

A further irnprovement on the pseudo-classitier algorithm is to introduce the

Leavesne-out approach. As we have already seen that the bootstrap algorithm is

sharpened by the Leave-onesut, it is natural to think that the pseudo-classifier algorithm

rnight benefit h m it as well. The Leavesne-out pseudo-classifier algorithm (LPC) is

given below.

CHAPTER 4: B ~ ~ T ~ T R A P BASED ERROR EST~MATES 102

Algorithm 4.10 Leave-OncOut Pseudo-Classifier Input: (i) c, the number of classes;

(ii) ni, the size of the training sample of class i; (iii) g[i, 11 , . . ., s[ni, 11, ~ [ l , 21, . . ., ~ [ n , cl, the training sarnple data; (iv) B, the repeated times of drawing bootstrap samples. (v) m, the number of the nearest neighbours of a training pattern

Output: The error estimate. Metbod BEGIN

Total = O Error = O For i = 1 tocdo

Forj= 1 snido GetNearestNeighbounOf&~, il, m)

End-For End-For Fork= 1 to Bdo

For+ 1 tocdo For j = 1 to ni do

CeieratePseudoSample&~, il) BuildChssifier(y[l, 1 1, y[2, I 1, . . ., y[n,, cl) Error = Error + Classi~ingError(r~, il) Total = Total + I

End-For End-For

End-For Error = Error / Total Return Error

EM> Pseudo-Testing


Pnmdure GeneratePseudoSampie Input: (i) c, the nwnber of classes;

(ii) ni, the size of the training sample of class i; (iii) g[1, 11 , . . . , x[ni, 11, ~ [ l ,2], . . . , x[n,, cl, the training sarnple data; (v) g[q, pl, the Gning pattern shouldbe lefi out

Output: The pseudo-training patterns y[ 1, 11, . . ., y[ni, 11, y[1,2], . . ., y[n,, cl. Method BEGIN

Fori= I tocdo Forj = 1 tanido

do k = mndom integer in [1, ni]

Whik &[k. il = g[q, pl) yu, il = CetPscudoPattern(m-nearest neighbours of xJk, il)

End-For End-For

END GeaentePseudoSampIe

The reader will observe that there are some changes in Algorithm 4.10 compared

to Algorithm 4.9. Fint, the GeaeratePseudoSampb and BuildClassifier procedures

have moved to the imermost For loop, so different pseudo-training samples are used to

test different training pattems. Secondly, when the CeneratePseudoSample procedure is

invoked, a pattem ~ [ q , pl is passed to it. When executing the

GeneritePseudoSample procedure, there is a Wbik loop to ensure that the patterns

drawn by the bootstrap sarnpling will exclude the pattem g[q, pl. This means that no

pseudo-training pattem would be a linear combination of the nt-nearest neighbouis of

x[q, pl. Although it is still possible for ~ [ q , pl to be involved in the pseudo-training I)

sarnple as one of the ni-nearest neighbouis of another pattem, the relationship between

x[q, pl and the pseudo-training sample is mlatively weak. 9

CHAPTER 4: I~OOTSTRAP BASED ERROR ESTIMATES 104

TABLE 4.6 - 4.7 list the statistics of the expenments using Algorithm 4.10. The

experiments were done with the Data Set 1. Al1 the panuneters used are the same as those

used for the expenments involving Algontbm 4.9.

TABLE 4.6 Error Rate Estimates of the Leave-One-Out Pseudo-Classifier Algorithm (samde siize 8) . . .

TABLE 4.7 Error Rate Estimates of the Leave-One-ûut Pseudo-Classifier Algorithm

L

Bayesian BooQtrap Random

Wei~hting

(samole size 24)

L

Met hod True

SIMDAT

1 Metbod

(A, E) 23.54025 4.33367

27.09 STD Mean STD Mean STD

Ch88 Pd? Mean STD Mean

(A, F) 14.33675 3.139198 22.0 1375

8.497344 37.97937 9.9438 17,. 38.72188 1 0.02002

Bayesian ; Bootstrap

As in the previous section, a pseudo-classifier algorithm is referred as to a Leuve-

(A, B) 46.99 1 50 4.05386 43.205

(AS) , 6.3225

1,86351 16.9 1

Random Weigbting

One-ûut SrMDAT pseudo-cIcassi/er (CVSPC), Leave-One-Out Bayesian Bootstrap

8.877637 29.78687 9.6 135 14

30.02 9.850337

STD Mean STD

pseudo-dassijier (CVBBPC), or a Lerne-One-Out Random Weighting pseudo-classijer

(A, C) 37.08750 5.921621 36.33 188

Mean 1 38.06104 STD 1 5.783742

(CVRWPC) Algorithm according to which approach is used for generating the pseudo-

(A, D) 33.37225,. 5.755794 34.32563 9.745623 27.2975

9.734568 27.5 1 187 9.73752 1

4.463 164 38.38708 5.290009

training wunple. Here, the 'CV' stands for Cross-Validation.

29.48875 6.84088

Although, the CVSPC algorithm does not perfonn weil here, the CVBBPC and

6.85 1486 30.42146 6.350641

CVRWPC algorithrns achieved mat progress compared with the BBPC and RWPC

6.760757 5.785625 4.866056

5,8825 4.873968

9.6439841 7.87 1366 19.78 1881 12.82 1 87

26.63542 5.992693

. 9.240409 19.78 188 9.240409_

6.660264 27.82375 5.7 1 1829

7.257 1 56 12.77 125 7.29 1 842

18.10437 5.634306

6.30261 1 19.53833 5.48523 1

1 118021 4.10253 1

5,643442 12.32875 4.136333

4.773958 2.778053

5.190987 5.453958 3.003775

CHAPTER 4: BOOTSTRAP BASED ERROR ESTIMATES 10s - - -- - --

algorithms. Thus, for instants, for class pair (A, E), the means of BBPC and RWPC on

200 trials are 14.96 1 1% and 14.89687% given in TABLE 4.4, while that of CVBBPC and

CVRWPC are both 19.78188 given in TABLE 4.6. As the mean of the tnie error rate is

23.54025, it is obvious that CVBBPC and CVRWPC provided much better estimates than

BBPC and RWPC did. This is trw not only for class pair (A, E) but also for al1 of the

other class pairs. Although, they are not considered to be the b a t among al1 the

algorithms discussed this chapter, in the next section we will see that the CVBBPC and

CVRWPC algorithms are definitely among the top level performers.

4.5. Summary

In this chapter, two types of algorithms for eror rate estirnating have ken

discussed. Algorithms of the first type are referred to as real-sample algorithrns because

they only use the real training sarnples to estimate the error rate. As opposed to these,

aigonthms of the other type are referred to as pseudo-sample aigorithms because they

generate a set of pseudo-trahing samples to estimate the enor rate.

The conclusions of the experirnents on the algorithms are:

i ) The apparent error estimator is obviously downward biased, and the bootstrap

techniques are helpfùl for correcting the bias.

2) The Leave-one-out estimator, on average, has a smaller bias while its standard

deviation is larger than those related to the bootstmp techniques. A notable

fact is that the Leave-onesut estirnator seems robust for d l of the class pairs

involved in the experirnents.


3) The bootstrap-related estimators have smaller standard deviations and their

biases are also smaller. However, it seems that their performance is effited

by the changes of the class pair.

4) The pseudo-training sample algorithms work well, if the pseudo-training

patterns are used to constnnt the classifier. To generate pseudo-training

patterns, the Bayesian ôootstrap or random weighting-like approaches are

M e r than the SIMDAT-like approach.

According to the statistics on the averages and standard deviations of the

experiments, it is hard to say which algorithm pen0mis the best. We explain below an

altemate method for cornparhg these algorithms, which may be of great interest.

In each expriment, an algorithm was repeatedly carried out for 200 trials;

comspondingly, there were also 200 true errors obtained by ushg independent testing

samples. We now compute whether the estimate given by an algonth is close to the tnie

error when a training sample is given. This wouid be an important measurement of the

performance of an algorithm. For a large number of the algorithms, the absolute

difference between its estimate and the true error was calculateci for each trial. TABLE

4.8 - 4.9 give the statistics of the absolute difference for 200 trials.

CHAPTER 4: BOOTSTIUP BASED ERROR ESTIMATES 107

TABLE 4.8 Absolute difference between estimates and tme emrs (sample size 8)

Method IClusPdr 0.632 Mean

STD 0.632+ Mean

STD h i c Mean

Bootntrrp STD EO Mean

S m

LePive-on~ut' Mena Boohtrap , STD BBPC Mean

STD R W C Mean

STD CVBBPC Mean

STD CVRWC Mean

STD

TABLE 4.9 Absolute difference between estimates and true emrs (sample size 24)

EO Mean STD

Leaveonwwt Mean STD

Leavt-onwut Mena Boobtrrip STD BBPC . Mean

STD

CVRWPC 1 Mem I STD

Of course, it i s understwd that the smaller its absolute difference h m the tnie

emt, the better the performance of an estimator. From this point of view, the Leavesne-

out bootstrap algorithm is definitely better than just the Leave-one-out. It is almost m e

for the CVBBPC and CVRWPC algorithms. TABLE 4.10 - 4.1 1 range the perfomiances

of the algorithms according to the mean value given in TABLE 4.8 - 4.9.

TABLE 4. LO Performance ranges according to the absolute difference h m the truc emr (sample size 8)

The statistics above defulltely convince us that the bootstrap related algorithms

are generally better than the Leave-one-out. It dso suggested that the 0.632, 0.632+,

CWBPC or CVRWPC should be fust considered, if the training sample size is small;

the 0.632, 0.632+, EO and Lave-One-Out should be considered if the training sample

size is relatively large.

L

Clru Pair 0.632 0.632+

r

,Bwk Boobtrap EO

1

LeriveOne-Out Leave-One-Out

Bootstrap BBPC RWPC CVBBPC CVRWC

TABLE 4.1 1 Performance ranges according to the absolute diffemce h m the true e m r

7

(A, B) 1 3 8 4 7 2

10 9 6 5

(samde

BBPC 9 9 9 9 8 4 I

RWPC 8 8 8 8 9 6 CVBBPC 6 5 4 3 2 3

i C V R W P C i 7 i 6 I 7 i 6 i 6 i 2 t

h

Clms Pair 0.632 0.632+

,Basie Bootstrap EO

Leave-One-Out LeaveOne-Out

size 24)

(A, C) 1 2 10 6 7 3

8 9 5 4

(A, B) 5 4 10 2 3 I

(A, Dl 1 (A, E)

(A, D) 2 1 10 5

(A, C) 4 2 10 3 7 1

(A, F) 4 I 10 5 7 3

5 2 10 3 6 1

(A, G) 5 1 9 8 10 7

5 2 10 4 7 1

(A, E) 3 4 9 6

(A, F) 4 3 9 8

7 6

8 9 3

(A, G) 6 5 9 8

10 7

5 6 1

10 5

7 8 1

10 7

. 2 1

3 , 4 4 1 2

CHAPTER S

5.1 Introduction

Both the Bhattachayya bound and the emr rate of a classifier discussed in the

previous chapten have similar problems - that of estimating an unknown quantity. In

these pmblems, a question of bias correction arose, and the bootstrap techniques were

applied to correct it. Although the details are different, the basic strategies are similar

when the bootstrap technique is applied to this h d of problem: it involves sampling

fiom the original training set, and using the bias deduced h m the bootstrap samplc to

estimate the tnie bias. Generally, the bootstrap technique works well in such a case.

As opposed to this the classifier design problem is a totally different issue as it is

not a problem of bias correction. What is needed here is a technique to sharpen the

classifier within a given training sample. Therefore, the way to apply the bootstrap

technique to the problem of classifier design is different h m what was discussed before.

Here, the rrsearch will focus on how to generate an artificial training sample set fiom the

original one. The known work was done by Hamamoto, Uchimura and Tomita [HUT97]

who proposed bootstrap schemes to acbieve this. The details of their work makeup the

content of Section 5.2. In Section 5.3, some new schemes are introduced, which are

mainly based on the pseudo-sarnple algorithm mentioned in Chapter 4. The experimental

results obtained by using these schemes will be dirussed in Section 5.4.

CHAPTER 5: B ~ ~ T S T ~ ~ ~ B A S E D CLASSIFIER DESIGN 111

5.2 Previous Work

As mentioned above, the question adcûessed here is one of generating an artificial

training sample set when applying the bootstmp technique to the problem of classifier

design. Therefore. the schemes provided by Hamamoto CHUT971 are for generating the

synthetic training samples used to build the classifier. Below are the four algorithm

given in Hamamoto's paper.

Algoritbm 5.1 Bootstrapping 1 Input: (i) c, the number of classes;

(ii) ni, the size of the training sample of class i; (iii) ~ [ l , 1 ] , . . . , z[nl , 11, ~ [ l ,2], . . ., g[n,, cl, the training sample data; (v) m, the nurnber of nearest neighbors of a training pattern

Output: y[l, 11 , . . ., y[ni, 11, y[l, 21, ..., y[n, cl, the artificial training sarnple data. Method BEGIN

Fori= 1 tocdo For j = 1 to ni do

CetNeamtNeighboursO@~, il, m) End-For

End-For Fori=L tocdo

For j = 1 to ni do k = random integer in [l, ni] yü. il = GeneritcArti~ciaiPattem(~nemest neighbors of ~[k, il)

End-For End-For Ret-y[l, 11 9 - * - 9 y[n1, 11. y[l, 21, .**,y[&, cl

END Bwtstrapping 1

Input: ~ [ l ] , g[2], . . ., g[m] the m-nemst neighbors of the training pattem 5 Output: a pseudo-training pattern. Method BEGIN

GetWeights(m)

Return END Procdure GenerriteArtificialPattem

Procedure CetWeights Input: m, the number o f the neighbors o f a training pattern; Output: w[ 1 1, w[2], . . . , ~ [ m ] , the weights Method BEGIN

sum=o For (j = 1 to m)

wu] = random uni forni U[O, 1 1 sum=sum+w~]

End-For ForG=l tom)

wu'] = w[i] 1 sum End-For Rehim w[l], w[2], .... w[m]

END Pmcedure GetWeighb

The GetNearestNeighbounOf procedure is used to obtain the m-nearest

neighbors of a training pattern. Its task is to calculate the Euclidean distances between the

given pattern and the othea in the same class, sort the distances calculated, and collect

the Fust m patterns having the shortest distances. Note that a pattem itself is included in

its rn-nearest neighbors. The main procedure in this algorithm is the

GenerateArcificialPattern pmcedure that provides the artificid training sample set to

construct the classifier. The key point of the algorithm is to use a weighted average of the

m-nearest neighbors of a training pattem as an actificial training pattern. Observe that the

CHAPTER 5: BOOTSTRAP-BASED CLASSIFIER DESIGN 113

weights given by the GetWeighb procedure is just what was provided by the random

weighting method.

Algo rit hm 5.2 Bootstm pping II Input: (i) c, the number of classes;

(ii) ni, the size of the training sample of class i; (iii) g[1, 11 , .. ., z[ni, 11, ~ [ l , 21, . .., ~ [ n , cl, the training sarnple data; (v) rn, the number of the nearest neighbours of a training pattern

Output: y[l, 11 , ..., y[ni, 11, y[l, 21, ..., y[n,, cl, the artifcial training sarnple data. Method BEGIN

Fori= 1 t o c d o For j = 1 to ni do

CetNeareatNeighbounOt(xJj, il, m) End-For

End-For For+ 1 t o c d o

For j = 1 to ni do yü, il = GenerateArtificialPattem(m-nearest neighbon of g[j, il)

End-For End-For Retwn y[L 1 1 9 * * * , y[ni, 11, y[l, 21, O * * , y[nc, cl

END Bootstrapping II

The only change between AlgorithmS.1 and Algorithm 5.2 is chat the step

k = random integer in [l, ni]

in Algorithm 5.1 is eliminated. This means that the artificial training sample is just to

replace each pattern in the original sample with the weighted average of its m-nearest

neighbors.

CHAPTER 5: BOQTSTRAP-BASED CLASSIFIER DESIGN 114

Input: (i) c, the number of classes; (ii) ni, the size of the training sample of class i; (iii) g(1, 11 , ..., g[nl, 11, ~ [ l , 21, ..., g [ n , cl, the training sarnple data; (v) m, the number of the nearest neighbors of a training pattem

Output: y[l, I l . ..., y[n,, 11, y[l, 21, ..., y [ n , cl, the artificial training sample data. Method BEGIN

Fori=l tocdo For j = 1 to ni do

GetNearestNeighboursOf(&~, il, m) End-For

End-Foi Fori= 1 tocdo

For j = 1 to ni do k = random integer in [1 , ni] yu, il = the average of the mnearest neighbors of ~ [ k , i]

End-For End-For Remm yu, 11 9 --*, Y [ ~ I , 11, y[l, 21, ***, y[n, cl

END Bootstrapping III

Unlike Algorithm 5.2, this algorithm modifies Algorithm 5.1 in another way. It

replaces an original training pattern directly with the arithmetic average of its nt-nearest

neighbon; meanwhile, the original patterns are randomly selected so as to give the

averages of their n-aearest neighbors.

CHAPTER Sr BOOTSTRAP-BASED CLASSIHER DESIGN 115

Algorithm 5.4 Boohtrapping IV Input: (i) c, the number of classes;

(ii) ni, the size of the training sample of class i; (iii) g[l, 11 , ..., z[ni, 11, g[1,2], . .., ~ [ n , cl, the training sample data; (v) m, the number of the nearest neighbours of a training pattern

Output: y[l, 11 , . . . , y[n~, 11, y[l ,2], . . ., y[n, cl, the artificial training sample data. Method BEGIN

F o r i = 1 tocdo Forj 4 tonido

GetNeamtNeighbounOf~~, il, in)

End-For End-For For i = 1 tocdo

For j = 1 to ni do yu, i] = the average of the m-nearest neighboa of gü, il

End-For End-For Remrn y[!, 11 , ..., y[m, 11, y[L 21, .... y[no cl

END Bootstrapping IV

Algorithm 5.4 changed fiom Algorithm 5 3 the same way as Algorithm 5.3

changed from Algorithm 5.1. The artificial training sample consists of the arithrnetic

averages of the m-nearest neighbon for al1 the patterns in the original sample.

The cornparison of the above algorithms was done in the paper by Hamamoto

[HUT97] on the bases of the experiments with k-NN classifiers ( k = 1, 3, 5). The

experiments were designed to inspect the performances of the above algorithms in

various situations, as well as the different dimensions, training sarnple sizes, sizes of the

nearest neighbors, and distributions. The main conclusions of their work are:

1) The algoritbms outperfon the conventional k-NN classifiers as well as the

edited 1 -NN classifier.

2) The advantage of the algorithms cornes fiom removing the outliers by

smwthing the training pattern, e.g. using local means to replace a training pattern,

CHAPTER 5: BOOTSTMP~BASED CLASSIFIER DESIGN 116

3) The number of the nearest neighbors chosen has an effect on the result. An

algorithm was inCroduced to optimize the selection of this size.

The details of the experiments and the algo&.hm for optimizing the size of the

nearest neighbor set can be found in [HUT97].

5.3 Mixed-sam pie Classifier Design

In Chapter 4, we presented the pseudotlassifier algorithrns, which used a pseudo-

training sample to test the accuracy of the classifier. It is generaily accepted that the

larger a training sample, the greater the accuracy of a classifier. Consequently, we beiieve

that using the pseudo-training sample set to enlarge the training sample size is a good

idea. The question that we face is one of knowing how to generate suitable pseudo-

training patterns and blending them together with the original training samples to build a

classifier. Such a classifier design scheme is referred to as a Mixed-sample classifier

design because it uses the mixed training sample to construct the classifier. Generally,

this kind of algorithm will build a classifier in two steps: first, it will generate a pseudo-

training sample set; then it will build a classifier with al1 the training patterns using both

of the original and the pseudo-training samples.

The second step will be executed based on the classifier selected. This can involve

any standard classifier design algorithm based on the given training patterns. We will

thus focus on how to carry out the first step. Earlier, in Chapter 4 we presented a

CenentePseudoSampk procedure. We shall now marginally modify an existing

procedure and make it suitable for the purposes of the classifier design. Below is a typical

algorithm for generaiing a pseudo-training sample for the classifier design.

Algorithm 5.1 Generiiting Pseudo-training sam pie Input: (i) c, the number of classes;

(ii) ni, the size of the training sample of class i; (iii) al, 11 . . ... ni, 11, g[1,2], ..., ~ [ n , cl, the training sample data; (iv) B. the repeated times of drawing bootstrap samples. (v) m, the number of the nearest neighbours of a training pattern (vi) ti, the size of the pseudo-training sample of class i;

Output: y[l , 1 ] , . . . , y[ri , 11, y[ l,2], . . ., y[t,, cl, the pseudo-training sample. Method BEGIN

For i = 1 tocdo For j = I to ni do

CetNeamtNeighboursOf(x~, il, m) End-For

End-For Fori= 1 tocdo

For j = 1 to ti do k = random integer in [ l , ni] yu, il = GetPsrudoPattern(m-ne- neighbours of ~ [ k , il)

End-For End-For Ret~nyr l , 11 , ***, y[tt, 11, yu, 21, ***, y[t, cl

END Generating Pseudo-training sampk

The CetPseudoPattern procedure in the above algorithm can be either the

CetPseudoPattern 1 procedure given in Algorithm 4.6 or the CetPseudoPattern II

procedure given in Algorithm 4.7. Therefore, it is possible to have dif'ferent weighting

schemes for the CetPseudoPattero procedure. In our experiments, we used three

different schemes: the SIMDAT algocith, the Bayesian bootstrap and the random

weighting methods.

There are other parameten that have to be decided as well in this algorithm.

These are the nurnber of the nearest neighbors and the size of the pseudo-training sample

CHAPTER 5: BOOTSTRAP-BASED CLASSRFIER DESIGN

set of each class. It is preferred to give a fixed rate between the pseudo-training sample

size and the original training sample size for each class. As expected, the pseudo-training

sample size affects the performance of the classifier. In our experiments, several pseudo-

training sample sizes were used to ascertain theu effect on the performance of a

classifier. The size of the nearest neighbor set can also affect the performance of a

classifier. Two different sizes of the nearest neighbor set were used in our experiments to

show how they effect the classifier's performance.

5.4 Simulation Results

Experiments for the Mixed-sarnple algonthms were done with the Data Set 1. Two

classes each with a training sample of size 8 were involved in each experiment. Six

different sizes: 4.8. 12, 16,20 and 24, were used to generate the pseudo-training sample

sets. With the size of the nearest neighboa m = 3, three schemes namely, the SiMDAT

algorithm, the Bayesian bootstrap and the random weighting method were used to

generate the pseudo-training samples. To study the effect of the size of the nearest

neighbors, the experiments were also carried out with the size of the nearest neighbors m

= 8. Note that the training sample size of a class is 8, so nr = 8 means that a pseudo-

training pattern is a combination of dl the patterns in the training sample of a class. The

Bayesian bootstrap and the random weighting method were used in the experiments with

m = 8. To distinguish between the Bayesian booistrap and the random weighting method

used in two different sizes of the nearest neighbors, the previous ones are referred to as

the Local-Bayesian bootstrap and the Localiandom weighting method while the latter

CHAPTER 5: B~~TSTRAP-BASED CLASSIFIER DESIGN 119

are simply called the Bayesain bootstrap and the Random weighting method. M e r

constructing a classifier with the mixed-sample, an independent testing sample of size

1,000 for each clam was used to estimate the enor rate of the 3-NN classifier. Figure 5.1

- 5.6 provide the results of the experiments on an average of 200 trials.

+Bayes Bootstrap -+- Random Wdghtlng + Local Bayes Bmstrap K L o d WaigMing +,SIMDAT

Figure 5.1 Testing Error of Mixed-Sample Classifier-Class Pair (A, B)

Figure 5.2 Testing Emr o f Mixed-Sample Classifier-Class Pair (A, C)

CHAPTER S: BOOTSTRAP-BASED CLASSIFIER DESIGN 120

O 4 8 12 16 20 24

Prwbûainlng Srmpk Sizr

+Bayes Bootstfap -t- Random Weighting -6- Local Baye8 Boolstrap - Local Randan WeigMing -+-SMOAT

Figure 5.3 Testiiig Error of Mixed-Sample Classifier-Class Pair (A, D)

21 1 1 O 4 8 12 16 20 24

Pruôo4aining Srmph Sizm +Bayes BWsirap -m- Rondom Weighting --cc-. Local Bayes Bootstrap

- +. Local Random Weighîing .- S M W

Figure 5.4 Test ing Error of Mixed-Sarnple Classifier-Class Pair (A, E)

CHAPTER S: BOOTSTRAP-BASED CLASSIFIER DESIGN 121

+Bayes Booîstrap + Random Wdghting +Local Bayes Bootstrap Local Random WeigMing -+.SMDAT

Figure 5.5 Testing Error of Mixed-Sample Classifier-Class Pair (A, F)

-+-Bayes Bootstrap + Randocn Waighting -e-- Local Bayes Boolslrap - L o d Random Weighling +,SMDAT

Figure 5.6 Testing Error of Mixed-Sample Classifier-Class Pair (A, G )

The X-axis in each chart represents the pseudo-training sample size of a class,

while the Y-axis represents the error rate of the classifier estimated by the independent

testing sarnple. Each chart gives the experiment's results on one class pair. Although,

different class pairs have a different e m r rate; the expenment results showed consistent

tendencies, which are

1. Al1 the three algonthms for pseudo-training sarnple generation seem to make

things worse. These are the STMDAT algorithrn, the Local-Bayesian bootstrap

and the Local-random weighting method.

2. The Bayesian bootstrap and random weighting method apparently improve the

performance of the classifier.

3. The effect of the pseudo-training sample increases with its size increment.

For instance, without a pseudo-training sample mixed into the original training

sarnple For the classifier construction, the error rate of the 3-NN classifier would be

33.37% for class pair (AT D). When the size of the pseudo-sample increased to 16, twice

of the training sample size, the classifier's error rates increased to 34.41%, 33.82% and

34.15% for the Local-Bayesian Bootstrap, the Local-random weighting method and the

SIMDAT algorithm respectively. However, with the same pseudo-training sample size,

the error rates decreased to 3 1.08% and 3 1.14% for the Bayesain bootstrap and the

ranâom weighting method. This is also truc for class pair (A, E). Without the pseudo-

training sample, the error rate of the 3-NN classifier is 23.54%. The error rates rose to

24.14%, 23.59% and 25.14% for the Local-Bayesian Bootstrap, the Local-random

weighting method and the SMDAT algorithm respectively. At the sarne time, the e m r

rates decreased to 21.61% and 21.70% for the Bayesain bootstrap and the random

weighting method.

The failure of the Local-Bayesian Bwtstrap, the Local-random weighting method

and the SIMDAT algonthm indicates that the size of the nearest neighbon should not be

too small. Although, a combination of rn-nearest neighboa smoothed the training patterns

in some way; it might still be an outlier if it is the combination of rn-nearest neighboa of

an outlier and the size, rn, is small. Thus, the mixed-sample classifier might have more

outliea than that of the usual classifier, which makes things even worse. It is noted that

the SlMDAT algorithm is the worst one of al1 the schemes for class pairs (A, E), (A, F)

and (A, G). The cause for this is that the SIMDAT algorithm uses the unifom

xm - 1) distribution u[' - /y, 1 + ,/T 1 to generate the weighting vecion, m 111 m m

which allows for a pa ter chance for an outlier point to be gcnerated.

The aigorithrns of the Bayesian bootstrap and the random weighting method

perfomed very well as they use combinations of whole training patterns to produce the

pseudo-training smple. This codirms that the size of the nearest neighbors should not be

too small. The effect of the pseudo-training sample size is also significant. It can be seen

from Figure 5.1 - 5.6 that, for al1 class pairs, the larger the pseudo-training sarnple's size,

the lower the error rate. The evidence show that the rate of decrease slows down when

the pseudo-training sarnple's size increases. Generally, it is enough to take the size of the

pseudo-training sample as 1.5 times (in our case it is m = 12) as that of the original

training sample. The results of the experiments also showed that there is no significant

performance difference between the Bayesian bootstrap and the mdom weighting

method.

5.5 Discussions and Conclusions

In this chapter, the classifier design problem has k e n discussed. Previously,

Hamamoto and the others constnicted a clessifier by using the weighting local means in

place of the onginal training pattems. Their experiments demonstrated that their strategy

performed favorably. With the local means strategy, they also discussed how to find the

optimized sire of the nearest neighbon.

What introduced by us is an alternative approach for constnicting a classifier with

the mixed-sample. The motivation for using the mixed-sample clossifier design was to

have a larger sample size. This approach works when the pseudo-training patterns are

obtained fiom the weighting averages of al1 the original training pattems together.

However, it fails if the size of the nearest neighbors is small.

Comparing the results of the mixed-sarnple classifier design to the previous

results of Hamamoto's work, it may be seen that there are still interesting problems to be

discussed.

First, in ternis of the mixed-sarnple classifier design, the classifier designs

suggested by Hamamoto can be refened to as pseudo-training sample designs as they use

only the pseudo-training samples - the local means - to constnift a classifier. The

difference between the mixed-sample and the pseudo-training sample classifier designs is

that the latter uses the pseudo-training sample to replace the original one while the former

mixes the two samples together.

Second, the local meam function well in the pseudo-training sample classifier

design because the locd means removed the outliers by smwthing the pattems. However,

they perform badly in the mixed-sarnple classifier design when the size of the nearest

neighbon is mall, as a bad local mean might be an extra outlier. On the other hand, a

large size of the nearea neighbon, such as the whole set of the training sample, works in

the mixed-sarnple classifier design.

Third, Hamamoto's work proved that a large size for the set of nearest neighbon

used does not imply that it yields the best results [HUT97]. That is why an algorithm to

optimize the size of the nearest neighbors was suggested. Hence, it appears as if the

problem of optimizing the size of the nearest neighboa for the mixed-sample classifier

design is unanswered. This could be a problem for fùrther study.

Fourth, our experiments have proved that using a pseudo-training sample to

enlarge the sample size is a good strategy. As no work hm k e n done to compare the

mixed-sarnple classifier design with the pseudo-training sample classifier design, it is

dificult to say which one is better. It is reasonable, though, to use the original training

patterns in the construction of a classifier if they are beneficial.

Finally, fiom ail of the above discussions, it is confirmed that the classifier design

can be improved on by using a pseudo-training sample. The problem is how to avoid the

contamination of bad pseudo-patterns. There are two possible solutions to the problem:

one is to exclude the outliea fiom the original training sample set before generating the

pseudo-training samples. The other is to replace the outliers in the original training

sample with their local means. Although it is believed that these two approaches are able

to enhance the classifier design, further studies are needed to prove the assumption.

6.1 Introduction

This thesis has npresented an in-depth study on Efion's booistrap techniques. It

also discussed how the technique should be applied to probiems in statistical pattern

recognition, narnely those of estimating the Bhattacharyya emr bounds, estimating

classifier emr rates and designing classifiers. Al1 the works done in the area, including

the work presented here, demonstrate that the bootstrap technique is applicable to and

works well for problems in the field of statistical pattern recognition. Throughout our

discussion, we have proposed the use of three main strategies for the application of the

bootstrap technique, namely, bias correction, cross-validation and pseudo-pattern

generation. We now discuss how these strategies were applicable in statistical pattern

recognition.

6.2 Bias Correction

As explained in Chapter 2, Efron's bootstrap scheme was initially proposed to estimate

the bias of an estimator. The simulation of the distribution t* given by the bootstrap

sample, to experimentally represent the empirical distribution @ is useci to provide us

with the information about the way S simulates the tnie distribution F. Thus the bias of a

parameter 0 defined in (2.1) can be estimated as:

A bootstrap estimate of 0 is obtained by using the bias estimate to correct the original

estimate,

A n eBm= B- B ~ U .

Bias correction is obtained by merely applying (6.1).

There are two different ways of applying the b i s correction stmtegy: direct or

indirect. The direct b i s correction is to apply (6.1) directly on the parameter estimated.

This approach works well for estimating the Bhattacharyya bound. As s h o w in Chapter

3, it improved the estimates for al1 of the six class pairs. Direct bias correction also works

for estimating the error rate of a classifier. To do this, the basic bootstrap, 0.632, and

0.632+ estimators use the direct bias correction, even through their methods to estimate

the bias are different. The performance of a direct-bias-correction eslimator depends

largely on how well the bias estimate does. For instance, in estimating the error rate of a

classifier, al1 of the schemes, namely, the basic bootstrap, 0.632 and 0.632+ algorithrns

are based on the bias correction of the apparent error estimate. The performances of the

three algorithms Vary because they estimate the bias in different ways. Therefore, it is

very important to find a good bias estimator. Unfomuiately, however, there are no

gened niles that can be followed so as to yield a superior bias estimator.

When the parameter 8 is a function of another parameter 6: 8 = I(Q, it is possible

to correct c's bias fust. and then to obtain the estimate of 8 via the Taylor's expansion of

the hct ion 8 = c). This approach is called the indirect bias correction. in Chapter 3,

this approach was used to estimate the Bhattaçharyya bound. Although the FAMS

algorithm with the bootseap scheme improves the estimate significantly, the stanâacd

variance of the esthate is very large. For exarnple, in TABLE 3.7, the mean and the

standard deviation for the 200 trials are 20.63062% and 23.34433% for class pair (A, O),

while, in TABLE 3.10, they are 33.01038% and 43.78814% for the same class pair,

implying thai the results are not very satisfactory. On the other hand, the FAMS

algorithms with the Bayesian bootstrap and the random weighting method schemes have

a relatively small standard deviation, but they do wt correct the bins significantly. In

TABLE 3.8 with the Bayesian bootstrap scheme, for example, the mean and the standard

deviation of the class pair (A, G) are 9.294273% and 4.665902%. The fact is that there is

a large gap between the basic bootstrap schemes and the Beyaesian bootstnip scheme, as

well as between the basic bootstrap and the random weighting method. There seems to

also be a trade-off between the bias correction and the standard deviation. The question

of whether we can devise a scheme that can have the advantages of both the basic

bootstrap and the Bayesian bootstrap remains open.

Cornparhg the indirect bias correction with the direct bias correction, it is not

easy to state which is superior. The experimental results given in Chapter 3 show that the

direct bias correction perfoms better for class pairs (A, B), (A, C), (A, D) and (A, E),

; 1 while the indirect bias

cause for this is that

correction performs better for class pairs (A, F) and (A, G). The

the Bhattacharyya bound c, is an exponentiai hction of the

. 1

Bhattac haryya distance p(1/2), E,, = - e -* IR'. It mua alw be noted that the Bhattacharyya 2

distance p(lL2) is usually over-estimated. Hence the estimate bias of the Bhattacharyya

distance will contribute more to the estimate bias of the Bhattacharyya bound when the

Bhattacharyya distance is large. Therefore, the larger the Bhattacharyya distance is, the

larger its effect on the estimate bias of the Bhattacharyya bound will be.

The cross-validation approach is mainly concerned with the error rate estimation

of a classifier. The term bbcross-validation" as used by us in this thesis does not have the

same meaning as in machine learning. The key idea in the Cross-validation algorithm is

to separate the testing pattem h m the training pattems. It mi@, therefore, be better to

describe the scheme using the phrase "Training and testing pattern separating approach"

instead of using the tenn "Cross-validation". Al1 of the methods, namely EO, 0.632,

Leave-onesut bootstrap, 0.632+ estimators, CVBBPC and CVRWPC use the Cross-

validation to estimate the error rate. Compared with the other error rate estimation

algorithms, which do not separate the training and testing samples, these algorithms

obviously corne out as the wimea.

In small sample cases, however, an explicit separation does not necessarily mean

the bat . TABLE 4.8 and TABLE 4.10 indicate that when the training sample size is 8,

the impure separating estimators: 0.632, 0.632+, CVBBPC and CVRWPC estirnators

performed apparently better than the pure separating estimators: the EO and Leavesne-

out bootstrap estimatoa. On the other hanâ, TABLE 4.9 and TABLE 4.1 1 show that the

performance of the EO and the Leave-one-out bootstrap estimators irnproved

sigaificantly. These facts demonstrate us that the size of the training sample set is an

important factor. When some patterns were excluded h m the training sample for testing,

CHAPTER 6: SUMMARY 130

it effectively reduced the size of the training sample set to construct a classifier, and so

the emr rate of the classifier increasad. This explains why the emr rate estimates given

by Leave-one-out, EO and Leave-one-out bootstrap tend to over-estimate the tme error

rate,

When the size of the training sample set is large, the result will not be so effected

when we merely exclude one pattem each time h m the training sample for testing.

TABLE 4.9 and TABLE 4.1 1 clearly demonstrate this fact. However, when the size of a

training sample is small, it is hard to design, test and train the classifier with separate

patterns. One of the solutions is to use a pseudotlassifier algorithm, such as CVBBPC or

CVRWPC. Of course, the Leave-one-out bootstrap performs very well in the small

training size case, but a pseudo t lassi fier algorithm is pte ferred for the fol lowing reasons:

1. A pseudo-classifier algorithm cannot be worse than the Leave-one-out

bootstrap. Thus. for example, the performance of CVBBPC and CVRWPC are

comparable to the Leave-onesut bootstrap.

2. A pseudotlassifier algorithm is more flexible than the Leave-onesut

bootstrap as it is possible to improve its performance by changing either the

method of generating a weighting vector or by changing the number of the a

nearest neighbors of a training pattem used in designing the algorithm.

The first reason has k e n supported by our experirnents. The details of the second

reason will be studied in the next section,

6.4 Pseudo-pattern Generation

Pseudoîlassifier algorithms are used for estimating the error rate of a classifier,

and the pseudo-irainhg sample can be used in classifier design. in spite of the difierence

between their purpose, both the pseudo-classifier algorithms and the pseudo-training

sample algorithms are based on the same strategy, namely, that of pseudo-pattern

generating. Of course, the strategies for generating the pseudo-patterns Vary depending

on the purpose for which pseudo-patterns are generated.

Two issues are important when generating pseudo-patterns. The fint is the

schemes used to produce the weighting vectors. In our experiments, three schemes were

tested: the SiMDAT algorithm, the Bayesian bootstrap and the random weighting

method. Generally, the latter two perfom better although, in error rate estimation, the

SIMDAT algorithm has a better performance for the class pairs (A, B). (A, C) and (A, D)

(see TABLES 4.4 - 4.7). It is therefore clear, that a suitable scheme must be chosen based

on the situation at hand. Unfortunately, at this time, there are no rules available for such a

selection. For the error rate estimation, our experiments showed that the SIMDAT

algorithm is better when the emr rate is higher, and the Bayesian bootstrap and the

random weighting method are better with a lower error rate.

The second issue of importance is number of the nearest neighboa of a pattern

used in generating the pseudo-patterns. In error rate estimation, pseudo-patterns are used

to simulate the behavior of the training patterns. So it is better to keep the size of the

n e a s t neighbors small. in this way, the distribution of the pseudo-patterns can be kept

close to that of the original training sample. As opposed to this, in classifier design, the

purpose of adding the pseudo-patterns is to decrease the classifier's error. So the size of

the set of nearest neighbors of a pattern should be kept larger. In this way, the pseudo-

patterns will tend to appear mon frequently in the central vicinity of each class, and so

bring about more smoothness to the training samples.

In the estimation of emr rates, it is urilikely that we will be able to design an

algorithm that can optimize the size of the set of nearest neighbors because of the lack of

practicd criterion fùnction. The goal of the exercise is to compote an estimate that is as

close to the tnie error rate as possible. However, this problem is complex because the true

error rate is unknown in practice. A possible solution for the optimization is to choose the

size of the set of nearest neighbors depending on the results of simulation experiments.

In the classifier design, however, the Leave-one-out error estimation can be

implemented to optimize the size of the nearest neighbors. The algorithm is bnefly

described below:

1. For a fixed size of nearest neighboa, generate pseudo-patterns and constnict

the classifier with the Mixed-sample classifier design discussed in Chapter 5.

2. Use the Leave-onesut algorithm to estimate the error rate of the classifier

built in Step 1.

3. Do Steps 1 and 2 for three different sizes of nearea neighbors, mi = 2 c m2 =

(mi + m3) 1 2 < rns = the training sample size of a class. Compare the estimated

error rates.

Let Erq be the estimated error rate celated to size mi. if Errl < En2 < Err3, then

select a no = (mi + m2) 1 2; if En l > Err2 > En3, then select a mo = (m2 + m3) 1

2; if Enl > En2 and Err2 < En3, then select moi = (ml + m2) / 2 and mo2 (m2 +

mj) / 2.

Do Steps 1 and 2 for the newer selected size mo, and compare the newer

estimated emr rate Eno with the two othen (Enl and Err2 if r n ~ = (ml + m2) I

2; Erfi and En3 if mo = (ml + m2) 12). If there are two newer sizes moi and mo2

selected in Step 4, then two new error rates Enoi and Eno2 will be estimated,

and the cornparison would be carried out with EIQ, Enol and Eno2.

Repeat Steps 3 - 5, until the three compared sizes converge. The m with the

minimum Cross-validation error would be selected as the size of the nearest

neighbors.

Although we cannot guarantee that the best size for the set of the nearest

neighbors will result, this algorithm can, at least, provide a reasonably good estimate for

the cardinality of the set.

6.5 Conclusion

in this thesis we have demonstrated, conclusively, that the bootstrap technique is

useful for solving problems in statistical pattern recognition. Generally, aigorithms using

the bootstrap technique take more space and more time for calculations, which are the

main disadvantages of the bootstrap technique. However, space and time are not serious

problems in estimating the Bhattacharyya b o d , and in estiniating the error rate, as the


size of the training simple is not generally big, and the estirnates are only calculated

once. There are also no problems of space and time in the classifier design. Once the

classifier has been constructed with the mixed-samples, it will be used the same way as

the classifier designed by the usual approach.

The benefits of the bootstrap technique are obvious:

1. It corrects the estimate b i s of the Bhattacharyya bound and classifier emr

rate, and

2. It improves the performance of a classifier.

Therefore, the question now is not 'Tould the bootstrap technique be applied to

statistical pattern recognition?" but rather, "Which is the best bootstrap algorithm

applicable?" and "How much can the bootstrap technique achieve?".

Based on our research, the following algorithms for the bootstrap technique are

suggested for the three problems in statistical pattern recognition, which we have

researched.

To estimate the Bhattacharyya bound, the bias correction approach with the basic

bootstrap scheme is preferred. Either the direct or indirect bias correction approach cm

be used depending on the Bhattacharyya distance associated with the class pair. If the

Bhattacharyya distance is small, it is preferred that the direct bias correction be used.

ûtherwise the indirect bias correction would be better.

To estimate the emr rate of a classifier, a combined Cross-validation and

bootstrap approach is recornrnended, such as the 0.632+, the Leavesne-out bootstrap, the

CVBBPC or the CVRWPC. For the pseudo-classifier algorithms CVBBPC and


CVRWPC, the size of the nearest neighbors can be selected as m = 3, as demonstrated by

our experiments. As mentioned in the previous section, the pseudo-classifier algorithm

with the SIMDAT scheme can also be a good candidate, if the true error rate is high.

To design a classifier, a mixed-sample algonthm with the Bayesian bootstnip or

the random weighting method is preferred. The size of the set of nearest neighbors can be

the same as the size for the training samples of a class. Our experiments proved that this

size works well for the classifier design.

it is dificult to Say just how much advantages a bootstrap technique can give.

Although the methods recommended above have achieved high performances, there is

still potential to achieve more progress by appl ying the idea of pseudo-pattern generation

and the bootstrap technique. Two kinds of improvements seem possible. The first is to

seek a better scheme to generate the weighting vectoa for a specific problem. The other

is to optimize the size of the set of nearest neighbors of a pattern. Both of these avenues

are areas where much research can be done. The question of using parametrical bootstrap

techniques for the three problems studied remains open.

PD841 Buake, 0. and Droge, B. (1984). "Bootstnîp and cross-validation estimates of the prediction emr for linear regression models", Annals of Statistics, 12, pp. 1400- 1424.

[Bu891 Bumian, P. (1989), "A comparative study of ordinary cross-validation, v- fold cross-validation and the repeated leamiag-testing methods", Biometrika, 76, pp.503 - 541.

[CMN85] Chernick, M.R., Murthy, V.K. and Nealy, C.D. (1985), "Application of bootstrap and other resampling techniques: evaluation of classifier performance", Pattern Recognition Letters, 3, pp. 167 - 1 78.

[CMN86] Chemick, M.R., Murthy, V.K. and Nealy, C.D. (1986), "Correction to 'application of bootstrap and other resampling te~h~ques: evaluation of classifier perfomance'", Pattern Recognition Letters, 4, pp. 133 - 142.

Davison, A. C. and Hall, P. (1992), "On the bias and variability of bootstrap and cross-validation estimates of error rate in discrimination problems", Biometrika, 79, pp.279 - 284.

Davison, A. C. and Hinkley, D. V. (1997), Bootstrap methods and Their Application, Cambridge University Press, Cambridge.

E h n , B. (1979), "Bootstrap Method: Another Look at the Jackknife", Annafs of Statistics, 7 , pp. 1 - 26.

Ehn, B. (1982), The Juckknife, the Bootstrap and Other Resampling Plans, CBMS-NSF, Regionai Conference Series in Applied Muthematics, Society for Industrial and Applied Mathematics.

Ehn , B. (1983), "Estimating the enor rate of a prediction d e : improvment on cross-validation", Journal of the American Stutistical Association, 78, pp.3 16 - 33 1.

Ehn , B. (1986), "How biased is the apparent emr rate of a prediction rule?", Journal of the American Statistical Associution, 8 1, pp.46 1 - 470.

Ehn, B. and Tibshirani, R J . (1993), An introduction to the Bootstrap, Chapman & Hall, Inc.

E h n , B. and Tibshirani, R. J. (1997), "Improvement on cross-validation: the -632 + booîstap methd9, Journal of the American Statistical Association, 92, pp.548 - 560.

Fukunaga, K. (1990). Introduction to StiWticul Pattern Recognition, Second Edition, Academic Press, inc.

Hand, J.D., (1986), "Recent advances in emr rate estimation", Pattern Recognition Letters, 4, pp.335 - 346.

Hamamoto, Y., Uchimura, S. and Tomita, S. (1997), " A bootstrap technique for aearest neighbor classifier design", IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, pp.73 - 79.

Jain, A.K., Dubes, R.C. and Chen, C. (1987), "Bootstrap techniques for error estimation", IEEE Transactiom on Pattern Analysis and Machine Intelligence, PAMI-9,628 - 63 3.

Li, K.-C (1978), "Asymptotic optimality of C,,, 4 cross-validation and generalized cross-validation: discrete index set", Annals of Statistics, 1 5, pp.958 -975.

Mammen, E., (1992), When Does Boofstrap Work? - Asymptiotic Results and Simulations, Lecture Note in Statistics 77, Springer - Verlag, New York, inc.

Port, Sidney C., (1994), Theoreticai Probability /or Applicationr, John Wiley & Sons, Inc.

Shao, J. and Tu, D., (1995), The Jackknfe and Bootstrap, Springer - Verlag, New York, inc.

Taylor, M. S. and Thompson, J. R., (1992), "A nonparametric density estimation based iesampling algocithm, Exploring the Limits of Bootstrap, John Wiley & Sons, hc., pp.397 - 404.

Wand, M. P. and Jones, M. C. (1995), Kernel Smoothing, Chapman & Hall, London.

Weiss, S.M. (1991), "Srnafl sample emr rate estimation for k-NN classifiers", lEEE Transactions on Pattern Anuiysis and Machine Intelligence, 13, pp.285 - 289.

[a871 Zhen, 2. (1987), ''Etandom weighting methods", Acta Marh Appi. Sinica, 10, pp.247 - 253.

Documents

BOOTSTRAP TECHNIQUES STATISTICAL PATTERN RECOGNITIONcollectionscanada.gc.ca/obj/s4/f2/dsk1/tape3/PQDD_0027/MQ52407.… · 1.1 Introduction Since Efion's bootstrap technique was introduced