Application of boosting to classification problems in chemometrics

Analytica Chimica Acta 544 (2005) 167–176

Application of boosting to classification problems in chemometrics

M.H. Zhanga,1, Q.S. Xub, F. Daeyaertc, P.J. Lewic, D.L. Massarta,∗a Department of Pharmaceutical and Biomedical Analysis, Pharmaceutical Institute, Vrije Universiteit Brussel, Laarbeeklaan 103,

B-1090 Brussels, Belgiumb School of Mathematical Science and Computing Technology, Central South University, Changsha 410083, PR China

c Center for Molecular Design, Janssen Pharmaceutica N.V., Antwerpsesteenweg 37, B-2350 Vosselaar, Belgium

Received 16 September 2004; received in revised form 3 January 2005; accepted 31 January 2005Available online 16 March 2005

Abstract

Application of boosting to both two-class and multi-class classification problems are studied. Five real chemical data sets are used. Eachdata is randomly divided into two subsets, one for training and the other for prediction. For two-class classification, each data is separated intoa high response level class and a low response level class according to a threshold value. As a result, three data sets, wheat data, cream dataand HIV data, show that boosting using classification and regression trees (CART) as a base learner may decrease the misclassification ratei oosting isa re also usedf mportancea on specialt©

K

1

tita

larmcr

f

n thetest

ed toovedbetter

dgeand

e thisiveromarner.

em-uracytiveand

tord-The

0d

n prediction with respect to using a single CART. However, boosting for green tea data indicates that overfitting may occur when bpplied. For the chromatographic retention data, boosting performs worse than a single CART. The cream data and the HIV data a

or multi-class classification. Both data sets demonstrate that boosting performs better than CART in multi-classification. Variable inalysis suggests that the improvement made by boosting may be due to the use of more variables, which give more information

ypes of samples in the training data.2005 Elsevier B.V. All rights reserved.

eywords:Boosting; Classification; AdaBoost; CART (classification and regression trees); Variable importance

. Introduction

Boosting was initiated by Schapire in 1990[1]. Since thenhere has been an increasing interest in it due to its ability tomprove the accuracy of learning algorithms. Reports showhat boosting can reduce both the error rate in classificationnd the prediction error in regression[2–6].

Boosting is applied to what is called an ensemble of weakearners. In boosting terminology a learner is a decision rulend a weak learner refers to a learner that performs better thanandom but does not perform as well as one would like. Thiseans that for a binary classification of two equally sized

lasses the correct classification rate is larger than 0.5 butelatively far removed from 1. A weak learner is characterised

∗ Corresponding author. Tel.: +32 2 477 4734; fax: +32 2 477 4735.E-mail addresses:[email protected] (M.H. Zhang),

[email protected] (D.L. Massart).1 Tel.: +32 2 477 4728; fax: +32 2 477 4735.

by the fact that prediction is unstable, i.e. a small change itraining data causes a large change in prediction for theset[7]. Decision trees and neural networks are considerbe typical unstable learners. Kearns and Valiant have prthat an ensemble, i.e. a set, of weak learners can form alearner and result in higher accuracy[8].

The earlier versions of boosting demand prior knowleof the accuracy of the weak learners. In 1997, FreundSchapire et al. proposed a method that does not havrequirement[9]. This method is called AdaBoost (Adaptboosting). It adjusts adaptively to the errors resulting fthe weak learners and generates a weighted majority leThis learner is an ensemble of weak learners, whose mbers each have a weight, which is determined by the accof the weak learner. AdaBoost is based on a multiplicaweight-update technique, a generalization of LittlestoneWarmuth’s weighted majority algorithm[10]. In AdaBoosthe distribution of the samples is sequentially altered accing to the performance of the previous weak learners.

003-2670/$ – see front matter © 2005 Elsevier B.V. All rights reserved.oi:10.1016/j.aca.2005.01.075

168 M.H. Zhang et al. / Analytica Chimica Acta 544 (2005) 167–176

cases or samples that are least well learned are given a higherweight. Samples with higher weights have higher probabil-ity to be selected to generate the next weak learner, therebymaking it focuses on these difficult samples.

Recently several studies have appeared that contribute to abetter understanding and an improvement of the performanceof boosting. For instance, Friedman stated that a “stochas-tic gradient boosting” could improve accuracy and increaserobustness[11]. Buhlmann and Yu proposed “L2Boost”, aboosting method constructed from a gradient descent algo-rithm with L2-loss function[12]. Their study not only ex-plained the “resistance of boosting to overfitting” mystery,but also discussed in detail the requirements for successfulboosting, such as the choice of the learner.

So far the best results with boosting have been obtainedwith classification problems. In chemometrics, classificationsolutions are frequently needed for high dimensional data.Up to now, boosting is seldom used in that field. Recently,Varmuza et al. applied boosting to classification problems ofmass spectroscopic data[13]. He et al. reported the successfulapplication of boosting for the classification of three kinds ofdata (clinical data, chemical descriptor data and mass spectraldata)[14]. In this study, the application of boosting to classi-fication problems of five real chemical data will be presented.CART, a decision tree method, is applied as learner.

2

2

2d by

B atp RTi ntv edi d byb newn calledc hildn s. Thf latterf d byt them ignedt odea ri thisa

firsts pleted rootn eachp itter.A

values. Thexs is selected from all independent variablesxand should satisfy the condition that the splitting maximizesthe impurity decrease�I(xs,J) on going from the parent nodeJ to its two child nodesJL andJR:

�I(xs, J) = �i(J) − fL�i(JL) − fR�i(JR) (1)

where�i(j), �i(JL) and�i(JR) are the impurity of parentnodeJ, left and right child nodesJL andJR, respectively.fLis the fraction of samples, whosexs value is less thans, ofnodeJ and that therefore go to nodeJL. fR is the fraction ofsamples, whosexs value is equal to or larger thans, of nodeJ that go to nodeJR. In our study, the impurity of a nodeJ iscalculated by Gini measure:

�i(J) = 1 −K∑

k=1

(Pk(J))2 (2)

whereK is the total number of classes,Pk(J) is the classprobability of thek class in nodeJ. The splitting continuesuntil all terminal nodes satisfy some criteria, e.g. until allterminal nodes contain only objects of one class or less thana predetermined number of samples when the classificationis not complete yet.

The complete decision tree obtained in the first step has atendency to overfit, i.e. the training data are partitioned wellbut the prediction for unknown samples is poor. A solutiont at itp conda

y thec inaln s thes ningc bero es aso un-i

M

w ob-t( -c e oft anchd op-t desi nings ne ors nds iffer-e e ofs onlyo best.

fromt ctionf the

. Theory

.1. Classification and regression trees (CART)

.1.1. CART procedureClassification and regression trees (CART), propose

reiman et al. in 1984[15], is a tree-building method thartitions a set of samples into groups. The goal of CA

s to explain a responsey by selecting some independeariablesx from a larger set ofxvalues. The tree is constructn a recursive binary way, resulting in nodes connecteranches. A node, which can be further divided into twoodes, is called a parent node. The two new nodes arehild nodes. A terminal node refers to a node having no codes. There are classification trees and regression tree

ormer is constructed for categorical responses and theor numerical ones. Each terminal node is characterisehe class of the majority members (in classification) orean of response values of members (in regression) ass

o that node. A new sample is allocated to a terminal nccording to the values of itsx variables. The prediction fo

ts response is the characteristic of that terminal node. Inrticle only classification trees are considered.

The CART procedure consists of three main steps. Thetep is a forward stepwise procedure to construct a comecision tree based on the training set. Starting from theode, i.e. a node that includes all the training samples,arent node is divided into two child nodes by a best splbest splitter is an independent variablexs with a threshold

e

o this problem is to prune the tree, generalising it so thredicts unknown new data better. This is done in the send the third step of CART.

The second step consists of pruning the tree. Usuallomplete tree obtained in step 1 contains many termodes. The total number of terminal nodes representize of a tree. For a given number of terminal nodes, pruould lead to many different sub-trees with a smaller numf terminal nodes. In order to choose one of these sub-treptimal, Breiman proposed “minimal cost-complexity pr

ng”. The cost-complexity measureMα(T) is defined as

α(T ) = M(T ) + αS (3)

hereM(T) is the misclassification cost of the sub-treeained after pruning of a certain branch,S the complexitythe number of terminal nodes in the sub-tree) andα the costomplexity parameter for that branch, i.e. the decreashe misclassification cost due to the presence of this brivided by the number of additional terminal nodes. The

imal pruned sub-tree for a given number of terminal nos the one that has the lowest cost-complexity value. Prutarts from the bottom of the tree. In each pruning step oeveral branches with smallestα are pruned away. The secotep results in a series of optimal pruned sub-trees of dnt size. However, no information about the performancub-trees on the unknown data is given at this stage andne of these optimal sub-trees should be chosen as the

In the third step the best pruned sub-tree is selectedhese optimal sub-trees based on the quality of predior new data. This is determined by cross-validation on

M.H. Zhang et al. / Analytica Chimica Acta 544 (2005) 167–176 169

training data or by using a separate test set. In the context ofboosting, the final CART tree obtained is called “a learnerf(x)”. For details on the construction of a CART tree we referto [15].

In our study, CART is performed with the “treefit” func-tion, which is supplied by the Statistics Toolbox 4.0 inMatlab® 6.5. This function uses minimal cost-complexitypruning as the pruning method. The right size of the tree isselected through 10-fold cross validation of the training data.

2.1.2. Evaluation of the importance of variables inCART

Breiman proposed a variable rank detecting method for theinterpretation of the relative importance of variables[15]. Theimportance of a variable in a CART learnerf(x) is determinedby the decrease of impurity across the tree for all non-terminalnodesJ that use this variable as a splitter.

TI(xi, f (x)) =∑

J ∈ f (x)

�I(xi, J) (4)

The relative importance of a variable in a CART tree is givenby:

RTI(xi, f (x)) = TI(xi, f (x))

max(TI(x, f (x)))× 100% (5)

2

hod.D n theA sucha oost,R itedt ithm

2to

ts si

ow-i

1 inal

2alam-Theo itsher

(b) Obtain a learnerft(x) from the re-sampled data set. Inour case, this is done with CART.

(c) Apply the learnerft(x) to the original training dataset. If a sample is misclassified, its error errt

i = 1,otherwise its error is 0.

(d) Compute the sum of the weighted errors of all trainingsamples.

errt =∑

i

(wtierrti) (6)

The confidence index of the learnerft(x) is calculatedas:

ct = log((1− errt)/errt) (7)

The lower the weighted error made by the learnerft(x) on the training samples, the higher the confidenceindex of the learnerft(x).

(e) Update the weight of all original training samples.

wt+1i = wt

i exp(cterrti) (i = 1, 2, . . . , N) (8)

The weights of the samples that are correctly classi-fied are unchanged while the weights of the misclas-sified samples are increased.

(f) Re-normalizew so that∑

i wt+1i = 1.

(g) t= t+ 1. If errti ≤ 0.5 andt<T, repeat steps (a)–(f);p

3 by aon.

ofceal

els:

Iaw ssd e toi

2hted

b mbi-n ever,f rent

.2. Boosting

AdaBoost is the most frequently used boosting metepending on the purpose and data structure, variants odaBoost algorithm have been developed. Here threelgorithms have been studied, namely Discrete AdaBeal AdaBoost and AdaBoost.MH. The first two are lim

o two-class classification while the third is able to deal wulti-class problems.

.2.1. Discrete AdaBoost algorithm[16]Consider a training data set withN samples belonging

wo classes. The two classes are defined asy∈ {−1,1}, i.e.amples in classy=−1 are all giveny value−1 and samplen classy= 1 are all giveny value 1.

The Discrete AdaBoost algorithm consists of the follng steps:

. Assign initial equal weights to each sample in the origtraining data set:

w1i = 1/N, i = 1, 2, . . . , N

. For iterationst= 1, 2,. . ., T:(a) Select a data set withN samples from the origin

training data set using bootstrap resampling, i.e. sples are selected randomly with replacement.chance for a sample to be selected is related tweight. A sample with a higher weight has a higprobability to be selected.

otherwise, stop andT= t− 1. AfterT iterations in ste2, there areT learnersft(x), t= 1, . . ., T.

. The performance of Discrete AdaBoost is evaluatedtest set. For a samplej of the test set, the final predictiis the combined prediction obtained from theT learnersEach prediction is multiplied by the confidence indexthe corresponding learnerft(x). The higher the confidenindex of a learnerft(x), the higher its role in the findecision.

yj = sign

(∑t

ctf t(xj)

)(9)

Here sign is a function that has two possible output lab

sign(∗) ={

−1, if ∗ < 0

1, if ∗ ≥ 0

n step 2(d), it sometimes happens that errt is equal to 0 or 1ndct cannot be computed. In our study we set errt = 0.995hen it was larger than 0.995 and errt = 0.005 when it wamaller than 0.005. The purpose is to avoid that onect wouldominate the final prediction over hundreds of others du

ts extreme value.

.2.2. Real AdaBoost algorithm[16]Discrete AdaBoost results in a set of trees, each weig

y their accuracy. The prediction for a sample is the coation of the weighted predictions of these trees. How

or a given tree, different terminal nodes may have diffe


accuracy. In Real AdaBoost it is considered that a predictionby a terminal node with a higher accuracy should have higherconfidence than that obtained by a terminal node with a loweraccuracy.

Consider a training data set withN samples and their re-sponsesy∈ {−1,1}. Steps 1 and 2(a) and (b) are the sameas for Discrete AdaBoost. 2(c) For each terminal nodeh ofthe learnerft(x), compute the class probability that samplesallocated to that node belong to classy= 1.

pth(x) =

∑yi=1 wt

i∑yi=1 wt

i +∑yi=−1 wti

(samplei ∈ nodeh)

(10)

The confidence index of the nodeh is calculated as

cth(x) = 1

2log

(pt

h

1 − pth

)(11)

For a samplei allocated to nodeh, the prediction for thatsample is the confidence index of nodeh, i.e.ct(xi) = ct

h(x).The confidence index has a double function here. The sign

of the confidence index decides the classification, i.e. a pos-itive c(x) means that samplei belongs to classy= 1 while anegativec(x) means that it belongs to classy=−1. |c(x)| alsomeasures the confidence of the prediction. The higher|c(x)|,the higher the confidence.

w

T asedw

s(

.

4 -

of

hec mep ha

2for

m ex-p ing a“ oost.

gf

1. Re-arrange the original data{X, y} intoK data sets. Eachdata set{X, yik} considers the samples of one class to haveresponsey= 1 and the samples of the otherK− 1 classesare combined into one new class with responsey=−1, i.e.samples in the class considered are giveny value 1.

yik ={

1 if yi ∈ k

−1 if yi /∈ k

(i = 1, 2, . . . , N; k = 1, 2, . . . , K)

This is done for each of theK classes. As a result, each newdata set concerns a two-class problem and the classifica-tion problem is reduced to deciding if a sample belongsto the class considered or not.

2. Apply Real AdaBoost to each new data set.For an unknown samplej, obtain f (xj, k) =∑tc

t(xj, k) (t= 1, . . ., T), whereT is the number of it-erations of Real AdaBoost. Thus, there areK predictions(f(xj , k)) for samplej, each predicting whether samplejbelongs to a class considered or not and how high theconfidence of that prediction is.

3. The final prediction for samplej is given as

j ∈ argmaxkf (xj, k)

2b

tingp theb

R

wib

as:

F

3

3

tioni ssesc s, wec risono

(d) Update the weights of all original training samples

t+1i = wt

i exp(−yict(xi)) (i = 1, 2, . . . , N) (12)

he weights of correctly classified samples are decrehile those of misclassified samples are increased.(e) Re-normalizew so that

∑iw

t+1i = 1 and repeat step

a)–(e).In step 2, afterT iterations, there areT learnersft(x), t= 1,

. ., T.

. For an unknown samplej, the final prediction is the combination of predictions made by theT learners:

yj = sign

(∑t

ct(xj)

)(13)

The prediction does not only give the classificationsamplej but also the confidence for that decision.

As for Discrete AdaBoost,p(x) can be zero or 1, and talculation ofc(x) then is no longer possible and the sarocedure for limiting 0.005≤ errt ≤ 0.995 is adopted in succase as for Discrete AdaBoost.

.2.3. AdaBoost.MH algorithm[16]AdaBoost.MH is one of the most popular algorithms

ulti-class classification. The basis of this method is toand the multi-class problem into two-class problems usone against all” approach, and then to apply Real AdaB

Consider a training data set withN samples that cominromK classes.

that is, find the class that has the highestf(xj , k) value.Then samplej is predicted to belong to that class.

.2.4. Evaluation of the importance of variables inoosting

The relative importance of a variable in a CART boosrocedure is the weighted average of its importance inoosted CART trees.

I(xi) = 1

T

T∑t=1

[ctRTI(xi, ft(x)] (14)

hereT is the number of iterations in the boosting step,ft(x)s thetth CART learner andct the confidence offt(x) in theoosted trees.

The final relative importance of a variable is re-scaled

RI(xi) = RI(xi)

max(RI(x))× 100% (15)

. Experiment

.1. Data

Several of the examples are from multivariate calibranstead of classification and the limit between the two claonsidered is therefore somewhat arbitrary. Neverthelesonsider the examples to be appropriate for a first compaf the investigated methods.


3.1.1. Wheat dataThe wheat data come from[17]. They consist of 100 NIR

spectra of wheat samples measured every 2 nm between 1100and 2500 nm. The response is the percent of moisture content.The spectra are pre-treated by offset correction. Two clustersare observed in PC1–PC3 plot. The data is randomly splitinto two subsets: 70 in the training set and 30 in the test set.

3.1.2. Cream dataThe cream data come from[18]. They consist of 987 NIR

spectra of clinical lots of cream samples recorded every 2 nmbetween 1100 and 2498 nm. The response is the content ofthe drug substance. The samples belong to five different con-centrations (0, 1, 2, 3, 4%), each with four different batches.The NIR spectra are measured on two different instrumentsat different temperatures and time. The data is randomly splitinto two subsets, resulting in a calibration set of 450 samplesand a test set of 537 samples.

3.1.3. HIV dataThe HIV data come from[19]. The data set consists of 208

non-nucleoside reverse transcriptase inhibitors (NNRTIs), ofwhich the biological activity against wild type and four mu-tant strains of HIV have been determined. The NNRTIs havebeen dockied into a set of crystal structures of the HIV reverset e cal-c me,s sidec RTIb logi-ca The2 IBO( -s sets:1

3t

c d be-t ancem n-t st werer werer ntain7

3

do elu-t culard sets:5

3.2. Software

Data analysis was performed in Matlab® for windows,Version 6.5 (The MathWorks Inc.) with the programs devel-oped in our department. CART is from Statistics Toolbox 4.0in Matlab® 6.5.

3.3. Data analysis

The predictive ability is evaluated using an independenttest set and is defined as

preditive ability= samples correctly classified

total samples of the test set

Since the improvement made by boosting is not the samein each run, boosting is repeated 50 times. Unless otherwisestated, the overall predictive ability is reported as an averageperformance over 50 individual runs. The boosting iterationnumber is set to beT= 150.

4. Results and discussions

4.1. Wheat data

Fig. 1(a) shows the distribution of the observedy-valueof the wheat data, indicating two clusters iny. Two clusters

Fig. 1. (a) The distribution of wheat data according toy value (moisturepercent). (b) PC1–PC3 plot of wheat data. .: training set;©: test set.

ranscriptase enzyme. The descriptors, 54 in total, are thulated interaction energies of the NNRTIs with the enzyplit into Van der Waals and Coulomb interactions of thehain and backbone parts of the amino acids lining the NNinding site. The response variable is the averaged bioal activity expressed as pIC50 against wild-type HIV virusnd four mutant strains (181C, 103N, 100I and 188L).08 NNRTIs belong to five chemical classes, called T2), HEPT-like (92), ITU (1), DATA (65) and DAPY (48), repectively. The data is randomly separated into two sub28 in the calibration set and 80 in the test set.

.1.4. Green tea dataThe green tea data come from[20]. The original data se

ontains 123 NIR spectra of green tea samples recordeween 1100 and 2500 nm every 2 nm using diffuse reflectode. The correspondingy-value is the measured total a

ioxidant capacity value as described in[20]. Three samplehat are considered to be outliers in the previous studyemoved, resulting in a set of 120 samples. The dataandomly divided into a training set and test set, and co0 samples and 50 samples, respectively.

.1.5. Chromatographic retention dataThe chromatographic retention data come from[21]. The

ata consist of the logarithm of the retention factor (logkw)btained on Unisphere PBD at PH 11.7 using isocratic

ions for 83 basic drugs, each characterised by 266 moleescriptors. The data is randomly divided into two sub3 in the calibration set and 30 in the test set.


Fig. 2. The overall predictive ability for wheat data using CART (+) andReal AdaBoost (−) for two-class classification.

are also observed in PC1–PC3 plot (Fig. 1(b)). These twoPC clusters are consistent with the two clusters iny. Samplesin the left cluster ofFig. 1(a) constitute the upper cluster ofFig. 1(b) while those in the right cluster ofFig. 1(a) consti-tute the lower cluster ofFig. 1(b). No sample has ay-valuebetween 14.42 and 15.07. The object of the classification isto classify the samples as having ay< 15 (class 1) or ay> 15(class 2) on the basis of their spectrum. As a result, the train-ing set has 33 samples in class 1 and 37 samples in class2. The test set has eight samples in class 1 and 22 samplesin class 2. The optimal CART tree has an overall predictiveability of 0.867. The overall predictive ability from DiscreteAdaBoost with up to 150 iterations is 0.983 and that of RealAdaBoost is 0.998 (Fig. 2, Table 1).

4.2. Cream data

The cream data are used as an example of both a two-classand multi-class classification problem. For two-class classi-fication, the samples are separated into a low concentrationclass (class 1) and a high concentration class (class 2). Sam-ples whosey value is less than 3% are assigned to class 1 andthe others to class 2. As a result, the calibration set has 275samples in class 1 and 175 samples in class 2 while the testset has 318 samples in class 1 and 219 samples in class 2.

TT tion

D

WCHGC

Fig. 3. The overall predictive ability for cream data using CART (+), DiscreteAdaBoost (−) and Real AdaBoost (–·–) for two-class classification.

Fig. 3andTable 1indicate that both Discrete AdaBoost andReal AdaBoost perform better than CART.

In multi-class classification, there are five classes eachcorresponding to one of the five concentrations. The calibra-tion set has 88, 87, 100, 88 and 87 samples with concentra-tions 0, 1, 2, 3 and 4%, respectively. The test set has 110,110, 98, 109 and 110 samples corresponding to the five con-centrations. The results demonstrate that the performance ofCART (overall predictive ability is 0.760) is much improvedwhen boosting (overall predictive ability is 0.922) is applied

F andA

TT ssc

D

CHIV 0.837 0.900

able 1he overall predictive ability of different methods in two-class classifica

ata set Method

CART DiscreteAdaBoost

RealAdaBoost

heat 0.867 0.983 0.998ream 0.949 0.981 0.982IV 0.636 0.700 0.712reen tea 0.820 0.800 0.800hromatographic retention 0.867 0.839 0.817

ig. 4. The overall predictive ability for cream data using CART (+)daBoost.MH (−) for multi-class classification.

able 2he overall predictive ability of different methods in multi-clalassification

ata set Method

CART AdaBoost.MH

ream 0.760 0.922


(Fig. 4, Table 2). The optimal CART tree uses only 16 vari-ables while in boosting 623 out of 700 variables in total areused in the 150 CART trees, leading to the decision rule.The poor prediction ability of CART suggests that for thecream data those 16 variables are not enough to explain theconcentrations well.

4.3. HIV data

In the two-class classification, samples with biological ac-tivity below 7.1 are allocated to class 1 and the others are al-located to class 2. The calibration set has 56 samples in class1 and 72 samples in class 2. The test set has 33 samples inclass 1 and 47 samples in class 2.Fig. 5 andTable 1showthat both Discrete AdaBoost and Real AdaBoost enhance theprediction ability of CART and that Real AdaBoost performsa little better than Discrete AdaBoost.

The single optimal CART tree of the training uses fourvariables and results in a predictive ability of 0.650 (Fig. 6(a)).These four variables are variable 17 (A229vsc), 3 (A95vsc),1 (B136vsc) and 51 (A101cbb), with relative importance in-dexes of 100, 53, 24 and 23%, respectively. These variablescorrespond to the Van der Waals interaction energies of theNNRTIs with the A229 tryptophane, A95 proline and B136glutamine side chains, and the Coulomb interaction with theA e-t set.T pleso ast tV aneA oft 229r ings iderV 95

F eteA

Fig. 6. (a) The overall predictive ability for HIV data from a single CART(+) and a single Discrete AdaBoost (−) run for two-class classification; (b)the relative variable importance of HIV data in the Discrete AdaBoost runof (a). The total boosting iteration number is 96.

residue (A95vsc) as an important factor. A drawback of sin-gle CART is that it is a stepwise procedure, that is, in everysplit only one variable is selected as optimal (even when othersuitable variables may be found as competitors or surrogates).Also, if a splitter is not correctly selected because of the pres-ence of outliers, this error may propagate as the next splitsare based upon it. For this reason, CART needs a detail studyon each node split, making the analysis complex.

Such kind of study is impossible and unnecessary withboosting. Boosting is a combination of many trees, each treefocusing on samples not fit well by the previous trees. There-fore, if a previous tree does not work well, boosting maycorrect this in other trees.

A single boosting run uses more variables, as shown inFig. 6(b), and produces a predictive ability of 0.763. Thetop 10 important variables are 3 (A95vsc), 43 (A235vbb),51 (A101cbb), 18 (A234vsc), 5 (A100vsc), 4 (A97vsc), 21(B138vbb), 46 (B138csc), 49 (A318csc) and 10 (179vsc).Their relative importance are 100, 90, 65, 61, 58, 56, 54, 52,52 and 49%, respectively. The most important variable in oursingle optimal CART tree, A229vsc, does not appear in thetop 10 important variable list while A95vsc and A101cbb ap-pear again, with A95vsc becoming the most important vari-able. Some of the top 10 important residues are found to

101 lysine back bone. In Ref.[19] we have performed a dailed study on the CART tree of the complete HIV datahough the present CART tree, built using only 128 samf the 208 NNRTIs, is not identical, it also has A229vsc

he first primary splitter. In Ref.[19] it was pointed out thaan der Waals interaction with the side-chain of tryptoph229 residue (A229vsc) is highly related to the activity

he NNRTIs. This can be attributed to the fact that the Aesidue, which is situated at the “roof” of the RT bindite, is highly conserved in HIV-RT. Both trees also consan der Waals interaction with the side-chain of proline A

ig. 5. The overall predictive ability for HIV data using CART (+), DiscrdaBoost (−) and Real AdaBoost (–·–) for two-class classification.


Fig. 7. The overall predictive ability for HIV data using CART (+), Ad-aBoost.MH (−) for multi-class classification.

be highly conserved and resistant against mutation, such asA318 and A100. Using the same data set, Xu et al. built amultivariate adaptive regression splines (MARS) model torelate the activity of NNRTIS with the interaction energieswith the residues[22]. Of the top 14 important variable theyfound, six belong to our top 10 list, i.e. A95vsc, A235vbb,A101cbb, A100vsc, B138vbb and B138csc.

As described higher, the 208 samples belong to five chem-ical classes. Since two of the classes, TIBO and ITU, containonly two and one samples, respectively, these three samplesare removed and the multi-class classification study only con-siders the samples belonging to the HEPT-like, DATA andDAPY classes of compounds. The calibration set contains57 HEPT-like, 38 DATA and 30 DAPY samples. The threeclasses of samples in the test set are 35, 27 and 18, respectively. Fig. 7andTable 2demonstrate that boosting increasesthe overall predictive ability of multi-classification (0.900)with respect to that of CART (0.837).

The single optimal CART tree uses two variables, variable6 (A101vsc) and variable 20 (A318vsc), to distinguish thethree classes. The relative importance of the two variablesare 100 and 17%, respectively. The predictive ability of thistree is 0.838. In Ref.[19], A101vsc is found as a competitor

of the first primary splitter A100vbb, which separates theHEPT-like NNRTIs from the other classes.

Unlike a single CART tree that only uses two variables, asingle boosting run for three-class classification with 150 it-erations contains 450 (150× 3) CART trees and 53 out of 54variables have been used. The predictive ability is improvedto 0.900. The overall top 10 important variables are variable2 (B138vsc), 50 (A100cbb), 20 (A318vsc), 29 (A103vbb),16 (A227vsc), 10 (A179vsc), 47 (A103csc), 17 (A229vsc),46 (B138csc) and 49 (A318csc). Their relative importanceare 100, 98, 68, 64, 59, 56, 53, 50, 42 and 38%, respectively.The 450 CART trees can be divided into three groups, eachgroup containing the information that specifically separatesone class from the other two classes.Table 3lists the im-portant variables for distinguishing each class from the othertwo classes. The importance of the A100cbb and B138vscvariables can be traced back to the different binding modesof the HEPT-like compounds on one hand, and the DATAand DAPY compounds on the other hand. The latter form adirect hydrogen bond with the carbonyl and amide groups oflysine A100. In the HEPT-like compounds, this interaction ismediated by a structural water molecule, which also interactswith the acid group of glutamate B138[23].

4.4. Green tea data

theirt idantc f class1 al too 2. Asa d 59s lass 1a

bil-i thec theiA over-fi mayo estst ring

Table 3The top 10 important variables of the HIV data for distinguishing one class fr

ce (%)

b)c)b)c)c)c)c))c))

No. HEPT-like DATA

Variable Relative importance (%) Variable

1 2 (B138vsc) 100 50 (A100cb2 26 (A100vbb) 56 10 (A179vs3 16 (A227vsc) 43 29 (A103vb4 50 (A100cbb) 40 16 (A227vs5 6 (A101vsc) 33 47 (A103cs6 54 (A236cbb) 30 17 (A229vs7 46 (B138csc) 26 46 (B138cs8 49 (A318csc) 23 25 (A99vbb9 22 (A95vbb) 20 49 (A318cs

10 4 (A97vsc) 17 2 (B138vsc

-

Samples are divided into two classes according tootal antioxidant capacities. Samples whose total antioxapacity is less than 22, are considered to be members owhile samples whose total antioxidant capacity is equr greater than 22 are defined as members of classresult, the training set has 11 samples in class 1 an

amples in class 2. The test set has nine samples in cnd 41 samples in class 2.

A single CART prediction has an overall predictive aty of 0.820. Both Discrete and Real AdaBoost improveorrect rate at the beginning of the iteration but fail asteration number increases (Fig. 8 andTable 1). GenerallydaBoost is considered to be resistent with respect totting. However, there are also reports that overfittingccur in boosting[24]. Result from the green tea data sugg

hat the total iteration number should be considered du

om the other two classes

DAPY

Relative importance (%) Variable Relative importan

100 20 (A318vsc) 10082 17 (A229vsc) 5281 29 (A103vbb) 4955 50 (A100cbb) 4752 28 (A102vbb) 4749 2 (B138vsc) 4337 5 (A100vsc) 4034 13 (A188vsc) 3831 47 (A103csc) 3527 27 (A101vbb) 33


Fig. 8. The overall predictive ability for green tea data using CART (+),Discrete AdaBoost (−) and Real AdaBoost (–·–) for two-class classification.

boosting. Since there is no absolute rule for the selection ofthe total iteration number in boosting, the solution could beinternal cross-validation or using an independent test set, assuggested by[12].

4.5. Chromatographic retention data

Samples in the training and test set are separated into twoclasses (low and high retention) according to logkw with asthe threshold value of 1.6010. Class 1 contains samples whoselogkw is equal to or less than 1.6010 while class 2 containssamples with logkw larger than 1.6010. The calibration setincludes 22 samples in class 1 and 31 samples in class 2. Thetest set has 14 samples in class 1 and 16 samples in class 2. 5boosting runs are performed, each with 300 iterations. Theresults from the average of the 50 runs indicates that thoughboth methods tend to improve the overall predictive ability

F ART( -c

as the iteration number increases, their performance is notbetter than that of a single CART (Fig. 9, Table 1).

The failure of the boosting may be due to the specificcharacteristics of this data. The optimal CART only usesone variable, variable 1 (logP, log of partition coefficient(octanol/water)) and has a predictive ability of 0.867. Sinceboosting contains many CART trees, each focusing on a cer-tain type of samples, many more variables are used. Thesevariables may contribute in this case noise to the boostingmodel and lead to a worse prediction.

5. Conclusion

Application of boosting in classification for five real chem-ical data has been studied. The data sets were chosen suchthat the classification is not simple. The results show thatboosting may indeed improve the overall predictive abil-ity for classification, though it is not always the case. Insome cases, Real AdaBoost performs slightly better than Dis-crete AdaBoost, but the difference is not of practical impor-tance.

Variable importance analysis demonstrates that the im-proved performance is related to the variables selected forconstructing a CART learner. In contrast with a single CART,which only uses a few variables, boosting uses more vari-a ge int manyC ecials me oft ecials anceo

tud-i latedt tingi an int sm oten-t thers dvan-t visedp

R

ng 5

lgo-Ma-

ca-g 36

lemision

ig. 9. The overall predictive ability for chromatographic data using C+), Discrete AdaBoost (−) and Real AdaBoost (–·–) for two-class classifiation.

0

bles. CART can be an unstable method. A small chanhe data may cause a big change in a tree. By boostingART trees are produced, each focusing on some spamples. This leads to the use of more variables and sohese may have special importance to identify those spamples. This is why boosting can improve the performf CART and also make the result more robust.

The data sets studied are typical for chemometric ses in the sense that the variables are much more correhan is the case in the applications of CART and boosn applications outside chemometrics and in fact also thhe published chemical application[13,14], which concernass spectra. While it is shown that both methods have p

ial also for data sets with highly correlated variables, furtudy is needed to decide what the advantages and disaages of the methodology are compared to other superattern recognition methods.

eferences

[1] R.E. Schapire, The strength of weak learnability, Mach. Learni(2) (1990) 197–227.

[2] Y. Freund, R.E. Schapire, Experiments with a new boosting arithm, in: Proceedings of the 13th International Conference onchine learning, Morgan Kaufmann, 1996, pp. 148–156.

[3] E. Bauer, R. Kohavi, An empirical comparison of voting classifition algorithms: bagging, boosting, and variants, Mach. learnin(12) (1999) 105–142.

[4] B.M. Namee, P. Cunningham, S. Byrne, O.I. Corrigan, The probof bias in training data in regression problems in medical decsupport, Artif. Intell. Med. 24 (1) (2002) 51–70.


[5] M. Dettling, P. Buhlmann, Boosting for tumor classification withgene expression data, Bioinformatics (Oxford) 19 (9) (2003)1061–1069.

[6] A.S. Atukorale, T. Downs, P.N. Suganthan, Boosting the HONGnetwork, Neurocomputing 51 (2003) 75–86.

[7] L. Breiman, Bagging predictors, Mach. Learning 26 (2) (1996)123–140.

[8] M. Kearns, L. Valiant, Cryptographic limitations on learning booleanformulae and finite automata, J. ACM – Association for ComputingMachinery 41 (1) (1994) 67–95.

[9] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. System Sci.55 (1997) 119–139.

[10] N. Littlestone, M.K. Warmuth, The weighted majority algorithm,Inform. Comput. 108 (1994) 212–261.

[11] J.H. Friedman, Stochastic gradient boosting, Comput. Stat. DataAnal. 38 (2002) 367–378.

[12] P. Buhlmann, B. Yu, Boosting with the L2 loss: regression and clas-sification, J. Am. Stat. Assoc. 98 (462) (2003) 324–339.

[13] K. Varmuza, P. He, K.T. Fang, Boosting applied to classification ofmass spectral data, J. Data Sci. 1 (2003) 391–404.

[14] P. He, C.J. Xu, Y.Z. Liang, K.T. Fang, Improving the classificationaccuracy in chemistry via boosting technique, Chemon. Intell. Lab.Syst. 70 (2004) 39–46.

[15] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classificationand Regression Trees, Wadsworth International Group, Monterey,CA, 1984.

[16] J. Friedman, T. Hastie, R. Tibshirani, Additive logistic regres-sion: a statistical view of boosting, Ann. Stat. 28 (2000) 337–407.

[17] J.H. Kalivas, Two data sets of near infrared spectra, Chemon. Intell.

[18] J. Luypaert, S. Heuerding, D.L. Massart, Direct orthogonal signalcorrection as an alternative for standardization in classification ofclinical lots by near infrared spectroscopy, Anal. Chim. Acta, sub-mitted for publication.

[19] M. Daszykowski, B. Walczak, Q.S. Xu, F. Daeyaert, M.R. de Jonge,J. Heeres, L.M.H. Koymans, P.J. Lewi, H.M. Vinkers, P.A. Janssen,D.L. Massart, Classification and regression trees-studies of HIV re-verse transcriptase inhibitors, J. Chem. Inf. Comput. Sci. 44 (2004)716–726.

[20] M.H. Zhang, J. Luypaert, J.A. Fernandez Pierna, Q.S. Xu, D.L.Massart, Determination of total antioxidant capacity in green tea bynear-infrared spectroscopy and multivariate calibration, Talanta 62(2004) 25–35.

[21] R. Put, C. Perrin, F. Questier, D. Coomans, D.L. Massart, Y. VanderHeyden, Classification and regression tree analysis for molecular de-scriptor selection and retention prediction in chromatographic quan-titative structure-retention relationship studies, J. Chromatogr. A 998(2003) 261–276.

[22] Q.S. Xu, M. Daszykowski, B. Walczak, F. Daeyaert, M.R. de Jonge,J. Heeres, L.M.H. Koymans, P.J. Lewi, H.M. Vinkers, P.A. Janssen,D.L. Massart, Multivariate adaptive regression splines (MARS) –studies of HIV reverse transcriptase inhibitors, Chemon. Intell. Lab.Syst. 72 (2004) 27–34.

[23] A.L. Hopkins, J. Ren, R.M. Esnouf, B.E. Willcox, E.Y. Jones, C.Ross, T. Miyasaka, R.T. Walker, H. Tanaka, D.K. Stammers, D.I.Stuart, Complexes of HIV-1 reverse transcriptase with inhibitors ofthe HEPT series reveals conformational changes relevant to the de-sign of potent non-nucleoside inhibitors, J. Med. Chem. 39 (1996)1589–1600.

[24] W. Jiang, Does boosting overfit: views from an exact solution, Tech-nical Report 00-04, Department of Statistics, Northwestern Univer-

Lab. Syst. 37 (1997) 255–259.
sity, September, 2000.

Documents

Application of boosting to classification problems in chemometrics