A method to Boost Naïve Bayesian Classifiersarbor.ee.ntu.edu.tw/pakdd02/paper/P0197.pdf · A method to Boost Naïve Bayesian Classifiers ... (features) for the instance. When making

A method to Boost Naïve Bayesian Classifiers

DIAO Lili, HU Keyun, LU Yuchang and SHI Chunyi Computer Science and Technology Dept., TsingHua University The State Key Laboratory of Intelligent Technology and System

BeiJing China, 100084 [email protected]

Abstract. In this paper, we introduce a new method to improve the performance of combining boosting and naïve

Bayesian learning for text categorization. Instead of combining boosting and naïve Bayesian directly, which was

proved to be futile to improve anything, we incorporate different feature extraction methods to the construction of

naïve Bayesian classifiers, and hence generate very different or unstable base classifiers for boosting. Besides,

because the number of different feature extraction methods is very limited, we devise a modification for the weight

adjusting of boosting algorithm in order to achieve this goal: minimizing the overlapping errors of its constituent

classifiers. We conducted series of experiments, which show that the new method not only has performance much

better than naïve Bayesian classifiers or directly boosted naïve Bayesian ones, but also much quicker to obtain

optimal performance than boosting stumps and boosting decision trees incorporated with naïve Bayesian learning.

Keywords: Text categorization, Boosting, Naïve Bayesian Learning, Feature extraction

1. Introduction

Boosting is an iterative machine learning procedure that successively classifies a weighted version of the sample, and then re-weights the sample dependent on how successful the classification was. Its purpose is to find a highly accurate classification rule by combining many weak or base hypotheses (classifiers), many of which may be only moderately accurate. As boosting progresses, training examples and their corresponding labels that are easy to classify get lower weights. The intended effect is to force the base classifier to concentrate on “difficult to identify” samples with high weights, which will be beneficial to the overall goal of finding a highly accurate classification rule (Freund, Y. and Schapire, R. 1997). Text categorization is the problem of classifying text documents into categories or classes, by means of machine learning techniques. Naïve Bayesian classifier is an effective classification measure, though its performance is always influenced by the completeness of training examples and hence usually not optimal. Boosting decision trees has been proved to be very successful for text categorization and other machine learning problems (Schapire, R. and Singer, Y., 2000), but some experiments show that, Naïve Bayesian classifiers cannot be effectively improved by boosting (Kai Ming Ting and Zijian Zheng, 2000). The most possible reason is that Naïve Bayesian classifier is quite stable with respect to small disturbances of training data, since these disturbances will not result in important changes of the estimated probabilities. Therefore, boosting Naïve Bayesian classifiers may not be able to generate multiple models with sufficient diversity. These models cannot effectively correct each other’s errors during classification by majority voting or weighted combination. In such a situation, boosting cannot greatly reduce the error of naïve Bayesian classification. In the sense, the stability of naïve Bayesian classifiers becomes a real problem when employing naïve Bayesian classifiers as the base classifiers of boosting. Kai Ming Ting and Zijian Zheng proposed to

1

introduce tree structures into naïve Bayesian classification to increase its instability, and expect that this can improve the success of boosting for naïve Bayesian classification (Kai Ming Ting and Zijian Zheng, 2000). This method provides a classifier with lower error rates than the naïve Bayesian one. However, to achieve optimal results, those naïve Bayesian trees need to be large, say, many of them need to have depth over 5. Since the leaves of those trees are all naïve Bayesian classifiers using all training samples, its converging speed becomes very slow, and its computational cost is also expensive, especially for the task of text categorization that always have huge number of training documents with very high dimensions. In this paper we propose a method quite different with that of naïve Bayesian trees to obtain “instable” naïve Bayesian classifiers for boosting: using different feature extraction methods to establish different feature-sets, according to which we can obtain different VSM representation for each training document (Salton, G., Wong, A. and Yang, C., 1995). As we know, different feature extraction methods will have different selected features (though generally most of them are the same), but how thus generated VSM vectors of training documents would influence the naïve Bayesian classifiers still need to be examined. Because the number of different feature extraction methods is rather small, we limit the iteration times of boosting to exactly the number of feature extraction methods combined, one for each naïve Bayesian classifier. In this condition, we modify the weight adjusting criteria of boosting to place extra emphasis on the sample documents that would be helpful in minimizing the overlapping errors among the constituent classifiers. Besides, we employ un-weighted majority vote for combining base naïve Bayesian classifiers because we expect the classification results would not be influenced by the order of feature extraction methods.

Section 2 and 3 introduce the basic ideas of Boosting, naive Bayesian learning and feature extraction for text categorization respectively. Section 4 illustrates our new algorithm in detail. Section 5 presents the corresponding experimental results of compared performance with regarding to this new algorithm and other related methods. Section 6 concludes our work.

2. Naïve Bayesian classifier and Boosting

The Bayesian approach to classification estimates the (posterior) probability that an instance belongs to a class, given the observed attribute values (features) for the instance. When making a categorical rather than probabilistic classification, the class with the highest estimated posterior

probability is selected. The posterior probability, ( )VCP j | , of an instance being class ,

given the observed attribute values

jC

nv,...,vv , 21=V , can be computed using the apriori

probability of an instance being class , jC ( )jCP ; the probability of an instance of class

having the observed attribute values,

jC

( )jC|VP ; and the prior probability of observing the

attribute values, . In the naïve Bayesian approach, the likelihood of the observed attribute

values are assumed to be mutually independent for each class. With this attribute independence

assumption, the probability

( )VP

( )VCP j | can be re-expressed as:

2

( )VCP j | ∝ ∝)|()( CjVPCjP ( ) ( )∏=

V

ijij CvPCP

1

| .

The description of boosting follows Freund and Schapire’s AdaBoost.M1 (Freund, Y. and

Schapire, R. 1997). We assume a set of training cases Ni ,...,2,1= . At each repetition or trial

, case has weight Ts ,...,2,1= i [ ]iws , where [ ]iw1 N/1= for all i . The process at trial

can be summarized as follows:

s

The base learning system constructs classifier from the training cases using the

weights .

sH

sw

The error rate sε of this classifier on the training data is determined as the sum of the

weights [ ]iws for each misclassified case . i

Update the weight of each training case [ ]iws for all i :

[ ] [ ] ( )

−÷=+ otherwise

icaseiesmisclassifHifiwiw

s

ssss ε

ε122

1

The composite classifier is obtained by voting each of the component classifiers

. If classifies some case

BH

{ sH } sH x as belonging to class k , the total vote for is

incremented by

k

( )( )ss εε−1log . then assigns BH x to the class with the greatest total vote.

Provided the learning system can reliably generate from weighted cases a hypothesis that has less

than 50% error on those cases, a sequence of “weak” classifiers { }sH can be boosted to a

“strong” classifier that is at least as, and usually more accurate than, the best weak classifier. BH

In quite a number of experiments, boosting cannot increase, or even decreases, the accuracy of naïve Bayesian classifier learning, which is in sharp contrast to the success of boosting decision trees. One key difference between naïve Bayesian classifiers and decision trees, with respect to boosting, is the stability of the classifiers – naïve Bayesian classifier learning is relatively stable with respect to small changes to training data, but decision tree learning is not. How to make the naïve Bayesian classifiers unstable (say, quite different with each other) becomes very urgent a problem to solve for incorporating Bayesian ideas into boosting method. Fortunately, we observed that, when the attributes of each training case change, the naïve Bayesian classifier constructed from these training cases will be quite “instable”, hence solve the problem. In text categorization context, there are several Feature (attribute) extraction methods, each of which can construct different feature sets and hence the VSM representation of each training document will be

3

different with each other. Thus, it becomes an intriguing problem in text categorization to employ different feature extraction methods for constructing instable naïve Bayesian classifiers as base classifiers for boosting.

3. Feature Extraction algorithms for text categorization

VSM (vector space model) is currently the most popular representational model for text document (Salton, G., Wong, A. and Yang, C. 1995). Given a set of training text documents,

, for any document

m

i{ mDocDocDocD ,,, 21 L= } mDDoci ,,2,1, L=∈ , it can be

represented as a formalized term (all words or adjacent words are potential terms) vector

n,Lktval inik ,2,1)),(,),LtvaltvalDocV ii (,),(()( 1 Lv

=

ikt

= . Here means the number of all

possible terms in the space of training set, and represents the k-th term of .

is a numeric value used to measure the importance of in ,

n

i

iDoc

) ≤

)( iktval

ikt Doc (0 1≤ iktval . By this

means, the problem of processing text documents has been converted to the problem of processing numerical vectors, which is quite suitable to be solved by mathematical methods.

Generally we need to extract “important terms”, which is usually called “features” in text mining, from the whole term space of training documents, according to certain criteria for measuring the importance. The criteria mentioned above can be mapped to several feature extraction methods such as Information Gain, Expected Cross Entropy (Koller.D., Sahami.M., 1997), Mutual Information (Yang.Y., Pedersen.J.O., 1997), The Weight of Evidence for Text (Mladenic.D., 1998) and Word Frequency (Yang.Y., Pedersen.J.O., 1997), etc. The formalized model of these criteria can be represented as follows (here t represents any term in the term space,

and denotes any class label in all possible label space): jC

∑∑ +=j

j

jj

j

j

CPtCP

tCPtPCP

tCptCP j )(

)|(log)|()(

)()|(

log)|(P(t)InfGain(t)j

∑=j

j

jj CP

tCptCPtPtTxtCrossEntry

)()|(

log)|()()(

∑=j

jj tP

CtpCPtTxtMutualInfo

)()|(

log)()(

∑ −−

=j

jj

jjj tCpCP

CptCpCPtPtTxtWeightEvid

))|(1)(())(1)(|(

log)()()(

( )tTFtncyWordFreque =)(

Feature extraction model will compute the corresponding numeric value for each term in the term space. The terms with values larger than a certain threshold will be selected as “features”. Then, each document can be represented by a vector where only feature terms have corresponding

4

values nonzero, and all other terms keep zero. According to many experiments, the feature extraction methods mentioned above will generate different feature sets, and hence produce different VSM vectors for each training document. For instance, we have conducted series of experiments by investigating the feature term sets successively generated by these feature extraction methods. We found that on average only about 70% terms are common among them, and hence the performance of the same classification system constructed from them are also different, with the gap between the best and the worst over 30%. Experiments based on different data sets also show that there is not any best feature extraction method fit well for all kind of documents. Two points are important for us to take into consideration. the first one is that the disturbance to the feature set will seriously influence the result of classification, so feature extraction has the potency to become useful method for constructing instable Bayesian classifiers; the second point is that, since at least now we cannot devise any optimal feature extraction method, it becomes interesting to combine these existed ones for achieving better performance of classification.

4. Improved boosting Naïve Bayesian learning

Here we introduce our newly devised method for boosting naïve Bayesian classifiers. Before generate each naïve Bayesian classifier as one of the base learners of boosting, we

need to run a certain feature extraction method to establish a feature set, based on which a specific naïve Bayesian classifier is constructed. The feature extraction methods introduced in section 3 can be selected to find different feature sets, one for each naïve Bayesian classifier.

Other than employing the original form, the naïve Bayesian classifier needs to be updated for meeting the requirements of Boosting as its base classifier.

As introduced in section 2, the naïve Bayesian classification rule for text categorization is:

Find the maximum of ( )DocCP j | ∝ ( ) ( )∏=

Doc

ijij CtPCP

1

| , and choose the corresponding

as its prediction. represents the jC jC j th class label in the possible label space, and C is

the volume of this space.

( )jCP can be estimated by ( )||||

)|(11

DC

DocCPCP

D

iij

j +

+=

∑= . ( )jCtP | can be estimated by

( )∑∑

∑

= =

=

+

+

V

k

D

iijik

D

iiji

DocCPDoctNV

DocCPDoctNtP

1 1

1

)|(),(||

)|(),(1=jC| . Here V represent the possible item set,

and V measures its volume. is a function to compute the frequency of term ( DocxN , ) x

appears in a certain document . The terms used for text classification are always confined in

the terms belongs to feature set.

Doc

( )ji C|DocP is defined as when then ji CDoc ∈

5

( ) 1| =ji CDocP , if else, ( ) 0| =ji CDocP . Because Boosting requires its base classifiers to

work with a weight distribution over training documents, we use ( ) [ ]iji DocwCDocP ⋅| to

replace ( )jCiDocP |

iDoc

s

. Here is the weight (maintained by boosting) of training

document .

[ ]iDocw

ε

T

kH

s BH

1H

( )keP

2H

2/T

( )

( BeP

)2e

BH

( ) ( )

{ }( )

( )eeeP

T

eeePT

eeePeP

T

T

T

B

...

12/

...2/

...

21

21

2/21

++

+

+

≤

LL

(all

misclassif

ePeymisclassif

eeeP

T 1

21

+

+

( sseeeP +11 ......

Because we expect the classification results would not be influenced by the order of feature extraction methods selected for the base classifiers, the uniform weights suggested by Breiman will be used instead of the weights defined for AdaBoost that depend on the values of

(Breiman L., 1998; Quinlan J.R. 1998).

Suppose we settle the boosting algorithm to T rounds. For 2 class problems, should be odd number to avoid ties. If there are more than two classes, ties are possible if some classes have the same number of votes. In this situation, the class predicted by the first classifier with the maximum votes will be preferred. Because we have known that with the same feature set, naïve Bayesian classifiers cannot differ so much despite of the weight adjusting. Therefore, we need to

choose different feature extraction methods for different boosting rounds. Let be the

classifier generated in the th boosting round, T,...,2,1s = , and the composite classifier.

Let and denote their respective probabilities of error under the original (uniform)

weight distribution for the training cases. The same notation will be extended to more complex

events; for instance, will denote the probability of simultaneous errors by and

. It is clear that boosted classifier will misclassify a document only when at least

constituent classifiers are in error, so,

)

P( 1e

( )

Discussions of ensemble classifiers often focus on independence of the constituents (Freund and Schapire,1997; Dietterich, 1997), then,

( ) ( ){ }

)ymisclassifsclassifier

ysclassifier

eeeeePeeeeeesclassifier

eeeeePeeeee

TTTTTTTTT

TTTTTTTTTT

...............

...............

12/2/2132/22/3222/12/

2/12/2122/12/312/

++

++

+−−++++

−−−+++

L

L

)( ) ( ) ( )( ) ( )( ) ( ) ( TTsTsss

TTsTs

ePePePePePeP

eee

×××−××−×××

=

++++

+++

LLL 12/2/11

12/2/

11

... )

6

Unless ( ) ( ) ( TePePeP ,,, 21 L ) are all zero, ( )BeP will remain positive. However,

consider the situation in which at least 2/T classifiers have non-overlapping errors. Then

becomes error-free. The goal of boosting should thus be generating constituent classifiers

with minimal overlapping errors, rather than independent classifiers. This objective can be achieved easily when the number of constituent classifiers is small. Since the number of feature extraction methods is also very limited, combining this goal and naïve Bayesian classifiers generated by different feature extraction methods in boosting algorithm will lead to better results. To achieve this goal, we modify the weight adjusting method for the next boosting round. In the problem of multi-class learning, such as text categorization, a document still may not be

misclassified even if it has already been misclassified by more than

( BeP )

2/T

s

base classifiers.

Only the document with the largest number of votes for any class which this document does not belong to will be misclassified. In another word, at any boosting round , if a document has

already been misclassified by classifiers, it still have the chance to be correctly classified

if, in the remained rounds, it can receive enough correct votes to defeat the combined votes of any incorrect class. Therefore, in the remained rounds, whether we can give enough correct votes for these documents will be very crucial for us to minimize the overlapping errors of the base classifiers. In the sense, we increase the weight of such document, no matter in this round it is classified correctly or incorrectly. There is an adaptive coefficient to determine the size of incensement: the closer the document is to being misclassified, the larger the coefficient is (say, the more important the document is). As for the documents already having no chance to receive correct classification even if all the remained

2/T

sT − base classifiers correctly classify it, their weight can be set to 0 from round 1+s , since they will have no help for minimizing the overlapping errors. If in round , a document has already been misclassified by less than

, we employ the weight adjusting criteria of boosting: if it is misclassified at this round,

then its weight will increase; otherwise, decrease. To facilitate the convergence of boosting, we also introduce a coefficient to enlarge the increment of the weight of the document misclassified

by . This coefficient is also adaptive: the closer the number of previous classifiers

misclassifying this document is to

s

2/T

sH

2/T , the larger the coefficient is. If in round a

document is correctly classified, the reduction of weight is not changed.

s

Thus we develop our new boosting algorithm as follows: Given training document set:

( )( ) ( )( ) ( )( ){ }NN DocYDocDocYDocDocYDocD ,,,,,, 2211 L= , where is any

training document,

iDoc

( ) { }Li CCCyDoc ,,, 21 L=∈

N

Y denotes the classes belongs to, and

is the number of different classes. is the number of training documents. .

iDoc

L N,,Li 2,1=

7

If the settled language environment is CHINESE, we need to execute CHINESE SEGMENTATION for separating the CHINESE characters into independent “terms”. If the language is English, such process is not needed.

This algorithm maintains a set of weights as a distribution over sample documents and

labels (classes), i.e., for each

W

DDoci ∈ , and at each boosting round , there is an associated

real value .

s

[ ]iws

Boosting maintains an error counter for each training document. For example, the error

counter of is denoted by iDoc [ ]iec . Boosting also maintains an array for each training

document to recording the votes of all possible classes. Each class has an element of .

denotes how many base classifiers predict that belongs to C .

cc

j

cc

],[ jicc iDoc

Step #1: Initialize [ ] Niw 11 = for all i N,,2,1 L= ; set [ ] 0=iec for all i ; N,,2,1 L=

Set for all and for all 0],[ =jicc Ni ,,2,1 L= Lj ,,2,1 L= ;

Step#2: For Ts ,,2,1 L= ( T equals to the number of feature extraction methods

employed here): Select a feature extraction method from feature extraction algorithm set {Information

Gain, Expected Cross Entropy, Mutual Information, The Weight of Evidence for Text, Word Frequency, ……}, according to any kind of order. Once an algorithm is selected, it is deleted from this model set. In the sense, each feature extraction algorithm will exactly be selected once.

Apply the selected feature extraction method to the whole term space constituted by the terms of all the training documents. Compute out the corresponding real score relating to each term in the term set, and select the terms with regard to a certain kind of ratio (such as 5%, 10% etc.). In another word, for example, the 5% terms with largest scores can be selected as the feature set of boosting round . Here we use “ratio” instead of the “threshold” mentioned above for feature extraction has two reasons: 1) it is very likely that we cannot make sure which of the thresholds is good for a certain feature extraction method in a specific training data (documents) before running once, so select features according to ratio is an easier and more safe way to construct feature set. 2) It is also the requirement of boosting to force base classifiers having feature sets with the same size and them compete fairly with each other via majority voting (un-weighted voting). “Ratio” can achieve this goal easily, which cannot be guaranteed by using threshold.

s

According to the selected feature set, the system represents the training documents via

VSM vectors ( ) ( ) ( )( ){ }Viiii tvaltvaltvalDoc ,...,, 21= . Here denote the whole term

space. (

V

kt V,...,2,1= kk ) is the th term in V

( )ki tval

. If belongs to and the

feature set, then in the VSM vectors of all the training documents will be set to

kt iDoc

8

1, or will be set to zero. It is nature that all the vectors combined will present a

large and sparse matrix, which can lead to simplified measures for storage and computing.

( )ki tval

( ) = CDoc

jmax

sH

L,,2, L

[] = iccj

2/1>

i 2,1=

[ ] /Ti <

[ ]

=i exp

[ ] /Ti ≥

CMEV =

CMEV

Now we can start to construct weighted naïve Bayesian classifier . Its classification

rule is:

sH

( )== DocCPH jj

js |maxarg ( ) ( )∏=

Doc

kjkj

j

CtPCP1

|maxarg

[ ]

[ ]∏

∑∑

∑∑=

= =

==

+

∈+⋅

+

+=

Doc

mV

k

D

itijik

D

itijim

D

iij

jiwDocCPDoctNV

jiwDocCPDocDoctN

Dy

DocCP

1

1 1

11

,)|(),(||

,)|(),(1

||||

)|(1arg

for any document represented by VSM vector. Doc

Using to classify all the training documents. For all i N,,2,1 L= and for all

, if any document is predicted by as class , then

; if this prediction is incorrect, then

j 1=

,[icc

iDoc sH jC

1], +j [ ] [ ] 1+= ieciec .

Compute the weighted error .

is a function that maps its content to 1 if it is true, otherwise, maps to 0.

[ ] [ ][ ]∑=

⋅=N

isis HbyedmisclassifisDociw

1ε

[ ][ • ]

If sε break;

For all , do follows: N,,L

If 2ec , then:

[ ] ( )

−÷×

−+ otherwise

DocymisclassifHiw

iecTw

s

issss ε

ε122

][2/1

1 ;

If 2ec then:

Compute:

( )( )ij DocYCjicc ≠:],[max ; ( )ij DocYCjiccCCV == :],[ ;

If sTCCV −<− then:

9

[ ] [ ]s

ss CCVCMEVstiwiw

ε1

)()(1exp1 ×

−−−

×=+

[ ] 01 =+ iws

( ) jB CDocH = jC

T,,2,1 L=

;

Else: ;

Step #3: = The with the maximum un-weighted votes from

for all .

( )DocHs

s

5. Experimental results

We choose other 4 algorithms to execute the same task: text categorization, and conducted a number of experiments to compare their performance with that of our newly devised algorithm, which is denoted as BNBFE. They are: 1) NAÏVE BAYESIAN (for simplicity, we denote it as NB, McCallum and A., Nigam, K. 1999), 2) Boosted naïve Bayesian (for simplicity, we denote it as BNB), 3) Boosted Leveled naïve Bayesian Trees (for simplicity, we denote it as BLNBT), a method of boosting decision trees incorporated with naïve Bayesian learning (Kai Ming Ting and Zijian Zheng. 2000), and 4) ADABOOST.SMH (for simplicity, we denote it as ABS, Schapire, R. and Singer, Y. 2000), a boosting-based text categorization method whose performance are better than many traditional text categorization algorithms. For these experiments we have used the ruters-21578, which consists of a set of 12902 news stories. The documents have an average length of 211 words (that become 117 after stop word removal). The number of positive examples per category ranges from a minimum of 1 to a maximum of 3964. The term set is identified after removing punctuation and stop words from the training documents.

We choose “Precision” and “Recall” as main measures for assessing and comparing the performances of different text categorization methods. Table 1 presents the precision and recall of different text categorization methods for the 5 most frequent topics of ruters-21578.

TABLE 1: The Performance Of 5 Text Categorization Methods

Algorithms topics Trade Interest Ship Wheat Corn precision 0.715* 0.900* 0.856 0.821 0.833* BNBFE

(T=5) recall 0.823* 0.795* 0.771* 0.988* 0.960 precision 0.689 0.765 0.773 0.742 0.705

NB recall 0.722 0.622 0.693 0.830 0.844 precision 0.601 0.594 0.765 0.706 0.566 BNB

(T=30) recall 0.723 0.618 0.705 0.828 0.851 precision 0.649 0.762 0.755 0.752 0.771 BLNBT

(T=5) recall 0.725 0.602 0.701 0.850 0.883 precision 0.683 0.803 0.828 0.798 0.802 BLNBT

(T=10) recall 0.788 0.637 0.735 0.925 0.933 precision 0.702 0.835 0.860 0.832* 0.819 BLNBT

(T=30) recall 0.811 0.647 0.765 0.971 0.983* ABS precision 0.565 0.604 0.633 0.515 0.552

10

(T=10) recall 0.613 0.525 0.579 0.698 0.632 precision 0.712 0.852 0.878* 0.814 0.826 ABS

(T=50) recall 0.801 0.714 0.752 0.947 0.977 In the experiment, BNBFE employ 5 feature selection methods: Information Gain, Expected

Cross Entropy, Mutual Information, The Weight of Evidence for Text and Word Frequency. Therefore, the number of boosting rounds is also set to 5. BNB, BLNBT and ABS is boosting based, so we choose their number of rounds from 5 to 50.

Symbols “*” are marked on the value to make note the largest of precision or recall for each dataset (topic). From table 1 we can find that BNBFE has the best precision in 3 out of 5 topics, and has the best recall in 4 out of 5 topics. From this table we found that directly boosted naïve Bayesian classifiers (BNB) cannot really improve the performance of naïve Bayesian learning, but BLNBT and our BNBFE can do that.. Generally, the classification performance of BNBFE with 5=T is much better than all the other boosting algorithms compared here with the same T , and is nearly equal to the performance of other boosting algorithms with large T . Because other boosting algorithms require many rounds to achieve optimal performance, our method BNBFE provides a cheaper way to obtain the same goal. BNBFE also obviously enhances the learning ability of naïve Bayesian learning, which is reflected by the gap of precision and recall between BNBFE and NB.

0.6

FIGURE 1: PERFORMACE MEASURED BY PRECISION

0

0.2

0.4

0.8

1

5 10 15 20

CLASS NUMBER

PRECISION

BNBFE(T=5)

NB

BNB(T=30)

BLNBT(T=30)

ABS(T=50)

We also conducted the experiments for exploring the fitness of these algorithms to the problem of more classes. Figure 1 and 2 describes the average performance of these algorithms over different number of classes measured by precision and recall respectively. From these figures, we can found that the performance of our new algorithm BNBFE is not seriously weakened by the increment of class number, but the performance of other algorithms, especially NB, BNB, BLNBT, drops quickly along with the increasing of class number.

FIGURE 2: PERFORMANCE MEASURED BY RECALL

0

0.2

0.4

0.6

0.8

1

5 10 15 20

CLASS NUMBER

RECALL

BNBFE(T=3)

NB

BNB(T=30)

BLNBT(T=30)

ABS(T=50)

11

From the experiments we found that our newly devised algorithm BNBFE has quicker converging speed and more stable to the fluctuations of the problem scale for text categorization than other learning algorithms compared here.

6. Conclusion

In this paper, we introduce a new method to improve the performance of combining boosting and naïve Bayesian learning for text categorization. Instead of combining boosting and naïve Bayesian classifier directly, which was proved to be futile to improve anything, we incorporate different feature extraction methods to the construction of naïve Bayesian classifiers, and hence generate very different or unstable Bayesian classifiers. Besides, because the number of different feature extraction method is very limited, so we contrive a modification to the boosting algorithm to achieve the goal: minimize the overlapping errors of its constituent classifiers. With these configurations, we obtain a new boosting algorithm for text categorization, which not only has performance much better than naïve Bayesian classifiers or directly boosted naïve Bayesian ones, but also much quicker to obtain optimal performance than boosting stumps and boosting decision trees incorporated with naïve Bayesian learning. The improved generalization error bounds and the optimal number of combined feature extraction methods should be further explored in our future work.

References Bremain L. (1999) Bias, Variance, and arcing classifiers. Machine Learning Dietterich, T. G. (1997) Machine Learning research. AI Magazine, 18:4, 97-136 Freund, Y. and Schapire, R. (1997) A decision-theoretic generalization of on-line learning and an

application to boosting. Journal of Computer and System Sciences, 55(1), 119-139 Kai Ming Ting and Zijian Zheng. (2000) Improving the performance of boosting for naïve

Bayesian classification. School of Computing and Mathematicas, Deakin University Koller.D., Sahami.M. (1997) Hierarchically classifying documents using very few words,

ICML97, P.170-178 McCallum and A., Nigam, K. (1999) A Comparison of Event Models for Naive Bayesian Text

Classification. Just Research. 4616 Henry Street Pittsburgh, PA 15213 Mladenic.D. (1998) Machine Learning on non-homogeneous, distributed text data, doctoral

dissertation, university of Ljubljana Quinlan J.R. (1998) Mini-Boosting Decision Trees. AI access Foundation and Morgan Kaufmann

Publishers. Salton, G., Wong, A. and Yang, C. (1995) A Vector Space Model for Automatic Indexing.

Communication of the ACM, V.18:613-620 Schapire, R. and Singer, Y. (2000) BoosTexter: A boosting-based system for text categorization.

Machine Learning, 39(2/3):135-168 Yang.Y., Pedersen.J.O. (1997) A Comparative Study on Feature Selection in Text Categorization,

Proc. of the 14th International Conference on Machine Learning ICML97, P.412-420

12

Documents

A method to Boost Naïve Bayesian Classifiersarbor.ee.ntu.edu.tw/pakdd02/paper/P0197.pdf · A method to Boost Naïve Bayesian Classifiers ... (features) for the instance. When making