8

Click here to load reader

Robustness of Regularized Linear Classification …nyc.lti.cs.cmu.edu/yiming/Publications/jzhang-sigir03.pdf · Robustness of Regularized Linear Classification Methods ... Recognition]:

  • Upload
    lecong

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Robustness of Regularized Linear Classification …nyc.lti.cs.cmu.edu/yiming/Publications/jzhang-sigir03.pdf · Robustness of Regularized Linear Classification Methods ... Recognition]:

Robustness of Regularized Linear Classification Methodsin Text Categorization �

Jian ZhangSchool of Computer ScienceCarnegie Mellon University

5000 Forbes AvenuePittsburgh, PA 15213, U.S.A.

[email protected]

Yiming YangSchool of Computer ScienceCarnegie Mellon University

5000 Forbes AvenuePittsburgh, PA 15213, U.S.A.

[email protected]

ABSTRACTReal-world appli ations often require the lassi� ation ofdo uments under situations of small number of features,mis-labeled do uments and rare positive examples. Thispaper investigates the robustness of three regularized lin-ear lassi� ation methods (SVM, ridge regression and logis-ti regression) under above situations. We ompare thesemethods in terms of their loss fun tions and s ore distribu-tions, and establish the onne tion between their optimiza-tion problems and generalization error bounds. Several setsof ontrolled experiments on the Reuters-21578 orpus are ondu ted to investigate the robustness of these methods.Our results show that ridge regression seems to be the mostpromising andidate for rare lass problems.Categories and Subject DescriptorsH.4.m [Information Systems Appli ations℄: Mis ella-neous; I.5.1 [Pattern Re ognition℄: Models{Statisti al;I.5.4 [Pattern Re ognition℄: Appli ations{Text pro ess-ingGeneral TermsAlgorithms, Performan e, ReliabilityKeywordsrobustness, text ategorization, SVM, ridge regression andlogisti regression1. INTRODUCTIONMany supervised learning methods have been applied totext ategorization, in luding nearest neighbor lassi�ers,�(Produ es the permission blo k, and opyright informa-tion). For use with SIG-ALTERNATE.CLS. Supported byACM.Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGIR’03, July 28–August 1, 2003, Toronto, Canada.Copyright 2003 ACM 1-58113-646-3/03/0007 ...$5.00.

de ision trees, Bayesian probabilisti lassi�ers, neural net-works, regression methods, SVM, et . And lots of empiri- al studies in text ategorization have been done in re entyears[9, 6, 15℄ whi h investigate di�erent aspe ts of lassi�- ation methods.Text ategorization problems an be hara terized as deal-ing with high-dimensional and sparse data, and usually a - ompanied by skewly distributed ategories. These har-a teristi s together make text ategorization di�erent from lassi pattern lassi� ation problems. Real-world appli a-tions often require the lassi� ation of do uments under thefollowing onditions:1. Restri tions on spa e and time: Classi�ers need lessspa e and an be trained mu h faster with fewer fea-tures. And if ve torized �les need to be stored andreused later, it will also redu e the storage and thustest time signi� antly sin e the most expensive part isto load the test data.2. Mis-labeled training do uments: The most ru ial re-sour es of lassi�ers are training do uments, whi h arelabeled by human. In ases there are many mis-labeleddo uments, andidate lassi�ers should be tolerant tolabeling errors to some degree.3. Small number of training do uments: There are manyimportant appli ations where only small number ofpositive training examples are available, like the �l-tering task in information retrieval. Candidate lassi-�ers should perform reasonably well with rare positivetraining data.Hastie et al. [3℄ gave insightful analysis on the robust-ness of lassi�ers based on their loss fun tions. However,the goodness of those analysis still depends on the intrinsi hara teristi s of the data. To our knowledge, no su h studyin text ategorization has been done.Yang & Liu [15℄ ompared empiri al results on ommonand rare ategories. However, they only showed the per-forman e degradation as the size of positive data de reasewithout further exploration, and their results are not om-parable a ross ategories.Our work is mainly based on the work by Zhang [17℄,whi h studies several regularized linear lassi� ation meth-ods and their appli ations in text ategorization. However,we mainly fo us on addressing the above issues, that is, we

Page 2: Robustness of Regularized Linear Classification …nyc.lti.cs.cmu.edu/yiming/Publications/jzhang-sigir03.pdf · Robustness of Regularized Linear Classification Methods ... Recognition]:

study three lassi�ers in text ategorization under ondi-tions of small number of features, noisy settings (mis-labeleddata), and rare positive data.We design several sets of ontrolled-experiments to in-vestigate the behaviors of three regularized linear methods(ridge regression, regularized logisti regression and linearkernel SVM) under above onditions, as well as establish the onne tion between their optimization problems and gener-alization error bounds. Last but not the least, we analyzetheir di�erent behaviors in ase of rare positive data, whi hreveals another property in their loss fun tions.The rest of this paper is organized as follows: Three lin-ear methods together with their regularizations are intro-du ed in Se tion 2. Se tion 3 dis usses these methods interms of loss fun tions and s ore distributions, generaliza-tion error bounds and implementations. Experimental setupis reported in Se tion 4, and results are reported in Se tion5, where we ompare three methods under above onditionsand give our analysis. The last se tion summarizes the mainresults of the paper.2. LINEAR METHODS AND REGULARIZA-

TIONAmong popular lassi� ation methods, linear lassi� a-tion methods are those methods that have linear de isionboundaries between positive and negative lasses, su h aslinear regression, logisti regression, linear kernel SVM, naiveBayes lassi�er, linear dis riminant analysis, per eptron al-gorithm, et . Compared with other methods, linear methodsare simpler and the trained model is mu h easier to inter-pret. Even more important is that in text ategorizationthey have been shown to be very e�e tive and their perfor-man es are among the top lassi�ers [17℄.We de�ne the binary lassi� ation problem for linear meth-ods as follows. Given � = f(x1; y1); (x2; y2); � � � ; (xn; yn)gthe training set, where yi 2 f�1;+1g and xi is a ve torof training instan e. For linear lassi�er, it tries to �nd aweight ve tor w, an inter ept b and a threshold � su h thatf(wTx+ b) < � if the label is �1 and f(wTx+ b) � � if thelabel is +1, where the fun tion f(:) is spe i�ed by the linear lassi�er. One an augment the input ve tor x to [1;x℄ andthe weight ve tor w to [b;w℄ to absorb the inter ept b.Given a linear model and a task-spe i� loss fun tionl(f(x); y), our goal is to minimize the expe ted loss:minEDl(f(x); y)where D is the underlying and unknown distribution of ourdata. For lassi� ation problems, the loss fun tion L(f)is usually a onvex upper bound of the lassi� ation error,whi h avoids the hardness of dire tly minimizing the lassi-� ation error.In order to �t the model, Empiri al Risk Minimization(ERM) is usually used, whi h tries to minimize the aboveobje tive fun tion over empiri al data:min 1n nXi=1 l(f(x); y)As we an see, ERM uses the uniform distribution over em-piri al data to repla e the unknown distribution. Sin e em-piri al loss is the pure obje tive of ERM, ERM may over�tthe data by favoring omplex fun tions. Regularization [10,11℄ is an e�e tive way to prevent over�tting, and it has been

su essfully applied to many methods. The regularized ver-sion usually looks like:min 1n nXi=1 l(f(x); y) + �J(f)where J(f) ontrols the learning omplexity of the hypoth-esis family, and the oeÆ ient � ontrols balan e betweenthe model omplexity and the empiri al loss.2.1 Linear SVMSVM is based on statisti al learning theory[13℄, whi huses the prin iple of Stru tural Risk Minimization insteadof Empiri al Risk Minimization. It is regarded as high per-forman e lassi�ers in many domains in luding text atego-rization. We limit our dis ussion to linear kernel SVM inthis paper sin e it is the most popular kernel in text at-egorization [4, 1℄, and it is omputationally mu h heaperthan other kernels. Linear SVM tries to �nd the hyperplanewith maximum margin as the de ision boundary in linearseparable ase, whi h is equivalent to minimizing the normof the weight ve tor under some linear onstraints1:minw;b 12wTwsubje t to : yi(wTxi + b) � 1; 8iFor linear non-separable ase, by introdu ing sla k vari-ables, the optimization problem is augmented to:minw;b 12wTw+ CPni=1 �isubje t to : �i � 0; 8iyi(wTxi + b) � 1� �i; 8iwhere C represents ost oeÆ ient, and sla k variable �imeasures how far away the orresponding data point (xi; yi)falls in the wrong side of the margin.The dual form of the above optimization problem an bewritten as followsmaxPni=1 �i � 12Pi;j �i�jyiyjxTi xjsubje t to : 0 � �i � C; 8iPni=1 �iyi = 0whi h is both theoreti ally and pra ti ally meaningful.From Karush-Kuhn-Tu ker (KKT) onditions [7℄ we knowthat only those data points whose Lagrangian oeÆ ients(�i's) are not zeros ontribute to the �nal de ision bound-ary w =Pni=1 �iyixi. These data points are alled \supportve tors", while the remaining data points are alled \non-support ve tors". For text ategorization people found thatusually a small portion of the training data points are sup-port ve tors.Note that SVM itself an be treated as a regularized methodwith the loss fun tion rewritten asw = argmin( nXi=1 1n max(0; 1� yi(wTxi + b)) + �wTw)where max(0; 1� yi(wTxi + b)) = �i.1We do not use augmented ve tor for SVM be ause peopledo not penalize the inter ept in SVM. Otherwise, the on-straints of SVM (in its dual form) will be hanged, and sodoes its algorithm. Whether penalizing inter ept or not willonly bring in trivial modi� ations for ridge regression andlogisti regression.

Page 3: Robustness of Regularized Linear Classification …nyc.lti.cs.cmu.edu/yiming/Publications/jzhang-sigir03.pdf · Robustness of Regularized Linear Classification Methods ... Recognition]:

2.2 Linear Regression and RegularizationThe problem of linear regression tries to �nd a linear fun -tion f(x) = wTxthat an �t the training data very well. Least squares algo-rithm is the most popular estimation method for linear re-gression, whi h is equivalent to the Maximum Likelihood Es-timation when the y is in uen ed by Gaussian noise. Leastsquares algorithm omputes a weight ve tor w based on theminimization of the squared loss between the model outputwTx and y:w = argminw f 1n nXi=1(1 � yiwTxi)2g= argminw f 1n nXi=1(yi �wTxi)2gThe solution is given byw = ( nXi=1 xixTi )�1 nXi=1 xiyiThough we an give the lose form solution, the matrixPni=1 xixTi an be singular, whi h means there are multipleminimizers for the obje tive fun tion. Parti ularly, in text ategorization the number of features is often larger thanthe number of training do uments, and the matrix is sin-gular in those ases. One solution is to use pseudo-inversematrix[14℄, whi h is built on top of the omputation of Sin-gular Value De omposition (SVD). Another solution is touse ridge regression [3, 17℄, whi h regularizes the originalobje tive fun tion by adding a penalizer �wTw. In this pa-per we only dis uss the later ase, whose obje tive fun tionbe omes the following:w = argminw f 1n nXi=1(yi �wTxi)2 + �wTwgand now the lose form solution be omesw = ( nXi=1 xixTi + n�I)�1 nXi=1 xiyiNote that after this transformation, the matrixnXi=1 xixTi + n�Iis guaranteed to be non-singular provided � > 0. As a re-sult, we have a unique solution for ridge regression, and theoptimization problem an be solved with simple algorithm.2.3 Logistic Regression and RegularizationLogisti regression has been widely used in statisti s formany years, and has re eived extensive studies in ma hinelearning re ently due to its lose relation to SVM and Ad-aBoost. However, in text ategorization, it has not been aswidely used as least squares algorithm and SVM. Re entlythis method is applied to text ategorization and omparedwith other linear lassi� ation methods [9, 17℄, whi h showsthat its performan e is omparable to that of SVM. Lo-gisti regression tries to model the onditional probability

p(yjx) , and the model is �tted by maximizing the ondi-tional log-likelihood. For binary lassi� ation problems the onditional probability is modeled as:p(y j x;w) = 11 + exp(�ywTx)The MaximumLikelihood Estimation is equivalent to min-imizing the following:w = argminw 1n nXi=1 log(1 + exp(�yiwTxi))The solution of the above problem may be in�nite: Sup-pose training data are linear separable, and w0 is one sepa-rating weight ve tor. Then any weight ve tor �w0 provided� > 1 an separate training data with a smaller obje tivefun tion value. So the solution is unbounded.In order to solve this problem, we on e again resort toregularized version:w = argminw f 1n nXi=1 log(1 + exp(�yiwTxi)) + �wTwgThe Hessian matrix of the obje tive fun tion L (w.r.t. w)is de�ned as:H = ��wT �L�w= 1n nXi=1 exp(�yiwTxi)(1 + exp(�yiwTxi))2xixTi + 2�Iwhere I is the identity matrix. It is straight forward toshow that the Hessian matrix is positive de�nite (given � >0), whi h is equivalent to the onvexity of the obje tivefun tion[7℄. After the regularization the Hessian matrix isbounded away from 0, whi h is a ni e property for manynumeri al algorithms.3. ANALYSIS

3.1 Loss Functions and DistributionsBased on previous dis ussions, we an unify the obje tivefun tions of all three linear lassi� ation algorithms as:L = 1n nXi=1 f(yiwTxi) + �wTwWe an see that all these methods are using the 2-normregularization term, and the di�eren e is that ea h methodis asso iated with a parti ular loss fun tion f(:):1. Ridge Regression:f(yiwTxi) = (1 � yiwTxi)22. Regularized Logisti Regression:f(yiwTxi) = log(1 + exp(�yiwTxi))3. Linear SVM:f(yi(wTxi + b)) = maxf0; 1� yi(wTxi + b)gIt would be meaningful to ompare their loss fun tionsagainst the mis- lassi� ation error in Figure 1 [3℄, whi h anhelp us understand di�erent behaviors among three meth-ods.

Page 4: Robustness of Regularized Linear Classification …nyc.lti.cs.cmu.edu/yiming/Publications/jzhang-sigir03.pdf · Robustness of Regularized Linear Classification Methods ... Recognition]:

−25 −20 −15 −10 −5 0 5 10 15 200

0.05

0.1

0.15

0.2

0.25

0.3

0.35

wTx

de

nsi

ty

logistic regression

positive datanegative data

−8 −6 −4 −2 0 2 4 6 80

0.5

1

1.5

2

2.5

wTx

de

nsi

ty

SVM

positive datanegative data

−3 −2 −1 0 1 2 30

5

10

15

20

25

30

wTx

de

nsi

ty

ridge regression

positive datanegative data

Figure 2: Distributions of wTx for positive and negative data (over test data, feature size=3000)

−2 −1.5 −1 −0.5 0 0.5 1 1.5 20

1

2

3

4

5

6

7

8

9

↓ mis−classification

← ridge regression

↓ svm

↓ logistic regression

ywTx

f(yw

Tx)

mis−classificationsvmlogistic regressionridge regression

Figure 1: Loss Fun tionsWe an see from the graphs (or simply derive from theirformulas) that all three loss fun tions are onvex fun tions,and they are also upper bounds of the mis- lassi� ation er-ror.It is believed that squared loss is not as robust as theother two loss fun tions sin e its loss fun tion will be morein uen ed by extreme data, as we an see from the graph.Another disadvantage of squared loss is that it also penalizesthose orre tly lassi�ed data points if their output valuesare larger than 1. Meanwhile, it has the good property thatboth the �rst and se ond derivatives of its obje tive fun tionare well-behaved, and it an be solved with simple optimiza-tion te hniques.SVM loss is linear, and logisti loss is lose to linear. Bothloss fun tions are less sensitive to extreme data points om-pared with squared loss. The non-di�erentiable hara teris-ti of SVM loss fun tion makes it harder to solve than theother two methods.To show how e�e tive these loss fun tions are, we plotthe s ore distributions (wTx) for three methods over testdata in �gure 2. We an see that most of the dense of thedistributions has loss lose to zero, whi h shows that ourdata is linear-separable to a great degree. Also noti e that

the overlapping of ridge regression is small ompared withthe other two methods, whi h is onsistent with its goodperforman e reported in se tion 5.3.2 Generalization ErrorWe �rst refer to two theorems of Vapnik[12, 13℄:Theorem 1 Suppose f(x; �) is a set of learning fun tionsfor binary lassi� ation with adjustable parameters �, thenthe following bound holds with a probability of at least 1��:R(�) � Remp(�) +rh(log(2n=h) + 1)� log(�=4)nwhere R(�) = R L(y; f(x; �))dF (x; y) is the expe ted risk,Remp(�) = 1nPni=1 L(yi; f(xi; yi)) is the empiri al risk, nis the size of training data, h is the VC dimension of thelearning model, L(:) is the mis- lassi� ation error loss, andF (:) is the underlying data distribution whi h is unknown.Theorem 2 A subset of separating hyperplane de�nedon � = f(x1; y1); (x2; y2); � � � ; (xn; yn)g (8i : xiTxi � D2)satisfying yi(wTxi + b) � 1 : 8i and wTw � A2 has the VCdimension h bounded byh � min([D2A2℄; n) + 1From the above theorems we know that for linear-separabledata, if we shrink the hypothesis spa e by putting limit onthe 2-norm of the weight ve tor w of linear lassi� ationmethods, we an potentially redu e the VC dimension, thusredu e the generalization error bound.On the other hand, we know that the optimization prob-lem[8℄ minf(w) subje t to : wTw � A2is equivalent to the problemmin f(w) + �wTwgiven the fun tion f(:) and the onstraint are onvex. So,by appropriately hoosing �, we are a tually limiting ourhypothesis spa e by the onstraint wTw � A2. This estab-lishes the onne tion between all three regularized methodsand the above generalization error bound, whi h has beenregarded as one major theoreti al advantage of SVM.3.3 Implementation IssueFor real-world appli ations, omputational eÆ ien y isanother important issue. In our experiments, we use SVM light

Page 5: Robustness of Regularized Linear Classification …nyc.lti.cs.cmu.edu/yiming/Publications/jzhang-sigir03.pdf · Robustness of Regularized Linear Classification Methods ... Recognition]:

[5℄ for the linear kernel SVM model, and use iterative algo-rithms[17℄ that are variants of Gauss-Seidel[2℄ for solvingboth the ridge regression and regularized logisti regression.The algorithm of ridge regression is very simple and moreeÆ ient than the other two methods. Another advantage ofridge regression (though we did not apply in this paper) isthat for olle tions that ontain large number of ategories(like Ohsumed), we an �rst ompute the matrix inverseM = ( nXi=1 xixTi + n�I)�1whi h is independent of ategory2, followed by the ompu-tation of MPni=1 xiyi for ea h individual ategory.4. EXPERIMENTAL SETUPReuters-21578 (ModApte split) is used as our data ol-le tion, whi h is a standard testbed for text ategorization.Sin e every do ument in the olle tion an have multiplelabels, we split the lassi� ation problem into multiple bi-nary lassi� ation problems with respe t to ea h ategory.All numbers and stopwords are removed, and words are on-verted into lower ase without stemming. Infrequent features(o ur less than 3 times) are �ltered out. Feature sele -tion is done using Information Gain[16℄ per ategory, binaryterm weighting is applied, and words in do ument titles aretreated as di�erent words in do ument bodies.Pre ision (p) and re all (r) are used to evaluate methodsin their ombined form F1, whi h is de�ned asF1 = 2rp(r + p)In order to ompare the global performan e of di�erentmethods, Ma roavgF1 and Mi roavgF1 are also used. Aswe know, Ma roavgF1 gives the same weight to all ate-gories, and thus it will be mainly in uen ed by the per-forman e of rare ategories for our data olle tion due tothe skewed ategory distribution of the Reuters-21578 ol-le tion. On the ontrary, Mi roavgF1 will be dominated bythe performan e of ommon ategories.We ondu t our experiments under three sets of ondi-tions:1. Robustness in terms of small number of features: We ompare lassi�ers' performan e as the number of fea-tures varies. Parti ularly, we examine their perfor-man e in ase of very few features, whi h are top-ranked features by Information Gain.2. Robustness in terms of \noise" level (mis-labeled data):We randomly pi k up some portion of training exam-ples and ip their labels. Performan e is measuredwith respe t to the per entage of ipped training ex-amples.3. Robustness in terms of \rare" positive training data:In order to study di�erent behaviors of three lassi�ersin ase of rare positive training data, we use the top12 ommon ategories in our data olle tion and sim-ulate the pro ess by redu ing the available per ent ofpositive training data.2Suppose features are independent of ategories.

5. RESULTS AND DISCUSSIONS

5.1 Performance vs. Feature SizeIn this subse tion we show how well those methods anperform with relatively small number of features.Both thresholds and the regularization oeÆ ient � (� =10�k; k = 0; 1; � � � ; 5) are hosen by maximizing F1 overtraining data with 5-fold ross-validation.Our results in both Mi roavgF1 and Ma roavgF1 areshown in �gure 3. And F1 results of the most ommon12 ategories (with 3000 features) are also listed in table1, whi h are onsistent with previous published results [17℄.From the �gure 3 we an see that even with 30 features, theMi roavgF1 of three methods an still be above 80%, whi his an a eptable performan e for some appli ations. On theother hand, the Ma roavgF1s of three methods behave dif-ferently with ridge regression the best and logisti regressionthe worst. The di�eren e of Ma roF1 results (right graphin Figure 3) between ridge regression and SVM (and logisti regression) are signi� ant3.Sin e the Ma roavgF1 performan e is mainly in uen edby rare ategories, we believe that three methods must havedi�erent behaviors in ase of rare ategories. Our later ex-periments about \rare positive data" will further explore thedi�eren es.Table 1: F1 performan e of SVM, ridge regressionand regularized logisti regression (Reuters-21578,top 12 ategories, feature size=3000)# of positivetraining examples SVM RR LRearn 2877 0.978 0.976 0.980a q 1650 0.962 0.956 0.960money-fx 538 0.762 0.728 0.779grain 433 0.912 0.929 0.902 rude 389 0.886 0.865 0.879trade 368 0.741 0.700 0.754interest 347 0.764 0.785 0.743wheat 212 0.902 0.903 0.895ship 197 0.830 0.846 0.834 orn 180 0.920 0.917 0.889money-supply 140 0.820 0.836 0.825dlr 131 0.791 0.727 0.7915.2 Performance vs. Noise LevelIn order to ondu t experiments under noisy settings, werandomly pi k up 1%, 3%, 5%, 10%, 15%, 20% and 30%training data respe tively and ip their labels. Thresholdsand the regularization oeÆ ient � are tuned with 5-fold ross-validation under ea h ondition to make sure that theterm �wTw an play an a tive role for every method toresist noisy data.We only reported Mi roavgF1 versus noise level in �gure4. Ma roavgF1 results of all three methods drop dramati- ally (below 0.05) after noise level is greater than 3%. This an be explained as follows: Most of the 90 ategories in3We use Ma ro t-test [15℄ with signi� ant level 0.05.

Page 6: Robustness of Regularized Linear Classification …nyc.lti.cs.cmu.edu/yiming/Publications/jzhang-sigir03.pdf · Robustness of Regularized Linear Classification Methods ... Recognition]:

101

102

103

104

0.78

0.79

0.8

0.81

0.82

0.83

0.84

0.85

0.86

0.87

0.88

feature size

mic

ro−

F1

over

90

cate

gorie

sMicro−F1 vs Feature Size

logistic regressionsvmridge regression

101

102

103

104

0.4

0.45

0.5

0.55

0.6

0.65

feature size

mac

ro−

F1

over

90

cate

gorie

s

Macro−F1 vs Feature Size

logistic regressionsvmridge regressionFigure 3: Mi ro-F1, Ma ro-F1 vs. Feature Size

0 0.05 0.1 0.15 0.2 0.25 0.3 0.350.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

noise level

logistic regressionsvmridge regression

Figure 4: Mi roavg F1 vs. noise level (per entage ofmis-labeled data)our data olle tion are relatively small, and due to the waywe manipulate our data, a p% ip of negative data (whi hbe ome \positive" now) will overwhelm the amount of re-maining positive data whi h are orre t. Thus, it is veryhard to learn the target orre tly for small ategories, whi hleads to the poor performan e in Ma roavgF1.Our results show that SVM is superior to the other twomethods as the noise level initially goes up, as we anti ipated4. Ridge regression works reasonably well though it failedin the signi� ant test ompared with SVM, if we onsiderthe large number of de isions it made (number of do umenttimes number of ategories). This tells us that for text ol-le tion, squared loss is a eptable even with ertain amount4For example, the results between SVM and ridge regressionat noise level 3% is signi� ant at level 0.05 with Mi ro s-test[15℄, sin e SVM wins 211 out of 365 di�erent de isions.

of mis-labeled do uments.5.3 Performance vs. Rare Positive DataThere are many ases where only small number of positiveexamples are available to our lassi�ers. Here we want toexamine the di�erent behaviors of the three methods undersu h ases, whi h an help us further explain the perfor-man e di�eren es of Ma roavgF1 in �gure 3.One way to deal with rare lass lassi� ation problems isto re-weight the training data (or hange the ost fun tionto be asymmetri ) so that the same amount of positive dataplay more important roles than negative data. All threemethods an be adapted in to this version by slightly mod-ifying the original optimization problem into:w = argminfC+=�Pi2D+ f(yiwTx) +Pi2D� f(yiwTx)(jD+jC+=� + jD�j)+ �wTwgwhere D+ and D� are sets of positive and negative exam-ples, jD+j and jD�j are their ardinalities, C+=� measuresthe relative importan e between positive and negative data,and (jD+jC+=� + jD�j) is the normalization fa tor so that� is set to be independent of number of examples.We did not apply this approa h in our experiments for sev-eral reasons: First of all, all three methods an be adaptedin this way, whi h does not help to re e t di�erent hara -teristi s among three methods. Se ond, we want to examinethe apabilities of dealing with rare lass in a natural way,while this adaptation hanges the data prior distribution.And the weight ratio C+=� between positive and negativedata need to be hosen empiri ally per ategory by ross-validation, whi h may be unstable for rare lass.Instead, we design our experiments to dire tly examinedi�erent behaviors of the three methods without the hang-ing the optimization obje tive fun tion.In order to examine performan e under this ondition, we hoose the most 12 ommon ategories (in terms of numberof positive training examples, see table 1) in our data olle -tion. Sin e these ategories are relatively ommon, we anrandomly hold some portion of the positive examples, thus

Page 7: Robustness of Regularized Linear Classification …nyc.lti.cs.cmu.edu/yiming/Publications/jzhang-sigir03.pdf · Robustness of Regularized Linear Classification Methods ... Recognition]:

investigate the behaviors as the available amount of positivedata hanges. In order to make our results stable, results areaveraged from 5 to 20 times (the less the per ent of positivedata used, the more times it is averaged over).Results here are reported in terms of \best possible" F1over test data, whi h avoids the tuning of thresholds. Figure5 shows the F1 results of 12 ommon ategories (as shownin table 1) as the available per ent of positive data hanges(1%, 3%, 5%, 10%, 30%, and 50%).From the results we an see that though results vary from ategory to ategory, logisti regression is mu h worse thanboth SVM and ridge regression. Ridge regression performsslightly better than SVM for small ategories, whi h further on�rms our results in �gure 3.Note that for all three methods, when the amount of neg-ative data is mu h larger than positive data, the obje tivefun tion's value will be initially dominated by negative data.However, sin e logisti loss is stri tly greater than 0 even for orre tly lassi�ed data, the optimization pro ess will keeppushing the majority ( orre tly lassi�ed negative data) fur-ther down the loss fun tion as long as their role in the obje -tive fun tion is still larger than (or omparable to) that ofrare positive data, thus sa ri� e the performan e of positivedata, whi h will lead to poor F1 s ore. Both SVM loss andsquared loss will have some point(s) that have exa tly zeroloss with �nite ywTx value ( ontrast to logisti loss whi hgoes to 0 as ywTx goes to 1), and thus their performan efor rare lass will not drop as dramati as logisti regression.6. CONCLUDING REMARKSIn this paper we presented a ontrolled study on the ro-bustness of three regularized linear lassi� ation methods intext ategorization. We dis ussed their loss fun tions and re-lated s ore distributions, as well as establishing the onne -tion between their optimization targets and the generaliza-tion error bounds. In our experiments, we investigated theirperforman e under onditions of small number of features,noisy settings and rare positive data. Their performan edi�eren es are ompared and analyzed. Our on luding re-marks are:� Theoreti ally, all three methods an be treated as shrink-ing the hypothesis spa e when performing the opti-mization, thus they have similar generalization errorbounds. Pra ti ally, they all perform very well inMi roavgF1 even with very few sele ted features.� Under noisy settings, SVM is better than logisti re-gression and ridge regression. Ridge regression, as in-di ated by its loss fun tion, performs the worst.� Ridge regression is better than SVM when only smallnumber of positive examples are available. We showthat logisti regression performs very badly in this ase, as well as give explanations.7. ACKNOWLEDGMENTSWe thank anonymous reviewers for their helpful omments.This resear h is sponsored in part by National S ien e Foun-dation (NSF) under the grants EIA-9873009 and IIS-9982226,and in part by DoD under the award 114008-N66001992891808.However, any opinions or on lusions in this paper are theauthors' and do not ne essarily re e t those of the sponsors.

8. REFERENCES[1℄ S. Dumais, J. Platt, D. He kerman, and M. Sahami.Indu tive learning algorithms and representations fortext ategorization. In CIKM'98, 1998.[2℄ G. Golub and C. V. Loan. Matrix Computations (3rdedition). Johns Hopkins University Press, Baltimore,MD, 1996.[3℄ T. Hastie, R. Tibshirani, and J. Friedman. Theelements of statisti al learning: Data mining,inferen e, and predi tion. New York, 2001. Springer.[4℄ T. Joa hims. Text Categorization with Support Ve torMa hines: Learning with Many Relevant Features. In23, Universit Dortmund, LSVIII-Report, 1997.[5℄ T. Joa hims. Text Categorization with Support Ve torMa hines: Learning with Many Relevant Features. InEuropean Conferen e on Ma hine Learning (ECML),pages 137{142, Berlin, 1998. Springer.[6℄ D. Lewis and M. Ringuette. Comparison of twolearning algorithms for text ategorization. InPro eedings of the Third Annual Symposium onDo ument Analysis and Information Retrieval(SDAIR'94), Nevada, Las Vegas, 1994. University ofNevada, Las Vegas.[7℄ D. Luenberger. Linear and nonlinear programming.Addison-Wesley, New York, 1989.[8℄ D. Luenberger. Optimization by Ve tor Spa e Methods.John Wiley & Sons, New York, 1997.[9℄ H. S h�utze, D. Hull, and J. Pedersen. A omparison of lassi�ers and do ument representations for therouting problem. In 18th Ann Int ACM SIGIRConferen e on Resear h and Development inInformation Retrieval (SIGIR'95), pages 229{237,1995.[10℄ A. N. Tikhonov. Solutions of in orre tly formulatedproblems and the regularization method. In SovietMath. Dokl., volume 4, pages 1035{1038, 1963.[11℄ A. N. Tikhonov and V. Y. Arsenin. Solutions ofIll-posed Problems. W. H. Winston, Washington, D.C.,1977.[12℄ V. Vapnik. The Nature of Statisti al Learning Theory.Springer, New York, 1995.[13℄ V. Vapnik. Statisti al Learning Theory. John Wileyand Sons, New York, 1998.[14℄ Y. Yang and C. Chute. An example-based mappingmethod for text ategorization and retrieval. ACMTransa tion on Information Systems (TOIS),12(3):252{277, 1994.[15℄ Y. Yang and X. Liu. A re-examination of text ategorization methods. In The 22th Ann Int ACMSIGIR Conferen e on Resear h and Development inInformation Retrieval (SIGIR'99), pages 42{49, 1999.[16℄ Y. Yang and J. Pedersen. A omparative study onfeature sele tion in text ategorization. InJ. D. H. Fisher, editor, The Fourteenth InternationalConferen e on Ma hine Learning (ICML'97), pages412{420. Morgan Kaufmann, 1997.[17℄ T. Zhang and F. J. Oles. Text ategorization based onregularized linear lassi� atin methods. InInformation Retrieval, volume 4, pages 5{31, 2001.

Page 8: Robustness of Regularized Linear Classification …nyc.lti.cs.cmu.edu/yiming/Publications/jzhang-sigir03.pdf · Robustness of Regularized Linear Classification Methods ... Recognition]:

10−2

10−1

100

0.96

0.965

0.97

0.975

0.98

0.985

0.99

available positive data

F1

earn

logistic regressionsvmridge regression

10−2

10−1

100

0.78

0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

available positive data

F1

acq

logistic regressionsvmridge regression

10−2

10−1

100

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

available positive data

F1

money−fx

logistic regressionsvmridge regression

10−2

10−1

100

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

available positive data

F1

grain

logistic regressionsvmridge regression

10−2

10−1

100

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

available positive data

F1

crude

logistic regressionsvmridge regression

10−2

10−1

100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

available positive data

F1

trade

logistic regressionsvmridge regression

10−2

10−1

100

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

available positive data

F1

interest

logistic regressionsvmridge regression

10−2

10−1

100

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

available positive data

F1

wheat

logistic regressionsvmridge regression

10−2

10−1

100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

available positive data

F1

ship

logistic regressionsvmridge regression

10−2

10−1

100

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

available positive data

F1

corn

logistic regressionsvmridge regression

10−2

10−1

100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

available positive data

F1

money−supply

logistic regressionsvmridge regression

10−2

10−1

100

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

available positive data

F1

dlr

logistic regressionsvmridge regressionFigure 5: F1 vs. available amount of positive data (feature size=3000)