CONTROLLING FALSE ALARMS WITH SUPPORT ...cscott/pubs/npsvm_icassp.pdfCONTROLLING FALSE ALARMS WITH SUPPORT VECTOR MACHINES Mark A. Davenport, Richard G. Baraniuk ∗ Rice University

CONTROLLING FALSE ALARMS WITH SUPPORT VECTOR MACHINES

Mark A. Davenport, Richard G. Baraniuk∗

Rice University

Department of Electrical and Computer Engineering

Clayton D. Scott†

Rice University

Department of Statistics

ABSTRACT

We study the problem of designing support vector classifiers withrespect to a Neyman-Pearson criterion. Specifically, given a user-specified levelα ∈ (0, 1), how can we ensure a false alarm rateno greater thanα while minimizing the miss rate? We examinetwo approaches, one based on shifting the offset of a convention-ally trained SVM and the other based on the introduction of class-specific weights. Our contributions include a novel heuristic for im-proved error estimation and a strategy for efficiently searching theparameter space of the second method. We also provide a charac-terization of the feasible parameter set of the2ν-SVM on which thesecond approach is based. The proposed methods are compared onfour benchmark datasets.

1. INTRODUCTION

Most approaches to classification attempt to infer a classifier thatminimizes the probability of making an error. In many importantapplications, however, some kinds of errors are more important thanothers. In tumor classification, for example, the impact of mistakenlyclassifying a benign tumor as malignant is much less than that of theopposite mistake.

In the Neyman-Pearson (NP) classification paradigm, the goalis to design a classifier (based on training data) that minimizes theprobability of amisswhile constraining the probability of afalsealarm to be less than some user-specified significance levelα. Un-like Neyman-Pearson hypothesis testing in classical detection the-ory, NP classification relies entirely on training data, placing no para-metric assumptions on the data.

The NP framework has two major advantages with respect toconventional classification criteria that seek to minimize the proba-bly of error or, more generally, an expected misclassification cost.First, assigning costs is often less intuitive or reasonable than as-signing a false alarm constraint. Second, NP classification does notassume knowledge of the a priori class probabilities. This is ex-tremely important in applications where the class frequencies in thetraining data do not accurately reflect class probabilities in the largerpopulation. In fact, it could probably be argued that most classifi-cation problems of interest fit this description. For a more detailedmotivation for the Neyman-Pearson paradigm, see [1].

This paper studies support vector machines (SVMs) for NP clas-sification. In the SVM literature two approaches have been sug-gested for controlling false alarms. One involves shifting theoffsetparameter, resulting in an affine shift of the decision boundary, andthe other entails introducing an additional parameter to control the

∗Supported by NSF, AFOSR, ONR, and the Texas Instruments Leader-ship University Program.

†Supported by an NSF VIGRE postdoctoral training grant.Email: {md,richb,cscott}@rice.edu, Web: dsp.rice.edu

relative weight given to each class. It is clear that both approachesaffect the desired tradeoff between false alarms and misses. Whatis not clear, however, is how to implement these ideas to achieve aspecificfalse alarm levelα. In other words, the primary challenge isaccurate error estimation.

As might be expected, we find that shifting the offset parame-ter does not perform as well as introducing an additional parameterto control the relative weights. However, optimizing over this addi-tional parameter significantly increases the training time. We pro-pose a method for greatly reducing the complexity of the expandedsearch with no significant loss in performance. We also suggest amethod for decreasing the variance of the error estimates, which arecrucial for NP classification. Furthermore, we offer two contribu-tions regarding the2ν-SVM, the cost-sensitive SVM we employ.First, we present a theorem that precisely characterizes the feasi-ble set for the defining quadratic program. Second, we have madeavailable at www.dsp.rice.edu/software our code, which is based onthe LIBSVM package [2].

2. SUPPORT VECTOR MACHINES

Support vector machines (SVMs) are among the most effectivemethods for classification [3]. Let(xi, yi), i = 1, 2, . . . , n denotethe training data wherexi ∈ R

d is a d-dimensional feature vectorandyi ∈ {+1,−1} indicates the class ofxi. Conceptually, the sup-port vector classifier is constructed in a two step process. In the firststep, thexi are transformed via a mappingΦ : R

d → H whereH isa high (possibly infinite) dimensional Hilbert space. The intuition isthat the two classes should be more easily separated inH than inR

d.For algorithmic reasons,Φ must be chosen so that thekernelopera-tor k(x,x′) = 〈Φ(x), Φ(x′)〉H is positive definite. This allows usto compute inner products inH without explicitly evaluatingΦ.

In the second step, a hyperplane is determined in the inducedfeature space according to the max-margin principle. In the casewhere the two classes can be separated by a hyperplane, the SVMfinds the hyperplane that maximizes themargin – the distance be-tween the decision boundary and the closest point to the bound-ary. When the classes cannot be separated by a hyperplane, theconstraints are relaxed through the introduction of slack variablesξi. If ξi > 0, this means that the correspondingxi lies insidethe margin and is called amargin error. If w ∈ H and b ∈ R

are the normal vector and affine shift defining the max-margin hy-perplane, then the support vector classifier is given byfw,b(x) =sgn(〈w, Φ(x)〉H + b). The offset parameterb is often called thebias.

There are two different formulations of the SVM. The originalSVM [4], which we will call theC-SVM, can be formulated as the

Proc. Int. Conf. Acoustics, Speech, Signal Processing -- ICASSP, May 2006.

following quadratic program:

(PC) minw,b,ξ

1

2‖w‖2 + C

nX

i=1

ξi

s.t. yi(k(w,xi) + b) ≥ 1 − ξi for i = 1, 2, . . . , n

ξi ≥ 0 for i = 1, 2, . . . , n

whereC ≥ 0 is a tradeoff parameter that controls overfitting.For computational reasons, it is often easier to solve (PC ) by

solving its dual formulation. This is derived by forming the La-grangian and then optimizing over the Lagrange multiplierα insteadof the primal variables. The primal and the dual are related throughw =

Pn

i=1αiyiΦ(xi). For mostxi, we will haveαi = 0. Thexi

for whichαi 6= 0 are calledsupport vectors.An alternative (but equivalent) formulation of theC-SVM is the

ν-SVM [5], which replacesC with a different parameterν ∈ [0, 1]that serves as an upper bound on the fraction of margin errors and alower bound on the fraction of support vectors. Theν-SVM has theprimal formulation:

(Pν) minw,b,ξ,ρ

1

2‖w‖2 − νρ +

1

n

nX

i=1

ξi

s.t. yi(k(w,xi) + b) ≥ ρ − ξi for i = 1, 2, . . . , n

ξi ≥ 0 for i = 1, 2, . . . , n

ρ ≥ 0.

The above formulations implicitly penalize errors in both classesequally. However, as described in the introduction, in many applica-tions there are different costs associated with the two different kindsof errors. To address this issue, cost-sensitive extensions of both theC-SVM and theν-SVM have been proposed, which we shall denotethe2C-SVM and the2ν-SVM, respectively.

First we will consider the2C-SVM proposed in [6]. LetI+ ={i : yi = +1} andI− = {i : yi = −1}. The2C-SVM has primal:

(P2C) minw,b,ξ

1

2‖w‖2 + Cγ

X

i∈I+

ξi + C(1 − γ)X

i∈I−

ξi

s.t. yi(k(w,xi) + b) ≥ 1 − ξi for i = 1, 2, . . . , n

ξi ≥ 0 for i = 1, 2, . . . , n.

whereγ ∈ (0, 1) is a parameter for trading off false alarms andmisses.

Similarly, [7] proposed the2ν-SVM as a cost-sensitive exten-sion of theν-SVM. The2ν-SVM has primal:

(P2ν) minw,b,ξ,ρ

1

2‖w‖2 − νρ +

γ

n

X

i∈I+

ξi +1 − γ

n

X

i∈I−

ξi

s.t. yi(k(w,xi) + b) ≥ ρ − ξi for i = 1, 2, . . . , n

ξi ≥ 0 for i = 1, 2, . . . , n

ρ ≥ 0.

Above we stated that (PC ) and (Pν ) are equivalent. This notionwas made precise in [8]. We have extended this result to show that(P2C ) and (P2ν ) are equivalent in the following sense. The proof isgiven in a supplemental technical report [9].

Theorem 1. Let (D2C ) and (D2ν ) denote the duals of (P2C ) and(P2ν ) respectively. Fixγ ∈ [0, 1] and letα∗ be any optimal solutionof (D2C ). Define

ν∗ = limC→∞

Pn

i=1α∗

i

Cn, ν∗ = lim

C→0

Pn

i=1α∗

i

Cn.

Then

0 ≤ ν∗ ≤ ν∗ =min(γn+, (1 − γ)n−)

n≤

1

2

wheren+ = |I+| andn− = |I−|. Then0 ≤ ν∗ ≤ ν∗ = νmax ≤ 1

2.

Thus, for anyν > ν∗, (D2ν ) is infeasible. For anyν ∈ (ν∗, ν∗]

the optimal objective value of (D2ν ) is strictly positive, thus thereexists at least oneC > 0 such that anyα is an optimal solution of(D2C ) if and only ifα/(Cn) is an optimal solution of (D2ν ). Foranyν ∈ [0, ν∗], (D2ν ) is feasible with zero optimal objective value(and a trivial solution).

3. CONTROLLING FALSE ALARMS

As mentioned in the Introduction, there are two main strategies forcontrolling false alarms. The first is to train aC-SVM or ν-SVMand then shift the bias (b) to achieve the desired false alarm rate. Thesecond approach is to use the2C-SVM or 2ν-SVM and achieve thedesired false alarm rate by adjustingγ appropriately. As describedis Section 2, (P2C ) and (P2ν ), as well as (PC ) and (Pν ), are closelyrelated and equivalent in the sense that they explore the same set ofpossible solutions. For the remainder of this paper we restrict ourattention to theν-SVM and the2ν-SVM, because their parameterspaces are more conveniently discretized. This is important becauseit makes the coordinate descent and windowing heuristics we de-scribe below more reasonable.

In what followsPF (f) andPM (f) denote the false alarm andmiss rates of a classifierf , and bPF (f) and bPM (f) denote estimatesof these quantities.

3.1. Bias-shifting approach

A potential advantage of thebias-shiftingstrategy is the ability toseparate the training into two stages. First, we search over the pa-rameters of the SVM (ν and any kernel parameters). Using an errorestimation method such as cross-validation (CV), we then select theparameters that minimize the estimated probability of error. Sec-ond, we shift the bias of that chosen classifier and, again using someform of error estimation, select the bias minimizingbPM such thatbPF ≤ α. Note that the first error estimate must be nested within thesecond. In our experiments we use the resubstitution estimate to se-lect the bias. Resubstitution is generally a poor estimate when the setof classifiers is complex; however, once we fix a normal vectorw,the set of possible shifted hyperplanes is a class with low complexity,and so resubstitution is in fact a reasonable error estimate.

Since some datasets are unbalanced, we can also apply the abovestrategy using a2ν-SVM with ν+ = ν− (whereν+ and ν− areas defined in Section 3.2) instead of theν-SVM. This method isreferred to asbalanced bias-shifting.

3.2. 2ν approach

The cost-sensitive extension of the2ν-SVM proposed in [7] is pa-rameterized in a different manner than (P2ν ). Specifically, instead


of parametersν andγ, (P2ν ) is formulated usingν+ andν−, where

ν =2ν+ν−n+n−

(ν+n+ + ν−n−)n, γ =

ν−n−

ν+n+ + ν−n−

=νn

2ν+n+

or equivalently

ν+ =νn

2γn+

, ν− =νn

2(1 − γ)n−

.

This parametrization has the benefit thatν+ and ν− have a moreintuitive meaning, similar to theν-SVM. Specifically, suppose thatthe optimal solution of (P2ν ) satisfiesρ > 0. Then for the optimalsolution of (P2ν) ν+ is an upper bound on the fraction of marginerrors and a lower bound on the fraction of support vectors fromclass+1. Similarly,ν− is an upper bound on the fraction of marginerrors and a lower bound on the fraction of support vectors fromclass−1. See [7] for a proof.

Furthermore, from Theorem 1 it follows that the dual formula-tion of (P2ν ) is feasible if and only ifν+ ≤ 1 andν− ≤ 1, witha trivial solution if ν+ ≤ 0 or ν− ≤ 0. Therefore, to search overthe parameters of the2ν-SVM it suffices to conduct a search over auniform grid of(ν+, ν−) in [0, 1]2.

In sum, the full algorithm for NP classification with the2ν-SVMis to conduct a grid search over the SVM parameters, estimatePF

andPM using some error estimation technique, and select the pa-rameter combination minimizingbPM such thatbPF ≤ α.

3.3. Coordinate descent: Speeding up the 2ν-SVM

The additional parameter in the 2ν-SVM renders a full grid searchvery time consuming. Fortunately, a simple speed-up is possible.We have observed across a wide range of datasets and kernels thatthe errorsPF andPM vary smoothly when plotted as functions of(ν+, ν−) ∈ [0, 1]2. Thus, instead of conducting a full grid searchover(ν+, ν−) we propose a kind of coordinate descent search. Sev-eral variants are possible, but the one we employ runs as follows:Find the best parameters on grids placed along the linesν+ = 1/2andν− = 1/2. From then on, conduct a line search in the direc-tion orthogonal to the previous line search, at each step selecting theparameters minimizingbPM such thatbPF ≤ α. Note that this strat-egy would be more difficult to justify with the2C-SVM because thechoice of endpoints and grid spacing would ultimately be arbitraryand data-dependent.

3.4. Windowing the estimated errors

The observation about the smoothness ofPF andPM as functions of(ν+, ν−) leads to another heuristic improvement in all of our meth-ods. For the full grid search over(ν+, ν−), after estimating the errorat each point on the grid, we low-pass filter bothbPF and bPM with aGaussian window. This effectively reduces the variance of the errorestimates. It is especially effective for high variance estimates likecross-validation. Without windowing, some grid points will lookmuch better than they actually are, due to chance variation. For co-ordinate descent we window along lines in the grid, and for the biasshifting approach, we window the estimates across theν grid. Aswith the coordinate descent strategy, the ability to discretize the pa-rameter space of the2ν-SVM with a uniform grid plays a key rolein justifying this heuristic.

4. MEASURING PERFORMANCE

Any experimental comparison of classifiers requires a measure ofperformance, that is, a scalar criterion that we can evaluate to com-pare two classifiers head-to-head. In Neyman-Pearson classification,if we only report the observed false alarm and miss probabilities, wewill often observe that one classifier has a smallerPF but largerPM

than another. In this case we cannot make a definitive statementabout which classifier is better.

One option for a scalar performance measure is to estimatePF (f) andPM (f) and assign a “score” ofbPM (f) to the classifierif bPF (f) ≤ α, and∞ otherwise, wherebPF and bPM are based on anindependent test sample. This is problematic, however, because theestimated false alarm rate is based on data and hence susceptible toerror. Moreover, in practical settings, it is often acceptable to havePF (f) be a small amount greater thanα.

We evaluate our classifiers using the measure

E(f) =1

αmax {PF (f) − α, 0} + PM (f). (1)

As discussed in [10], this measure satisfies the following de-sirable properties: (i) It is minimized by the classifierf∗

α =arg min{PM (f) |PF (f) ≤ α}. (ii ) It can be accurately estimatedfrom a test sample using the simple plug-in estimate. (iii ) It has theappealing property that asα draws closer to 0, a stiffer penalty isexacted on classifiers that violate the constraint. In other words, itpenalizes therelativeerror(PF (f) − α)/α.

5. EXPERIMENTS

In our experiments with theν-SVM we used the LIBSVM package[2]. For the2ν-SVM we implemented our own version that is avail-able online at www.dsp.rice.edu/software.

We ran our algorithms on the benchmark datasets named “ba-nana”, “heart”, “thyroid”, and “breast.” The datasets are availableonline with documentation.1 The first dataset is synthetic, while theother three are based on real data collected from various repositorieson the web. There are 100 permutations of each dataset into trainingand test data. The dimensions of the datasets are 2, 13, 5, and 9, re-spectively, and the training sample sizes are 400, 170, 140, and 200,respectively. The targetedα is 0.1.

In all of our experiments we used a radial basis function (Gaus-sian) kernel and searched for the bandwidth parameterσ over a loga-rithmically spaced grid of 50 points from10−4 to 104. For the bias-shifting method we searched over a uniform grid of the parameterν of 50 points. For the2ν-SVM methods we considered a50 × 50regular grid of(ν+, ν−) ∈ [0, 1]2.

For each permutation of each dataset we ran our algorithms onthe training data and estimated the false alarm and miss rates us-ing the test data. Table 1 reports the average false alarm and missrates over all 100 permutations, along with standard deviations. Foreach permutation we also computed the performance measureE , andwe show the median values in the table. The table also reports theaverage training time of each algorithm. The methods comparedare bias-shifting with theν-SVM (BS), balanced bias-shifting usingthe 2ν-SVM with ν+ = ν− (BBS), the2ν-SVM with a full gridsearch over(ν+, ν−) (GS), and the2ν-SVM with a coordinate de-scent search over(ν+, ν−) (CD). For each of the last two methodswe also applied a smoothing window to the error estimates as de-scribed in Section 3.4. These two variants are indicated by a “W”.

1http://ida.first.fhg.de/projects/bench/


Table 1. Average values ofPF andPM (over 100 permutations ofeach data set), the median NP error scoresE , and average runningtimes for the six tested methods. Hereα = 0.1. The tested methodsare bias-shifting (BS), balanced bias-shifting (BBS), and 2ν-SVMgrid-search (GS), windowed grid search (WGS), coordinate descent(CD), and windowed coordinate descent (WCD).

PF PM E Time(s)

thyr

oid

BS .057± .07 .455± .47 .637 19BBS .089± .06 .059± .15 .077 32GS .098± .09 .064± .09 .127 898WGS .087± .06 .032± .05 .051 898CD .084± .06 .039± .05 .066 55WCD .093± .06 .032± .05 .082 55

hear

t

BS .086± .09 .553± .39 1.000 58BBS .113± .10 .368± .33 .681 66GS .124± .06 .219± .07 .375 2801WGS .113± .05 .231± .07 .326 2801CD .106± .05 .230± .06 .318 169WCD .110± .05 .231± .06 .330 169

brea

st

BS .000± .00 1.00± .00 1.000 45BBS .078± .23 .910± .24 1.000 83GS .156± .09 .668± .10 1.122 2084WGS .112± .06 .689± .10 .821 2084CD .114± .06 .683± .10 .871 121WCD .119± .06 .678± .10 .906 121

bana

na

BS .109± .07 .334± .40 .628 212BBS .142± .04 .104± .02 .464 221GS .114± .03 .120± .02 .255 9727WGS .104± .02 .124± .02 .160 9727CD .104± .02 .125± .02 .179 541WCD .106± .03 .124± .02 .198 541

The window size was3 × 3 for GS and1 × 3 for CD. The standarddeviation of the Gaussian window was set to the length of one gridinterval. Different window sizes and widths were tried, but withoutmuch change in performance. Windowing for BS and BBS was alsotried but lead to no improvements. For all methods, the parameterswere selected using leave-one-out cross-validation (LOOCV). Forthe bias-shifting approaches, the bias was selecting using the resub-stitution estimate.

6. CONCLUSION

We have studied two general approaches to controlling the falsealarm rate with SVMs. Using theν-SVM and2ν-SVM, respectively,we experimentally evaluated the strategies of adjusting the bias andweighting the margin errors. Since the latter approach introduces anadditional parameter and is therefore much slower than the former,we also studied a heuristic speed-up based on a greedy coordinatedescent search through the2ν-SVM parameter space.

Based on the results in Table 1, we can definitively conclude thatthe2ν-SVM methods (GS, WGS, CD, and WCD) outperform bias-shifting (BS). While balanced bias-shifting (BBS) is also clearly su-perior to BS, in general it cannot compete with the2ν-SVM methodseither. In part this is to be expected, since the2ν-SVM offers a moreflexible set of classifiers. Yet a more significant problem was thedifficulty in enforcing the false alarm constraint; both BS and BBSfrequently result in classifiers for whichPF significantly exceedsα,

resulting in poor average performance.The more surprising result is that the full grid search (GS) per-

formed worse than the heuristics (WGS, CD, and WCD). Window-ing consistently improved GS, and in general WGS exhibited thebest performance. Windowing did not seem to offer any benefit toCD, but CD was remarkably competitive with WGS. Moreover, CDis only 2 to 3 times slower than BS. Given the computational demandof (W)GS, CD seems to be a reliable compromise of computing timeand performance.

This paper also contributed a theorem on the feasible parameterset of the2ν-SVM. This theorem is important for our algorithms be-cause it allows us to discretize the parameter space via a uniform gridin the unit square. This regular grid in turn underlies two heuristicsthat gave us improved performance: the coordinate descent searchand the windowed error estimates.

The performance of classifiers was measured byE(f) =(1/α)max{PF (f) − α, 0} + PM (f). Thus, accurate error esti-mation is crucial to an algorithm’s success. If a learning rule re-sults in an estimatedPF ≤ α, it stands a much better chance ofbeing competitive with respect to this measure, since errors in ex-cess ofα are penalized heavily. In our algorithms we estimated er-rors using LOOCV, and selected parameters according to the rulemin{ bPM (f) | bPF (f) ≤ α}. Yet in several cases the final test esti-mate ofPF was in excess ofα. We did find some improvements inerror estimation by low-pass filtering the error estimates to removesome of the high variance of LOOCV.

Future work on Neyman-Pearson classification should focus onimproved methods of error estimation. One possibility is to replaceLOOCV with an error estimate with a negative bias, such as the boot-strap zero estimator. Another is to train the classifier as ifα is reallysome numberα′ < α. Yet this quickly enters the realm of ad hocfixes, and it may take some care to develop rules that perform wellacross a variety of datasets.

7. REFERENCES

[1] C. Scott and R. Nowak, “A Neyman-Pearson approach to statisticallearning,” IEEE Trans. on Information Theory, November, 2005.

[2] C. C. Chang and C. J. Lin,LIBSVM: a library for support vector ma-chines, 2001, Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm.

[3] B. Scholkopf and A. J. Smola,Learning with Kernels, MIT Press,Cambridge, MA, 2002.

[4] C. Cortes and V. Vapnik, “Support-vector networks,”Machine Learn-ing, vol. 20, no. 3, pp. 273–297, 1995.

[5] B. Scholkopf, A. J. Smola, R. Williams, and P. Bartlett, “New supportvector algorithms,”Neural Computation, vol. 12, pp. 1083–1121, 2000.

[6] E. Osuna, R. Freund, and F. Girosi, “Support vector machines: Trainingand applications,” Tech. Rep. A.I. Memo No. 1602, MIT ArtificialIntelligence Laboratory, March 1997.

[7] H. G. Chew, R. E. Bogner, and C. C. Lim, “Dual-ν support vectormachine with error rate and training size biasing,” inProc. Int. Conf. onAcoustics, Speech, and Signal Proc. (ICASSP), 2001, pp. 1269–1272.

[8] C. C. Chang and C. J. Lin, “Trainingν-support vector classifiers: The-ory and algorithms,” Neural Computation, vol. 13, pp. 2119–2147,2001.

[9] M. A. Davenport, “The2ν-SVM: A cost-sensitive extension of theν-SVM,” Tech. Rep. TREE 0504, Rice University, Dept. of Elec. andComp. Engineering, October, 2005, Available at http://www.ece.rice.edu/∼md.

[10] C. Scott, “Performance measures for Neyman-Pearson classification,”Tech. Rep., Rice University, Dept. of Statistics, 2005, Available athttp://www.stat.rice.edu/∼cscott.


Documents

CONTROLLING FALSE ALARMS WITH SUPPORT ...cscott/pubs/npsvm_icassp.pdfCONTROLLING FALSE ALARMS WITH SUPPORT VECTOR MACHINES Mark A. Davenport, Richard G. Baraniuk ∗ Rice University