Two methods of selecting Gaussian kernel parameters for one-class SVM and their application to fault detection

Knowledge-Based Systems 59 (2014) 75–84

Contents lists available at ScienceDirect

Knowledge-Based Systems

journal homepage: www.elsevier .com/ locate /knosys

Two methods of selecting Gaussian kernel parameters for one-class SVMand their application to fault detection

http://dx.doi.org/10.1016/j.knosys.2014.01.0200950-7051/� 2014 Elsevier B.V. All rights reserved.

⇑ Corresponding author. Tel.: +86 10 62781993; fax: +86 10 62786911.E-mail addresses: [email protected] (Y. Xiao), hgwang@tsinghua.

edu.cn (H. Wang), [email protected] (L. Zhang), [email protected](W. Xu).

Yingchao Xiao, Huangang Wang ⇑, Lin Zhang, Wenli XuInstitute of Control Theory and Technology, Department of Automation, Tsinghua University, Beijing 100084, China

a r t i c l e i n f o a b s t r a c t

Article history:Received 30 July 2013Received in revised form 17 January 2014Accepted 17 January 2014Available online 27 January 2014

Keywords:One-class classificationOCSVMGaussian kernelParameter selectionFault detection

As one of the methods to solve one-class classification problems (OCC), one-class support vectormachines (OCSVM) have been applied to fault detection in recent years. Among all the kernels availablefor OCSVM, the Gaussian kernel is the most commonly used one. The selection of Gaussian kernel param-eters influences greatly the performances of classifiers, which remains as an open problem. In this papertwo methods are proposed to select Gaussian kernel parameters in OCSVM: according to the first one, theparameters are selected using the information of the farthest and the nearest neighbors of each sample;using the second one, the parameters are determined via detecting the ‘‘tightness’’ of the decision bound-aries. The two proposed methods are tested on UCI data sets and Tennessee Eastman Process benchmarkdata sets. The results show that, the two proposed methods can be used to select suitable parameters forthe Gaussian kernel, enabling the resulting OCSVM models to perform well on fault detection.

� 2014 Elsevier B.V. All rights reserved.

1. Introduction Different from the above methods, these last two methods can

One-class classification (OCC) is the problem of constructing adescription of a data set when only samples belonging to one classare available, and then detecting whether a new sample resemblesthis data set according to the constructed description. Many meth-ods have been proposed to solve OCC problems in recent years. TheKernel Principal Component Analysis (KPCA) is used to find theprincipal loading vectors along which samples have the maximalvariances in the feature space, and to obtain the decision functionof OCC by introducing reconstruction errors [1]. The MinimumSpanning Tree methodology is also used to construct a descriptionof the data set of one class, and a new sample is classified based onits distance to the closest edge of this tree [2]. More recently,Gaussian processes have been adopted to describe the latent func-tions of the samples and several classification criteria have beenproposed [3]. However, these methods need to store all the train-ing samples, impeding their application to big data sets. On theother hand, one-class support vector machines (OCSVM) [4] solveOCC by finding a hyper plane in the feature space that separatestraining samples from the origin with maximum margin, andsupport vector data description (SVDD) [5] solves it by finding ahyper sphere of minimum volume to cover the training samples.

build sparse OCC models.For these methods, the adopted kernels affect their perfor-

mances greatly, because kernels determine the distribution of datamappings in the feature space. Among various kernels, the Gauss-ian kernel avoids the influence of the norms of samples and createssample mappings of unit length in the feature space, which is ben-eficial to getting a closed decision boundary for OCC [1]. Moreover,the Gaussian kernel has only one parameter, i.e. its width parame-ter s, and this makes it easy to be tuned. Therefore, the Gaussiankernel is commonly used in OCC, although other kernels are alsoavailable. In addition, OCSVM and SVDD become equivalent whenthe Gaussian kernel is adopted [4,5]. Therefore without loss of gen-erality, in this paper we focus on selecting the Gaussian kernelparameter for OCSVM.

Although there are numerous parameter selection methods forbinary-class SVM, they do not apply to OCSVM. For example, vari-ous criteria for measuring the separation of binary classes are de-fined in [6–10], where the optimal kernel parameters areobtained through maximizing them; in [11,12] the optimal kernelparameters are obtained via exploring the distribution of datamappings in the feature space. However, these methods need touse the information from different classes of data, and thus arenot suitable for OCSVM. Methods dealing with the parameterselection of Gaussian kernel in OCC can be classified into twocategories: ‘‘indirect’’ ones and ‘‘direct’’ ones. The ‘‘indirect’’ meth-ods are independent of OCSVM models, only utilizing the data dis-tribution of one class. Evangelista et al. [13] defined a criterion

http://crossmark.crossref.org/dialog/?doi=10.1016/j.knosys.2014.01.020&domain=pdf

http://dx.doi.org/10.1016/j.knosys.2014.01.020

mailto:[email protected]





http://dx.doi.org/10.1016/j.knosys.2014.01.020

http://www.sciencedirect.com/science/journal/09507051

http://www.elsevier.com/locate/knosys

76 Y. Xiao et al. / Knowledge-Based Systems 59 (2014) 75–84

VAR2= MEANþ eð Þ to measure the dispersion of the kernel matrix,

where VAR2 and MEAN are the variance and the mean of non-diag-onal entries, respectively. This method (hereinafter referred to asVM) obtains s by maximizing this criterion. Subsequent experi-ments will show that the s obtained by VM always leads to overfit-ting of OCSVM models. Khazai et al. [14] proposed a method tocalculate s by the maximum distance of positive samples (thismethod will be referred to as MD hereinafter). Because MD doesnot take local distribution of training positive samples into consid-eration, the resulting s is prone to be large and the model cannotdescribe the data distribution accurately. By contrast, the ‘‘direct’’methods select the optimal parameter and train the OCSVM mod-els alternatively, using feedbacks of OCSVM models to tune theparameter and afterwards training new models based on the tunedparameter; this process goes on until convergence. Tax and Duin[5] proposed to estimate the leave-one-out error (LOO) of the posi-tive class from the fraction of support vectors (SVs) and used thisestimate and the user defined error fraction to tune the kernelparameter. This method will be referred to as FSV hereinafter.Since it only considers the LOO for the positive class, it does notdeal with potential negative classes properly, which leads to itsunderfitting performance. Deng and Xu [15] proposed a skew-ness-based method to generate outliers (this method will be re-ferred to as SKEW hereinafter), and then took the proportion ofoutliers accepted by OCSVM models and that of SVs to all trainingsamples as estimates of type I error and type II error, respectively.It optimized s through minimizing the weighted sum of these er-rors. However, the performance of this method heavily relies onthe locations and quantity of the generated outliers, and it willdeteriorate once the outliers are generated inappropriately.

In this paper, we propose two new methods to select Gaussiankernel parameters in OCSVM. The first one is indirect, optimizingthe Gaussian kernel parameter utilizing the information of the far-thest and the nearest neighbors of each sample; the second one isdirect, and optimizes the parameter via detecting the ‘‘tightness’’ ofthe decision boundaries.

The remainder of this paper is organized as follows. The secondsection briefly reviews OCSVM and some properties of the Gauss-ian kernel. In Sections 3 and 4, we propose our two methods ofselecting Gaussian kernel parameters. In Section 5 experimentson UCI data sets and Tennessee Eastman Process benchmark datasets compare our methods with other similar ones on fault detec-tion. Section 6 concludes this paper.

2. Preliminary knowledge

OCC is the problem of constructing models for data with merelysamples of one class, and Schölkopf et al. [4] proposed the OCSVMoptimization to solve this problem. Assume x1;x2; . . . ;xn 2 X aretraining samples whose labels are all positive, where n is the num-ber of training samples and X is the input space; U is the mappingfunction that maps samples from X to the feature space F. The basicidea of OCSVM is to find a hyper plane in the feature space thatseparates training sample mappings from the origin with maxi-mum margin. In this case, OCSVM treats the origin as a representa-tive negative sample and the maximum margin of OCSVM tries toseparate positive samples from the negative one as far as possible.This hyper plane is denoted by f xð Þ ¼ w �U xð Þð Þ � q ¼ 0, where wand q are its normal vector and offset, respectively. Therefore,the optimization for OCSVM has the following formulas [4]:

minw2F;n2Rn ;q2R

12

wk k2 þ 1tn

Xi

ni � q

s:t: w �U xið Þð ÞP q� ni;

ni P 0; i ¼ 1; . . . ;n

ð1Þ

where t 2 0;1ð � and ni is the slack variable. Based on the Lagrangian,the dual optimization problem of (1) is as follows

maxa

� 12aT Ka

s:t: 0 6 ai 61tn;X

i

ai ¼ 1;

i ¼ 1; . . . ;n

ð2Þ

where a ¼ a1 a2 . . . an½ �T are dual variables, K is the kernelmatrix with Kð Þij ¼ Kij ¼ k xi;xj

� �¼ U xið Þ �U xj

� �� . In this paper we

focus on the Gaussian kernel

k xi; xj� �

¼ exp �xi � xj

�� 2

s

!ð3Þ

where s is the Gaussian width parameter, and we will discuss howto select it in the following sections. Once the optimal solution a isobtained, the offset q can be calculated by w �U xsð Þð Þ, wherew ¼

Pni¼1aiU xið Þ and xs is some sample whose corresponding

as 2 0;1=tnð Þ. Then we obtain the OCSVM decision functionf xð Þ ¼ w �U xð Þð Þ � q ¼

Pni¼1aik xi;xð Þ � q. If f xð ÞP 0; x is predicted

be positive; otherwise it is predicted to be negative.Next, we will review some properties of the Gaussian kernel. It

is known that

k x; xð Þ ¼ U xð Þ �U xð Þð Þ ¼ U xð Þk k2 ¼ 1: ð4Þ

Thus, in the feature space defined by the Gaussian kernel, the normof each mapping is 1, i.e. all mappings are located on the surface of aunit hyper sphere. Therefore, the distance between any two map-pings in the feature space could be calculated as follows

U xið Þ �U xj� �� 2 ¼ U xið Þk k2 þ U xj

� �� 2 � 2 U xið Þ �U xj� ��

¼ 2� 2k xi;xj� �

¼ 2� 2 exp �xi � xj

�� 2

s

!ð5Þ

Based on the monotonicity of the exponential function, we know

that if xi � xj

�� 2< xl � xkk k2 in the input space, and then it holds

that U xið Þ �U xj� �� 2

< U xlð Þ �U xkð Þk k2 in the feature space, thatis, the Gaussian kernel preserves the structure of training samples.

At last, we discuss the effect of the Gaussian kernel parameteron OCSVM models. According to the properties mentioned above,x1;x2; . . . ;xn are mapped into an orthant of the feature space andtheir mappings are located on the surface of a hyper sphere. Asthe Gaussian kernel parameter s! 0; k xi;xj

� �¼ U xið Þ �U xj

� �� ! 0, that is, U xið Þ and U xj

� �tend to be orthogonal, thus mappings

of samples tend to locate themselves at the edge of the orthant.Based on the basic idea of OCSVM, this distribution of mappingswould increase the number of SVs, and consequently make themodel overfitting. On the other side, as s!1; k xi;xj

� �¼

U xið Þ �U xj� ��

! 1, the angle between U xið Þ and U xj� �

tends tobe 0, that is, mappings of all samples converge to one point. TheOCSVM model built upon this distribution is unable to reflect localstructures of samples, lacking in classification ability.

3. An indirect method: DFN

In this section, we propose an ‘‘indirect’’ method of optimizingthe Gaussian kernel parameter s. Our method utilizes the distancesfrom training samples to their farthest neighbors and the distancesto their nearest neighbors, so we name it DFN.

Y. Xiao et al. / Knowledge-Based Systems 59 (2014) 75–84 77

3.1. The DFN method

In OCC, models are expected to separate training samples of thepositive class from other negative classes. Since there are no sam-ples of negative classes, the farthest neighbors of every trainingsample are used instead, and we call them pseudo-negative sam-ples. Obviously, these pseudo-negative samples are located closelyto the training samples. Therefore, if an OCSVM model correspond-ing to the parameter s could separate mappings of training samplesfrom those of pseudo-negative samples, then based on the struc-ture preservation of the Gaussian kernel, it would be able to sepa-rate mappings of training samples from those of real negativesamples, because samples from negative classes are usually locatedfarther than pseudo-negative ones from training samples. On theother hand, we describe the distribution information within train-ing samples using their nearest neighbors, reflecting their localstructures. In order to improve the performance of OCSVM models,distances from mappings of training samples to those of pseudo-negative samples are supposed to be larger, while distances be-tween mappings of training samples to be smaller. Therefore, wedefine the following objective function, which is to be maximizedto get the optimal s.

foðsÞ¼1n

Xn

i¼1

maxj

U xið Þ�U xj� �� 2�1

n

Xn

i¼1

minj–i

U xið Þ�U xj� �� 2 ð6Þ

The first term of (6) takes the mean of the distances from trainingsample mappings to their farthest neighbors as a measure of theseparation between the positive class and negative classes; the sec-ond term takes the mean of the nearest-neighbor distances as ameasure of inner-class structure representation quality. Note thatthe objective function in (6) is similar to but not the same as theone in [12], where it is applied to binary classification.

Based on (5), formula (6) could be reformulated as

foðsÞ ¼1n

Xn

i¼1

maxj 2� 2k xi;xj� ��

� 1n

Xn

i¼1

minj–i 2� 2k xi;xj� ��

¼ 2n

Xn

i¼1

maxj–ik xi; xj� �

� 2n

Xn

i¼1

minjk xi;xj� �

ð7Þ

Using the monotonicity of the exponential function, we can furthersimplify (7) as

foðsÞ ¼2n

Xn

i¼1

maxj–ik xi;xj� �

� 2n

Xn

i¼1

minjk xi;xj� �

¼ 2n

Xn

i¼1

exp �minj–i xi � xj

�� 2

s

!

� 2n

Xn

i¼1

exp �maxj xi � xj

�� 2

s

!

¼ 2n

Xn

i¼1

exp �Near xið Þs

� �� 2

n

Xn

i¼1

exp � Far xið Þs

� �ð8Þ

where Near xið Þ ¼minj–i xi � xj

�� 2; Far xið Þ ¼maxj xi � xj

�� 2. Astraining samples are given, these two values are constant for anyxi. The constraint s > 0 can be removed by replacing s by r2. In thisway, maximizing fo rð Þ becomes an unconstrained optimizationproblem, which can be solved more conveniently.1

In order to obtain foðsÞ, we need to calculate the distances be-tween every pair of samples and to find the nearest and farthestneighbors for each sample. The computational complexity of this

1 Because Near xð Þ is less than Far xð Þ; fo rð Þ is greater than zero. Whenr! 0; fo rð Þ ! 0. Therefore, r ¼ 0 cannot maximize fo rð Þ and it will not becomethe optimal solution. In addition, we maximize fo rð Þ using the Matlab function‘‘fminunc’’.

step is O n2� �

. After this step, the function foðsÞ is fixed and irrele-vant to the quantity of samples. Therefore, the overall computa-tional complexity of DFN is O n2

� �.

3.2. Some illustrative experiments

To illustrate the performance of DFN and compare it with theother two ‘‘indirect’’ methods mentioned in Section 1, MD andVM, we perform experiments on two toy data sets. Samples ofthe two data sets are distributed in a banana shape and a ringshape, respectively. The banana-shaped data set contains 500 sam-ples, generated by the Matlab toolbox PRtools [16]; the ring-shaped data set also contains 500 samples, distributed uniformlywithin a ring form whose inner and outer diameters are 0.6 and1, respectively. These three methods obtain their own optimal svalues, and then build respective OCSVM models based on thesevalues. The decision boundaries of these different methods areshown in Figs. 1 and 2, where the points stand for training samplesand the curves stand for OCSVM decision boundaries.

For the banana-shaped data set, the decision boundary of MD isapparently loose; the curve in Fig. 1(a) does not reflect the concavepart of the banana-shaped region. On the contrary, the decisionboundaries of VM are tight; the curves in Fig. 1(b) are shatteredand unable to outline the banana-shaped area, which means thatthe OCSVM model of VM overfits samples seriously. It is clearlyshown in Fig. 1(c) that the DFN method proposed in this sectionis capable of describing the banana-shaped data set accurately. Itneither mistakes the inside of the banana-shaped region as nega-tive classes, nor takes the outside regions as a part of the region.

For the ring-shaped data set, the decision boundary of MD isstill loose; the inner diameter of the curve in Fig. 2(a) is too small,making the ring broader than expected. The OCSVM model built byVM overfits the data set. A lot of holes exist inside the ring inFig. 2(b), which will lead to a high false negative rate. By contrast,DFN (Fig. 2(c)) performs well, and the OCSVM model built can de-scribe the ring-shaped region properly.

The experiments on the above two data sets demonstrate thatthe decision boundaries of MD are always loose, unable to reflectthe local structures of training samples, consistent with the analy-sis in Section 1. The VM method is inclined to overfit the trainingsamples, and to build complicated classification models, leadingto the loss of generalization ability. This may result from the MEANvalue in its criterion: when the parameter s is small, the entries inthe kernel matrix are close to 0, which makes the value of MEANalso close to 0; this leads to a high value of the VM criterion. There-fore, VM is likely to obtain a small s to achieve a high value of thetarget function, resulting in an overfitting model. The DFN methodmakes use of the information about the farthest neighbors and thenearest neighbors. It takes the local structures into considerationwhile maximizing the differences between the positive class andnegative ones; therefore it outperforms both MD and VM.

Next, we test the above three methods on a more complicatedtoy data set, which possesses a multi-mode Gaussian distribution.It contains 500 samples generated by PRtools [16]. The results ofthese methods on this data set are shown in Fig. 3.

As shown in Fig. 3, MD underfits the data set while VM overfitsit. However, DFN does not obtain satisfying results either: it onlyexcludes a part of the middle region, so it fails to deal with themulti-mode data set. This is because the capability of statisticsabout the farthest and nearest neighbors is limited, not sufficientto elaborate on this complicated distribution. Specifically, it isimpossible to recognize multiple modes of samples just based onstatistics about the farthest and nearest neighbors. Nevertheless,it is a difficult problem to estimate the distribution of samples be-fore OCC models are built, and needs to be studied further.

−3 −2 −1 0 1 2−2

−1

0

1

2

3

(a) MD−3 −2 −1 0 1 2−2

−1

0

1

2

3

(b) VM−3 −2 −1 0 1 2−2

−1

0

1

2

3

(c) DFNFig. 1. Decision boundaries of indirect methods on the banana-shaped data set.

−2 −1 0 1 2−2

−1

0

1

2

(a) MD

−2 −1 0 1 2−2

−1

0

1

2

(b) VM−2 −1 0 1 2−2

−1

0

1

2

(c) DFNFig. 2. Decision boundaries of indirect methods on the ring-shaped data set.

−2 −1 0 1 2−4

−2

0

2

4

(a) MD−2 −1 0 1 2−4

−2

0

2

4

(b) VM−2 −1 0 1 2−4

−2

0

2

4

(c) DFNFig. 3. Decision boundaries of indirect methods on the multi-mode Gauss data set.


4. A direct method: DTL

The ‘‘indirect’’ methods of optimizing s are independent ofOCSVM models. Due to this independence, they have low compu-tational complexity and are widely used, but they do not make suf-ficient use of the valuable information of the OCSVM models. Onthe contrary, the ‘‘direct’’ methods combine optimizing s with uti-lizing the OCSVM models, using the feedbacks from OCSVM modelsto tune s. In the following section, we propose a ‘‘direct’’ method ofoptimizing s, which takes advantage of both the feedbacks fromOCSVM models and the training samples.

4.1. The DTL method

An ideal OCSVM model should construct a suitable decisionboundary: it should neither be tight, to keep good generalizationability; nor be loose, to make sure to recognize negative samples.Therefore, we propose a method of detecting whether OCSVM deci-sion boundaries are tight or loose, which is named DTL. The basicidea of DTL is as follows. If inside the decision boundaries existsone or several ‘‘holes’’ whose interiors contain no training sample,

then the boundaries are judged to be loose; if the boundaries neartwo neighboring samples are concave or disconnected, then theyare judged to be tight.

To embody the above basic ideas, we define the following sets,given thresholds h1 and h2,

X1 ¼ xi;xj� �

j xi � xj

�� > h1;xi;xj 2 T; i – j

X2 ¼ xi;xj� �

j xi � xj

�� 6 h2; xi;xj 2 T; i – j

where T ¼ xjf xð ÞP 0; x 2 Xf g. We now specify the tightness detect-ing rules:

� Loose boundaries: If there exists one or several pairs of train-ing samples xi;xj

� �2 X1, whose midpoint �x satisfies

f �xð Þ > 0, and there are no training samples in the hypersphere xj x� �xk k 6 D=2f g (we will discuss how to defineD subsequently), then the decision boundaries are loose.

� Tight boundaries: If there exists one or several pairs of train-ing samples xi;xj

� �2 X2, whose midpoint �x is outside the

boundaries, i.e. f �xð Þ < 0, then the boundaries are tight.

Table 1The DTL algorithm.

Input: X ¼ x1; . . . ; xnf g; initial values for the upper and lower bound of s, i.e.su and sl; the threshold Ds

Process1: Compute D by (9)2: While su � slj j > Ds, do3: Set s ¼ su þ slð Þ=2, train OCSVM model, obtain the decision function f xð Þ4: Detect whether the boundary f xð Þ ¼ 0 is tight or loose

(a) Loose, set su s(b) Tight, set sl s(c) Both loose and tight, set sl s(d) Neither loose nor tight, break

5: End while6: Return s

Output: the Gaussian kernel parameter s


When training samples are distributed unevenly, it is possibleto judge the boundaries to be both tight and loose. In this casethe requirement of real application can be taken into consider-ation. If there is strict requirement for positive class accuracy,the boundaries can be judged to be tight so as to tune s towardsa larger value; if the requirement for negative class accuracy isstrict, they can be judged to be loose so as to tune s towards a smal-ler value. In this paper, we are inclined to judge them to be tight, soas to ensure the models’ generalization ability. Once the bound-aries are judged to be neither loose nor tight, the optimal s is ob-tained. To illustrate the abovementioned detecting rules, we givean example in Fig. 4 where the solid lines stand for decision bound-aries and the blank dots stand for training samples. In Fig. 4(a) theblack filled dot represents the midpoint of a pair of training sam-ples xi;xj

� �2 X1. This dot lies inside the decision boundary and

no training samples exist inside the circle centered at it, so theboundary is judged to be loose; in Fig. 4(b) the midpoint of a pairof training samples xi;xj

� �2 X2 lies outside of the decision bound-

ary, so the boundary is judged to be tight.Now let us discuss how to define D using the distance statistics

of training samples. We compute distances between any two train-ing samples, and find the distance Di between any sample xi and itsnearest neighbor. Then D is computed as follows.

D ¼ maxiDi ð9Þ

If there exist some samples far apart from most samples in thetraining data set, they would make D computed by (9) to be largerthan expected. In this case, we could sort Di, and compute D basedon some given percentile. To further reduce outliers’ effect, someoutlier detection methods [17] could be used to preprocess theOCC training set, removing outliers before parameter selection.Usually it is assumed that the training samples are representativepositive ones, so it is reasonable to compute D by (9). To simplifythe DTL process, the thresholds h1 and h2 can also be set to D.

According to the analysis in Section 2, the effect of s on thetightness of decision boundaries is monotonic, i.e. the smaller thes is, the tighter the boundaries are; the larger the s is, the looserthe boundaries are. Therefore, considering the monotonic effectof s and the abovementioned tightness detecting rules, we canpresent the DTL algorithm for optimizing s. In this algorithm, wefirst find the initial upper bound su and lower bound sl. Accordingto our previous work[18], they can be computed as shown in (10)and (11) to get a loose decision boundary and a tight one,respectively.

su ¼ 0:4D2u;Du ¼ max

imax

jxi � xj

�� ð10Þ

sl ¼ 0:4D2l ;Dl ¼max

imin

j–ixi � xj

�� ð11Þ

−2 −1 0 1−1.5

−1

−0.5

0

0.5

1

1.5

2

(a)

−1

−0

0

1

Fig. 4. Schemat

Then we get a candidate kernel parameter s by taking s ¼ su þ slð Þ=2,i.e. applying the bisection method. The decision boundary corre-sponding to this s is detected to be loose or tight via the detectingrules abovementioned. Based on the detecting result, the candidateinterval for s can be narrowed down, and the bisection method canbe reapplied in this new interval. The above steps are repeated untilthe candidate interval is smaller than the given threshold Ds or thedecision boundary is neither loose nor tight. The details of this algo-rithm are listed in Table 1.

Before starting the DTL iterations, D; su and sl should be calcu-lated, which has quadratic computational complexity. During eachiteration, an OCSVM model is built, whose computational complex-ity is O n3

� �[4]; and the resulting boundary is detected to be tight

or loose, whose computational complexity is O n2� �

for the worstcase. Assume this algorithm executes q iterations before it stops,then its computational complexity is O qn3

� �.

4.2. Some illustrative experiments

To compare the DTL method with the other two ‘‘direct’’ meth-ods, FSV and SKEW, we also test them on the three toy data setsused in last section. Their decision boundaries are shown inFig. 5–7, respectively.

The FSV method does not result in good performances on thesethree data sets since it underfits the training samples. The reason isprobably that it only estimates the LOO for the positive class andtries to minimize its difference from the user defined error fraction.Usually the defined error fraction is small, so FSV will tend to selecta large s to make the estimated LOO small. The SKEW method per-forms well in two out of the three data sets: it builds properOCSVM models for the ring-shaped and multi-mode Gauss datasets. For the banana-shaped data set however, although the curve

−2 −1 0 1.5

−1

.5

0

.5

1

.5

2

(b)ic for DTL.

−3 −2 −1 0 1 2−2

−1

0

1

2

3

(a) FSV−3 −2 −1 0 1 2−2

−1

0

1

2

3

(b) SKEW−3 −2 −1 0 1 2−2

−1

0

1

2

3

(c) DTLFig. 5. Decision boundaries of direct methods on the banana-shaped data set.

−2 −1 0 1 2−2

−1

0

1

2

(a) FSV−2 −1 0 1 2−2

−1

0

1

2

(b) SKEW−2 −1 0 1 2−2

−1

0

1

2

(c) DTLFig. 6. Decision boundaries of direct methods on the ring-shaped data set.

−2 −1 0 1 2−4

−2

0

2

4

(a) FSV−2 −1 0 1 2−4

−2

0

2

4

(b) SKEW−2 −1 0 1 2−4

−2

0

2

4

(c) DTLFig. 7. Decision boundaries of direct methods on the multi-mode Gauss data set.


in Fig. 5(b), i.e. the decision boundary of SKEW, can outline the ba-nana shape, it is not smooth and has a hole inside, showing a ten-dency to overfitting. Moreover, when it comes to high-dimensionaldata sets, it is more difficult for SKEW to generate appropriate out-liers, and thus it does not perform well, which will be illustrated inthe following experiments. By comparison, the DTL method per-forms well on all these three data sets, and is successful in describ-ing these data sets by enclosing the samples properly.

Comparing with the results shown in the last section, SKEW andDTL perform better than MD and VM on all the three data sets, andoutperform DFN on the multi-mode Gauss data set. This illustratesthat proper feedbacks from models provide valuable information oftraining samples, benefiting the optimization of parameter s. Withregard to the performances of SKEW and DTL, it turns out that thelatter is better than the former. The reason is that the SKEW meth-od depends heavily on the quality of generated outliers; if the out-liers distribute inappropriately, the SKEW method is unable to finda suitable parameter s. Instead, our DTL method detects the tight-ness of the OCSVM model by the locations of training samples rel-ative to decision boundaries. Since it only makes use of the trainingsamples themselves, it does not suffer from the randomness of

generating outliers, being more robust than SKEW. Note that, thesemethods need to train OCSVM models repeatedly, and thus theircomputational complexities are high.

5. Experiments

In this section, we conduct two groups of experiments to com-pare the performances of these methods on selecting the Gaussiankernel parameter s for OCSVM models. In order to test these meth-ods on high-dimensional data, we perform experiments on 10 UCIdata sets, and compare the accuracies of their OCSVM models. Atlast, we apply our methods to fault detection on the TennesseeEastman Process, and compare their fault detection rates and falsealarm rates with those of the other methods. In experiments, wesolve the optimization of OCSVM using the toolbox LIBSVM [19].

5.1. Experiment 1: the UCI benchmark data sets

In order to compare the performances of these six methods onhigh dimensional data sets, we experiment on 10 UCI benchmark

Table 2Description of 10 UCI data sets.

Data sets Dim Ntrain positive Ntest

Positive Negative

Banana 2 216 2708 2912Diabetes 8 304 156 144Flare-solar 9 367 222 178German 20 489 153 147Heart 13 94 55 46Image 18 746 518 492Ringnorm 20 200 3536 3464Titanic 3 100 1090 961Twonorm 20 202 3500 3500Waveform 21 132 1515 1685

Table 3Accuracies for MD, VM, DFN.

Data set MD VM DFN

Banana 0.6053(0.0084) 0.6765(0.0342) 0.7760(0.0409)Diabetis 0.6702(0.0208) 0.6362(0.0269) 0.6717(0.0207)Flare 0.5095(0.0269) 0.4696(0.0587) 0.5144(0.0229)German 0.6898(0.0201) 0.2975(0.0210) 0.6930(0.0216)Heart 0.6650(0.0309) 0.4470(0.0379) 0.6610(0.0306)Image 0.7082(0.0100) 0.8096(0.0133) 0.7084(0.0101)Ringnorm 0.8553(0.0176) 0.7981(0.0173) 0.9173(0.0119)Titanic 0.6176(0.1135) 0.5573(0.1732) 0.6178(0.1003)Twonorm 0.8075(0.0119) 0.7738(0.0132) 0.8110(0.0166)Wavenorm 0.8176(0.0090) 0.8053(0.0108) 0.8190(0.0093)Average 0.6946(0.0269) 0.6271(0.0407) 0.7190(0.0285)

p-Value 0.0352 0.0371 –h 1 1 –


data sets, which are commonly used in pattern recognition andavailable online.2 These 10 data sets vary widely in the dimensionand quantity of samples.

Every data set consists of different classes of samples, and weselect the one, whose training set sufficiently represents the distri-bution of its own class, as the positive class, and the others as thenegative class. For example, the Titanic data set contains two clas-ses of samples: A and B, and they both have their own training andtest sets: Train A, Test A, Train B, Test B. Train A and Train B areused to train two respective OCSVM models, where the Gaussiankernel parameter s is set to a large value (e.g. 1000) deliberately.Under this condition, it is known that test samples of the sameclass as the training samples should be correctly classified, if thetraining set could sufficiently represent the distribution of itsown class. The OCSVM model trained by Train A classifies correctlyonly 58.56% samples from Test A, while the model trained by TrainB classifies correctly 81.37% samples from Test B. It is clear thatTrain B represents the distribution of its own class more suffi-ciently, and thus class B is selected as the positive class.

Note that, to our knowledge, there are no benchmark OCC datasets available online. Therefore we use these UCI data sets whichare meant for binary-class classification in the OCC way [20]: train-ing sets contain only samples of the positive class, while test setscontain samples of both positive and negative classes.

We normalize the selected training positive samples to zeromean and unitary standard deviation before optimizing the Gauss-ian kernel parameter s. Then the parameters obtained by differentmethods are utilized to build corresponding OCSVM models, wheret is set to 0.05. For the test samples, they are first normalized bythe mean and standard deviation of training samples, and thenevaluated by the built models, getting their predicted labels. Foreach of the 10 data sets, we test these methods on 20 train/testpermutations, and the samples are selected randomly in every per-mutation. The average distributions of positive/negative samplesover these permutations are listed in Table 2. We calculate themean and standard deviation of accuracies for every method. Theresults are listed in Tables 3 and 4, where standard deviationsare given in brackets. Moreover, we analyze the differences be-tween the proposed methods and the others by the Wilcoxonsigned-ranks test [21]. The last two rows in the tables list the testresults and their p-values, where h ¼ 1 indicates that there issignificant difference between DFN/DTL and each of the othermethods and h ¼ 0 indicates there is no significant difference.Additionally, we also present the oracle-style experiment resultsin the last two columns of Table 4. In the oracle-style experiment,we use both positive and negative samples to find the s that resultsin the best performances. Obviously, this is impossible in real life,because the negative samples are not available for selecting s, butthe oracle-style results could illustrate the differences between theobtained results and the ideal ones.

From the results in Tables 3 and 4, it is apparent that the meth-ods we proposed (DFN and DTL) are superior to the other methodson most of the data sets. Based on the theories of these methods,MD, VM and DFN methods are in the ‘‘indirect’’ category, for theyoptimize s using only the geometric distribution of sample map-pings in the feature space, with no need to train OCSVM models;FSV, SKEW and DTL are in the ‘‘direct’’ category, for they need totrain OCSVM models during the optimization of s. Generally speak-ing, in both categories the proposed methods are markedly supe-rior to the other methods, based on the results of the Wilcoxonsigned-ranks test at the significance level a ¼ 0:05. Specifically,among the three methods of the first category, the models builtby DFN obtain higher accuracies than the other two methods on

2 http://archive.ics.uci.edu/ml/datasets.html.

the whole. Over all these 10 data sets, the average accuracy ofDFN is 0.7190, higher than 0.6946 of MD and 0.6271 of VM. Onthe ringnorm data set, the accuracy of DFN is high up to 0.9173,markedly superior to those of MD and VM, 0.8553 and 0.7981.On the other hand, among the three methods of the second cate-gory, our DTL works better than FSV and SKEW. In 8 out of the10 data sets, the accuracies of DTL are higher than those of FSVand SKEW, respectively. Taking the twonorm data set for example,the accuracy of DTL is 0.8264, slightly better than 0.8153 of FSVand much better than 0.5481 of SKEW. Furthermore, in terms ofthe standard deviation (std), the std of SKEW is mostly higher thanthose of the other methods. This illustrates the instability of SKEW,mainly due to its reliance on the distribution of generated outliers.By contrast, our DTL method utilizes only the given training sam-ples to detect the tightness of decision boundaries, free from therandom influences of generated outliers, therefore the stability ofits accuracies is improved. In summary, judging from the experi-ment results of the 10 UCI data sets, the methods we propose,DFN and DTL, are able to obtain a suitable Gaussian kernel param-eter s, providing corresponding OCSVM models with higher predic-tion accuracies.

5.2. Experiment 2: the industrial application on the TEP

In many practical applications, faulty samples are very rare orexpensive to obtain. As a result, it is not appropriate to build mon-itoring models by methods of binary classification. However, tak-ing normal samples as positive and faulty ones as negative, thiskind of fault detection problems can be solved by the OCSVMmethod.

In this experiment, we apply our methods (DFN and DTL) on theTennessee Eastman Process (TEP), so as to verify their goodperformances at detecting faults in industrial processes. TEP is a

http://archive.ics.uci.edu/ml/datasets.html

Table 4Accuracies for FSV, SKEW, DTL and ORACLE.

Data set FSV SKEW DTL ORACLE

Banana 0.6043(0.0080) 0.8468(0.0168) 0.7831(0.0931) 0.8502(0.0160)Diabetis 0.6705(0.0226) 0.5592(0.0783) 0.6722(0.0209) 0.6862(0.0175)Flare 0.5230(0.0255) 0.4714(0.0518) 0.5236(0.0208) 0.5863(0.0272)German 0.6958(0.0218) 0.3988(0.1800) 0.6911(0.0225) 0.7002(0.0195)Heart 0.6640(0.0250) 0.5090(0.1225) 0.6650(0.0411) 0.7045(0.0366)Image 0.7086(0.0107) 0.7519(0.0146) 0.7080(0.0101) 0.8217(0.0130)Ringnorm 0.8514(0.0090) 0.5173(0.0997) 0.8963(0.0171) 0.9492(0.0096)Titanic 0.5756(0.1289) 0.6062(0.1411) 0.6564(0.1435) 0.7710(0.0133)Twonorm 0.8153(0.0301) 0.5481(0.1182) 0.8264(0.0114) 0.8314(0.0163)Wavenorm 0.8168(0.0123) 0.6784(0.0352) 0.8298(0.0085) 0.8321(0.0090)Average 0.6925(0.0294) 0.5887(0.0858) 0.7252(0.0389) 0.7733(0.0178)

p-Value 0.0371 0.0195 – –h 1 1 – –

Fig. 8. Flow sheet of TEP.


benchmark simulation of a real chemical engineering process pro-posed by Downs and Vogel [22]. As shown in Fig. 8, the TEP con-sists of five major operational units: a reactor, a condenser, acompressor, a separator and a stripper; and it comprises eightcomponents: from A to H. The TEP is described by 52 variables:22 process variables, 11 manipulated variables and 19 compositionmeasurements. The details of TEP can be found in [23]. In TEPbenchmark data sets, besides 500 normal samples for trainingand 960 normal ones for validation, there are 21 faults listed inTable 5. 960 test samples are included in each type of fault: thefirst 160 samples are normal and the fault is introduced from the160th sample. The TEP benchmark data set can be downloadedonline.3 This data set is widely used in fault detection to comparedifferent methods [24–26].

In this experiment, we build fault detection models with all the52 variables. First, we normalize the 500 normal samples to zero

3 http://brahms.scs.uiuc.edu/.

mean and unitary standard deviation, and then implement differ-ent methods to obtain their optimal s, respectively. Based on theseoptimal parameters, OCSVM models are built, where t is set to 0.01according to [24]. The thresholds are defined as the 0.99 confidencelimit of the metric in the validation set. For normalized test sam-ples, OCSVM models predict whether they are faulty or not, andthen the fault detection rates and false alarm rates are calculated.Moreover, as PCA [27] is a typical method for fault detection, welist the results of T2 and SPE statistics of PCA as a baseline. TheT2 and SPE statistics of PCA monitor the space spanned by principlecomponents and that spanned by the remaining components,respectively. Additionally, we also present the oracle-style experi-ment results. The fault detection rates and false alarm rates areshown in Table 6, where the results of the Wilcoxon signed-rankstest are also listed.

In this table, the results of SKEW and VM are not presented. Theparameter obtained by SKEW is so inappropriate that the corre-sponding detection rates and false alarm rates all equal to 1, whichmeans this method fails on TEP benchmark data sets. On the other

http://brahms.scs.uiuc.edu/

Table 5Types of fault in TEP.

Fault Description Type

1 A/C feed ratio, B composition constant (stream 4) Step2 B composition, A/C ratio constant (stream 4) Step3 D feed temperature (stream 2) Step4 Reactor cooling water inlet temperature Step5 Condenser cooling water inlet temperature Step6 A feed loss (stream 1) Step7 C header pressure loss-reduced availability

(stream 4)Step

8 A, B, C feed composition (stream 4) Random9 D feed temperature (stream 2) Random10 C feed temperature (stream 4) Random11 Reactor cooling water inlet temperature Random12 Condenser cooling water inlet temperature Random13 Reaction kinetics Slow Drift14 Reactor cooling water valve Sticking15 Condenser cooling water valve Sticking16–20 Unknown Unknown21 The valve for stream 4 was fixed at the steady-

state positionConstantposition


hand, although the VM method obtains an average detection rateas high as 97.09%, its false alarm rate is unacceptably high up to73.51%, meaning it does not perform well, either. Therefore, thefollowing discussion will not consider these two methods. Basedon the results of the Wilcoxon signed-ranks test at the significancelevel a ¼ 0:05, it is reasonable to say that the methods proposed inthis paper (DFN and DTL) are superior to the other methods in faultdetection on TEP. Specifically, the false alarm rates of DFN and DTLare 1.93% and 2.32%, which are acceptable in fault detection. Theiraverage detection rates are 68.52% and 68.95%, being higher thanthose of the other methods, and closer to that of the oracle-styleresults. In 17 out of 21 fault types, our methods achieve the highestfault detection rates among these methods, not including theoracle-style results. Comparing DFN and DTL, the latter performsbetter than the former, with higher detection rates for 12 fault

Table 6Fault detection rates and false alarm rates (%).

Fault Fault detection rates

T2 SPE MD DFN FSV DTL ORACLE

1 99.38 99.75 99.50 99.63 99.50 99.63 99.632 98.25 98.63 98.50 98.63 98.38 98.63 98.633 1.13 1.13 6.63 12.38 9.25 14.13 16.504 40.50 98.88 76.00 87.50 70.13 86.25 88.385 25.00 22.25 29.50 33.63 31.87 35.50 36.886 99.00 100 100 100 100 100 1007 100 100 100 100 100 100 1008 97.25 93.38 98.38 98.50 98.37 98.63 99.139 1.75 1.88 5.63 10.63 7.75 12.50 14.1310 39.88 38.88 56.13 61.88 56.50 62.13 62.3811 49.25 48.00 62.25 70.25 59.13 71.13 71.6312 98.50 94.75 99.00 99.63 99.25 99.63 99.6313 95.13 94.88 95.50 95.50 94.87 95.50 95.6314 100 94.13 100 100 100 100 10015 2.13 2.13 14.63 19.00 15.25 20.38 21.6316 19.38 34.38 46.63 53.88 45.50 54.63 55.0017 78.50 94.00 89.25 90.75 88.37 91.00 91.1318 89.25 90.25 89.50 90.13 89.37 90.25 90.5019 6.25 8.88 5.25 12.00 4.50 12.00 13.0020 34.50 51.88 56.38 61.88 58.00 62.50 62.7521 37.38 48.00 41.00 43.13 40.25 43.50 44.25Average 57.73 62.67 65.22 68.52 65.06 68.95 69.56p-Value – – 2.90e-04 – 1.96e-04 – –h – – 1 – 1 – –

False alarm rates1.38 1.63 2.24 1.93 2.80 2.32 2.15

types. As for the MD method, it does not perform as well as DFN,with its average detection rate being about 3% lower than that ofDFN. For example, for fault 11, the fault detection rate for MD isonly 62.25%, while that for DFN is 70.25%. Similar situations existbetween FSV and DTL. On the other side, the OCSVM models builtby DFN and DTL perform apparently better than the baseline meth-od PCA, and their average fault detection rates are higher thanthose of T2 and SPE by about 10% and 6%, respectively.

Based on the above analyses, it is reasonable to draw the con-clusion that by utilizing the proposed DFN and DTL methods, weare able to select suitable Gaussian kernel parameters for OCSVM.Regarding the Wilcoxon signed-ranks test, the OCSVM models builtby DFN and DTL perform better than those built by the other meth-ods for fault detection on TEP benchmark data sets.

6. Conclusion

In this paper, we study the problem of selecting the Gaussiankernel parameter for one-class support vector machines (OCSVM)and its application to fault detection. According to the two catego-ries of methods in this field, we propose two different methods tooptimize s. The first one, entitled DFN, solves this problem by max-imizing the differences between the farthest and nearest neighborsof sample mappings. However, the DFN method is not able to makefull use of the sample distribution information; therefore wepropose the second method DTL, which optimizes s using the feed-backs of OCSVM models. DTL detects the tightness of the decisionboundaries through a set of systematic rules, and tunes s iterativelybased on its monotonic effect on the tightness of the decisionboundaries.

Then we test the performances of our methods using threegroups of experiments. As shown by the experimental results ontoy data sets, the OCSVM models built by our methods could en-close training samples properly and generate suitable decisionboundaries. As for high-dimensional data sets, we experiment on10 UCI data sets, and our methods achieve higher accuracies gen-erally. At last, we apply our methods to fault detection on the Ten-nessee Eastman Process. The results demonstrate that our methodscan detect faults of various types effectively, being superior to theother parameter selection methods.

Considering the similar effect of s upon other kernel OCC classi-fiers, the potential application of the proposed methods to theseclassifiers could be explored in future work. In addition, the distri-bution and other aspects of sample mappings in the feature space,e.g. the pattern and the density, could be utilized to obtain a moresuitable s before classification models are built. Furthermore, moreefficient algorithms could be designed to detect the tightness ofdecision boundaries. Finally, since we only study the methods ofselecting s for the Gaussian kernel, it would also be worthy to studymethods of selecting parameters for other kernels.

Acknowledgements

This work was supported in part by the National KeyTechnologies Research and Development Program of China (No.2011AA060203), and in part by the No. 2 Important National Sci-ence and Technology Specific Project of China (No. 2011ZX02504).

References

[1] H. Hoffmann, Kernel PCA for novelty detection, Pattern Recogn. 40 (2007) 863–874.

[2] P. Juszczak, D.M.J. Tax, E. Pekalska, R.P.W. Duin, Minimum spanning tree basedone-class classifier, Neurocomputing 72 (2009) 1859–1869.

[3] M. Kemmler, E. Rodner, J. Denzler, One-class classification with gaussianprocesses, in: Computer Vision – ACCV 2010, Springer, 2011, pp. 489–500.

http://refhub.elsevier.com/S0950-7051(14)00037-9/h0015








[4] B. Schölkopf, J.C. Platt, J. Shawe-Taylor, A.J. Smola, R.C. Williamson, Estimatingthe support of a high-dimensional distribution, Neural Comput. 13 (2001)1443–1471.

[5] D.M.J. Tax, R.P.W. Duin, Support vector data description, Mach. Learn. 54(2004) 45–66.

[6] Y. Baram, Learning by kernel polarization, Neural Comput. 17 (2005) 1264–1275.

[7] L. Wang, P. Xue, K.L. Chan, Two criteria for model selection in multiclasssupport vector machines, IEEE Trans. Syst. Man Cybernet. Part B – Cybernet. 38(2008) 1432–1448.

[8] C.H. Nguyen, T.B. Ho, An efficient kernel matrix evaluation measure, PatternRecogn. 41 (2008) 3366–3372.

[9] K.-P. Wu, S.-D. Wang, Choosing the kernel parameters for support vectormachines by the inter-cluster distance in the feature space, Pattern Recogn. 42(2009) 710–717.

[10] M. Lazaro-Gredilla, V. Gomez-Verdejo, E. Parrado-Hernandez, Low-cost modelselection for SVMs using local features, Eng. Appl. Artif. Intell. 25 (2012) 1203–1211.

[11] M. Varewyck, J.-P. Martens, A practical approach to model selection forsupport vector machines with a gaussian kernel, IEEE Trans. Syst. ManCybernet. Part B – Cybernet. 41 (2011) 330–340.

[12] Z. Xu, M. Dai, D. Meng, Fast and efficient strategies for model selection ofgaussian support vector machine, IEEE Trans. Syst. Man Cybernet. Part B –Cybernet. 39 (2009) 1292–1307.

[13] P.F. Evangelista, M.J. Embrechts, B.K. Szymanski, Some properties of thegaussian kernel for one class learning, in: Artificial Neural Networks – ICANN2007, Springer, 2007, pp. 269–278.

[14] S. Khazai, S. Homayouni, A. Safari, B. Mojaradi, Anomaly detection inhyperspectral images based on an adaptive support vector method, IEEEGeosci. Remote Sens. Lett. 8 (2011) 646–650.

[15] H. Deng, R. Xu, Model selection for anomaly detection in wireless ad hocnetworks, in: CIDM 2007, IEEE Symposium on Computational Intelligence andData Mining, 2007, IEEE, 2007, pp. 540–546.

[16] R. Duin, P. Juszczak, P. Paclik, E. Pekalska, D. De Ridder, D. Tax, S. Verzakov, Amatlab toolbox for pattern recognition, 2007. <http://www.prtools.org/>.

[17] V.J. Hodge, J. Austin, A survey of outlier detection methodologies, Artif. Intell.Rev. 22 (2004) 85–126.

[18] H. Wang, L. Zhang, Y. Xiao, W. Xu, An approach to choosing gaussian kernelparameter for one-class SVMs via tightness detecting, 2012 4th InternationalConference on Intelligent Human-Machine Systems and Cybernetics (IHMSC),vol. 2, IEEE, 2012, pp. 318–323.

[19] C. Chih-Chung, L. Chih-Jen, LIBSVM: a library for support vector machines,2001. <http://www.csie.ntu.edu.tw/�cjlin/libsvm>.

[20] G. Lee, C. Scott, Nested support vector machines, IEEE Trans. Signal Process. 58(2010) 1648–1660.

[21] J. Demsar, Statistical comparisons of classifiers over multiple data sets, J. Mach.Learn. Res. 7 (2006) 1–30.

[22] J.J. Downs, E.F. Vogel, A plant-wide industrial-process control problem,Comput. Chem. Eng. 17 (1993) 245–255.

[23] L. Chiang, E. Russell, R. Braatz, Fault Detection and Diagnosis in IndustrialSystems, Springer, 2001.

[24] S. Mahadevan, S.L. Shah, Fault detection and diagnosis in process data usingone-class support vector machines, J. Process Control 19 (2009) 1627–1639.

[25] M. Kano, K. Nagao, S. Hasebe, I. Hashimoto, H. Ohno, R. Strauss, B. Bakshi,Comparison of statistical process monitoring methods: application to theEastman challenge problem, Comput. Chem. Eng. 24 (2000) 175–181.

[26] J.M. Lee, S.J. Qin, I.B. Lee, Fault detection and diagnosis based on modifiedindependent component analysis, Aiche J. 52 (2006) 3501–3514.

[27] S.J. Qin, Statistical process monitoring: basics and beyond, J. Chemometr. 17(2003) 480–502.





























































Documents

Two methods of selecting Gaussian kernel parameters for one-class SVM and their application to fault detection