19
Sparse Kernel Learning-Based Feature Selection for Anomaly Detection ZHIMIN PENG University of California, Los Angeles Los Angeles, CA PRUDHVI GURRAM, Member, IEEE HEESUNG KWON, Senior Member, IEEE U.S. Army Research Laboratory Adelphi, MD WOTAO YIN University of California, Los Angeles Los Angeles, CA In this paper, a novel framework of sparse kernel learning for support vector data description (SVDD) based anomaly detection is presented. By introducing 0-1 control variables to original features in the input space, sparse feature selection for anomaly detection is modeled as a mixed integer programming problem. Due to the prohibitively high computational complexity, it is relaxed into a quadratically constrained linear programming (QCLP) problem. The QCLP problem can then be practically solved by using an iterative optimization method, in which multiple subsets of features are iteratively found as opposed to a single subset. However, when a nonlinear kernel such as Gaussian radial basis function kernel, associated with an infinite-dimensional reproducing kernel Hilbert space (RKHS) is used in the QCLP-based iterative optimization, it is impractical to find optimal subsets of features due to a large number of possible combinations of the original features. To tackle this issue, a feature map called the empirical kernel map, which maps data points in the input space into a finite space called the empirical kernel feature space (EKFS), is used in the proposed work. The QCLP-based iterative optimization problem is solved in the EKFS instead of in the input space or the RKHS. This is possible because the geometrical properties of the EKFS and the corresponding RKHS remain the same. Now, an explicit nonlinear exploitation of the data in a finite EKFS is achievable, which results in optimal feature ranking. Comprehensive experimental results on three Manuscript received November 13, 2013; revised June 26, 2014, November 13, 2014; released for publication December 1, 2014. DOI. No. 10.1109/TAES.2015.130730. Refereeing of this contribution was handled by T. Robertazzi. The research of W. Yin and Z. Peng were supported in part by NSF Grant DMS-1317602 and ARO MURI Grant FA9550-10-1-0567. Authors’ addresses: Z. Peng and W. Yin, Department of Mathematics, University of California, Los Angeles, Los Angeles, CA, 90095; P. Gurram and H. Kwon, U.S. Army Research Laboratory, Adelphi, MD, 20783. E-mail: ([email protected]). 0018-9251/15/$26.00 C 2015 IEEE hyperspectral images and several machine learning datasets show that our proposed method can provide improved performance over the current state-of-the-art techniques. I. INTRODUCTION The key challenge for anomaly detection is to characterize the normalcy data. In general, there are three types of models [1], including the multivariate normal (MVN) model, non-MVN background model, and exploitation of spatial structures. The MVN model assumes the data have Gaussian distribution [2], and algorithms of this type include different variants of the RX anomaly detection algorithm [35] and different variants of the matched filter [6, 7]. However, real datasets are usually non-Gaussian; non-MVN methods such as near-MVN model [8], finite mixture models [9, 10], and nonparametric kernel-based models [1113] have been used to explore the non-Gaussian structure of the data. Some of the methods that explore spatial structure information include the multiple window technique [14], the standard deviation matched filter [15], postdetection spatial analysis [16], and spectral-based algorithms [17]. Support vector data description (SVDD) [12], a support vector based learning algorithm for anomaly detection, learns the support or boundary of the given normalcy data by building a minimal enclosing hypersphere containing the data. This is accomplished by minimizing the radius of the hypersphere with a constraint that the hypersphere contains all the background data points and excludes the superfluous data points, such as noisy data and/or outliers. The use of nonlinear kernels allows SVDD to accurately model the nonlinear support/boundary of nontrivial multimodal distributions of high dimensional multivariate data [18, 19]. The kernel-based SVDD first nonlinearly maps the input data to a high dimensional feature space, called reproducing kernel Hilbert space (RKHS) [20], and then finds the enclosing hypersphere. Due to the powerful exploitation of nonlinear correlations among the features by nonlinear kernel, an optimized hypersphere in the RKHS is in fact equivalent to a robust nonlinear hypersphere in the input space providing superior performance over conventional anomaly detection techniques. However, a regular SVDD with single kernel learning oftentimes results in a weak normalcy model (i.e., a hypersphere overfitting the data), especially if the normalcy pattern consists of sparse and nonlinear high dimensional data structures. In order to overcome the problem of overfitting, an ensemble learning approach, called sparse kernel anomaly detection (SKAD) [13], was previously developed by two of the current authors. SKAD aims to learn multiple kernels over the input space by randomly choosing a large number of subsets of a few features, each subset used as an input to a corresponding kernel. The individual kernels are then optimally weighted using the 1 norm sparse optimization principle of multiple kernel learning (MKL) [21, 22] to jointly estimate the combined hypersphere. Due 1698 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO. 3 JULY 2015

hyperspectral images and several machine learning datasets ...wotaoyin/papers/pdf/Peng_Gurram_Kwon_Yin_… · 1698 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: hyperspectral images and several machine learning datasets ...wotaoyin/papers/pdf/Peng_Gurram_Kwon_Yin_… · 1698 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO

Sparse Kernel Learning-BasedFeature Selection for AnomalyDetection

ZHIMIN PENGUniversity of California, Los AngelesLos Angeles, CA

PRUDHVI GURRAM, Member, IEEEHEESUNG KWON, Senior Member, IEEEU.S. Army Research LaboratoryAdelphi, MD

WOTAO YINUniversity of California, Los AngelesLos Angeles, CA

In this paper, a novel framework of sparse kernel learning forsupport vector data description (SVDD) based anomaly detection ispresented. By introducing 0-1 control variables to original featuresin the input space, sparse feature selection for anomaly detection ismodeled as a mixed integer programming problem. Due to theprohibitively high computational complexity, it is relaxed into aquadratically constrained linear programming (QCLP) problem.The QCLP problem can then be practically solved by using aniterative optimization method, in which multiple subsets of featuresare iteratively found as opposed to a single subset. However, when anonlinear kernel such as Gaussian radial basis function kernel,associated with an infinite-dimensional reproducing kernel Hilbertspace (RKHS) is used in the QCLP-based iterative optimization, it isimpractical to find optimal subsets of features due to a large numberof possible combinations of the original features. To tackle this issue,a feature map called the empirical kernel map, which maps datapoints in the input space into a finite space called the empiricalkernel feature space (EKFS), is used in the proposed work. TheQCLP-based iterative optimization problem is solved in the EKFSinstead of in the input space or the RKHS. This is possible becausethe geometrical properties of the EKFS and the correspondingRKHS remain the same. Now, an explicit nonlinear exploitation ofthe data in a finite EKFS is achievable, which results in optimalfeature ranking. Comprehensive experimental results on three

Manuscript received November 13, 2013; revised June 26, 2014,November 13, 2014; released for publication December 1, 2014.

DOI. No. 10.1109/TAES.2015.130730.

Refereeing of this contribution was handled by T. Robertazzi.

The research of W. Yin and Z. Peng were supported in part by NSF GrantDMS-1317602 and ARO MURI Grant FA9550-10-1-0567.

Authors’ addresses: Z. Peng and W. Yin, Department of Mathematics,University of California, Los Angeles, Los Angeles, CA, 90095; P.Gurram and H. Kwon, U.S. Army Research Laboratory, Adelphi, MD,20783. E-mail: ([email protected]).

0018-9251/15/$26.00 C© 2015 IEEE

hyperspectral images and several machine learning datasets showthat our proposed method can provide improved performance overthe current state-of-the-art techniques.

I. INTRODUCTION

The key challenge for anomaly detection is tocharacterize the normalcy data. In general, there are threetypes of models [1], including the multivariate normal(MVN) model, non-MVN background model, andexploitation of spatial structures. The MVN modelassumes the data have Gaussian distribution [2], andalgorithms of this type include different variants of the RXanomaly detection algorithm [3–5] and different variantsof the matched filter [6, 7]. However, real datasets areusually non-Gaussian; non-MVN methods such asnear-MVN model [8], finite mixture models [9, 10], andnonparametric kernel-based models [11–13] have beenused to explore the non-Gaussian structure of the data.Some of the methods that explore spatial structureinformation include the multiple window technique [14],the standard deviation matched filter [15], postdetectionspatial analysis [16], and spectral-based algorithms [17].

Support vector data description (SVDD) [12], asupport vector based learning algorithm for anomalydetection, learns the support or boundary of the givennormalcy data by building a minimal enclosinghypersphere containing the data. This is accomplished byminimizing the radius of the hypersphere with a constraintthat the hypersphere contains all the background datapoints and excludes the superfluous data points, such asnoisy data and/or outliers. The use of nonlinear kernelsallows SVDD to accurately model the nonlinearsupport/boundary of nontrivial multimodal distributions ofhigh dimensional multivariate data [18, 19]. Thekernel-based SVDD first nonlinearly maps the input datato a high dimensional feature space, called reproducingkernel Hilbert space (RKHS) [20], and then finds theenclosing hypersphere. Due to the powerful exploitationof nonlinear correlations among the features by nonlinearkernel, an optimized hypersphere in the RKHS is in factequivalent to a robust nonlinear hypersphere in the inputspace providing superior performance over conventionalanomaly detection techniques. However, a regular SVDDwith single kernel learning oftentimes results in a weaknormalcy model (i.e., a hypersphere overfitting the data),especially if the normalcy pattern consists of sparse andnonlinear high dimensional data structures.

In order to overcome the problem of overfitting, anensemble learning approach, called sparse kernel anomalydetection (SKAD) [13], was previously developed by twoof the current authors. SKAD aims to learn multiplekernels over the input space by randomly choosing a largenumber of subsets of a few features, each subset used asan input to a corresponding kernel. The individual kernelsare then optimally weighted using the �1 norm sparseoptimization principle of multiple kernel learning (MKL)[21, 22] to jointly estimate the combined hypersphere. Due

1698 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO. 3 JULY 2015

Page 2: hyperspectral images and several machine learning datasets ...wotaoyin/papers/pdf/Peng_Gurram_Kwon_Yin_… · 1698 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO

to �1-based sparse optimization, only a few kernels withnonzero weights are used to jointly model the hypersphere.It was shown that SKAD can provide more robustperformance than the regular SVDD based on singlekernel learning due to the powerful joint optimization ofmultiple hyperspheres. However, the SKAD technique hasfocused on finding optimal sparse weights of the multiplekernels alone, while the optimal selection of subsets offeatures for different kernels is not considered.

Feature selection for learning algorithms aims to find asubset of features that can improve the learningperformance by discarding features that are not useful oreven harmful for the given tasks. In the case ofkernel-based anomaly detection, such as SVDD, thefeature selection requires the accurate estimation of thecontribution of each feature to the objective function, i.e.,the radius of a hypersphere in the RKHS. If a linear kernelis used, it is a maximum margin hyperplane in the inputspace rather than a hypersphere in the RKHS, whichseparates the normalcy data from the origin. The featureselection problem becomes evident in the linear case sincethe contribution of individual features to the marginbetween the origin and the normalcy data can be optimallycalculated since no interactions among the features areconsidered at all. An optimal subset of features is thenobtained by solving a simple unconstrained optimizationproblem and selecting a certain number of features withhigh rankings. However, once a nonlinear kernel such asthe Guassian radial basis function (RBF) kernel is used, alarge margin hyperplane in the RKHS which is equivalentto a hypersphere needs to be estimated and the featureselection problem now becomes a complicated nonconvexproblem due to the highly nonlinear interactions/correlations among the input features caused by anonlinear kernel transformation of the input data.

In the pattern recognition literature, the featureselection problem is largely divided into two categories:the filter approach and the wrapper approach [23]. In thefilter approach, such as the FOCUS [24] and the Relief[25, 26] algorithms, the relevance or weights of individualfeatures is independently estimated within a preprocessingstep completely ignoring the interrelationship between theselected subset of features and the performance of thesubject learning algorithms. On the other hand, thewrapper approach interacts with the learning algorithmsunder consideration and finds a subset of features by usingthe given learning algorithm as an evaluation function.The forward selection and the backward eliminationtechniques [27, 28] are two widely used examples of thewrapper approach. However, the wrapper approach is alsoin general suboptimal since estimating the relevance of allthe possible subsets of the entire features is simply notfeasible.

There also have been some attempts to conduct theproblem of feature selection directly in the RKHS ofinfinite dimensionality. For instance, the Relief algorithmwas extended to a kernel space and the weights ofindividual features were estimated by constructing a basis

set and then maximizing margins between data of differentclasses [29]. In [30], sequential forward selection wasused in such a way that a classification criterion, such asclass separability based on kernel Fisher discriminantanalysis (KFDA), is optimized in a kernel space inselecting a subset of features. In [31], instead of using 0-1control variables, the continuous weights of features wereoptimized for support vector machine (SVM) based on agradient descent technique by maximizing theradius-margin bound. However, all of the abovetechniques are not optimal in selecting features givenobjective functions.

In this paper, a new framework of optimal sparsekernel learning for SVDD-based anomaly detection(OSKLAD) is proposed. The proposed OSKLADoptimally extends the feature selection technique used forthe kernel-based learning approaches [32, 33] intoSVDD-based anomaly detection by optimizing the featureselection method for nonlinear kernels in a newly definedfinite space. Hence, the OSKLAD can be considered as anoptimized version of the wrapper approach to theSVDD-based anomaly detection with nonlinear kernels. InOSKLAD, multiple hyperspheres are optimally designedby iteratively finding the corresponding optimal subsets offeatures1 and then jointly weighting the hyperspheres toestimate a final combined hypersphere. The jointlyestimated hypersphere is used as a final normalcy modelfor the background data.

The initial objective of the proposed OSKLAD beginswith finding a single subset of original features that can beused to build an optimal hypersphere in the RKHS. Thisobjective can be modeled as a mixed integer programmingproblem. However, this problem is NP-hard, and so werelax it into a quadratically constrained linearprogramming (QCLP) problem [34] by converting theobjective function of the mixed integer programmingproblem into lower bounded quadratic inequalityconstraints. This QCLP problem is yet intractable due tothe prohibitively large number of the inequalityconstraints. In fact, the number of the inequalityconstraints is the same as the number of all the possiblecombinations of the original features. To address this issue,a cutting plane method based on the restricted masterproblem coupled with MKL [21, 22] is iteratively used.The goal is to find only a small subset of the inequalityconstraints that are actively used to define the feasibleregion of the parameters of the given QCLP problem.

The active constraints are effectively identified byfinding the most violating constraints instead whosehalf-planes maximally violate the correspondinginequality constraints. Therefore, the task becomes findingmultiple subsets of most violated features associated withthe corresponding most violating constraints given theobjective function, such as the radius of a hypersphere inthe RKHS. However, due to the prohibitively large

1 Note that these subsets of features were randomly chosen in SKAD.

PENG ET AL.: SPARSE KERNEL LEARNING-BASED FEATURE SELECTION FOR ANOMALY DETECTION 1699

Page 3: hyperspectral images and several machine learning datasets ...wotaoyin/papers/pdf/Peng_Gurram_Kwon_Yin_… · 1698 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO

number of possible combinations (subsets) of the originalfeatures of using nonlinear kernels, finding the mostviolating constraints also becomes a combinatorialproblem. To tackle this issue, the most violated featuresare found in a newly generated space, called the empiricalkernel feature space (EKFS) [20], instead of the inputspace or the RKHS.

The EKFS is a finite space linearly spanned by basisvectors, which are generated by a map called the empiricalkernel map that basically evaluates a kernel function ofeach data point with respect to training samplesconstructing a finite-dimensional space whosedimensionality is the same as the number of trainingsamples. Note that the dimensionality of the EKFS can befurther reduced by using principal component analysis(PCA)-based dimensionality reduction techniques withoutcompromising optimization performance. By whiteningthe EKFS, it is endowed with the canonical dot productmaking the empirical kernel map a kernel feature map[20]. It is shown in [35] that the EKFS and thecorresponding RKHS constructed by using the samekernel function have the same geometrical property. Thismeans that solutions of any optimization problem obtainedfrom either space are identical. In the proposed OSKLAD,the subsets of the most violated features are optimallyfound in the EKFS since individual feature ranking interms of contribution to the radius in the EKFS can beperformed optimally based on the property of canonicaldot product and the finite dimensionality of the space.

Unlike current kernel-based QCLP optimizationtechniques [33], the OSKLAD with nonlinear kernelsfinds optimal subsets of features in the EKFS instead ofthe input space. This means the selected subsets offeatures are no longer in the original form of features inthe input space. Each component of the newly mappedfeatures in the EKFS is a kernel function between two datapoints, which basically represents the nonlinearcorrelation between the two. Hence, the proposed workmainly focuses on finding the joint normalcy model for thebackground characterization by optimally designing andweighting multiple hyperspheres in the EKFS withoutexplicitly identifying the corresponding subsets of theoriginal features in the input space. Nevertheless, theOSKLAD can also be used in the input space, in whichcase the optimal subsets of original features can beidentified. However, in this case the OSKLAD in the inputspace is only optimal up to linear kernel significantlylimiting the detection performance.

It should be also noted that optimizing normalcymodels for anomaly detection in the EKFS has notpreviously been developed. The proposed work is in factthe first attempt to utilize the very useful properties of theEKFS, such as finite dimensionality, availability of thecanonical dot product, rigorous nonlinear exploitation ofthe given data, etc., in solving optimization problems foranomaly detection. Further study and developments foranomaly detection as well as classification techniques inthe EKFS are expected following the proposed work.

The rest of the paper is organized as follows. Section IIdescribes the concept of SVDD; Section III provides thedetail of the SVDD based on optimal feature selection;Section IV provides some simulation results, includingone class classification datasets and hyperspectral imagedata; Section V concludes the paper with some remarksabout the proposed method.

II. SUPPORT VECTOR DATA DESCRIPTION

SVDD, introduced in [12], is a state-of-the-art learningtechnique, widely used for anomaly detection (one classclassification). SVDD characterizes a data set by enclosingthe normalcy data and excluding the superfluous spacearound the normalcy data. The boundary of the data set isdefined by the vectors or samples in the normalcy data,which are called support vectors. The samples that lieoutside this boundary are detected as outliers or anomalies.

Consider a data set containing samples represented as{x1, x2, . . . , xN }, where xi ∈ R

M is an M-dimensionalfeature vector of data sample i. Let �(x) be a function thattransforms the input feature vector to a high dimensional(possibly infinite) RKHS associated with the kernelfunction k(xi, xj) = 〈�(xi), � (xj)〉. The kernel-basedSVDD algorithm tries to find the smallest hypersphere inthe RKHS that encloses the given background data set andexcludes the superfluous space around the backgrounddata set as much as possible. This sphere is defined by itscenter a and radius R, where R is minimized with aconstraint that the hypersphere contains all thebackground data points. If there is a possibility of outliersexisting in the background data, then slack variables areused to allow for the outliers and generate a more robustmodel as follows

minR,ξi ,a

R2 + C ·N∑

i=1

ξi

subject to ‖� (xi) − a‖2 ≤ R2 + ξi

ξi ≥ 0, i = 1, 2, . . . , N,

(1)

where ξ i are the slack variables and C controls the tradeoff between the volume of the hypersphere and the errors.By applying the Karush-Kuhn-Tucker (KKT) conditions,the dual problem of (1) is

maxα

N∑i=1

αik (xi , xi) −N∑

i=1

N∑j=1

αiαjk(xi , xj

)

s.t.N∑

i=1

αi = 1

0 ≤ αi ≤ C, i = 1, 2, . . . , N,

(2)

where αi are Lagrangian parameters. After solving thequadratic problem (2) using existing algorithms, theoptimal dual variables α∗

i are obtained. The center of thehypersphere (though it cannot be determined explicitly) isthen given by a = ∑N

i=1 α∗i �(xi). In fact, α∗

i exactlycharacterizes the relative location of xi with respect to the

1700 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO. 3 JULY 2015

Page 4: hyperspectral images and several machine learning datasets ...wotaoyin/papers/pdf/Peng_Gurram_Kwon_Yin_… · 1698 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO

Fig. 1. Demonstrate SVDD with two different kernels. Dash line andsolid line represent the data description obtained by using linear and

Gaussian RBF kernel respectively.

hypersphere. The vectors with α∗i = 0 lie inside the

hypersphere and are considered to be part of thebackground data; the vectors with 0 < α∗

i < C are thesupport vectors that actually lie on the boundary of thehypersphere; the vectors with α∗

i = C lie outside thehypersphere and are considered as anomalies. The radiusof the hypersphere R is given by

R2 = 1

Nb

Nb∑k=1

||� (xk) − a||2

= 1

Nb

Nb∑k=1

(k (xk, xk) − 2

∑i

α∗i k (xk, xi)

+∑i,j

α∗i α

∗j k

(xi , xj

)⎞⎠ . (3)

Without loss of generality, we assume � (xk) (k = 1,2,. . . , Nb) are the support vectors that lie on the boundarywhere Nb is the total number of the support vectors. Thetest statistic that can be used to determine if the test datasample xT is an anomaly or not is given by

FSV DD (xT ) = k (xT , xT ) − 2∑

i

α∗i k (xT , xi)

+∑i,j

α∗i α

∗j k

(xi , xj

) ≥ R2. (4)

If the distance between the test sample and the centerof the enclosing hypersphere is less than or equal to theradius of the enclosing hypersphere, it is considered to bea part of the normalcy data. However, if the distance isgreater than the radius of the hypersphere, it is consideredto be an anomaly.

Fig. 1 provides an illustration describing the modelingof the boundary of a two-dimensional toy dataset, whichhas a banana shape in the input space. Two different

kernels are used to perform this modeling to show theadvantages of using nonlinear kernel to model multimodaldistributions. The dash line corresponds to the SVDDmodel with a linear kernel, which implies that theenclosing hypersphere is built in the input space. The solidline represents the data description obtained by using aGaussian RBF kernel that transforms the data into RKHSand an enclosing hypersphere is built in the RKHS. Thesame hypersphere represents a nonlinear boundary aroundthe normalcy data in the input space. One can observe theadvantages of using a nonlinear kernel in this example.Gaussian RBF kernel-based SVDD provides a tighter fit tothe normalcy data removing all the superfluous spacearound the data while linear SVDD included this spaceinto the normalcy data model.

III. SPARSE KERNEL LEARNING BASED FEATURESELECTION

As explained in previous sections, even though anSVDD model with a Gaussian RBF kernel providessuperior performance over generative models and linearSVDD, the performance of the kernel-based SVDD modeldepends on the input features used, and the kernelparameters (the kernel bandwidth parameter in the case ofGaussian RBF kernel). The optimization of the kernelparameters is another research topic, and has beenpreviously studied by the two of the authors [19]. In thissection, the proposed OSKLAD is described by 1) firstintroducing the previously developed sparse kernellearning technique as a baseline method, 2) how theOSKLAD is modeled by using a QCLP-based iterativeoptimization technique, and 3) how the OSKLAD ismodeled in the EKFS instead of the RKHS to make thefeature selection optimal.

A. Sparse Kernel Learning for Anomaly Detection

In order to improve the performance of the one classclassifier, instead of using a single kernel with all thefeatures, a multiple kernel learning algorithm has beendeveloped for SVDD called SKAD [13]. In this technique,feature subsets were chosen randomly and individualkernels were built from these feature subsets. And then, asparse kernel learning algorithm has been developed toselect best feature subsets from the randomly chosensubsets that give the smallest hypersphere in the jointRKHS and tightest fit around the normalcy data. Here, weprovide a brief overview of this algorithm for clarity of thepaper. In SKAD, the final kernel k (xi, xj) is considered asa convex combination of basis kernels kl (xi, xj), l = 1,2,. . . , L. Each of the basis kernels kl is associated with itscorresponding RKHS. Each kernel and RKHS isgenerated from a randomly selected feature subspace ofthe input data. Each feature subspace is obtained byrandom selection of features using a uniform distribution.

PENG ET AL.: SPARSE KERNEL LEARNING-BASED FEATURE SELECTION FOR ANOMALY DETECTION 1701

Page 5: hyperspectral images and several machine learning datasets ...wotaoyin/papers/pdf/Peng_Gurram_Kwon_Yin_… · 1698 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO

Mathematically, it can be written as

k(xi , xj

) =L∑

l=1

μlkl

(P lxi , P

lxj

)subject to μl ≥ 0, l = 1, 2, . . . , L,

L∑l=1

μl = 1,

(5)

where L is the total number of kernels and μl are theweights of the individual kernels. Pl is an Fl × Mprojection matrix which defines the input featuresrandomly selected to form the lth subspace. It is defined bythe elements P l

f,m = 1 if the mth input feature is selectedas the f th feature of the lth subspace and 0 otherwise, forall m = 1, 2, . . . , M and f = 1, 2, . . . , Fl. The �1 normconstraint is applied on the weights of the basis kernels topromote sparsity among them and selects only the bestfeature subsets that help in improving the generalizationperformance of the classifier. The dual problem for SKADcan be formed by substituting the combined kernel (5) intothe dual problem of the standard SVDD (2) as follows

max J (μ, α) =N∑

i=1

αi

L∑l=1

μlkl

(P lxi , P

lxi

)

−N∑

i,j=1

αiαj

L∑l=1

μlkl

(P lxi , P

lxj

)subject to 0 ≤ αi ≤ C, ∀i = 1, 2, . . . , N

N∑i=1

αi = 1.

(6)

If the weights of individual kernels are known, usingthe combined kernel k(xi , xj ) = ∑L

l=1 μlkl(P lxi , Plxj ),

(6) can be solved as a standard SVDD problem. Once theoptimal Lagrange multipliers α∗

i are obtained, theobjective value of (6) is going to be

J (μ) =N∑

i=1

α∗i

L∑l=1

μlkl

(P lxi , P

lxi

)

−N∑

i,j=1

α∗i α

∗j

L∑l=1

μlkl

(P lxi , P

lxj

). (7)

This is solved using the gradient descent algorithm,where the gradient of J(μ) with respect to each weight μl

(assuming Gaussian RBF kernel) is given by

∂J

∂μl

= −∑i,j

α∗i α

∗j kl

(P lxi , P

lxj

). (8)

Once the gradient of J(μ) is computed, μ is updated byusing a descent direction calculated via a reduced gradientmethod described in [22]. The reduced gradient methodensures that the weights μ are updated in such a way thatthe equality constraint and the nonnegativity constraintson μ [shown in (5)] are satisfied. Optimal weights of the

subclassifiers μ∗l are obtained when the gradient descent

algorithm converges and the optimal Lagrange multipliersα∗

i are obtained for the final combined kernel-basedSVDD. Now, the radius of the joint hypersphere is givenby

R2 = 1

Nb

Nb∑k=1

(L∑

l=1

μ∗l kl

(P lxk, P

lxk

)

− 2N∑

i=1

α∗i

L∑l=1

μ∗l kl

(P lxk, P

lxi

)

+N∑

i,j=1

α∗i α

∗j

L∑l=1

μ∗l kl

(P lxi , P

lxj

)⎞⎠ , (9)

where � (xk), k = 1, 2, . . . , Nb are the support vectors thatlie on the boundary of the background data set in thecombined RKHS, and Nb is the total number of suchboundary support vectors. The algorithm is initiated byselecting random feature subspaces from the input data toform weak classifiers. The weights of all the weakclassifiers or kernels are set to the same value, i.e., 1/L.The final kernel is obtained by combining all the weightedindividual kernels. Then, the optimization problem in (6) issolved to obtain the best solution of α∗

i . They are pluggedinto (8) to be used in the gradient descent algorithm forupdating the weights of the individual kernels. These twosteps are continued until the algorithm convergencecriterion is met. The algorithm terminates when thechange in the weights of the individual kernels is below acertain threshold. The final optimized sparse weights areused to combine the kernels and obtain the hypersphere inthe RKHS associated with the combined kernel as shownin (9). The test statistic that can be used to determine if thetest pixel is an anomaly or not using SKAD is given by

FSKAD (xT ) =∑

l

μ∗l kl

(P lxT , P lxT

)− 2

∑i

α∗i

∑l

μ∗l kl

(P lxT , P lxi

)+

∑i,j

α∗i α

∗j

∑l

μ∗l kl

(P lxi , P

lxj

) ≥ R2.

(10)

More details of the SKAD algorithm can be found in[13]. However, in this method, the number of features usedfor each kernel and the initial number of kernels need tobe set beforehand and are not optimized during theimplementation. A new algorithm is developed in thispaper in order to optimally choose the features used ineach individual kernel and the number of kernels usedfinally in the ensemble.

B. Optimal Sparse Kernel Learning

In this section, we present an OSKLAD using SVDDand SKAD as building blocks. Inspired by the featureselection approach for the kernel-based classification [33],

1702 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO. 3 JULY 2015

Page 6: hyperspectral images and several machine learning datasets ...wotaoyin/papers/pdf/Peng_Gurram_Kwon_Yin_… · 1698 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO

the OSKLAD addresses the problem of the optimal featureselection for the SVDD-based anomaly detection. Themajor drawback of the previous feature selection approach[33] is that it is only optimal up to linear kernel. OSKLADtackles the optimality issue associated with nonlinearkernels and is optimal with both linear and nonlinearkernels for a given feature size. OSKLAD follows abottom-up approach starting with a single kernel with anoptimal subset of features. More kernels (again withoptimal feature subsets) are added in successive iterationsas required, in order to obtain the final optimal kernel. Thekernels are associated with the corresponding RKHS, inwhich the normalcy data is optimally described by thesmallest enclosing hypersphere.

The basic formulation of OSKLAD is again verysimilar to that of SVDD as shown in (1) except thatOSKLAD only uses a subset of features. So, similar to[33], the model is described as a mixed integerprogramming problem

mind

minR,ξ,a

R2 + C ·N∑

i=1

ξi

subject to ||� (xi) − a||2 ≤ R2 + ξi

ξi ≥ 0

xi = xi � d, i = 1, 2, . . . , N

(11)

where d ∈ D = {d| ∑Mi=1 di = B, di = 0 or 1}, and �

represents element-wise product.Here B is a threshold that controls the number of

selected features. It can be set either to a certain integer fora fixed number of features in each subset, or to a thresholdas a percentage of change in the radius of the enclosinghypersphere for variable length feature subsets. OSKLADis optimal for the predefined threshold B. If one assumesthat d is fixed in (11), it turns into a continuousconstrained optimization problem just like a standardSVDD. By applying the Lagrange multipliers and KKTconditions to it, we can derive the dual problem (similar tostandard SVDD) as

mind

maxαi

N∑i=1

αik (xi , xi) −N∑

i=1

N∑j=1

αiαjk(xi , xj

)

subject toN∑

i=1

αi = 1

0 ≤ αi ≤ C

xi = xi � d, i = 1, 2, . . . , N.

(12)

However, one should notice that (12) is still a mixedinteger programming problem due to the last constraint,which is computationally expensive to solve. In order tosolve this problem, it can be converted into a QCLP asfollows. According to the minimax inequality theorem[36], if we interchange the order of mind and maxα , then

Fig. 2. Illustration of active constraints.

(12) can be lower bounded by

maxα

mind

N∑i=1

αik (xi , xi) −N∑

i=1

N∑j=1

αiαjk(xi , xj

)

subject toN∑

i=1

αi = 1

0 ≤ αi ≤ C

xi = xi � d, i = 1, 2, . . . , N.

(13)

We define S(α, d) = ∑Ni=1 αik(xi , xi) −∑N

i,j=1 αiαjk(xi , xj ), and introduce an additionalparameter t to obtain the QCLP equivalent of (12) asfollows

maxα,t

t

subject toN∑

i=1

αi = 1

0 ≤ αi ≤ C

t ≤ S (α, d) , d ∈ D.

(14)

Though (14) is convex, a large number of inequalityconstraints [last condition in (14)] makes it impractical tobe solved by existing techniques. The number ofinequality constraints becomes huge if the features residein a high dimensional space. Fortunately, note that not allthe inequality constraints used in (14) are actively used indefining the feasible region of the optimization problem.In fact, only a small number of the constraints are usefuland directly used to solve the optimization problem, whichis illustrated in Fig. 2.

The solid lines in the figure represent the inactiveconstraints, while the dotted lines represent the activeconstraints, and the gray region is the space where we aresearching for the optimal point. Therefore, an iterativealgorithm can be used, in which instead of solving (14) atonce, an intermediate solution pair (t, α) is iterativelyupdated based on a limited subset of previously foundactive constraints. This optimization problem is called therestricted master problem, which is closely related to thecutting plane algorithm described in [37]. The restricted

PENG ET AL.: SPARSE KERNEL LEARNING-BASED FEATURE SELECTION FOR ANOMALY DETECTION 1703

Page 7: hyperspectral images and several machine learning datasets ...wotaoyin/papers/pdf/Peng_Gurram_Kwon_Yin_… · 1698 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO

master problem consists of two steps [32]: 1) (t, α) areoptimized based on a previously found restricted subset Iof features, which maximally violates the constraints; and2) a new vector d of the most violated features is obtainedbased on newly optimized (t, α) and added to the restrictedsubset I = I ∪ {d}. These two steps are iterated untilconvergence. Finding the most violated features is detailedin the next subsection.

The intermediate solution pair (t, α) is now obtainedfrom the following optimization problem

maxa,t

t

subject toN∑

i=1

αi = 1,

0 ≤ αi ≤ C,

t ≤ S (α, dq) , dq ∈ I.

(15)

Let μq ≥ 0 be the dual variable for each constraint in (15).The Lagrangian of (15) can be written as

L (t, μ) = t +P∑

q=1

μqS(α, dq

), (16)

where P is the cardinality of I. By setting ∂L∂t

= 0, we

have∑P

q=1 μq = 1. The Lagrangian L(t, μ), after applyingthis partial KKT condition, can be rewritten asL(t, μ) = ∑P

q=1 μqS(α, dq), which transforms (14) to thefollowing problem

maxα

minμ

P∑q=1

μqS(α, dq

)

subject toN∑

i=1

αi = 1

0 ≤ αi ≤ C for i = 1, 2, . . . , N

P∑q=1

μq = 1

μq ≥ 0 for l = 1, 2, . . . , P .

(17)

One can observe that (17) up to the second constraintis equivalent to the dual of the existing SKAD algorithmshown in (6), and can be solved using a two-step iterativeprocess to obtain the optimal sparse weights of individualkernels μ and the optimal Lagrange multipliers α*.However, the objective function is a summation of a hugenumber of functions S(α, dq) which requires evaluation ofa huge number of base kernels. So it is impractical to solve(17) by applying the SKAD algorithm. Fortunately, wecan apply the cutting plane algorithm, which iterativelygenerates sparse feature subsets to construct the quadraticinequality constraints in (15). We discuss the generation ofthe sparse features in the next subsection.

C. Optimal Feature Selection: Finding MaximallyViolating Features

For updating d, the features that maximally violate thelast constraint in (14) need to be determined. Since thegoal of (14) is to maximize t, and it is upper bounded byS(α, d) according to the constraint, the features thatmaximally violate this constraint will minimize S(α, d).One has to solve the following optimization problem

mind

S (α, d)

subject toM∑i=1

di = B

di ∈ {0, 1} , for i = 1, . . . , M.

(18)

In this section, we describe the method to find thesefeature vectors for both linear kernel and nonlinear kernel.

1) Linear Kernel: If a linear kernel is used, sincek (xi, xj) = 〈xi, xj〉, we have S(α, d) = ∑M

j=1 djcj , where

cj = ∑Ni=1 αix

2ij + (

∑Ni=1 αixi)2. S(α, d) is a linear

function of d. Once we have optimal support vectors, theglobal solution of d can be easily obtained by sorting cjs indescending order and setting the first B correspondingelements in d to 1 and the rest to 0. Once the optimalfeature subset is chosen for a kernel, optimal α and μ areupdated by solving (17). These two steps are repeateduntil the algorithm converges. The convergence criterionused is based on the difference of objective functionbetween the previous iteration and current iteration. If‖fpre − f ‖ < tol or if the maximum number of iterationsis achieved, the algorithm is terminated. The overallalgorithm is described in Algorithm 1.

Algorithm 1 OSKLAD with Linear Kernel

1. Initialized α = 1N

1; find the maximally violating feature subset d, andset I = {d};2. Run SKAD and obtain optimal α and μ;3. Find the next maximally violated feature subset d and set I = I ∪ {d};4. Repeat step 2 and 3 until convergence.

2) Nonlinear Kernel: If a Gaussian RBF kernel is

used, since k(xi , xj ) = 〈�(xi), �(xj )〉 = exp(−‖xi−xj ‖2

σ 2 ),S(α, d) is not a linear function of d. We cannot solve theproblem in (18) optimally because of the large number ofcombinations of features that have to be considered. But,note that the data can be transformed frominfinite-dimensional RKHS into another space calledEKFS with finite dimensionality using an empirical kernelmap. This will allow us to select subsets of featuresoptimally while still preserving the nonlinear correlationsamong the features. For a given set of training data points{xi}Ni=1, the map defined by

�N : Rd → R

Nwhere x �→ k (·, x)

= (k (x1, x) , . . . , k (xN, x))T (19)

1704 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO. 3 JULY 2015

Page 8: hyperspectral images and several machine learning datasets ...wotaoyin/papers/pdf/Peng_Gurram_Kwon_Yin_… · 1698 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO

is called the empirical kernel map with respect to {xi}Ni=1[20]. However, the kernel function k used to build kernelmatrices in previous subsections cannot be representedusing �n, since they do not form an orthonormal system.The dot product to use in the representation of k is not thecanonical dot product in the EKFS R

n. In order to turn �n

into a feature map associated with k, EKFS is endowedwith a dot product 〈·, ·〉N such that k(xi, xj) = 〈�N (xi), �N

(xj)〉N. After analyzing certain conditions using thisequality as shown in [20], the dot product 〈·, ·〉N can beconverted to a canonical dot product by merely whiteningthe EKFS and using the new basis functions as features. Itcan be represented as

k(xi , xj

) = ⟨�w

N (xi) , �wN

(xj

)⟩, (20)

where the feature map in whitened EKFS is given by

�wN : x �→ K

− 12 (k (x1, x) , . . . , k (xN, x))T , (21)

where K ∈ RN×N is the kernel matrix and Ki,j = k(xi, xj).

The whitening process is basically dividing theeigenvector basis vector of K by

√λi , where λi are the

eigenvalues of K. It can be seen that this map is similar toperforming kernel PCA feature extraction on the originalRKHS and hence, is also called as kernel PCA map. Notethat we have thus constructed a data-dependent featuremap intro an N-dimensional space, which associates withthe given kernel. Hence, the feature subset selectionproblem turns exactly into (18) (linear version) except forthe fact that in this case the features are selected in EKFS.Similar to the OSKLAD with a linear kernel, the overallOSKLAD in the EKFS is described in Algorithm 2.

Algorithm 2 OSKLAD with Nonlinear Kernel

1. Map the data points into the EKHS by using a certain kernel k;2. Initialized: α = 1

N1, find the maximally violating feature subset d, and

set I = {d};3. Run SKAD based on the kernel matrices generated by I and optimizefor α and μ;4. Find the next maximally violated feature subset d based on the currentα and μ and set I = I ∪ {d};5. Repeat step 3 and 4 until convergence.

IV. CONVERGENCE

This section shows that the OSKLAD algorithm canachieve global convergence. Assume A × D is the feasibleset of (13), where A = {α :

∑Ni=1 αi = 1, 0 ≤ αi ≤ C}

and D = {d : dj = 0 or 1,∑M

j=1 dj = B}. OSKLAD findsand adds the most violated constraint to the set I in eachiteration. In addition, define Ik to be the set of previouslycalculated dj where (j = 1, 2, . . . , k), then we haveIk+1 = Ik ∪ {dk+1}. Then in the kth iteration, a newconstraint dk+1 is found based on αk and we haveS(αk, dk+1) = mind∈DS(α, d). Define

βk = maxα

min1≤i≤k

S(α, di

)(22)

and

φk = max1≤i≤k

S(αi, di+1

)(23)

Similar to [33], we have the following theorem whichindicates that the OSKLAD algorithm graduallyapproaches the optimal solution.

THEOREM 1 Let (t*, α*) be the global optimal solution of(14), {βk} and {φk} are defined as above, then we have

φk ≤ t∗ ≤ βk

where {βk} is a monotonically decreasing sequence, and{φk} is a monotonically increasing sequence.

PROOF OF THEOREM 1 Since t* is the optimalsolution of (14), we have t∗ = maxα∈A mind∈DS(α, d). Onone hand, ∀α ∈ A, we have mind∈DS(α, d) ≤mind∈Ik

S(α, d) f or Ik ⊆ D, hencemaxα∈Amind∈DS(α, d) ≤ maxα∈Amind∈Ik

S(α, d), i.e., t* ≤βk. On the other hand, S(αj , dj+1) = mind∈DS(αj , d),which means that the point (S(αj, dj + 1), αj) is a feasiblepoint of (14), thus t* ≥ S(αj, dj + 1), ∀j = 1, . . . , k, whichimplies t* ≥ maxj = 1,. . .,k S(αj, dj + 1) = φk. In conclusion,φk ≤ t* ≤ βk. With the number of iteration k increasing,the subset Ik is also monotonically increasing, so βk ismonotonically decreasing and φk is monotonicallyincreasing.

Based on theorem 1, we have the followingconvergence result.

THEOREM 2 OSKLAD algorithm converges to the globalsolution of (14) after a finite number of steps.

PROOF OF THEOREM 2 Since |D| is finite, Ik will remainthe same after finite number of iterations. Assume in kthiteration, Ik+1 = Ik . Since there is no update in Ik , therewill be no update of α based on OSKLAD algorithm, i.e.,we have αk+1 = αk. Then we have S(αk, dk+1) =mind∈DS(αk, d) = mind∈Ik

S(αk, d) =maxα∈Amind∈Ik

S(α, d) = βk . And,φk = max1≤j≤kS(αj , dj+1) ≥ S(αk, dk+1) = βk . Combineφk ≥ βk with the results in theorem 1, then we obtainβk = t* = φk, and (βk, αk) is the global optimal solutionpair (14).

V. EXPERIMENTS

This section evaluates and analyzes the performance ofOSKLAD and compares it with SVDD and SKAD [13]using both linear and Gaussian RBF kernel on threehyperspectral image datasets and multivariatedatasets.

A. Hyperspectral Image Data

The experiments are carried out on three hyperspectraldigital imagery collection experiment (HYDICE) images,including the panel image, the forest radiance (FR) imageand the desert radiance (DR) image. A HYDICE imagingsensor generates 210 bands across the whole spectral range

PENG ET AL.: SPARSE KERNEL LEARNING-BASED FEATURE SELECTION FOR ANOMALY DETECTION 1705

Page 9: hyperspectral images and several machine learning datasets ...wotaoyin/papers/pdf/Peng_Gurram_Kwon_Yin_… · 1698 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO

(0.4−2.5 μm), but we only use 150 bands by discardingwater absorption and low signal-to-noise ratio (SNR)bands. The bands used are the 23rd-101st, 109th-136th,and 152nd-194th. The pixel vectors in each hyperspectraltest image are normalized in such a way that the minimumspectral value in an image is 0 and the maximum spectralvalue is 1. The details of the images are shown inTable I.

Fig. 3 provides the ground truth where the anomaliesare highlighted by the green boxes. It can be observed thatthe sizes of the targets are different.

TABLE IHYDICE Image

Dataset Size # Targets

Panel 74 × 254 30Desert Radiance (DR) 135 × 372 6Forest Radiance (FR) 86 × 574 14

1) Experimental setup: We chose a small patch(11 pixels × 69 pixels) from the upper left corner of eachof the panel, FR, and DR images as the data to model thenormalcy class of the respective datasets. Once the

Fig. 3. Original data and ground truth of three datasets.

1706 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO. 3 JULY 2015

Page 10: hyperspectral images and several machine learning datasets ...wotaoyin/papers/pdf/Peng_Gurram_Kwon_Yin_… · 1698 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO

Fig. 4. Anomaly detection results of panel image using SVDD, SKAD, and OSKLAD.

normalcy class is modeled using the three differentalgorithms, we obtain the radius R (for both linear andGaussian RBF kernels) and the center a (for linear kernel)of the hypersphere. As described in (4), the distancebetween each pixel in the image and the center can beeasily determined. If the distance is greater than R, thepixel is considered as an anomaly, otherwise it is abackground pixel.

As explained earlier, the termination criterion for themain OSKLAD algorithm is based on the difference of theobjective function between the previous and currentiteration. If ||fpre − f || < tol or if the maximum number

of iterations is achieved, the algorithm is terminated. Thestopping criterion employed in [13] is used for thetermination of the SKAD algorithm (step 2 in algorithm 1and step 3 in algorithm 2) in this work too. The algorithmterminates when the change in the weights of theindividual kernels is below a user-defined threshold. TheGaussian RBF kernel parameter σ is determined by usingthe minimax estimation technique described in [38] onsingle kernel SVDD with all features. Experiments areperformed with a 3.60 GHz AMD FX(tm)-4100Quad-Core Processor running Windows 7 with 8.0 GBmain memory.

PENG ET AL.: SPARSE KERNEL LEARNING-BASED FEATURE SELECTION FOR ANOMALY DETECTION 1707

Page 11: hyperspectral images and several machine learning datasets ...wotaoyin/papers/pdf/Peng_Gurram_Kwon_Yin_… · 1698 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO

Fig. 5. Anomaly detection results of the DR image using SVDD, SKAD, and OSKLAD.

2) Performance comparison among SVDD, SKAD, andOSKLAD: For SVDD and SKAD, both linear andGaussian RBF kernels are used in the input space. ForOSKLAD, we use linear kernel in the input space andGaussian RBF kernel in EKFS. The kernel parameters(bandwidth) are determined based on the min-maxtechnique using randomly selected 10 different regions ofthe image to represent the background as done in [38]. Thesame value is used over all the test pixels in image for allthe algorithms.

The number of features used for each kernel of SKADwith both linear and Gaussian RBF kernel and OSKLAD

with linear kernel is 75, which is half the total number offeatures. For OSKLAD in the EKFS, the total number offeatures used for the empirical kernel mapping is 759(total number of pixels used for modeling). We use only 7features for modeling each hypersphere in thiscomparison. Fig. 4, Fig. 5, and Fig. 6 show the anomalydetection results for panel, DR, and FR images,respectively.

The value of each pixel in the results is the ratio of thedistance between the pixel and the radius of thehypersphere. For comparison, we normalize the scale inall the resulting images to be between 0 (blue) and 1 (red).

1708 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO. 3 JULY 2015

Page 12: hyperspectral images and several machine learning datasets ...wotaoyin/papers/pdf/Peng_Gurram_Kwon_Yin_… · 1698 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO

Fig. 6. Anomaly detection results of FR image using SVDD, SKAD, and OSKLAD.

For the panel hyperspectral image, one can observe that allthe six methods are able to identify the first two rows ofanomalies, but only OSKLAD in EKFS is able to identifythe small targets in the third row. Also, OSKLAD in EKFSis able to suppress the background better than all the rest,resulting in better probability of detection at a certain falsealarm rate. The DR and FR datasets are more complicatedhyperspectral images. Except for OSKLAD in EKFS, thedetection results for the other five methods contain lots ofnoise, however, OSKLAD in EKFS is able to detect theanomalies with clean background.

The corresponding receiver operating characteristic(ROC) curves of the six methods are plotted in Fig. 7.

Note that Nonlin. SVDD and Nonlin. SKAD representthe performance of SVDD and SKAD with Gaussian RBFkernel, and Nonlin. OSKLAD represents the performanceof OSKLAD with Gaussian RBF in EKFS. The ROCcurves also show that nonlinear kernel methods are betterthan the linear kernel methods in general. The ROC curvesindicate that OSKLAD in EKFS has the best probabilityof detection at a relatively low false alarm rate. Theoutstanding performance of OSKLAD in EKFS is due to

PENG ET AL.: SPARSE KERNEL LEARNING-BASED FEATURE SELECTION FOR ANOMALY DETECTION 1709

Page 13: hyperspectral images and several machine learning datasets ...wotaoyin/papers/pdf/Peng_Gurram_Kwon_Yin_… · 1698 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO

Fig. 7. ROC curves of the 6 different methods.

TABLE IITraining Time (in Seconds)

Panel DR FR

Lin. SVDD 1.2 1.3 0.6Nonlin. SVDD 0.9 0.7 0.8Lin. SKAD 17.3 23.7 42.1Nonlin. SKAD 19.8 39.7 29.7Lin. OSKLAD 124.3 39.4 66.5Nonlin. OSKLAD 9.6 8.5 6.1

the fact that the first few features in the EKFS containmost of the information about the nonlinear correlations ofinput features, hence, a better description of thebackground model can be obtained.

3) Computation time comparison among SVDD,SKAD, and OSKLAD: Table II shows the training timeresults for the three algorithms with both linear andnonlinear kernels.

As we can see from the table, on one hand, SKAD andOSKLAD need more computational time than SVDDbecause multiple kernel learning or ensemble learningtechniques have to use much more computationalresources than regular algorithms to concurrently performmultiple learning as opposed to single learning. On theother hand, OSKLAD in EKFS is still much faster than theother multiple kernel learning algorithms since it onlyselects 7 features for each kernel and converges within 10iterations. However, compared with SVDD, OSKLAD inEKFS is able to achieve moderate to significantimprovements on HYDICE images at the cost ofcomputational time. It has to be also stressed that sinceSVDD is in fact the state-of-art technique, one of the bestin anomaly detection, it is very meaningful to achieve evenmoderate improvements on these datasets. In fact, we havenot found any report in hyperspectral literature to date thatclaims better performance than SVDD in hyperspectralanomaly detection.

4) Performance comparison of OSKLAD (linear andnonlinear) with varying numbers of features for panelimage: Since S(α, d) for linear kernel is a linear

combination of features, d1, d2, . . . , dM, as explained inSection III-C, for each kernel at each iteration, a subset ofB features is chosen from M by sorting the coefficients cj

in descending order and picking the first B features. In thisexperiment, we evaluate the performance of OSKLADwith both linear and Gaussian RBF kernel by varying thenumber of selected features. For the linear kernel, thenumbers are chosen in increments of 10 in the input space.The numbers we actually used are 2, 12, 22, 32, 42, and52. For the Gaussian RBF kernel, the numbers are chosenin 5 increments in the EKFS – 2, 7, 12, 17, 22, and 27have been used. The same numbers of features are usedfor all the kernels that are optimized by OSKLAD. TheROC curves are provided in Fig. 8.

It should be noted that the performance of OSKLADwith linear kernel increases as the number of featuresincreases from 2 to 52 for most of the false alarm range.For OSKLAD with Gaussian RBF kernel, there is asignificant improvement in the detection rate when thenumber of features is increased from 2 to 7. However, theperformance is very similar to each other when thenumber of features is 7 or more. It is also interesting toobserve that the worst performer of OSKLAD withGaussian kernel already outperforms OSKLAD withlinear kernel with any number of features. This shows howeffectively the OSKLAD with nonlinear kernels exploitsnonlinear correlations among the features in the inputspace and optimally chooses the features in the EKFS. Ascan be seen in Fig. 8, the best performance is obtained byusing just 7 features in each subset. This is because thefirst few features in the EKFS already contain most of theinformation about the nonlinear correlations of inputfeatures. The results for DR and FR images are notincluded since the performance comparison is similar tothe panel case.

B. Multivariate Datasets

In this subsection, experiments are performed tocompare performance of the OSKLAD algorithm toSVDD and SKAD using both linear kernel on multivariatedatasets. The performance comparison among the

1710 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO. 3 JULY 2015

Page 14: hyperspectral images and several machine learning datasets ...wotaoyin/papers/pdf/Peng_Gurram_Kwon_Yin_… · 1698 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO

Fig. 8. ROC curves for OSKLAD with linear and nonlinear kernel using difference numbers of selected features.

TABLE IIIDatasets

Dataset Features # Normalcy Samples # Anomaly Samples

Sonar mines 60 111 97Sonar rocks 60 111 97Diabetes absent 8 268 500Ecoli periplasm 7 52 284

different algorithms is conducted on all of the datasets inDavid Tax’s data repository.2 Since the three algorithmswith nonlinear kernel can achieve 100% accurate anomalydetection for most of the datasets, here we present fourdatasets including sonar mines dataset, sonar rocksdataset, diabetes absent dataset, and ecoli periplasmdataset where OSKLAD in EKFS has better performancethan the other algorithms. Detailed information of thesedatasets including the dimensionality, number of normalcyand anomaly data points are described in Table III.

All the datasets are one class classification datasetswhere part of normalcy data points are used for modelingthe smallest enclosing hypersphere and obtaining adescription of the one class. Anomaly points together withthe rest of the normalcy data points are used for testinghow well each algorithm models the one class and keepsthe superfluous space around the one class out.

1) Experimental Setup: The experimental setup formultivariate datasets is very similar to that of thehyperspectral image data. For regular SVDD and SKAD,both linear and Gaussian RBF kernels are used in the inputspace. For the OSKLAD algorithm with linear kernel,feature selection is performed in the input space. For theOSKLAD algorithm with Gaussian RBF kernel, the inputvector is first mapped into EKFS, and linear kernel is usedin EKFS. The Gaussian RBF kernel parameter σ isdetermined by using minimax estimation technique on

2http://homepage.tudelft.nl/n9d04/occ/index.html

single kernel SVDD with all features. The number offeatures B is set to half of the total number of features.

2) Experimental Results: In this section, weevaluate the performance of our OSKLAD algorithm aswell as the SVDD and SKAD algorithms on a collectionof the four real world datasets. The ROC curves for thefour datasets are provided in Fig. 9.

From this figure one can observe that for all thedatasets, OSKLAD using Gaussian RBF kernel (featureselection performed in EKFS) significantly outperformsall the other algorithms in terms of probability of detectionat a very low false alarm rate. In particular, OSKLAD inEKFS can accurately discriminate all of the normalcy dataand anomalies at zero false alarm rate for sonar rocks anddiabetes absent datasets. The SKAD algorithm with linearor nonlinear kernel performs worse or the same as theregular SVDD based on the random subsets of featuresselected in one particular run. This disadvantage of thealgorithm has been discussed in detail in [13].

Fig. 10 indicates the impact of the sparsity or thepercentage of selected features for each kernel ondetection accuracy. It should be observed from these plotsthat OSKLAD in EKFS has better performance thanOSKLAD in the input space. One can also observe that inEKFS, only a small proportion of features (around 40%over all the datasets) are needed to obtain a good datadescription with high detection probability. This alsoproves the fact that very few features in EKFS containmost of the information since it is very similar to kernelPCA space unlike the features in the input space.

3) Comparison of sparsity between SKAD andOSKLAD: In this subsection, we compare thecomputational requirements of SKAD and OSKLAD inthe testing phase which in turn depends on the totalnumber of kernels or subsets of features used after theoptimal sparse weights are obtained in the training stage.The initial number of kernels that are used and the finalnumber of kernels with nonzero weights in the SKADalgorithm for different datasets are shown in Table IV(linear kernel) and Table V (Gaussian RBF kernel).

PENG ET AL.: SPARSE KERNEL LEARNING-BASED FEATURE SELECTION FOR ANOMALY DETECTION 1711

Page 15: hyperspectral images and several machine learning datasets ...wotaoyin/papers/pdf/Peng_Gurram_Kwon_Yin_… · 1698 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO

Fig. 9. Performance comparison on four small datasets.

TABLE IVComparison of Number of Kernels Used by SKAD and OSKLAD

(Linear Kernel)

SKAD OSKLAD

Dataset Initial Final Iters Final

Sonar mines 100 92 8 6Sonar rocks 100 50 10 9Diabetes absent 100 62 6 4Ecoli periplasm 100 70 10 10

Similarly, the total number of iterations required forconvergence of OSKLAD and the final number of kernelwith nonzero weights are also presented in the respectivetables. As can be observed from these tables, OSKLADuses a far smaller number of kernels than SKAD afteroptimization but still provides superior detectionperformance to SKAD. For only one dataset (ionosphere),the final number of kernels used by SKAD is lesscompared with that of OSKLAD when Gaussian RBFkernel is used. The number of iterations required byOSKLAD to converge is also very small which helps inmaking the training process fast.

TABLE VComparison of Number of Kernels Used by SKAD and OSKLAD

(Gaussian RBF/EKFS)

SKAD OSKLAD

Dataset Initial Final Iters Final

Sonar mines 100 11 6 6Sonar rocks 100 92 6 6Diabetes absent 100 91 6 6Ecoli periplasm 100 4 6 6

VI. CONCLUSIONS

Due to highly complex nonlinear interactions amongthe features in the RKHS associated with nonlinearkernels, conducting optimal feature selection oflearning algorithms has been a difficult issue to achieve.This paper aims to tackle the issue and achieve optimalfeature selection with a predefined feature size for theoptimization problem of sparse kernel learning foranomaly detection. In this paper, the sparse kernellearning has been initially modeled as a mixed integerprogramming problem and then relaxed into the QCLPwith a large number of inequality constraints. The

1712 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO. 3 JULY 2015

Page 16: hyperspectral images and several machine learning datasets ...wotaoyin/papers/pdf/Peng_Gurram_Kwon_Yin_… · 1698 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO

Fig. 10. Sparsity in input space for four small datasets obtained by using OSKLAD.

restricted master problem is used to iteratively solve theQCLP problem, which otherwise is difficult to solve dueto the prohibitively large number of constraints. Toachieve optimality in feature selection, in the proposedwork, a new finite space called the EKFS instead of theRKHS is used and the QCLP problem is optimallysolved in EKFS. The normalcy data is first transformedinto EKFS via the empirical kernel map. EKFS isendowed with the canonical dot product by whitening thespace. Since EKFS is a finite space with the samegeometrical properties as the corresponding RKHSlinear feature, subset selection is equivalent to optimalnonlinear feature selection in the input data space. It hasbeen shown that by optimally selecting features,significant improvements can be made in anomalydetection in EKFS rather than the original inputspace.

ACKNOWLEDGMENT

The authors would like to thank David Tax and BobDuin for sharing their code dd_tools and for providing theone class classifier data files. The authors also would like

to thank Dr. Liyi Dai of the Army Research Office (ARO)for supporting this collaborative research.

REFERENCES

[1] Matteoli, S., Diani, M., and Theiler, J.An overview of background modeling for detection of targetsand anomalies in hyperspectral remotely sensed imagery.IEEE Journal of Selected Topics in Applied EarthObservations and Remote Sensing, 7 (2014),2317–2336.

[2] Johnson, S.Family of constrained signal detectors for hyperspectralimagery.IEEE Transactions on Aerospace and Electronic Systems, 41,1 (2005), 34–49.

[3] Reed, I. S., and Yu, X.Adaptive multiple-band cfar detection of an optical patternwith unknown spectral distribution.IEEE Transactions on Acoustics, Speech and SignalProcessing, 38, 10 (1990), 1760–1770.

[4] Ren, H., Du, Q., Wang, J., Chang, C.-I., Jensen, J. O., and Jensen,J. L.Automatic target recognition for hyperspectral imagery usinghigh-order statistics.IEEE Transactions on Aerospace and Electronic Systems, 42,4 (2006), 1372–1385.

PENG ET AL.: SPARSE KERNEL LEARNING-BASED FEATURE SELECTION FOR ANOMALY DETECTION 1713

Page 17: hyperspectral images and several machine learning datasets ...wotaoyin/papers/pdf/Peng_Gurram_Kwon_Yin_… · 1698 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO

[5] Chen, S.-Y., Wang, Y., Wu, C.-C., Liu, C., and Chang, C.-I.Real-time causal processing of anomaly detection forhyperspectral imagery.IEEE Transactions on Aerospace and Electronic Systems, 50,2 (2014), 1511–1534.

[6] Kelly, E. J.An adaptive detection algorithm.IEEE Transactions on Aerospace and Electronic Systems,AES-22 2 (1986), 115–127.

[7] Reed, I. S., Mallett, J. D., and Brennan, L. E.Rapid convergence rate in adaptive arrays.IEEE Transactions on Aerospace and Electronic Systems,AES-10, 6 (1974), 853–863.

[8] Manolakis, D., Rossacci, M., Zhang, D., Cipar, J., Lockwood, R.,Cooley, T., and Jacobson, J.Statistical characterization of hyperspectral backgroundclutter in the reflective spectral region.Applied optics, 47, 28 (2008), F96–F106.

[9] Matteoli, S., Diani, M., and Corsini, G.A tutorial overview of anomaly detection in hyperspectralimages.IEEE Aerospace and Electronic Systems Magazine, 25, 7(2010), 5–28.

[10] Matteoli, S., Veracini, T., Diani, M., and Corsini, G.Models and methods for automated background densityestimation in hyperspectral anomaly detection.IEEE Transactions on Geoscience and Remote Sensing, 51, 5(2013), 2837–2852.

[11] Kwon, H., and Nasrabadi, N. M.Kernel rx-algorithm: a nonlinear anomaly detector forhyperspectral imagery.IEEE Transactions on Geoscience and Remote Sensing, 43, 2(2005), 388–397.

[12] Tax, D. M. J., and Duin, R. P. W.Support vector data description.Machine Learning, 54 (2004), 45–66.

[13] Gurram, P., Kwon, H., and Han, T.Sparse kernel-based hyperspectral anomaly detection.IEEE Geoscience and Remote Sensing Letters, 9, 5 (Sept.2012), 943–947.

[14] Liu, W.-M., and Chang, C.-I.Multiple-window anomaly detection for hyperspectralimagery.IEEE Journal of Selected Topics in Applied EarthObservations and Remote Sensing, 6, 2 (2013), 644–658.

[15] Caefer, C. E., Raviv, O., Rotman, S. R., Stefanou, M. S., Nielsen,E. D., and Rizzuto, A. P.Analysis of false alarm distributions in the development andevaluation of hyperspectral point target detection algorithms.Optical Engineering, 46, 7 (2007).

[16] Romano, J. M., Rosario, D., and Niver, E.Morphological operators for polarimetric anomaly detection.IEEE Journal of Selected Topics in Applied EarthObservations and Remote Sensing (2014), 664–677.

[17] Ren, H., and Chang, C.-I.Automatic spectral target recognition in hyperspectralimagery.IEEE Transactions on Aerospace and Electronic Systems, 39,4 (2003), 1232–1249.

[18] Banerjee, A., Burlina, P., and Diehl, C.A support vector method for anomaly detection inhyperspectral imagery.IEEE Transactions on Geoscience and Remote Sensing, 44, 8(2006), 2282–2291.

[19] Gurram, P., and Kwon, H.Support-vector-based hyperspectral anomaly detection usingoptimized kernel parameters.IEEE Geoscience and Remote Sensing Letters, 8, 6 (Nov.2011), 1060–1064.

[20] Scholkopf, B., and Smola, A. J.Learning with Kernels. Cambridge, MA: MIT Press,2002.

[21] Sonnenburg, S., Ratsch, G., Schafer, C., and Scholkopf, B.Large scale multiple kernel learning.Journal of Machine Learning Research, 7, 1 (2006),1531–1565.

[22] Rakotomamonjy, A., Bach, F. R., Canu, S., and Grandvalet, Y.Simplemkl.Journal of Machine Learning Research, 9 (2008),2491–2521.

[23] Kohavi, R., and John, G. H.Wrappers for feature subset selection.Artificial Intelligence, 97 (1997), 273–324.

[24] Almuallim, H.Learning boolean concepts in the presence of many irrelevantfeatures.Artificial Intelligence, 69 (1994), 279–306.

[25] Kira, K., and Rendell, L. A.The feature selection problem: Traditional methods and a newalgorithm.In Proceedings of AAAI-92, San Jose, CA, 1992,pp. 129–134.

[26] Kononenko, I.Estimating attributes: analysis and extensions of relief.In Machine Learning: ECML-94. New York: Springer, 1994,pp. 171–182.

[27] Devjver, P. A., and Kittler, J. Pattern Recognition: A StatisticalApproach. Englewood Cliffs, NJ: Prentice-Hall,1982.

[28] Miller, A. J.Subset Selection in Regression. London: Chapman and Hall,1990.

[29] Cao, B., Shen, D., Sun, J., Yang, Q., and Chen, Z.Feature selection in a kernel space.In Proceedings of the International Conferenceon Machine Learning, Corvallis, OR, 2007,pp. 121–128.

[30] Wang, L.Feature selection with kernel class separability.IEEE Transactions on Pattern Analysis and MachineIntelligence, 30, 9 (2008), 1534–1546.

[31] Chapelle, O., Vapnik, V., Bousquet, O., and Mukherjee, S.Choosing multiple parameters for SVM.Machine Learning, 46 (2002), 131–159.

[32] Chen, J., and Ye, J.Training SVM with indefinite kernels.In Proceedings of the 25th International Conference onMachine Learning, Helsinki, Finland, June 2008,pp. 136–143.

[33] Tan, M., Wang, L., and Tsang, I. W.Learning sparse SVM for feature selection on very highdimensional datasets.In Proceedings of the 27th International Conference onMachine Learning (ICML-10), Haifa, Israel, June 2010,pp. 1047–1054.

[34] Hettich, R., and Kortanek, K. O.Semi-infinite programming: Theory, methods, andapplications.SIAM Review, 35, 3 (Sept. 1993), 380–429.

[35] Tsuda, K.Support vector classifier with asymmetric kernel function.In Proceedings of ESANN, Bruges, Belgium, 1999,pp. 183–188.

[36] Kim, S., and Boyd, S.A minimax theorem with applications to machine learning,signal processing, and finance.SIAM Journal on Optimization, 19, 3 (Nov. 2008),1344–1367.

1714 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO. 3 JULY 2015

Page 18: hyperspectral images and several machine learning datasets ...wotaoyin/papers/pdf/Peng_Gurram_Kwon_Yin_… · 1698 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO

[37] Kelley, J. E., Jr.The cutting-plane method for solving convex programs.Journal of the Society for Industrial & AppliedMathematics, 8, 4 (1960),703–712.

[38] Banerjee, A., Burlina, P., and Diehl, C.A support vector method for anomaly detection inhyperspectral imagery.IEEE Transactions on Geoscience and Remote Sensing, 44, 8(Aug. 2006), 2282–2291.

Zhimin Peng received his B.S. in computational mathematics from Xi’an JiaotongUniversity, China, in 2011, and his M.A. in applied math from Rice University,Houston, TX, in 2013.

He is a Ph.D. student in the Department of Mathematics at UCLA. His researchinterests include developing efficient algorithms for anomaly detection and solvinglarge-scale convex optimization problems.

Prudhvi Gurram (S’00—M’10) received the B.E. degree in electronics andcommunication engineering from the National Institute of Technology Karnataka,Karnataka, India, in 2003, and the M.S. degree in electrical engineering and the Ph.D.degree in imaging science from the Rochester Institute of Technology, Rochester, NY,in 2008 and 2009, respectively.

Since Oct. 2009, he has been a researcher with the image processing branch of theU.S. Army Research Laboratory, Adelphi, MD. At present, he is working on machinelearning algorithms for hyperspectral image analysis and perception of autonomoussystems. His research interests include hyperspectral signal processing, machinelearning, and computer vision.

Dr. Gurram is a member of the IEEE Geosciences and Remote Sensing Society. Heis a reviewer for the IEEE Transactions on Image Processing, IEEE Transactions onGeoscience and Remote Sensing, IEEE Geoscienes and Remote Sensing Letters,Journal of Selected Opics in Applied Earth Observations and Remote, Sensing, andSPIE Journal of Electronic Imaging and Optical Engineering.

PENG ET AL.: SPARSE KERNEL LEARNING-BASED FEATURE SELECTION FOR ANOMALY DETECTION 1715

Page 19: hyperspectral images and several machine learning datasets ...wotaoyin/papers/pdf/Peng_Gurram_Kwon_Yin_… · 1698 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO

Heesung Kwon (M’99—SM’05) received the B.Sc. degree in electronic engineeringfrom Sogang University, Seoul, Korea, in 1984, and the M.S. and Ph.D. degrees inelectrical engineering from the State University of New York at Buffalo in 1995 and1999, respectively.

From 1983 to 1993, he was with Samsung Electronics Corp., where he worked as asenior research engineer. He was with the U.S. Army Research Laboratory (ARL),Adelphi, MD from 1996 to 2006 working on automatic target detection andhyperspectral signal processing applications. From 2006 to 2007, he was with JohnsHopkins University Applied Physics Laboratory (JHU/APL) working on biologicalstandoff detection problems. He rejoined ARL in Aug. 2007 as a senior electronicsengineer, leading hyperspectral research efforts in the Image Processing Branch. Hiscurrent research interests include image/video analytics, human-autonomy interaction,hyperspectral signal processing, pattern recognition, machine learning, and statisticallearning.

Dr. Kwon is currently Associate Editor of IEEE Transactions on Aerospace andElectronic Systems. He also served as Lead Guest Editor of the Special Issue onAlgorithms for Multispectral and Hyperspectral Image Analysis of the Journal ofElectrical and Computer Engineering. He has published over 90 journal, book chapters,and conference papers on the areas of his research interests.

Wotao Yin received his B.S. in mathematics from Nanjing University, China, in 2001,and his M.S. and Ph.D. degrees in operations research from Columbia University, NewYork, NY, in 2003 and 2006, respectively.

He is a professor in the Department of Mathematics at UCLA. His research interestslie in computational optimization and its applications in image processing, machinelearning, and other inverse problems. His recent work has been in optimizationalgorithms for large-scale and distributed signal processing and machine learningproblems.

He won the NSF CAREER award in 2008 and the Alfred P. Sloan ResearchFellowship in 2009.

1716 IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS VOL. 51, NO. 3 JULY 2015