15
Research Article An Improved Semisupervised Outlier Detection Algorithm Based on Adaptive Feature Weighted Clustering Tingquan Deng 1,2 and Jinhong Yang 2 1 College of Science, Harbin Engineering University, Harbin 150001, China 2 College of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China Correspondence should be addressed to Jinhong Yang; [email protected] Received 13 September 2016; Revised 2 November 2016; Accepted 20 November 2016 Academic Editor: Filippo Ubertini Copyright © 2016 T. Deng and J. Yang. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. ere exist already various approaches to outlier detection, in which semisupervised methods achieve encouraging superiority due to the introduction of prior knowledge. In this paper, an adaptive feature weighted clustering-based semisupervised outlier detection strategy is proposed. is method maximizes the membership degree of a labeled normal object to the cluster it belongs to and minimizes the membership degrees of a labeled outlier to all clusters. In consideration of distinct significance of features or components in a dataset in determining an object being an inlier or outlier, each feature is adaptively assigned different weights according to the deviation degrees between this feature of all objects and that of a certain cluster prototype. A series of experiments on a synthetic dataset and several real-world datasets are implemented to verify the effectiveness and efficiency of the proposal. 1. Introduction Outlier detection is an important topic in data mining community, which aims at finding patterns that occur infre- quently as opposed to other data mining techniques [1]. An outlier is an observation that deviates significantly from, or inconsistent with the main body of a dataset, as if it was gener- ated by a different mechanism [2]. e importance of outlier detection is in the view of the fact that outliers can provide raw patterns and valuable knowledge about a dataset. Current application areas of outlier detection include crime detection, credit card fraud detection, network intrusion detection, medical diagnosis, faulty detection in critical safety systems, or detecting abnormal regions in image processing [3–9]. Recently the studies on outlier detection are very active and many approaches have been proposed. In general, exist- ing work on outlier detection can be broadly classified into three modes depending on whether label information is available or can be used to build outlier detection models: unsupervised, supervised, and semisupervised methods. Supervised outlier detection concerns the situation where the training dataset contains prior information about the class of each instance that is normal or abnormal. One- class support vector machine (OCSVM) [10] or support vector data description (SVDD) [11, 12], which considers the case that training data are all normal instances, conducts a hypersphere around the normal data and utilizes the constructed hypersphere to detect an unknown sample as an inlier or outlier. e supervised outlier detection problem is a difficult case in many real-world applications, since the acquisition of label information of the whole training dataset is oſten expensive, time consuming, and subjective. Unsupervised outlier detection, without prior infor- mation about the class distribution, is generally classified into distribution-based [3], distance-based [13, 14], density- based [15, 16], and clustering-based [17–20] approaches. Distribution-based approach assumes that all data points are generated by a certain statistical model, while outliers do not obey the model. However, the assumption of an underlying distribution of data points is not always available in many real-life applications. Distance-based approach was firstly investigated by Knox and Ng [14]. An object in a dataset is an outlier if at least % of objects in are further than the distance from . e global parameters and are not suitable when the local information of the dataset varies greatly. Representatives of this type of approaches include -nearest neighbor (NN) algorithm [13] and its variants [21, 22]. Density-based approach was originally proposed by Hindawi Publishing Corporation Mathematical Problems in Engineering Volume 2016, Article ID 6394253, 14 pages http://dx.doi.org/10.1155/2016/6394253

Research Article An Improved Semisupervised Outlier Detection …downloads.hindawi.com/journals/mpe/2016/6394253.pdf · credit card fraud detection, network intrusion detection, medical

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Research ArticleAn Improved Semisupervised Outlier Detection AlgorithmBased on Adaptive Feature Weighted Clustering

Tingquan Deng12 and Jinhong Yang2

1College of Science Harbin Engineering University Harbin 150001 China2College of Computer Science and Technology Harbin Engineering University Harbin 150001 China

Correspondence should be addressed to Jinhong Yang yangjinhong66163com

Received 13 September 2016 Revised 2 November 2016 Accepted 20 November 2016

Academic Editor Filippo Ubertini

Copyright copy 2016 T Deng and J YangThis is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

There exist already various approaches to outlier detection in which semisupervised methods achieve encouraging superioritydue to the introduction of prior knowledge In this paper an adaptive feature weighted clustering-based semisupervised outlierdetection strategy is proposed This method maximizes the membership degree of a labeled normal object to the cluster it belongsto and minimizes the membership degrees of a labeled outlier to all clusters In consideration of distinct significance of featuresor components in a dataset in determining an object being an inlier or outlier each feature is adaptively assigned different weightsaccording to the deviation degrees between this feature of all objects and that of a certain cluster prototype A series of experimentson a synthetic dataset and several real-world datasets are implemented to verify the effectiveness and efficiency of the proposal

1 Introduction

Outlier detection is an important topic in data miningcommunity which aims at finding patterns that occur infre-quently as opposed to other data mining techniques [1] Anoutlier is an observation that deviates significantly from orinconsistentwith themain body of a dataset as if it was gener-ated by a different mechanism [2] The importance of outlierdetection is in the view of the fact that outliers can provideraw patterns and valuable knowledge about a dataset Currentapplication areas of outlier detection include crime detectioncredit card fraud detection network intrusion detectionmedical diagnosis faulty detection in critical safety systemsor detecting abnormal regions in image processing [3ndash9]

Recently the studies on outlier detection are very activeand many approaches have been proposed In general exist-ing work on outlier detection can be broadly classified intothree modes depending on whether label information isavailable or can be used to build outlier detection modelsunsupervised supervised and semisupervised methods

Supervised outlier detection concerns the situationwherethe training dataset contains prior information about theclass of each instance that is normal or abnormal One-class support vector machine (OCSVM) [10] or support

vector data description (SVDD) [11 12] which considers thecase that training data are all normal instances conductsa hypersphere around the normal data and utilizes theconstructed hypersphere to detect an unknown sample as aninlier or outlier The supervised outlier detection problemis a difficult case in many real-world applications since theacquisition of label information of the whole training datasetis often expensive time consuming and subjective

Unsupervised outlier detection without prior infor-mation about the class distribution is generally classifiedinto distribution-based [3] distance-based [13 14] density-based [15 16] and clustering-based [17ndash20] approachesDistribution-based approach assumes that all data points aregenerated by a certain statistical model while outliers do notobey the model However the assumption of an underlyingdistribution of data points is not always available in manyreal-life applications Distance-based approach was firstlyinvestigated by Knox and Ng [14] An object 119900 in a dataset119883 is an outlier if at least 119901 of objects in 119883 are further thanthe distance 119863 from 119900 The global parameters 119901 and 119863 arenot suitable when the local information of the dataset variesgreatly Representatives of this type of approaches include119870-nearest neighbor (119870NN) algorithm [13] and its variants[21 22] Density-based approach was originally proposed by

Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2016 Article ID 6394253 14 pageshttpdxdoiorg10115520166394253

2 Mathematical Problems in Engineering

Breunig et al [15] A local outlier factor (LOF) is assigned toeach data point based on their local neighborhood densityThen a data point with a high LOF value is determined as anoutlier However this method is very sensitive to the choiceof neighborhood parameter

Clustering-based approaches [17ndash20] partition thedataset into several clusters depending on similarity ofobjects and detect outliers by examining the relationshipbetween objects and clusters In general clusters containingsignificantly less data points than other clusters or beingremote from other clusters are considered as outliers Thecluster structure of data can facilitate the task of outlierdetection and a small amount of related literatures has beenproposed A classical clustering method is used to findanomaly in the intrusion detection domain [18] In the workof [19] the clustering techniques iteratively detect outliersfor multidimensional data analysis in subspace Zhao et al[20] propose an adaptive fuzzy c-means (AFCM) algorithmby introducing sample weight coefficients to the objectivefunction and apply it to anomaly data detection in energysystem of steel industry Since clustering-based approachesare unsupervised without requiring any labeled training datatheir performance in outlier detection is limited In additionmost of the existing clustering-based methods only involvethe optimal clustering but do not incorporate optimal outlierdetection into clustering process

Inmany real-world applications onemay encounter caseswhere a small set of objects are labeled as outliers or belongingto a certain class but most of the data are unlabeled Studiesindicate that the introduction of a small amount of priorknowledge can significantly improve the effectiveness of out-lier detection [23ndash25]Therefore semisupervised approachesto outlier detection have been developed to tackle suchscenarios and have been thought of a popular directionof outlier detection recently In order to take advantageof the label information of a target dataset entropy-basedoutlier detection based on semisupervised learning fromfew positive examples (EODSP) is proposed in [23] Thatmethod extracts reliable normal instances from unlabeledobjects and regards themas labeled normal samples Entropy-based outlier detection method is used to detect top 119872outliers However when a dataset initially provides labelednormal and abnormal samples the algorithm in [23] cannotmake full use of the given label information Literature [24]develops a semisupervised outlier detectionmethod based onthe assessment of deviation from known labeled objects bypunishing poor clustering results and restricting the numberof outliers Xue et al [25] present a semisupervised outlierdetection proposal based on fuzzy rough c-means clusteringwhich detects outliers by minimizing the sum of squarederrors of clustering results and the deviation from knownlabeled examples as well as the number of outliers Unfortu-nately some labeled normal objects are finally misidentifiedas outliers due to improper parameter selection in [24 25]

Most of the previous research equally treats differentfeatures of objects in outlier detecting process which does notconform to the intrinsic characteristic of a dataset Actuallyit is more reasonable that different features have differentimportance in each cluster especially for high-dimension

sparse datasets where the structure of each cluster is oftenlimited to a subset of features rather than the entire featureset Some works concerning feature weighted clustering havebeen studied Huang et al [26] propose a W-c-means typeclustering algorithm that can automatically calculate featureweights W-c-means adds a new step into the basic c-meansalgorithm to update the variable weights based on the cur-rent partition of data Literature [27] develops an approachcalled simultaneous clustering and attribute discrimination(SCAD) SCAD learns the feature relevance representation ofeach cluster independently in an unsupervisedmanner Zhouet al [28] publish a maximum-entropy-regularized weightedfuzzy c-means (EWFCM) clustering algorithm for ldquonon-sphericalrdquo shaped data A new objective function is developedin the EWFCM algorithm to achieve the optimal clusteringresult by minimizing the dispersion within clusters andmaximizing the entropy of attribute weights simultaneouslyThese existing methods about feature weighted clusteringencourage scholars to study outlier detection based on featureweighted clustering

To make full use of prior knowledge to facilitateclustering-based outlier detection we develop a semisuper-vised outlier detection algorithm based on adaptive featureweighted clustering (SSOD-AFW) in this paper in whichthe feature weights are iteratively obtained The proposedalgorithm emphasizes the diversity of different features ineach cluster and assigns lower weights to irrelevant featuresto reduce their negative influence on outlier decision Fur-thermore based on the convention that outliers usually havea lower membership to every cluster we relax the constraintof fuzzy c-means (FCM) clustering where the membershipdegrees of a sample to all clusters must sum up to oneand propose an adaptive feature weighted semisupervisedpossibilistic clustering-based outlier detection algorithmTheinteraction problem between optimal clustering and outlierdetection is addressed in the proposed method The labelinformation is introduced into the possibilistic clusteringmethod according to the following principles (1) maximizingthe membership degree of a labeled normal object to thecluster it belongs to (2) minimizing the membership degreesof a labeled normal object to the clusters it does not belongto and (3) minimizing the membership degrees of a labeledoutlier to all clusters In addition to the above principles wesimultaneously minimize the dispersion within clusters inthe new objective function of clustering to achieve a propercluster structure Finally the yielded optimal membershipdegrees are used to indicate the outlying degree of eachsample in the dataset The proposed algorithm is foundpromising in improving the performance of outlier detectionin comparison with typical outlier detection methods inaccuracy running time as well as other evaluation metrics

The remainder of this paper is organized as followsSection 2 gives a short review on possibilistic clustering algo-rithms Section 3 presents the detailed description of featureweighted semisupervised clustering-based outlier detectionalgorithm In Section 4 the experimental results of theproposedmethod against typical outlier detection algorithmsare discussed on synthetic and real-world datasets FinallySection 5 follows our conclusions

Mathematical Problems in Engineering 3

2 Possibilistic Clustering Algorithms

Let119883 = 1199091 1199092 119909119899 be a given dataset of 119899 objects where119909119894 = 1199091198941 1199091198942 119909119894119873 isin 119877119873 is the 119894th object characterized by119873 features Suppose that the dataset is divided into 119888 clustersand 119900119896 denotes the 119896th cluster prototype

FCM is a well-known clustering algorithm [29] whoseobjective function is

119869FCM = 119899sum119894=1

119888sum119896=1

119906119898119894119896 1003817100381710038171003817119909119894 minus 11990011989610038171003817100381710038172 (1)

st119888sum119896=1

119906119894119896 = 1 119906119894119896 isin [0 1] (2)

where 119906119894119896 is the membership degree of the 119894th (1 le 119894 le 119899)object to the 119896th (1 le 119896 le 119888) cluster sdot represents the 1198972-normof a vector and119898 gt 0 is the fuzzification coefficient Note thatthe constraint condition in (2) indicates that the membershipsum of each object to all clusters equals one Therefore FCMis sensitive to outliers due to the intuition that outliers ornoises commonly locate far away from all cluster prototypesFor this reason Krishnapuram and Keller [30] proposeda possibilistic c-means (PCM) clustering algorithm whichrelaxes the constraint on the sum of memberships andminimizes the following objective function

119869PCM = 119899sum119894=1

119888sum119896=1

119906119898119894119896 1003817100381710038171003817119909119894 minus 11990011989610038171003817100381710038172 + 119888sum119896=1

120578119896 119899sum119894=1

(1 minus 119906119894119896)119898 (3)

st 119906119894119896 isin [0 1] (4)

where 120578119896 is a suitable positive number In PCM the constraint(4) allows an outlier holding a lowmembership to all clustersso an outlier has a low impact on the objective function (3)Themembership information of each sample can be naturallyused to interpret the outlying characteristic of a sample Fora certain sample if it has a low membership to all clusters itis likely to be an outlier

Afterward another unsupervised possibilistic clusteringalgorithm (PCA) is proposed by Yang and Wu [31] and theobjective function of PCA is described as

119869PCA= 119899sum119894=1

119888sum119896=1

119906119898119894119896 1003817100381710038171003817119909119894 minus 11990011989610038171003817100381710038172+ 1205731198982radic119888

119899sum119894=1

119888sum119896=1

(119906119898119894119896 log 119906119898119894119896 minus 119906119898119894119896) st 119906119894119896 isin [0 1]

(5)

where the parameter 120573 can be calculated by the samplecovariance

120573 = sum119899119894=1sum119873119895=1 (119909119894119895 minus sum119899119897=1 119909119897119895119899)2119899 (6)

3 Semisupervised OutlierDetection Framework Based on FeatureWeighted Clustering

31 Model Formulation In this section we introduce priorknowledge into possibilistic c-means clustering method toimprove the performance of outlier detection First a smallsubset of samples in a given dataset 119883 = 1199091 1199092 119909119899is labeled as normal or outlier objects Each labeled normalobject carries the label of class it belongs to A semisupervisedindicator matrix A = (119886119894119896)119899times119888 is constructed to describe thesemisupervised information and its entries are defined by thefollowing

(i) If an object 119909119894 is labeled as a normal point and itbelongs to the 119896th cluster then 119886119894119896 = minus1 and for all119904 = 1 2 119888 119904 = 119896 we let 119886119894119904 = 1

(ii) If 119909119894 is labeled as an outlier then for all 119904 = 1 2 119888we set 119886119894119904 = 1

(iii) If 119909119894 is unlabeled then for all 119904 = 1 2 119888 it has119886119894119904 = 0Usually data often contain a number of redundant fea-

turesThe cluster structure in a given dataset is often confinedto a subset of features rather than the entire feature setIrrelevant features can only obscure the discovery of thecluster structure by a clustering algorithm An intrinsicoutlier is easy to be neglected due to the vagueness ofcluster structure Figure 1 presents an example of a three-dimensional dataset The dataset has two clusters (1198881 and 1198882)and 3 features (11989111198912 and1198913) In the feature space (119891111989121198913)neither of the clusters is discovered (see Figure 1(a)) In thesubspace (1198911 1198912) cluster 1198881 can be found but 1198882 cannot (seeFigure 1(b)) Nevertheless only cluster 1198882 can be clearly shownin (1198912 1198913) (see Figure 1(c)) Therefore if we assign weights047 045 and 008 to features 1198911 1198912 and 1198913 respectivelycluster 1198881 will be recovered by a clustering algorithm If theweights of features 1198911 1198912 and 1198913 are assigned as 013 046and 041 respectively cluster 1198882 will be recovered In thisconsideration each cluster is relevant to different subsets offeatures and the same feature may have different importancein different clusters

In our research let V119895119896 be the weight of the 119895th (1 le119895 le 119873) dimensional feature with respect to the 119896th (1 le119896 le 119888) cluster which satisfies sum119873119895=1 V119895119896 = 1 then the featureweighted distance 119889119894119896 between the 119894th object and the 119896thcluster prototype is defined as

119889119894119896 = radic 119873sum119895=1

V119902119895119896

(119909119894119895 minus 119900119896119895)2 (7)

where the parameter 119902 gt 1 is the feature weight indexThe points within clusters usually behave strongly corre-

lated while weak correlation is shown between outliers Thatis normal points belong to one of the 119888 clusters and outliersdo not belong to any clusterTherefore a normal point shouldhave a high membership to the cluster it belongs to and anoutlier has a low membership to all clusters Based on this

4 Mathematical Problems in Engineering

86

42

0minus50

5

68

024

10

f2 f1

f3

(a) Plot of the space (1198911 1198912 1198913)

0 1 2 3 4 5 6 7

0

5

10

minus5

f2

f1

c1

(b) Plot of the subspace (1198911 1198912)

0 10

0

1

2

3

4

5

6

7

minus5minus1

f2

f3

c2

5

(c) Plot of the subspace (1198912 1198913)

Figure 1 Three-dimensional synthetic example

idea we define a new objective function and minimize it asfollows

119869SSOD-AFW (UOV)= 119899sum119894=1

119888sum119896=1

1199061198981198941198961198892119894119896 + 1205731198982radic119888119899sum119894=1

119888sum119896=1

(119906119898119894119896 log 119906119898119894119896 minus 119906119898119894119896)+ 119899sum119894=1

119888sum119896=1

120572119894119896119886119894119896119906119898119894119896(8)

st119873sum119895=1

V119895119896 = 1 119906119894119896 isin [0 1] V119895119896 isin [0 1] (9)

where 119899 119873 and 119888 are the number of objects features andclusters respectively U = (119906119894119896)119899times119888 119906119894119896 is the membershipdegree of the 119894th object belonging to the 119896th cluster V =(V119895119896)119873times119888 V119895119896 denotes the featureweight of the 119895th dimensionalfeature with respect to the 119896th cluster O = (119900119896119895)119888times119873 119900119896119895indicates the 119895th dimensional feature value of the 119896th clusterprototype 119889119894119896 denotes the feature weighted distance betweenobject 119909119894 and the 119896th cluster prototype 119886119894119896 isin 1 minus1 0 isthe element in semisupervised indicator matrix A 119898 gt 0is the fuzzification coefficient and the parameter 120573 can befixed as the sample covariance according to (6) The positivecoefficient120572119894119896 adjusts the significance of the label informationof the 119894th object with respect to the 119896th cluster in objectivefunction (8) The larger 120572119894119896 is the larger the influence of labelknowledge is

The first term in (8) is equivalent to the FCM objectivefunction which requires the distances of objects from thecluster prototypes to be as small as possible The second termis constructed to force 119906119894119896 to be as large as possible Thethird term 120572119894119896119886119894119896119906119898119894119896 focuses on minimizing the membershipdegrees of a labeled outlier to all the clusters and maximizingthe membership degree of a labeled normal object to thecluster it belongs to With a proper choice of 120572119894119896 we canbalance the label information weight of every object andachieve the optimal fuzzy partition

The virtue of semisupervised indicator matrix A inobjective function (8) can be elaborated as follows Recallingthe construction of semisupervised indicator matrix A and

objective function (8) note that if we know that 119909119894 belongsto the 119896th cluster then 119886119894119896 = minus1 and all the other entriesin the 119894th row equal 1 Thus minimizing 120572119894119896119886119894119896119906119898119894119896 in (8)means maximizing the membership of 119909119894 to the 119896th clusterand simultaneously minimizing the memberships of 119909119894 to theother clusters If 119909119894 is labeled as an outlier namely where allthe elements in the 119894th row of A equal 1 then minimizing120572119894119896119886119894119896119906119898119894119896 in (8) means minimizing the memberships of 119909119894 toall clusters for an outlier does not belong to any cluster If 119909119894is unlabeled namely where 119886119894119896 = 0 for all 119896 = 1 2 119888 thenthe term 120572119894119896119886119894119896119906119898119894119896 has no impact on objective function (8)

32 Solutions to the Objective Function In this subsection aniterative algorithm for minimizing 119869SSOD-AFW with respect to119906119894119896 V119895119896 and 119900119896119895 is derived similar to classical FCM

First in order to minimize 119869SSOD-AFW with respect to VU and O are fixed and the parameters 120572119894119896 (119894 = 1 2 119899119896 = 1 2 119888) are constants The Lagrange function isconstructed as follows

1198711 = 119899sum119894=1

119888sum119896=1

1199061198981198941198961198892119894119896 + 1205731198982radic119888119899sum119894=1

119888sum119896=1

(119906119898119894119896 log 119906119898119894119896 minus 119906119898119894119896)+ 119899sum119894=1

119888sum119896=1

120572119894119896119886119894119896119906119898119894119896 minus 119888sum119896=1

120574119896( 119873sum119895=1

V119895119896 minus 1) (10)

where 120574119896 (119896 = 1 2 119888) are the Lagrange multipliersBy taking the gradient of 1198711 with respect to V119895119896 and setting

it to zero we obtain

1205971198711120597V119895119896 = 119902V119902minus1119895119896

119899sum119894=1

119906119898119894119896 (119909119894119895 minus 119900119896119895)2 minus 120574119896 = 0 (11)

Then

V119895119896 = [[120574119896119902sum119899119894=1 119906119898119894119896 (119909119894119895 minus 119900119896119895)2]]

1(119902minus1)

(12)

Mathematical Problems in Engineering 5

Substituting (12) into (9) we have

119873sum119895=1

V119895119896 = [120574119896119902 ]1(119902minus1) 119873sum119895=1

[[1

sum119899119894=1 119906119898119894119896 (119909119894119895 minus 119900119896119895)2]]1(119902minus1)

= 1(13)

It follows that

[120574119896119902 ]1(119902minus1) = 1sum119873119895=1 [1sum119899119894=1 119906119898119894119896 (119909119894119895 minus 119900119896119895)2]1(119902minus1) (14)

The updating criteria of feature weight V119895119896 (1 le 119895 le 1198731 le 119896 le 119888) are obtainedV119895119896 = [1sum119899119894=1 119906119898119894119896 (119909119894119895 minus 119900119896119895)2]1(119902minus1)

sum119873119897=1 [1sum119899119894=1 119906119898119894119896 (119909119894119897 minus 119900119896119897)2]1(119902minus1) (15)

The updating way of V119895119896 implies that the larger the devi-ation degrees from all samples to the 119896th cluster prototyperegarding the 119895th feature are the smaller the weight of the119895th feature is That is if the distribution of all data is compactaround the 119896th cluster prototype in the 119895th feature spacethe 119895th feature plays a significant role in formulating the 119896thcluster Meanwhile irrelevant features thus are assigned asmaller weight to reduce the negative impact of them on theclustering process

To find the optimal cluster prototype O we assume Uand V are fixed and the parameters 120572119894119896 (119894 = 1 2 119899119896 = 1 2 119888) are also constants We take the gradient of119869SSOD-AFW with respect to 119900119896119895 and set it to zero

120597119869SSOD-AFW120597119900119896119895 = 119899sum119894=1

(minus2119906119898119894119896V119895119896 (119909119894119895 minus 119900119896119895)) = 0 (16)

The updating formula of cluster prototype 119900119896119895 is obtainedas follows

119900119896119895 = sum119899119894=1 119906119898119894119896119909119894119895sum119899119894=1 119906119898119894119896 (17)

To solve the optimal fuzzy partition matrixU we assumeO and V are fixed and the parameters 120572119894119896 (119894 = 1 2 119899 119896 =1 2 119888) are also constantsWe set the gradient of 119869SSOD-AFWwith respect to 119906119894119896 to zero120597119869SSOD-AFW120597119906119894119896 = 119898119906119898minus1119894119896 1198892119894119896 + 120573119898radic119888119906119898minus1119894119896 log 119906119898119894119896

+ 120572119894119896119886119894119896119898119906119898minus1119894119896 = 0 (18)

The updating formula of 119906119894119896 is derived as follows

119906119894119896 = exp(minus119898radic119888120573 (1198892119894119896 + 120572119894119896119886119894119896)) (19)

Formula (19) indicates that a large value of weighteddistance 119889119894119896 leads to a smaller value of 119906119894119896 for all 1 le 119894 le 119899

1 le 119896 le 119888 It should be noted that the membership degree 119906119894119896is also dependent on the coefficient 120572119894119896 The choice of 120572119894119896 isimportant to the performance of the SSOD-AFW algorithmbecause it serves in distinguishing the importance of the thirdterm relative to the other terms in objective function (8) If120572119894119896is too small the third term will be neglected and the labels ofobjects will not work to promote the cluster structure If 120572119894119896 istoo large the other terms will be neglected and the negativeinfluence of possiblemislabels of objects will be enlargedThevalue of 120572119894119896 should be chosen such that it has the same orderof magnitude with the first term in (8) To determine theparameter120572119894119896 in an adaptiveway in all experiments describedin this paper we choose 120572119894119896 proportional to 1198892119894119896 as follows

120572119894119896 = 1198701198892119894119896 (20)

where 119870 isin (0 1) is a constant Since the weighted distance119889119894119896 is dynamically updated the value of parameter 120572119894119896 isadaptively updated in each iteration

33 Criteria for Outlier Identification Based on the aboveanalysis outliers should hold low membership degrees to allclusters Therefore the sum of memberships of an object toall clusters can be used to evaluate its outlying degree For acertain object 119909119894 its outlying degree is defined as

OD (119909119894) = 119888sum119896=1

119906119894119896 (21)

Thus a small value of OD(119909119894) indicates a high outlyingpossibility of object 119909119894 The outlying degree of each sample ina dataset is calculated respectively and sorted incrementallyThe suspicious outliers can be found just by extracting the top119872 objects in the sorted outlying degree sequence where119872 isa given number of outliers contained in the dataset or a givennumber of outliers one needs

In summary the description of the SSOD-AFW algo-rithm is shown in Algorithm 1

Algorithm 1 (semisupervised outlier detection based on adap-tive feature weighted clustering (SSOD-AFW))

Input Dataset 119883 the label information of some objects thenumber of clusters 119888 parameters 119898 gt 1 119870 isin (0 1) 119902 gt 1120576 gt 0 and the number of outliers119872

Output119872 suspicious outliers

(1) Calculate the parameter 120573 according to (6) randomlyinitialize matrix U(0) = (119906(0)

119894119896)119899times119888 and initialize all

elements in V(0) = (V(0)119895119896

)119873times119888 as 1119873 Set iterationcounter 119905 = 0

(2) Compute the matrix of cluster prototype O(119905+1) =(119900(119905+1)119896119895

)119888times119873 according to (17)(3) Update the feature weight matrix V(119905+1) = (V(119905+1)

119895119896)119873times119888

by (15)(4) Update the feature weighted distance D(119905+1) =(119889(119905+1)

119894119896)119899times119888 by (7)

6 Mathematical Problems in Engineering

(5) Update parameter 120572(119905+1)119894119896

by (20)

(6) Update the membership degree matrix U(119905+1) =(119906(119905+1)119894119896

)119899times119888 according to (19)(7) If U(119905+1) minus U(119905) lt 120576 go to step (8) else 119905 = 119905 + 1

repeat step (2) to step (7)(8) The outlying degree of each object is computed and

the OD values are sorted in an ascending mannerFinally output top 119872 outliers with the smallest out-lying degrees

Computational complexity analysis Step (2) needs119874(cNn) operations to compute 119888 cluster prototype The com-putational complexity of computing the weights of119873 featuresis 119874(cNn) in Step (3) Step (4) requires 119874(cNn) to computethe weighted distances of 119899 objects to 119888 cluster prototypesStep (5) needs 119874(cn) to compute parameter 120572119894119896 of 119899 objectswith respect to 119888 cluster prototypes Moreover Step (6) needs119874(cn) operations to calculate the memberships of 119899 objectsto 119888 clusters Therefore the whole computational complexityis 119874(cNn) the same as that of the classical FCM algorithm

34 Proof of Convergence In this section we discuss the con-vergence of the proposed SSOD-AFW algorithm To provethe convergence of objective function 119869SSOD-AFW in (8) byiteratingVO andUwith formulas (15) (17) and (19) it onlyneeds to prove that 119869SSOD-AFW is monotonically decreasingand bounded after a finite number of iterationsNext Lemmas2 3 and 4 verify the monotonically decreasing property of119869SSOD-AFW with respect toVO andU respectively Lemma 5presents the boundedness of 119869SSOD-AFW

Lemma 2 Objective function 119869119878119878119874119863-119860119865119882 in (8) is nonincreas-ing by updating U = (119906119894119896)119899times119888 with formula (19)

Proof Due to the fact that V and O are fixed when updatingU by (19) here the objective function 119869SSOD-AFW can beregarded as a function only associated with U denoted as119869SSOD-AFW(U) According to Lagrangianmultiplier techniqueUlowast computed via (19) is a stagnation point of 119869SSOD-AFW(U)On the other hand if Hessian matrix nabla2119869SSOD-AFW(U) isproved to be positive definite at Ulowast it can be proved that119869SSOD-AFW(U) attains its local minimum at Ulowast The Hessianmatrix nabla2119869SSOD-AFW(Ulowast) is expressed as

nabla2119869SSOD-AFW (Ulowast)

=[[[[[[[[[

1205972119869SSOD-AFW (Ulowast)1205971199061112059711990611 sdot sdot sdot 1205972119869SSOD-AFW (Ulowast)12059711990611120597119906119899119888 d

1205972119869SSOD-AFW (Ulowast)12059711990611989911988812059711990611 sdot sdot sdot 1205972119869SSOD-AFW (Ulowast)120597119906119899119888120597119906119899119888

]]]]]]]]] (22)

nabla2119869SSOD-AFW(Ulowast) is a diagonal matrix and its diagonal ele-ment is 1205972119869SSOD-AFW (Ulowast)120597119906119894119896120597119906119894119896 = 120573radic119888119906119898minus2119894119896 (23)

Since 119906119894119896 gt 0 Hessianmatrixnabla2119869SSOD-AFW(Ulowast) is positivedefinite Having proved that Ulowast is a stagnation point of119869SSOD-AFW(U) and nabla2119869SSOD-AFW(Ulowast) is positive definite weconcludeUlowast is the local minimum of 119869SSOD-AFW(U) Then wehave 119869SSOD-AFW(U(119905)) ge 119869SSOD-AFW(Ulowast) = 119869SSOD-AFW(U(119905+1))whereU(119905) is the membership matrix after the 119905th iteration in(19) and Ulowast = U(119905+1) is the one after the (119905 + 1)th iterationTherefore objective function 119869SSOD-AFW in (8) is nonincreas-ing by updating U = (119906119894119896)119899times119888 using formula (19)

Lemma 3 Objective function 119869119878119878119874119863-119860119865119882 in (8) is nonincreas-ing by updating V = (V119895119896)119873times119888 using (15)Proof Similar to Lemma 2 when U and O are fixed wejust need to prove that the Hessian matrix of Lagrangian of119869SSOD-AFW(V) atVlowast is positive definite whereVlowast is computedby (15)TheHessian matrix is denoted as119867(V) = (ℎ119895119896119897119898(V))whose element is expressed as follows

ℎ119895119896119897119898 (V) = 12059721198711 (V)120597V119895119896120597V119897119898=

119902 (119902 minus 1) V119902minus2119895119896

119899sum119894=1

119906119898119894119896 (119909119894119895 minus 119900119896119895)2 if119895 = 119897 119896 = 1198980 otherwise

(24)

Since 119902 gt 1 and V119895119896 gt 0 the diagonal entries of thediagonal matrix are apparently positive Therefore Hessianmatrix 119867(V) is positive definite 119869SSOD-AFW(V) attains itslocal minimum at Vlowast computed by (15) This completes theproof

Lemma 4 Objective function 119869119878119878119874119863-119860119865119882 in (8) is nonincreas-ing whenO = (119900119896119895)119888times119873 is updated using (17)

The proof of Lemma 4 is similar to Lemma 2

Lemma 5 Objective function 119869119878119878119874119863-119860119865119882 in (8) is boundedthere exists a constant 119861 and it satisfies 119869119878119878119874119863-119860119865119882 lt 119861Proof forall119894 isin 1 119899 andforall119896 isin 1 119888 we have 0 lt 119906119894119896 le 1Thus119906119898119894119896 log 119906119898119894119896minus119906119898119894119896 ismonotonically decreasingwith respectto 119906119894119896 and hence minus1 le 119906119898119894119896 log 119906119898119894119896 minus 119906119898119894119896 lt 0 Apparentlythe first term and the second term of the objective function119869119878119878119874119863-119860119865119882 are also bounded In consequence objective func-tion 119869SSOD-AFW in (8) is bounded

Theorem 6 The objective function 119869119878119878119874119863-119860119865119882 converges to alocal minimum by updating V O and U using formulas (15)(17) and (19) which subjects to the constraints in (9)

Proof Lemmas 2 3 and 4 verify that objective functionin (8) is nonincreasing under iterations according to (15)(17) and (19) Lemma 5 shows that 119869SSOD-AFW has a finitebound Though the parameter 120572119894119896 needs to be updated in theiteration process it is a constant in the problem solving usingLagrangian multiplier technique So 120572119894119896 does not affect theconvergence of the SSOD-AFW algorithm Combining theabove conclusions 119869SSOD-AFW is sure to converge to a local

Mathematical Problems in Engineering 7

minimum through iterations ofVO andU by (15) (17) and(19)

4 Experiments and Analysis

Comprehensive experiments and analysis on a syntheticdataset and several real-world datasets are conducted to showthe effectiveness and superiority of the proposed SSOD-AFW We compared the proposed algorithm with two the-state-of-the-art unsupervised outlier algorithms LOF [15]and 119870NN [13] one supervised method SVDD [32] and onesemisupervised method EODSP [23]

41 Evaluation Metrics For numerically performance evalu-ation of outlier detection algorithms three metrics namelyaccuracy [7] AUC [33] and rank-power [16] are used in thispaper

Let 119872 be the number of true outliers that a dataset 119883contains and 119878 denotes the number of true outliers detectedby an algorithm In experiments top 119872 most suspiciousinstances are detected out Then the accuracy is given by

Accuracy = 119878119872 (25)

The receiver operating characteristic (ROC) curve rep-resents the trade-off relationship between the detection rateand the false alarm rate In general the area under the ROCcurve (AUC) is used to measure the performance of outlierdetection method and the value of AUC for ideal detectionperformance is close to one

For a given outlier detection algorithm true outliersoccupy top positions with respect to the nonoutliers among119872 suspicious instances then the rank-power (RP) of thealgorithm is said to be high If 119878 is the number of true outliersfound within top119872 instances and 119877119894 denotes the rank of the119894th true outlier then the metric rank-power (RP) is given by

RP = 119878 (119878 + 1)2sum119878119894=1 119877119894 (26)

RP reaches the maximum value 1 when all 119878 true outliersare in the top 119878 positions Larger value of RP implies betterperformance of an algorithm

42 Experiments on Synthetic Dataset A two-dimensionalsynthetic dataset with two cluster patterns is generatedfrom Gaussian distribution to intuitively compare the outlierdetection results of the proposed method against the otherfour algorithms mentioned above The mean vectors of thetwo clusters are 1205831 = (75 9)T and 1205832 = (1 3)T respectivelyand the covariance matrixes of them are Σ1 = ( 1 00 2 ) andΣ2 = ( 1 00 2 ) As Figure 2(a) shows a total of 199 samplesare contained in the synthetic dataset in which there are183 normal samples (within two clusters) and 16 outliers(cluttered between two clusters) 13 normal objects are labeledand marked as symbol ldquotimesrdquo and 5 outliers are labeled andmarkedwith symbol ldquolowastrdquo while the rest samples are unlabeledmarked with ldquosdotrdquo Figures 2(b)ndash2(f) respectively illustrate the

outlier detection results on the synthetic dataset by usingLOF 119870NN SVDD EODSP and SSOD-AFW where the redcolored symbols ldquo∘rdquo denote the detected suspicious outliersHere the value of parameter 119896 (size of neighborhood) in LOFand 119870NN is assigned to 3 Gauss kernel function is chosenin SVDD and we set the bandwidth 120590 = 03 and the trade-off coefficient 119862 = 045 Besides the Euclidean distancethreshold 119879 in EODSP is set as 01 and the percentage ofnegative set is set to 119897 = 10 The parameter settings of theproposed algorithm are 119898 = 23 119870 = 085 and 119902 = 223In addition to SVDD the top 16 objects with the highestoutlying scores are considered as the results in the other fouralgorithms

In Figure 2 it is noticeable that the unsupervisedmethodsLOF and 119870NN as well as the supervised SVDD fail tocompletely detect all of the 5 labeled outliers Neverthelesssome normal points in clusters are badly misjudged asoutliers In contrast the semisupervised EODSP algorithmand the proposed SSOD-AFW algorithm successfully detectall of the 5 labeled outliers However EODSP does notcompletely detect all the unlabeled true outliers and severaltrue normal samples are improperly identified as outliers Itis concluded from Figure 2 that the proposed algorithm findsall the true outliers in the synthetic dataset and excludes thenormal samples while the other methods do not

Figure 3 numerically presents the performance evaluationof outlier detection using LOF 119870NN SVDD EODSP andSSOD-AFW for the synthetic dataset From Figure 3 we seethat the values of accuracy AUC and RP of the proposedalgorithm all reach 1 outperforming the other methods

Furthermore during the experimental process shown inFigure 3 the feature weights of the synthetic dataset learnedby formula (15) in our method are V11 = 06727 V12 =05985 V21 = 03273 and V22 = 04015 To strengthenthe effectiveness of feature weights in the proposed SSOD-AFW algorithm a comparative analysis of the weighted andthe nonweighted versions is implemented on the syntheticdataset respectively Considering the nonweighted scenarioof the proposed algorithm the outlier detection result on thesynthetic dataset is presented in Figure 4 As can be observedfrom Figure 4 the nonweighted SSOD-AFW ends up tagging15 true outlying and 1 normal samples as outliers with oneunlabeled true outlier missed

43 Experiments on Real-World Datasets

431 Introduction of Datasets For further verification ofthe effectiveness of the proposed algorithm five real-worlddatasets from UCI Machine Learning Repository [34] (ieIris Abalone Wine Ecoli and Breast Cancer Wisconsin(WDBC)) are employed to test the performance of theproposed algorithm against LOF 119870NN SVDD and EODSPAs mentioned in Aggarwal and Yu [35] one way to test theperformance of an outlier detection algorithm is to run it onthe dataset and calculate the percentage of points belongingto the rare classes So a small amount of samples from thesame class are randomly selected as outlying objects or astarget objects for the five datasets For instance the originalIris dataset incorporates 150 objects with 50 objects in each

8 Mathematical Problems in Engineering

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliers

(a) Original synthetic dataset

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by LOF

(b) LOF

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by KNN

(c) 119870NN

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by SVDD

(d) SVDD

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by EODSP

(e) EODSP

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by SSOD-AFW

(f) SSOD-AFW

Figure 2 Outlier detection results of different algorithms on the two-dimensional synthetic dataset

0875008125

06000

05000

1000010

09

08

07

06

05

04

03

02

01

00

Accu

racy

Algorithms

LOF

KN

N

SVD

D

EOD

SP

SSO

D-A

FW

(a) Accuracy

10

09

08

07

06

05

04

03

02

01

00

05748

10000

08948

0989809952

AUC

Algorithms

LOF

KN

N

SVD

D

EOD

SP

SSO

D-A

FW

(b) AUC

1009130 08835

07627

04138

10000

09

08

07

06

05

04

03

02

01

00

RP

Algorithms

LOF

KN

N

SVD

D

EOD

SP

SSO

D-A

FW

(c) RP

Figure 3 Performance comparison of different algorithms on the synthetic dataset

of three classes We randomly selected 26 objects from classldquoIris-virginicardquo as target outliers and all objects in the othertwo classes are considered as normal objects The otherfour datasets are similarly preprocessed and more detaileddescription about the five real-world datasets is given inTable 1

432 Experimental Result Analysis We compare the outlierdetection performance of the proposed algorithm with LOF119870NN SVDD and EODSP on real-world datasets Eachmethod has its own parameters and the detailed parametersettings of each algorithm are as follows The parameters ofthe proposed algorithm are 119898 = 21 119870 = 065 and 119902 = 153

Mathematical Problems in Engineering 9

Table 1 Description of real-world datasets

Dataset Instances Features Outlying classes Outliers (percent) Clusters Prior informationIris 126 4 ldquoVirginicardquo 26 (2063) 2 10 labeled normal samples 4 labeled outliersAbalone 4177 8 ldquo1rdquondashldquo4rdquo ldquo16rdquondashldquo27rdquo ldquo29rdquo 335 (802) 11 11 labeled normal samples 18 labeled outliersWine 130 13 ldquo3rdquo 11 (846) 2 9 labeled normal samples 4 labeled outliersEcoli 336 9 ldquoomLrdquo ldquoimLrdquo ldquoimSrdquo 9 (268) 5 11 labeled normal samples 3 labeled outliersWDBC 387 30 ldquoMalignantrdquo 30 (775) 1 10 labeled normal samples 8 labeled outliers

minus2 0 2 4 6 8 10minus4

minus2

0

2

4

6

8

10

12

14

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by nonweighted SSOD-AFW

Figure 4 Outlier detection result of the nonweighted SSOD-AFW on the synthetic dataset

for all the five datasets The strategy of parameter selectionfor SSOD-AFW will be discussed in the later subsectioncalled parameter analysis For the other algorithms thoseparameters are set exactly as mentioned in their referencesIt is well known that LOF and 119870NN have high dependencyon the neighborhood parameter 119896 In this paper we set 119896 = 5for datasets Iris and WDBC 119896 = 3 for dataset Abalone119896 = 10 for dataset Wine and 119896 = 10 for dataset Ecoli ForSVDDmethodGaussian kernel function is employed and thebandwidths 120590 = 045 and 119862 = 05 on all of the five real-worlddatasets In EODSP the Euclidean distance threshold 119879 is setas 01 and the percentage of negative set 119897 is set as 30 for IrisandAbalone datasets and119879 = 05 119897 = 30 for datasets EcoliWine and WDBC Since we randomly select outliers fromtarget classes for each dataset each experiment is repeated10 times with the same number of different outliers Theaverage accuracy AUC and RP are calculated as the criteriaof performance of various detection methods

Figure 5 illustrates the outlier detection results of SSOD-AFW algorithm against LOF 119870NN SVDD and EODSPrespectively on the five real-world datasets As can beseen from Figure 5 the proposed algorithm can accuratelyidentify outliers according to the cluster structure of adataset with the guidance of the label knowledge It showsdistinct superiority over the other unsupervised (LOF119870NN)semisupervised (EODSP) and supervised (SVDD) methodsIn particular the outlier detection accuracy of SSOD-AFW inFigure 5(a) is significantly higher than the others especially

for datasets Iris and Wine One can know from Figure 5(b)that the AUC values of our method are always higher thanthe others for all datasets except for WDBC In terms of RPSSOD-AFW performs better than the other four algorithmsondatasets Iris andWinewhereas slightly poorer than SVDDonAbalone poorer than LOFonEcoli and poorer than119870NNonWDBC seen as in Figure 5(c)

It is worth mentioning that the experiment of the pro-posed algorithm on WDBC involves one-class clusteringproblem Although one-class clustering task is generallymeaningless one-class clustering-based outlier detection isespecially meaningful and feasible in our proposal becauseour approach does not require that the membership degreesmust sum up to 1 This is one of the powerful and importantcharacteristics of the proposed algorithm

433 The Influence of the Proportion of Labeled Data onOutlier Detection In this subsection we will investigatethe influence of the proportion of labeled samples on theaccuracies of our methodology Two typical situations areconsidered and tested The first one is that the proportion oflabeled outliers increases when the number of labeled normalobjects is fixed at a certain constant The other one is that thepercent of labeled normal samples varies while the quantityof labeled outliers is fixed So two groups of experimentsare designed to compare the accuracies of the proposedalgorithm against the EODSP in the situations of different

10 Mathematical Problems in Engineering

10

09

08

07

06

05

04

03

02

01

00

Accu

racy

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(a) Accuracy

10

09

08

07

06

05

04

03

02

01

00

AUC

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(b) AUC

10

09

08

07

06

05

04

03

02

01

00

RP

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(c) RP

Figure 5 Performance comparison of various algorithms on the real-world datasets

percent of labeled outliers and normal samples respectivelyon the datasets Iris Abalone Wine Ecoli andWDBC In thetwo experiments the percent of labeled outliers or labelednormal samples ranges from 0 to 40 respectively whenthe number of another kind of labeled objects is fixedWe randomly select a certain number of labeled outliersor normal samples from each dataset each experiment isrepeated 10 times and the average accuracies of SSOD-AFWand EODSP are computed

Figure 6 shows results of the first group of experimentswhere the percent of labeled outliers varies from 0 to 40One can see from Figure 6 that the accuracies of the twosemisupervised algorithms are roughly increased with theproportion of labeled outliers becoming larger This power-fully supports the idea that semisupervised outlier detectionalgorithms can improve the accuracy of outlier detectionby using prior information Furthermore the SSOD-AFWachieves a better accuracy than EODSP algorithm for thesame proportion of labeled outliers on the five datasetsEspecially for Wine the accuracy of SSOD-AFW is 40higher than that of EODSP EODSP addresses the problemof detecting outliers with only few labeled outliers as trainingdata The labeled normal instances are extracted accordingto the maximum entropy principle where the entropy iscomputed only using the distance between each testingsample and all the labeled outliers That makes EODSPnot flexible as our proposed method due to informationdeficiencies

Figure 7 illustrates the accuracy comparison of the pro-posed algorithm and EODSP when the proportion of labelednormal samples increases from 0 to 40 and the percent oflabeled outliers is fixed Note that ourmethod obtains a betteraccuracy than EODSP on all of the five real-world datasetsThe accuracy of the proposed algorithm gets larger when thepercent of labeled normal samples increases As mentionedEODSP emphasizes the semisupervised outlier detectiononly with few labeled outliers in the initial dataset butwithout considering any labeled normal objects Therefore

the accuracy of EODSP algorithm keeps harper with variousproportions of labeled normal objects and always equals theaccuracy value of 0 labeled normal samples as well

434 Parameter Analysis The parameters 119898 119870 and 119902are important in our proposed algorithm which affect theperformance of SSOD-AFW In this section the influence ofeach parameter on outlier detection accuracy is studied

The parameter 119898 is the fuzzification coefficient Fig-ure 8(a) analyzes the relationship between the outlier detec-tion accuracy of our proposed algorithm and parameter 119898with 119898 varying from 15 to 50 The results imply that thehighest accuracy is achieved when 119898 ranges in [2 4] Soit is reasonable that 119898 value in the experiments shown inFigure 5 has been set as 21 The parameter 119870 isin (0 1)controls the importance of the label information in the resultof outlier detection Outlier detection accuracies are testifiedby varying119870 from 01 to 09 which are shown in Figure 8(b)The overall tendency is that the accuracies become larger as119870 increasesThe best results of the proposed algorithm occurand keep stable when 119870 ge 07 Finally from Figure 8(c) weconclude that the feature weight index 119902 has small influenceon the accuracy of SSOD-AFW in the situation that theother parametersmaintain the same settings So the proposedalgorithm is not sensitive to the parameter 119902 In general theparameter 119902 is suggested to select a constant from (1 4]435 Execution Time Analysis Figure 9 analyzes the averagerunning time of the proposed algorithm against the otheralgorithms on five real-world datasets The experimentalenvironment is Windows XP systems MATLAB 71 platform3GHz CPU 2GB RAM Because the volume of Abalonedataset is far greater than the other four datasets the runningtimes of various datasets are distinctly different In order tofacilitate the display in Figure 9 the horizontal coordinate axisis translated downward a short distance The result indicatesthat the proposed algorithm is more time-saving than the

Mathematical Problems in Engineering 11

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(a) Iris

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(b) Abalone1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(c) Wine1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(d) Ecoli

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(e) WDBC

Figure 6 Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled outliers on the real-world datasets

other four typical outlier detection algorithms except for119870NN on dataset Wine In whole the execution time of theSSOD-AFW is comparable to that of119870NNand less than thoseof other algorithms on most of the datasets

5 Conclusions

In order to detect outliers more precisely a semisuper-vised outlier detection algorithm based on adaptive feature

weighted clustering called SSOD-AFW is proposed in thispaper Distinct weights of each feature with respect to differ-ent clusters are considered and obtained by adaptive iterationso that the negative effects of irrelevant features on outlierdetection are weakened Moreover the proposed methodmakes full use of the prior knowledge contained in datasetsand detects outliers in virtue of the cluster structure It isverified by a series of experiments that the proposed SSOD-AFW algorithm is superior to other typical unsupervised

12 Mathematical Problems in Engineering

0

1009080706050403020100

Accu

racy

Proportion of labeled normal samples5

EODSPSSOD-AFW

10 15 20 25 35 4030

(a) Iris

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(b) Abalone

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(c) Wine

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples

5 10 15 20 25 35 4030

(d) Ecoli

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(e) WDBC

Figure 7 Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled normal samples on the real-worlddatasets

semisupervised and supervised algorithms in both outlierdetection precision and running speed

In this paper we present a new semisupervised outlierdetection method which utilizes labels of a small amountof objects However our method assumes that the labels

of objects are reliable and does not consider mislabel pun-ishment in the new objective function Therefore a robustversion of the proposed method dealing with noisy orimperfect labels of objects deserves further studiesMoreoversince only one typical dissimilaritymeasure named Euclidean

Mathematical Problems in Engineering 13

1009080706050403020100

Accu

racy

m

5045403530252015

IrisAbaloneWine

EcoliWDBC

(a) Parameter119898

K

1009080706050403020100

Accu

racy

090807060504030201

IrisAbaloneWine

EcoliWDBC

(b) Parameter119870

q

40

1009080706050403020100

3530252015

Accu

racy

IrisAbaloneWine

EcoliWDBC

(c) Parameter 119902

Figure 8 Outlier detection accuracy of the proposed algorithm under various parameters on the real-world datasets

Iris Abalone Wine Ecoli WDBCDatasets

50

45

40

35

30

25

20

15

10

5

0

Runn

ing

time (

s)

LOFKNNSVDD

EODSPSSOD-AFW

Figure 9 Execution time comparison of different algorithms on thereal-world datasets

distance is discussed in our method the SSOD-AFW algo-rithm is limited to outlier detection for numerical data Thefuture research aims at extending our method to mixed-attribute data in more real-life applications such as faultdiagnosis in industrial process or network anomaly detection

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Natural ScienceFoundation of China (Grant no 11471001)

References

[1] J Han M Kamber and J Pei Data Mining Concepts andTechniques Elsevier 2011

[2] D M Hawkins Identification of Outliers Chapman amp HallLondon UK 1980

[3] V Bamnett and T LewisOutliers in Statistical Data JohnWileyamp Sons Chichester UK 1994

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] B Sheng Q Li W Mao and W Jin ldquoOutlier detection insensor networksrdquo in Proceedings of the 8th ACM InternationalSymposium on Mobile Ad Hoc Networking and Computing(MobiHoc rsquo07) pp 219ndash228Montreal Canada September 2007

[6] A P James and S Dimitrijev ldquoInter-image outliers and theirapplication to image classificationrdquo Pattern Recognition vol 43no 12 pp 4101ndash4112 2010

[7] J Huang Q Zhu L Yang and J Feng ldquoA non-parameter outlierdetection algorithm based on Natural Neighborrdquo Knowledge-Based Systems vol 92 pp 71ndash77 2016

[8] J M Shepherd and S J Burian ldquoDetection of urban-inducedrainfall anomalies in a major coastal cityrdquo Earth Interactionsvol 7 no 4 pp 1ndash17 2003

[9] O Alan and C Catal ldquoThresholds based outlier detectionapproach for mining class outliers an empirical case study onsoftware measurement datasetsrdquo Expert Systems with Applica-tions vol 38 no 4 pp 3440ndash3445 2011

[10] B Scholkopf J C Platt J Shawe-Taylor A J Smola and RC Williamson ldquoEstimating the support of a high-dimensionaldistributionrdquo Neural Computation vol 13 no 7 pp 1443ndash14712001

[11] D Tax A Ypma and R Duin ldquoSupport vector data descriptionapplied to machine vibration analysisrdquo in Proceedings of the 5thAnnual Conference of the Advanced School for Computing andImaging pp 398ndash405 Heijen The Netherlands June 1999

[12] DM J Tax andR PWDuin ldquoSupport vector data descriptionrdquoMachine Learning vol 54 no 1 pp 45ndash66 2004

[13] S Ramaswamy R Rastogi and K Shim ldquoEfficient algorithmsformining outliers from large data setsrdquo SIGMODRecord (ACMSpecial Interest Group onManagement of Data) vol 29 no 2 pp427ndash438 2000

14 Mathematical Problems in Engineering

[14] EM Knox and R T Ng ldquoAlgorithms formining distance basedoutliers in large datasetrdquo in Proceedings of the InternationalConference on Very Large Data Bases pp 392ndash403 CiteseerNew York NY USA 1998

[15] M M Breunig H-P Kriegel R T Ng and J Sander ldquoLOFidentifying density-based local outliersrdquoACMSIGMODRecordvol 29 no 2 pp 93ndash104 2000

[16] J Ha S Seok and J-S Lee ldquoA precise ranking method foroutlier detectionrdquo Information Sciences vol 324 pp 88ndash1072015

[17] R N Dave ldquoCharacterization and detection of noise in clus-teringrdquo Pattern Recognition Letters vol 12 no 11 pp 657ndash6641991

[18] R Smith A Bivens M Embrechts C Palagiri and B Szyman-ski ldquoClustering approaches for anomaly based intrusion detec-tionrdquo Proceedings of Intelligent Engineering Systems ThroughArtificial Neural Networks pp 579ndash584 2002

[19] Y Shi and L Zhang ldquoCOID a cluster-outlier iterative detectionapproach to multi-dimensional data analysisrdquo Knowledge andInformation Systems vol 28 no 3 pp 709ndash733 2011

[20] J Zhao K Liu W Wang and Y Liu ldquoAdaptive fuzzy clusteringbased anomaly data detection in energy system of steel indus-tryrdquo Information Sciences vol 259 pp 335ndash345 2014

[21] F Angiulli and C Pizzuti ldquoFast outlier detection in high dimen-sional spacesrdquo in Proceedings of the 6th European Conference onPrinciples of DataMining and Knowledge Discovery (PKDD rsquo02)vol 2 pp 15ndash26 August 2002

[22] M Radovanovic A Nanopoulos and M Ivanovic ldquoReversenearest neighbors in unsupervised distance-based outlier detec-tionrdquo IEEE Transactions on Knowledge and Data Engineeringvol 27 no 5 pp 1369ndash1382 2015

[23] A Daneshpazhouh and A Sami ldquoEntropy-based outlier detec-tion using semi-supervised approach with few positive exam-plesrdquo Pattern Recognition Letters vol 49 pp 77ndash84 2014

[24] J Gao H Cheng and P-N Tan ldquoSemi-supervised outlierdetectionrdquo in Proceedings of the ACM Symposium on AppliedComputing pp 635ndash636 ACM Dijon France April 2006

[25] Z Xue Y Shang and A Feng ldquoSemi-supervised outlier detec-tion based on fuzzy rough C-means clusteringrdquo Mathematicsand Computers in Simulation vol 80 no 9 pp 1911ndash1921 2010

[26] J Z Huang M K Ng H Rong and Z Li ldquoAutomated variableweighting in k-means type clusteringrdquo IEEE Transactions onPattern Analysis andMachine Intelligence vol 27 no 5 pp 657ndash668 2005

[27] H Friguiand and O Nasraou ldquoUnsupervised learning ofprototypes and attribute weightsrdquo Pattern Recognition vol 37no 3 pp 567ndash581 2004

[28] J Zhou L Chen C P Chen Y Zhang and H Li ldquoFuzzy clus-tering with the entropy of attribute weightsrdquo Neurocomputingvol 198 pp 125ndash134 2016

[29] M Hassan A Chaudhry A Khan and M A Iftikhar ldquoRobustinformation gain based fuzzy c-means clustering and classifi-cation of carotid artery ultrasound imagesrdquo Computer Methodsand Programs in Biomedicine vol 113 no 2 pp 593ndash609 2014

[30] R Krishnapuram and J M Keller ldquoA possibilistic approach toclusteringrdquo IEEE Transactions on Fuzzy Systems vol 1 no 2 pp98ndash110 1993

[31] M-S Yang and K-L Wu ldquoUnsupervised possibilistic cluster-ingrdquo Pattern Recognition vol 39 no 1 pp 5ndash21 2006

[32] SM Guo L C Chen and J S H Tsai ldquoA boundarymethod foroutlier detection based on support vector domain descriptionrdquoPattern Recognition vol 42 no 1 pp 77ndash83 2009

[33] T Fawcett ldquoRoc graphs notes and practical considerations forresearchersrdquoMachine Learning vol 31 no 1 pp 1ndash38 2004

[34] C Blake and C J Merz ldquoUCI repository of machine learningdatabasesrdquo 1998

[35] C C Aggarwal and P S Yu ldquoOutlier detection for highdimensional datardquo in Proceedings of the ACM SIGMOD Interna-tional Conference on Management of Data (SIGMOD rsquo01) SantaBarbara Calif USA May 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

2 Mathematical Problems in Engineering

Breunig et al [15] A local outlier factor (LOF) is assigned toeach data point based on their local neighborhood densityThen a data point with a high LOF value is determined as anoutlier However this method is very sensitive to the choiceof neighborhood parameter

Clustering-based approaches [17ndash20] partition thedataset into several clusters depending on similarity ofobjects and detect outliers by examining the relationshipbetween objects and clusters In general clusters containingsignificantly less data points than other clusters or beingremote from other clusters are considered as outliers Thecluster structure of data can facilitate the task of outlierdetection and a small amount of related literatures has beenproposed A classical clustering method is used to findanomaly in the intrusion detection domain [18] In the workof [19] the clustering techniques iteratively detect outliersfor multidimensional data analysis in subspace Zhao et al[20] propose an adaptive fuzzy c-means (AFCM) algorithmby introducing sample weight coefficients to the objectivefunction and apply it to anomaly data detection in energysystem of steel industry Since clustering-based approachesare unsupervised without requiring any labeled training datatheir performance in outlier detection is limited In additionmost of the existing clustering-based methods only involvethe optimal clustering but do not incorporate optimal outlierdetection into clustering process

Inmany real-world applications onemay encounter caseswhere a small set of objects are labeled as outliers or belongingto a certain class but most of the data are unlabeled Studiesindicate that the introduction of a small amount of priorknowledge can significantly improve the effectiveness of out-lier detection [23ndash25]Therefore semisupervised approachesto outlier detection have been developed to tackle suchscenarios and have been thought of a popular directionof outlier detection recently In order to take advantageof the label information of a target dataset entropy-basedoutlier detection based on semisupervised learning fromfew positive examples (EODSP) is proposed in [23] Thatmethod extracts reliable normal instances from unlabeledobjects and regards themas labeled normal samples Entropy-based outlier detection method is used to detect top 119872outliers However when a dataset initially provides labelednormal and abnormal samples the algorithm in [23] cannotmake full use of the given label information Literature [24]develops a semisupervised outlier detectionmethod based onthe assessment of deviation from known labeled objects bypunishing poor clustering results and restricting the numberof outliers Xue et al [25] present a semisupervised outlierdetection proposal based on fuzzy rough c-means clusteringwhich detects outliers by minimizing the sum of squarederrors of clustering results and the deviation from knownlabeled examples as well as the number of outliers Unfortu-nately some labeled normal objects are finally misidentifiedas outliers due to improper parameter selection in [24 25]

Most of the previous research equally treats differentfeatures of objects in outlier detecting process which does notconform to the intrinsic characteristic of a dataset Actuallyit is more reasonable that different features have differentimportance in each cluster especially for high-dimension

sparse datasets where the structure of each cluster is oftenlimited to a subset of features rather than the entire featureset Some works concerning feature weighted clustering havebeen studied Huang et al [26] propose a W-c-means typeclustering algorithm that can automatically calculate featureweights W-c-means adds a new step into the basic c-meansalgorithm to update the variable weights based on the cur-rent partition of data Literature [27] develops an approachcalled simultaneous clustering and attribute discrimination(SCAD) SCAD learns the feature relevance representation ofeach cluster independently in an unsupervisedmanner Zhouet al [28] publish a maximum-entropy-regularized weightedfuzzy c-means (EWFCM) clustering algorithm for ldquonon-sphericalrdquo shaped data A new objective function is developedin the EWFCM algorithm to achieve the optimal clusteringresult by minimizing the dispersion within clusters andmaximizing the entropy of attribute weights simultaneouslyThese existing methods about feature weighted clusteringencourage scholars to study outlier detection based on featureweighted clustering

To make full use of prior knowledge to facilitateclustering-based outlier detection we develop a semisuper-vised outlier detection algorithm based on adaptive featureweighted clustering (SSOD-AFW) in this paper in whichthe feature weights are iteratively obtained The proposedalgorithm emphasizes the diversity of different features ineach cluster and assigns lower weights to irrelevant featuresto reduce their negative influence on outlier decision Fur-thermore based on the convention that outliers usually havea lower membership to every cluster we relax the constraintof fuzzy c-means (FCM) clustering where the membershipdegrees of a sample to all clusters must sum up to oneand propose an adaptive feature weighted semisupervisedpossibilistic clustering-based outlier detection algorithmTheinteraction problem between optimal clustering and outlierdetection is addressed in the proposed method The labelinformation is introduced into the possibilistic clusteringmethod according to the following principles (1) maximizingthe membership degree of a labeled normal object to thecluster it belongs to (2) minimizing the membership degreesof a labeled normal object to the clusters it does not belongto and (3) minimizing the membership degrees of a labeledoutlier to all clusters In addition to the above principles wesimultaneously minimize the dispersion within clusters inthe new objective function of clustering to achieve a propercluster structure Finally the yielded optimal membershipdegrees are used to indicate the outlying degree of eachsample in the dataset The proposed algorithm is foundpromising in improving the performance of outlier detectionin comparison with typical outlier detection methods inaccuracy running time as well as other evaluation metrics

The remainder of this paper is organized as followsSection 2 gives a short review on possibilistic clustering algo-rithms Section 3 presents the detailed description of featureweighted semisupervised clustering-based outlier detectionalgorithm In Section 4 the experimental results of theproposedmethod against typical outlier detection algorithmsare discussed on synthetic and real-world datasets FinallySection 5 follows our conclusions

Mathematical Problems in Engineering 3

2 Possibilistic Clustering Algorithms

Let119883 = 1199091 1199092 119909119899 be a given dataset of 119899 objects where119909119894 = 1199091198941 1199091198942 119909119894119873 isin 119877119873 is the 119894th object characterized by119873 features Suppose that the dataset is divided into 119888 clustersand 119900119896 denotes the 119896th cluster prototype

FCM is a well-known clustering algorithm [29] whoseobjective function is

119869FCM = 119899sum119894=1

119888sum119896=1

119906119898119894119896 1003817100381710038171003817119909119894 minus 11990011989610038171003817100381710038172 (1)

st119888sum119896=1

119906119894119896 = 1 119906119894119896 isin [0 1] (2)

where 119906119894119896 is the membership degree of the 119894th (1 le 119894 le 119899)object to the 119896th (1 le 119896 le 119888) cluster sdot represents the 1198972-normof a vector and119898 gt 0 is the fuzzification coefficient Note thatthe constraint condition in (2) indicates that the membershipsum of each object to all clusters equals one Therefore FCMis sensitive to outliers due to the intuition that outliers ornoises commonly locate far away from all cluster prototypesFor this reason Krishnapuram and Keller [30] proposeda possibilistic c-means (PCM) clustering algorithm whichrelaxes the constraint on the sum of memberships andminimizes the following objective function

119869PCM = 119899sum119894=1

119888sum119896=1

119906119898119894119896 1003817100381710038171003817119909119894 minus 11990011989610038171003817100381710038172 + 119888sum119896=1

120578119896 119899sum119894=1

(1 minus 119906119894119896)119898 (3)

st 119906119894119896 isin [0 1] (4)

where 120578119896 is a suitable positive number In PCM the constraint(4) allows an outlier holding a lowmembership to all clustersso an outlier has a low impact on the objective function (3)Themembership information of each sample can be naturallyused to interpret the outlying characteristic of a sample Fora certain sample if it has a low membership to all clusters itis likely to be an outlier

Afterward another unsupervised possibilistic clusteringalgorithm (PCA) is proposed by Yang and Wu [31] and theobjective function of PCA is described as

119869PCA= 119899sum119894=1

119888sum119896=1

119906119898119894119896 1003817100381710038171003817119909119894 minus 11990011989610038171003817100381710038172+ 1205731198982radic119888

119899sum119894=1

119888sum119896=1

(119906119898119894119896 log 119906119898119894119896 minus 119906119898119894119896) st 119906119894119896 isin [0 1]

(5)

where the parameter 120573 can be calculated by the samplecovariance

120573 = sum119899119894=1sum119873119895=1 (119909119894119895 minus sum119899119897=1 119909119897119895119899)2119899 (6)

3 Semisupervised OutlierDetection Framework Based on FeatureWeighted Clustering

31 Model Formulation In this section we introduce priorknowledge into possibilistic c-means clustering method toimprove the performance of outlier detection First a smallsubset of samples in a given dataset 119883 = 1199091 1199092 119909119899is labeled as normal or outlier objects Each labeled normalobject carries the label of class it belongs to A semisupervisedindicator matrix A = (119886119894119896)119899times119888 is constructed to describe thesemisupervised information and its entries are defined by thefollowing

(i) If an object 119909119894 is labeled as a normal point and itbelongs to the 119896th cluster then 119886119894119896 = minus1 and for all119904 = 1 2 119888 119904 = 119896 we let 119886119894119904 = 1

(ii) If 119909119894 is labeled as an outlier then for all 119904 = 1 2 119888we set 119886119894119904 = 1

(iii) If 119909119894 is unlabeled then for all 119904 = 1 2 119888 it has119886119894119904 = 0Usually data often contain a number of redundant fea-

turesThe cluster structure in a given dataset is often confinedto a subset of features rather than the entire feature setIrrelevant features can only obscure the discovery of thecluster structure by a clustering algorithm An intrinsicoutlier is easy to be neglected due to the vagueness ofcluster structure Figure 1 presents an example of a three-dimensional dataset The dataset has two clusters (1198881 and 1198882)and 3 features (11989111198912 and1198913) In the feature space (119891111989121198913)neither of the clusters is discovered (see Figure 1(a)) In thesubspace (1198911 1198912) cluster 1198881 can be found but 1198882 cannot (seeFigure 1(b)) Nevertheless only cluster 1198882 can be clearly shownin (1198912 1198913) (see Figure 1(c)) Therefore if we assign weights047 045 and 008 to features 1198911 1198912 and 1198913 respectivelycluster 1198881 will be recovered by a clustering algorithm If theweights of features 1198911 1198912 and 1198913 are assigned as 013 046and 041 respectively cluster 1198882 will be recovered In thisconsideration each cluster is relevant to different subsets offeatures and the same feature may have different importancein different clusters

In our research let V119895119896 be the weight of the 119895th (1 le119895 le 119873) dimensional feature with respect to the 119896th (1 le119896 le 119888) cluster which satisfies sum119873119895=1 V119895119896 = 1 then the featureweighted distance 119889119894119896 between the 119894th object and the 119896thcluster prototype is defined as

119889119894119896 = radic 119873sum119895=1

V119902119895119896

(119909119894119895 minus 119900119896119895)2 (7)

where the parameter 119902 gt 1 is the feature weight indexThe points within clusters usually behave strongly corre-

lated while weak correlation is shown between outliers Thatis normal points belong to one of the 119888 clusters and outliersdo not belong to any clusterTherefore a normal point shouldhave a high membership to the cluster it belongs to and anoutlier has a low membership to all clusters Based on this

4 Mathematical Problems in Engineering

86

42

0minus50

5

68

024

10

f2 f1

f3

(a) Plot of the space (1198911 1198912 1198913)

0 1 2 3 4 5 6 7

0

5

10

minus5

f2

f1

c1

(b) Plot of the subspace (1198911 1198912)

0 10

0

1

2

3

4

5

6

7

minus5minus1

f2

f3

c2

5

(c) Plot of the subspace (1198912 1198913)

Figure 1 Three-dimensional synthetic example

idea we define a new objective function and minimize it asfollows

119869SSOD-AFW (UOV)= 119899sum119894=1

119888sum119896=1

1199061198981198941198961198892119894119896 + 1205731198982radic119888119899sum119894=1

119888sum119896=1

(119906119898119894119896 log 119906119898119894119896 minus 119906119898119894119896)+ 119899sum119894=1

119888sum119896=1

120572119894119896119886119894119896119906119898119894119896(8)

st119873sum119895=1

V119895119896 = 1 119906119894119896 isin [0 1] V119895119896 isin [0 1] (9)

where 119899 119873 and 119888 are the number of objects features andclusters respectively U = (119906119894119896)119899times119888 119906119894119896 is the membershipdegree of the 119894th object belonging to the 119896th cluster V =(V119895119896)119873times119888 V119895119896 denotes the featureweight of the 119895th dimensionalfeature with respect to the 119896th cluster O = (119900119896119895)119888times119873 119900119896119895indicates the 119895th dimensional feature value of the 119896th clusterprototype 119889119894119896 denotes the feature weighted distance betweenobject 119909119894 and the 119896th cluster prototype 119886119894119896 isin 1 minus1 0 isthe element in semisupervised indicator matrix A 119898 gt 0is the fuzzification coefficient and the parameter 120573 can befixed as the sample covariance according to (6) The positivecoefficient120572119894119896 adjusts the significance of the label informationof the 119894th object with respect to the 119896th cluster in objectivefunction (8) The larger 120572119894119896 is the larger the influence of labelknowledge is

The first term in (8) is equivalent to the FCM objectivefunction which requires the distances of objects from thecluster prototypes to be as small as possible The second termis constructed to force 119906119894119896 to be as large as possible Thethird term 120572119894119896119886119894119896119906119898119894119896 focuses on minimizing the membershipdegrees of a labeled outlier to all the clusters and maximizingthe membership degree of a labeled normal object to thecluster it belongs to With a proper choice of 120572119894119896 we canbalance the label information weight of every object andachieve the optimal fuzzy partition

The virtue of semisupervised indicator matrix A inobjective function (8) can be elaborated as follows Recallingthe construction of semisupervised indicator matrix A and

objective function (8) note that if we know that 119909119894 belongsto the 119896th cluster then 119886119894119896 = minus1 and all the other entriesin the 119894th row equal 1 Thus minimizing 120572119894119896119886119894119896119906119898119894119896 in (8)means maximizing the membership of 119909119894 to the 119896th clusterand simultaneously minimizing the memberships of 119909119894 to theother clusters If 119909119894 is labeled as an outlier namely where allthe elements in the 119894th row of A equal 1 then minimizing120572119894119896119886119894119896119906119898119894119896 in (8) means minimizing the memberships of 119909119894 toall clusters for an outlier does not belong to any cluster If 119909119894is unlabeled namely where 119886119894119896 = 0 for all 119896 = 1 2 119888 thenthe term 120572119894119896119886119894119896119906119898119894119896 has no impact on objective function (8)

32 Solutions to the Objective Function In this subsection aniterative algorithm for minimizing 119869SSOD-AFW with respect to119906119894119896 V119895119896 and 119900119896119895 is derived similar to classical FCM

First in order to minimize 119869SSOD-AFW with respect to VU and O are fixed and the parameters 120572119894119896 (119894 = 1 2 119899119896 = 1 2 119888) are constants The Lagrange function isconstructed as follows

1198711 = 119899sum119894=1

119888sum119896=1

1199061198981198941198961198892119894119896 + 1205731198982radic119888119899sum119894=1

119888sum119896=1

(119906119898119894119896 log 119906119898119894119896 minus 119906119898119894119896)+ 119899sum119894=1

119888sum119896=1

120572119894119896119886119894119896119906119898119894119896 minus 119888sum119896=1

120574119896( 119873sum119895=1

V119895119896 minus 1) (10)

where 120574119896 (119896 = 1 2 119888) are the Lagrange multipliersBy taking the gradient of 1198711 with respect to V119895119896 and setting

it to zero we obtain

1205971198711120597V119895119896 = 119902V119902minus1119895119896

119899sum119894=1

119906119898119894119896 (119909119894119895 minus 119900119896119895)2 minus 120574119896 = 0 (11)

Then

V119895119896 = [[120574119896119902sum119899119894=1 119906119898119894119896 (119909119894119895 minus 119900119896119895)2]]

1(119902minus1)

(12)

Mathematical Problems in Engineering 5

Substituting (12) into (9) we have

119873sum119895=1

V119895119896 = [120574119896119902 ]1(119902minus1) 119873sum119895=1

[[1

sum119899119894=1 119906119898119894119896 (119909119894119895 minus 119900119896119895)2]]1(119902minus1)

= 1(13)

It follows that

[120574119896119902 ]1(119902minus1) = 1sum119873119895=1 [1sum119899119894=1 119906119898119894119896 (119909119894119895 minus 119900119896119895)2]1(119902minus1) (14)

The updating criteria of feature weight V119895119896 (1 le 119895 le 1198731 le 119896 le 119888) are obtainedV119895119896 = [1sum119899119894=1 119906119898119894119896 (119909119894119895 minus 119900119896119895)2]1(119902minus1)

sum119873119897=1 [1sum119899119894=1 119906119898119894119896 (119909119894119897 minus 119900119896119897)2]1(119902minus1) (15)

The updating way of V119895119896 implies that the larger the devi-ation degrees from all samples to the 119896th cluster prototyperegarding the 119895th feature are the smaller the weight of the119895th feature is That is if the distribution of all data is compactaround the 119896th cluster prototype in the 119895th feature spacethe 119895th feature plays a significant role in formulating the 119896thcluster Meanwhile irrelevant features thus are assigned asmaller weight to reduce the negative impact of them on theclustering process

To find the optimal cluster prototype O we assume Uand V are fixed and the parameters 120572119894119896 (119894 = 1 2 119899119896 = 1 2 119888) are also constants We take the gradient of119869SSOD-AFW with respect to 119900119896119895 and set it to zero

120597119869SSOD-AFW120597119900119896119895 = 119899sum119894=1

(minus2119906119898119894119896V119895119896 (119909119894119895 minus 119900119896119895)) = 0 (16)

The updating formula of cluster prototype 119900119896119895 is obtainedas follows

119900119896119895 = sum119899119894=1 119906119898119894119896119909119894119895sum119899119894=1 119906119898119894119896 (17)

To solve the optimal fuzzy partition matrixU we assumeO and V are fixed and the parameters 120572119894119896 (119894 = 1 2 119899 119896 =1 2 119888) are also constantsWe set the gradient of 119869SSOD-AFWwith respect to 119906119894119896 to zero120597119869SSOD-AFW120597119906119894119896 = 119898119906119898minus1119894119896 1198892119894119896 + 120573119898radic119888119906119898minus1119894119896 log 119906119898119894119896

+ 120572119894119896119886119894119896119898119906119898minus1119894119896 = 0 (18)

The updating formula of 119906119894119896 is derived as follows

119906119894119896 = exp(minus119898radic119888120573 (1198892119894119896 + 120572119894119896119886119894119896)) (19)

Formula (19) indicates that a large value of weighteddistance 119889119894119896 leads to a smaller value of 119906119894119896 for all 1 le 119894 le 119899

1 le 119896 le 119888 It should be noted that the membership degree 119906119894119896is also dependent on the coefficient 120572119894119896 The choice of 120572119894119896 isimportant to the performance of the SSOD-AFW algorithmbecause it serves in distinguishing the importance of the thirdterm relative to the other terms in objective function (8) If120572119894119896is too small the third term will be neglected and the labels ofobjects will not work to promote the cluster structure If 120572119894119896 istoo large the other terms will be neglected and the negativeinfluence of possiblemislabels of objects will be enlargedThevalue of 120572119894119896 should be chosen such that it has the same orderof magnitude with the first term in (8) To determine theparameter120572119894119896 in an adaptiveway in all experiments describedin this paper we choose 120572119894119896 proportional to 1198892119894119896 as follows

120572119894119896 = 1198701198892119894119896 (20)

where 119870 isin (0 1) is a constant Since the weighted distance119889119894119896 is dynamically updated the value of parameter 120572119894119896 isadaptively updated in each iteration

33 Criteria for Outlier Identification Based on the aboveanalysis outliers should hold low membership degrees to allclusters Therefore the sum of memberships of an object toall clusters can be used to evaluate its outlying degree For acertain object 119909119894 its outlying degree is defined as

OD (119909119894) = 119888sum119896=1

119906119894119896 (21)

Thus a small value of OD(119909119894) indicates a high outlyingpossibility of object 119909119894 The outlying degree of each sample ina dataset is calculated respectively and sorted incrementallyThe suspicious outliers can be found just by extracting the top119872 objects in the sorted outlying degree sequence where119872 isa given number of outliers contained in the dataset or a givennumber of outliers one needs

In summary the description of the SSOD-AFW algo-rithm is shown in Algorithm 1

Algorithm 1 (semisupervised outlier detection based on adap-tive feature weighted clustering (SSOD-AFW))

Input Dataset 119883 the label information of some objects thenumber of clusters 119888 parameters 119898 gt 1 119870 isin (0 1) 119902 gt 1120576 gt 0 and the number of outliers119872

Output119872 suspicious outliers

(1) Calculate the parameter 120573 according to (6) randomlyinitialize matrix U(0) = (119906(0)

119894119896)119899times119888 and initialize all

elements in V(0) = (V(0)119895119896

)119873times119888 as 1119873 Set iterationcounter 119905 = 0

(2) Compute the matrix of cluster prototype O(119905+1) =(119900(119905+1)119896119895

)119888times119873 according to (17)(3) Update the feature weight matrix V(119905+1) = (V(119905+1)

119895119896)119873times119888

by (15)(4) Update the feature weighted distance D(119905+1) =(119889(119905+1)

119894119896)119899times119888 by (7)

6 Mathematical Problems in Engineering

(5) Update parameter 120572(119905+1)119894119896

by (20)

(6) Update the membership degree matrix U(119905+1) =(119906(119905+1)119894119896

)119899times119888 according to (19)(7) If U(119905+1) minus U(119905) lt 120576 go to step (8) else 119905 = 119905 + 1

repeat step (2) to step (7)(8) The outlying degree of each object is computed and

the OD values are sorted in an ascending mannerFinally output top 119872 outliers with the smallest out-lying degrees

Computational complexity analysis Step (2) needs119874(cNn) operations to compute 119888 cluster prototype The com-putational complexity of computing the weights of119873 featuresis 119874(cNn) in Step (3) Step (4) requires 119874(cNn) to computethe weighted distances of 119899 objects to 119888 cluster prototypesStep (5) needs 119874(cn) to compute parameter 120572119894119896 of 119899 objectswith respect to 119888 cluster prototypes Moreover Step (6) needs119874(cn) operations to calculate the memberships of 119899 objectsto 119888 clusters Therefore the whole computational complexityis 119874(cNn) the same as that of the classical FCM algorithm

34 Proof of Convergence In this section we discuss the con-vergence of the proposed SSOD-AFW algorithm To provethe convergence of objective function 119869SSOD-AFW in (8) byiteratingVO andUwith formulas (15) (17) and (19) it onlyneeds to prove that 119869SSOD-AFW is monotonically decreasingand bounded after a finite number of iterationsNext Lemmas2 3 and 4 verify the monotonically decreasing property of119869SSOD-AFW with respect toVO andU respectively Lemma 5presents the boundedness of 119869SSOD-AFW

Lemma 2 Objective function 119869119878119878119874119863-119860119865119882 in (8) is nonincreas-ing by updating U = (119906119894119896)119899times119888 with formula (19)

Proof Due to the fact that V and O are fixed when updatingU by (19) here the objective function 119869SSOD-AFW can beregarded as a function only associated with U denoted as119869SSOD-AFW(U) According to Lagrangianmultiplier techniqueUlowast computed via (19) is a stagnation point of 119869SSOD-AFW(U)On the other hand if Hessian matrix nabla2119869SSOD-AFW(U) isproved to be positive definite at Ulowast it can be proved that119869SSOD-AFW(U) attains its local minimum at Ulowast The Hessianmatrix nabla2119869SSOD-AFW(Ulowast) is expressed as

nabla2119869SSOD-AFW (Ulowast)

=[[[[[[[[[

1205972119869SSOD-AFW (Ulowast)1205971199061112059711990611 sdot sdot sdot 1205972119869SSOD-AFW (Ulowast)12059711990611120597119906119899119888 d

1205972119869SSOD-AFW (Ulowast)12059711990611989911988812059711990611 sdot sdot sdot 1205972119869SSOD-AFW (Ulowast)120597119906119899119888120597119906119899119888

]]]]]]]]] (22)

nabla2119869SSOD-AFW(Ulowast) is a diagonal matrix and its diagonal ele-ment is 1205972119869SSOD-AFW (Ulowast)120597119906119894119896120597119906119894119896 = 120573radic119888119906119898minus2119894119896 (23)

Since 119906119894119896 gt 0 Hessianmatrixnabla2119869SSOD-AFW(Ulowast) is positivedefinite Having proved that Ulowast is a stagnation point of119869SSOD-AFW(U) and nabla2119869SSOD-AFW(Ulowast) is positive definite weconcludeUlowast is the local minimum of 119869SSOD-AFW(U) Then wehave 119869SSOD-AFW(U(119905)) ge 119869SSOD-AFW(Ulowast) = 119869SSOD-AFW(U(119905+1))whereU(119905) is the membership matrix after the 119905th iteration in(19) and Ulowast = U(119905+1) is the one after the (119905 + 1)th iterationTherefore objective function 119869SSOD-AFW in (8) is nonincreas-ing by updating U = (119906119894119896)119899times119888 using formula (19)

Lemma 3 Objective function 119869119878119878119874119863-119860119865119882 in (8) is nonincreas-ing by updating V = (V119895119896)119873times119888 using (15)Proof Similar to Lemma 2 when U and O are fixed wejust need to prove that the Hessian matrix of Lagrangian of119869SSOD-AFW(V) atVlowast is positive definite whereVlowast is computedby (15)TheHessian matrix is denoted as119867(V) = (ℎ119895119896119897119898(V))whose element is expressed as follows

ℎ119895119896119897119898 (V) = 12059721198711 (V)120597V119895119896120597V119897119898=

119902 (119902 minus 1) V119902minus2119895119896

119899sum119894=1

119906119898119894119896 (119909119894119895 minus 119900119896119895)2 if119895 = 119897 119896 = 1198980 otherwise

(24)

Since 119902 gt 1 and V119895119896 gt 0 the diagonal entries of thediagonal matrix are apparently positive Therefore Hessianmatrix 119867(V) is positive definite 119869SSOD-AFW(V) attains itslocal minimum at Vlowast computed by (15) This completes theproof

Lemma 4 Objective function 119869119878119878119874119863-119860119865119882 in (8) is nonincreas-ing whenO = (119900119896119895)119888times119873 is updated using (17)

The proof of Lemma 4 is similar to Lemma 2

Lemma 5 Objective function 119869119878119878119874119863-119860119865119882 in (8) is boundedthere exists a constant 119861 and it satisfies 119869119878119878119874119863-119860119865119882 lt 119861Proof forall119894 isin 1 119899 andforall119896 isin 1 119888 we have 0 lt 119906119894119896 le 1Thus119906119898119894119896 log 119906119898119894119896minus119906119898119894119896 ismonotonically decreasingwith respectto 119906119894119896 and hence minus1 le 119906119898119894119896 log 119906119898119894119896 minus 119906119898119894119896 lt 0 Apparentlythe first term and the second term of the objective function119869119878119878119874119863-119860119865119882 are also bounded In consequence objective func-tion 119869SSOD-AFW in (8) is bounded

Theorem 6 The objective function 119869119878119878119874119863-119860119865119882 converges to alocal minimum by updating V O and U using formulas (15)(17) and (19) which subjects to the constraints in (9)

Proof Lemmas 2 3 and 4 verify that objective functionin (8) is nonincreasing under iterations according to (15)(17) and (19) Lemma 5 shows that 119869SSOD-AFW has a finitebound Though the parameter 120572119894119896 needs to be updated in theiteration process it is a constant in the problem solving usingLagrangian multiplier technique So 120572119894119896 does not affect theconvergence of the SSOD-AFW algorithm Combining theabove conclusions 119869SSOD-AFW is sure to converge to a local

Mathematical Problems in Engineering 7

minimum through iterations ofVO andU by (15) (17) and(19)

4 Experiments and Analysis

Comprehensive experiments and analysis on a syntheticdataset and several real-world datasets are conducted to showthe effectiveness and superiority of the proposed SSOD-AFW We compared the proposed algorithm with two the-state-of-the-art unsupervised outlier algorithms LOF [15]and 119870NN [13] one supervised method SVDD [32] and onesemisupervised method EODSP [23]

41 Evaluation Metrics For numerically performance evalu-ation of outlier detection algorithms three metrics namelyaccuracy [7] AUC [33] and rank-power [16] are used in thispaper

Let 119872 be the number of true outliers that a dataset 119883contains and 119878 denotes the number of true outliers detectedby an algorithm In experiments top 119872 most suspiciousinstances are detected out Then the accuracy is given by

Accuracy = 119878119872 (25)

The receiver operating characteristic (ROC) curve rep-resents the trade-off relationship between the detection rateand the false alarm rate In general the area under the ROCcurve (AUC) is used to measure the performance of outlierdetection method and the value of AUC for ideal detectionperformance is close to one

For a given outlier detection algorithm true outliersoccupy top positions with respect to the nonoutliers among119872 suspicious instances then the rank-power (RP) of thealgorithm is said to be high If 119878 is the number of true outliersfound within top119872 instances and 119877119894 denotes the rank of the119894th true outlier then the metric rank-power (RP) is given by

RP = 119878 (119878 + 1)2sum119878119894=1 119877119894 (26)

RP reaches the maximum value 1 when all 119878 true outliersare in the top 119878 positions Larger value of RP implies betterperformance of an algorithm

42 Experiments on Synthetic Dataset A two-dimensionalsynthetic dataset with two cluster patterns is generatedfrom Gaussian distribution to intuitively compare the outlierdetection results of the proposed method against the otherfour algorithms mentioned above The mean vectors of thetwo clusters are 1205831 = (75 9)T and 1205832 = (1 3)T respectivelyand the covariance matrixes of them are Σ1 = ( 1 00 2 ) andΣ2 = ( 1 00 2 ) As Figure 2(a) shows a total of 199 samplesare contained in the synthetic dataset in which there are183 normal samples (within two clusters) and 16 outliers(cluttered between two clusters) 13 normal objects are labeledand marked as symbol ldquotimesrdquo and 5 outliers are labeled andmarkedwith symbol ldquolowastrdquo while the rest samples are unlabeledmarked with ldquosdotrdquo Figures 2(b)ndash2(f) respectively illustrate the

outlier detection results on the synthetic dataset by usingLOF 119870NN SVDD EODSP and SSOD-AFW where the redcolored symbols ldquo∘rdquo denote the detected suspicious outliersHere the value of parameter 119896 (size of neighborhood) in LOFand 119870NN is assigned to 3 Gauss kernel function is chosenin SVDD and we set the bandwidth 120590 = 03 and the trade-off coefficient 119862 = 045 Besides the Euclidean distancethreshold 119879 in EODSP is set as 01 and the percentage ofnegative set is set to 119897 = 10 The parameter settings of theproposed algorithm are 119898 = 23 119870 = 085 and 119902 = 223In addition to SVDD the top 16 objects with the highestoutlying scores are considered as the results in the other fouralgorithms

In Figure 2 it is noticeable that the unsupervisedmethodsLOF and 119870NN as well as the supervised SVDD fail tocompletely detect all of the 5 labeled outliers Neverthelesssome normal points in clusters are badly misjudged asoutliers In contrast the semisupervised EODSP algorithmand the proposed SSOD-AFW algorithm successfully detectall of the 5 labeled outliers However EODSP does notcompletely detect all the unlabeled true outliers and severaltrue normal samples are improperly identified as outliers Itis concluded from Figure 2 that the proposed algorithm findsall the true outliers in the synthetic dataset and excludes thenormal samples while the other methods do not

Figure 3 numerically presents the performance evaluationof outlier detection using LOF 119870NN SVDD EODSP andSSOD-AFW for the synthetic dataset From Figure 3 we seethat the values of accuracy AUC and RP of the proposedalgorithm all reach 1 outperforming the other methods

Furthermore during the experimental process shown inFigure 3 the feature weights of the synthetic dataset learnedby formula (15) in our method are V11 = 06727 V12 =05985 V21 = 03273 and V22 = 04015 To strengthenthe effectiveness of feature weights in the proposed SSOD-AFW algorithm a comparative analysis of the weighted andthe nonweighted versions is implemented on the syntheticdataset respectively Considering the nonweighted scenarioof the proposed algorithm the outlier detection result on thesynthetic dataset is presented in Figure 4 As can be observedfrom Figure 4 the nonweighted SSOD-AFW ends up tagging15 true outlying and 1 normal samples as outliers with oneunlabeled true outlier missed

43 Experiments on Real-World Datasets

431 Introduction of Datasets For further verification ofthe effectiveness of the proposed algorithm five real-worlddatasets from UCI Machine Learning Repository [34] (ieIris Abalone Wine Ecoli and Breast Cancer Wisconsin(WDBC)) are employed to test the performance of theproposed algorithm against LOF 119870NN SVDD and EODSPAs mentioned in Aggarwal and Yu [35] one way to test theperformance of an outlier detection algorithm is to run it onthe dataset and calculate the percentage of points belongingto the rare classes So a small amount of samples from thesame class are randomly selected as outlying objects or astarget objects for the five datasets For instance the originalIris dataset incorporates 150 objects with 50 objects in each

8 Mathematical Problems in Engineering

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliers

(a) Original synthetic dataset

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by LOF

(b) LOF

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by KNN

(c) 119870NN

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by SVDD

(d) SVDD

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by EODSP

(e) EODSP

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by SSOD-AFW

(f) SSOD-AFW

Figure 2 Outlier detection results of different algorithms on the two-dimensional synthetic dataset

0875008125

06000

05000

1000010

09

08

07

06

05

04

03

02

01

00

Accu

racy

Algorithms

LOF

KN

N

SVD

D

EOD

SP

SSO

D-A

FW

(a) Accuracy

10

09

08

07

06

05

04

03

02

01

00

05748

10000

08948

0989809952

AUC

Algorithms

LOF

KN

N

SVD

D

EOD

SP

SSO

D-A

FW

(b) AUC

1009130 08835

07627

04138

10000

09

08

07

06

05

04

03

02

01

00

RP

Algorithms

LOF

KN

N

SVD

D

EOD

SP

SSO

D-A

FW

(c) RP

Figure 3 Performance comparison of different algorithms on the synthetic dataset

of three classes We randomly selected 26 objects from classldquoIris-virginicardquo as target outliers and all objects in the othertwo classes are considered as normal objects The otherfour datasets are similarly preprocessed and more detaileddescription about the five real-world datasets is given inTable 1

432 Experimental Result Analysis We compare the outlierdetection performance of the proposed algorithm with LOF119870NN SVDD and EODSP on real-world datasets Eachmethod has its own parameters and the detailed parametersettings of each algorithm are as follows The parameters ofthe proposed algorithm are 119898 = 21 119870 = 065 and 119902 = 153

Mathematical Problems in Engineering 9

Table 1 Description of real-world datasets

Dataset Instances Features Outlying classes Outliers (percent) Clusters Prior informationIris 126 4 ldquoVirginicardquo 26 (2063) 2 10 labeled normal samples 4 labeled outliersAbalone 4177 8 ldquo1rdquondashldquo4rdquo ldquo16rdquondashldquo27rdquo ldquo29rdquo 335 (802) 11 11 labeled normal samples 18 labeled outliersWine 130 13 ldquo3rdquo 11 (846) 2 9 labeled normal samples 4 labeled outliersEcoli 336 9 ldquoomLrdquo ldquoimLrdquo ldquoimSrdquo 9 (268) 5 11 labeled normal samples 3 labeled outliersWDBC 387 30 ldquoMalignantrdquo 30 (775) 1 10 labeled normal samples 8 labeled outliers

minus2 0 2 4 6 8 10minus4

minus2

0

2

4

6

8

10

12

14

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by nonweighted SSOD-AFW

Figure 4 Outlier detection result of the nonweighted SSOD-AFW on the synthetic dataset

for all the five datasets The strategy of parameter selectionfor SSOD-AFW will be discussed in the later subsectioncalled parameter analysis For the other algorithms thoseparameters are set exactly as mentioned in their referencesIt is well known that LOF and 119870NN have high dependencyon the neighborhood parameter 119896 In this paper we set 119896 = 5for datasets Iris and WDBC 119896 = 3 for dataset Abalone119896 = 10 for dataset Wine and 119896 = 10 for dataset Ecoli ForSVDDmethodGaussian kernel function is employed and thebandwidths 120590 = 045 and 119862 = 05 on all of the five real-worlddatasets In EODSP the Euclidean distance threshold 119879 is setas 01 and the percentage of negative set 119897 is set as 30 for IrisandAbalone datasets and119879 = 05 119897 = 30 for datasets EcoliWine and WDBC Since we randomly select outliers fromtarget classes for each dataset each experiment is repeated10 times with the same number of different outliers Theaverage accuracy AUC and RP are calculated as the criteriaof performance of various detection methods

Figure 5 illustrates the outlier detection results of SSOD-AFW algorithm against LOF 119870NN SVDD and EODSPrespectively on the five real-world datasets As can beseen from Figure 5 the proposed algorithm can accuratelyidentify outliers according to the cluster structure of adataset with the guidance of the label knowledge It showsdistinct superiority over the other unsupervised (LOF119870NN)semisupervised (EODSP) and supervised (SVDD) methodsIn particular the outlier detection accuracy of SSOD-AFW inFigure 5(a) is significantly higher than the others especially

for datasets Iris and Wine One can know from Figure 5(b)that the AUC values of our method are always higher thanthe others for all datasets except for WDBC In terms of RPSSOD-AFW performs better than the other four algorithmsondatasets Iris andWinewhereas slightly poorer than SVDDonAbalone poorer than LOFonEcoli and poorer than119870NNonWDBC seen as in Figure 5(c)

It is worth mentioning that the experiment of the pro-posed algorithm on WDBC involves one-class clusteringproblem Although one-class clustering task is generallymeaningless one-class clustering-based outlier detection isespecially meaningful and feasible in our proposal becauseour approach does not require that the membership degreesmust sum up to 1 This is one of the powerful and importantcharacteristics of the proposed algorithm

433 The Influence of the Proportion of Labeled Data onOutlier Detection In this subsection we will investigatethe influence of the proportion of labeled samples on theaccuracies of our methodology Two typical situations areconsidered and tested The first one is that the proportion oflabeled outliers increases when the number of labeled normalobjects is fixed at a certain constant The other one is that thepercent of labeled normal samples varies while the quantityof labeled outliers is fixed So two groups of experimentsare designed to compare the accuracies of the proposedalgorithm against the EODSP in the situations of different

10 Mathematical Problems in Engineering

10

09

08

07

06

05

04

03

02

01

00

Accu

racy

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(a) Accuracy

10

09

08

07

06

05

04

03

02

01

00

AUC

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(b) AUC

10

09

08

07

06

05

04

03

02

01

00

RP

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(c) RP

Figure 5 Performance comparison of various algorithms on the real-world datasets

percent of labeled outliers and normal samples respectivelyon the datasets Iris Abalone Wine Ecoli andWDBC In thetwo experiments the percent of labeled outliers or labelednormal samples ranges from 0 to 40 respectively whenthe number of another kind of labeled objects is fixedWe randomly select a certain number of labeled outliersor normal samples from each dataset each experiment isrepeated 10 times and the average accuracies of SSOD-AFWand EODSP are computed

Figure 6 shows results of the first group of experimentswhere the percent of labeled outliers varies from 0 to 40One can see from Figure 6 that the accuracies of the twosemisupervised algorithms are roughly increased with theproportion of labeled outliers becoming larger This power-fully supports the idea that semisupervised outlier detectionalgorithms can improve the accuracy of outlier detectionby using prior information Furthermore the SSOD-AFWachieves a better accuracy than EODSP algorithm for thesame proportion of labeled outliers on the five datasetsEspecially for Wine the accuracy of SSOD-AFW is 40higher than that of EODSP EODSP addresses the problemof detecting outliers with only few labeled outliers as trainingdata The labeled normal instances are extracted accordingto the maximum entropy principle where the entropy iscomputed only using the distance between each testingsample and all the labeled outliers That makes EODSPnot flexible as our proposed method due to informationdeficiencies

Figure 7 illustrates the accuracy comparison of the pro-posed algorithm and EODSP when the proportion of labelednormal samples increases from 0 to 40 and the percent oflabeled outliers is fixed Note that ourmethod obtains a betteraccuracy than EODSP on all of the five real-world datasetsThe accuracy of the proposed algorithm gets larger when thepercent of labeled normal samples increases As mentionedEODSP emphasizes the semisupervised outlier detectiononly with few labeled outliers in the initial dataset butwithout considering any labeled normal objects Therefore

the accuracy of EODSP algorithm keeps harper with variousproportions of labeled normal objects and always equals theaccuracy value of 0 labeled normal samples as well

434 Parameter Analysis The parameters 119898 119870 and 119902are important in our proposed algorithm which affect theperformance of SSOD-AFW In this section the influence ofeach parameter on outlier detection accuracy is studied

The parameter 119898 is the fuzzification coefficient Fig-ure 8(a) analyzes the relationship between the outlier detec-tion accuracy of our proposed algorithm and parameter 119898with 119898 varying from 15 to 50 The results imply that thehighest accuracy is achieved when 119898 ranges in [2 4] Soit is reasonable that 119898 value in the experiments shown inFigure 5 has been set as 21 The parameter 119870 isin (0 1)controls the importance of the label information in the resultof outlier detection Outlier detection accuracies are testifiedby varying119870 from 01 to 09 which are shown in Figure 8(b)The overall tendency is that the accuracies become larger as119870 increasesThe best results of the proposed algorithm occurand keep stable when 119870 ge 07 Finally from Figure 8(c) weconclude that the feature weight index 119902 has small influenceon the accuracy of SSOD-AFW in the situation that theother parametersmaintain the same settings So the proposedalgorithm is not sensitive to the parameter 119902 In general theparameter 119902 is suggested to select a constant from (1 4]435 Execution Time Analysis Figure 9 analyzes the averagerunning time of the proposed algorithm against the otheralgorithms on five real-world datasets The experimentalenvironment is Windows XP systems MATLAB 71 platform3GHz CPU 2GB RAM Because the volume of Abalonedataset is far greater than the other four datasets the runningtimes of various datasets are distinctly different In order tofacilitate the display in Figure 9 the horizontal coordinate axisis translated downward a short distance The result indicatesthat the proposed algorithm is more time-saving than the

Mathematical Problems in Engineering 11

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(a) Iris

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(b) Abalone1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(c) Wine1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(d) Ecoli

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(e) WDBC

Figure 6 Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled outliers on the real-world datasets

other four typical outlier detection algorithms except for119870NN on dataset Wine In whole the execution time of theSSOD-AFW is comparable to that of119870NNand less than thoseof other algorithms on most of the datasets

5 Conclusions

In order to detect outliers more precisely a semisuper-vised outlier detection algorithm based on adaptive feature

weighted clustering called SSOD-AFW is proposed in thispaper Distinct weights of each feature with respect to differ-ent clusters are considered and obtained by adaptive iterationso that the negative effects of irrelevant features on outlierdetection are weakened Moreover the proposed methodmakes full use of the prior knowledge contained in datasetsand detects outliers in virtue of the cluster structure It isverified by a series of experiments that the proposed SSOD-AFW algorithm is superior to other typical unsupervised

12 Mathematical Problems in Engineering

0

1009080706050403020100

Accu

racy

Proportion of labeled normal samples5

EODSPSSOD-AFW

10 15 20 25 35 4030

(a) Iris

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(b) Abalone

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(c) Wine

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples

5 10 15 20 25 35 4030

(d) Ecoli

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(e) WDBC

Figure 7 Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled normal samples on the real-worlddatasets

semisupervised and supervised algorithms in both outlierdetection precision and running speed

In this paper we present a new semisupervised outlierdetection method which utilizes labels of a small amountof objects However our method assumes that the labels

of objects are reliable and does not consider mislabel pun-ishment in the new objective function Therefore a robustversion of the proposed method dealing with noisy orimperfect labels of objects deserves further studiesMoreoversince only one typical dissimilaritymeasure named Euclidean

Mathematical Problems in Engineering 13

1009080706050403020100

Accu

racy

m

5045403530252015

IrisAbaloneWine

EcoliWDBC

(a) Parameter119898

K

1009080706050403020100

Accu

racy

090807060504030201

IrisAbaloneWine

EcoliWDBC

(b) Parameter119870

q

40

1009080706050403020100

3530252015

Accu

racy

IrisAbaloneWine

EcoliWDBC

(c) Parameter 119902

Figure 8 Outlier detection accuracy of the proposed algorithm under various parameters on the real-world datasets

Iris Abalone Wine Ecoli WDBCDatasets

50

45

40

35

30

25

20

15

10

5

0

Runn

ing

time (

s)

LOFKNNSVDD

EODSPSSOD-AFW

Figure 9 Execution time comparison of different algorithms on thereal-world datasets

distance is discussed in our method the SSOD-AFW algo-rithm is limited to outlier detection for numerical data Thefuture research aims at extending our method to mixed-attribute data in more real-life applications such as faultdiagnosis in industrial process or network anomaly detection

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Natural ScienceFoundation of China (Grant no 11471001)

References

[1] J Han M Kamber and J Pei Data Mining Concepts andTechniques Elsevier 2011

[2] D M Hawkins Identification of Outliers Chapman amp HallLondon UK 1980

[3] V Bamnett and T LewisOutliers in Statistical Data JohnWileyamp Sons Chichester UK 1994

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] B Sheng Q Li W Mao and W Jin ldquoOutlier detection insensor networksrdquo in Proceedings of the 8th ACM InternationalSymposium on Mobile Ad Hoc Networking and Computing(MobiHoc rsquo07) pp 219ndash228Montreal Canada September 2007

[6] A P James and S Dimitrijev ldquoInter-image outliers and theirapplication to image classificationrdquo Pattern Recognition vol 43no 12 pp 4101ndash4112 2010

[7] J Huang Q Zhu L Yang and J Feng ldquoA non-parameter outlierdetection algorithm based on Natural Neighborrdquo Knowledge-Based Systems vol 92 pp 71ndash77 2016

[8] J M Shepherd and S J Burian ldquoDetection of urban-inducedrainfall anomalies in a major coastal cityrdquo Earth Interactionsvol 7 no 4 pp 1ndash17 2003

[9] O Alan and C Catal ldquoThresholds based outlier detectionapproach for mining class outliers an empirical case study onsoftware measurement datasetsrdquo Expert Systems with Applica-tions vol 38 no 4 pp 3440ndash3445 2011

[10] B Scholkopf J C Platt J Shawe-Taylor A J Smola and RC Williamson ldquoEstimating the support of a high-dimensionaldistributionrdquo Neural Computation vol 13 no 7 pp 1443ndash14712001

[11] D Tax A Ypma and R Duin ldquoSupport vector data descriptionapplied to machine vibration analysisrdquo in Proceedings of the 5thAnnual Conference of the Advanced School for Computing andImaging pp 398ndash405 Heijen The Netherlands June 1999

[12] DM J Tax andR PWDuin ldquoSupport vector data descriptionrdquoMachine Learning vol 54 no 1 pp 45ndash66 2004

[13] S Ramaswamy R Rastogi and K Shim ldquoEfficient algorithmsformining outliers from large data setsrdquo SIGMODRecord (ACMSpecial Interest Group onManagement of Data) vol 29 no 2 pp427ndash438 2000

14 Mathematical Problems in Engineering

[14] EM Knox and R T Ng ldquoAlgorithms formining distance basedoutliers in large datasetrdquo in Proceedings of the InternationalConference on Very Large Data Bases pp 392ndash403 CiteseerNew York NY USA 1998

[15] M M Breunig H-P Kriegel R T Ng and J Sander ldquoLOFidentifying density-based local outliersrdquoACMSIGMODRecordvol 29 no 2 pp 93ndash104 2000

[16] J Ha S Seok and J-S Lee ldquoA precise ranking method foroutlier detectionrdquo Information Sciences vol 324 pp 88ndash1072015

[17] R N Dave ldquoCharacterization and detection of noise in clus-teringrdquo Pattern Recognition Letters vol 12 no 11 pp 657ndash6641991

[18] R Smith A Bivens M Embrechts C Palagiri and B Szyman-ski ldquoClustering approaches for anomaly based intrusion detec-tionrdquo Proceedings of Intelligent Engineering Systems ThroughArtificial Neural Networks pp 579ndash584 2002

[19] Y Shi and L Zhang ldquoCOID a cluster-outlier iterative detectionapproach to multi-dimensional data analysisrdquo Knowledge andInformation Systems vol 28 no 3 pp 709ndash733 2011

[20] J Zhao K Liu W Wang and Y Liu ldquoAdaptive fuzzy clusteringbased anomaly data detection in energy system of steel indus-tryrdquo Information Sciences vol 259 pp 335ndash345 2014

[21] F Angiulli and C Pizzuti ldquoFast outlier detection in high dimen-sional spacesrdquo in Proceedings of the 6th European Conference onPrinciples of DataMining and Knowledge Discovery (PKDD rsquo02)vol 2 pp 15ndash26 August 2002

[22] M Radovanovic A Nanopoulos and M Ivanovic ldquoReversenearest neighbors in unsupervised distance-based outlier detec-tionrdquo IEEE Transactions on Knowledge and Data Engineeringvol 27 no 5 pp 1369ndash1382 2015

[23] A Daneshpazhouh and A Sami ldquoEntropy-based outlier detec-tion using semi-supervised approach with few positive exam-plesrdquo Pattern Recognition Letters vol 49 pp 77ndash84 2014

[24] J Gao H Cheng and P-N Tan ldquoSemi-supervised outlierdetectionrdquo in Proceedings of the ACM Symposium on AppliedComputing pp 635ndash636 ACM Dijon France April 2006

[25] Z Xue Y Shang and A Feng ldquoSemi-supervised outlier detec-tion based on fuzzy rough C-means clusteringrdquo Mathematicsand Computers in Simulation vol 80 no 9 pp 1911ndash1921 2010

[26] J Z Huang M K Ng H Rong and Z Li ldquoAutomated variableweighting in k-means type clusteringrdquo IEEE Transactions onPattern Analysis andMachine Intelligence vol 27 no 5 pp 657ndash668 2005

[27] H Friguiand and O Nasraou ldquoUnsupervised learning ofprototypes and attribute weightsrdquo Pattern Recognition vol 37no 3 pp 567ndash581 2004

[28] J Zhou L Chen C P Chen Y Zhang and H Li ldquoFuzzy clus-tering with the entropy of attribute weightsrdquo Neurocomputingvol 198 pp 125ndash134 2016

[29] M Hassan A Chaudhry A Khan and M A Iftikhar ldquoRobustinformation gain based fuzzy c-means clustering and classifi-cation of carotid artery ultrasound imagesrdquo Computer Methodsand Programs in Biomedicine vol 113 no 2 pp 593ndash609 2014

[30] R Krishnapuram and J M Keller ldquoA possibilistic approach toclusteringrdquo IEEE Transactions on Fuzzy Systems vol 1 no 2 pp98ndash110 1993

[31] M-S Yang and K-L Wu ldquoUnsupervised possibilistic cluster-ingrdquo Pattern Recognition vol 39 no 1 pp 5ndash21 2006

[32] SM Guo L C Chen and J S H Tsai ldquoA boundarymethod foroutlier detection based on support vector domain descriptionrdquoPattern Recognition vol 42 no 1 pp 77ndash83 2009

[33] T Fawcett ldquoRoc graphs notes and practical considerations forresearchersrdquoMachine Learning vol 31 no 1 pp 1ndash38 2004

[34] C Blake and C J Merz ldquoUCI repository of machine learningdatabasesrdquo 1998

[35] C C Aggarwal and P S Yu ldquoOutlier detection for highdimensional datardquo in Proceedings of the ACM SIGMOD Interna-tional Conference on Management of Data (SIGMOD rsquo01) SantaBarbara Calif USA May 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Mathematical Problems in Engineering 3

2 Possibilistic Clustering Algorithms

Let119883 = 1199091 1199092 119909119899 be a given dataset of 119899 objects where119909119894 = 1199091198941 1199091198942 119909119894119873 isin 119877119873 is the 119894th object characterized by119873 features Suppose that the dataset is divided into 119888 clustersand 119900119896 denotes the 119896th cluster prototype

FCM is a well-known clustering algorithm [29] whoseobjective function is

119869FCM = 119899sum119894=1

119888sum119896=1

119906119898119894119896 1003817100381710038171003817119909119894 minus 11990011989610038171003817100381710038172 (1)

st119888sum119896=1

119906119894119896 = 1 119906119894119896 isin [0 1] (2)

where 119906119894119896 is the membership degree of the 119894th (1 le 119894 le 119899)object to the 119896th (1 le 119896 le 119888) cluster sdot represents the 1198972-normof a vector and119898 gt 0 is the fuzzification coefficient Note thatthe constraint condition in (2) indicates that the membershipsum of each object to all clusters equals one Therefore FCMis sensitive to outliers due to the intuition that outliers ornoises commonly locate far away from all cluster prototypesFor this reason Krishnapuram and Keller [30] proposeda possibilistic c-means (PCM) clustering algorithm whichrelaxes the constraint on the sum of memberships andminimizes the following objective function

119869PCM = 119899sum119894=1

119888sum119896=1

119906119898119894119896 1003817100381710038171003817119909119894 minus 11990011989610038171003817100381710038172 + 119888sum119896=1

120578119896 119899sum119894=1

(1 minus 119906119894119896)119898 (3)

st 119906119894119896 isin [0 1] (4)

where 120578119896 is a suitable positive number In PCM the constraint(4) allows an outlier holding a lowmembership to all clustersso an outlier has a low impact on the objective function (3)Themembership information of each sample can be naturallyused to interpret the outlying characteristic of a sample Fora certain sample if it has a low membership to all clusters itis likely to be an outlier

Afterward another unsupervised possibilistic clusteringalgorithm (PCA) is proposed by Yang and Wu [31] and theobjective function of PCA is described as

119869PCA= 119899sum119894=1

119888sum119896=1

119906119898119894119896 1003817100381710038171003817119909119894 minus 11990011989610038171003817100381710038172+ 1205731198982radic119888

119899sum119894=1

119888sum119896=1

(119906119898119894119896 log 119906119898119894119896 minus 119906119898119894119896) st 119906119894119896 isin [0 1]

(5)

where the parameter 120573 can be calculated by the samplecovariance

120573 = sum119899119894=1sum119873119895=1 (119909119894119895 minus sum119899119897=1 119909119897119895119899)2119899 (6)

3 Semisupervised OutlierDetection Framework Based on FeatureWeighted Clustering

31 Model Formulation In this section we introduce priorknowledge into possibilistic c-means clustering method toimprove the performance of outlier detection First a smallsubset of samples in a given dataset 119883 = 1199091 1199092 119909119899is labeled as normal or outlier objects Each labeled normalobject carries the label of class it belongs to A semisupervisedindicator matrix A = (119886119894119896)119899times119888 is constructed to describe thesemisupervised information and its entries are defined by thefollowing

(i) If an object 119909119894 is labeled as a normal point and itbelongs to the 119896th cluster then 119886119894119896 = minus1 and for all119904 = 1 2 119888 119904 = 119896 we let 119886119894119904 = 1

(ii) If 119909119894 is labeled as an outlier then for all 119904 = 1 2 119888we set 119886119894119904 = 1

(iii) If 119909119894 is unlabeled then for all 119904 = 1 2 119888 it has119886119894119904 = 0Usually data often contain a number of redundant fea-

turesThe cluster structure in a given dataset is often confinedto a subset of features rather than the entire feature setIrrelevant features can only obscure the discovery of thecluster structure by a clustering algorithm An intrinsicoutlier is easy to be neglected due to the vagueness ofcluster structure Figure 1 presents an example of a three-dimensional dataset The dataset has two clusters (1198881 and 1198882)and 3 features (11989111198912 and1198913) In the feature space (119891111989121198913)neither of the clusters is discovered (see Figure 1(a)) In thesubspace (1198911 1198912) cluster 1198881 can be found but 1198882 cannot (seeFigure 1(b)) Nevertheless only cluster 1198882 can be clearly shownin (1198912 1198913) (see Figure 1(c)) Therefore if we assign weights047 045 and 008 to features 1198911 1198912 and 1198913 respectivelycluster 1198881 will be recovered by a clustering algorithm If theweights of features 1198911 1198912 and 1198913 are assigned as 013 046and 041 respectively cluster 1198882 will be recovered In thisconsideration each cluster is relevant to different subsets offeatures and the same feature may have different importancein different clusters

In our research let V119895119896 be the weight of the 119895th (1 le119895 le 119873) dimensional feature with respect to the 119896th (1 le119896 le 119888) cluster which satisfies sum119873119895=1 V119895119896 = 1 then the featureweighted distance 119889119894119896 between the 119894th object and the 119896thcluster prototype is defined as

119889119894119896 = radic 119873sum119895=1

V119902119895119896

(119909119894119895 minus 119900119896119895)2 (7)

where the parameter 119902 gt 1 is the feature weight indexThe points within clusters usually behave strongly corre-

lated while weak correlation is shown between outliers Thatis normal points belong to one of the 119888 clusters and outliersdo not belong to any clusterTherefore a normal point shouldhave a high membership to the cluster it belongs to and anoutlier has a low membership to all clusters Based on this

4 Mathematical Problems in Engineering

86

42

0minus50

5

68

024

10

f2 f1

f3

(a) Plot of the space (1198911 1198912 1198913)

0 1 2 3 4 5 6 7

0

5

10

minus5

f2

f1

c1

(b) Plot of the subspace (1198911 1198912)

0 10

0

1

2

3

4

5

6

7

minus5minus1

f2

f3

c2

5

(c) Plot of the subspace (1198912 1198913)

Figure 1 Three-dimensional synthetic example

idea we define a new objective function and minimize it asfollows

119869SSOD-AFW (UOV)= 119899sum119894=1

119888sum119896=1

1199061198981198941198961198892119894119896 + 1205731198982radic119888119899sum119894=1

119888sum119896=1

(119906119898119894119896 log 119906119898119894119896 minus 119906119898119894119896)+ 119899sum119894=1

119888sum119896=1

120572119894119896119886119894119896119906119898119894119896(8)

st119873sum119895=1

V119895119896 = 1 119906119894119896 isin [0 1] V119895119896 isin [0 1] (9)

where 119899 119873 and 119888 are the number of objects features andclusters respectively U = (119906119894119896)119899times119888 119906119894119896 is the membershipdegree of the 119894th object belonging to the 119896th cluster V =(V119895119896)119873times119888 V119895119896 denotes the featureweight of the 119895th dimensionalfeature with respect to the 119896th cluster O = (119900119896119895)119888times119873 119900119896119895indicates the 119895th dimensional feature value of the 119896th clusterprototype 119889119894119896 denotes the feature weighted distance betweenobject 119909119894 and the 119896th cluster prototype 119886119894119896 isin 1 minus1 0 isthe element in semisupervised indicator matrix A 119898 gt 0is the fuzzification coefficient and the parameter 120573 can befixed as the sample covariance according to (6) The positivecoefficient120572119894119896 adjusts the significance of the label informationof the 119894th object with respect to the 119896th cluster in objectivefunction (8) The larger 120572119894119896 is the larger the influence of labelknowledge is

The first term in (8) is equivalent to the FCM objectivefunction which requires the distances of objects from thecluster prototypes to be as small as possible The second termis constructed to force 119906119894119896 to be as large as possible Thethird term 120572119894119896119886119894119896119906119898119894119896 focuses on minimizing the membershipdegrees of a labeled outlier to all the clusters and maximizingthe membership degree of a labeled normal object to thecluster it belongs to With a proper choice of 120572119894119896 we canbalance the label information weight of every object andachieve the optimal fuzzy partition

The virtue of semisupervised indicator matrix A inobjective function (8) can be elaborated as follows Recallingthe construction of semisupervised indicator matrix A and

objective function (8) note that if we know that 119909119894 belongsto the 119896th cluster then 119886119894119896 = minus1 and all the other entriesin the 119894th row equal 1 Thus minimizing 120572119894119896119886119894119896119906119898119894119896 in (8)means maximizing the membership of 119909119894 to the 119896th clusterand simultaneously minimizing the memberships of 119909119894 to theother clusters If 119909119894 is labeled as an outlier namely where allthe elements in the 119894th row of A equal 1 then minimizing120572119894119896119886119894119896119906119898119894119896 in (8) means minimizing the memberships of 119909119894 toall clusters for an outlier does not belong to any cluster If 119909119894is unlabeled namely where 119886119894119896 = 0 for all 119896 = 1 2 119888 thenthe term 120572119894119896119886119894119896119906119898119894119896 has no impact on objective function (8)

32 Solutions to the Objective Function In this subsection aniterative algorithm for minimizing 119869SSOD-AFW with respect to119906119894119896 V119895119896 and 119900119896119895 is derived similar to classical FCM

First in order to minimize 119869SSOD-AFW with respect to VU and O are fixed and the parameters 120572119894119896 (119894 = 1 2 119899119896 = 1 2 119888) are constants The Lagrange function isconstructed as follows

1198711 = 119899sum119894=1

119888sum119896=1

1199061198981198941198961198892119894119896 + 1205731198982radic119888119899sum119894=1

119888sum119896=1

(119906119898119894119896 log 119906119898119894119896 minus 119906119898119894119896)+ 119899sum119894=1

119888sum119896=1

120572119894119896119886119894119896119906119898119894119896 minus 119888sum119896=1

120574119896( 119873sum119895=1

V119895119896 minus 1) (10)

where 120574119896 (119896 = 1 2 119888) are the Lagrange multipliersBy taking the gradient of 1198711 with respect to V119895119896 and setting

it to zero we obtain

1205971198711120597V119895119896 = 119902V119902minus1119895119896

119899sum119894=1

119906119898119894119896 (119909119894119895 minus 119900119896119895)2 minus 120574119896 = 0 (11)

Then

V119895119896 = [[120574119896119902sum119899119894=1 119906119898119894119896 (119909119894119895 minus 119900119896119895)2]]

1(119902minus1)

(12)

Mathematical Problems in Engineering 5

Substituting (12) into (9) we have

119873sum119895=1

V119895119896 = [120574119896119902 ]1(119902minus1) 119873sum119895=1

[[1

sum119899119894=1 119906119898119894119896 (119909119894119895 minus 119900119896119895)2]]1(119902minus1)

= 1(13)

It follows that

[120574119896119902 ]1(119902minus1) = 1sum119873119895=1 [1sum119899119894=1 119906119898119894119896 (119909119894119895 minus 119900119896119895)2]1(119902minus1) (14)

The updating criteria of feature weight V119895119896 (1 le 119895 le 1198731 le 119896 le 119888) are obtainedV119895119896 = [1sum119899119894=1 119906119898119894119896 (119909119894119895 minus 119900119896119895)2]1(119902minus1)

sum119873119897=1 [1sum119899119894=1 119906119898119894119896 (119909119894119897 minus 119900119896119897)2]1(119902minus1) (15)

The updating way of V119895119896 implies that the larger the devi-ation degrees from all samples to the 119896th cluster prototyperegarding the 119895th feature are the smaller the weight of the119895th feature is That is if the distribution of all data is compactaround the 119896th cluster prototype in the 119895th feature spacethe 119895th feature plays a significant role in formulating the 119896thcluster Meanwhile irrelevant features thus are assigned asmaller weight to reduce the negative impact of them on theclustering process

To find the optimal cluster prototype O we assume Uand V are fixed and the parameters 120572119894119896 (119894 = 1 2 119899119896 = 1 2 119888) are also constants We take the gradient of119869SSOD-AFW with respect to 119900119896119895 and set it to zero

120597119869SSOD-AFW120597119900119896119895 = 119899sum119894=1

(minus2119906119898119894119896V119895119896 (119909119894119895 minus 119900119896119895)) = 0 (16)

The updating formula of cluster prototype 119900119896119895 is obtainedas follows

119900119896119895 = sum119899119894=1 119906119898119894119896119909119894119895sum119899119894=1 119906119898119894119896 (17)

To solve the optimal fuzzy partition matrixU we assumeO and V are fixed and the parameters 120572119894119896 (119894 = 1 2 119899 119896 =1 2 119888) are also constantsWe set the gradient of 119869SSOD-AFWwith respect to 119906119894119896 to zero120597119869SSOD-AFW120597119906119894119896 = 119898119906119898minus1119894119896 1198892119894119896 + 120573119898radic119888119906119898minus1119894119896 log 119906119898119894119896

+ 120572119894119896119886119894119896119898119906119898minus1119894119896 = 0 (18)

The updating formula of 119906119894119896 is derived as follows

119906119894119896 = exp(minus119898radic119888120573 (1198892119894119896 + 120572119894119896119886119894119896)) (19)

Formula (19) indicates that a large value of weighteddistance 119889119894119896 leads to a smaller value of 119906119894119896 for all 1 le 119894 le 119899

1 le 119896 le 119888 It should be noted that the membership degree 119906119894119896is also dependent on the coefficient 120572119894119896 The choice of 120572119894119896 isimportant to the performance of the SSOD-AFW algorithmbecause it serves in distinguishing the importance of the thirdterm relative to the other terms in objective function (8) If120572119894119896is too small the third term will be neglected and the labels ofobjects will not work to promote the cluster structure If 120572119894119896 istoo large the other terms will be neglected and the negativeinfluence of possiblemislabels of objects will be enlargedThevalue of 120572119894119896 should be chosen such that it has the same orderof magnitude with the first term in (8) To determine theparameter120572119894119896 in an adaptiveway in all experiments describedin this paper we choose 120572119894119896 proportional to 1198892119894119896 as follows

120572119894119896 = 1198701198892119894119896 (20)

where 119870 isin (0 1) is a constant Since the weighted distance119889119894119896 is dynamically updated the value of parameter 120572119894119896 isadaptively updated in each iteration

33 Criteria for Outlier Identification Based on the aboveanalysis outliers should hold low membership degrees to allclusters Therefore the sum of memberships of an object toall clusters can be used to evaluate its outlying degree For acertain object 119909119894 its outlying degree is defined as

OD (119909119894) = 119888sum119896=1

119906119894119896 (21)

Thus a small value of OD(119909119894) indicates a high outlyingpossibility of object 119909119894 The outlying degree of each sample ina dataset is calculated respectively and sorted incrementallyThe suspicious outliers can be found just by extracting the top119872 objects in the sorted outlying degree sequence where119872 isa given number of outliers contained in the dataset or a givennumber of outliers one needs

In summary the description of the SSOD-AFW algo-rithm is shown in Algorithm 1

Algorithm 1 (semisupervised outlier detection based on adap-tive feature weighted clustering (SSOD-AFW))

Input Dataset 119883 the label information of some objects thenumber of clusters 119888 parameters 119898 gt 1 119870 isin (0 1) 119902 gt 1120576 gt 0 and the number of outliers119872

Output119872 suspicious outliers

(1) Calculate the parameter 120573 according to (6) randomlyinitialize matrix U(0) = (119906(0)

119894119896)119899times119888 and initialize all

elements in V(0) = (V(0)119895119896

)119873times119888 as 1119873 Set iterationcounter 119905 = 0

(2) Compute the matrix of cluster prototype O(119905+1) =(119900(119905+1)119896119895

)119888times119873 according to (17)(3) Update the feature weight matrix V(119905+1) = (V(119905+1)

119895119896)119873times119888

by (15)(4) Update the feature weighted distance D(119905+1) =(119889(119905+1)

119894119896)119899times119888 by (7)

6 Mathematical Problems in Engineering

(5) Update parameter 120572(119905+1)119894119896

by (20)

(6) Update the membership degree matrix U(119905+1) =(119906(119905+1)119894119896

)119899times119888 according to (19)(7) If U(119905+1) minus U(119905) lt 120576 go to step (8) else 119905 = 119905 + 1

repeat step (2) to step (7)(8) The outlying degree of each object is computed and

the OD values are sorted in an ascending mannerFinally output top 119872 outliers with the smallest out-lying degrees

Computational complexity analysis Step (2) needs119874(cNn) operations to compute 119888 cluster prototype The com-putational complexity of computing the weights of119873 featuresis 119874(cNn) in Step (3) Step (4) requires 119874(cNn) to computethe weighted distances of 119899 objects to 119888 cluster prototypesStep (5) needs 119874(cn) to compute parameter 120572119894119896 of 119899 objectswith respect to 119888 cluster prototypes Moreover Step (6) needs119874(cn) operations to calculate the memberships of 119899 objectsto 119888 clusters Therefore the whole computational complexityis 119874(cNn) the same as that of the classical FCM algorithm

34 Proof of Convergence In this section we discuss the con-vergence of the proposed SSOD-AFW algorithm To provethe convergence of objective function 119869SSOD-AFW in (8) byiteratingVO andUwith formulas (15) (17) and (19) it onlyneeds to prove that 119869SSOD-AFW is monotonically decreasingand bounded after a finite number of iterationsNext Lemmas2 3 and 4 verify the monotonically decreasing property of119869SSOD-AFW with respect toVO andU respectively Lemma 5presents the boundedness of 119869SSOD-AFW

Lemma 2 Objective function 119869119878119878119874119863-119860119865119882 in (8) is nonincreas-ing by updating U = (119906119894119896)119899times119888 with formula (19)

Proof Due to the fact that V and O are fixed when updatingU by (19) here the objective function 119869SSOD-AFW can beregarded as a function only associated with U denoted as119869SSOD-AFW(U) According to Lagrangianmultiplier techniqueUlowast computed via (19) is a stagnation point of 119869SSOD-AFW(U)On the other hand if Hessian matrix nabla2119869SSOD-AFW(U) isproved to be positive definite at Ulowast it can be proved that119869SSOD-AFW(U) attains its local minimum at Ulowast The Hessianmatrix nabla2119869SSOD-AFW(Ulowast) is expressed as

nabla2119869SSOD-AFW (Ulowast)

=[[[[[[[[[

1205972119869SSOD-AFW (Ulowast)1205971199061112059711990611 sdot sdot sdot 1205972119869SSOD-AFW (Ulowast)12059711990611120597119906119899119888 d

1205972119869SSOD-AFW (Ulowast)12059711990611989911988812059711990611 sdot sdot sdot 1205972119869SSOD-AFW (Ulowast)120597119906119899119888120597119906119899119888

]]]]]]]]] (22)

nabla2119869SSOD-AFW(Ulowast) is a diagonal matrix and its diagonal ele-ment is 1205972119869SSOD-AFW (Ulowast)120597119906119894119896120597119906119894119896 = 120573radic119888119906119898minus2119894119896 (23)

Since 119906119894119896 gt 0 Hessianmatrixnabla2119869SSOD-AFW(Ulowast) is positivedefinite Having proved that Ulowast is a stagnation point of119869SSOD-AFW(U) and nabla2119869SSOD-AFW(Ulowast) is positive definite weconcludeUlowast is the local minimum of 119869SSOD-AFW(U) Then wehave 119869SSOD-AFW(U(119905)) ge 119869SSOD-AFW(Ulowast) = 119869SSOD-AFW(U(119905+1))whereU(119905) is the membership matrix after the 119905th iteration in(19) and Ulowast = U(119905+1) is the one after the (119905 + 1)th iterationTherefore objective function 119869SSOD-AFW in (8) is nonincreas-ing by updating U = (119906119894119896)119899times119888 using formula (19)

Lemma 3 Objective function 119869119878119878119874119863-119860119865119882 in (8) is nonincreas-ing by updating V = (V119895119896)119873times119888 using (15)Proof Similar to Lemma 2 when U and O are fixed wejust need to prove that the Hessian matrix of Lagrangian of119869SSOD-AFW(V) atVlowast is positive definite whereVlowast is computedby (15)TheHessian matrix is denoted as119867(V) = (ℎ119895119896119897119898(V))whose element is expressed as follows

ℎ119895119896119897119898 (V) = 12059721198711 (V)120597V119895119896120597V119897119898=

119902 (119902 minus 1) V119902minus2119895119896

119899sum119894=1

119906119898119894119896 (119909119894119895 minus 119900119896119895)2 if119895 = 119897 119896 = 1198980 otherwise

(24)

Since 119902 gt 1 and V119895119896 gt 0 the diagonal entries of thediagonal matrix are apparently positive Therefore Hessianmatrix 119867(V) is positive definite 119869SSOD-AFW(V) attains itslocal minimum at Vlowast computed by (15) This completes theproof

Lemma 4 Objective function 119869119878119878119874119863-119860119865119882 in (8) is nonincreas-ing whenO = (119900119896119895)119888times119873 is updated using (17)

The proof of Lemma 4 is similar to Lemma 2

Lemma 5 Objective function 119869119878119878119874119863-119860119865119882 in (8) is boundedthere exists a constant 119861 and it satisfies 119869119878119878119874119863-119860119865119882 lt 119861Proof forall119894 isin 1 119899 andforall119896 isin 1 119888 we have 0 lt 119906119894119896 le 1Thus119906119898119894119896 log 119906119898119894119896minus119906119898119894119896 ismonotonically decreasingwith respectto 119906119894119896 and hence minus1 le 119906119898119894119896 log 119906119898119894119896 minus 119906119898119894119896 lt 0 Apparentlythe first term and the second term of the objective function119869119878119878119874119863-119860119865119882 are also bounded In consequence objective func-tion 119869SSOD-AFW in (8) is bounded

Theorem 6 The objective function 119869119878119878119874119863-119860119865119882 converges to alocal minimum by updating V O and U using formulas (15)(17) and (19) which subjects to the constraints in (9)

Proof Lemmas 2 3 and 4 verify that objective functionin (8) is nonincreasing under iterations according to (15)(17) and (19) Lemma 5 shows that 119869SSOD-AFW has a finitebound Though the parameter 120572119894119896 needs to be updated in theiteration process it is a constant in the problem solving usingLagrangian multiplier technique So 120572119894119896 does not affect theconvergence of the SSOD-AFW algorithm Combining theabove conclusions 119869SSOD-AFW is sure to converge to a local

Mathematical Problems in Engineering 7

minimum through iterations ofVO andU by (15) (17) and(19)

4 Experiments and Analysis

Comprehensive experiments and analysis on a syntheticdataset and several real-world datasets are conducted to showthe effectiveness and superiority of the proposed SSOD-AFW We compared the proposed algorithm with two the-state-of-the-art unsupervised outlier algorithms LOF [15]and 119870NN [13] one supervised method SVDD [32] and onesemisupervised method EODSP [23]

41 Evaluation Metrics For numerically performance evalu-ation of outlier detection algorithms three metrics namelyaccuracy [7] AUC [33] and rank-power [16] are used in thispaper

Let 119872 be the number of true outliers that a dataset 119883contains and 119878 denotes the number of true outliers detectedby an algorithm In experiments top 119872 most suspiciousinstances are detected out Then the accuracy is given by

Accuracy = 119878119872 (25)

The receiver operating characteristic (ROC) curve rep-resents the trade-off relationship between the detection rateand the false alarm rate In general the area under the ROCcurve (AUC) is used to measure the performance of outlierdetection method and the value of AUC for ideal detectionperformance is close to one

For a given outlier detection algorithm true outliersoccupy top positions with respect to the nonoutliers among119872 suspicious instances then the rank-power (RP) of thealgorithm is said to be high If 119878 is the number of true outliersfound within top119872 instances and 119877119894 denotes the rank of the119894th true outlier then the metric rank-power (RP) is given by

RP = 119878 (119878 + 1)2sum119878119894=1 119877119894 (26)

RP reaches the maximum value 1 when all 119878 true outliersare in the top 119878 positions Larger value of RP implies betterperformance of an algorithm

42 Experiments on Synthetic Dataset A two-dimensionalsynthetic dataset with two cluster patterns is generatedfrom Gaussian distribution to intuitively compare the outlierdetection results of the proposed method against the otherfour algorithms mentioned above The mean vectors of thetwo clusters are 1205831 = (75 9)T and 1205832 = (1 3)T respectivelyand the covariance matrixes of them are Σ1 = ( 1 00 2 ) andΣ2 = ( 1 00 2 ) As Figure 2(a) shows a total of 199 samplesare contained in the synthetic dataset in which there are183 normal samples (within two clusters) and 16 outliers(cluttered between two clusters) 13 normal objects are labeledand marked as symbol ldquotimesrdquo and 5 outliers are labeled andmarkedwith symbol ldquolowastrdquo while the rest samples are unlabeledmarked with ldquosdotrdquo Figures 2(b)ndash2(f) respectively illustrate the

outlier detection results on the synthetic dataset by usingLOF 119870NN SVDD EODSP and SSOD-AFW where the redcolored symbols ldquo∘rdquo denote the detected suspicious outliersHere the value of parameter 119896 (size of neighborhood) in LOFand 119870NN is assigned to 3 Gauss kernel function is chosenin SVDD and we set the bandwidth 120590 = 03 and the trade-off coefficient 119862 = 045 Besides the Euclidean distancethreshold 119879 in EODSP is set as 01 and the percentage ofnegative set is set to 119897 = 10 The parameter settings of theproposed algorithm are 119898 = 23 119870 = 085 and 119902 = 223In addition to SVDD the top 16 objects with the highestoutlying scores are considered as the results in the other fouralgorithms

In Figure 2 it is noticeable that the unsupervisedmethodsLOF and 119870NN as well as the supervised SVDD fail tocompletely detect all of the 5 labeled outliers Neverthelesssome normal points in clusters are badly misjudged asoutliers In contrast the semisupervised EODSP algorithmand the proposed SSOD-AFW algorithm successfully detectall of the 5 labeled outliers However EODSP does notcompletely detect all the unlabeled true outliers and severaltrue normal samples are improperly identified as outliers Itis concluded from Figure 2 that the proposed algorithm findsall the true outliers in the synthetic dataset and excludes thenormal samples while the other methods do not

Figure 3 numerically presents the performance evaluationof outlier detection using LOF 119870NN SVDD EODSP andSSOD-AFW for the synthetic dataset From Figure 3 we seethat the values of accuracy AUC and RP of the proposedalgorithm all reach 1 outperforming the other methods

Furthermore during the experimental process shown inFigure 3 the feature weights of the synthetic dataset learnedby formula (15) in our method are V11 = 06727 V12 =05985 V21 = 03273 and V22 = 04015 To strengthenthe effectiveness of feature weights in the proposed SSOD-AFW algorithm a comparative analysis of the weighted andthe nonweighted versions is implemented on the syntheticdataset respectively Considering the nonweighted scenarioof the proposed algorithm the outlier detection result on thesynthetic dataset is presented in Figure 4 As can be observedfrom Figure 4 the nonweighted SSOD-AFW ends up tagging15 true outlying and 1 normal samples as outliers with oneunlabeled true outlier missed

43 Experiments on Real-World Datasets

431 Introduction of Datasets For further verification ofthe effectiveness of the proposed algorithm five real-worlddatasets from UCI Machine Learning Repository [34] (ieIris Abalone Wine Ecoli and Breast Cancer Wisconsin(WDBC)) are employed to test the performance of theproposed algorithm against LOF 119870NN SVDD and EODSPAs mentioned in Aggarwal and Yu [35] one way to test theperformance of an outlier detection algorithm is to run it onthe dataset and calculate the percentage of points belongingto the rare classes So a small amount of samples from thesame class are randomly selected as outlying objects or astarget objects for the five datasets For instance the originalIris dataset incorporates 150 objects with 50 objects in each

8 Mathematical Problems in Engineering

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliers

(a) Original synthetic dataset

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by LOF

(b) LOF

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by KNN

(c) 119870NN

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by SVDD

(d) SVDD

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by EODSP

(e) EODSP

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by SSOD-AFW

(f) SSOD-AFW

Figure 2 Outlier detection results of different algorithms on the two-dimensional synthetic dataset

0875008125

06000

05000

1000010

09

08

07

06

05

04

03

02

01

00

Accu

racy

Algorithms

LOF

KN

N

SVD

D

EOD

SP

SSO

D-A

FW

(a) Accuracy

10

09

08

07

06

05

04

03

02

01

00

05748

10000

08948

0989809952

AUC

Algorithms

LOF

KN

N

SVD

D

EOD

SP

SSO

D-A

FW

(b) AUC

1009130 08835

07627

04138

10000

09

08

07

06

05

04

03

02

01

00

RP

Algorithms

LOF

KN

N

SVD

D

EOD

SP

SSO

D-A

FW

(c) RP

Figure 3 Performance comparison of different algorithms on the synthetic dataset

of three classes We randomly selected 26 objects from classldquoIris-virginicardquo as target outliers and all objects in the othertwo classes are considered as normal objects The otherfour datasets are similarly preprocessed and more detaileddescription about the five real-world datasets is given inTable 1

432 Experimental Result Analysis We compare the outlierdetection performance of the proposed algorithm with LOF119870NN SVDD and EODSP on real-world datasets Eachmethod has its own parameters and the detailed parametersettings of each algorithm are as follows The parameters ofthe proposed algorithm are 119898 = 21 119870 = 065 and 119902 = 153

Mathematical Problems in Engineering 9

Table 1 Description of real-world datasets

Dataset Instances Features Outlying classes Outliers (percent) Clusters Prior informationIris 126 4 ldquoVirginicardquo 26 (2063) 2 10 labeled normal samples 4 labeled outliersAbalone 4177 8 ldquo1rdquondashldquo4rdquo ldquo16rdquondashldquo27rdquo ldquo29rdquo 335 (802) 11 11 labeled normal samples 18 labeled outliersWine 130 13 ldquo3rdquo 11 (846) 2 9 labeled normal samples 4 labeled outliersEcoli 336 9 ldquoomLrdquo ldquoimLrdquo ldquoimSrdquo 9 (268) 5 11 labeled normal samples 3 labeled outliersWDBC 387 30 ldquoMalignantrdquo 30 (775) 1 10 labeled normal samples 8 labeled outliers

minus2 0 2 4 6 8 10minus4

minus2

0

2

4

6

8

10

12

14

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by nonweighted SSOD-AFW

Figure 4 Outlier detection result of the nonweighted SSOD-AFW on the synthetic dataset

for all the five datasets The strategy of parameter selectionfor SSOD-AFW will be discussed in the later subsectioncalled parameter analysis For the other algorithms thoseparameters are set exactly as mentioned in their referencesIt is well known that LOF and 119870NN have high dependencyon the neighborhood parameter 119896 In this paper we set 119896 = 5for datasets Iris and WDBC 119896 = 3 for dataset Abalone119896 = 10 for dataset Wine and 119896 = 10 for dataset Ecoli ForSVDDmethodGaussian kernel function is employed and thebandwidths 120590 = 045 and 119862 = 05 on all of the five real-worlddatasets In EODSP the Euclidean distance threshold 119879 is setas 01 and the percentage of negative set 119897 is set as 30 for IrisandAbalone datasets and119879 = 05 119897 = 30 for datasets EcoliWine and WDBC Since we randomly select outliers fromtarget classes for each dataset each experiment is repeated10 times with the same number of different outliers Theaverage accuracy AUC and RP are calculated as the criteriaof performance of various detection methods

Figure 5 illustrates the outlier detection results of SSOD-AFW algorithm against LOF 119870NN SVDD and EODSPrespectively on the five real-world datasets As can beseen from Figure 5 the proposed algorithm can accuratelyidentify outliers according to the cluster structure of adataset with the guidance of the label knowledge It showsdistinct superiority over the other unsupervised (LOF119870NN)semisupervised (EODSP) and supervised (SVDD) methodsIn particular the outlier detection accuracy of SSOD-AFW inFigure 5(a) is significantly higher than the others especially

for datasets Iris and Wine One can know from Figure 5(b)that the AUC values of our method are always higher thanthe others for all datasets except for WDBC In terms of RPSSOD-AFW performs better than the other four algorithmsondatasets Iris andWinewhereas slightly poorer than SVDDonAbalone poorer than LOFonEcoli and poorer than119870NNonWDBC seen as in Figure 5(c)

It is worth mentioning that the experiment of the pro-posed algorithm on WDBC involves one-class clusteringproblem Although one-class clustering task is generallymeaningless one-class clustering-based outlier detection isespecially meaningful and feasible in our proposal becauseour approach does not require that the membership degreesmust sum up to 1 This is one of the powerful and importantcharacteristics of the proposed algorithm

433 The Influence of the Proportion of Labeled Data onOutlier Detection In this subsection we will investigatethe influence of the proportion of labeled samples on theaccuracies of our methodology Two typical situations areconsidered and tested The first one is that the proportion oflabeled outliers increases when the number of labeled normalobjects is fixed at a certain constant The other one is that thepercent of labeled normal samples varies while the quantityof labeled outliers is fixed So two groups of experimentsare designed to compare the accuracies of the proposedalgorithm against the EODSP in the situations of different

10 Mathematical Problems in Engineering

10

09

08

07

06

05

04

03

02

01

00

Accu

racy

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(a) Accuracy

10

09

08

07

06

05

04

03

02

01

00

AUC

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(b) AUC

10

09

08

07

06

05

04

03

02

01

00

RP

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(c) RP

Figure 5 Performance comparison of various algorithms on the real-world datasets

percent of labeled outliers and normal samples respectivelyon the datasets Iris Abalone Wine Ecoli andWDBC In thetwo experiments the percent of labeled outliers or labelednormal samples ranges from 0 to 40 respectively whenthe number of another kind of labeled objects is fixedWe randomly select a certain number of labeled outliersor normal samples from each dataset each experiment isrepeated 10 times and the average accuracies of SSOD-AFWand EODSP are computed

Figure 6 shows results of the first group of experimentswhere the percent of labeled outliers varies from 0 to 40One can see from Figure 6 that the accuracies of the twosemisupervised algorithms are roughly increased with theproportion of labeled outliers becoming larger This power-fully supports the idea that semisupervised outlier detectionalgorithms can improve the accuracy of outlier detectionby using prior information Furthermore the SSOD-AFWachieves a better accuracy than EODSP algorithm for thesame proportion of labeled outliers on the five datasetsEspecially for Wine the accuracy of SSOD-AFW is 40higher than that of EODSP EODSP addresses the problemof detecting outliers with only few labeled outliers as trainingdata The labeled normal instances are extracted accordingto the maximum entropy principle where the entropy iscomputed only using the distance between each testingsample and all the labeled outliers That makes EODSPnot flexible as our proposed method due to informationdeficiencies

Figure 7 illustrates the accuracy comparison of the pro-posed algorithm and EODSP when the proportion of labelednormal samples increases from 0 to 40 and the percent oflabeled outliers is fixed Note that ourmethod obtains a betteraccuracy than EODSP on all of the five real-world datasetsThe accuracy of the proposed algorithm gets larger when thepercent of labeled normal samples increases As mentionedEODSP emphasizes the semisupervised outlier detectiononly with few labeled outliers in the initial dataset butwithout considering any labeled normal objects Therefore

the accuracy of EODSP algorithm keeps harper with variousproportions of labeled normal objects and always equals theaccuracy value of 0 labeled normal samples as well

434 Parameter Analysis The parameters 119898 119870 and 119902are important in our proposed algorithm which affect theperformance of SSOD-AFW In this section the influence ofeach parameter on outlier detection accuracy is studied

The parameter 119898 is the fuzzification coefficient Fig-ure 8(a) analyzes the relationship between the outlier detec-tion accuracy of our proposed algorithm and parameter 119898with 119898 varying from 15 to 50 The results imply that thehighest accuracy is achieved when 119898 ranges in [2 4] Soit is reasonable that 119898 value in the experiments shown inFigure 5 has been set as 21 The parameter 119870 isin (0 1)controls the importance of the label information in the resultof outlier detection Outlier detection accuracies are testifiedby varying119870 from 01 to 09 which are shown in Figure 8(b)The overall tendency is that the accuracies become larger as119870 increasesThe best results of the proposed algorithm occurand keep stable when 119870 ge 07 Finally from Figure 8(c) weconclude that the feature weight index 119902 has small influenceon the accuracy of SSOD-AFW in the situation that theother parametersmaintain the same settings So the proposedalgorithm is not sensitive to the parameter 119902 In general theparameter 119902 is suggested to select a constant from (1 4]435 Execution Time Analysis Figure 9 analyzes the averagerunning time of the proposed algorithm against the otheralgorithms on five real-world datasets The experimentalenvironment is Windows XP systems MATLAB 71 platform3GHz CPU 2GB RAM Because the volume of Abalonedataset is far greater than the other four datasets the runningtimes of various datasets are distinctly different In order tofacilitate the display in Figure 9 the horizontal coordinate axisis translated downward a short distance The result indicatesthat the proposed algorithm is more time-saving than the

Mathematical Problems in Engineering 11

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(a) Iris

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(b) Abalone1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(c) Wine1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(d) Ecoli

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(e) WDBC

Figure 6 Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled outliers on the real-world datasets

other four typical outlier detection algorithms except for119870NN on dataset Wine In whole the execution time of theSSOD-AFW is comparable to that of119870NNand less than thoseof other algorithms on most of the datasets

5 Conclusions

In order to detect outliers more precisely a semisuper-vised outlier detection algorithm based on adaptive feature

weighted clustering called SSOD-AFW is proposed in thispaper Distinct weights of each feature with respect to differ-ent clusters are considered and obtained by adaptive iterationso that the negative effects of irrelevant features on outlierdetection are weakened Moreover the proposed methodmakes full use of the prior knowledge contained in datasetsand detects outliers in virtue of the cluster structure It isverified by a series of experiments that the proposed SSOD-AFW algorithm is superior to other typical unsupervised

12 Mathematical Problems in Engineering

0

1009080706050403020100

Accu

racy

Proportion of labeled normal samples5

EODSPSSOD-AFW

10 15 20 25 35 4030

(a) Iris

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(b) Abalone

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(c) Wine

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples

5 10 15 20 25 35 4030

(d) Ecoli

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(e) WDBC

Figure 7 Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled normal samples on the real-worlddatasets

semisupervised and supervised algorithms in both outlierdetection precision and running speed

In this paper we present a new semisupervised outlierdetection method which utilizes labels of a small amountof objects However our method assumes that the labels

of objects are reliable and does not consider mislabel pun-ishment in the new objective function Therefore a robustversion of the proposed method dealing with noisy orimperfect labels of objects deserves further studiesMoreoversince only one typical dissimilaritymeasure named Euclidean

Mathematical Problems in Engineering 13

1009080706050403020100

Accu

racy

m

5045403530252015

IrisAbaloneWine

EcoliWDBC

(a) Parameter119898

K

1009080706050403020100

Accu

racy

090807060504030201

IrisAbaloneWine

EcoliWDBC

(b) Parameter119870

q

40

1009080706050403020100

3530252015

Accu

racy

IrisAbaloneWine

EcoliWDBC

(c) Parameter 119902

Figure 8 Outlier detection accuracy of the proposed algorithm under various parameters on the real-world datasets

Iris Abalone Wine Ecoli WDBCDatasets

50

45

40

35

30

25

20

15

10

5

0

Runn

ing

time (

s)

LOFKNNSVDD

EODSPSSOD-AFW

Figure 9 Execution time comparison of different algorithms on thereal-world datasets

distance is discussed in our method the SSOD-AFW algo-rithm is limited to outlier detection for numerical data Thefuture research aims at extending our method to mixed-attribute data in more real-life applications such as faultdiagnosis in industrial process or network anomaly detection

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Natural ScienceFoundation of China (Grant no 11471001)

References

[1] J Han M Kamber and J Pei Data Mining Concepts andTechniques Elsevier 2011

[2] D M Hawkins Identification of Outliers Chapman amp HallLondon UK 1980

[3] V Bamnett and T LewisOutliers in Statistical Data JohnWileyamp Sons Chichester UK 1994

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] B Sheng Q Li W Mao and W Jin ldquoOutlier detection insensor networksrdquo in Proceedings of the 8th ACM InternationalSymposium on Mobile Ad Hoc Networking and Computing(MobiHoc rsquo07) pp 219ndash228Montreal Canada September 2007

[6] A P James and S Dimitrijev ldquoInter-image outliers and theirapplication to image classificationrdquo Pattern Recognition vol 43no 12 pp 4101ndash4112 2010

[7] J Huang Q Zhu L Yang and J Feng ldquoA non-parameter outlierdetection algorithm based on Natural Neighborrdquo Knowledge-Based Systems vol 92 pp 71ndash77 2016

[8] J M Shepherd and S J Burian ldquoDetection of urban-inducedrainfall anomalies in a major coastal cityrdquo Earth Interactionsvol 7 no 4 pp 1ndash17 2003

[9] O Alan and C Catal ldquoThresholds based outlier detectionapproach for mining class outliers an empirical case study onsoftware measurement datasetsrdquo Expert Systems with Applica-tions vol 38 no 4 pp 3440ndash3445 2011

[10] B Scholkopf J C Platt J Shawe-Taylor A J Smola and RC Williamson ldquoEstimating the support of a high-dimensionaldistributionrdquo Neural Computation vol 13 no 7 pp 1443ndash14712001

[11] D Tax A Ypma and R Duin ldquoSupport vector data descriptionapplied to machine vibration analysisrdquo in Proceedings of the 5thAnnual Conference of the Advanced School for Computing andImaging pp 398ndash405 Heijen The Netherlands June 1999

[12] DM J Tax andR PWDuin ldquoSupport vector data descriptionrdquoMachine Learning vol 54 no 1 pp 45ndash66 2004

[13] S Ramaswamy R Rastogi and K Shim ldquoEfficient algorithmsformining outliers from large data setsrdquo SIGMODRecord (ACMSpecial Interest Group onManagement of Data) vol 29 no 2 pp427ndash438 2000

14 Mathematical Problems in Engineering

[14] EM Knox and R T Ng ldquoAlgorithms formining distance basedoutliers in large datasetrdquo in Proceedings of the InternationalConference on Very Large Data Bases pp 392ndash403 CiteseerNew York NY USA 1998

[15] M M Breunig H-P Kriegel R T Ng and J Sander ldquoLOFidentifying density-based local outliersrdquoACMSIGMODRecordvol 29 no 2 pp 93ndash104 2000

[16] J Ha S Seok and J-S Lee ldquoA precise ranking method foroutlier detectionrdquo Information Sciences vol 324 pp 88ndash1072015

[17] R N Dave ldquoCharacterization and detection of noise in clus-teringrdquo Pattern Recognition Letters vol 12 no 11 pp 657ndash6641991

[18] R Smith A Bivens M Embrechts C Palagiri and B Szyman-ski ldquoClustering approaches for anomaly based intrusion detec-tionrdquo Proceedings of Intelligent Engineering Systems ThroughArtificial Neural Networks pp 579ndash584 2002

[19] Y Shi and L Zhang ldquoCOID a cluster-outlier iterative detectionapproach to multi-dimensional data analysisrdquo Knowledge andInformation Systems vol 28 no 3 pp 709ndash733 2011

[20] J Zhao K Liu W Wang and Y Liu ldquoAdaptive fuzzy clusteringbased anomaly data detection in energy system of steel indus-tryrdquo Information Sciences vol 259 pp 335ndash345 2014

[21] F Angiulli and C Pizzuti ldquoFast outlier detection in high dimen-sional spacesrdquo in Proceedings of the 6th European Conference onPrinciples of DataMining and Knowledge Discovery (PKDD rsquo02)vol 2 pp 15ndash26 August 2002

[22] M Radovanovic A Nanopoulos and M Ivanovic ldquoReversenearest neighbors in unsupervised distance-based outlier detec-tionrdquo IEEE Transactions on Knowledge and Data Engineeringvol 27 no 5 pp 1369ndash1382 2015

[23] A Daneshpazhouh and A Sami ldquoEntropy-based outlier detec-tion using semi-supervised approach with few positive exam-plesrdquo Pattern Recognition Letters vol 49 pp 77ndash84 2014

[24] J Gao H Cheng and P-N Tan ldquoSemi-supervised outlierdetectionrdquo in Proceedings of the ACM Symposium on AppliedComputing pp 635ndash636 ACM Dijon France April 2006

[25] Z Xue Y Shang and A Feng ldquoSemi-supervised outlier detec-tion based on fuzzy rough C-means clusteringrdquo Mathematicsand Computers in Simulation vol 80 no 9 pp 1911ndash1921 2010

[26] J Z Huang M K Ng H Rong and Z Li ldquoAutomated variableweighting in k-means type clusteringrdquo IEEE Transactions onPattern Analysis andMachine Intelligence vol 27 no 5 pp 657ndash668 2005

[27] H Friguiand and O Nasraou ldquoUnsupervised learning ofprototypes and attribute weightsrdquo Pattern Recognition vol 37no 3 pp 567ndash581 2004

[28] J Zhou L Chen C P Chen Y Zhang and H Li ldquoFuzzy clus-tering with the entropy of attribute weightsrdquo Neurocomputingvol 198 pp 125ndash134 2016

[29] M Hassan A Chaudhry A Khan and M A Iftikhar ldquoRobustinformation gain based fuzzy c-means clustering and classifi-cation of carotid artery ultrasound imagesrdquo Computer Methodsand Programs in Biomedicine vol 113 no 2 pp 593ndash609 2014

[30] R Krishnapuram and J M Keller ldquoA possibilistic approach toclusteringrdquo IEEE Transactions on Fuzzy Systems vol 1 no 2 pp98ndash110 1993

[31] M-S Yang and K-L Wu ldquoUnsupervised possibilistic cluster-ingrdquo Pattern Recognition vol 39 no 1 pp 5ndash21 2006

[32] SM Guo L C Chen and J S H Tsai ldquoA boundarymethod foroutlier detection based on support vector domain descriptionrdquoPattern Recognition vol 42 no 1 pp 77ndash83 2009

[33] T Fawcett ldquoRoc graphs notes and practical considerations forresearchersrdquoMachine Learning vol 31 no 1 pp 1ndash38 2004

[34] C Blake and C J Merz ldquoUCI repository of machine learningdatabasesrdquo 1998

[35] C C Aggarwal and P S Yu ldquoOutlier detection for highdimensional datardquo in Proceedings of the ACM SIGMOD Interna-tional Conference on Management of Data (SIGMOD rsquo01) SantaBarbara Calif USA May 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

4 Mathematical Problems in Engineering

86

42

0minus50

5

68

024

10

f2 f1

f3

(a) Plot of the space (1198911 1198912 1198913)

0 1 2 3 4 5 6 7

0

5

10

minus5

f2

f1

c1

(b) Plot of the subspace (1198911 1198912)

0 10

0

1

2

3

4

5

6

7

minus5minus1

f2

f3

c2

5

(c) Plot of the subspace (1198912 1198913)

Figure 1 Three-dimensional synthetic example

idea we define a new objective function and minimize it asfollows

119869SSOD-AFW (UOV)= 119899sum119894=1

119888sum119896=1

1199061198981198941198961198892119894119896 + 1205731198982radic119888119899sum119894=1

119888sum119896=1

(119906119898119894119896 log 119906119898119894119896 minus 119906119898119894119896)+ 119899sum119894=1

119888sum119896=1

120572119894119896119886119894119896119906119898119894119896(8)

st119873sum119895=1

V119895119896 = 1 119906119894119896 isin [0 1] V119895119896 isin [0 1] (9)

where 119899 119873 and 119888 are the number of objects features andclusters respectively U = (119906119894119896)119899times119888 119906119894119896 is the membershipdegree of the 119894th object belonging to the 119896th cluster V =(V119895119896)119873times119888 V119895119896 denotes the featureweight of the 119895th dimensionalfeature with respect to the 119896th cluster O = (119900119896119895)119888times119873 119900119896119895indicates the 119895th dimensional feature value of the 119896th clusterprototype 119889119894119896 denotes the feature weighted distance betweenobject 119909119894 and the 119896th cluster prototype 119886119894119896 isin 1 minus1 0 isthe element in semisupervised indicator matrix A 119898 gt 0is the fuzzification coefficient and the parameter 120573 can befixed as the sample covariance according to (6) The positivecoefficient120572119894119896 adjusts the significance of the label informationof the 119894th object with respect to the 119896th cluster in objectivefunction (8) The larger 120572119894119896 is the larger the influence of labelknowledge is

The first term in (8) is equivalent to the FCM objectivefunction which requires the distances of objects from thecluster prototypes to be as small as possible The second termis constructed to force 119906119894119896 to be as large as possible Thethird term 120572119894119896119886119894119896119906119898119894119896 focuses on minimizing the membershipdegrees of a labeled outlier to all the clusters and maximizingthe membership degree of a labeled normal object to thecluster it belongs to With a proper choice of 120572119894119896 we canbalance the label information weight of every object andachieve the optimal fuzzy partition

The virtue of semisupervised indicator matrix A inobjective function (8) can be elaborated as follows Recallingthe construction of semisupervised indicator matrix A and

objective function (8) note that if we know that 119909119894 belongsto the 119896th cluster then 119886119894119896 = minus1 and all the other entriesin the 119894th row equal 1 Thus minimizing 120572119894119896119886119894119896119906119898119894119896 in (8)means maximizing the membership of 119909119894 to the 119896th clusterand simultaneously minimizing the memberships of 119909119894 to theother clusters If 119909119894 is labeled as an outlier namely where allthe elements in the 119894th row of A equal 1 then minimizing120572119894119896119886119894119896119906119898119894119896 in (8) means minimizing the memberships of 119909119894 toall clusters for an outlier does not belong to any cluster If 119909119894is unlabeled namely where 119886119894119896 = 0 for all 119896 = 1 2 119888 thenthe term 120572119894119896119886119894119896119906119898119894119896 has no impact on objective function (8)

32 Solutions to the Objective Function In this subsection aniterative algorithm for minimizing 119869SSOD-AFW with respect to119906119894119896 V119895119896 and 119900119896119895 is derived similar to classical FCM

First in order to minimize 119869SSOD-AFW with respect to VU and O are fixed and the parameters 120572119894119896 (119894 = 1 2 119899119896 = 1 2 119888) are constants The Lagrange function isconstructed as follows

1198711 = 119899sum119894=1

119888sum119896=1

1199061198981198941198961198892119894119896 + 1205731198982radic119888119899sum119894=1

119888sum119896=1

(119906119898119894119896 log 119906119898119894119896 minus 119906119898119894119896)+ 119899sum119894=1

119888sum119896=1

120572119894119896119886119894119896119906119898119894119896 minus 119888sum119896=1

120574119896( 119873sum119895=1

V119895119896 minus 1) (10)

where 120574119896 (119896 = 1 2 119888) are the Lagrange multipliersBy taking the gradient of 1198711 with respect to V119895119896 and setting

it to zero we obtain

1205971198711120597V119895119896 = 119902V119902minus1119895119896

119899sum119894=1

119906119898119894119896 (119909119894119895 minus 119900119896119895)2 minus 120574119896 = 0 (11)

Then

V119895119896 = [[120574119896119902sum119899119894=1 119906119898119894119896 (119909119894119895 minus 119900119896119895)2]]

1(119902minus1)

(12)

Mathematical Problems in Engineering 5

Substituting (12) into (9) we have

119873sum119895=1

V119895119896 = [120574119896119902 ]1(119902minus1) 119873sum119895=1

[[1

sum119899119894=1 119906119898119894119896 (119909119894119895 minus 119900119896119895)2]]1(119902minus1)

= 1(13)

It follows that

[120574119896119902 ]1(119902minus1) = 1sum119873119895=1 [1sum119899119894=1 119906119898119894119896 (119909119894119895 minus 119900119896119895)2]1(119902minus1) (14)

The updating criteria of feature weight V119895119896 (1 le 119895 le 1198731 le 119896 le 119888) are obtainedV119895119896 = [1sum119899119894=1 119906119898119894119896 (119909119894119895 minus 119900119896119895)2]1(119902minus1)

sum119873119897=1 [1sum119899119894=1 119906119898119894119896 (119909119894119897 minus 119900119896119897)2]1(119902minus1) (15)

The updating way of V119895119896 implies that the larger the devi-ation degrees from all samples to the 119896th cluster prototyperegarding the 119895th feature are the smaller the weight of the119895th feature is That is if the distribution of all data is compactaround the 119896th cluster prototype in the 119895th feature spacethe 119895th feature plays a significant role in formulating the 119896thcluster Meanwhile irrelevant features thus are assigned asmaller weight to reduce the negative impact of them on theclustering process

To find the optimal cluster prototype O we assume Uand V are fixed and the parameters 120572119894119896 (119894 = 1 2 119899119896 = 1 2 119888) are also constants We take the gradient of119869SSOD-AFW with respect to 119900119896119895 and set it to zero

120597119869SSOD-AFW120597119900119896119895 = 119899sum119894=1

(minus2119906119898119894119896V119895119896 (119909119894119895 minus 119900119896119895)) = 0 (16)

The updating formula of cluster prototype 119900119896119895 is obtainedas follows

119900119896119895 = sum119899119894=1 119906119898119894119896119909119894119895sum119899119894=1 119906119898119894119896 (17)

To solve the optimal fuzzy partition matrixU we assumeO and V are fixed and the parameters 120572119894119896 (119894 = 1 2 119899 119896 =1 2 119888) are also constantsWe set the gradient of 119869SSOD-AFWwith respect to 119906119894119896 to zero120597119869SSOD-AFW120597119906119894119896 = 119898119906119898minus1119894119896 1198892119894119896 + 120573119898radic119888119906119898minus1119894119896 log 119906119898119894119896

+ 120572119894119896119886119894119896119898119906119898minus1119894119896 = 0 (18)

The updating formula of 119906119894119896 is derived as follows

119906119894119896 = exp(minus119898radic119888120573 (1198892119894119896 + 120572119894119896119886119894119896)) (19)

Formula (19) indicates that a large value of weighteddistance 119889119894119896 leads to a smaller value of 119906119894119896 for all 1 le 119894 le 119899

1 le 119896 le 119888 It should be noted that the membership degree 119906119894119896is also dependent on the coefficient 120572119894119896 The choice of 120572119894119896 isimportant to the performance of the SSOD-AFW algorithmbecause it serves in distinguishing the importance of the thirdterm relative to the other terms in objective function (8) If120572119894119896is too small the third term will be neglected and the labels ofobjects will not work to promote the cluster structure If 120572119894119896 istoo large the other terms will be neglected and the negativeinfluence of possiblemislabels of objects will be enlargedThevalue of 120572119894119896 should be chosen such that it has the same orderof magnitude with the first term in (8) To determine theparameter120572119894119896 in an adaptiveway in all experiments describedin this paper we choose 120572119894119896 proportional to 1198892119894119896 as follows

120572119894119896 = 1198701198892119894119896 (20)

where 119870 isin (0 1) is a constant Since the weighted distance119889119894119896 is dynamically updated the value of parameter 120572119894119896 isadaptively updated in each iteration

33 Criteria for Outlier Identification Based on the aboveanalysis outliers should hold low membership degrees to allclusters Therefore the sum of memberships of an object toall clusters can be used to evaluate its outlying degree For acertain object 119909119894 its outlying degree is defined as

OD (119909119894) = 119888sum119896=1

119906119894119896 (21)

Thus a small value of OD(119909119894) indicates a high outlyingpossibility of object 119909119894 The outlying degree of each sample ina dataset is calculated respectively and sorted incrementallyThe suspicious outliers can be found just by extracting the top119872 objects in the sorted outlying degree sequence where119872 isa given number of outliers contained in the dataset or a givennumber of outliers one needs

In summary the description of the SSOD-AFW algo-rithm is shown in Algorithm 1

Algorithm 1 (semisupervised outlier detection based on adap-tive feature weighted clustering (SSOD-AFW))

Input Dataset 119883 the label information of some objects thenumber of clusters 119888 parameters 119898 gt 1 119870 isin (0 1) 119902 gt 1120576 gt 0 and the number of outliers119872

Output119872 suspicious outliers

(1) Calculate the parameter 120573 according to (6) randomlyinitialize matrix U(0) = (119906(0)

119894119896)119899times119888 and initialize all

elements in V(0) = (V(0)119895119896

)119873times119888 as 1119873 Set iterationcounter 119905 = 0

(2) Compute the matrix of cluster prototype O(119905+1) =(119900(119905+1)119896119895

)119888times119873 according to (17)(3) Update the feature weight matrix V(119905+1) = (V(119905+1)

119895119896)119873times119888

by (15)(4) Update the feature weighted distance D(119905+1) =(119889(119905+1)

119894119896)119899times119888 by (7)

6 Mathematical Problems in Engineering

(5) Update parameter 120572(119905+1)119894119896

by (20)

(6) Update the membership degree matrix U(119905+1) =(119906(119905+1)119894119896

)119899times119888 according to (19)(7) If U(119905+1) minus U(119905) lt 120576 go to step (8) else 119905 = 119905 + 1

repeat step (2) to step (7)(8) The outlying degree of each object is computed and

the OD values are sorted in an ascending mannerFinally output top 119872 outliers with the smallest out-lying degrees

Computational complexity analysis Step (2) needs119874(cNn) operations to compute 119888 cluster prototype The com-putational complexity of computing the weights of119873 featuresis 119874(cNn) in Step (3) Step (4) requires 119874(cNn) to computethe weighted distances of 119899 objects to 119888 cluster prototypesStep (5) needs 119874(cn) to compute parameter 120572119894119896 of 119899 objectswith respect to 119888 cluster prototypes Moreover Step (6) needs119874(cn) operations to calculate the memberships of 119899 objectsto 119888 clusters Therefore the whole computational complexityis 119874(cNn) the same as that of the classical FCM algorithm

34 Proof of Convergence In this section we discuss the con-vergence of the proposed SSOD-AFW algorithm To provethe convergence of objective function 119869SSOD-AFW in (8) byiteratingVO andUwith formulas (15) (17) and (19) it onlyneeds to prove that 119869SSOD-AFW is monotonically decreasingand bounded after a finite number of iterationsNext Lemmas2 3 and 4 verify the monotonically decreasing property of119869SSOD-AFW with respect toVO andU respectively Lemma 5presents the boundedness of 119869SSOD-AFW

Lemma 2 Objective function 119869119878119878119874119863-119860119865119882 in (8) is nonincreas-ing by updating U = (119906119894119896)119899times119888 with formula (19)

Proof Due to the fact that V and O are fixed when updatingU by (19) here the objective function 119869SSOD-AFW can beregarded as a function only associated with U denoted as119869SSOD-AFW(U) According to Lagrangianmultiplier techniqueUlowast computed via (19) is a stagnation point of 119869SSOD-AFW(U)On the other hand if Hessian matrix nabla2119869SSOD-AFW(U) isproved to be positive definite at Ulowast it can be proved that119869SSOD-AFW(U) attains its local minimum at Ulowast The Hessianmatrix nabla2119869SSOD-AFW(Ulowast) is expressed as

nabla2119869SSOD-AFW (Ulowast)

=[[[[[[[[[

1205972119869SSOD-AFW (Ulowast)1205971199061112059711990611 sdot sdot sdot 1205972119869SSOD-AFW (Ulowast)12059711990611120597119906119899119888 d

1205972119869SSOD-AFW (Ulowast)12059711990611989911988812059711990611 sdot sdot sdot 1205972119869SSOD-AFW (Ulowast)120597119906119899119888120597119906119899119888

]]]]]]]]] (22)

nabla2119869SSOD-AFW(Ulowast) is a diagonal matrix and its diagonal ele-ment is 1205972119869SSOD-AFW (Ulowast)120597119906119894119896120597119906119894119896 = 120573radic119888119906119898minus2119894119896 (23)

Since 119906119894119896 gt 0 Hessianmatrixnabla2119869SSOD-AFW(Ulowast) is positivedefinite Having proved that Ulowast is a stagnation point of119869SSOD-AFW(U) and nabla2119869SSOD-AFW(Ulowast) is positive definite weconcludeUlowast is the local minimum of 119869SSOD-AFW(U) Then wehave 119869SSOD-AFW(U(119905)) ge 119869SSOD-AFW(Ulowast) = 119869SSOD-AFW(U(119905+1))whereU(119905) is the membership matrix after the 119905th iteration in(19) and Ulowast = U(119905+1) is the one after the (119905 + 1)th iterationTherefore objective function 119869SSOD-AFW in (8) is nonincreas-ing by updating U = (119906119894119896)119899times119888 using formula (19)

Lemma 3 Objective function 119869119878119878119874119863-119860119865119882 in (8) is nonincreas-ing by updating V = (V119895119896)119873times119888 using (15)Proof Similar to Lemma 2 when U and O are fixed wejust need to prove that the Hessian matrix of Lagrangian of119869SSOD-AFW(V) atVlowast is positive definite whereVlowast is computedby (15)TheHessian matrix is denoted as119867(V) = (ℎ119895119896119897119898(V))whose element is expressed as follows

ℎ119895119896119897119898 (V) = 12059721198711 (V)120597V119895119896120597V119897119898=

119902 (119902 minus 1) V119902minus2119895119896

119899sum119894=1

119906119898119894119896 (119909119894119895 minus 119900119896119895)2 if119895 = 119897 119896 = 1198980 otherwise

(24)

Since 119902 gt 1 and V119895119896 gt 0 the diagonal entries of thediagonal matrix are apparently positive Therefore Hessianmatrix 119867(V) is positive definite 119869SSOD-AFW(V) attains itslocal minimum at Vlowast computed by (15) This completes theproof

Lemma 4 Objective function 119869119878119878119874119863-119860119865119882 in (8) is nonincreas-ing whenO = (119900119896119895)119888times119873 is updated using (17)

The proof of Lemma 4 is similar to Lemma 2

Lemma 5 Objective function 119869119878119878119874119863-119860119865119882 in (8) is boundedthere exists a constant 119861 and it satisfies 119869119878119878119874119863-119860119865119882 lt 119861Proof forall119894 isin 1 119899 andforall119896 isin 1 119888 we have 0 lt 119906119894119896 le 1Thus119906119898119894119896 log 119906119898119894119896minus119906119898119894119896 ismonotonically decreasingwith respectto 119906119894119896 and hence minus1 le 119906119898119894119896 log 119906119898119894119896 minus 119906119898119894119896 lt 0 Apparentlythe first term and the second term of the objective function119869119878119878119874119863-119860119865119882 are also bounded In consequence objective func-tion 119869SSOD-AFW in (8) is bounded

Theorem 6 The objective function 119869119878119878119874119863-119860119865119882 converges to alocal minimum by updating V O and U using formulas (15)(17) and (19) which subjects to the constraints in (9)

Proof Lemmas 2 3 and 4 verify that objective functionin (8) is nonincreasing under iterations according to (15)(17) and (19) Lemma 5 shows that 119869SSOD-AFW has a finitebound Though the parameter 120572119894119896 needs to be updated in theiteration process it is a constant in the problem solving usingLagrangian multiplier technique So 120572119894119896 does not affect theconvergence of the SSOD-AFW algorithm Combining theabove conclusions 119869SSOD-AFW is sure to converge to a local

Mathematical Problems in Engineering 7

minimum through iterations ofVO andU by (15) (17) and(19)

4 Experiments and Analysis

Comprehensive experiments and analysis on a syntheticdataset and several real-world datasets are conducted to showthe effectiveness and superiority of the proposed SSOD-AFW We compared the proposed algorithm with two the-state-of-the-art unsupervised outlier algorithms LOF [15]and 119870NN [13] one supervised method SVDD [32] and onesemisupervised method EODSP [23]

41 Evaluation Metrics For numerically performance evalu-ation of outlier detection algorithms three metrics namelyaccuracy [7] AUC [33] and rank-power [16] are used in thispaper

Let 119872 be the number of true outliers that a dataset 119883contains and 119878 denotes the number of true outliers detectedby an algorithm In experiments top 119872 most suspiciousinstances are detected out Then the accuracy is given by

Accuracy = 119878119872 (25)

The receiver operating characteristic (ROC) curve rep-resents the trade-off relationship between the detection rateand the false alarm rate In general the area under the ROCcurve (AUC) is used to measure the performance of outlierdetection method and the value of AUC for ideal detectionperformance is close to one

For a given outlier detection algorithm true outliersoccupy top positions with respect to the nonoutliers among119872 suspicious instances then the rank-power (RP) of thealgorithm is said to be high If 119878 is the number of true outliersfound within top119872 instances and 119877119894 denotes the rank of the119894th true outlier then the metric rank-power (RP) is given by

RP = 119878 (119878 + 1)2sum119878119894=1 119877119894 (26)

RP reaches the maximum value 1 when all 119878 true outliersare in the top 119878 positions Larger value of RP implies betterperformance of an algorithm

42 Experiments on Synthetic Dataset A two-dimensionalsynthetic dataset with two cluster patterns is generatedfrom Gaussian distribution to intuitively compare the outlierdetection results of the proposed method against the otherfour algorithms mentioned above The mean vectors of thetwo clusters are 1205831 = (75 9)T and 1205832 = (1 3)T respectivelyand the covariance matrixes of them are Σ1 = ( 1 00 2 ) andΣ2 = ( 1 00 2 ) As Figure 2(a) shows a total of 199 samplesare contained in the synthetic dataset in which there are183 normal samples (within two clusters) and 16 outliers(cluttered between two clusters) 13 normal objects are labeledand marked as symbol ldquotimesrdquo and 5 outliers are labeled andmarkedwith symbol ldquolowastrdquo while the rest samples are unlabeledmarked with ldquosdotrdquo Figures 2(b)ndash2(f) respectively illustrate the

outlier detection results on the synthetic dataset by usingLOF 119870NN SVDD EODSP and SSOD-AFW where the redcolored symbols ldquo∘rdquo denote the detected suspicious outliersHere the value of parameter 119896 (size of neighborhood) in LOFand 119870NN is assigned to 3 Gauss kernel function is chosenin SVDD and we set the bandwidth 120590 = 03 and the trade-off coefficient 119862 = 045 Besides the Euclidean distancethreshold 119879 in EODSP is set as 01 and the percentage ofnegative set is set to 119897 = 10 The parameter settings of theproposed algorithm are 119898 = 23 119870 = 085 and 119902 = 223In addition to SVDD the top 16 objects with the highestoutlying scores are considered as the results in the other fouralgorithms

In Figure 2 it is noticeable that the unsupervisedmethodsLOF and 119870NN as well as the supervised SVDD fail tocompletely detect all of the 5 labeled outliers Neverthelesssome normal points in clusters are badly misjudged asoutliers In contrast the semisupervised EODSP algorithmand the proposed SSOD-AFW algorithm successfully detectall of the 5 labeled outliers However EODSP does notcompletely detect all the unlabeled true outliers and severaltrue normal samples are improperly identified as outliers Itis concluded from Figure 2 that the proposed algorithm findsall the true outliers in the synthetic dataset and excludes thenormal samples while the other methods do not

Figure 3 numerically presents the performance evaluationof outlier detection using LOF 119870NN SVDD EODSP andSSOD-AFW for the synthetic dataset From Figure 3 we seethat the values of accuracy AUC and RP of the proposedalgorithm all reach 1 outperforming the other methods

Furthermore during the experimental process shown inFigure 3 the feature weights of the synthetic dataset learnedby formula (15) in our method are V11 = 06727 V12 =05985 V21 = 03273 and V22 = 04015 To strengthenthe effectiveness of feature weights in the proposed SSOD-AFW algorithm a comparative analysis of the weighted andthe nonweighted versions is implemented on the syntheticdataset respectively Considering the nonweighted scenarioof the proposed algorithm the outlier detection result on thesynthetic dataset is presented in Figure 4 As can be observedfrom Figure 4 the nonweighted SSOD-AFW ends up tagging15 true outlying and 1 normal samples as outliers with oneunlabeled true outlier missed

43 Experiments on Real-World Datasets

431 Introduction of Datasets For further verification ofthe effectiveness of the proposed algorithm five real-worlddatasets from UCI Machine Learning Repository [34] (ieIris Abalone Wine Ecoli and Breast Cancer Wisconsin(WDBC)) are employed to test the performance of theproposed algorithm against LOF 119870NN SVDD and EODSPAs mentioned in Aggarwal and Yu [35] one way to test theperformance of an outlier detection algorithm is to run it onthe dataset and calculate the percentage of points belongingto the rare classes So a small amount of samples from thesame class are randomly selected as outlying objects or astarget objects for the five datasets For instance the originalIris dataset incorporates 150 objects with 50 objects in each

8 Mathematical Problems in Engineering

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliers

(a) Original synthetic dataset

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by LOF

(b) LOF

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by KNN

(c) 119870NN

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by SVDD

(d) SVDD

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by EODSP

(e) EODSP

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by SSOD-AFW

(f) SSOD-AFW

Figure 2 Outlier detection results of different algorithms on the two-dimensional synthetic dataset

0875008125

06000

05000

1000010

09

08

07

06

05

04

03

02

01

00

Accu

racy

Algorithms

LOF

KN

N

SVD

D

EOD

SP

SSO

D-A

FW

(a) Accuracy

10

09

08

07

06

05

04

03

02

01

00

05748

10000

08948

0989809952

AUC

Algorithms

LOF

KN

N

SVD

D

EOD

SP

SSO

D-A

FW

(b) AUC

1009130 08835

07627

04138

10000

09

08

07

06

05

04

03

02

01

00

RP

Algorithms

LOF

KN

N

SVD

D

EOD

SP

SSO

D-A

FW

(c) RP

Figure 3 Performance comparison of different algorithms on the synthetic dataset

of three classes We randomly selected 26 objects from classldquoIris-virginicardquo as target outliers and all objects in the othertwo classes are considered as normal objects The otherfour datasets are similarly preprocessed and more detaileddescription about the five real-world datasets is given inTable 1

432 Experimental Result Analysis We compare the outlierdetection performance of the proposed algorithm with LOF119870NN SVDD and EODSP on real-world datasets Eachmethod has its own parameters and the detailed parametersettings of each algorithm are as follows The parameters ofthe proposed algorithm are 119898 = 21 119870 = 065 and 119902 = 153

Mathematical Problems in Engineering 9

Table 1 Description of real-world datasets

Dataset Instances Features Outlying classes Outliers (percent) Clusters Prior informationIris 126 4 ldquoVirginicardquo 26 (2063) 2 10 labeled normal samples 4 labeled outliersAbalone 4177 8 ldquo1rdquondashldquo4rdquo ldquo16rdquondashldquo27rdquo ldquo29rdquo 335 (802) 11 11 labeled normal samples 18 labeled outliersWine 130 13 ldquo3rdquo 11 (846) 2 9 labeled normal samples 4 labeled outliersEcoli 336 9 ldquoomLrdquo ldquoimLrdquo ldquoimSrdquo 9 (268) 5 11 labeled normal samples 3 labeled outliersWDBC 387 30 ldquoMalignantrdquo 30 (775) 1 10 labeled normal samples 8 labeled outliers

minus2 0 2 4 6 8 10minus4

minus2

0

2

4

6

8

10

12

14

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by nonweighted SSOD-AFW

Figure 4 Outlier detection result of the nonweighted SSOD-AFW on the synthetic dataset

for all the five datasets The strategy of parameter selectionfor SSOD-AFW will be discussed in the later subsectioncalled parameter analysis For the other algorithms thoseparameters are set exactly as mentioned in their referencesIt is well known that LOF and 119870NN have high dependencyon the neighborhood parameter 119896 In this paper we set 119896 = 5for datasets Iris and WDBC 119896 = 3 for dataset Abalone119896 = 10 for dataset Wine and 119896 = 10 for dataset Ecoli ForSVDDmethodGaussian kernel function is employed and thebandwidths 120590 = 045 and 119862 = 05 on all of the five real-worlddatasets In EODSP the Euclidean distance threshold 119879 is setas 01 and the percentage of negative set 119897 is set as 30 for IrisandAbalone datasets and119879 = 05 119897 = 30 for datasets EcoliWine and WDBC Since we randomly select outliers fromtarget classes for each dataset each experiment is repeated10 times with the same number of different outliers Theaverage accuracy AUC and RP are calculated as the criteriaof performance of various detection methods

Figure 5 illustrates the outlier detection results of SSOD-AFW algorithm against LOF 119870NN SVDD and EODSPrespectively on the five real-world datasets As can beseen from Figure 5 the proposed algorithm can accuratelyidentify outliers according to the cluster structure of adataset with the guidance of the label knowledge It showsdistinct superiority over the other unsupervised (LOF119870NN)semisupervised (EODSP) and supervised (SVDD) methodsIn particular the outlier detection accuracy of SSOD-AFW inFigure 5(a) is significantly higher than the others especially

for datasets Iris and Wine One can know from Figure 5(b)that the AUC values of our method are always higher thanthe others for all datasets except for WDBC In terms of RPSSOD-AFW performs better than the other four algorithmsondatasets Iris andWinewhereas slightly poorer than SVDDonAbalone poorer than LOFonEcoli and poorer than119870NNonWDBC seen as in Figure 5(c)

It is worth mentioning that the experiment of the pro-posed algorithm on WDBC involves one-class clusteringproblem Although one-class clustering task is generallymeaningless one-class clustering-based outlier detection isespecially meaningful and feasible in our proposal becauseour approach does not require that the membership degreesmust sum up to 1 This is one of the powerful and importantcharacteristics of the proposed algorithm

433 The Influence of the Proportion of Labeled Data onOutlier Detection In this subsection we will investigatethe influence of the proportion of labeled samples on theaccuracies of our methodology Two typical situations areconsidered and tested The first one is that the proportion oflabeled outliers increases when the number of labeled normalobjects is fixed at a certain constant The other one is that thepercent of labeled normal samples varies while the quantityof labeled outliers is fixed So two groups of experimentsare designed to compare the accuracies of the proposedalgorithm against the EODSP in the situations of different

10 Mathematical Problems in Engineering

10

09

08

07

06

05

04

03

02

01

00

Accu

racy

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(a) Accuracy

10

09

08

07

06

05

04

03

02

01

00

AUC

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(b) AUC

10

09

08

07

06

05

04

03

02

01

00

RP

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(c) RP

Figure 5 Performance comparison of various algorithms on the real-world datasets

percent of labeled outliers and normal samples respectivelyon the datasets Iris Abalone Wine Ecoli andWDBC In thetwo experiments the percent of labeled outliers or labelednormal samples ranges from 0 to 40 respectively whenthe number of another kind of labeled objects is fixedWe randomly select a certain number of labeled outliersor normal samples from each dataset each experiment isrepeated 10 times and the average accuracies of SSOD-AFWand EODSP are computed

Figure 6 shows results of the first group of experimentswhere the percent of labeled outliers varies from 0 to 40One can see from Figure 6 that the accuracies of the twosemisupervised algorithms are roughly increased with theproportion of labeled outliers becoming larger This power-fully supports the idea that semisupervised outlier detectionalgorithms can improve the accuracy of outlier detectionby using prior information Furthermore the SSOD-AFWachieves a better accuracy than EODSP algorithm for thesame proportion of labeled outliers on the five datasetsEspecially for Wine the accuracy of SSOD-AFW is 40higher than that of EODSP EODSP addresses the problemof detecting outliers with only few labeled outliers as trainingdata The labeled normal instances are extracted accordingto the maximum entropy principle where the entropy iscomputed only using the distance between each testingsample and all the labeled outliers That makes EODSPnot flexible as our proposed method due to informationdeficiencies

Figure 7 illustrates the accuracy comparison of the pro-posed algorithm and EODSP when the proportion of labelednormal samples increases from 0 to 40 and the percent oflabeled outliers is fixed Note that ourmethod obtains a betteraccuracy than EODSP on all of the five real-world datasetsThe accuracy of the proposed algorithm gets larger when thepercent of labeled normal samples increases As mentionedEODSP emphasizes the semisupervised outlier detectiononly with few labeled outliers in the initial dataset butwithout considering any labeled normal objects Therefore

the accuracy of EODSP algorithm keeps harper with variousproportions of labeled normal objects and always equals theaccuracy value of 0 labeled normal samples as well

434 Parameter Analysis The parameters 119898 119870 and 119902are important in our proposed algorithm which affect theperformance of SSOD-AFW In this section the influence ofeach parameter on outlier detection accuracy is studied

The parameter 119898 is the fuzzification coefficient Fig-ure 8(a) analyzes the relationship between the outlier detec-tion accuracy of our proposed algorithm and parameter 119898with 119898 varying from 15 to 50 The results imply that thehighest accuracy is achieved when 119898 ranges in [2 4] Soit is reasonable that 119898 value in the experiments shown inFigure 5 has been set as 21 The parameter 119870 isin (0 1)controls the importance of the label information in the resultof outlier detection Outlier detection accuracies are testifiedby varying119870 from 01 to 09 which are shown in Figure 8(b)The overall tendency is that the accuracies become larger as119870 increasesThe best results of the proposed algorithm occurand keep stable when 119870 ge 07 Finally from Figure 8(c) weconclude that the feature weight index 119902 has small influenceon the accuracy of SSOD-AFW in the situation that theother parametersmaintain the same settings So the proposedalgorithm is not sensitive to the parameter 119902 In general theparameter 119902 is suggested to select a constant from (1 4]435 Execution Time Analysis Figure 9 analyzes the averagerunning time of the proposed algorithm against the otheralgorithms on five real-world datasets The experimentalenvironment is Windows XP systems MATLAB 71 platform3GHz CPU 2GB RAM Because the volume of Abalonedataset is far greater than the other four datasets the runningtimes of various datasets are distinctly different In order tofacilitate the display in Figure 9 the horizontal coordinate axisis translated downward a short distance The result indicatesthat the proposed algorithm is more time-saving than the

Mathematical Problems in Engineering 11

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(a) Iris

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(b) Abalone1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(c) Wine1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(d) Ecoli

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(e) WDBC

Figure 6 Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled outliers on the real-world datasets

other four typical outlier detection algorithms except for119870NN on dataset Wine In whole the execution time of theSSOD-AFW is comparable to that of119870NNand less than thoseof other algorithms on most of the datasets

5 Conclusions

In order to detect outliers more precisely a semisuper-vised outlier detection algorithm based on adaptive feature

weighted clustering called SSOD-AFW is proposed in thispaper Distinct weights of each feature with respect to differ-ent clusters are considered and obtained by adaptive iterationso that the negative effects of irrelevant features on outlierdetection are weakened Moreover the proposed methodmakes full use of the prior knowledge contained in datasetsand detects outliers in virtue of the cluster structure It isverified by a series of experiments that the proposed SSOD-AFW algorithm is superior to other typical unsupervised

12 Mathematical Problems in Engineering

0

1009080706050403020100

Accu

racy

Proportion of labeled normal samples5

EODSPSSOD-AFW

10 15 20 25 35 4030

(a) Iris

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(b) Abalone

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(c) Wine

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples

5 10 15 20 25 35 4030

(d) Ecoli

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(e) WDBC

Figure 7 Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled normal samples on the real-worlddatasets

semisupervised and supervised algorithms in both outlierdetection precision and running speed

In this paper we present a new semisupervised outlierdetection method which utilizes labels of a small amountof objects However our method assumes that the labels

of objects are reliable and does not consider mislabel pun-ishment in the new objective function Therefore a robustversion of the proposed method dealing with noisy orimperfect labels of objects deserves further studiesMoreoversince only one typical dissimilaritymeasure named Euclidean

Mathematical Problems in Engineering 13

1009080706050403020100

Accu

racy

m

5045403530252015

IrisAbaloneWine

EcoliWDBC

(a) Parameter119898

K

1009080706050403020100

Accu

racy

090807060504030201

IrisAbaloneWine

EcoliWDBC

(b) Parameter119870

q

40

1009080706050403020100

3530252015

Accu

racy

IrisAbaloneWine

EcoliWDBC

(c) Parameter 119902

Figure 8 Outlier detection accuracy of the proposed algorithm under various parameters on the real-world datasets

Iris Abalone Wine Ecoli WDBCDatasets

50

45

40

35

30

25

20

15

10

5

0

Runn

ing

time (

s)

LOFKNNSVDD

EODSPSSOD-AFW

Figure 9 Execution time comparison of different algorithms on thereal-world datasets

distance is discussed in our method the SSOD-AFW algo-rithm is limited to outlier detection for numerical data Thefuture research aims at extending our method to mixed-attribute data in more real-life applications such as faultdiagnosis in industrial process or network anomaly detection

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Natural ScienceFoundation of China (Grant no 11471001)

References

[1] J Han M Kamber and J Pei Data Mining Concepts andTechniques Elsevier 2011

[2] D M Hawkins Identification of Outliers Chapman amp HallLondon UK 1980

[3] V Bamnett and T LewisOutliers in Statistical Data JohnWileyamp Sons Chichester UK 1994

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] B Sheng Q Li W Mao and W Jin ldquoOutlier detection insensor networksrdquo in Proceedings of the 8th ACM InternationalSymposium on Mobile Ad Hoc Networking and Computing(MobiHoc rsquo07) pp 219ndash228Montreal Canada September 2007

[6] A P James and S Dimitrijev ldquoInter-image outliers and theirapplication to image classificationrdquo Pattern Recognition vol 43no 12 pp 4101ndash4112 2010

[7] J Huang Q Zhu L Yang and J Feng ldquoA non-parameter outlierdetection algorithm based on Natural Neighborrdquo Knowledge-Based Systems vol 92 pp 71ndash77 2016

[8] J M Shepherd and S J Burian ldquoDetection of urban-inducedrainfall anomalies in a major coastal cityrdquo Earth Interactionsvol 7 no 4 pp 1ndash17 2003

[9] O Alan and C Catal ldquoThresholds based outlier detectionapproach for mining class outliers an empirical case study onsoftware measurement datasetsrdquo Expert Systems with Applica-tions vol 38 no 4 pp 3440ndash3445 2011

[10] B Scholkopf J C Platt J Shawe-Taylor A J Smola and RC Williamson ldquoEstimating the support of a high-dimensionaldistributionrdquo Neural Computation vol 13 no 7 pp 1443ndash14712001

[11] D Tax A Ypma and R Duin ldquoSupport vector data descriptionapplied to machine vibration analysisrdquo in Proceedings of the 5thAnnual Conference of the Advanced School for Computing andImaging pp 398ndash405 Heijen The Netherlands June 1999

[12] DM J Tax andR PWDuin ldquoSupport vector data descriptionrdquoMachine Learning vol 54 no 1 pp 45ndash66 2004

[13] S Ramaswamy R Rastogi and K Shim ldquoEfficient algorithmsformining outliers from large data setsrdquo SIGMODRecord (ACMSpecial Interest Group onManagement of Data) vol 29 no 2 pp427ndash438 2000

14 Mathematical Problems in Engineering

[14] EM Knox and R T Ng ldquoAlgorithms formining distance basedoutliers in large datasetrdquo in Proceedings of the InternationalConference on Very Large Data Bases pp 392ndash403 CiteseerNew York NY USA 1998

[15] M M Breunig H-P Kriegel R T Ng and J Sander ldquoLOFidentifying density-based local outliersrdquoACMSIGMODRecordvol 29 no 2 pp 93ndash104 2000

[16] J Ha S Seok and J-S Lee ldquoA precise ranking method foroutlier detectionrdquo Information Sciences vol 324 pp 88ndash1072015

[17] R N Dave ldquoCharacterization and detection of noise in clus-teringrdquo Pattern Recognition Letters vol 12 no 11 pp 657ndash6641991

[18] R Smith A Bivens M Embrechts C Palagiri and B Szyman-ski ldquoClustering approaches for anomaly based intrusion detec-tionrdquo Proceedings of Intelligent Engineering Systems ThroughArtificial Neural Networks pp 579ndash584 2002

[19] Y Shi and L Zhang ldquoCOID a cluster-outlier iterative detectionapproach to multi-dimensional data analysisrdquo Knowledge andInformation Systems vol 28 no 3 pp 709ndash733 2011

[20] J Zhao K Liu W Wang and Y Liu ldquoAdaptive fuzzy clusteringbased anomaly data detection in energy system of steel indus-tryrdquo Information Sciences vol 259 pp 335ndash345 2014

[21] F Angiulli and C Pizzuti ldquoFast outlier detection in high dimen-sional spacesrdquo in Proceedings of the 6th European Conference onPrinciples of DataMining and Knowledge Discovery (PKDD rsquo02)vol 2 pp 15ndash26 August 2002

[22] M Radovanovic A Nanopoulos and M Ivanovic ldquoReversenearest neighbors in unsupervised distance-based outlier detec-tionrdquo IEEE Transactions on Knowledge and Data Engineeringvol 27 no 5 pp 1369ndash1382 2015

[23] A Daneshpazhouh and A Sami ldquoEntropy-based outlier detec-tion using semi-supervised approach with few positive exam-plesrdquo Pattern Recognition Letters vol 49 pp 77ndash84 2014

[24] J Gao H Cheng and P-N Tan ldquoSemi-supervised outlierdetectionrdquo in Proceedings of the ACM Symposium on AppliedComputing pp 635ndash636 ACM Dijon France April 2006

[25] Z Xue Y Shang and A Feng ldquoSemi-supervised outlier detec-tion based on fuzzy rough C-means clusteringrdquo Mathematicsand Computers in Simulation vol 80 no 9 pp 1911ndash1921 2010

[26] J Z Huang M K Ng H Rong and Z Li ldquoAutomated variableweighting in k-means type clusteringrdquo IEEE Transactions onPattern Analysis andMachine Intelligence vol 27 no 5 pp 657ndash668 2005

[27] H Friguiand and O Nasraou ldquoUnsupervised learning ofprototypes and attribute weightsrdquo Pattern Recognition vol 37no 3 pp 567ndash581 2004

[28] J Zhou L Chen C P Chen Y Zhang and H Li ldquoFuzzy clus-tering with the entropy of attribute weightsrdquo Neurocomputingvol 198 pp 125ndash134 2016

[29] M Hassan A Chaudhry A Khan and M A Iftikhar ldquoRobustinformation gain based fuzzy c-means clustering and classifi-cation of carotid artery ultrasound imagesrdquo Computer Methodsand Programs in Biomedicine vol 113 no 2 pp 593ndash609 2014

[30] R Krishnapuram and J M Keller ldquoA possibilistic approach toclusteringrdquo IEEE Transactions on Fuzzy Systems vol 1 no 2 pp98ndash110 1993

[31] M-S Yang and K-L Wu ldquoUnsupervised possibilistic cluster-ingrdquo Pattern Recognition vol 39 no 1 pp 5ndash21 2006

[32] SM Guo L C Chen and J S H Tsai ldquoA boundarymethod foroutlier detection based on support vector domain descriptionrdquoPattern Recognition vol 42 no 1 pp 77ndash83 2009

[33] T Fawcett ldquoRoc graphs notes and practical considerations forresearchersrdquoMachine Learning vol 31 no 1 pp 1ndash38 2004

[34] C Blake and C J Merz ldquoUCI repository of machine learningdatabasesrdquo 1998

[35] C C Aggarwal and P S Yu ldquoOutlier detection for highdimensional datardquo in Proceedings of the ACM SIGMOD Interna-tional Conference on Management of Data (SIGMOD rsquo01) SantaBarbara Calif USA May 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Mathematical Problems in Engineering 5

Substituting (12) into (9) we have

119873sum119895=1

V119895119896 = [120574119896119902 ]1(119902minus1) 119873sum119895=1

[[1

sum119899119894=1 119906119898119894119896 (119909119894119895 minus 119900119896119895)2]]1(119902minus1)

= 1(13)

It follows that

[120574119896119902 ]1(119902minus1) = 1sum119873119895=1 [1sum119899119894=1 119906119898119894119896 (119909119894119895 minus 119900119896119895)2]1(119902minus1) (14)

The updating criteria of feature weight V119895119896 (1 le 119895 le 1198731 le 119896 le 119888) are obtainedV119895119896 = [1sum119899119894=1 119906119898119894119896 (119909119894119895 minus 119900119896119895)2]1(119902minus1)

sum119873119897=1 [1sum119899119894=1 119906119898119894119896 (119909119894119897 minus 119900119896119897)2]1(119902minus1) (15)

The updating way of V119895119896 implies that the larger the devi-ation degrees from all samples to the 119896th cluster prototyperegarding the 119895th feature are the smaller the weight of the119895th feature is That is if the distribution of all data is compactaround the 119896th cluster prototype in the 119895th feature spacethe 119895th feature plays a significant role in formulating the 119896thcluster Meanwhile irrelevant features thus are assigned asmaller weight to reduce the negative impact of them on theclustering process

To find the optimal cluster prototype O we assume Uand V are fixed and the parameters 120572119894119896 (119894 = 1 2 119899119896 = 1 2 119888) are also constants We take the gradient of119869SSOD-AFW with respect to 119900119896119895 and set it to zero

120597119869SSOD-AFW120597119900119896119895 = 119899sum119894=1

(minus2119906119898119894119896V119895119896 (119909119894119895 minus 119900119896119895)) = 0 (16)

The updating formula of cluster prototype 119900119896119895 is obtainedas follows

119900119896119895 = sum119899119894=1 119906119898119894119896119909119894119895sum119899119894=1 119906119898119894119896 (17)

To solve the optimal fuzzy partition matrixU we assumeO and V are fixed and the parameters 120572119894119896 (119894 = 1 2 119899 119896 =1 2 119888) are also constantsWe set the gradient of 119869SSOD-AFWwith respect to 119906119894119896 to zero120597119869SSOD-AFW120597119906119894119896 = 119898119906119898minus1119894119896 1198892119894119896 + 120573119898radic119888119906119898minus1119894119896 log 119906119898119894119896

+ 120572119894119896119886119894119896119898119906119898minus1119894119896 = 0 (18)

The updating formula of 119906119894119896 is derived as follows

119906119894119896 = exp(minus119898radic119888120573 (1198892119894119896 + 120572119894119896119886119894119896)) (19)

Formula (19) indicates that a large value of weighteddistance 119889119894119896 leads to a smaller value of 119906119894119896 for all 1 le 119894 le 119899

1 le 119896 le 119888 It should be noted that the membership degree 119906119894119896is also dependent on the coefficient 120572119894119896 The choice of 120572119894119896 isimportant to the performance of the SSOD-AFW algorithmbecause it serves in distinguishing the importance of the thirdterm relative to the other terms in objective function (8) If120572119894119896is too small the third term will be neglected and the labels ofobjects will not work to promote the cluster structure If 120572119894119896 istoo large the other terms will be neglected and the negativeinfluence of possiblemislabels of objects will be enlargedThevalue of 120572119894119896 should be chosen such that it has the same orderof magnitude with the first term in (8) To determine theparameter120572119894119896 in an adaptiveway in all experiments describedin this paper we choose 120572119894119896 proportional to 1198892119894119896 as follows

120572119894119896 = 1198701198892119894119896 (20)

where 119870 isin (0 1) is a constant Since the weighted distance119889119894119896 is dynamically updated the value of parameter 120572119894119896 isadaptively updated in each iteration

33 Criteria for Outlier Identification Based on the aboveanalysis outliers should hold low membership degrees to allclusters Therefore the sum of memberships of an object toall clusters can be used to evaluate its outlying degree For acertain object 119909119894 its outlying degree is defined as

OD (119909119894) = 119888sum119896=1

119906119894119896 (21)

Thus a small value of OD(119909119894) indicates a high outlyingpossibility of object 119909119894 The outlying degree of each sample ina dataset is calculated respectively and sorted incrementallyThe suspicious outliers can be found just by extracting the top119872 objects in the sorted outlying degree sequence where119872 isa given number of outliers contained in the dataset or a givennumber of outliers one needs

In summary the description of the SSOD-AFW algo-rithm is shown in Algorithm 1

Algorithm 1 (semisupervised outlier detection based on adap-tive feature weighted clustering (SSOD-AFW))

Input Dataset 119883 the label information of some objects thenumber of clusters 119888 parameters 119898 gt 1 119870 isin (0 1) 119902 gt 1120576 gt 0 and the number of outliers119872

Output119872 suspicious outliers

(1) Calculate the parameter 120573 according to (6) randomlyinitialize matrix U(0) = (119906(0)

119894119896)119899times119888 and initialize all

elements in V(0) = (V(0)119895119896

)119873times119888 as 1119873 Set iterationcounter 119905 = 0

(2) Compute the matrix of cluster prototype O(119905+1) =(119900(119905+1)119896119895

)119888times119873 according to (17)(3) Update the feature weight matrix V(119905+1) = (V(119905+1)

119895119896)119873times119888

by (15)(4) Update the feature weighted distance D(119905+1) =(119889(119905+1)

119894119896)119899times119888 by (7)

6 Mathematical Problems in Engineering

(5) Update parameter 120572(119905+1)119894119896

by (20)

(6) Update the membership degree matrix U(119905+1) =(119906(119905+1)119894119896

)119899times119888 according to (19)(7) If U(119905+1) minus U(119905) lt 120576 go to step (8) else 119905 = 119905 + 1

repeat step (2) to step (7)(8) The outlying degree of each object is computed and

the OD values are sorted in an ascending mannerFinally output top 119872 outliers with the smallest out-lying degrees

Computational complexity analysis Step (2) needs119874(cNn) operations to compute 119888 cluster prototype The com-putational complexity of computing the weights of119873 featuresis 119874(cNn) in Step (3) Step (4) requires 119874(cNn) to computethe weighted distances of 119899 objects to 119888 cluster prototypesStep (5) needs 119874(cn) to compute parameter 120572119894119896 of 119899 objectswith respect to 119888 cluster prototypes Moreover Step (6) needs119874(cn) operations to calculate the memberships of 119899 objectsto 119888 clusters Therefore the whole computational complexityis 119874(cNn) the same as that of the classical FCM algorithm

34 Proof of Convergence In this section we discuss the con-vergence of the proposed SSOD-AFW algorithm To provethe convergence of objective function 119869SSOD-AFW in (8) byiteratingVO andUwith formulas (15) (17) and (19) it onlyneeds to prove that 119869SSOD-AFW is monotonically decreasingand bounded after a finite number of iterationsNext Lemmas2 3 and 4 verify the monotonically decreasing property of119869SSOD-AFW with respect toVO andU respectively Lemma 5presents the boundedness of 119869SSOD-AFW

Lemma 2 Objective function 119869119878119878119874119863-119860119865119882 in (8) is nonincreas-ing by updating U = (119906119894119896)119899times119888 with formula (19)

Proof Due to the fact that V and O are fixed when updatingU by (19) here the objective function 119869SSOD-AFW can beregarded as a function only associated with U denoted as119869SSOD-AFW(U) According to Lagrangianmultiplier techniqueUlowast computed via (19) is a stagnation point of 119869SSOD-AFW(U)On the other hand if Hessian matrix nabla2119869SSOD-AFW(U) isproved to be positive definite at Ulowast it can be proved that119869SSOD-AFW(U) attains its local minimum at Ulowast The Hessianmatrix nabla2119869SSOD-AFW(Ulowast) is expressed as

nabla2119869SSOD-AFW (Ulowast)

=[[[[[[[[[

1205972119869SSOD-AFW (Ulowast)1205971199061112059711990611 sdot sdot sdot 1205972119869SSOD-AFW (Ulowast)12059711990611120597119906119899119888 d

1205972119869SSOD-AFW (Ulowast)12059711990611989911988812059711990611 sdot sdot sdot 1205972119869SSOD-AFW (Ulowast)120597119906119899119888120597119906119899119888

]]]]]]]]] (22)

nabla2119869SSOD-AFW(Ulowast) is a diagonal matrix and its diagonal ele-ment is 1205972119869SSOD-AFW (Ulowast)120597119906119894119896120597119906119894119896 = 120573radic119888119906119898minus2119894119896 (23)

Since 119906119894119896 gt 0 Hessianmatrixnabla2119869SSOD-AFW(Ulowast) is positivedefinite Having proved that Ulowast is a stagnation point of119869SSOD-AFW(U) and nabla2119869SSOD-AFW(Ulowast) is positive definite weconcludeUlowast is the local minimum of 119869SSOD-AFW(U) Then wehave 119869SSOD-AFW(U(119905)) ge 119869SSOD-AFW(Ulowast) = 119869SSOD-AFW(U(119905+1))whereU(119905) is the membership matrix after the 119905th iteration in(19) and Ulowast = U(119905+1) is the one after the (119905 + 1)th iterationTherefore objective function 119869SSOD-AFW in (8) is nonincreas-ing by updating U = (119906119894119896)119899times119888 using formula (19)

Lemma 3 Objective function 119869119878119878119874119863-119860119865119882 in (8) is nonincreas-ing by updating V = (V119895119896)119873times119888 using (15)Proof Similar to Lemma 2 when U and O are fixed wejust need to prove that the Hessian matrix of Lagrangian of119869SSOD-AFW(V) atVlowast is positive definite whereVlowast is computedby (15)TheHessian matrix is denoted as119867(V) = (ℎ119895119896119897119898(V))whose element is expressed as follows

ℎ119895119896119897119898 (V) = 12059721198711 (V)120597V119895119896120597V119897119898=

119902 (119902 minus 1) V119902minus2119895119896

119899sum119894=1

119906119898119894119896 (119909119894119895 minus 119900119896119895)2 if119895 = 119897 119896 = 1198980 otherwise

(24)

Since 119902 gt 1 and V119895119896 gt 0 the diagonal entries of thediagonal matrix are apparently positive Therefore Hessianmatrix 119867(V) is positive definite 119869SSOD-AFW(V) attains itslocal minimum at Vlowast computed by (15) This completes theproof

Lemma 4 Objective function 119869119878119878119874119863-119860119865119882 in (8) is nonincreas-ing whenO = (119900119896119895)119888times119873 is updated using (17)

The proof of Lemma 4 is similar to Lemma 2

Lemma 5 Objective function 119869119878119878119874119863-119860119865119882 in (8) is boundedthere exists a constant 119861 and it satisfies 119869119878119878119874119863-119860119865119882 lt 119861Proof forall119894 isin 1 119899 andforall119896 isin 1 119888 we have 0 lt 119906119894119896 le 1Thus119906119898119894119896 log 119906119898119894119896minus119906119898119894119896 ismonotonically decreasingwith respectto 119906119894119896 and hence minus1 le 119906119898119894119896 log 119906119898119894119896 minus 119906119898119894119896 lt 0 Apparentlythe first term and the second term of the objective function119869119878119878119874119863-119860119865119882 are also bounded In consequence objective func-tion 119869SSOD-AFW in (8) is bounded

Theorem 6 The objective function 119869119878119878119874119863-119860119865119882 converges to alocal minimum by updating V O and U using formulas (15)(17) and (19) which subjects to the constraints in (9)

Proof Lemmas 2 3 and 4 verify that objective functionin (8) is nonincreasing under iterations according to (15)(17) and (19) Lemma 5 shows that 119869SSOD-AFW has a finitebound Though the parameter 120572119894119896 needs to be updated in theiteration process it is a constant in the problem solving usingLagrangian multiplier technique So 120572119894119896 does not affect theconvergence of the SSOD-AFW algorithm Combining theabove conclusions 119869SSOD-AFW is sure to converge to a local

Mathematical Problems in Engineering 7

minimum through iterations ofVO andU by (15) (17) and(19)

4 Experiments and Analysis

Comprehensive experiments and analysis on a syntheticdataset and several real-world datasets are conducted to showthe effectiveness and superiority of the proposed SSOD-AFW We compared the proposed algorithm with two the-state-of-the-art unsupervised outlier algorithms LOF [15]and 119870NN [13] one supervised method SVDD [32] and onesemisupervised method EODSP [23]

41 Evaluation Metrics For numerically performance evalu-ation of outlier detection algorithms three metrics namelyaccuracy [7] AUC [33] and rank-power [16] are used in thispaper

Let 119872 be the number of true outliers that a dataset 119883contains and 119878 denotes the number of true outliers detectedby an algorithm In experiments top 119872 most suspiciousinstances are detected out Then the accuracy is given by

Accuracy = 119878119872 (25)

The receiver operating characteristic (ROC) curve rep-resents the trade-off relationship between the detection rateand the false alarm rate In general the area under the ROCcurve (AUC) is used to measure the performance of outlierdetection method and the value of AUC for ideal detectionperformance is close to one

For a given outlier detection algorithm true outliersoccupy top positions with respect to the nonoutliers among119872 suspicious instances then the rank-power (RP) of thealgorithm is said to be high If 119878 is the number of true outliersfound within top119872 instances and 119877119894 denotes the rank of the119894th true outlier then the metric rank-power (RP) is given by

RP = 119878 (119878 + 1)2sum119878119894=1 119877119894 (26)

RP reaches the maximum value 1 when all 119878 true outliersare in the top 119878 positions Larger value of RP implies betterperformance of an algorithm

42 Experiments on Synthetic Dataset A two-dimensionalsynthetic dataset with two cluster patterns is generatedfrom Gaussian distribution to intuitively compare the outlierdetection results of the proposed method against the otherfour algorithms mentioned above The mean vectors of thetwo clusters are 1205831 = (75 9)T and 1205832 = (1 3)T respectivelyand the covariance matrixes of them are Σ1 = ( 1 00 2 ) andΣ2 = ( 1 00 2 ) As Figure 2(a) shows a total of 199 samplesare contained in the synthetic dataset in which there are183 normal samples (within two clusters) and 16 outliers(cluttered between two clusters) 13 normal objects are labeledand marked as symbol ldquotimesrdquo and 5 outliers are labeled andmarkedwith symbol ldquolowastrdquo while the rest samples are unlabeledmarked with ldquosdotrdquo Figures 2(b)ndash2(f) respectively illustrate the

outlier detection results on the synthetic dataset by usingLOF 119870NN SVDD EODSP and SSOD-AFW where the redcolored symbols ldquo∘rdquo denote the detected suspicious outliersHere the value of parameter 119896 (size of neighborhood) in LOFand 119870NN is assigned to 3 Gauss kernel function is chosenin SVDD and we set the bandwidth 120590 = 03 and the trade-off coefficient 119862 = 045 Besides the Euclidean distancethreshold 119879 in EODSP is set as 01 and the percentage ofnegative set is set to 119897 = 10 The parameter settings of theproposed algorithm are 119898 = 23 119870 = 085 and 119902 = 223In addition to SVDD the top 16 objects with the highestoutlying scores are considered as the results in the other fouralgorithms

In Figure 2 it is noticeable that the unsupervisedmethodsLOF and 119870NN as well as the supervised SVDD fail tocompletely detect all of the 5 labeled outliers Neverthelesssome normal points in clusters are badly misjudged asoutliers In contrast the semisupervised EODSP algorithmand the proposed SSOD-AFW algorithm successfully detectall of the 5 labeled outliers However EODSP does notcompletely detect all the unlabeled true outliers and severaltrue normal samples are improperly identified as outliers Itis concluded from Figure 2 that the proposed algorithm findsall the true outliers in the synthetic dataset and excludes thenormal samples while the other methods do not

Figure 3 numerically presents the performance evaluationof outlier detection using LOF 119870NN SVDD EODSP andSSOD-AFW for the synthetic dataset From Figure 3 we seethat the values of accuracy AUC and RP of the proposedalgorithm all reach 1 outperforming the other methods

Furthermore during the experimental process shown inFigure 3 the feature weights of the synthetic dataset learnedby formula (15) in our method are V11 = 06727 V12 =05985 V21 = 03273 and V22 = 04015 To strengthenthe effectiveness of feature weights in the proposed SSOD-AFW algorithm a comparative analysis of the weighted andthe nonweighted versions is implemented on the syntheticdataset respectively Considering the nonweighted scenarioof the proposed algorithm the outlier detection result on thesynthetic dataset is presented in Figure 4 As can be observedfrom Figure 4 the nonweighted SSOD-AFW ends up tagging15 true outlying and 1 normal samples as outliers with oneunlabeled true outlier missed

43 Experiments on Real-World Datasets

431 Introduction of Datasets For further verification ofthe effectiveness of the proposed algorithm five real-worlddatasets from UCI Machine Learning Repository [34] (ieIris Abalone Wine Ecoli and Breast Cancer Wisconsin(WDBC)) are employed to test the performance of theproposed algorithm against LOF 119870NN SVDD and EODSPAs mentioned in Aggarwal and Yu [35] one way to test theperformance of an outlier detection algorithm is to run it onthe dataset and calculate the percentage of points belongingto the rare classes So a small amount of samples from thesame class are randomly selected as outlying objects or astarget objects for the five datasets For instance the originalIris dataset incorporates 150 objects with 50 objects in each

8 Mathematical Problems in Engineering

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliers

(a) Original synthetic dataset

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by LOF

(b) LOF

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by KNN

(c) 119870NN

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by SVDD

(d) SVDD

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by EODSP

(e) EODSP

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by SSOD-AFW

(f) SSOD-AFW

Figure 2 Outlier detection results of different algorithms on the two-dimensional synthetic dataset

0875008125

06000

05000

1000010

09

08

07

06

05

04

03

02

01

00

Accu

racy

Algorithms

LOF

KN

N

SVD

D

EOD

SP

SSO

D-A

FW

(a) Accuracy

10

09

08

07

06

05

04

03

02

01

00

05748

10000

08948

0989809952

AUC

Algorithms

LOF

KN

N

SVD

D

EOD

SP

SSO

D-A

FW

(b) AUC

1009130 08835

07627

04138

10000

09

08

07

06

05

04

03

02

01

00

RP

Algorithms

LOF

KN

N

SVD

D

EOD

SP

SSO

D-A

FW

(c) RP

Figure 3 Performance comparison of different algorithms on the synthetic dataset

of three classes We randomly selected 26 objects from classldquoIris-virginicardquo as target outliers and all objects in the othertwo classes are considered as normal objects The otherfour datasets are similarly preprocessed and more detaileddescription about the five real-world datasets is given inTable 1

432 Experimental Result Analysis We compare the outlierdetection performance of the proposed algorithm with LOF119870NN SVDD and EODSP on real-world datasets Eachmethod has its own parameters and the detailed parametersettings of each algorithm are as follows The parameters ofthe proposed algorithm are 119898 = 21 119870 = 065 and 119902 = 153

Mathematical Problems in Engineering 9

Table 1 Description of real-world datasets

Dataset Instances Features Outlying classes Outliers (percent) Clusters Prior informationIris 126 4 ldquoVirginicardquo 26 (2063) 2 10 labeled normal samples 4 labeled outliersAbalone 4177 8 ldquo1rdquondashldquo4rdquo ldquo16rdquondashldquo27rdquo ldquo29rdquo 335 (802) 11 11 labeled normal samples 18 labeled outliersWine 130 13 ldquo3rdquo 11 (846) 2 9 labeled normal samples 4 labeled outliersEcoli 336 9 ldquoomLrdquo ldquoimLrdquo ldquoimSrdquo 9 (268) 5 11 labeled normal samples 3 labeled outliersWDBC 387 30 ldquoMalignantrdquo 30 (775) 1 10 labeled normal samples 8 labeled outliers

minus2 0 2 4 6 8 10minus4

minus2

0

2

4

6

8

10

12

14

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by nonweighted SSOD-AFW

Figure 4 Outlier detection result of the nonweighted SSOD-AFW on the synthetic dataset

for all the five datasets The strategy of parameter selectionfor SSOD-AFW will be discussed in the later subsectioncalled parameter analysis For the other algorithms thoseparameters are set exactly as mentioned in their referencesIt is well known that LOF and 119870NN have high dependencyon the neighborhood parameter 119896 In this paper we set 119896 = 5for datasets Iris and WDBC 119896 = 3 for dataset Abalone119896 = 10 for dataset Wine and 119896 = 10 for dataset Ecoli ForSVDDmethodGaussian kernel function is employed and thebandwidths 120590 = 045 and 119862 = 05 on all of the five real-worlddatasets In EODSP the Euclidean distance threshold 119879 is setas 01 and the percentage of negative set 119897 is set as 30 for IrisandAbalone datasets and119879 = 05 119897 = 30 for datasets EcoliWine and WDBC Since we randomly select outliers fromtarget classes for each dataset each experiment is repeated10 times with the same number of different outliers Theaverage accuracy AUC and RP are calculated as the criteriaof performance of various detection methods

Figure 5 illustrates the outlier detection results of SSOD-AFW algorithm against LOF 119870NN SVDD and EODSPrespectively on the five real-world datasets As can beseen from Figure 5 the proposed algorithm can accuratelyidentify outliers according to the cluster structure of adataset with the guidance of the label knowledge It showsdistinct superiority over the other unsupervised (LOF119870NN)semisupervised (EODSP) and supervised (SVDD) methodsIn particular the outlier detection accuracy of SSOD-AFW inFigure 5(a) is significantly higher than the others especially

for datasets Iris and Wine One can know from Figure 5(b)that the AUC values of our method are always higher thanthe others for all datasets except for WDBC In terms of RPSSOD-AFW performs better than the other four algorithmsondatasets Iris andWinewhereas slightly poorer than SVDDonAbalone poorer than LOFonEcoli and poorer than119870NNonWDBC seen as in Figure 5(c)

It is worth mentioning that the experiment of the pro-posed algorithm on WDBC involves one-class clusteringproblem Although one-class clustering task is generallymeaningless one-class clustering-based outlier detection isespecially meaningful and feasible in our proposal becauseour approach does not require that the membership degreesmust sum up to 1 This is one of the powerful and importantcharacteristics of the proposed algorithm

433 The Influence of the Proportion of Labeled Data onOutlier Detection In this subsection we will investigatethe influence of the proportion of labeled samples on theaccuracies of our methodology Two typical situations areconsidered and tested The first one is that the proportion oflabeled outliers increases when the number of labeled normalobjects is fixed at a certain constant The other one is that thepercent of labeled normal samples varies while the quantityof labeled outliers is fixed So two groups of experimentsare designed to compare the accuracies of the proposedalgorithm against the EODSP in the situations of different

10 Mathematical Problems in Engineering

10

09

08

07

06

05

04

03

02

01

00

Accu

racy

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(a) Accuracy

10

09

08

07

06

05

04

03

02

01

00

AUC

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(b) AUC

10

09

08

07

06

05

04

03

02

01

00

RP

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(c) RP

Figure 5 Performance comparison of various algorithms on the real-world datasets

percent of labeled outliers and normal samples respectivelyon the datasets Iris Abalone Wine Ecoli andWDBC In thetwo experiments the percent of labeled outliers or labelednormal samples ranges from 0 to 40 respectively whenthe number of another kind of labeled objects is fixedWe randomly select a certain number of labeled outliersor normal samples from each dataset each experiment isrepeated 10 times and the average accuracies of SSOD-AFWand EODSP are computed

Figure 6 shows results of the first group of experimentswhere the percent of labeled outliers varies from 0 to 40One can see from Figure 6 that the accuracies of the twosemisupervised algorithms are roughly increased with theproportion of labeled outliers becoming larger This power-fully supports the idea that semisupervised outlier detectionalgorithms can improve the accuracy of outlier detectionby using prior information Furthermore the SSOD-AFWachieves a better accuracy than EODSP algorithm for thesame proportion of labeled outliers on the five datasetsEspecially for Wine the accuracy of SSOD-AFW is 40higher than that of EODSP EODSP addresses the problemof detecting outliers with only few labeled outliers as trainingdata The labeled normal instances are extracted accordingto the maximum entropy principle where the entropy iscomputed only using the distance between each testingsample and all the labeled outliers That makes EODSPnot flexible as our proposed method due to informationdeficiencies

Figure 7 illustrates the accuracy comparison of the pro-posed algorithm and EODSP when the proportion of labelednormal samples increases from 0 to 40 and the percent oflabeled outliers is fixed Note that ourmethod obtains a betteraccuracy than EODSP on all of the five real-world datasetsThe accuracy of the proposed algorithm gets larger when thepercent of labeled normal samples increases As mentionedEODSP emphasizes the semisupervised outlier detectiononly with few labeled outliers in the initial dataset butwithout considering any labeled normal objects Therefore

the accuracy of EODSP algorithm keeps harper with variousproportions of labeled normal objects and always equals theaccuracy value of 0 labeled normal samples as well

434 Parameter Analysis The parameters 119898 119870 and 119902are important in our proposed algorithm which affect theperformance of SSOD-AFW In this section the influence ofeach parameter on outlier detection accuracy is studied

The parameter 119898 is the fuzzification coefficient Fig-ure 8(a) analyzes the relationship between the outlier detec-tion accuracy of our proposed algorithm and parameter 119898with 119898 varying from 15 to 50 The results imply that thehighest accuracy is achieved when 119898 ranges in [2 4] Soit is reasonable that 119898 value in the experiments shown inFigure 5 has been set as 21 The parameter 119870 isin (0 1)controls the importance of the label information in the resultof outlier detection Outlier detection accuracies are testifiedby varying119870 from 01 to 09 which are shown in Figure 8(b)The overall tendency is that the accuracies become larger as119870 increasesThe best results of the proposed algorithm occurand keep stable when 119870 ge 07 Finally from Figure 8(c) weconclude that the feature weight index 119902 has small influenceon the accuracy of SSOD-AFW in the situation that theother parametersmaintain the same settings So the proposedalgorithm is not sensitive to the parameter 119902 In general theparameter 119902 is suggested to select a constant from (1 4]435 Execution Time Analysis Figure 9 analyzes the averagerunning time of the proposed algorithm against the otheralgorithms on five real-world datasets The experimentalenvironment is Windows XP systems MATLAB 71 platform3GHz CPU 2GB RAM Because the volume of Abalonedataset is far greater than the other four datasets the runningtimes of various datasets are distinctly different In order tofacilitate the display in Figure 9 the horizontal coordinate axisis translated downward a short distance The result indicatesthat the proposed algorithm is more time-saving than the

Mathematical Problems in Engineering 11

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(a) Iris

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(b) Abalone1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(c) Wine1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(d) Ecoli

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(e) WDBC

Figure 6 Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled outliers on the real-world datasets

other four typical outlier detection algorithms except for119870NN on dataset Wine In whole the execution time of theSSOD-AFW is comparable to that of119870NNand less than thoseof other algorithms on most of the datasets

5 Conclusions

In order to detect outliers more precisely a semisuper-vised outlier detection algorithm based on adaptive feature

weighted clustering called SSOD-AFW is proposed in thispaper Distinct weights of each feature with respect to differ-ent clusters are considered and obtained by adaptive iterationso that the negative effects of irrelevant features on outlierdetection are weakened Moreover the proposed methodmakes full use of the prior knowledge contained in datasetsand detects outliers in virtue of the cluster structure It isverified by a series of experiments that the proposed SSOD-AFW algorithm is superior to other typical unsupervised

12 Mathematical Problems in Engineering

0

1009080706050403020100

Accu

racy

Proportion of labeled normal samples5

EODSPSSOD-AFW

10 15 20 25 35 4030

(a) Iris

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(b) Abalone

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(c) Wine

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples

5 10 15 20 25 35 4030

(d) Ecoli

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(e) WDBC

Figure 7 Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled normal samples on the real-worlddatasets

semisupervised and supervised algorithms in both outlierdetection precision and running speed

In this paper we present a new semisupervised outlierdetection method which utilizes labels of a small amountof objects However our method assumes that the labels

of objects are reliable and does not consider mislabel pun-ishment in the new objective function Therefore a robustversion of the proposed method dealing with noisy orimperfect labels of objects deserves further studiesMoreoversince only one typical dissimilaritymeasure named Euclidean

Mathematical Problems in Engineering 13

1009080706050403020100

Accu

racy

m

5045403530252015

IrisAbaloneWine

EcoliWDBC

(a) Parameter119898

K

1009080706050403020100

Accu

racy

090807060504030201

IrisAbaloneWine

EcoliWDBC

(b) Parameter119870

q

40

1009080706050403020100

3530252015

Accu

racy

IrisAbaloneWine

EcoliWDBC

(c) Parameter 119902

Figure 8 Outlier detection accuracy of the proposed algorithm under various parameters on the real-world datasets

Iris Abalone Wine Ecoli WDBCDatasets

50

45

40

35

30

25

20

15

10

5

0

Runn

ing

time (

s)

LOFKNNSVDD

EODSPSSOD-AFW

Figure 9 Execution time comparison of different algorithms on thereal-world datasets

distance is discussed in our method the SSOD-AFW algo-rithm is limited to outlier detection for numerical data Thefuture research aims at extending our method to mixed-attribute data in more real-life applications such as faultdiagnosis in industrial process or network anomaly detection

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Natural ScienceFoundation of China (Grant no 11471001)

References

[1] J Han M Kamber and J Pei Data Mining Concepts andTechniques Elsevier 2011

[2] D M Hawkins Identification of Outliers Chapman amp HallLondon UK 1980

[3] V Bamnett and T LewisOutliers in Statistical Data JohnWileyamp Sons Chichester UK 1994

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] B Sheng Q Li W Mao and W Jin ldquoOutlier detection insensor networksrdquo in Proceedings of the 8th ACM InternationalSymposium on Mobile Ad Hoc Networking and Computing(MobiHoc rsquo07) pp 219ndash228Montreal Canada September 2007

[6] A P James and S Dimitrijev ldquoInter-image outliers and theirapplication to image classificationrdquo Pattern Recognition vol 43no 12 pp 4101ndash4112 2010

[7] J Huang Q Zhu L Yang and J Feng ldquoA non-parameter outlierdetection algorithm based on Natural Neighborrdquo Knowledge-Based Systems vol 92 pp 71ndash77 2016

[8] J M Shepherd and S J Burian ldquoDetection of urban-inducedrainfall anomalies in a major coastal cityrdquo Earth Interactionsvol 7 no 4 pp 1ndash17 2003

[9] O Alan and C Catal ldquoThresholds based outlier detectionapproach for mining class outliers an empirical case study onsoftware measurement datasetsrdquo Expert Systems with Applica-tions vol 38 no 4 pp 3440ndash3445 2011

[10] B Scholkopf J C Platt J Shawe-Taylor A J Smola and RC Williamson ldquoEstimating the support of a high-dimensionaldistributionrdquo Neural Computation vol 13 no 7 pp 1443ndash14712001

[11] D Tax A Ypma and R Duin ldquoSupport vector data descriptionapplied to machine vibration analysisrdquo in Proceedings of the 5thAnnual Conference of the Advanced School for Computing andImaging pp 398ndash405 Heijen The Netherlands June 1999

[12] DM J Tax andR PWDuin ldquoSupport vector data descriptionrdquoMachine Learning vol 54 no 1 pp 45ndash66 2004

[13] S Ramaswamy R Rastogi and K Shim ldquoEfficient algorithmsformining outliers from large data setsrdquo SIGMODRecord (ACMSpecial Interest Group onManagement of Data) vol 29 no 2 pp427ndash438 2000

14 Mathematical Problems in Engineering

[14] EM Knox and R T Ng ldquoAlgorithms formining distance basedoutliers in large datasetrdquo in Proceedings of the InternationalConference on Very Large Data Bases pp 392ndash403 CiteseerNew York NY USA 1998

[15] M M Breunig H-P Kriegel R T Ng and J Sander ldquoLOFidentifying density-based local outliersrdquoACMSIGMODRecordvol 29 no 2 pp 93ndash104 2000

[16] J Ha S Seok and J-S Lee ldquoA precise ranking method foroutlier detectionrdquo Information Sciences vol 324 pp 88ndash1072015

[17] R N Dave ldquoCharacterization and detection of noise in clus-teringrdquo Pattern Recognition Letters vol 12 no 11 pp 657ndash6641991

[18] R Smith A Bivens M Embrechts C Palagiri and B Szyman-ski ldquoClustering approaches for anomaly based intrusion detec-tionrdquo Proceedings of Intelligent Engineering Systems ThroughArtificial Neural Networks pp 579ndash584 2002

[19] Y Shi and L Zhang ldquoCOID a cluster-outlier iterative detectionapproach to multi-dimensional data analysisrdquo Knowledge andInformation Systems vol 28 no 3 pp 709ndash733 2011

[20] J Zhao K Liu W Wang and Y Liu ldquoAdaptive fuzzy clusteringbased anomaly data detection in energy system of steel indus-tryrdquo Information Sciences vol 259 pp 335ndash345 2014

[21] F Angiulli and C Pizzuti ldquoFast outlier detection in high dimen-sional spacesrdquo in Proceedings of the 6th European Conference onPrinciples of DataMining and Knowledge Discovery (PKDD rsquo02)vol 2 pp 15ndash26 August 2002

[22] M Radovanovic A Nanopoulos and M Ivanovic ldquoReversenearest neighbors in unsupervised distance-based outlier detec-tionrdquo IEEE Transactions on Knowledge and Data Engineeringvol 27 no 5 pp 1369ndash1382 2015

[23] A Daneshpazhouh and A Sami ldquoEntropy-based outlier detec-tion using semi-supervised approach with few positive exam-plesrdquo Pattern Recognition Letters vol 49 pp 77ndash84 2014

[24] J Gao H Cheng and P-N Tan ldquoSemi-supervised outlierdetectionrdquo in Proceedings of the ACM Symposium on AppliedComputing pp 635ndash636 ACM Dijon France April 2006

[25] Z Xue Y Shang and A Feng ldquoSemi-supervised outlier detec-tion based on fuzzy rough C-means clusteringrdquo Mathematicsand Computers in Simulation vol 80 no 9 pp 1911ndash1921 2010

[26] J Z Huang M K Ng H Rong and Z Li ldquoAutomated variableweighting in k-means type clusteringrdquo IEEE Transactions onPattern Analysis andMachine Intelligence vol 27 no 5 pp 657ndash668 2005

[27] H Friguiand and O Nasraou ldquoUnsupervised learning ofprototypes and attribute weightsrdquo Pattern Recognition vol 37no 3 pp 567ndash581 2004

[28] J Zhou L Chen C P Chen Y Zhang and H Li ldquoFuzzy clus-tering with the entropy of attribute weightsrdquo Neurocomputingvol 198 pp 125ndash134 2016

[29] M Hassan A Chaudhry A Khan and M A Iftikhar ldquoRobustinformation gain based fuzzy c-means clustering and classifi-cation of carotid artery ultrasound imagesrdquo Computer Methodsand Programs in Biomedicine vol 113 no 2 pp 593ndash609 2014

[30] R Krishnapuram and J M Keller ldquoA possibilistic approach toclusteringrdquo IEEE Transactions on Fuzzy Systems vol 1 no 2 pp98ndash110 1993

[31] M-S Yang and K-L Wu ldquoUnsupervised possibilistic cluster-ingrdquo Pattern Recognition vol 39 no 1 pp 5ndash21 2006

[32] SM Guo L C Chen and J S H Tsai ldquoA boundarymethod foroutlier detection based on support vector domain descriptionrdquoPattern Recognition vol 42 no 1 pp 77ndash83 2009

[33] T Fawcett ldquoRoc graphs notes and practical considerations forresearchersrdquoMachine Learning vol 31 no 1 pp 1ndash38 2004

[34] C Blake and C J Merz ldquoUCI repository of machine learningdatabasesrdquo 1998

[35] C C Aggarwal and P S Yu ldquoOutlier detection for highdimensional datardquo in Proceedings of the ACM SIGMOD Interna-tional Conference on Management of Data (SIGMOD rsquo01) SantaBarbara Calif USA May 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

6 Mathematical Problems in Engineering

(5) Update parameter 120572(119905+1)119894119896

by (20)

(6) Update the membership degree matrix U(119905+1) =(119906(119905+1)119894119896

)119899times119888 according to (19)(7) If U(119905+1) minus U(119905) lt 120576 go to step (8) else 119905 = 119905 + 1

repeat step (2) to step (7)(8) The outlying degree of each object is computed and

the OD values are sorted in an ascending mannerFinally output top 119872 outliers with the smallest out-lying degrees

Computational complexity analysis Step (2) needs119874(cNn) operations to compute 119888 cluster prototype The com-putational complexity of computing the weights of119873 featuresis 119874(cNn) in Step (3) Step (4) requires 119874(cNn) to computethe weighted distances of 119899 objects to 119888 cluster prototypesStep (5) needs 119874(cn) to compute parameter 120572119894119896 of 119899 objectswith respect to 119888 cluster prototypes Moreover Step (6) needs119874(cn) operations to calculate the memberships of 119899 objectsto 119888 clusters Therefore the whole computational complexityis 119874(cNn) the same as that of the classical FCM algorithm

34 Proof of Convergence In this section we discuss the con-vergence of the proposed SSOD-AFW algorithm To provethe convergence of objective function 119869SSOD-AFW in (8) byiteratingVO andUwith formulas (15) (17) and (19) it onlyneeds to prove that 119869SSOD-AFW is monotonically decreasingand bounded after a finite number of iterationsNext Lemmas2 3 and 4 verify the monotonically decreasing property of119869SSOD-AFW with respect toVO andU respectively Lemma 5presents the boundedness of 119869SSOD-AFW

Lemma 2 Objective function 119869119878119878119874119863-119860119865119882 in (8) is nonincreas-ing by updating U = (119906119894119896)119899times119888 with formula (19)

Proof Due to the fact that V and O are fixed when updatingU by (19) here the objective function 119869SSOD-AFW can beregarded as a function only associated with U denoted as119869SSOD-AFW(U) According to Lagrangianmultiplier techniqueUlowast computed via (19) is a stagnation point of 119869SSOD-AFW(U)On the other hand if Hessian matrix nabla2119869SSOD-AFW(U) isproved to be positive definite at Ulowast it can be proved that119869SSOD-AFW(U) attains its local minimum at Ulowast The Hessianmatrix nabla2119869SSOD-AFW(Ulowast) is expressed as

nabla2119869SSOD-AFW (Ulowast)

=[[[[[[[[[

1205972119869SSOD-AFW (Ulowast)1205971199061112059711990611 sdot sdot sdot 1205972119869SSOD-AFW (Ulowast)12059711990611120597119906119899119888 d

1205972119869SSOD-AFW (Ulowast)12059711990611989911988812059711990611 sdot sdot sdot 1205972119869SSOD-AFW (Ulowast)120597119906119899119888120597119906119899119888

]]]]]]]]] (22)

nabla2119869SSOD-AFW(Ulowast) is a diagonal matrix and its diagonal ele-ment is 1205972119869SSOD-AFW (Ulowast)120597119906119894119896120597119906119894119896 = 120573radic119888119906119898minus2119894119896 (23)

Since 119906119894119896 gt 0 Hessianmatrixnabla2119869SSOD-AFW(Ulowast) is positivedefinite Having proved that Ulowast is a stagnation point of119869SSOD-AFW(U) and nabla2119869SSOD-AFW(Ulowast) is positive definite weconcludeUlowast is the local minimum of 119869SSOD-AFW(U) Then wehave 119869SSOD-AFW(U(119905)) ge 119869SSOD-AFW(Ulowast) = 119869SSOD-AFW(U(119905+1))whereU(119905) is the membership matrix after the 119905th iteration in(19) and Ulowast = U(119905+1) is the one after the (119905 + 1)th iterationTherefore objective function 119869SSOD-AFW in (8) is nonincreas-ing by updating U = (119906119894119896)119899times119888 using formula (19)

Lemma 3 Objective function 119869119878119878119874119863-119860119865119882 in (8) is nonincreas-ing by updating V = (V119895119896)119873times119888 using (15)Proof Similar to Lemma 2 when U and O are fixed wejust need to prove that the Hessian matrix of Lagrangian of119869SSOD-AFW(V) atVlowast is positive definite whereVlowast is computedby (15)TheHessian matrix is denoted as119867(V) = (ℎ119895119896119897119898(V))whose element is expressed as follows

ℎ119895119896119897119898 (V) = 12059721198711 (V)120597V119895119896120597V119897119898=

119902 (119902 minus 1) V119902minus2119895119896

119899sum119894=1

119906119898119894119896 (119909119894119895 minus 119900119896119895)2 if119895 = 119897 119896 = 1198980 otherwise

(24)

Since 119902 gt 1 and V119895119896 gt 0 the diagonal entries of thediagonal matrix are apparently positive Therefore Hessianmatrix 119867(V) is positive definite 119869SSOD-AFW(V) attains itslocal minimum at Vlowast computed by (15) This completes theproof

Lemma 4 Objective function 119869119878119878119874119863-119860119865119882 in (8) is nonincreas-ing whenO = (119900119896119895)119888times119873 is updated using (17)

The proof of Lemma 4 is similar to Lemma 2

Lemma 5 Objective function 119869119878119878119874119863-119860119865119882 in (8) is boundedthere exists a constant 119861 and it satisfies 119869119878119878119874119863-119860119865119882 lt 119861Proof forall119894 isin 1 119899 andforall119896 isin 1 119888 we have 0 lt 119906119894119896 le 1Thus119906119898119894119896 log 119906119898119894119896minus119906119898119894119896 ismonotonically decreasingwith respectto 119906119894119896 and hence minus1 le 119906119898119894119896 log 119906119898119894119896 minus 119906119898119894119896 lt 0 Apparentlythe first term and the second term of the objective function119869119878119878119874119863-119860119865119882 are also bounded In consequence objective func-tion 119869SSOD-AFW in (8) is bounded

Theorem 6 The objective function 119869119878119878119874119863-119860119865119882 converges to alocal minimum by updating V O and U using formulas (15)(17) and (19) which subjects to the constraints in (9)

Proof Lemmas 2 3 and 4 verify that objective functionin (8) is nonincreasing under iterations according to (15)(17) and (19) Lemma 5 shows that 119869SSOD-AFW has a finitebound Though the parameter 120572119894119896 needs to be updated in theiteration process it is a constant in the problem solving usingLagrangian multiplier technique So 120572119894119896 does not affect theconvergence of the SSOD-AFW algorithm Combining theabove conclusions 119869SSOD-AFW is sure to converge to a local

Mathematical Problems in Engineering 7

minimum through iterations ofVO andU by (15) (17) and(19)

4 Experiments and Analysis

Comprehensive experiments and analysis on a syntheticdataset and several real-world datasets are conducted to showthe effectiveness and superiority of the proposed SSOD-AFW We compared the proposed algorithm with two the-state-of-the-art unsupervised outlier algorithms LOF [15]and 119870NN [13] one supervised method SVDD [32] and onesemisupervised method EODSP [23]

41 Evaluation Metrics For numerically performance evalu-ation of outlier detection algorithms three metrics namelyaccuracy [7] AUC [33] and rank-power [16] are used in thispaper

Let 119872 be the number of true outliers that a dataset 119883contains and 119878 denotes the number of true outliers detectedby an algorithm In experiments top 119872 most suspiciousinstances are detected out Then the accuracy is given by

Accuracy = 119878119872 (25)

The receiver operating characteristic (ROC) curve rep-resents the trade-off relationship between the detection rateand the false alarm rate In general the area under the ROCcurve (AUC) is used to measure the performance of outlierdetection method and the value of AUC for ideal detectionperformance is close to one

For a given outlier detection algorithm true outliersoccupy top positions with respect to the nonoutliers among119872 suspicious instances then the rank-power (RP) of thealgorithm is said to be high If 119878 is the number of true outliersfound within top119872 instances and 119877119894 denotes the rank of the119894th true outlier then the metric rank-power (RP) is given by

RP = 119878 (119878 + 1)2sum119878119894=1 119877119894 (26)

RP reaches the maximum value 1 when all 119878 true outliersare in the top 119878 positions Larger value of RP implies betterperformance of an algorithm

42 Experiments on Synthetic Dataset A two-dimensionalsynthetic dataset with two cluster patterns is generatedfrom Gaussian distribution to intuitively compare the outlierdetection results of the proposed method against the otherfour algorithms mentioned above The mean vectors of thetwo clusters are 1205831 = (75 9)T and 1205832 = (1 3)T respectivelyand the covariance matrixes of them are Σ1 = ( 1 00 2 ) andΣ2 = ( 1 00 2 ) As Figure 2(a) shows a total of 199 samplesare contained in the synthetic dataset in which there are183 normal samples (within two clusters) and 16 outliers(cluttered between two clusters) 13 normal objects are labeledand marked as symbol ldquotimesrdquo and 5 outliers are labeled andmarkedwith symbol ldquolowastrdquo while the rest samples are unlabeledmarked with ldquosdotrdquo Figures 2(b)ndash2(f) respectively illustrate the

outlier detection results on the synthetic dataset by usingLOF 119870NN SVDD EODSP and SSOD-AFW where the redcolored symbols ldquo∘rdquo denote the detected suspicious outliersHere the value of parameter 119896 (size of neighborhood) in LOFand 119870NN is assigned to 3 Gauss kernel function is chosenin SVDD and we set the bandwidth 120590 = 03 and the trade-off coefficient 119862 = 045 Besides the Euclidean distancethreshold 119879 in EODSP is set as 01 and the percentage ofnegative set is set to 119897 = 10 The parameter settings of theproposed algorithm are 119898 = 23 119870 = 085 and 119902 = 223In addition to SVDD the top 16 objects with the highestoutlying scores are considered as the results in the other fouralgorithms

In Figure 2 it is noticeable that the unsupervisedmethodsLOF and 119870NN as well as the supervised SVDD fail tocompletely detect all of the 5 labeled outliers Neverthelesssome normal points in clusters are badly misjudged asoutliers In contrast the semisupervised EODSP algorithmand the proposed SSOD-AFW algorithm successfully detectall of the 5 labeled outliers However EODSP does notcompletely detect all the unlabeled true outliers and severaltrue normal samples are improperly identified as outliers Itis concluded from Figure 2 that the proposed algorithm findsall the true outliers in the synthetic dataset and excludes thenormal samples while the other methods do not

Figure 3 numerically presents the performance evaluationof outlier detection using LOF 119870NN SVDD EODSP andSSOD-AFW for the synthetic dataset From Figure 3 we seethat the values of accuracy AUC and RP of the proposedalgorithm all reach 1 outperforming the other methods

Furthermore during the experimental process shown inFigure 3 the feature weights of the synthetic dataset learnedby formula (15) in our method are V11 = 06727 V12 =05985 V21 = 03273 and V22 = 04015 To strengthenthe effectiveness of feature weights in the proposed SSOD-AFW algorithm a comparative analysis of the weighted andthe nonweighted versions is implemented on the syntheticdataset respectively Considering the nonweighted scenarioof the proposed algorithm the outlier detection result on thesynthetic dataset is presented in Figure 4 As can be observedfrom Figure 4 the nonweighted SSOD-AFW ends up tagging15 true outlying and 1 normal samples as outliers with oneunlabeled true outlier missed

43 Experiments on Real-World Datasets

431 Introduction of Datasets For further verification ofthe effectiveness of the proposed algorithm five real-worlddatasets from UCI Machine Learning Repository [34] (ieIris Abalone Wine Ecoli and Breast Cancer Wisconsin(WDBC)) are employed to test the performance of theproposed algorithm against LOF 119870NN SVDD and EODSPAs mentioned in Aggarwal and Yu [35] one way to test theperformance of an outlier detection algorithm is to run it onthe dataset and calculate the percentage of points belongingto the rare classes So a small amount of samples from thesame class are randomly selected as outlying objects or astarget objects for the five datasets For instance the originalIris dataset incorporates 150 objects with 50 objects in each

8 Mathematical Problems in Engineering

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliers

(a) Original synthetic dataset

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by LOF

(b) LOF

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by KNN

(c) 119870NN

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by SVDD

(d) SVDD

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by EODSP

(e) EODSP

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by SSOD-AFW

(f) SSOD-AFW

Figure 2 Outlier detection results of different algorithms on the two-dimensional synthetic dataset

0875008125

06000

05000

1000010

09

08

07

06

05

04

03

02

01

00

Accu

racy

Algorithms

LOF

KN

N

SVD

D

EOD

SP

SSO

D-A

FW

(a) Accuracy

10

09

08

07

06

05

04

03

02

01

00

05748

10000

08948

0989809952

AUC

Algorithms

LOF

KN

N

SVD

D

EOD

SP

SSO

D-A

FW

(b) AUC

1009130 08835

07627

04138

10000

09

08

07

06

05

04

03

02

01

00

RP

Algorithms

LOF

KN

N

SVD

D

EOD

SP

SSO

D-A

FW

(c) RP

Figure 3 Performance comparison of different algorithms on the synthetic dataset

of three classes We randomly selected 26 objects from classldquoIris-virginicardquo as target outliers and all objects in the othertwo classes are considered as normal objects The otherfour datasets are similarly preprocessed and more detaileddescription about the five real-world datasets is given inTable 1

432 Experimental Result Analysis We compare the outlierdetection performance of the proposed algorithm with LOF119870NN SVDD and EODSP on real-world datasets Eachmethod has its own parameters and the detailed parametersettings of each algorithm are as follows The parameters ofthe proposed algorithm are 119898 = 21 119870 = 065 and 119902 = 153

Mathematical Problems in Engineering 9

Table 1 Description of real-world datasets

Dataset Instances Features Outlying classes Outliers (percent) Clusters Prior informationIris 126 4 ldquoVirginicardquo 26 (2063) 2 10 labeled normal samples 4 labeled outliersAbalone 4177 8 ldquo1rdquondashldquo4rdquo ldquo16rdquondashldquo27rdquo ldquo29rdquo 335 (802) 11 11 labeled normal samples 18 labeled outliersWine 130 13 ldquo3rdquo 11 (846) 2 9 labeled normal samples 4 labeled outliersEcoli 336 9 ldquoomLrdquo ldquoimLrdquo ldquoimSrdquo 9 (268) 5 11 labeled normal samples 3 labeled outliersWDBC 387 30 ldquoMalignantrdquo 30 (775) 1 10 labeled normal samples 8 labeled outliers

minus2 0 2 4 6 8 10minus4

minus2

0

2

4

6

8

10

12

14

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by nonweighted SSOD-AFW

Figure 4 Outlier detection result of the nonweighted SSOD-AFW on the synthetic dataset

for all the five datasets The strategy of parameter selectionfor SSOD-AFW will be discussed in the later subsectioncalled parameter analysis For the other algorithms thoseparameters are set exactly as mentioned in their referencesIt is well known that LOF and 119870NN have high dependencyon the neighborhood parameter 119896 In this paper we set 119896 = 5for datasets Iris and WDBC 119896 = 3 for dataset Abalone119896 = 10 for dataset Wine and 119896 = 10 for dataset Ecoli ForSVDDmethodGaussian kernel function is employed and thebandwidths 120590 = 045 and 119862 = 05 on all of the five real-worlddatasets In EODSP the Euclidean distance threshold 119879 is setas 01 and the percentage of negative set 119897 is set as 30 for IrisandAbalone datasets and119879 = 05 119897 = 30 for datasets EcoliWine and WDBC Since we randomly select outliers fromtarget classes for each dataset each experiment is repeated10 times with the same number of different outliers Theaverage accuracy AUC and RP are calculated as the criteriaof performance of various detection methods

Figure 5 illustrates the outlier detection results of SSOD-AFW algorithm against LOF 119870NN SVDD and EODSPrespectively on the five real-world datasets As can beseen from Figure 5 the proposed algorithm can accuratelyidentify outliers according to the cluster structure of adataset with the guidance of the label knowledge It showsdistinct superiority over the other unsupervised (LOF119870NN)semisupervised (EODSP) and supervised (SVDD) methodsIn particular the outlier detection accuracy of SSOD-AFW inFigure 5(a) is significantly higher than the others especially

for datasets Iris and Wine One can know from Figure 5(b)that the AUC values of our method are always higher thanthe others for all datasets except for WDBC In terms of RPSSOD-AFW performs better than the other four algorithmsondatasets Iris andWinewhereas slightly poorer than SVDDonAbalone poorer than LOFonEcoli and poorer than119870NNonWDBC seen as in Figure 5(c)

It is worth mentioning that the experiment of the pro-posed algorithm on WDBC involves one-class clusteringproblem Although one-class clustering task is generallymeaningless one-class clustering-based outlier detection isespecially meaningful and feasible in our proposal becauseour approach does not require that the membership degreesmust sum up to 1 This is one of the powerful and importantcharacteristics of the proposed algorithm

433 The Influence of the Proportion of Labeled Data onOutlier Detection In this subsection we will investigatethe influence of the proportion of labeled samples on theaccuracies of our methodology Two typical situations areconsidered and tested The first one is that the proportion oflabeled outliers increases when the number of labeled normalobjects is fixed at a certain constant The other one is that thepercent of labeled normal samples varies while the quantityof labeled outliers is fixed So two groups of experimentsare designed to compare the accuracies of the proposedalgorithm against the EODSP in the situations of different

10 Mathematical Problems in Engineering

10

09

08

07

06

05

04

03

02

01

00

Accu

racy

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(a) Accuracy

10

09

08

07

06

05

04

03

02

01

00

AUC

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(b) AUC

10

09

08

07

06

05

04

03

02

01

00

RP

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(c) RP

Figure 5 Performance comparison of various algorithms on the real-world datasets

percent of labeled outliers and normal samples respectivelyon the datasets Iris Abalone Wine Ecoli andWDBC In thetwo experiments the percent of labeled outliers or labelednormal samples ranges from 0 to 40 respectively whenthe number of another kind of labeled objects is fixedWe randomly select a certain number of labeled outliersor normal samples from each dataset each experiment isrepeated 10 times and the average accuracies of SSOD-AFWand EODSP are computed

Figure 6 shows results of the first group of experimentswhere the percent of labeled outliers varies from 0 to 40One can see from Figure 6 that the accuracies of the twosemisupervised algorithms are roughly increased with theproportion of labeled outliers becoming larger This power-fully supports the idea that semisupervised outlier detectionalgorithms can improve the accuracy of outlier detectionby using prior information Furthermore the SSOD-AFWachieves a better accuracy than EODSP algorithm for thesame proportion of labeled outliers on the five datasetsEspecially for Wine the accuracy of SSOD-AFW is 40higher than that of EODSP EODSP addresses the problemof detecting outliers with only few labeled outliers as trainingdata The labeled normal instances are extracted accordingto the maximum entropy principle where the entropy iscomputed only using the distance between each testingsample and all the labeled outliers That makes EODSPnot flexible as our proposed method due to informationdeficiencies

Figure 7 illustrates the accuracy comparison of the pro-posed algorithm and EODSP when the proportion of labelednormal samples increases from 0 to 40 and the percent oflabeled outliers is fixed Note that ourmethod obtains a betteraccuracy than EODSP on all of the five real-world datasetsThe accuracy of the proposed algorithm gets larger when thepercent of labeled normal samples increases As mentionedEODSP emphasizes the semisupervised outlier detectiononly with few labeled outliers in the initial dataset butwithout considering any labeled normal objects Therefore

the accuracy of EODSP algorithm keeps harper with variousproportions of labeled normal objects and always equals theaccuracy value of 0 labeled normal samples as well

434 Parameter Analysis The parameters 119898 119870 and 119902are important in our proposed algorithm which affect theperformance of SSOD-AFW In this section the influence ofeach parameter on outlier detection accuracy is studied

The parameter 119898 is the fuzzification coefficient Fig-ure 8(a) analyzes the relationship between the outlier detec-tion accuracy of our proposed algorithm and parameter 119898with 119898 varying from 15 to 50 The results imply that thehighest accuracy is achieved when 119898 ranges in [2 4] Soit is reasonable that 119898 value in the experiments shown inFigure 5 has been set as 21 The parameter 119870 isin (0 1)controls the importance of the label information in the resultof outlier detection Outlier detection accuracies are testifiedby varying119870 from 01 to 09 which are shown in Figure 8(b)The overall tendency is that the accuracies become larger as119870 increasesThe best results of the proposed algorithm occurand keep stable when 119870 ge 07 Finally from Figure 8(c) weconclude that the feature weight index 119902 has small influenceon the accuracy of SSOD-AFW in the situation that theother parametersmaintain the same settings So the proposedalgorithm is not sensitive to the parameter 119902 In general theparameter 119902 is suggested to select a constant from (1 4]435 Execution Time Analysis Figure 9 analyzes the averagerunning time of the proposed algorithm against the otheralgorithms on five real-world datasets The experimentalenvironment is Windows XP systems MATLAB 71 platform3GHz CPU 2GB RAM Because the volume of Abalonedataset is far greater than the other four datasets the runningtimes of various datasets are distinctly different In order tofacilitate the display in Figure 9 the horizontal coordinate axisis translated downward a short distance The result indicatesthat the proposed algorithm is more time-saving than the

Mathematical Problems in Engineering 11

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(a) Iris

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(b) Abalone1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(c) Wine1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(d) Ecoli

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(e) WDBC

Figure 6 Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled outliers on the real-world datasets

other four typical outlier detection algorithms except for119870NN on dataset Wine In whole the execution time of theSSOD-AFW is comparable to that of119870NNand less than thoseof other algorithms on most of the datasets

5 Conclusions

In order to detect outliers more precisely a semisuper-vised outlier detection algorithm based on adaptive feature

weighted clustering called SSOD-AFW is proposed in thispaper Distinct weights of each feature with respect to differ-ent clusters are considered and obtained by adaptive iterationso that the negative effects of irrelevant features on outlierdetection are weakened Moreover the proposed methodmakes full use of the prior knowledge contained in datasetsand detects outliers in virtue of the cluster structure It isverified by a series of experiments that the proposed SSOD-AFW algorithm is superior to other typical unsupervised

12 Mathematical Problems in Engineering

0

1009080706050403020100

Accu

racy

Proportion of labeled normal samples5

EODSPSSOD-AFW

10 15 20 25 35 4030

(a) Iris

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(b) Abalone

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(c) Wine

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples

5 10 15 20 25 35 4030

(d) Ecoli

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(e) WDBC

Figure 7 Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled normal samples on the real-worlddatasets

semisupervised and supervised algorithms in both outlierdetection precision and running speed

In this paper we present a new semisupervised outlierdetection method which utilizes labels of a small amountof objects However our method assumes that the labels

of objects are reliable and does not consider mislabel pun-ishment in the new objective function Therefore a robustversion of the proposed method dealing with noisy orimperfect labels of objects deserves further studiesMoreoversince only one typical dissimilaritymeasure named Euclidean

Mathematical Problems in Engineering 13

1009080706050403020100

Accu

racy

m

5045403530252015

IrisAbaloneWine

EcoliWDBC

(a) Parameter119898

K

1009080706050403020100

Accu

racy

090807060504030201

IrisAbaloneWine

EcoliWDBC

(b) Parameter119870

q

40

1009080706050403020100

3530252015

Accu

racy

IrisAbaloneWine

EcoliWDBC

(c) Parameter 119902

Figure 8 Outlier detection accuracy of the proposed algorithm under various parameters on the real-world datasets

Iris Abalone Wine Ecoli WDBCDatasets

50

45

40

35

30

25

20

15

10

5

0

Runn

ing

time (

s)

LOFKNNSVDD

EODSPSSOD-AFW

Figure 9 Execution time comparison of different algorithms on thereal-world datasets

distance is discussed in our method the SSOD-AFW algo-rithm is limited to outlier detection for numerical data Thefuture research aims at extending our method to mixed-attribute data in more real-life applications such as faultdiagnosis in industrial process or network anomaly detection

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Natural ScienceFoundation of China (Grant no 11471001)

References

[1] J Han M Kamber and J Pei Data Mining Concepts andTechniques Elsevier 2011

[2] D M Hawkins Identification of Outliers Chapman amp HallLondon UK 1980

[3] V Bamnett and T LewisOutliers in Statistical Data JohnWileyamp Sons Chichester UK 1994

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] B Sheng Q Li W Mao and W Jin ldquoOutlier detection insensor networksrdquo in Proceedings of the 8th ACM InternationalSymposium on Mobile Ad Hoc Networking and Computing(MobiHoc rsquo07) pp 219ndash228Montreal Canada September 2007

[6] A P James and S Dimitrijev ldquoInter-image outliers and theirapplication to image classificationrdquo Pattern Recognition vol 43no 12 pp 4101ndash4112 2010

[7] J Huang Q Zhu L Yang and J Feng ldquoA non-parameter outlierdetection algorithm based on Natural Neighborrdquo Knowledge-Based Systems vol 92 pp 71ndash77 2016

[8] J M Shepherd and S J Burian ldquoDetection of urban-inducedrainfall anomalies in a major coastal cityrdquo Earth Interactionsvol 7 no 4 pp 1ndash17 2003

[9] O Alan and C Catal ldquoThresholds based outlier detectionapproach for mining class outliers an empirical case study onsoftware measurement datasetsrdquo Expert Systems with Applica-tions vol 38 no 4 pp 3440ndash3445 2011

[10] B Scholkopf J C Platt J Shawe-Taylor A J Smola and RC Williamson ldquoEstimating the support of a high-dimensionaldistributionrdquo Neural Computation vol 13 no 7 pp 1443ndash14712001

[11] D Tax A Ypma and R Duin ldquoSupport vector data descriptionapplied to machine vibration analysisrdquo in Proceedings of the 5thAnnual Conference of the Advanced School for Computing andImaging pp 398ndash405 Heijen The Netherlands June 1999

[12] DM J Tax andR PWDuin ldquoSupport vector data descriptionrdquoMachine Learning vol 54 no 1 pp 45ndash66 2004

[13] S Ramaswamy R Rastogi and K Shim ldquoEfficient algorithmsformining outliers from large data setsrdquo SIGMODRecord (ACMSpecial Interest Group onManagement of Data) vol 29 no 2 pp427ndash438 2000

14 Mathematical Problems in Engineering

[14] EM Knox and R T Ng ldquoAlgorithms formining distance basedoutliers in large datasetrdquo in Proceedings of the InternationalConference on Very Large Data Bases pp 392ndash403 CiteseerNew York NY USA 1998

[15] M M Breunig H-P Kriegel R T Ng and J Sander ldquoLOFidentifying density-based local outliersrdquoACMSIGMODRecordvol 29 no 2 pp 93ndash104 2000

[16] J Ha S Seok and J-S Lee ldquoA precise ranking method foroutlier detectionrdquo Information Sciences vol 324 pp 88ndash1072015

[17] R N Dave ldquoCharacterization and detection of noise in clus-teringrdquo Pattern Recognition Letters vol 12 no 11 pp 657ndash6641991

[18] R Smith A Bivens M Embrechts C Palagiri and B Szyman-ski ldquoClustering approaches for anomaly based intrusion detec-tionrdquo Proceedings of Intelligent Engineering Systems ThroughArtificial Neural Networks pp 579ndash584 2002

[19] Y Shi and L Zhang ldquoCOID a cluster-outlier iterative detectionapproach to multi-dimensional data analysisrdquo Knowledge andInformation Systems vol 28 no 3 pp 709ndash733 2011

[20] J Zhao K Liu W Wang and Y Liu ldquoAdaptive fuzzy clusteringbased anomaly data detection in energy system of steel indus-tryrdquo Information Sciences vol 259 pp 335ndash345 2014

[21] F Angiulli and C Pizzuti ldquoFast outlier detection in high dimen-sional spacesrdquo in Proceedings of the 6th European Conference onPrinciples of DataMining and Knowledge Discovery (PKDD rsquo02)vol 2 pp 15ndash26 August 2002

[22] M Radovanovic A Nanopoulos and M Ivanovic ldquoReversenearest neighbors in unsupervised distance-based outlier detec-tionrdquo IEEE Transactions on Knowledge and Data Engineeringvol 27 no 5 pp 1369ndash1382 2015

[23] A Daneshpazhouh and A Sami ldquoEntropy-based outlier detec-tion using semi-supervised approach with few positive exam-plesrdquo Pattern Recognition Letters vol 49 pp 77ndash84 2014

[24] J Gao H Cheng and P-N Tan ldquoSemi-supervised outlierdetectionrdquo in Proceedings of the ACM Symposium on AppliedComputing pp 635ndash636 ACM Dijon France April 2006

[25] Z Xue Y Shang and A Feng ldquoSemi-supervised outlier detec-tion based on fuzzy rough C-means clusteringrdquo Mathematicsand Computers in Simulation vol 80 no 9 pp 1911ndash1921 2010

[26] J Z Huang M K Ng H Rong and Z Li ldquoAutomated variableweighting in k-means type clusteringrdquo IEEE Transactions onPattern Analysis andMachine Intelligence vol 27 no 5 pp 657ndash668 2005

[27] H Friguiand and O Nasraou ldquoUnsupervised learning ofprototypes and attribute weightsrdquo Pattern Recognition vol 37no 3 pp 567ndash581 2004

[28] J Zhou L Chen C P Chen Y Zhang and H Li ldquoFuzzy clus-tering with the entropy of attribute weightsrdquo Neurocomputingvol 198 pp 125ndash134 2016

[29] M Hassan A Chaudhry A Khan and M A Iftikhar ldquoRobustinformation gain based fuzzy c-means clustering and classifi-cation of carotid artery ultrasound imagesrdquo Computer Methodsand Programs in Biomedicine vol 113 no 2 pp 593ndash609 2014

[30] R Krishnapuram and J M Keller ldquoA possibilistic approach toclusteringrdquo IEEE Transactions on Fuzzy Systems vol 1 no 2 pp98ndash110 1993

[31] M-S Yang and K-L Wu ldquoUnsupervised possibilistic cluster-ingrdquo Pattern Recognition vol 39 no 1 pp 5ndash21 2006

[32] SM Guo L C Chen and J S H Tsai ldquoA boundarymethod foroutlier detection based on support vector domain descriptionrdquoPattern Recognition vol 42 no 1 pp 77ndash83 2009

[33] T Fawcett ldquoRoc graphs notes and practical considerations forresearchersrdquoMachine Learning vol 31 no 1 pp 1ndash38 2004

[34] C Blake and C J Merz ldquoUCI repository of machine learningdatabasesrdquo 1998

[35] C C Aggarwal and P S Yu ldquoOutlier detection for highdimensional datardquo in Proceedings of the ACM SIGMOD Interna-tional Conference on Management of Data (SIGMOD rsquo01) SantaBarbara Calif USA May 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Mathematical Problems in Engineering 7

minimum through iterations ofVO andU by (15) (17) and(19)

4 Experiments and Analysis

Comprehensive experiments and analysis on a syntheticdataset and several real-world datasets are conducted to showthe effectiveness and superiority of the proposed SSOD-AFW We compared the proposed algorithm with two the-state-of-the-art unsupervised outlier algorithms LOF [15]and 119870NN [13] one supervised method SVDD [32] and onesemisupervised method EODSP [23]

41 Evaluation Metrics For numerically performance evalu-ation of outlier detection algorithms three metrics namelyaccuracy [7] AUC [33] and rank-power [16] are used in thispaper

Let 119872 be the number of true outliers that a dataset 119883contains and 119878 denotes the number of true outliers detectedby an algorithm In experiments top 119872 most suspiciousinstances are detected out Then the accuracy is given by

Accuracy = 119878119872 (25)

The receiver operating characteristic (ROC) curve rep-resents the trade-off relationship between the detection rateand the false alarm rate In general the area under the ROCcurve (AUC) is used to measure the performance of outlierdetection method and the value of AUC for ideal detectionperformance is close to one

For a given outlier detection algorithm true outliersoccupy top positions with respect to the nonoutliers among119872 suspicious instances then the rank-power (RP) of thealgorithm is said to be high If 119878 is the number of true outliersfound within top119872 instances and 119877119894 denotes the rank of the119894th true outlier then the metric rank-power (RP) is given by

RP = 119878 (119878 + 1)2sum119878119894=1 119877119894 (26)

RP reaches the maximum value 1 when all 119878 true outliersare in the top 119878 positions Larger value of RP implies betterperformance of an algorithm

42 Experiments on Synthetic Dataset A two-dimensionalsynthetic dataset with two cluster patterns is generatedfrom Gaussian distribution to intuitively compare the outlierdetection results of the proposed method against the otherfour algorithms mentioned above The mean vectors of thetwo clusters are 1205831 = (75 9)T and 1205832 = (1 3)T respectivelyand the covariance matrixes of them are Σ1 = ( 1 00 2 ) andΣ2 = ( 1 00 2 ) As Figure 2(a) shows a total of 199 samplesare contained in the synthetic dataset in which there are183 normal samples (within two clusters) and 16 outliers(cluttered between two clusters) 13 normal objects are labeledand marked as symbol ldquotimesrdquo and 5 outliers are labeled andmarkedwith symbol ldquolowastrdquo while the rest samples are unlabeledmarked with ldquosdotrdquo Figures 2(b)ndash2(f) respectively illustrate the

outlier detection results on the synthetic dataset by usingLOF 119870NN SVDD EODSP and SSOD-AFW where the redcolored symbols ldquo∘rdquo denote the detected suspicious outliersHere the value of parameter 119896 (size of neighborhood) in LOFand 119870NN is assigned to 3 Gauss kernel function is chosenin SVDD and we set the bandwidth 120590 = 03 and the trade-off coefficient 119862 = 045 Besides the Euclidean distancethreshold 119879 in EODSP is set as 01 and the percentage ofnegative set is set to 119897 = 10 The parameter settings of theproposed algorithm are 119898 = 23 119870 = 085 and 119902 = 223In addition to SVDD the top 16 objects with the highestoutlying scores are considered as the results in the other fouralgorithms

In Figure 2 it is noticeable that the unsupervisedmethodsLOF and 119870NN as well as the supervised SVDD fail tocompletely detect all of the 5 labeled outliers Neverthelesssome normal points in clusters are badly misjudged asoutliers In contrast the semisupervised EODSP algorithmand the proposed SSOD-AFW algorithm successfully detectall of the 5 labeled outliers However EODSP does notcompletely detect all the unlabeled true outliers and severaltrue normal samples are improperly identified as outliers Itis concluded from Figure 2 that the proposed algorithm findsall the true outliers in the synthetic dataset and excludes thenormal samples while the other methods do not

Figure 3 numerically presents the performance evaluationof outlier detection using LOF 119870NN SVDD EODSP andSSOD-AFW for the synthetic dataset From Figure 3 we seethat the values of accuracy AUC and RP of the proposedalgorithm all reach 1 outperforming the other methods

Furthermore during the experimental process shown inFigure 3 the feature weights of the synthetic dataset learnedby formula (15) in our method are V11 = 06727 V12 =05985 V21 = 03273 and V22 = 04015 To strengthenthe effectiveness of feature weights in the proposed SSOD-AFW algorithm a comparative analysis of the weighted andthe nonweighted versions is implemented on the syntheticdataset respectively Considering the nonweighted scenarioof the proposed algorithm the outlier detection result on thesynthetic dataset is presented in Figure 4 As can be observedfrom Figure 4 the nonweighted SSOD-AFW ends up tagging15 true outlying and 1 normal samples as outliers with oneunlabeled true outlier missed

43 Experiments on Real-World Datasets

431 Introduction of Datasets For further verification ofthe effectiveness of the proposed algorithm five real-worlddatasets from UCI Machine Learning Repository [34] (ieIris Abalone Wine Ecoli and Breast Cancer Wisconsin(WDBC)) are employed to test the performance of theproposed algorithm against LOF 119870NN SVDD and EODSPAs mentioned in Aggarwal and Yu [35] one way to test theperformance of an outlier detection algorithm is to run it onthe dataset and calculate the percentage of points belongingto the rare classes So a small amount of samples from thesame class are randomly selected as outlying objects or astarget objects for the five datasets For instance the originalIris dataset incorporates 150 objects with 50 objects in each

8 Mathematical Problems in Engineering

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliers

(a) Original synthetic dataset

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by LOF

(b) LOF

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by KNN

(c) 119870NN

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by SVDD

(d) SVDD

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by EODSP

(e) EODSP

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by SSOD-AFW

(f) SSOD-AFW

Figure 2 Outlier detection results of different algorithms on the two-dimensional synthetic dataset

0875008125

06000

05000

1000010

09

08

07

06

05

04

03

02

01

00

Accu

racy

Algorithms

LOF

KN

N

SVD

D

EOD

SP

SSO

D-A

FW

(a) Accuracy

10

09

08

07

06

05

04

03

02

01

00

05748

10000

08948

0989809952

AUC

Algorithms

LOF

KN

N

SVD

D

EOD

SP

SSO

D-A

FW

(b) AUC

1009130 08835

07627

04138

10000

09

08

07

06

05

04

03

02

01

00

RP

Algorithms

LOF

KN

N

SVD

D

EOD

SP

SSO

D-A

FW

(c) RP

Figure 3 Performance comparison of different algorithms on the synthetic dataset

of three classes We randomly selected 26 objects from classldquoIris-virginicardquo as target outliers and all objects in the othertwo classes are considered as normal objects The otherfour datasets are similarly preprocessed and more detaileddescription about the five real-world datasets is given inTable 1

432 Experimental Result Analysis We compare the outlierdetection performance of the proposed algorithm with LOF119870NN SVDD and EODSP on real-world datasets Eachmethod has its own parameters and the detailed parametersettings of each algorithm are as follows The parameters ofthe proposed algorithm are 119898 = 21 119870 = 065 and 119902 = 153

Mathematical Problems in Engineering 9

Table 1 Description of real-world datasets

Dataset Instances Features Outlying classes Outliers (percent) Clusters Prior informationIris 126 4 ldquoVirginicardquo 26 (2063) 2 10 labeled normal samples 4 labeled outliersAbalone 4177 8 ldquo1rdquondashldquo4rdquo ldquo16rdquondashldquo27rdquo ldquo29rdquo 335 (802) 11 11 labeled normal samples 18 labeled outliersWine 130 13 ldquo3rdquo 11 (846) 2 9 labeled normal samples 4 labeled outliersEcoli 336 9 ldquoomLrdquo ldquoimLrdquo ldquoimSrdquo 9 (268) 5 11 labeled normal samples 3 labeled outliersWDBC 387 30 ldquoMalignantrdquo 30 (775) 1 10 labeled normal samples 8 labeled outliers

minus2 0 2 4 6 8 10minus4

minus2

0

2

4

6

8

10

12

14

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by nonweighted SSOD-AFW

Figure 4 Outlier detection result of the nonweighted SSOD-AFW on the synthetic dataset

for all the five datasets The strategy of parameter selectionfor SSOD-AFW will be discussed in the later subsectioncalled parameter analysis For the other algorithms thoseparameters are set exactly as mentioned in their referencesIt is well known that LOF and 119870NN have high dependencyon the neighborhood parameter 119896 In this paper we set 119896 = 5for datasets Iris and WDBC 119896 = 3 for dataset Abalone119896 = 10 for dataset Wine and 119896 = 10 for dataset Ecoli ForSVDDmethodGaussian kernel function is employed and thebandwidths 120590 = 045 and 119862 = 05 on all of the five real-worlddatasets In EODSP the Euclidean distance threshold 119879 is setas 01 and the percentage of negative set 119897 is set as 30 for IrisandAbalone datasets and119879 = 05 119897 = 30 for datasets EcoliWine and WDBC Since we randomly select outliers fromtarget classes for each dataset each experiment is repeated10 times with the same number of different outliers Theaverage accuracy AUC and RP are calculated as the criteriaof performance of various detection methods

Figure 5 illustrates the outlier detection results of SSOD-AFW algorithm against LOF 119870NN SVDD and EODSPrespectively on the five real-world datasets As can beseen from Figure 5 the proposed algorithm can accuratelyidentify outliers according to the cluster structure of adataset with the guidance of the label knowledge It showsdistinct superiority over the other unsupervised (LOF119870NN)semisupervised (EODSP) and supervised (SVDD) methodsIn particular the outlier detection accuracy of SSOD-AFW inFigure 5(a) is significantly higher than the others especially

for datasets Iris and Wine One can know from Figure 5(b)that the AUC values of our method are always higher thanthe others for all datasets except for WDBC In terms of RPSSOD-AFW performs better than the other four algorithmsondatasets Iris andWinewhereas slightly poorer than SVDDonAbalone poorer than LOFonEcoli and poorer than119870NNonWDBC seen as in Figure 5(c)

It is worth mentioning that the experiment of the pro-posed algorithm on WDBC involves one-class clusteringproblem Although one-class clustering task is generallymeaningless one-class clustering-based outlier detection isespecially meaningful and feasible in our proposal becauseour approach does not require that the membership degreesmust sum up to 1 This is one of the powerful and importantcharacteristics of the proposed algorithm

433 The Influence of the Proportion of Labeled Data onOutlier Detection In this subsection we will investigatethe influence of the proportion of labeled samples on theaccuracies of our methodology Two typical situations areconsidered and tested The first one is that the proportion oflabeled outliers increases when the number of labeled normalobjects is fixed at a certain constant The other one is that thepercent of labeled normal samples varies while the quantityof labeled outliers is fixed So two groups of experimentsare designed to compare the accuracies of the proposedalgorithm against the EODSP in the situations of different

10 Mathematical Problems in Engineering

10

09

08

07

06

05

04

03

02

01

00

Accu

racy

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(a) Accuracy

10

09

08

07

06

05

04

03

02

01

00

AUC

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(b) AUC

10

09

08

07

06

05

04

03

02

01

00

RP

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(c) RP

Figure 5 Performance comparison of various algorithms on the real-world datasets

percent of labeled outliers and normal samples respectivelyon the datasets Iris Abalone Wine Ecoli andWDBC In thetwo experiments the percent of labeled outliers or labelednormal samples ranges from 0 to 40 respectively whenthe number of another kind of labeled objects is fixedWe randomly select a certain number of labeled outliersor normal samples from each dataset each experiment isrepeated 10 times and the average accuracies of SSOD-AFWand EODSP are computed

Figure 6 shows results of the first group of experimentswhere the percent of labeled outliers varies from 0 to 40One can see from Figure 6 that the accuracies of the twosemisupervised algorithms are roughly increased with theproportion of labeled outliers becoming larger This power-fully supports the idea that semisupervised outlier detectionalgorithms can improve the accuracy of outlier detectionby using prior information Furthermore the SSOD-AFWachieves a better accuracy than EODSP algorithm for thesame proportion of labeled outliers on the five datasetsEspecially for Wine the accuracy of SSOD-AFW is 40higher than that of EODSP EODSP addresses the problemof detecting outliers with only few labeled outliers as trainingdata The labeled normal instances are extracted accordingto the maximum entropy principle where the entropy iscomputed only using the distance between each testingsample and all the labeled outliers That makes EODSPnot flexible as our proposed method due to informationdeficiencies

Figure 7 illustrates the accuracy comparison of the pro-posed algorithm and EODSP when the proportion of labelednormal samples increases from 0 to 40 and the percent oflabeled outliers is fixed Note that ourmethod obtains a betteraccuracy than EODSP on all of the five real-world datasetsThe accuracy of the proposed algorithm gets larger when thepercent of labeled normal samples increases As mentionedEODSP emphasizes the semisupervised outlier detectiononly with few labeled outliers in the initial dataset butwithout considering any labeled normal objects Therefore

the accuracy of EODSP algorithm keeps harper with variousproportions of labeled normal objects and always equals theaccuracy value of 0 labeled normal samples as well

434 Parameter Analysis The parameters 119898 119870 and 119902are important in our proposed algorithm which affect theperformance of SSOD-AFW In this section the influence ofeach parameter on outlier detection accuracy is studied

The parameter 119898 is the fuzzification coefficient Fig-ure 8(a) analyzes the relationship between the outlier detec-tion accuracy of our proposed algorithm and parameter 119898with 119898 varying from 15 to 50 The results imply that thehighest accuracy is achieved when 119898 ranges in [2 4] Soit is reasonable that 119898 value in the experiments shown inFigure 5 has been set as 21 The parameter 119870 isin (0 1)controls the importance of the label information in the resultof outlier detection Outlier detection accuracies are testifiedby varying119870 from 01 to 09 which are shown in Figure 8(b)The overall tendency is that the accuracies become larger as119870 increasesThe best results of the proposed algorithm occurand keep stable when 119870 ge 07 Finally from Figure 8(c) weconclude that the feature weight index 119902 has small influenceon the accuracy of SSOD-AFW in the situation that theother parametersmaintain the same settings So the proposedalgorithm is not sensitive to the parameter 119902 In general theparameter 119902 is suggested to select a constant from (1 4]435 Execution Time Analysis Figure 9 analyzes the averagerunning time of the proposed algorithm against the otheralgorithms on five real-world datasets The experimentalenvironment is Windows XP systems MATLAB 71 platform3GHz CPU 2GB RAM Because the volume of Abalonedataset is far greater than the other four datasets the runningtimes of various datasets are distinctly different In order tofacilitate the display in Figure 9 the horizontal coordinate axisis translated downward a short distance The result indicatesthat the proposed algorithm is more time-saving than the

Mathematical Problems in Engineering 11

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(a) Iris

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(b) Abalone1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(c) Wine1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(d) Ecoli

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(e) WDBC

Figure 6 Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled outliers on the real-world datasets

other four typical outlier detection algorithms except for119870NN on dataset Wine In whole the execution time of theSSOD-AFW is comparable to that of119870NNand less than thoseof other algorithms on most of the datasets

5 Conclusions

In order to detect outliers more precisely a semisuper-vised outlier detection algorithm based on adaptive feature

weighted clustering called SSOD-AFW is proposed in thispaper Distinct weights of each feature with respect to differ-ent clusters are considered and obtained by adaptive iterationso that the negative effects of irrelevant features on outlierdetection are weakened Moreover the proposed methodmakes full use of the prior knowledge contained in datasetsand detects outliers in virtue of the cluster structure It isverified by a series of experiments that the proposed SSOD-AFW algorithm is superior to other typical unsupervised

12 Mathematical Problems in Engineering

0

1009080706050403020100

Accu

racy

Proportion of labeled normal samples5

EODSPSSOD-AFW

10 15 20 25 35 4030

(a) Iris

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(b) Abalone

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(c) Wine

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples

5 10 15 20 25 35 4030

(d) Ecoli

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(e) WDBC

Figure 7 Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled normal samples on the real-worlddatasets

semisupervised and supervised algorithms in both outlierdetection precision and running speed

In this paper we present a new semisupervised outlierdetection method which utilizes labels of a small amountof objects However our method assumes that the labels

of objects are reliable and does not consider mislabel pun-ishment in the new objective function Therefore a robustversion of the proposed method dealing with noisy orimperfect labels of objects deserves further studiesMoreoversince only one typical dissimilaritymeasure named Euclidean

Mathematical Problems in Engineering 13

1009080706050403020100

Accu

racy

m

5045403530252015

IrisAbaloneWine

EcoliWDBC

(a) Parameter119898

K

1009080706050403020100

Accu

racy

090807060504030201

IrisAbaloneWine

EcoliWDBC

(b) Parameter119870

q

40

1009080706050403020100

3530252015

Accu

racy

IrisAbaloneWine

EcoliWDBC

(c) Parameter 119902

Figure 8 Outlier detection accuracy of the proposed algorithm under various parameters on the real-world datasets

Iris Abalone Wine Ecoli WDBCDatasets

50

45

40

35

30

25

20

15

10

5

0

Runn

ing

time (

s)

LOFKNNSVDD

EODSPSSOD-AFW

Figure 9 Execution time comparison of different algorithms on thereal-world datasets

distance is discussed in our method the SSOD-AFW algo-rithm is limited to outlier detection for numerical data Thefuture research aims at extending our method to mixed-attribute data in more real-life applications such as faultdiagnosis in industrial process or network anomaly detection

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Natural ScienceFoundation of China (Grant no 11471001)

References

[1] J Han M Kamber and J Pei Data Mining Concepts andTechniques Elsevier 2011

[2] D M Hawkins Identification of Outliers Chapman amp HallLondon UK 1980

[3] V Bamnett and T LewisOutliers in Statistical Data JohnWileyamp Sons Chichester UK 1994

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] B Sheng Q Li W Mao and W Jin ldquoOutlier detection insensor networksrdquo in Proceedings of the 8th ACM InternationalSymposium on Mobile Ad Hoc Networking and Computing(MobiHoc rsquo07) pp 219ndash228Montreal Canada September 2007

[6] A P James and S Dimitrijev ldquoInter-image outliers and theirapplication to image classificationrdquo Pattern Recognition vol 43no 12 pp 4101ndash4112 2010

[7] J Huang Q Zhu L Yang and J Feng ldquoA non-parameter outlierdetection algorithm based on Natural Neighborrdquo Knowledge-Based Systems vol 92 pp 71ndash77 2016

[8] J M Shepherd and S J Burian ldquoDetection of urban-inducedrainfall anomalies in a major coastal cityrdquo Earth Interactionsvol 7 no 4 pp 1ndash17 2003

[9] O Alan and C Catal ldquoThresholds based outlier detectionapproach for mining class outliers an empirical case study onsoftware measurement datasetsrdquo Expert Systems with Applica-tions vol 38 no 4 pp 3440ndash3445 2011

[10] B Scholkopf J C Platt J Shawe-Taylor A J Smola and RC Williamson ldquoEstimating the support of a high-dimensionaldistributionrdquo Neural Computation vol 13 no 7 pp 1443ndash14712001

[11] D Tax A Ypma and R Duin ldquoSupport vector data descriptionapplied to machine vibration analysisrdquo in Proceedings of the 5thAnnual Conference of the Advanced School for Computing andImaging pp 398ndash405 Heijen The Netherlands June 1999

[12] DM J Tax andR PWDuin ldquoSupport vector data descriptionrdquoMachine Learning vol 54 no 1 pp 45ndash66 2004

[13] S Ramaswamy R Rastogi and K Shim ldquoEfficient algorithmsformining outliers from large data setsrdquo SIGMODRecord (ACMSpecial Interest Group onManagement of Data) vol 29 no 2 pp427ndash438 2000

14 Mathematical Problems in Engineering

[14] EM Knox and R T Ng ldquoAlgorithms formining distance basedoutliers in large datasetrdquo in Proceedings of the InternationalConference on Very Large Data Bases pp 392ndash403 CiteseerNew York NY USA 1998

[15] M M Breunig H-P Kriegel R T Ng and J Sander ldquoLOFidentifying density-based local outliersrdquoACMSIGMODRecordvol 29 no 2 pp 93ndash104 2000

[16] J Ha S Seok and J-S Lee ldquoA precise ranking method foroutlier detectionrdquo Information Sciences vol 324 pp 88ndash1072015

[17] R N Dave ldquoCharacterization and detection of noise in clus-teringrdquo Pattern Recognition Letters vol 12 no 11 pp 657ndash6641991

[18] R Smith A Bivens M Embrechts C Palagiri and B Szyman-ski ldquoClustering approaches for anomaly based intrusion detec-tionrdquo Proceedings of Intelligent Engineering Systems ThroughArtificial Neural Networks pp 579ndash584 2002

[19] Y Shi and L Zhang ldquoCOID a cluster-outlier iterative detectionapproach to multi-dimensional data analysisrdquo Knowledge andInformation Systems vol 28 no 3 pp 709ndash733 2011

[20] J Zhao K Liu W Wang and Y Liu ldquoAdaptive fuzzy clusteringbased anomaly data detection in energy system of steel indus-tryrdquo Information Sciences vol 259 pp 335ndash345 2014

[21] F Angiulli and C Pizzuti ldquoFast outlier detection in high dimen-sional spacesrdquo in Proceedings of the 6th European Conference onPrinciples of DataMining and Knowledge Discovery (PKDD rsquo02)vol 2 pp 15ndash26 August 2002

[22] M Radovanovic A Nanopoulos and M Ivanovic ldquoReversenearest neighbors in unsupervised distance-based outlier detec-tionrdquo IEEE Transactions on Knowledge and Data Engineeringvol 27 no 5 pp 1369ndash1382 2015

[23] A Daneshpazhouh and A Sami ldquoEntropy-based outlier detec-tion using semi-supervised approach with few positive exam-plesrdquo Pattern Recognition Letters vol 49 pp 77ndash84 2014

[24] J Gao H Cheng and P-N Tan ldquoSemi-supervised outlierdetectionrdquo in Proceedings of the ACM Symposium on AppliedComputing pp 635ndash636 ACM Dijon France April 2006

[25] Z Xue Y Shang and A Feng ldquoSemi-supervised outlier detec-tion based on fuzzy rough C-means clusteringrdquo Mathematicsand Computers in Simulation vol 80 no 9 pp 1911ndash1921 2010

[26] J Z Huang M K Ng H Rong and Z Li ldquoAutomated variableweighting in k-means type clusteringrdquo IEEE Transactions onPattern Analysis andMachine Intelligence vol 27 no 5 pp 657ndash668 2005

[27] H Friguiand and O Nasraou ldquoUnsupervised learning ofprototypes and attribute weightsrdquo Pattern Recognition vol 37no 3 pp 567ndash581 2004

[28] J Zhou L Chen C P Chen Y Zhang and H Li ldquoFuzzy clus-tering with the entropy of attribute weightsrdquo Neurocomputingvol 198 pp 125ndash134 2016

[29] M Hassan A Chaudhry A Khan and M A Iftikhar ldquoRobustinformation gain based fuzzy c-means clustering and classifi-cation of carotid artery ultrasound imagesrdquo Computer Methodsand Programs in Biomedicine vol 113 no 2 pp 593ndash609 2014

[30] R Krishnapuram and J M Keller ldquoA possibilistic approach toclusteringrdquo IEEE Transactions on Fuzzy Systems vol 1 no 2 pp98ndash110 1993

[31] M-S Yang and K-L Wu ldquoUnsupervised possibilistic cluster-ingrdquo Pattern Recognition vol 39 no 1 pp 5ndash21 2006

[32] SM Guo L C Chen and J S H Tsai ldquoA boundarymethod foroutlier detection based on support vector domain descriptionrdquoPattern Recognition vol 42 no 1 pp 77ndash83 2009

[33] T Fawcett ldquoRoc graphs notes and practical considerations forresearchersrdquoMachine Learning vol 31 no 1 pp 1ndash38 2004

[34] C Blake and C J Merz ldquoUCI repository of machine learningdatabasesrdquo 1998

[35] C C Aggarwal and P S Yu ldquoOutlier detection for highdimensional datardquo in Proceedings of the ACM SIGMOD Interna-tional Conference on Management of Data (SIGMOD rsquo01) SantaBarbara Calif USA May 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

8 Mathematical Problems in Engineering

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliers

(a) Original synthetic dataset

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by LOF

(b) LOF

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by KNN

(c) 119870NN

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by SVDD

(d) SVDD

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by EODSP

(e) EODSP

minus2 0 2 4 6 8 10minus4minus2

02468

101214

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by SSOD-AFW

(f) SSOD-AFW

Figure 2 Outlier detection results of different algorithms on the two-dimensional synthetic dataset

0875008125

06000

05000

1000010

09

08

07

06

05

04

03

02

01

00

Accu

racy

Algorithms

LOF

KN

N

SVD

D

EOD

SP

SSO

D-A

FW

(a) Accuracy

10

09

08

07

06

05

04

03

02

01

00

05748

10000

08948

0989809952

AUC

Algorithms

LOF

KN

N

SVD

D

EOD

SP

SSO

D-A

FW

(b) AUC

1009130 08835

07627

04138

10000

09

08

07

06

05

04

03

02

01

00

RP

Algorithms

LOF

KN

N

SVD

D

EOD

SP

SSO

D-A

FW

(c) RP

Figure 3 Performance comparison of different algorithms on the synthetic dataset

of three classes We randomly selected 26 objects from classldquoIris-virginicardquo as target outliers and all objects in the othertwo classes are considered as normal objects The otherfour datasets are similarly preprocessed and more detaileddescription about the five real-world datasets is given inTable 1

432 Experimental Result Analysis We compare the outlierdetection performance of the proposed algorithm with LOF119870NN SVDD and EODSP on real-world datasets Eachmethod has its own parameters and the detailed parametersettings of each algorithm are as follows The parameters ofthe proposed algorithm are 119898 = 21 119870 = 065 and 119902 = 153

Mathematical Problems in Engineering 9

Table 1 Description of real-world datasets

Dataset Instances Features Outlying classes Outliers (percent) Clusters Prior informationIris 126 4 ldquoVirginicardquo 26 (2063) 2 10 labeled normal samples 4 labeled outliersAbalone 4177 8 ldquo1rdquondashldquo4rdquo ldquo16rdquondashldquo27rdquo ldquo29rdquo 335 (802) 11 11 labeled normal samples 18 labeled outliersWine 130 13 ldquo3rdquo 11 (846) 2 9 labeled normal samples 4 labeled outliersEcoli 336 9 ldquoomLrdquo ldquoimLrdquo ldquoimSrdquo 9 (268) 5 11 labeled normal samples 3 labeled outliersWDBC 387 30 ldquoMalignantrdquo 30 (775) 1 10 labeled normal samples 8 labeled outliers

minus2 0 2 4 6 8 10minus4

minus2

0

2

4

6

8

10

12

14

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by nonweighted SSOD-AFW

Figure 4 Outlier detection result of the nonweighted SSOD-AFW on the synthetic dataset

for all the five datasets The strategy of parameter selectionfor SSOD-AFW will be discussed in the later subsectioncalled parameter analysis For the other algorithms thoseparameters are set exactly as mentioned in their referencesIt is well known that LOF and 119870NN have high dependencyon the neighborhood parameter 119896 In this paper we set 119896 = 5for datasets Iris and WDBC 119896 = 3 for dataset Abalone119896 = 10 for dataset Wine and 119896 = 10 for dataset Ecoli ForSVDDmethodGaussian kernel function is employed and thebandwidths 120590 = 045 and 119862 = 05 on all of the five real-worlddatasets In EODSP the Euclidean distance threshold 119879 is setas 01 and the percentage of negative set 119897 is set as 30 for IrisandAbalone datasets and119879 = 05 119897 = 30 for datasets EcoliWine and WDBC Since we randomly select outliers fromtarget classes for each dataset each experiment is repeated10 times with the same number of different outliers Theaverage accuracy AUC and RP are calculated as the criteriaof performance of various detection methods

Figure 5 illustrates the outlier detection results of SSOD-AFW algorithm against LOF 119870NN SVDD and EODSPrespectively on the five real-world datasets As can beseen from Figure 5 the proposed algorithm can accuratelyidentify outliers according to the cluster structure of adataset with the guidance of the label knowledge It showsdistinct superiority over the other unsupervised (LOF119870NN)semisupervised (EODSP) and supervised (SVDD) methodsIn particular the outlier detection accuracy of SSOD-AFW inFigure 5(a) is significantly higher than the others especially

for datasets Iris and Wine One can know from Figure 5(b)that the AUC values of our method are always higher thanthe others for all datasets except for WDBC In terms of RPSSOD-AFW performs better than the other four algorithmsondatasets Iris andWinewhereas slightly poorer than SVDDonAbalone poorer than LOFonEcoli and poorer than119870NNonWDBC seen as in Figure 5(c)

It is worth mentioning that the experiment of the pro-posed algorithm on WDBC involves one-class clusteringproblem Although one-class clustering task is generallymeaningless one-class clustering-based outlier detection isespecially meaningful and feasible in our proposal becauseour approach does not require that the membership degreesmust sum up to 1 This is one of the powerful and importantcharacteristics of the proposed algorithm

433 The Influence of the Proportion of Labeled Data onOutlier Detection In this subsection we will investigatethe influence of the proportion of labeled samples on theaccuracies of our methodology Two typical situations areconsidered and tested The first one is that the proportion oflabeled outliers increases when the number of labeled normalobjects is fixed at a certain constant The other one is that thepercent of labeled normal samples varies while the quantityof labeled outliers is fixed So two groups of experimentsare designed to compare the accuracies of the proposedalgorithm against the EODSP in the situations of different

10 Mathematical Problems in Engineering

10

09

08

07

06

05

04

03

02

01

00

Accu

racy

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(a) Accuracy

10

09

08

07

06

05

04

03

02

01

00

AUC

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(b) AUC

10

09

08

07

06

05

04

03

02

01

00

RP

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(c) RP

Figure 5 Performance comparison of various algorithms on the real-world datasets

percent of labeled outliers and normal samples respectivelyon the datasets Iris Abalone Wine Ecoli andWDBC In thetwo experiments the percent of labeled outliers or labelednormal samples ranges from 0 to 40 respectively whenthe number of another kind of labeled objects is fixedWe randomly select a certain number of labeled outliersor normal samples from each dataset each experiment isrepeated 10 times and the average accuracies of SSOD-AFWand EODSP are computed

Figure 6 shows results of the first group of experimentswhere the percent of labeled outliers varies from 0 to 40One can see from Figure 6 that the accuracies of the twosemisupervised algorithms are roughly increased with theproportion of labeled outliers becoming larger This power-fully supports the idea that semisupervised outlier detectionalgorithms can improve the accuracy of outlier detectionby using prior information Furthermore the SSOD-AFWachieves a better accuracy than EODSP algorithm for thesame proportion of labeled outliers on the five datasetsEspecially for Wine the accuracy of SSOD-AFW is 40higher than that of EODSP EODSP addresses the problemof detecting outliers with only few labeled outliers as trainingdata The labeled normal instances are extracted accordingto the maximum entropy principle where the entropy iscomputed only using the distance between each testingsample and all the labeled outliers That makes EODSPnot flexible as our proposed method due to informationdeficiencies

Figure 7 illustrates the accuracy comparison of the pro-posed algorithm and EODSP when the proportion of labelednormal samples increases from 0 to 40 and the percent oflabeled outliers is fixed Note that ourmethod obtains a betteraccuracy than EODSP on all of the five real-world datasetsThe accuracy of the proposed algorithm gets larger when thepercent of labeled normal samples increases As mentionedEODSP emphasizes the semisupervised outlier detectiononly with few labeled outliers in the initial dataset butwithout considering any labeled normal objects Therefore

the accuracy of EODSP algorithm keeps harper with variousproportions of labeled normal objects and always equals theaccuracy value of 0 labeled normal samples as well

434 Parameter Analysis The parameters 119898 119870 and 119902are important in our proposed algorithm which affect theperformance of SSOD-AFW In this section the influence ofeach parameter on outlier detection accuracy is studied

The parameter 119898 is the fuzzification coefficient Fig-ure 8(a) analyzes the relationship between the outlier detec-tion accuracy of our proposed algorithm and parameter 119898with 119898 varying from 15 to 50 The results imply that thehighest accuracy is achieved when 119898 ranges in [2 4] Soit is reasonable that 119898 value in the experiments shown inFigure 5 has been set as 21 The parameter 119870 isin (0 1)controls the importance of the label information in the resultof outlier detection Outlier detection accuracies are testifiedby varying119870 from 01 to 09 which are shown in Figure 8(b)The overall tendency is that the accuracies become larger as119870 increasesThe best results of the proposed algorithm occurand keep stable when 119870 ge 07 Finally from Figure 8(c) weconclude that the feature weight index 119902 has small influenceon the accuracy of SSOD-AFW in the situation that theother parametersmaintain the same settings So the proposedalgorithm is not sensitive to the parameter 119902 In general theparameter 119902 is suggested to select a constant from (1 4]435 Execution Time Analysis Figure 9 analyzes the averagerunning time of the proposed algorithm against the otheralgorithms on five real-world datasets The experimentalenvironment is Windows XP systems MATLAB 71 platform3GHz CPU 2GB RAM Because the volume of Abalonedataset is far greater than the other four datasets the runningtimes of various datasets are distinctly different In order tofacilitate the display in Figure 9 the horizontal coordinate axisis translated downward a short distance The result indicatesthat the proposed algorithm is more time-saving than the

Mathematical Problems in Engineering 11

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(a) Iris

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(b) Abalone1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(c) Wine1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(d) Ecoli

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(e) WDBC

Figure 6 Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled outliers on the real-world datasets

other four typical outlier detection algorithms except for119870NN on dataset Wine In whole the execution time of theSSOD-AFW is comparable to that of119870NNand less than thoseof other algorithms on most of the datasets

5 Conclusions

In order to detect outliers more precisely a semisuper-vised outlier detection algorithm based on adaptive feature

weighted clustering called SSOD-AFW is proposed in thispaper Distinct weights of each feature with respect to differ-ent clusters are considered and obtained by adaptive iterationso that the negative effects of irrelevant features on outlierdetection are weakened Moreover the proposed methodmakes full use of the prior knowledge contained in datasetsand detects outliers in virtue of the cluster structure It isverified by a series of experiments that the proposed SSOD-AFW algorithm is superior to other typical unsupervised

12 Mathematical Problems in Engineering

0

1009080706050403020100

Accu

racy

Proportion of labeled normal samples5

EODSPSSOD-AFW

10 15 20 25 35 4030

(a) Iris

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(b) Abalone

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(c) Wine

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples

5 10 15 20 25 35 4030

(d) Ecoli

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(e) WDBC

Figure 7 Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled normal samples on the real-worlddatasets

semisupervised and supervised algorithms in both outlierdetection precision and running speed

In this paper we present a new semisupervised outlierdetection method which utilizes labels of a small amountof objects However our method assumes that the labels

of objects are reliable and does not consider mislabel pun-ishment in the new objective function Therefore a robustversion of the proposed method dealing with noisy orimperfect labels of objects deserves further studiesMoreoversince only one typical dissimilaritymeasure named Euclidean

Mathematical Problems in Engineering 13

1009080706050403020100

Accu

racy

m

5045403530252015

IrisAbaloneWine

EcoliWDBC

(a) Parameter119898

K

1009080706050403020100

Accu

racy

090807060504030201

IrisAbaloneWine

EcoliWDBC

(b) Parameter119870

q

40

1009080706050403020100

3530252015

Accu

racy

IrisAbaloneWine

EcoliWDBC

(c) Parameter 119902

Figure 8 Outlier detection accuracy of the proposed algorithm under various parameters on the real-world datasets

Iris Abalone Wine Ecoli WDBCDatasets

50

45

40

35

30

25

20

15

10

5

0

Runn

ing

time (

s)

LOFKNNSVDD

EODSPSSOD-AFW

Figure 9 Execution time comparison of different algorithms on thereal-world datasets

distance is discussed in our method the SSOD-AFW algo-rithm is limited to outlier detection for numerical data Thefuture research aims at extending our method to mixed-attribute data in more real-life applications such as faultdiagnosis in industrial process or network anomaly detection

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Natural ScienceFoundation of China (Grant no 11471001)

References

[1] J Han M Kamber and J Pei Data Mining Concepts andTechniques Elsevier 2011

[2] D M Hawkins Identification of Outliers Chapman amp HallLondon UK 1980

[3] V Bamnett and T LewisOutliers in Statistical Data JohnWileyamp Sons Chichester UK 1994

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] B Sheng Q Li W Mao and W Jin ldquoOutlier detection insensor networksrdquo in Proceedings of the 8th ACM InternationalSymposium on Mobile Ad Hoc Networking and Computing(MobiHoc rsquo07) pp 219ndash228Montreal Canada September 2007

[6] A P James and S Dimitrijev ldquoInter-image outliers and theirapplication to image classificationrdquo Pattern Recognition vol 43no 12 pp 4101ndash4112 2010

[7] J Huang Q Zhu L Yang and J Feng ldquoA non-parameter outlierdetection algorithm based on Natural Neighborrdquo Knowledge-Based Systems vol 92 pp 71ndash77 2016

[8] J M Shepherd and S J Burian ldquoDetection of urban-inducedrainfall anomalies in a major coastal cityrdquo Earth Interactionsvol 7 no 4 pp 1ndash17 2003

[9] O Alan and C Catal ldquoThresholds based outlier detectionapproach for mining class outliers an empirical case study onsoftware measurement datasetsrdquo Expert Systems with Applica-tions vol 38 no 4 pp 3440ndash3445 2011

[10] B Scholkopf J C Platt J Shawe-Taylor A J Smola and RC Williamson ldquoEstimating the support of a high-dimensionaldistributionrdquo Neural Computation vol 13 no 7 pp 1443ndash14712001

[11] D Tax A Ypma and R Duin ldquoSupport vector data descriptionapplied to machine vibration analysisrdquo in Proceedings of the 5thAnnual Conference of the Advanced School for Computing andImaging pp 398ndash405 Heijen The Netherlands June 1999

[12] DM J Tax andR PWDuin ldquoSupport vector data descriptionrdquoMachine Learning vol 54 no 1 pp 45ndash66 2004

[13] S Ramaswamy R Rastogi and K Shim ldquoEfficient algorithmsformining outliers from large data setsrdquo SIGMODRecord (ACMSpecial Interest Group onManagement of Data) vol 29 no 2 pp427ndash438 2000

14 Mathematical Problems in Engineering

[14] EM Knox and R T Ng ldquoAlgorithms formining distance basedoutliers in large datasetrdquo in Proceedings of the InternationalConference on Very Large Data Bases pp 392ndash403 CiteseerNew York NY USA 1998

[15] M M Breunig H-P Kriegel R T Ng and J Sander ldquoLOFidentifying density-based local outliersrdquoACMSIGMODRecordvol 29 no 2 pp 93ndash104 2000

[16] J Ha S Seok and J-S Lee ldquoA precise ranking method foroutlier detectionrdquo Information Sciences vol 324 pp 88ndash1072015

[17] R N Dave ldquoCharacterization and detection of noise in clus-teringrdquo Pattern Recognition Letters vol 12 no 11 pp 657ndash6641991

[18] R Smith A Bivens M Embrechts C Palagiri and B Szyman-ski ldquoClustering approaches for anomaly based intrusion detec-tionrdquo Proceedings of Intelligent Engineering Systems ThroughArtificial Neural Networks pp 579ndash584 2002

[19] Y Shi and L Zhang ldquoCOID a cluster-outlier iterative detectionapproach to multi-dimensional data analysisrdquo Knowledge andInformation Systems vol 28 no 3 pp 709ndash733 2011

[20] J Zhao K Liu W Wang and Y Liu ldquoAdaptive fuzzy clusteringbased anomaly data detection in energy system of steel indus-tryrdquo Information Sciences vol 259 pp 335ndash345 2014

[21] F Angiulli and C Pizzuti ldquoFast outlier detection in high dimen-sional spacesrdquo in Proceedings of the 6th European Conference onPrinciples of DataMining and Knowledge Discovery (PKDD rsquo02)vol 2 pp 15ndash26 August 2002

[22] M Radovanovic A Nanopoulos and M Ivanovic ldquoReversenearest neighbors in unsupervised distance-based outlier detec-tionrdquo IEEE Transactions on Knowledge and Data Engineeringvol 27 no 5 pp 1369ndash1382 2015

[23] A Daneshpazhouh and A Sami ldquoEntropy-based outlier detec-tion using semi-supervised approach with few positive exam-plesrdquo Pattern Recognition Letters vol 49 pp 77ndash84 2014

[24] J Gao H Cheng and P-N Tan ldquoSemi-supervised outlierdetectionrdquo in Proceedings of the ACM Symposium on AppliedComputing pp 635ndash636 ACM Dijon France April 2006

[25] Z Xue Y Shang and A Feng ldquoSemi-supervised outlier detec-tion based on fuzzy rough C-means clusteringrdquo Mathematicsand Computers in Simulation vol 80 no 9 pp 1911ndash1921 2010

[26] J Z Huang M K Ng H Rong and Z Li ldquoAutomated variableweighting in k-means type clusteringrdquo IEEE Transactions onPattern Analysis andMachine Intelligence vol 27 no 5 pp 657ndash668 2005

[27] H Friguiand and O Nasraou ldquoUnsupervised learning ofprototypes and attribute weightsrdquo Pattern Recognition vol 37no 3 pp 567ndash581 2004

[28] J Zhou L Chen C P Chen Y Zhang and H Li ldquoFuzzy clus-tering with the entropy of attribute weightsrdquo Neurocomputingvol 198 pp 125ndash134 2016

[29] M Hassan A Chaudhry A Khan and M A Iftikhar ldquoRobustinformation gain based fuzzy c-means clustering and classifi-cation of carotid artery ultrasound imagesrdquo Computer Methodsand Programs in Biomedicine vol 113 no 2 pp 593ndash609 2014

[30] R Krishnapuram and J M Keller ldquoA possibilistic approach toclusteringrdquo IEEE Transactions on Fuzzy Systems vol 1 no 2 pp98ndash110 1993

[31] M-S Yang and K-L Wu ldquoUnsupervised possibilistic cluster-ingrdquo Pattern Recognition vol 39 no 1 pp 5ndash21 2006

[32] SM Guo L C Chen and J S H Tsai ldquoA boundarymethod foroutlier detection based on support vector domain descriptionrdquoPattern Recognition vol 42 no 1 pp 77ndash83 2009

[33] T Fawcett ldquoRoc graphs notes and practical considerations forresearchersrdquoMachine Learning vol 31 no 1 pp 1ndash38 2004

[34] C Blake and C J Merz ldquoUCI repository of machine learningdatabasesrdquo 1998

[35] C C Aggarwal and P S Yu ldquoOutlier detection for highdimensional datardquo in Proceedings of the ACM SIGMOD Interna-tional Conference on Management of Data (SIGMOD rsquo01) SantaBarbara Calif USA May 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Mathematical Problems in Engineering 9

Table 1 Description of real-world datasets

Dataset Instances Features Outlying classes Outliers (percent) Clusters Prior informationIris 126 4 ldquoVirginicardquo 26 (2063) 2 10 labeled normal samples 4 labeled outliersAbalone 4177 8 ldquo1rdquondashldquo4rdquo ldquo16rdquondashldquo27rdquo ldquo29rdquo 335 (802) 11 11 labeled normal samples 18 labeled outliersWine 130 13 ldquo3rdquo 11 (846) 2 9 labeled normal samples 4 labeled outliersEcoli 336 9 ldquoomLrdquo ldquoimLrdquo ldquoimSrdquo 9 (268) 5 11 labeled normal samples 3 labeled outliersWDBC 387 30 ldquoMalignantrdquo 30 (775) 1 10 labeled normal samples 8 labeled outliers

minus2 0 2 4 6 8 10minus4

minus2

0

2

4

6

8

10

12

14

unlabeled sampleslabeled normal sampleslabeled outliersoutliers detected by nonweighted SSOD-AFW

Figure 4 Outlier detection result of the nonweighted SSOD-AFW on the synthetic dataset

for all the five datasets The strategy of parameter selectionfor SSOD-AFW will be discussed in the later subsectioncalled parameter analysis For the other algorithms thoseparameters are set exactly as mentioned in their referencesIt is well known that LOF and 119870NN have high dependencyon the neighborhood parameter 119896 In this paper we set 119896 = 5for datasets Iris and WDBC 119896 = 3 for dataset Abalone119896 = 10 for dataset Wine and 119896 = 10 for dataset Ecoli ForSVDDmethodGaussian kernel function is employed and thebandwidths 120590 = 045 and 119862 = 05 on all of the five real-worlddatasets In EODSP the Euclidean distance threshold 119879 is setas 01 and the percentage of negative set 119897 is set as 30 for IrisandAbalone datasets and119879 = 05 119897 = 30 for datasets EcoliWine and WDBC Since we randomly select outliers fromtarget classes for each dataset each experiment is repeated10 times with the same number of different outliers Theaverage accuracy AUC and RP are calculated as the criteriaof performance of various detection methods

Figure 5 illustrates the outlier detection results of SSOD-AFW algorithm against LOF 119870NN SVDD and EODSPrespectively on the five real-world datasets As can beseen from Figure 5 the proposed algorithm can accuratelyidentify outliers according to the cluster structure of adataset with the guidance of the label knowledge It showsdistinct superiority over the other unsupervised (LOF119870NN)semisupervised (EODSP) and supervised (SVDD) methodsIn particular the outlier detection accuracy of SSOD-AFW inFigure 5(a) is significantly higher than the others especially

for datasets Iris and Wine One can know from Figure 5(b)that the AUC values of our method are always higher thanthe others for all datasets except for WDBC In terms of RPSSOD-AFW performs better than the other four algorithmsondatasets Iris andWinewhereas slightly poorer than SVDDonAbalone poorer than LOFonEcoli and poorer than119870NNonWDBC seen as in Figure 5(c)

It is worth mentioning that the experiment of the pro-posed algorithm on WDBC involves one-class clusteringproblem Although one-class clustering task is generallymeaningless one-class clustering-based outlier detection isespecially meaningful and feasible in our proposal becauseour approach does not require that the membership degreesmust sum up to 1 This is one of the powerful and importantcharacteristics of the proposed algorithm

433 The Influence of the Proportion of Labeled Data onOutlier Detection In this subsection we will investigatethe influence of the proportion of labeled samples on theaccuracies of our methodology Two typical situations areconsidered and tested The first one is that the proportion oflabeled outliers increases when the number of labeled normalobjects is fixed at a certain constant The other one is that thepercent of labeled normal samples varies while the quantityof labeled outliers is fixed So two groups of experimentsare designed to compare the accuracies of the proposedalgorithm against the EODSP in the situations of different

10 Mathematical Problems in Engineering

10

09

08

07

06

05

04

03

02

01

00

Accu

racy

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(a) Accuracy

10

09

08

07

06

05

04

03

02

01

00

AUC

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(b) AUC

10

09

08

07

06

05

04

03

02

01

00

RP

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(c) RP

Figure 5 Performance comparison of various algorithms on the real-world datasets

percent of labeled outliers and normal samples respectivelyon the datasets Iris Abalone Wine Ecoli andWDBC In thetwo experiments the percent of labeled outliers or labelednormal samples ranges from 0 to 40 respectively whenthe number of another kind of labeled objects is fixedWe randomly select a certain number of labeled outliersor normal samples from each dataset each experiment isrepeated 10 times and the average accuracies of SSOD-AFWand EODSP are computed

Figure 6 shows results of the first group of experimentswhere the percent of labeled outliers varies from 0 to 40One can see from Figure 6 that the accuracies of the twosemisupervised algorithms are roughly increased with theproportion of labeled outliers becoming larger This power-fully supports the idea that semisupervised outlier detectionalgorithms can improve the accuracy of outlier detectionby using prior information Furthermore the SSOD-AFWachieves a better accuracy than EODSP algorithm for thesame proportion of labeled outliers on the five datasetsEspecially for Wine the accuracy of SSOD-AFW is 40higher than that of EODSP EODSP addresses the problemof detecting outliers with only few labeled outliers as trainingdata The labeled normal instances are extracted accordingto the maximum entropy principle where the entropy iscomputed only using the distance between each testingsample and all the labeled outliers That makes EODSPnot flexible as our proposed method due to informationdeficiencies

Figure 7 illustrates the accuracy comparison of the pro-posed algorithm and EODSP when the proportion of labelednormal samples increases from 0 to 40 and the percent oflabeled outliers is fixed Note that ourmethod obtains a betteraccuracy than EODSP on all of the five real-world datasetsThe accuracy of the proposed algorithm gets larger when thepercent of labeled normal samples increases As mentionedEODSP emphasizes the semisupervised outlier detectiononly with few labeled outliers in the initial dataset butwithout considering any labeled normal objects Therefore

the accuracy of EODSP algorithm keeps harper with variousproportions of labeled normal objects and always equals theaccuracy value of 0 labeled normal samples as well

434 Parameter Analysis The parameters 119898 119870 and 119902are important in our proposed algorithm which affect theperformance of SSOD-AFW In this section the influence ofeach parameter on outlier detection accuracy is studied

The parameter 119898 is the fuzzification coefficient Fig-ure 8(a) analyzes the relationship between the outlier detec-tion accuracy of our proposed algorithm and parameter 119898with 119898 varying from 15 to 50 The results imply that thehighest accuracy is achieved when 119898 ranges in [2 4] Soit is reasonable that 119898 value in the experiments shown inFigure 5 has been set as 21 The parameter 119870 isin (0 1)controls the importance of the label information in the resultof outlier detection Outlier detection accuracies are testifiedby varying119870 from 01 to 09 which are shown in Figure 8(b)The overall tendency is that the accuracies become larger as119870 increasesThe best results of the proposed algorithm occurand keep stable when 119870 ge 07 Finally from Figure 8(c) weconclude that the feature weight index 119902 has small influenceon the accuracy of SSOD-AFW in the situation that theother parametersmaintain the same settings So the proposedalgorithm is not sensitive to the parameter 119902 In general theparameter 119902 is suggested to select a constant from (1 4]435 Execution Time Analysis Figure 9 analyzes the averagerunning time of the proposed algorithm against the otheralgorithms on five real-world datasets The experimentalenvironment is Windows XP systems MATLAB 71 platform3GHz CPU 2GB RAM Because the volume of Abalonedataset is far greater than the other four datasets the runningtimes of various datasets are distinctly different In order tofacilitate the display in Figure 9 the horizontal coordinate axisis translated downward a short distance The result indicatesthat the proposed algorithm is more time-saving than the

Mathematical Problems in Engineering 11

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(a) Iris

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(b) Abalone1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(c) Wine1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(d) Ecoli

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(e) WDBC

Figure 6 Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled outliers on the real-world datasets

other four typical outlier detection algorithms except for119870NN on dataset Wine In whole the execution time of theSSOD-AFW is comparable to that of119870NNand less than thoseof other algorithms on most of the datasets

5 Conclusions

In order to detect outliers more precisely a semisuper-vised outlier detection algorithm based on adaptive feature

weighted clustering called SSOD-AFW is proposed in thispaper Distinct weights of each feature with respect to differ-ent clusters are considered and obtained by adaptive iterationso that the negative effects of irrelevant features on outlierdetection are weakened Moreover the proposed methodmakes full use of the prior knowledge contained in datasetsand detects outliers in virtue of the cluster structure It isverified by a series of experiments that the proposed SSOD-AFW algorithm is superior to other typical unsupervised

12 Mathematical Problems in Engineering

0

1009080706050403020100

Accu

racy

Proportion of labeled normal samples5

EODSPSSOD-AFW

10 15 20 25 35 4030

(a) Iris

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(b) Abalone

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(c) Wine

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples

5 10 15 20 25 35 4030

(d) Ecoli

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(e) WDBC

Figure 7 Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled normal samples on the real-worlddatasets

semisupervised and supervised algorithms in both outlierdetection precision and running speed

In this paper we present a new semisupervised outlierdetection method which utilizes labels of a small amountof objects However our method assumes that the labels

of objects are reliable and does not consider mislabel pun-ishment in the new objective function Therefore a robustversion of the proposed method dealing with noisy orimperfect labels of objects deserves further studiesMoreoversince only one typical dissimilaritymeasure named Euclidean

Mathematical Problems in Engineering 13

1009080706050403020100

Accu

racy

m

5045403530252015

IrisAbaloneWine

EcoliWDBC

(a) Parameter119898

K

1009080706050403020100

Accu

racy

090807060504030201

IrisAbaloneWine

EcoliWDBC

(b) Parameter119870

q

40

1009080706050403020100

3530252015

Accu

racy

IrisAbaloneWine

EcoliWDBC

(c) Parameter 119902

Figure 8 Outlier detection accuracy of the proposed algorithm under various parameters on the real-world datasets

Iris Abalone Wine Ecoli WDBCDatasets

50

45

40

35

30

25

20

15

10

5

0

Runn

ing

time (

s)

LOFKNNSVDD

EODSPSSOD-AFW

Figure 9 Execution time comparison of different algorithms on thereal-world datasets

distance is discussed in our method the SSOD-AFW algo-rithm is limited to outlier detection for numerical data Thefuture research aims at extending our method to mixed-attribute data in more real-life applications such as faultdiagnosis in industrial process or network anomaly detection

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Natural ScienceFoundation of China (Grant no 11471001)

References

[1] J Han M Kamber and J Pei Data Mining Concepts andTechniques Elsevier 2011

[2] D M Hawkins Identification of Outliers Chapman amp HallLondon UK 1980

[3] V Bamnett and T LewisOutliers in Statistical Data JohnWileyamp Sons Chichester UK 1994

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] B Sheng Q Li W Mao and W Jin ldquoOutlier detection insensor networksrdquo in Proceedings of the 8th ACM InternationalSymposium on Mobile Ad Hoc Networking and Computing(MobiHoc rsquo07) pp 219ndash228Montreal Canada September 2007

[6] A P James and S Dimitrijev ldquoInter-image outliers and theirapplication to image classificationrdquo Pattern Recognition vol 43no 12 pp 4101ndash4112 2010

[7] J Huang Q Zhu L Yang and J Feng ldquoA non-parameter outlierdetection algorithm based on Natural Neighborrdquo Knowledge-Based Systems vol 92 pp 71ndash77 2016

[8] J M Shepherd and S J Burian ldquoDetection of urban-inducedrainfall anomalies in a major coastal cityrdquo Earth Interactionsvol 7 no 4 pp 1ndash17 2003

[9] O Alan and C Catal ldquoThresholds based outlier detectionapproach for mining class outliers an empirical case study onsoftware measurement datasetsrdquo Expert Systems with Applica-tions vol 38 no 4 pp 3440ndash3445 2011

[10] B Scholkopf J C Platt J Shawe-Taylor A J Smola and RC Williamson ldquoEstimating the support of a high-dimensionaldistributionrdquo Neural Computation vol 13 no 7 pp 1443ndash14712001

[11] D Tax A Ypma and R Duin ldquoSupport vector data descriptionapplied to machine vibration analysisrdquo in Proceedings of the 5thAnnual Conference of the Advanced School for Computing andImaging pp 398ndash405 Heijen The Netherlands June 1999

[12] DM J Tax andR PWDuin ldquoSupport vector data descriptionrdquoMachine Learning vol 54 no 1 pp 45ndash66 2004

[13] S Ramaswamy R Rastogi and K Shim ldquoEfficient algorithmsformining outliers from large data setsrdquo SIGMODRecord (ACMSpecial Interest Group onManagement of Data) vol 29 no 2 pp427ndash438 2000

14 Mathematical Problems in Engineering

[14] EM Knox and R T Ng ldquoAlgorithms formining distance basedoutliers in large datasetrdquo in Proceedings of the InternationalConference on Very Large Data Bases pp 392ndash403 CiteseerNew York NY USA 1998

[15] M M Breunig H-P Kriegel R T Ng and J Sander ldquoLOFidentifying density-based local outliersrdquoACMSIGMODRecordvol 29 no 2 pp 93ndash104 2000

[16] J Ha S Seok and J-S Lee ldquoA precise ranking method foroutlier detectionrdquo Information Sciences vol 324 pp 88ndash1072015

[17] R N Dave ldquoCharacterization and detection of noise in clus-teringrdquo Pattern Recognition Letters vol 12 no 11 pp 657ndash6641991

[18] R Smith A Bivens M Embrechts C Palagiri and B Szyman-ski ldquoClustering approaches for anomaly based intrusion detec-tionrdquo Proceedings of Intelligent Engineering Systems ThroughArtificial Neural Networks pp 579ndash584 2002

[19] Y Shi and L Zhang ldquoCOID a cluster-outlier iterative detectionapproach to multi-dimensional data analysisrdquo Knowledge andInformation Systems vol 28 no 3 pp 709ndash733 2011

[20] J Zhao K Liu W Wang and Y Liu ldquoAdaptive fuzzy clusteringbased anomaly data detection in energy system of steel indus-tryrdquo Information Sciences vol 259 pp 335ndash345 2014

[21] F Angiulli and C Pizzuti ldquoFast outlier detection in high dimen-sional spacesrdquo in Proceedings of the 6th European Conference onPrinciples of DataMining and Knowledge Discovery (PKDD rsquo02)vol 2 pp 15ndash26 August 2002

[22] M Radovanovic A Nanopoulos and M Ivanovic ldquoReversenearest neighbors in unsupervised distance-based outlier detec-tionrdquo IEEE Transactions on Knowledge and Data Engineeringvol 27 no 5 pp 1369ndash1382 2015

[23] A Daneshpazhouh and A Sami ldquoEntropy-based outlier detec-tion using semi-supervised approach with few positive exam-plesrdquo Pattern Recognition Letters vol 49 pp 77ndash84 2014

[24] J Gao H Cheng and P-N Tan ldquoSemi-supervised outlierdetectionrdquo in Proceedings of the ACM Symposium on AppliedComputing pp 635ndash636 ACM Dijon France April 2006

[25] Z Xue Y Shang and A Feng ldquoSemi-supervised outlier detec-tion based on fuzzy rough C-means clusteringrdquo Mathematicsand Computers in Simulation vol 80 no 9 pp 1911ndash1921 2010

[26] J Z Huang M K Ng H Rong and Z Li ldquoAutomated variableweighting in k-means type clusteringrdquo IEEE Transactions onPattern Analysis andMachine Intelligence vol 27 no 5 pp 657ndash668 2005

[27] H Friguiand and O Nasraou ldquoUnsupervised learning ofprototypes and attribute weightsrdquo Pattern Recognition vol 37no 3 pp 567ndash581 2004

[28] J Zhou L Chen C P Chen Y Zhang and H Li ldquoFuzzy clus-tering with the entropy of attribute weightsrdquo Neurocomputingvol 198 pp 125ndash134 2016

[29] M Hassan A Chaudhry A Khan and M A Iftikhar ldquoRobustinformation gain based fuzzy c-means clustering and classifi-cation of carotid artery ultrasound imagesrdquo Computer Methodsand Programs in Biomedicine vol 113 no 2 pp 593ndash609 2014

[30] R Krishnapuram and J M Keller ldquoA possibilistic approach toclusteringrdquo IEEE Transactions on Fuzzy Systems vol 1 no 2 pp98ndash110 1993

[31] M-S Yang and K-L Wu ldquoUnsupervised possibilistic cluster-ingrdquo Pattern Recognition vol 39 no 1 pp 5ndash21 2006

[32] SM Guo L C Chen and J S H Tsai ldquoA boundarymethod foroutlier detection based on support vector domain descriptionrdquoPattern Recognition vol 42 no 1 pp 77ndash83 2009

[33] T Fawcett ldquoRoc graphs notes and practical considerations forresearchersrdquoMachine Learning vol 31 no 1 pp 1ndash38 2004

[34] C Blake and C J Merz ldquoUCI repository of machine learningdatabasesrdquo 1998

[35] C C Aggarwal and P S Yu ldquoOutlier detection for highdimensional datardquo in Proceedings of the ACM SIGMOD Interna-tional Conference on Management of Data (SIGMOD rsquo01) SantaBarbara Calif USA May 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

10 Mathematical Problems in Engineering

10

09

08

07

06

05

04

03

02

01

00

Accu

racy

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(a) Accuracy

10

09

08

07

06

05

04

03

02

01

00

AUC

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(b) AUC

10

09

08

07

06

05

04

03

02

01

00

RP

Iris Abalone Wine Ecoli WDBCDatasets

LOFKNNSVDD

EODSPSSOD-AFW

(c) RP

Figure 5 Performance comparison of various algorithms on the real-world datasets

percent of labeled outliers and normal samples respectivelyon the datasets Iris Abalone Wine Ecoli andWDBC In thetwo experiments the percent of labeled outliers or labelednormal samples ranges from 0 to 40 respectively whenthe number of another kind of labeled objects is fixedWe randomly select a certain number of labeled outliersor normal samples from each dataset each experiment isrepeated 10 times and the average accuracies of SSOD-AFWand EODSP are computed

Figure 6 shows results of the first group of experimentswhere the percent of labeled outliers varies from 0 to 40One can see from Figure 6 that the accuracies of the twosemisupervised algorithms are roughly increased with theproportion of labeled outliers becoming larger This power-fully supports the idea that semisupervised outlier detectionalgorithms can improve the accuracy of outlier detectionby using prior information Furthermore the SSOD-AFWachieves a better accuracy than EODSP algorithm for thesame proportion of labeled outliers on the five datasetsEspecially for Wine the accuracy of SSOD-AFW is 40higher than that of EODSP EODSP addresses the problemof detecting outliers with only few labeled outliers as trainingdata The labeled normal instances are extracted accordingto the maximum entropy principle where the entropy iscomputed only using the distance between each testingsample and all the labeled outliers That makes EODSPnot flexible as our proposed method due to informationdeficiencies

Figure 7 illustrates the accuracy comparison of the pro-posed algorithm and EODSP when the proportion of labelednormal samples increases from 0 to 40 and the percent oflabeled outliers is fixed Note that ourmethod obtains a betteraccuracy than EODSP on all of the five real-world datasetsThe accuracy of the proposed algorithm gets larger when thepercent of labeled normal samples increases As mentionedEODSP emphasizes the semisupervised outlier detectiononly with few labeled outliers in the initial dataset butwithout considering any labeled normal objects Therefore

the accuracy of EODSP algorithm keeps harper with variousproportions of labeled normal objects and always equals theaccuracy value of 0 labeled normal samples as well

434 Parameter Analysis The parameters 119898 119870 and 119902are important in our proposed algorithm which affect theperformance of SSOD-AFW In this section the influence ofeach parameter on outlier detection accuracy is studied

The parameter 119898 is the fuzzification coefficient Fig-ure 8(a) analyzes the relationship between the outlier detec-tion accuracy of our proposed algorithm and parameter 119898with 119898 varying from 15 to 50 The results imply that thehighest accuracy is achieved when 119898 ranges in [2 4] Soit is reasonable that 119898 value in the experiments shown inFigure 5 has been set as 21 The parameter 119870 isin (0 1)controls the importance of the label information in the resultof outlier detection Outlier detection accuracies are testifiedby varying119870 from 01 to 09 which are shown in Figure 8(b)The overall tendency is that the accuracies become larger as119870 increasesThe best results of the proposed algorithm occurand keep stable when 119870 ge 07 Finally from Figure 8(c) weconclude that the feature weight index 119902 has small influenceon the accuracy of SSOD-AFW in the situation that theother parametersmaintain the same settings So the proposedalgorithm is not sensitive to the parameter 119902 In general theparameter 119902 is suggested to select a constant from (1 4]435 Execution Time Analysis Figure 9 analyzes the averagerunning time of the proposed algorithm against the otheralgorithms on five real-world datasets The experimentalenvironment is Windows XP systems MATLAB 71 platform3GHz CPU 2GB RAM Because the volume of Abalonedataset is far greater than the other four datasets the runningtimes of various datasets are distinctly different In order tofacilitate the display in Figure 9 the horizontal coordinate axisis translated downward a short distance The result indicatesthat the proposed algorithm is more time-saving than the

Mathematical Problems in Engineering 11

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(a) Iris

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(b) Abalone1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(c) Wine1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(d) Ecoli

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(e) WDBC

Figure 6 Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled outliers on the real-world datasets

other four typical outlier detection algorithms except for119870NN on dataset Wine In whole the execution time of theSSOD-AFW is comparable to that of119870NNand less than thoseof other algorithms on most of the datasets

5 Conclusions

In order to detect outliers more precisely a semisuper-vised outlier detection algorithm based on adaptive feature

weighted clustering called SSOD-AFW is proposed in thispaper Distinct weights of each feature with respect to differ-ent clusters are considered and obtained by adaptive iterationso that the negative effects of irrelevant features on outlierdetection are weakened Moreover the proposed methodmakes full use of the prior knowledge contained in datasetsand detects outliers in virtue of the cluster structure It isverified by a series of experiments that the proposed SSOD-AFW algorithm is superior to other typical unsupervised

12 Mathematical Problems in Engineering

0

1009080706050403020100

Accu

racy

Proportion of labeled normal samples5

EODSPSSOD-AFW

10 15 20 25 35 4030

(a) Iris

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(b) Abalone

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(c) Wine

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples

5 10 15 20 25 35 4030

(d) Ecoli

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(e) WDBC

Figure 7 Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled normal samples on the real-worlddatasets

semisupervised and supervised algorithms in both outlierdetection precision and running speed

In this paper we present a new semisupervised outlierdetection method which utilizes labels of a small amountof objects However our method assumes that the labels

of objects are reliable and does not consider mislabel pun-ishment in the new objective function Therefore a robustversion of the proposed method dealing with noisy orimperfect labels of objects deserves further studiesMoreoversince only one typical dissimilaritymeasure named Euclidean

Mathematical Problems in Engineering 13

1009080706050403020100

Accu

racy

m

5045403530252015

IrisAbaloneWine

EcoliWDBC

(a) Parameter119898

K

1009080706050403020100

Accu

racy

090807060504030201

IrisAbaloneWine

EcoliWDBC

(b) Parameter119870

q

40

1009080706050403020100

3530252015

Accu

racy

IrisAbaloneWine

EcoliWDBC

(c) Parameter 119902

Figure 8 Outlier detection accuracy of the proposed algorithm under various parameters on the real-world datasets

Iris Abalone Wine Ecoli WDBCDatasets

50

45

40

35

30

25

20

15

10

5

0

Runn

ing

time (

s)

LOFKNNSVDD

EODSPSSOD-AFW

Figure 9 Execution time comparison of different algorithms on thereal-world datasets

distance is discussed in our method the SSOD-AFW algo-rithm is limited to outlier detection for numerical data Thefuture research aims at extending our method to mixed-attribute data in more real-life applications such as faultdiagnosis in industrial process or network anomaly detection

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Natural ScienceFoundation of China (Grant no 11471001)

References

[1] J Han M Kamber and J Pei Data Mining Concepts andTechniques Elsevier 2011

[2] D M Hawkins Identification of Outliers Chapman amp HallLondon UK 1980

[3] V Bamnett and T LewisOutliers in Statistical Data JohnWileyamp Sons Chichester UK 1994

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] B Sheng Q Li W Mao and W Jin ldquoOutlier detection insensor networksrdquo in Proceedings of the 8th ACM InternationalSymposium on Mobile Ad Hoc Networking and Computing(MobiHoc rsquo07) pp 219ndash228Montreal Canada September 2007

[6] A P James and S Dimitrijev ldquoInter-image outliers and theirapplication to image classificationrdquo Pattern Recognition vol 43no 12 pp 4101ndash4112 2010

[7] J Huang Q Zhu L Yang and J Feng ldquoA non-parameter outlierdetection algorithm based on Natural Neighborrdquo Knowledge-Based Systems vol 92 pp 71ndash77 2016

[8] J M Shepherd and S J Burian ldquoDetection of urban-inducedrainfall anomalies in a major coastal cityrdquo Earth Interactionsvol 7 no 4 pp 1ndash17 2003

[9] O Alan and C Catal ldquoThresholds based outlier detectionapproach for mining class outliers an empirical case study onsoftware measurement datasetsrdquo Expert Systems with Applica-tions vol 38 no 4 pp 3440ndash3445 2011

[10] B Scholkopf J C Platt J Shawe-Taylor A J Smola and RC Williamson ldquoEstimating the support of a high-dimensionaldistributionrdquo Neural Computation vol 13 no 7 pp 1443ndash14712001

[11] D Tax A Ypma and R Duin ldquoSupport vector data descriptionapplied to machine vibration analysisrdquo in Proceedings of the 5thAnnual Conference of the Advanced School for Computing andImaging pp 398ndash405 Heijen The Netherlands June 1999

[12] DM J Tax andR PWDuin ldquoSupport vector data descriptionrdquoMachine Learning vol 54 no 1 pp 45ndash66 2004

[13] S Ramaswamy R Rastogi and K Shim ldquoEfficient algorithmsformining outliers from large data setsrdquo SIGMODRecord (ACMSpecial Interest Group onManagement of Data) vol 29 no 2 pp427ndash438 2000

14 Mathematical Problems in Engineering

[14] EM Knox and R T Ng ldquoAlgorithms formining distance basedoutliers in large datasetrdquo in Proceedings of the InternationalConference on Very Large Data Bases pp 392ndash403 CiteseerNew York NY USA 1998

[15] M M Breunig H-P Kriegel R T Ng and J Sander ldquoLOFidentifying density-based local outliersrdquoACMSIGMODRecordvol 29 no 2 pp 93ndash104 2000

[16] J Ha S Seok and J-S Lee ldquoA precise ranking method foroutlier detectionrdquo Information Sciences vol 324 pp 88ndash1072015

[17] R N Dave ldquoCharacterization and detection of noise in clus-teringrdquo Pattern Recognition Letters vol 12 no 11 pp 657ndash6641991

[18] R Smith A Bivens M Embrechts C Palagiri and B Szyman-ski ldquoClustering approaches for anomaly based intrusion detec-tionrdquo Proceedings of Intelligent Engineering Systems ThroughArtificial Neural Networks pp 579ndash584 2002

[19] Y Shi and L Zhang ldquoCOID a cluster-outlier iterative detectionapproach to multi-dimensional data analysisrdquo Knowledge andInformation Systems vol 28 no 3 pp 709ndash733 2011

[20] J Zhao K Liu W Wang and Y Liu ldquoAdaptive fuzzy clusteringbased anomaly data detection in energy system of steel indus-tryrdquo Information Sciences vol 259 pp 335ndash345 2014

[21] F Angiulli and C Pizzuti ldquoFast outlier detection in high dimen-sional spacesrdquo in Proceedings of the 6th European Conference onPrinciples of DataMining and Knowledge Discovery (PKDD rsquo02)vol 2 pp 15ndash26 August 2002

[22] M Radovanovic A Nanopoulos and M Ivanovic ldquoReversenearest neighbors in unsupervised distance-based outlier detec-tionrdquo IEEE Transactions on Knowledge and Data Engineeringvol 27 no 5 pp 1369ndash1382 2015

[23] A Daneshpazhouh and A Sami ldquoEntropy-based outlier detec-tion using semi-supervised approach with few positive exam-plesrdquo Pattern Recognition Letters vol 49 pp 77ndash84 2014

[24] J Gao H Cheng and P-N Tan ldquoSemi-supervised outlierdetectionrdquo in Proceedings of the ACM Symposium on AppliedComputing pp 635ndash636 ACM Dijon France April 2006

[25] Z Xue Y Shang and A Feng ldquoSemi-supervised outlier detec-tion based on fuzzy rough C-means clusteringrdquo Mathematicsand Computers in Simulation vol 80 no 9 pp 1911ndash1921 2010

[26] J Z Huang M K Ng H Rong and Z Li ldquoAutomated variableweighting in k-means type clusteringrdquo IEEE Transactions onPattern Analysis andMachine Intelligence vol 27 no 5 pp 657ndash668 2005

[27] H Friguiand and O Nasraou ldquoUnsupervised learning ofprototypes and attribute weightsrdquo Pattern Recognition vol 37no 3 pp 567ndash581 2004

[28] J Zhou L Chen C P Chen Y Zhang and H Li ldquoFuzzy clus-tering with the entropy of attribute weightsrdquo Neurocomputingvol 198 pp 125ndash134 2016

[29] M Hassan A Chaudhry A Khan and M A Iftikhar ldquoRobustinformation gain based fuzzy c-means clustering and classifi-cation of carotid artery ultrasound imagesrdquo Computer Methodsand Programs in Biomedicine vol 113 no 2 pp 593ndash609 2014

[30] R Krishnapuram and J M Keller ldquoA possibilistic approach toclusteringrdquo IEEE Transactions on Fuzzy Systems vol 1 no 2 pp98ndash110 1993

[31] M-S Yang and K-L Wu ldquoUnsupervised possibilistic cluster-ingrdquo Pattern Recognition vol 39 no 1 pp 5ndash21 2006

[32] SM Guo L C Chen and J S H Tsai ldquoA boundarymethod foroutlier detection based on support vector domain descriptionrdquoPattern Recognition vol 42 no 1 pp 77ndash83 2009

[33] T Fawcett ldquoRoc graphs notes and practical considerations forresearchersrdquoMachine Learning vol 31 no 1 pp 1ndash38 2004

[34] C Blake and C J Merz ldquoUCI repository of machine learningdatabasesrdquo 1998

[35] C C Aggarwal and P S Yu ldquoOutlier detection for highdimensional datardquo in Proceedings of the ACM SIGMOD Interna-tional Conference on Management of Data (SIGMOD rsquo01) SantaBarbara Calif USA May 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Mathematical Problems in Engineering 11

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(a) Iris

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(b) Abalone1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(c) Wine1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(d) Ecoli

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled outliers

5 10 15 20 25 35 4030

(e) WDBC

Figure 6 Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled outliers on the real-world datasets

other four typical outlier detection algorithms except for119870NN on dataset Wine In whole the execution time of theSSOD-AFW is comparable to that of119870NNand less than thoseof other algorithms on most of the datasets

5 Conclusions

In order to detect outliers more precisely a semisuper-vised outlier detection algorithm based on adaptive feature

weighted clustering called SSOD-AFW is proposed in thispaper Distinct weights of each feature with respect to differ-ent clusters are considered and obtained by adaptive iterationso that the negative effects of irrelevant features on outlierdetection are weakened Moreover the proposed methodmakes full use of the prior knowledge contained in datasetsand detects outliers in virtue of the cluster structure It isverified by a series of experiments that the proposed SSOD-AFW algorithm is superior to other typical unsupervised

12 Mathematical Problems in Engineering

0

1009080706050403020100

Accu

racy

Proportion of labeled normal samples5

EODSPSSOD-AFW

10 15 20 25 35 4030

(a) Iris

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(b) Abalone

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(c) Wine

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples

5 10 15 20 25 35 4030

(d) Ecoli

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(e) WDBC

Figure 7 Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled normal samples on the real-worlddatasets

semisupervised and supervised algorithms in both outlierdetection precision and running speed

In this paper we present a new semisupervised outlierdetection method which utilizes labels of a small amountof objects However our method assumes that the labels

of objects are reliable and does not consider mislabel pun-ishment in the new objective function Therefore a robustversion of the proposed method dealing with noisy orimperfect labels of objects deserves further studiesMoreoversince only one typical dissimilaritymeasure named Euclidean

Mathematical Problems in Engineering 13

1009080706050403020100

Accu

racy

m

5045403530252015

IrisAbaloneWine

EcoliWDBC

(a) Parameter119898

K

1009080706050403020100

Accu

racy

090807060504030201

IrisAbaloneWine

EcoliWDBC

(b) Parameter119870

q

40

1009080706050403020100

3530252015

Accu

racy

IrisAbaloneWine

EcoliWDBC

(c) Parameter 119902

Figure 8 Outlier detection accuracy of the proposed algorithm under various parameters on the real-world datasets

Iris Abalone Wine Ecoli WDBCDatasets

50

45

40

35

30

25

20

15

10

5

0

Runn

ing

time (

s)

LOFKNNSVDD

EODSPSSOD-AFW

Figure 9 Execution time comparison of different algorithms on thereal-world datasets

distance is discussed in our method the SSOD-AFW algo-rithm is limited to outlier detection for numerical data Thefuture research aims at extending our method to mixed-attribute data in more real-life applications such as faultdiagnosis in industrial process or network anomaly detection

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Natural ScienceFoundation of China (Grant no 11471001)

References

[1] J Han M Kamber and J Pei Data Mining Concepts andTechniques Elsevier 2011

[2] D M Hawkins Identification of Outliers Chapman amp HallLondon UK 1980

[3] V Bamnett and T LewisOutliers in Statistical Data JohnWileyamp Sons Chichester UK 1994

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] B Sheng Q Li W Mao and W Jin ldquoOutlier detection insensor networksrdquo in Proceedings of the 8th ACM InternationalSymposium on Mobile Ad Hoc Networking and Computing(MobiHoc rsquo07) pp 219ndash228Montreal Canada September 2007

[6] A P James and S Dimitrijev ldquoInter-image outliers and theirapplication to image classificationrdquo Pattern Recognition vol 43no 12 pp 4101ndash4112 2010

[7] J Huang Q Zhu L Yang and J Feng ldquoA non-parameter outlierdetection algorithm based on Natural Neighborrdquo Knowledge-Based Systems vol 92 pp 71ndash77 2016

[8] J M Shepherd and S J Burian ldquoDetection of urban-inducedrainfall anomalies in a major coastal cityrdquo Earth Interactionsvol 7 no 4 pp 1ndash17 2003

[9] O Alan and C Catal ldquoThresholds based outlier detectionapproach for mining class outliers an empirical case study onsoftware measurement datasetsrdquo Expert Systems with Applica-tions vol 38 no 4 pp 3440ndash3445 2011

[10] B Scholkopf J C Platt J Shawe-Taylor A J Smola and RC Williamson ldquoEstimating the support of a high-dimensionaldistributionrdquo Neural Computation vol 13 no 7 pp 1443ndash14712001

[11] D Tax A Ypma and R Duin ldquoSupport vector data descriptionapplied to machine vibration analysisrdquo in Proceedings of the 5thAnnual Conference of the Advanced School for Computing andImaging pp 398ndash405 Heijen The Netherlands June 1999

[12] DM J Tax andR PWDuin ldquoSupport vector data descriptionrdquoMachine Learning vol 54 no 1 pp 45ndash66 2004

[13] S Ramaswamy R Rastogi and K Shim ldquoEfficient algorithmsformining outliers from large data setsrdquo SIGMODRecord (ACMSpecial Interest Group onManagement of Data) vol 29 no 2 pp427ndash438 2000

14 Mathematical Problems in Engineering

[14] EM Knox and R T Ng ldquoAlgorithms formining distance basedoutliers in large datasetrdquo in Proceedings of the InternationalConference on Very Large Data Bases pp 392ndash403 CiteseerNew York NY USA 1998

[15] M M Breunig H-P Kriegel R T Ng and J Sander ldquoLOFidentifying density-based local outliersrdquoACMSIGMODRecordvol 29 no 2 pp 93ndash104 2000

[16] J Ha S Seok and J-S Lee ldquoA precise ranking method foroutlier detectionrdquo Information Sciences vol 324 pp 88ndash1072015

[17] R N Dave ldquoCharacterization and detection of noise in clus-teringrdquo Pattern Recognition Letters vol 12 no 11 pp 657ndash6641991

[18] R Smith A Bivens M Embrechts C Palagiri and B Szyman-ski ldquoClustering approaches for anomaly based intrusion detec-tionrdquo Proceedings of Intelligent Engineering Systems ThroughArtificial Neural Networks pp 579ndash584 2002

[19] Y Shi and L Zhang ldquoCOID a cluster-outlier iterative detectionapproach to multi-dimensional data analysisrdquo Knowledge andInformation Systems vol 28 no 3 pp 709ndash733 2011

[20] J Zhao K Liu W Wang and Y Liu ldquoAdaptive fuzzy clusteringbased anomaly data detection in energy system of steel indus-tryrdquo Information Sciences vol 259 pp 335ndash345 2014

[21] F Angiulli and C Pizzuti ldquoFast outlier detection in high dimen-sional spacesrdquo in Proceedings of the 6th European Conference onPrinciples of DataMining and Knowledge Discovery (PKDD rsquo02)vol 2 pp 15ndash26 August 2002

[22] M Radovanovic A Nanopoulos and M Ivanovic ldquoReversenearest neighbors in unsupervised distance-based outlier detec-tionrdquo IEEE Transactions on Knowledge and Data Engineeringvol 27 no 5 pp 1369ndash1382 2015

[23] A Daneshpazhouh and A Sami ldquoEntropy-based outlier detec-tion using semi-supervised approach with few positive exam-plesrdquo Pattern Recognition Letters vol 49 pp 77ndash84 2014

[24] J Gao H Cheng and P-N Tan ldquoSemi-supervised outlierdetectionrdquo in Proceedings of the ACM Symposium on AppliedComputing pp 635ndash636 ACM Dijon France April 2006

[25] Z Xue Y Shang and A Feng ldquoSemi-supervised outlier detec-tion based on fuzzy rough C-means clusteringrdquo Mathematicsand Computers in Simulation vol 80 no 9 pp 1911ndash1921 2010

[26] J Z Huang M K Ng H Rong and Z Li ldquoAutomated variableweighting in k-means type clusteringrdquo IEEE Transactions onPattern Analysis andMachine Intelligence vol 27 no 5 pp 657ndash668 2005

[27] H Friguiand and O Nasraou ldquoUnsupervised learning ofprototypes and attribute weightsrdquo Pattern Recognition vol 37no 3 pp 567ndash581 2004

[28] J Zhou L Chen C P Chen Y Zhang and H Li ldquoFuzzy clus-tering with the entropy of attribute weightsrdquo Neurocomputingvol 198 pp 125ndash134 2016

[29] M Hassan A Chaudhry A Khan and M A Iftikhar ldquoRobustinformation gain based fuzzy c-means clustering and classifi-cation of carotid artery ultrasound imagesrdquo Computer Methodsand Programs in Biomedicine vol 113 no 2 pp 593ndash609 2014

[30] R Krishnapuram and J M Keller ldquoA possibilistic approach toclusteringrdquo IEEE Transactions on Fuzzy Systems vol 1 no 2 pp98ndash110 1993

[31] M-S Yang and K-L Wu ldquoUnsupervised possibilistic cluster-ingrdquo Pattern Recognition vol 39 no 1 pp 5ndash21 2006

[32] SM Guo L C Chen and J S H Tsai ldquoA boundarymethod foroutlier detection based on support vector domain descriptionrdquoPattern Recognition vol 42 no 1 pp 77ndash83 2009

[33] T Fawcett ldquoRoc graphs notes and practical considerations forresearchersrdquoMachine Learning vol 31 no 1 pp 1ndash38 2004

[34] C Blake and C J Merz ldquoUCI repository of machine learningdatabasesrdquo 1998

[35] C C Aggarwal and P S Yu ldquoOutlier detection for highdimensional datardquo in Proceedings of the ACM SIGMOD Interna-tional Conference on Management of Data (SIGMOD rsquo01) SantaBarbara Calif USA May 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

12 Mathematical Problems in Engineering

0

1009080706050403020100

Accu

racy

Proportion of labeled normal samples5

EODSPSSOD-AFW

10 15 20 25 35 4030

(a) Iris

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(b) Abalone

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(c) Wine

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples

5 10 15 20 25 35 4030

(d) Ecoli

1009080706050403020100

Accu

racy

EODSPSSOD-AFW

0Proportion of labeled normal samples5 10 15 20 25 35 4030

(e) WDBC

Figure 7 Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled normal samples on the real-worlddatasets

semisupervised and supervised algorithms in both outlierdetection precision and running speed

In this paper we present a new semisupervised outlierdetection method which utilizes labels of a small amountof objects However our method assumes that the labels

of objects are reliable and does not consider mislabel pun-ishment in the new objective function Therefore a robustversion of the proposed method dealing with noisy orimperfect labels of objects deserves further studiesMoreoversince only one typical dissimilaritymeasure named Euclidean

Mathematical Problems in Engineering 13

1009080706050403020100

Accu

racy

m

5045403530252015

IrisAbaloneWine

EcoliWDBC

(a) Parameter119898

K

1009080706050403020100

Accu

racy

090807060504030201

IrisAbaloneWine

EcoliWDBC

(b) Parameter119870

q

40

1009080706050403020100

3530252015

Accu

racy

IrisAbaloneWine

EcoliWDBC

(c) Parameter 119902

Figure 8 Outlier detection accuracy of the proposed algorithm under various parameters on the real-world datasets

Iris Abalone Wine Ecoli WDBCDatasets

50

45

40

35

30

25

20

15

10

5

0

Runn

ing

time (

s)

LOFKNNSVDD

EODSPSSOD-AFW

Figure 9 Execution time comparison of different algorithms on thereal-world datasets

distance is discussed in our method the SSOD-AFW algo-rithm is limited to outlier detection for numerical data Thefuture research aims at extending our method to mixed-attribute data in more real-life applications such as faultdiagnosis in industrial process or network anomaly detection

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Natural ScienceFoundation of China (Grant no 11471001)

References

[1] J Han M Kamber and J Pei Data Mining Concepts andTechniques Elsevier 2011

[2] D M Hawkins Identification of Outliers Chapman amp HallLondon UK 1980

[3] V Bamnett and T LewisOutliers in Statistical Data JohnWileyamp Sons Chichester UK 1994

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] B Sheng Q Li W Mao and W Jin ldquoOutlier detection insensor networksrdquo in Proceedings of the 8th ACM InternationalSymposium on Mobile Ad Hoc Networking and Computing(MobiHoc rsquo07) pp 219ndash228Montreal Canada September 2007

[6] A P James and S Dimitrijev ldquoInter-image outliers and theirapplication to image classificationrdquo Pattern Recognition vol 43no 12 pp 4101ndash4112 2010

[7] J Huang Q Zhu L Yang and J Feng ldquoA non-parameter outlierdetection algorithm based on Natural Neighborrdquo Knowledge-Based Systems vol 92 pp 71ndash77 2016

[8] J M Shepherd and S J Burian ldquoDetection of urban-inducedrainfall anomalies in a major coastal cityrdquo Earth Interactionsvol 7 no 4 pp 1ndash17 2003

[9] O Alan and C Catal ldquoThresholds based outlier detectionapproach for mining class outliers an empirical case study onsoftware measurement datasetsrdquo Expert Systems with Applica-tions vol 38 no 4 pp 3440ndash3445 2011

[10] B Scholkopf J C Platt J Shawe-Taylor A J Smola and RC Williamson ldquoEstimating the support of a high-dimensionaldistributionrdquo Neural Computation vol 13 no 7 pp 1443ndash14712001

[11] D Tax A Ypma and R Duin ldquoSupport vector data descriptionapplied to machine vibration analysisrdquo in Proceedings of the 5thAnnual Conference of the Advanced School for Computing andImaging pp 398ndash405 Heijen The Netherlands June 1999

[12] DM J Tax andR PWDuin ldquoSupport vector data descriptionrdquoMachine Learning vol 54 no 1 pp 45ndash66 2004

[13] S Ramaswamy R Rastogi and K Shim ldquoEfficient algorithmsformining outliers from large data setsrdquo SIGMODRecord (ACMSpecial Interest Group onManagement of Data) vol 29 no 2 pp427ndash438 2000

14 Mathematical Problems in Engineering

[14] EM Knox and R T Ng ldquoAlgorithms formining distance basedoutliers in large datasetrdquo in Proceedings of the InternationalConference on Very Large Data Bases pp 392ndash403 CiteseerNew York NY USA 1998

[15] M M Breunig H-P Kriegel R T Ng and J Sander ldquoLOFidentifying density-based local outliersrdquoACMSIGMODRecordvol 29 no 2 pp 93ndash104 2000

[16] J Ha S Seok and J-S Lee ldquoA precise ranking method foroutlier detectionrdquo Information Sciences vol 324 pp 88ndash1072015

[17] R N Dave ldquoCharacterization and detection of noise in clus-teringrdquo Pattern Recognition Letters vol 12 no 11 pp 657ndash6641991

[18] R Smith A Bivens M Embrechts C Palagiri and B Szyman-ski ldquoClustering approaches for anomaly based intrusion detec-tionrdquo Proceedings of Intelligent Engineering Systems ThroughArtificial Neural Networks pp 579ndash584 2002

[19] Y Shi and L Zhang ldquoCOID a cluster-outlier iterative detectionapproach to multi-dimensional data analysisrdquo Knowledge andInformation Systems vol 28 no 3 pp 709ndash733 2011

[20] J Zhao K Liu W Wang and Y Liu ldquoAdaptive fuzzy clusteringbased anomaly data detection in energy system of steel indus-tryrdquo Information Sciences vol 259 pp 335ndash345 2014

[21] F Angiulli and C Pizzuti ldquoFast outlier detection in high dimen-sional spacesrdquo in Proceedings of the 6th European Conference onPrinciples of DataMining and Knowledge Discovery (PKDD rsquo02)vol 2 pp 15ndash26 August 2002

[22] M Radovanovic A Nanopoulos and M Ivanovic ldquoReversenearest neighbors in unsupervised distance-based outlier detec-tionrdquo IEEE Transactions on Knowledge and Data Engineeringvol 27 no 5 pp 1369ndash1382 2015

[23] A Daneshpazhouh and A Sami ldquoEntropy-based outlier detec-tion using semi-supervised approach with few positive exam-plesrdquo Pattern Recognition Letters vol 49 pp 77ndash84 2014

[24] J Gao H Cheng and P-N Tan ldquoSemi-supervised outlierdetectionrdquo in Proceedings of the ACM Symposium on AppliedComputing pp 635ndash636 ACM Dijon France April 2006

[25] Z Xue Y Shang and A Feng ldquoSemi-supervised outlier detec-tion based on fuzzy rough C-means clusteringrdquo Mathematicsand Computers in Simulation vol 80 no 9 pp 1911ndash1921 2010

[26] J Z Huang M K Ng H Rong and Z Li ldquoAutomated variableweighting in k-means type clusteringrdquo IEEE Transactions onPattern Analysis andMachine Intelligence vol 27 no 5 pp 657ndash668 2005

[27] H Friguiand and O Nasraou ldquoUnsupervised learning ofprototypes and attribute weightsrdquo Pattern Recognition vol 37no 3 pp 567ndash581 2004

[28] J Zhou L Chen C P Chen Y Zhang and H Li ldquoFuzzy clus-tering with the entropy of attribute weightsrdquo Neurocomputingvol 198 pp 125ndash134 2016

[29] M Hassan A Chaudhry A Khan and M A Iftikhar ldquoRobustinformation gain based fuzzy c-means clustering and classifi-cation of carotid artery ultrasound imagesrdquo Computer Methodsand Programs in Biomedicine vol 113 no 2 pp 593ndash609 2014

[30] R Krishnapuram and J M Keller ldquoA possibilistic approach toclusteringrdquo IEEE Transactions on Fuzzy Systems vol 1 no 2 pp98ndash110 1993

[31] M-S Yang and K-L Wu ldquoUnsupervised possibilistic cluster-ingrdquo Pattern Recognition vol 39 no 1 pp 5ndash21 2006

[32] SM Guo L C Chen and J S H Tsai ldquoA boundarymethod foroutlier detection based on support vector domain descriptionrdquoPattern Recognition vol 42 no 1 pp 77ndash83 2009

[33] T Fawcett ldquoRoc graphs notes and practical considerations forresearchersrdquoMachine Learning vol 31 no 1 pp 1ndash38 2004

[34] C Blake and C J Merz ldquoUCI repository of machine learningdatabasesrdquo 1998

[35] C C Aggarwal and P S Yu ldquoOutlier detection for highdimensional datardquo in Proceedings of the ACM SIGMOD Interna-tional Conference on Management of Data (SIGMOD rsquo01) SantaBarbara Calif USA May 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Mathematical Problems in Engineering 13

1009080706050403020100

Accu

racy

m

5045403530252015

IrisAbaloneWine

EcoliWDBC

(a) Parameter119898

K

1009080706050403020100

Accu

racy

090807060504030201

IrisAbaloneWine

EcoliWDBC

(b) Parameter119870

q

40

1009080706050403020100

3530252015

Accu

racy

IrisAbaloneWine

EcoliWDBC

(c) Parameter 119902

Figure 8 Outlier detection accuracy of the proposed algorithm under various parameters on the real-world datasets

Iris Abalone Wine Ecoli WDBCDatasets

50

45

40

35

30

25

20

15

10

5

0

Runn

ing

time (

s)

LOFKNNSVDD

EODSPSSOD-AFW

Figure 9 Execution time comparison of different algorithms on thereal-world datasets

distance is discussed in our method the SSOD-AFW algo-rithm is limited to outlier detection for numerical data Thefuture research aims at extending our method to mixed-attribute data in more real-life applications such as faultdiagnosis in industrial process or network anomaly detection

Competing Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work was supported by the National Natural ScienceFoundation of China (Grant no 11471001)

References

[1] J Han M Kamber and J Pei Data Mining Concepts andTechniques Elsevier 2011

[2] D M Hawkins Identification of Outliers Chapman amp HallLondon UK 1980

[3] V Bamnett and T LewisOutliers in Statistical Data JohnWileyamp Sons Chichester UK 1994

[4] V J Hodge and J Austin ldquoA survey of outlier detectionmethodologiesrdquo Artificial Intelligence Review vol 22 no 2 pp85ndash126 2004

[5] B Sheng Q Li W Mao and W Jin ldquoOutlier detection insensor networksrdquo in Proceedings of the 8th ACM InternationalSymposium on Mobile Ad Hoc Networking and Computing(MobiHoc rsquo07) pp 219ndash228Montreal Canada September 2007

[6] A P James and S Dimitrijev ldquoInter-image outliers and theirapplication to image classificationrdquo Pattern Recognition vol 43no 12 pp 4101ndash4112 2010

[7] J Huang Q Zhu L Yang and J Feng ldquoA non-parameter outlierdetection algorithm based on Natural Neighborrdquo Knowledge-Based Systems vol 92 pp 71ndash77 2016

[8] J M Shepherd and S J Burian ldquoDetection of urban-inducedrainfall anomalies in a major coastal cityrdquo Earth Interactionsvol 7 no 4 pp 1ndash17 2003

[9] O Alan and C Catal ldquoThresholds based outlier detectionapproach for mining class outliers an empirical case study onsoftware measurement datasetsrdquo Expert Systems with Applica-tions vol 38 no 4 pp 3440ndash3445 2011

[10] B Scholkopf J C Platt J Shawe-Taylor A J Smola and RC Williamson ldquoEstimating the support of a high-dimensionaldistributionrdquo Neural Computation vol 13 no 7 pp 1443ndash14712001

[11] D Tax A Ypma and R Duin ldquoSupport vector data descriptionapplied to machine vibration analysisrdquo in Proceedings of the 5thAnnual Conference of the Advanced School for Computing andImaging pp 398ndash405 Heijen The Netherlands June 1999

[12] DM J Tax andR PWDuin ldquoSupport vector data descriptionrdquoMachine Learning vol 54 no 1 pp 45ndash66 2004

[13] S Ramaswamy R Rastogi and K Shim ldquoEfficient algorithmsformining outliers from large data setsrdquo SIGMODRecord (ACMSpecial Interest Group onManagement of Data) vol 29 no 2 pp427ndash438 2000

14 Mathematical Problems in Engineering

[14] EM Knox and R T Ng ldquoAlgorithms formining distance basedoutliers in large datasetrdquo in Proceedings of the InternationalConference on Very Large Data Bases pp 392ndash403 CiteseerNew York NY USA 1998

[15] M M Breunig H-P Kriegel R T Ng and J Sander ldquoLOFidentifying density-based local outliersrdquoACMSIGMODRecordvol 29 no 2 pp 93ndash104 2000

[16] J Ha S Seok and J-S Lee ldquoA precise ranking method foroutlier detectionrdquo Information Sciences vol 324 pp 88ndash1072015

[17] R N Dave ldquoCharacterization and detection of noise in clus-teringrdquo Pattern Recognition Letters vol 12 no 11 pp 657ndash6641991

[18] R Smith A Bivens M Embrechts C Palagiri and B Szyman-ski ldquoClustering approaches for anomaly based intrusion detec-tionrdquo Proceedings of Intelligent Engineering Systems ThroughArtificial Neural Networks pp 579ndash584 2002

[19] Y Shi and L Zhang ldquoCOID a cluster-outlier iterative detectionapproach to multi-dimensional data analysisrdquo Knowledge andInformation Systems vol 28 no 3 pp 709ndash733 2011

[20] J Zhao K Liu W Wang and Y Liu ldquoAdaptive fuzzy clusteringbased anomaly data detection in energy system of steel indus-tryrdquo Information Sciences vol 259 pp 335ndash345 2014

[21] F Angiulli and C Pizzuti ldquoFast outlier detection in high dimen-sional spacesrdquo in Proceedings of the 6th European Conference onPrinciples of DataMining and Knowledge Discovery (PKDD rsquo02)vol 2 pp 15ndash26 August 2002

[22] M Radovanovic A Nanopoulos and M Ivanovic ldquoReversenearest neighbors in unsupervised distance-based outlier detec-tionrdquo IEEE Transactions on Knowledge and Data Engineeringvol 27 no 5 pp 1369ndash1382 2015

[23] A Daneshpazhouh and A Sami ldquoEntropy-based outlier detec-tion using semi-supervised approach with few positive exam-plesrdquo Pattern Recognition Letters vol 49 pp 77ndash84 2014

[24] J Gao H Cheng and P-N Tan ldquoSemi-supervised outlierdetectionrdquo in Proceedings of the ACM Symposium on AppliedComputing pp 635ndash636 ACM Dijon France April 2006

[25] Z Xue Y Shang and A Feng ldquoSemi-supervised outlier detec-tion based on fuzzy rough C-means clusteringrdquo Mathematicsand Computers in Simulation vol 80 no 9 pp 1911ndash1921 2010

[26] J Z Huang M K Ng H Rong and Z Li ldquoAutomated variableweighting in k-means type clusteringrdquo IEEE Transactions onPattern Analysis andMachine Intelligence vol 27 no 5 pp 657ndash668 2005

[27] H Friguiand and O Nasraou ldquoUnsupervised learning ofprototypes and attribute weightsrdquo Pattern Recognition vol 37no 3 pp 567ndash581 2004

[28] J Zhou L Chen C P Chen Y Zhang and H Li ldquoFuzzy clus-tering with the entropy of attribute weightsrdquo Neurocomputingvol 198 pp 125ndash134 2016

[29] M Hassan A Chaudhry A Khan and M A Iftikhar ldquoRobustinformation gain based fuzzy c-means clustering and classifi-cation of carotid artery ultrasound imagesrdquo Computer Methodsand Programs in Biomedicine vol 113 no 2 pp 593ndash609 2014

[30] R Krishnapuram and J M Keller ldquoA possibilistic approach toclusteringrdquo IEEE Transactions on Fuzzy Systems vol 1 no 2 pp98ndash110 1993

[31] M-S Yang and K-L Wu ldquoUnsupervised possibilistic cluster-ingrdquo Pattern Recognition vol 39 no 1 pp 5ndash21 2006

[32] SM Guo L C Chen and J S H Tsai ldquoA boundarymethod foroutlier detection based on support vector domain descriptionrdquoPattern Recognition vol 42 no 1 pp 77ndash83 2009

[33] T Fawcett ldquoRoc graphs notes and practical considerations forresearchersrdquoMachine Learning vol 31 no 1 pp 1ndash38 2004

[34] C Blake and C J Merz ldquoUCI repository of machine learningdatabasesrdquo 1998

[35] C C Aggarwal and P S Yu ldquoOutlier detection for highdimensional datardquo in Proceedings of the ACM SIGMOD Interna-tional Conference on Management of Data (SIGMOD rsquo01) SantaBarbara Calif USA May 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

14 Mathematical Problems in Engineering

[14] EM Knox and R T Ng ldquoAlgorithms formining distance basedoutliers in large datasetrdquo in Proceedings of the InternationalConference on Very Large Data Bases pp 392ndash403 CiteseerNew York NY USA 1998

[15] M M Breunig H-P Kriegel R T Ng and J Sander ldquoLOFidentifying density-based local outliersrdquoACMSIGMODRecordvol 29 no 2 pp 93ndash104 2000

[16] J Ha S Seok and J-S Lee ldquoA precise ranking method foroutlier detectionrdquo Information Sciences vol 324 pp 88ndash1072015

[17] R N Dave ldquoCharacterization and detection of noise in clus-teringrdquo Pattern Recognition Letters vol 12 no 11 pp 657ndash6641991

[18] R Smith A Bivens M Embrechts C Palagiri and B Szyman-ski ldquoClustering approaches for anomaly based intrusion detec-tionrdquo Proceedings of Intelligent Engineering Systems ThroughArtificial Neural Networks pp 579ndash584 2002

[19] Y Shi and L Zhang ldquoCOID a cluster-outlier iterative detectionapproach to multi-dimensional data analysisrdquo Knowledge andInformation Systems vol 28 no 3 pp 709ndash733 2011

[20] J Zhao K Liu W Wang and Y Liu ldquoAdaptive fuzzy clusteringbased anomaly data detection in energy system of steel indus-tryrdquo Information Sciences vol 259 pp 335ndash345 2014

[21] F Angiulli and C Pizzuti ldquoFast outlier detection in high dimen-sional spacesrdquo in Proceedings of the 6th European Conference onPrinciples of DataMining and Knowledge Discovery (PKDD rsquo02)vol 2 pp 15ndash26 August 2002

[22] M Radovanovic A Nanopoulos and M Ivanovic ldquoReversenearest neighbors in unsupervised distance-based outlier detec-tionrdquo IEEE Transactions on Knowledge and Data Engineeringvol 27 no 5 pp 1369ndash1382 2015

[23] A Daneshpazhouh and A Sami ldquoEntropy-based outlier detec-tion using semi-supervised approach with few positive exam-plesrdquo Pattern Recognition Letters vol 49 pp 77ndash84 2014

[24] J Gao H Cheng and P-N Tan ldquoSemi-supervised outlierdetectionrdquo in Proceedings of the ACM Symposium on AppliedComputing pp 635ndash636 ACM Dijon France April 2006

[25] Z Xue Y Shang and A Feng ldquoSemi-supervised outlier detec-tion based on fuzzy rough C-means clusteringrdquo Mathematicsand Computers in Simulation vol 80 no 9 pp 1911ndash1921 2010

[26] J Z Huang M K Ng H Rong and Z Li ldquoAutomated variableweighting in k-means type clusteringrdquo IEEE Transactions onPattern Analysis andMachine Intelligence vol 27 no 5 pp 657ndash668 2005

[27] H Friguiand and O Nasraou ldquoUnsupervised learning ofprototypes and attribute weightsrdquo Pattern Recognition vol 37no 3 pp 567ndash581 2004

[28] J Zhou L Chen C P Chen Y Zhang and H Li ldquoFuzzy clus-tering with the entropy of attribute weightsrdquo Neurocomputingvol 198 pp 125ndash134 2016

[29] M Hassan A Chaudhry A Khan and M A Iftikhar ldquoRobustinformation gain based fuzzy c-means clustering and classifi-cation of carotid artery ultrasound imagesrdquo Computer Methodsand Programs in Biomedicine vol 113 no 2 pp 593ndash609 2014

[30] R Krishnapuram and J M Keller ldquoA possibilistic approach toclusteringrdquo IEEE Transactions on Fuzzy Systems vol 1 no 2 pp98ndash110 1993

[31] M-S Yang and K-L Wu ldquoUnsupervised possibilistic cluster-ingrdquo Pattern Recognition vol 39 no 1 pp 5ndash21 2006

[32] SM Guo L C Chen and J S H Tsai ldquoA boundarymethod foroutlier detection based on support vector domain descriptionrdquoPattern Recognition vol 42 no 1 pp 77ndash83 2009

[33] T Fawcett ldquoRoc graphs notes and practical considerations forresearchersrdquoMachine Learning vol 31 no 1 pp 1ndash38 2004

[34] C Blake and C J Merz ldquoUCI repository of machine learningdatabasesrdquo 1998

[35] C C Aggarwal and P S Yu ldquoOutlier detection for highdimensional datardquo in Proceedings of the ACM SIGMOD Interna-tional Conference on Management of Data (SIGMOD rsquo01) SantaBarbara Calif USA May 2001

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of