10
Received 4 July 2012 Accepted 11 March 2013 Outlier Detection Based on Local Kernel Regression for Instance Selection Qinmu Peng 1 , Yiu-ming Cheung 1,2 1 Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR, China 2 United International College, Beijing Normal University – Hong Kong Baptist University, Zhuhai, China E-mail: [email protected], [email protected] Abstract In this paper, we propose an outlier detection approach based on local kernel regression for instance selection. It evaluates the reconstruction error of instances by their neighbors to identify the outliers. Ex- periments are performed on the synthetic and real data sets to show the efficacy of the proposed approach in comparison with the existing counterparts. Keywords: Outlier Detection; Instance Selection; Local Kernel Regression. 1. Introduction Instance selection, which is to clean the noise- corrupted and redundant data, has been widely used in real world applications such as marketing re- search 1 and data mining 2 . It is able to circumvent the over-fitting problem and improve a learner’s per- formance, meanwhile speeding up the learning pro- cessing. Instance selection is therefore regarded as an important preprocessing step in data analysis. Typically, the instance selection can be made via outlier removal. In general, an outlier 18 is consid- ered as an observation that deviates too much from other observations, i.e. an outlier is generated by a different mechanism compared to the other ob- servations. The task of outlier detection is defined as follows: “Given a set of N data points or ob- jects and the number n of expected outliers, find the top n objects that are considerably dissimilar, exceptional, or inconsistent with respect to the re- maining data” 19,20 . Thus far, various outlier detec- tion techniques for instance selection have been pro- posed, which can be roughly grouped into five cate- gories: (1)statistical or distribution-based approach- es, (2) density-based approaches, (3) distance-based approaches, (4) deviation-based approaches, and (5) cluster-based approaches. Statistical-based ap- proaches 3,4 used a standard distribution model for data points, through which the outliers can be detect- ed based on their relationship with the distribution model. Usually, the performance of such a method quite relies on the hypothesis model. In density- based methods, outlier is detected from the local density of observations with the different density es- timation strategies. For instance, Breunig et al. 5 pro- posed the local outlier factor (LOF) to score each points, depending on the local density of its neigh- borhood. By defining the neighborhood based on the distance to the MinPts-th nearest neighbor, Jin et al. 6 proposed an algorithm to discover the top n out- liers efficiently for a particular value of MinPts. Fur- thermore, Tang et al. 7 developed the connectivity- Corresponding author. International Journal of Computational Intelligence Systems, Vol. 7, No. 4 (August 2014), 748-757 Co-published by Atlantis Press and Taylor & Francis Copyright: the authors 748

Outlier Detection Based on Local Kernel Regression for ... · based outlier factor (COF) algorithm analogous to the LOF approach. Papadimitriou et al. 8 comput-ed the number of neighbors

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Outlier Detection Based on Local Kernel Regression for ... · based outlier factor (COF) algorithm analogous to the LOF approach. Papadimitriou et al. 8 comput-ed the number of neighbors

Received 4 July 2012

Accepted 11 March 2013

Outlier Detection Based on Local Kernel Regression for Instance Selection

Qinmu Peng1 , Yiu-ming Cheung1,2 ∗

1Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR, China2United International College, Beijing Normal University – Hong Kong Baptist University, Zhuhai, China

E-mail: [email protected], [email protected]

Abstract

In this paper, we propose an outlier detection approach based on local kernel regression for instanceselection. It evaluates the reconstruction error of instances by their neighbors to identify the outliers. Ex-periments are performed on the synthetic and real data sets to show the efficacy of the proposed approachin comparison with the existing counterparts.

Keywords: Outlier Detection; Instance Selection; Local Kernel Regression.

1. Introduction

Instance selection, which is to clean the noise-corrupted and redundant data, has been widely usedin real world applications such as marketing re-search 1 and data mining 2. It is able to circumventthe over-fitting problem and improve a learner’s per-formance, meanwhile speeding up the learning pro-cessing. Instance selection is therefore regarded asan important preprocessing step in data analysis.

Typically, the instance selection can be made viaoutlier removal. In general, an outlier 18 is consid-ered as an observation that deviates too much fromother observations, i.e. an outlier is generated bya different mechanism compared to the other ob-servations. The task of outlier detection is definedas follows: “Given a set of N data points or ob-jects and the number n of expected outliers, findthe top n objects that are considerably dissimilar,exceptional, or inconsistent with respect to the re-maining data” 19,20. Thus far, various outlier detec-

tion techniques for instance selection have been pro-posed, which can be roughly grouped into five cate-gories: (1)statistical or distribution-based approach-es, (2) density-based approaches, (3) distance-basedapproaches, (4) deviation-based approaches, and(5) cluster-based approaches. Statistical-based ap-proaches 3,4 used a standard distribution model fordata points, through which the outliers can be detect-ed based on their relationship with the distributionmodel. Usually, the performance of such a methodquite relies on the hypothesis model. In density-based methods, outlier is detected from the localdensity of observations with the different density es-timation strategies. For instance, Breunig et al.5 pro-posed the local outlier factor (LOF) to score eachpoints, depending on the local density of its neigh-borhood. By defining the neighborhood based onthe distance to the MinPts-th nearest neighbor, Jin etal.6 proposed an algorithm to discover the top n out-liers efficiently for a particular value of MinPts. Fur-thermore, Tang et al. 7 developed the connectivity-

∗Corresponding author.

International Journal of Computational Intelligence Systems, Vol. 7, No. 4 (August 2014), 748-757

Co-published by Atlantis Press and Taylor & FrancisCopyright: the authors

748

Page 2: Outlier Detection Based on Local Kernel Regression for ... · based outlier factor (COF) algorithm analogous to the LOF approach. Papadimitriou et al. 8 comput-ed the number of neighbors

Qinmu Peng, Yiu-ming Cheung

based outlier factor (COF) algorithm analogous tothe LOF approach. Papadimitriou et al. 8 comput-ed the number of neighbors for each point to evalu-ate the outlier-ness, in which a point whose neigh-borhood size significantly varies with respect to theneighborhood size of its neighbors is regarded as anoutlier. In distance-based methods, it judges a pointbased on the distances to its neighbors. That is, nor-mal data points have a dense neighborhood, whileoutliers are far apart from their neighbors. In thework 2,9,10 by Knorr et al., a data point in the data setP is a distance-based outlier if at least a fraction β ofthe points in P is further than r from it. Also, Hauta-maki et al.11 constructed the k-NN graph for a dataset, in which a vertex that has an indegree less thanthe user-defined threshold is an outlier. In deviation-based approaches, it groups points and consider-s those points as outliers that deviate considerablyfrom the general characteristics of the groups. Forexample, Kriegel et al.12 proposed the angle-basedoutlier detection in high dimensional space. Theidea is that a point is an outlier if most of its neigh-bors are located in the similar directions. Otherwise,it is an inner point whose neighbors are located indifferent directions. Besides, some other deviation-based approaches have been proposed in work 13,14.In cluster-based approaches, they usually detect out-liers as by-products 15. Hence, these algorithms can-not properly detect the outliers under the noisy en-vironment unless the number of clusters is knownin advance. Along this line, He et al.16 proposedFindCBLOF to determine the Cluster-Based LocalOutlier Factor (CBLOF) for each data point. Yanget al.17 had introduced a globally optimal exemplar-based GMM to detect the outliers, in which a Gaus-sian is centered at each point. The outlier factor ateach point is calculated by the weighted sum of themixture proportion with the weights representing thesimilarities to the other points.

In this paper, we will present an outlier de-tection approach based on local kernel regression(LKR) along the line of distance-based approach-es. Compared to the statistical-based approachesand density-based approaches, it need not require apriori knowledge of the distribution or density func-tion. Also, it can provide more local structure in-

formation than the deviation-based approaches, andavoid the unstableness of cluster-based approaches.

To be specific, the proposed outlier detection al-gorithm takes into account the local structure of thedata space. We consider not only the distance be-tween a point and its neighbors, but also the distancebetween neighbors. Hence, each data can be bet-ter approximated by a combination of its neighbors.Usually, the reconstruction error of an inner pointis much smaller than that of a point that is on theboundary of the data distribution. Under the circum-stances, we shall present an iterative method basedon local kernel ridge regression to detect the outlier-s inward starting from the utmost outlier which hasthe greatest reconstruction error. Experiments haveshown its promising results on synthetic and real da-ta sets in comparison with the existing counterparts.

The remainder of this paper is organized as fol-lows. Section 2 briefly overviews the kernel ridgeregression. The details of the proposed approach,including the time complexity analysis, are given inSection 3. Section 4 shows the experimental results.Finally, we draw a conclusion in Section 5.

2. Overview of Kernel Ridge Regression

Kernel ridge regression 21,24 is an effective approachto model nonlinear regression. Given the trainingdata (xi,yi)

Ni=1, where xi is an input variable and yi is

the corresponding target value, kernel ridge regres-sion is to estimate the target value using the inputvariables. It can be defined as

g(x) =N

∑i=1

αiK (x,xi), (1)

where K (·, ·) is a kernel function, and αis are thecoefficients which can be estimated by minimizingthe following objective function:

minα∥Kα−y∥2 + γαT Kα, (2)

where α = [α1, · · ·,αN ]T , y = [y1, · · ·,yN ]

T , γ is asmall positive regularization parameter, and K =[ki, j]N×N with ki, j = K (xi,x j). In Eq.(2), the firstand second terms are called the fitness item and the

Co-published by Atlantis Press and Taylor & FrancisCopyright: the authors

749

Page 3: Outlier Detection Based on Local Kernel Regression for ... · based outlier factor (COF) algorithm analogous to the LOF approach. Papadimitriou et al. 8 comput-ed the number of neighbors

Outlier Detection Based on Local Kernel Regression for Instance Selection

regularization item, respectively. The solution ofEq.(2) is

α= (K+ γI)−1y, (3)

where I is an N×N identity matrix. Therefore, g(x)can be given as

g(x) = kTx (K+ γI)−1y, (4)

where kx = [K (x,x1), · · ·,K (x,xN)]T is a vector

that is formed by the value of kernel function.

3. A New Algorithm for Outlier Detectionusing Local Kernel Regression

It is found that the reconstruction error of an innerpoint is usually much smaller than that of one on theboundary. Accordingly, we measure the outlier-nessof points based on this phenomenon. That is, we ex-amine how well the feature value of each point canbe estimated by its neighbors. In this paper, we u-tilize a local kernel ridge regression to perform theestimation.

Given a training data xi and {(x j, lr j)}x j∈Ni , Nidenotes the neighbors of xi, and lr j is the rth fea-ture of x j. We need to train a kernel ridge regressionmodel to estimate the value lri, i.e. the rth featureof xi. Based on Eq.(4), we can obtain the equationto denote the local kernel ridge regression model forxi:

gNi(lri) = kTNi(KNi + γI)−1lrNi , (5)

where gNi(·) is the estimation of lri as given{(x j, lr j)}x j∈Ni , KNi is the kernel matrix, i.e. KNi =[K (xm,xn)], xm,xn ∈ Ni, and lrNi denotes the vectorof lr js for all x j ∈ Ni, We let

βTNi= kT

Ni(KNi + γI)−1. (6)

Subsequently, Eq.(6) can be rewritten as:

gNi(lri) = βTNilrNi . (7)

Hence, the reconstruction error for the rth feature ofxi is estimated as

Re(lri) = (gNi(lri)− lri)2

= (βTNilrNi− lri)

2.(8)

In general, the value of each feature in an instancecan be quite different, thus Re(lri) for each featureneeds to be normalized. Let diagonal matrix D bediag(D11, · · ·,Dkk), and Dii =

1|Ni| ∑x j∈Ni(xi − x j)

2.Obviously, Dii indicates the density around xi. Theweighted variance Vw(r) for each feature is estimat-ed as in work22. That is,

Vw(r) =N

∑i=1

(lri−µr)2Dii, (9)

where µr = ∑i lriDii

∑i Diiis the weighted mean of the

rth feature. Suppose each instance xi contains Mfeatures. The normalized total reconstruction errorREi is given as follows:

REi =M

∑r=1

Re(lri)

Vw(r)

=M

∑r=1

(gNi(lri)− lri)2

Vw(r).

(10)

From Eq.(8), the total reconstruction error REi canthen be written as:

REi =M

∑r=1

(βTNilrNi− lri)

2

Vw(r). (11)

We first select the utmost outliers on the boundary,which has the greatest reconstruction error. After re-moving it, the points surrounded by the outliers onthe boundary then become new outliers. We there-fore further remove them in the next iteration. By re-peating these procedure, all outliers can be removedfinally.

Since it is hard to know the ratio of outliers in ad-vance as given a data set, a common approach is tofind the top n outliers from the data sets, e.g., see theworks 19,20. Accordingly, we propose the algorithmLKROut that receives an input data set X includingN instances, the number n of top outliers to find, andthe number k of neighbors to consider; the set OutSconsisting of the detected outliers is the output ofthis algorithm.

The details of algorithm LKRout are summarizedas follows:

Co-published by Atlantis Press and Taylor & FrancisCopyright: the authors

750

Page 4: Outlier Detection Based on Local Kernel Regression for ... · based outlier factor (COF) algorithm analogous to the LOF approach. Papadimitriou et al. 8 comput-ed the number of neighbors

Qinmu Peng, Yiu-ming Cheung

Algorithm OutS = LKROut(X ,n,k)begin

Construct neighbor graph G and kernel matrix K;Initialize Outlier Set: OutS =∅; n = 0; N = N;

while (n 6 n)do begin

Estimate gNi(lri) for each instance by Eq.(7);Calculate REi for each instance by Eq.(11);Sort all {REi} in descending order;Choose the top p instances as Outliers: OutL;Update OutS: OutS← OutS

∪OutL;

Update n = n+ p, N = N− p;Remove OutL from data set;Update G and K;

end;return OutS;end.

In the above algorithm, the parameter p in eachiteration can be set according to the reconstructionerrors, supposing the elements in {REi} are sorted indescending order. The top p instances with large re-construction errors are chosen as outliers which ac-count for a small portion of this data set. That is, thep is set according to the formula p/N 6 η , where Nis the number of instances in the data set, η is a s-mall ratio. Often, it just iterates several times to findthe top n outliers in data instances.

The complexity analysis of key steps in the pro-posed algorithm is given as follows: The complex-ity of construction neighbor graph and kernel K isO(N2M). For local kernel regression in Eq.(7), it isO(Nk3) and the complexity for reconstruction errorin Eq.(11) is O(kNM). Usually, the neighbor sizemeets k≪ N. Therefore, the overall complexity isabout O(tN2M), where t is the iteration times and Mis the feature dimension, respectively.

In general, the proposed method is differen-t from the nearest-neighbor methods, e.g. see paper-s 2,9,10,11. Essentially, these nearest-neighbor meth-ods consider the distance between point xi and itsnearest-neighbor only without taking into accoun-t the distance between the neighbors. Actually, thedistance between the neighbors can provide more re-lationship information. Accordingly, the proposedmethod considers not only the distance betweenpoint and its neighbors, but also the distance be-

tween the pairs of neighbors by weighing the influ-ence of the relevant neighborhood point 23. Sincesuch a way would be better to reveal the local rela-tionship of data, it is therefore expected that the pro-posed method generally outperforms these nearest-neighbor methods.

4. Experimental Results

The experiments were performed on several synthet-ic and real-world data sets to show the performanceof the proposed algorithm in comparison with theexisting counterparts. In general, local outlier de-tection approaches have shown better performancethan global models. Accordingly, we chose two ofthe most popular local approaches, namely the LOF5 and angle-based outlier detection (ABOD) 12, ascounterparts.

In all experiments, we utilized the Gaussian ker-nel, i.e. K (xm,xn) = exp(∥xm−xn∥2/h), and theparameter h and the ridge parameter γ were set at100 and 0.1, respectively. To measure the perfor-mance of each algorithm, we utilized the Detectionrate (True positive rate), the False alarm rate (Falsepositive rate) from the work 17, and the ROC curves,which are defined as follows:

Detection Rate = T P/(T P+FN)

False alarm Rate = FP/(FP+T N).(12)

As shown in Table 1, the detection rate reveals therelative ratio of correctly identified outliers, whilethe false alarm rate is the percentage of normal da-ta records misclassified as outliers. Furthermore, theROC curve is defined by False positive rate and Truepositive rate as x- and y-axes, respectively, whichdepicts the relative trade-off between Detection rate(benefits) and False alarm rate (costs).

Table 1. Performance indicator.

Predicted Outliers Predicted NormalActual Outliers TP (True positive) FN (False negative)Actual Normal FP (False positive) TN (True negative)

Co-published by Atlantis Press and Taylor & FrancisCopyright: the authors

751

Page 5: Outlier Detection Based on Local Kernel Regression for ... · based outlier factor (COF) algorithm analogous to the LOF approach. Papadimitriou et al. 8 comput-ed the number of neighbors

Outlier Detection Based on Local Kernel Regression for Instance Selection

0 1 2 3 4 5 6 7 80

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 80

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 80

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 80

1

2

3

4

5

6

7

8

(a) (b)

(c) (d)Fig. 1. Outlier detection results. (a) Data points: normalpoints are small dots in black, while the outliers are largedots in blue. (b) LOF method; (c)ABOD method; (d) Pro-posed method. Note that the large dots in red are the outliersdetected by these methods.

4.1. Synthetic data

4.1.1. Synthetic data with several noisy points

The synthetic data were utilized to evaluate the per-formance of three approaches. In Fig. 1 (a), thereare eight outliers (marked with blue dots) and twoclusters with different densities, 50 points are in thesparse cluster and 318 points are in the dense clus-ter. From Fig. 1 (b-d), it can be observed that LOFmethod based on local neighbors is unable to detectall outliers, while ABOD and the proposed methodare able to detect all of them in this environment. Aplausible reason is that the LOF just considers the lo-cal distance between the point and its neighbors, butignores the distance between the pairs of neighbors.As a result, the LOF, as well as the other neighbor-

based methods, is difficult to detect the outliers withthe varied densities. In contrast, the proposed ap-proach considers both distances which provide moreinformation about the local domain. As for the A-BOD approach, it is angle-based one that describesthe divergence of points with respect to one anoth-er. Hence, unlike the distance-based approach, it canperform well in this case.

4.1.2. Synthetic data with many noisy points

In this experiment, more noisy points were added in-to the data set to evaluate the capability of the threeapproaches.

750 normal data points were randomly generat-ed from a homogeneous distribution model, which

Co-published by Atlantis Press and Taylor & FrancisCopyright: the authors

752

Page 6: Outlier Detection Based on Local Kernel Regression for ... · based outlier factor (COF) algorithm analogous to the LOF approach. Papadimitriou et al. 8 comput-ed the number of neighbors

Qinmu Peng, Yiu-ming Cheung

formed the shape “H”. Also, the extra 150 datapoints were generated by a random distribution thatwas independent of the model of the normal data.The synthetic data set is shown in Fig. 3 (a). Wecompared the proposed algorithm with the ABODand LOF approaches on this 2-D data set. In thesemethods, the neighborhood size was set at 10. Inorder to evaluate the capability of each method to i-dentify the most likely outliers, we utilized the truepositive rate and false positive rate, and drew ROCcurves. That is, we successively retrieved the mostlikely outliers until all outliers were retrieved. Theresults of the proposed approach are shown by ROCcurves in Fig. 2 in comparison with the other t-wo approaches. It can be seen that the proposedapproach outperforms the other methods, and theLOF method has poor performance on this data set.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

False positive rate

Tru

e po

sitiv

e ra

te

LOFABODOUR

Fig. 2. ROC curves on the synthetic data.

In addition, we showed a snapshot of results byretrieving the top 15% of data points as outliers.From Fig. 3 (b-d), it can be seen that most out-liers have been successfully detected by the ABODand the proposed approach. Nevertheless, com-pared to the proposed method, some obvious outliers(marked in black box) were undetectable by the A-BOD method. Also, the LOF approach was not suit-able to deal with them even with the different neigh-borhood sizes.

4.2. Real data

We further demonstrated the effectiveness of theproposed method on two real-world datasets.

4.2.1. Breast Cancer Wisconsin Data

In this experiment, we utilized the Breast CancerWisconsin (original) Data Set †, which is a medicaldataset and has been used for breast tumor diagno-sis. The dataset contains 11-attributed (ID, class, 9real-valued features) 699 medical diagnosis records,in which the total valid records are 683, while the re-maining 16 records have missing values. The classtype is binary, i.e. “Benign” and “Malignant”. Wetreated the 444 records labeled “Benign” as normaldata. Furthermore, a certain number of “Malignan-t” records were added into normal data as outliers.In this dataset, our task was to identify the outlierswhich were the abnormal data records, and evaluat-ed the three different approaches.

We measured the percentage of detected outliersin top-num potential outliers as the detection rate,i.e. detection rate = ndetected outliers/num, wherenum is the number of outliers. Experiments wereperformed three times with num = 20. At each time,we performed ten independent trials on the dataset(normal data with randomly selected outliers) to cal-culate the average detection rate and standard devi-ation. The experimental results are shown in Table2.

Moreover, we examined the relationship betweenthe neighborhood size and detection rate. In thisexperiment, we utilized 20 outliers, which wererandomly extracted from the “Malignant” record-s. Once again, top-20 was utilized here. Fig. 4shows the outlier detection rate over different neigh-borhood sizes. It can be seen that the detection rateof the proposed method is more robust against theneighborhood size. In contrast, the LOF approachsuffers from the effect of the neighborhood size.

†UCI ML Repository: http://archive.ics.uci.edu/ml/datasets.html

Co-published by Atlantis Press and Taylor & FrancisCopyright: the authors

753

Page 7: Outlier Detection Based on Local Kernel Regression for ... · based outlier factor (COF) algorithm analogous to the LOF approach. Papadimitriou et al. 8 comput-ed the number of neighbors

Outlier Detection Based on Local Kernel Regression for Instance Selection

0 1 2 3 4 5 6 7 8 9 10 110

1

2

3

4

5

6

7

8

9

10

11

0 1 2 3 4 5 6 7 8 9 10 110

1

2

3

4

5

6

7

8

9

10

11

0 1 2 3 4 5 6 7 8 9 10 110

1

2

3

4

5

6

7

8

9

10

11

0 1 2 3 4 5 6 7 8 9 10 110

1

2

3

4

5

6

7

8

9

10

11

(a) (b)

(c) (d)

Fig. 3. Outlier detection results. (a) Data points; (b) LOFmethod; (c) ABOD method; (d) Proposed method. The redpoints are outliers detected by these methods.

Table 2. The detection rate of each method.

Number of ABOD LOF OurDetected Outliers Detection Rate Detection Rate Detection Rate20 0.6750 ± 0.0024 0.0450 ± 0.0014 0.7050 ± 0.003915 0.7600 ± 0.0320 0.0500 ± 0.0080 0.7800 ± 0.020010 0.8600 ± 0.0138 0.0620 ± 0.0093 0.8800 ± 0.0173

Co-published by Atlantis Press and Taylor & FrancisCopyright: the authors

754

Page 8: Outlier Detection Based on Local Kernel Regression for ... · based outlier factor (COF) algorithm analogous to the LOF approach. Papadimitriou et al. 8 comput-ed the number of neighbors

Qinmu Peng, Yiu-ming Cheung

5 10 15 20 250

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Neighborhood size

Det

ectio

n ra

te

ABODOURLOF

Fig. 4. Outlier detection rate over different neighborhoodsize by ten independent runs.

4.2.2. MPEG-7 Shape Data

We also evaluated the proposed method on the wide-ly used MPEG-7 shape data set for detection of un-usual shapes. MPEG-7 contains 70 classes with 20shapes in each class, which are binary images. InFig. 5, there are 20 bell shapes from one class and4 unusual shapes from the other different classes.Hence, we should retrieve the 4 unusual shapes forthe purpose of outlier detection. The shape imageis represented by inner distance shape context (ID-SC) feature 25, which provides good shape descrip-tors for binary image shape. The results are demon-strated in Fig. 6. The ABOD and LOF approachesretrieve the same results that consist of three true un-usual shapes. Compared to the two approaches, theproposed approach can identify all of them.

5. Conclusion

In this paper, we have proposed a new way to mea-sure the outlier-ness of instances based on the recon-struction error of instances by their neighbors. It isfound that the reconstruction error of a normal in-stance is usually much smaller than that of one onthe boundary. Accordingly, we have presented an it-erative algorithm to detect the outliers. Experimen-tal results have demonstrated its efficacy in compar-ison with the existing counterparts.

Acknowledgment

The works described in this paper were supportedby the Faculty Research Grant of Hong Kong Bap-tist University with the Project Code: FRG2/11-12/067, FRG2/12-13/082, and the NSFC under grant61272366.

References

1. H. Nare, D. Maposa, and M. Lesaoana, A method fordetection and correction of outliers in time series da-ta, African Journal of Business Management, 6 (22),6631–6639 (2012).

2. E.M. Knorr, R.T. Ng, and V. Tucakov, Distance-basedoutliers: algorithms and applications, The Internation-al Journal on Very Large Data Bases, 8, 237-253(2000).

3. N. Billor, A. Hadi and P. Velleman, BACON: blockedadaptive computationally effcient outlier nominators,Computational Statistics and Data Analysis, 34, 279-298 (2000).

4. E. Eskin, Anomaly detection over noisy data usinglearned probability distributions, Proceedings of In-ternational Conference on Machine Learning, 255-262 (2000).

5. M. M. Breunig, H. P. Kriegel, R. T. Ng, and J. Sander,LOF: identifying density based local outliers, Pro-ceedings of the ACM SIGMOD International Confer-ence on Management of Data, 93-104 (2000).

6. W. Jin, A. Tung, and J. H. Mining, Mining top-n localoutliers in large databases, Proceedings of the Knowl-edge Discovery and Data Mining Conference, 293-298 (2001).

7. J. Tang, Z. Chen, A. Fu, and D. Cheung, Enhancingefectiveness of outlier detections for low density pat-terns, Proceedings of the Pacific-Asia Knowledge Dis-covery and Data Mining Conference, 535-548 (2002).

8. S. Papadimitriou, H. Kitagawa, P. B. Gibbons, andC. Faloutsos, LOCI: fast outlier detection using thelocal correlation integral, Proceedings of the Inter-national Conference on Data Engineering, 315-326(2003).

9. E.M. Knorr and R.T. Ng, Algorithms for miningdistance-based outliers in large datasets, Proceed-ings of International Conference on Very Large DataBases, 392-403 (1998).

10. E.M. Knorr and R.T. Ng, Finding intentional knowl-edge of distance-based outliers, Proceedings of Inter-national Conference on Very Large Data Bases, 211-222 (1999).

11. V. Hautamaki, I. Karkkainen, and P. Franti, Outlier de-tection using k-nearest neighbour graph, Proceedings

Co-published by Atlantis Press and Taylor & FrancisCopyright: the authors

755

Page 9: Outlier Detection Based on Local Kernel Regression for ... · based outlier factor (COF) algorithm analogous to the LOF approach. Papadimitriou et al. 8 comput-ed the number of neighbors

Outlier Detection Based on Local Kernel Regression for Instance Selection

Fig. 5. The shape database for unusual shape detection.

LOF

OUR

ABOD

Fig. 6. Top 4 ranked unusual shapes (outliers) by the threeapproaches.

of IEEE International Conference on Pattern Recog-nition, 430-433 (2004).

12. H.-P. Kriegel, M. Schubert, and A. Zimek, Angle-based outlier detection in high-dimensional data, Pro-ceedings of the ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining, 444-452 (2008).

13. A. Arning, R. Agrawal, and P. Raghavan, A linearmethod for deviation detection in large databases,Proceedings of Knowledge Discovery and Data Min-ing Conference, 164-169 (1996).

14. S. Sarawagi, R. Agrawal, and N. Megiddo, Discovery-driven exploration of OLAP data cubes, Proceedingsof International Conference on Extending DatabaseTechnology - Advances in Database Technology, 168-182 (1998).

15. A.K. Jain, M.N. Murty, and P.J. Flynn, Data cluster-ing: a review, ACM Computing Surveys, 31 (3):264-323 (1999).

16. Z. He, X. Xu and S. Deng, Discovering cluster-basedlocal outliers, Pattern Recognition Letters, 24,1641-1650 (2003).

17. X. Yang, L. J. Latechi, and D. Pokrajac, Outlier de-tection with globally optimal exemplar-based GMM,Proceedings of the SIAM International Conference onData Mining, 145-154 (2009).

18. D. M. Hawkins, Identification of outliers, Chapmanand Hall, London (1980).

19. J. Han and M. Kamber, Data mining, conceptsand techniques, San Francisco: Morgan Kaufmann,(2001).

20. F. Angiulli and C. Pizzuti, Outlier mining in largehigh-dimensional data sets, IEEE Transactions onKnowledge and Data Engineering, 17 (2), 203–215(2005).

21. Y.M. Cheung and H. Zeng, Local kernel regression s-core for selecting features of high-dimensional data,IEEE Transactions on Knowledge and Data Engineer-

Co-published by Atlantis Press and Taylor & FrancisCopyright: the authors

756

Page 10: Outlier Detection Based on Local Kernel Regression for ... · based outlier factor (COF) algorithm analogous to the LOF approach. Papadimitriou et al. 8 comput-ed the number of neighbors

Qinmu Peng, Yiu-ming Cheung

ing, 21 (12), 1798-1802 (2009).22. X. He, D. Cai, and P. Niyogi, Laplacian score for fea-

ture selection, Proceedings of the Advances in NeuralInformation Processing Systems Conference, 507-514(2005).

23. Q.M. Peng and Y.M. Cheung, Sample outlier detec-tion based on local kernel regression, Proceedings ofIEEE/WIC/ACM International Conference on Intelli-

gent Agent Technology (IAT’2012), 664-668 (2012).24. J. Shawe-Taylor and N. Cristianini, Kernel methods

for pattern analysis, Cambridge Univ. Press (2004).25. H.B. Ling and D. W. Jacobs, Shape classification us-

ing the inner-distance, IEEE Transactions on PatternAnalysis and Machine Intelligence, 29 (2): 286-299(2007).

Co-published by Atlantis Press and Taylor & FrancisCopyright: the authors

757