Download pdf - On the Evaluation of Unsupervised Outlier Detectionzimek/InvitedTalks/TUVienna...Kriegel et al., 2016] I specializations for different areas [Chandola et al., 2009, Zimek et al., 2012,

On theEvaluation ofUnsupervised

OutlierDetection

Arthur Zimek

Outlier DetectionMethods

Evaluation Measures

Datasets

Experiments

Conclusions

References

On the Evaluation of Unsupervised OutlierDetection

Arthur Zimek

Ludwig-Maximilians-Universität München

Talk at TU Vienna18.05.2016

1


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

On the evaluation of unsupervised outlierdetection

Campos, Zimek, Sander, Campello, Micenková, Schubert,Assent, and Houle: On the evaluation of unsupervised outlierdetection: Measures, datasets, and an empirical study. DataMining and Knowledge Discovery, 2016.http://doi.org/10.1007/s10618-015-0444-8

Online repository with complete material(methods, datasets, results, analysis):http://www.dbs.ifi.lmu.de/research/

outlier-evaluation/

2

http://doi.org/10.1007/s10618-015-0444-8

http://www.dbs.ifi.lmu.de/research/outlier-evaluation/



OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

What is an Outlier?

The intuitive definition of an outlier would be “anobservation which deviates so much from otherobservations as to arouse suspicions that it wasgenerated by a different mechanism”.

[Hawkins, 1980]

Simple model: take thekNN distance of a pointas its outlier score[Ramaswamy et al.,2000]

p1

p2

p3

3


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Motivation

I many new outlier detection methods developed everyyear

I some studies about efficiency [Orair et al., 2010,Kriegel et al., 2016]

I specializations for different areas [Chandola et al.,2009, Zimek et al., 2012, Schubert et al., 2014a,Akoglu et al., 2015]

I evaluation of effectiveness remains notoriouslychallenging

I characterisation of outlierness differs from method tomethod

I lack of commonly agreed upon benchmark dataI measure of success? (most commonly: ROC)

4


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Outline

Outlier Detection Methods

Evaluation Measures

Datasets

Experiments

Conclusions

5


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Selected Methods

I kNN [Ramaswamy et al., 2000], kNN-weight [Angiulliand Pizzuti, 2005]

I ODIN [Hautamäki et al., 2004] (related to low hubnessoutlierness [Radovanovic et al., 2014])

I LOF [Breunig et al., 2000]I SimplifiedLOF [Schubert et al., 2014a], COF [Tang

et al., 2002], INFLO [Jin et al., 2006], LoOP [Kriegelet al., 2009]

I LDOF [Zhang et al., 2009], LDF [Latecki et al., 2007],KDEOS [Schubert et al., 2014b]

I FastABOD [Kriegel et al., 2008]

6


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Discussion

I all these methods have a common parameter, theneighborhood size k

I this family of kNN-based methods is popular andcontains both classic and recent methods

I nevertheless, the parameter k has differentinterpretations and impact among the selected methods

I the selected methods comprise both ‘global’ and ‘local’methods [Schubert et al., 2014a]

I included variants of LOF vary different components ofthe typical local outlier model: notion of neighborhood,distance, density estimates, model comparison

I first study to compare all these methodsI all methods are implemented in a common framework

ELKI [Schubert et al., 2015]7


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Outline


Evaluation Measures

Datasets

Experiments

Conclusions

8


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Evaluation of Rankings

I methods return a full ranking of database objectsI user interested in the top-ranked objects

Rank Score

1

1 0.996

2

2 0.995

3

3 0.994

4

4 0.994

5

5 0.994

6

6 0.993

7

7 0.993

8

8 0.992

99 0.991

10

10 0.989

11

11 0.985

1212 0.974

13

13 0.891

14

14 0.877

15

15 0.854

16

16 0.803

17

17 0.680

18

18 0.573

19

19 0.568 20

20 0.562...615

...0.000

Rank Score

1

1 56.069

2

2 47.862

3

3 43.726

4

4 43.449

5

5 41.938

6

6 37.283

7

7 36.444

8

8 32.193

99 27.095

10

10 23.429

11

11 16.629

1212 9.339

13

13 4.029

14

14 3.369

15

15 3.2081616 3.091

17

17 3.091

18

18 3.020

19

19 3.020 20

20 2.988

76

...76

...2.268

151

...151

...2.017

285

...285

...1.687

407

...407

...1.251...

615...0.171

examples taken from Campello et al. [2015]

9


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Precision at n

I If the number of outlier candidates n is specified, thesimplest measure of performance is the precision at n(P@n), i.e., the proportion of correct results in the top nranks [Craswell, 2009a].

Precision at n (P@n)Given a database D of size N, outliers O ⊂ D and inliersI ⊆ D (D = O ∪ I), we have

P@n =|{o ∈ O | rank(o) ≤ n}|

n.

10


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Precision at n — Properties

I how to fairly choose the parameter n of P@n?I n = |O| yields the popular R-Precision measure

[Craswell, 2009b].I n = |O| � N⇒ typically obtained values of P@n can be deceptivelylow, and not very informative as such

I n = |O| relatively large (of the same order as N)⇒ deceptively high values of P@n can be obtainedsimply due to the relatively small number of inliersavailable

11


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Precision at n — Adjustment for Chance

I general procedure due to Hubert and Arabie [1985]:

Adjusted Index =Index− Expected Index

Maximum Index− Expected Index

I maximum P@n: |O|/n if n > |O|, and 1 otherwiseI expected value (random outlier ranking): |O|/N

Precision at n (P@n), adjusted for chanceIf n ≤ |O|:

Adjusted P@n =P@n− |O|/N

1− |O|/N.

For larger n, the maximum |O|/n must be used instead of 1.

12


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Imbalance

I problem: imbalance between the numbers of inliers andoutliers: |I| � |O|, |I| ≈ N

I P@n and Adjusted P@n measures are easily interpretedI but they are sensitive to the choice of n, particularly

when n is smallI example:

I dataset with 10 outliers and 1 million inliersI result with (quite high) ranks 11–20 for the true outliersI P@10 of 0I P@20 of 0.5

I Adjusted P@n measure can be seen to suffer from asimilar sensitivity with respect to the choice of n.

13


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Average Precision

I even in good results, outliers typically do not form alarger fraction among the top |O| ranks

I use measures that aggregate performance over a widerange of possible choices of n

I for example average precision [Zhang and Zhang,2009] (popular in information retrieval)

Average Precision (AP)

AP =1|O|

∑o∈O

P@ rank(o).

14


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Average Precision — Adjustment for Chance

I perfect ranking yields a maximum value of 1I expected value of a random ranking is |O|/N

Average Precision (AP), adjusted for chance

Adjusted AP =AP−|O|/N1− |O|/N

.

15


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

When Should we Adjust for Chance?

for P@n and AP, adjustment for chance is:I not necessary when the performance of two methods

on the same dataset (that is, with the same proportionof outliers) is compared in relative terms

I helpful, if the measure is to be interpreted in absoluteterms

I strictly necessary, if the performance is to be comparedover different datasets with different proportions ofoutliers

16


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Receiver Operating Characteristic (ROC)

I plots for all possible choices of n the true positive rate(the proportion of outliers correctly ranked among thetop n) versus the false positive rate (the proportion ofinliers ranked among the top n)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

False Positive Rate

Tru

e P

ositi

ve R

ate

ROCAUC: 0.92684211

I based on rates, ROC inherently adjusts for theimbalance of class sizes typical of outlier detectiontasks

17


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Area under the ROC curve (ROC AUC)

I A ROC curve can be summarized by a single valueknown as ROC AUC, defined as the area under theROC curve (AUC)

I 0 ≤ ROC AUC ≤ 1I interpretation: average of the recall at n, with n taken

over the ranks of all inlier objectsI probabilistic interpretation [Hanley and McNeil, 1982]:

probability of a pair (o, i), where o is some true outlier,and i is some inlier, being ordered correctly in theevaluated ranking:

ROC AUC := meano∈O,i∈I

1 if score(o) > score(i)12 if score(o) = score(i)0 if score(o) < score(i)

18


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Discussion

I all these evaluation measures are external: requireannotated ground truth (outlier/inlier)

I useful for competitive evaluation of algorithms wherethe outliers in a dataset are actually known

I for evaluation in practical applications of methods wewould need internal evaluation measures

I so far, only one paper in the literature describes aninternal evaluation procedure, the work of Marqueset al. [2015]

19


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Outline


Evaluation Measures

Datasets

Experiments

Conclusions

20


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Ground Truth for Outlier Detection?

I no commonly agreed upon and frequently usedbenchmark data available

I UCI datasets etc.: ground truth by class labels — notreadily usable for outlier evaluation

I papers on outlier detection prepare some datasets adhoc or reuse some datasets that have been preparedad hoc by others

I preparation involves decisions that are often notsufficiently documented

I we follow the common practice of downsampling someclass in a classification dataset to produce an outlierclass

21


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Datasets Used in the Literature

Dataset Preprocessing N |O| Attributes Version used bynum cat

ALOI 50000 images, 27 attr. 50000 1508 27 Kriegel et al. [2011], Schubert et al. [2012]24000 images, 27648 attr. de Vries et al. [2012]

Glass Class 6 (out.) 214 9 7 Keller et al. [2012]vs. others (in.)

Iono- Class ‘b’ (out.) 351 126 32 Keller et al. [2012]sphere vs. class ‘g’ (in.)KDDCup99 U2R (out.) 60632 246 38 3 Nguyen and Gopalkrishnan [2010], Nguyen et al. [2010],

vs. Normal (in.) Kriegel et al. [2011], Schubert et al. [2012]Lympho- Classes 1 and 4 (out.) 148 6 3 16 Lazarevic and Kumar [2005],graphy vs. others (in.) Nguyen et al. [2010], Zimek et al. [2013]Pen- Downsampling class ‘4’ 9868 20 16 Kriegel et al. [2011]Digits to 20 objects (out.) Schubert et al. [2012]

Downsampling class ‘0’ Keller et al. [2012]to 10% (out.)

Shuttle Classes 2, 3, 5, 6, 7 (out.) Lazarevic and Kumar [2005], Abe et al. [2006],vs. class 1 (in.) Nguyen et al. [2010]Class 2 (out.) vs. downs. 1013 13 9 Zhang et al. [2009]others to 1000 obj. (in.)Downs. classes 2, 3, 5, Gao and Tan [2006]6, 7 (out.) vs. others (in.)

Wave- Downsampling class ‘0’ 3443 100 21 Zimek et al. [2013]form to 100 objects (out.)WBC ‘malignant ’ (out.) Gao and Tan [2006]

vs. ‘benign’ (in.)Downs. class ‘malignant ’ 454 10 9 Kriegel et al. [2011], Schubert et al. [2012],to 10 objects (out.) Zimek et al. [2013]

WDBC Downs. class ‘malignant ’ 367 10 30 Zhang et al. [2009]to 10 objects (out.)‘malignant ’ (out.) Keller et al. [2012]vs. ‘benign’ (in.)

WPBC Class ‘R’ (out.) 198 47 33 Keller et al. [2012]vs. class ‘N’ (in.)

22


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Semantically Meaningful Outlier Datasets

Dataset Semantics N |O| Attributesnum. binary

Annthyroid 2 types of hypothyroidism vs. healthy 7200 534 21Arrhythmia 12 types of cardiac arrhythmia vs. healthy 450 206 259Cardiotocography pathologic, suspect vs. healthy 2126 471 21HeartDisease heart problems vs. healthy 270 120 13Hepatitis survival vs. fatal 80 13 19InternetAds ads vs. other images 3264 454 1555PageBlocks non-text vs. text 5473 560 10Parkinson healthy vs. Parkinson 195 147 22Pima diabetes vs. healthy 768 268 8SpamBase non-spam vs. spam 4601 1813 57Stamps genuine vs. forged 340 31 9Wilt diseased trees vs. other 4839 261 5

23


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Dataset Preparation I

downsampling randomly downsample one class (as outlierclass) while retaining all instances from other classes (asinlier class)

I great variation in the nature of outliers producedI we repeat the downsampling ten times, resulting in

ten different datasetsI we adopt different downsample rates where

applicable (resulting in datasets with 20%, 10%, 5%,or 2% outliers)

duplicates duplicates can be problematic (e.g., densityestimates) — for datasets containing duplicates, wegenerate two variants, one with the original duplicates,and one without duplicates

24


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Dataset Preparation II

categorical attributes three variants of handling categoricalattributes:

I categorical attributes are removedI 1-of-n encoding: a categorical attribute with n

possible values is mapped into n binary attributes(presence or absence of the correspondingcategorical value)

I IDF: a categorical attribute is encoded as theinverse document frequency

IDF(t) = ln(N/ft)

(N: total number of instances, ft: frequency of theattribute value t)

25


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Dataset Preparation III

normalization Normalization of datasets is expected tohave considerable impact on the results, but is rarelydiscussed in the literature. Two variants for each datasetthat does not already have normalized attributes:

I unnormalizedI attribute-wise linear normalization to the range [0, 1]

missing values If an attribute has fewer than 10% ofinstances with missing values, those instances areremoved. Otherwise, the attribute itself is removed.

These procedures applied to most of the 23 datasets (someare taken from the literature unchanged, where they areavailable) results in about 1000 dataset variants.

26


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures

Characterization ofthe Methods

Characterization ofthe Datasets

Conclusions

References

Outline


Evaluation Measures

Datasets

ExperimentsEvaluation MeasuresCharacterization of the MethodsCharacterization of the Datasets

Conclusions

27


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References

Outline


Evaluation Measures

Datasets


Conclusions

28


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References

Example: Annthyroid

1 10 20 30 40 50 60 70 80 90 1000.00

0.05

0.10

0.15

0.20

0.25

●●

●

●

●●

●

●●

●●

●

●

●

●

●●●●●●

●●

●●●●●●

●●

●●●●●●●●

●●●●●●●●●●

●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●

●●●●●

●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

Annthyroid_withoutdupl_norm_07

Neighborhood size

P@

n

0

●kNN kNNW LOF SimplifiedLOF LoOP LDOF

0

●ODIN KDEOS COF FastABOD LDF INFLO

29


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References

Example: Annthyroid

1 10 20 30 40 50 60 70 80 90 100−0.05

0.00

0.05

0.10

0.15

●●

●

●

●●

●

●●

●●

●

●

●

●

●●●●●●

●●

●●●●●●

●●

●●●●●●●●

●●●●●●●●●●

●●●●

●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●

●●●●

●

●●●●●

●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●


Neighborhood size

Adj

uste

d P

@n

0


0


29


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References

Example: Annthyroid

1 10 20 30 40 50 60 70 80 90 100

0.06

0.08

0.10

0.12

0.14

●

●

●

●

●

●●

●●

●●●

●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●


Neighborhood size

AP

0


0


29


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References

Example: Annthyroid

1 10 20 30 40 50 60 70 80 90 1000.00

0.02

0.04

0.06

0.08

0.10

●

●

●

●

●

●●

●●

●●●

●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●


Neighborhood size

Adj

uste

d A

P

0


0


29


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References

Example: Annthyroid

1 10 20 30 40 50 60 70 80 90 1000.50

0.55

0.60

0.65

0.70

●

●

●

●

●

●●

●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●


Neighborhood size

RO

C A

UC

0


0


29


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References

Example: HeartDisease

1 10 20 30 40 50 60 70 80 90 1000.35

0.40

0.45

0.50

0.55

0.60

0.65

0.70

●

●●●

●

●

●

●●●

●

●●

●

●

●●

●●

●●

●●

●

●●

●●●

●●●

●●

●●●

●●●●

●●●●

●●●

●●●●

●●

●●●

●●

●

●●

●●●

●●●

●

●●●●

●●●●●●●

●●

●

●●●

●

●●●

●●●●

●●●●●●

●●●●●●●

●●●●●

●●●

●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

HeartDisease_withoutdupl_norm_44

Neighborhood size

P@

n

0


0


30


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References


1 10 20 30 40 50 60 70 80 90 100−0.1

0.0

0.1

0.2

0.3

0.4

●

●●●

●

●

●

●●●

●

●●

●

●

●●

●●

●●

●●

●

●●

●●●

●●●

●●

●●●

●●●●

●●●●

●●●

●●●●

●●

●●●

●●

●

●●

●●●

●●●

●

●●●●

●●●●●●●

●●

●

●●●

●

●●●

●●●●

●●●●●●

●●●●●●●

●●●●●

●●●

●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●


Neighborhood size

Adj

uste

d P

@n

0


0


30


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References


1 10 20 30 40 50 60 70 80 90 1000.40

0.45

0.50

0.55

0.60

0.65

0.70

●

●

●●

●

●●●

●●●●

●●●

●●

●●

●●

●

●

●●

●●

●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●

●●●●

●●

●●

●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●


Neighborhood size

AP

0


0


30


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References


1 10 20 30 40 50 60 70 80 90 100

0.0

0.1

0.2

0.3

0.4

●

●

●

●

●

●●●

●●●●

●●●

●●

●●

●

●●

●

●●

●●

●●●

●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●

●●

●●●●

●●

●●

●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●


Neighborhood size

Adj

uste

d A

P

0


0


30


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References


1 10 20 30 40 50 60 70 80 90 1000.45

0.50

0.55

0.60

0.65

0.70

0.75

0.80

●

●

●

●

●

●●

●

●●

●●●●

●●

●

●●

●●●

●

●

●

●●

●

●●●●●●●●●●●●●●●●

●●●●

●●●●

●●●●●●●●●●●●

●●●●●●●●●●

●●●

●●●●●●●●●●●●●●

●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●


Neighborhood size

RO

C A

UC

0


0


30


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References

Example: Wilt

1 10 20 30 40 50 60 70 80 90 1000.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

●

●

●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

Wilt_withoutdupl_norm_02_v01

Neighborhood size

P@

n

0


0


31


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References

Example: Wilt

1 10 20 30 40 50 60 70 80 90 100−0.05

0.00

0.05

0.10

0.15

●

●

●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●


Neighborhood size

Adj

uste

d P

@n

0


0


31


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References

Example: Wilt

1 10 20 30 40 50 60 70 80 90 1000.00

0.02

0.04

0.06

0.08

0.10

●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●


Neighborhood size

AP

0


0


31


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References

Example: Wilt

1 10 20 30 40 50 60 70 80 90 100

−0.04

−0.02

0.00

0.02

0.04

0.06

0.08

0.10

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●


Neighborhood size

Adj

uste

d A

P

0


0


31


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References

Example: Wilt

1 10 20 30 40 50 60 70 80 90 1000.3

0.4

0.5

0.6

0.7

0.8

●

●

●●●●●

●●

●●

●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●


Neighborhood size

RO

C A

UC

0


0


31


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References

Observations

all results available in the web repository:http://www.dbs.ifi.lmu.de/research/outlier-evaluation/

I performance trends differ across algorithms, datasets,parameters, and evaluation methods

I ROC AUC less sensitive to number of true outliersI ROC AUC scores across the datasets typically

reasonably highI P@n scores considerably lower for datasets with smaller

proportions of outliersI AP resembles ROC AUC, assessing the ranks of all

outliers, but tends to be lower with stronger imbalanceI P@n can discriminate between methods that perform

more or less equally well in terms of ROC AUC [Davisand Goadrich, 2006]

32



OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References

Outline


Evaluation Measures

Datasets


Conclusions

33


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References

Average ROC AUC per Method

aggregated over all datasets (without duplicates,normalized)

●

●

●

●

●

●●

●

●

●

●

●

0.65

0.70

0.75

0.80

0.85

KNN

KNNW

LOF

SimplifiedLOF

LoOP

LDOF

ODIN

KDEOS

COF

FastABOD

LDF

INFLO

mean ROC AUC (mean over best k per data set)

34


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



●

●

●

●

●

●

●

●

●

●

●

●

0.65

0.70

0.75

0.80

0.85

KNN

KNNW

LOF

SimplifiedLOF

LoOP

LDOF

ODIN

KDEOS

COF

FastABOD

LDF

INFLO

mean ROC AUC (mean over best k +− 5 per data set)

34


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



●

●

●

●

●

●

●

●

●

●

●

●

0.65

0.70

0.75

0.80

0.85

KNN

KNNW

LOF

SimplifiedLOF

LoOP

LDOF

ODIN

KDEOS

COF

FastABOD

LDF

INFLO


34


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



●

●

●

●

●

●

●

●

●

●

●

●

0.65

0.70

0.75

0.80

0.85

KNN

KNNW

LOF

SimplifiedLOF

LoOP

LDOF

ODIN

KDEOS

COF

FastABOD

LDF

INFLO

mean ROC AUC (mean over all k per data set)

34


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References


aggregated over all datasets (without duplicates,normalized, at most 5% outliers)

●

●

●

●

●

●●

●

●

●

●

●

0.65

0.70

0.75

0.80

0.85

KNN

KNNW

LOF

SimplifiedLOF

LoOP

LDOF

ODIN

KDEOS

COF

FastABOD

LDF

INFLO

mean ROC AUC (mean over best k per data set)

35


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



●●

●

●

●

●

●

●

●

●

●

●

0.65

0.70

0.75

0.80

0.85

KNN

KNNW

LOF

SimplifiedLOF

LoOP

LDOF

ODIN

KDEOS

COF

FastABOD

LDF

INFLO


35


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



●

●●

●

●

●

●

●

●●

●

●

0.65

0.70

0.75

0.80

0.85

KNN

KNNW

LOF

SimplifiedLOF

LoOP

LDOF

ODIN

KDEOS

COF

FastABOD

LDF

INFLO


35


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



●

●

●

●

●

●

●

●

●

●

●

●

0.65

0.70

0.75

0.80

0.85

KNN

KNNW

LOF

SimplifiedLOF

LoOP

LDOF

ODIN

KDEOS

COF

FastABOD

LDF

INFLO

mean ROC AUC (mean over all k per data set)

35


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References

Statistical Test

Nemenyi post-hoc test (normalized datasets without duplicates, ALOI and KDDCup99 removed, bestachieved quality in terms of ROC AUC chosen for each dataset independently; results for those datasets withmultiple subsampled variants were grouped by averaging the best results over all variants for each method):

column method is better/worse than row method at 90% (‘+’/‘−’) and 95% (‘++’/‘−−’) confidence levels.

kNN

kNN

W

LO

F

Sim

plifi

edL

OF

LoO

P

LD

OF

OD

IN

KD

EO

S

CO

F

Fast

AB

OD

LD

F

INFL

O

kNN = −−kNNW = −−LOF = − −− −− −−SimplifiedLOF = −−LoOP = −−LDOF + =ODIN ++ =KDEOS ++ ++ ++ ++ ++ = ++ ++ ++COF −− =FastABOD ++ = +LDF −− − =INFLO −− =

36


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References

Outline


Evaluation Measures

Datasets


Conclusions

37


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References

Best Results per Dataset

Average best performance of all methods, per dataset(without duplicates, normalized). Best results chosen byROC AUC performance.

●

●●

●

●●●●

●

●●●●●

●

●

●●●

●

●●

●●●●●

●

●●●

●

●

●

●●●

●

●

●●

●●●

●

●

●

●

●●

●●

●

●●

●●

●

●

●●

●

0.00

0.25

0.50

0.75

1.00

WP

BC

Wilt

_05

Wilt

_02

WD

BC

WB

CW

avef

orm

Sta

mps

_09

Sta

mps

_05

Sta

mps

_02

Spa

mB

ase_

40S

pam

Bas

e_20

Spa

mB

ase_

10S

pam

Bas

e_05

Spa

mB

ase_

02S

huttl

eP

ima_

35P

ima_

20P

ima_

10P

ima_

05P

ima_

02P

enD

igits

Par

kins

on_7

5P

arki

nson

_20

Par

kins

on_1

0P

arki

nson

_05

Pag

eBlo

cks_

09P

ageB

lock

s_05

Pag

eBlo

cks_

02Ly

mph

ogra

phy_

idf

Lym

phog

raph

y_ca

trem

oved

Lym

phog

raph

y_1o

fnK

DD

Cup

99_i

dfK

DD

Cup

99_c

atre

mov

edK

DD

Cup

99_1

ofn

Iono

sphe

reIn

tern

etA

ds_1

9In

tern

etA

ds_1

0In

tern

etA

ds_0

5In

tern

etA

ds_0

2H

epat

itis_

16H

epat

itis_

10H

epat

itis_

05H

eart

Dis

ease

_44

Hea

rtD

isea

se_2

0H

eart

Dis

ease

_10

Hea

rtD

isea

se_0

5H

eart

Dis

ease

_02

Gla

ssC

ardi

otoc

ogra

phy_

22C

ardi

otoc

ogra

phy_

20C

ardi

otoc

ogra

phy_

10C

ardi

otoc

ogra

phy_

05C

ardi

otoc

ogra

phy_

02A

rrhy

thm

ia_4

6A

rrhy

thm

ia_2

0A

rrhy

thm

ia_1

0A

rrhy

thm

ia_0

5A

rrhy

thm

ia_0

2A

nnth

yroi

d_07

Ann

thyr

oid_

05A

nnth

yroi

d_02

ALO

I

P@

n

38


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



●

●●●

●●●

●●●●●●●

●

●

●●●

●●

●

●

●●

●

●

●

●●●

●

●

●

●●●

●

●●

●●●●

●

●●

●

●●●●

●

●●●●

●

●

●●●

0.00

0.25

0.50

0.75

1.00

WP

BC

Wilt

_05

Wilt

_02

WD

BC

WB

CW

avef

orm

Sta

mps

_09

Sta

mps

_05

Sta

mps

_02

Spa

mB

ase_

40S

pam

Bas

e_20

Spa

mB

ase_

10S

pam

Bas

e_05

Spa

mB

ase_

02S

huttl

eP

ima_

35P

ima_

20P

ima_

10P

ima_

05P

ima_

02P

enD

igits

Par

kins

on_7

5P

arki

nson

_20

Par

kins

on_1

0P

arki

nson

_05

Pag

eBlo

cks_

09P

ageB

lock

s_05

Pag

eBlo

cks_

02Ly

mph

ogra

phy_

idf

Lym

phog

raph

y_ca

trem

oved

Lym

phog

raph

y_1o

fnK

DD

Cup

99_i

dfK

DD

Cup

99_c

atre

mov

edK

DD

Cup

99_1

ofn

Iono

sphe

reIn

tern

etA

ds_1

9In

tern

etA

ds_1

0In

tern

etA

ds_0

5In

tern

etA

ds_0

2H

epat

itis_

16H

epat

itis_

10H

epat

itis_

05H

eart

Dis

ease

_44

Hea

rtD

isea

se_2

0H

eart

Dis

ease

_10

Hea

rtD

isea

se_0

5H

eart

Dis

ease

_02

Gla

ssC

ardi

otoc

ogra

phy_

22C

ardi

otoc

ogra

phy_

20C

ardi

otoc

ogra

phy_

10C

ardi

otoc

ogra

phy_

05C

ardi

otoc

ogra

phy_

02A

rrhy

thm

ia_4

6A

rrhy

thm

ia_2

0A

rrhy

thm

ia_1

0A

rrhy

thm

ia_0

5A

rrhy

thm

ia_0

2A

nnth

yroi

d_07

Ann

thyr

oid_

05A

nnth

yroi

d_02

ALO

I

Adj

uste

d P

@n

38


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



●

●●●

●●

●●

●

●●●

●●

●

●

●●●

●

●

●

●●●

●●

●

●●●

●

●

●

●●

●

●

●●

●

●●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

0.00

0.25

0.50

0.75

1.00

WP

BC

Wilt

_05

Wilt

_02

WD

BC

WB

CW

avef

orm

Sta

mps

_09

Sta

mps

_05

Sta

mps

_02

Spa

mB

ase_

40S

pam

Bas

e_20

Spa

mB

ase_

10S

pam

Bas

e_05

Spa

mB

ase_

02S

huttl

eP

ima_

35P

ima_

20P

ima_

10P

ima_

05P

ima_

02P

enD

igits

Par

kins

on_7

5P

arki

nson

_20

Par

kins

on_1

0P

arki

nson

_05

Pag

eBlo

cks_

09P

ageB

lock

s_05

Pag

eBlo

cks_

02Ly

mph

ogra

phy_

idf

Lym

phog

raph

y_ca

trem

oved

Lym

phog

raph

y_1o

fnK

DD

Cup

99_i

dfK

DD

Cup

99_c

atre

mov

edK

DD

Cup

99_1

ofn

Iono

sphe

reIn

tern

etA

ds_1

9In

tern

etA

ds_1

0In

tern

etA

ds_0

5In

tern

etA

ds_0

2H

epat

itis_

16H

epat

itis_

10H

epat

itis_

05H

eart

Dis

ease

_44

Hea

rtD

isea

se_2

0H

eart

Dis

ease

_10

Hea

rtD

isea

se_0

5H

eart

Dis

ease

_02

Gla

ssC

ardi

otoc

ogra

phy_

22C

ardi

otoc

ogra

phy_

20C

ardi

otoc

ogra

phy_

10C

ardi

otoc

ogra

phy_

05C

ardi

otoc

ogra

phy_

02A

rrhy

thm

ia_4

6A

rrhy

thm

ia_2

0A

rrhy

thm

ia_1

0A

rrhy

thm

ia_0

5A

rrhy

thm

ia_0

2A

nnth

yroi

d_07

Ann

thyr

oid_

05A

nnth

yroi

d_02

ALO

I

AP

38


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



●

●●●

●●●●

●

●●●●●●

●

●●●

●●

●●

●●●●

●

●●●

●

●

●

●●●

●

●●

●

●●●

●

●

●

●

●●●●●

●●

●

●

●

●

●●●

0.00

0.25

0.50

0.75

1.00

WP

BC

Wilt

_05

Wilt

_02

WD

BC

WB

CW

avef

orm

Sta

mps

_09

Sta

mps

_05

Sta

mps

_02

Spa

mB

ase_

40S

pam

Bas

e_20

Spa

mB

ase_

10S

pam

Bas

e_05

Spa

mB

ase_

02S

huttl

eP

ima_

35P

ima_

20P

ima_

10P

ima_

05P

ima_

02P

enD

igits

Par

kins

on_7

5P

arki

nson

_20

Par

kins

on_1

0P

arki

nson

_05

Pag

eBlo

cks_

09P

ageB

lock

s_05

Pag

eBlo

cks_

02Ly

mph

ogra

phy_

idf

Lym

phog

raph

y_ca

trem

oved

Lym

phog

raph

y_1o

fnK

DD

Cup

99_i

dfK

DD

Cup

99_c

atre

mov

edK

DD

Cup

99_1

ofn

Iono

sphe

reIn

tern

etA

ds_1

9In

tern

etA

ds_1

0In

tern

etA

ds_0

5In

tern

etA

ds_0

2H

epat

itis_

16H

epat

itis_

10H

epat

itis_

05H

eart

Dis

ease

_44

Hea

rtD

isea

se_2

0H

eart

Dis

ease

_10

Hea

rtD

isea

se_0

5H

eart

Dis

ease

_02

Gla

ssC

ardi

otoc

ogra

phy_

22C

ardi

otoc

ogra

phy_

20C

ardi

otoc

ogra

phy_

10C

ardi

otoc

ogra

phy_

05C

ardi

otoc

ogra

phy_

02A

rrhy

thm

ia_4

6A

rrhy

thm

ia_2

0A

rrhy

thm

ia_1

0A

rrhy

thm

ia_0

5A

rrhy

thm

ia_0

2A

nnth

yroi

d_07

Ann

thyr

oid_

05A

nnth

yroi

d_02

ALO

I

Adj

uste

d A

P

38


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



●

●

●●

●

●

●

●●

●

●

●

●●

●●●●

●

●

●

●●

●

●

●

●

●●●●●

●

●

●●

●

●●

●

●

●

●●●●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

0.50

0.63

0.75

0.88

1.00

WP

BC

Wilt

_05

Wilt

_02

WD

BC

WB

CW

avef

orm

Sta

mps

_09

Sta

mps

_05

Sta

mps

_02

Spa

mB

ase_

40S

pam

Bas

e_20

Spa

mB

ase_

10S

pam

Bas

e_05

Spa

mB

ase_

02S

huttl

eP

ima_

35P

ima_

20P

ima_

10P

ima_

05P

ima_

02P

enD

igits

Par

kins

on_7

5P

arki

nson

_20

Par

kins

on_1

0P

arki

nson

_05

Pag

eBlo

cks_

09P

ageB

lock

s_05

Pag

eBlo

cks_

02Ly

mph

ogra

phy_

idf

Lym

phog

raph

y_ca

trem

oved

Lym

phog

raph

y_1o

fnK

DD

Cup

99_i

dfK

DD

Cup

99_c

atre

mov

edK

DD

Cup

99_1

ofn

Iono

sphe

reIn

tern

etA

ds_1

9In

tern

etA

ds_1

0In

tern

etA

ds_0

5In

tern

etA

ds_0

2H

epat

itis_

16H

epat

itis_

10H

epat

itis_

05H

eart

Dis

ease

_44

Hea

rtD

isea

se_2

0H

eart

Dis

ease

_10

Hea

rtD

isea

se_0

5H

eart

Dis

ease

_02

Gla

ssC

ardi

otoc

ogra

phy_

22C

ardi

otoc

ogra

phy_

20C

ardi

otoc

ogra

phy_

10C

ardi

otoc

ogra

phy_

05C

ardi

otoc

ogra

phy_

02A

rrhy

thm

ia_4

6A

rrhy

thm

ia_2

0A

rrhy

thm

ia_1

0A

rrhy

thm

ia_0

5A

rrhy

thm

ia_0

2A

nnth

yroi

d_07

Ann

thyr

oid_

05A

nnth

yroi

d_02

ALO

I

RO

C A

UC

38


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References

Difficulty and Dimensionality

Wilt Glass Pima Stamps WBC Page Heart Hepat Lymph Annth Cardio Wave Park ALOI WDBC Spam Arrhy Internet

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

Datasets

RO

C A

UC

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●kNN kNNW LOF SimplifiedLOF LoOP LDOF●ODIN KDEOS COF FastABOD LDF INFLO

ROC AUC scores, for each method using the best k, on the datasets with 3 to 5%

of outliers, averaged over the different dataset variants where available.

The datasets are arranged on the x-axis of the plot from left to right in order of

increasing dimensionality.39


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References

Diversity of Outlier Ranks

datasets without duplicates, normalized, with 3-5% outliers

ALOIA B C D E F G H I K L

A kNN

B kNNW

C LOF

D SimplifiedLOF

E LoOP

F LDOF

G ODIN

H KDEOS

I COF

J FastABOD

K LDF

L INFLO

40


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



AnnthyroidA B C D E F G H I J K L

A kNN

B kNNW

C LOF

D SimplifiedLOF

E LoOP

F LDOF

G ODIN

H KDEOS

I COF

J FastABOD

K LDF

L INFLO

40


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



ArrhythmiaA B C D E F G H I J K L

A kNN

B kNNW

C LOF

D SimplifiedLOF

E LoOP

F LDOF

G ODIN

H KDEOS

I COF

J FastABOD

K LDF

L INFLO

40


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



CardiotocographyA B C D E F G H I J K L

A kNN

B kNNW

C LOF

D SimplifiedLOF

E LoOP

F LDOF

G ODIN

H KDEOS

I COF

J FastABOD

K LDF

L INFLO

40


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



GlassA B C D E F G H I J K L

A kNN

B kNNW

C LOF

D SimplifiedLOF

E LoOP

F LDOF

G ODIN

H KDEOS

I COF

J FastABOD

K LDF

L INFLO

40


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



HeartDiseaseA B C D E F G H I J K L

A kNN

B kNNW

C LOF

D SimplifiedLOF

E LoOP

F LDOF

G ODIN

H KDEOS

I COF

J FastABOD

K LDF

L INFLO

40


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



HepatitisA B C D E F G H I J K L

A kNN

B kNNW

C LOF

D SimplifiedLOF

E LoOP

F LDOF

G ODIN

H KDEOS

I COF

J FastABOD

K LDF

L INFLO

40


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



InternetAdsA B C D E F G H I J K L

A kNN

B kNNW

C LOF

D SimplifiedLOF

E LoOP

F LDOF

G ODIN

H KDEOS

I COF

J FastABOD

K LDF

L INFLO

40


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



LymphographyA B C D E F G H I J K L

A kNN

B kNNW

C LOF

D SimplifiedLOF

E LoOP

F LDOF

G ODIN

H KDEOS

I COF

J FastABOD

K LDF

L INFLO

40


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



PageBlocksA B C D E F G H I J K L

A kNN

B kNNW

C LOF

D SimplifiedLOF

E LoOP

F LDOF

G ODIN

H KDEOS

I COF

J FastABOD

K LDF

L INFLO

40


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



ParkinsonA B C D E F G H I J K L

A kNN

B kNNW

C LOF

D SimplifiedLOF

E LoOP

F LDOF

G ODIN

H KDEOS

I COF

J FastABOD

K LDF

L INFLO

40


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



PimaA B C D E F G H I J K L

A kNN

B kNNW

C LOF

D SimplifiedLOF

E LoOP

F LDOF

G ODIN

H KDEOS

I COF

J FastABOD

K LDF

L INFLO

40


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



SpamBaseA B C D E F G H I J K L

A kNN

B kNNW

C LOF

D SimplifiedLOF

E LoOP

F LDOF

G ODIN

H KDEOS

I COF

J FastABOD

K LDF

L INFLO

40


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



StampsA B C D E F G H I J K L

A kNN

B kNNW

C LOF

D SimplifiedLOF

E LoOP

F LDOF

G ODIN

H KDEOS

I COF

J FastABOD

K LDF

L INFLO

40


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



WaveformA B C D E F G H I J K L

A kNN

B kNNW

C LOF

D SimplifiedLOF

E LoOP

F LDOF

G ODIN

H KDEOS

I COF

J FastABOD

K LDF

L INFLO

40


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



WBCA B C D E F G H I J K L

A kNN

B kNNW

C LOF

D SimplifiedLOF

E LoOP

F LDOF

G ODIN

H KDEOS

I COF

J FastABOD

K LDF

L INFLO

40


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



WDBCA B C D E F G H I J K L

A kNN

B kNNW

C LOF

D SimplifiedLOF

E LoOP

F LDOF

G ODIN

H KDEOS

I COF

J FastABOD

K LDF

L INFLO

40


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References



WiltA B C D E F G H I J K L

A kNN

B kNNW

C LOF

D SimplifiedLOF

E LoOP

F LDOF

G ODIN

H KDEOS

I COF

J FastABOD

K LDF

L INFLO

40


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References

Suitability of Ground Truth Outlier Labels

Difficulty for given labels vs. random labels

●ALOI

●Glass

●Lymphography

●

●

●

●●

●

●

●

●●

Waveform

●

●●●●

●

●●

●

●

WBC

●

●

●

●

●●

●

●

●● WDBC

●●

●●●

●

●

●●●

Annthyroid

●

● ●

●

●

●

●

●

● ●

Arrhythmia●

●

●

●●

●

●

●

●●

Cardiotocography

●

●●

●

●

●

●●

●

●

HeartDisease

●

● ●

●

●

●

●

●●

●

Hepatitis

●

●●●●● ●

●

●

●

InternetAds

●●

●

● ●

●

●●

●

●

PageBlocks

●

●●

●

●

●

●

●

●

●

Parkinson●

●

●

●

●

●

●

●●

●

Pima

●●●

●

●●●

●

●

●

SpamBase

●●

●

●●

● ●●

●

●Stamps

●Wilt

● ●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●●

●

●

●●●●

●●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●●

●

●

●

●●

●

●●●

RandomRankersIndependent

●● ●● ●●●●●●● ●●●●● ●● ● ●●● ●●●● ●●●● ●●● ●● ●●●● ● ● ● ●●● ●●● ●●●● ●● ● ●●●●● ●●●●● ●● ●● ●● ●● ●●● ●● ●● ●●● ●●●●● ● ● ●●●● ● ●●● ● ●

RandomRankersIdentical●

0 2 4 6 8 10

01

23

4

PerfectResult

Difficulty Score

Div

ersi

ty S

core

41


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Evaluation Measures



Conclusions

References

Suitability of Ground Truth Outlier Labels

Difficulty for given labels vs. random labels

●

●

●

●

ALOIAnnthyroidArrhythmia

CardiotocographyGlass

HeartDiseaseHepatitis

InternetAdsLymphography

PageBlocksParkinson

PimaSpamBase

StampsWaveform

WBCWDBC

Wilt

2 4 6 8 10

●

●

ALOIAnnthyroidArrhythmia

CardiotocographyGlass

HeartDiseaseHepatitis

InternetAdsLymphography

PageBlocksParkinson

PimaSpamBase

StampsWaveform

WBCWDBC

Wilt

2 4 6 8 10

Difficulty Score

41


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Outline


Evaluation Measures

Datasets

Experiments

Conclusions

42


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Conclusions

I we discussed evaluation measures for outlier rankings:P@n, AP, and ROC (AUC)

I we proposed adjustment for chance for P@n and for API we discussed preprocessing issues for the preparation

of outlier datasets with annotatded ground truth andprovide 23 datasets in about 1000 variants

I we tested 12 outlier detection methods on thesedatasets with a range of choices for the neighborhoodparameter k ∈ [1, . . . , 100]

43


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Conclusions

I we aggregate and analyse the resulting > 1, 3 millionexperiments and

I summarize the effectiveness of the 12 methodsI study the suitability of the datasets for evaluation of

outlier detectionI we offer all results and analyses together with source

code online:http://www.dbs.ifi.lmu.de/research/outlier-evaluation/

I experiments can be easily repeated and extended forother methods and other datasets

44



OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

Thank you for your attention!

And many thanks to mycollaborators:

I Guilherme O.Campos

I Jörg SanderI Ricardo J. G. B.

CampelloI Barbora MicenkováI Erich SchubertI Ira AssentI Mike E. Houle

45


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

References I

N. Abe, B. Zadrozny, and J. Langford. Outlier detection by active learning. InProceedings of the 12th ACM International Conference on KnowledgeDiscovery and Data Mining (SIGKDD), Philadelphia, PA, pages 504–509, 2006.doi: 10.1145/1150402.1150459.

L. Akoglu, H. Tong, and D. Koutra. Graph-based anomaly detection anddescription: A survey. Data Mining and Knowledge Discovery, 29(3):626–688,2015. doi: 10.1007/s10618-014-0365-y.

F. Angiulli and C. Pizzuti. Outlier mining in large high-dimensional data sets. IEEETransactions on Knowledge and Data Engineering, 17(2):203–215, 2005. doi:10.1109/TKDE.2005.31.

M. M. Breunig, H.-P. Kriegel, R.T. Ng, and J. Sander. LOF: Identifyingdensity-based local outliers. In Proceedings of the ACM InternationalConference on Management of Data (SIGMOD), Dallas, TX, pages 93–104,2000. doi: 10.1145/342009.335388.

R. J. G. B. Campello, D. Moulavi, A. Zimek, and J. Sander. Hierarchical densityestimates for data clustering, visualization, and outlier detection. ACMTransactions on Knowledge Discovery from Data (TKDD), 10(1):5:1–51, 2015.doi: 10.1145/2733381.

46


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

References II

G. O. Campos, A. Zimek, J. Sander, R. J. G. B. Campello, B. Micenková,E. Schubert, I. Assent, and M. E. Houle. On the evaluation of unsupervisedoutlier detection: Measures, datasets, and an empirical study. Data Mining andKnowledge Discovery, 2016. doi: 10.1007/s10618-015-0444-8.

V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACMComputing Surveys, 41(3):Article 15, 1–58, 2009. doi:10.1145/1541880.1541882.

N. Craswell. Precision at n. In L. Liu and M. T. Özsu, editors, Encyclopedia ofDatabase Systems, pages 2127–2128. Springer, 2009a. doi:10.1007/978-0-387-39940-9_484.

N. Craswell. R-precision. In L. Liu and M. T. Özsu, editors, Encyclopedia ofDatabase Systems, page 2453. Springer, 2009b. doi:10.1007/978-0-387-39940-9_486.

J. Davis and M. Goadrich. The relationship between Precision-Recall and ROCcurves. In Proceedings of the 23rd International Conference on MachineLearning (ICML), Pittsburgh, PA, pages 233–240, 2006.

T. de Vries, S. Chawla, and M. E. Houle. Density-preserving projections forlarge-scale local anomaly detection. Knowledge and Information Systems(KAIS), 32(1):25–52, 2012. doi: 10.1007/s10115-011-0430-4.

47


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

References III

J. Gao and P.-N. Tan. Converting output scores from outlier detection algorithmsinto probability estimates. In Proceedings of the 6th IEEE InternationalConference on Data Mining (ICDM), Hong Kong, China, pages 212–221, 2006.doi: 10.1109/ICDM.2006.43.

J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiveroperating characteristic (roc) curve. Radiology, 143:29–36, 1982.

V. Hautamäki, I. Kärkkäinen, and P. Fränti. Outlier detection using k-nearestneighbor graph. In Proceedings of the 17th International Conference onPattern Recognition (ICPR), Cambridge, England, UK, pages 430–433, 2004.doi: 10.1109/ICPR.2004.1334558.

D. Hawkins. Identification of Outliers. Chapman and Hall, 1980.

L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2(1):193–218, 1985.

W. Jin, A. K. H. Tung, J. Han, and W. Wang. Ranking outliers using symmetricneighborhood relationship. In Proceedings of the 10th Pacific-Asia Conferenceon Knowledge Discovery and Data Mining (PAKDD), Singapore, pages577–593, 2006. doi: 10.1007/11731139_68.

48


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

References IV

F. Keller, E. Müller, and K. Böhm. HiCS: high contrast subspaces fordensity-based outlier ranking. In Proceedings of the 28th InternationalConference on Data Engineering (ICDE), Washington, DC, pages 1037–1048,2012. doi: 10.1109/ICDE.2012.88.

H.-P. Kriegel, M. Schubert, and A. Zimek. Angle-based outlier detection inhigh-dimensional data. In Proceedings of the 14th ACM InternationalConference on Knowledge Discovery and Data Mining (SIGKDD), Las Vegas,NV, pages 444–452, 2008. doi: 10.1145/1401890.1401946.

H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. LoOP: local outlierprobabilities. In Proceedings of the 18th ACM Conference on Information andKnowledge Management (CIKM), Hong Kong, China, pages 1649–1652, 2009.doi: 10.1145/1645953.1646195.

H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Interpreting and unifyingoutlier scores. In Proceedings of the 11th SIAM International Conference onData Mining (SDM), Mesa, AZ, pages 13–24, 2011. doi:10.1137/1.9781611972818.2.

H.-P. Kriegel, E. Schubert, and A. Zimek. The (black) art of runtime evaluation:Are we comparing algorithms or implementations? submitted, 2016.

49


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

References V

L. J. Latecki, A. Lazarevic, and D. Pokrajac. Outlier detection with kernel densityfunctions. In Proceedings of the 5th International Conference on MachineLearning and Data Mining in Pattern Recognition (MLDM), Leipzig, Germany,pages 61–75, 2007. doi: 10.1007/978-3-540-73499-4_6.

A. Lazarevic and V. Kumar. Feature bagging for outlier detection. In Proceedingsof the 11th ACM International Conference on Knowledge Discovery and DataMining (SIGKDD), Chicago, IL, pages 157–166, 2005. doi:10.1145/1081870.1081891.

H. O. Marques, R. J. G. B. Campello, A. Zimek, and J. Sander. On the internalevaluation of unsupervised outlier detection. In Proceedings of the 27thInternational Conference on Scientific and Statistical Database Management(SSDBM), San Diego, CA, pages 7:1–12, 2015. doi: 10.1145/2791347.2791352.

H. V. Nguyen and V. Gopalkrishnan. Feature extraction for outlier detection inhigh-dimensional spaces. Journal of Machine Learning Research, ProceedingsTrack 10:66–75, 2010.

H. V. Nguyen, H. H. Ang, and V. Gopalkrishnan. Mining outliers with ensemble ofheterogeneous detectors on random subspaces. In Proceedings of the 15thInternational Conference on Database Systems for Advanced Applications(DASFAA), Tsukuba, Japan, pages 368–383, 2010. doi:10.1007/978-3-642-12026-8_29.

50


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

References VI

G. H. Orair, C. Teixeira, Y. Wang, W. Meira Jr., and S. Parthasarathy.Distance-based outlier detection: Consolidation and renewed bearing.Proceedings of the VLDB Endowment, 3(2):1469–1480, 2010.

M. Radovanovic, A. Nanopoulos, and M. Ivanovic. Reverse nearest neighbors inunsupervised distance-based outlier detection. IEEE Transactions onKnowledge and Data Engineering, 2014. doi: 10.1109/TKDE.2014.2365790.

S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliersfrom large data sets. In Proceedings of the ACM International Conference onManagement of Data (SIGMOD), Dallas, TX, pages 427–438, 2000. doi:10.1145/342009.335437.

E. Schubert, R. Wojdanowski, A. Zimek, and H.-P. Kriegel. On evaluation of outlierrankings and outlier scores. In Proceedings of the 12th SIAM InternationalConference on Data Mining (SDM), Anaheim, CA, pages 1047–1058, 2012.doi: 10.1137/1.9781611972825.90.

E. Schubert, A. Zimek, and H.-P. Kriegel. Local outlier detection reconsidered: ageneralized view on locality with applications to spatial, video, and networkoutlier detection. Data Mining and Knowledge Discovery, 28(1):190–237,2014a. doi: 10.1007/s10618-012-0300-z.

51


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

References VII

E. Schubert, A. Zimek, and H.-P. Kriegel. Generalized outlier detection withflexible kernel density estimates. In Proceedings of the 14th SIAM InternationalConference on Data Mining (SDM), Philadelphia, PA, pages 542–550, 2014b.doi: 10.1137/1.9781611973440.63.

E. Schubert, A. Koos, T. Emrich, A. Züfle, K. A. Schmid, and A. Zimek. Aframework for clustering uncertain data. Proceedings of the VLDB Endowment,8(12):1976–1979, 2015. doi: 10.14778/2824032.2824115.

J. Tang, Z. Chen, A. W.-C. Fu, and D. W. Cheung. Enhancing effectiveness ofoutlier detections for low density patterns. In Proceedings of the 6thPacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD),Taipei, Taiwan, pages 535–548, 2002. doi: 10.1007/3-540-47887-6_53.

E. Zhang and Y. Zhang. Average precision. In L. Liu and M. T. Özsu, editors,Encyclopedia of Database Systems, pages 192–193. Springer, 2009. doi:10.1007/978-0-387-39940-9_482.

K. Zhang, M. Hutter, and H. Jin. A new local distance-based outlier detectionapproach for scattered real-world data. In Proceedings of the 13th Pacific-AsiaConference on Knowledge Discovery and Data Mining (PAKDD), Bangkok,Thailand, pages 813–822, 2009. doi: 10.1007/978-3-642-01307-2_84.

52


OutlierDetection

Arthur Zimek


Evaluation Measures

Datasets

Experiments

Conclusions

References

References VIII

A. Zimek, E. Schubert, and H.-P. Kriegel. A survey on unsupervised outlierdetection in high-dimensional numerical data. Statistical Analysis and DataMining, 5(5):363–387, 2012. doi: 10.1002/sam.11161.

A. Zimek, M. Gaudet, R. J. G. B. Campello, and J. Sander. Subsampling forefficient and effective unsupervised outlier detection ensembles. InProceedings of the 19th ACM International Conference on KnowledgeDiscovery and Data Mining (SIGKDD), Chicago, IL, pages 428–436, 2013. doi:10.1145/2487575.2487676.

53