29
cnrs - upmc laboratoire d’informatique de paris 6 Outskewer: Using Skewness to Spot Outliers in Samples and Time Series ebastien Heymann, Matthieu Latapy, Cl´ emence Magnien ASONAM 2012

Outskewer: Using Skewness to Spot Outliers in Samples and Time Series

Embed Size (px)

DESCRIPTION

 

Citation preview

  • 1. cnrs - upmc laboratoire dinformatique de paris 6 Outskewer: Using Skewness to Spot Outliers in Samples and Time Series Sbastien Heymann, Matthieu Latapy, Clmence Magnien e e ASONAM 2012
  • 2. Did you know?Outlier detection is an important problem to data mining: source: https://xkcd.com/539/
  • 3. cnrs - upmc laboratoire dinformatique de paris 6 How to detect outliers? No formal denition, it is a subjective concept. Depends on cases and hypotheses on data. Intuitively: to identify values which deviate remarkably from the remainder of values (Grubbs, 1969). Sbastien Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 3/27
  • 4. cnrs - upmc laboratoire dinformatique de paris 6 Usual approaches in literature Hypothesis: data normal Distance data points / distribution. theoretical values. Sbastien Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 4/27
  • 5. cnrs - upmc laboratoire dinformatique de paris 6 Problem statement Most of the time, we cant make strong assumptions on: the theoretical distribution of values. how the data should evolve over time (time series). Thus we want a method which makes no hypothesis on data. Sbastien Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 5/27
  • 6. Our Method
  • 7. cnrs - upmc laboratoire dinformatique de paris 6 Skewness coecient n xmean 3 = (n1)(n2) xX standard deviation density density x x 0 Example of skewed distributions. Sbastien Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 7/27
  • 8. cnrs - upmc laboratoire dinformatique de paris 6 Skewness coecient n xmean 3 = (n1)(n2) xX standard deviation density density x x 0 Example of skewed distributions. It is sensitive to extremal values (min/max) far from the mean ! Sbastien Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 7/27
  • 9. cnrs - upmc laboratoire dinformatique de paris 6 Skewness signature Denition Evolution of skewness coecient when extremal values are removed one by one from the sample. Algorithm If > 0 then remove max(X ), 1.5 skewness Else remove min(X ). 1.0 0.5 0.0 Example 1 2 3 4 5 6 7 X = {-3, -2, -1, -1, 0, 1, 2, 3, 7} # extremal values removed : 1.09, 0.22, 0.17, 0, 0.4, 0, 1.73 Sbastien Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 8/27
  • 10. cnrs - upmc laboratoire dinformatique de paris 6 Our method: Outskewer Our denition Outlier = extremal value which skews a distribution of values. Implication The removal of these extremal values one by one should reduce the skewness of the distribution. Implication Otherwise, there is no outlier as we dene it. Sbastien Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 9/27
  • 11. cnrs - upmc laboratoire dinformatique de paris 6 Outskewer : non-relevant cases Where extremal values far from the mean are common. e.g. Power law distributions Sbastien Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 10/27
  • 12. cnrs - upmc laboratoire dinformatique de paris 6 Outskewer : p-stability Is the signature p-stable? p: fraction of extremal values removed. p-stable || 0.5 p, for each p from p to 0.5 1.0 q 0.5 t T cumulative distribution q q q q qq q 0.8 q q q q q 0.4 q q q |skewness| q q q 0.6 q q 0.3 |g| qq qq qq q q 0.4 q q q q q 0.2 qq q q qq 0.2 qq q q q 0.1 q q q q q 0.0 0.0 8 6 4 2 0 2 0 0.14 0.30 0.16 0.5 x p Example: 0.16-stable but not 0.30-stable Sbastien Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 11/27
  • 13. cnrs - upmc laboratoire dinformatique de paris 6 Outskewer : p-stability Is the signature p-stable? p: fraction of extremal values removed. p-stable || 0.5 p, for each p from p to 0.5 If yes: there may be outliers. Sbastien Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 12/27
  • 14. cnrs - upmc laboratoire dinformatique de paris 6 Outskewer : p-stability Is the signature p-stable? p: fraction of extremal values removed. p-stable || 0.5 p, for each p from p to 0.5 If yes: there may be outliers. If no for all p: the skewness coecient is always too large, thus no outlier as we dene it can lie in the sample. Sbastien Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 12/27
  • 15. cnrs - upmc laboratoire dinformatique de paris 6 Outskewer : outlier detection |g| area of outliers area of potential area with no outlier 2.0 outliers 1.5 1.0 q not outlier q |skewness| q qcumulative frequency qq qq 0.8 potential outlier q q q q 1.0 q q qq outlier q q q q q 0.6 q q q q qq qq q 0.5 q 0.4 q q q q t q q q q 0.2 T 0.0 0.0 8 6 4 2 0 2 t T x 0 0.14 0.5 1 p t smallest t-stable value , t smallest value so that || 0.5 t T largest T -stable value , T smallest value so that || 0.5 T Example: 50 values, including 7 outliers and 5 potential outliers Sbastien Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 13/27
  • 16. cnrs - upmc laboratoire dinformatique de paris 6 Outskewer : outcome Each value of the sample is classied as follows:qqqqqqqqqqqqqq qqqqqqqqqq status q not outlier potential outlier outlier 2000 or unknown when the method is not applicable (skewness signature never p-stable). Sbastien Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 14/27
  • 17. cnrs - upmc laboratoire dinformatique de paris 6 Extension to time series On a sliding window of size w , each value of X is classied w times. The nal class of a value is the one that appears the most. time Sbastien Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 15/27
  • 18. Experimental Validation False positive rate. Regime change.
  • 19. cnrs - upmc laboratoire dinformatique de paris 6 False positive rate Normal distribution: 3% for n = 10, 0.01% for n = 100 Pareto distribution: 5% for n = 100, 0.01% for n = 1000 Sbastien Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 17/27
  • 20. cnrs - upmc laboratoire dinformatique de paris 6 Regime change Video 5 5 q not outlier 5 q not outlier q not outlier q 4 4 potential outlier 4 potential outlier q q q potential outlier q 3 3 outlier q q q 3 outlier q q q q q unknown q q q q q q q q q q qq q q q qq q 2 q q 2 q unknown q q q q 2 q unknown q q q q q q q q q qq q q qq q 1 qq qq q qq q qq q q qq qq 1 qq qq q qq q qq q q qq qq 1 qq qq q qq q qq q q qq qq q q q q q q q q q q x x x qqq q q qqq q q q q q qqq q q qqq q q q q q qqq q q qqq q q q q q q q q 0 q qq q q q qqq q q 0 q qq q q q qqq q qq q 0 q qq q q q qqq q qq q qq q q q q q qq qq qq q q q q q qqq qq qq q q q q q qqq qq qq qq q q qq q q q qq qq q qq qq q qq qq qq q q qqq q q q qq qq q qq qq qq q qq q qq qq q q qqq q q q qq qq q qq qq qq q qq q 1 q qq q qq q q 1 q qq q qq q q 1 q qq q qq q q qq qq qq 2 q qq q q 2 q qq q q q 2 q qq q q q 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 t t t 5 q not outlier q 5 q not outlier q q q q 5 q q q q q q q q q q q q q q q 4 potential outlier qq q qq q 4 potential outlier q qqqqq q q qq q q 4 q not outlier q q q q q qqq q q qqq q q q qq q q qqq q q qqq qq q qq qqq q q q q q q q 3 outlier qq q q qqq q 3 outlier qqq qqq q qq qq 3 potential outlier q qq q q q qq q q q q q qqqqq q q q q qqq qq q q q q qqq qq q q q q qqq qq q q q qqq q q q q q q q qqq q q q q q qqq q q q q q qqq qq q qq 2 q unknown q q q q q qqq 2 q unknown q q q q q q qq q q 2 q q q q q q q q qq qq q q q q q q q qq q q q q qq q q qq qq 1 qq qq q qq q qq q q qq qq q 1 qq qq q qq q qq q q qq qq q 1 qq qq q qq q qq q q qq qq q q q q q q q q q q q x x x qqq q q qqq q q q q q qqq q q qqq q q q q q qqq q q qqq q q q q q q q q 0 q qq q q q qqq q qq q 0 q qq q q q qqq q qq q 0 q qq q q q qqq q qq q qq q q q q q qqq qq qq q q q q q qqq qq qq q q q q q qqq qq qq qq q q qqq q q q qq qq q qq qq qq q qq q qq qq q q qqq q q q qq qq q qq qq qq q qq q q q qq q q qqq q q q qq qq q qq qq qq q qq q 1 q qq q qq q q 1 q qq q qq q q 1 q qq q qq q q qq qq qq 2 q qq q q q 2 q qq q q q 2 q qq q q q 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 t t t Sbastien Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 18/27
  • 21. Experimental Results French population during the 20th century. Logs of a P2P search engine.
  • 22. cnrs - upmc laboratoire dinformatique de paris 6 French population during the 20th century Number of inhabitants per year qqq qqq 60M qqq qqq qqqqq qqqq qqqq population qqqq qqq qqqq q qqq 50M qqq qqq qq qq qqq qq qqq q qqq qqqqqqqqqqqqq qqqqq qqqqqqqqqq qqq q 40M qqq qq qqqq qqqqq q 1900 1920 1940 1960 1980 2000 Year Dierence over years 1000000 q q q q 500000 q q qqq qqqqqqq qqq qqqqqqqqqqq status population q qqqqqqqqqq qq qq q q qqqqqqqqqqqqqqqqqqqqqqqqqq q qqqqqqqqqqqqq q qqq qq 0 q qq q not outlier 500000 potential outlier 1000000 1500000 outlier 1900 1920 1940 1960 1980 2000 Year Sbastien Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 20/27
  • 23. cnrs - upmc laboratoire dinformatique de paris 6 Harry Potter on eDonkey Number of outliers per day 75 # outliers / day in theatre unknown event pirate release outliers 0 50 potential outliers 15 Jul 24 Aug 12 Oct 1 Dec Date Data: search logs on P2P network eDonkey. # queries containing half blood prince per hour, computed every 10 minutes. during 28 weeks. over 205 millions of queries. for 24.4 millions of IP addresses. Sbastien Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 21/27
  • 24. cnrs - upmc laboratoire dinformatique de paris 6 Contributions Our method: is non-parametric but for the size of the time window. classies values only when the statistical conditions are met. is naturally generalized to on-line analysis. Sbastien Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 22/27
  • 25. cnrs - upmc laboratoire dinformatique de paris 6 Conclusion Motivation: outlier detection with no hypothesis on data. Method based on the skewness of distributions. Excellent experimental results. Relevant on various data sets. Open source code in R on http://outskewer.sebastien.pro Sbastien Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 23/27
  • 26. Questions?Outskewer: Using Skewness to Spot Outliers in Samples and Time Series
  • 27. cnrs - upmc laboratoire dinformatique de paris 6 Homogeneous / heterogeneous data Outlier = unexpected extremal value? Extremal values far from the mean? heterogeneous (Pareto, Zipf...): common homogeneous (normal, Laplace...): uncommon 100 105 density 1010 1015 1020 10 5 0 5 10 x Probability density function of normal and Pareto laws. Sbastien Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 25/27
  • 28. cnrs - upmc laboratoire dinformatique de paris 6 Skewness signature Normal 2 1 median 0 min s(p) max 1 q1 2 q3 0.0 0.2 0.4 0.6 0.8 1.0 p Pareto 8 6 median 4 min s(p) 2 max 0 q1 2 q3 0.0 0.2 0.4 0.6 0.8 1.0 p Sbastien Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 26/27
  • 29. cnrs - upmc laboratoire dinformatique de paris 6 Local view of the internet topology 13000 Nb nodes 12000 11000 outlier potential outlier q not outlier unknown 0 1000 2000 3000 4000 5000 Nb rounds M. Latapy, C. Magnien and F. Oudraogo, A Radar for the Internet, in Complex Systems, 20 (1), 23-30, 2011. e Sbastien Heymann, Matthieu Latapy, Clmence Magnien Outskewer ASONAM 2012 e e 27/27