34
15-388/688 - Practical Data Science: Anomaly detection and mixture of Gaussians J. Zico Kolter Carnegie Mellon University Spring 2018 1

15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

15-388/688 - Practical Data Science:Anomaly detection and mixture of Gaussians

J. Zico KolterCarnegie Mellon University

Spring 2018

1

Page 2: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

OutlineAnomalies and outliers

Multivariate Gaussian

Mixture of Gaussians

2

Page 3: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

OutlineAnomalies and outliers

Multivariate Gaussian

Mixture of Gaussians

3

Page 4: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

What is an “anomaly”Two views of anomaly detection

Supervised view: anomalies are what some user labels as anomalies

Unsupervised view: anomalies are outliers (points of low probability) in the data

In reality, you want a combination of both these viewpoints: not all outliers are anomalies, but all anomalies should be outliers

This lecture is going to focus on the unsupervised view, but this is only part of the full equation

4

Page 5: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

What is an outlier?Outliers are points of low probability

Given a collection of data points 𝑥 1,… , 𝑥

% , describe the points using some distribution, then find points with lowest 𝑝 𝑥

'

Since we are considering points with no labels, this is an unsupervised learning algorithm (could formulate in terms of hypothesis, loss, optimization, but instead for this lecture we’ll be focusing on the probabilistic notation)

5

Page 6: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

OutlineAnomalies and outliers

Multivariate Gaussian

Mixture of Gaussians

6

Page 7: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

Multivariate Gaussian distributionsWe have seen Gaussian distributions previously, but mainly focused on distributions over scalar-valued data 𝑥 '

∈ ℝ

𝑝 𝑥;𝜇, 𝜎2= 2𝜋𝜎

2 −1/2exp −

𝑥− 𝜇2

2𝜎2

Gaussian distributions generalize nicely to distributions over vector-valued random variables 𝑋 taking values in ℝ7

𝑝 𝑥; 𝜇,Σ = 2𝜋Σ−1/2

exp −1

2𝑥 − 𝜇

:Σ−1

𝑥 − 𝜇

≡𝒩 𝑥;𝜇,Σ

with parameters 𝜇 ∈ ℝ7 and Σ ∈ ℝ

7×7, and were ⋅ denotes the determinant of a matrix (also written 𝑋 ∼𝒩 𝜇,Σ ) 7

Page 8: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

Properties of multivariate GaussiansMean and variance

𝐄 𝑋 =∫

ℝB

𝑥𝒩 𝑥;𝜇,Σ 𝑑𝑥 = 𝜇

𝐂𝐨𝐯 𝑋 =∫

ℝB

𝑥 − 𝜇 𝑥− 𝜇:𝒩 𝑥;𝜇,Σ 𝑑𝑥 = Σ

(these are not obvious)

Creation from univariate Gaussians: for 𝑥 ∈ ℝ, if 𝑝 𝑥'= 𝒩 𝑥; 0,1 (i.e., each

element 𝑥'

is an independent univariate Gaussian, then 𝑦 = 𝐴𝑥+ 𝑏 is also normal, with distribution

𝑌 ∼𝒩 𝜇 = 𝑏,Σ = 𝐴𝐴:

8

Page 9: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

Multivariate Gaussians, graphically

9

𝜇 =3

−4

Σ =2.0 0.5

0.5 1.0

Page 10: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

Multivariate Gaussians, graphically

10

𝜇 =3

−4

Σ =2.0 0

0 1.0

Page 11: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

Multivariate Gaussians, graphically

11

𝜇 =3

−4

Σ =2.0 1.0

1.0 1.0

Page 12: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

Multivariate Gaussians, graphically

12

𝜇 =3

−4

Σ =2.0 1.4

1.4 1.0

Page 13: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

Multivariate Gaussians, graphically

13

𝜇 =3

−4

Σ =2.0 −1.0

−1.0 1.0

Page 14: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

Maximum likelihood estimationThe maximum likelihood estimate of 𝜇,Σ are what you would “expect”, but derivation is non-obvious

minimizeU,Σ

ℓ 𝜇,Σ =∑

'=1

%

log 𝑝(𝑥'; 𝜇,Σ)

=∑

'=1

%

−1

2log 2𝜋Σ −

1

2𝑥'− 𝜇

:Σ−1

𝑥'− 𝜇

Taking gradients with respect to 𝜇 and Σ and setting equal to zero give the closed-form solutions

𝜇 =1

𝑚∑

'=1

%

𝑥', Σ =

1

𝑚∑

'=1

%

𝑥'− 𝜇 𝑥

'− 𝜇

:

14

Page 15: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

Fitting Gaussian to MNIST

15

𝜇 = Σ =

Page 16: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

MNIST Outliers

16

Page 17: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

OutlineAnomalies and outliers

Multivariate Gaussian

Mixture of Gaussians

17

Page 18: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

Limits of GaussiansThough useful, multivariate Gaussians are limited in the types of distributions they can represent

18

Page 19: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

Mixture modelsA more powerful model to consider is a mixture of Gaussian distributions, a distribution where we first consider a categorical variable

𝑍 ∼ Categorical 𝜙 , 𝜙 ∈ 0,1g,∑

'

𝜙'= 1

i.e., 𝑧 takes on values 1,… , 𝑘

For each potential value of 𝑍, we consider a separate Gaussian distribution:𝑋|𝑍 = 𝑧 ∼𝒩 𝜇

k,Σ

k, 𝜇

k∈ ℝ

7,Σ

k∈ ℝ

7×7

Can write the distribution of 𝑋 using marginalization

𝑝 𝑋 =∑

k

𝑝 𝑋 𝑍 = 𝑧 𝑝(𝑍 = 𝑧) =∑

k

𝒩 𝑥;𝜇k,Σ

k𝜙k

19

Page 20: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

Learning mixture modelsTo estimate parameters, suppose first that we can observe both 𝑋 and 𝑍, i.e., our data set is of the form 𝑥 '

, 𝑧', 𝑖 = 1,… ,𝑚

In this case, we can maximize the log-likelihood of the parameters:

ℓ 𝜇,Σ, 𝜙 =∑

'=1

%

log 𝑝(𝑥', 𝑧

'; 𝜇,Σ, 𝜙)

Without getting into the full details, it hopefully should not be too surprising that the solutions here are given by:

𝜙k=

∑'=1

%𝟏 𝑧

'= 𝑧

𝑚, 𝜇

k=

∑'=1

%𝟏 𝑧

'= 𝑧 𝑥

'

∑'=1

%𝟏 𝑧 ' = 𝑧

,

Σk=

∑'=1

%𝟏 𝑧

'= 𝑧 (𝑥

'−𝜇

k) 𝑥

'− 𝜇

k :

∑'=1

%𝟏 𝑧 ' = 𝑧 20

Page 21: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

Latent variables and expectation maximization

In the unsupervised setting, 𝑧 ' terms will not be known, these are referred to as hidden or latent random variables

This means that to estimate the parameters, we can’t use the function 1 𝑧

'= 𝑧 anymore

Expectation maximization (EM) algorithm (at a high level): replace indicators 1 𝑧

'= 𝑧 with probability estimates 𝑝 𝑧

'= 𝑧 𝑥

'; 𝜇,Σ, 𝜙

When we re-estimate these parameter, probabilities change, so repeat:E (expectation) step: compute 𝑝 𝑧

'= 𝑧 𝑥

'; 𝜇,Σ, 𝜙 ,∀𝑖, 𝑧

M (maximization) step: re-estimate 𝜇,Σ,𝜙21

Page 22: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

EM for Gaussian mixture modelsE step: using Bayes’ rule, compute probabilities

𝑝k

'= 𝑝 𝑧

'= 𝑧 𝑥

'; 𝜇,Σ, 𝜙 =

𝑝 𝑥'𝑧'= 𝑧; 𝜇,Σ 𝑝 𝑧

'= 𝑧; 𝜙

∑k′𝑝 𝑥

'𝑧'= 𝑧′; 𝜇,Σ 𝑝 𝑧 ' = 𝑧′; 𝜙

=𝒩 𝑥

'; 𝜇

k,Σ

k𝜙k

∑k′𝒩 𝑥 ' ; 𝜇 k′ ,Σ k′ 𝜙

k′

M step: re-estimate parameters using these probabilities

𝜙k←

∑'=1

%𝑝k

'

𝑚, 𝜇

k←

∑'=1

%𝑝k

'𝑥

'

∑'=1

%𝑝',k

, Σk←

∑'=1

%𝑝k

'(𝑥

'−𝜇

k) 𝑥

'− 𝜇

k :

∑'=1

%𝑝k

'

22

Page 23: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

Local optimaLike k-means, EM is effectively optimizating a non-convex problem

Very real possibility of local optima (seemingly moreso than k-means, in practice)

Same heuristics work as for k-means (in fact, common to initialize EM with clusters from k-means)

23

Page 24: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

Illustration of EM algorithm

24

Page 25: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

Illustration of EM algorithm

25

Page 26: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

Illustration of EM algorithm

26

Page 27: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

Illustration of EM algorithm

27

Page 28: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

Illustration of EM algorithm

28

Page 29: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

Possibility of local optima

29

Page 30: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

Possibility of local optima

30

Page 31: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

Possibility of local optima

31

Page 32: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

Possibility of local optima

32

Page 33: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

Poll: outliers in mixture of GaussiansConsider the following cartoon dataset:

If we fit a mixture of two Gaussians to this data via the EM algorithm, which group of points is likely to contain more “outliers” (points with the lowest 𝑝(𝑥))?

1. Left group2. Right group3. Equal chance of each, depending on initialization

33

Page 34: 15-388/688 -Practical Data Science: Anomaly detection and ... · What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies

EM and k-meansAs you may have noticed, EM for mixture of Gaussians and k-means seem to be doing very similar things

Primary differences: EM is computing “distances” based upon the inverse covariance matrix, allows for “soft” assignments instead of hard assignments

34