105
Copyright © 2001-2018 K.R. Pattipati Fall 2018 October 1, 8 & 15, 2018 Prof. Krishna R. Pattipati Dept. of Electrical and Computer Engineering University of Connecticut Contact: [email protected] (860) 486-2890 Lectures 4 and 5: ML and Bayesian Learning, Density Estimation, Gaussian Mixtures and EM and Variational Bayesian Inference, Performance Assessment of Classifiers

Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Fall 2018

October 1, 8 & 15, 2018

Prof. Krishna R. Pattipati

Dept. of Electrical and Computer Engineering

University of Connecticut Contact: [email protected] (860) 486-2890

Lectures 4 and 5: ML and Bayesian Learning,

Density Estimation, Gaussian Mixtures and EM

and Variational Bayesian Inference, Performance

Assessment of Classifiers

Page 2: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

• Duda, Hart and Stork, Sections 3.1-3.6, Chapter 4

• Bishop, Section 2.5, Chapter 9, Sections 10.1 , 10.2

• Murphy, Chapter 4, Chapter 11, Section 21.6

• Theodoridis, Chapter 7, Chapter 12

Reading List

Page 3: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Estimating Parameters of Densities From Data

Maximum Likelihood Methods

Bayesian Learning

Estimating Probability Densities (Nonparametric)

Histogram Methods

Parzen Windows

Probabilistic Neural Network

k-nearest Neighbor Approach

Mixture Models and EM

Performance Assessment of Classifiers

Lecture Outline

Page 4: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Need to know

We usually estimate them from data

Parametric Methods (assume a form for density)

Nonparametric Methods

Estimate density via Parzen windows, PNN, RCE,NN, k-NN,..

Mixture Models

Mixture of Gaussians, Multinoullis, Multinomials, student,…

Recall Bayesian Classifiers

}),|(),({ ijjzxpjzP z

x

Hidden

Categorical

Features,

observed

Parameters

{µk, k}

Page 5: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Estimating parameters of densities from data:

For example, in the Gaussian case

Data:

Assuming samples are independent,

Estimating Parameters of Densities from Data

C

kkzkzP 1

C

1k)}|,xp( { )(

2{( , )} ({ }, ) ({ }, ) General Case or Hyperellipsoid Case or Hypersphere casek pk k kI

CkD kn

kkkk . . . . 3, 2, 1, :x ..........x x x321

C

1k

Let . class from samples Nnkn kk

1 1

) ( ( | , ) ( ) ( | )knC

j

k

k j

L p D p x z k P z k

)|( ln)( ln)( DpLl

k

1 1 1

k )k(; ln),|( ln

C

k

n

j

c

k

k

j

k

k

zPnkzxp

Page 6: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Optimal ML Estimate of k :

What if some classes did not have samples in training data?

Zero count or sparse data or black swan problem

Optimal k

1 s.t. lnmax1

C

1k

kk

C

k

kn

ˆN

nkk

Nnn

nLagrangian

kk

k

k

C

k

kk

ˆˆ

)]1( ln[max:1

C

1k

kk}{

k

Fraction of samples of class k.

Intuitively appealing.

1

ˆCN

nkk

Laplace rule of succession or add-one smoothing

Recall:

ln x is concave

-ln x is convex

Page 7: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Optimal ML estimate of : (when covariance matrices for

all classes are taken to be equal, i.e., i = )

Optimal : Hyperellipsoid Case

C

k

n

j

j

k

k

kzxpl1 1

),|( ln)( { },k

C

k

n

jk

j

k

T

k

j

k

C

kk

k

xxN

l1 1

1

1 )()(2

1||ln

2),}({

Ckxn

xlk

k

n

j

j

k

kkk

j

k ,..,2,1;1

ˆ0)ˆ(ˆ 01

n

1j

1k

ML estimate for mean is just the sample mean!

AAtrATT

2

Page 8: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

we know,

We can show that, so use

Optimal

0l

1||ln

0ˆ)ˆ)(ˆ(ˆ2

1ˆ2

1 1

1 1

11

C

k

n

j

T

k

j

kk

j

k

k

xxNl

C

k

n

j

T

k

j

kk

j

k

k

xxN 1 1

)ˆ)(ˆ(1ˆ

N

CNE ˆ

C

k

n

j

T

k

j

kk

j

k

k

xxCN 1 1

)ˆ)(ˆ(1ˆ

ML estimate of covariance

Matrix is the arithmetic average

of over all classesT

k

j

kk

j

k xx )ˆ)(ˆ(

1||ln AAA

1

1

||ln

][ln||ln

])[lnexp()lnexp(|)|exp(ln||

:

AA

AtraceA

AtraceAA

PDisAAside

A

n

i

i

Page 9: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Proof of

1 1 1

1 1

1 1

21 1 1 1

1ˆ ˆ ˆ ˆ( )( )

1ˆ ˆ ˆ ˆ[ ]

1ˆ ˆ[ ]

1 1ˆ( ) { }

1

k k

k

k

k k

n nCj j jT

k k kkk k kk j j

nCj jT j T jT T

k k k kk k k kk j

nCj jT T

k k k kk j

n nC CT q r Tk

k kk k kk k q rk

k k k

x x know n xN

x x x xN

x xN

nE n E x x

N N n

nN

1 1

1

( )

C CT T

k k kk k

Cn

N N

N C

N

ˆ N CE

N

N

CNE ˆ

Page 10: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Linear discriminants may outperform quadratic discriminants

when training data is small

Shrink the covariance matrices towards common value possibly biased, but results in a less variable estimator

Select and to minimize error rate via cross validation

Shrinkage Methods

)Σ and Σof mbination (convex co kˆˆ

Nn

Nn

k

kkk

)1(

ˆˆ)1()(ˆ

ˆ ˆ( ) (1 ) ; 0 1I

Itrp

kkk ))(ˆ()(ˆ)1(),(ˆ

If variables are scaled to have

zero mean and unit variance

Regularized Discriminant

Analysis

ˆ)1(ˆ)(ˆ diag This has Bayesian (MAP) interpretation

Page 11: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

ML-based Discriminants

kn

j

j

k

kk

xn 1

1

kn

j

T

k

j

kk

j

k

k

k xxn 1

)ˆ)(ˆ(1

As N , they approach the performance of optimal Bayesian Classifier. However,

for finite N, a linear rule may outperform a quadratic one!

Note that we need nkp for each class. If not, use shrinkage methods.

ML-estimated Best Linear Rule (C+Cp+p(p+1)/2 parameters)

Unequal Covariance Case

ML-estimated quadratic rule (C+Cp+Cp(p+1)/2 parameters)

]ˆlnˆˆˆ

2

1[ˆˆmax arg 11

},...,2,1{kk

T

k

T

kCkxj

kkk

T

kkCk

xxj ˆln)ˆ(ˆ)ˆ(2

1|ˆ|ln

2

1max arg 1

},..,2,1{

Page 12: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

• In general, we can use SA algorithm:

Recursive Estimation of Parameters-1

n

k

n

k

n

kx

nn

1ˆ)

11(ˆ

1

kn

j

j

k

k

kzxn 1

0),|p( ln1

0),|p( ln

kzxE

j

kkn

lim

1|),|p( lnˆˆ

n

k

kzxn

kn

n

k

n

k

)ˆ(ˆˆ111

n

k

n

kn

n

kx

Idea of Stochastic Approximation:

|)( gEf

ufroots )( of

][ 2|θ(g-f)Eassume

1 [ ( )]n n n nu g

0lim

nn

1n

n

1

2

n

n

What if “big streaming data”? Recursive estimation of mean

conditionsSAsatisfyKn

Kn

mnnn

1)/(

)1/(,

)(

1,

120

Page 13: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Unequal Covariance Case

Covariance Recursion for Class k:

Recursive Estimation of Parameters-2

Class k

kn

j

j

k

kk

xn 1

1

kn

j

T

k

j

kk

j

k

k

k xxn 1

)ˆ)(ˆ(1

nn

1

Tn

k

n

k

n

k

n

k

n

k

Tn

k

n

k

n

k

n

k

n

k

n

j

Tn

k

n

k

n

k

j

k

n

k

n

k

n

k

j

k

n

j

Tn

k

j

k

n

k

j

k

n

k

xxnn

n

xxn

n

nnnn

n

xxnn

n

xxn

)ˆ)(ˆ(1ˆ)

1

2(

)ˆ)(ˆ)(11

1)(1

1(ˆ)

1

2(

})ˆˆˆ)(ˆˆˆ({2

1)

1

2(

)ˆ)(ˆ(1

111

11

2

1

1

1111

1

)ˆ(1

ˆˆ11

n

k

n

k

n

k

n

kx

n

Page 14: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Covariance Recursion in Hyperellipsoid Case

Update means via

For covariance, suppose nth sample is from class l. Then

Recursive Estimation of Parameters-3

Cixnn

iii n

i

n

i

n

i,..,2,1;

1ˆ)

11(ˆ

1

1 1

1 1 1

1 11

2

1ˆ ˆ ˆ{ ( )( ) }

1 1ˆ ˆ ˆ ˆ( ) {[ ( )( ) ] ( )( ) }

1

1 1 1 1ˆ ˆ ˆ( ) ( )(1 )( )( )

k

k k

k l

k k l l

l l l l

nCj n j nn T

k kk kk j

n nCj n j n j n j nT T

k k l lk k l lk j j

k l

n n n nn ll ll l

l l l

x xn C

n Cx x x x

n C n C

nn Cx x

n C n C n n n

1 11 ( 1)1 ˆ ˆ ˆ( ) ( )( )( )

l l l l

T

n n n nn Tll ll l

l

nn Cx x

n C n C n

Page 15: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Posterior probability P(z=i |x) is the key

Have data:

we can only get P(z=k |x, Dk)

Class conditional density and posterior density of parameters

Bayesian Learning

1 2 3

1 2 .......... : 1, 2, 3, . . . . { , ,..., }kn

k k k k CD x x x x k C D D D

1

( | , ) ( )( | , )

( | , ) ( )

kk C

i

i

p x z k D P z kP z k x D

p x z i D P z i

( | , ) ( | , ) ( | , )k kp x z k D p x z k p z k D d

1

1

( | , ) ( | )( | , )

( | , ) ( | )

( | , ) ( | )

( | , ) ( | )

k

k

kk

k

nj

k

j

nj

k

j

p D z k p z kp z k D

p D z k p z k d

p x z k p z k

p x z k p z k d

If p(Dk | z=k,) has

sharp peak at

then so does p( |z=k, Dk)

Tough to compute unless

reproducing density.

Need Simulation.

Page 16: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Illustration of Bayesian Learning - 1

Bayesian learning of the mean of Gaussian distributions in one dimension.

Strong prior (small variance) posterior mean “shrinks” towards the prior mean.

Weak prior (large variance) posterior mean is similar to the MLE

-5 0 50

0.1

0.2

0.3

0.4

0.5

0.6

prior variance = 1.00

prior

lik

post

-5 0 50

0.1

0.2

0.3

0.4

0.5

0.6

prior variance = 5.00

prior

lik

post

gaussInferParamsMean1d from Murphy, Page 121

Page 17: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Illustration of Bayesian Learning - 2

Bayesian learning of the mean of Gaussian distributions in one and two dimensions.

As the number of samples increase, the posterior density peaks at the true value.

Page 18: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Sensor Fusion

Bayesian sensor fusion appropriately combines measurements from multiple sensors

based on uncertainty of the sensor measurements. Larger uncertainty less weight.

sensorFusion2d from Murphy, Page 123

-0.5 0 0.5 1 1.5-1.4

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

Equally reliable Sensors (R,G)

Fused Estimate: Black

-1 -0.5 0 0.5 1 1.5-1.6

-1.4

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

G is more reliable than R

Fused Estimate: Black

-1 -0.5 0 0.5 1 1.5-1.5

-1

-0.5

0

0.5

1

R is more reliable in y

G is more reliable in x

Fused Estimate: Black

Page 19: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Likelihood Recursion

Recursion for Posterior Density

Recursive Bayesian Learning

1

1

0

( | , ) ( | , )( | , )

( | , ) ( | , )

( | , ) ( | )

k k

k

k k

n nn k kk n n

k k

k

p x z k p z k Dp z k D

p x z k p z k D d

where p z k D p z k

1 2 1 1

1

{ , ,....., , } { , }

( | , ) ( | , ) ( | , )

k k kk k

kk k

n n nn n

k k k k kk k

nn n

kk k

Let D x x x x D x

p D z k p x z k p D z k

Problem: Need to store all training samples to calculate 1kn

kD1

( | , ).kn

kp z k D

For exponential family (e.g., Gaussian, exponential, Rayleigh, Gamma,

Beta, Poisson, Bernoulli, Binomial, Multinomial) need only few parameters

to characterize , They are called sufficient statistics. 1

( | , ).kn

kp z k D

Page 20: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Want to learn both the mean vector & covariance

Recursive Learning of Mean and Covariance

0

0

( )

0 0 0

0

00

0 0

0 20 0

0

2

1( | , ) ( , ) ( , ) ( | , ) ( | , )

1 1( | , ) ( ; , )

| |( | , ) |

2 ( )2

NIW Normal Inverse Wishart

n

v v v v v

inverse WishartGaussian

v v

v vp

p

p x N and p p m pk

p m N mk k

IW

00 1

1

1

1 1 0 0( )

2 2

1

1 2

0 10

0 0

0 1 0

0

1| ; ( ) ( )

2 2

{ , ,...., }, ( , | ) ( , | , , , )

1

1; 1

(

v

n

n

p ptr

p

i

n nn n n n n

v v

nnn n

n n

n n n

n

ie

Given D x x x p D NIW m k

k n kwhere m m x m x

k n k n k k

k k n k n

00 0

1

)( ) ( )( ) (HW: Get recursive expression)n

n n n ni i T T

ni

nx x x x x m x m

Wishart is a generalization of

Gamma and chi-squared

Page 21: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Recall

Conjugate prior: Dirichlet distribution

Posterior

MAP estimate Conditional mean (MMSE estimate)

Bayesian Generalization of Laplacian Smoothing

100

111

( )( | ) ( | ) ;

( ).... ( )k

C C

k k

kkC

p Dir

1

({ }) k

Cn

k k

k

L

1

( | , );C

N

k

k

p D N n

10

11 1

( )( | ). ( | )( | , )

( | ) ( )..... ( )

=Dir( | )

k k

N CnN

kNkC C

Np D pp D

p D n n

n

0

1ˆ MAP k k

k

n

N C

0

ˆ MMSE k kk

n

N

Mode and Mean are not the same here!

Page 22: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Sum of Gaussian vectors is Gaussian.

Linear transformations of Gaussians is Gaussian

Marginal and conditional densities

Operations on Gaussians - 1

1 2 1 211 22 121 2

1 2 11 12 12 221 2

~ ( , ); ~ ( , );cov( , )

~ ( , )T

x N x N x x

x x x N

~ ( , ); ~ ( , )Tx N y Ax N A A A

11 11 12 11 121

2 12 22 12 222

2 222

1

1 2 212 22 11 121 2

and ~ ( , ); ; ;

Then, the marginal & conditional densities are also Gaussian

( ) ( , )

( | ) ( ( ),

T T

J Jxx x N J

J Jx

p x N

p x x N x

1

22 12

1 1

211 12 111 2

)

=N( ( ), )

T

J J x J

This is what happens in Least Squares & Kalman filtering

If x1 is scalar

(e.g., Gibbs sampling),

Information form could

save computation.

1 1

1 1

11 11 12 22 12 22 22 12 11 12

1 1

12 11 12 22 11 12 22

;T TJ J

J J J

Page 23: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Marginal and conditional densities

Marginals & Conditionals of 2D Gaussian

21 1 1 1 2

22 2 1 2 2

2

1 1 1

2 211 2 1 2 2 1 2 1 2

2

1 0.8 and ~ ( , ); 0;

0.8 1

( ) ( , ) (0,1)

( | ) ( ( ), (1 )). 1 ( | ) (0.8,0.36)

xx x N

x

p x N N

p x x N x If x p x x N

-5 0 5

-10

-5

0

5

10

x1

x2

p(x1,x2)

-5 0 50

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

x1

x2

p(x1)

-5 0 50

1

2

3

4

5

6

7

x1

x2

p(x1|x2=1)

gaussCondition2Ddemo2

from Murphy, Page 112

2 2 2

1 1 2

2 2 2

1 2 2

1

(1 ) (1 ) 2.778 2.222

1 2.222 2.778

(1 ) (1 )

J

Page 24: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Covariance matrix captures marginal independencies between variables

Information matrix captures conditional independencies

Non-zero entries in J correspond to edges in the dependency network

Product of Gaussian PDFs is proportional to a Gaussian

Valid for division also

Operations on Gaussians - 2

0 & ( )ij i j i jx x are independent or x x 1J

1/2

1

/2

1 1 1

1 1 1 1

1 11

1 1( ; , ) exp{ ( ) ( )} exp( + - )

(2 ) 2 2

1= ; = ; ln 2 ln | |

2

( ; , ) ( ; , ) ;

T Ti T

i i i i ipi i i i

T

i i i i i i ii i i i i

n n n

i i i ii ii ii

Jp x J x J x A x x J x

J J A p J J

p x J p x J where J J J J

1 20 | { , ,..., } { , }ij i j n i jJ x x x x x x x

Page 25: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Sampling From Multivariate Gaussian

1 1

1/2

~ ( , ) ( , ); (Precision or Information Matrix)

Method 1: ; (0, );

Method 2: Cholesky Decomposition of ; (0, );

Method 3: Given Cholesky Decomposition of Precisio

T

T

x N N J J

Q Q z N I x Q z

LL z N I x Lz

,

2

,

,

n Matrix, ,

(0, );Solve ;

Method 4: Gibbs sampler using Covariance Matrix/Precision Matrix

Let ~ ( , ) (i i

T

T

i ii i ii

T

i ii i i

J BB

z N I B y z x y

xy N N

x

,

,

1

,

,

1 2 1

, , , ,

,

3

, )

( | ) ~ ( ( ), )

1 = ( ( ), )

Methods 1 and 2 require O(n ) computation. Method 3 can e

i i

i i

ii i i

T

i i

T

i ii i i i i i i i i i ii

i i

ii iii ii

J J

J J

p x x N x

JN x

J J

xploit sparsity of .

Information form of Gibbs sampler does not require matrix inversion! Relation

to Gausee-Seidel, successive over-relaxati https://arxiv.org/pdf/1505.0on, et 3512. c. pdf

J

Page 26: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

ML versus Bayesian Learning

ML Bayesian

Computationally Simpler

(calculus, optimization)

Complex Numerical

Integration (unless a

reproducing density)

Single Best Model.

Easier to Interpret

Weighted Av. of Models

Does not use a priori

information on

Uses a priori information

on

( | , )

( | , ) ( | , )

k

k

p x z k D

p x z k p z k D d

ˆ( | , ) ( | , )MLkp x z k D p x z k

Page 27: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

• Assume x was already standardized, then to construct a

histogram, divide the range into a set of B evenly spread bins

Then,

where Kb = number of points which fall inside the bin

B is crucial

B large Est. density is spiky (“noisy”)

B small Smoothed density

There is an optimum choice for B

We can construct the histogram sequentially

considering data one at a time

Density Estimation: Histograms

( ) , 1,2, . . . . ,bKh b b B

N

B=3

B=7

B=12

Page 28: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

• Estimated density is not smooth (has discontinuities at the boundaries of the bins)

• In high dimensions, we need Bp bins ( curse of dimensionality requires a huge number of data points to estimate density )

• In high dimensions

– Density is concentrated in a small part of the space

– Most of the bins will be empty estimated density = 0

– As the number of dimensions grows, a shell of thin, constant thickness on the interior of the sphere ends up containing almost all of its volume most of the volume is near the surface!

Major Problems with Histogram Algorithm

Page 29: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

General idea

Suppose we want to find p(x)

Suppose draw N points from p(x)

Expected number falling in region R

So,

Kernel and K-nearest Neighbor Methods-1

Pr ( ) Pr 1x R

ob x R P p x d x ob x R P

Pr these fall in (1 ) ( ; , )k N kN

ob k of R P P B k N Pk

kNk PPk

NkkE

)1( )(

N

0k

kNk PPk

Nk

)1(

N

1k

1

1

1 (1 )

1

Nk N k

k

NNP P P

k

NP

[ ]E k NP

Page 30: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Expected fraction of points falling in region R

Variance of

As, N, variance 0

. . . . . .(1)

Also, . . . . . .(2)

So, or, . . . . . .(3)

Note that R should be large for (1) to hold. However, R should be small for (2) to hold an optimal choice for R.

Kernel and K-nearest Neighbor Methods-2

[ ]E kP

N

2k k

E PN N

2

1

(1 )N

k N k

k

NkP P P

kN

(1 )P P

N

estimate good a isN

kP

VxpdxxpP )()( R

N

kVxp )(

NV

kxp )(

Page 31: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Effect of N on Probability Estimates

N

k

We are trying to estimate P via . As N increases,

the estimate peaks at the true value (it becomes a delta function) NV

kxp )(

Page 32: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

There are basically two approaches to use Eq. (3) for density estimation

Fix V and determine k from the data Kernel Based Estimation

Fix k and determine the corresponding volume from the data

k-nearest neighbor approach.

Major Disadvantage: Needs all data

Both of these methods converge to true densities as N, provided

V as N V0 as N

k as N k as N and k/N 0

Kernel and K-nearest Neighbor Methods-3

methodsneighbornearestkforNkSelect

methodsbasednelforN

VselectTypically

ker1

Page 33: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Selection of V and k

Page 34: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Kernel Estimators

• Suppose we take the region R to be a hypercube with sides of length h

centered on the point x. Its volume is

• We can find an expression for k, the number of points which fall within

this region, by defining a Kernel Function, H(u). It is also known as the

Parzen Window

For all data points xj, the quantity is equal to unity (1) if

the point xj falls inside the hypercube of side h at x and 0 otherwise.

Total number of points falling inside the hypercube is

Kernel Estimators-1

PhV

11 | | 1, 2, ....,

( ) 2

0

ifor u i pH u

otherwise

)(

j

h

xxH

N

j

j

h

xxHk

1

)(

H(u) is a unit hypercube

centered at the origin.

Gaussian windows could

be used as well.

)(1

)(ˆ1

N

j

j

pp h

xxH

NhNh

kxp

Page 35: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Gaussian Parzen Windows

Example of two dimensional circularly symmetric normal

Parzen windows for three different values of h

]||||2

1exp[

)2(

1)( 2

2/ h

x

hh

xH

pp

Page 36: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Kernel Estimators-2

NV

kxp )(ˆ

N

j

j

p h

xxH

Nh 1

1

points data theof oneon centered cubeeach with

, side of cubes ofion superposit ˆ hN)x(p

We can smooth out this estimate by choosing different forms

for the kernel function H(u). Example: Gaussian Parzen Windows

N

j

j

p h

xxH

hNExpE

1

11)(ˆ

1p

x vE H

h h

1( ) ( )

p

v

H x v p v dvh

Convolution of H and p.

Blurred version of p(x) as

seen through the window

Large N Good estimate for h 0

Small N Need to select h properly

For small N, h small noisy estimate

Page 37: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Illustration of Density Estimation: Effect of h

Three Parzen-window density estimates based on

the same set of five samples, using the Gaussian

Parzen window functions

Page 38: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Density Estimation: Effects of h and N

Parzen-window estimates of a univariate normal density using

different window widths and numbers of samples

Page 39: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Bivariate Density Estimation

Parzen-window estimates of a bivariate normal density using

different window widths and numbers of samples

Page 40: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Bimodal Density Estimation

Parzen-window estimates of a bimodal distribution using different window widths

and numbers of samples. Note that when n, estimates are the same and match

the true distribution.

Page 41: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Volume of a hypersphere in p dimensions of radius

= sum of multivariate Gaussian distributions centered at

each training sample.

(can also do for each class )

Gaussian Windows:PNN

pp

pp

)2/(

)(2 2/

/2 21

1 1 ( ) ( )ˆ ( ) exp

(2 ) 2

j jTN

p pj

x x x xp x h

N

ˆ ( | )k

j j

k

N np x z k

x x

Also called “Probabilistic Neural Network (PNN)”

Small density estimation will have discontinuities

Larger causes greater degree of interpolation (smoothness)

1

0( ) a ua u e du

Page 42: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Probabilistic Neural Network-1

• Note that each component of x has the same

Must scale xi to have the same range

Scale

PNN is fairly

insensitive to over a

wide range.

Sometimes, it is easier

to experiment with

than compute it this

way.

k Compute

2 1 ˆˆ tr( )k kp

2 2

k 1

C

k knN

Page 43: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Probabilistic Neural Network-2

PNN structure

1 12 n1

x1 x2 xp

1 12 nc

1

2/)2(

1

npp

/2

1

(2 ) p p

Cn

( 1)P z ( )P z C

Pick Maximum

Input Units

Pattern Units

Summing

Units

Output Unit

• Speed up PNN using cluster centers as representative patterns

Cluster using K-means, LVQ,….

Page 44: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Probabilistic Neural Network-2

One form of Pattern Unit structure

x1 x2 xp ||x||2 -1/2

1

2jk /z

e

j

kx1 j

pkx2

1

2||||j

kx

22 ||||2

1||||

2

1 j

k

Tj

k

j

k xxxxz

j

kx2

0-j

kz

j = sample number

k = class

p= dimension of pattern

0

Page 45: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Probabilistic Neural Network-3

x1 x2xp

1

square squaresquare

0

1 1

j

kx x

j

kx1

j

kx2

2 2

j

kx xj

p pkx x

j

pkx

2||||j

k

j

k xxz

2/2j

kze

Second form of Pattern Unit structure

j

kz

j = sample number

k = class

p= dimension of pattern

Page 46: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Alternative forms of Kernels

N

j

p

i

j

iipxx

hNhxp

1 1

||2

exp1

)(

N

j

p

i

j

ii

p h

xx

hNxp

1

1

12

2)(1

)(

1)(

N

j

p

i

j

ii

p

xxc

Nxp

1

2

1

)(sin

)2(

1)(

sinwhere sin ( )

xc x

x

Manhattan Distance, 1-norm

Alternate forms of Kernels:

Cauchy Distribution

3

1 1

1 70( ) 1 | | ; | | 1

81

p pN

i i

j i

p x x x tri cube KernelN

2

1 1

1 3( ) 1 ; | | 1

4

p pN

i i

j i

p x x x Epanechnikov KernelN

Page 47: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Decide on K, the number of clusters in advance.

Suppose there are N data points,

Want to find K representative vectors

K-means algorithm seeks to partition the data into K

disjoint subsets containing {nj}points, in such a way as

to minimize the sum of squares clustering function

K – Means Algorithm or K – Centers Algorithm

NxxxD , . . . . . .

21

K

μ. . . . . . μ μ21

2

1

|| ||j

Ki

jj i C

J x

1

j

i

ji Cj

xn

clearly if are known, then jC

The means of clusters

“Cluster Centers”

“Codebook Vectors”

}{ jC

Page 48: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

• 1) Start with any K random feature vectors as K centers.

Alternately, assign the N points into K sets randomly and

compute their means; these means serve as initial centers.

2) For i = 1, 2, . . . . . ,N

Assign pattern i to cluster if

3) Recompute means via:

4) If centers have changed, go to step 2. Else, stop.

Covariance of each cluster: etc.

Batch Version of K–Means Algorithm

jC 21

||||min argki

Kkxj

1

j

i

ji Cj

xn

1ˆ ( )( )1

jC

T

i ij j jij

x xn

K–Means Algorithm-Unsupervised Learning

Page 49: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Initialization (Recall David Arthur and Sergei Vassilvitskii’s

“k-means++: The Advantages of Careful Seeding”,

Eighteenth Annual ACM-SIAM Symposium on Discrete

Algorithms)

a. Choose initial center at random. Let n1 be the data point.

b. For k= 2,..,K

For n=1,2,.., N & n ≠ ni , i=1,2,..,k-1

End

Select probabilistically

Store nk

End

Initialization

2 2

1 1 1 12 2min exp( min )

n n

n ni ii k i kD x or D x

1

; 1,2,.., 1

( )( )

( )

k

k

i

nn

Nn

n

n n i k

D xp x

D x

1

1.

nLet x

kn

kx

Page 50: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

• Pick K to minimize the sum of errors (training error) +

model complexity term. The sum is called prediction error.

where is the variance of noise in the data.

• Kurtosis based measure

– Kurtosis for normalized Gaussian

– Find Excess Kurtosis for each cluster

– Plot and pick K that gives minimum KT

– Can also use this idea in a dynamic cluster splitting scheme (see

Vlasis and Likos, IEEE Trans.- SMCA, July 1999, pp-393-399).

Selection of K

K

j Cnj

n

jN

Kpx

NPE

1

22 22

2

3

4

xE

4

13 1,2, ,

C

j

n

i ji

ji

n ijj

xK i p

C

1 1 1 1

1 1 1;

p pK K

T ji j j ji

j i j i

K K K K KKp K p

KKT vs.

Page 51: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Online Version

• Start with K randomly chosen centers

• For each data point, update the nearest via

.

Leave all others the same.

Can use it for dynamic data.

For static data, need to go through data multiple times!

Online Version of K-Means

)(oldoldnew jijj

x

K

jj 1

j

Vector

Quantization via

Stochastic

Approximation

Page 52: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Supervised Algorithm (Learning Vector Quantization {know class labels!})

Start with K codebook vectors.

For each data point xi find the closest codebook

(or center) .

if xi is classified correctly.

if xi is classified incorrectly.

We will take up the variants of the algorithm in Lecture 8.

Supervised Algorithm

)(oldoldnew jijj

x

)(oldoldnew jijj

x

j

It is a version of reinforcement learning

Page 53: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

k-Nearest Neighbor Approach

• Fix k and allow the volume V to vary

Again, Small k noisy

Large k smooth

Major use of k- Nearest Neighbor Technique is not in

probability estimation, but in classification.

• Assign a point to class i if class i has the largest number of points

among the k-nearest neighbors MAP rule

k-Nearest Neighbor Approach

NV

kxp )( V is the volume of a hypersphere centered at

point x and contains exactly k points.

( | ) i

i

kp x z i

nV

NV

kxp )(

( ) inP z i

N

( | ) ( ) 1( | )

( )

i i i

i

k n kp x z i P z iP z i x

kp x nV N k

NV

Page 54: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

1-NN Classifier

1-NN classifier has interesting Geometric Interpretation

x2

1-NN Classifier x1

1-Nearest Neighbor Classifier

Decision

boundary

x2

1-NN Classifier x1

Data points become corner points of triangles

spanned by each reference point and two of its

neighbors. The network of triangles is called

Delauny Triangulation (or Delauny Net).

Perpendiculars to triangle’s edges run borders of

polygonal patches delimiting local neighborhood

of each point. Resulting network of polygons is

called VORONOI Diagram or VORONOI Nets.

Decision Boundary follows sections of

VORONOI Nets.

1-NN & K-NN adapt to any data; high

variability & low bias.

Page 55: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Why Gaussian Mixtures?

• Parametric fast but limited

• Non Parametric general but slow (require lot of data)

• Mixture Models

Gaussian Mixtures, Maximum Likelihood and

Expectation Maximization Algorithm - 1

Conditional Density Estimation

(function approx.)

RBF

Mixture of experts models

) ( | )

1 2

(.

Similar technique for p p x k

k , , . . . ,C

M

j

jPjxpxp1

)|()(

1 0 ; 11

j

M

j

j PP

1)|(1

M

j x

j xdPjxp

Page 56: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Gaussian Mixtures, Maximum Likelihood and

Expectation Maximization Algorithm - 2

p(x)

p(x|M)p(x|j)p(x|1)

PMPjP1

x1 x2 xp

),()|( jjNjxp

ly typical),(2IN jj

2

22

2

)||||(

2/2)2(

1 j

jx

p

j

e

Problem: Given data,

1 2

1

. . . . , , ,M

N

j jj j

D x x x find the ML estimates of P

Let j jjθ P ,μ ,σ

max ( | ) max ln min lnθθ

L p D l p(D |θ ) - p(D |θ ) J

Page 57: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Gaussian Mixtures, Maximum Likelihood and

Expectation Maximization Algorithm - 3

N N M

i 1 i 1 j 1

ln , ln ( ) i i

jJ p( x θ ) p x | j P

N

1iM

1k

1

j

i

j

k

ij

μ

|j)xp(P

|k)Pxp(μ

J

22

2

22 e

2

1

2

2

j

i

ji

j

i

)||μx||(

p/

jj

i

σ

xμ|j)xp(

σ

)πσ(μ

|j)xp( j

j

i

(1) .. . . . . . . . .. . . . . . 2

1

σ

xμ)xP(j|

μ

J

j

i

jN

i

i

j

M

i

j

M

j

jj

PJl

Lagrangian

P; Pts

J

1

1

:

101 ..

min

So,Note the Simplicity

of Gradient

posterior

Page 58: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Gaussian Mixtures, Maximum Likelihood and

Expectation Maximization Algorithm - 4

N

i j

i

jM

k

k

ij

|j)xp(P

|k)Pxp(

σ

J

1

1

1

2

31

iN

i j

i j j

|| x μ ||pP(j | x )

σ σ

. . . . . .. . . . . . . . ..... ......(2)

1

1

1 ( | )

( | )

Ni

Miij

k

k

lp x j

Pp x k P

(3) . . . . . . . . . . . . . . 1

1 1

N

i

N

i

j

i

jj

i

λP)xP(j|P

λ P

)xP(j|

Dimension of

feature vector

Page 59: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Gaussian Mixtures, Maximum Likelihood and

Expectation Maximization Algorithm - 5

From (1),

N

i

i

N

i

ii

j

xjP

xxjP

1

1

)|(

)|(

N

i

i

N

ij

ii

j

xjP

xxjP

p

1

1

2

2

)|(

||ˆ||)|(1

ˆ

N have we1 and 1)|( that,noting1 1

M

j

M

j

j

iPxjP

N

i

i

j xjPN

P1

)|(1ˆ

These are coupled non-linear equations

Necessary Conditions of Optimality:

Set Gradients Equal to Zero

Responsibility

1

1

General Case:

ˆ ˆ( | )( )( )

( | )

Ni i i T

j ji

j Ni

i

P j x x x

P j x

Page 60: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Nonlinear Programming (NLP) Techniques

Methods of Solution: NLP

*

10 . . . . . . . . . . . .

lHkk 1

H =

I SD or Gradient Method

Newton’s Method

Levenberg-Marquardt Method

Levenberg-Marquardt version of

Gauss Newton Method

Various versions of Quasi-Newton Method

Various versions of Conjugate Gradient method

12 J

12J I

1

TJ J I

Best to compute

Hessian using finite

Difference method

Page 61: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

EM Algorithm

EM Algorithm

N

i

iold

N

i

iiold

new

j

xjP

xxjP

1

1

)|(ˆ

)|(ˆ

N

i

iold

N

i

newiiold

new

j

xjP

xxjP

p

j

1

1

2

2

)|(ˆ

||ˆ||)|(ˆ1

ˆ

N

i

ioldnew

j xjPN

P1

)|(ˆ1ˆ

How did we get these equations

and Why?…. Later

• By setting gradient to zero (M-step)

• Evaluating posterior

Probabilities/Responsibilities (E-step)

1

ˆ( | )ˆ ( | )

ˆ( | )

i new

i jnew

Mi new

m

m

p x j PP j x

p x m P

M-step

E-step

Gauss-Seidel view of EM

Page 62: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Sequential Estimation -1

1

1

1

11

)|(

)|(

ˆn

i

i

n

i

ii

n

xjP

xxjP

j

1

1

1

1

1

1

1

)|(

)|(ˆ

)|(

)|(

n

n

i

i

nn

n

i

i

n

i

i

x

xjP

xjP

xjP

xjP

j

nn

n

i

i

nn

jj

x

xjP

xjP ˆ

)|(

)|(ˆ

1

1

1

1

1n

j

Sequential Estimation Stochastic Approximation

1

1 1

1 11

1

1

1

:

( | ) ( | )1

1( | ) ( | )

( | )( | )

=1 .( | ) ( | )

( | ) 1 =1+ .

( | )

n ni i

i i

n nn

j

ni

n

i

n n

n

n n

j

Note

P j x P j x

P j x P j x

P j xP j x

P j x P j x

P j x

P j x

Page 63: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

• Sometimes replace, or,

Similarly,

Sequential Estimation Stochastic Approximation -2

11

1

( | )

ˆ( 1)j

nn

n

j

P j x

n P

1

1

)|(

)|(111

nn

n

n

jj

xjP

xjP

n

i

i

n

i

nii

n

j

xjP

xxjP

p

j

1

1

2

2

)|(

||ˆ||)|(1

ˆ

1

1

1

1

21

12

)|(

||ˆ||)|(1

ˆn

i

i

n

i

nii

n

j

xjP

xxjP

p

j

1

1

1

1

21

)|(

||ˆˆˆ||)|(1

n

i

i

n

i

nnnii

xjP

xxjP

p

jjj

1

1

1

1

2112

)|(

||ˆˆ||ˆˆˆ2||ˆ||)|(1

n

i

i

n

i

nnnnT

ninii

xjP

xxxjP

p

jjjjjj

2121

1

1

1

1

1

12||ˆˆ||

1||ˆ||

)|(

)|(1

)|(

)|(

ˆnnnn

n

i

i

n

n

i

i

n

i

i

n

jjjj p

x

xjP

xjP

pxjP

xjP

1

1

1

( | )

( | ) ( | )

j

j

j

n n

n

n n n

P j x

P j x P j x

Page 64: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

• Similarly,

Sequential Estimation Stochastic Approximation -3

212211212||ˆˆ||

1ˆ||ˆ||

1ˆˆ

nnn

j

nnn

j

n

j

n

jjjj p

xp

n

j

nnn

j

n

j

n

jj

xp

221112ˆ||ˆ||1

n

j

nn

j

n

j PxjPn

PP ˆ)|(1

1ˆˆ 11

212

21

1

1

12

||ˆˆ||1

ˆ||ˆ||

)|(

)|(ˆ

nnn

j

nn

n

i

i

nn

jjj

j

pp

x

xjP

xjP

1 11

Re

ˆ ˆ ˆj j j

n n n nn

j

call

x

Page 65: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Key ideas of EM as applied to Gaussian Mixture Problem

EM Algorithm for Gaussian Mixture Problem-1

1 1 1

ln ln |N N M

i i

j

i i j

J p( x ) p( x j)P

N

iiold

inewoldnew

)x(p

)x( pJJ

1

ln

N

iiold

M

jiold

ioldinewnew

j

) x(p

)x(j|P

)x(j|P|j)x(pP

1

1ln

Idea:

ln ( , )

( ( )||

:

: var ( )

:

( )

ln ( , | ) ln ( | , ) ln ( | )

( , | )ln ( | ) [ln ]

( )

( | , )[ln ]

( )

q

L q

q

KL q z p

x data

z hidden iables mixture

parameters

q z any arbitrary distribution

p x z p z x p x

p x zp x E

q z

p z xE

q z

( | , ))

ln ( , ) ( ( ) || ( | , ))

ln ( , ) ( ( ) || ( | , )) 0

: ( ) ( | , )

: min[ ln ( , )]

min [ln ( , | )]

min

z x

old

new

q

J L q KL q z p z x

J L q KL q z p z x

E step q z p z x

M step L q

E p x z

=

( , )

: ln ( , ) ( , ) ( , ) ( , )

old

old old old

q

Q

Note L q Q Q H z

J=-ln p(x| )

-ln L(q, )=Q(, old

KL(q||p)

Page 66: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

• For convex functions,

Minimizing l will lead to a decrease in J()

0 ,1 e wher lnln iiiiii xx

1 1

1 1

1 1

ln

,ln

, )

ln , ( , ) l

inew newN Mi jnew old old

i iold oldi j

inewN Miold

ioldi j

N Mi inew old new old

i j

P p ( x | j)J J P (j | x )

p ( x ) P (j | x )

p ( x j)P (j | x )

p ( x j

J P (j | x ) p ( x j) Q

=

n ( , )L q

: , ( ) ( , ) 0

( ) ( , ) 0

( , ) and ( ) have the same gradient at

old old old old old

new new new old

old old

Note at J Q Force KL

J Q KL

Q J

old new

J()

Jold

Q(, old)

EM Algorithm for Gaussian Mixture Problem-2

( , )oldq z

Q(, new)

Page 67: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

• Optimization problem:

– min

– s.t.

1 1 1 1

( | ) ln ( | ) ( | ) ln ( , )N M N M

i i i iold new new old new

j

i j i j

Q P j x P p x j P j x p x j

2

21 1

|| ||( | ) ln ln

2

i newN M

i jold new new

j j newi j j

xQ P j x P p

Q

. . . .M, . . . . , jPP new

j

M

j

new

j 21 ;0 ;11

Dropping terms that depend on old parameters, we get

EM Algorithm for Gaussian Mixture Problem-3

For Gaussian conditional probability density functions

Page 68: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

N

i

iold

iN

i

iold

new

j

)x(j|P

x)x(j|P

1

1

N

i

iold

N

i

newiiold

new

j

xjP

xxjP

p

j

1

1

2

2

)|(

||||)|(1

N

i

ioldnew

j xjPN

P1

)|(1

EM Algorithm for Gaussian Mixture Problem-4

1

1

General Case:

ˆ ˆ( | )( )( )

( | )

Ni i new i newold T

j jnew ij N

iold

i

P j x x x

P j x

Page 69: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Graphical Illustration of E and M Steps

ln ( , ) ( ( ) || ( | , ))

ln ( , ) ( ( ) || ( | , )) 0

: ( ) ( | , )

ln ( , ) ln ( | )

?

( ( ) || ( | , )) 0

: arg min[ ln ( , )]

ln ( , ) ln ( | )

?

old

old old

old

new

new new

J L q KL q z p z x

J L q KL q z p z x

E step q z p z x

L q p x

Why

KL q z p z x

M step L q

L q p x

why

( ( ) ( | , ) || ( | , )) 0old new

KL q z p z x p z x

J=-ln p(x| new )Q(, old =-ln L(q, new )

KL(q||p)

E-step

J=-ln p(x| old ) Q(, old =-ln L(q, old )

KL(q||p)=0

M-step

Note: EM is a Maximum Likelihood Algorithm. Is there a Bayesian Version? Yes: If

you assume priors on ({j, j2, P j}) called Variational Bayesian Inference.

Page 70: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

An Alternate View of EM for Gaussian Mixtures - 1

z

x

Hidden

(Latent)

Variables

Observation

z is a M-dimensional binary random vector such that

1

1

{0,1} and 1

( 1) ( ) j

M

j j

j

Mz

j j j

j

z z

P z P P z P

x is a p-dimensional random vector such that

1

11

( | ) [ ( ; , )]

( ) ( , ) ( ) ( | )

[ ( ; , )] ( ; , )

j

j

Mz

jjj

z z

M Mz

j j j jj jz jj

p x z N x

p x p x z P z p x z

P N x P N x

pdf of x is a Gaussian Mixture

only possible vectors:

{ : 1,2,.., }

unit vector

i

th

i

z

z e i M

e i

Page 71: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Problem: Given incomplete (partial) data,

1 2

1

. . . . , , ,M

N

j jj j

D x x x find the ML estimates of P

1

Let M

j jj j

θ P ,μ ,

min lnθ

J where J - p(D |θ )

If have several observations {xn: n=1,2,..,N} , each data point will have

a corresponding latent vector zn. Note the generality

Complete Data:

1

1 1 2 2

2

1 1

( , ), ( , ) . . . .( , )

1 1ln ( | ) { ln ln 2 ln | | || || }

2 2 2 j

N N

c

N Mnn

c j j j jn j

D x z x z x z

pp D z P x

An Alternate View of EM for Gaussian Mixtures - 2

Page 72: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

If had complete data, estimation is trivial. Similar to Gaussian

case, except that we estimate with subsets of data that are

assigned to each mixture component

In EM, replace each latent variable by its expectation with

respect to the posterior density during the E-step

In EM, minimize the expected value of the negative

complete-data log likelihood during the M-step

1

( | , ) ( 1| , )

( ; , )( 1| , )

( ; , )

n nn n n n

j j j j

n

j jn jn n

j jMn

k kkk

z E z x P z x

P N xP z x

P N x

An Alternate View of EM for Gaussian Mixtures - 3

Responsibilities

1

2

1 1

1 1{ ln ( | )} { ln ln 2 ln | | || || }

2 2 2 j

N Mnn

Z c j j j jn j

pE p D P x

Q(, old

Page 73: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

EM Algorithm for Gaussian Mixtures -4

1 1 1

1 1

1. Initialize the means { } , covariances { } , and mixing coefficients { } .

Evaluate = -ln ( | ) ln{ ( ; , )}

2. E-step: Evaluate the responsibilities using the curr

M M M

j j j j jj

N Mn

j jjn j

P

J p x P N x

1

1

ent parameter values

( ; , ) ; 1, 2,.., ; 1, 2,..,

( ; , )

; 1, 2,..,

3. M-step: Re-estimate the parameters using the current responsibi

n

j jjn

j Mn

k kkk

Nn

j j

n

P N xj M n N

P N x

N j M

1

1

lities

1

1 ( )( )

4. Evaluate the negative log likelihood and check for convergence of parameters or

Nnew nn

jjnj

Nn new n newnew n T

j j j jnj

jnew

j

xN

x xN

NP

N

the likelihood.

If not converged, go to step 2.

2

1

For unbiased estimate of covariance,

1Divide by

( )

Nn

j

j

j

j

NN

Goes to 1/(Nj -1)

for (0-1) case

Page 74: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Illustration of EM Algorithm for Gaussian Mixtures

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2

-1

0

1

iteration 0, loglik -Inf

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2

-1

0

1

iteration 1, loglik -4929.3761

-3 -2 -1 0 1 2

-2

-1

0

1

iteration 2, loglik -441.9557

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-2

-1

0

1

iteration 7, loglik -389.7686

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-2

-1

0

1

iteration 8, loglik -389.7602

1 2 3 4 5 6 7 8-5000

-4500

-4000

-3500

-3000

-2500

-2000

-1500

-1000

-500

0

iter

avera

ge loglik

Murphy, Page 353, mixGaussDemoFaithful

Page 75: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Relation of Gaussian Mixtures to K - means

2

2

1

|| || /2

|| || /2

1

Suppose for 1, 2,..,

Then

( ; , ) ; 1, 2,.., ; 1, 2,..,

( ; , )

As 0

1

n

j

n

k

j

n

j jn

j Mn

k kk

x

jn

j Mx

k

k

n

j

I j M

P N x Ij M n N

P N x I

P e

P e

if j

2

1 1

arg min || ||; the rest go to zero as long as none of the is zero.

The expected value of negative log likelihood of complete-data is

1 { ln ( | )} || || consta

2

n

jkk

N Mnn

Z c j jn j

x P

E p D x

2

1 1

nt

1So, K-means minimizes || ||

2

N Mnn

j jn j

x

Page 76: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Variational Bayesian Inference - 1

w

x

Hidden

(Latent)

Variables

Observation

w is a latent vector (continuous or discrete)

− Mixture vector (discrete), z

− Parameters ({j, j, P j})

x is a p-dimensional random vector

Recall

Variational inference typically assumes q(w) to be factorized

( )

( )

ln ( ) ln ( ( )) ( ( ) || ( | ))

( , )ln ( ( )) ( ) ln ln ( , ) ( )

( )

( | )( ( ) || ( | )) ( ) ln ln ( | ) ( )

( )

ln ( ) ln ( ( )) ( ( ) || ( | )) 0

q w q

q w q

J p x L q w KL q w p w x

p x wL q w q w d w E p x w H w

q w

p w xKL q w p w x q w d w E p w x H w

q w

J p x L q w KL q w p w x

1

( ) ( );{ } are disjoint groups

Example: ( ) ( ) ({ , , })

K

j jj

j

j jj

q w q w w

q w q z q P

P z

x

-1

Page 77: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Variational Bayesian Inference - 2

Minimize the upper bound -lnL(q(w)) with respect to qj (wj) while

keeping {qi (wi) : i ≠ j} constant (a la Gauss-Seidel)

Iterative algorithm for finding the factors {qj (wj)}

11

11

[ln ( , )]

( , )ln ( ( )) ( ) ln ( ) ln ( , ) ( )

( )

( ) ln ( , ) ( ) ( ) ( )

[ ln ( (

i

j i

i j

K K

i ii q

ii

K K

j i i j j ij i q q

iii ji j

E p x w

p x wL q w q w d w q w p x w d w H w

q w

q w p x w q w d w d w H w H w

L q w

=

[ln ( , )]

[ln ( , )]

))][ln ( , )] 1 ln ( ) 0

( )

ln ( ) [ln ( , )]

( )i j

i j

ji j j

jj

jj i j

E p x w

jj E p x w

j

E p x w q wq w

q w E p x w

eq w

e d w

Log of the optimal qj is the expectation

of the log of joint distribution with respect

to all of the other factors {qi (wi) : i ≠ j}.

This idea is used in loopy belief propagation

and expectation propagation also.

Page 78: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Application to Gaussian Mixtures - 1

Here w involves mixture variables and component parameters

Model assumptions

1 ( ) ( ) ({ , , } )

is a binary random vector of dimension

M

j j jjq w q z q P

z M

1 1

1 1

1 1 1

1 1

1

Mixture Distribution: ({ } |{ } )

Data Likelihood given latent variables:

({ } |{ } ,{ , } ) [ ( ; , )]

We also assume on { , , } Bay

nj

nj

N Mzn N M

n j j j

n j

N Mzn n nN N M

n n j j jj jn j

M

j j jj

p z P P

p x z N x

priors P

0 10

10

esian approach

( ) ( ) ( | ) ; conjugate prior to multinomial

( )

M

jMj

Mp P Dirichlet P P

1

0 ( ) ; ( 1) ( ); ( ) ( 1)! for integerste t dt n n

Page 79: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Application to Gaussian Mixtures - 2

Model assumptions (continued)

Joint distribution of all random variables decomposes as follows:

1 1 1 1

1

0 0 0

1 0

({ , } ) ({ } |{ } ) ({ } )

1 = ( ; , ). ( ; , )

M M M M

j j j j j j jj j

M

j jjj

p p p

N m Wishart W

In one dimension, Gamma

See Bishop’s book

1 1 1 1 1 1

1 1

1 1 1 1 1

1 1

({ } ,{ } ,{ , , } ) ({ } |{ } ,{ , } ).

({ } |{ } ). ( ). ({ } |{ } ). ({ } )

= [ (

n n n nN N M N N M

n n j j j n n j jj j

n N M M M M

n j j j j j j jj

N M

j

n j

p x z P p x z

p z P p P p p

P N

0 1 10

0 0 0

1 00

; , )] .

( ) 1 . ( ; , ). ( ; , )

( )

njzn

jj

M

j j jM jj

x

MP N m Wishart W

P z

x

-1

Page 80: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Application to Gaussian Mixtures - 3

Variational Bayes M-step (VBM-step)…. It is easier to see M-step first

1

1

0

1 1 1 1({ } )

1 1

({ } )1 10

0 0 0

1 00

ln ({ , , } ) ln ({ } ,{ } ,{ , , } )

ln [ ( ; , )] .

= t( ) 1

. ( ; , ). ( ; , )( )

n Nn

nj

n Nn

n nM N N M

j j j n n j j jq zj j

N Mzn

j jjn j

q z M

j j jM jj

q P E p x z P

P N x

E consM

P N m Wishart W

0

00

1 1 1 1

0 0

1 1

.

1ln ( ; , )

= [( 1) ]ln

ln ( ; , )

+ ln ( ; , ) tan

j

nj

M N Mjjn

j j

J n j

jN

N Mnn

j jjn j

N mE z P

Wishart W

E z N x cons t

1 1

0 1

({ , , } ) ( ) ({ , } )

( ) ( ;{ } )

( , )

M M

j j j j jj j

M

j j j

jj

q P q P q

q P Dirichlet P N

q Gaussian Wishart

Page 81: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Application to Gaussian Mixtures - 4

Updated factorized distribution after M-step

1 1

0

0 1

0

1

1

0

1

0

({ , , } ) ( ) ({ , } )

( ) ( ;{ } ) ( )

( , )

1 = ( ; , ). ( ; , )

;

1

M M

j j j j jj j

j jM

j j j j M

k

k

jj

j j j j jjj

Nn

j j j j

n

j

j

q P q P q

Nq P Dirichlet P N E P

M N

q Gaussian Wishart

N m Wishart W

N N

m m

0

1

01 1

0 00

0

1

0

1;

( )( )

1where ( )( )

jj

j j

j j

Nnn

j j

nj

j T

j j j

j

Nn nn T

j j

nj

j j

N x x xN

NW W N S x m x m

N

S x x x xN

N

Updates for { , , }

are similar to ML

jj jN x S

Sequential VBEM?

Page 82: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Application to Gaussian Mixtures - 5

Variational Bayes E-step (VBE-step)

1

1

1

1 1 1 1({ , , } )

({ , , } )1 1

({ , } )1

ln ({ } ) ln ({ } ,{ } ,{ , , } )

= ln [ ( ; , )] tan

=

Mj j jj

nj

Mj j jj

Mj jj

n n nN N N M

n n n j j jq P j

N Mzn

j jq P jn j

Mn

jqn j

q z E p x z P

E P N x cons t

E z

1 1 1

1

( ; , )

{ } 1 1

ln ({ } |{ } ,{ , } )

ln ({ } |{ } ) tan

= ln

njj

j

Nn nN N M

n n j jj

N x

n N M

q P n j j

P

n n

j j

p x z

E p z P cons t

z

1 1 1

1 1

1 2

,

1

1 1

1

1 1where ln ln ln | | ln 2 || ||

2 2 2

({ } ) [ ] where .... responsibilities

j j j jj

nj

N M

n j

nn

j P j j j

nN Mzn jN n n

n j j Mnn jk

k

pE P E E x

q z

Page 83: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Application to Gaussian Mixtures - 6

Variational Bayes E-step (VBE-step) … continued

− Evaluation of responsibilities

− Recall

1 1 1

1 2

,

1 1ln ln ln | | ln 2 || ||

2 2 2j j j jj

nn

j P j j j

pE P E E x

1

1 1

1

1

1

2

,

ln ( ) ( ); ( ) ln ( ).... function

1ln | | ( ) ln 2 ln | |

2

|| || ( ) ( )

j

j

j jj

M

P j j k

k

pj

j j

i

n n nT

j jj jjj

dE P digamma

d

iE p W

pE x x m W x m

1

11| | exp ( ) ( ) ( ) ( )

2 2 2 2

n n

j j

pn nj jn T

j jj j j j

i j

Since

i pW x m W x m

See Bishop

Chapter 10

Page 84: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

VB Approximation of Gaussian Gamma

In VBEM start with large M and very small 0 <<1 (0.001)

It automatically prunes clusters with very few members (“rich get

richer”)

In this example, we start with 6 clusters, but only 2 remain at the end

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

4

5

iter 94

1 2 3 4 5 60

20

40

60

80

100

120

140

160

180iter 94

mixGaussVbDemoFaithful from Murphy, Page 755

Page 85: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Bayesian Model Selection -1

( | ) ( )( | )

( | ) ( )l

p D m p mP m D

p D l p l

Bayesian Model Selection (max . prob. of model given data)

Bayes Factors for Comparing Models m and l

Set of models (e.g., linear, quadratic discriminants)

Model is specified by parameter vector ;m

M

m m M

1( )

2

1( )

2

( | ) ( | ) ( ) ( )( , ) / /

( | ) ( | ) ( ) ( )

( , ) 1 mod mod

( sin log )

BIC m

BIC l

p D m p m D p m e p mBF m l

p D l p l D p l p le

BF m l el m is preferred over el l

BIC Bayesian Information Criterion u g negative likelihood

Page 86: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Bayesian Model Selection -2

Bayesian Information Criterion (BIC)… minimize BIC

Akaike Information Criterion (AIC) …. Minimize AIC

ˆ ˆ2ln ( | ) ( ) ln

, .

; equal & speherical covaraiance (Note: only ( -1) probabi

ˆ( )

m m

m

BIC p D dof N Schwarz criterion

Valid for regression classification and density estimation

For linear and quadratic classifiers

C Cp C

dof

ities)

1; equal and spherical feature-dependent covaraiance

( 1) / 2 1; equal and general covaraiance

( 1) / 2 1;unequal and general covaraianc

is closley related to

C Cp p

C Cp p p

C Cp Cp p

BIC Minimum Des

( )

: ln ln[( 2) / 24]

cription Length MDL

Adjusted BIC N N

&

ˆ ˆ2ln ( | ) 2 ( )

ˆ ˆ2 ( )( ( ) 1)

ˆ( ) 1

m m

m msmall N Gaussian

m

AIC p D dof

dof dofAIC AIC

N dof

Page 87: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Binary Classification, BIC & AIC - 1

Class z = 0: N(0,1);zero mean … Null hypothesis, H0

Class z = 1: N (µ,1); µ 0 …. Alternative hypothesis, H1

Under null hypothesis, sample mean

Note .

So, under H1:

1 2int : { , ,..., }NN scalar data po s D x x x

1~ (0, ) ~ (0,1)x N N x N

N

x x

~ (0,1)N x N

0

( | | ) 1 ( . ., 2 0.05),

0 1 .

.

, : | |

If P N x c for a specified c e g c for

we are confident that with probability

is the probability of faslsely rejecting H

cSo test statistic is x

N

Page 88: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Binary Classification, BIC & AIC - 2

BIC

AIC

2

0

1

2 22

1

1 1

2

1 0

ˆ ˆ2ln ( | ) ( ) ln

( )

( ) ln ln

ln ln, ( ) ( ) | |

m

Ni

i

N Ni i

i i

BIC p D dof N

BIC H x

BIC H x x N x N x N

N NSo BIC H BIC H if x x

N N

2

0

1

2 22

1

1 1

2

1 0

ˆ ˆ2ln ( | ) 2. ( )

( )

( ) 2 2

2 2, ( ) ( ) | |

; 2

m

Ni

i

N Ni i

i i

AIC p D dof

AIC H x

AIC H x x x N x

So AIC H AIC H if x xN N

Similar to classical hypothesis testing c

sample number-

dependent threshold

Page 89: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Probability of error, misclassification rate.

• Holdout Method:

Error Count = R out of K PE=R/K

From Binomial Distribution:

To obtain an estimate of PE within 1%:

If PE 0.05,

When PE ½, we need lots of samples…. 10,000

We can also estimate Class Conditional Error Rate, PE(z=k). Then

Holdout Method of Cross Validation

P PE z

N

N-K training

K validation

Typically, K N/5

(1 )PE

PE PE

K

(1 )0.01 2

PE PE

K

2

4

4 (1 ) 40000( (1 )) 1900

10

PE PEK PE PE

1

( ). PE( )c

k

PE P z k z k

Are there better

bounds?

Page 90: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

• Suppose have a nonnegative rv, x (discrete or continuous)

• Assume continuous WLOG

• Markov inequality

• Chebyshev inequality

Markov & Chebyshev Inequalities

00

( ) ( ) ( ) ( ) ( ) ( ) ( )

( )( )

E x xf x dx xf x dx xf x dx xf x dx f x dx P x

E xP x

22 2

2

2

2

2

2 2

( ) ( )( ) ( )

, (| | )

: ,

| | 1(| | ) ( )

,

(1 )(| | )

e e

e ee e

E x E xP x P x

so P x

Example x sample mean of n numbers m

mP m P

n n

For binary classifier with unknown probability of error P and sample error S

P PP S P

2 2

1; 100, 0.2, 0.0625....

4n bound loose

n n

Page 91: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Hoeffding’s Inequality

2 2 2

21

1

2 /2

; [ , ];

{| ( ) | } 2 ; when 1 {| ( ) | } 2

n

i

i

n

i i i i i i i

i

n Rn

i i i

Let Y X X a b R b a

P Y E Y n e R b a P Y E Y n e

1 1

1

1 1 1 1

1

( ) ( ) 0 & [ , ]2 2

{| ( ) | } {| | } { | | } { } { }

{ } { } { }

{

n n

i i

i i

n

i

i

i ii i i i i

n n n n

i i i i

i i i i

n t Z t Ztn tn

i

i

t Ztn

R RLet Z X E X E Z Z

P Y E Y n P Z n P t Z tn P t Z tn P t Z tn

P t Z tn P e e e E e Markov Inequality

e E e

[ (1 ) ] [ (1 ) ]2 2 2 2

1 1 1

/2 /22 2

1 1

/2

1

/ 2} { } { } { };

1{ (1 ) }

2

1, { }

2

i i i ii i i i

i

i i

i i

i

R R R Rn n nt t

tZtn tn tn i ii

i i i i

R Rn nt t

tR tRtn tn

i i

i i

ntR ttn

i

i

Z Re E e e E e e E e

R

e E e e e e e

So P t Z tn e e e

2 2

2 21

2 2 2

21

[ /8 ]/2 /8

1 1

2 /2 2

1 1

min 4 / {| | } 2 2 1

n

i

i i i

n

i

i

n n t R tnR t Rtn

i i

n n n Rn

i i i

i i

e e e

RHS is imized when t n R P Z n e e when R

Jensen’s inequality

https://en.wikipedia.org/wiki/Hoeffding%27s_lemma

Page 92: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

• Suppose we have classifiers/models M1, M2,…., MC

• Training Data, D & Validation Data, V

• Samples are assumed to be i.i.d.

Rationale behind Cross Validation

2

2

2| |

2| |

1 ˆlog ln ( | )| |

ˆ , { }

' {max | | } 2

2ln( )

ln, 2 ,

2 | | | |

" ( ) , (

i

im m

x V

mm m

Vm m

m

V

Average likelihood l p xV

Since does not depend on V E l l

By Hoeffding s inequality P l l Ce

C

CSo if Ce

V V

Confidence is cheap but accuracy

) exp "

2ln( )

lnmax max 2 max ( )

2 | | | |

, (1 ), mod .

m m mm m m

is more ensive

C

Cl l l O

V V

So with probability of at least one chooses the best el

22| |

1

{max | | }

{ | | }

{| | } 2

m mm

m mm

CV

m m

m

P l l

P l l

P l l Ce

Page 93: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

• S-fold Cross validation:

S-fold Cross Validation

Testing

Validation

N

NS

S

S

N

1N

S

S

S

N

1N

S

S

S

N

1

D1 D2 Ds

Run1

Run2

Run S

1

1( ( ) ) (.)

s

S

i i

s i D

PE I x z I indicator functionN

• S=N N-fold cross validation or Leave one-out CV (LOOCV) method

1

1( ( ) ); ( ) ( , )

N

i i i

i

PE I x z x f x DN

• Practical Scheme: 5x2 Cross-Validation. Variation: 5 repetitions of 2-

fold cross-validation on a randomized dataset

S is typically 5-10

Page 94: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Summary of Validation and Model Selection Methods

Simple splitLeave one out CV

N=S for S-fold CV Model Selection

Regularization and Model Selection Any combination of CV,

Model Selection and

Regularization

(hyper-parameter selection)

is possible.

See AMLbook.com .

Page 95: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

• Bootstrap Method:

Sample from D with uniform probability (1/N). Each xi is drawn

independently with replacement

Let b= bootstrap index, b=1,2,…,B

B= number of bootstrap samples (typically 50-200)

Do b=1,2,..,B

Bootstrap Sample

End

Bootstrap Method of Validation

NxxxD , . . . . . .

21

)( , . . . . . . )()2()1(

bPExxxD trainingb

N

b

: \ ( )b b valValidation Data A D D samples not inbootstrap PE b

0.632* ( ) 0.368* ( )training valPE PE b PE b Effron Estimator

Page 96: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

• Bootstrap Method (cont’d):

Why?

• Confusion Matrix:

In words, just count errors from validation set, bootstrap, etc. for class j

and divide by the number of samples from class j

Confusion Matrix

0.632* ( ) 0.368* ( )training valPE PE b PE b

1 1{ } (1 )

{ } 0.368

{ } 0.632

NP observation bootstrap sample as NN e

P observation validation samples

P observation training samples

][ ijPP

{ | } ; 0,1,2,..., ; 1, 2,...,ij

ij

j

NP P decision i z j i C j C

N

Page 97: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

• Contingency table showing the differences between the true and predicted classes for a set of labeled examples

• The following metrics can be derived from the confusion matrix:

– PD (Sensitivity, Recall): • TP/(TP+FN)

– PF (1-Selectivity): • FP/(TN+FP)

– Positive Prediction Power (Precision)• TP/(TP+FP)

– Negative Prediction Power• TN/(TN+FN)

– Correct Classification Rate (CCR)• (TP+TN)/N

– Misclassification Rate• (FP+FN)/N

– Odds-ratio

• (TP*TN)/(FP*FN)

Outcome Fault No-Fault Total

Positive

Detection

Number of

detected

faults

(TP)

Number of

false-alarms

(FP)

Total

number of

positive

detections

Negative

Detection

Number of

missed faults

(FN)

Number of

correct

rejections.

(TN)

Total

number of

negative

detections

Total

number of

faulty

samples

Total

number of

fault-free

samples

Total

number of

samples

– Kappa

2

( ) ( )

1

2

( ) ( )

1

1

row i col i

i

row i col i

i

CCR P P

P P

CCR = Correct

Classification Rate

Prow(i)=% entries in row i

Pcol(i)=% entries in column i

Poor: K < 0.4 Good: 0.4 < K < 0.75

Excellent: K > 0.75

Performance Metrics for Binary Classification - 1

TRUE

P

R

E

D

I

C

T

E

D

2

1

)()(

2

1

)()(

1i

icolirow

i

icolirow

PP

PPCCR

No reject option

When doing this,

All four entries should sum to 1

Page 98: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Outcome Fault No-Fault Total

Positive

Detection

3023(TP)

1518(FP)

Total

number of

positive

detections

Negative

Detection

1977

(FN)

3482

(TN)

Total

number of

negative

detections

Total

number of

faulty

samples

Total

number of

fault-free

samples

Total

number of

samples

Normalized

0.605 0.304

0.395 0.696

PD = 0.605 False Neg. Rate = 0.395

PF = 0.304 (False Positive Rate )

Correct Classification Rate = 0.65

Misclassification Rate = 0.35

Odds Ratio = 3.51

Kappa = 0.301Poor

Positive Prediction Power = 0.666

Negative Prediction Power = 0.638

Prevalence = 0.5 (Priors)

Metrics

Aircraft Engine Data1 0

1 0 1 1 0

1 0

2 ( )

( )( )

( ) if

D F

D F

D F

PP P PKappa

P P P PP P P

P P P P

Performance Metrics for Binary Classification - 2

Re ( ) 0.605

Pr ( ) 0.666 (1 ) / 0.304

20.634

D

F

call R P R

ecision P P R P P

PRF score

P R

Page 99: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

• One-versus-One

– Generates C(C-1)/2

confusion matrices

• C1 vs. C2

• C1 vs. C3

• C2 vs. C3

• One-versus-All

– Generates C

confusion matrices

• C1 vs. C2 & C3

• C2 vs. C1 & C3

• C3 vs. C1 & C2

One-versus-One Confusion Matrices

One-versus-All Confusion Matrix

Summed,

creating a

2x2 matrix

Confusion Matrices for Multiple Classes - 1

Confusion Matrix for C Classes

Page 100: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

• Some-versus-Rest

– Generates 2(C-1) - 1 confusion matrices

– Both true and false classifications may be sums

– Here is a four class example

C1 vs. C2 & C3 & C4

C2 vs. C1 & C3 & C4

C1 & C2 vs. C3 & C4

C3 vs. C1 & C2 & C4

C1 & C3 vs. C2 & C4

C2 & C3 vs. C1 & C4

C1 & C2 & C3 vs. C4

C1 vs. C2 & C3 & C4

C1 & C2 vs. C3 & C4

Confusion Matrices for Multiple Classes - 2

Can form the basis for code book based classifiers

Page 101: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

• Cobweb

– Illustrates the probability that

each class will be predicted

incorrectly (the off diagonal cells

of the confusion matrix)

– Shows relative performance

between classes for each

classifier

– High performance classifiers

have poor visibility in cobweb

– May be difficult to interpret for

high numbers of classes

• c(c-1)/2 rays

Cobweb

Cobweb

Page 102: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

• Fawcett’s Extension

– Sums the areas under the curves for each class versus the rest, multiplied by the probability of that class

• Hand and Till Function

– Averages the areas under the curves for each pair of classes

• Macro-Average Modified

– Uses both the geometric mean and average correct classification rate

1

( ) ( , );C

i

AUCF P z i AUC i rest AUC Area under the ROC Curve

Other Metrics of Performance Assessment

1

1 1

2( , )

( 1)

C i

i j

HT AUC i jC C

1 1

0.75( ) 0.25( )CC

CMOD ii ii

i i

MAVG P P

Page 103: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

• Comparing Classifiers

Suppose have Classifiers A and B

nA= number of errors made by A but not by B

nB= number of errors made by B but not by A

McNemar’s Test: Check if

Comparing Classifiers

)1,0(1||

Nnn

nn

BA

BA

Need | nA- nB|>5 for a significant difference

To detect 1% difference in error rates, need at least 500 samples

Page 104: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

1. Alpaydin, Ethem. Introduction to Machine Learning. MIT Press, Cambridge. 2004.

2. Kuncheva, Ludmila I. Combining Pattern Classifiers; Methods and Algorithms. John Wiley and Sons: Hoboken, NJ. 2004.

3. Ferri, C. “Volume Under the ROC Surface for Multi-class Problems.” Univ. Politecnica de Valencia, Spain. 2003.

4. D. Mossman. “Three-way ROCs”, Medical Decision Making, 19(1):78–89,1999.

5. A. Patel, M. K. Markey, "Comparison of three-class classification performancemetrics: a case study in breast cancer CAD", Medical Imaging 2005: ImageProcessing.

6. Ferri C., Hernandez J., Salido M. A., “Volume Under the ROC Surface forMulticlass Problems. Exact Computation and Evaluation of Approximations”Technical Report DSOC. Univ. Politec. Valencia. 2003.http://www.dsic.upv.es/users/elp/cferri/VUS.pdf.

7. Mooney, C.Z. and R.D. Duval, 1993, Bootstrapping: A Non-ParametricApproach to Statistical Inference. Newbury Park, CA: Sage Publications.

8. J. K. Martin and D. S. Hirschberg. “Small sample statistics for classification error rates, I: error rate measurements.” Technical Report 96-21, Dept. of Information & Computer Science, University of California, Irvine, 1996.

References on Performance Assessment

Page 105: Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Estimating Parameters of Densities From Data

Maximum Likelihood Methods

Bayesian Learning

Probability Density Estimation

Histogram Methods

Parzen Windows

Probabilistic Neural Network

k-nearest Neighbor Approach

Mixture Models

Estimate parameters via EM and VBEM algorithm

Various interpretations of EM

Performance Assessment of Classifiers

Summary