Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ

Copyright © 2001-2018 K.R. Pattipati

Fall 2018

October 1, 8 & 15, 2018

Prof. Krishna R. Pattipati

Dept. of Electrical and Computer Engineering

University of Connecticut Contact: [email protected] (860) 486-2890

Lectures 4 and 5: ML and Bayesian Learning,

Density Estimation, Gaussian Mixtures and EM

and Variational Bayesian Inference, Performance

Assessment of Classifiers

mailto:[email protected]


• Duda, Hart and Stork, Sections 3.1-3.6, Chapter 4

• Bishop, Section 2.5, Chapter 9, Sections 10.1 , 10.2

• Murphy, Chapter 4, Chapter 11, Section 21.6

• Theodoridis, Chapter 7, Chapter 12

Reading List


Estimating Parameters of Densities From Data

Maximum Likelihood Methods

Bayesian Learning

Estimating Probability Densities (Nonparametric)

Histogram Methods

Parzen Windows

Probabilistic Neural Network

k-nearest Neighbor Approach

Mixture Models and EM

Performance Assessment of Classifiers

Lecture Outline


Need to know

We usually estimate them from data

Parametric Methods (assume a form for density)

Nonparametric Methods

Estimate density via Parzen windows, PNN, RCE,NN, k-NN,..

Mixture Models

Mixture of Gaussians, Multinoullis, Multinomials, student,…

Recall Bayesian Classifiers

}),|(),({ ijjzxpjzP z

x

Hidden

Categorical

Features,

observed

Parameters

{µk, k}


Estimating parameters of densities from data:

For example, in the Gaussian case

Data:

Assuming samples are independent,

Estimating Parameters of Densities from Data

C

kkzkzP 1

C

1k)}|,xp( { )(

2{( , )} ({ }, ) ({ }, ) General Case or Hyperellipsoid Case or Hypersphere casek pk k kI

CkD kn

kkkk . . . . 3, 2, 1, :x ..........x x x321

C

1k

Let . class from samples Nnkn kk

1 1

) ( ( | , ) ( ) ( | )knC

j

k

k j

L p D p x z k P z k

)|( ln)( ln)( DpLl

k

1 1 1

k )k(; ln),|( ln

C

k

n

j

c

k

k

j

k

k

zPnkzxp


Optimal ML Estimate of k :

What if some classes did not have samples in training data?

Zero count or sparse data or black swan problem

Optimal k

1 s.t. lnmax1

C

1k

kk

C

k

kn

ˆN

nkk

Nnn

nLagrangian

kk

k

k

C

k

kk

ˆˆ

)]1( ln[max:1

C

1k

kk}{

k

Fraction of samples of class k.

Intuitively appealing.

1

ˆCN

nkk

Laplace rule of succession or add-one smoothing

Recall:

ln x is concave

-ln x is convex


Optimal ML estimate of : (when covariance matrices for

all classes are taken to be equal, i.e., i = )

Optimal : Hyperellipsoid Case

C

k

n

j

j

k

k

kzxpl1 1

),|( ln)( { },k

C

k

n

jk

j

k

T

k

j

k

C

kk

k

xxN

l1 1

1

1 )()(2

1||ln

2),}({

Ckxn

xlk

k

n

j

j

k

kkk

j

k ,..,2,1;1

ˆ0)ˆ(ˆ 01

n

1j

1k

ML estimate for mean is just the sample mean!

AAtrATT

2


we know,

We can show that, so use

Optimal

0l

1||ln

0ˆ)ˆ)(ˆ(ˆ2

1ˆ2

1 1

1 1

11

C

k

n

j

T

k

j

kk

j

k

k

xxNl

C

k

n

j

T

k

j

kk

j

k

k

xxN 1 1

)ˆ)(ˆ(1ˆ

N

CNE ˆ

C

k

n

j

T

k

j

kk

j

k

k

xxCN 1 1

)ˆ)(ˆ(1ˆ

ML estimate of covariance

Matrix is the arithmetic average

of over all classesT

k

j

kk

j

k xx )ˆ)(ˆ(

1||ln AAA

1

1

||ln

][ln||ln

])[lnexp()lnexp(|)|exp(ln||

:

AA

AtraceA

AtraceAA

PDisAAside

A

n

i

i


Proof of

1 1 1

1 1

1 1

21 1 1 1

1ˆ ˆ ˆ ˆ( )( )

1ˆ ˆ ˆ ˆ[ ]

1ˆ ˆ[ ]

1 1ˆ( ) { }

1

k k

k

k

k k

n nCj j jT

k k kkk k kk j j

nCj jT j T jT T

k k k kk k k kk j

nCj jT T

k k k kk j

n nC CT q r Tk

k kk k kk k q rk

k k k

x x know n xN

x x x xN

x xN

nE n E x x

N N n

nN

1 1

1

( )

C CT T

k k kk k

Cn

N N

N C

N

ˆ N CE

N

N

CNE ˆ


Linear discriminants may outperform quadratic discriminants

when training data is small

Shrink the covariance matrices towards common value possibly biased, but results in a less variable estimator

Select and to minimize error rate via cross validation

Shrinkage Methods

)Σ and Σof mbination (convex co kˆˆ

Nn

Nn

k

kkk

)1(

ˆˆ)1()(ˆ

ˆ ˆ( ) (1 ) ; 0 1I

Itrp

kkk ))(ˆ()(ˆ)1(),(ˆ

If variables are scaled to have

zero mean and unit variance

Regularized Discriminant

Analysis

ˆ)1(ˆ)(ˆ diag This has Bayesian (MAP) interpretation


ML-based Discriminants

kn

j

j

k

kk

xn 1

1

kn

j

T

k

j

kk

j

k

k

k xxn 1

)ˆ)(ˆ(1

1ˆ

As N , they approach the performance of optimal Bayesian Classifier. However,

for finite N, a linear rule may outperform a quadratic one!

Note that we need nkp for each class. If not, use shrinkage methods.

ML-estimated Best Linear Rule (C+Cp+p(p+1)/2 parameters)

Unequal Covariance Case

ML-estimated quadratic rule (C+Cp+Cp(p+1)/2 parameters)

]ˆlnˆˆˆ

2

1[ˆˆmax arg 11

},...,2,1{kk

T

k

T

kCkxj

kkk

T

kkCk

xxj ˆln)ˆ(ˆ)ˆ(2

1|ˆ|ln

2

1max arg 1

},..,2,1{


• In general, we can use SA algorithm:

Recursive Estimation of Parameters-1

n

k

n

k

n

kx

nn

1ˆ)

11(ˆ

1

kn

j

j

k

k

kzxn 1

0),|p( ln1

0),|p( ln

kzxE

j

kkn

lim

1ˆ

1|),|p( lnˆˆ

n

k

kzxn

kn

n

k

n

k

)ˆ(ˆˆ111

n

k

n

kn

n

kx

Idea of Stochastic Approximation:

|)( gEf

ufroots )( of

][ 2|θ(g-f)Eassume

1 [ ( )]n n n nu g

0lim

nn

1n

n

1

2

n

n

What if “big streaming data”? Recursive estimation of mean

conditionsSAsatisfyKn

Kn

mnnn

1)/(

)1/(,

)(

1,

120


Unequal Covariance Case

Covariance Recursion for Class k:


Class k

kn

j

j

k

kk

xn 1

1

kn

j

T

k

j

kk

j

k

k

k xxn 1

)ˆ)(ˆ(1

1ˆ

nn

1

Tn

k

n

k

n

k

n

k

n

k

Tn

k

n

k

n

k

n

k

n

k

n

j

Tn

k

n

k

n

k

j

k

n

k

n

k

n

k

j

k

n

j

Tn

k

j

k

n

k

j

k

n

k

xxnn

n

xxn

n

nnnn

n

xxnn

n

xxn

)ˆ)(ˆ(1ˆ)

1

2(

)ˆ)(ˆ)(11

1)(1

1(ˆ)

1

2(

})ˆˆˆ)(ˆˆˆ({2

1)

1

2(

)ˆ)(ˆ(1

1ˆ

111

11

2

1

1

1111

1

)ˆ(1

ˆˆ11

n

k

n

k

n

k

n

kx

n


Covariance Recursion in Hyperellipsoid Case

Update means via

For covariance, suppose nth sample is from class l. Then


Cixnn

iii n

i

n

i

n

i,..,2,1;

1ˆ)

11(ˆ

1

1 1

1 1 1

1 11

2

1ˆ ˆ ˆ{ ( )( ) }

1 1ˆ ˆ ˆ ˆ( ) {[ ( )( ) ] ( )( ) }

1

1 1 1 1ˆ ˆ ˆ( ) ( )(1 )( )( )

k

k k

k l

k k l l

l l l l

nCj n j nn T

k kk kk j

n nCj n j n j n j nT T

k k l lk k l lk j j

k l

n n n nn ll ll l

l l l

x xn C

n Cx x x x

n C n C

nn Cx x

n C n C n n n

1 11 ( 1)1 ˆ ˆ ˆ( ) ( )( )( )

l l l l

T

n n n nn Tll ll l

l

nn Cx x

n C n C n


Posterior probability P(z=i |x) is the key

Have data:

we can only get P(z=k |x, Dk)

Class conditional density and posterior density of parameters

Bayesian Learning

1 2 3

1 2 .......... : 1, 2, 3, . . . . { , ,..., }kn

k k k k CD x x x x k C D D D

1

( | , ) ( )( | , )

( | , ) ( )

kk C

i

i

p x z k D P z kP z k x D

p x z i D P z i

( | , ) ( | , ) ( | , )k kp x z k D p x z k p z k D d

1

1

( | , ) ( | )( | , )

( | , ) ( | )

( | , ) ( | )

( | , ) ( | )

k

k

kk

k

nj

k

j

nj

k

j

p D z k p z kp z k D

p D z k p z k d

p x z k p z k

p x z k p z k d

If p(Dk | z=k,) has

sharp peak at

then so does p( |z=k, Dk)

Tough to compute unless

reproducing density.

Need Simulation.


Illustration of Bayesian Learning - 1

Bayesian learning of the mean of Gaussian distributions in one dimension.

Strong prior (small variance) posterior mean “shrinks” towards the prior mean.

Weak prior (large variance) posterior mean is similar to the MLE

-5 0 50

0.1

0.2

0.3

0.4

0.5

0.6

prior variance = 1.00

prior

lik

post

-5 0 50

0.1

0.2

0.3

0.4

0.5

0.6

prior variance = 5.00

prior

lik

post

gaussInferParamsMean1d from Murphy, Page 121


Illustration of Bayesian Learning - 2

Bayesian learning of the mean of Gaussian distributions in one and two dimensions.

As the number of samples increase, the posterior density peaks at the true value.


Sensor Fusion

Bayesian sensor fusion appropriately combines measurements from multiple sensors

based on uncertainty of the sensor measurements. Larger uncertainty less weight.

sensorFusion2d from Murphy, Page 123

-0.5 0 0.5 1 1.5-1.4

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

Equally reliable Sensors (R,G)

Fused Estimate: Black

-1 -0.5 0 0.5 1 1.5-1.6

-1.4

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

G is more reliable than R


-1 -0.5 0 0.5 1 1.5-1.5

-1

-0.5

0

0.5

1

R is more reliable in y

G is more reliable in x



Likelihood Recursion

Recursion for Posterior Density

Recursive Bayesian Learning

1

1

0

( | , ) ( | , )( | , )

( | , ) ( | , )

( | , ) ( | )

k k

k

k k

n nn k kk n n

k k

k

p x z k p z k Dp z k D

p x z k p z k D d

where p z k D p z k

1 2 1 1

1

{ , ,....., , } { , }

( | , ) ( | , ) ( | , )

k k kk k

kk k

n n nn n

k k k k kk k

nn n

kk k

Let D x x x x D x

p D z k p x z k p D z k

Problem: Need to store all training samples to calculate 1kn

kD1

( | , ).kn

kp z k D

For exponential family (e.g., Gaussian, exponential, Rayleigh, Gamma,

Beta, Poisson, Bernoulli, Binomial, Multinomial) need only few parameters

to characterize , They are called sufficient statistics. 1

( | , ).kn

kp z k D


Want to learn both the mean vector & covariance

Recursive Learning of Mean and Covariance

0

0

( )

0 0 0

0

00

0 0

0 20 0

0

2

1( | , ) ( , ) ( , ) ( | , ) ( | , )

1 1( | , ) ( ; , )

| |( | , ) |

2 ( )2

NIW Normal Inverse Wishart

n

v v v v v

inverse WishartGaussian

v v

v vp

p

p x N and p p m pk

p m N mk k

IW

00 1

1

1

1 1 0 0( )

2 2

1

1 2

0 10

0 0

0 1 0

0

1| ; ( ) ( )

2 2

{ , ,...., }, ( , | ) ( , | , , , )

1

1; 1

(

v

n

n

p ptr

p

i

n nn n n n n

v v

nnn n

n n

n n n

n

ie

Given D x x x p D NIW m k

k n kwhere m m x m x

k n k n k k

k k n k n

00 0

1

)( ) ( )( ) (HW: Get recursive expression)n

n n n ni i T T

ni

nx x x x x m x m

Wishart is a generalization of

Gamma and chi-squared


Recall

Conjugate prior: Dirichlet distribution

Posterior

MAP estimate Conditional mean (MMSE estimate)

Bayesian Generalization of Laplacian Smoothing

100

111

( )( | ) ( | ) ;

( ).... ( )k

C C

k k

kkC

p Dir

1

({ }) k

Cn

k k

k

L

1

( | , );C

N

k

k

p D N n

10

11 1

( )( | ). ( | )( | , )

( | ) ( )..... ( )

=Dir( | )

k k

N CnN

kNkC C

Np D pp D

p D n n

n

0

1ˆ MAP k k

k

n

N C

0

ˆ MMSE k kk

n

N

Mode and Mean are not the same here!


Sum of Gaussian vectors is Gaussian.

Linear transformations of Gaussians is Gaussian

Marginal and conditional densities

Operations on Gaussians - 1

1 2 1 211 22 121 2

1 2 11 12 12 221 2

~ ( , ); ~ ( , );cov( , )

~ ( , )T

x N x N x x

x x x N

~ ( , ); ~ ( , )Tx N y Ax N A A A

11 11 12 11 121

2 12 22 12 222

2 222

1

1 2 212 22 11 121 2

and ~ ( , ); ; ;

Then, the marginal & conditional densities are also Gaussian

( ) ( , )

( | ) ( ( ),

T T

J Jxx x N J

J Jx

p x N

p x x N x

1

22 12

1 1

211 12 111 2

)

=N( ( ), )

T

J J x J

This is what happens in Least Squares & Kalman filtering

If x1 is scalar

(e.g., Gibbs sampling),

Information form could

save computation.

1 1

1 1

11 11 12 22 12 22 22 12 11 12

1 1

12 11 12 22 11 12 22

;T TJ J

J J J


Marginal and conditional densities

Marginals & Conditionals of 2D Gaussian

21 1 1 1 2

22 2 1 2 2

2

1 1 1

2 211 2 1 2 2 1 2 1 2

2

1 0.8 and ~ ( , ); 0;

0.8 1

( ) ( , ) (0,1)

( | ) ( ( ), (1 )). 1 ( | ) (0.8,0.36)

xx x N

x

p x N N

p x x N x If x p x x N

-5 0 5

-10

-5

0

5

10

x1

x2

p(x1,x2)

-5 0 50

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

x1

x2

p(x1)

-5 0 50

1

2

3

4

5

6

7

x1

x2

p(x1|x2=1)

gaussCondition2Ddemo2

from Murphy, Page 112

2 2 2

1 1 2

2 2 2

1 2 2

1

(1 ) (1 ) 2.778 2.222

1 2.222 2.778

(1 ) (1 )

J


Covariance matrix captures marginal independencies between variables

Information matrix captures conditional independencies

Non-zero entries in J correspond to edges in the dependency network

Product of Gaussian PDFs is proportional to a Gaussian

Valid for division also

Operations on Gaussians - 2

0 & ( )ij i j i jx x are independent or x x 1J

1/2

1

/2

1 1 1

1 1 1 1

1 11

1 1( ; , ) exp{ ( ) ( )} exp( + - )

(2 ) 2 2

1= ; = ; ln 2 ln | |

2

( ; , ) ( ; , ) ;

T Ti T

i i i i ipi i i i

T

i i i i i i ii i i i i

n n n

i i i ii ii ii

Jp x J x J x A x x J x

J J A p J J

p x J p x J where J J J J

1 20 | { , ,..., } { , }ij i j n i jJ x x x x x x x


Sampling From Multivariate Gaussian

1 1

1/2

~ ( , ) ( , ); (Precision or Information Matrix)

Method 1: ; (0, );

Method 2: Cholesky Decomposition of ; (0, );

Method 3: Given Cholesky Decomposition of Precisio

T

T

x N N J J

Q Q z N I x Q z

LL z N I x Lz

,

2

,

,

n Matrix, ,

(0, );Solve ;

Method 4: Gibbs sampler using Covariance Matrix/Precision Matrix

Let ~ ( , ) (i i

T

T

i ii i ii

T

i ii i i

J BB

z N I B y z x y

xy N N

x

,

,

1

,

,

1 2 1

, , , ,

,

3

, )

( | ) ~ ( ( ), )

1 = ( ( ), )

Methods 1 and 2 require O(n ) computation. Method 3 can e

i i

i i

ii i i

T

i i

T

i ii i i i i i i i i i ii

i i

ii iii ii

J J

J J

p x x N x

JN x

J J

xploit sparsity of .

Information form of Gibbs sampler does not require matrix inversion! Relation

to Gausee-Seidel, successive over-relaxati https://arxiv.org/pdf/1505.0on, et 3512. c. pdf

J


ML versus Bayesian Learning

ML Bayesian

Computationally Simpler

(calculus, optimization)

Complex Numerical

Integration (unless a

reproducing density)

Single Best Model.

Easier to Interpret

Weighted Av. of Models

Does not use a priori

information on

Uses a priori information

on

( | , )

( | , ) ( | , )

k

k

p x z k D

p x z k p z k D d

ˆ( | , ) ( | , )MLkp x z k D p x z k


• Assume x was already standardized, then to construct a

histogram, divide the range into a set of B evenly spread bins

Then,

where Kb = number of points which fall inside the bin

B is crucial

B large Est. density is spiky (“noisy”)

B small Smoothed density

There is an optimum choice for B

We can construct the histogram sequentially

considering data one at a time

Density Estimation: Histograms

( ) , 1,2, . . . . ,bKh b b B

N

B=3

B=7

B=12


• Estimated density is not smooth (has discontinuities at the boundaries of the bins)

• In high dimensions, we need Bp bins ( curse of dimensionality requires a huge number of data points to estimate density )

• In high dimensions

– Density is concentrated in a small part of the space

– Most of the bins will be empty estimated density = 0

– As the number of dimensions grows, a shell of thin, constant thickness on the interior of the sphere ends up containing almost all of its volume most of the volume is near the surface!

Major Problems with Histogram Algorithm


General idea

Suppose we want to find p(x)

Suppose draw N points from p(x)

Expected number falling in region R

So,

Kernel and K-nearest Neighbor Methods-1

Pr ( ) Pr 1x R

ob x R P p x d x ob x R P

Pr these fall in (1 ) ( ; , )k N kN

ob k of R P P B k N Pk

kNk PPk

NkkE

)1( )(

N

0k

kNk PPk

Nk

)1(

N

1k

1

1

1 (1 )

1

Nk N k

k

NNP P P

k

NP

[ ]E k NP


Expected fraction of points falling in region R

Variance of

As, N, variance 0

. . . . . .(1)

Also, . . . . . .(2)

So, or, . . . . . .(3)

Note that R should be large for (1) to hold. However, R should be small for (2) to hold an optimal choice for R.


[ ]E kP

N

2k k

E PN N

2

1

(1 )N

k N k

k

NkP P P

kN

(1 )P P

N

estimate good a isN

kP

VxpdxxpP )()( R

N

kVxp )(

NV

kxp )(


Effect of N on Probability Estimates

N

k

We are trying to estimate P via . As N increases,

the estimate peaks at the true value (it becomes a delta function) NV

kxp )(


There are basically two approaches to use Eq. (3) for density estimation

Fix V and determine k from the data Kernel Based Estimation

Fix k and determine the corresponding volume from the data

k-nearest neighbor approach.

Major Disadvantage: Needs all data

Both of these methods converge to true densities as N, provided

V as N V0 as N

k as N k as N and k/N 0


methodsneighbornearestkforNkSelect

methodsbasednelforN

VselectTypically

ker1


Selection of V and k


Kernel Estimators

• Suppose we take the region R to be a hypercube with sides of length h

centered on the point x. Its volume is

• We can find an expression for k, the number of points which fall within

this region, by defining a Kernel Function, H(u). It is also known as the

Parzen Window

For all data points xj, the quantity is equal to unity (1) if

the point xj falls inside the hypercube of side h at x and 0 otherwise.

Total number of points falling inside the hypercube is

Kernel Estimators-1

PhV

11 | | 1, 2, ....,

( ) 2

0

ifor u i pH u

otherwise

)(

j

h

xxH

N

j

j

h

xxHk

1

)(

H(u) is a unit hypercube

centered at the origin.

Gaussian windows could

be used as well.

)(1

)(ˆ1

N

j

j

pp h

xxH

NhNh

kxp


Gaussian Parzen Windows

Example of two dimensional circularly symmetric normal

Parzen windows for three different values of h

]||||2

1exp[

)2(

1)( 2

2/ h

x

hh

xH

pp


Kernel Estimators-2

NV

kxp )(ˆ

N

j

j

p h

xxH

Nh 1

1

points data theof oneon centered cubeeach with

, side of cubes ofion superposit ˆ hN)x(p

We can smooth out this estimate by choosing different forms

for the kernel function H(u). Example: Gaussian Parzen Windows

N

j

j

p h

xxH

hNExpE

1

11)(ˆ

1p

x vE H

h h

1( ) ( )

p

v

H x v p v dvh

Convolution of H and p.

Blurred version of p(x) as

seen through the window

Large N Good estimate for h 0

Small N Need to select h properly

For small N, h small noisy estimate


Illustration of Density Estimation: Effect of h

Three Parzen-window density estimates based on

the same set of five samples, using the Gaussian

Parzen window functions


Density Estimation: Effects of h and N

Parzen-window estimates of a univariate normal density using

different window widths and numbers of samples


Bivariate Density Estimation

Parzen-window estimates of a bivariate normal density using

different window widths and numbers of samples


Bimodal Density Estimation

Parzen-window estimates of a bimodal distribution using different window widths

and numbers of samples. Note that when n, estimates are the same and match

the true distribution.


Volume of a hypersphere in p dimensions of radius

= sum of multivariate Gaussian distributions centered at

each training sample.

(can also do for each class )

Gaussian Windows:PNN

pp

pp

)2/(

)(2 2/

/2 21

1 1 ( ) ( )ˆ ( ) exp

(2 ) 2

j jTN

p pj

x x x xp x h

N

ˆ ( | )k

j j

k

N np x z k

x x

Also called “Probabilistic Neural Network (PNN)”

Small density estimation will have discontinuities

Larger causes greater degree of interpolation (smoothness)

1

0( ) a ua u e du


Probabilistic Neural Network-1

• Note that each component of x has the same

Must scale xi to have the same range

Scale

PNN is fairly

insensitive to over a

wide range.

Sometimes, it is easier

to experiment with

than compute it this

way.

k Compute

2 1 ˆˆ tr( )k kp

2 2

k 1

1ˆ

C

k knN



PNN structure

1 12 n1

x1 x2 xp

1 12 nc

1

2/)2(

1

npp

/2

1

(2 ) p p

Cn

( 1)P z ( )P z C

Pick Maximum

Input Units

Pattern Units

Summing

Units

Output Unit

• Speed up PNN using cluster centers as representative patterns

Cluster using K-means, LVQ,….



One form of Pattern Unit structure

x1 x2 xp ||x||2 -1/2

1

2jk /z

e

j

kx1 j

pkx2

1

2||||j

kx

22 ||||2

1||||

2

1 j

k

Tj

k

j

k xxxxz

j

kx2

0-j

kz

j = sample number

k = class

p= dimension of pattern

0



x1 x2xp

1

square squaresquare

0

1 1

j

kx x

j

kx1

j

kx2

2 2

j

kx xj

p pkx x

j

pkx

2||||j

k

j

k xxz

2/2j

kze

Second form of Pattern Unit structure

j

kz

j = sample number

k = class

p= dimension of pattern


Alternative forms of Kernels

N

j

p

i

j

iipxx

hNhxp

1 1

||2

exp1

)(

N

j

p

i

j

ii

p h

xx

hNxp

1

1

12

2)(1

)(

1)(

N

j

p

i

j

ii

p

xxc

Nxp

1

2

1

)(sin

)2(

1)(

sinwhere sin ( )

xc x

x

Manhattan Distance, 1-norm

Alternate forms of Kernels:

Cauchy Distribution

3

1 1

1 70( ) 1 | | ; | | 1

81

p pN

i i

j i

p x x x tri cube KernelN

2

1 1

1 3( ) 1 ; | | 1

4

p pN

i i

j i

p x x x Epanechnikov KernelN


Decide on K, the number of clusters in advance.

Suppose there are N data points,

Want to find K representative vectors

K-means algorithm seeks to partition the data into K

disjoint subsets containing {nj}points, in such a way as

to minimize the sum of squares clustering function

K – Means Algorithm or K – Centers Algorithm

NxxxD , . . . . . .

21

K

μ. . . . . . μ μ21

2

1

|| ||j

Ki

jj i C

J x

1

j

i

ji Cj

xn

clearly if are known, then jC

The means of clusters

“Cluster Centers”

“Codebook Vectors”

}{ jC


• 1) Start with any K random feature vectors as K centers.

Alternately, assign the N points into K sets randomly and

compute their means; these means serve as initial centers.

2) For i = 1, 2, . . . . . ,N

Assign pattern i to cluster if

3) Recompute means via:

4) If centers have changed, go to step 2. Else, stop.

Covariance of each cluster: etc.

Batch Version of K–Means Algorithm

jC 21

||||min argki

Kkxj

1

j

i

ji Cj

xn

1ˆ ( )( )1

jC

T

i ij j jij

x xn

K–Means Algorithm-Unsupervised Learning


Initialization (Recall David Arthur and Sergei Vassilvitskii’s

“k-means++: The Advantages of Careful Seeding”,

Eighteenth Annual ACM-SIAM Symposium on Discrete

Algorithms)

a. Choose initial center at random. Let n1 be the data point.

b. For k= 2,..,K

For n=1,2,.., N & n ≠ ni , i=1,2,..,k-1

End

Select probabilistically

Store nk

End

Initialization

2 2

1 1 1 12 2min exp( min )

n n

n ni ii k i kD x or D x

1

; 1,2,.., 1

( )( )

( )

k

k

i

nn

Nn

n

n n i k

D xp x

D x

1

1.

nLet x

kn

kx


• Pick K to minimize the sum of errors (training error) +

model complexity term. The sum is called prediction error.

where is the variance of noise in the data.

• Kurtosis based measure

– Kurtosis for normalized Gaussian

– Find Excess Kurtosis for each cluster

– Plot and pick K that gives minimum KT

– Can also use this idea in a dynamic cluster splitting scheme (see

Vlasis and Likos, IEEE Trans.- SMCA, July 1999, pp-393-399).

Selection of K

K

j Cnj

n

jN

Kpx

NPE

1

22 22

2

3

4

xE

4

13 1,2, ,

C

j

n

i ji

ji

n ijj

xK i p

C

1 1 1 1

1 1 1;

p pK K

T ji j j ji

j i j i

K K K K KKp K p

KKT vs.


Online Version

• Start with K randomly chosen centers

• For each data point, update the nearest via

.

Leave all others the same.

Can use it for dynamic data.

For static data, need to go through data multiple times!

Online Version of K-Means

)(oldoldnew jijj

x

K

jj 1

j

Vector

Quantization via

Stochastic

Approximation


Supervised Algorithm (Learning Vector Quantization {know class labels!})

Start with K codebook vectors.

For each data point xi find the closest codebook

(or center) .

if xi is classified correctly.

if xi is classified incorrectly.

We will take up the variants of the algorithm in Lecture 8.

Supervised Algorithm

)(oldoldnew jijj

x

)(oldoldnew jijj

x

j

It is a version of reinforcement learning


k-Nearest Neighbor Approach

• Fix k and allow the volume V to vary

Again, Small k noisy

Large k smooth

Major use of k- Nearest Neighbor Technique is not in

probability estimation, but in classification.

• Assign a point to class i if class i has the largest number of points

among the k-nearest neighbors MAP rule

k-Nearest Neighbor Approach

NV

kxp )( V is the volume of a hypersphere centered at

point x and contains exactly k points.

( | ) i

i

kp x z i

nV

NV

kxp )(

( ) inP z i

N

( | ) ( ) 1( | )

( )

i i i

i

k n kp x z i P z iP z i x

kp x nV N k

NV


1-NN Classifier

1-NN classifier has interesting Geometric Interpretation

x2

1-NN Classifier x1

1-Nearest Neighbor Classifier

Decision

boundary

x2

1-NN Classifier x1

Data points become corner points of triangles

spanned by each reference point and two of its

neighbors. The network of triangles is called

Delauny Triangulation (or Delauny Net).

Perpendiculars to triangle’s edges run borders of

polygonal patches delimiting local neighborhood

of each point. Resulting network of polygons is

called VORONOI Diagram or VORONOI Nets.

Decision Boundary follows sections of

VORONOI Nets.

1-NN & K-NN adapt to any data; high

variability & low bias.


Why Gaussian Mixtures?

• Parametric fast but limited

• Non Parametric general but slow (require lot of data)

• Mixture Models

Gaussian Mixtures, Maximum Likelihood and

Expectation Maximization Algorithm - 1

Conditional Density Estimation

(function approx.)

RBF

Mixture of experts models

) ( | )

1 2

(.

Similar technique for p p x k

k , , . . . ,C

M

j

jPjxpxp1

)|()(

1 0 ; 11

j

M

j

j PP

1)|(1

M

j x

j xdPjxp




p(x)

p(x|M)p(x|j)p(x|1)

PMPjP1

x1 x2 xp

),()|( jjNjxp

ly typical),(2IN jj

2

22

2

)||||(

2/2)2(

1 j

jx

p

j

e

Problem: Given data,

1 2

1

. . . . , , ,M

N

j jj j

D x x x find the ML estimates of P

Let j jjθ P ,μ ,σ

max ( | ) max ln min lnθθ

L p D l p(D |θ ) - p(D |θ ) J




N N M

i 1 i 1 j 1

ln , ln ( ) i i

jJ p( x θ ) p x | j P

N

1iM

1k

1

j

i

j

k

ij

μ

|j)xp(P

|k)Pxp(μ

J

22

2

22 e

2

1

2

2

j

i

ji

j

i

jσ

)||μx||(

p/

jj

i

σ

xμ|j)xp(

σ

xμ

)πσ(μ

|j)xp( j

j

i

(1) .. . . . . . . . .. . . . . . 2

1

σ

xμ)xP(j|

μ

J

j

i

jN

i

i

j

M

i

j

M

j

jj

PJl

Lagrangian

P; Pts

J

1

1

:

101 ..

min

So,Note the Simplicity

of Gradient

posterior




N

i j

i

jM

k

k

ij

|j)xp(P

|k)Pxp(

σ

J

1

1

1

2

31

iN

i j

i j j

|| x μ ||pP(j | x )

σ σ

. . . . . .. . . . . . . . ..... ......(2)

1

1

1 ( | )

( | )

Ni

Miij

k

k

lp x j

Pp x k P

(3) . . . . . . . . . . . . . . 1

1 1

N

i

N

i

j

i

jj

i

λP)xP(j|P

λ P

)xP(j|

Dimension of

feature vector




From (1),

N

i

i

N

i

ii

j

xjP

xxjP

1

1

)|(

)|(

N

i

i

N

ij

ii

j

xjP

xxjP

p

1

1

2

2

)|(

||ˆ||)|(1

ˆ

N have we1 and 1)|( that,noting1 1

M

j

M

j

j

iPxjP

N

i

i

j xjPN

P1

)|(1ˆ

These are coupled non-linear equations

Necessary Conditions of Optimality:

Set Gradients Equal to Zero

Responsibility

1

1

General Case:

ˆ ˆ( | )( )( )

( | )

Ni i i T

j ji

j Ni

i

P j x x x

P j x


Nonlinear Programming (NLP) Techniques

Methods of Solution: NLP

*

10 . . . . . . . . . . . .

lHkk 1

H =

I SD or Gradient Method

Newton’s Method

Levenberg-Marquardt Method

Levenberg-Marquardt version of

Gauss Newton Method

Various versions of Quasi-Newton Method

Various versions of Conjugate Gradient method

12 J

12J I

1

TJ J I

Best to compute

Hessian using finite

Difference method


EM Algorithm

EM Algorithm

N

i

iold

N

i

iiold

new

j

xjP

xxjP

1

1

)|(ˆ

)|(ˆ

N

i

iold

N

i

newiiold

new

j

xjP

xxjP

p

j

1

1

2

2

)|(ˆ

||ˆ||)|(ˆ1

ˆ

N

i

ioldnew

j xjPN

P1

)|(ˆ1ˆ

How did we get these equations

and Why?…. Later

• By setting gradient to zero (M-step)

• Evaluating posterior

Probabilities/Responsibilities (E-step)

1

ˆ( | )ˆ ( | )

ˆ( | )

i new

i jnew

Mi new

m

m

p x j PP j x

p x m P

M-step

E-step

Gauss-Seidel view of EM


Sequential Estimation -1

1

1

1

11

)|(

)|(

ˆn

i

i

n

i

ii

n

xjP

xxjP

j

1

1

1

1

1

1

1

)|(

)|(ˆ

)|(

)|(

n

n

i

i

nn

n

i

i

n

i

i

x

xjP

xjP

xjP

xjP

j

nn

n

i

i

nn

jj

x

xjP

xjP ˆ

)|(

)|(ˆ

1

1

1

1

1n

j

Sequential Estimation Stochastic Approximation

1

1 1

1 11

1

1

1

:

( | ) ( | )1

1( | ) ( | )

( | )( | )

=1 .( | ) ( | )

( | ) 1 =1+ .

( | )

n ni i

i i

n nn

j

ni

n

i

n n

n

n n

j

Note

P j x P j x

P j x P j x

P j xP j x

P j x P j x

P j x

P j x


• Sometimes replace, or,

Similarly,

Sequential Estimation Stochastic Approximation -2

11

1

( | )

ˆ( 1)j

nn

n

j

P j x

n P

1

1

)|(

)|(111

nn

n

n

jj

xjP

xjP

n

i

i

n

i

nii

n

j

xjP

xxjP

p

j

1

1

2

2

)|(

||ˆ||)|(1

ˆ

1

1

1

1

21

12

)|(

||ˆ||)|(1

ˆn

i

i

n

i

nii

n

j

xjP

xxjP

p

j

1

1

1

1

21

)|(

||ˆˆˆ||)|(1

n

i

i

n

i

nnnii

xjP

xxjP

p

jjj

1

1

1

1

2112

)|(

||ˆˆ||ˆˆˆ2||ˆ||)|(1

n

i

i

n

i

nnnnT

ninii

xjP

xxxjP

p

jjjjjj

2121

1

1

1

1

1

12||ˆˆ||

1||ˆ||

)|(

)|(1

)|(

)|(

ˆnnnn

n

i

i

n

n

i

i

n

i

i

n

jjjj p

x

xjP

xjP

pxjP

xjP

1

1

1

( | )

( | ) ( | )

j

j

j

n n

n

n n n

P j x

P j x P j x


• Similarly,

Sequential Estimation Stochastic Approximation -3

212211212||ˆˆ||

1ˆ||ˆ||

1ˆˆ

nnn

j

nnn

j

n

j

n

jjjj p

xp

n

j

nnn

j

n

j

n

jj

xp

221112ˆ||ˆ||1

1ˆ

n

j

nn

j

n

j PxjPn

PP ˆ)|(1

1ˆˆ 11

212

21

1

1

12

||ˆˆ||1

ˆ||ˆ||

)|(

)|(ˆ

nnn

j

nn

n

i

i

nn

jjj

j

pp

x

xjP

xjP

1 11

Re

ˆ ˆ ˆj j j

n n n nn

j

call

x


Key ideas of EM as applied to Gaussian Mixture Problem

EM Algorithm for Gaussian Mixture Problem-1

1 1 1

ln ln |N N M

i i

j

i i j

J p( x ) p( x j)P

N

iiold

inewoldnew

)x(p

)x( pJJ

1

ln

N

iiold

M

jiold

ioldinewnew

j

) x(p

)x(j|P

)x(j|P|j)x(pP

1

1ln

Idea:

ln ( , )

( ( )||

:

: var ( )

:

( )

ln ( , | ) ln ( | , ) ln ( | )

( , | )ln ( | ) [ln ]

( )

( | , )[ln ]

( )

q

L q

q

KL q z p

x data

z hidden iables mixture

parameters

q z any arbitrary distribution

p x z p z x p x

p x zp x E

q z

p z xE

q z

( | , ))

ln ( , ) ( ( ) || ( | , ))

ln ( , ) ( ( ) || ( | , )) 0

: ( ) ( | , )

: min[ ln ( , )]

min [ln ( , | )]

min

z x

old

new

q

J L q KL q z p z x

J L q KL q z p z x

E step q z p z x

M step L q

E p x z

=

( , )

: ln ( , ) ( , ) ( , ) ( , )

old

old old old

q

Q

Note L q Q Q H z

J=-ln p(x| )

-ln L(q, )=Q(, old

KL(q||p)


• For convex functions,

Minimizing l will lead to a decrease in J()

0 ,1 e wher lnln iiiiii xx

1 1

1 1

1 1

ln

,ln

, )

ln , ( , ) l

inew newN Mi jnew old old

i iold oldi j

inewN Miold

ioldi j

N Mi inew old new old

i j

P p ( x | j)J J P (j | x )

p ( x ) P (j | x )

p ( x j)P (j | x )

p ( x j

J P (j | x ) p ( x j) Q

=

n ( , )L q

: , ( ) ( , ) 0

( ) ( , ) 0

( , ) and ( ) have the same gradient at

old old old old old

new new new old

old old

Note at J Q Force KL

J Q KL

Q J

old new

J()

Jold

Q(, old)


( , )oldq z

Q(, new)


• Optimization problem:

– min

– s.t.

1 1 1 1

( | ) ln ( | ) ( | ) ln ( , )N M N M

i i i iold new new old new

j

i j i j

Q P j x P p x j P j x p x j

2

21 1

|| ||( | ) ln ln

2

i newN M

i jold new new

j j newi j j

xQ P j x P p

Q

. . . .M, . . . . , jPP new

j

M

j

new

j 21 ;0 ;11

Dropping terms that depend on old parameters, we get


For Gaussian conditional probability density functions


N

i

iold

iN

i

iold

new

j

)x(j|P

x)x(j|P

1

1

N

i

iold

N

i

newiiold

new

j

xjP

xxjP

p

j

1

1

2

2

)|(

||||)|(1

N

i

ioldnew

j xjPN

P1

)|(1


1

1

General Case:

ˆ ˆ( | )( )( )

( | )

Ni i new i newold T

j jnew ij N

iold

i

P j x x x

P j x


Graphical Illustration of E and M Steps

ln ( , ) ( ( ) || ( | , ))

ln ( , ) ( ( ) || ( | , )) 0

: ( ) ( | , )

ln ( , ) ln ( | )

?

( ( ) || ( | , )) 0

: arg min[ ln ( , )]

ln ( , ) ln ( | )

?

old

old old

old

new

new new

J L q KL q z p z x

J L q KL q z p z x

E step q z p z x

L q p x

Why

KL q z p z x

M step L q

L q p x

why

( ( ) ( | , ) || ( | , )) 0old new

KL q z p z x p z x

J=-ln p(x| new )Q(, old =-ln L(q, new )

KL(q||p)

E-step

J=-ln p(x| old ) Q(, old =-ln L(q, old )

KL(q||p)=0

M-step

Note: EM is a Maximum Likelihood Algorithm. Is there a Bayesian Version? Yes: If

you assume priors on ({j, j2, P j}) called Variational Bayesian Inference.


An Alternate View of EM for Gaussian Mixtures - 1

z

x

Hidden

(Latent)

Variables

Observation

z is a M-dimensional binary random vector such that

1

1

{0,1} and 1

( 1) ( ) j

M

j j

j

Mz

j j j

j

z z

P z P P z P

x is a p-dimensional random vector such that

1

11

( | ) [ ( ; , )]

( ) ( , ) ( ) ( | )

[ ( ; , )] ( ; , )

j

j

Mz

jjj

z z

M Mz

j j j jj jz jj

p x z N x

p x p x z P z p x z

P N x P N x

pdf of x is a Gaussian Mixture

only possible vectors:

{ : 1,2,.., }

unit vector

i

th

i

z

z e i M

e i


Problem: Given incomplete (partial) data,

1 2

1

. . . . , , ,M

N

j jj j

D x x x find the ML estimates of P

1

Let M

j jj j

θ P ,μ ,

min lnθ

J where J - p(D |θ )

If have several observations {xn: n=1,2,..,N} , each data point will have

a corresponding latent vector zn. Note the generality

Complete Data:

1

1 1 2 2

2

1 1

( , ), ( , ) . . . .( , )

1 1ln ( | ) { ln ln 2 ln | | || || }

2 2 2 j

N N

c

N Mnn

c j j j jn j

D x z x z x z

pp D z P x



If had complete data, estimation is trivial. Similar to Gaussian

case, except that we estimate with subsets of data that are

assigned to each mixture component

In EM, replace each latent variable by its expectation with

respect to the posterior density during the E-step

In EM, minimize the expected value of the negative

complete-data log likelihood during the M-step

1

( | , ) ( 1| , )

( ; , )( 1| , )

( ; , )

n nn n n n

j j j j

n

j jn jn n

j jMn

k kkk

z E z x P z x

P N xP z x

P N x


Responsibilities

1

2

1 1

1 1{ ln ( | )} { ln ln 2 ln | | || || }

2 2 2 j

N Mnn

Z c j j j jn j

pE p D P x

Q(, old


EM Algorithm for Gaussian Mixtures -4

1 1 1

1 1

1. Initialize the means { } , covariances { } , and mixing coefficients { } .

Evaluate = -ln ( | ) ln{ ( ; , )}

2. E-step: Evaluate the responsibilities using the curr

M M M

j j j j jj

N Mn

j jjn j

P

J p x P N x

1

1

ent parameter values

( ; , ) ; 1, 2,.., ; 1, 2,..,

( ; , )

; 1, 2,..,

3. M-step: Re-estimate the parameters using the current responsibi

n

j jjn

j Mn

k kkk

Nn

j j

n

P N xj M n N

P N x

N j M

1

1

lities

1

1 ( )( )

4. Evaluate the negative log likelihood and check for convergence of parameters or

Nnew nn

jjnj

Nn new n newnew n T

j j j jnj

jnew

j

xN

x xN

NP

N

the likelihood.

If not converged, go to step 2.

2

1

For unbiased estimate of covariance,

1Divide by

( )

Nn

j

j

j

j

NN

Goes to 1/(Nj -1)

for (0-1) case


Illustration of EM Algorithm for Gaussian Mixtures

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2

-1

0

1

iteration 0, loglik -Inf

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-2

-1

0

1

iteration 1, loglik -4929.3761

-3 -2 -1 0 1 2

-2

-1

0

1


-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-2

-1

0

1


-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-2

-1

0

1


1 2 3 4 5 6 7 8-5000

-4500

-4000

-3500

-3000

-2500

-2000

-1500

-1000

-500

0

iter

avera

ge loglik

Murphy, Page 353, mixGaussDemoFaithful


Relation of Gaussian Mixtures to K - means

2

2

1

|| || /2

|| || /2

1

Suppose for 1, 2,..,

Then

( ; , ) ; 1, 2,.., ; 1, 2,..,

( ; , )

As 0

1

n

j

n

k

j

n

j jn

j Mn

k kk

x

jn

j Mx

k

k

n

j

I j M

P N x Ij M n N

P N x I

P e

P e

if j

2

1 1

arg min || ||; the rest go to zero as long as none of the is zero.

The expected value of negative log likelihood of complete-data is

1 { ln ( | )} || || consta

2

n

jkk

N Mnn

Z c j jn j

x P

E p D x

2

1 1

nt

1So, K-means minimizes || ||

2

N Mnn

j jn j

x


Variational Bayesian Inference - 1

w

x

Hidden

(Latent)

Variables

Observation

w is a latent vector (continuous or discrete)

− Mixture vector (discrete), z

− Parameters ({j, j, P j})

x is a p-dimensional random vector

Recall

Variational inference typically assumes q(w) to be factorized

( )

( )

ln ( ) ln ( ( )) ( ( ) || ( | ))

( , )ln ( ( )) ( ) ln ln ( , ) ( )

( )

( | )( ( ) || ( | )) ( ) ln ln ( | ) ( )

( )

ln ( ) ln ( ( )) ( ( ) || ( | )) 0

q w q

q w q

J p x L q w KL q w p w x

p x wL q w q w d w E p x w H w

q w

p w xKL q w p w x q w d w E p w x H w

q w

J p x L q w KL q w p w x

1

( ) ( );{ } are disjoint groups

Example: ( ) ( ) ({ , , })

K

j jj

j

j jj

q w q w w

q w q z q P

P z

x

-1


Variational Bayesian Inference - 2

Minimize the upper bound -lnL(q(w)) with respect to qj (wj) while

keeping {qi (wi) : i ≠ j} constant (a la Gauss-Seidel)

Iterative algorithm for finding the factors {qj (wj)}

11

11

[ln ( , )]

( , )ln ( ( )) ( ) ln ( ) ln ( , ) ( )

( )

( ) ln ( , ) ( ) ( ) ( )

[ ln ( (

i

j i

i j

K K

i ii q

ii

K K

j i i j j ij i q q

iii ji j

E p x w

p x wL q w q w d w q w p x w d w H w

q w

q w p x w q w d w d w H w H w

L q w

=

[ln ( , )]

[ln ( , )]

))][ln ( , )] 1 ln ( ) 0

( )

ln ( ) [ln ( , )]

( )i j

i j

ji j j

jj

jj i j

E p x w

jj E p x w

j

E p x w q wq w

q w E p x w

eq w

e d w

Log of the optimal qj is the expectation

of the log of joint distribution with respect

to all of the other factors {qi (wi) : i ≠ j}.

This idea is used in loopy belief propagation

and expectation propagation also.


Application to Gaussian Mixtures - 1

Here w involves mixture variables and component parameters

Model assumptions

1 ( ) ( ) ({ , , } )

is a binary random vector of dimension

M

j j jjq w q z q P

z M

1 1

1 1

1 1 1

1 1

1

Mixture Distribution: ({ } |{ } )

Data Likelihood given latent variables:

({ } |{ } ,{ , } ) [ ( ; , )]

We also assume on { , , } Bay

nj

nj

N Mzn N M

n j j j

n j

N Mzn n nN N M

n n j j jj jn j

M

j j jj

p z P P

p x z N x

priors P

0 10

10

esian approach

( ) ( ) ( | ) ; conjugate prior to multinomial

( )

M

jMj

Mp P Dirichlet P P

1

0 ( ) ; ( 1) ( ); ( ) ( 1)! for integerste t dt n n



Model assumptions (continued)

Joint distribution of all random variables decomposes as follows:

1 1 1 1

1

0 0 0

1 0

({ , } ) ({ } |{ } ) ({ } )

1 = ( ; , ). ( ; , )

M M M M

j j j j j j jj j

M

j jjj

p p p

N m Wishart W

In one dimension, Gamma

See Bishop’s book

1 1 1 1 1 1

1 1

1 1 1 1 1

1 1

({ } ,{ } ,{ , , } ) ({ } |{ } ,{ , } ).

({ } |{ } ). ( ). ({ } |{ } ). ({ } )

= [ (

n n n nN N M N N M

n n j j j n n j jj j

n N M M M M

n j j j j j j jj

N M

j

n j

p x z P p x z

p z P p P p p

P N

0 1 10

0 0 0

1 00

; , )] .

( ) 1 . ( ; , ). ( ; , )

( )

njzn

jj

M

j j jM jj

x

MP N m Wishart W

P z

x

-1



Variational Bayes M-step (VBM-step)…. It is easier to see M-step first

1

1

0

1 1 1 1({ } )

1 1

({ } )1 10

0 0 0

1 00

ln ({ , , } ) ln ({ } ,{ } ,{ , , } )

ln [ ( ; , )] .

= t( ) 1

. ( ; , ). ( ; , )( )

n Nn

nj

n Nn

n nM N N M

j j j n n j j jq zj j

N Mzn

j jjn j

q z M

j j jM jj

q P E p x z P

P N x

E consM

P N m Wishart W

0

00

1 1 1 1

0 0

1 1

.

1ln ( ; , )

= [( 1) ]ln

ln ( ; , )

+ ln ( ; , ) tan

j

nj

M N Mjjn

j j

J n j

jN

N Mnn

j jjn j

N mE z P

Wishart W

E z N x cons t

1 1

0 1

({ , , } ) ( ) ({ , } )

( ) ( ;{ } )

( , )

M M

j j j j jj j

M

j j j

jj

q P q P q

q P Dirichlet P N

q Gaussian Wishart



Updated factorized distribution after M-step

1 1

0

0 1

0

1

1

0

1

0

({ , , } ) ( ) ({ , } )

( ) ( ;{ } ) ( )

( , )

1 = ( ; , ). ( ; , )

;

1

M M

j j j j jj j

j jM

j j j j M

k

k

jj

j j j j jjj

Nn

j j j j

n

j

j

q P q P q

Nq P Dirichlet P N E P

M N

q Gaussian Wishart

N m Wishart W

N N

m m

0

1

01 1

0 00

0

1

0

1;

( )( )

1where ( )( )

jj

j j

j j

Nnn

j j

nj

j T

j j j

j

Nn nn T

j j

nj

j j

N x x xN

NW W N S x m x m

N

S x x x xN

N

Updates for { , , }

are similar to ML

jj jN x S

Sequential VBEM?



Variational Bayes E-step (VBE-step)

1

1

1

1 1 1 1({ , , } )

({ , , } )1 1

({ , } )1

ln ({ } ) ln ({ } ,{ } ,{ , , } )

= ln [ ( ; , )] tan

=

Mj j jj

nj

Mj j jj

Mj jj

n n nN N N M

n n n j j jq P j

N Mzn

j jq P jn j

Mn

jqn j

q z E p x z P

E P N x cons t

E z

1 1 1

1

( ; , )

{ } 1 1

ln ({ } |{ } ,{ , } )

ln ({ } |{ } ) tan

= ln

njj

j

Nn nN N M

n n j jj

N x

n N M

q P n j j

P

n n

j j

p x z

E p z P cons t

z

1 1 1

1 1

1 2

,

1

1 1

1

1 1where ln ln ln | | ln 2 || ||

2 2 2

({ } ) [ ] where .... responsibilities

j j j jj

nj

N M

n j

nn

j P j j j

nN Mzn jN n n

n j j Mnn jk

k

pE P E E x

q z



Variational Bayes E-step (VBE-step) … continued

− Evaluation of responsibilities

− Recall

1 1 1

1 2

,

1 1ln ln ln | | ln 2 || ||

2 2 2j j j jj

nn

j P j j j

pE P E E x

1

1 1

1

1

1

2

,

ln ( ) ( ); ( ) ln ( ).... function

1ln | | ( ) ln 2 ln | |

2

|| || ( ) ( )

j

j

j jj

M

P j j k

k

pj

j j

i

n n nT

j jj jjj

dE P digamma

d

iE p W

pE x x m W x m

1

11| | exp ( ) ( ) ( ) ( )

2 2 2 2

n n

j j

pn nj jn T

j jj j j j

i j

Since

i pW x m W x m

See Bishop

Chapter 10


VB Approximation of Gaussian Gamma

In VBEM start with large M and very small 0 <<1 (0.001)

It automatically prunes clusters with very few members (“rich get

richer”)

In this example, we start with 6 clusters, but only 2 remain at the end

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

4

5

iter 94

1 2 3 4 5 60

20

40

60

80

100

120

140

160

180iter 94

mixGaussVbDemoFaithful from Murphy, Page 755


Bayesian Model Selection -1

( | ) ( )( | )

( | ) ( )l

p D m p mP m D

p D l p l

Bayesian Model Selection (max . prob. of model given data)

Bayes Factors for Comparing Models m and l

Set of models (e.g., linear, quadratic discriminants)

Model is specified by parameter vector ;m

M

m m M

1( )

2

1( )

2

( | ) ( | ) ( ) ( )( , ) / /

( | ) ( | ) ( ) ( )

( , ) 1 mod mod

( sin log )

BIC m

BIC l

p D m p m D p m e p mBF m l

p D l p l D p l p le

BF m l el m is preferred over el l

BIC Bayesian Information Criterion u g negative likelihood


Bayesian Model Selection -2

Bayesian Information Criterion (BIC)… minimize BIC

Akaike Information Criterion (AIC) …. Minimize AIC

ˆ ˆ2ln ( | ) ( ) ln

, .

; equal & speherical covaraiance (Note: only ( -1) probabi

ˆ( )

m m

m

BIC p D dof N Schwarz criterion

Valid for regression classification and density estimation

For linear and quadratic classifiers

C Cp C

dof

ities)

1; equal and spherical feature-dependent covaraiance

( 1) / 2 1; equal and general covaraiance

( 1) / 2 1;unequal and general covaraianc

is closley related to

C Cp p

C Cp p p

C Cp Cp p

BIC Minimum Des

( )

: ln ln[( 2) / 24]

cription Length MDL

Adjusted BIC N N

&

ˆ ˆ2ln ( | ) 2 ( )

ˆ ˆ2 ( )( ( ) 1)

ˆ( ) 1

m m

m msmall N Gaussian

m

AIC p D dof

dof dofAIC AIC

N dof


Binary Classification, BIC & AIC - 1

Class z = 0: N(0,1);zero mean … Null hypothesis, H0

Class z = 1: N (µ,1); µ 0 …. Alternative hypothesis, H1

Under null hypothesis, sample mean

Note .

So, under H1:

1 2int : { , ,..., }NN scalar data po s D x x x

1~ (0, ) ~ (0,1)x N N x N

N

x x

~ (0,1)N x N

0

( | | ) 1 ( . ., 2 0.05),

0 1 .

.

, : | |

If P N x c for a specified c e g c for

we are confident that with probability

is the probability of faslsely rejecting H

cSo test statistic is x

N


Binary Classification, BIC & AIC - 2

BIC

AIC

2

0

1

2 22

1

1 1

2

1 0

ˆ ˆ2ln ( | ) ( ) ln

( )

( ) ln ln

ln ln, ( ) ( ) | |

m

Ni

i

N Ni i

i i

BIC p D dof N

BIC H x

BIC H x x N x N x N

N NSo BIC H BIC H if x x

N N

2

0

1

2 22

1

1 1

2

1 0

ˆ ˆ2ln ( | ) 2. ( )

( )

( ) 2 2

2 2, ( ) ( ) | |

; 2

m

Ni

i

N Ni i

i i

AIC p D dof

AIC H x

AIC H x x x N x

So AIC H AIC H if x xN N

Similar to classical hypothesis testing c

sample number-

dependent threshold


Probability of error, misclassification rate.

• Holdout Method:

Error Count = R out of K PE=R/K

From Binomial Distribution:

To obtain an estimate of PE within 1%:

If PE 0.05,

When PE ½, we need lots of samples…. 10,000

We can also estimate Class Conditional Error Rate, PE(z=k). Then

Holdout Method of Cross Validation

P PE z

N

N-K training

K validation

Typically, K N/5

(1 )PE

PE PE

K

(1 )0.01 2

PE PE

K

2

4

4 (1 ) 40000( (1 )) 1900

10

PE PEK PE PE

1

( ). PE( )c

k

PE P z k z k

Are there better

bounds?


• Suppose have a nonnegative rv, x (discrete or continuous)

• Assume continuous WLOG

• Markov inequality

• Chebyshev inequality

Markov & Chebyshev Inequalities

00

( ) ( ) ( ) ( ) ( ) ( ) ( )

( )( )

E x xf x dx xf x dx xf x dx xf x dx f x dx P x

E xP x

22 2

2

2

2

2

2 2

( ) ( )( ) ( )

, (| | )

: ,

| | 1(| | ) ( )

,

(1 )(| | )

e e

e ee e

E x E xP x P x

so P x

Example x sample mean of n numbers m

mP m P

n n

For binary classifier with unknown probability of error P and sample error S

P PP S P

2 2

1; 100, 0.2, 0.0625....

4n bound loose

n n


Hoeffding’s Inequality

2 2 2

21

1

2 /2

; [ , ];

{| ( ) | } 2 ; when 1 {| ( ) | } 2

n

i

i

n

i i i i i i i

i

n Rn

i i i

Let Y X X a b R b a

P Y E Y n e R b a P Y E Y n e

1 1

1

1 1 1 1

1

( ) ( ) 0 & [ , ]2 2

{| ( ) | } {| | } { | | } { } { }

{ } { } { }

{

n n

i i

i i

n

i

i

i ii i i i i

n n n n

i i i i

i i i i

n t Z t Ztn tn

i

i

t Ztn

R RLet Z X E X E Z Z

P Y E Y n P Z n P t Z tn P t Z tn P t Z tn

P t Z tn P e e e E e Markov Inequality

e E e

[ (1 ) ] [ (1 ) ]2 2 2 2

1 1 1

/2 /22 2

1 1

/2

1

/ 2} { } { } { };

1{ (1 ) }

2

1, { }

2

i i i ii i i i

i

i i

i i

i

R R R Rn n nt t

tZtn tn tn i ii

i i i i

R Rn nt t

tR tRtn tn

i i

i i

ntR ttn

i

i

Z Re E e e E e e E e

R

e E e e e e e

So P t Z tn e e e

2 2

2 21

2 2 2

21

[ /8 ]/2 /8

1 1

2 /2 2

1 1

min 4 / {| | } 2 2 1

n

i

i i i

n

i

i

n n t R tnR t Rtn

i i

n n n Rn

i i i

i i

e e e

RHS is imized when t n R P Z n e e when R

Jensen’s inequality

https://en.wikipedia.org/wiki/Hoeffding%27s_lemma


• Suppose we have classifiers/models M1, M2,…., MC

• Training Data, D & Validation Data, V

• Samples are assumed to be i.i.d.

Rationale behind Cross Validation

2

2

2| |

2| |

1 ˆlog ln ( | )| |

ˆ , { }

' {max | | } 2

2ln( )

ln, 2 ,

2 | | | |

" ( ) , (

i

im m

x V

mm m

Vm m

m

V

Average likelihood l p xV

Since does not depend on V E l l

By Hoeffding s inequality P l l Ce

C

CSo if Ce

V V

Confidence is cheap but accuracy

) exp "

2ln( )

lnmax max 2 max ( )

2 | | | |

, (1 ), mod .

m m mm m m

is more ensive

C

Cl l l O

V V

So with probability of at least one chooses the best el

22| |

1

{max | | }

{ | | }

{| | } 2

m mm

m mm

CV

m m

m

P l l

P l l

P l l Ce


• S-fold Cross validation:

S-fold Cross Validation

Testing

Validation

N

NS

S

S

N

1N

S

S

S

N

1N

S

S

S

N

1

D1 D2 Ds

Run1

Run2

Run S

1

1( ( ) ) (.)

s

S

i i

s i D

PE I x z I indicator functionN

• S=N N-fold cross validation or Leave one-out CV (LOOCV) method

1

1( ( ) ); ( ) ( , )

N

i i i

i

PE I x z x f x DN

• Practical Scheme: 5x2 Cross-Validation. Variation: 5 repetitions of 2-

fold cross-validation on a randomized dataset

S is typically 5-10


Summary of Validation and Model Selection Methods

Simple splitLeave one out CV

N=S for S-fold CV Model Selection

Regularization and Model Selection Any combination of CV,

Model Selection and

Regularization

(hyper-parameter selection)

is possible.

See AMLbook.com .


• Bootstrap Method:

Sample from D with uniform probability (1/N). Each xi is drawn

independently with replacement

Let b= bootstrap index, b=1,2,…,B

B= number of bootstrap samples (typically 50-200)

Do b=1,2,..,B

Bootstrap Sample

End

Bootstrap Method of Validation

NxxxD , . . . . . .

21

)( , . . . . . . )()2()1(

bPExxxD trainingb

N

b

: \ ( )b b valValidation Data A D D samples not inbootstrap PE b

0.632* ( ) 0.368* ( )training valPE PE b PE b Effron Estimator


• Bootstrap Method (cont’d):

Why?

• Confusion Matrix:

In words, just count errors from validation set, bootstrap, etc. for class j

and divide by the number of samples from class j

Confusion Matrix

0.632* ( ) 0.368* ( )training valPE PE b PE b

1 1{ } (1 )

{ } 0.368

{ } 0.632

NP observation bootstrap sample as NN e

P observation validation samples

P observation training samples

][ ijPP

{ | } ; 0,1,2,..., ; 1, 2,...,ij

ij

j

NP P decision i z j i C j C

N


• Contingency table showing the differences between the true and predicted classes for a set of labeled examples

• The following metrics can be derived from the confusion matrix:

– PD (Sensitivity, Recall): • TP/(TP+FN)

– PF (1-Selectivity): • FP/(TN+FP)

– Positive Prediction Power (Precision)• TP/(TP+FP)

– Negative Prediction Power• TN/(TN+FN)

– Correct Classification Rate (CCR)• (TP+TN)/N

– Misclassification Rate• (FP+FN)/N

– Odds-ratio

• (TP*TN)/(FP*FN)

Outcome Fault No-Fault Total

Positive

Detection

Number of

detected

faults

(TP)

Number of

false-alarms

(FP)

Total

number of

positive

detections

Negative

Detection

Number of

missed faults

(FN)

Number of

correct

rejections.

(TN)

Total

number of

negative

detections

Total

number of

faulty

samples

Total

number of

fault-free

samples

Total

number of

samples

– Kappa

2

( ) ( )

1

2

( ) ( )

1

1

row i col i

i

row i col i

i

CCR P P

P P

CCR = Correct

Classification Rate

Prow(i)=% entries in row i

Pcol(i)=% entries in column i

Poor: K < 0.4 Good: 0.4 < K < 0.75

Excellent: K > 0.75

Performance Metrics for Binary Classification - 1

TRUE

P

R

E

D

I

C

T

E

D

2

1

)()(

2

1

)()(

1i

icolirow

i

icolirow

PP

PPCCR

No reject option

When doing this,

All four entries should sum to 1


Outcome Fault No-Fault Total

Positive

Detection

3023(TP)

1518(FP)

Total

number of

positive

detections

Negative

Detection

1977

(FN)

3482

(TN)

Total

number of

negative

detections

Total

number of

faulty

samples

Total

number of

fault-free

samples

Total

number of

samples

Normalized

0.605 0.304

0.395 0.696

PD = 0.605 False Neg. Rate = 0.395

PF = 0.304 (False Positive Rate )

Correct Classification Rate = 0.65

Misclassification Rate = 0.35

Odds Ratio = 3.51

Kappa = 0.301Poor

Positive Prediction Power = 0.666

Negative Prediction Power = 0.638

Prevalence = 0.5 (Priors)

Metrics

Aircraft Engine Data1 0

1 0 1 1 0

1 0

2 ( )

( )( )

( ) if

D F

D F

D F

PP P PKappa

P P P PP P P

P P P P

Performance Metrics for Binary Classification - 2

Re ( ) 0.605

Pr ( ) 0.666 (1 ) / 0.304

20.634

D

F

call R P R

ecision P P R P P

PRF score

P R


• One-versus-One

– Generates C(C-1)/2

confusion matrices

• C1 vs. C2

• C1 vs. C3

• C2 vs. C3

• One-versus-All

– Generates C

confusion matrices

• C1 vs. C2 & C3

• C2 vs. C1 & C3

• C3 vs. C1 & C2

One-versus-One Confusion Matrices

One-versus-All Confusion Matrix

Summed,

creating a

2x2 matrix

Confusion Matrices for Multiple Classes - 1

Confusion Matrix for C Classes


• Some-versus-Rest

– Generates 2(C-1) - 1 confusion matrices

– Both true and false classifications may be sums

– Here is a four class example

C1 vs. C2 & C3 & C4

C2 vs. C1 & C3 & C4

C1 & C2 vs. C3 & C4

C3 vs. C1 & C2 & C4

C1 & C3 vs. C2 & C4

C2 & C3 vs. C1 & C4

C1 & C2 & C3 vs. C4

C1 vs. C2 & C3 & C4

C1 & C2 vs. C3 & C4

Confusion Matrices for Multiple Classes - 2

Can form the basis for code book based classifiers


• Cobweb

– Illustrates the probability that

each class will be predicted

incorrectly (the off diagonal cells

of the confusion matrix)

– Shows relative performance

between classes for each

classifier

– High performance classifiers

have poor visibility in cobweb

– May be difficult to interpret for

high numbers of classes

• c(c-1)/2 rays

Cobweb

Cobweb


• Fawcett’s Extension

– Sums the areas under the curves for each class versus the rest, multiplied by the probability of that class

• Hand and Till Function

– Averages the areas under the curves for each pair of classes

• Macro-Average Modified

– Uses both the geometric mean and average correct classification rate

1

( ) ( , );C

i

AUCF P z i AUC i rest AUC Area under the ROC Curve

Other Metrics of Performance Assessment

1

1 1

2( , )

( 1)

C i

i j

HT AUC i jC C

1 1

0.75( ) 0.25( )CC

CMOD ii ii

i i

MAVG P P


• Comparing Classifiers

Suppose have Classifiers A and B

nA= number of errors made by A but not by B

nB= number of errors made by B but not by A

McNemar’s Test: Check if

Comparing Classifiers

)1,0(1||

Nnn

nn

BA

BA

Need | nA- nB|>5 for a significant difference

To detect 1% difference in error rates, need at least 500 samples


1. Alpaydin, Ethem. Introduction to Machine Learning. MIT Press, Cambridge. 2004.

2. Kuncheva, Ludmila I. Combining Pattern Classifiers; Methods and Algorithms. John Wiley and Sons: Hoboken, NJ. 2004.

3. Ferri, C. “Volume Under the ROC Surface for Multi-class Problems.” Univ. Politecnica de Valencia, Spain. 2003.

4. D. Mossman. “Three-way ROCs”, Medical Decision Making, 19(1):78–89,1999.

5. A. Patel, M. K. Markey, "Comparison of three-class classification performancemetrics: a case study in breast cancer CAD", Medical Imaging 2005: ImageProcessing.

6. Ferri C., Hernandez J., Salido M. A., “Volume Under the ROC Surface forMulticlass Problems. Exact Computation and Evaluation of Approximations”Technical Report DSOC. Univ. Politec. Valencia. 2003.http://www.dsic.upv.es/users/elp/cferri/VUS.pdf.

7. Mooney, C.Z. and R.D. Duval, 1993, Bootstrapping: A Non-ParametricApproach to Statistical Inference. Newbury Park, CA: Sage Publications.

8. J. K. Martin and D. S. Hirschberg. “Small sample statistics for classification error rates, I: error rate measurements.” Technical Report 96-21, Dept. of Information & Computer Science, University of California, Irvine, 1996.

References on Performance Assessment

http://www.dsic.upv.es/users/elp/cferri/VUS.pdf


Estimating Parameters of Densities From Data

Maximum Likelihood Methods

Bayesian Learning

Probability Density Estimation

Histogram Methods

Parzen Windows

Probabilistic Neural Network

k-nearest Neighbor Approach

Mixture Models

Estimate parameters via EM and VBEM algorithm

Various interpretations of EM

Performance Assessment of Classifiers

Summary

Documents

Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C n n i i n i i i; 1,2,.., 1) Ö 1 PÖ (1 P 1 11 1 11 1 11 2 Ö 1 { ( )( ) }ÖÖ