Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C...

Fall 2018

October 1, 8 & 15, 2018

Prof. Krishna R. Pattipati

Dept. of Electrical and Computer Engineering

University of Connecticut Contact: krishna@engr.uconn.edu (860) 486-2890

Lectures 4 and 5: ML and Bayesian Learning,

Density Estimation, Gaussian Mixtures and EM

and Variational Bayesian Inference, Performance

Assessment of Classifiers

• Duda, Hart and Stork, Sections 3.1-3.6, Chapter 4

• Bishop, Section 2.5, Chapter 9, Sections 10.1 , 10.2

• Murphy, Chapter 4, Chapter 11, Section 21.6

• Theodoridis, Chapter 7, Chapter 12

Reading List

Estimating Parameters of Densities From Data

Maximum Likelihood Methods

Bayesian Learning

Estimating Probability Densities (Nonparametric)

Histogram Methods

Parzen Windows

Probabilistic Neural Network

k-nearest Neighbor Approach

Mixture Models and EM

Performance Assessment of Classifiers

Lecture Outline

Need to know

We usually estimate them from data

Parametric Methods (assume a form for density)

Nonparametric Methods

Estimate density via Parzen windows, PNN, RCE,NN, k-NN,..

Mixture Models

Mixture of Gaussians, Multinoullis, Multinomials, student,…

Recall Bayesian Classifiers

}),|(),({ ijjzxpjzP z

Hidden

Categorical

Features,

observed

Parameters

{µk, k}

Estimating parameters of densities from data:

For example, in the Gaussian case

Assuming samples are independent,

Estimating Parameters of Densities from Data

kkzkzP 1

1k)}|,xp( { )(

2{( , )} ({ }, ) ({ }, ) General Case or Hyperellipsoid Case or Hypersphere casek pk k kI

CkD kn

kkkk . . . . 3, 2, 1, :x ..........x x x321

Let . class from samples Nnkn kk

) ( ( | , ) ( ) ( | )knC

L p D p x z k P z k

)|( ln)( ln)( DpLl

k )k(; ln),|( ln

zPnkzxp

Optimal ML Estimate of k :

What if some classes did not have samples in training data?

Zero count or sparse data or black swan problem

Optimal k

1 s.t. lnmax1

nLagrangian

)]1( ln[max:1

Fraction of samples of class k.

Intuitively appealing.

Laplace rule of succession or add-one smoothing

Recall:

ln x is concave

-ln x is convex

Optimal ML estimate of : (when covariance matrices for

all classes are taken to be equal, i.e., i = )

Optimal : Hyperellipsoid Case

kzxpl1 1

),|( ln)( { },k

1 )()(2

2),}({

k ,..,2,1;1

ˆ0)ˆ(ˆ 01

ML estimate for mean is just the sample mean!

AAtrATT

we know,

We can show that, so use

Optimal

0ˆ)ˆ)(ˆ(ˆ2

xxN 1 1

)ˆ)(ˆ(1ˆ

CNE ˆ

xxCN 1 1

)ˆ)(ˆ(1ˆ

ML estimate of covariance

Matrix is the arithmetic average

of over all classesT

k xx )ˆ)(ˆ(

1||ln AAA

][ln||ln

])[lnexp()lnexp(|)|exp(ln||

AtraceA

AtraceAA

PDisAAside

Proof of

21 1 1 1

1ˆ ˆ ˆ ˆ( )( )

1ˆ ˆ ˆ ˆ[ ]

1ˆ ˆ[ ]

1 1ˆ( ) { }

n nCj j jT

k k kkk k kk j j

nCj jT j T jT T

k k k kk k k kk j

nCj jT T

k k k kk j

n nC CT q r Tk

k kk k kk k q rk

x x know n xN

x x x xN

nE n E x x

C CT T

k k kk k

ˆ N CE

CNE ˆ

Linear discriminants may outperform quadratic discriminants

when training data is small

Shrink the covariance matrices towards common value possibly biased, but results in a less variable estimator

Select and to minimize error rate via cross validation

Shrinkage Methods

)Σ and Σof mbination (convex co kˆˆ

ˆˆ)1()(ˆ

ˆ ˆ( ) (1 ) ; 0 1I

kkk ))(ˆ()(ˆ)1(),(ˆ

If variables are scaled to have

zero mean and unit variance

Regularized Discriminant

Analysis

ˆ)1(ˆ)(ˆ diag This has Bayesian (MAP) interpretation

ML-based Discriminants

k xxn 1

)ˆ)(ˆ(1

As N , they approach the performance of optimal Bayesian Classifier. However,

for finite N, a linear rule may outperform a quadratic one!

Note that we need nkp for each class. If not, use shrinkage methods.

ML-estimated Best Linear Rule (C+Cp+p(p+1)/2 parameters)

Unequal Covariance Case

ML-estimated quadratic rule (C+Cp+Cp(p+1)/2 parameters)

]ˆlnˆˆˆ

1[ˆˆmax arg 11

},...,2,1{kk

xxj ˆln)ˆ(ˆ)ˆ(2

1|ˆ|ln

1max arg 1

},..,2,1{

• In general, we can use SA algorithm:

Recursive Estimation of Parameters-1

kzxn 1

0),|p( ln1

0),|p( ln

1|),|p( lnˆˆ

)ˆ(ˆˆ111

Idea of Stochastic Approximation:

|)( gEf

ufroots )( of

][ 2|θ(g-f)Eassume

1 [ ( )]n n n nu g

What if “big streaming data”? Recursive estimation of mean

conditionsSAsatisfyKn

Unequal Covariance Case

Covariance Recursion for Class k:

Class k

k xxn 1

)ˆ)(ˆ(1

)ˆ)(ˆ(1ˆ)

)ˆ)(ˆ)(11

})ˆˆˆ)(ˆˆˆ({2

)ˆ)(ˆ(1

ˆˆ11

Covariance Recursion in Hyperellipsoid Case

Update means via

For covariance, suppose nth sample is from class l. Then

i,..,2,1;

1ˆ ˆ ˆ{ ( )( ) }

1 1ˆ ˆ ˆ ˆ( ) {[ ( )( ) ] ( )( ) }

1 1 1 1ˆ ˆ ˆ( ) ( )(1 )( )( )

k k l l

l l l l

nCj n j nn T

k kk kk j

n nCj n j n j n j nT T

k k l lk k l lk j j

n n n nn ll ll l

x xn C

n Cx x x x

n C n C

nn Cx x

n C n C n n n

1 11 ( 1)1 ˆ ˆ ˆ( ) ( )( )( )

l l l l

n n n nn Tll ll l

nn Cx x

n C n C n

Posterior probability P(z=i |x) is the key

Have data:

we can only get P(z=k |x, Dk)

Class conditional density and posterior density of parameters

Bayesian Learning

1 2 .......... : 1, 2, 3, . . . . { , ,..., }kn

k k k k CD x x x x k C D D D

( | , ) ( )( | , )

( | , ) ( )

p x z k D P z kP z k x D

p x z i D P z i

( | , ) ( | , ) ( | , )k kp x z k D p x z k p z k D d

( | , ) ( | )( | , )

( | , ) ( | )

p D z k p z kp z k D

p D z k p z k d

p x z k p z k

p x z k p z k d

If p(Dk | z=k,) has

sharp peak at

then so does p( |z=k, Dk)

Tough to compute unless

reproducing density.

Need Simulation.

Illustration of Bayesian Learning - 1

Bayesian learning of the mean of Gaussian distributions in one dimension.

Strong prior (small variance) posterior mean “shrinks” towards the prior mean.

Weak prior (large variance) posterior mean is similar to the MLE

-5 0 50

prior variance = 1.00

-5 0 50

prior variance = 5.00

gaussInferParamsMean1d from Murphy, Page 121

Illustration of Bayesian Learning - 2

Bayesian learning of the mean of Gaussian distributions in one and two dimensions.

As the number of samples increase, the posterior density peaks at the true value.

Sensor Fusion

Bayesian sensor fusion appropriately combines measurements from multiple sensors

based on uncertainty of the sensor measurements. Larger uncertainty less weight.

sensorFusion2d from Murphy, Page 123

-0.5 0 0.5 1 1.5-1.4

Equally reliable Sensors (R,G)

Fused Estimate: Black

-1 -0.5 0 0.5 1 1.5-1.6

G is more reliable than R

-1 -0.5 0 0.5 1 1.5-1.5

R is more reliable in y

G is more reliable in x

Likelihood Recursion

Recursion for Posterior Density

Recursive Bayesian Learning

( | , ) ( | , )( | , )

( | , ) ( | , )

( | , ) ( | )

n nn k kk n n

p x z k p z k Dp z k D

p x z k p z k D d

where p z k D p z k

1 2 1 1

{ , ,....., , } { , }

( | , ) ( | , ) ( | , )

k k kk k

n n nn n

k k k k kk k

Let D x x x x D x

p D z k p x z k p D z k

Problem: Need to store all training samples to calculate 1kn

( | , ).kn

kp z k D

For exponential family (e.g., Gaussian, exponential, Rayleigh, Gamma,

Beta, Poisson, Bernoulli, Binomial, Multinomial) need only few parameters

to characterize , They are called sufficient statistics. 1

( | , ).kn

kp z k D

Want to learn both the mean vector & covariance

Recursive Learning of Mean and Covariance

0 20 0

1( | , ) ( , ) ( , ) ( | , ) ( | , )

1 1( | , ) ( ; , )

| |( | , ) |

2 ( )2

NIW Normal Inverse Wishart

v v v v v

inverse WishartGaussian

p x N and p p m pk

p m N mk k

1 1 0 0( )

1| ; ( ) ( )

{ , ,...., }, ( , | ) ( , | , , , )

n nn n n n n

Given D x x x p D NIW m k

k n kwhere m m x m x

k n k n k k

k k n k n

)( ) ( )( ) (HW: Get recursive expression)n

n n n ni i T T

nx x x x x m x m

Wishart is a generalization of

Gamma and chi-squared

Recall

Conjugate prior: Dirichlet distribution

Posterior

MAP estimate Conditional mean (MMSE estimate)

Bayesian Generalization of Laplacian Smoothing

( )( | ) ( | ) ;

( ).... ( )k

({ }) k

( | , );C

p D N n

( )( | ). ( | )( | , )

( | ) ( )..... ( )

=Dir( | )

kNkC C

Np D pp D

p D n n

1ˆ MAP k k

ˆ MMSE k kk

Mode and Mean are not the same here!

Sum of Gaussian vectors is Gaussian.

Linear transformations of Gaussians is Gaussian

Marginal and conditional densities

Operations on Gaussians - 1

1 2 1 211 22 121 2

1 2 11 12 12 221 2

~ ( , ); ~ ( , );cov( , )

~ ( , )T

x N x N x x

x x x N

~ ( , ); ~ ( , )Tx N y Ax N A A A

11 11 12 11 121

2 12 22 12 222

1 2 212 22 11 121 2

and ~ ( , ); ; ;

Then, the marginal & conditional densities are also Gaussian

( ) ( , )

( | ) ( ( ),

J Jxx x N J

p x x N x

211 12 111 2

=N( ( ), )

J J x J

This is what happens in Least Squares & Kalman filtering

If x1 is scalar

(e.g., Gibbs sampling),

Information form could

save computation.

11 11 12 22 12 22 22 12 11 12

12 11 12 22 11 12 22

;T TJ J

Marginal and conditional densities

Marginals & Conditionals of 2D Gaussian

21 1 1 1 2

22 2 1 2 2

2 211 2 1 2 2 1 2 1 2

1 0.8 and ~ ( , ); 0;

( ) ( , ) (0,1)

( | ) ( ( ), (1 )). 1 ( | ) (0.8,0.36)

xx x N

p x N N

p x x N x If x p x x N

-5 0 5

p(x1,x2)

-5 0 50

p(x1|x2=1)

gaussCondition2Ddemo2

from Murphy, Page 112

(1 ) (1 ) 2.778 2.222

1 2.222 2.778

(1 ) (1 )

Covariance matrix captures marginal independencies between variables

Information matrix captures conditional independencies

Non-zero entries in J correspond to edges in the dependency network

Product of Gaussian PDFs is proportional to a Gaussian

Valid for division also

Operations on Gaussians - 2

0 & ( )ij i j i jx x are independent or x x 1J

1 1 1 1

1 1( ; , ) exp{ ( ) ( )} exp( + - )

(2 ) 2 2

1= ; = ; ln 2 ln | |

( ; , ) ( ; , ) ;

T Ti T

i i i i ipi i i i

i i i i i i ii i i i i

i i i ii ii ii

Jp x J x J x A x x J x

J J A p J J

p x J p x J where J J J J

1 20 | { , ,..., } { , }ij i j n i jJ x x x x x x x

Sampling From Multivariate Gaussian

~ ( , ) ( , ); (Precision or Information Matrix)

Method 1: ; (0, );

Method 2: Cholesky Decomposition of ; (0, );

Method 3: Given Cholesky Decomposition of Precisio

x N N J J

Q Q z N I x Q z

LL z N I x Lz

n Matrix, ,

(0, );Solve ;

Method 4: Gibbs sampler using Covariance Matrix/Precision Matrix

Let ~ ( , ) (i i

i ii i ii

i ii i i

z N I B y z x y

xy N N

, , , ,

( | ) ~ ( ( ), )

1 = ( ( ), )

Methods 1 and 2 require O(n ) computation. Method 3 can e

ii i i

i ii i i i i i i i i i ii

ii iii ii

p x x N x

xploit sparsity of .

Information form of Gibbs sampler does not require matrix inversion! Relation

to Gausee-Seidel, successive over-relaxati https://arxiv.org/pdf/1505.0on, et 3512. c. pdf

ML versus Bayesian Learning

ML Bayesian

Computationally Simpler

(calculus, optimization)

Complex Numerical

Integration (unless a

reproducing density)

Single Best Model.

Easier to Interpret

Weighted Av. of Models

Does not use a priori

information on

Uses a priori information

( | , )

( | , ) ( | , )

p x z k D

p x z k p z k D d

ˆ( | , ) ( | , )MLkp x z k D p x z k

• Assume x was already standardized, then to construct a

histogram, divide the range into a set of B evenly spread bins

where Kb = number of points which fall inside the bin

B is crucial

B large Est. density is spiky (“noisy”)

B small Smoothed density

There is an optimum choice for B

We can construct the histogram sequentially

considering data one at a time

Density Estimation: Histograms

( ) , 1,2, . . . . ,bKh b b B

• Estimated density is not smooth (has discontinuities at the boundaries of the bins)

• In high dimensions, we need Bp bins ( curse of dimensionality requires a huge number of data points to estimate density )

• In high dimensions

– Density is concentrated in a small part of the space

– Most of the bins will be empty estimated density = 0

– As the number of dimensions grows, a shell of thin, constant thickness on the interior of the sphere ends up containing almost all of its volume most of the volume is near the surface!

Major Problems with Histogram Algorithm

General idea

Suppose we want to find p(x)

Suppose draw N points from p(x)

Expected number falling in region R

Kernel and K-nearest Neighbor Methods-1

Pr ( ) Pr 1x R

ob x R P p x d x ob x R P

Pr these fall in (1 ) ( ; , )k N kN

ob k of R P P B k N Pk

kNk PPk

)1( )(

kNk PPk

1 (1 )

Nk N k

NNP P P

[ ]E k NP

Expected fraction of points falling in region R

Variance of

As, N, variance 0

. . . . . .(1)

Also, . . . . . .(2)

So, or, . . . . . .(3)

Note that R should be large for (1) to hold. However, R should be small for (2) to hold an optimal choice for R.

[ ]E kP

E PN N

NkP P P

(1 )P P

estimate good a isN

VxpdxxpP )()( R

kVxp )(

kxp )(

Effect of N on Probability Estimates

We are trying to estimate P via . As N increases,

the estimate peaks at the true value (it becomes a delta function) NV

kxp )(

There are basically two approaches to use Eq. (3) for density estimation

Fix V and determine k from the data Kernel Based Estimation

Fix k and determine the corresponding volume from the data

k-nearest neighbor approach.

Major Disadvantage: Needs all data

Both of these methods converge to true densities as N, provided

V as N V0 as N

k as N k as N and k/N 0

methodsneighbornearestkforNkSelect

methodsbasednelforN

VselectTypically

Selection of V and k

Kernel Estimators

• Suppose we take the region R to be a hypercube with sides of length h

centered on the point x. Its volume is

• We can find an expression for k, the number of points which fall within

this region, by defining a Kernel Function, H(u). It is also known as the

Parzen Window

For all data points xj, the quantity is equal to unity (1) if

the point xj falls inside the hypercube of side h at x and 0 otherwise.

Total number of points falling inside the hypercube is

Kernel Estimators-1

11 | | 1, 2, ....,

ifor u i pH u

otherwise

H(u) is a unit hypercube

centered at the origin.

Gaussian windows could

be used as well.

Gaussian Parzen Windows

Example of two dimensional circularly symmetric normal

Parzen windows for three different values of h

]||||2

Kernel Estimators-2

kxp )(ˆ

points data theof oneon centered cubeeach with

, side of cubes ofion superposit ˆ hN)x(p

We can smooth out this estimate by choosing different forms

for the kernel function H(u). Example: Gaussian Parzen Windows

hNExpE

11)(ˆ

x vE H

1( ) ( )

H x v p v dvh

Convolution of H and p.

Blurred version of p(x) as

seen through the window

Large N Good estimate for h 0

Small N Need to select h properly

For small N, h small noisy estimate

Illustration of Density Estimation: Effect of h

Three Parzen-window density estimates based on

the same set of five samples, using the Gaussian

Parzen window functions

Density Estimation: Effects of h and N

Parzen-window estimates of a univariate normal density using

different window widths and numbers of samples

Bivariate Density Estimation

Parzen-window estimates of a bivariate normal density using

different window widths and numbers of samples

Bimodal Density Estimation

Parzen-window estimates of a bimodal distribution using different window widths

and numbers of samples. Note that when n, estimates are the same and match

the true distribution.

Volume of a hypersphere in p dimensions of radius

= sum of multivariate Gaussian distributions centered at

each training sample.

(can also do for each class )

Gaussian Windows:PNN

)(2 2/

1 1 ( ) ( )ˆ ( ) exp

(2 ) 2

x x x xp x h

ˆ ( | )k

N np x z k

Also called “Probabilistic Neural Network (PNN)”

Small density estimation will have discontinuities

Larger causes greater degree of interpolation (smoothness)

0( ) a ua u e du

Probabilistic Neural Network-1

• Note that each component of x has the same

Must scale xi to have the same range

PNN is fairly

insensitive to over a

wide range.

Sometimes, it is easier

to experiment with

than compute it this

k Compute

2 1 ˆˆ tr( )k kp

PNN structure

1 12 n1

x1 x2 xp

1 12 nc

(2 ) p p

( 1)P z ( )P z C

Pick Maximum

Input Units

Pattern Units

Summing

Output Unit

• Speed up PNN using cluster centers as representative patterns

Cluster using K-means, LVQ,….

One form of Pattern Unit structure

x1 x2 xp ||x||2 -1/2

2jk /z

2||||j

22 ||||2

k xxxxz

j = sample number

k = class

p= dimension of pattern

x1 x2xp

square squaresquare

p pkx x

2||||j

Second form of Pattern Unit structure

j = sample number

k = class

p= dimension of pattern

Alternative forms of Kernels

sinwhere sin ( )

Manhattan Distance, 1-norm

Alternate forms of Kernels:

Cauchy Distribution

1 70( ) 1 | | ; | | 1

p x x x tri cube KernelN

1 3( ) 1 ; | | 1

p x x x Epanechnikov KernelN

Decide on K, the number of clusters in advance.

Suppose there are N data points,

Want to find K representative vectors

K-means algorithm seeks to partition the data into K

disjoint subsets containing {nj}points, in such a way as

to minimize the sum of squares clustering function

K – Means Algorithm or K – Centers Algorithm

NxxxD , . . . . . .

μ. . . . . . μ μ21

|| ||j

jj i C

clearly if are known, then jC

The means of clusters

“Cluster Centers”

“Codebook Vectors”

• 1) Start with any K random feature vectors as K centers.

Alternately, assign the N points into K sets randomly and

compute their means; these means serve as initial centers.

2) For i = 1, 2, . . . . . ,N

Assign pattern i to cluster if

3) Recompute means via:

4) If centers have changed, go to step 2. Else, stop.

Covariance of each cluster: etc.

Batch Version of K–Means Algorithm

||||min argki

1ˆ ( )( )1

i ij j jij

K–Means Algorithm-Unsupervised Learning

Initialization (Recall David Arthur and Sergei Vassilvitskii’s

“k-means++: The Advantages of Careful Seeding”,

Eighteenth Annual ACM-SIAM Symposium on Discrete

Algorithms)

a. Choose initial center at random. Let n1 be the data point.

b. For k= 2,..,K

For n=1,2,.., N & n ≠ ni , i=1,2,..,k-1

Select probabilistically

Store nk

Initialization

1 1 1 12 2min exp( min )

n ni ii k i kD x or D x

; 1,2,.., 1

( )( )

n n i k

D xp x

nLet x

• Pick K to minimize the sum of errors (training error) +

model complexity term. The sum is called prediction error.

where is the variance of noise in the data.

• Kurtosis based measure

– Kurtosis for normalized Gaussian

– Find Excess Kurtosis for each cluster

– Plot and pick K that gives minimum KT

– Can also use this idea in a dynamic cluster splitting scheme (see

Vlasis and Likos, IEEE Trans.- SMCA, July 1999, pp-393-399).

Selection of K

13 1,2, ,

xK i p

1 1 1 1

1 1 1;

p pK K

T ji j j ji

j i j i

K K K K KKp K p

KKT vs.

Online Version

• Start with K randomly chosen centers

• For each data point, update the nearest via

Leave all others the same.

Can use it for dynamic data.

For static data, need to go through data multiple times!

Online Version of K-Means

)(oldoldnew jijj

Vector

Quantization via

Stochastic

Approximation

Supervised Algorithm (Learning Vector Quantization {know class labels!})

Start with K codebook vectors.

For each data point xi find the closest codebook

(or center) .

if xi is classified correctly.

if xi is classified incorrectly.

We will take up the variants of the algorithm in Lecture 8.

Supervised Algorithm

)(oldoldnew jijj

It is a version of reinforcement learning

k-Nearest Neighbor Approach

• Fix k and allow the volume V to vary

Again, Small k noisy

Large k smooth

Major use of k- Nearest Neighbor Technique is not in

probability estimation, but in classification.

• Assign a point to class i if class i has the largest number of points

among the k-nearest neighbors MAP rule

k-Nearest Neighbor Approach

kxp )( V is the volume of a hypersphere centered at

point x and contains exactly k points.

( | ) i

kp x z i

kxp )(

( ) inP z i

( | ) ( ) 1( | )

k n kp x z i P z iP z i x

kp x nV N k

1-NN Classifier

1-NN classifier has interesting Geometric Interpretation

1-NN Classifier x1

1-Nearest Neighbor Classifier

Decision

boundary

1-NN Classifier x1

Data points become corner points of triangles

spanned by each reference point and two of its

neighbors. The network of triangles is called

Delauny Triangulation (or Delauny Net).

Perpendiculars to triangle’s edges run borders of

polygonal patches delimiting local neighborhood

of each point. Resulting network of polygons is

called VORONOI Diagram or VORONOI Nets.

Decision Boundary follows sections of

VORONOI Nets.

1-NN & K-NN adapt to any data; high

variability & low bias.

Why Gaussian Mixtures?

• Parametric fast but limited

• Non Parametric general but slow (require lot of data)

• Mixture Models

Gaussian Mixtures, Maximum Likelihood and

Expectation Maximization Algorithm - 1

Conditional Density Estimation

(function approx.)

Mixture of experts models

) ( | )

Similar technique for p p x k

k , , . . . ,C

jPjxpxp1

1 0 ; 11

j xdPjxp

p(x|M)p(x|j)p(x|1)

PMPjP1

x1 x2 xp

),()|( jjNjxp

ly typical),(2IN jj

)||||(

2/2)2(

Problem: Given data,

. . . . , , ,M

j jj j

D x x x find the ML estimates of P

Let j jjθ P ,μ ,σ

max ( | ) max ln min lnθθ

L p D l p(D |θ ) - p(D |θ ) J

i 1 i 1 j 1

ln , ln ( ) i i

jJ p( x θ ) p x | j P

|j)xp(P

|k)Pxp(μ

)||μx||(

xμ|j)xp(

)πσ(μ

|j)xp( j

(1) .. . . . . . . . .. . . . . . 2

xμ)xP(j|

Lagrangian

P; Pts

101 ..

So,Note the Simplicity

of Gradient

posterior

|j)xp(P

|k)Pxp(

|| x μ ||pP(j | x )

. . . . . .. . . . . . . . ..... ......(2)

1 ( | )

lp x j

Pp x k P

(3) . . . . . . . . . . . . . . 1

λP)xP(j|P

)xP(j|

Dimension of

feature vector

From (1),

||ˆ||)|(1

N have we1 and 1)|( that,noting1 1

j xjPN

)|(1ˆ

These are coupled non-linear equations

Necessary Conditions of Optimality:

Set Gradients Equal to Zero

Responsibility

General Case:

ˆ ˆ( | )( )( )

Ni i i T

P j x x x

Nonlinear Programming (NLP) Techniques

Methods of Solution: NLP

10 . . . . . . . . . . . .

lHkk 1

I SD or Gradient Method

Newton’s Method

Levenberg-Marquardt Method

Levenberg-Marquardt version of

Gauss Newton Method

Various versions of Quasi-Newton Method

Various versions of Conjugate Gradient method

TJ J I

Best to compute

Hessian using finite

Difference method

EM Algorithm

newiiold

||ˆ||)|(ˆ1

ioldnew

j xjPN

)|(ˆ1ˆ

How did we get these equations

and Why?…. Later

• By setting gradient to zero (M-step)

• Evaluating posterior

Probabilities/Responsibilities (E-step)

ˆ( | )ˆ ( | )

ˆ( | )

i jnew

Mi new

p x j PP j x

p x m P

M-step

E-step

Gauss-Seidel view of EM

Sequential Estimation -1

xjP ˆ

Sequential Estimation Stochastic Approximation

( | ) ( | )1

1( | ) ( | )

( | )( | )

=1 .( | ) ( | )

( | ) 1 =1+ .

n ni i

P j x P j x

P j xP j x

P j x P j x

• Sometimes replace, or,

Similarly,

Sequential Estimation Stochastic Approximation -2

ˆ( 1)j

)|(111

||ˆ||)|(1

||ˆˆˆ||)|(1

||ˆˆ||ˆˆˆ2||ˆ||)|(1

jjjjjj

12||ˆˆ||

1||ˆ||

ˆnnnn

jjjj p

( | ) ( | )

P j x P j x

• Similarly,

Sequential Estimation Stochastic Approximation -3

212211212||ˆˆ||

1ˆ||ˆ||

jjjj p

221112ˆ||ˆ||1

j PxjPn

PP ˆ)|(1

1ˆˆ 11

||ˆˆ||1

ˆ||ˆ||

ˆ ˆ ˆj j j

n n n nn

Key ideas of EM as applied to Gaussian Mixture Problem

EM Algorithm for Gaussian Mixture Problem-1

ln ln |N N M

J p( x ) p( x j)P

inewoldnew

)x( pJJ

ioldinewnew

)x(j|P

)x(j|P|j)x(pP

ln ( , )

( ( )||

: var ( )

ln ( , | ) ln ( | , ) ln ( | )

( , | )ln ( | ) [ln ]

( | , )[ln ]

KL q z p

x data

z hidden iables mixture

parameters

q z any arbitrary distribution

p x z p z x p x

p x zp x E

p z xE

( | , ))

ln ( , ) ( ( ) || ( | , ))

ln ( , ) ( ( ) || ( | , )) 0

: ( ) ( | , )

: min[ ln ( , )]

min [ln ( , | )]

J L q KL q z p z x

E step q z p z x

M step L q

E p x z

: ln ( , ) ( , ) ( , ) ( , )

old old old

Note L q Q Q H z

J=-ln p(x| )

-ln L(q, )=Q(, old

KL(q||p)

• For convex functions,

Minimizing l will lead to a decrease in J()

0 ,1 e wher lnln iiiiii xx

ln , ( , ) l

inew newN Mi jnew old old

i iold oldi j

inewN Miold

ioldi j

N Mi inew old new old

P p ( x | j)J J P (j | x )

p ( x ) P (j | x )

p ( x j)P (j | x )

p ( x j

J P (j | x ) p ( x j) Q

n ( , )L q

: , ( ) ( , ) 0

( ) ( , ) 0

( , ) and ( ) have the same gradient at

old old old old old

new new new old

old old

Note at J Q Force KL

J Q KL

old new

Q(, old)

( , )oldq z

Q(, new)

• Optimization problem:

– min

– s.t.

1 1 1 1

( | ) ln ( | ) ( | ) ln ( , )N M N M

i i i iold new new old new

i j i j

Q P j x P p x j P j x p x j

|| ||( | ) ln ln

i newN M

i jold new new

j j newi j j

xQ P j x P p

. . . .M, . . . . , jPP new

j 21 ;0 ;11

Dropping terms that depend on old parameters, we get

For Gaussian conditional probability density functions

)x(j|P

x)x(j|P

newiiold

||||)|(1

ioldnew

j xjPN

General Case:

ˆ ˆ( | )( )( )

Ni i new i newold T

j jnew ij N

P j x x x

Graphical Illustration of E and M Steps

ln ( , ) ( ( ) || ( | , ))

ln ( , ) ( ( ) || ( | , )) 0

: ( ) ( | , )

ln ( , ) ln ( | )

( ( ) || ( | , )) 0

: arg min[ ln ( , )]

ln ( , ) ln ( | )

old old

new new

J L q KL q z p z x

E step q z p z x

L q p x

KL q z p z x

M step L q

L q p x

( ( ) ( | , ) || ( | , )) 0old new

KL q z p z x p z x

J=-ln p(x| new )Q(, old =-ln L(q, new )

KL(q||p)

E-step

J=-ln p(x| old ) Q(, old =-ln L(q, old )

KL(q||p)=0

M-step

Note: EM is a Maximum Likelihood Algorithm. Is there a Bayesian Version? Yes: If

you assume priors on ({j, j2, P j}) called Variational Bayesian Inference.

An Alternate View of EM for Gaussian Mixtures - 1

Hidden

(Latent)

Variables

Observation

z is a M-dimensional binary random vector such that

{0,1} and 1

( 1) ( ) j

P z P P z P

x is a p-dimensional random vector such that

( | ) [ ( ; , )]

( ) ( , ) ( ) ( | )

[ ( ; , )] ( ; , )

j j j jj jz jj

p x z N x

p x p x z P z p x z

P N x P N x

pdf of x is a Gaussian Mixture

only possible vectors:

{ : 1,2,.., }

unit vector

z e i M

Problem: Given incomplete (partial) data,

. . . . , , ,M

j jj j

D x x x find the ML estimates of P

j jj j

θ P ,μ ,

min lnθ

J where J - p(D |θ )

If have several observations {xn: n=1,2,..,N} , each data point will have

a corresponding latent vector zn. Note the generality

Complete Data:

1 1 2 2

( , ), ( , ) . . . .( , )

1 1ln ( | ) { ln ln 2 ln | | || || }

2 2 2 j

c j j j jn j

D x z x z x z

pp D z P x

If had complete data, estimation is trivial. Similar to Gaussian

case, except that we estimate with subsets of data that are

assigned to each mixture component

In EM, replace each latent variable by its expectation with

respect to the posterior density during the E-step

In EM, minimize the expected value of the negative

complete-data log likelihood during the M-step

( | , ) ( 1| , )

( ; , )( 1| , )

( ; , )

n nn n n n

j j j j

j jn jn n

z E z x P z x

P N xP z x

Responsibilities

1 1{ ln ( | )} { ln ln 2 ln | | || || }

2 2 2 j

Z c j j j jn j

pE p D P x

Q(, old

EM Algorithm for Gaussian Mixtures -4

1. Initialize the means { } , covariances { } , and mixing coefficients { } .

Evaluate = -ln ( | ) ln{ ( ; , )}

2. E-step: Evaluate the responsibilities using the curr

j j j j jj

j jjn j

J p x P N x

ent parameter values

( ; , ) ; 1, 2,.., ; 1, 2,..,

( ; , )

; 1, 2,..,

3. M-step: Re-estimate the parameters using the current responsibi

P N xj M n N

lities

1 ( )( )

4. Evaluate the negative log likelihood and check for convergence of parameters or

Nnew nn

Nn new n newnew n T

j j j jnj

the likelihood.

If not converged, go to step 2.

For unbiased estimate of covariance,

1Divide by

Goes to 1/(Nj -1)

for (0-1) case

Illustration of EM Algorithm for Gaussian Mixtures

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

iteration 0, loglik -Inf

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

iteration 1, loglik -4929.3761

-3 -2 -1 0 1 2

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

1 2 3 4 5 6 7 8-5000

ge loglik

Murphy, Page 353, mixGaussDemoFaithful

Relation of Gaussian Mixtures to K - means

|| || /2

Suppose for 1, 2,..,

( ; , ) ; 1, 2,.., ; 1, 2,..,

( ; , )

P N x Ij M n N

P N x I

arg min || ||; the rest go to zero as long as none of the is zero.

The expected value of negative log likelihood of complete-data is

1 { ln ( | )} || || consta

Z c j jn j

E p D x

1So, K-means minimizes || ||

j jn j

Variational Bayesian Inference - 1

Hidden

(Latent)

Variables

Observation

w is a latent vector (continuous or discrete)

− Mixture vector (discrete), z

− Parameters ({j, j, P j})

x is a p-dimensional random vector

Recall

Variational inference typically assumes q(w) to be factorized

ln ( ) ln ( ( )) ( ( ) || ( | ))

( , )ln ( ( )) ( ) ln ln ( , ) ( )

( | )( ( ) || ( | )) ( ) ln ln ( | ) ( )

ln ( ) ln ( ( )) ( ( ) || ( | )) 0

J p x L q w KL q w p w x

p x wL q w q w d w E p x w H w

p w xKL q w p w x q w d w E p w x H w

J p x L q w KL q w p w x

( ) ( );{ } are disjoint groups

Example: ( ) ( ) ({ , , })

q w q w w

q w q z q P

Variational Bayesian Inference - 2

Minimize the upper bound -lnL(q(w)) with respect to qj (wj) while

keeping {qi (wi) : i ≠ j} constant (a la Gauss-Seidel)

Iterative algorithm for finding the factors {qj (wj)}

[ln ( , )]

( , )ln ( ( )) ( ) ln ( ) ln ( , ) ( )

( ) ln ( , ) ( ) ( ) ( )

[ ln ( (

i ii q

j i i j j ij i q q

iii ji j

E p x w

p x wL q w q w d w q w p x w d w H w

q w p x w q w d w d w H w H w

[ln ( , )]

))][ln ( , )] 1 ln ( ) 0

ln ( ) [ln ( , )]

( )i j

ji j j

jj i j

E p x w

jj E p x w

E p x w q wq w

q w E p x w

Log of the optimal qj is the expectation

of the log of joint distribution with respect

to all of the other factors {qi (wi) : i ≠ j}.

This idea is used in loopy belief propagation

and expectation propagation also.

Application to Gaussian Mixtures - 1

Here w involves mixture variables and component parameters

Model assumptions

1 ( ) ( ) ({ , , } )

is a binary random vector of dimension

j j jjq w q z q P

Mixture Distribution: ({ } |{ } )

Data Likelihood given latent variables:

({ } |{ } ,{ , } ) [ ( ; , )]

We also assume on { , , } Bay

N Mzn N M

n j j j

N Mzn n nN N M

n n j j jj jn j

j j jj

p z P P

p x z N x

priors P

esian approach

( ) ( ) ( | ) ; conjugate prior to multinomial

Mp P Dirichlet P P

0 ( ) ; ( 1) ( ); ( ) ( 1)! for integerste t dt n n

Model assumptions (continued)

Joint distribution of all random variables decomposes as follows:

1 1 1 1

({ , } ) ({ } |{ } ) ({ } )

1 = ( ; , ). ( ; , )

M M M M

j j j j j j jj j

N m Wishart W

In one dimension, Gamma

See Bishop’s book

1 1 1 1 1 1

1 1 1 1 1

({ } ,{ } ,{ , , } ) ({ } |{ } ,{ , } ).

({ } |{ } ). ( ). ({ } |{ } ). ({ } )

n n n nN N M N N M

n n j j j n n j jj j

n N M M M M

n j j j j j j jj

p x z P p x z

p z P p P p p

0 1 10

; , )] .

( ) 1 . ( ; , ). ( ; , )

j j jM jj

MP N m Wishart W

Variational Bayes M-step (VBM-step)…. It is easier to see M-step first

1 1 1 1({ } )

({ } )1 10

ln ({ , , } ) ln ({ } ,{ } ,{ , , } )

ln [ ( ; , )] .

= t( ) 1

. ( ; , ). ( ; , )( )

n nM N N M

j j j n n j j jq zj j

j jjn j

j j jM jj

q P E p x z P

E consM

P N m Wishart W

1 1 1 1

1ln ( ; , )

= [( 1) ]ln

ln ( ; , )

+ ln ( ; , ) tan

M N Mjjn

j jjn j

N mE z P

Wishart W

E z N x cons t

({ , , } ) ( ) ({ , } )

( ) ( ;{ } )

j j j j jj j

q P q P q

q P Dirichlet P N

q Gaussian Wishart

Updated factorized distribution after M-step

({ , , } ) ( ) ({ , } )

( ) ( ;{ } ) ( )

1 = ( ; , ). ( ; , )

j j j j jj j

j j j j M

j j j j jjj

j j j j

q P q P q

Nq P Dirichlet P N E P

q Gaussian Wishart

N m Wishart W

( )( )

1where ( )( )

Nn nn T

N x x xN

NW W N S x m x m

S x x x xN

Updates for { , , }

are similar to ML

jj jN x S

Sequential VBEM?

Variational Bayes E-step (VBE-step)

1 1 1 1({ , , } )

({ , , } )1 1

({ , } )1

ln ({ } ) ln ({ } ,{ } ,{ , , } )

= ln [ ( ; , )] tan

Mj j jj

n n nN N N M

n n n j j jq P j

j jq P jn j

q z E p x z P

E P N x cons t

( ; , )

{ } 1 1

ln ({ } |{ } ,{ , } )

ln ({ } |{ } ) tan

Nn nN N M

n n j jj

q P n j j

E p z P cons t

1 1where ln ln ln | | ln 2 || ||

({ } ) [ ] where .... responsibilities

j j j jj

j P j j j

nN Mzn jN n n

n j j Mnn jk

pE P E E x

Variational Bayes E-step (VBE-step) … continued

− Evaluation of responsibilities

− Recall

1 1ln ln ln | | ln 2 || ||

2 2 2j j j jj

j P j j j

pE P E E x

ln ( ) ( ); ( ) ln ( ).... function

1ln | | ( ) ln 2 ln | |

|| || ( ) ( )

P j j k

n n nT

j jj jjj

dE P digamma

iE p W

pE x x m W x m

11| | exp ( ) ( ) ( ) ( )

2 2 2 2

pn nj jn T

j jj j j j

i pW x m W x m

See Bishop

Chapter 10

VB Approximation of Gaussian Gamma

In VBEM start with large M and very small 0 <<1 (0.001)

It automatically prunes clusters with very few members (“rich get

richer”)

In this example, we start with 6 clusters, but only 2 remain at the end

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2.5

iter 94

1 2 3 4 5 60

180iter 94

mixGaussVbDemoFaithful from Murphy, Page 755

Bayesian Model Selection -1

( | ) ( )( | )

( | ) ( )l

p D m p mP m D

p D l p l

Bayesian Model Selection (max . prob. of model given data)

Bayes Factors for Comparing Models m and l

Set of models (e.g., linear, quadratic discriminants)

Model is specified by parameter vector ;m

( | ) ( | ) ( ) ( )( , ) / /

( | ) ( | ) ( ) ( )

( , ) 1 mod mod

( sin log )

p D m p m D p m e p mBF m l

p D l p l D p l p le

BF m l el m is preferred over el l

BIC Bayesian Information Criterion u g negative likelihood

Bayesian Model Selection -2

Bayesian Information Criterion (BIC)… minimize BIC

Akaike Information Criterion (AIC) …. Minimize AIC

ˆ ˆ2ln ( | ) ( ) ln

; equal & speherical covaraiance (Note: only ( -1) probabi

BIC p D dof N Schwarz criterion

Valid for regression classification and density estimation

For linear and quadratic classifiers

C Cp C

ities)

1; equal and spherical feature-dependent covaraiance

( 1) / 2 1; equal and general covaraiance

( 1) / 2 1;unequal and general covaraianc

is closley related to

C Cp p

C Cp p p

C Cp Cp p

BIC Minimum Des

: ln ln[( 2) / 24]

cription Length MDL

Adjusted BIC N N

ˆ ˆ2ln ( | ) 2 ( )

ˆ ˆ2 ( )( ( ) 1)

ˆ( ) 1

m msmall N Gaussian

AIC p D dof

dof dofAIC AIC

Binary Classification, BIC & AIC - 1

Class z = 0: N(0,1);zero mean … Null hypothesis, H0

Class z = 1: N (µ,1); µ 0 …. Alternative hypothesis, H1

Under null hypothesis, sample mean

Note .

So, under H1:

1 2int : { , ,..., }NN scalar data po s D x x x

1~ (0, ) ~ (0,1)x N N x N

~ (0,1)N x N

( | | ) 1 ( . ., 2 0.05),

, : | |

If P N x c for a specified c e g c for

we are confident that with probability

is the probability of faslsely rejecting H

cSo test statistic is x

Binary Classification, BIC & AIC - 2

ˆ ˆ2ln ( | ) ( ) ln

( ) ln ln

ln ln, ( ) ( ) | |

N Ni i

BIC p D dof N

BIC H x

BIC H x x N x N x N

N NSo BIC H BIC H if x x

ˆ ˆ2ln ( | ) 2. ( )

( ) 2 2

2 2, ( ) ( ) | |

N Ni i

AIC p D dof

AIC H x

AIC H x x x N x

So AIC H AIC H if x xN N

Similar to classical hypothesis testing c

sample number-

dependent threshold

Probability of error, misclassification rate.

• Holdout Method:

Error Count = R out of K PE=R/K

From Binomial Distribution:

To obtain an estimate of PE within 1%:

If PE 0.05,

When PE ½, we need lots of samples…. 10,000

We can also estimate Class Conditional Error Rate, PE(z=k). Then

Holdout Method of Cross Validation

P PE z

N-K training

K validation

Typically, K N/5

(1 )PE

(1 )0.01 2

4 (1 ) 40000( (1 )) 1900

PE PEK PE PE

( ). PE( )c

PE P z k z k

Are there better

bounds?

• Suppose have a nonnegative rv, x (discrete or continuous)

• Assume continuous WLOG

• Markov inequality

• Chebyshev inequality

Markov & Chebyshev Inequalities

( ) ( ) ( ) ( ) ( ) ( ) ( )

( )( )

E x xf x dx xf x dx xf x dx xf x dx f x dx P x

E xP x

( ) ( )( ) ( )

, (| | )

| | 1(| | ) ( )

(1 )(| | )

e ee e

E x E xP x P x

so P x

Example x sample mean of n numbers m

mP m P

For binary classifier with unknown probability of error P and sample error S

P PP S P

1; 100, 0.2, 0.0625....

4n bound loose

Hoeffding’s Inequality

; [ , ];

{| ( ) | } 2 ; when 1 {| ( ) | } 2

i i i i i i i

Let Y X X a b R b a

P Y E Y n e R b a P Y E Y n e

1 1 1 1

( ) ( ) 0 & [ , ]2 2

{| ( ) | } {| | } { | | } { } { }

{ } { } { }

i ii i i i i

n n n n

i i i i

n t Z t Ztn tn

R RLet Z X E X E Z Z

P Y E Y n P Z n P t Z tn P t Z tn P t Z tn

P t Z tn P e e e E e Markov Inequality

[ (1 ) ] [ (1 ) ]2 2 2 2

/2 /22 2

/ 2} { } { } { };

1{ (1 ) }

1, { }

i i i ii i i i

R R R Rn n nt t

tZtn tn tn i ii

i i i i

R Rn nt t

tR tRtn tn

ntR ttn

Z Re E e e E e e E e

e E e e e e e

So P t Z tn e e e

[ /8 ]/2 /8

2 /2 2

min 4 / {| | } 2 2 1

n n t R tnR t Rtn

n n n Rn

RHS is imized when t n R P Z n e e when R

Jensen’s inequality

https://en.wikipedia.org/wiki/Hoeffding%27s_lemma

• Suppose we have classifiers/models M1, M2,…., MC

• Training Data, D & Validation Data, V

• Samples are assumed to be i.i.d.

Rationale behind Cross Validation

1 ˆlog ln ( | )| |

ˆ , { }

' {max | | } 2

2ln( )

ln, 2 ,

2 | | | |

" ( ) , (

Average likelihood l p xV

Since does not depend on V E l l

By Hoeffding s inequality P l l Ce

CSo if Ce

Confidence is cheap but accuracy

) exp "

2ln( )

lnmax max 2 max ( )

2 | | | |

, (1 ), mod .

m m mm m m

is more ensive

Cl l l O

So with probability of at least one chooses the best el

{max | | }

{ | | }

{| | } 2

P l l Ce

• S-fold Cross validation:

S-fold Cross Validation

Testing

Validation

D1 D2 Ds

1( ( ) ) (.)

PE I x z I indicator functionN

• S=N N-fold cross validation or Leave one-out CV (LOOCV) method

1( ( ) ); ( ) ( , )

PE I x z x f x DN

• Practical Scheme: 5x2 Cross-Validation. Variation: 5 repetitions of 2-

fold cross-validation on a randomized dataset

S is typically 5-10

Summary of Validation and Model Selection Methods

Simple splitLeave one out CV

N=S for S-fold CV Model Selection

Regularization and Model Selection Any combination of CV,

Model Selection and

Regularization

(hyper-parameter selection)

is possible.

See AMLbook.com .

• Bootstrap Method:

Sample from D with uniform probability (1/N). Each xi is drawn

independently with replacement

Let b= bootstrap index, b=1,2,…,B

B= number of bootstrap samples (typically 50-200)

Do b=1,2,..,B

Bootstrap Sample

Bootstrap Method of Validation

NxxxD , . . . . . .

)( , . . . . . . )()2()1(

bPExxxD trainingb

: \ ( )b b valValidation Data A D D samples not inbootstrap PE b

0.632* ( ) 0.368* ( )training valPE PE b PE b Effron Estimator

• Bootstrap Method (cont’d):

• Confusion Matrix:

In words, just count errors from validation set, bootstrap, etc. for class j

and divide by the number of samples from class j

Confusion Matrix

0.632* ( ) 0.368* ( )training valPE PE b PE b

1 1{ } (1 )

{ } 0.368

{ } 0.632

NP observation bootstrap sample as NN e

P observation validation samples

P observation training samples

][ ijPP

{ | } ; 0,1,2,..., ; 1, 2,...,ij

NP P decision i z j i C j C

• Contingency table showing the differences between the true and predicted classes for a set of labeled examples

• The following metrics can be derived from the confusion matrix:

– PD (Sensitivity, Recall): • TP/(TP+FN)

– PF (1-Selectivity): • FP/(TN+FP)

– Positive Prediction Power (Precision)• TP/(TP+FP)

– Negative Prediction Power• TN/(TN+FN)

– Correct Classification Rate (CCR)• (TP+TN)/N

– Misclassification Rate• (FP+FN)/N

– Odds-ratio

• (TP*TN)/(FP*FN)

Outcome Fault No-Fault Total

Positive

Detection

Number of

detected

faults

Number of

false-alarms

number of

positive

detections

Negative

Detection

Number of

missed faults

Number of

correct

rejections.

number of

negative

detections

number of

faulty

samples

number of

fault-free

samples

number of

samples

– Kappa

( ) ( )

row i col i

CCR P P

CCR = Correct

Classification Rate

Prow(i)=% entries in row i

Pcol(i)=% entries in column i

Poor: K < 0.4 Good: 0.4 < K < 0.75

Excellent: K > 0.75

Performance Metrics for Binary Classification - 1

icolirow

No reject option

When doing this,

All four entries should sum to 1

Outcome Fault No-Fault Total

Positive

Detection

3023(TP)

1518(FP)

number of

positive

detections

Negative

Detection

number of

negative

detections

number of

faulty

samples

number of

fault-free

samples

number of

samples

Normalized

0.605 0.304

0.395 0.696

PD = 0.605 False Neg. Rate = 0.395

PF = 0.304 (False Positive Rate )

Correct Classification Rate = 0.65

Misclassification Rate = 0.35

Odds Ratio = 3.51

Kappa = 0.301Poor

Positive Prediction Power = 0.666

Negative Prediction Power = 0.638

Prevalence = 0.5 (Priors)

Metrics

Aircraft Engine Data1 0

1 0 1 1 0

( )( )

( ) if

PP P PKappa

P P P PP P P

P P P P

Performance Metrics for Binary Classification - 2

Re ( ) 0.605

Pr ( ) 0.666 (1 ) / 0.304

20.634

call R P R

ecision P P R P P

PRF score

• One-versus-One

– Generates C(C-1)/2

confusion matrices

• C1 vs. C2

• C1 vs. C3

• C2 vs. C3

• One-versus-All

– Generates C

confusion matrices

• C1 vs. C2 & C3

• C2 vs. C1 & C3

• C3 vs. C1 & C2

One-versus-One Confusion Matrices

One-versus-All Confusion Matrix

Summed,

creating a

2x2 matrix

Confusion Matrices for Multiple Classes - 1

Confusion Matrix for C Classes

• Some-versus-Rest

– Generates 2(C-1) - 1 confusion matrices

– Both true and false classifications may be sums

– Here is a four class example

C1 vs. C2 & C3 & C4

C2 vs. C1 & C3 & C4

C1 & C2 vs. C3 & C4

C3 vs. C1 & C2 & C4

C1 & C3 vs. C2 & C4

C2 & C3 vs. C1 & C4

C1 & C2 & C3 vs. C4

C1 vs. C2 & C3 & C4

C1 & C2 vs. C3 & C4

Confusion Matrices for Multiple Classes - 2

Can form the basis for code book based classifiers

• Cobweb

– Illustrates the probability that

each class will be predicted

incorrectly (the off diagonal cells

of the confusion matrix)

– Shows relative performance

between classes for each

classifier

– High performance classifiers

have poor visibility in cobweb

– May be difficult to interpret for

high numbers of classes

• c(c-1)/2 rays

Cobweb

• Fawcett’s Extension

– Sums the areas under the curves for each class versus the rest, multiplied by the probability of that class

• Hand and Till Function

– Averages the areas under the curves for each pair of classes

• Macro-Average Modified

– Uses both the geometric mean and average correct classification rate

( ) ( , );C

AUCF P z i AUC i rest AUC Area under the ROC Curve

Other Metrics of Performance Assessment

2( , )

HT AUC i jC C

0.75( ) 0.25( )CC

CMOD ii ii

MAVG P P

• Comparing Classifiers

Suppose have Classifiers A and B

nA= number of errors made by A but not by B

nB= number of errors made by B but not by A

McNemar’s Test: Check if

Comparing Classifiers

)1,0(1||

Need | nA- nB|>5 for a significant difference

To detect 1% difference in error rates, need at least 500 samples

1. Alpaydin, Ethem. Introduction to Machine Learning. MIT Press, Cambridge. 2004.

2. Kuncheva, Ludmila I. Combining Pattern Classifiers; Methods and Algorithms. John Wiley and Sons: Hoboken, NJ. 2004.

3. Ferri, C. “Volume Under the ROC Surface for Multi-class Problems.” Univ. Politecnica de Valencia, Spain. 2003.

4. D. Mossman. “Three-way ROCs”, Medical Decision Making, 19(1):78–89,1999.

5. A. Patel, M. K. Markey, "Comparison of three-class classification performancemetrics: a case study in breast cancer CAD", Medical Imaging 2005: ImageProcessing.

6. Ferri C., Hernandez J., Salido M. A., “Volume Under the ROC Surface forMulticlass Problems. Exact Computation and Evaluation of Approximations”Technical Report DSOC. Univ. Politec. Valencia. 2003.http://www.dsic.upv.es/users/elp/cferri/VUS.pdf.

7. Mooney, C.Z. and R.D. Duval, 1993, Bootstrapping: A Non-ParametricApproach to Statistical Inference. Newbury Park, CA: Sage Publications.

8. J. K. Martin and D. S. Hirschberg. “Small sample statistics for classification error rates, I: error rate measurements.” Technical Report 96-21, Dept. of Information & Computer Science, University of California, Irvine, 1996.

References on Performance Assessment

Estimating Parameters of Densities From Data

Maximum Likelihood Methods

Bayesian Learning

Probability Density Estimation

Histogram Methods

Parzen Windows

Probabilistic Neural Network

k-nearest Neighbor Approach

Mixture Models

Estimate parameters via EM and VBEM algorithm

Various interpretations of EM

Performance Assessment of Classifiers

Summary

Lectures 4 and 5: ML and Bayesian Learning, Density ......Recursive Estimation of Parameters-3 x i C...

Documents

%JTDPWFS$IFSSZ#MPTTPNTJOUIF+BQBOFTF$PVOUSZTJEFg ÊEï5T 7]F d&mD* ö±ö¹ö·ö¸ ö ö³ ö²ö ö±ö´ m ùö¹öµ. z ë 0 [®ú ê5Y ó N AøÊÓÿRå

6 I SPY— fruits a veggies o • I. 1-1. O Ö - POPULAR...6 I SPY— fruits a veggies o • I. 1-1. O Ö

Ã Æå¢b4 / I ölibrary.jsce.or.jp/jsce/open/00906/2015/21-0349.pdf1= e ] /¡1= e7 '¨ s º v ö% b % "I ö@$ ^ ] 75b9× È )b )E)F 8#Ý_| È\Ã Æå¢b4 / "I ö:$7(5/($. 3,3,1*%(+$9,2562)5,9(5/(9((6

« Q ö b ^ a Z g i 17 № « = g a ^ u r d» ö a ö m g b p bö ö ...dou17gnezdushko.ucoz.ru/shkola2015/proekt_vkusnye_vitaminy.pdf · Паспорт педагогического

CS10707 2017 UL Annual Report Final JA · ö?öVö?ö ö?öbödön · õ Ú K - % f " ö ö?öW öjödöhösögö?övöhökököhö`ölör öoö ö ö ö ö ö ö ö öKö?öbö ö

8) −Ö´Ö−ãÖÖ †•Ö Ô( úÖ¸êüÖ ”ûÖ¯Öß»Ö †•Ö Ô¾Ö “Ö»Ö−Ö ¯ÖÏŸÖ) 2014/11 Candidate TET SUCHNA 2014 Blank Namuna... · KALIDAS on date :

X-12 Ins. Page (For paper-III) · Discuss the salient features of early Chalukyan Architecture. ¯Ö Ï Ö¸Ó ü× ³Ö Ûú “Ö Ö»Öã ŒµÖ ¾Ö Ö ÃŸÖ ã Ûú»Ö Ö Ûúß

1 1 ö m ¾ ° 0 > ö m ¾ ° 0 > !! Â Â

>/ ç ô º Ø V · m ñ 0¿ z ö 5 ;µ 0 w á Û 2( 5 2 Ü ö 5 "i 9 _ ° ¦ 2 Ü"5 2n ° ö 5 9×8 $ª8ô2 2 Ü ö5 #Ý 1 4) ö 5

PAPER – I GENERAL STUDIES 1. 2. 3. 4. - ukpsc.gov.inGS)_A.pdf · SRS-01 3 Series-A ¯ÖÏ¿®Ö¯Ö¡Ö – I ÃÖÖ´ÖÖ®µÖ †¬µÖµÖ®Ö 1. •Öê´ÃÖ ‹®›çüµÖ

ç F Ö ® T Ö ç ® Ñ Ó Á Æ Ú é ¢ s Ö ¢ s 7 I s 8 Õ ë ® ç Þ...Z C } ` I s v o O q z | Ï I s 9 y . l r | 8 W ~ d Ö û M ç F Ö ® T Ö ç ® Ñ Ó Á Æ Ú é ¢

1/ 200 2 ARAB CULTURE AND ISLAMIC STUDIES · 2019-08-26 · 1. ¯ÖÆü»Öê ¯ÖéÂšü “êú ‰ú¯Ö¸ü ×ÖµÖŸÖ Ã£ÖÖÖ ¯Ö¸ü †¯ÖÖÖ ¸üÖê»Ö Ö´²Ö¸ü

ö³ö±ö³ö±A×B1 A×B1Q T d K m d · ö³ö±ö³ö±A×ö²ö²PÖö³öµNü a ö³ö±ö³ö²A×ö´PÖö´ö²Nü Yá .hZð+ ñ Ú ñ Û ô · á 6û 2 T d * ö¶ö± 0

D-07-13-III (Anthropology) Inst...D-07-14 3 Paper-III ´ÖÖ®Ö¾Ö ×¾Ö–ÖÖ®Ö ¯ ÖÏ ¿® Ö¯Ö¡Ö – III Ã Öæ “ Ö® ÖÖ: ‡ÃÖ ¯ÖÏ¿®Ö¯Ö¡Ö ´Öë ¯

Smart Urban Water Systems...Mar 01, 2020 · ARMAX model: 0.2140 1 0.1822 2 0.1018 3 0.1634 4 0.1014 5 0.0869 6 0.0543 7 1.5261 Ö Ö Ö Ö 1 0 Ö.7946 2 0.0285 3 0. Ö Ö 292 Ö

1/ Test Booklet Code 100 4 16 60 Instructions for the ... (W).pdf · W-00 3 P.T.O. Paper – I ¯ÖÏ¿®Ö¯Ö¡Ö – I Note : This paper contains Sixty (60) multiple choice questions,

Karykram Andajpatrak Final Booklet 2015-16 0 · ÛúÖµÖÔÛÎú´Ö †Ó¤üÖ•Ö¯Ö¡ÖÛúú. 2015-2016. ÝÖÏÖ´Ö×¾ÖÛúÖÃÖ ¾Ö •Ö»ÖÃÖÓ¬ÖÖ¸üÞÖ ×¾Ö³ÖÖÝÖ

¥ 6 ÿ Î ÿ 4 3 - WordPress.com · 2018. 5. 15. · ¥ 6 ÿ Î ÿ 4 3 284 Ø i 1903 P 6 ¾ 4 ø W ö Ö ö Í i i z2003 P ä Ö ö Þ ¿ : Í ä)Pluto Press* z P T G Í Ö ö ¡

VI J J~ i - senateweb.senate.go.th/bill/bk_data/510-3.pdf÷Ü ö`ö ñúïÜÙï ßa ðhîÖãö÷ ÷Ü ö`ö ñúïÜÙï ßa ðhîÖãö÷ ÷Ü ö`ö ñúïÜÙï ßa ðhîÖãö÷

Y4-Spring-Block-4-ANS5-Dividing-1-digit-by-10-2019 · 2020. 4. 28. · 8 Complete the number sentences. a) 6 Ö Ö 10 = 3 Ö 10 b) 24 Ö 6 Ö 10 = Ö 10 c) 42 Ö Ö 10 = 21 Ö 7 Ö