38
1 PCA, LDA, HLDA and HDA Reference: 1. E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2. S. R. Searle, “Matrix Algebra Useful for Statistics,” Wiley Series in Probability and Mathematical Statistics, New Yourk, 1982. 3. Berlin Chen’s Sliders. 4. N. Kumar and A. G. Andreou, “Heteroscedastic Discriminant Analysis and Redued Rank HMM s for Improved Speech Recognition,” Speech Communication, 26:283-297, 1998. 5. G. Saon, M. Padmanabhan, R. Gopinath and S. Chen, “Maximum Likelihood Discriminant Fea ture Spaces,” ICASSP, 2000. 6. X. Liu, “Linear Projection Schemes for Automatic Speech Recognition,” Master of Philos ophy, University of Cambridge, 2001. 7. 張張張 , “ 張張張張張張張張張張張張張張張張張張張張張張張張張張張張 ,” 張張張張 , 2005.

1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

Embed Size (px)

Citation preview

Page 1: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

1

PCA, LDA, HLDA and HDA

Reference:1. E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004.2. S. R. Searle, “Matrix Algebra Useful for Statistics,” Wiley Series in Probability and Mathematical Statistics, New Your

k, 1982.3. Berlin Chen’s Sliders.4. N. Kumar and A. G. Andreou, “Heteroscedastic Discriminant Analysis and Redued Rank HMMs for Improved Speech

Recognition,” Speech Communication, 26:283-297, 1998.5. G. Saon, M. Padmanabhan, R. Gopinath and S. Chen, “Maximum Likelihood Discriminant Feature Spaces,” ICASSP,

2000.6. X. Liu, “Linear Projection Schemes for Automatic Speech Recognition,” Master of Philosophy, University of Cambridg

e, 2001.7. 張志豪 , “強健性和鑑別力語音特徵擷取技術於大詞彙連續語音辨識之研究 ,” 碩士論文 , 2005.

Page 2: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

2

PCA: Introduction

• PCA (Principle Component Analysis) is a one-group and unsupervised projection method for reducing data dimensionality (feature extraction).

• We make use of PCA to find a mapping from the inputs in the original d-dimensional space to a new (k < d)-dimensional space, with:– minimum loss of information.

– maximum amount of information measured in terms of variability.

• That is, we’ll find the new variables or linear transformations (major axes, principal components) to reach the goal.

• The projection of x on the direction of w is, z = wTx.

input space X(d-

dimensionality)

feature space Z(k-

dimensionality)

linear transform W

Page 3: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

3

PCA: Methodology (General)

• For maximizing the amount of information, PCA centers the sample and then rotates the axes to line up with the directions of highest variance.

• That is, find w such that Var(z) is maximized, which is the criterion.

• Var(z) = Var(wTx) = E[(wTx – wTμ)2] = E[(wTx – wTμ)(wTx – wTμ)]

= E[wT(x – μ)(x – μ)Tw]

= wT E[(x – μ)(x –μ)T]w = wT ∑ w

Var(x) = E[(x – μ)2]= E[(x – μ)(x –μ)T] = ∑

Page 4: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

4

PCA: Methodology (General) (cont.)

• Maximize Var(z1) = w1T ∑ w1 subject to ||w1|| = 1.

– For a unique solution and to make the direction the important factor, we require ||w|| = 1, i.e., w1

Tw1 = 1.

– w1 is an eigenvector of ∑, λ is an eigenvalue associated with w1.– Var(z1) = Var(w1

Tx) = w1T ∑ w1 = w1

T λ w1 = λ w1Tw1 = λ

– max Var(z1) = max λ, so choose the one with the largest eigenvalue for Var(z) to be max.

L , 1

L ,0 1 0

max L ,L ,

0

1

T T1 1 1 1 1

1 T1 1

1w

11 1

1

w w w w w

ww w

ww

w ww

2T

T1 1

1 11

w ww w

w

T1 1

11

w ww

w

λ is Lagrange multiplier.

Page 5: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

5

PCA: Methodology (General) (cont.)

• Second principal component: max Var(z2), s.t., ||w2||=1 and orthogonal to w1.

– That is, max Var(z1) = max λ.

• Conclusions:– wi is the eigenvector of ∑ associated with the ith largest eigenvalue.

Var(zi) = λi. (The above can be proved by mathematical induction.)

– wi’s are uncorrelated (orthogonal).

– w1 explains as much as possible of original variance in data set.

– w2 explains as much as possible of remaining variance, etc.

2 2

2 2

L , , , 1 0

max L , , ,

2

T T T1 2 2 2 2 2 2 1

1 2 2 2w

w w w w w w w w

w w w w

Page 6: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

6

PCA: Some Discussions

• About dimensions:– If the dimensions are highly correlated, there will be a small

number of eigenvectors with large eigenvalues, and k will be much smaller than d and a large reduction in dimensionality may be attained.

– If the dimensions are not correlated, k will be as large as d and there is no gain through PCA.

Page 7: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

7

PCA: Some Discussions (cont.)

• About ∑:– For two different eigenvalues, the eigenvectors are orthogonal.– If ∑ is positive definite (xT ∑ x > 0, x != null), all eigenvalues are

positive.– If ∑ is singular, then its rank, the effective dimension is k < d, and

λi = 0, i > k.

• About scaling:– Different variables have completely different scaling.– Eigenvalues of the matrix is scale dependent.– If scale of the data is unknown then it is better to use correlation

matrix instead of the covariance matrix.– The interpretation of the principal components derived by these t

wo methods can be completely different.

Page 8: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

8

PCA: Methodology (SD)

• z = WTx, or z = WT(x - m): center the data on the origin.• To find a matrix w such that when we have z = WTx, we

will get Cov(z) = D is any diagonal matrix.

• We would like to get uncorrelated zi.

• Let C = [ci] be the normalized eigenvectors of S, then– CTC = I– S = SCCT

= S (c1, c2, … , cd) CT

= (Sc1, Sc2, … , Scd) CT

= (λc1, λc2, … , λcd) CT

= λc1 c1T + λc2 c2

T + … + λcd cdT

= CDCT

– D = CTSCWT ∑ W

D(kk) = CT(kd) S(dd) C(dk)

d: input space dim; k: feature space dim

Spectral Decomposition is the factorization of a positive definite matrix S into S = CDCT where D is a diagonal matrix of eigenvalues, and the C matrix has the eigenvectors.

Page 9: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

9

Appendix A

• Another criterion for PCA is MMSE (Minimum Mean-Squared Error) criterion which will reach the same destination as the above two methods do. But there may be an interesting difference among them.

• Some important properties of symmetric matrices:– Eigenvalues are all real.– Symmetric matrices are diagonable.– Eigenvectors are orthogonal.

• Eigenvectors corresponding to different eigenvalues are orthogonal.

• mk LIN eigenvectors corresponding to any eigenvalue λk of multiplicity mk can be obtained such that they are orthogonal.

– Rank equals number of nonzero eigenvalues.

Page 10: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

10

LDA: Introduction

• LDA (Linear Discriminant Analysis) (Fisher, 1936) (Rao, 1935) is a supervised method for dimension reduction for classification problem.

• To obtain features suitable for speech sound classification, the use of LDA was proposed (Hunt, 1979).– Brown showed that the LDA transform is superior to the PCA tra

nsform by using DHMM classifier and incorporating context information (Brown, 1987).

– The later researchers have applied LDA to DHMM and CHMM speech recognition systems and have reported improved performance on small vocabulary tasks but with mixed results on large vocabulary phoneme-based systems.

Page 11: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

11

LDA: Assumptions

• LDA is related to the MLE (Maximum Likelihood Estimation) of parameters for a Gaussian model, with two a priori assumptions (Campbell, 1984).– First, all the class-discrimination information resides in a p-dimen

sional subspace of the n-dimensional feature space.– Second, the within-class variances are equal for all classes.

• Another notable assumption is that class distributions is mixture of Gaussians (Hastie & Tibshirani, 1994). (Why not single Gaussian?)– That means LDA is optimal if the classes are normally distributed.

But we can still use LDA for classification.

Page 12: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

12

LDA: Methodology

• Criterion: Given a set of sample vectors with labeled (class) information, try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal.– After projection, for all classes to be well separated, we would lik

e the means to be as far apart as possible and the examples of classes be scatteres in as small a region as possible.

Page 13: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

13

LDA: Methodology (cont.)

• Let x be an n-dimensional feature vector. We seek a linear transformation Rn Rp (p < n) of the form yp = θp

Tx, where θp is an np matrix. Let θ be a nonsingular nn matrix used to define the linear transformation y = θTx. Let’s partition as

• First, we apply a nonsingular linear transformation to x to obtain y = θTx. Second, we retain only the first p rows of y to give yp.

p n p

Page 14: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

14

LDA: Methodology (cont.)

• Let there be a total of J classes, and let g(i) {1…J} indicate the class that is associated with xi. Let {xi} be the set of training examples available.– The sample mean:

– The class sample means:

– The class sample covariances:

1

1 p

NT

i pi

X x XN

( )

1 p Tj i p j

g i jj

X x XN

( )

1 pT T

j i j i j p j pg i jj

W x X x X WN

Page 15: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

15

LDA: Methodology (cont.)

– The average within-class variation:

– The average between-class variation:

– The total sample covariance

1

1 p

JT

j j p pj

W N W WN

1

1 p

J T Tj j j p p

j

B N X X X X BN

1

1 p

NT T

i i p pi

T x X x X TN

Page 16: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

16

LDA: Methodology (cont.)

• To get a p-dimensional transformation, we maximize the ratio

• To obtain , we choose those eigenvectors of that correspond to the largest p eigenvalues, and let be an np matrix of these eigenvectors. The p-dimensional features thus obtained by that are uncorrelated.

ˆ arg max arg maxp p

T Tp p p p

p T Tp p p p

B T

W W

T W B

p 1W T

p

ˆTpy x

Page 17: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

17

HLDA: ML framework

• For LDA, since the final objective is classification, the implicit assumption is that the rejected subspace does not carry any classification information.

• For Gaussian models, the assumption of lack of classification information is equivalent to the assumption that the means and the variances of the class distributions are the same for all classes in the rejected (n-p)-dimensional subspace.

• Now in an alternative way, let the full rank linear transformation θ be such that the first p columns of θ span the p-dimensional subspace in which the class means, and probably the class variances, are different.– When rank(θnn) = n, θ is said to have full rank, or to be of full rank. Its ra

nk equals its order, it is nonsingular, its inverse exists.

– θ obtained by LDA can be full rank?

– Since the data variables x are Gaussian, their linear transformation y are also Gaussian.

Page 18: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

18

HLDA: ML framework (cont.)

• The goal of HLDA (Heteroscedastic Discriminant Analysis) is to generalize LDA under ML (Maximum Likelihood) framework.

• For notational convenience, we define:

where μj represents the class means and Σj represents the class covariances after transformation.

,1

,

0, 1 0

0,

j

pj p j

jp

n

0

0

pj p p

j n pn p n p

Page 19: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

19

HLDA: ML framework (cont.)

• The probability density of xi under the preceding model is given as

where xi belongs to the group g(i). Note that although the Gaussian distribution is defined on the transformed variable yi, we are interested in maximizing the likelihood of the original data xi.– The term |θ| comes from the Jacobian of the linear transformatio

n y = θTx.

1

exp22

TT T

i ig i g i g i

i n

g i

x xP x

Page 20: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

20

HLDA: ML framework (cont.)

Page 21: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

21

HLDA: Full rank

• The log-likelihood of the data under

the linear transformation θ and under the constrained Gaussian model assumption for each class is

– Doing a straightforward maximization with respect to various parameters is computationally intensive. (Why?)

pj

1

logN

F ii

L P x

1

1

log , , ;

1log 2

2

log

F j j i

N T nT Ti ig i g i g i g i

i

L x

x x

N

Page 22: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

22

HLDA: Full rank (cont.)

• We simplify it considerably by first calculating the values of the mean and variance parameters that maximize the likelihood in terms of a fixed linear transformation θ.

• We’ll get

• Transformations vs. ML estimators?

pj

0

ˆ , 1...

ˆ ,

ˆ , 1...

ˆ .

p Tj p j

Tn p

p Tj p j p

n p Tn p n p

X j J

X

W j J

T

Page 23: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

23

HLDA: Full rank (cont.)

• By replacing the two parameters in terms of θ, the log-likelihood will be

pj

1

1

1

1

1

log ;

log 2 log log2 2 2

1

2

1

2

log

F i

JjT T

n p n p p j pi

J T T Ti j p p j p p i j

i g i j

JT T T

i n p n p n p n p ii g i j

L x

NNn NT W

x X W x X

x X T x X

N

Page 24: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

24

HLDA: Full rank (cont.)

• We can simplify the above log-likelihood to get θ:– Proposition 1: Let F be any full-rank nn matrix. Let t be any (np)

rank-p matrix (p < n). Then, Trace(t(tTFt)-1tTF) = p.– Proposition 2:

pj

1

log ;

log log2 2

log 1 log 22

F i

JjT T

n p n p p j pi

L x

NNT W

NnN

1

ˆ arg max log log log2 2

JjT T

F n p n p p j pi

NNT W N

Page 25: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

25

HLDA: Full rank (cont.)

• Since there is no closed-form solution for maximizing the likelihood with respect to h, the maximization has to be performed numerically.– An initial guess of θ: the LDA solution.– Quadratic programming algorithms in MATLABTM tool-box.

• After optimization, we use only the first p columns of θ θ to obtain the dimension-reduction transformation.

pj

Page 26: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

26

HLDA: Diagonal

• In speech recognition, we often assume that the within-class variances are diagonal.

• The log-likelihood of the data can be written as

pj

1

2

,

1 1 1 1

2

0,

1 1

log , , ;

log 2 log log2 2

1log

2 2

1

2

D j j i

nk

k p

Tp pJ Jk i j kj k

j kj k j g i j k j

TJ n

k i k

kj g i j k p

L x

Nn NN

xN

x

Page 27: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

27

HLDA: Diagonal (cont.)

• Using the same method as before, and maximizing the likelihood with respect to means and variances, we get

pj

0

ˆ , 1...

ˆ ,

ˆ Diag , 1...

ˆ Diag .

p Tj p j

Tn p

p Tj p j p

n p Tn p n p

X j J

X

W j J

T

Page 28: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

28

HLDA: Diagonal (cont.)

• Substituting values of the maximized mean and variance parameters gives the maximized likelihood of the data in terms of θ.

1

1

1

1

1

log ;

log 2 log Diag log Diag2 2 2

1Diag

2

1Diag

2

log

D i

JjT T

n p n p p j pi

J T T Ti j p p j p p i j

i g i j

JT T T

i n p n p n p n p ii g i j

L x

NNn NT W

x X W x X

x X T x X

N

pj

Page 29: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

29

HLDA: Diagonal (cont.)

• We can simplify this maximization to the following

pj

1

log Diag2

ˆ arg max log Diag2

log

Tn p n p

Jj T

F p j pi

NT

NW

N

Page 30: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

30

HLDA: with equal parameters

• We finally consider the case where every class has an equal covariance matrix. Then, the maximum-likelihood parameter estimates can be written as follows:

0

ˆ , 1...

ˆ ,

ˆ Diag , 1...

ˆ Diag .

p Tj p j

Tn p

p Tj p p

n p Tn p n p

X j J

X

W j J

T

Page 31: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

31

HLDA: with equal parameters (cont.)

• The solution that we obtain by taking the eigenvectors corresponding to largest p eigenvalues of also maximizes the expression above, thus asserting the claim that LDA is the maximum-likelihood parameter estimate of a constrained model.

1

log Diag2

ˆ arg max log Diag2

log

Tn p n p

Jj T

F p pi

NT

NW

N

1W T

Page 32: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

32

HDA: Introduction

• The same as HLDA, the essence of HDA (Heteroscedastic Discriminant Analysis) to remove the equal within-class covariance constraint.

• HDA defines an objective function similar to LDA’s which maximizes the class discrimination in the projected subspace while ignoring the rejected dimensions.

• The assumptions of HDA:– Being the intuitive heteroscedastic extension of LDA, HDA share

s the same assumptions as LDA (Chang, 2005). But why?– First, all of the classification information lies in the first p-dimensi

onal feature subspace.– Second, every class distribution is normal.

Page 33: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

33

HDA: Derivation

• Considering the uniform class specific variance assumption removed for HDA, then we’ll try to maximize:

• By taking log and rearranging terms, we get:

1

1

j

j

N NT TJ

JT NTj jj

j

B B

W W

1

log logJ

T Tj j

j

H N W N B

Page 34: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

34

HDA: Derivation (cont.)

• H has useful properties of invariance:– For every nonsingular matrix φ, H(φθ) = H(θ); This means that subsequ

ent feature space transformations of the range of θ will not affect the value of the objective function. So, like LDA, the HDA solution is invariant to linear transformations of the data in the original space.

– No special provisions have to be made for θ during the optimization of H except for |θTθ| != 0.

– The objective function is invariant to row or column scalings of θ or eigenvalue scalings of θTθ.

• Using matrix differentiation, the derivative of H is given by:

– There is no close-form solution for H’(θ) = 0.

– Instead, we used a quasi-Newton conjugate gradient descent routine from the NAG2 Fortran library for the optimization of H.

1 1

1

2 2J

T Tj j j

j

HN W W N B B

Page 35: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

35

HDA: Derivation (cont.)

Page 36: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

36

HDA: Likelihood interpretation

• Assuming a single full covariance Gaussian model for each class, the log likelihood of these samples according to the induced ML model.

– It may be seen that the summation in H is related to the log likelihood of the projected samples. Thus, θ can be interpreted as a constrained ML projection, the constraint being given by the maximization of the projected between-class scatter volume.

1 1

ˆlog log 2 log2 2 2

J Jj j T

j jj j

N NNpW W C

Page 37: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

37

HDA: diagonal variance

• Consider the case when diagonal variance modeling constraints are present in the final feature space.– MLLT (Maximum Likelihood Linear Transform) is introduced whe

n the dimensions of the original and the projected space are the same.

– MLLT aims at minimizing the loss in likelihood between full and diagonal covariance Gaussian models.

– The objective is to find a transformation φ, that maximizes the log likelihood difference of the data:

1

1

ˆ ˆˆ arg max log Diag log2

arg max log Diag log2

p p

p p

Jj T T

j jj

Jj T T

jj

NW W

NW N

Page 38: 1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

38

HDA: HDA vs. HLDA

• Consider the diagonal constraint in the projected feature space:– For HDA:

– For HLDA:

1

log Diag logJ

T TD j j

j

H N W N B

1

log Diag log Diag

log

JT T

D j p j p n p n pi

K N W N T

N

To be maximized

To be minimized