1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for

1

PCA, LDA, HLDA and HDA

Reference:1. E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004.2. S. R. Searle, “Matrix Algebra Useful for Statistics,” Wiley Series in Probability and Mathematical Statistics, New Your

k, 1982.3. Berlin Chen’s Sliders.4. N. Kumar and A. G. Andreou, “Heteroscedastic Discriminant Analysis and Redued Rank HMMs for Improved Speech

Recognition,” Speech Communication, 26:283-297, 1998.5. G. Saon, M. Padmanabhan, R. Gopinath and S. Chen, “Maximum Likelihood Discriminant Feature Spaces,” ICASSP,

2000.6. X. Liu, “Linear Projection Schemes for Automatic Speech Recognition,” Master of Philosophy, University of Cambridg

e, 2001.7. 張志豪 , “強健性和鑑別力語音特徵擷取技術於大詞彙連續語音辨識之研究 ,” 碩士論文 , 2005.

2

PCA: Introduction

• PCA (Principle Component Analysis) is a one-group and unsupervised projection method for reducing data dimensionality (feature extraction).

• We make use of PCA to find a mapping from the inputs in the original d-dimensional space to a new (k < d)-dimensional space, with:– minimum loss of information.

– maximum amount of information measured in terms of variability.

• That is, we’ll find the new variables or linear transformations (major axes, principal components) to reach the goal.

• The projection of x on the direction of w is, z = wTx.

input space X(d-

dimensionality)

feature space Z(k-

dimensionality)

linear transform W

3

PCA: Methodology (General)

• For maximizing the amount of information, PCA centers the sample and then rotates the axes to line up with the directions of highest variance.

• That is, find w such that Var(z) is maximized, which is the criterion.

• Var(z) = Var(wTx) = E[(wTx – wTμ)2] = E[(wTx – wTμ)(wTx – wTμ)]

= E[wT(x – μ)(x – μ)Tw]

= wT E[(x – μ)(x –μ)T]w = wT ∑ w

Var(x) = E[(x – μ)2]= E[(x – μ)(x –μ)T] = ∑

4

PCA: Methodology (General) (cont.)

• Maximize Var(z1) = w1T ∑ w1 subject to ||w1|| = 1.

– For a unique solution and to make the direction the important factor, we require ||w|| = 1, i.e., w1

Tw1 = 1.

– w1 is an eigenvector of ∑, λ is an eigenvalue associated with w1.– Var(z1) = Var(w1

Tx) = w1T ∑ w1 = w1

T λ w1 = λ w1Tw1 = λ

– max Var(z1) = max λ, so choose the one with the largest eigenvalue for Var(z) to be max.

L , 1

L ,0 1 0

max L ,L ,

0

1

T T1 1 1 1 1

1 T1 1

1w

11 1

1

w w w w w

ww w

ww

w ww

2T

T1 1

1 11

w ww w

w

T1 1

11

w ww

w

λ is Lagrange multiplier.

5

PCA: Methodology (General) (cont.)

• Second principal component: max Var(z2), s.t., ||w2||=1 and orthogonal to w1.

– That is, max Var(z1) = max λ.

• Conclusions:– wi is the eigenvector of ∑ associated with the ith largest eigenvalue.

Var(zi) = λi. (The above can be proved by mathematical induction.)

– wi’s are uncorrelated (orthogonal).

– w1 explains as much as possible of original variance in data set.

– w2 explains as much as possible of remaining variance, etc.

2 2

2 2

L , , , 1 0

max L , , ,

2

T T T1 2 2 2 2 2 2 1

1 2 2 2w

w w w w w w w w

w w w w

6

PCA: Some Discussions

• About dimensions:– If the dimensions are highly correlated, there will be a small

number of eigenvectors with large eigenvalues, and k will be much smaller than d and a large reduction in dimensionality may be attained.

– If the dimensions are not correlated, k will be as large as d and there is no gain through PCA.

7

PCA: Some Discussions (cont.)

• About ∑:– For two different eigenvalues, the eigenvectors are orthogonal.– If ∑ is positive definite (xT ∑ x > 0, x != null), all eigenvalues are

positive.– If ∑ is singular, then its rank, the effective dimension is k < d, and

λi = 0, i > k.

• About scaling:– Different variables have completely different scaling.– Eigenvalues of the matrix is scale dependent.– If scale of the data is unknown then it is better to use correlation

matrix instead of the covariance matrix.– The interpretation of the principal components derived by these t

wo methods can be completely different.

8

PCA: Methodology (SD)

• z = WTx, or z = WT(x - m): center the data on the origin.• To find a matrix w such that when we have z = WTx, we

will get Cov(z) = D is any diagonal matrix.

• We would like to get uncorrelated zi.

• Let C = [ci] be the normalized eigenvectors of S, then– CTC = I– S = SCCT

= S (c1, c2, … , cd) CT

= (Sc1, Sc2, … , Scd) CT

= (λc1, λc2, … , λcd) CT

= λc1 c1T + λc2 c2

T + … + λcd cdT

= CDCT

– D = CTSCWT ∑ W

D(kk) = CT(kd) S(dd) C(dk)

d: input space dim; k: feature space dim

Spectral Decomposition is the factorization of a positive definite matrix S into S = CDCT where D is a diagonal matrix of eigenvalues, and the C matrix has the eigenvectors.

9

Appendix A

• Another criterion for PCA is MMSE (Minimum Mean-Squared Error) criterion which will reach the same destination as the above two methods do. But there may be an interesting difference among them.

• Some important properties of symmetric matrices:– Eigenvalues are all real.– Symmetric matrices are diagonable.– Eigenvectors are orthogonal.

• Eigenvectors corresponding to different eigenvalues are orthogonal.

• mk LIN eigenvectors corresponding to any eigenvalue λk of multiplicity mk can be obtained such that they are orthogonal.

– Rank equals number of nonzero eigenvalues.

10

LDA: Introduction

• LDA (Linear Discriminant Analysis) (Fisher, 1936) (Rao, 1935) is a supervised method for dimension reduction for classification problem.

• To obtain features suitable for speech sound classification, the use of LDA was proposed (Hunt, 1979).– Brown showed that the LDA transform is superior to the PCA tra

nsform by using DHMM classifier and incorporating context information (Brown, 1987).

– The later researchers have applied LDA to DHMM and CHMM speech recognition systems and have reported improved performance on small vocabulary tasks but with mixed results on large vocabulary phoneme-based systems.

11

LDA: Assumptions

• LDA is related to the MLE (Maximum Likelihood Estimation) of parameters for a Gaussian model, with two a priori assumptions (Campbell, 1984).– First, all the class-discrimination information resides in a p-dimen

sional subspace of the n-dimensional feature space.– Second, the within-class variances are equal for all classes.

• Another notable assumption is that class distributions is mixture of Gaussians (Hastie & Tibshirani, 1994). (Why not single Gaussian?)– That means LDA is optimal if the classes are normally distributed.

But we can still use LDA for classification.

12

LDA: Methodology

• Criterion: Given a set of sample vectors with labeled (class) information, try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal.– After projection, for all classes to be well separated, we would lik

e the means to be as far apart as possible and the examples of classes be scatteres in as small a region as possible.

13

LDA: Methodology (cont.)

• Let x be an n-dimensional feature vector. We seek a linear transformation Rn Rp (p < n) of the form yp = θp

Tx, where θp is an np matrix. Let θ be a nonsingular nn matrix used to define the linear transformation y = θTx. Let’s partition as

• First, we apply a nonsingular linear transformation to x to obtain y = θTx. Second, we retain only the first p rows of y to give yp.

p n p

14


• Let there be a total of J classes, and let g(i) {1…J} indicate the class that is associated with xi. Let {xi} be the set of training examples available.– The sample mean:

– The class sample means:

– The class sample covariances:

1

1 p

NT

i pi

X x XN

( )

1 p Tj i p j

g i jj

X x XN

( )

1 pT T

j i j i j p j pg i jj

W x X x X WN

15


– The average within-class variation:

– The average between-class variation:

– The total sample covariance

1

1 p

JT

j j p pj

W N W WN

1

1 p

J T Tj j j p p

j

B N X X X X BN

1

1 p

NT T

i i p pi

T x X x X TN

16


• To get a p-dimensional transformation, we maximize the ratio

• To obtain , we choose those eigenvectors of that correspond to the largest p eigenvalues, and let be an np matrix of these eigenvectors. The p-dimensional features thus obtained by that are uncorrelated.

ˆ arg max arg maxp p

T Tp p p p

p T Tp p p p

B T

W W

T W B

p 1W T

p

ˆTpy x

17

HLDA: ML framework

• For LDA, since the final objective is classification, the implicit assumption is that the rejected subspace does not carry any classification information.

• For Gaussian models, the assumption of lack of classification information is equivalent to the assumption that the means and the variances of the class distributions are the same for all classes in the rejected (n-p)-dimensional subspace.

• Now in an alternative way, let the full rank linear transformation θ be such that the first p columns of θ span the p-dimensional subspace in which the class means, and probably the class variances, are different.– When rank(θnn) = n, θ is said to have full rank, or to be of full rank. Its ra

nk equals its order, it is nonsingular, its inverse exists.

– θ obtained by LDA can be full rank?

– Since the data variables x are Gaussian, their linear transformation y are also Gaussian.

18

HLDA: ML framework (cont.)

• The goal of HLDA (Heteroscedastic Discriminant Analysis) is to generalize LDA under ML (Maximum Likelihood) framework.

• For notational convenience, we define:

where μj represents the class means and Σj represents the class covariances after transformation.

,1

,

0, 1 0

0,

j

pj p j

jp

n

0

0

pj p p

j n pn p n p

19


• The probability density of xi under the preceding model is given as

where xi belongs to the group g(i). Note that although the Gaussian distribution is defined on the transformed variable yi, we are interested in maximizing the likelihood of the original data xi.– The term |θ| comes from the Jacobian of the linear transformatio

n y = θTx.

1

exp22

TT T

i ig i g i g i

i n

g i

x xP x

20


21

HLDA: Full rank

• The log-likelihood of the data under

the linear transformation θ and under the constrained Gaussian model assumption for each class is

– Doing a straightforward maximization with respect to various parameters is computationally intensive. (Why?)

pj

1

logN

F ii

L P x

1

1

log , , ;

1log 2

2

log

F j j i

N T nT Ti ig i g i g i g i

i

L x

x x

N

22

HLDA: Full rank (cont.)

• We simplify it considerably by first calculating the values of the mean and variance parameters that maximize the likelihood in terms of a fixed linear transformation θ.

• We’ll get

• Transformations vs. ML estimators?

pj

0

ˆ , 1...

ˆ ,

ˆ , 1...

ˆ .

p Tj p j

Tn p

p Tj p j p

n p Tn p n p

X j J

X

W j J

T

23


• By replacing the two parameters in terms of θ, the log-likelihood will be

pj

1

1

1

1

1

log ;

log 2 log log2 2 2

1

2

1

2

log

F i

JjT T

n p n p p j pi

J T T Ti j p p j p p i j

i g i j

JT T T

i n p n p n p n p ii g i j

L x

NNn NT W

x X W x X

x X T x X

N

24


• We can simplify the above log-likelihood to get θ:– Proposition 1: Let F be any full-rank nn matrix. Let t be any (np)

rank-p matrix (p < n). Then, Trace(t(tTFt)-1tTF) = p.– Proposition 2:

pj

1

log ;

log log2 2

log 1 log 22

F i

JjT T

n p n p p j pi

L x

NNT W

NnN

1

ˆ arg max log log log2 2

JjT T

F n p n p p j pi

NNT W N

25


• Since there is no closed-form solution for maximizing the likelihood with respect to h, the maximization has to be performed numerically.– An initial guess of θ: the LDA solution.– Quadratic programming algorithms in MATLABTM tool-box.

• After optimization, we use only the first p columns of θ θ to obtain the dimension-reduction transformation.

pj

26

HLDA: Diagonal

• In speech recognition, we often assume that the within-class variances are diagonal.

• The log-likelihood of the data can be written as

pj

1

2

,

1 1 1 1

2

0,

1 1

log , , ;

log 2 log log2 2

1log

2 2

1

2

D j j i

nk

k p

Tp pJ Jk i j kj k

j kj k j g i j k j

TJ n

k i k

kj g i j k p

L x

Nn NN

xN

x

27

HLDA: Diagonal (cont.)

• Using the same method as before, and maximizing the likelihood with respect to means and variances, we get

pj

0

ˆ , 1...

ˆ ,

ˆ Diag , 1...

ˆ Diag .

p Tj p j

Tn p

p Tj p j p

n p Tn p n p

X j J

X

W j J

T

28


• Substituting values of the maximized mean and variance parameters gives the maximized likelihood of the data in terms of θ.

1

1

1

1

1

log ;

log 2 log Diag log Diag2 2 2

1Diag

2

1Diag

2

log

D i

JjT T

n p n p p j pi

J T T Ti j p p j p p i j

i g i j

JT T T

i n p n p n p n p ii g i j

L x

NNn NT W

x X W x X

x X T x X

N

pj

29


• We can simplify this maximization to the following

pj

1

log Diag2

ˆ arg max log Diag2

log

Tn p n p

Jj T

F p j pi

NT

NW

N

30

HLDA: with equal parameters

• We finally consider the case where every class has an equal covariance matrix. Then, the maximum-likelihood parameter estimates can be written as follows:

0

ˆ , 1...

ˆ ,

ˆ Diag , 1...

ˆ Diag .

p Tj p j

Tn p

p Tj p p

n p Tn p n p

X j J

X

W j J

T

31

HLDA: with equal parameters (cont.)

• The solution that we obtain by taking the eigenvectors corresponding to largest p eigenvalues of also maximizes the expression above, thus asserting the claim that LDA is the maximum-likelihood parameter estimate of a constrained model.

1

log Diag2

ˆ arg max log Diag2

log

Tn p n p

Jj T

F p pi

NT

NW

N

1W T

32

HDA: Introduction

• The same as HLDA, the essence of HDA (Heteroscedastic Discriminant Analysis) to remove the equal within-class covariance constraint.

• HDA defines an objective function similar to LDA’s which maximizes the class discrimination in the projected subspace while ignoring the rejected dimensions.

• The assumptions of HDA:– Being the intuitive heteroscedastic extension of LDA, HDA share

s the same assumptions as LDA (Chang, 2005). But why?– First, all of the classification information lies in the first p-dimensi

onal feature subspace.– Second, every class distribution is normal.

33

HDA: Derivation

• Considering the uniform class specific variance assumption removed for HDA, then we’ll try to maximize:

• By taking log and rearranging terms, we get:

1

1

j

j

N NT TJ

JT NTj jj

j

B B

W W

1

log logJ

T Tj j

j

H N W N B

34

HDA: Derivation (cont.)

• H has useful properties of invariance:– For every nonsingular matrix φ, H(φθ) = H(θ); This means that subsequ

ent feature space transformations of the range of θ will not affect the value of the objective function. So, like LDA, the HDA solution is invariant to linear transformations of the data in the original space.

– No special provisions have to be made for θ during the optimization of H except for |θTθ| != 0.

– The objective function is invariant to row or column scalings of θ or eigenvalue scalings of θTθ.

• Using matrix differentiation, the derivative of H is given by:

– There is no close-form solution for H’(θ) = 0.

– Instead, we used a quasi-Newton conjugate gradient descent routine from the NAG2 Fortran library for the optimization of H.

1 1

1

2 2J

T Tj j j

j

HN W W N B B

35

HDA: Derivation (cont.)

36

HDA: Likelihood interpretation

• Assuming a single full covariance Gaussian model for each class, the log likelihood of these samples according to the induced ML model.

– It may be seen that the summation in H is related to the log likelihood of the projected samples. Thus, θ can be interpreted as a constrained ML projection, the constraint being given by the maximization of the projected between-class scatter volume.

1 1

ˆlog log 2 log2 2 2

J Jj j T

j jj j

N NNpW W C

37

HDA: diagonal variance

• Consider the case when diagonal variance modeling constraints are present in the final feature space.– MLLT (Maximum Likelihood Linear Transform) is introduced whe

n the dimensions of the original and the projected space are the same.

– MLLT aims at minimizing the loss in likelihood between full and diagonal covariance Gaussian models.

– The objective is to find a transformation φ, that maximizes the log likelihood difference of the data:

1

1

ˆ ˆˆ arg max log Diag log2

arg max log Diag log2

p p

p p

Jj T T

j jj

Jj T T

jj

NW W

NW N

38

HDA: HDA vs. HLDA

• Consider the diagonal constraint in the projected feature space:– For HDA:

– For HLDA:

1

log Diag logJ

T TD j j

j

H N W N B

1

log Diag log Diag

log

JT T

D j p j p n p n pi

K N W N T

N

To be maximized

To be minimized

Documents

1 PCA, LDA, HLDA and HDA Reference: 1.E. Alpaydin, “Introduction to Machine Learning,” The MIT Press, 2004. 2.S. R. Searle, “Matrix Algebra Useful for