Regularized PCA to denoise and visualise datamath.agrocampus-ouest.fr/infoglueDeliverLive/digital... · 2015-10-22 · Regularized PCA to denoise and visualise data Marie Verbanck

PCA Regularized PCA Results Discussion

Regularized PCA to denoise and

visualise data

Marie Verbanck � Julie Josse � François Husson

Laboratoire de statistique, Agrocampus Ouest, Rennes, France

Statlearn, Bordeaux, 9 april 2013

1 / 35


Outline

1 PCA

2 Regularized PCA

MSE point of view

Bayesian point of view

3 Results

4 Discussion

2 / 35


Geometrical point of view

Maximize the variance of the projected points

Minimize the reconstruction error

⇒ Approximation of X of low rank (S < p)

‖Xn×p − Xn×p‖2 SVD: XPCA = Un×SΛ

12S×SV

′p×S

Fn×SV′p×S

F = UΛ12 principal components

V principal axes

Alternating least squares (ALS) algorithm: ‖X− FV′‖2

F = XV(V′V)−1

V = X′F(F′F)−1

3 / 35


Model in PCA

xij = signal + noise

• Random e�ect model: Factor Analysis, Probabilistic PCA

(Lawley, 1953, Roweis, 1998, Tipping & Bishop, 1999)

⇒ individuals are exchangeable

⇒ relationship between variables

• Fixed e�ect model (Lawley, 1941, Caussinus 1986)

⇒ individuals are not i.i.d

⇒ study the individuals and the variables

• sensory analysis (products - sensory descriptors)• plant breeding (genotypes - environments)

4 / 35


Fixed e�ect model

Xn×p = Xn×p + εn×p

xij =∑S

s=1

√dsqis rjs + εij , εij ∼ N (0, σ2)

• Randomness due to the noise

• Individuals have di�erent expectations: individual parameters

• Inferential framework: when n ↗ the number of parameters ↗⇒ Asymptotic: σ → 0

• Maximum likelihood estimates (θ): least squares estimates

XPCA = UΛ12V

′

EM algorithm = ALS algorithm

⇒ Not necessarily the best in MSE

⇒ Shrinkage estimates: minimize E(φθ − θ)2

5 / 35


Outline

1 PCA

2 Regularized PCA

MSE point of view


3 Results

4 Discussion

6 / 35


Minimizing the MSE

xij = xij + εij =∑S

s=1 xsij + εij

MSE = E

∑i ,j

min(n−1,p)∑s=1

φs x(s)ij − xij

2 with x(s)ij =

√λsuisvjs

MSE = E

∑i ,j

(S∑

s=1

φs x(s)ij − x

(s)ij

)2

+

min(n−1,p)∑s=S+1

(φs x

(s)ij − x

(s)ij

)2For all s ≥ S + 1, x

(s)ij = 0, therefore φS+1 = ... = φmin(n−1,p) = 0.

7 / 35


Minimizing the MSE

MSE = E

∑i ,j

(S∑

s=1

φs x(s)ij − xij

)2 with x

(s)ij =

√λsuisvjs

φs =

∑i ,j E(x

(s)ij )xij∑

i ,j E(x(s)2ij )

=

∑i ,j E(x

(s)ij )xij∑

i ,j

(V (x

(s)ij ) +

(E(x

(s)ij ))2)

When σ → 0: E(x(s)ij ) = x

(s)ij (unbiased); V (x

(s)ij ) = σ2

min(n−1;p)(Pazman & Denis, 1996; Denis & Gower, 1999)

φs =

∑i ,j x

(s)2

ij∑i ,j

σ2

min(n−1;p) +∑

i ,j x(s)2

ij

8 / 35


Minimizing the MSE

xij = xij + εij =∑S

s=1

√dsqis rjs + εij , εij ∼ N (0, σ2)

MSE = E(∑

i ,j

(∑Ss=1 φs x

(s)ij − xij

)2)with x

(s)ij =

√λsuisvjs

φs =

∑i ,j x

(s)2

ij∑i ,j

σ2

min(n−1;p) +∑

i ,j x(s)2

ij

=ds

npσ2

min(n−1;p) + ds

=variance signal

variance total

φs =λs − σ2

λsfor s = 1, ..., S

9 / 35


Regularized PCA

xPCAij =S∑

s=1

√λsuisvjs

xrPCAij =S∑

s=1

(λs − σ2

λs

)√λsuisvjs =

S∑s=1

(√λs −

σ2√λs

)uisvjs

Compromise between hard and soft thresholding

σ2 small → regularized PCA ≈ PCA

σ2 large → individuals toward the center of gravity

σ2 =RSS

ddl=

n∑q

s=S+1 λs

np − p − nS − pS + S2 + S(Xn×p;Un×S ;Vp×S)

(only asympt unbiased)10 / 35


Illustration of the regularization

Xsim4×10 = X4×10 + εsim4×10 sim = 1, ..., 1000

PCA rPCA

⇒ rPCA more biased less variable

11 / 35


Outline

1 PCA

2 Regularized PCA• MSE point of view• Bayesian point of view

3 Results

12 / 35


Outline

1 PCA

2 Regularized PCA• MSE point of view• Bayesian point of view

Y = Xβ + ε, ε ∼ N (0, σ2)

β ∼ N (0, τ2)

βMAP = (X′X + κ)−1X′Y , with κ = σ2/τ2

3 Results

12 / 35


Probabilistic PCA (Roweis, 1998, Tipping & Bishop, 1999 )

⇒ A speci�c case of an exploratory factor analysis model

xij = (Zn×SB′p×S)ij + εij , zi ∼ N (0, IS), εi ∼ N (0, σ2Ip)

• Conditional independence : xi |zi ∼ N (Bzi , Ip)

• Distribution for the observations:

xi ∼ N (0,Σ) with Σ = BB′ + σ2Ip

• Explicit solution:• σ2 = 1

p−S

∑p

s=S+1λs

• B = V(Λ− σ2IS)1

2

13 / 35


Probabilistic PCA

xij = (Zn×SB′p×S)ij + εij , zi ∼ N (0, IS), εi ∼ N (0, σ2Ip)

⇒ Random e�ect model: no individual parameters

⇒ BLUP E(zi |xi ): Z = XB(B′B + σ2IS)−1

XpPCA = ZB

′

Using B = V(Λ− σ2IS)12 and SVD X = UΛ1/2V′ → XV = Λ1/2U

XpPCA = U(Λ− σ2IS)Λ−

12V x

pPCAij =

S∑s=1

(√λs −

σ2√λs

)uisvjs

Bayesian treatment of the �xed e�ect model with an a priori on zi :

⇒ Regularized PCA14 / 35


Probabilistic PCA

⇒ EM algorithm in pPCA: two ridge regressions

Step E: Z = XB(B′B + σ2IS)−1

Step M: B = X′Z(Z′Z + σ2Λ−1)−1

⇒ EM algorithm in PCA: two regressions ‖X− FV′‖2

F = XV(V′V)−1

V = X′F(F′F)−1

15 / 35


An empirical Bayesian approach

Model: xij =S∑

s=1

x(s)ij + εij , εij ∼ N (0, σ2)

Prior: x(s)ij ∼ N (0, τ2s )

Posterior: E(x(s)ij |x

(s)ij

)= Φsx

(s)ij with Φs =

τ2sτ2s + 1

min(n−1;p)σ2

Maximizing the likelihood of(x(s)ij

)∼ N (0, τ2s + 1

min(n−1;p)σ2) as a

function of τ2s to obtain: τs2 =

(1npλs − 1

min(n−1;p) σ2)

Φs =λs − np

min(n−1;p) σ2

λs

16 / 35


Other Bayesian point of views

X = X + ε = UDV′ + ε

⇒ Bayesian SVD models: overparametrization issue

• Ho� (2009): Uniform prior for U and V special cases of von

Mises-Fisher distributions (Stiefel manifold); (ds)s=1...S ∼ N(0, s2λ

)• Josse & Denis (2013): disregarding the constraints

uis ∼ N (0, 1) E (xij) = 0

vjs ∼ N (0, 1) V (xij) = Ss2λ(ds)s=1...S ∼ N

(0, s2λ

)

17 / 35


Outline

1 PCA

2 Regularized PCA

MSE point of view


3 Results

4 Discussion

18 / 35


Simulation design

The simulated data:

X = X + ε

Vary several parameters to construct X:

• n/p: 100/20 = 5, 50/50 = 1 and 20/100 = 0.2

• S underlying dimensions: 2, 4, 10

• the ratio d1/d2: 4, 1

εij is drawn from a Gaussian distribution with a standard deviation

chosen to obtain a signal-to-noise ratio (SNR): 4, 1, 0.8

19 / 35


Soft thresholding and SURE (Candès et al., 2012)

xSoftij =

min(n,p)∑s=1

(√λs − λ

)+uisvjs ‖X− X‖2 + λ‖X‖∗

Choosing the tunning parameter λ? MSE(λ) = E‖X− SVTλ(X)‖2

⇒ Stein's Unbiased Risk Estimate (SURE)

SURE(SVTλ)(X) = −npσ2 +

min(n,p)∑i

min(λ2, λi ) + 2σ2div(SVTλ)(X)

⇒ λ is obtained minimizing SURE. Remark: σ2 is required

20 / 35


Simulation study

n/p S SNR d1/d2 MSE(XPCA, X) MSE(XrPCA, X) MSE(XSURE, X)100/20 4 4 4 8.25E-04 8.24E-04 1.42E-03100/20 4 4 1 8.26E-04 8.25E-04 1.43E-03100/20 4 1 4 2.60E-01 1.96E-01 2.43E-01100/20 4 1 1 2.47E-01 2.04E-01 2.62E-01100/20 4 0.8 4 7.41E-01 4.27E-01 4.36E-01100/20 4 0.8 1 6.68E-01 4.40E-01 4.83E-0150/50 4 4 4 5.48E-04 5.48E-04 1.04E-0350/50 4 4 1 5.46E-04 5.46E-04 1.04E-0350/50 4 1 4 1.75E-01 1.53E-01 2.00E-0150/50 4 1 1 1.68E-01 1.52E-01 2.09E-0150/50 4 0.8 4 5.07E-01 3.87E-01 3.85E-01

50/50 4 0.8 1 4.67E-01 3.76E-01 4.13E-0120/100 4 4 4 8.28E-04 8.27E-04 1.41E-0320/100 4 4 1 8.29E-04 8.28E-04 1.42E-0320/100 4 1 4 2.55E-01 1.97E-01 2.45E-0120/100 4 1 1 2.48E-01 2.04E-01 2.60E-0120/100 4 0.8 4 7.13E-01 4.15E-01 4.37E-0120/100 4 0.8 1 6.66E-01 4.34E-01 4.78E-01

MSE in favor of rPCA21 / 35


Simulation study



rPCA ≈ PCA when SNR is high - rPCA ≥ PCA when SNR is small21 / 35


Simulation study



rPCA ≥ rPCA when d1 ≥ d221 / 35


Simulation study



Same order of errors for n/p = 0.2 or 5, smaller for n/p = 1!21 / 35


Simulation study



SURE good when data are noisy; S is high

⇒ SNR not a good measure of the level of noise 21 / 35


Simulation study



SURE bad: select too many components(√λs − λ

)+

21 / 35


Candes et al. (2012)'s simulation

• Same model of simulation

• n = 200 p = 500, SNR (4, 2, 1, 0.5), S (10, 100)

S SNR MSE(XPCA, X) MSE(XrPCA, X) MSE(XSURE, X)10 4 4.31E-03 4.29E-03 8.74E-0310 2 1.74E-02 1.71E-02 3.29E-0210 1 7.16E-02 6.75E-02 1.16E-0110 0.5 3.19E-01 2.57E-01 3.53E-01100 4 3.79E-02 3.69E-02 4.50E-02100 2 1.58E-01 1.41E-01 1.56E-01100 1 7.29E-01 4.91E-01 4.48E-01

100 0.5 3.16E+00 1.48E+00 8.52E-01

⇒ Good behavior of rPCA

⇒ SURE good when data are noisy and S high

22 / 35


Candes et al. (2012)'s simulation

• Same model of simulation

• n = 200 p = 500, SNR (4, 2, 1, 0.5), S (10, 100)

S SNR MSE(XPCA, X) MSE(XrPCA, X) MSE(XSURE, X)10 4 4.31E-03 4.29E-03 8.74E-0310 2 1.74E-02 1.71E-02 3.29E-0210 1 7.16E-02 6.75E-02 1.16E-0110 0.5 3.19E-01 2.57E-01 3.53E-01100 4 3.79E-02 3.69E-02 4.50E-02100 2 1.58E-01 1.41E-01 1.56E-01100 1 7.29E-01 4.91E-01 4.48E-01

100 0.5 3.16E+00 1.48E+00 8.52E-01

MSE of rPCA with 100 dim = 1.48

Data very noisy: MSE of rPCA with 0 dim = 1

22 / 35


Individuals representation

100 individuals, 20 variables, 2 dimensions, SNR=0.8, d1/d2 = 4

True con�guration PCA

23 / 35




PCA rPCA

24 / 35




PCA SURE

24 / 35




True con�guration rPCA

⇒ Improve the recovery of the inner-product matrix XX′

25 / 35


Variables representation

True con�guration PCA

cor(X7, X9)= 1 0.68

26 / 35



PCA rPCA

cor(X7, X9)= 0.68 0.81

27 / 35



PCA SURE

cor(X7, X9)= 0.68 0.82

27 / 35



True con�guration rPCA

cor(X7, X9)= 1 0.81

⇒ Improve the recovery of the covariance matrix X′X

28 / 35


Real data set

• 27 chickens submitted to 4 nutritional statuses

• 12664 gene expressions

PCA rPCA

29 / 35


Real data set



SURE rPCA

29 / 35


Real data set



SPCA rPCA

Sparse PCA (Witten, 2009)

29 / 35


Real data set

PCA rPCA

Bi-clustering from rPCA better �nd the 4 nutritional statuses!

30 / 35


Real data set

PCA rPCA

F16R16 with F16R5 and with N in rPCA


30 / 35


Real data set

SURE rPCA


30 / 35


Real data set

SPCA rPCA


30 / 35


Outline

1 PCA

2 Regularized PCA

MSE point of view


3 Results

4 Discussion

31 / 35


Conclusion

⇒ Recover a low rank matrix from noisy observations

• singular values thresholding: a popular estimation strategy

• a regularized term a priori:(√

λs − σ2√λs

): Minimize MSE

(asymptotic results) - Bayesian �xed e�ect model

⇒ Good results: recovery of the structure, distances between

individuals, covariance matrix, visualization

⇒ Tuning parameter: the number of dimensions !!

• too few: information lost (noise variance overestimated)

• too many: relevant information taken into account (noise

variance is underestimated); better? Not always!

Noisy schemes → less dimensions to regularize more

32 / 35


Conclusion

⇒ Recover a low rank matrix from noisy observations

• singular values thresholding: a popular estimation strategy

• a regularized term a priori:(√

λs − σ2√λs

): Minimize MSE

(asymptotic results) - Bayesian �xed e�ect model

⇒ Good results: recovery of the structure, distances between

individuals, covariance matrix, visualization

⇒ Tuning parameter: the number of dimensions !!

• too few: information lost (noise variance overestimated)

• too many: relevant information taken into account (noise

variance is underestimated); better? Not always!

Noisy schemes → less dimensions to regularize more

32 / 35


Selecting the number of dimensions (Josse & Husson, 2011)

?

??

??

??

??

?

??

??

?

?

??

?

?

⇒ EM-CV (Bro et al. 2008)

⇒ MSEPS = 1np

∑ni=1

∑pj=1(xij − x

−ijij )2

⇒ Computational costly

In regression y = Py (Craven & Whaba, 1979): y−ii − yi = yi−yi

1−Pi,i

In PCA vec(X(S)) = P(S) vec(X) X = FV′

P(S)np×np = (P

′V⊗ In) + (I′p ⊗ PF)− (P

′V⊗ PF)

ACVS =1

np

∑i ,j

(xij − xij

1− Pij ,ij

)2

GCVS =1

np×

∑i ,j(xij − xij)

2

(1− tr(P(S))/np)2

33 / 35



?

??

??

??

??

?

??

??

?

?

??

?

?


⇒ MSEPS = 1np

∑ni=1

∑pj=1(xij − x

−ijij )2



1−Pi,i


P(S)np×np = (P

′V⊗ In) + (I′p ⊗ PF)− (P

′V⊗ PF)

ACVS =1

np

∑i ,j

(xij − xij

1− Pij ,ij

)2

GCVS =1

np×


2

(1− tr(P(S))/np)2

33 / 35



?

??

??

??

??

?

??

??

?

?

??

?

?


⇒ MSEPS = 1np

∑ni=1

∑pj=1(xij − x

−ijij )2



1−Pi,i


P(S)np×np = (P

′V⊗ In) + (I′p ⊗ PF)− (P

′V⊗ PF)

ACVS =1

np

∑i ,j

(xij − xij

1− Pij ,ij

)2

GCVS =1

np×


2

(1− tr(P(S))/np)2

33 / 35



?

??

??

??

??

?

??

??

?

?

??

?

?


⇒ MSEPS = 1np

∑ni=1

∑pj=1(xij − x

−ijij )2



1−Pi,i


P(S)np×np = (P

′V⊗ In) + (I′p ⊗ PF)− (P

′V⊗ PF)

ACVS =1

np

∑i ,j

(xij − xij

1− Pij ,ij

)2

GCVS =1

np×


2

(1− tr(P(S))/np)2

33 / 35



?

??

??

??

??

?

??

??

?

?

??

?

?


⇒ MSEPS = 1np

∑ni=1

∑pj=1(xij − x

−ijij )2



1−Pi,i


P(S)np×np = (P

′V⊗ In) + (I′p ⊗ PF)− (P

′V⊗ PF)

ACVS =1

np

∑i ,j

(xij − xij

1− Pij ,ij

)2

GCVS =1

np×


2

(1− tr(P(S))/np)2

33 / 35



?

??

??

??

??

?

??

??

?

?

??

?

?


⇒ MSEPS = 1np

∑ni=1

∑pj=1(xij − x

−ijij )2



1−Pi,i


P(S)np×np = (P

′V⊗ In) + (I′p ⊗ PF)− (P

′V⊗ PF)

ACVS =1

np

∑i ,j

(xij − xij

1− Pij ,ij

)2

GCVS =1

np×


2

(1− tr(P(S))/np)2

33 / 35


Best graphics for X = X+ ε

Regularized PCA: biased but closer to X

PCA (unbiased) with con�dence areas? ⇒ Residuals bootstrap

1 residuals (ε = X− X) bootstrap or draw from N (0, σ2): εb

2 Xb = X + εb

3 PCA on Xb: Ub, Λb, Vb ⇒ (U1,Λ1,V1), ..., (UB ,ΛB ,VB)

-5 0 5

-6-4

-20

2

Individuals factor map (PCA)

Dim 1 (52.64%)

Dim

2 (

25.5

9%)

S Michaud

S Renaudie

S Trotignon

S Buisse Domaine

S Buisse Cristal

V Aub Silex

V Aub Marigny

V Font Domaine

V Font Brûlés

V Font Coteaux

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

Variables factor map (PCA)

Dim 1 (52.64%)

Dim

2 (2

5.59

%)

O.Av.Intensité

O.Ap.Intensité

O.AlcoolO.Végétale

O.Champignon

O.Fruitpassion

O.Typicité

G.Intensité

Sucrée

Acide

Amère

Astringent

G.Alcool

Equilibre

G.Typicité

34 / 35


Analysis of variance framework

Y11 G1 E1

Y1J

Y21

YIJ

G1

G2

GI

EJ

E2

EJ

Y G E

G1

GI

YJ

E1 EJ

⇒ Related to AMMI (Additive Main e�ects and Multiplicative

Interaction) models in Anova

yij = µ+ αi + βj +S∑

s=1

λsuisvjs + εij

35 / 35

Documents

Regularized PCA to denoise and visualise datamath.agrocampus-ouest.fr/infoglueDeliverLive/digital... · 2015-10-22 · Regularized PCA to denoise and visualise data Marie Verbanck