96
Foundations of Machine Learning African Masters in Machine Intelligence Principal Component Analysis Marc Deisenroth Quantum Leap Africa African Institute for Mathematical Sciences, Rwanda Department of Computing Imperial College London @mpd37 [email protected] October 4, 2018

Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Foundations of Machine LearningAfrican Masters in Machine Intelligence

Principal Component AnalysisMarc Deisenroth

Quantum Leap AfricaAfrican Institute for MathematicalSciences, Rwanda

Department of ComputingImperial College London

@[email protected]

October 4, 2018

Page 2: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

References

§ Bishop: Pattern Recognition and Machine Learning, Chapter 12

§ Deisenroth et al.: Mathematics for Machine Learning, Chapter 10(https://mml-book.com)

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 2

Page 3: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Overview

Introduction

Setting

Maximum Variance Perspective

Projection Perspective

PCA Algorithm

PCA in High Dimensions

Probabilistic PCA

Related Models

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 3

Page 4: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

High-Dimensional Data

§ Real-world data is often high dimensional§ Challenges:

§ Difficult to analyze§ Difficult to visualize§ Difficult to interpret

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 4

Page 5: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Properties of High-dimensional Data

−5.0 −2.5 0.0 2.5 5.0x1

−4

−2

0

2

4

x2

−5.0 −2.5 0.0 2.5 5.0x1

−4

−2

0

2

4

x2

§ Many dimensions are unnecessary

§ Data often lives on a low-dimensional manifold

Dimensionality reduction finds the relevant dimensions.Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 5

Page 6: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Background: Coordinate Representations

Consider R2 with the canonical basis e1 “ r1, 0sJ, e2 “ r0, 1sJ.

x “„

53

“ 5e1 ` 3e2 Linear combination of basis vectors

§ Coordinates of x w.r.t. pe1, e2q: r5, 3s

Consider the vectors of the form

x̃ “„

0z

P R2 , z P R

Write them as 0e1 ` ze2.§ Only remember/store the coordinate/code z of the e2 vector

Compression§ Set of x̃ vectors forms a vector subspace U Ď R2 with dimpUq “ 1

because U “ spanre2s.

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 6

Page 7: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Background: Coordinate Representations

Consider R2 with the canonical basis e1 “ r1, 0sJ, e2 “ r0, 1sJ.

x “„

53

“ 5e1 ` 3e2 Linear combination of basis vectors

§ Coordinates of x w.r.t. pe1, e2q: r5, 3sConsider the vectors of the form

x̃ “„

0z

P R2 , z P R

Write them as 0e1 ` ze2.§ Only remember/store the coordinate/code z of the e2 vector

Compression§ Set of x̃ vectors forms a vector subspace U Ď R2 with dimpUq “ 1

because U “ spanre2s.Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 6

Page 8: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Overview

Introduction

Setting

Maximum Variance Perspective

Projection Perspective

PCA Algorithm

PCA in High Dimensions

Probabilistic PCA

Related Models

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 7

Page 9: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

PCA Setting

x x̃z

original compressed

codeRD RD

RM

Encoder Decoder

§ Dataset X :“ tx1, . . . , xNu, xn P RD

§ Data matrix X :“ rx1, . . . , xNs P RDˆN Often N ˆD matrix

§ Without loss of generality: ErX s “ 0 Centered dataData covariance matrix S “ 1

N XXJ P RDˆD

§ Linear relationships between latent code z and data x:

z “ BJx , x̃ “ Bz

§ B “ rb1, . . . , bMs P RDˆM is an orthogonal matrix

§ Columns of B are an ONB of an M-dimensional subspace of RD

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 8

Page 10: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

PCA Setting

x x̃z

original compressed

codeRD RD

RM

Encoder Decoder

§ Dataset X :“ tx1, . . . , xNu, xn P RD

§ Data matrix X :“ rx1, . . . , xNs P RDˆN Often N ˆD matrix

§ Without loss of generality: ErX s “ 0 Centered dataData covariance matrix

S “ 1N XXJ P RDˆD

§ Linear relationships between latent code z and data x:

z “ BJx , x̃ “ Bz

§ B “ rb1, . . . , bMs P RDˆM is an orthogonal matrix

§ Columns of B are an ONB of an M-dimensional subspace of RD

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 8

Page 11: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

PCA Setting

x x̃z

original compressed

codeRD RD

RM

Encoder Decoder

§ Dataset X :“ tx1, . . . , xNu, xn P RD

§ Data matrix X :“ rx1, . . . , xNs P RDˆN Often N ˆD matrix

§ Without loss of generality: ErX s “ 0 Centered dataData covariance matrix S “ 1

N XXJ P RDˆD

§ Linear relationships between latent code z and data x:

z “ BJx , x̃ “ Bz

§ B “ rb1, . . . , bMs P RDˆM is an orthogonal matrix

§ Columns of B are an ONB of an M-dimensional subspace of RDPrincipal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 8

Page 12: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Low-Dimensional Embedding

x x̃z

original compressed

codeRD RD

RM

Encoder Decoder−1 0 1 2

x1

0

1

2

x2

b

U

§ Find an M-dimensional subspace U Ă RD onto which we projectthe data

§ x̃ “ πUpxq is the projection of x onto U

§ Find projections x̃ that are as similar to x as possibleFind basis vectors b1, . . . , bM

§ Compression loss incurs if M ! D

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 9

Page 13: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Overview

Introduction

Setting

Maximum Variance Perspective

Projection Perspective

PCA Algorithm

PCA in High Dimensions

Probabilistic PCA

Related Models

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 10

Page 14: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

PCA Idea: Maximum Variance

§ Project D-dimensional data x onto an M-dimensional subspacethat retains as much information as possible

Data compression

§ Informally: information = diversity = varianceMaximize variance in projected space (Hotelling 1933)

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 11

Page 15: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

PCA Idea: Maximum Variance

§ Project D-dimensional data x onto an M-dimensional subspacethat retains as much information as possible

Data compression

§ Informally: information = diversity = varianceMaximize variance in projected space (Hotelling 1933)

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 11

Page 16: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

PCA Objective: Maximum Variance

§ Linear relationships:

z “ BJx , x̃ “ Bz

§ B “ rb1, . . . , bMs P RDˆM is an orthogonal matrix

§ Columns of B are an ONB of an M-dimensional subspace of RD

§ Find B “ rb1, . . . , bMs so that the variance in the projected spaceis maximized

maxb1,...,bM

Vrzs “ maxb1,...,bM

VrBJxs

s.t. }b1} “ 1 “ . . . “ }bM}

Constrained optimization problem

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 12

Page 17: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

PCA Objective: Maximum Variance

§ Linear relationships:

z “ BJx , x̃ “ Bz

§ B “ rb1, . . . , bMs P RDˆM is an orthogonal matrix

§ Columns of B are an ONB of an M-dimensional subspace of RD

§ Find B “ rb1, . . . , bMs so that the variance in the projected spaceis maximized

maxb1,...,bM

Vrzs “ maxb1,...,bM

VrBJxs

s.t. }b1} “ 1 “ . . . “ }bM}

Constrained optimization problem

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 12

Page 18: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Direction with Maximal Variance (1)

§ Maximize variance of first coordinate of z P RM:

V1 :“ Vrz1s “1N

Nÿ

n“1

z2n1

Empirical variance of the training dataset

§ First coordinate of zn is

zn1 “ bJ1 xn

Coordinate of orthogonal projection of xn onto spanrb1s

(1-dimensional subspace spanned by b1)

Vrz1s “

1N

ÿN

n“1pbJ1 xn q

2 “1N

ÿN

n“1bJ1 xnxJn b1

“ bJ1

ˆ

1N

ÿN

n“1xnxJn

˙

b1 “ bJ1 Sb1

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 13

Page 19: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Direction with Maximal Variance (1)

§ Maximize variance of first coordinate of z P RM:

V1 :“ Vrz1s “1N

Nÿ

n“1

z2n1

Empirical variance of the training dataset§ First coordinate of zn is

zn1 “ bJ1 xn

Coordinate of orthogonal projection of xn onto spanrb1s

(1-dimensional subspace spanned by b1)

Vrz1s “

1N

ÿN

n“1pbJ1 xn q

2 “1N

ÿN

n“1bJ1 xnxJn b1

“ bJ1

ˆ

1N

ÿN

n“1xnxJn

˙

b1 “ bJ1 Sb1

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 13

Page 20: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Direction with Maximal Variance (1)

§ Maximize variance of first coordinate of z P RM:

V1 :“ Vrz1s “1N

Nÿ

n“1

z2n1

Empirical variance of the training dataset§ First coordinate of zn is

zn1 “ bJ1 xn

Coordinate of orthogonal projection of xn onto spanrb1s

(1-dimensional subspace spanned by b1)

Vrz1s “

1N

ÿN

n“1pbJ1 xn q

2 “1N

ÿN

n“1bJ1 xnxJn b1

“ bJ1

ˆ

1N

ÿN

n“1xnxJn

˙

b1 “ bJ1 Sb1

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 13

Page 21: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Direction with Maximal Variance (1)

§ Maximize variance of first coordinate of z P RM:

V1 :“ Vrz1s “1N

Nÿ

n“1

z2n1

Empirical variance of the training dataset§ First coordinate of zn is

zn1 “ bJ1 xn

Coordinate of orthogonal projection of xn onto spanrb1s

(1-dimensional subspace spanned by b1)

Vrz1s “1N

ÿN

n“1pbJ1 xn q

2 “1N

ÿN

n“1bJ1 xnxJn b1

“ bJ1

ˆ

1N

ÿN

n“1xnxJn

˙

b1 “ bJ1 Sb1

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 13

Page 22: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Direction with Maximal Variance (2)

§ Maximize variance

maxb1,}b1}

2“1Vrz1s “ max

b1,}b1}2“1

bJ1 Sb1

§ Lagrangian:

Lpb1, λq “ bJ1 Sb1 ` λ1p1´ bJ1 b1q

Discuss with your neighbors and find λ1 and b1

§ Setting the gradients w.r.t. b1 and λ1 to 0 yields

Sb1 “ λ1b1

bJ1 b1 “ 1

§ b1 is an eigenvector of the data covariance matrix S§ λ1 is the corresponding eigenvalue

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 14

Page 23: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Direction with Maximal Variance (2)

§ Maximize variance

maxb1,}b1}

2“1Vrz1s “ max

b1,}b1}2“1

bJ1 Sb1

§ Lagrangian:

Lpb1, λq “ bJ1 Sb1 ` λ1p1´ bJ1 b1q

Discuss with your neighbors and find λ1 and b1

§ Setting the gradients w.r.t. b1 and λ1 to 0 yields

Sb1 “ λ1b1

bJ1 b1 “ 1

§ b1 is an eigenvector of the data covariance matrix S§ λ1 is the corresponding eigenvalue

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 14

Page 24: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Direction with Maximal Variance (2)

§ Maximize variance

maxb1,}b1}

2“1Vrz1s “ max

b1,}b1}2“1

bJ1 Sb1

§ Lagrangian:

Lpb1, λq “ bJ1 Sb1 ` λ1p1´ bJ1 b1q

Discuss with your neighbors and find λ1 and b1

§ Setting the gradients w.r.t. b1 and λ1 to 0 yields

Sb1 “ λ1b1

bJ1 b1 “ 1

§ b1 is an eigenvector of the data covariance matrix S§ λ1 is the corresponding eigenvalue

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 14

Page 25: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Direction with Maximal Variance (3)

§ Sb1 “ λ1b1

Vrz1s “ bJ1 Sb1 “ λ1bJ1 b1 “ λ1

Variance retained by first coordinate corresponds toeigenvalue λ1

Choose eigenvector b1 associated with the largest eigenvalue

§ Projection:

x̃n “ b1bJ1 xn

§ Coordinate:

zn1 “ bJ1 xn

Direction with Maximal VarianceMaximizing the variance means to choose the direction b1 as theeigenvector of the data covariance matrix S that is associated with thelargest eigenvalue λ1 of S.

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 15

Page 26: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Direction with Maximal Variance (3)

§ Sb1 “ λ1b1

Vrz1s “ bJ1 Sb1 “ λ1bJ1 b1 “ λ1

Variance retained by first coordinate corresponds toeigenvalue λ1

Choose eigenvector b1 associated with the largest eigenvalue

§ Projection:

x̃n “ b1bJ1 xn

§ Coordinate:

zn1 “ bJ1 xn

Direction with Maximal VarianceMaximizing the variance means to choose the direction b1 as theeigenvector of the data covariance matrix S that is associated with thelargest eigenvalue λ1 of S.

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 15

Page 27: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Direction with Maximal Variance (3)

§ Sb1 “ λ1b1

Vrz1s “ bJ1 Sb1 “ λ1bJ1 b1 “ λ1

Variance retained by first coordinate corresponds toeigenvalue λ1

Choose eigenvector b1 associated with the largest eigenvalue

§ Projection:

x̃n “ b1bJ1 xn

§ Coordinate:

zn1 “ bJ1 xn

Direction with Maximal VarianceMaximizing the variance means to choose the direction b1 as theeigenvector of the data covariance matrix S that is associated with thelargest eigenvalue λ1 of S.

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 15

Page 28: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Direction with Maximal Variance (3)

§ Sb1 “ λ1b1

Vrz1s “ bJ1 Sb1 “ λ1bJ1 b1 “ λ1

Variance retained by first coordinate corresponds toeigenvalue λ1

Choose eigenvector b1 associated with the largest eigenvalue

§ Projection: x̃n “ b1bJ1 xn

§ Coordinate: zn1 “ bJ1 xn

Direction with Maximal VarianceMaximizing the variance means to choose the direction b1 as theeigenvector of the data covariance matrix S that is associated with thelargest eigenvalue λ1 of S.

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 15

Page 29: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

M-dimensional Subspace with Maximum Variance

General ResultThe M-dimensional subspace of RD that retains the most variance isspanned by the M eigenvectors of the data covariance matrix S thatare associated with the M largest eigenvalues of S. (e.g., Bishop 2006)

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 16

Page 30: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Example: MNIST Embedding (Training Set)

§ Embedding of handwritten ‘0’ and ‘1’ digits (28ˆ 28 pixels) intoa two-dimensional subspace, spanned by the first two principalcomponents.

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 17

Page 31: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Example: MNIST Reconstruction (Test Set)

Original

PCs: 1

PCs: 10

PCs: 100

PCs: 500

§ Reconstructions of original digits as the number of principalcomponents increases

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 18

Page 32: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Overview

Introduction

Setting

Maximum Variance Perspective

Projection Perspective

PCA Algorithm

PCA in High Dimensions

Probabilistic PCA

Related Models

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 19

Page 33: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Refresher: Orthogonal Projection onto Subspaces

§ Basis b1, . . . , bM of a subspace U Ă RD

§ Define B “ rb1, ..., bMs P RDˆM

§ Project x P RD onto subspace U:

πUpxq “ x̃ “ BpBJBq´1BJx

§ If b1, . . . , bM form an orthonormal basis (bJi bj “ δij), then theprojection simplifies to

x̃ “ BBJx

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 20

Page 34: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

PCA Objective: Minimize Reconstruction Error

§ Objective: Find orthogonal projection that minimizes the averagesquared projection/reconstruction error

J “1N

Nÿ

n“1

}xn ´ x̃n}2

where x̃n “ πUpxnq is the projection of xn onto UPrincipal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 21

Page 35: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Derivation (1)

§ Assume an orthonormal basis of RD “ spanrb1, . . . , bDs, suchthat bJi bj “ δij

§ Every data point x can be written as a linear combination of thebasis vectors:

x “Dÿ

d“1

ηdbd “ Bη , B “ rb1, . . . , bDs

Rotation of the standard coordinates to a new coordinatesystem defined by the basis pb1, . . . , bDq.

Original coordinates xd are replaced by ηd, d “ 1, . . . , D§ Obtain ηd “ xJbd, such that

x “Dÿ

d“1

pxJbdqbd

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 22

Page 36: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Derivation (1)

§ Assume an orthonormal basis of RD “ spanrb1, . . . , bDs, suchthat bJi bj “ δij

§ Every data point x can be written as a linear combination of thebasis vectors:

x “Dÿ

d“1

ηdbd “ Bη , B “ rb1, . . . , bDs

Rotation of the standard coordinates to a new coordinatesystem defined by the basis pb1, . . . , bDq.

Original coordinates xd are replaced by ηd, d “ 1, . . . , D§ Obtain ηd “ xJbd, such that

x “Dÿ

d“1

pxJbdqbd

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 22

Page 37: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Derivation (1)

§ Assume an orthonormal basis of RD “ spanrb1, . . . , bDs, suchthat bJi bj “ δij

§ Every data point x can be written as a linear combination of thebasis vectors:

x “Dÿ

d“1

ηdbd “ Bη , B “ rb1, . . . , bDs

Rotation of the standard coordinates to a new coordinatesystem defined by the basis pb1, . . . , bDq.

Original coordinates xd are replaced by ηd, d “ 1, . . . , D§ Obtain ηd “ xJbd, such that

x “Dÿ

d“1

pxJbdqbd

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 22

Page 38: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Derivation (1)

§ Assume an orthonormal basis of RD “ spanrb1, . . . , bDs, suchthat bJi bj “ δij

§ Every data point x can be written as a linear combination of thebasis vectors:

x “Dÿ

d“1

ηdbd “ Bη , B “ rb1, . . . , bDs

Rotation of the standard coordinates to a new coordinatesystem defined by the basis pb1, . . . , bDq.

Original coordinates xd are replaced by ηd, d “ 1, . . . , D

§ Obtain ηd “ xJbd, such that

x “Dÿ

d“1

pxJbdqbd

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 22

Page 39: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Derivation (1)

§ Assume an orthonormal basis of RD “ spanrb1, . . . , bDs, suchthat bJi bj “ δij

§ Every data point x can be written as a linear combination of thebasis vectors:

x “Dÿ

d“1

ηdbd “ Bη , B “ rb1, . . . , bDs

Rotation of the standard coordinates to a new coordinatesystem defined by the basis pb1, . . . , bDq.

Original coordinates xd are replaced by ηd, d “ 1, . . . , D§ Obtain ηd “ xJbd, such that

x “Dÿ

d“1

pxJbdqbd

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 22

Page 40: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Derivation (2)

ObjectiveApproximate

x “Dÿ

d“1

ηdbd with x̃ “Mÿ

m“1

zmbm

using M ! D many basis vectorsProjection onto a lower-dimensional subspace

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 23

Page 41: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Derivation (2)

ObjectiveApproximate

x “Dÿ

d“1

ηdbd with x̃ “Mÿ

m“1

zmbm

using M ! D many basis vectorsProjection onto a lower-dimensional subspace

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 23

Page 42: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Derivation (3): Objective

x̃n “

Mÿ

m“1

zmnbm

loooomoooon

lower-dim. subspace

§ Choose coordinates zmn and basis vectors b1, . . . , bD such that theaverage squared reconstruction error

JM “1N

Nÿ

n“1

}xn ´ x̃n}2

is minimizedCompute gradients of JM w.r.t. all variables, set to 0, solve

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 24

Page 43: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Derivation (3): Objective

x̃n “

Mÿ

m“1

zmnbm

loooomoooon

lower-dim. subspace

§ Choose coordinates zmn and basis vectors b1, . . . , bD such that theaverage squared reconstruction error

JM “1N

Nÿ

n“1

}xn ´ x̃n}2

is minimized

Compute gradients of JM w.r.t. all variables, set to 0, solve

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 24

Page 44: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Derivation (3): Objective

x̃n “

Mÿ

m“1

zmnbm

loooomoooon

lower-dim. subspace

§ Choose coordinates zmn and basis vectors b1, . . . , bD such that theaverage squared reconstruction error

JM “1N

Nÿ

n“1

}xn ´ x̃n}2

is minimizedCompute gradients of JM w.r.t. all variables, set to 0, solve

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 24

Page 45: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Derivation (4): Optimal Coordinates

Necessary condition for optimum:B JM

Bzmn“ 0 ùñ zmn “ xJn bm , m “ 1, . . . , M

§ The optimal projection is the orthogonal projection§ The optimal coordinate zmn is the orthogonal projection of xn

onto the one-dimensional subspace spanned by bm

§ pb1, . . . , bDq is ONB spanrbM`1, . . . , bDs is orthogonalcomplement of principal subspace (spanrb1, . . . , bMs)

§ If

xn “ÿD

d“1ηdnbd and x̃n “

ÿM

m“1zmnbm

then ηmn “ zmn for m “ 1, . . . , MMinimum error is given by the orthogonal projection of xn onto the

principal subspace spanned by b1, . . . , bMPrincipal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 25

Page 46: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Derivation (5): Displacement Vector

−5 0 5x1

−6

−4

−2

0

2

4

6

x2

U

U⊥

Approximation error only plays a role indimensions M` 1, . . . , D:

xn ´ x̃n “

Dÿ

j“M`1

`

xJn bj˘

bj

Displacement vector xn ´ x̃n lies in orthogonal complement UK ofprincipal subspace U (linear combination of the bj forj “ M` 1, . . . , D)

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 26

Page 47: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Derivation (5): Displacement Vector

−5 0 5x1

−6

−4

−2

0

2

4

6

x2

U

U⊥

Approximation error only plays a role indimensions M` 1, . . . , D:

xn ´ x̃n “

Dÿ

j“M`1

`

xJn bj˘

bj

Displacement vector xn ´ x̃n lies in orthogonal complement UK ofprincipal subspace U (linear combination of the bj forj “ M` 1, . . . , D)

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 26

Page 48: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Derivation (5)

From the previous slide:

xn ´ x̃n “

Dÿ

j“M`1

pxJn bjqbj

Let’s compute our reconstruction error:

JM “1N

Nÿ

n“1

}xn ´ x̃n}2 “

1N

Nÿ

n“1

pxn ´ x̃nqJpxn ´ x̃nq

“1N

Nÿ

n“1

Dÿ

j“M`1

pxJn bjq2

Dÿ

j“M`1

bJj Sbj

where S “ 1NřN

n“1 xnxJn is the data covariance matrix

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 27

Page 49: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Derivation (5)

From the previous slide:

xn ´ x̃n “

Dÿ

j“M`1

pxJn bjqbj

Let’s compute our reconstruction error:

JM “1N

Nÿ

n“1

}xn ´ x̃n}2 “

1N

Nÿ

n“1

pxn ´ x̃nqJpxn ´ x̃nq

“1N

Nÿ

n“1

Dÿ

j“M`1

pxJn bjq2

Dÿ

j“M`1

bJj Sbj

where S “ 1NřN

n“1 xnxJn is the data covariance matrix

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 27

Page 50: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Derivation (5)

From the previous slide:

xn ´ x̃n “

Dÿ

j“M`1

pxJn bjqbj

Let’s compute our reconstruction error:

JM “1N

Nÿ

n“1

}xn ´ x̃n}2 “

1N

Nÿ

n“1

pxn ´ x̃nqJpxn ´ x̃nq

“1N

Nÿ

n“1

Dÿ

j“M`1

pxJn bjq2

Dÿ

j“M`1

bJj Sbj

where S “ 1NřN

n“1 xnxJn is the data covariance matrix

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 27

Page 51: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Derivation (5)

From the previous slide:

xn ´ x̃n “

Dÿ

j“M`1

pxJn bjqbj

Let’s compute our reconstruction error:

JM “1N

Nÿ

n“1

}xn ´ x̃n}2 “

1N

Nÿ

n“1

pxn ´ x̃nqJpxn ´ x̃nq

“1N

Nÿ

n“1

Dÿ

j“M`1

pxJn bjq2

Dÿ

j“M`1

bJj Sbj

where S “ 1NřN

n“1 xnxJn is the data covariance matrixPrincipal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 27

Page 52: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Derivation (6)

§ What remains: Minimize JM w.r.t. bj under the constraint that thebj form an orthonormal basis.

§ Similar setting to maximum variance perspective: Instead ofmaximizing the variance in the principal subspace, we minimizethe variance in the orthogonal complement of the principalsubspace

§ End up with eigenvalue problem:

Sbj “ λjbj , j “ D` 1, . . . , M

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 28

Page 53: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Derivation (7)

§ Find the eigenvectors bj of the data covariance matrix S

§ Corresponding value of the squared reconstruction error:

JM “

Dÿ

j“M`1

λj

i.e., the sum of the eigenvalues associated with eigenvectors notin the principle subspace

§ Minimizing JM requires us to choose the M eigenvectors as theprinciple subspace that are associated with the M largesteigenvalues.

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 29

Page 54: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Derivation (7)

§ Find the eigenvectors bj of the data covariance matrix S

§ Corresponding value of the squared reconstruction error:

JM “

Dÿ

j“M`1

λj

i.e., the sum of the eigenvalues associated with eigenvectors notin the principle subspace

§ Minimizing JM requires us to choose the M eigenvectors as theprinciple subspace that are associated with the M largesteigenvalues.

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 29

Page 55: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Derivation (7)

§ Find the eigenvectors bj of the data covariance matrix S

§ Corresponding value of the squared reconstruction error:

JM “

Dÿ

j“M`1

λj

i.e., the sum of the eigenvalues associated with eigenvectors notin the principle subspace

§ Minimizing JM requires us to choose the M eigenvectors as theprinciple subspace that are associated with the M largesteigenvalues.

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 29

Page 56: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Geometric Interpretation

§ Objective: Project x onto an affine subspace µ` spanrb1s.

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 30

Page 57: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Geometric Interpretation

§ Shift scenario to the origin (affine subspace vector subspace)

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 30

Page 58: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Geometric Interpretation

§ Shift x as well (onto x´ µ).

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 30

Page 59: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Geometric Interpretation

§ Orthogonal projection of x´ µ onto subspace spanned by b1

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 30

Page 60: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Geometric Interpretation

§ Move projected point πU1pxq back into original (affine) setting.

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 30

Page 61: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Overview

Introduction

Setting

Maximum Variance Perspective

Projection Perspective

PCA Algorithm

PCA in High Dimensions

Probabilistic PCA

Related Models

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 31

Page 62: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Key Steps of PCA

1. Compute the empirical mean µ of the data

2. Mean subtraction: Replace all data points xi with x̄i “ xi ´ µ.

3. Standardization: Divide the data by its standard deviation ineach dimension: X̂pdq “ X̄{σpXpdqq for d “ 1, . . . , D.

4. Eigendecomposition of the data covariance matrix: Compute theeigenvectors (orthonormal) and eigenvalues of the datacovariance matrix S

5. Orthogonal projection: Choose the eigenvectors associated withthe M largest eigenvalues to be the basis of the principalsubspace. Obtain X̃

6. Moving back to original data space: X̃pdq “ X̃pdqσpXpdqq ` µd

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 32

Page 63: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

PCA Algorithm

−2.5 0.0 2.5 5.0x1

−4

−2

0

2

4

6

x2

§ Dataset

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 33

Page 64: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

PCA Algorithm: Step 1

−2.5 0.0 2.5 5.0x1

−4

−2

0

2

4

6

x2

§ Mean subtraction

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 34

Page 65: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

PCA Algorithm: Step 2

−2.5 0.0 2.5 5.0x1

−4

−2

0

2

4

6

x2

§ Standardization (variance 1 in each direction)

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 35

Page 66: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

PCA Algorithm: Step 3

−2.5 0.0 2.5 5.0x1

−4

−2

0

2

4

6

x2

§ Eigendecomposition of the data covariance matrix

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 36

Page 67: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

PCA Algorithm: Step 4

−2.5 0.0 2.5 5.0x1

−4

−2

0

2

4

6

x2

§ Orthogonal projection onto the principal subspace

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 37

Page 68: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

PCA Algorithm: Step 5

−2.5 0.0 2.5 5.0x1

−4

−2

0

2

4

6

x2

§ Moving back to the original data space

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 38

Page 69: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Overview

Introduction

Setting

Maximum Variance Perspective

Projection Perspective

PCA Algorithm

PCA in High Dimensions

Probabilistic PCA

Related Models

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 39

Page 70: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

PCA for High-Dimensional Data

§ Fewer data points than dimensions, i.e., N ă D.

§ At least D´ N ` 1 eigenvalues 0.

§ Computation time for computing eigenvalues of data covariancematrix S: OpD3q

§ Rephrase PCA

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 40

Page 71: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Reformulating PCA

§ Define X to be the Dˆ N-dimensional centered data matrix,whose nth row is pxn ´ErxsqJ Mean normalization

§ Corresponding covariance: S “ 1N XXJ

§ Corresponding eigenvector equation:

Sbi “ λibi ðñ1N

XXJbi “ λibi

§ Transformation (left-multiply by XJ):

1N

XXJbi “ λibi ðñ1N

XJX XJbiloomoon

“:vi

“ λi XJbiloomoon

“:vi

vi is an eigenvector of the N ˆ N-matrix 1N XJX, which has the

same non-zero eigenvalues as the original covariance matrix.Get eigenvalues in OpN3q instead of OpD3q.

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 41

Page 72: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Reformulating PCA

§ Define X to be the Dˆ N-dimensional centered data matrix,whose nth row is pxn ´ErxsqJ Mean normalization

§ Corresponding covariance: S “ 1N XXJ

§ Corresponding eigenvector equation:

Sbi “ λibi ðñ1N

XXJbi “ λibi

§ Transformation (left-multiply by XJ):

1N

XXJbi “ λibi ðñ1N

XJX XJbiloomoon

“:vi

“ λi XJbiloomoon

“:vi

vi is an eigenvector of the N ˆ N-matrix 1N XJX, which has the

same non-zero eigenvalues as the original covariance matrix.Get eigenvalues in OpN3q instead of OpD3q.

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 41

Page 73: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Reformulating PCA

§ Define X to be the Dˆ N-dimensional centered data matrix,whose nth row is pxn ´ErxsqJ Mean normalization

§ Corresponding covariance: S “ 1N XXJ

§ Corresponding eigenvector equation:

Sbi “ λibi ðñ1N

XXJbi “ λibi

§ Transformation (left-multiply by XJ):

1N

XXJbi “ λibi ðñ1N

XJX XJbiloomoon

“:vi

“ λi XJbiloomoon

“:vi

vi is an eigenvector of the N ˆ N-matrix 1N XJX, which has the

same non-zero eigenvalues as the original covariance matrix.Get eigenvalues in OpN3q instead of OpD3q.

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 41

Page 74: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Reformulating PCA

§ Define X to be the Dˆ N-dimensional centered data matrix,whose nth row is pxn ´ErxsqJ Mean normalization

§ Corresponding covariance: S “ 1N XXJ

§ Corresponding eigenvector equation:

Sbi “ λibi ðñ1N

XXJbi “ λibi

§ Transformation (left-multiply by XJ):

1N

XXJbi “ λibi ðñ1N

XJX XJbiloomoon

“:vi

“ λi XJbiloomoon

“:vi

vi is an eigenvector of the N ˆ N-matrix 1N XJX, which has the

same non-zero eigenvalues as the original covariance matrix.Get eigenvalues in OpN3q instead of OpD3q.

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 41

Page 75: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Recovering the Original Eigenvectors

§ The new eigenvalue/eigenvector equation is

1N

XJXvi “ λivi

where vi “ XJbi

§ We want to recover the original eigenvectors bi of the datacovariance matrix S “ 1

N XXJ

§ Left-multiply eigenvector equation by X yields

1N

XXJlooomooon

“S

Xvi “ λiXvi

and we recover Xvi as an eigenvector of S associated witheigenvalue λi

§ Make sure to normalize Xvi so that }Xvi} “ 1

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 42

Page 76: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Recovering the Original Eigenvectors

§ The new eigenvalue/eigenvector equation is

1N

XJXvi “ λivi

where vi “ XJbi

§ We want to recover the original eigenvectors bi of the datacovariance matrix S “ 1

N XXJ

§ Left-multiply eigenvector equation by X yields

1N

XXJlooomooon

“S

Xvi “ λiXvi

and we recover Xvi as an eigenvector of S associated witheigenvalue λi

§ Make sure to normalize Xvi so that }Xvi} “ 1

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 42

Page 77: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Recovering the Original Eigenvectors

§ The new eigenvalue/eigenvector equation is

1N

XJXvi “ λivi

where vi “ XJbi

§ We want to recover the original eigenvectors bi of the datacovariance matrix S “ 1

N XXJ

§ Left-multiply eigenvector equation by X yields

1N

XXJlooomooon

“S

Xvi “ λiXvi

and we recover Xvi as an eigenvector of S associated witheigenvalue λi

§ Make sure to normalize Xvi so that }Xvi} “ 1

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 42

Page 78: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Recovering the Original Eigenvectors

§ The new eigenvalue/eigenvector equation is

1N

XJXvi “ λivi

where vi “ XJbi

§ We want to recover the original eigenvectors bi of the datacovariance matrix S “ 1

N XXJ

§ Left-multiply eigenvector equation by X yields

1N

XXJlooomooon

“S

Xvi “ λiXvi

and we recover Xvi as an eigenvector of S associated witheigenvalue λi

§ Make sure to normalize Xvi so that }Xvi} “ 1Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 42

Page 79: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Overview

Introduction

Setting

Maximum Variance Perspective

Projection Perspective

PCA Algorithm

PCA in High Dimensions

Probabilistic PCA

Related Models

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 43

Page 80: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Latent Variable Perspective

xn

B

zn

σ

µ

n “ 1, . . . , N

§ Model:x “ Bz` µ` ε , ε „ N

`

0, σ2I˘

§ Generative process:

z „ N`

0, I˘

x|z „ N`

x |Bz` µ, σ2I˘

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 44

Page 81: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Latent Variable Perspective

xn

B

zn

σ

µ

n “ 1, . . . , N

§ Model:x “ Bz` µ` ε , ε „ N

`

0, σ2I˘

§ Generative process:

z „ N`

0, I˘

x|z „ N`

x |Bz` µ, σ2I˘

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 44

Page 82: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Why is this useful?

§ “Standard” PCA as a special case,

§ Comes with a likelihood function, and we can explicitly dealwith noisy observations

§ Allow for Bayesian model comparison via the marginallikelihood

§ PCA as a generative model, which allows us to simulate new data

§ Straightforward connections to related algorithms and models(e.g., ICA)

§ Deal with data dimensions that are missing at random byapplying Bayes’ theorem

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 45

Page 83: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Why is this useful?

§ “Standard” PCA as a special case,

§ Comes with a likelihood function, and we can explicitly dealwith noisy observations

§ Allow for Bayesian model comparison via the marginallikelihood

§ PCA as a generative model, which allows us to simulate new data

§ Straightforward connections to related algorithms and models(e.g., ICA)

§ Deal with data dimensions that are missing at random byapplying Bayes’ theorem

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 45

Page 84: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Why is this useful?

§ “Standard” PCA as a special case,

§ Comes with a likelihood function, and we can explicitly dealwith noisy observations

§ Allow for Bayesian model comparison via the marginallikelihood

§ PCA as a generative model, which allows us to simulate new data

§ Straightforward connections to related algorithms and models(e.g., ICA)

§ Deal with data dimensions that are missing at random byapplying Bayes’ theorem

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 45

Page 85: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Why is this useful?

§ “Standard” PCA as a special case,

§ Comes with a likelihood function, and we can explicitly dealwith noisy observations

§ Allow for Bayesian model comparison via the marginallikelihood

§ PCA as a generative model, which allows us to simulate new data

§ Straightforward connections to related algorithms and models(e.g., ICA)

§ Deal with data dimensions that are missing at random byapplying Bayes’ theorem

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 45

Page 86: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Why is this useful?

§ “Standard” PCA as a special case,

§ Comes with a likelihood function, and we can explicitly dealwith noisy observations

§ Allow for Bayesian model comparison via the marginallikelihood

§ PCA as a generative model, which allows us to simulate new data

§ Straightforward connections to related algorithms and models(e.g., ICA)

§ Deal with data dimensions that are missing at random byapplying Bayes’ theorem

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 45

Page 87: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Why is this useful?

§ “Standard” PCA as a special case,

§ Comes with a likelihood function, and we can explicitly dealwith noisy observations

§ Allow for Bayesian model comparison via the marginallikelihood

§ PCA as a generative model, which allows us to simulate new data

§ Straightforward connections to related algorithms and models(e.g., ICA)

§ Deal with data dimensions that are missing at random byapplying Bayes’ theorem

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 45

Page 88: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Likelihood

§ Model:x “ Bz` µ` ε , ε „ N

`

0, σ2I˘

§ PPCA Likelihood (integrate out the latent variables):

ppx|B, µ, σ2q “

ż

ppx|z, µ, σ2qppzqdz

ż

N`

x |Bz` µ, σ2I˘

N`

z | 0, I˘

dz

Is Gaussian with mean and covariance

Erxs “ EzrBz` µ` εs “ µ

Vrxs “ VzrBz` µ` εs “ BBJ ` σ2I

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 46

Page 89: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Likelihood

§ Model:x “ Bz` µ` ε , ε „ N

`

0, σ2I˘

§ PPCA Likelihood (integrate out the latent variables):

ppx|B, µ, σ2q “

ż

ppx|z, µ, σ2qppzqdz

ż

N`

x |Bz` µ, σ2I˘

N`

z | 0, I˘

dz

Is Gaussian with mean and covariance

Erxs “ EzrBz` µ` εs “ µ

Vrxs “ VzrBz` µ` εs “ BBJ ` σ2I

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 46

Page 90: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Likelihood

§ Model:x “ Bz` µ` ε , ε „ N

`

0, σ2I˘

§ PPCA Likelihood (integrate out the latent variables):

ppx|B, µ, σ2q “

ż

ppx|z, µ, σ2qppzqdz

ż

N`

x |Bz` µ, σ2I˘

N`

z | 0, I˘

dz

Is Gaussian with mean and covariance

Erxs “ EzrBz` µ` εs “ µ

Vrxs “ VzrBz` µ` εs “ BBJ ` σ2I

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 46

Page 91: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Likelihood

§ Model:x “ Bz` µ` ε , ε „ N

`

0, σ2I˘

§ PPCA Likelihood (integrate out the latent variables):

ppx|B, µ, σ2q “

ż

ppx|z, µ, σ2qppzqdz

ż

N`

x |Bz` µ, σ2I˘

N`

z | 0, I˘

dz

Is Gaussian with mean and covariance

Erxs “ EzrBz` µ` εs “ µ

Vrxs “ VzrBz` µ` εs “ BBJ ` σ2 I

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 46

Page 92: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Joint Distribution and Posterior

§ Joint distribution of observed and latent variables

ppx, z|B, µ, σ2q “ N˜«

xz

ff ˇ

ˇ

ˇ

ˇ

ˇ

«

µ

0

ff

,

«

BBJ ` σ2I BBJ I

ff¸

§ Posterior via Gaussian conditioning:

ppz|x, B, µ, σ2q “ N`

z |m, C˘

m “ BJpBBJ ` σ2Iq´1px´ µq

C “ I ´ BJpBBJ ` σ2Iq´1B

For a new observation x˚ compute the posterior on ppz˚|x˚, Xqand examine it (e.g., variance).

§ Generate new (plausible) data from this posterior

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 47

Page 93: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Maximum Likelihood Estimation

§ In PPCA, we can determine the parameters µ, B, σ2 via maximumlikelihood estimation. PPCA Likelihood: ppX|µ, B, σ2q

§ Result (e.g., Tipping & Bishop (1999)):

µML “1N

Nÿ

n“1

xn Sample mean

BML “ TpΛ´ σ2 Iq12 R

σ2ML “

1D´M

Dÿ

j“M`1

λj Average variance in orth. complement

§ For σ Ñ 0 the maximum likelihood solution gives the same resultas PCA (see mml-book.com)

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 48

Page 94: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Overview

Introduction

Setting

Maximum Variance Perspective

Projection Perspective

PCA Algorithm

PCA in High Dimensions

Probabilistic PCA

Related Models

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 49

Page 95: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Related Models

§ Factor analysis:Axis-aligned noise(instead of isotropic)

§ Independentcomponent analysis:Non-Gaussian priorppzq “

ś

m pmpzmq

§ Kernel PCA

§ Bayesian PCA: Priorson parameters B, µ, σ2

Approximateinference

xn

B

zn

σ

µ

n “ 1, . . . , N

§ Gaussian process latent variablemodel (GP-LVM): Replace linearmapping in Bayesian PCA withGaussian process. Point estimate of z

§ Bayesian GP-LVM maintains adistribution on z Approximateinference

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 50

Page 96: Principal Component Analysis - Marc Deisenroth › ... › foundations-of-machine-learning › lecture_p… · Introduction Setting Maximum Variance Perspective Projection Perspective

Summary

−2.5 0.0 2.5 5.0x1

−4

−2

0

2

4

6

x2

xn

B

zn

σ

µ

n “ 1, . . . , N

§ PCA: Algorithm for linear dimensionality reduction§ Orthogonal projection of data onto a lower-dimensional subspace

§ Maximizes the variance of the projection§ Minimizes the average squared projection/reconstruction error

§ High-dimensional data§ Probabilistic PCA

Principal Component Analysis Marc Deisenroth @AIMS Rwanda, October 4, 2018 51