Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

Machine Learning for Data Science (CS4786)Lecture 4

Canonical Correlation Analysis (CCA)

Course Webpage :http://www.cs.cornell.edu/Courses/cs4786/2016fa/

Announcement

• We are grading HW0 and you will be added to cms by monday

• HW1 will be posted tonight on webpage (homework tab)

• HW1 on CCA and PCA (due in a week)

QUIZ

X

d

n

x

>1

x

>n

Assume points are centered. Which of the following are equal to the covariance matrix?

A. ⌃ =1

n

nX

t=1

x

>t xt B. ⌃ =

1

n

nX

t=1

xtx>t

C. ⌃ = XX> D. ⌃ = XX>

Example: Students in classroom

x

yz

Maximize Spread Minimize Reconstruction Error

PRINCIPAL COMPONENT ANALYSIS

Eigenvectors of the covariance matrix are the principalcomponents

Top K principal components are the eigenvectors with K largesteigenvalues

Projection = Data × Top Keigenvectors

Reconstruction = Projection × Transpose of top K eigenvectors

Independently discovered by Pearson in 1901 and Hotelling in1933.

⌃ =cov X !

1.

eigs= ⌃ ,K( )W2.

3. Y = W⇥X�µ

RECONSTRUCTION

4.Y= ⇥bX W>

+µ

WHEN d >> n

If d >> n then ⌃ is largeBut we only need top K eigen vectors.Idea: use SVD

X − µ = UDV

�Then note that, ⌃ = (X − µ)�(X − µ) = VD

2V

Hence, matrix V is the same as matrix W got from eigendecomposition of ⌃, eigenvalues are diagonal elements of D

2

Alternative algorithm:

[U,V] = SVD(X − µ,K) W = V

U U = ITV V = IT

WHEN TO USE PCA?

When data naturally lies in a low dimensional linear subspace

To minimize reconstruction error

Find directions where data is maximally spread

Canonical Correlation Analysis

x

yz

+Age

Gender

Angle

Canonical Correlation Analysis

x

yz

+Age

Gender

Angle

TWO VIEW DIMENSIONALITY REDUCTION

Data comes in pairs (x1,x′1), . . . , (xn,x ′n)where xt’s are d

dimensional and x

′t ’s are d ′ dimensional

Goal: Compress say view one into y1, . . . ,yn, that are Kdimensional vectors

Retain information redundant between the two views

Eliminate “noise” specific to only one of the views

EXAMPLE I: SPEECH RECOGNITION

Audio might have background sounds uncorrelated with video

Video might have lighting changes uncorrelated with audio

Redundant information between two views: the speech

+

EXAMPLE II: COMBINING FEATURE EXTRACTIONS

Method A and Method B are both equally good feature extractiontechniques

Concatenating the two features blindly yields large dimensionalfeature vector with redundancy

Applying techniques like CCA extracts the key informationbetween the two methods

Removes extra unwanted information

How do we get the right direction? (say K = 1)

+Age

Gender

Angle

WHICH DIRECTION TO PICK?

View I View II


0 0

PCA direction


Direction has large covariance

How do we pick the right direction to project to?

MAXIMIZING CORRELATION COEFFICIENT

Say w1 and v1 are the directions we choose to project in views 1and 2 respectively we want these directions to maximize,

1n

n�t=1�yt[1] − 1

n

n�t=1

yt[1]� ⋅ �y ′t[1] − 1n

n�t=1

y

′t[1]�

s.t. 1n ∑n

t=1 �yt[1] − 1n ∑n

t=1 yt[1]�2 = 1n ∑n

t=1 �y ′t[1] − 1n ∑n

t=1 y

′t[1]� = 1

where yt[1] =w

�1 xt and y

′t[1] = v

�1 x

′t



1n

n�t=1�yt[1] − 1

n

n�t=1

yt[1]� ⋅ �y ′t[1] − 1n

n�t=1

y

′t[1]�

s.t. 1n ∑n

t=1 �yt[1] − 1n ∑n

t=1 yt[1]�2 = 1n ∑n

t=1 �y ′t[1] − 1n ∑n

t=1 y

′t[1]� = 1

where yt[1] =w

�1 xt and y

′t[1] = v

�1 x

′t

What is the problem with the above?

WHY NOT MAXIMIZE COVARIANCE

Relevant information

Say1

n

nX

t=1

xt[2] · x0t[2] > 0

Scaling up this coordinate we can blow up covariance



1n ∑n

t=1 �yt[1] − 1n ∑n

t=1 yt[1]� ⋅ �y ′t[1] − 1n ∑n

t=1 y

′t[1]��

1n ∑n

t=1 �yt[1] − 1n ∑n

t=1 yt[1]�2� 1n ∑n

t=1 �y ′t[1] − 1n ∑n

t=1 y

′t[1]�

BASIC IDEA OF CCA

Normalize variance in chosen direction to be constant (say 1)

Then maximize covariance

This is same as maximizing “correlation coefficient” (recall fromlast class).

COVARIANCE VS CORRELATION

Covariance(A,B) = E[(A −E[A]) ⋅ (B −E[B])]Depends on the scale of A and B. If B is rescaled, covariance shifts.

Corelation(A,B) = E[(A−E[A])⋅(B−E[B])]�Var(A)�Var(B)

Scale free.



1n

n�t=1�yt[1] − 1

n

n�t=1

yt[1]� ⋅ �y ′t[1] − 1n

n�t=1

y

′t[1]�

s.t. 1n ∑n

t=1 �yt[1] − 1n ∑n

t=1 yt[1]�2 = 1n ∑n

t=1 �y ′t[1] − 1n ∑n

t=1 y

′t[1]� = 1

where yt[1] =w

�1 xt and y

′t[1] = v

�1 x

′t

CANONICAL CORRELATION ANALYSIS

Hence we want to solve for projection vectors w1 and v1 that

maximize1n

n�t=1

w

�1(xt − µ) ⋅ v�1(x ′t − µ ′)

subject to1n

n�t=1(w�1(xt − µ))2 = 1

n

n�t=1(v�1(x ′t − µ ′))2 = 1

where µ = 1n ∑n

t=1 xt and µ ′ = 1n ∑n

t=1 x

′t



maximize1n

n�t=1

w

�1(xt − µ) ⋅ v�1(x ′t − µ ′)

subject to1n

n�t=1(w�1(xt − µ))2 = 1

n

n�t=1(v�1(x ′t − µ ′))2 = 1

where µ = 1n ∑n

t=1 xt and µ ′ = 1n ∑n

t=1 x

′t



maximize w

�1⌃1,2v1

subject to w

�1⌃1,1w1 = v

�1⌃2,2v1 = 1

Writing Lagrangian taking derivative equating to 0 we get

⌃1,2⌃−12,2⌃2,1w1 = �2⌃1,1w1 and ⌃2,1⌃

−11,1⌃1,2v1 = �2⌃2,2v1

or equivalently

�⌃−11,1⌃1,2⌃

−12,2⌃2,1�w1 = �2

w1 and �⌃−12,2⌃2,1⌃

−11,1⌃1,2�v1 = �2

v1⌃ =cov X

!⌃

⌃⌃⌃11

21

12

22= X

0

SOLUTION

eigs= ,K( )W1⌃�1

11 ⌃12⌃�122 ⌃21

eigs= ,K( )W2⌃�1

22 ⌃21⌃�111 ⌃12

CCA ALGORITHM

Write x̃t = � xtx

′t� the d + d ′ dimensional concatenated vectors.

Calculate covariance matrix of the joint data points

⌃ = � ⌃1,1 ⌃1,2⌃2,1 ⌃2,2

�

Calculate ⌃−11,1⌃1,2⌃

−12,2⌃2,1. The top K eigen vectors of this matrix

give us projection matrix for view I.

Calculate ⌃−12,2⌃2,1⌃

−11,1⌃1,2. The top K eigen vectors of this matrix

give us projection matrix for view II.

,

n

d1 d2

X = X1 X2

!

1.

⌃ =cov X !

2.⌃

⌃⌃⌃11

21

12

22=

eigs= ,K( )3. W1 ⌃�111 ⌃12⌃

�122 ⌃21

4. Y = W⇥X�µ11 1 1

Documents

Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework