32
Machine Learning for Data Science (CS4786) Lecture 4 Canonical Correlation Analysis (CCA) Course Webpage : http://www.cs.cornell.edu/Courses/cs4786/2016fa/

Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

Machine Learning for Data Science (CS4786)Lecture 4

Canonical Correlation Analysis (CCA)

Course Webpage :http://www.cs.cornell.edu/Courses/cs4786/2016fa/

Page 2: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

Announcement

• We are grading HW0 and you will be added to cms by monday

• HW1 will be posted tonight on webpage (homework tab)

• HW1 on CCA and PCA (due in a week)

Page 3: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

QUIZ

X

d

n

x

>1

x

>n

Assume points are centered. Which of the following are equal to the covariance matrix?

A. ⌃ =1

n

nX

t=1

x

>t xt B. ⌃ =

1

n

nX

t=1

xtx>t

C. ⌃ = XX> D. ⌃ = XX>

Page 4: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

Example: Students in classroom

x

yz

Page 5: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

Maximize Spread Minimize Reconstruction Error

Page 6: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

PRINCIPAL COMPONENT ANALYSIS

Eigenvectors of the covariance matrix are the principalcomponents

Top K principal components are the eigenvectors with K largesteigenvalues

Projection = Data × Top Keigenvectors

Reconstruction = Projection × Transpose of top K eigenvectors

Independently discovered by Pearson in 1901 and Hotelling in1933.

⌃ =cov X !

1.

eigs= ⌃ ,K( )W2.

3. Y = W⇥X�µ

Page 7: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

RECONSTRUCTION

4.Y= ⇥bX W>

Page 8: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

WHEN d >> n

If d >> n then ⌃ is largeBut we only need top K eigen vectors.Idea: use SVD

X − µ = UDV

�Then note that, ⌃ = (X − µ)�(X − µ) = VD

2V

Hence, matrix V is the same as matrix W got from eigendecomposition of ⌃, eigenvalues are diagonal elements of D

2

Alternative algorithm:

[U,V] = SVD(X − µ,K) W = V

U U = ITV V = IT

Page 9: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

WHEN TO USE PCA?

When data naturally lies in a low dimensional linear subspace

To minimize reconstruction error

Find directions where data is maximally spread

Page 10: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

Canonical Correlation Analysis

x

yz

+Age

Gender

Angle

Page 11: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

Canonical Correlation Analysis

x

yz

+Age

Gender

Angle

Page 12: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

TWO VIEW DIMENSIONALITY REDUCTION

Data comes in pairs (x1,x′1), . . . , (xn,x ′n)where xt’s are d

dimensional and x

′t ’s are d ′ dimensional

Goal: Compress say view one into y1, . . . ,yn, that are Kdimensional vectors

Retain information redundant between the two views

Eliminate “noise” specific to only one of the views

Page 13: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

EXAMPLE I: SPEECH RECOGNITION

Audio might have background sounds uncorrelated with video

Video might have lighting changes uncorrelated with audio

Redundant information between two views: the speech

+

Page 14: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

EXAMPLE II: COMBINING FEATURE EXTRACTIONS

Method A and Method B are both equally good feature extractiontechniques

Concatenating the two features blindly yields large dimensionalfeature vector with redundancy

Applying techniques like CCA extracts the key informationbetween the two methods

Removes extra unwanted information

Page 15: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

How do we get the right direction? (say K = 1)

+Age

Gender

Angle

Page 16: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

WHICH DIRECTION TO PICK?

View I View II

Page 17: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

WHICH DIRECTION TO PICK?

0 0

PCA direction

Page 18: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

WHICH DIRECTION TO PICK?

Direction has large covariance

Page 19: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

How do we pick the right direction to project to?

Page 20: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

MAXIMIZING CORRELATION COEFFICIENT

Say w1 and v1 are the directions we choose to project in views 1and 2 respectively we want these directions to maximize,

1n

n�t=1�yt[1] − 1

n

n�t=1

yt[1]� ⋅ �y ′t[1] − 1n

n�t=1

y

′t[1]�

s.t. 1n ∑n

t=1 �yt[1] − 1n ∑n

t=1 yt[1]�2 = 1n ∑n

t=1 �y ′t[1] − 1n ∑n

t=1 y

′t[1]� = 1

where yt[1] =w

�1 xt and y

′t[1] = v

�1 x

′t

Page 21: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

MAXIMIZING CORRELATION COEFFICIENT

Say w1 and v1 are the directions we choose to project in views 1and 2 respectively we want these directions to maximize,

1n

n�t=1�yt[1] − 1

n

n�t=1

yt[1]� ⋅ �y ′t[1] − 1n

n�t=1

y

′t[1]�

s.t. 1n ∑n

t=1 �yt[1] − 1n ∑n

t=1 yt[1]�2 = 1n ∑n

t=1 �y ′t[1] − 1n ∑n

t=1 y

′t[1]� = 1

where yt[1] =w

�1 xt and y

′t[1] = v

�1 x

′t

Page 22: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

What is the problem with the above?

Page 23: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

WHY NOT MAXIMIZE COVARIANCE

Relevant information

Say1

n

nX

t=1

xt[2] · x0t[2] > 0

Scaling up this coordinate we can blow up covariance

Page 24: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

MAXIMIZING CORRELATION COEFFICIENT

Say w1 and v1 are the directions we choose to project in views 1and 2 respectively we want these directions to maximize,

1n ∑n

t=1 �yt[1] − 1n ∑n

t=1 yt[1]� ⋅ �y ′t[1] − 1n ∑n

t=1 y

′t[1]��

1n ∑n

t=1 �yt[1] − 1n ∑n

t=1 yt[1]�2� 1n ∑n

t=1 �y ′t[1] − 1n ∑n

t=1 y

′t[1]�

Page 25: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

BASIC IDEA OF CCA

Normalize variance in chosen direction to be constant (say 1)

Then maximize covariance

This is same as maximizing “correlation coefficient” (recall fromlast class).

Page 26: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

COVARIANCE VS CORRELATION

Covariance(A,B) = E[(A −E[A]) ⋅ (B −E[B])]Depends on the scale of A and B. If B is rescaled, covariance shifts.

Corelation(A,B) = E[(A−E[A])⋅(B−E[B])]�Var(A)�Var(B)

Scale free.

Page 27: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

MAXIMIZING CORRELATION COEFFICIENT

Say w1 and v1 are the directions we choose to project in views 1and 2 respectively we want these directions to maximize,

1n

n�t=1�yt[1] − 1

n

n�t=1

yt[1]� ⋅ �y ′t[1] − 1n

n�t=1

y

′t[1]�

s.t. 1n ∑n

t=1 �yt[1] − 1n ∑n

t=1 yt[1]�2 = 1n ∑n

t=1 �y ′t[1] − 1n ∑n

t=1 y

′t[1]� = 1

where yt[1] =w

�1 xt and y

′t[1] = v

�1 x

′t

Page 28: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

CANONICAL CORRELATION ANALYSIS

Hence we want to solve for projection vectors w1 and v1 that

maximize1n

n�t=1

w

�1(xt − µ) ⋅ v�1(x ′t − µ ′)

subject to1n

n�t=1(w�1(xt − µ))2 = 1

n

n�t=1(v�1(x ′t − µ ′))2 = 1

where µ = 1n ∑n

t=1 xt and µ ′ = 1n ∑n

t=1 x

′t

Page 29: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

CANONICAL CORRELATION ANALYSIS

Hence we want to solve for projection vectors w1 and v1 that

maximize1n

n�t=1

w

�1(xt − µ) ⋅ v�1(x ′t − µ ′)

subject to1n

n�t=1(w�1(xt − µ))2 = 1

n

n�t=1(v�1(x ′t − µ ′))2 = 1

where µ = 1n ∑n

t=1 xt and µ ′ = 1n ∑n

t=1 x

′t

Page 30: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

CANONICAL CORRELATION ANALYSIS

Hence we want to solve for projection vectors w1 and v1 that

maximize w

�1⌃1,2v1

subject to w

�1⌃1,1w1 = v

�1⌃2,2v1 = 1

Writing Lagrangian taking derivative equating to 0 we get

⌃1,2⌃−12,2⌃2,1w1 = �2⌃1,1w1 and ⌃2,1⌃

−11,1⌃1,2v1 = �2⌃2,2v1

or equivalently

�⌃−11,1⌃1,2⌃

−12,2⌃2,1�w1 = �2

w1 and �⌃−12,2⌃2,1⌃

−11,1⌃1,2�v1 = �2

v1⌃ =cov X

!⌃

⌃⌃⌃11

21

12

22= X

0

Page 31: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

SOLUTION

eigs= ,K( )W1⌃�1

11 ⌃12⌃�122 ⌃21

eigs= ,K( )W2⌃�1

22 ⌃21⌃�111 ⌃12

Page 32: Machine Learning for Data Science (CS4786) Lecture 4 · Announcement • We are grading HW0 and you will be added to cms by monday • HW1 will be posted tonight on webpage (homework

CCA ALGORITHM

Write x̃t = � xtx

′t� the d + d ′ dimensional concatenated vectors.

Calculate covariance matrix of the joint data points

⌃ = � ⌃1,1 ⌃1,2⌃2,1 ⌃2,2

Calculate ⌃−11,1⌃1,2⌃

−12,2⌃2,1. The top K eigen vectors of this matrix

give us projection matrix for view I.

Calculate ⌃−12,2⌃2,1⌃

−11,1⌃1,2. The top K eigen vectors of this matrix

give us projection matrix for view II.

,

n

d1 d2

X = X1 X2

!

1.

⌃ =cov X !

2.⌃

⌃⌃⌃11

21

12

22=

eigs= ,K( )3. W1 ⌃�111 ⌃12⌃

�122 ⌃21

4. Y = W⇥X�µ11 1 1