Gene Expression Analysis - ut...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 26, 2013 Coming up next… Supervised machine learning

Kernel Methods

Konstantin Tretyakov ([email protected])

MTAT.03.227 Machine Learning

So far…

Supervised machine learning

Linear models

Non-linear models

Unsupervised machine learning

Generic scaffolding

May 26, 2013

So far…


Linear models

Least squares regression, SVR

Fisher’s discriminant, Perceptron, Logistic model, SVM

Non-linear models

Neural networks, Decision trees, Association rules


Clustering/EM, PCA

Generic scaffolding

Probabilistic modeling, ML/MAP estimation

Performance evaluation, Statistical learning theory

Linear algebra, Optimization methods

May 26, 2013

Coming up next…


Linear models

Least squares regression, SVR

Fisher’s discriminant, Perceptron, Logistic model, SVM

Non-linear models

Neural networks, Decision trees, Association rules

Kernel-XXX


Clustering/EM, PCA, Kernel-XXX

Generic scaffolding

Probabilistic modeling, ML/MAP estimation

Performance evaluation, Statistical learning theory

Linear algebra, Optimization methods

KernelsMay 26, 2013

Logistic regression, Perceptron, Max. margin,

Fisher’s discriminant, Linear regression, Ridge

Regression, LASSO, …:

𝑓 𝒙 =

May 26, 2013




𝑓 𝒙 = 𝒘𝑇𝒙 + 𝑏

May 26, 2013





PCA, LDA, ICA, …:

𝑓 𝒙 =

May 26, 2013





PCA, LDA, ICA, …:

𝑓 𝒙 = 𝑨𝒙

May 26, 2013





PCA, LDA, ICA, …:

𝑓 𝒙 = 𝑨𝒙

K-means:

𝒄𝑖 =1

𝑚𝑿𝑖𝟏

CCA, GLM, …May 26, 2013

Too much linear





PCA, LDA, ICA, …:

𝒙𝑇 = 𝑨𝒙

K-means:

𝒄𝑖 =1

𝑚𝑿𝑖𝟏

CCA, GLM, …May 26, 2013

Linear is not enough

Limited generalization ability

May 26, 2013


Limited generalization ability

May 26, 2013


Limited applicability

Text?

Ordinal/Nominal data?

Graphs/Trees/Networks?

Shapes?

Graph nodes?

May 26, 2013

Solutions

May 26, 2013

Solutions

Feature space

Kernels

May 26, 2013

Solutions

Feature space

Nonlinear feature spaces

Kernels

The Kernel Trick

Dual representation

May 26, 2013

Important idea #1

Important idea #2

Important idea #3

𝑓 𝑥 = 𝑤𝑥

May 26, 2013

𝑥 → 𝑥′ ≔ 𝜙 𝑥 ≔ 𝑥, 𝑥2, 𝑥3

May 26, 2013

Nonlinear feature space

𝑥 → 𝑥′ ≔ 𝜙 𝑥 ≔ 𝑥, 𝑥2, 𝑥3

𝑓 𝑥′ = 𝑤1𝑥 + 𝑤2𝑥2 +𝑤3𝑥

3

May 26, 2013

May 26, 2013

𝑥 → 𝜙 𝑥 = (𝑥, 𝑥3−𝑥)

May 26, 2013

𝑥 → 𝜙 𝑥 = (𝑥, 𝑥3−𝑥)

May 26, 2013

𝑥 → 𝜙 𝑥 = (𝑥, 𝑥3−𝑥)

May 26, 2013

Nonlinear feature space

𝑓 𝒙 = 𝒘𝑇𝜙(𝒙)

May 26, 2013

+Support for arbitrary data types

𝜙 text = word counts𝜙 graph = node degrees𝜙 tree = path lengths

…

May 26, 2013

What if the dimensionality is high?

𝑥1, 𝑥2, … , 𝑥𝑚 → 𝑥1𝑥1, 𝑥1𝑥2, … , 𝑥𝑚𝑥𝑚

May 26, 2013

What if the dimensionality is high?

𝑥1, 𝑥2, … , 𝑥𝑚 → 𝑥1𝑥1, 𝑥1𝑥2, … , 𝑥𝑚𝑥𝑚𝑂(𝑚2) elements

For all k-wise products: 𝑂 𝑚𝑘

May 26, 2013

The Kernel Trick

Let 𝜙 𝒙 = (𝑥1𝑥1, 𝑥1𝑥2, … , 𝑥𝑚𝑥𝑚)

Consider

𝜙 𝒙 , 𝜙 𝒚 =

𝑖𝑗

𝜙 𝒙 𝑖𝑗𝜙 𝒚 𝑖𝑗

May 26, 2013

The Kernel Trick


Consider

𝜙 𝒙 , 𝜙 𝒚 =

𝑖𝑗


=

𝑖𝑗

𝑥𝑖𝑥𝑗𝑦𝑖𝑦𝑗

May 26, 2013

The Kernel Trick


Consider

𝜙 𝒙 , 𝜙 𝒚 =

𝑖𝑗


=

𝑖𝑗

𝑥𝑖𝑥𝑗𝑦𝑖𝑦𝑗 =

𝑖𝑗

𝑥𝑖𝑦𝑖𝑥𝑗𝑦𝑗

May 26, 2013

The Kernel Trick


Consider

𝜙 𝒙 , 𝜙 𝒚 =

𝑖𝑗


=

𝑖𝑗


𝑖𝑗


=

𝑖

𝑥𝑖𝑦𝑖

𝑗

𝑥𝑗𝑦𝑗May 26, 2013

The Kernel Trick


Consider

𝜙 𝒙 , 𝜙 𝒚 =

𝑖𝑗


=

𝑖𝑗


𝑖𝑗


=

𝑖

𝑥𝑖𝑦𝑖

𝑗

𝑥𝑗𝑦𝑗 =

𝑖

𝑥𝑖𝑦𝑖

2

May 26, 2013

The Kernel Trick


Consider

𝜙 𝒙 , 𝜙 𝒚 =

𝑖𝑗


=

𝑖

𝑥𝑖𝑦𝑖

2

= 𝒙, 𝒚 2

May 26, 2013

The Kernel Trick


Consider

𝜙 𝒙 , 𝜙 𝒚 =

𝑖𝑗


=

𝑖

𝑥𝑖𝑦𝑖

2

= 𝒙, 𝒚 2

May 26, 2013

The Kernel Trick

Let 𝜙 𝒙 = (𝑥1𝑥1, 𝑥1𝑥2, … , 𝑥𝑛𝑥𝑛)

Consider

𝜙 𝑥 , 𝜙 𝑦 =

𝑖𝑗

𝜙 𝑥 𝑖𝑗𝜙 𝑦 𝑖𝑗

=

𝑖

𝑥𝑖𝑦𝑖

2

= 𝑥, 𝑦 2

May 26, 2013

Polynomial kernel

𝐾 𝒙, 𝒚 = 𝒙, 𝒚 + 𝑅 𝑑

The Kernel Trick

What about:

𝐾 𝒙, 𝒚 = 𝒙, 𝒚 + 0.5 𝒙, 𝒚 2?

May 26, 2013

The Kernel Trick

What about:

𝐾 𝒙, 𝒚 = 𝒙, 𝒚 + 0.5 𝒙, 𝒚 2

=

𝑖

𝑥𝑖𝑦𝑖 + 0.5

𝑖𝑗

𝜙𝑖𝑗 𝒙 𝜙𝑖𝑗(𝒚)

May 26, 2013

The Kernel Trick

What about:

𝐾 𝒙, 𝒚 = 𝒙, 𝒚 + 0.5 𝒙, 𝒚 2

=

𝑖


𝑖𝑗


= ⟨ 𝑥1, … , 𝑥𝑚, √0.5𝑥1𝑥1, … , √0.5𝑥𝑚𝑥𝑚 ,

(𝑦1, … , 𝑦𝑚, √0.5𝑦1𝑦1, … , √0.5𝑦𝑚𝑦𝑚)⟩

May 26, 2013

The Kernel Trick

What about:

𝐾 𝒙, 𝒚 = 𝒙, 𝒚 + 0.5 𝒙, 𝒚 2

=

𝑖


𝑖𝑗


= ⟨ 𝑥1, … , 𝑥𝑚, √0.5𝑥1𝑥1, … , √0.5𝑥𝑚𝑥𝑚 ,

(𝑦1, … , 𝑦𝑚, √0.5𝑦1𝑦1, … , √0.5𝑦𝑚𝑦𝑚)⟩

May 26, 2013

The Kernel Trick

What about:

𝐾 𝑥, 𝑦 = 1 + 𝑥, 𝑦 +1

2𝑥, 𝑦 2 +

1

6𝑥, 𝑦 3 +

1

24𝑥, 𝑦 4?

May 26, 2013

The Kernel Trick

What about:

𝐾 𝑥, 𝑦 =

𝑖=0

∞𝑥, 𝑦 𝑖

𝑖!

May 26, 2013

The Kernel Trick

What about:

𝐾 𝑥, 𝑦 =

𝑖=0

∞𝑥, 𝑦 𝑖

𝑖!= exp⟨𝑥, 𝑦⟩

May 26, 2013

The Kernel Trick

What about:

𝐾 𝑥, 𝑦 =

𝑖=0

∞𝑥, 𝑦 𝑖


Infinite-dimensional feature space!

May 26, 2013

The Kernel Trick

What about:

𝐾 𝑥, 𝑦 =

𝑖=0

∞𝑥, 𝑦 𝑖



May 26, 2013

Gaussian kernel

𝐾 𝒙, 𝒚 == exp(−𝛾‖𝒙 − 𝒚‖2)

= exp −𝒙 − 𝒚 2

2𝜎2

The Kernel Trick

What about:

𝐾 𝑥, 𝑦 =

𝑖=0

∞𝑥, 𝑦 𝑖



May 26, 2013

Exponential kernel

𝐾 𝒙, 𝒚 = exp −𝒙 − 𝒚

2𝜎2

Kernels

May 26, 2013

http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html

Structured data kernels

String kernels

P-spectrum kernels

All-subsequences kernels

Gap-weighted subsequences kernels

…

Graph & tree kernels

Co-rooted subtrees

All subtrees

Random walks

…

May 26, 2013

Kernel

A function 𝐾(𝒙, 𝒚) is a kernel, if

𝐾 𝒙, 𝒚 = 𝜙 𝒙 , 𝜙 𝒚

for some feature map 𝜙.

May 26, 2013

Kernel matrix

For a given kernel function 𝐾 and a finite

dataset (𝒙1, 𝒙2, … , 𝒙𝑛) the 𝑛 × 𝑛 matrix

𝑲𝑖𝑗 ≔ 𝐾 𝒙𝑖 , 𝒙𝑗

is called the kernel matrix.

May 26, 2013

Kernel matrix

Let 𝑿 be the data matrix, then

𝑲 = 𝑿𝑿𝑇

is the kernel matrix for the linear kernel

𝐾 𝒙, 𝒚 = 𝒙𝑇𝒚

May 26, 2013

Kernel matrix

Let 𝑿 be the data matrix, then

𝑲 = 𝑿𝑿𝑇

is the kernel matrix for the linear kernel

𝐾 𝒙, 𝒚 = 𝒙𝑇𝒚

Let 𝜙 be a feature mapping. Then*

𝑲 = 𝜙 𝑿 𝜙 𝑿 𝑇

is the kernel matrix for the corresponding

kernel 𝐾 𝒙, 𝒚 = ⟨𝜙 𝒙 , 𝜙 𝒚 ⟩.

May 26, 2013

Kernel theorem

Not every function K is a kernel!

May 26, 2013

Example?

Kernel theorem

Not every function K is a kernel!

e. g. 𝐾 𝑥, 𝑦 = −1 is not

Not every 𝑛 × 𝑛 matrix is a Kernel matrix!

May 26, 2013

Kernel theorem

Theorem:

𝐾 is a kernel function ⇔ 𝐾 is symmetric positive

semidefinite

A function is positive semidefinite iff for any

finite dataset {𝒙1, 𝒙2, … , 𝒙𝑛} the corresponding

kernel matrix is positive semidefinite.

May 26, 2013

Kernel closure

May 26, 2013

Kernel closure

May 26, 2013

Feature space concatenation

Kernel closure

May 26, 2013

Feature space scaling

Kernel closure

May 26, 2013

Feature space tensor product

Kernel closure

May 26, 2013

Feature map composition

Kernel normalization

Let 𝜙′ 𝑥 =𝜙 𝑥

𝜙 𝑥

Then

𝐾′ 𝑥, 𝑦 = 𝜙′ 𝑥 , 𝜙′ 𝑦 =𝜙 𝑥

𝜙 𝑥,𝜙 𝑦

𝜙 𝑦=

𝜙 𝑥 ,𝜙 𝑦

𝜙 𝑥 2 𝜙 𝑦 2=

=𝐾 𝑥, 𝑦

𝐾 𝑥, 𝑥 𝐾 𝑦, 𝑦

May 26, 2013

Kernel matrix normalization

Then

𝐾′ 𝑥, 𝑦 = 𝜙′ 𝑥 , 𝜙′ 𝑦 =𝜙 𝑥

𝜙 𝑥,𝜙 𝑦

𝜙 𝑦=

𝜙 𝑥 , 𝜙 𝑦

𝜙 𝑥 2 𝜙 𝑦 2=

=𝐾 𝑥, 𝑦

𝐾 𝑥, 𝑥 𝐾 𝑦, 𝑦

𝐾′𝑖𝑗 ≔𝐾𝑖𝑗

𝐾𝑖𝑖𝐾𝑗𝑗

May 26, 2013

Kernel matrix centering

𝒙𝑖 → 𝒙𝑖 −1

𝑛

𝑘

𝒙𝑘

May 26, 2013



𝑛

𝑘

𝒙𝑘

May 26, 2013



𝑛

𝑘

𝒙𝑘

𝑿 → 𝑿 −1

𝑛𝟏𝑛𝟏𝑛𝑇𝑿

May 26, 2013



𝑛

𝑘

𝒙𝑘

𝑿 → 𝑿−1


𝑿𝑿𝑇 → 𝑿−1

𝑛𝟏𝟏𝑇𝑿 𝑿 −

1

𝑛𝟏𝟏𝑇𝑿

𝑇

May 26, 2013



𝑛

𝑘

𝒙𝑘

𝑿 → 𝑿−1


𝑿𝑿𝑇 → 𝑿−1

𝑛𝟏𝟏𝑇𝑿 𝑿 −

1

𝑛𝟏𝟏𝑇𝑿

𝑇

𝑿𝑿𝑇

→ 𝑿𝑿𝑇 −1

𝑛𝟏𝟏𝑇𝑿𝑿𝑇 −

1

𝑛𝑿𝑿𝑇𝟏𝟏𝑇

+1

𝑛2𝟏𝟏𝑇𝑿𝑿𝑇𝟏𝟏𝑇 May 26, 2013


𝑿𝑿𝑇

→ 𝑿𝑿𝑇 −1

𝑛𝟏𝟏𝑇𝑿𝑿𝑇 −

1

𝑛𝑿𝑿𝑇𝟏𝟏𝑇

+1

𝑛2𝟏𝟏𝑇𝑿𝑿𝑇𝟏𝟏𝑇

𝑲cent

≔ 𝑲−1

𝑛𝟏𝟏𝑇𝑲−

1

𝑛𝑲𝟏𝟏𝑇 +

1

𝑛2𝟏𝟏𝑇𝑲𝟏𝟏𝑇

May 26, 2013

The Dual Representation

Let 𝐴 be the input space, and let 𝐵 be the

higher-dimensional feature space.

Let 𝜙: 𝐴 → 𝐵 be the feature map.

Fix a dataset {𝒙1, 𝒙2, … , 𝒙𝑛} ⊂ 𝐴

Let 𝑤 = 𝑖 𝛼𝑖𝜙(𝒙𝑖) ∈ 𝐵

We say that 𝛼𝑖 are the dual coordinates for 𝑤.

May 26, 2013

Dual coordinates

𝒘 =

𝑖

𝛼𝑖𝜙(𝒙𝑖) = 𝜙 𝑿𝑇𝜶 = 𝚵𝑻𝜶

Note that 𝚵𝚵𝑇 = 𝜙 𝑿 𝜙 𝑿 𝑇 = 𝑲

Now we can do all of the useful stuff using dual

coordinates only.

May 26, 2013

Dual coordinates

Let

𝒘 = 𝚵𝑇𝜶𝒖 = 𝚵T𝜷

Then

2𝒘 =

May 26, 2013

Dual coordinates

Let


Then

2𝒘 = 𝚵T(2𝜶)

May 26, 2013

Dual coordinates

Let


Then

2𝒘 = 𝚵T 2𝜶𝒘+ 𝒖 =

May 26, 2013

Dual coordinates

Let


Then

2𝒘 = 𝚵T 2𝜶𝒘+ 𝒖 = 𝚵T(𝜶 + 𝜷)

May 26, 2013

Dual coordinates

Let


Then

2𝒘 = 𝚵T 2𝜶𝒘+ 𝒖 = 𝚵T 𝜶 + 𝜷𝒘, 𝒖 =

May 26, 2013

Dual coordinates

Let


Then

2𝒘 = 𝚵T 2𝜶𝒘+ 𝒖 = 𝚵T 𝜶 + 𝜷𝒘, 𝒖 = 𝒘𝑇𝒖 = 𝜶𝑇𝚵𝚵𝑇𝜷 = 𝜶𝑇𝑲𝜷

May 26, 2013

Dual coordinates

Let


Then

2𝒘 = 𝚵T 2𝜶𝒘+ 𝒖 = 𝚵T 𝜶 + 𝜷𝒘,𝒖 = 𝒘𝑇𝒖 = 𝜶𝚵𝚵𝑇𝜷 = 𝜶𝑲𝜷𝒘− 𝒖 2 =

May 26, 2013

Dual coordinates

Let


Then

2𝒘 = 𝚵T 2𝜶𝒘+ 𝒖 = 𝚵T 𝜶 + 𝜷𝒘, 𝒖 = 𝒘𝑇𝒖 = 𝜶𝚵𝚵𝑇𝜷 = 𝜶𝑲𝜷𝒘− 𝒖 2 = 𝒘𝑇𝒘+ 𝒖𝑇𝒖 − 𝟐𝒘𝑇𝒖 = ⋯

May 26, 2013

Dual coordinates

Let


Then

2𝒘 = 𝚵T 2𝜶𝒘+ 𝒖 = 𝚵T 𝜶 + 𝜷𝒘, 𝒖 = 𝒘𝑇𝒖 = 𝜶𝚵𝚵𝑇𝜷 = 𝜶𝑲𝜷𝒘− 𝒖 2 = 𝒘𝑇𝒘+ 𝒖𝑇𝒖 − 𝟐𝒘𝑇𝒖 = ⋯

May 26, 2013

Kernelization

Recall the Perceptron:

May 26, 2013

Kernelization


Initialize 𝒘 ≔ 𝟎

Find a misclassified example (𝑥𝑖 , 𝑦𝑖)

Update weights:

𝒘 ≔ 𝒘+ 𝜇𝑦𝑖𝒙𝒊 𝑏 ≔ 𝑏 + 𝜇𝑦𝑖

May 26, 2013

Kernelization


Initialize 𝒘 ≔ 𝟎 ⇔ 𝜶 ≔ 𝟎


Update weights:

𝒘 ≔ 𝒘+ 𝜇𝑦𝑖𝒙𝒊 𝑏 ≔ 𝑏 + 𝜇𝑦𝑖

May 26, 2013

Kernelization


Initialize 𝒘 ≔ 𝟎 ⇔ 𝜶 ≔ 𝟎


Update weights:

𝒘 ≔ 𝒘+ 𝜇𝑦𝑖𝒙𝒊 ⇔ 𝛼𝑖 ≔ 𝛼𝑖 + 𝜇𝑦𝑖 𝑏 ≔ 𝑏 + 𝜇𝑦𝑖

May 26, 2013

Kernelization


Initialize 𝜶 ≔ 𝟎


Update weights:

𝛼𝑖 ≔ 𝛼𝑖 + 𝜇𝑦𝑖 𝑏 ≔ 𝑏 + 𝜇𝑦𝑖

May 26, 2013

Kernelization




𝒘𝑇𝒙𝑖 + 𝑏 ≠ 𝑦𝑖 ⇔ 𝑗 𝛼𝑗𝒙𝑗𝑇𝒙𝑖 + 𝑏 ≠ 𝑦𝑖

Update weights:


May 26, 2013

Kernelization




𝒘𝑇𝒙𝑖 + 𝑏 ≠ 𝑦𝑖 ⇔ 𝑲𝑖𝜶 + 𝑏 ≠ 𝑦𝑖

Update weights:


May 26, 2013

Kernelization




𝑲𝑖𝜶 + 𝑏 ≠ 𝑦𝑖

Update weights:


May 26, 2013

Quiz

Today we heard three important ideas

Important idea #1: __________



Function/matrix 𝐾 is a kernel function/matrix

iff it is __________

Dual representation: ___ = ___ __

May 26, 2013

Quiz

Those algoritms have kernelized versions:

___________________________ …

May 26, 2013

May 26, 2013

Documents

Gene Expression Analysis - ut...Performance evaluation, Statistical learning theory Linear algebra, Optimization methods May 26, 2013 Coming up next… Supervised machine learning