Gene Expression Analysis - ut...Performance evaluation, Statistical learning theory Linear algebra,...

Kernel Methods

Konstantin Tretyakov (kt@ut.ee)

MTAT.03.227 Machine Learning

So far…

Supervised machine learning

Linear models

Non-linear models

Unsupervised machine learning

Generic scaffolding

May 26, 2013

So far…

Linear models

Least squares regression, SVR

Fisher’s discriminant, Perceptron, Logistic model, SVM

Non-linear models

Neural networks, Decision trees, Association rules

Clustering/EM, PCA

Generic scaffolding

Probabilistic modeling, ML/MAP estimation

Performance evaluation, Statistical learning theory

Linear algebra, Optimization methods

May 26, 2013

Coming up next…

Linear models

Least squares regression, SVR

Fisher’s discriminant, Perceptron, Logistic model, SVM

Non-linear models

Neural networks, Decision trees, Association rules

Kernel-XXX

Clustering/EM, PCA, Kernel-XXX

Generic scaffolding

Probabilistic modeling, ML/MAP estimation

Performance evaluation, Statistical learning theory

Linear algebra, Optimization methods

KernelsMay 26, 2013

Logistic regression, Perceptron, Max. margin,

Fisher’s discriminant, Linear regression, Ridge

Regression, LASSO, …:

𝑓 𝒙 =

May 26, 2013

𝑓 𝒙 = 𝒘𝑇𝒙 + 𝑏

May 26, 2013

PCA, LDA, ICA, …:

𝑓 𝒙 =

May 26, 2013

PCA, LDA, ICA, …:

𝑓 𝒙 = 𝑨𝒙

May 26, 2013

PCA, LDA, ICA, …:

𝑓 𝒙 = 𝑨𝒙

K-means:

𝒄𝑖 =1

𝑚𝑿𝑖𝟏

CCA, GLM, …May 26, 2013

Too much linear

PCA, LDA, ICA, …:

𝒙𝑇 = 𝑨𝒙

K-means:

𝒄𝑖 =1

𝑚𝑿𝑖𝟏

CCA, GLM, …May 26, 2013

Linear is not enough

Limited generalization ability

May 26, 2013

Limited generalization ability

May 26, 2013

Limited applicability

Ordinal/Nominal data?

Graphs/Trees/Networks?

Shapes?

Graph nodes?

May 26, 2013

Solutions

May 26, 2013

Solutions

Feature space

Kernels

May 26, 2013

Solutions

Feature space

Nonlinear feature spaces

Kernels

The Kernel Trick

Dual representation

May 26, 2013

Important idea #1

Important idea #2

Important idea #3

𝑓 𝑥 = 𝑤𝑥

May 26, 2013

𝑥 → 𝑥′ ≔ 𝜙 𝑥 ≔ 𝑥, 𝑥2, 𝑥3

May 26, 2013

Nonlinear feature space

𝑥 → 𝑥′ ≔ 𝜙 𝑥 ≔ 𝑥, 𝑥2, 𝑥3

𝑓 𝑥′ = 𝑤1𝑥 + 𝑤2𝑥2 +𝑤3𝑥

May 26, 2013

𝑥 → 𝜙 𝑥 = (𝑥, 𝑥3−𝑥)

May 26, 2013

𝑥 → 𝜙 𝑥 = (𝑥, 𝑥3−𝑥)

May 26, 2013

𝑥 → 𝜙 𝑥 = (𝑥, 𝑥3−𝑥)

May 26, 2013

Nonlinear feature space

𝑓 𝒙 = 𝒘𝑇𝜙(𝒙)

May 26, 2013

+Support for arbitrary data types

𝜙 text = word counts𝜙 graph = node degrees𝜙 tree = path lengths

May 26, 2013

What if the dimensionality is high?

𝑥1, 𝑥2, … , 𝑥𝑚 → 𝑥1𝑥1, 𝑥1𝑥2, … , 𝑥𝑚𝑥𝑚

May 26, 2013

What if the dimensionality is high?

𝑥1, 𝑥2, … , 𝑥𝑚 → 𝑥1𝑥1, 𝑥1𝑥2, … , 𝑥𝑚𝑥𝑚𝑂(𝑚2) elements

For all k-wise products: 𝑂 𝑚𝑘

May 26, 2013

The Kernel Trick

Let 𝜙 𝒙 = (𝑥1𝑥1, 𝑥1𝑥2, … , 𝑥𝑚𝑥𝑚)

Consider

𝜙 𝒙 , 𝜙 𝒚 =

𝑖𝑗

𝜙 𝒙 𝑖𝑗𝜙 𝒚 𝑖𝑗

May 26, 2013

The Kernel Trick

Consider

𝜙 𝒙 , 𝜙 𝒚 =

𝑖𝑗

𝑥𝑖𝑥𝑗𝑦𝑖𝑦𝑗

May 26, 2013

The Kernel Trick

Consider

𝜙 𝒙 , 𝜙 𝒚 =

𝑖𝑗

𝑥𝑖𝑥𝑗𝑦𝑖𝑦𝑗 =

𝑖𝑗

𝑥𝑖𝑦𝑖𝑥𝑗𝑦𝑗

May 26, 2013

The Kernel Trick

Consider

𝜙 𝒙 , 𝜙 𝒚 =

𝑖𝑗

𝑥𝑖𝑦𝑖

𝑥𝑗𝑦𝑗May 26, 2013

The Kernel Trick

Consider

𝜙 𝒙 , 𝜙 𝒚 =

𝑖𝑗

𝑥𝑖𝑦𝑖

𝑥𝑗𝑦𝑗 =

𝑥𝑖𝑦𝑖

May 26, 2013

The Kernel Trick

Consider

𝜙 𝒙 , 𝜙 𝒚 =

𝑖𝑗

𝑥𝑖𝑦𝑖

= 𝒙, 𝒚 2

May 26, 2013

The Kernel Trick

Consider

𝜙 𝒙 , 𝜙 𝒚 =

𝑖𝑗

𝑥𝑖𝑦𝑖

= 𝒙, 𝒚 2

May 26, 2013

The Kernel Trick

Let 𝜙 𝒙 = (𝑥1𝑥1, 𝑥1𝑥2, … , 𝑥𝑛𝑥𝑛)

Consider

𝜙 𝑥 , 𝜙 𝑦 =

𝑖𝑗

𝜙 𝑥 𝑖𝑗𝜙 𝑦 𝑖𝑗

𝑥𝑖𝑦𝑖

= 𝑥, 𝑦 2

May 26, 2013

Polynomial kernel

𝐾 𝒙, 𝒚 = 𝒙, 𝒚 + 𝑅 𝑑

The Kernel Trick

What about:

𝐾 𝒙, 𝒚 = 𝒙, 𝒚 + 0.5 𝒙, 𝒚 2?

May 26, 2013

The Kernel Trick

What about:

𝐾 𝒙, 𝒚 = 𝒙, 𝒚 + 0.5 𝒙, 𝒚 2

𝑥𝑖𝑦𝑖 + 0.5

𝑖𝑗

𝜙𝑖𝑗 𝒙 𝜙𝑖𝑗(𝒚)

May 26, 2013

The Kernel Trick

What about:

𝐾 𝒙, 𝒚 = 𝒙, 𝒚 + 0.5 𝒙, 𝒚 2

𝑖𝑗

= ⟨ 𝑥1, … , 𝑥𝑚, √0.5𝑥1𝑥1, … , √0.5𝑥𝑚𝑥𝑚 ,

(𝑦1, … , 𝑦𝑚, √0.5𝑦1𝑦1, … , √0.5𝑦𝑚𝑦𝑚)⟩

May 26, 2013

The Kernel Trick

What about:

𝐾 𝒙, 𝒚 = 𝒙, 𝒚 + 0.5 𝒙, 𝒚 2

𝑖𝑗

= ⟨ 𝑥1, … , 𝑥𝑚, √0.5𝑥1𝑥1, … , √0.5𝑥𝑚𝑥𝑚 ,

(𝑦1, … , 𝑦𝑚, √0.5𝑦1𝑦1, … , √0.5𝑦𝑚𝑦𝑚)⟩

May 26, 2013

The Kernel Trick

What about:

𝐾 𝑥, 𝑦 = 1 + 𝑥, 𝑦 +1

2𝑥, 𝑦 2 +

6𝑥, 𝑦 3 +

24𝑥, 𝑦 4?

May 26, 2013

The Kernel Trick

What about:

𝐾 𝑥, 𝑦 =

𝑖=0

∞𝑥, 𝑦 𝑖

May 26, 2013

The Kernel Trick

What about:

𝐾 𝑥, 𝑦 =

𝑖=0

∞𝑥, 𝑦 𝑖

𝑖!= exp⟨𝑥, 𝑦⟩

May 26, 2013

The Kernel Trick

What about:

𝐾 𝑥, 𝑦 =

𝑖=0

∞𝑥, 𝑦 𝑖

Infinite-dimensional feature space!

May 26, 2013

The Kernel Trick

What about:

𝐾 𝑥, 𝑦 =

𝑖=0

∞𝑥, 𝑦 𝑖

May 26, 2013

Gaussian kernel

𝐾 𝒙, 𝒚 == exp(−𝛾‖𝒙 − 𝒚‖2)

= exp −𝒙 − 𝒚 2

2𝜎2

The Kernel Trick

What about:

𝐾 𝑥, 𝑦 =

𝑖=0

∞𝑥, 𝑦 𝑖

May 26, 2013

Exponential kernel

𝐾 𝒙, 𝒚 = exp −𝒙 − 𝒚

2𝜎2

Kernels

May 26, 2013

http://crsouza.blogspot.com/2010/03/kernel-functions-for-machine-learning.html

Structured data kernels

String kernels

P-spectrum kernels

All-subsequences kernels

Gap-weighted subsequences kernels

Graph & tree kernels

Co-rooted subtrees

All subtrees

Random walks

May 26, 2013

Kernel

A function 𝐾(𝒙, 𝒚) is a kernel, if

𝐾 𝒙, 𝒚 = 𝜙 𝒙 , 𝜙 𝒚

for some feature map 𝜙.

May 26, 2013

Kernel matrix

For a given kernel function 𝐾 and a finite

dataset (𝒙1, 𝒙2, … , 𝒙𝑛) the 𝑛 × 𝑛 matrix

𝑲𝑖𝑗 ≔ 𝐾 𝒙𝑖 , 𝒙𝑗

is called the kernel matrix.

May 26, 2013

Kernel matrix

Let 𝑿 be the data matrix, then

𝑲 = 𝑿𝑿𝑇

is the kernel matrix for the linear kernel

𝐾 𝒙, 𝒚 = 𝒙𝑇𝒚

May 26, 2013

Kernel matrix

Let 𝑿 be the data matrix, then

𝑲 = 𝑿𝑿𝑇

is the kernel matrix for the linear kernel

𝐾 𝒙, 𝒚 = 𝒙𝑇𝒚

Let 𝜙 be a feature mapping. Then*

𝑲 = 𝜙 𝑿 𝜙 𝑿 𝑇

is the kernel matrix for the corresponding

kernel 𝐾 𝒙, 𝒚 = ⟨𝜙 𝒙 , 𝜙 𝒚 ⟩.

May 26, 2013

Kernel theorem

Not every function K is a kernel!

May 26, 2013

Example?

Kernel theorem

Not every function K is a kernel!

e. g. 𝐾 𝑥, 𝑦 = −1 is not

Not every 𝑛 × 𝑛 matrix is a Kernel matrix!

May 26, 2013

Kernel theorem

Theorem:

𝐾 is a kernel function ⇔ 𝐾 is symmetric positive

semidefinite

A function is positive semidefinite iff for any

finite dataset {𝒙1, 𝒙2, … , 𝒙𝑛} the corresponding

kernel matrix is positive semidefinite.

May 26, 2013

Kernel closure

May 26, 2013

Kernel closure

May 26, 2013

Feature space concatenation

Kernel closure

May 26, 2013

Feature space scaling

Kernel closure

May 26, 2013

Feature space tensor product

Kernel closure

May 26, 2013

Feature map composition

Kernel normalization

Let 𝜙′ 𝑥 =𝜙 𝑥

𝜙 𝑥

𝐾′ 𝑥, 𝑦 = 𝜙′ 𝑥 , 𝜙′ 𝑦 =𝜙 𝑥

𝜙 𝑥,𝜙 𝑦

𝜙 𝑦=

𝜙 𝑥 ,𝜙 𝑦

𝜙 𝑥 2 𝜙 𝑦 2=

=𝐾 𝑥, 𝑦

𝐾 𝑥, 𝑥 𝐾 𝑦, 𝑦

May 26, 2013

Kernel matrix normalization

𝐾′ 𝑥, 𝑦 = 𝜙′ 𝑥 , 𝜙′ 𝑦 =𝜙 𝑥

𝜙 𝑥,𝜙 𝑦

𝜙 𝑦=

𝜙 𝑥 , 𝜙 𝑦

𝜙 𝑥 2 𝜙 𝑦 2=

=𝐾 𝑥, 𝑦

𝐾 𝑥, 𝑥 𝐾 𝑦, 𝑦

𝐾′𝑖𝑗 ≔𝐾𝑖𝑗

𝐾𝑖𝑖𝐾𝑗𝑗

May 26, 2013

Kernel matrix centering

𝒙𝑖 → 𝒙𝑖 −1

𝒙𝑘

May 26, 2013

𝒙𝑘

May 26, 2013

𝒙𝑘

𝑿 → 𝑿 −1

𝑛𝟏𝑛𝟏𝑛𝑇𝑿

May 26, 2013

𝒙𝑘

𝑿 → 𝑿−1

𝑿𝑿𝑇 → 𝑿−1

𝑛𝟏𝟏𝑇𝑿 𝑿 −

𝑛𝟏𝟏𝑇𝑿

May 26, 2013

𝒙𝑘

𝑿 → 𝑿−1

𝑿𝑿𝑇 → 𝑿−1

𝑛𝟏𝟏𝑇𝑿 𝑿 −

𝑛𝟏𝟏𝑇𝑿

𝑿𝑿𝑇

→ 𝑿𝑿𝑇 −1

𝑛𝟏𝟏𝑇𝑿𝑿𝑇 −

𝑛𝑿𝑿𝑇𝟏𝟏𝑇

𝑛2𝟏𝟏𝑇𝑿𝑿𝑇𝟏𝟏𝑇 May 26, 2013

𝑿𝑿𝑇

→ 𝑿𝑿𝑇 −1

𝑛𝟏𝟏𝑇𝑿𝑿𝑇 −

𝑛𝑿𝑿𝑇𝟏𝟏𝑇

𝑛2𝟏𝟏𝑇𝑿𝑿𝑇𝟏𝟏𝑇

𝑲cent

≔ 𝑲−1

𝑛𝟏𝟏𝑇𝑲−

𝑛𝑲𝟏𝟏𝑇 +

𝑛2𝟏𝟏𝑇𝑲𝟏𝟏𝑇

May 26, 2013

The Dual Representation

Let 𝐴 be the input space, and let 𝐵 be the

higher-dimensional feature space.

Let 𝜙: 𝐴 → 𝐵 be the feature map.

Fix a dataset {𝒙1, 𝒙2, … , 𝒙𝑛} ⊂ 𝐴

Let 𝑤 = 𝑖 𝛼𝑖𝜙(𝒙𝑖) ∈ 𝐵

We say that 𝛼𝑖 are the dual coordinates for 𝑤.

May 26, 2013

Dual coordinates

𝒘 =

𝛼𝑖𝜙(𝒙𝑖) = 𝜙 𝑿𝑇𝜶 = 𝚵𝑻𝜶

Note that 𝚵𝚵𝑇 = 𝜙 𝑿 𝜙 𝑿 𝑇 = 𝑲

Now we can do all of the useful stuff using dual

coordinates only.

May 26, 2013

Dual coordinates

𝒘 = 𝚵𝑇𝜶𝒖 = 𝚵T𝜷

2𝒘 =

May 26, 2013

Dual coordinates

2𝒘 = 𝚵T(2𝜶)

May 26, 2013

Dual coordinates

2𝒘 = 𝚵T 2𝜶𝒘+ 𝒖 =

May 26, 2013

Dual coordinates

2𝒘 = 𝚵T 2𝜶𝒘+ 𝒖 = 𝚵T(𝜶 + 𝜷)

May 26, 2013

Dual coordinates

2𝒘 = 𝚵T 2𝜶𝒘+ 𝒖 = 𝚵T 𝜶 + 𝜷𝒘, 𝒖 =

May 26, 2013

Dual coordinates

2𝒘 = 𝚵T 2𝜶𝒘+ 𝒖 = 𝚵T 𝜶 + 𝜷𝒘, 𝒖 = 𝒘𝑇𝒖 = 𝜶𝑇𝚵𝚵𝑇𝜷 = 𝜶𝑇𝑲𝜷

May 26, 2013

Dual coordinates

2𝒘 = 𝚵T 2𝜶𝒘+ 𝒖 = 𝚵T 𝜶 + 𝜷𝒘,𝒖 = 𝒘𝑇𝒖 = 𝜶𝚵𝚵𝑇𝜷 = 𝜶𝑲𝜷𝒘− 𝒖 2 =

May 26, 2013

Dual coordinates

2𝒘 = 𝚵T 2𝜶𝒘+ 𝒖 = 𝚵T 𝜶 + 𝜷𝒘, 𝒖 = 𝒘𝑇𝒖 = 𝜶𝚵𝚵𝑇𝜷 = 𝜶𝑲𝜷𝒘− 𝒖 2 = 𝒘𝑇𝒘+ 𝒖𝑇𝒖 − 𝟐𝒘𝑇𝒖 = ⋯

May 26, 2013

Dual coordinates

2𝒘 = 𝚵T 2𝜶𝒘+ 𝒖 = 𝚵T 𝜶 + 𝜷𝒘, 𝒖 = 𝒘𝑇𝒖 = 𝜶𝚵𝚵𝑇𝜷 = 𝜶𝑲𝜷𝒘− 𝒖 2 = 𝒘𝑇𝒘+ 𝒖𝑇𝒖 − 𝟐𝒘𝑇𝒖 = ⋯

May 26, 2013

Kernelization

Recall the Perceptron:

May 26, 2013

Kernelization

Initialize 𝒘 ≔ 𝟎

Find a misclassified example (𝑥𝑖 , 𝑦𝑖)

Update weights:

𝒘 ≔ 𝒘+ 𝜇𝑦𝑖𝒙𝒊 𝑏 ≔ 𝑏 + 𝜇𝑦𝑖

May 26, 2013

Kernelization

Initialize 𝒘 ≔ 𝟎 ⇔ 𝜶 ≔ 𝟎

Update weights:

𝒘 ≔ 𝒘+ 𝜇𝑦𝑖𝒙𝒊 𝑏 ≔ 𝑏 + 𝜇𝑦𝑖

May 26, 2013

Kernelization

Initialize 𝒘 ≔ 𝟎 ⇔ 𝜶 ≔ 𝟎

Update weights:

𝒘 ≔ 𝒘+ 𝜇𝑦𝑖𝒙𝒊 ⇔ 𝛼𝑖 ≔ 𝛼𝑖 + 𝜇𝑦𝑖 𝑏 ≔ 𝑏 + 𝜇𝑦𝑖

May 26, 2013

Kernelization

Initialize 𝜶 ≔ 𝟎

Update weights:

𝛼𝑖 ≔ 𝛼𝑖 + 𝜇𝑦𝑖 𝑏 ≔ 𝑏 + 𝜇𝑦𝑖

May 26, 2013

Kernelization

𝒘𝑇𝒙𝑖 + 𝑏 ≠ 𝑦𝑖 ⇔ 𝑗 𝛼𝑗𝒙𝑗𝑇𝒙𝑖 + 𝑏 ≠ 𝑦𝑖

Update weights:

May 26, 2013

Kernelization

𝒘𝑇𝒙𝑖 + 𝑏 ≠ 𝑦𝑖 ⇔ 𝑲𝑖𝜶 + 𝑏 ≠ 𝑦𝑖

Update weights:

May 26, 2013

Kernelization

𝑲𝑖𝜶 + 𝑏 ≠ 𝑦𝑖

Update weights:

May 26, 2013

Today we heard three important ideas

Important idea #1: __________

Function/matrix 𝐾 is a kernel function/matrix

iff it is __________

Dual representation: ___ = ___ __

May 26, 2013

Those algoritms have kernelized versions:

___________________________ …

May 26, 2013

Gene Expression Analysis - ut...Performance evaluation, Statistical learning theory Linear algebra,...

Documents

Semi Supervised Learning

Presentation on supervised learning

Unsupervised Learning. Supervised learning vs. unsupervised learning

Federated Semi-Supervised Learning with Inter-Client ...2.1. Preliminaries Semi-Supervised Learning Semi-Supervised Learning (SSL) refers to the problem of learning with partially

2. Supervised Learning

Supervised Learning

Bengio Machine learning semi-supervised learning

Machine Learning Basics: Supervised Learning Algorithmssrihari/CSE676/5.7 MLBasics-Supervised.pdfDeep Learning Srihari 1 Machine Learning Basics: Supervised Learning Algorithms Sargur

Supervised machine learning

Semi-supervised Learning with Ladder Networkspapers.nips.cc/...semi-supervised-learning-with-ladder-networks.pdf · Semi-Supervised Learning with Ladder Networks ... 3] or classiﬁcation

PrediksiCuacadiKotaPalembangBerbasis Supervised Learning

Supervised Experiential Learning

Sequential Supervised Learning

Chapter 3: Supervised Learning · 2019. 11. 24. · Supervised vs. unsupervised Learning •Supervised learning: classification is seen as supervised learning from examples. •Supervision:

Semi-Supervised Learningzhuxj/tmp/book.pdf1.1.2 Semi-Supervised Learning Semi-supervised learning (SSL) is half way between supervised and unsupervised learning. In addition to unlabeled

Lecture 2: Supervised Learning | Classi cationjaven/talk/L2 Supervised Learning.pdf · Recap Lecture 1 Concepts of Supervised Learning (SL) Classi cation algorithms Supervised Learning

Semi-supervised Learning Rong Jin. Semi-supervised learning Label propagation Transductive learning Co-training Active learning

Transfer Learning & Semi-supervised Learning

Self-supervised Learning

Bits of Machine Learning Part 1: Supervised Learning - …wasp-sweden.org/custom/...proutiere-supervised-learning-20161028.pdf · Bits of Machine Learning Part 1: Supervised Learning