Machine Learning - web.engr.oregonstate.edu

Machine LearningFall 2017

Professor Liang Huang

Kernels

(Kernels, Kernelized Perceptron and SVM)

(Chap. 12 of CIML)

• Concatenated (combined) features• XOR: x = (x1, x2, x1x2)• income: add “degree + major”

• Perceptron• Map data into feature space• Solution in span of

Nonlinear Features

x ! �(x)

�(xi)

x1: +1

x2: -1

x4: -1

x3: +1

Quadratic Features

• Separating surfaces areCircles, hyperbolae, parabolae

Kernels as dot productsKernels

Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40

ProblemExtracting features can sometimes be very costly.Example: second order features in 1000 dimensions.This leads to 5005 numbers. For higher order polyno-mial features much worse.

SolutionDon’t compute the features, try to compute dot productsimplicitly. For some features this works . . .

DefinitionA kernel function k : X ⇥ X ! R is a symmetric functionin its arguments for which the following property holds

k(x, x

0) = h�(x), �(x

0)i for some feature map �.

If k(x, x

0) is much cheaper to compute than �(x) . . .

5 · 105

Quadratic Kernel

Polynomial Features

Quadratic Features in R2

�(x) :=

2x1x2, x22

Dot Producth�(x), �(x

2x1x2, x22

022⌘E

= hx, x

0i2.InsightTrick works for any polynomials of order d via hx, x

x1: +1

x2: -1

x4: -1

x3: +1for x in ℝn, quadratic ɸ: naive: ɸ(x): O(n2) ɸ(x)∙ɸ(x’): O(n2) kernel k(x,x’): O(n)

= k(x, x0)

The Perceptron on features

• Nothing happens if classified correctly• Weight vector is linear combination• Classifier is (implicitly) a linear combination of

inner products

Perceptron on Features

argument: X := {x1, . . . , xm

} ⇢ X (data)Y := {y1, . . . , ym

} ⇢ {±1} (labels)function (w, b) = Perceptron(X, Y, ⌘)

initialize w, b = 0

repeatPick (x

) from dataif y

(w · �(x

) + b) 0 thenw

0= w + y

0= b + y

until y

(w · �(x

) + b) > 0 for all i

Important detailw =

) and hence f (x) =

(�(x

) · �(x)) + b

↵i�(xi)

f(x) =X

↵i h�(xi),�(x)i

Kernelized Perceptron

• instead of updating w, now update αi

• Weight vector is linear combination• Classifier is linear combination of inner products

Kernel Perceptron

argument: X := {x1, . . . , xm

} ⇢ X (data)Y := {y1, . . . , ym

} ⇢ {±1} (labels)function f = Perceptron(X, Y, ⌘)

initialize f = 0

repeatPick (x

) from dataif y

) 0 thenf (·) f (·) + y

, ·) + y

until y

) > 0 for all i

Important detailw =

) and hence f (x) =

, x) + b.

f(x) =X

↵i h�(xi),�(x)i =X

↵ik(xi, x)

↵i�(xi)

Functional Form

↵i ↵i + yiincrease its vote by 1

Kernelized Perceptron

• Nothing happens if classified correctly• Weight vector is linear combination• Classifier is linear combination of inner products

Dual Formupdate linear coefficients

implicitly equivalent to:

Primal Formupdate weights

classifyw w + yi�(xi)

f(k) = w · �(x)

↵i ↵i + yi

↵i�(xi)

f(x) =X

↵i h�(xi),�(x)i =X

↵ik(xi, x)

Kernelized PerceptronDual Formupdate linear coefficients

implicitly equivalent to:

Primal Formupdate weights

classify

f(k) = w · �(x) w =X

↵i�(xi)

↵i ↵i + yi

f(x) = w · �(x) = [X

↵i�(xi)]�(x)

↵ih�(xi),�(x)i

↵ik(xi, x)fastO(d)

slowO(d2)

Kernelized Perceptroninitialize for allrepeat Pick from data if then

until for all

↵i = 0

(xi, yi)yif(xi) 0

↵i ↵i + yiyif(xi) > 0

Dual Formupdate linear coefficients

implicitly

classify

↵i ↵i + yi

↵i�(xi)

f(x) = w · �(x) = [X

↵i�(xi)]�(x)

↵ik(xi, x)

if #features >> #examples, dual is easier;

otherwise primal is easierfastO(d)

slowO(d2)

Kernelized PerceptronDual Perceptronupdate linear coefficients

implicitly

Primal Perceptronupdate weights

f(k) = w · �(x)

↵i ↵i + yi

↵i�(xi)

if #features >> #examples, dual is easier;

otherwise primal is easier

Q: when is #features >> #examples?

A: higher-order polynomial kernels or exponential kernels (inf. dim.)

Kernelized PerceptronDual Perceptronupdate linear coefficients

implicitly

classify

Pros/Cons of Kernel in Dual• pros:

• no need to compute ɸ(x) (time)• no need to store ɸ(x) and w

(memory)

• cons:• sum over all misclassified

training examples for test • need to store all misclassified

training examples (memory)• called “support vector set”• SVM will minimize this set!

↵i ↵i + yi

↵i�(xi)

f(x) = w · �(x) = [X

↵i�(xi)]�(x)

↵ik(xi, x) fastO(d)

slowO(d2)

Kernelized PerceptronDual PerceptronPrimal Perceptron

update on new param.x1: -1 w = (0, -1)x2: +1 w = (2, 0)x3: +1 w = (2, -1)

update on new param. w (implicit)

x1: -1 α = (-1, 0, 0) -x1x2: +1 α = (-1, 1, 0) -x1 + x2x3: +1 α = (-1, 1, 1) -x1 + x2 + x3

linear kernel (identity map)final implicit w = (2, -1)

x2(2, 1) : +1

x3(0,�1) : +1

x1(0, 1) : �1

geometric interpretation of dual classification:

sum of dot-products with x2 & x3bigger than dot-product with x1

(agreement w/ positive > w/ negative)

XOR ExampleDual Perceptron

x1: +1 α = (+1, 0, 0, 0) φ(x1)

x2: -1 α = (+1, -1, 0, 0) φ(x1) - φ(x2)

x1: +1

x2: -1

x4: -1

x3: +1

classification rule in dual/geom:(x · x1)

2> (x · x2)

2✓1 > cos

) | cos ✓1| > | cos ✓2|

x1: +1

x2: -1

in dual/algebra:

(x · x1)2> (x · x2)

) (x1 + x2)2> (x1 � x2)

) x1x2 > 0

also verify in primal

k(x, x0) = (x · x0)2 , �(x) = (x21, x

22,p2x1x2) w = (0, 0, 2

Circle Example??Dual Perceptron

x1: +1 α = (+1, 0, 0, 0) φ(x1)

x2: -1 α = (+1, -1, 0, 0) φ(x1) - φ(x2)k(x, x0) = (x · x0)2 , �(x) = (x2

1, x22,p2x1x2)

Polynomial KernelsPolynomial Kernels in Rn

IdeaWe want to extend k(x, x

0) = hx, x

0i2 to

k(x, x

0) = (hx, x

0i + c)

d where c > 0 and d 2 N.

Prove that such a kernel corresponds to a dot product.Proof strategySimple and straightforward: compute the explicit sumgiven by the kernel, i.e.

k(x, x

0) = (hx, x

0i + c)

◆(hx, x

0i)i cd�i

Individual terms (hx, x

0i)i are dot products for some �

+c is just augmenting space.simpler proof: set x0 = sqrt(c)

Circle ExampleDual Perceptron

x (augmented) y(2, 0, 1) +1(-1, 2, 1) +1

(0, -1.5, 1) -1

x1: +1 α = (+1, 0, 0, 0, 0) φ(x1)

x2: -1 α = (+1, -1, 0, 0, 0) φ(x1) - φ(x2)

x3: -1 α = (+1, -1, -1, 0, 0)

k(x, x0) = (x · x0)2 , �(x) = (x21, x

22,p2x1x2)

k(x, x0) = (x · x0 + 1)2 , �(x) =?

ExamplesSome Good Kernels

Examples of kernels k(x, x

Linear hx, x

0iLaplacian RBF exp (��kx � x

0k)Gaussian RBF exp

��kx � x

0k2�

Polynomial (hx, x

0i + ci)d , c � 0, d 2 NB-Spline B2n+1(x � x

Cond. Expectation E

[p(x|c)p(x

0|c)]Simple trick for checking Mercer’s conditionCompute the Fourier transform of the kernel and checkthat it is nonnegative.

you only need to know polynomial and gaussian.

distorts distance

distorts angle

Kernel Summary• For a feature map ɸ, find a magic function k, s.t.:

• the dot-product ɸ(x)∙ɸ(x’) = k(x, x’)• this k(x, x’) should be much faster than ɸ(x)• k(x, x’) should be computable in O(n) if x in ℝn

• ɸ(x) is much slower: O(nd) for poly d, more for Gaussian• But for any k function, is there a ɸ s.t. ɸ(x)∙ɸ(x’) = k(x,x’)?Some Good Kernels

Examples of kernels k(x, x

Linear hx, x

0iLaplacian RBF exp (��kx � x

0k)Gaussian RBF exp

��kx � x

0k2�

Polynomial (hx, x

0i + ci)d , c � 0, d 2 NB-Spline B2n+1(x � x

Cond. Expectation E

[p(x|c)p(x

0|c)]Simple trick for checking Mercer’s conditionCompute the Fourier transform of the kernel and checkthat it is nonnegative.

Mercer’s Theorem

The TheoremFor any symmetric function k : X ⇥ X ! R which issquare integrable in X⇥ X and which satisfies

k(x, x

0)f (x)f (x

0)dxdx

0 � 0 for all f 2 L2(X)

there exist �i

: X ! R and numbers �

� 0 wherek(x, x

(x)�

0) for all x, x

0 2 X.

InterpretationDouble integral is the continuous version of a vector-matrix-vector multiplication. For positive semidefinitematrices we haveX

Mercer’s Theorem

PropertiesProperties of the Kernel

Distance in Feature SpaceDistance between points in feature space via

d(x, x

2:=k�(x) � �(x

=h�(x), �(x)i � 2h�(x), �(x

0)i + h�(x

0), �(x

=k(x, x) + k(x

0) � 2k(x, x)

Kernel MatrixTo compare observations we compute dot products, sowe study the matrix K given by

= h�(x

), �(x

)i = k(x

where x

are the training patterns.Similarity MeasureThe entries K

tell us the overlap between �(x

) and�(x

), so k(x

) is a similarity measure.

Kernelized Pegasos for SVM

for HW2, you don’t need to randomly choose training examples.just go over all training examples in the original order, and call that an epoch (same as HW1).

σ = 1.0 C =∞f(x) = 1

f(x) = 0

f(x) = −1

f(x) =NX

αiyi exp³−||x− xi||2/2σ2

Gaussian RBF kernel (default in sklearn)

σ = 1.0 C = 100

Decrease C, gives wider (soft) margin

σ = 1.0 C = 10

f(x) =NX

σ = 1.0 C =∞

f(x) =NX

σ = 0.25 C =∞

Decrease sigma, moves towards nearest neighbour classifier

σ = 0.1 C =∞

f(x) =NX

Polynomial Kernels

this is in contrast with C: smaller C => wide margin (underfitting)larger C => narrow margin (overfitting)

Overfitting vs. Overfitting

From SVM to Nearest Neighbor• for each test example x, decide its label by the

training example closest to x• decision boundary highly non-linear (Voronoi)• k-nearest neighbor (k-NN): smoother boundaries

-1.5 -1 -0.5 0 0.5 1-0.2

-1.5 -1 -0.5 0 0.5 1 1.5-0.2

error = 0.0 error = 0.15

Training data Testing data

-1.5 -1 -0.5 0 0.5 1-0.2

-1.5 -1 -0.5 0 0.5 1 1.5-0.2

error = 0.0760 error = 0.1340

-1.5 -1 -0.5 0 0.5 1-0.2

-1.5 -1 -0.5 0 0.5 1 1.5-0.2

error = 0.1320 error = 0.1110

K = 21

-1.5 -1 -0.5 0 0.5 1-0.2

-1.5 -1 -0.5 0 0.5 1 1.5-0.2

error = 0.1120 error = 0.0920

-1.5 -1 -0.5 0 0.5 1-0.2

-1.5 -1 -0.5 0 0.5 1 1.5-0.2

error = 0.0 error = 0.15

-1.5 -1 -0.5 0 0.5 1-0.2

-1.5 -1 -0.5 0 0.5 1 1.5-0.2

error = 0.0760 error = 0.1340

-1.5 -1 -0.5 0 0.5 1-0.2

-1.5 -1 -0.5 0 0.5 1 1.5-0.2

error = 0.1320 error = 0.1110

K = 21

-1.5 -1 -0.5 0 0.5 1-0.2

-1.5 -1 -0.5 0 0.5 1 1.5-0.2

error = 0.1120 error = 0.0920

small k: overfitting

large k: underfitting

what about k=N?

SVM vs. Nearest Neighbor

support vectors few all

Machine Learning - web.engr.oregonstate.edu

Documents

Chapter- 6 : Machine Learning - IOE Notesioenotes.edu.np › media › ...Chapter-6-machine-learning... · Chapter- 6 : Machine Learning-Machine learning is a branch of AI that uses

Predicting Housing Prices With Azure Machine Learning · Azure Machine Learning. Expected Learning Outcomes Azure Machine Learning @tetranoodle Build Regression Machine Learning Model

Introducing Machine Learning · 4 ntroducing Machine Learning How Machine Learning Works Machine learning uses two types of techniques: supervised learning, which trains a model on

Introduction To Azure Machine Learning...Expected Learning Outcomes Azure Machine Learning @tetranoodle Supervised Machine Learning Machine Learning Algorithms in Azure ML Workflow

1. Introduction to Machine - lis-lab.fr › bernard.espinasse › ... · § 1. Introduction to Machine Learning § Machine Learning Problem § Machine Learning: Supervised & Unsupervised

Section 1Section 1 Machine Learning basic concepts · Machine Learning TutorialMachine Learning Tutorial ... Section 1Section 1 Machine Learning basic concepts ... Learning Tutorial

Deep learning - Machine learning overviewce.sharif.edu/courses/98-99/1/ce719-1/resources/root/...Deep learning j What is machine learning? Types of machine learning Machine learning

Machine Learning and Azure Machine Learning

1-Machine Learning in Oracle - ITOUG2017/12/01 · – Oracle Machine Learning (on Oracle Autonomous Data Warehouse Cloud Service) – Automated Machine Learning => Machine Learning

Machine Learning - 0. Overview€¦ · Machine Learning 1. What is Machine Learning? One area of research, many names (and aspects) machine learning historically, stresses learning

Machine learning for neuroimaging with scikit-learn · Machine learning for neuroimaging with scikit-learn. ... a Python machine learning library, ... Abraham et al. Machine learning

An Ensemble Architecture for Learning Complex …web.engr.oregonstate.edu/~tgd/publications/zhang-etal-an...An Ensemble Architecture for Learning Complex Problem-Solving Techniques

Machine Learning on Spark - UC Berkeley AMP Campampcamp.berkeley.edu/.../Machine-Learning-on-Spark... · Machine Learning on Spark Shivaram Venkataraman ... Machine learning algorithms

omae02p2.final - web.engr.oregonstate.edu

CS7267 Machine Learning introduction to machine learning

Machine Learning and Ecosystem Informatics: Challenges and …web.engr.oregonstate.edu/~tgd/talks/acml-2009.pdf · 2009-11-07 · Hierarchical Modeling and Inference in Ecology: The

Machine Learning using MapReduce - Boise State Universitycs.boisestate.edu/.../mapreduce-machine-learning-handout.pdf · 2016-10-27 · What is Machine Learning I Machine learning

Machine Learning - apiacoa.orgapiacoa.org/publications/teaching/machine-learning/machine... · Machine Learning Fabrice Rossi ... I chess program ... Machine programming I climbing

ii - web.engr.oregonstate.edu

Machine Learning in Logistics: Machine Learning Algorithmsltu.diva-portal.org/smash/get/diva2:1118764/FULLTEXT02.pdf · Machine Learning in Logistics: Machine Learning ... Machine