81
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Slides adapted from Jordan Boyd-Graber Machine Learning: Chenhao Tan | Boulder | 1 of 40

Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Machine Learning: Chenhao TanUniversity of Colorado BoulderLECTURE 9

Slides adapted from Jordan Boyd-Graber

Machine Learning: Chenhao Tan | Boulder | 1 of 40

Page 2: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Recap

• Supervised learning• Previously: KNN, naïve Bayes, logistic regression and neural networks

• This time: SVMs◦ Motivated by geometric properties◦ (another) example of linear classifier◦ Good theoretical guarantee

Machine Learning: Chenhao Tan | Boulder | 2 of 40

Page 3: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Recap

• Supervised learning• Previously: KNN, naïve Bayes, logistic regression and neural networks• This time: SVMs◦ Motivated by geometric properties◦ (another) example of linear classifier◦ Good theoretical guarantee

Machine Learning: Chenhao Tan | Boulder | 2 of 40

Page 4: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Take a step back

• Problem• Model• Algorithm/optimization• Evaluation

https://blog.twitter.com/official/en_us/topics/product/

2017/Giving-you-more-characters-to-express-yourself.html

Machine Learning: Chenhao Tan | Boulder | 3 of 40

Page 5: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Take a step back

• Problem• Model• Algorithm/optimization• Evaluation

https://blog.twitter.com/official/en_us/topics/product/

2017/Giving-you-more-characters-to-express-yourself.html

Machine Learning: Chenhao Tan | Boulder | 3 of 40

Page 6: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Overview

Intuition

Linear Classifiers

Support Vector Machines

Formulation

Theoretical Guarantees

Recap

Machine Learning: Chenhao Tan | Boulder | 4 of 40

Page 7: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Intuition

Outline

Intuition

Linear Classifiers

Support Vector Machines

Formulation

Theoretical Guarantees

Recap

Machine Learning: Chenhao Tan | Boulder | 5 of 40

Page 8: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Intuition

Thinking Geometrically

• Suppose you have two classes: vacations and sports• Suppose you have four documents

SportsDoc1: {ball, ball, ball, travel}Doc2: {ball, ball}

VacationsDoc3: {travel, ball, travel}Doc4: {travel}

• What does this look like in vector space?

Machine Learning: Chenhao Tan | Boulder | 6 of 40

Page 9: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Intuition

Put the documents in vector space

SportsDoc1: {ball, ball, ball, travel}Doc2: {ball, ball}

VacationsDoc3: {travel, ball, travel}Doc4: {travel}

Travel3

2

1

01 2 3

Ball

Machine Learning: Chenhao Tan | Boulder | 7 of 40

Page 10: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Intuition

Vector space representation of documents

• Each document is a vector, one component for each term.• Terms are axes.• High dimensionality: 10,000s of dimensions and more• How can we do classification in this space?

Machine Learning: Chenhao Tan | Boulder | 8 of 40

Page 11: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Intuition

Vector space classification

• As before, the training set is a set of documents, each labeled with its class.• In vector space classification, this set corresponds to a labeled set of points or

vectors in the vector space.• Premise 1: Documents in the same class form a contiguous region.• Premise 2: Documents from different classes don’t overlap.• We define lines, surfaces, hypersurfaces to divide regions.

Machine Learning: Chenhao Tan | Boulder | 9 of 40

Page 12: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Intuition

Classes in the vector space

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Classes in the vector space

xxx

x

!! !!

!

!

China

Kenya

UK!

Should the document ! be assigned to China, UK or Kenya?

Schutze: Support vector machines 12 / 55

Machine Learning: Chenhao Tan | Boulder | 10 of 40

Page 13: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Intuition

Classes in the vector space

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Classes in the vector space

xxx

x

!! !!

!

!

China

Kenya

UK!

Should the document ! be assigned to China, UK or Kenya?

Schutze: Support vector machines 12 / 55

Should the document ? be assigned to China, UK or Kenya?

Machine Learning: Chenhao Tan | Boulder | 10 of 40

Page 14: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Intuition

Classes in the vector space

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Classes in the vector space

xxx

x

!! !!

!

!

China

Kenya

UK!

Find separators between the classes

Schutze: Support vector machines 12 / 55

Find separators between the classes

Machine Learning: Chenhao Tan | Boulder | 10 of 40

Page 15: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Intuition

Classes in the vector space

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Classes in the vector space

xxx

x

!! !!

!

!

China

Kenya

UK!

Find separators between the classes

Schutze: Support vector machines 12 / 55

Find separators between the classes

Machine Learning: Chenhao Tan | Boulder | 10 of 40

Page 16: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Intuition

Classes in the vector space

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Classes in the vector space

xxx

x

!! !!

!

!

China

Kenya

UK!

Find separators between the classes

Schutze: Support vector machines 12 / 55

Based on these separators: ? should be assigned to China

Machine Learning: Chenhao Tan | Boulder | 10 of 40

Page 17: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Intuition

Classes in the vector space

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Classes in the vector space

xxx

x

!! !!

!

!

China

Kenya

UK!

Find separators between the classes

Schutze: Support vector machines 12 / 55

How do we find separators that do a good job at classifying new documents like ??– Main topic of today

Machine Learning: Chenhao Tan | Boulder | 10 of 40

Page 18: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Linear Classifiers

Outline

Intuition

Linear Classifiers

Support Vector Machines

Formulation

Theoretical Guarantees

Recap

Machine Learning: Chenhao Tan | Boulder | 11 of 40

Page 19: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Linear Classifiers

Linear classifiers

• Definition:◦ A linear classifier computes a linear combination or weighted sum

∑j βjxj of the

feature values.◦ Classification decision:

∑j βjxj > β0? (β0 is our bias)

◦ . . . where β0 (the threshold) is a parameter.

• We call this the separator or decision boundary.• We find the separator based on training set.• Methods for finding separator: logistic regression, naïve Bayes, linear SVM• Assumption: The classes are linearly separable.

• Before, we just talked about equations. What’s the geometric intuition?

Machine Learning: Chenhao Tan | Boulder | 12 of 40

Page 20: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Linear Classifiers

Linear classifiers

• Definition:◦ A linear classifier computes a linear combination or weighted sum

∑j βjxj of the

feature values.◦ Classification decision:

∑j βjxj > β0? (β0 is our bias)

◦ . . . where β0 (the threshold) is a parameter.

• We call this the separator or decision boundary.• We find the separator based on training set.• Methods for finding separator: logistic regression, naïve Bayes, linear SVM• Assumption: The classes are linearly separable.• Before, we just talked about equations. What’s the geometric intuition?

Machine Learning: Chenhao Tan | Boulder | 12 of 40

Page 21: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Linear Classifiers

A linear classifier in 1D

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 1D

A linear classifier in 1D isa point x described by theequation w1d1 = !

x = !/w1

Schutze: Support vector machines 15 / 55

• A linear classifier in 1D is a point xdescribed by the equation β1x1 = β0

• x = β0/β1

• Points (x1) with β1x1 ≥ β0 are in theclass c.

• Points (x1) with β1x1 < β0 are in thecomplement class c.

Machine Learning: Chenhao Tan | Boulder | 13 of 40

Page 22: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Linear Classifiers

A linear classifier in 1D

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 1D

A linear classifier in 1D isa point x described by theequation w1d1 = !

x = !/w1

Schutze: Support vector machines 15 / 55

• A linear classifier in 1D is a point xdescribed by the equation β1x1 = β0

• x = β0/β1

• Points (x1) with β1x1 ≥ β0 are in theclass c.

• Points (x1) with β1x1 < β0 are in thecomplement class c.

Machine Learning: Chenhao Tan | Boulder | 13 of 40

Page 23: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Linear Classifiers

A linear classifier in 1D

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 1D

A linear classifier in 1D isa point x described by theequation w1d1 = !

x = !/w1

Points (d1) with w1d1 ! !are in the class c .

Schutze: Support vector machines 15 / 55

• A linear classifier in 1D is a point xdescribed by the equation β1x1 = β0

• x = β0/β1

• Points (x1) with β1x1 ≥ β0 are in theclass c.

• Points (x1) with β1x1 < β0 are in thecomplement class c.

Machine Learning: Chenhao Tan | Boulder | 13 of 40

Page 24: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Linear Classifiers

A linear classifier in 1D

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 1D

A linear classifier in 1D isa point x described by theequation w1d1 = !

x = !/w1

Points (d1) with w1d1 ! !are in the class c .

Points (d1) with w1d1 < !are in the complementclass c .

Schutze: Support vector machines 15 / 55

• A linear classifier in 1D is a point xdescribed by the equation β1x1 = β0

• x = β0/β1

• Points (x1) with β1x1 ≥ β0 are in theclass c.

• Points (x1) with β1x1 < β0 are in thecomplement class c.

Machine Learning: Chenhao Tan | Boulder | 13 of 40

Page 25: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Linear Classifiers

A linear classifier in 2DRecap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 2D

A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = !

Example for a 2D linearclassifier

Schutze: Support vector machines 16 / 55

• A linear classifier in 2D is a linedescribed by the equationβ1x1 + β2x2 = β0

• Example for a 2D linear classifier• Points (x1 x2) with β1x1 + β2x2 ≥ β0

are in the class c.• Points (x1 x2) with β1x1 + β2x2 < β0

are in the complement class c.

Machine Learning: Chenhao Tan | Boulder | 14 of 40

Page 26: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Linear Classifiers

A linear classifier in 2DRecap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 2D

A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = !

Example for a 2D linearclassifier

Schutze: Support vector machines 16 / 55

• A linear classifier in 2D is a linedescribed by the equationβ1x1 + β2x2 = β0

• Example for a 2D linear classifier

• Points (x1 x2) with β1x1 + β2x2 ≥ β0are in the class c.

• Points (x1 x2) with β1x1 + β2x2 < β0are in the complement class c.

Machine Learning: Chenhao Tan | Boulder | 14 of 40

Page 27: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Linear Classifiers

A linear classifier in 2DRecap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 2D

A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = !

Example for a 2D linearclassifier

Points (d1 d2) withw1d1 + w2d2 ! ! are inthe class c .

Schutze: Support vector machines 16 / 55

• A linear classifier in 2D is a linedescribed by the equationβ1x1 + β2x2 = β0

• Example for a 2D linear classifier• Points (x1 x2) with β1x1 + β2x2 ≥ β0

are in the class c.

• Points (x1 x2) with β1x1 + β2x2 < β0are in the complement class c.

Machine Learning: Chenhao Tan | Boulder | 14 of 40

Page 28: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Linear Classifiers

A linear classifier in 2DRecap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 2D

A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = !

Example for a 2D linearclassifier

Points (d1 d2) withw1d1 + w2d2 ! ! are inthe class c .

Points (d1 d2) withw1d1 + w2d2 < ! are inthe complement class c .

Schutze: Support vector machines 16 / 55

• A linear classifier in 2D is a linedescribed by the equationβ1x1 + β2x2 = β0

• Example for a 2D linear classifier• Points (x1 x2) with β1x1 + β2x2 ≥ β0

are in the class c.• Points (x1 x2) with β1x1 + β2x2 < β0

are in the complement class c.

Machine Learning: Chenhao Tan | Boulder | 14 of 40

Page 29: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Linear Classifiers

A linear classifier in 3DRecap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 3D

A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = !

Schutze: Support vector machines 17 / 55

• A linear classifier in 3D is a planedescribed by the equationβ1x1 + β2x2 + β3x3 = β0

• Example for a 3D linear classifier• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 ≥ β0 are in theclass c.

• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 < β0 are in thecomplement class c.

Machine Learning: Chenhao Tan | Boulder | 15 of 40

Page 30: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Linear Classifiers

A linear classifier in 3DRecap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 3D

A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = !

Example for a 3D linearclassifier

Schutze: Support vector machines 17 / 55

• A linear classifier in 3D is a planedescribed by the equationβ1x1 + β2x2 + β3x3 = β0

• Example for a 3D linear classifier

• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 ≥ β0 are in theclass c.

• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 < β0 are in thecomplement class c.

Machine Learning: Chenhao Tan | Boulder | 15 of 40

Page 31: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Linear Classifiers

A linear classifier in 3DRecap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 3D

A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = !

Example for a 3D linearclassifier

Points (d1 d2 d3) withw1d1 + w2d2 + w3d3 ! !are in the class c .

Schutze: Support vector machines 17 / 55

• A linear classifier in 3D is a planedescribed by the equationβ1x1 + β2x2 + β3x3 = β0

• Example for a 3D linear classifier• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 ≥ β0 are in theclass c.

• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 < β0 are in thecomplement class c.

Machine Learning: Chenhao Tan | Boulder | 15 of 40

Page 32: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Linear Classifiers

A linear classifier in 3D

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 3D

A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = !

Example for a 3D linearclassifier

Points (d1 d2 d3) withw1d1 + w2d2 + w3d3 ! !are in the class c .

Points (d1 d2 d3) withw1d1 + w2d2 + w3d3 < !are in the complementclass c .

Schutze: Support vector machines 17 / 55

• A linear classifier in 3D is a planedescribed by the equationβ1x1 + β2x2 + β3x3 = β0

• Example for a 3D linear classifier• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 ≥ β0 are in theclass c.

• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 < β0 are in thecomplement class c.

Machine Learning: Chenhao Tan | Boulder | 15 of 40

Page 33: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Linear Classifiers

Naive Bayes and Logistic Regression as linear classifiers

TakeawayNaïve Bayes, logistic regression and SVM are all linear methods. They choosetheir hyperplanes based on different objectives: joint likelihood (NB), conditionallikelihood (LR), and the margin (SVM).

Machine Learning: Chenhao Tan | Boulder | 16 of 40

Page 34: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Linear Classifiers

Which hyperplane?

• For linearly separable training sets: there are infinitely many separatinghyperplanes.

• They all separate the training set perfectly . . .• . . . but they behave differently on test data.• Error rates on new data are low for some, high for others.• How do we find a low-error separator?

Machine Learning: Chenhao Tan | Boulder | 17 of 40

Page 35: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Support Vector Machines

Outline

Intuition

Linear Classifiers

Support Vector Machines

Formulation

Theoretical Guarantees

Recap

Machine Learning: Chenhao Tan | Boulder | 18 of 40

Page 36: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Support Vector Machines

Support vector machines

• Machine-learning research in the last two decades has improved classifiereffectiveness.

• New generation of state-of-the-art classifiers: support vector machines (SVMs),boosted decision trees, regularized logistic regression, neural networks, andrandom forests

• Applications to IR problems, particularly text classification

SVMs: A kind of large-margin classifierVector space based machine-learning method aiming to find a decision boundarybetween two classes that is maximally far from any point in the training data(possibly discounting some points as outliers or noise)

Machine Learning: Chenhao Tan | Boulder | 19 of 40

Page 37: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Support Vector Machines

Support Vector Machines

• 2-class training data

• decision boundary→ linearseparator

• criterion: being maximally far awayfrom any data point→ determinesclassifier margin

• linear separator position defined bysupport vectors

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Support Vector Machines

2-class training data

Schutze: Support vector machines 29 / 55

Machine Learning: Chenhao Tan | Boulder | 20 of 40

Page 38: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Support Vector Machines

Support Vector Machines

• 2-class training data• decision boundary→ linear

separator

• criterion: being maximally far awayfrom any data point→ determinesclassifier margin

• linear separator position defined bysupport vectors

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Support Vector Machines

2-class training data

decision boundary! linear separator

Schutze: Support vector machines 29 / 55Machine Learning: Chenhao Tan | Boulder | 20 of 40

Page 39: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Support Vector Machines

Support Vector Machines

• 2-class training data• decision boundary→ linear

separator• criterion: being maximally far away

from any data point→ determinesclassifier margin

• linear separator position defined bysupport vectors

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Support Vector Machines

2-class training data

decision boundary! linear separator

criterion: beingmaximally far awayfrom any data point! determinesclassifier margin

Margin ismaximized

Schutze: Support vector machines 29 / 55

Machine Learning: Chenhao Tan | Boulder | 20 of 40

Page 40: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Support Vector Machines

Support Vector Machines

• 2-class training data• decision boundary→ linear

separator• criterion: being maximally far away

from any data point→ determinesclassifier margin

• linear separator position defined bysupport vectors

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Why maximize the margin?

Points near decisionsurface ! uncertainclassification decisions(50% either way).A classifier with a largemargin makes no lowcertainty classificationdecisions.Gives classificationsafety margin w.r.t slighterrors in measurement ordoc. variation

Support vectors

Margin ismaximized

Maximummargindecisionhyperplane

Schutze: Support vector machines 30 / 55

Machine Learning: Chenhao Tan | Boulder | 20 of 40

Page 41: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Support Vector Machines

Why maximize the margin?

• Points near decision surface→uncertain classification decisions

• A classifier with a large margin isalways confident

• Gives classification safety margin(measurement or variation)

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Why maximize the margin?

Points near decisionsurface ! uncertainclassification decisions(50% either way).A classifier with a largemargin makes no lowcertainty classificationdecisions.Gives classificationsafety margin w.r.t slighterrors in measurement ordoc. variation

Support vectors

Margin ismaximized

Maximummargindecisionhyperplane

Schutze: Support vector machines 30 / 55

Machine Learning: Chenhao Tan | Boulder | 21 of 40

Page 42: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Support Vector Machines

Why maximize the margin?

• SVM classifier: large margin arounddecision boundary

• compare to decision hyperplane:place fat separator between classes◦ unique solution

• decreased memory capacity• increased ability to correctly

generalize to test data

Machine Learning: Chenhao Tan | Boulder | 22 of 40

Page 43: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Formulation

Outline

Intuition

Linear Classifiers

Support Vector Machines

Formulation

Theoretical Guarantees

Recap

Machine Learning: Chenhao Tan | Boulder | 23 of 40

Page 44: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Formulation

Equation

• Equation of a hyperplanew · x + b = 0 (1)

• Distance of a point to hyperplane

|w · x + b|||w|| (2)

• The margin ρ is given by

ρ ≡ min(x,y)∈S

|w · x + b|||w|| ≡ 1

||w|| (3)

This is because for any point on the marginal hyperplane, w · x + b = ±1, andwe would like to maximize the margin ρ

Machine Learning: Chenhao Tan | Boulder | 24 of 40

Page 45: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Formulation

Equation

• Equation of a hyperplanew · x + b = 0 (1)

• Distance of a point to hyperplane

|w · x + b|||w|| (2)

• The margin ρ is given by

ρ ≡ min(x,y)∈S

|w · x + b|||w|| ≡ 1

||w|| (3)

This is because for any point on the marginal hyperplane, w · x + b = ±1, andwe would like to maximize the margin ρ

Machine Learning: Chenhao Tan | Boulder | 24 of 40

Page 46: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Formulation

Optimization Problem

We want to find a weight vector w and bias b that optimize

minw,b

12||w||2 (4)

subject to yi(w · xi + b) ≥ 1, ∀i ∈ [1,m].

Next week: algorithm

Machine Learning: Chenhao Tan | Boulder | 25 of 40

Page 47: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Formulation

Optimization Problem

We want to find a weight vector w and bias b that optimize

minw,b

12||w||2 (4)

subject to yi(w · xi + b) ≥ 1, ∀i ∈ [1,m].

Next week: algorithm

Machine Learning: Chenhao Tan | Boulder | 25 of 40

Page 48: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Formulation

Find the maximum margin hyperplane

3

2

1

01 2 3

Machine Learning: Chenhao Tan | Boulder | 26 of 40

Page 49: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Formulation

Find the maximum margin hyperplane

3

2

1

01 2 3

Which are the support vectors?

Machine Learning: Chenhao Tan | Boulder | 26 of 40

Page 50: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Formulation

Walkthrough example: building an SVM over the data shown

Working geometrically:

• If you got 0 = .5x + y− 2.75, close!• Set up system of equations

w1 + w2 + b = −1 (5)32

w1 + 2w2 + b = 0 (6)

2w1 + 3w2 + b = +1 (7)

The SVM decision boundary is:

0 =25

x +45

y− 115

3

2

1

01 2 3

Machine Learning: Chenhao Tan | Boulder | 27 of 40

Page 51: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Formulation

Walkthrough example: building an SVM over the data shown

Working geometrically:• If you got 0 = .5x + y− 2.75, close!

• Set up system of equations

w1 + w2 + b = −1 (5)32

w1 + 2w2 + b = 0 (6)

2w1 + 3w2 + b = +1 (7)

The SVM decision boundary is:

0 =25

x +45

y− 115

3

2

1

01 2 3

Machine Learning: Chenhao Tan | Boulder | 27 of 40

Page 52: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Formulation

Walkthrough example: building an SVM over the data shown

Working geometrically:• If you got 0 = .5x + y− 2.75, close!• Set up system of equations

w1 + w2 + b = −1 (5)32

w1 + 2w2 + b = 0 (6)

2w1 + 3w2 + b = +1 (7)

The SVM decision boundary is:

0 =25

x +45

y− 115

3

2

1

01 2 3

Machine Learning: Chenhao Tan | Boulder | 27 of 40

Page 53: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Formulation

Walkthrough example: building an SVM over the data shown

Working geometrically:• If you got 0 = .5x + y− 2.75, close!• Set up system of equations

w1 + w2 + b = −1 (5)32

w1 + 2w2 + b = 0 (6)

2w1 + 3w2 + b = +1 (7)

The SVM decision boundary is:

0 =25

x +45

y− 115

3

2

1

01 2 3

Machine Learning: Chenhao Tan | Boulder | 27 of 40

Page 54: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Formulation

Cannonical Form

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Walkthrough example: building an SVM over the data setshown in the figure

Working geometrically:

0 1 2 30

1

2

3

Schutze: Support vector machines 42 / 55

w1x1 + w2x2 + b

Machine Learning: Chenhao Tan | Boulder | 28 of 40

Page 55: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Formulation

Cannonical Form

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Walkthrough example: building an SVM over the data setshown in the figure

Working geometrically:

0 1 2 30

1

2

3

Schutze: Support vector machines 42 / 55

.4x1 + .8x2 − 2.2

Machine Learning: Chenhao Tan | Boulder | 28 of 40

Page 56: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Formulation

Cannonical Form

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Walkthrough example: building an SVM over the data setshown in the figure

Working geometrically:

0 1 2 30

1

2

3

Schutze: Support vector machines 42 / 55

.4x1 + .8x2 − 2.2• .4 · 1 + .8 · 1− 2.2 = −1• .4 · 3

2 + .8 · 2 = 0• .4 · 2 + .8 · 3− 2.2 = +1

Machine Learning: Chenhao Tan | Boulder | 28 of 40

Page 57: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Formulation

What’s the margin?

• Distance to closest point

√(32− 1)2

+ (2− 1)2 =

√5

2(8)

• Weight vector

1||w|| =

1√(25

)2+( 4

5

)2=

1√2025

=5√5√

4=

√5

2(9)

Machine Learning: Chenhao Tan | Boulder | 29 of 40

Page 58: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Formulation

What’s the margin?

• Distance to closest point √(32− 1)2

+ (2− 1)2 =

√5

2(8)

• Weight vector

1||w|| =

1√(25

)2+( 4

5

)2=

1√2025

=5√5√

4=

√5

2(9)

Machine Learning: Chenhao Tan | Boulder | 29 of 40

Page 59: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Formulation

What’s the margin?

• Distance to closest point √(32− 1)2

+ (2− 1)2 =

√5

2(8)

• Weight vector

1||w|| =

1√(25

)2+( 4

5

)2=

1√2025

=5√5√

4=

√5

2(9)

Machine Learning: Chenhao Tan | Boulder | 29 of 40

Page 60: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Formulation

What’s the margin?

• Distance to closest point √(32− 1)2

+ (2− 1)2 =

√5

2(8)

• Weight vector

1||w|| =

1√(25

)2+( 4

5

)2=

1√2025

=5√5√

4=

√5

2(9)

Machine Learning: Chenhao Tan | Boulder | 29 of 40

Page 61: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Theoretical Guarantees

Outline

Intuition

Linear Classifiers

Support Vector Machines

Formulation

Theoretical Guarantees

Recap

Machine Learning: Chenhao Tan | Boulder | 30 of 40

Page 62: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Theoretical Guarantees

Three Proofs that Suggest SVMs will Work

• Leave-one-out error• Margin analysis• VC Dimension (omitted for now)

Machine Learning: Chenhao Tan | Boulder | 31 of 40

julius
Pencil
Page 63: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Theoretical Guarantees

Leave One Out Error (sketch)

Leave one out error is the error by using one point as your test set (averaged overall such points).

RLOO =1m

m∑i=1

1[hs−{xi} 6= yi

](10)

This serves as an unbiased estimate of generalization error for samples of sizem− 1:

Machine Learning: Chenhao Tan | Boulder | 32 of 40

julius
Pencil
Page 64: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Theoretical Guarantees

Leave One Out Error (sketch)

Leave one out error is the error by using one point as your test set (averaged overall such points).

RLOO =1m

m∑i=1

1[hs−{xi} 6= yi

](10)

This serves as an unbiased estimate of generalization error for samples of sizem− 1:

Machine Learning: Chenhao Tan | Boulder | 32 of 40

Page 65: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Theoretical Guarantees

Leave One Out Error (sketch)

Leave-one-out error is bounded by the number of support vectors.

ES∼Dm−1 [R(hs)] ≤ ES∼Dm

[NSV(S)

m

](11)

Consider the held out error for xi.• If xi was not a support vector, the answer doesn’t change.• If xi was a support vector, it could change the answer; this is when we can have

an error.There are NSV support vectors and thus NSV possible errors.

Machine Learning: Chenhao Tan | Boulder | 33 of 40

julius
Pencil
Page 66: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Theoretical Guarantees

Pictorial proof

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Support Vector Machines

2-class training data

decision boundary! linear separator

criterion: beingmaximally far awayfrom any data point! determinesclassifier margin

Margin ismaximized

Schutze: Support vector machines 29 / 55Machine Learning: Chenhao Tan | Boulder | 34 of 40

julius
Pencil
julius
Sticky Note
julius
Pencil
Page 67: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Theoretical Guarantees

Margin Theory

Machine Learning: Chenhao Tan | Boulder | 35 of 40

Page 68: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Theoretical Guarantees

Margin Loss Function

To see where SVMs really shine, consider the margin loss ρ:

Φρ(x) =

0 if ρ ≤ x1− x

ρ if 0 ≤ x ≤ ρ1 if x ≤ 0

(12)

The empirical margin loss of a hypothesis h is

Rρ(h) =1m

m∑i=1

Φρ(yih(xi)) (13)

Machine Learning: Chenhao Tan | Boulder | 36 of 40

julius
Pencil
julius
Pencil
Page 69: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Theoretical Guarantees

Margin Loss Function

To see where SVMs really shine, consider the margin loss ρ:

Φρ(x) =

0 if ρ ≤ x1− x

ρ if 0 ≤ x ≤ ρ1 if x ≤ 0

(12)

The empirical margin loss of a hypothesis h is

Rρ(h) =1m

m∑i=1

Φρ(yih(xi)) (13)

Machine Learning: Chenhao Tan | Boulder | 36 of 40

julius
Pencil
Page 70: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Theoretical Guarantees

Margin Loss Function

To see where SVMs really shine, consider the margin loss ρ:

Φρ(x) =

0 if ρ ≤ x1− x

ρ if 0 ≤ x ≤ ρ1 if x ≤ 0

(12)

The empirical margin loss of a hypothesis h is

Rρ(h) =1m

m∑i=1

Φρ(yih(xi)) ≤1m

m∑i=1

1 [yih(xi) ≤ ρ] (13)

Machine Learning: Chenhao Tan | Boulder | 36 of 40

julius
Pencil
Page 71: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Theoretical Guarantees

Margin Loss Function

To see where SVMs really shine, consider the margin loss ρ:

Φρ(x) =

0 if ρ ≤ x1− x

ρ if 0 ≤ x ≤ ρ1 if x ≤ 0

(12)

The empirical margin loss of a hypothesis h is

Rρ(h) =1m

m∑i=1

Φρ(yih(xi)) ≤1m

m∑i=1

1 [yih(xi) ≤ ρ] (13)

The fraction of the points in the training sample S that have been misclassified orclassified with confidence less than ρ.

Machine Learning: Chenhao Tan | Boulder | 36 of 40

Page 72: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Theoretical Guarantees

Generalization

For linear classifiers H = {x 7→ w · x : ||w|| ≤ Λ} and data X ∈ {x : ||x|| ≤ r}. Fixρ > 0 then with probability at least 1− δ, for any h ∈ H,

R(h) ≤ Rρ(h) + 2

√r2Λ2

ρ2m+

√log 1

δ

2m(14)

• Data-dependent: must be separable with a margin• Fortunately, many data do have good margin properties• SVMs can find good classifiers in those instances

Machine Learning: Chenhao Tan | Boulder | 37 of 40

julius
Pencil
Page 73: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Theoretical Guarantees

Generalization

For linear classifiers H = {x 7→ w · x : ||w|| ≤ Λ} and data X ∈ {x : ||x|| ≤ r}. Fixρ > 0 then with probability at least 1− δ, for any h ∈ H,

R(h) ≤ Rρ(h) + 2

√r2Λ2

ρ2m+

√log 1

δ

2m(14)

• Data-dependent: must be separable with a margin• Fortunately, many data do have good margin properties• SVMs can find good classifiers in those instances

Machine Learning: Chenhao Tan | Boulder | 37 of 40

Page 74: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Recap

Outline

Intuition

Linear Classifiers

Support Vector Machines

Formulation

Theoretical Guarantees

Recap

Machine Learning: Chenhao Tan | Boulder | 38 of 40

Page 75: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Recap

Choosing what kind of classifier to use

When building a text classifier, first question: how much training data is therecurrently available?Before neural networks:• None?• Very little?• A fair amount?• A huge amount

Machine Learning: Chenhao Tan | Boulder | 39 of 40

julius
Pencil
Page 76: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Recap

Choosing what kind of classifier to use

When building a text classifier, first question: how much training data is therecurrently available?Before neural networks:• None? Hand write rules or collect more data (possibly with active learning)• Very little?• A fair amount?• A huge amount

Machine Learning: Chenhao Tan | Boulder | 39 of 40

julius
Pencil
Page 77: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Recap

Choosing what kind of classifier to use

When building a text classifier, first question: how much training data is therecurrently available?Before neural networks:• None? Hand write rules or collect more data (possibly with active learning)• Very little? Naïve Bayes• A fair amount?• A huge amount

Machine Learning: Chenhao Tan | Boulder | 39 of 40

Page 78: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Recap

Choosing what kind of classifier to use

When building a text classifier, first question: how much training data is therecurrently available?Before neural networks:• None? Hand write rules or collect more data (possibly with active learning)• Very little? Naïve Bayes• A fair amount? SVM• A huge amount

Machine Learning: Chenhao Tan | Boulder | 39 of 40

Page 79: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Recap

Choosing what kind of classifier to use

When building a text classifier, first question: how much training data is therecurrently available?Before neural networks:• None? Hand write rules or collect more data (possibly with active learning)• Very little? Naïve Bayes• A fair amount? SVM• A huge amount Doesn’t matter, use whatever works

Machine Learning: Chenhao Tan | Boulder | 39 of 40

Page 80: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Recap

Choosing what kind of classifier to use

When building a text classifier, first question: how much training data is therecurrently available?With neural networks:• None? Hand write rules or collect more data (possibly with active learning)• Very little? Naïve Bayes, consider using pre-trained representation/model with

neural networks• A fair amount? SVM, consider using pre-trained representation/model with

neural networks• A huge amount Probably neural networks

Trade-off between interpretability and model performance

Machine Learning: Chenhao Tan | Boulder | 39 of 40

julius
Pencil
Page 81: Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised learning Previously: KNN, naïve Bayes, logistic regression and neural networks

Recap

SVM extensions: What’s next

• Finding solutions• Slack variables: not perfect line• Kernels: different geometries

Machine Learning: Chenhao Tan | Boulder | 40 of 40