Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised...

Preview:

Citation preview

Machine Learning: Chenhao TanUniversity of Colorado BoulderLECTURE 9

Slides adapted from Jordan Boyd-Graber

Machine Learning: Chenhao Tan | Boulder | 1 of 40

Recap

• Supervised learning• Previously: KNN, naïve Bayes, logistic regression and neural networks

• This time: SVMs◦ Motivated by geometric properties◦ (another) example of linear classifier◦ Good theoretical guarantee

Machine Learning: Chenhao Tan | Boulder | 2 of 40

Recap

• Supervised learning• Previously: KNN, naïve Bayes, logistic regression and neural networks• This time: SVMs◦ Motivated by geometric properties◦ (another) example of linear classifier◦ Good theoretical guarantee

Machine Learning: Chenhao Tan | Boulder | 2 of 40

Take a step back

• Problem• Model• Algorithm/optimization• Evaluation

https://blog.twitter.com/official/en_us/topics/product/

2017/Giving-you-more-characters-to-express-yourself.html

Machine Learning: Chenhao Tan | Boulder | 3 of 40

Take a step back

• Problem• Model• Algorithm/optimization• Evaluation

https://blog.twitter.com/official/en_us/topics/product/

2017/Giving-you-more-characters-to-express-yourself.html

Machine Learning: Chenhao Tan | Boulder | 3 of 40

Overview

Intuition

Linear Classifiers

Support Vector Machines

Formulation

Theoretical Guarantees

Recap

Machine Learning: Chenhao Tan | Boulder | 4 of 40

Intuition

Outline

Intuition

Linear Classifiers

Support Vector Machines

Formulation

Theoretical Guarantees

Recap

Machine Learning: Chenhao Tan | Boulder | 5 of 40

Intuition

Thinking Geometrically

• Suppose you have two classes: vacations and sports• Suppose you have four documents

SportsDoc1: {ball, ball, ball, travel}Doc2: {ball, ball}

VacationsDoc3: {travel, ball, travel}Doc4: {travel}

• What does this look like in vector space?

Machine Learning: Chenhao Tan | Boulder | 6 of 40

Intuition

Put the documents in vector space

SportsDoc1: {ball, ball, ball, travel}Doc2: {ball, ball}

VacationsDoc3: {travel, ball, travel}Doc4: {travel}

Travel3

2

1

01 2 3

Ball

Machine Learning: Chenhao Tan | Boulder | 7 of 40

Intuition

Vector space representation of documents

• Each document is a vector, one component for each term.• Terms are axes.• High dimensionality: 10,000s of dimensions and more• How can we do classification in this space?

Machine Learning: Chenhao Tan | Boulder | 8 of 40

Intuition

Vector space classification

• As before, the training set is a set of documents, each labeled with its class.• In vector space classification, this set corresponds to a labeled set of points or

vectors in the vector space.• Premise 1: Documents in the same class form a contiguous region.• Premise 2: Documents from different classes don’t overlap.• We define lines, surfaces, hypersurfaces to divide regions.

Machine Learning: Chenhao Tan | Boulder | 9 of 40

Intuition

Classes in the vector space

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Classes in the vector space

xxx

x

!! !!

!

!

China

Kenya

UK!

Should the document ! be assigned to China, UK or Kenya?

Schutze: Support vector machines 12 / 55

Machine Learning: Chenhao Tan | Boulder | 10 of 40

Intuition

Classes in the vector space

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Classes in the vector space

xxx

x

!! !!

!

!

China

Kenya

UK!

Should the document ! be assigned to China, UK or Kenya?

Schutze: Support vector machines 12 / 55

Should the document ? be assigned to China, UK or Kenya?

Machine Learning: Chenhao Tan | Boulder | 10 of 40

Intuition

Classes in the vector space

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Classes in the vector space

xxx

x

!! !!

!

!

China

Kenya

UK!

Find separators between the classes

Schutze: Support vector machines 12 / 55

Find separators between the classes

Machine Learning: Chenhao Tan | Boulder | 10 of 40

Intuition

Classes in the vector space

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Classes in the vector space

xxx

x

!! !!

!

!

China

Kenya

UK!

Find separators between the classes

Schutze: Support vector machines 12 / 55

Find separators between the classes

Machine Learning: Chenhao Tan | Boulder | 10 of 40

Intuition

Classes in the vector space

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Classes in the vector space

xxx

x

!! !!

!

!

China

Kenya

UK!

Find separators between the classes

Schutze: Support vector machines 12 / 55

Based on these separators: ? should be assigned to China

Machine Learning: Chenhao Tan | Boulder | 10 of 40

Intuition

Classes in the vector space

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Classes in the vector space

xxx

x

!! !!

!

!

China

Kenya

UK!

Find separators between the classes

Schutze: Support vector machines 12 / 55

How do we find separators that do a good job at classifying new documents like ??– Main topic of today

Machine Learning: Chenhao Tan | Boulder | 10 of 40

Linear Classifiers

Outline

Intuition

Linear Classifiers

Support Vector Machines

Formulation

Theoretical Guarantees

Recap

Machine Learning: Chenhao Tan | Boulder | 11 of 40

Linear Classifiers

Linear classifiers

• Definition:◦ A linear classifier computes a linear combination or weighted sum

∑j βjxj of the

feature values.◦ Classification decision:

∑j βjxj > β0? (β0 is our bias)

◦ . . . where β0 (the threshold) is a parameter.

• We call this the separator or decision boundary.• We find the separator based on training set.• Methods for finding separator: logistic regression, naïve Bayes, linear SVM• Assumption: The classes are linearly separable.

• Before, we just talked about equations. What’s the geometric intuition?

Machine Learning: Chenhao Tan | Boulder | 12 of 40

Linear Classifiers

Linear classifiers

• Definition:◦ A linear classifier computes a linear combination or weighted sum

∑j βjxj of the

feature values.◦ Classification decision:

∑j βjxj > β0? (β0 is our bias)

◦ . . . where β0 (the threshold) is a parameter.

• We call this the separator or decision boundary.• We find the separator based on training set.• Methods for finding separator: logistic regression, naïve Bayes, linear SVM• Assumption: The classes are linearly separable.• Before, we just talked about equations. What’s the geometric intuition?

Machine Learning: Chenhao Tan | Boulder | 12 of 40

Linear Classifiers

A linear classifier in 1D

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 1D

A linear classifier in 1D isa point x described by theequation w1d1 = !

x = !/w1

Schutze: Support vector machines 15 / 55

• A linear classifier in 1D is a point xdescribed by the equation β1x1 = β0

• x = β0/β1

• Points (x1) with β1x1 ≥ β0 are in theclass c.

• Points (x1) with β1x1 < β0 are in thecomplement class c.

Machine Learning: Chenhao Tan | Boulder | 13 of 40

Linear Classifiers

A linear classifier in 1D

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 1D

A linear classifier in 1D isa point x described by theequation w1d1 = !

x = !/w1

Schutze: Support vector machines 15 / 55

• A linear classifier in 1D is a point xdescribed by the equation β1x1 = β0

• x = β0/β1

• Points (x1) with β1x1 ≥ β0 are in theclass c.

• Points (x1) with β1x1 < β0 are in thecomplement class c.

Machine Learning: Chenhao Tan | Boulder | 13 of 40

Linear Classifiers

A linear classifier in 1D

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 1D

A linear classifier in 1D isa point x described by theequation w1d1 = !

x = !/w1

Points (d1) with w1d1 ! !are in the class c .

Schutze: Support vector machines 15 / 55

• A linear classifier in 1D is a point xdescribed by the equation β1x1 = β0

• x = β0/β1

• Points (x1) with β1x1 ≥ β0 are in theclass c.

• Points (x1) with β1x1 < β0 are in thecomplement class c.

Machine Learning: Chenhao Tan | Boulder | 13 of 40

Linear Classifiers

A linear classifier in 1D

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 1D

A linear classifier in 1D isa point x described by theequation w1d1 = !

x = !/w1

Points (d1) with w1d1 ! !are in the class c .

Points (d1) with w1d1 < !are in the complementclass c .

Schutze: Support vector machines 15 / 55

• A linear classifier in 1D is a point xdescribed by the equation β1x1 = β0

• x = β0/β1

• Points (x1) with β1x1 ≥ β0 are in theclass c.

• Points (x1) with β1x1 < β0 are in thecomplement class c.

Machine Learning: Chenhao Tan | Boulder | 13 of 40

Linear Classifiers

A linear classifier in 2DRecap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 2D

A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = !

Example for a 2D linearclassifier

Schutze: Support vector machines 16 / 55

• A linear classifier in 2D is a linedescribed by the equationβ1x1 + β2x2 = β0

• Example for a 2D linear classifier• Points (x1 x2) with β1x1 + β2x2 ≥ β0

are in the class c.• Points (x1 x2) with β1x1 + β2x2 < β0

are in the complement class c.

Machine Learning: Chenhao Tan | Boulder | 14 of 40

Linear Classifiers

A linear classifier in 2DRecap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 2D

A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = !

Example for a 2D linearclassifier

Schutze: Support vector machines 16 / 55

• A linear classifier in 2D is a linedescribed by the equationβ1x1 + β2x2 = β0

• Example for a 2D linear classifier

• Points (x1 x2) with β1x1 + β2x2 ≥ β0are in the class c.

• Points (x1 x2) with β1x1 + β2x2 < β0are in the complement class c.

Machine Learning: Chenhao Tan | Boulder | 14 of 40

Linear Classifiers

A linear classifier in 2DRecap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 2D

A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = !

Example for a 2D linearclassifier

Points (d1 d2) withw1d1 + w2d2 ! ! are inthe class c .

Schutze: Support vector machines 16 / 55

• A linear classifier in 2D is a linedescribed by the equationβ1x1 + β2x2 = β0

• Example for a 2D linear classifier• Points (x1 x2) with β1x1 + β2x2 ≥ β0

are in the class c.

• Points (x1 x2) with β1x1 + β2x2 < β0are in the complement class c.

Machine Learning: Chenhao Tan | Boulder | 14 of 40

Linear Classifiers

A linear classifier in 2DRecap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 2D

A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = !

Example for a 2D linearclassifier

Points (d1 d2) withw1d1 + w2d2 ! ! are inthe class c .

Points (d1 d2) withw1d1 + w2d2 < ! are inthe complement class c .

Schutze: Support vector machines 16 / 55

• A linear classifier in 2D is a linedescribed by the equationβ1x1 + β2x2 = β0

• Example for a 2D linear classifier• Points (x1 x2) with β1x1 + β2x2 ≥ β0

are in the class c.• Points (x1 x2) with β1x1 + β2x2 < β0

are in the complement class c.

Machine Learning: Chenhao Tan | Boulder | 14 of 40

Linear Classifiers

A linear classifier in 3DRecap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 3D

A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = !

Schutze: Support vector machines 17 / 55

• A linear classifier in 3D is a planedescribed by the equationβ1x1 + β2x2 + β3x3 = β0

• Example for a 3D linear classifier• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 ≥ β0 are in theclass c.

• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 < β0 are in thecomplement class c.

Machine Learning: Chenhao Tan | Boulder | 15 of 40

Linear Classifiers

A linear classifier in 3DRecap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 3D

A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = !

Example for a 3D linearclassifier

Schutze: Support vector machines 17 / 55

• A linear classifier in 3D is a planedescribed by the equationβ1x1 + β2x2 + β3x3 = β0

• Example for a 3D linear classifier

• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 ≥ β0 are in theclass c.

• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 < β0 are in thecomplement class c.

Machine Learning: Chenhao Tan | Boulder | 15 of 40

Linear Classifiers

A linear classifier in 3DRecap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 3D

A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = !

Example for a 3D linearclassifier

Points (d1 d2 d3) withw1d1 + w2d2 + w3d3 ! !are in the class c .

Schutze: Support vector machines 17 / 55

• A linear classifier in 3D is a planedescribed by the equationβ1x1 + β2x2 + β3x3 = β0

• Example for a 3D linear classifier• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 ≥ β0 are in theclass c.

• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 < β0 are in thecomplement class c.

Machine Learning: Chenhao Tan | Boulder | 15 of 40

Linear Classifiers

A linear classifier in 3D

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 3D

A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = !

Example for a 3D linearclassifier

Points (d1 d2 d3) withw1d1 + w2d2 + w3d3 ! !are in the class c .

Points (d1 d2 d3) withw1d1 + w2d2 + w3d3 < !are in the complementclass c .

Schutze: Support vector machines 17 / 55

• A linear classifier in 3D is a planedescribed by the equationβ1x1 + β2x2 + β3x3 = β0

• Example for a 3D linear classifier• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 ≥ β0 are in theclass c.

• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 < β0 are in thecomplement class c.

Machine Learning: Chenhao Tan | Boulder | 15 of 40

Linear Classifiers

Naive Bayes and Logistic Regression as linear classifiers

TakeawayNaïve Bayes, logistic regression and SVM are all linear methods. They choosetheir hyperplanes based on different objectives: joint likelihood (NB), conditionallikelihood (LR), and the margin (SVM).

Machine Learning: Chenhao Tan | Boulder | 16 of 40

Linear Classifiers

Which hyperplane?

• For linearly separable training sets: there are infinitely many separatinghyperplanes.

• They all separate the training set perfectly . . .• . . . but they behave differently on test data.• Error rates on new data are low for some, high for others.• How do we find a low-error separator?

Machine Learning: Chenhao Tan | Boulder | 17 of 40

Support Vector Machines

Outline

Intuition

Linear Classifiers

Support Vector Machines

Formulation

Theoretical Guarantees

Recap

Machine Learning: Chenhao Tan | Boulder | 18 of 40

Support Vector Machines

Support vector machines

• Machine-learning research in the last two decades has improved classifiereffectiveness.

• New generation of state-of-the-art classifiers: support vector machines (SVMs),boosted decision trees, regularized logistic regression, neural networks, andrandom forests

• Applications to IR problems, particularly text classification

SVMs: A kind of large-margin classifierVector space based machine-learning method aiming to find a decision boundarybetween two classes that is maximally far from any point in the training data(possibly discounting some points as outliers or noise)

Machine Learning: Chenhao Tan | Boulder | 19 of 40

Support Vector Machines

Support Vector Machines

• 2-class training data

• decision boundary→ linearseparator

• criterion: being maximally far awayfrom any data point→ determinesclassifier margin

• linear separator position defined bysupport vectors

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Support Vector Machines

2-class training data

Schutze: Support vector machines 29 / 55

Machine Learning: Chenhao Tan | Boulder | 20 of 40

Support Vector Machines

Support Vector Machines

• 2-class training data• decision boundary→ linear

separator

• criterion: being maximally far awayfrom any data point→ determinesclassifier margin

• linear separator position defined bysupport vectors

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Support Vector Machines

2-class training data

decision boundary! linear separator

Schutze: Support vector machines 29 / 55Machine Learning: Chenhao Tan | Boulder | 20 of 40

Support Vector Machines

Support Vector Machines

• 2-class training data• decision boundary→ linear

separator• criterion: being maximally far away

from any data point→ determinesclassifier margin

• linear separator position defined bysupport vectors

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Support Vector Machines

2-class training data

decision boundary! linear separator

criterion: beingmaximally far awayfrom any data point! determinesclassifier margin

Margin ismaximized

Schutze: Support vector machines 29 / 55

Machine Learning: Chenhao Tan | Boulder | 20 of 40

Support Vector Machines

Support Vector Machines

• 2-class training data• decision boundary→ linear

separator• criterion: being maximally far away

from any data point→ determinesclassifier margin

• linear separator position defined bysupport vectors

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Why maximize the margin?

Points near decisionsurface ! uncertainclassification decisions(50% either way).A classifier with a largemargin makes no lowcertainty classificationdecisions.Gives classificationsafety margin w.r.t slighterrors in measurement ordoc. variation

Support vectors

Margin ismaximized

Maximummargindecisionhyperplane

Schutze: Support vector machines 30 / 55

Machine Learning: Chenhao Tan | Boulder | 20 of 40

Support Vector Machines

Why maximize the margin?

• Points near decision surface→uncertain classification decisions

• A classifier with a large margin isalways confident

• Gives classification safety margin(measurement or variation)

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Why maximize the margin?

Points near decisionsurface ! uncertainclassification decisions(50% either way).A classifier with a largemargin makes no lowcertainty classificationdecisions.Gives classificationsafety margin w.r.t slighterrors in measurement ordoc. variation

Support vectors

Margin ismaximized

Maximummargindecisionhyperplane

Schutze: Support vector machines 30 / 55

Machine Learning: Chenhao Tan | Boulder | 21 of 40

Support Vector Machines

Why maximize the margin?

• SVM classifier: large margin arounddecision boundary

• compare to decision hyperplane:place fat separator between classes◦ unique solution

• decreased memory capacity• increased ability to correctly

generalize to test data

Machine Learning: Chenhao Tan | Boulder | 22 of 40

Formulation

Outline

Intuition

Linear Classifiers

Support Vector Machines

Formulation

Theoretical Guarantees

Recap

Machine Learning: Chenhao Tan | Boulder | 23 of 40

Formulation

Equation

• Equation of a hyperplanew · x + b = 0 (1)

• Distance of a point to hyperplane

|w · x + b|||w|| (2)

• The margin ρ is given by

ρ ≡ min(x,y)∈S

|w · x + b|||w|| ≡ 1

||w|| (3)

This is because for any point on the marginal hyperplane, w · x + b = ±1, andwe would like to maximize the margin ρ

Machine Learning: Chenhao Tan | Boulder | 24 of 40

Formulation

Equation

• Equation of a hyperplanew · x + b = 0 (1)

• Distance of a point to hyperplane

|w · x + b|||w|| (2)

• The margin ρ is given by

ρ ≡ min(x,y)∈S

|w · x + b|||w|| ≡ 1

||w|| (3)

This is because for any point on the marginal hyperplane, w · x + b = ±1, andwe would like to maximize the margin ρ

Machine Learning: Chenhao Tan | Boulder | 24 of 40

Formulation

Optimization Problem

We want to find a weight vector w and bias b that optimize

minw,b

12||w||2 (4)

subject to yi(w · xi + b) ≥ 1, ∀i ∈ [1,m].

Next week: algorithm

Machine Learning: Chenhao Tan | Boulder | 25 of 40

Formulation

Optimization Problem

We want to find a weight vector w and bias b that optimize

minw,b

12||w||2 (4)

subject to yi(w · xi + b) ≥ 1, ∀i ∈ [1,m].

Next week: algorithm

Machine Learning: Chenhao Tan | Boulder | 25 of 40

Formulation

Find the maximum margin hyperplane

3

2

1

01 2 3

Machine Learning: Chenhao Tan | Boulder | 26 of 40

Formulation

Find the maximum margin hyperplane

3

2

1

01 2 3

Which are the support vectors?

Machine Learning: Chenhao Tan | Boulder | 26 of 40

Formulation

Walkthrough example: building an SVM over the data shown

Working geometrically:

• If you got 0 = .5x + y− 2.75, close!• Set up system of equations

w1 + w2 + b = −1 (5)32

w1 + 2w2 + b = 0 (6)

2w1 + 3w2 + b = +1 (7)

The SVM decision boundary is:

0 =25

x +45

y− 115

3

2

1

01 2 3

Machine Learning: Chenhao Tan | Boulder | 27 of 40

Formulation

Walkthrough example: building an SVM over the data shown

Working geometrically:• If you got 0 = .5x + y− 2.75, close!

• Set up system of equations

w1 + w2 + b = −1 (5)32

w1 + 2w2 + b = 0 (6)

2w1 + 3w2 + b = +1 (7)

The SVM decision boundary is:

0 =25

x +45

y− 115

3

2

1

01 2 3

Machine Learning: Chenhao Tan | Boulder | 27 of 40

Formulation

Walkthrough example: building an SVM over the data shown

Working geometrically:• If you got 0 = .5x + y− 2.75, close!• Set up system of equations

w1 + w2 + b = −1 (5)32

w1 + 2w2 + b = 0 (6)

2w1 + 3w2 + b = +1 (7)

The SVM decision boundary is:

0 =25

x +45

y− 115

3

2

1

01 2 3

Machine Learning: Chenhao Tan | Boulder | 27 of 40

Formulation

Walkthrough example: building an SVM over the data shown

Working geometrically:• If you got 0 = .5x + y− 2.75, close!• Set up system of equations

w1 + w2 + b = −1 (5)32

w1 + 2w2 + b = 0 (6)

2w1 + 3w2 + b = +1 (7)

The SVM decision boundary is:

0 =25

x +45

y− 115

3

2

1

01 2 3

Machine Learning: Chenhao Tan | Boulder | 27 of 40

Formulation

Cannonical Form

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Walkthrough example: building an SVM over the data setshown in the figure

Working geometrically:

0 1 2 30

1

2

3

Schutze: Support vector machines 42 / 55

w1x1 + w2x2 + b

Machine Learning: Chenhao Tan | Boulder | 28 of 40

Formulation

Cannonical Form

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Walkthrough example: building an SVM over the data setshown in the figure

Working geometrically:

0 1 2 30

1

2

3

Schutze: Support vector machines 42 / 55

.4x1 + .8x2 − 2.2

Machine Learning: Chenhao Tan | Boulder | 28 of 40

Formulation

Cannonical Form

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Walkthrough example: building an SVM over the data setshown in the figure

Working geometrically:

0 1 2 30

1

2

3

Schutze: Support vector machines 42 / 55

.4x1 + .8x2 − 2.2• .4 · 1 + .8 · 1− 2.2 = −1• .4 · 3

2 + .8 · 2 = 0• .4 · 2 + .8 · 3− 2.2 = +1

Machine Learning: Chenhao Tan | Boulder | 28 of 40

Formulation

What’s the margin?

• Distance to closest point

√(32− 1)2

+ (2− 1)2 =

√5

2(8)

• Weight vector

1||w|| =

1√(25

)2+( 4

5

)2=

1√2025

=5√5√

4=

√5

2(9)

Machine Learning: Chenhao Tan | Boulder | 29 of 40

Formulation

What’s the margin?

• Distance to closest point √(32− 1)2

+ (2− 1)2 =

√5

2(8)

• Weight vector

1||w|| =

1√(25

)2+( 4

5

)2=

1√2025

=5√5√

4=

√5

2(9)

Machine Learning: Chenhao Tan | Boulder | 29 of 40

Formulation

What’s the margin?

• Distance to closest point √(32− 1)2

+ (2− 1)2 =

√5

2(8)

• Weight vector

1||w|| =

1√(25

)2+( 4

5

)2=

1√2025

=5√5√

4=

√5

2(9)

Machine Learning: Chenhao Tan | Boulder | 29 of 40

Formulation

What’s the margin?

• Distance to closest point √(32− 1)2

+ (2− 1)2 =

√5

2(8)

• Weight vector

1||w|| =

1√(25

)2+( 4

5

)2=

1√2025

=5√5√

4=

√5

2(9)

Machine Learning: Chenhao Tan | Boulder | 29 of 40

Theoretical Guarantees

Outline

Intuition

Linear Classifiers

Support Vector Machines

Formulation

Theoretical Guarantees

Recap

Machine Learning: Chenhao Tan | Boulder | 30 of 40

Theoretical Guarantees

Three Proofs that Suggest SVMs will Work

• Leave-one-out error• Margin analysis• VC Dimension (omitted for now)

Machine Learning: Chenhao Tan | Boulder | 31 of 40

julius
Pencil

Theoretical Guarantees

Leave One Out Error (sketch)

Leave one out error is the error by using one point as your test set (averaged overall such points).

RLOO =1m

m∑i=1

1[hs−{xi} 6= yi

](10)

This serves as an unbiased estimate of generalization error for samples of sizem− 1:

Machine Learning: Chenhao Tan | Boulder | 32 of 40

julius
Pencil

Theoretical Guarantees

Leave One Out Error (sketch)

Leave one out error is the error by using one point as your test set (averaged overall such points).

RLOO =1m

m∑i=1

1[hs−{xi} 6= yi

](10)

This serves as an unbiased estimate of generalization error for samples of sizem− 1:

Machine Learning: Chenhao Tan | Boulder | 32 of 40

Theoretical Guarantees

Leave One Out Error (sketch)

Leave-one-out error is bounded by the number of support vectors.

ES∼Dm−1 [R(hs)] ≤ ES∼Dm

[NSV(S)

m

](11)

Consider the held out error for xi.• If xi was not a support vector, the answer doesn’t change.• If xi was a support vector, it could change the answer; this is when we can have

an error.There are NSV support vectors and thus NSV possible errors.

Machine Learning: Chenhao Tan | Boulder | 33 of 40

julius
Pencil

Theoretical Guarantees

Pictorial proof

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Support Vector Machines

2-class training data

decision boundary! linear separator

criterion: beingmaximally far awayfrom any data point! determinesclassifier margin

Margin ismaximized

Schutze: Support vector machines 29 / 55Machine Learning: Chenhao Tan | Boulder | 34 of 40

julius
Pencil
julius
Sticky Note
julius
Pencil

Theoretical Guarantees

Margin Theory

Machine Learning: Chenhao Tan | Boulder | 35 of 40

Theoretical Guarantees

Margin Loss Function

To see where SVMs really shine, consider the margin loss ρ:

Φρ(x) =

0 if ρ ≤ x1− x

ρ if 0 ≤ x ≤ ρ1 if x ≤ 0

(12)

The empirical margin loss of a hypothesis h is

Rρ(h) =1m

m∑i=1

Φρ(yih(xi)) (13)

Machine Learning: Chenhao Tan | Boulder | 36 of 40

julius
Pencil
julius
Pencil

Theoretical Guarantees

Margin Loss Function

To see where SVMs really shine, consider the margin loss ρ:

Φρ(x) =

0 if ρ ≤ x1− x

ρ if 0 ≤ x ≤ ρ1 if x ≤ 0

(12)

The empirical margin loss of a hypothesis h is

Rρ(h) =1m

m∑i=1

Φρ(yih(xi)) (13)

Machine Learning: Chenhao Tan | Boulder | 36 of 40

julius
Pencil

Theoretical Guarantees

Margin Loss Function

To see where SVMs really shine, consider the margin loss ρ:

Φρ(x) =

0 if ρ ≤ x1− x

ρ if 0 ≤ x ≤ ρ1 if x ≤ 0

(12)

The empirical margin loss of a hypothesis h is

Rρ(h) =1m

m∑i=1

Φρ(yih(xi)) ≤1m

m∑i=1

1 [yih(xi) ≤ ρ] (13)

Machine Learning: Chenhao Tan | Boulder | 36 of 40

julius
Pencil

Theoretical Guarantees

Margin Loss Function

To see where SVMs really shine, consider the margin loss ρ:

Φρ(x) =

0 if ρ ≤ x1− x

ρ if 0 ≤ x ≤ ρ1 if x ≤ 0

(12)

The empirical margin loss of a hypothesis h is

Rρ(h) =1m

m∑i=1

Φρ(yih(xi)) ≤1m

m∑i=1

1 [yih(xi) ≤ ρ] (13)

The fraction of the points in the training sample S that have been misclassified orclassified with confidence less than ρ.

Machine Learning: Chenhao Tan | Boulder | 36 of 40

Theoretical Guarantees

Generalization

For linear classifiers H = {x 7→ w · x : ||w|| ≤ Λ} and data X ∈ {x : ||x|| ≤ r}. Fixρ > 0 then with probability at least 1− δ, for any h ∈ H,

R(h) ≤ Rρ(h) + 2

√r2Λ2

ρ2m+

√log 1

δ

2m(14)

• Data-dependent: must be separable with a margin• Fortunately, many data do have good margin properties• SVMs can find good classifiers in those instances

Machine Learning: Chenhao Tan | Boulder | 37 of 40

julius
Pencil

Theoretical Guarantees

Generalization

For linear classifiers H = {x 7→ w · x : ||w|| ≤ Λ} and data X ∈ {x : ||x|| ≤ r}. Fixρ > 0 then with probability at least 1− δ, for any h ∈ H,

R(h) ≤ Rρ(h) + 2

√r2Λ2

ρ2m+

√log 1

δ

2m(14)

• Data-dependent: must be separable with a margin• Fortunately, many data do have good margin properties• SVMs can find good classifiers in those instances

Machine Learning: Chenhao Tan | Boulder | 37 of 40

Recap

Outline

Intuition

Linear Classifiers

Support Vector Machines

Formulation

Theoretical Guarantees

Recap

Machine Learning: Chenhao Tan | Boulder | 38 of 40

Recap

Choosing what kind of classifier to use

When building a text classifier, first question: how much training data is therecurrently available?Before neural networks:• None?• Very little?• A fair amount?• A huge amount

Machine Learning: Chenhao Tan | Boulder | 39 of 40

julius
Pencil

Recap

Choosing what kind of classifier to use

When building a text classifier, first question: how much training data is therecurrently available?Before neural networks:• None? Hand write rules or collect more data (possibly with active learning)• Very little?• A fair amount?• A huge amount

Machine Learning: Chenhao Tan | Boulder | 39 of 40

julius
Pencil

Recap

Choosing what kind of classifier to use

When building a text classifier, first question: how much training data is therecurrently available?Before neural networks:• None? Hand write rules or collect more data (possibly with active learning)• Very little? Naïve Bayes• A fair amount?• A huge amount

Machine Learning: Chenhao Tan | Boulder | 39 of 40

Recap

Choosing what kind of classifier to use

When building a text classifier, first question: how much training data is therecurrently available?Before neural networks:• None? Hand write rules or collect more data (possibly with active learning)• Very little? Naïve Bayes• A fair amount? SVM• A huge amount

Machine Learning: Chenhao Tan | Boulder | 39 of 40

Recap

Choosing what kind of classifier to use

When building a text classifier, first question: how much training data is therecurrently available?Before neural networks:• None? Hand write rules or collect more data (possibly with active learning)• Very little? Naïve Bayes• A fair amount? SVM• A huge amount Doesn’t matter, use whatever works

Machine Learning: Chenhao Tan | Boulder | 39 of 40

Recap

Choosing what kind of classifier to use

When building a text classifier, first question: how much training data is therecurrently available?With neural networks:• None? Hand write rules or collect more data (possibly with active learning)• Very little? Naïve Bayes, consider using pre-trained representation/model with

neural networks• A fair amount? SVM, consider using pre-trained representation/model with

neural networks• A huge amount Probably neural networks

Trade-off between interpretability and model performance

Machine Learning: Chenhao Tan | Boulder | 39 of 40

julius
Pencil

Recap

SVM extensions: What’s next

• Finding solutions• Slack variables: not perfect line• Kernels: different geometries

Machine Learning: Chenhao Tan | Boulder | 40 of 40

Recommended