Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Machine Learning: Chenhao TanUniversity of Colorado BoulderLECTURE 9
Slides adapted from Jordan Boyd-Graber
Machine Learning: Chenhao Tan | Boulder | 1 of 40
Recap
• Supervised learning• Previously: KNN, naïve Bayes, logistic regression and neural networks
• This time: SVMs◦ Motivated by geometric properties◦ (another) example of linear classifier◦ Good theoretical guarantee
Machine Learning: Chenhao Tan | Boulder | 2 of 40
Recap
• Supervised learning• Previously: KNN, naïve Bayes, logistic regression and neural networks• This time: SVMs◦ Motivated by geometric properties◦ (another) example of linear classifier◦ Good theoretical guarantee
Machine Learning: Chenhao Tan | Boulder | 2 of 40
Take a step back
• Problem• Model• Algorithm/optimization• Evaluation
https://blog.twitter.com/official/en_us/topics/product/
2017/Giving-you-more-characters-to-express-yourself.html
Machine Learning: Chenhao Tan | Boulder | 3 of 40
Take a step back
• Problem• Model• Algorithm/optimization• Evaluation
https://blog.twitter.com/official/en_us/topics/product/
2017/Giving-you-more-characters-to-express-yourself.html
Machine Learning: Chenhao Tan | Boulder | 3 of 40
Overview
Intuition
Linear Classifiers
Support Vector Machines
Formulation
Theoretical Guarantees
Recap
Machine Learning: Chenhao Tan | Boulder | 4 of 40
Intuition
Outline
Intuition
Linear Classifiers
Support Vector Machines
Formulation
Theoretical Guarantees
Recap
Machine Learning: Chenhao Tan | Boulder | 5 of 40
Intuition
Thinking Geometrically
• Suppose you have two classes: vacations and sports• Suppose you have four documents
SportsDoc1: {ball, ball, ball, travel}Doc2: {ball, ball}
VacationsDoc3: {travel, ball, travel}Doc4: {travel}
• What does this look like in vector space?
Machine Learning: Chenhao Tan | Boulder | 6 of 40
Intuition
Put the documents in vector space
SportsDoc1: {ball, ball, ball, travel}Doc2: {ball, ball}
VacationsDoc3: {travel, ball, travel}Doc4: {travel}
Travel3
2
1
01 2 3
Ball
Machine Learning: Chenhao Tan | Boulder | 7 of 40
Intuition
Vector space representation of documents
• Each document is a vector, one component for each term.• Terms are axes.• High dimensionality: 10,000s of dimensions and more• How can we do classification in this space?
Machine Learning: Chenhao Tan | Boulder | 8 of 40
Intuition
Vector space classification
• As before, the training set is a set of documents, each labeled with its class.• In vector space classification, this set corresponds to a labeled set of points or
vectors in the vector space.• Premise 1: Documents in the same class form a contiguous region.• Premise 2: Documents from different classes don’t overlap.• We define lines, surfaces, hypersurfaces to divide regions.
Machine Learning: Chenhao Tan | Boulder | 9 of 40
Intuition
Classes in the vector space
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Classes in the vector space
xxx
x
!! !!
!
!
China
Kenya
UK!
Should the document ! be assigned to China, UK or Kenya?
Schutze: Support vector machines 12 / 55
Machine Learning: Chenhao Tan | Boulder | 10 of 40
Intuition
Classes in the vector space
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Classes in the vector space
xxx
x
!! !!
!
!
China
Kenya
UK!
Should the document ! be assigned to China, UK or Kenya?
Schutze: Support vector machines 12 / 55
Should the document ? be assigned to China, UK or Kenya?
Machine Learning: Chenhao Tan | Boulder | 10 of 40
Intuition
Classes in the vector space
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Classes in the vector space
xxx
x
!! !!
!
!
China
Kenya
UK!
Find separators between the classes
Schutze: Support vector machines 12 / 55
Find separators between the classes
Machine Learning: Chenhao Tan | Boulder | 10 of 40
Intuition
Classes in the vector space
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Classes in the vector space
xxx
x
!! !!
!
!
China
Kenya
UK!
Find separators between the classes
Schutze: Support vector machines 12 / 55
Find separators between the classes
Machine Learning: Chenhao Tan | Boulder | 10 of 40
Intuition
Classes in the vector space
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Classes in the vector space
xxx
x
!! !!
!
!
China
Kenya
UK!
Find separators between the classes
Schutze: Support vector machines 12 / 55
Based on these separators: ? should be assigned to China
Machine Learning: Chenhao Tan | Boulder | 10 of 40
Intuition
Classes in the vector space
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Classes in the vector space
xxx
x
!! !!
!
!
China
Kenya
UK!
Find separators between the classes
Schutze: Support vector machines 12 / 55
How do we find separators that do a good job at classifying new documents like ??– Main topic of today
Machine Learning: Chenhao Tan | Boulder | 10 of 40
Linear Classifiers
Outline
Intuition
Linear Classifiers
Support Vector Machines
Formulation
Theoretical Guarantees
Recap
Machine Learning: Chenhao Tan | Boulder | 11 of 40
Linear Classifiers
Linear classifiers
• Definition:◦ A linear classifier computes a linear combination or weighted sum
∑j βjxj of the
feature values.◦ Classification decision:
∑j βjxj > β0? (β0 is our bias)
◦ . . . where β0 (the threshold) is a parameter.
• We call this the separator or decision boundary.• We find the separator based on training set.• Methods for finding separator: logistic regression, naïve Bayes, linear SVM• Assumption: The classes are linearly separable.
• Before, we just talked about equations. What’s the geometric intuition?
Machine Learning: Chenhao Tan | Boulder | 12 of 40
Linear Classifiers
Linear classifiers
• Definition:◦ A linear classifier computes a linear combination or weighted sum
∑j βjxj of the
feature values.◦ Classification decision:
∑j βjxj > β0? (β0 is our bias)
◦ . . . where β0 (the threshold) is a parameter.
• We call this the separator or decision boundary.• We find the separator based on training set.• Methods for finding separator: logistic regression, naïve Bayes, linear SVM• Assumption: The classes are linearly separable.• Before, we just talked about equations. What’s the geometric intuition?
Machine Learning: Chenhao Tan | Boulder | 12 of 40
Linear Classifiers
A linear classifier in 1D
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 1D
A linear classifier in 1D isa point x described by theequation w1d1 = !
x = !/w1
Schutze: Support vector machines 15 / 55
• A linear classifier in 1D is a point xdescribed by the equation β1x1 = β0
• x = β0/β1
• Points (x1) with β1x1 ≥ β0 are in theclass c.
• Points (x1) with β1x1 < β0 are in thecomplement class c.
Machine Learning: Chenhao Tan | Boulder | 13 of 40
Linear Classifiers
A linear classifier in 1D
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 1D
A linear classifier in 1D isa point x described by theequation w1d1 = !
x = !/w1
Schutze: Support vector machines 15 / 55
• A linear classifier in 1D is a point xdescribed by the equation β1x1 = β0
• x = β0/β1
• Points (x1) with β1x1 ≥ β0 are in theclass c.
• Points (x1) with β1x1 < β0 are in thecomplement class c.
Machine Learning: Chenhao Tan | Boulder | 13 of 40
Linear Classifiers
A linear classifier in 1D
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 1D
A linear classifier in 1D isa point x described by theequation w1d1 = !
x = !/w1
Points (d1) with w1d1 ! !are in the class c .
Schutze: Support vector machines 15 / 55
• A linear classifier in 1D is a point xdescribed by the equation β1x1 = β0
• x = β0/β1
• Points (x1) with β1x1 ≥ β0 are in theclass c.
• Points (x1) with β1x1 < β0 are in thecomplement class c.
Machine Learning: Chenhao Tan | Boulder | 13 of 40
Linear Classifiers
A linear classifier in 1D
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 1D
A linear classifier in 1D isa point x described by theequation w1d1 = !
x = !/w1
Points (d1) with w1d1 ! !are in the class c .
Points (d1) with w1d1 < !are in the complementclass c .
Schutze: Support vector machines 15 / 55
• A linear classifier in 1D is a point xdescribed by the equation β1x1 = β0
• x = β0/β1
• Points (x1) with β1x1 ≥ β0 are in theclass c.
• Points (x1) with β1x1 < β0 are in thecomplement class c.
Machine Learning: Chenhao Tan | Boulder | 13 of 40
Linear Classifiers
A linear classifier in 2DRecap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 2D
A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = !
Example for a 2D linearclassifier
Schutze: Support vector machines 16 / 55
• A linear classifier in 2D is a linedescribed by the equationβ1x1 + β2x2 = β0
• Example for a 2D linear classifier• Points (x1 x2) with β1x1 + β2x2 ≥ β0
are in the class c.• Points (x1 x2) with β1x1 + β2x2 < β0
are in the complement class c.
Machine Learning: Chenhao Tan | Boulder | 14 of 40
Linear Classifiers
A linear classifier in 2DRecap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 2D
A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = !
Example for a 2D linearclassifier
Schutze: Support vector machines 16 / 55
• A linear classifier in 2D is a linedescribed by the equationβ1x1 + β2x2 = β0
• Example for a 2D linear classifier
• Points (x1 x2) with β1x1 + β2x2 ≥ β0are in the class c.
• Points (x1 x2) with β1x1 + β2x2 < β0are in the complement class c.
Machine Learning: Chenhao Tan | Boulder | 14 of 40
Linear Classifiers
A linear classifier in 2DRecap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 2D
A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = !
Example for a 2D linearclassifier
Points (d1 d2) withw1d1 + w2d2 ! ! are inthe class c .
Schutze: Support vector machines 16 / 55
• A linear classifier in 2D is a linedescribed by the equationβ1x1 + β2x2 = β0
• Example for a 2D linear classifier• Points (x1 x2) with β1x1 + β2x2 ≥ β0
are in the class c.
• Points (x1 x2) with β1x1 + β2x2 < β0are in the complement class c.
Machine Learning: Chenhao Tan | Boulder | 14 of 40
Linear Classifiers
A linear classifier in 2DRecap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 2D
A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = !
Example for a 2D linearclassifier
Points (d1 d2) withw1d1 + w2d2 ! ! are inthe class c .
Points (d1 d2) withw1d1 + w2d2 < ! are inthe complement class c .
Schutze: Support vector machines 16 / 55
• A linear classifier in 2D is a linedescribed by the equationβ1x1 + β2x2 = β0
• Example for a 2D linear classifier• Points (x1 x2) with β1x1 + β2x2 ≥ β0
are in the class c.• Points (x1 x2) with β1x1 + β2x2 < β0
are in the complement class c.
Machine Learning: Chenhao Tan | Boulder | 14 of 40
Linear Classifiers
A linear classifier in 3DRecap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 3D
A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = !
Schutze: Support vector machines 17 / 55
• A linear classifier in 3D is a planedescribed by the equationβ1x1 + β2x2 + β3x3 = β0
• Example for a 3D linear classifier• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 ≥ β0 are in theclass c.
• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 < β0 are in thecomplement class c.
Machine Learning: Chenhao Tan | Boulder | 15 of 40
Linear Classifiers
A linear classifier in 3DRecap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 3D
A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = !
Example for a 3D linearclassifier
Schutze: Support vector machines 17 / 55
• A linear classifier in 3D is a planedescribed by the equationβ1x1 + β2x2 + β3x3 = β0
• Example for a 3D linear classifier
• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 ≥ β0 are in theclass c.
• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 < β0 are in thecomplement class c.
Machine Learning: Chenhao Tan | Boulder | 15 of 40
Linear Classifiers
A linear classifier in 3DRecap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 3D
A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = !
Example for a 3D linearclassifier
Points (d1 d2 d3) withw1d1 + w2d2 + w3d3 ! !are in the class c .
Schutze: Support vector machines 17 / 55
• A linear classifier in 3D is a planedescribed by the equationβ1x1 + β2x2 + β3x3 = β0
• Example for a 3D linear classifier• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 ≥ β0 are in theclass c.
• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 < β0 are in thecomplement class c.
Machine Learning: Chenhao Tan | Boulder | 15 of 40
Linear Classifiers
A linear classifier in 3D
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
A linear classifier in 3D
A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = !
Example for a 3D linearclassifier
Points (d1 d2 d3) withw1d1 + w2d2 + w3d3 ! !are in the class c .
Points (d1 d2 d3) withw1d1 + w2d2 + w3d3 < !are in the complementclass c .
Schutze: Support vector machines 17 / 55
• A linear classifier in 3D is a planedescribed by the equationβ1x1 + β2x2 + β3x3 = β0
• Example for a 3D linear classifier• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 ≥ β0 are in theclass c.
• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 < β0 are in thecomplement class c.
Machine Learning: Chenhao Tan | Boulder | 15 of 40
Linear Classifiers
Naive Bayes and Logistic Regression as linear classifiers
TakeawayNaïve Bayes, logistic regression and SVM are all linear methods. They choosetheir hyperplanes based on different objectives: joint likelihood (NB), conditionallikelihood (LR), and the margin (SVM).
Machine Learning: Chenhao Tan | Boulder | 16 of 40
Linear Classifiers
Which hyperplane?
• For linearly separable training sets: there are infinitely many separatinghyperplanes.
• They all separate the training set perfectly . . .• . . . but they behave differently on test data.• Error rates on new data are low for some, high for others.• How do we find a low-error separator?
Machine Learning: Chenhao Tan | Boulder | 17 of 40
Support Vector Machines
Outline
Intuition
Linear Classifiers
Support Vector Machines
Formulation
Theoretical Guarantees
Recap
Machine Learning: Chenhao Tan | Boulder | 18 of 40
Support Vector Machines
Support vector machines
• Machine-learning research in the last two decades has improved classifiereffectiveness.
• New generation of state-of-the-art classifiers: support vector machines (SVMs),boosted decision trees, regularized logistic regression, neural networks, andrandom forests
• Applications to IR problems, particularly text classification
SVMs: A kind of large-margin classifierVector space based machine-learning method aiming to find a decision boundarybetween two classes that is maximally far from any point in the training data(possibly discounting some points as outliers or noise)
Machine Learning: Chenhao Tan | Boulder | 19 of 40
Support Vector Machines
Support Vector Machines
• 2-class training data
• decision boundary→ linearseparator
• criterion: being maximally far awayfrom any data point→ determinesclassifier margin
• linear separator position defined bysupport vectors
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Support Vector Machines
2-class training data
Schutze: Support vector machines 29 / 55
Machine Learning: Chenhao Tan | Boulder | 20 of 40
Support Vector Machines
Support Vector Machines
• 2-class training data• decision boundary→ linear
separator
• criterion: being maximally far awayfrom any data point→ determinesclassifier margin
• linear separator position defined bysupport vectors
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Support Vector Machines
2-class training data
decision boundary! linear separator
Schutze: Support vector machines 29 / 55Machine Learning: Chenhao Tan | Boulder | 20 of 40
Support Vector Machines
Support Vector Machines
• 2-class training data• decision boundary→ linear
separator• criterion: being maximally far away
from any data point→ determinesclassifier margin
• linear separator position defined bysupport vectors
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Support Vector Machines
2-class training data
decision boundary! linear separator
criterion: beingmaximally far awayfrom any data point! determinesclassifier margin
Margin ismaximized
Schutze: Support vector machines 29 / 55
Machine Learning: Chenhao Tan | Boulder | 20 of 40
Support Vector Machines
Support Vector Machines
• 2-class training data• decision boundary→ linear
separator• criterion: being maximally far away
from any data point→ determinesclassifier margin
• linear separator position defined bysupport vectors
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Why maximize the margin?
Points near decisionsurface ! uncertainclassification decisions(50% either way).A classifier with a largemargin makes no lowcertainty classificationdecisions.Gives classificationsafety margin w.r.t slighterrors in measurement ordoc. variation
Support vectors
Margin ismaximized
Maximummargindecisionhyperplane
Schutze: Support vector machines 30 / 55
Machine Learning: Chenhao Tan | Boulder | 20 of 40
Support Vector Machines
Why maximize the margin?
• Points near decision surface→uncertain classification decisions
• A classifier with a large margin isalways confident
• Gives classification safety margin(measurement or variation)
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Why maximize the margin?
Points near decisionsurface ! uncertainclassification decisions(50% either way).A classifier with a largemargin makes no lowcertainty classificationdecisions.Gives classificationsafety margin w.r.t slighterrors in measurement ordoc. variation
Support vectors
Margin ismaximized
Maximummargindecisionhyperplane
Schutze: Support vector machines 30 / 55
Machine Learning: Chenhao Tan | Boulder | 21 of 40
Support Vector Machines
Why maximize the margin?
• SVM classifier: large margin arounddecision boundary
• compare to decision hyperplane:place fat separator between classes◦ unique solution
• decreased memory capacity• increased ability to correctly
generalize to test data
Machine Learning: Chenhao Tan | Boulder | 22 of 40
Formulation
Outline
Intuition
Linear Classifiers
Support Vector Machines
Formulation
Theoretical Guarantees
Recap
Machine Learning: Chenhao Tan | Boulder | 23 of 40
Formulation
Equation
• Equation of a hyperplanew · x + b = 0 (1)
• Distance of a point to hyperplane
|w · x + b|||w|| (2)
• The margin ρ is given by
ρ ≡ min(x,y)∈S
|w · x + b|||w|| ≡ 1
||w|| (3)
This is because for any point on the marginal hyperplane, w · x + b = ±1, andwe would like to maximize the margin ρ
Machine Learning: Chenhao Tan | Boulder | 24 of 40
Formulation
Equation
• Equation of a hyperplanew · x + b = 0 (1)
• Distance of a point to hyperplane
|w · x + b|||w|| (2)
• The margin ρ is given by
ρ ≡ min(x,y)∈S
|w · x + b|||w|| ≡ 1
||w|| (3)
This is because for any point on the marginal hyperplane, w · x + b = ±1, andwe would like to maximize the margin ρ
Machine Learning: Chenhao Tan | Boulder | 24 of 40
Formulation
Optimization Problem
We want to find a weight vector w and bias b that optimize
minw,b
12||w||2 (4)
subject to yi(w · xi + b) ≥ 1, ∀i ∈ [1,m].
Next week: algorithm
Machine Learning: Chenhao Tan | Boulder | 25 of 40
Formulation
Optimization Problem
We want to find a weight vector w and bias b that optimize
minw,b
12||w||2 (4)
subject to yi(w · xi + b) ≥ 1, ∀i ∈ [1,m].
Next week: algorithm
Machine Learning: Chenhao Tan | Boulder | 25 of 40
Formulation
Find the maximum margin hyperplane
3
2
1
01 2 3
Machine Learning: Chenhao Tan | Boulder | 26 of 40
Formulation
Find the maximum margin hyperplane
3
2
1
01 2 3
Which are the support vectors?
Machine Learning: Chenhao Tan | Boulder | 26 of 40
Formulation
Walkthrough example: building an SVM over the data shown
Working geometrically:
• If you got 0 = .5x + y− 2.75, close!• Set up system of equations
w1 + w2 + b = −1 (5)32
w1 + 2w2 + b = 0 (6)
2w1 + 3w2 + b = +1 (7)
The SVM decision boundary is:
0 =25
x +45
y− 115
3
2
1
01 2 3
Machine Learning: Chenhao Tan | Boulder | 27 of 40
Formulation
Walkthrough example: building an SVM over the data shown
Working geometrically:• If you got 0 = .5x + y− 2.75, close!
• Set up system of equations
w1 + w2 + b = −1 (5)32
w1 + 2w2 + b = 0 (6)
2w1 + 3w2 + b = +1 (7)
The SVM decision boundary is:
0 =25
x +45
y− 115
3
2
1
01 2 3
Machine Learning: Chenhao Tan | Boulder | 27 of 40
Formulation
Walkthrough example: building an SVM over the data shown
Working geometrically:• If you got 0 = .5x + y− 2.75, close!• Set up system of equations
w1 + w2 + b = −1 (5)32
w1 + 2w2 + b = 0 (6)
2w1 + 3w2 + b = +1 (7)
The SVM decision boundary is:
0 =25
x +45
y− 115
3
2
1
01 2 3
Machine Learning: Chenhao Tan | Boulder | 27 of 40
Formulation
Walkthrough example: building an SVM over the data shown
Working geometrically:• If you got 0 = .5x + y− 2.75, close!• Set up system of equations
w1 + w2 + b = −1 (5)32
w1 + 2w2 + b = 0 (6)
2w1 + 3w2 + b = +1 (7)
The SVM decision boundary is:
0 =25
x +45
y− 115
3
2
1
01 2 3
Machine Learning: Chenhao Tan | Boulder | 27 of 40
Formulation
Cannonical Form
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Walkthrough example: building an SVM over the data setshown in the figure
Working geometrically:
0 1 2 30
1
2
3
Schutze: Support vector machines 42 / 55
w1x1 + w2x2 + b
Machine Learning: Chenhao Tan | Boulder | 28 of 40
Formulation
Cannonical Form
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Walkthrough example: building an SVM over the data setshown in the figure
Working geometrically:
0 1 2 30
1
2
3
Schutze: Support vector machines 42 / 55
.4x1 + .8x2 − 2.2
Machine Learning: Chenhao Tan | Boulder | 28 of 40
Formulation
Cannonical Form
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Walkthrough example: building an SVM over the data setshown in the figure
Working geometrically:
0 1 2 30
1
2
3
Schutze: Support vector machines 42 / 55
.4x1 + .8x2 − 2.2• .4 · 1 + .8 · 1− 2.2 = −1• .4 · 3
2 + .8 · 2 = 0• .4 · 2 + .8 · 3− 2.2 = +1
Machine Learning: Chenhao Tan | Boulder | 28 of 40
Formulation
What’s the margin?
• Distance to closest point
√(32− 1)2
+ (2− 1)2 =
√5
2(8)
• Weight vector
1||w|| =
1√(25
)2+( 4
5
)2=
1√2025
=5√5√
4=
√5
2(9)
Machine Learning: Chenhao Tan | Boulder | 29 of 40
Formulation
What’s the margin?
• Distance to closest point √(32− 1)2
+ (2− 1)2 =
√5
2(8)
• Weight vector
1||w|| =
1√(25
)2+( 4
5
)2=
1√2025
=5√5√
4=
√5
2(9)
Machine Learning: Chenhao Tan | Boulder | 29 of 40
Formulation
What’s the margin?
• Distance to closest point √(32− 1)2
+ (2− 1)2 =
√5
2(8)
• Weight vector
1||w|| =
1√(25
)2+( 4
5
)2=
1√2025
=5√5√
4=
√5
2(9)
Machine Learning: Chenhao Tan | Boulder | 29 of 40
Formulation
What’s the margin?
• Distance to closest point √(32− 1)2
+ (2− 1)2 =
√5
2(8)
• Weight vector
1||w|| =
1√(25
)2+( 4
5
)2=
1√2025
=5√5√
4=
√5
2(9)
Machine Learning: Chenhao Tan | Boulder | 29 of 40
Theoretical Guarantees
Outline
Intuition
Linear Classifiers
Support Vector Machines
Formulation
Theoretical Guarantees
Recap
Machine Learning: Chenhao Tan | Boulder | 30 of 40
Theoretical Guarantees
Three Proofs that Suggest SVMs will Work
• Leave-one-out error• Margin analysis• VC Dimension (omitted for now)
Machine Learning: Chenhao Tan | Boulder | 31 of 40
Theoretical Guarantees
Leave One Out Error (sketch)
Leave one out error is the error by using one point as your test set (averaged overall such points).
RLOO =1m
m∑i=1
1[hs−{xi} 6= yi
](10)
This serves as an unbiased estimate of generalization error for samples of sizem− 1:
Machine Learning: Chenhao Tan | Boulder | 32 of 40
Theoretical Guarantees
Leave One Out Error (sketch)
Leave one out error is the error by using one point as your test set (averaged overall such points).
RLOO =1m
m∑i=1
1[hs−{xi} 6= yi
](10)
This serves as an unbiased estimate of generalization error for samples of sizem− 1:
Machine Learning: Chenhao Tan | Boulder | 32 of 40
Theoretical Guarantees
Leave One Out Error (sketch)
Leave-one-out error is bounded by the number of support vectors.
ES∼Dm−1 [R(hs)] ≤ ES∼Dm
[NSV(S)
m
](11)
Consider the held out error for xi.• If xi was not a support vector, the answer doesn’t change.• If xi was a support vector, it could change the answer; this is when we can have
an error.There are NSV support vectors and thus NSV possible errors.
Machine Learning: Chenhao Tan | Boulder | 33 of 40
Theoretical Guarantees
Pictorial proof
Recap Vector space classification Linear classifiers Support Vector Machines Discussion
Support Vector Machines
2-class training data
decision boundary! linear separator
criterion: beingmaximally far awayfrom any data point! determinesclassifier margin
Margin ismaximized
Schutze: Support vector machines 29 / 55Machine Learning: Chenhao Tan | Boulder | 34 of 40
Theoretical Guarantees
Margin Theory
Machine Learning: Chenhao Tan | Boulder | 35 of 40
Theoretical Guarantees
Margin Loss Function
To see where SVMs really shine, consider the margin loss ρ:
Φρ(x) =
0 if ρ ≤ x1− x
ρ if 0 ≤ x ≤ ρ1 if x ≤ 0
(12)
The empirical margin loss of a hypothesis h is
Rρ(h) =1m
m∑i=1
Φρ(yih(xi)) (13)
Machine Learning: Chenhao Tan | Boulder | 36 of 40
Theoretical Guarantees
Margin Loss Function
To see where SVMs really shine, consider the margin loss ρ:
Φρ(x) =
0 if ρ ≤ x1− x
ρ if 0 ≤ x ≤ ρ1 if x ≤ 0
(12)
The empirical margin loss of a hypothesis h is
Rρ(h) =1m
m∑i=1
Φρ(yih(xi)) (13)
Machine Learning: Chenhao Tan | Boulder | 36 of 40
Theoretical Guarantees
Margin Loss Function
To see where SVMs really shine, consider the margin loss ρ:
Φρ(x) =
0 if ρ ≤ x1− x
ρ if 0 ≤ x ≤ ρ1 if x ≤ 0
(12)
The empirical margin loss of a hypothesis h is
Rρ(h) =1m
m∑i=1
Φρ(yih(xi)) ≤1m
m∑i=1
1 [yih(xi) ≤ ρ] (13)
Machine Learning: Chenhao Tan | Boulder | 36 of 40
Theoretical Guarantees
Margin Loss Function
To see where SVMs really shine, consider the margin loss ρ:
Φρ(x) =
0 if ρ ≤ x1− x
ρ if 0 ≤ x ≤ ρ1 if x ≤ 0
(12)
The empirical margin loss of a hypothesis h is
Rρ(h) =1m
m∑i=1
Φρ(yih(xi)) ≤1m
m∑i=1
1 [yih(xi) ≤ ρ] (13)
The fraction of the points in the training sample S that have been misclassified orclassified with confidence less than ρ.
Machine Learning: Chenhao Tan | Boulder | 36 of 40
Theoretical Guarantees
Generalization
For linear classifiers H = {x 7→ w · x : ||w|| ≤ Λ} and data X ∈ {x : ||x|| ≤ r}. Fixρ > 0 then with probability at least 1− δ, for any h ∈ H,
R(h) ≤ Rρ(h) + 2
√r2Λ2
ρ2m+
√log 1
δ
2m(14)
• Data-dependent: must be separable with a margin• Fortunately, many data do have good margin properties• SVMs can find good classifiers in those instances
Machine Learning: Chenhao Tan | Boulder | 37 of 40
Theoretical Guarantees
Generalization
For linear classifiers H = {x 7→ w · x : ||w|| ≤ Λ} and data X ∈ {x : ||x|| ≤ r}. Fixρ > 0 then with probability at least 1− δ, for any h ∈ H,
R(h) ≤ Rρ(h) + 2
√r2Λ2
ρ2m+
√log 1
δ
2m(14)
• Data-dependent: must be separable with a margin• Fortunately, many data do have good margin properties• SVMs can find good classifiers in those instances
Machine Learning: Chenhao Tan | Boulder | 37 of 40
Recap
Outline
Intuition
Linear Classifiers
Support Vector Machines
Formulation
Theoretical Guarantees
Recap
Machine Learning: Chenhao Tan | Boulder | 38 of 40
Recap
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data is therecurrently available?Before neural networks:• None?• Very little?• A fair amount?• A huge amount
Machine Learning: Chenhao Tan | Boulder | 39 of 40
Recap
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data is therecurrently available?Before neural networks:• None? Hand write rules or collect more data (possibly with active learning)• Very little?• A fair amount?• A huge amount
Machine Learning: Chenhao Tan | Boulder | 39 of 40
Recap
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data is therecurrently available?Before neural networks:• None? Hand write rules or collect more data (possibly with active learning)• Very little? Naïve Bayes• A fair amount?• A huge amount
Machine Learning: Chenhao Tan | Boulder | 39 of 40
Recap
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data is therecurrently available?Before neural networks:• None? Hand write rules or collect more data (possibly with active learning)• Very little? Naïve Bayes• A fair amount? SVM• A huge amount
Machine Learning: Chenhao Tan | Boulder | 39 of 40
Recap
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data is therecurrently available?Before neural networks:• None? Hand write rules or collect more data (possibly with active learning)• Very little? Naïve Bayes• A fair amount? SVM• A huge amount Doesn’t matter, use whatever works
Machine Learning: Chenhao Tan | Boulder | 39 of 40
Recap
Choosing what kind of classifier to use
When building a text classifier, first question: how much training data is therecurrently available?With neural networks:• None? Hand write rules or collect more data (possibly with active learning)• Very little? Naïve Bayes, consider using pre-trained representation/model with
neural networks• A fair amount? SVM, consider using pre-trained representation/model with
neural networks• A huge amount Probably neural networks
Trade-off between interpretability and model performance
Machine Learning: Chenhao Tan | Boulder | 39 of 40
Recap
SVM extensions: What’s next
• Finding solutions• Slack variables: not perfect line• Kernels: different geometries
Machine Learning: Chenhao Tan | Boulder | 40 of 40