Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised...

Machine Learning: Chenhao TanUniversity of Colorado BoulderLECTURE 9

Slides adapted from Jordan Boyd-Graber

Machine Learning: Chenhao Tan | Boulder | 1 of 40

• Supervised learning• Previously: KNN, naïve Bayes, logistic regression and neural networks

• This time: SVMs◦ Motivated by geometric properties◦ (another) example of linear classifier◦ Good theoretical guarantee

• Supervised learning• Previously: KNN, naïve Bayes, logistic regression and neural networks• This time: SVMs◦ Motivated by geometric properties◦ (another) example of linear classifier◦ Good theoretical guarantee

Take a step back

• Problem• Model• Algorithm/optimization• Evaluation

https://blog.twitter.com/official/en_us/topics/product/

2017/Giving-you-more-characters-to-express-yourself.html

Take a step back

• Problem• Model• Algorithm/optimization• Evaluation

https://blog.twitter.com/official/en_us/topics/product/

2017/Giving-you-more-characters-to-express-yourself.html

Overview

Intuition

Linear Classifiers

Support Vector Machines

Formulation

Theoretical Guarantees

Intuition

Outline

Intuition

Linear Classifiers

Formulation

Intuition

Thinking Geometrically

• Suppose you have two classes: vacations and sports• Suppose you have four documents

SportsDoc1: {ball, ball, ball, travel}Doc2: {ball, ball}

VacationsDoc3: {travel, ball, travel}Doc4: {travel}

• What does this look like in vector space?

Intuition

Put the documents in vector space

SportsDoc1: {ball, ball, ball, travel}Doc2: {ball, ball}

VacationsDoc3: {travel, ball, travel}Doc4: {travel}

Travel3

01 2 3

Intuition

Vector space representation of documents

• Each document is a vector, one component for each term.• Terms are axes.• High dimensionality: 10,000s of dimensions and more• How can we do classification in this space?

Intuition

Vector space classification

• As before, the training set is a set of documents, each labeled with its class.• In vector space classification, this set corresponds to a labeled set of points or

vectors in the vector space.• Premise 1: Documents in the same class form a contiguous region.• Premise 2: Documents from different classes don’t overlap.• We define lines, surfaces, hypersurfaces to divide regions.

Intuition

Classes in the vector space

Recap Vector space classification Linear classifiers Support Vector Machines Discussion

Should the document ! be assigned to China, UK or Kenya?

Schutze: Support vector machines 12 / 55

Intuition

Should the document ! be assigned to China, UK or Kenya?

Should the document ? be assigned to China, UK or Kenya?

Intuition

Find separators between the classes

Intuition

Based on these separators: ? should be assigned to China

Intuition

How do we find separators that do a good job at classifying new documents like ??– Main topic of today

Linear Classifiers

Outline

Intuition

Linear Classifiers

Formulation

Linear Classifiers

Linear classifiers

• Definition:◦ A linear classifier computes a linear combination or weighted sum

∑j βjxj of the

feature values.◦ Classification decision:

∑j βjxj > β0? (β0 is our bias)

◦ . . . where β0 (the threshold) is a parameter.

• We call this the separator or decision boundary.• We find the separator based on training set.• Methods for finding separator: logistic regression, naïve Bayes, linear SVM• Assumption: The classes are linearly separable.

• Before, we just talked about equations. What’s the geometric intuition?

Linear Classifiers

Linear classifiers

• Definition:◦ A linear classifier computes a linear combination or weighted sum

∑j βjxj of the

feature values.◦ Classification decision:

∑j βjxj > β0? (β0 is our bias)

◦ . . . where β0 (the threshold) is a parameter.

• We call this the separator or decision boundary.• We find the separator based on training set.• Methods for finding separator: logistic regression, naïve Bayes, linear SVM• Assumption: The classes are linearly separable.• Before, we just talked about equations. What’s the geometric intuition?

Linear Classifiers

A linear classifier in 1D

A linear classifier in 1D isa point x described by theequation w1d1 = !

x = !/w1

• A linear classifier in 1D is a point xdescribed by the equation β1x1 = β0

• x = β0/β1

• Points (x1) with β1x1 ≥ β0 are in theclass c.

• Points (x1) with β1x1 < β0 are in thecomplement class c.

Linear Classifiers

x = !/w1

• x = β0/β1

Linear Classifiers

x = !/w1

Points (d1) with w1d1 ! !are in the class c .

• x = β0/β1

Linear Classifiers

x = !/w1

Points (d1) with w1d1 ! !are in the class c .

Points (d1) with w1d1 < !are in the complementclass c .

• x = β0/β1

Linear Classifiers

A linear classifier in 2DRecap Vector space classification Linear classifiers Support Vector Machines Discussion

A linear classifier in 2D isa line described by theequation w1d1 + w2d2 = !

Example for a 2D linearclassifier

• A linear classifier in 2D is a linedescribed by the equationβ1x1 + β2x2 = β0

• Example for a 2D linear classifier• Points (x1 x2) with β1x1 + β2x2 ≥ β0

are in the class c.• Points (x1 x2) with β1x1 + β2x2 < β0

are in the complement class c.

Linear Classifiers

• Example for a 2D linear classifier

• Points (x1 x2) with β1x1 + β2x2 ≥ β0are in the class c.

• Points (x1 x2) with β1x1 + β2x2 < β0are in the complement class c.

Linear Classifiers

Points (d1 d2) withw1d1 + w2d2 ! ! are inthe class c .

are in the class c.

• Points (x1 x2) with β1x1 + β2x2 < β0are in the complement class c.

Linear Classifiers

Points (d1 d2) withw1d1 + w2d2 ! ! are inthe class c .

Points (d1 d2) withw1d1 + w2d2 < ! are inthe complement class c .

are in the class c.• Points (x1 x2) with β1x1 + β2x2 < β0

are in the complement class c.

Linear Classifiers

A linear classifier in 3D isa plane described by theequationw1d1 + w2d2 + w3d3 = !

• A linear classifier in 3D is a planedescribed by the equationβ1x1 + β2x2 + β3x3 = β0

• Example for a 3D linear classifier• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 ≥ β0 are in theclass c.

• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 < β0 are in thecomplement class c.

Linear Classifiers

• Example for a 3D linear classifier

• Points (x1 x2 x3) withβ1x1 + β2x2 + β3x3 ≥ β0 are in theclass c.

Linear Classifiers

Points (d1 d2 d3) withw1d1 + w2d2 + w3d3 ! !are in the class c .

Linear Classifiers

Points (d1 d2 d3) withw1d1 + w2d2 + w3d3 ! !are in the class c .

Points (d1 d2 d3) withw1d1 + w2d2 + w3d3 < !are in the complementclass c .

Linear Classifiers

Naive Bayes and Logistic Regression as linear classifiers

TakeawayNaïve Bayes, logistic regression and SVM are all linear methods. They choosetheir hyperplanes based on different objectives: joint likelihood (NB), conditionallikelihood (LR), and the margin (SVM).

Linear Classifiers

Which hyperplane?

• For linearly separable training sets: there are infinitely many separatinghyperplanes.

• They all separate the training set perfectly . . .• . . . but they behave differently on test data.• Error rates on new data are low for some, high for others.• How do we find a low-error separator?

Outline

Intuition

Linear Classifiers

Formulation

Support vector machines

• Machine-learning research in the last two decades has improved classifiereffectiveness.

• New generation of state-of-the-art classifiers: support vector machines (SVMs),boosted decision trees, regularized logistic regression, neural networks, andrandom forests

• Applications to IR problems, particularly text classification

SVMs: A kind of large-margin classifierVector space based machine-learning method aiming to find a decision boundarybetween two classes that is maximally far from any point in the training data(possibly discounting some points as outliers or noise)

• 2-class training data

• decision boundary→ linearseparator

• criterion: being maximally far awayfrom any data point→ determinesclassifier margin

• linear separator position defined bysupport vectors

2-class training data

• 2-class training data• decision boundary→ linear

separator

• criterion: being maximally far awayfrom any data point→ determinesclassifier margin

decision boundary! linear separator

Schutze: Support vector machines 29 / 55Machine Learning: Chenhao Tan | Boulder | 20 of 40

separator• criterion: being maximally far away

from any data point→ determinesclassifier margin

criterion: beingmaximally far awayfrom any data point! determinesclassifier margin

Margin ismaximized

separator• criterion: being maximally far away

from any data point→ determinesclassifier margin

Why maximize the margin?

Points near decisionsurface ! uncertainclassification decisions(50% either way).A classifier with a largemargin makes no lowcertainty classificationdecisions.Gives classificationsafety margin w.r.t slighterrors in measurement ordoc. variation

Support vectors

Margin ismaximized

Maximummargindecisionhyperplane

• Points near decision surface→uncertain classification decisions

• A classifier with a large margin isalways confident

• Gives classification safety margin(measurement or variation)

Points near decisionsurface ! uncertainclassification decisions(50% either way).A classifier with a largemargin makes no lowcertainty classificationdecisions.Gives classificationsafety margin w.r.t slighterrors in measurement ordoc. variation

Support vectors

Margin ismaximized

Maximummargindecisionhyperplane

• SVM classifier: large margin arounddecision boundary

• compare to decision hyperplane:place fat separator between classes◦ unique solution

• decreased memory capacity• increased ability to correctly

generalize to test data

Formulation

Outline

Intuition

Linear Classifiers

Formulation

Equation

• Equation of a hyperplanew · x + b = 0 (1)

• Distance of a point to hyperplane

|w · x + b|||w|| (2)

• The margin ρ is given by

ρ ≡ min(x,y)∈S

|w · x + b|||w|| ≡ 1

||w|| (3)

This is because for any point on the marginal hyperplane, w · x + b = ±1, andwe would like to maximize the margin ρ

Formulation

Equation

• Equation of a hyperplanew · x + b = 0 (1)

• Distance of a point to hyperplane

|w · x + b|||w|| (2)

• The margin ρ is given by

ρ ≡ min(x,y)∈S

|w · x + b|||w|| ≡ 1

||w|| (3)

This is because for any point on the marginal hyperplane, w · x + b = ±1, andwe would like to maximize the margin ρ

Formulation

Optimization Problem

We want to find a weight vector w and bias b that optimize

minw,b

12||w||2 (4)

subject to yi(w · xi + b) ≥ 1, ∀i ∈ [1,m].

Next week: algorithm

Formulation

Optimization Problem

We want to find a weight vector w and bias b that optimize

minw,b

12||w||2 (4)

subject to yi(w · xi + b) ≥ 1, ∀i ∈ [1,m].

Next week: algorithm

Formulation

Find the maximum margin hyperplane

01 2 3

Formulation

Find the maximum margin hyperplane

01 2 3

Which are the support vectors?

Formulation

Walkthrough example: building an SVM over the data shown

Working geometrically:

• If you got 0 = .5x + y− 2.75, close!• Set up system of equations

w1 + w2 + b = −1 (5)32

w1 + 2w2 + b = 0 (6)

2w1 + 3w2 + b = +1 (7)

The SVM decision boundary is:

y− 115

01 2 3

Formulation

Working geometrically:• If you got 0 = .5x + y− 2.75, close!

• Set up system of equations

w1 + w2 + b = −1 (5)32

w1 + 2w2 + b = 0 (6)

2w1 + 3w2 + b = +1 (7)

y− 115

01 2 3

Formulation

Working geometrically:• If you got 0 = .5x + y− 2.75, close!• Set up system of equations

w1 + w2 + b = −1 (5)32

w1 + 2w2 + b = 0 (6)

2w1 + 3w2 + b = +1 (7)

y− 115

01 2 3

Formulation

Working geometrically:• If you got 0 = .5x + y− 2.75, close!• Set up system of equations

w1 + w2 + b = −1 (5)32

w1 + 2w2 + b = 0 (6)

2w1 + 3w2 + b = +1 (7)

y− 115

01 2 3

Formulation

Cannonical Form

Walkthrough example: building an SVM over the data setshown in the figure

0 1 2 30

w1x1 + w2x2 + b

Formulation

Cannonical Form

0 1 2 30

.4x1 + .8x2 − 2.2

Formulation

Cannonical Form

0 1 2 30

.4x1 + .8x2 − 2.2• .4 · 1 + .8 · 1− 2.2 = −1• .4 · 3

2 + .8 · 2 = 0• .4 · 2 + .8 · 3− 2.2 = +1

Formulation

What’s the margin?

• Distance to closest point

√(32− 1)2

+ (2− 1)2 =

• Weight vector

1||w|| =

1√(25

)2+( 4

1√2025

=5√5√

Formulation

• Distance to closest point √(32− 1)2

+ (2− 1)2 =

• Weight vector

1||w|| =

1√(25

)2+( 4

1√2025

=5√5√

Formulation

+ (2− 1)2 =

• Weight vector

1||w|| =

1√(25

)2+( 4

1√2025

=5√5√

Formulation

+ (2− 1)2 =

• Weight vector

1||w|| =

1√(25

)2+( 4

1√2025

=5√5√

Outline

Intuition

Linear Classifiers

Formulation

Three Proofs that Suggest SVMs will Work

• Leave-one-out error• Margin analysis• VC Dimension (omitted for now)

Leave One Out Error (sketch)

Leave one out error is the error by using one point as your test set (averaged overall such points).

RLOO =1m

m∑i=1

1[hs−{xi} 6= yi

This serves as an unbiased estimate of generalization error for samples of sizem− 1:

Leave one out error is the error by using one point as your test set (averaged overall such points).

RLOO =1m

m∑i=1

1[hs−{xi} 6= yi

This serves as an unbiased estimate of generalization error for samples of sizem− 1:

Leave-one-out error is bounded by the number of support vectors.

ES∼Dm−1 [R(hs)] ≤ ES∼Dm

[NSV(S)

Consider the held out error for xi.• If xi was not a support vector, the answer doesn’t change.• If xi was a support vector, it could change the answer; this is when we can have

an error.There are NSV support vectors and thus NSV possible errors.

Pictorial proof

criterion: beingmaximally far awayfrom any data point! determinesclassifier margin

Margin ismaximized

Schutze: Support vector machines 29 / 55Machine Learning: Chenhao Tan | Boulder | 34 of 40

Margin Theory

Margin Loss Function

To see where SVMs really shine, consider the margin loss ρ:

Φρ(x) =

0 if ρ ≤ x1− x

ρ if 0 ≤ x ≤ ρ1 if x ≤ 0

The empirical margin loss of a hypothesis h is

Rρ(h) =1m

m∑i=1

Φρ(yih(xi)) (13)

Φρ(x) =

0 if ρ ≤ x1− x

ρ if 0 ≤ x ≤ ρ1 if x ≤ 0

Rρ(h) =1m

m∑i=1

Φρ(yih(xi)) (13)

Φρ(x) =

0 if ρ ≤ x1− x

ρ if 0 ≤ x ≤ ρ1 if x ≤ 0

Rρ(h) =1m

m∑i=1

Φρ(yih(xi)) ≤1m

m∑i=1

1 [yih(xi) ≤ ρ] (13)

Φρ(x) =

0 if ρ ≤ x1− x

ρ if 0 ≤ x ≤ ρ1 if x ≤ 0

Rρ(h) =1m

m∑i=1

Φρ(yih(xi)) ≤1m

m∑i=1

1 [yih(xi) ≤ ρ] (13)

The fraction of the points in the training sample S that have been misclassified orclassified with confidence less than ρ.

Generalization

For linear classifiers H = {x 7→ w · x : ||w|| ≤ Λ} and data X ∈ {x : ||x|| ≤ r}. Fixρ > 0 then with probability at least 1− δ, for any h ∈ H,

R(h) ≤ Rρ(h) + 2

√r2Λ2

√log 1

2m(14)

• Data-dependent: must be separable with a margin• Fortunately, many data do have good margin properties• SVMs can find good classifiers in those instances

Generalization

For linear classifiers H = {x 7→ w · x : ||w|| ≤ Λ} and data X ∈ {x : ||x|| ≤ r}. Fixρ > 0 then with probability at least 1− δ, for any h ∈ H,

R(h) ≤ Rρ(h) + 2

√r2Λ2

√log 1

2m(14)

• Data-dependent: must be separable with a margin• Fortunately, many data do have good margin properties• SVMs can find good classifiers in those instances

Outline

Intuition

Linear Classifiers

Formulation

Choosing what kind of classifier to use

When building a text classifier, first question: how much training data is therecurrently available?Before neural networks:• None?• Very little?• A fair amount?• A huge amount

When building a text classifier, first question: how much training data is therecurrently available?Before neural networks:• None? Hand write rules or collect more data (possibly with active learning)• Very little?• A fair amount?• A huge amount

When building a text classifier, first question: how much training data is therecurrently available?Before neural networks:• None? Hand write rules or collect more data (possibly with active learning)• Very little? Naïve Bayes• A fair amount?• A huge amount

When building a text classifier, first question: how much training data is therecurrently available?Before neural networks:• None? Hand write rules or collect more data (possibly with active learning)• Very little? Naïve Bayes• A fair amount? SVM• A huge amount

When building a text classifier, first question: how much training data is therecurrently available?Before neural networks:• None? Hand write rules or collect more data (possibly with active learning)• Very little? Naïve Bayes• A fair amount? SVM• A huge amount Doesn’t matter, use whatever works

When building a text classifier, first question: how much training data is therecurrently available?With neural networks:• None? Hand write rules or collect more data (possibly with active learning)• Very little? Naïve Bayes, consider using pre-trained representation/model with

neural networks• A fair amount? SVM, consider using pre-trained representation/model with

neural networks• A huge amount Probably neural networks

Trade-off between interpretability and model performance

SVM extensions: What’s next

• Finding solutions• Slack variables: not perfect line• Kernels: different geometries

Machine Learning: Chenhao Tan University of Colorado Boulder … · 2017-10-03 · Recap Supervised...

Documents

Sweet KNN: An Efﬁcient KNN on GPU through Reconciliation ... · show that Sweet KNN outperforms existing GPU implementations on KNN by up to 120X (11X on average). I. INTRODUCTION

Cheng Fang, Feng Zhou, Chenhao Luo

Naïve Bayes Classifier · Naïve Bayes Classifier 17 • Bayes Classifier with additional “naïve” assumption: – Features are independent given class: – More generally: •

Classification: Logistic Regressionalvin/courses/ugml2016/02b.pdf · Logistic Regression Example Contrasting Naïve Bayes and Logistic Regression Naïve Bayes easier Naïve Bayes

KNN Classification and Regression using SAS R · KNN Classification and Regression using SAS R Liang Xie, The Travelers Companies, Inc. ABSTRACT K-Nearest Neighbor (KNN) classification

MLE and MAP•Maximum a posteriori (MAP) estimate •Naïve Bayes •Various Naïve Bayes models •model 1: Bernoulli Naïve Bayes •model 2: Multinomial Naïve Bayes •model 3:

Lecture 11: Cross validation - Stanford UniversityScenario2 KNN!1 KNN!CV LDA Logistic QDA 0.25 0.30 0.35 0.40 0.45 SCENARIO 1 KNN!1 KNN!CV LDA Logistic QDA 0.15 0.20 0.25 0.30 SCENARIO

Knn solution

The Naïve Bayes Classifier - svivek · •The naïve Bayes Classifier •Learning the naïve Bayes Classifier •Practical concerns 2. Today’s lecture •The naïve Bayes Classifier

Sweet KNN: An Efficient KNN on GPU through Reconciliation ...yufeiding/publication/icde17.pdf · Sweet KNN: An Efﬁcient KNN on GPU through Reconciliation between Redundancy Removal

Sweet KNN: An Efficient KNN on GPU through Reconciliation ... · Sweet KNN: An Efﬁcient KNN on GPU through Reconciliation between Redundancy Removal and Regularity Guoyang Chen,

Examples of classification methods CSIT5210. Content KNN Decision Tree Naïve Bayesian Bayesian Belief Network Naïve Neural Network Multilayer Neural Network

kNN & Naïve Bayes

KNN News 042715

KNN & Naïve Bayes Hongning Wang CS@UVa. Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based

Natural Language Processing (CSEP 517): …...Natural Language Processing (CSEP 517): Computational Pragmatics Chenhao Tan c 2017 University of Washington chenhao@chenhaot.com May

Naïve Bayes: refinements

KNN (K-Nearest Neighbors)

An Intelligent Model for Online Recruitment Fraud …Naïve Bayes (NB), Support vector machine (SVM), Random forest and KNN were applied. Random forest showed fmeasure 93%. The re-

Data Mining in Healthcare for Diabetes Mellitus=.pdfAJ kumar(2010) using weka a data mining tool done work on 500 people using naïve bayes and KNN techniques. 2.1. Data Mining Data