76
Classification Classification Slides from Heng Ji

Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Embed Size (px)

Citation preview

Page 1: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

ClassificationClassification

Slides from Heng Ji

Page 2: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

ClassificationClassification

Define classes/categoriesDefine classes/categoriesLabel textLabel textExtract featuresExtract featuresChoose a classifierChoose a classifier

Naive Bayes Classifier Decision Trees Maximum Entropy …

Train it Train it Use it to classify new examplesUse it to classify new examples

Page 3: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

NaNaïïve Bayesve Bayes

More powerful that Decision Trees

Decision Trees Naïve Bayes

Every feature gets a say in determining which label should be assigned to a given input value.

Page 4: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

NaNaïïve Bayes: Strengthsve Bayes: Strengths

Very simple modelVery simple model Easy to understand Very easy to implement

Can scale easily to millions of training Can scale easily to millions of training examples (just need counts!) examples (just need counts!)

Very efficient, fast training and classificationVery efficient, fast training and classification Modest space storageModest space storage Widely used because it works really well for Widely used because it works really well for

text categorizationtext categorization Linear, but non parallel decision boundariesLinear, but non parallel decision boundaries

Page 5: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

NaNaïïve Bayes: weaknessesve Bayes: weaknesses

Naïve Bayes independence assumption has two consequences:Naïve Bayes independence assumption has two consequences: The linear ordering of words is ignored (bag of words

model) The words are independent of each other given the

class:• President is more likely to occur in a context that

contains election than in a context that contains poet

Naïve Bayes assumption is inappropriate if there are strong Naïve Bayes assumption is inappropriate if there are strong conditional dependencies between the variablesconditional dependencies between the variables

Nonetheless, Naïve Bayes models do well in a surprisingly large Nonetheless, Naïve Bayes models do well in a surprisingly large number of cases because often we are interested in number of cases because often we are interested in classification classification accuracyaccuracy and not in accurate and not in accurate probability estimationsprobability estimations) )

Does not optimize prediction accuracy Does not optimize prediction accuracy

Page 6: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

The naivete of The naivete of independenceindependence Naïve Bayes assumption is inappropriate if there are Naïve Bayes assumption is inappropriate if there are

strong conditional dependencies between the variablesstrong conditional dependencies between the variables Classifier may end up "double-counting" the effect of Classifier may end up "double-counting" the effect of

highly correlated features, pushing the classifier closer highly correlated features, pushing the classifier closer to a given label than is justifiedto a given label than is justified

Consider a name gender classifierConsider a name gender classifier features ends-with(a) and ends-with(vowel) are dependent on

one another, because if an input value has the first feature, then it must also have the second feature

For features like these, the duplicated information may be given more weight than is justified by the training set

Page 7: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Decision Trees: StrengthsDecision Trees: Strengths

capable to generate understandable rulescapable to generate understandable rules perform classification without requiring much perform classification without requiring much

computationcomputation capable to handle both continuous and capable to handle both continuous and

categorical variablescategorical variables provide a clear indication of which features provide a clear indication of which features

are most important for prediction or are most important for prediction or classification. classification.

Page 8: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Decision Trees: weaknessesDecision Trees: weaknesses

prone to errors in classification prone to errors in classification problems with many classes and problems with many classes and relatively small number of training relatively small number of training examples. examples. Since each branch in the decision tree

splits the training data, the amount of training data available to train nodes lower in the tree can become quite small.

can be computationally expensive to can be computationally expensive to train. train. Need to compare all possible splits Pruning is also expensive

Page 9: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Decision Trees: weaknessesDecision Trees: weaknesses

Typically examine one field at a timeTypically examine one field at a time Leads to rectangular classification boxes that Leads to rectangular classification boxes that

may not correspond well with the actual may not correspond well with the actual distribution of records in the decision space. distribution of records in the decision space. Such ordering limits their ability to exploit

features that are relatively independent of one another

Naive Bayes overcomes this limitation by allowing all features to act "in parallel"

Page 10: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Linearly separable dataLinearly separable data

Class1Class2

Linear Decision boundary

Page 11: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Non linearly separable Non linearly separable datadata

Class1Class2

Page 12: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Non linearly separable Non linearly separable datadata

Non Linear Classifier

Class1Class2

Page 13: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Linear versus Non Linear Linear versus Non Linear algorithmsalgorithms Linear or Non linear separable data?Linear or Non linear separable data?

We can find out only empirically Linear algorithmsLinear algorithms (algorithms that find a linear decision (algorithms that find a linear decision

boundary)boundary) When we think the data is linearly separable Advantages

• Simpler, less parameters Disadvantages

• High dimensional data (like for NLP) is usually not linearly separable

Examples: Perceptron, Winnow, large margin Note: we can use linear algorithms also for non

linear problems (see Kernel methods)

Page 14: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Linear versus Non Linear Linear versus Non Linear algorithmsalgorithms

Non Linear algorithmsNon Linear algorithms When the data is non linearly separable Advantages

• More accurate

Disadvantages• More complicated, more parameters

Example: Kernel methods Note: the distinction between linear and non linear

applies also for multi-class classification (we’ll see this later)

Page 15: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Simple linear Simple linear algorithmsalgorithms Perceptron algorithmPerceptron algorithm

Linear Binary classification Online (process data sequentially, one data

point at the time) Mistake driven Simple single layer Neural Networks

Page 16: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Linear AlgebraLinear Algebra

w

b

wx + b > 0

wx + b < 0

Page 17: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Linear binary Linear binary classificationclassification Data: {(xi, yi)}i=1…n

x in Rd (x is a vector in d-dimensional space) feature vector y in {-1,+1}

label (class, category)

Question: Design a linear decision boundary: wx + b (equation of

hyperplane) such that the classification rule associated with it has minimal probability of error

classification rule: y = sign(w x + b) which means: if wx + b > 0 then y = +1 if wx + b < 0 then y = -1

Gert Lanckriet, Statistical Learning Theory Tutorial

Page 18: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Linear binary classificationLinear binary classification

Find a goodFind a good hyperplane hyperplane

((ww, , bb) ) in in RRd+1d+1

that correctly classifies that correctly classifies data points as much as data points as much as possiblepossible

In In online fashiononline fashion: one : one data point at the time, data point at the time, update weights as update weights as necessarynecessary

wx + b = 0

Classification Rule: y = sign(wx + b)

From Gert Lanckriet, Statistical Learning Theory Tutorial

Page 19: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

PerceptronPerceptron

Page 20: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Perceptron Learning RulePerceptron Learning Rule

Assuming the problem is linearly separable, Assuming the problem is linearly separable, there is a learning rule that converges in a there is a learning rule that converges in a finite timefinite time

MotivationMotivation

A new (unseen) input pattern that is similar to A new (unseen) input pattern that is similar to an old (seen) input pattern is likely to be an old (seen) input pattern is likely to be classified correctlyclassified correctly

Page 21: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Learning Rule, CtdLearning Rule, Ctd

Basic Idea – go over all existing data Basic Idea – go over all existing data patterns, patterns, whose labeling is knownwhose labeling is known, and check , and check their classification with a current weight their classification with a current weight vectorvector

If correct, continueIf correct, continue If not, add to the weights a quantity that is If not, add to the weights a quantity that is

proportional to the product of the input pattern proportional to the product of the input pattern with the desired output Z (1 or –1)with the desired output Z (1 or –1)

Page 22: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Weight Update RuleWeight Update Rule

WWjj+1+1 = = WWj j + + ZZjjXXjjj j = 0, …, n= 0, …, n

= learning rate= learning rate

Page 23: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Hebb RuleHebb Rule

In 1949, In 1949, Hebb Hebb postulated that the changes in postulated that the changes in a synapse are proportional to the a synapse are proportional to the correlationcorrelation between firing of the neurons that are between firing of the neurons that are connected through the synapse (the pre- and connected through the synapse (the pre- and post- synaptic neurons)post- synaptic neurons)

Neurons that fire together, wire together Neurons that fire together, wire together

Page 24: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Example: a simple Example: a simple problemproblem

4 points linearly separable4 points linearly separable

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Z = 1 Z = 1 Z = - 1 Z = - 1

(1/2, 1)

(1,1/2)

(-1,1/2)

(-1,1)

Page 25: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

(1/2, 1)

(1,1/2)

(-1,1/2)

(-1,1)

W0 = (0, 1)

Initial WeightsInitial Weights

Page 26: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Updating WeightsUpdating Weights

Upper left point is wrongly classifiedUpper left point is wrongly classified

= 1/3 , = 1/3 , WW00 = (0, 1) = (0, 1)

WW11 WW00 + + ZZ XX11

WW11 = (0, 1) + 1/3 = (0, 1) + 1/3 -1 -1 (-1, 1/2) = (1/3, 5/6) (-1, 1/2) = (1/3, 5/6)

Page 27: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

W1 = (1/3,5/6)

First CorrectionFirst Correction

Page 28: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Updating Weights, CtdUpdating Weights, Ctd

Upper left point is still wrongly classifiedUpper left point is still wrongly classified

WW22 WW11 + + ZZ XX11

WW22 = (1/3, 5/6) + 1/3 = (1/3, 5/6) + 1/3 -1 -1 (-1, 1/2) = (2/3, 2/3) (-1, 1/2) = (2/3, 2/3)

Page 29: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Second CorrectionSecond Correction

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

W2 = (2/3,2/3)

Page 30: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Example, CtdExample, Ctd

All 4 points are classified correctlyAll 4 points are classified correctly Toy problem – only 2 updates requiredToy problem – only 2 updates required Correction of weights was simply a Correction of weights was simply a

rotation of the separating hyper planerotation of the separating hyper plane Rotation can be applied to the right direction, Rotation can be applied to the right direction,

but may require many updatesbut may require many updates

Page 31: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Support Vector MachinesSupport Vector Machines

Page 32: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Large margin Large margin classifierclassifier

Another family of linear Another family of linear algorithmsalgorithms

IntuitionIntuition (Vapnik, 1965) (Vapnik, 1965) If the classes are linearly If the classes are linearly

separable:separable: Separate the data Place hyper-plane “far”

from the data: large margin

Statistical results guarantee good generalization

BAD

Gert Lanckriet, Statistical Learning Theory Tutorial

Page 33: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

GOOD

Maximal Margin Classifier

Large margin Large margin classifierclassifier

IntuitionIntuition (Vapnik, 1965) if (Vapnik, 1965) if linearly separable:linearly separable: Separate the data Place hyperplane

“far” from the data: large margin

Statistical results guarantee good generalization

Gert Lanckriet, Statistical Learning Theory Tutorial

Page 34: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Large margin Large margin classifierclassifier

If If not linearly separablenot linearly separable Allow some errors Still, try to place

hyperplane “far” from each class

Gert Lanckriet, Statistical Learning Theory Tutorial

Page 35: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Large Margin Large Margin ClassifiersClassifiersAdvantagesAdvantages

Theoretically better (better error bounds)LimitationsLimitations

Computationally more expensive, large quadratic programming

Page 36: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Non Linear problemNon Linear problem

Page 37: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Non Linear problemNon Linear problem

Page 38: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Non Linear problemNon Linear problem

Kernel methodsKernel methods A family of A family of non-linear algorithmsnon-linear algorithms Transform the non linear problem in a linear Transform the non linear problem in a linear

one (in a different feature space)one (in a different feature space) Use linear algorithms to solve the linear Use linear algorithms to solve the linear

problem in the new spaceproblem in the new space

Gert Lanckriet, Statistical Learning Theory Tutorial

Page 39: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

X=[x z]

Basic principle kernel Basic principle kernel methodsmethods

: : RRdd R RDD (D >> d) (D >> d)

(X)=[x2 z2 xz]

f(x) = sign(w1x2+w2z2+w3xz +b)

wT(x)+b=0

Gert Lanckriet, Statistical Learning Theory Tutorial

Page 40: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Basic principle kernel Basic principle kernel methodsmethods Linear separabilityLinear separability: more likely in high : more likely in high

dimensionsdimensions MappingMapping: : maps input into high-dimensional maps input into high-dimensional

feature spacefeature space ClassifierClassifier: construct linear classifier in high-: construct linear classifier in high-

dimensional feature spacedimensional feature space MotivationMotivation: appropriate choice of : appropriate choice of leads to leads to

linear separabilitylinear separability We can do this efficiently!We can do this efficiently!

Gert Lanckriet, Statistical Learning Theory Tutorial

Page 41: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Basic principle kernel Basic principle kernel methodsmethods We can use the linear algorithms seen before We can use the linear algorithms seen before

(for example, perceptron) for classification in (for example, perceptron) for classification in the higher dimensional spacethe higher dimensional space

Page 42: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Multi-class classificationMulti-class classification

Given:Given: some data items that belong to one of some data items that belong to one of M possible classes M possible classes

Task:Task: Train the classifier and predict the class Train the classifier and predict the class for a new data itemfor a new data item

Geometrically:Geometrically: harder problem, no more harder problem, no more simple geometrysimple geometry

Page 43: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Multi-class classificationMulti-class classification

Page 44: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Linear Classifiers

f x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w x + b)

How would you classify this data?

w x +

b=0

w x + b<0

w x + b>0

Page 45: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Linear Classifiers

f x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w x + b)

How would you classify this data?

Page 46: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Linear Classifiers

f x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w x + b)

How would you classify this data?

Page 47: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Linear Classifiers

f x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w x + b)

Any of these would be fine..

..but which is best?

Page 48: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Linear Classifiers

f x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w x + b)

How would you classify this data?

Misclassified to +1 class

Page 49: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

f x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w x + b)

Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Classifier Margin

f x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w x + b)

Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Page 50: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Maximum Margin

f x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w x + b)

The maximum margin linear classifier is the linear classifier with the, um, maximum margin.

This is the simplest kind of SVM (Called an LSVM)Linear SVM

Support Vectors are those datapoints that the margin pushes up against

1. Maximizing the margin is good according to intuition and PAC theory

2. Implies that only support vectors are important; other training examples are ignorable.

3. Empirically it works very very well.

Page 51: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Let me digress to…what is PAC Let me digress to…what is PAC Theory?Theory?

Two important aspects of complexity in machine learning.Two important aspects of complexity in machine learning.

First, First, sample complexitysample complexity: in many learning problems, : in many learning problems, training data is expensive and we should hope not to need training data is expensive and we should hope not to need too much of it. too much of it.

Secondly, Secondly, computational complexitycomputational complexity: A neural network, for : A neural network, for example, which takes an hour to train may be of no example, which takes an hour to train may be of no practical use in complex financial prediction problems.practical use in complex financial prediction problems.

Important that both the amount of training data required Important that both the amount of training data required for a prescribed level of performance and the running time for a prescribed level of performance and the running time of the learning algorithm in learning from this data do not of the learning algorithm in learning from this data do not increase too dramatically as the “difficulty” of the learning increase too dramatically as the “difficulty” of the learning problem increases. problem increases.

Page 52: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Such issues have been formalised and Such issues have been formalised and investigated over the past decade within the field investigated over the past decade within the field of “computational learning theory”. of “computational learning theory”.

One popular framework for discussing such One popular framework for discussing such problems is the probabilistic framework which problems is the probabilistic framework which has become known as the “probably has become known as the “probably approximately correct”, or approximately correct”, or PAC, model of PAC, model of learning. learning.

Let me digress to…what is PAC Let me digress to…what is PAC Theory?Theory?

Page 53: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Linear SVM Linear SVM MathematicallyMathematically

What we know:What we know: ww . . xx++ + b = +1 + b = +1 ww . . xx-- + b = -1 + b = -1 ww . ( . (xx++-x-x--) = 2 ) = 2

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-1

X-

x+

ww

wxxM

2)(

M=Margin Width

Page 54: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Linear SVM Mathematically Goal: 1) Correctly classify all training data

if yi = +1

if yi = -1

for all i

2) Maximize the Margin

same as minimize

We can formulate a Quadratic Optimization Problem and solve for w and b

Minimize

subject to

wM

2

www t

2

1)(

1bwxi

1bwxi

1)( bwxyi ii

1)( bwxy ii

wwt

2

1

Page 55: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Solving the Optimization Problem

Need to optimize a quadratic function subject to linear constraints.

Quadratic optimization problems are a well-known class of mathematical programming problems, and many (rather

intricate) algorithms exist for solving them. The solution involves constructing a dual problem where a

Lagrange multiplier αi is associated with every constraint in the primary problem:

Find w and b such thatΦ(w) =½ wTw is minimized;

and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1

Find α1…αN such that

Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and

(1) Σαiyi = 0(2) αi ≥ 0 for all αi

Page 56: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

A digression… Lagrange A digression… Lagrange MultipliersMultipliers

In mathematical optimization, the method of In mathematical optimization, the method of Lagrange Lagrange multipliersmultipliers provides a strategy for finding the maxima and  provides a strategy for finding the maxima and minima of a function subject to constraints.minima of a function subject to constraints.

For instance, consider the optimization problem For instance, consider the optimization problem maximize maximize subject to subject to 

We introduce a new variable (λ) called a Lagrange We introduce a new variable (λ) called a Lagrange multiplier, and study the Lagrange function defined bymultiplier, and study the Lagrange function defined by

(the λ term may be either added or subtracted.) (the λ term may be either added or subtracted.) If (If (xx,,yy) is a maximum for the original constrained problem, ) is a maximum for the original constrained problem,

then there exists a then there exists a λλ such that ( such that (xx,,yy,λ) is a stationary ,λ) is a stationary point for the Lagrange function point for the Lagrange function

(stationary points are those points where the partial (stationary points are those points where the partial derivatives of Λ are zero). derivatives of Λ are zero).

Page 57: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

The Optimization Problem Solution

The solution has the form:

Each non-zero αi indicates that corresponding xi is a support vector.

Then the classifying function will have the form:

Notice that it relies on an inner product between the test point x and the support vectors xi

Also keep in mind that solving the optimization problem involved computing the inner products xi

Txj between all pairs of training points.

w =Σαiyixi b= yk- wTxk for any xk such that αk 0

f(x) = ΣαiyixiTx + b

Page 58: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Dataset with noise

Hard Margin: So far we require all data points be classified correctly

- No training error

What if the training set is noisy?

- Solution 1: use very powerful kernels

denotes +1

denotes -1

OVERFITTING!

Page 59: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Slack variables ξi can be added to allow misclassification of difficult or noisy examples.

wx+b=1

wx+b=0

wx+b=-

1

7

11 2

Soft Margin ClassificationSoft Margin Classification

What should our quadratic optimization criterion be?

Minimize

R

kkεC

1

.2

1ww

Page 60: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Hard Margin v.s. Soft Margin

The old formulation:

The new formulation incorporating slack variables:

Parameter C can be viewed as a way to control overfitting.

Find w and b such that

Φ(w) =½ wTw is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1

Find w and b such that

Φ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}yi (wTxi + b) ≥ 1- ξi and ξi ≥ 0 for all i

Page 61: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Linear SVMs: Overview

The classifier is a separating hyperplane. Most “important” training points are support vectors; they

define the hyperplane. Quadratic optimization algorithms can identify which training

points xi are support vectors with non-zero Lagrangian multipliers αi.

Both in the dual formulation of the problem and in the solution

training points appear only inside dot products: Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjxi

Txj is maximized and (1) Σαiyi = 0(2) 0 ≤ αi ≤ C for all αi

f(x) = ΣαiyixiTx + b

Page 62: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Non-linear SVMs

Datasets that are linearly separable with some noise work out great:

But what are we going to do if the dataset is just too hard?

How about… mapping data to a higher-dimensional space:

0 x

0 x

0 x

x2

Page 63: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Non-linear SVMs: Feature spaces

General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable:

Φ: x → φ(x)

Page 64: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

The “Kernel Trick”

The linear classifier relies on dot product between vectors K(xi,xj)=xiTxj

If every data point is mapped into high-dimensional space via some transformation Φ: x → φ(x), the dot product becomes:

K(xi,xj)= φ(xi) Tφ(xj)

A kernel function is some function that corresponds to an inner product in some expanded feature space.

Example:

2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2

,

Need to show that K(xi,xj)= φ(xi) Tφ(xj):

K(xi,xj)=(1 + xiTxj)2

,

= 1+ xi12xj1

2 + 2 xi1xj1 xi2xj2+ xi2

2xj22 + 2xi1xj1 + 2xi2xj2

= [1 xi12 √2 xi1xi2 xi2

2 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj2

2 √2xj1 √2xj2]

= φ(xi) Tφ(xj), where φ(x) = [1 x1

2 √2 x1x2 x22 √2x1 √2x2]

Page 65: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

What Functions are Kernels?

For some functions K(xi,xj) checking that

K(xi,xj)= φ(xi) Tφ(xj) can be cumbersome.

Mercer’s theorem:

Every semi-positive definite symmetric function is a kernel

Page 66: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Examples of Kernel Functions

Linear: K(xi,xj)= xi Txj

Polynomial of power p: K(xi,xj)= (1+ xi Txj)p

Gaussian (radial-basis function network):

Sigmoid: K(xi,xj)= tanh(β0xi Txj + β1)

)2

exp(),(2

2

ji

ji

xxxx

K

Page 67: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Non-linear SVMs Mathematically

Dual problem formulation:

The solution is:

Optimization techniques for finding αi’s remain the same!

Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and (1) Σαiyi = 0(2) αi ≥ 0 for all αi

f(x) = ΣαiyiK(xi, xj)+ b

Page 68: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

SVM locates a separating hyperplane in the feature space and classify points in that space

It does not need to represent the space explicitly, simply by defining a kernel function

The kernel function plays the role of the dot product in the feature space.

Nonlinear SVM - OverviewNonlinear SVM - Overview

Page 69: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Properties of SVMProperties of SVM

Flexibility in choosing a similarity functionFlexibility in choosing a similarity function Sparseness of solution when dealing with large data Sparseness of solution when dealing with large data

setssets only support vectors are used to specify the separating

hyperplane Ability to handle large feature spacesAbility to handle large feature spaces

complexity does not depend on the dimensionality of the feature space

Overfitting can be controlled by soft margin Overfitting can be controlled by soft margin approachapproach

Nice math property:Nice math property: a simple convex optimization problem which is guaranteed to

converge to a single global solution Feature SelectionFeature Selection

Page 70: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

SVM ApplicationsSVM Applications

SVM has been used successfully in many SVM has been used successfully in many real-world problemsreal-world problems

- text (and hypertext) categorization- text (and hypertext) categorization

- image classification – different types of sub-- image classification – different types of sub-problems problems

- bioinformatics (Protein classification, - bioinformatics (Protein classification,

Cancer classification)Cancer classification)

- hand-written character recognition- hand-written character recognition

Page 71: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Weakness of SVMWeakness of SVM

It is sensitive to noiseIt is sensitive to noise A relatively small number of mislabeled examples can

dramatically decrease the performance

It only considers two classesIt only considers two classes how to do multi-class classification with SVM? Answer:

1. with output arity m, learn m SVM’sSVM 1 learns “Output==1” vs “Output !=

1”SVM 2 learns “Output==2” vs “Output !=

2”:SVM m learns “Output==m” vs “Output !

= m”2. to predict the output for a new input, just predict with

each SVM and find out which one puts the prediction the furthest into the positive region.

Page 72: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Application: Text Application: Text CategorizationCategorization Task: The classification of natural text (or Task: The classification of natural text (or

hypertext) documents into a fixed number of hypertext) documents into a fixed number of predefined categories based on their content.predefined categories based on their content.

- email filtering, web searching, sorting documents by - email filtering, web searching, sorting documents by topic, etc..topic, etc..

A document can be assigned to more than A document can be assigned to more than one category, so this can be viewed as a one category, so this can be viewed as a series of binary classification problems, one series of binary classification problems, one for each categoryfor each category

Page 73: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Application : Face Application : Face Expression RecognitionExpression Recognition Construct feature space, by use of Construct feature space, by use of

eigenvectors or other meanseigenvectors or other means Multiple class problem, several expressionsMultiple class problem, several expressions Use multi-class SVMUse multi-class SVM

Page 74: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Some IssuesSome Issues

Choice of kernelChoice of kernel - Gaussian or polynomial kernel is default- Gaussian or polynomial kernel is default - if ineffective, more elaborate kernels are needed- if ineffective, more elaborate kernels are needed

Choice of kernel parametersChoice of kernel parameters - e.g. - e.g. σ in Gaussian kernelσ in Gaussian kernel - - σ is the distance between closest points with different σ is the distance between closest points with different

classifications classifications -- In the absence of reliable criteria, applications rely on the use In the absence of reliable criteria, applications rely on the use

of a validation set or cross-validation to set such parameters. of a validation set or cross-validation to set such parameters.

Optimization criterionOptimization criterion – Hard margin v.s. Soft margin – Hard margin v.s. Soft margin - a lengthy series of experiments in which various parameters - a lengthy series of experiments in which various parameters

are tested are tested

Page 75: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

Additional ResourcesAdditional Resources

LibSVMLibSVM

An excellent tutorial on VC-dimension and Support Vector An excellent tutorial on VC-dimension and Support Vector Machines:Machines:

C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998.

The VC/SRM/SVM Bible:The VC/SRM/SVM Bible: Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience; 1998Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience; 1998

http://www.kernel-machines.org/

Page 76: Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract

ReferenceReference

Support Vector Machine Classification of Support Vector Machine Classification of Microarray Gene Expression DataMicroarray Gene Expression Data, Michael P. , Michael P. S. Brown William Noble Grundy, David Lin, Nello S. Brown William Noble Grundy, David Lin, Nello Cristianini, Charles Sugnet, Manuel Ares, Jr., Cristianini, Charles Sugnet, Manuel Ares, Jr., David Haussler David Haussler

www.cs.utexas.edu/users/mooney/cs391L/www.cs.utexas.edu/users/mooney/cs391L/svm.svm.pptppt

Text categorization with Support Vector Text categorization with Support Vector Machines:Machines:learning with many relevant featureslearning with many relevant features

T. Joachims, ECML - 98 T. Joachims, ECML - 98