50
MACHINE LEARNING Alessandro Moschitti Department of information and communication technology University of Trento Email: [email protected] Support Vector Machines

Support Vector Machines Alessandro Moschitti

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Support Vector Machines Alessandro Moschitti

MACHINE LEARNING

Alessandro Moschitti

Department of information and communication technology University of Trento

Email: [email protected]

Support Vector Machines

Page 2: Support Vector Machines Alessandro Moschitti
Page 3: Support Vector Machines Alessandro Moschitti

Summary

  Support Vector Machines   Hard-margin SVMs   Soft-margin SVMs

Page 4: Support Vector Machines Alessandro Moschitti

Communications

  No lecture tomorrow (neither Dec. 8)

  ML Exams   12 January 2011 at 9:00,   26 January 2011 at 9:00

  Exercise in Lab   A201 (Polo scientifico e tecnologico)   Wednesday 15 and 22 December, 2011   Time: 8.30-10.30

Page 5: Support Vector Machines Alessandro Moschitti

Which hyperplane choose?

Page 6: Support Vector Machines Alessandro Moschitti

Classifier with a Maximum Margin

Var1

Var2

Margin

Margin

IDEA 1: Select the hyperplane with maximum margin

Page 7: Support Vector Machines Alessandro Moschitti

Support Vector

Var1

Var2

Margin

Support Vectors

Page 8: Support Vector Machines Alessandro Moschitti

Support Vector Machine Classifiers

Var1

Var2 kbxw −=+⋅

kbxw =+⋅

0=+⋅ bxw kk

w

The margin is equal to 2 kw

Page 9: Support Vector Machines Alessandro Moschitti

Support Vector Machines

Var1

Var2 kbxw −=+⋅

kbxw =+⋅

0=+⋅ bxw kk

w

The margin is equal to 2 kw

We need to solve

max2 k

|| w || w ⋅ x + b ≥ +k, if x is positive

w ⋅ x + b ≤ −k, if x is negative

Page 10: Support Vector Machines Alessandro Moschitti

Support Vector Machines

Var1

Var2 1w x b⋅ + = −

1w x b⋅ + =

0=+⋅ bxw 11

w

There is a scale for which k=1.

The problem transforms in:

max2

|| w || w ⋅ x + b ≥ +1, if x is positive

w ⋅ x + b ≤ −1, if x is negative

Page 11: Support Vector Machines Alessandro Moschitti

Final Formulation

max2

|| w || w ⋅ x i + b ≥ +1, yi =1

w ⋅ x i + b ≤ −1, yi = -1

max2

|| w ||yi( w ⋅ x i + b) ≥1

min|| w ||2

yi( w ⋅ x i + b) ≥1

min|| w ||2

2yi( w ⋅ x i + b) ≥1

Page 12: Support Vector Machines Alessandro Moschitti

Optimization Problem

  Optimal Hyperplane:

  Minimize

  Subject to

  The dual problem is simpler

τ ( w ) =12 w 2

yi (( w ⋅ x i ) + b) ≥1,i =1,...,m

Page 13: Support Vector Machines Alessandro Moschitti

Lagrangian Definition

Page 14: Support Vector Machines Alessandro Moschitti

Dual Optimization Problem

Page 15: Support Vector Machines Alessandro Moschitti

Dual Transformation

  To solve the dual problem we need to evaluate:

  Given the Lagrangian associated with our problem

  Let us impose the derivatives to 0, with respect to w

Page 16: Support Vector Machines Alessandro Moschitti

Dual Transformation (cont’d)

  and wrt b

  Then we substituted them in the Lagrange function

Page 17: Support Vector Machines Alessandro Moschitti

Final Dual Problem

Page 18: Support Vector Machines Alessandro Moschitti

Khun-Tucker Theorem

  Necessary and sufficient conditions to optimality

Page 19: Support Vector Machines Alessandro Moschitti

Properties coming from constraints

  Lagrange constraints:

  Karush-Kuhn-Tucker constraints

  Support Vectors have not null

  To evaluate b, we can apply the following equation

aii=1

m

∑ yi = 0 w = α ii=1

m

∑ yi

x i

α i ⋅ [yi ( x i ⋅ w + b) −1]= 0, i =1,...,m

Page 20: Support Vector Machines Alessandro Moschitti

Warning!

  On the graphical examples, we always consider normalized hyperplane (hyperplanes with normalized gradient)

  b in this case is exactly the distance of the hyperplane from the origin

  So if we have an equation not normalized we may have

  and b is not the distance

x ⋅ w '+b = 0 with x = x,y( ) and w '= 1,1( )

Page 21: Support Vector Machines Alessandro Moschitti

Warning!

  Let us consider a normalized gradient

w = 1/ 2,1/ 2( )x,y( ) ⋅ 1/ 2,1/ 2( ) + b = 0⇒ x / 2 + y / 2 = −b

⇒ y = −x − b 2

  Now we see that -b is exactly the distance.

  For x =0, we have the intersection with . This distance projected on is -b

−b 2

w

Page 22: Support Vector Machines Alessandro Moschitti

Soft Margin SVMs

Var1

Var2 1w x b⋅ + = −

1w x b⋅ + =

0=+⋅ bxw 11

w

iξ slack variables are added

Some errors are allowed but they should penalize the objective function

Page 23: Support Vector Machines Alessandro Moschitti

Soft Margin SVMs

Var1

Var2 1w x b⋅ + = −

1w x b⋅ + =

0=+⋅ bxw 11

w

The new constraints are

The objective function penalizes the incorrect classified examples

C is the trade-off between margin and the error

yi( w ⋅ x i + b) ≥1−ξ i

∀ x i where ξ i ≥ 0

min12|| w ||2 +C ξ ii∑

Page 24: Support Vector Machines Alessandro Moschitti

Dual formulation

  By deriving wrt

w , ξ and b

Page 25: Support Vector Machines Alessandro Moschitti

Partial Derivatives

Page 26: Support Vector Machines Alessandro Moschitti

Substitution in the objective function

  of Kronecker ijδ

Page 27: Support Vector Machines Alessandro Moschitti

Final dual optimization problem

Page 28: Support Vector Machines Alessandro Moschitti

Soft Margin Support Vector Machines

  The algorithm tries to keep ξi low and maximize the margin

  NB: The number of error is not directly minimized (NP-complete problem); the distances from the hyperplane are minimized

  If C→∞, the solution tends to the one of the hard-margin algorithm

  Attention !!!: if C = 0 we get = 0, since

  If C increases the number of error decreases. When C tends to infinite the number of errors must be 0, i.e. the hard-margin formulation

|||| w

min12|| w ||2 +C ξ ii∑

yi( w ⋅ x i + b) ≥1−ξ i ∀

x i

ξ i ≥ 0

yib ≥1−ξ i ∀ x i

Page 29: Support Vector Machines Alessandro Moschitti

Robusteness of Soft vs. Hard Margin SVMs

Var1

Var2 0=+⋅ bxw

ξi

Var1

Var2 0=+⋅ bxw

Soft Margin SVM Hard Margin SVM

Page 30: Support Vector Machines Alessandro Moschitti

Soft vs Hard Margin SVMs

  Soft-Margin has ever a solution

  Soft-Margin is more robust to odd examples

  Hard-Margin does not require parameters

Page 31: Support Vector Machines Alessandro Moschitti

Parameters

  C: trade-off parameter

  J: cost factor

min12|| w ||2 +C ξ ii∑ = min 1

2|| w ||2 +C+ ξ ii∑

++ C− ξ ii∑

= min 12|| w ||2 +C J ξ ii∑

++ ξ ii∑

Page 32: Support Vector Machines Alessandro Moschitti

Theoretical Justification

Page 33: Support Vector Machines Alessandro Moschitti

Definition of Training Set error

  Training Data

  Empirical Risk (error)

  Risk (error)

{ }1: ±→NRf

( x 1,y1),....,( x m ,ym )∈ RN × ±1{ }

Remp [ f ]= 1m

12 f ( x i ) − yi

i=1

m

R[ f ]= 12

f ( x ) − y dP( x ,y)∫

Page 34: Support Vector Machines Alessandro Moschitti

Error Characterization (part 1)

  From PAC-learning Theory (Vapnik):

where d is theVC-dimension, m is the number of examples, δ is a bound on the probability to get such error and α is a classifier parameter.

R(α) ≤ Remp (α) +ϕ( dm , log(δ )m )

ϕ( dm , log(δ )m ) = d (log 2md +1)− log(δ4 )m

Page 35: Support Vector Machines Alessandro Moschitti

There are many versions for different bounds

Page 36: Support Vector Machines Alessandro Moschitti

Error Characterization (part 2)

Page 37: Support Vector Machines Alessandro Moschitti

Ranking, Regression and

Multiclassification

Page 38: Support Vector Machines Alessandro Moschitti

The Ranking SVM [Herbrich et al. 1999, 2000; Joachims et al. 2002]

  The aim is to classify instance pairs as correctly ranked or incorrectly ranked   This turns an ordinal regression problem back into a binary

classification problem

  We want a ranking function f such that

xi > xj iff f(xi) > f(xj)

  … or at least one that tries to do this with minimal error

  Suppose that f is a linear function

f(xi) = wxi

• Sec.15.4.2

Page 39: Support Vector Machines Alessandro Moschitti

The Ranking SVM

  Ranking Model: f(xi)�

f (xi )

• Sec.15.4.2

Page 40: Support Vector Machines Alessandro Moschitti

The Ranking SVM

  Then (combining the two equations on the last slide):

xi > xj iff wxi − w xj > 0

xi > xj iff w(xi − xj) > 0

  Let us then create a new instance space from such pairs: zk = xi − xk

yk = +1, −1 as xi ≥ , < xk

• Sec.15.4.2

Page 41: Support Vector Machines Alessandro Moschitti

Support Vector Ranking

  Given two examples we build one example (xi , xj) €

−1

Page 42: Support Vector Machines Alessandro Moschitti

Support Vector Regression (SVR)

Constraints: +ε

-ε 0

Solution:

x

f(x)

Page 43: Support Vector Machines Alessandro Moschitti

Support Vector Regression (SVR)

-ε 0

x

f(x)

12wTw + C ξ i + ξ i

*( )i=1

N

Minimise:

Constraints: ξ

ξ*

Page 44: Support Vector Machines Alessandro Moschitti

Support Vector Regression

  yi is not -1 or 1 anymore, now it is a value

  ε is the tollerance of our function value

Page 45: Support Vector Machines Alessandro Moschitti

From Binary to Multiclass classifiers

  Three different approaches:

  ONE-vs-ALL (OVA)

  Given the example sets, {E1, E2, E3, …} for the categories: {C1, C2, C3,…} the binary classifiers: {b1, b2, b3,…} are built.

  For b1, E1 is the set of positives and E2∪E3 ∪… is the set of negatives, and so on

  For testing: given a classification instance x, the category is the one associated with the maximum margin among all binary classifiers

Page 46: Support Vector Machines Alessandro Moschitti

From Binary to Multiclass classifiers

  ALL-vs-ALL (AVA)   Given the examples: {E1, E2, E3, …} for the categories {C1, C2, C3,…}

  build the binary classifiers:

{b1_2, b1_3,…, b1_n, b2_3, b2_4,…, b2_n,…,bn-1_n}

  by learning on E1 (positives) and E2 (negatives), on E1 (positives) and E3 (negatives) and so on…

  For testing: given an example x,

  all the votes of all classifiers are collected

  where bE1E2 = 1 means a vote for C1 and bE1E2 = -1 is a vote for C2

  Select the category that gets more votes

Page 47: Support Vector Machines Alessandro Moschitti

From Binary to Multiclass classifiers

  Error Correcting Output Codes (ECOC)   The training set is partitioned according to binary sequences (codes)

associated with category sets.

  For example, 10101 indicates that the set of examples of

C1,C3 and C5 are used to train the C10101 classifier.

  The data of the other categories, i.e. C2 and C4 will be

negative examples

  In testing: the code-classifiers are used to decode one the original class, e.g.

C10101 = 1 and C11010 = 1 indicates that the instance belongs to C1 That is, the only one consistent with the codes

Page 48: Support Vector Machines Alessandro Moschitti

SVM-light: an implementation of SVMs

  Implements soft margin

  Contains the procedures for solving optimization problems

  Binary classifier

  Examples and descriptions in the web site:

http://www.joachims.org/

(http://svmlight.joachims.org/)

Page 49: Support Vector Machines Alessandro Moschitti

References

  A tutorial on Support Vector Machines for Pattern Recognition   Downloadable article (Chriss Burges)

  The Vapnik-Chervonenkis Dimension and the Learning Capability of Neural Nets   Downloadable Presentation

  Computational Learning Theory (Sally A Goldman Washington University St. Louis Missouri)   Downloadable Article

  AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel-based learning methods) N. Cristianini and J. Shawe-Taylor Cambridge University Press   Check our library

  The Nature of Statistical Learning Theory Vladimir Naumovich Vapnik - Springer Verlag (December, 1999)   Check our library

Page 50: Support Vector Machines Alessandro Moschitti

Exercise

  1. The equations of SVMs for Classification, Ranking and Regression (you can get them from my slides).

  2. The perceptron algorithm for Classification, Ranking and Regression (the last two you have to provide by looking at what you wrote in point (1)).

  3. The same as point (2) by using kernels (write the kernel definition as introduction of this section).