10
Computational Intelligence: Computational Intelligence: Methods and Applications Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch

Computational Intelligence: Methods and Applications

Embed Size (px)

DESCRIPTION

Computational Intelligence: Methods and Applications. Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch. Non-separable picture. - PowerPoint PPT Presentation

Citation preview

Page 1: Computational Intelligence:  Methods and Applications

Computational Intelligence: Computational Intelligence: Methods and ApplicationsMethods and Applications

Lecture 24

SVM in the non-linear case

Włodzisław Duch

Dept. of Informatics, UMK

Google: W Duch

Page 2: Computational Intelligence:  Methods and Applications

Non-separable pictureNon-separable pictureUnfortunately for non-separable data vectors not all conditions may be fulfilled, some data points are not outside of the two hyperplanes: new “slack” scalar variables are introduced in separability conditions.

Margin between the two distributions of points is defined as the distance between the two hyperplanes parallel to the decision border; it is still valid for most data, but there are now two points on the wrong side of the decision hyperplane, and one point inside the margin.

Page 3: Computational Intelligence:  Methods and Applications

Non-separable caseNon-separable case

The problem becomes slightly more difficult, since the quadratic optimization problem is not convex, saddle points appear.

Conditions:

If i > 1 then the point is on the wrong side of the g(X)>0 plane, and is misclassified. Dilemma: reduce the number of misclassified points, or keep large classification margin hoping for better generalization on future data, despite some errors made now. This is expressed by minimizing:

2

1

1

2

n

ii

C

W

( ) T ( ) ( )0

( ) T ( ) ( )0

1 for 1

1 for 1 and 0

i i ii

i i ii i

g W Y

g W Y

W

W

X W X

X W X

adding a user-defined parameter C and leading to the same solution as before, with bound on ,

0 i C smaller C = larger margins (see Webb Chap. 4.2.5)

Page 4: Computational Intelligence:  Methods and Applications

SVM: non-separableSVM: non-separable

Coefficients are obtained from the quadratic programming problem and from support vectors Y(i)g(X(i))=1.

Lagrangian with penalty for errors scaled by C coefficient:

Non-separable case conditions, using slack variables:

Min W, max . Discriminant function with regularization conditions:

2 ( ) T ( )0

1 1

1, 1 , 0

2

n ni i

i i ii i

L C Y W

W α W W X

0 i C

T ( ) ( )0

T ( ) ( )0

1 for 1

1 for 1 and 0

i ii

i ii i

g W Y

g W Y

W

W

X W X

X W X

T ( ) ( )T0 0

1

( )n

i ii

i

g W Y W

X W X X X

( ) T ( )0

i iW Y W X

Page 5: Computational Intelligence:  Methods and Applications

Support Vectors (SV)Support Vectors (SV)

is large and positive for misclassified vectors, and therefore vectors near the border g(X(i))Y(i) should have non zero i to influence W.

This term is negative for correctly classified vectors, far from the Hi

hyperplanes; selecting i=0 will maximize the Lagrangian L(W,).

Some have to be non zero, otherwise classification conditions Y(i)g(X(i))1>0 will not be fulfilled and discriminating function will be reduce to W0. The term known as the KKT sum (from Karush-Kuhn-

Tacker, who used it in optimization theory) :

( ) T ( )

1

, 1 , 0n

i ic i i

i

L Y

W α W X

( ) ( ) ( ) ( )

1 1 1

( )

1

1( )

2

0; 0 ; 1..

n n ni j i j

i i ji i j

ni

i ii

L Y Y

Y C i n

α X XThe dual form with is easier to use, it is maximized with one additional equality constraint:

Page 6: Computational Intelligence:  Methods and Applications

Mechanical analogyMechanical analogy

Mechanical analogy: imagine the g(X)=0 hyperplane as a membrane, and SV X(i) exerting force on it, in the Y(i)W direction. Stability conditions require forces to sum to zero leading to:

sv sv

( )sv

( )

1 1

1.. 0

0

ii i i

n ni

i ii i

Y i n

Y

WF

W

WF

W

Same as auxiliary SVM condition.

Also all torques should sum to 0sv sv

( ) ( ) ( )

1 1

0

n ni i i

i ii i

Y

WX F X

W

WW

W

Sum=0 if the SVM expression for W is used.

Page 7: Computational Intelligence:  Methods and Applications

Sequential Minimal OptimizationSequential Minimal Optimization

SMO: solve smallest possible optimization step (J. Platt, Microsoft).

Idea similar to the Jacobi rotation method with 2x2 rotations, but here applied to the quadratic optimization.

Valid solution for min L() is obtained when all conditions are fulfilled:

( ) ( )

( ) ( )

( ) ( )

0 1

0 1

1

i ii

i ii

i ii

Y g

C Y g

C Y g

X

X

X

- accuracy to which conditions should be fulfilled (typically 0.001)

SMO: find all examples X(i) that violate these conditions;select those that are neither 0 nor C (non-bound cases).

take a pair of ij and find analytically the values thatminimize their contribution to the L() Lagrangian.

Complexity: problem size n2, solution complexity nsv

2.

Page 8: Computational Intelligence:  Methods and Applications

Examples of linear SVMExamples of linear SVMSVM SMO, Sequential Multiple Optimization, is implemented in WEKA with linear and polynomial kernels.

The only user adjustable parameter for linear version is C; for non-linear version the polynomial degree may also be set.

In the GhostMiner 1.5 optimal value of C may automatically be found by crossvalidation training.

For non-linear version type of kernel function and parameters of kernels may be adjusted (GM).

Many free software packages/papers are at: www.kernel-machines.org

Example 1: Gaussians data clusters

Example 2: Cleveland Heart data

Page 9: Computational Intelligence:  Methods and Applications

Examples of linear SVMExamples of linear SVMExamples of mixture data with overlapping classes; the Bayesian non-linear decision borders, and linear SVM with margins are shown.

With C=10000, small margin with C=0.01, larger marginErrors seem to be reversed here! Large C is better around decision plane, but not worse overall (the model is too simple), so it should have lower training error but higher test; for small C margin is large, training error slightly larger but test lower.

Fig. 12.2,

from Hasti et. al 2001

data from mixture of Gaussians

Page 10: Computational Intelligence:  Methods and Applications

Letter recognitionLetter recognitionCategorization of text samples. Set different rejection rate and calculateRecall = P+|+ = P++ / P+ and Precision= P++/(P++ + P )=TP/(TP+FP)