38
Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison erformance Computation for Engineering Systems Semi MIT October 4, 2000

Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

Embed Size (px)

Citation preview

Page 1: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

Mathematical Programming in Support Vector Machines

Olvi L. Mangasarian

University of Wisconsin - Madison

High Performance Computation for Engineering Systems Seminar

MIT October 4, 2000

Page 2: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

What is a Support Vector Machine?

An optimally defined surfaceTypically nonlinear in the input spaceLinear in a higher dimensional spaceImplicitly defined by a kernel function

Page 3: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

What are Support Vector Machines Used For?

ClassificationRegression & Data FittingSupervised & Unsupervised Learning

(Will concentrate on classification)

Page 4: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

Example of Nonlinear Classifier:Checkerboard Classifier

Page 5: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

Outline of Talk

Generalized support vector machines (SVMs)Completely general kernel allows complex classification

(No Mercer condition!) Smooth support vector machines

Smooth & solve SVM by a fast Newton method Lagrangian support vector machines

Very fast simple iterative scheme-One matrix inversion: No LP. No QP.

Reduced support vector machinesHandle large datasets with nonlinear kernels

Page 6: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

Generalized Support Vector Machines2-Category Linearly Separable Case

A+

A-

wx0w = í + 1

x0w = í à 1

Page 7: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

Generalized Support Vector MachinesAlgebra of 2-Category Linearly Separable Case

Given m points in n dimensional space Represented by an m-by-n matrix A Membership of each in class +1 or –1 specified by:A i

An m-by-m diagonal matrix D with +1 & -1 entries

D(Awà eí )=e;

More succinctly:

where e is a vector of ones.

x0w = í æ1: Separate by two bounding planes,

A iw=í + 1; for D i i = + 1;A iw5í à 1; for D i i = à 1:

Page 8: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

Generalized Support Vector MachinesMaximizing the Margin between Bounding Planes

wx0w = í + 1

x0w = í à 1

A+

A-

jjwjj22

Page 9: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

Generalized Support Vector MachinesThe Linear Support Vector Machine Formulation

s.t. D(Awà eí ) + y = e

Solve the following mathematical program for some :

w;í ;ymin ÷e0y+ 2

kwk

y = 0:

÷> 0

The nonnegative slack variable is zero iff: Convex hulls of and do not intersect is sufficiently large

yA + A à

÷

D(Awà eí )=e

Page 10: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

Breast Cancer Diagnosis Application97% Tenfold Cross Validation Correctness780 Samples:494 Benign, 286 Malignant

Page 11: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

Another Application: Disputed Federalist PapersBosch & Smith 1998

56 Hamilton, 50 Madison, 12 Disputed

Page 12: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

Generalized Support Vector Machine Motivation

(Nonlinear Kernel Without Mercer Condition)

Linear SVM: Linear separating surface: x0w = ímin ÷e0y+ k w k1

s.t. D(Awà eí ) + y=e; y=0 Set w = A0Du. Resulting linear surface:x0A0Du = í

min ÷e0y+ k u k1

s.t. D(AA0Du à eí ) + y=e; y=0Replace AA0by arbitrary nonlinear kernel K (A;A0) Resulting nonlinear surface: K (x0;A0)Du = í

min ÷e0y+ k u k1

s.t. D(K (A;A0)Du à eí ) + y=e;y=0

Page 13: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

SSVM: Smooth Support Vector Machine(SVM as Unconstrained Minimization Problem)

Changing to 2-norm and measuring margin in( ) space:

Page 14: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

Smoothing the Plus Function: Integrate the Sigmoid Function

Page 15: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

SSVM: The Smooth Support Vector Machine Smoothing the Plus Function

Integrating the sigmoid approximation to the step function:

s(x;ë) = 1+"à ëx1 ;

gives a smooth, excellent approximation to the plus function:

p(x;ë) = x + ë1 log(1+ "à ëx); ë > 0:

Replacing the plus function in the nonsmooth SVMby the smooth approximation gives our SSVM:

min Ðë(w;í ) :=

min2÷k p(eà D(Awà eí );ë) k2

2 + 21 k w;í k2

2

Page 16: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

Newton: Minimize a sequence of quadratic approximationsto the strongly convex objective function, i.e. solve a sequenceof linear equations in n+1 variables. (Small dimensional inputspace.)

Armijo: Shorten distance between successive iterates so as to generate sufficient decrease in objective function. (In computational reality, not needed!)

Global Quadratic Convergence: Starting from any point,the iterates guaranteed to converge to the unique solution at a quadratic rate, i.e. errors get squared. (Typically, 6 to 8 iterations without an Armijo.)

Page 17: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

SSVM with a Nonlinear Kernel Nonlinear Separating Surface in Input Space

Page 18: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

Examples of Kernels Generate Nonlinear Separating Surfaces in Input Space

A 2 Rmâ n;a 2 Rm;ö 2 R;dintegerPolynomial Kernel

(AA0+ öaa0)d

Neural Network Kernel

(AA0+ öaa0)ã(á)ã : R ! f 0;1g

Gaussian (Radial Basis) Kernel

"à ökA ià A jk2; i;j=1;. . .;m:

Page 19: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering
Page 20: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering
Page 21: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering
Page 22: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering
Page 23: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering
Page 24: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering
Page 25: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

LSVM: Lagrangian Support Vector MachineDual of SVM

Taking the dual of the SVM formulation:

,

gives the following simple dual problem:

min0ô u2R m 21u0(÷

I + D(AA0+ ee0)D)uà e0u

The variables (w;í ;y) of SSVM are related to u by:

w = A0Du; y = ÷u; í = à e0Du:

Page 26: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

LSVM: Lagrangian Support Vector MachineDual SVM as Symmetric Linear Complementarity Problem

The optimality condition for this dual SVM is the LCP:

0 ô u ? Qu à eõ 0;

min 0ô u2Rm f (u) := 21u0Qu à e0u:

Reduces the dual SVM to:

Defining the two matrices:

H = D[A à e]; Q = ÷I + HH0

which, by Implicit Lagrangian Theory, is equivalent to:

Qu à e= ((Qu à e) à ëu)+:ë > 0;

Page 27: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

LSVM AlgorithmSimple & Linearly Convergent – One Small Matrix Inversion

ui+1 = Qà 1(e+ ((Qui à e) à ëui)+); i = 0;1; . . .Where:

0< ë < ÷2

Key Idea: Sherman-Morrison-Woodbury formula allows the inversioninversion of an extremely large m-by-m matrix Q by merely invertinga much smaller n-by-n matrix as follows:

(÷I + HH0)à 1 = ÷(I à H(÷

I + H0H)à 1H0):

Page 28: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

LSVM Algorithm – Linear Kernel11 Lines of MATLAB Code

function [it, opt, w, gamma] = svml(A,D,nu,itmax,tol)% lsvm with SMW for min 1/2*u'*Q*u-e'*u s.t. u=>0,% Q=I/nu+H*H', H=D[A -e]% Input: A, D, nu, itmax, tol; Output: it, opt, w, gamma% [it, opt, w, gamma] = svml(A,D,nu,itmax,tol); [m,n]=size(A);alpha=1.9/nu;e=ones(m,1);H=D*[A -e];it=0; S=H*inv((speye(n+1)/nu+H'*H)); u=nu*(1-S*(H'*e));oldu=u+1; while it<itmax & norm(oldu-u)>tol z=(1+pl(((u/nu+H*(H'*u))-alpha*u)-1)); oldu=u; u=nu*(z-S*(H'*z)); it=it+1; end; opt=norm(u-oldu);w=A'*D*u;gamma=-e'*D*u;

function pl = pl(x); pl = (abs(x)+x)/2;

Page 29: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

LSVM Algorithm – Linear KernelComputational Results

2 Million random points in 10 dimensional spaceClassified in 6.7 minutes in 6 iterations & e-5 accuracy250 MHz UltraSPARC II with 2 gigabyte memoryCPLEX ran out of memory

32562 points in 123-dimensional space (UCI Adult Dataset)Classified in141 seconds & 55 iterations to 85% correctness400 MHz Pentium II with 2 gigabyte memorySVM classified in 178 seconds & 4497 iterationslight

Page 30: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

LSVM – Nonlinear KernelFormulation

K (A;B) : Rmâ n â Rnâ l ! Rmâ l;

For the nonlinear kernel:

the separating nonlinear surface is given by:

K ([x0 à 1]; à e0A0

h i)Du = 0

Where u is the solution of the dual problem:

05u2Rmmin f (u) := 2

1u0Qu à e0u;with Q redefined as:

G = [A à e]; Q = ÷I + DK (G;G0)D

Page 31: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

LSVM Algorithm – Nonlinear Kernel Application 100 Iterations, 58 Seconds on Pentium II, 95.9% Accuracy

Page 32: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

Reduced Support Vector Machines (RSVM)

Large Nonlinear Kernel Classification Problems

is a small random sample ofK (A;Aö0);where Aö0 A0 Key idea: Use a rectangular kernel.

Typically Aö has 1% to 10% of the rows of A

Two important consequences:RSVM can solve very large problems

Aö Nonlinear separator depends on only

uö;í ;ymin

2÷y0y+ 2

1(uö0uö+ í 2)

s:t: D(K (A;Aö0)Döuöà eí ) + y=e;y=0

gives lousy resultsK (Aö;Aö0) Separating surface: K (x0;Aö0)Döuö = í

Page 33: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

Conventional SVM Result on Checkerboard Using 50 Random Points Out of 1000

Page 34: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

RSVM Result on Checkerboard Using SAME 50 Random Points Out of 1000

Page 35: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

RSVM on Large Classification ProblemsStandard Error over 50 Runs = 0.001 to 0.002RSVM Time = 1.24 * (Random Points Time)

Page 36: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

Conclusion

Mathematical Programming plays an essential role in SVMs

TheoryNew formulations

Generalized SVMsNew algorithm-generating concepts

Smoothing (SSVM)

Implicit Lagrangian (LSVM)Algorithms

Fast : SSVMMassive: LSVM, RSVM

Page 37: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

Future Research

TheoryConcave minimization

Concurrent feature & data selection Multiple-instance problems

SVMs as complementarity problems

Algorithms

Multicategory classification algorithms

Kernel methods in nonlinear programming

Chunking for massive classification: 108

Page 38: Mathematical Programming in Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison High Performance Computation for Engineering

Talk & Papers Available on Web

www.cs.wisc.edu/~olvi