64
1 New Horizon in Machine Learning Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering, Tuskegee University

1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

Embed Size (px)

Citation preview

Page 1: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

1

New Horizon in Machine Learning —

Support Vector Machine for non-Parametric Learning

Zhao Lu, Ph.D.

Associate Professor

Department of Electrical Engineering, Tuskegee University

Page 2: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

2

Introduction

As an innovative non-parametric learning strategy, Support Vector Machine (SVM) gained increasing popularity in late 1990s. Currently it is among the best performers for various tasks, such as pattern recognition, regression and signal processing, etc.

Support vector learning algorithmsSupport vector classification for nonlinear pattern recognition;

Support vector regression for highly nonlinear function approximation;

Page 3: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

3

Part I. Support Vector Learning for Classification

Page 4: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

4

Overfitting in linear separable classification

Page 5: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

5

What is a good Decision Boundary?

Consider a two-class, linearly separable classification problem. Construct the hyperplane

to make

Many decision boundaries! Are all decision boundaries equally good?

Class 1

Class 2

)+(=)( bxwsignxf T

0,T nw x b x R

0, 1

0, 1

Ti i

Ti

w x b for y

w x b for y

Page 6: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

6

Examples of Bad Decision Boundaries

Class 1

Class 2

Class 1

Class 2

)+(=)( bxwsignxf T)+(=)( bxwsignxf T

For linearly separable classes, the data from the same class should be close to the training data.

Page 7: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

7

Optimal separating hyperplane

The optimal separating hyperplane (OSH) is defined as

It can be proved that OSH is

unique and locate halfway

between margin hyperplanes.

Class 1

Class 2

m

0=+ bxwT

1=+ bxwT

,

: , 0, 1,2, ,n

w b i

Timax min x x x R w x b i

w

1bxwT

Page 8: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

8

Canonical separating hyperplane

A hyperplane is in canonical form with respect to all training data if :

Margin hyperplanes:

A canonical hyperplane having a maximal margin

is the ultimate learning goal, i.e. the optimal separating hyperplane.

x X0=+ bxwT

1

2

: 1

: 1

T

T

H w x b

H w x b

1i

Ti

x Xmin w x b

: , 0, 1,2, ,n Ti

im min x x x R w x b i

Page 9: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

9

Margin in terms of the norm of

According to the conclusions from the statistical learning theory, the large-margin decision boundary has the excellent generalization capability.

For the canonical hyperplane, it can be proved that the margin m is

Hence, maximizing margin is equivalent to minimizing the square of the norm of .

2

1m

w

w

w

Page 10: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

10

Finding the optimal decision boundary

Let {x1, ..., xn} be our data set and let yi {1,-1} be the class label of xi

The optimal decision boundary should classify all points correctly The decision boundary can be found by solving the following constrained optimization problem

This is a quadratic optimization problem with linear inequality constraints.

ibxwy iT

i ∀,1≥)+(

ibxwytosubject

wminimize

iT

i 1)(2

1 2

Page 11: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

11

Generalized Lagrangian Function Consider the general (primal) optimization problem

where the functions are defined on a domain . The generalized Lagrangian was defined as

,,,1=,,,,1=,, mihandkigf ii

mjwh

kiwgtosubject

wfminimize

j

i

,,1=,0=)(

,,1=,0≤)(

)(

)()()(

)()()(),,(1 1

whwgwf

whwgwfwL

TT

k

i

m

jjjii

Page 12: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

12

Dual Problem and Strong Duality Theorem

Given the primal optimization problem, the dual problem of it was defined as

Strong Duality Theorem: Given the primal optimization problem, where the domain is convex and the constraints are affine functions. Then the optimum of the primal problem occurs at the same values as the optimum of the dual problem .

ii handg

0

)()(

α>tosubject

w,α,Linfα,βθmaximizew

Page 13: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

13

Karush-Kuhn-Tucker Conditions

Given the primal optimization problem with the objective function convex and , affine. Necessary and sufficient conditions for to be an optimum are the existence of , such that

(KKT complementarity condition)

1Cf

ki

kiwg

kiwg

wL

wLw

i

i

ii

,,1,0

,,1,0)(

,,1,0)(

0),,(

0),,(

*

*

**

***

***

*w* *

igih

Page 14: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

14

Lagrangian of the optimization problem

The Lagrangian is

Setting the gradient of w.r.t. and b to zero, we have

ibxwytosubject

wminimize

iT

i 0)(12

1 2

n

ii

Tii

T bxwywwL1

))(1(2

1

n

iii

n

i

n

iiiiiii

y

xywxyw

1

1 1

0

0)(

L w

( )parametric nonparametric

Page 15: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

15

The Dual Problem

If we substitute into Lagrangian , we have

Note that , and the data points appear in terms of

their inner product; this is a quadratic function of i only.

n

i

n

j

n

iij

Tijiji

n

i

n

j

n

i

n

i

n

j

n

iiii

Tjjjiiij

Tijiji

n

i

n

j

n

i

n

ji

Tjjjiijjj

Tiii

xxyy

ybxxyyxxyy

bxxyyxyxyL

1 1 1

1 1 1 1 1 1

1 1 1 1

2

12

1

)(12

1

n

iiii xyw

1

L

n

iii y

1

0

0

Page 16: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

16

The Dual Problem

The new objective function is in terms of i only

The original problem is known as the primal problem

The objective function of the dual problem needs to be maximized!

The dual problem is therefore:

Properties of i when we introduce the Lagrange multipliers

The result when we differentiate the original Lagrangian w.r.t. b

n

iiii

n

i

n

jij

Tijijii

ytosubject

xxyyWmaxmize

1

1 1,1

0,0

2

1)(

Page 17: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

17

The Dual Problem

This is a quadratic programming (QP) problem, and therefore a global minimum of can always be found

can be recovered by , so the decision function

can be written in the following non-parametric form

n

iiii

n

ji

n

iij

Tijiji

ytosubject

xxyyWminimize

1

1, 1

0,0

2

1)(

n

iiii xyw

1

)()(1

n

i

Tiii bxxysignxf

w

i

Page 18: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

18

Conception of Support Vectors (SVs)

According to the Karush-Kuhn-Tucker (KKT) complementarity condition, the solution must satisfy

Thus, only for those points that are closest to the classifying hyperplane. These points are called support vectors.

From the KKT complementarity condition, the bias term b can be calculated by using the support vectors

n

ikk

Tiiik anyforxxyyb

1

0

0]1)([ bxwy iT

ii

0i ix

Ti iw x b y

Page 19: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

19

6=1.4

Sparseness of the solution

Class 1

Class 2

1=0.8

2=0

3=0

4=0

5=07=0

8=0.6

9=0

10=0

1=+ bxwT

0Tw x b

1Tw x b

w

Page 20: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

20

The use of slack variables

We allow “errors” i in classification for noisy data;

Class 1

Class 2

0Tw x b

1Tw x b

1=+ bxwT

wjx

ix

j

i

Page 21: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

21

Soft Margin Hyperplane

The use of slack variables i enable the soft margin classifier

i are “slack variables” in optimization

Note that i=0 if there is no error for

The objective function

C : tradeoff parameter between error and margin

The primal optimization problem becomes

i

ybxw

ybxw

i

iiiT

iiiT

0

11

11

n

iiCw

1

2

2

1

0,1)(2

1

1

2

iiiT

i

n

ii

bxwytosubject

Cwminimize

ix

Page 22: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

22

Dual Soft-Margin Optimization Problem

The dual of this new constrained optimization problem is

can be recovered as

This is very similar to the optimization problem in the hard-margin case, except that there is an upper bound C on i now.

Once again, a QP solver can be used to find i

1 1, 1

1

1( )

2

0, 0

n nT

i i j i j i ji i j

n

i i ii

maxmize W y y x x

subje Cct to y

n

iiii xyw

1

w

Page 23: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

23

Nonlinear separable problems

Page 24: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

24

Extension to Non-linear Decision Boundary

How to extend the linear large-margin classifier to nonlinear case?

Cover’s theorem

Consider a space made up of nonlinearly separable patterns.

Cover’s theorem states that such a multi-dimensional space can be transformed into a new feature space where the patterns are linearly separable with a high probability, provided two conditions are satisfied:

(1) The transform is nonlinear;

(2) The dimensionality of the feature space is high enough;

Page 25: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

25

Non-linear SVMs: Feature spaces

General idea: the data in original input space can always be mapped into some higher-dimensional feature space where the training data become linearly separable by using a nonlinear transformation:

Φ: x → φ(x)

kernel visualization: http://www.youtube.com/watch?v=9NrALgHFwTo

Page 26: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

26

Transforming the data

Key idea: transform to a higher dimensional space by using a nonlinear transformation

Input space: the space the point are locatedFeature space: the space of after transformation

Curse of dimensionality: Computation in the feature space can be very costly because it is high dimensional, and the feature space is typically infinite-dimensional!

This problem of ‘curse of dimensionality’ can be surmounted on the strength of kernel function because the inner product is just a scalar, which is the most appealing characteristic of SVM.

ix

ix)( ix

Page 27: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

27

Kernel trick

Recall the SVM dual optimization problem

The data points only appear as inner product

With the aid of inner product representation in the feature space, the nonlinear mapping can be used implicitly by defining the kernel function K by

n

iiii

n

i

n

jij

Tijijii

yCtosubject

xxyyWmaxmize

1

1 1,1

0,0

2

1)(

)()(),( jT

iji xxxxK

)( ix

Page 28: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

28

What functions can be used as kernels?

Mercer’s theorem in operator theory: Every semi-positive definite symmetric function is a kernel

Semi-positive definite symmetric functions correspond to a semi-positive definite symmetric Gram matrix on data points:

K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xn)

K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xn)

… … … … …

K(xn,x1) K(xn,x2) K(xn,x3) … K(xn,xn)

K=

Page 29: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

29

An Example for f (.) and K(.,.)

Suppose the nonlinear mapping (.): is as follows

An inner product in the feature space is

So, if we define the kernel function as follows, there is no need to carry out (.) explicitly

This use of kernel function to avoid carrying out (.) explicitly is known as the kernel trick

1 2 21 2 1 2 1 2

2

(1, 2 , 2 , , , 2 )Txx x x x x x

x

22211

2

1

2

1 )1(, yxyxy

y

x

x

22211 )1(),( yxyxyxK

6R R

Page 30: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

30

Kernel functions

In practical use of SVM, the user specifies the kernel function; the transformation (.) is not explicitly stated

Given a kernel function , the transformation (.) is given by its eigenfunctions (a concept in functional analysis)

Eigenfunctions can be difficult to construct explicitly

This is why people only specify the kernel function without worrying about the exact transformation

),( ji xxK

( , ) ( ) ( )X

k x z z dz x

Page 31: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

31

Examples of kernel functions

Polynomial kernel with degree d

Radial basis function kernel with width

Closely related to radial basis function neural networks

The feature space induced is infinite-dimensional

Sigmoid function with parameter and

It does not satisfy the Mercer condition on all and

Closely related to feedforward neural networks

dT yxyxK )1(),(

))2((),( 22 yxexpyxK

)(),( yxtanhyxK T

Page 32: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

32

Kernel: Bridge from linear to nonlinear

Change all inner products to kernel functions

For training, the optimization problem is

linear

nonlinear

n

iiii

n

i

n

jij

Tijijii

yCtosubject

xxyyWmaxmize

1

1 1,1

0,0

2

1)(

n

iiii

n

i

n

jijijijii

yCtosubject

xxKyyWmaxmize

1

1 1,1

0,0

),(2

1)(

Page 33: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

33

Kernel expansion for decision function

For classifying the new data z, it belongs to the class 1 if f 0, and to class 2 if f <0

linear

nonlinear

n

iiii xyw

1

n

i

Tiii

T bxxybxwxf1

)(

1

( , )( ) ( )n

iT

i ii

y K x xf x w x b b

n

iiii xyw

1

)(

Page 34: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

34

Compared to neural networks

SVMs are explicitly based on a theoretical model of learning rather than on loose analogies with natural learning systems or other heuristics.

Modularity: Any kernel-based learning algorithm is composed of two modules:

A general purpose learning machine

A problem specific kernel function

SVMs are not affected by the problem of local minima because their training amounts to convex optimization.

Page 35: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

35

Key features in SV classifier

All features were already present and had been used in machine learning since 1960s:

Maximum (large) margin; Kernel method; Duality in nonlinear programming; Sparseness of the solution; Slack variables;

However, not until 1995 all features were combined together, and it is so surprising how naturally and elegantly they fit together and complement each other in SVM.

Page 36: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

36

SVM classification for 2D data

Figure. Visualization of SVM classification.

Page 37: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

37

Part II. Support Vector Learning for Regression

Page 38: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

38

Overfitting in nonlinear regression

Page 39: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

39

The linear regression problem

bxwxfy )(

Page 40: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

40

Linear regression

The problem of linear regression is much older than the classification one. Least squares linear interpolation was first used by Gauss in the 18th century for astronomical problems.

Given a training set , with , , the problem of linear regression is to find a linear function that models the data

S ni RXx RYyi

f

bxwxfy )(

Page 41: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

41

Least squares

The least squares approach prescribes choosing the parameters to minimize the sum of the squared derivation of the data,

Setting and

where

),( bw

1

2)(),(i

ii bxwybwJ

TT bww ),(ˆ

T

T

T

x

x

x

X

ˆ

ˆ

ˆ

ˆ 2

1

TTii xx )1,(ˆ

Page 42: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

42

Least squares

The square loss function can be written as

Taking derivatives of the loss and setting them equal to zero,

yields the well-known ‘normal equations’

and, if the inverse of exists, the solution is:

)ˆˆ()ˆˆ()ˆ( wXywXywJ T

0ˆˆˆ2ˆ2ˆ

wXXyXw

J TT

yXwXX TT ˆˆˆˆ

XX T ˆˆ

yXXXw TT ˆ)ˆˆ(ˆ 1

Page 43: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

43

Ridge regression If the matrix in the least squares problem is not of full rank, or in other situations where numerical stability problems occur, one can use the following solution,

where is the identity matrix with the entry set to zero. This solution is called ridge regression.

The ridge regression minimizes the penalized loss function

regularizer

XX T ˆˆ

yXIXXw Tn

T ˆ)ˆˆ(ˆ 1

nI )1,1( nn

1

2)(),(i

ii ybxwwwbwJ

Page 44: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

44

-insensitive loss function

Instead of using the square loss function, the ε-insensitive loss function is used in SV regression

which leads to sparsity of the solution.

Value offtarget

Penalty

Value offtarget

Penalty

Square loss function-insensitive loss function

otherwise

if

0

Page 45: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

45

The linear regression problem

bxwxfy )(

))(,0(

,)(

)(,0)(

bxwymax

otherwisebxwy

bxwyifbxwy

ii

ii

iiii

Page 46: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

46

Primal problem in SVR (SVR)

Given a data set with values The -SVR was formulated as the following (primal) convex optimization problem:

The constant determines the trade-off between the flatness of and the amount up to which deviations larger than are tolerated.

xxx ,,, 21 yyy ,,, 21

0,

,

,

)(2

1

*

*

1

*2

ii

iii

iii

iii

ybxw

bxwy

tosubject

Cwminimize

0C

f

Page 47: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

47

Lagrangian

Construct the Lagrange function from the objective function and the corresponding constraints:

where the Lagrange multipliers satisfy positivity constraints

0,,, ** iiii

1

**

1

1 1

***2

),(

),(

)()(2

1

iiiii

iiiii

i iiiiiii

bxwy

bxwy

CwL

Page 48: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

48

Karush-Kuhn-Tucker Conditions

It follows from the saddle point condition that the partial derivatives of L with respect to the primal variables have to vanish for optimality,

The 2nd equation indicates that can be written as a linear combination of training patterns .

0

0

0)(

0)(

**

1

*

1

*

*

ii

ii

iiiiw

iiib

CL

CL

xwL

L

wix

( )parametric nonparametric

Page 49: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

49

Dual problem

Substituting the equations above into the Lagrangian yields the following dual problem

The function can be written in a non-parametric form by substituting into

1

* )(i

iii xw bxwxf )(

1

* ,)()(i

iii bxxxf

f

],0[,0)(

)()(,))((2

1

*

1

*

1, 1

*

1

***

Candtosubject

yxxmaximize

iii

ii

ji iiii

iiijijjii

Page 50: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

50

KKT complementarity conditions

At the optimal solution the following Karush-Kuhn-Tucker complementarity condition must be fulfilled

Obviously, for , holds, and in similar for , .

0)(

0)(

0),(

0),(

****

**

iiii

iiii

iiii

iiii

C

C

ybxw

ybxw

Ci 0 0iCi *0 0* i

Page 51: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

51

Unbounded support vectors

Hence, for , it follows

Thus, for all the data points fulfilling , dual variables , and for the ones satisfying , the dual variables . These data points are called the unbounded support vectors.

Cii *,0

0,

0,

ii

ii

ybxw

ybxw

Ci 0

Ci 0

Ci *0

)(xfy

)(xfy

Page 52: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

52

Computing bias term b

Unbounded support vector allow computing the value of the bias term b as given below

The calculation of a bias term b is numerically very sensitive, and it is better to compute the bias b by averaging over all the unbounded support vector data points.

Cforxwyb

Cforxwyb

iii

iii

*0,,

0,,

Page 53: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

53

Bounded support vector

The bounded support vectors are outside the -tube

Page 54: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

54

Sparsity of SV expansion

For all data points inside the -tube, i.e.,

From the KKT complementarity condition,

and have to be zeros. Therefore, we have a sparse expansion of in terms of

)( bxwy ii

0),(

0),(**

iiii

iiii

ybxw

ybxw

i *i

ixw

SVi

iiii

iii xxw )()( *

1

*

Page 55: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

55

SV Nonlinear Regression

Key idea: map data to a higher dimensional space by using a nonlinear transformation, and perform linear regression in feature (embedded) space

Input space: the space the point are locatedFeature space: the space of after transformation

Computation in the feature space can be costly because it is high dimensional, and the feature space is typically infinite-dimensional!

This problem of ‘curse of dimensionality’ can be surmounted by resorting to kernel trick, which is the most appealing characteristic of SVM.

ix

ix)( ix

Page 56: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

56

Kernel trick

Recall the SVM optimization problem

The data points only appear as inner product

As long as we can calculate the inner product in the feature space, we do not need to know the nonlinear mapping explicitly

Define the kernel function K by )()(),( jT

iji xxxxK

],0[,0)(

)()(,))((2

1

*

1

*

1, 1

*

1

***

Candtosubject

yxxmaximize

iii

ii

ji iiii

iiijijjii

Page 57: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

57

Kernel: Bridge from linear to nonlinear

Change all inner products to kernel functions

For training,Original

With kernel function * * *

, 1 1 1

* *

1

1( )( ) ( ) ( )

2

( ) 0

(

, [0

)

]

,

,

i i j j i i i ii j i i

i i i i

i

i

jmaximize y y

subje

K x

ct to and C

x

],0[,0)(

)()(,))((2

1

*

1

*

1, 1

*

1

**

Candtosubject

yyxxmaximize

iii

ii

ji iii

iiijijjii

Page 58: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

58

Kernel expansion representation

For testing, Original

With kernel function

1

* )(i

iii xw

1

* ,)(,)(i

iii bxxbxwxf

1

* )()(i

iii xw

*

1

( , )( ) , ( ) ( )i

ii if x b bx xw x K

Page 59: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

59

Compared to SV classification

The size of the SVR problem, with respect to the size of an SV classifier design task, is doubled now. There are unknown dual variables ( and ) for a support vector regression.

Besides the penalty parameter and the shape parameters of the kernel functions (such as the variances of a Gaussian kernel, order of polynomial), the insensitivity zone also need to be set beforehand in constructing SV machines for regression.

2

si '

si ' si '*

C

Page 60: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

4/28/2006 60

Influence of insensitivity zone

Page 61: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

61

Linear programming SV regression

In an attempt to improve computational efficiency and model sparsity, the linear programming SV regression was formulated

where .

0,

),(

),(

)(2

1

*

*

1

1

1

*1

ii

iij

ijj

ij

ijji

iii

yxxk

xxky

tosubject

Cminimize

T][ 21

Page 62: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

62

Linear programming SV regression

The optimization problem can be converted into a linear programming problem as follows

where ,

y

y

IKK

IKKtosubject

cminimize T

T

CCCc

2,,2,2,1,,1,1,1,,1,1

T),,,( 21 T),,,( 21

iii

T),,,( 21 ),(= jiij xxkK

Page 63: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

63

MATLAB Demo for SV regression

MATLAB support vector machines Toolbox developed by Steve R. Gunn at University of Southampton, UK.

The software can be downloaded from

http://www.isis.ecs.soton.ac.uk/resources/svminfo/

Page 64: 1 New Horizon in Machine Learning — Support Vector Machine for non-Parametric Learning Zhao Lu, Ph.D. Associate Professor Department of Electrical Engineering,

64

Questions?