Difficulties with Nonlinear SVM for Large Problems The nonlinear kernel is fully dense ...

Preview:

Citation preview

Difficulties with Nonlinear SVM

for Large Problems

The nonlinear kernelK (A;A0) 2 Rmâ m is fully dense

Computational complexity depends on m

Separating surface depends on almost entire dataset

Complexity of nonlinear SSVM ø O((m+ 1)3)

Runs out of memory while storing the kernel matrix

Long CPU time to compute the dense kernel matrix

O(m2) Need to generate and store entries

Need to store the entire dataset even after solving the problem

Reduced Support Vector Machine

K (x0;Aö0)Döuö = í

(ii) Solve the following problem by the Newtonmethod with correspondingD ú D :

2÷kp(eà D(K (A;A0)Döuö à eí );ë)k2

2 + 21kuö; í k2

2min(u; í ) 2 Rm+1

K (x0;Aö0)Döuö = í

(iii) The nonlinear classifier is defined by the optimal(u;í )solution in step (ii):

Using K (A;A0) gives lousy results!

(i) Choose a random subset matrix ofentire data matrixA 2 Rmâ n; (m << m):

A 2 Rmâ n

Nonlinear Classifier:

A Nonlinear Kernel ApplicationCheckerboard Training Set: 1000 Points

in Separate 486 Asterisks from 514

DotsR2

Conventional SVM Result on Checkerboard

Using 50 Randomly Selected Points Out of 1000

K (A;A0) 2 R50â 50

RSVM Result on Checkerboard Using SAME 50 Random Points Out of

1000

K (A;A0) 2 R1000â 50

RSVM on Moderate Sized Problems(Best Test Set Correctness %, CPU

seconds)

Cleveland Heart297 x 13, 30

86.473.04

85.9232.42

76.881.58

BUPA Liver345 x 6 , 35

74.862.68

73.6232.61

68.952.04

Ionosphere 351 x 34, 35

95.195.02

94.3559.88

88.702.13

Pima Indians768 x 8, 50

78.645.72

76.59328.3

57.324.64

Tic-Tac-Toe958 x 9, 96

98.7514.56

98.431033.5

88.248.87

Mushroom8124 x 22, 215

89.04466.20

N/A

N/A

83.90221.50

K (A;A0)mâ m K (A;A0)mâ m K (A;A0)mâ mmâ n; mDataset Size

RSVM on Large UCI Adult Dataset

Standard Deviation over 50 Runs = 0.001

Average Correctness % & Standard Deviation, 50 Runs

(6414, 26148) 84.47 0.001 77.03 0.014 210 3.2%(11221, 21341) 84.71 0.001 75.96 0.016 225 2.0%(16101, 16461) 84.90 0.001 75.45 0.017 242 1.5%(22697, 9865) 85.31 0.001 76.73 0.018 284 1.2%(32562, 16282) 85.07 0.001 76.95 0.013 326 1.0%

Dataset Size( Train ; Test)

UCI AdultK (A;A0)mâ m

Testing%Std.Dev.

Amâ 123

m m=mK (A;A0)mâ m

%Testing Std.Dev.

Tim

e( C

PU

sec

. )

Training Set Size

RSVMSMOPCGC

Support Vector Regression(Linear Case:f (x) = x0w+ b)

Given the training set:S = f (xi;yi)j xi 2 Rn; yi 2 R; i = 1; . . .; lg

Find a linear function,f (x) = x0w+ b where(w;b)is determined by solving a minimization

problem that guarantees the smallest overallexperiment error made by f (x) = x0w+ b

Motivated by SVM: jjwjj2should be as small as possible

Some tiny error should be discard

-Insensitive Loss Functionï -insensitive loss function:ï

jyi à f (xi)jï = maxf0; jyi à f (xi)j à ïg

The loss made by the estimation function, fat the data point(xi;yi) is

jøjï = maxf0; jøj à ïg= 0 if jøj6 ïjøj à ï otherwise

ú

If ø2 Rn then jøjï 2 Rn is defined as:

(jøjï)i = jøijï ; i = 1. . .n

x

x

x

x

x

x

x

x

x

"

"

-Insensitive Linear Regression"

f (x) = x0w+ b

yj à f (xj) à "f (xk) à yk à "

Find (w;b)with the smallest overall error

ï- insensitive Support Vector Regression Model

Motivated by SVM: jjwjj2should be as small as possible

Some tiny error should be discarded

min(w;b;ø)2Rn+1+m

21jjwjj22+ Ce0 øj jï

where øj jï 2 Rm; ( øj jï)i = max(0; A iw+ bà yij j à ï )

Reformulated - SVR as a Constrained Minimization Problem

min(w;b;ø;øã)2Rn+1+2m

21w0w+ Ce0(ø+ øã)

yà Awà eb 6 eï + øAw+ ebà y 6 eï + øã

ø;øã > 0

subject to

n+1+2m variables and 2m constrains minimization problem

ï

Enlarge the problem size and computational complexity for solving the problem

SV Regression by Minimizing Quadratic -Insensitive Lossï

We minimize jj(w;b)jj22at the same time Occam’s razor: the simplest is the best

min(w;b;ø)2R n+1+l

21(jjwjj22 + b2) + 2

C jj(jøj")jj22

We have the following (nonsmooth) problem:

where (jøj") i = jyi à (w0xi + b)j"

Have the strong convexity of the problem

- insensitive Loss Functionï

(à x à ï )+ (x à ï )+xj jï

Quadratic -insensitive Loss Function ï

xj j2ï = ((x à ï )+ + (à x à ï )+)2

= (x à ï )2+ + (à x à ï )2

+

(x à ï )+ á(à x à ï )+ = 0

p2ï -function replace Us

e ïQuadratic -insensitive Function

p(x; ì )

p2ï (x; ì ) = (p(x à ï ; ì ))2 + (p(à x à ï ; ì ))2

which

is defined byp(x; ì ) = x + ì

1 log(1+ expà ì x)

p-function withë = 10; p(x;10); x 2 [à 3;3]

xj j2ï p2ï (x; ì ); ï = 1; ì = 5

-insensitive Smooth Support Vector Regression

ï

min(w;b)2Rn+1

21(w0w+ b2) + 2

C P

i=1

m

p2ï (A iw+ bà yi; ì )2

C P

i=1

m

A iw+ bà yij j2ï

This problem is a strongly convexstrongly convex minimization problem without any constrainsThe object function is twice twice differentiabledifferentiable thus we can use a fast Newton-Armijo methodNewton-Armijo method to solve this problem

min(w;b)2Rn+1

Ðï ;ë(w;b) :=

Nonlinear -SVR ï

w = A0ë; ë 2 R m

Based on duality theorem and KKT–optimality conditions

y ù Aw+ eb

y ù AA0ë + eby ù K (A;A0)ë + ebIn nonlinear

case :

Nonlinear SVR

min(ë;b)2R m+1

21jjëjj22 + C

P

i=1

m

K (A i;A0)ë + bà yij jï

A 2 R mâ n;B 2 R nâ l

K (A;B)

ï à

Letand R mâ n â R nâ l=) R mâ l

K (A i ;A 0) 2 R 1â mand

Nonlinear regression function : f (x) = K (x;A0)ë + b

min(ë;b)2R m+1

21(ë0ë + b2)

+ 2C P

i=1

m

p2ï (K (A i;A0)ë + bà yi; ì )+ 2

C P

i=1

m

K (A i;A0)ë + bà yij j2ï

Nonlinear Smooth Support Vector

-insensitive Regressionï

Training set and testing set (Slice methodSlice method) Gaussian kernelGaussian kernel is used to generate

nonlinear -SVR in all experiments Reduced kernel techniqueReduced kernel technique is utilized when

training dataset is bigger then 1000 Error measure : 2-norm relative error

Numerical Results

ï

yk k2

yà yêk k2 : observations: predicted values

y

f (x) = 0:5ãsinc(ù10x)+nois

e

Noise: mean=0

x 2 [à 1;1], 101 points

û = 0:04Parameter:

÷= 50; ö = 5; " = 0:02

Training time : 0.3 sec.

101 Data Points in R â RNonlinear SSVR with Kernel:expà öjjxià xj jj22

First Artificial Dataset

f (x) = 0:5ãù30x

sinc( ù30x)

+ ú úrandom noise with mean=0,standard deviation 0.04

Training Time : 0.016 sec.Error : 0.059

Training Time : 0.015 sec.Error : 0.068

ï - SSVR LIBSVM

Original Function

Noise : mean=0 ,

û = 0:4

Parameter :

÷= 50; ö = 1; " = 0:5

Training time : 9.61 sec.Mean Absolute Error (MAE) of 49x49 mesh points : 0.1761

Estimated Function

481 Data Points in

R2 â R

Noise : mean=0 ,

û = 0:4

Estimated Function

Original Function

Using Reduced Kernel:

K (A;A0) 2 R28900â 300

Parameter :

Training time : 22.58 sec.MAE of 49x49 mesh points : 0.0513

C = 10000; ö = 1; ï = 0:2

Real Datasets

Linear -SSVRTenfold Numerical Result

ï

Nonlinear -SSVRTenfold Numerical Result

1/2ï

Nonlinear -SSVRTenfold Numerical Result 2/2

ï

Difficulties with Nonlinear SVM

for Large Problems

The nonlinear kernelK (A;A0) 2 Rmâ m is fully dense

Computational complexity depends on m

Separating surface depends on almost entire dataset

Complexity of nonlinear SSVM ø O((m+ 1)3)

Runs out of memory while storing the kernel matrix

Long CPU time to compute the dense kernel matrix

O(m2) Need to generate and store entries

Need to store the entire dataset even after solving the problem

Reduced Support Vector Machine

K (x0;Aö0)Döuö = í

(ii) Solve the following problem by the Newtonmethod with correspondingD ú D :

2÷kp(eà D(K (A;A0)Döuö à eí );ë)k2

2 + 21kuö; í k2

2min(u; í ) 2 Rm+1

K (x0;Aö0)Döuö = í

(iii) The nonlinear classifier is defined by the optimal(u;í )solution in step (ii):

Using K (A;A0) gives lousy results!

(i) Choose a random subset matrix ofentire data matrixA 2 Rmâ n; (m << m):

A 2 Rmâ n

Nonlinear Classifier:

A Nonlinear Kernel ApplicationCheckerboard Training Set: 1000 Points

in Separate 486 Asterisks from 514

DotsR2

Conventional SVM Result on Checkerboard

Using 50 Randomly Selected Points Out of 1000

K (A;A0) 2 R50â 50

RSVM Result on Checkerboard Using SAME 50 Random Points Out of

1000

K (A;A0) 2 R1000â 50

RSVM on Moderate Sized Problems(Best Test Set Correctness %, CPU

seconds)

Cleveland Heart297 x 13, 30

86.473.04

85.9232.42

76.881.58

BUPA Liver345 x 6 , 35

74.862.68

73.6232.61

68.952.04

Ionosphere 351 x 34, 35

95.195.02

94.3559.88

88.702.13

Pima Indians768 x 8, 50

78.645.72

76.59328.3

57.324.64

Tic-Tac-Toe958 x 9, 96

98.7514.56

98.431033.5

88.248.87

Mushroom8124 x 22, 215

89.04466.20

N/A

N/A

83.90221.50

K (A;A0)mâ m K (A;A0)mâ m K (A;A0)mâ mmâ n; mDataset Size

RSVM on Large UCI Adult Dataset

Standard Deviation over 50 Runs = 0.001

Average Correctness % & Standard Deviation, 50 Runs

(6414, 26148) 84.47 0.001 77.03 0.014 210 3.2%(11221, 21341) 84.71 0.001 75.96 0.016 225 2.0%(16101, 16461) 84.90 0.001 75.45 0.017 242 1.5%(22697, 9865) 85.31 0.001 76.73 0.018 284 1.2%(32562, 16282) 85.07 0.001 76.95 0.013 326 1.0%

Dataset Size( Train ; Test)

UCI AdultK (A;A0)mâ m

Testing%Std.Dev.

Amâ 123

m m=mK (A;A0)mâ m

%Testing Std.Dev.

Tim

e( C

PU

sec

. )

Training Set Size

RSVMSMOPCGC

Recommended