Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008

Online Learning of Maximum Margin Classifiers

Kohei HATANOKyusyu University

(Joint work with K. Ishibashi and M. Takeda)

p-Norm with Bias

COLT 2008

Plan of this talk

1. Introduction 2. Preliminalies

― ROMMA

3. Our result– Our new algorithm – Our implicit reduction

4. Experiments

PUMMA

Maximum Margin Classification

margin

• SVMs [Boser et al. 92]– 2-norm margin

• Boosting [Freund&Schapire 97] – ∞-norm margin (approximtely)

• Why maximum (or large) margin?– Good generalization [Schapire et al. 98]

[Shawe-Taylor et al. 98]– Formulated as convex

optimization problems(QP, LP)

Scaling up Max. Margin Classification

1. Decomposition Methods (for SVMs)– Break original QP into smaller QPs – SMO [Platt 99],SVMlight [Joachims 99], LIBSVM [Chang & Lin 01]– state-of-the-art implementations

2. Online Learning (our approach)

Online Learning

Advantages of online Learning • Simple & easy to implement• Uses less memory• Adaptive for changing concepts

Online Learning AlgorithmFor t=1 to T1. Receive an instance xt in Rn

2. Guess a label ŷt=sign(wt ∙ xt+bt)

3. Receive the label yt in {-1,1}

4. Update (wt+1,bt+1)=UPDATE_RULE(wt,bt,xt,yt)

end

xt

+1?

(w t,b t)(w t+1,b t+1

)

-1

Online Learning Algorithms for maximum margin classification

• Max Margin Perceptron [Kowalzyk 00]

• ROMMA [Li & Long 02]

• ALMA [Gentile 01]

• LASVM [Bordes et al. 05]

• MICRA [Tsampouka&Shawe-Taylor 07]

• Pegasos [Shalev-Shwalz et al. 07]

• Etc.

Most of online algs cannot learn hyperplane with bias!

bias

0

hyperplane with bias

hyperplane w/o bias

Typical Reduction to deal with bias [Cf. Cristianini& Shawe-Taylor 00]

Adding an extra dimension corresponding bias.

),( bu

jxγ

nj Rx

Original space Augmented space1),(~ n

jj R Rxx

),( bu )/,(~ Rbuu

jjR xmax:instance

hyperplane 1u

↔

↔

Rbxuy

γ jj

j u)(

min

margin (over normalized Instances)

↔

b

jjR x~max:~

R

R~~)~~(y

minγ~ jj

j uxu

u~

j~x

γ~

R~

jj~~b xuxu NOTE: is equivalent with (u,b) )u~ (

This reduction weaken the guarantee of margin:

2~ γγγ

→it might cause significant difference in genealization!

Our New Online Learning Algorithm

PUMMA(P-norm Utilizing Maximum Margin Algorithm)

• PUMMA can learn maximum margin classifiers with bias directly (without using the typical reduction!).

• Margin is defined as p-norm (p≥2)– For p=2, similar to Perceptron.– For p=O(ln n) [Gentile ’03], similar to Winnow [Littlestone ‘89].

Fast when the target is sparse.

• Extended to linearly inseparable case (omitted).– Soft margin with2-norm slack variables.

Problem of finding the p-norm maximum margin hyperplane [Cf. Mangasarian 99]

)1( 1y :to sub.

21

minarg

j

2

T,...j)b(

.)b,(

j

qb,

**

xw

www

Given: (linearly separable) S=((x1,y1),…,(xT,yT)),

Goal: Find an approximate solution of (w*,b*)　　　

　　　We want an online alg. solving the problem with small # of updates.

q-norm (dual norm)1/p+1/q=1

E.g.p=2, q=2p=∞, q=1

ROMMA(Relaxed Online Maximum Margin Algorithm)[Li&Long,’02]

2

2t

t

2

21

,1)(y :to sub. 21

.minarg

t

t

t

www

xw

www

Given: S=((x1,y1),…,(xt-1,yt-1)), xt,1. Predict ŷt=sign(wt∙xt), and receive yt

2. If yt(wt ·xt )<1-δ (margin is “insufficient”), 3. update:

4. Otherwise, wt+1=wt

Constraint over the last example which causes an update

Constraint over the last hyperplane

2 constraints only!

NOTE: bias is fixed with 0

ROMMA [Li&Long,’02]

0

weght space

12

4

3

w1

w2

wSVM

w3

1,...,4)(j

,1y :to sub.21

min

bias) (without SVM

2

2

)(

.

jj xw

ww

2

21-t

t

2

2

,1)(y :to sub.21

.min

ROMMA

www

xw

ww

t

feasible region of SVM

Solution of ROMMA

.

)(

)(2 ,

)(

1

where , Otherwise, (ii)

.1

where ,

, If (i)

22

2

2

2

2

2

2

222

2

2

2

2

2

1

2

2

1

2

21

tttt

tttt

tttt

ttt

tttt

tttt

ttt

βα

βyα

αyα

xwxw

xwxw

xwxw

xww

wxw

xxw

www

Solution of ROMMA is an additive update:

PUMMA

21-t

2

,11

)(

,1

,1 :to sub. 21

.min),(

q

negt

post

qbtt

b

b

b

wwfw

xw

xw

www

bias is optimized

q-norm(1/p+1/q=1)

xpost, xneg

t

: last positive and negative examples which incur updates

2

1)()(

q

q

q

qii wwsignf

ww

link function [Grove et al. 97]

Given: S=((x1,y1),…,(xt-1,yt-1)), xt,1.Predict ŷt=sign(wt∙xt), and receive yt

2.If yt(wt ·xt +bt)>1-δ, update:

3.Otherwise, wt+1=wt

○≧1

2tt www

ROMMA

○≧1

2)(qtt wwfw

PUMMA

●

≧1

Solution of PUMMA

Observation:For p=2, the solution is the same as that of ROMMA for zt = xt

pos – xtneg.

.)(

,)(βα)(βα.)β,α(

,βyα

.α,yα

,

negtt

postt

ptptt

tttt

negt

postt

tttt

ttt

2b

cases, either Inmethod. Newton the by solved is which

221

argmin

where Otherwise, (ii)

and 2

where

If (i)

111t

22

1

2

2

1

2

21

xwxw

wfwfz

wzw

xxzz

zw

www

Solution of PUMMA is found numerically:

xpost, xneg

t

: last positive and negative examples which incur updates

Our (implicit) reduction which preserves the margin

),...,1( 1

),...,1( 1

:to sub.21

.minarg),( 2

2 ,

**

Njb

Pib

b

negj

posi

b

xw

xw

www

),...,1,,...1(

2)(

:to sub.21

.minarg~ 2

2

NjPi

xx negj

posi

w

www

ww ~ Thm. * = -

hyperplane with bias hyperplane without biasover pairs of positive and negative instances

Main Result

Thm• Suppose that given S=((x1,y1),…,(xT,yT)),

there exists a linear classifier (u,b) , s.t. yt(u·x+b)≥1 for t=1,…,T.

• (# of updates of PUMMAp(δ)) ≤(p-1)uq2R2/ δ2

• After (p-1)u q2R2/ δ2 updates,

PUMMAp(δ) outputs a hypothesis with p-norm margin ≥ (1-δ)γ (γ: margin of (u,b) ).

.max where,...1 ptTt

R x

similar to those of previous algorithms

• example (x,y)- x: n(=100)-dimensional {-1,+1}-valued vector- y=f(x),where

• generate 1000 examples randomly• 3 datasets (b=1 (small), 9(medium), 15(large))• Compare ROMMA(p=2), ALMA(p=2ln n).

Experiment over artificial data

),(sign)( 1621 bxxxf x

Results over Artificial Data

NOTE1: margin is defined over the original space (w/o reduction) NOTE2: We omit the results for b=9 for clarity .

0 0.5 1 1.5 2 2.5x　104

- 0.1

- 0.08

- 0.06

- 0.04

- 0.02

0

0.0225

#　of　updates

mar

gin

p=2

　

　

PUMMA(15)PUMMA(1)ROMMA(15)ALMA(1)

103 104 105 106- 0.2

- 0.15

- 0.1

- 0.05

0

0.0461p=2 ln (N)

#　of　updates

mar

gin

　

　

ALMA(15)ALMA(9)ALMA(1)PUMMA(15)PUMMA(9)PUMMA(1)

PUMMA

ROMMAPUMMA ALMA

# of updates # of updates

mar

gin

# of updates

mar

gin

p=2 p=2ln n

Computation Time

• time

For p=2 ， PUMMA is faster than ROMMA.For p=2ln n ， PUMMA is faster than ALMA even though PUMMA uses Newton method.

p=2 p=2ln n

large← 　　　　 bias 　　　　→ small

large← 　　　　 bias 　　　　→ small

PUMMA

ROMMA

PUMMA

ALMA

Sec. Se

c.

Results over UCI Adult data

• result

adult

# of data 32561

algorithm sec.maginrate

SVMlight 5893 100ROMMA

(99%) 71296 99.03PUMMA

(99%) 44480 99.14

•Fix p=2.•2-norm soft margin formulation for linearly inseparable data.•Run ROMMA and PUMMA until they achieves 99% of the maximum margin.

Results over MNIST data

MNIST

# of data

algorithm sec. margin rate(%)

SVMlight 401.36 100ROMMA

(99%) 1715.57 93.5PUMMA

(99%) 1971.30 99.2

•Fix p=2.•Use polynomial kernels.•2-norm soft margin formulation for linearly inseparable data.•Run ROMMA and PUMMA until they achieves 99% of the maximum margin.

Summary

• PUMMA can learn p-norm maximum margin classifiers with bias directly.– # of updates is similar to those of previous algs.– achieves (1-δ) times the maximum p-norm margin.

• PUMMA outperforms other online algs when the underlying hyperplane has large bias.

Future work

• Maximizing ∞-norm margin directly.• Tighter bounds of # of updates:– In our experiments, PUMMA is faster especially

when bias is large (like WINNOW). – Our current bound does not reflect this fact.

Documents

Online Learning of Maximum Margin Classifiers Kohei HATANO Kyusyu University (Joint work with K. Ishibashi and M. Takeda) p-Norm with Bias COLT 2008