Upload
calvin-bridges
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Online Learning of Maximum Margin Classifiers
Kohei HATANOKyusyu University
(Joint work with K. Ishibashi and M. Takeda)
p-Norm with Bias
COLT 2008
Plan of this talk
1. Introduction 2. Preliminalies
― ROMMA
3. Our result– Our new algorithm – Our implicit reduction
4. Experiments
PUMMA
Maximum Margin Classification
margin
• SVMs [Boser et al. 92]– 2-norm margin
• Boosting [Freund&Schapire 97] – ∞-norm margin (approximtely)
• Why maximum (or large) margin?– Good generalization [Schapire et al. 98]
[Shawe-Taylor et al. 98]– Formulated as convex
optimization problems(QP, LP)
Scaling up Max. Margin Classification
1. Decomposition Methods (for SVMs)– Break original QP into smaller QPs – SMO [Platt 99],SVMlight [Joachims 99], LIBSVM [Chang & Lin 01]– state-of-the-art implementations
2. Online Learning (our approach)
Online Learning
Advantages of online Learning • Simple & easy to implement• Uses less memory• Adaptive for changing concepts
Online Learning AlgorithmFor t=1 to T1. Receive an instance xt in Rn
2. Guess a label ŷt=sign(wt ∙ xt+bt)
3. Receive the label yt in {-1,1}
4. Update (wt+1,bt+1)=UPDATE_RULE(wt,bt,xt,yt)
end
xt
+1?
(w t,b t)(w t+1,b t+1
)
-1
Online Learning Algorithms for maximum margin classification
• Max Margin Perceptron [Kowalzyk 00]
• ROMMA [Li & Long 02]
• ALMA [Gentile 01]
• LASVM [Bordes et al. 05]
• MICRA [Tsampouka&Shawe-Taylor 07]
• Pegasos [Shalev-Shwalz et al. 07]
• Etc.
Most of online algs cannot learn hyperplane with bias!
bias
0
hyperplane with bias
hyperplane w/o bias
Typical Reduction to deal with bias [Cf. Cristianini& Shawe-Taylor 00]
Adding an extra dimension corresponding bias.
),( bu
jxγ
nj Rx
Original space Augmented space1),(~ n
jj R Rxx
),( bu )/,(~ Rbuu
jjR xmax:instance
hyperplane 1u
↔
↔
Rbxuy
γ jj
j u)(
min
margin (over normalized Instances)
↔
b
jjR x~max:~
R
R~~)~~(y
minγ~ jj
j uxu
u~
j~x
γ~
R~
jj~~b xuxu NOTE: is equivalent with (u,b) )u~ (
This reduction weaken the guarantee of margin:
2~ γγγ
→it might cause significant difference in genealization!
Our New Online Learning Algorithm
PUMMA(P-norm Utilizing Maximum Margin Algorithm)
• PUMMA can learn maximum margin classifiers with bias directly (without using the typical reduction!).
• Margin is defined as p-norm (p≥2)– For p=2, similar to Perceptron.– For p=O(ln n) [Gentile ’03], similar to Winnow [Littlestone ‘89].
Fast when the target is sparse.
• Extended to linearly inseparable case (omitted).– Soft margin with2-norm slack variables.
Problem of finding the p-norm maximum margin hyperplane [Cf. Mangasarian 99]
)1( 1y :to sub.
21
minarg
j
2
T,...j)b(
.)b,(
j
qb,
**
xw
www
Given: (linearly separable) S=((x1,y1),…,(xT,yT)),
Goal: Find an approximate solution of (w*,b*)
We want an online alg. solving the problem with small # of updates.
q-norm (dual norm)1/p+1/q=1
E.g.p=2, q=2p=∞, q=1
ROMMA(Relaxed Online Maximum Margin Algorithm)[Li&Long,’02]
2
2t
t
2
21
,1)(y :to sub. 21
.minarg
t
t
t
www
xw
www
Given: S=((x1,y1),…,(xt-1,yt-1)), xt,1. Predict ŷt=sign(wt∙xt), and receive yt
2. If yt(wt ·xt )<1-δ (margin is “insufficient”), 3. update:
4. Otherwise, wt+1=wt
Constraint over the last example which causes an update
Constraint over the last hyperplane
2 constraints only!
NOTE: bias is fixed with 0
ROMMA [Li&Long,’02]
0
weght space
12
4
3
w1
w2
wSVM
w3
1,...,4)(j
,1y :to sub.21
min
bias) (without SVM
2
2
)(
.
jj xw
ww
2
21-t
t
2
2
,1)(y :to sub.21
.min
ROMMA
www
xw
ww
t
feasible region of SVM
Solution of ROMMA
.
)(
)(2 ,
)(
1
where , Otherwise, (ii)
.1
where ,
, If (i)
22
2
2
2
2
2
2
222
2
2
2
2
2
1
2
2
1
2
21
tttt
tttt
tttt
ttt
tttt
tttt
ttt
βα
βyα
αyα
xwxw
xwxw
xwxw
xww
wxw
xxw
www
Solution of ROMMA is an additive update:
PUMMA
21-t
2
,11
)(
,1
,1 :to sub. 21
.min),(
q
negt
post
qbtt
b
b
b
wwfw
xw
xw
www
bias is optimized
q-norm(1/p+1/q=1)
xpost, xneg
t
: last positive and negative examples which incur updates
2
1)()(
q
q
q
qii wwsignf
ww
link function [Grove et al. 97]
Given: S=((x1,y1),…,(xt-1,yt-1)), xt,1.Predict ŷt=sign(wt∙xt), and receive yt
2.If yt(wt ·xt +bt)>1-δ, update:
3.Otherwise, wt+1=wt
○≧1
2tt www
ROMMA
○≧1
2)(qtt wwfw
PUMMA
●
≧1
Solution of PUMMA
Observation:For p=2, the solution is the same as that of ROMMA for zt = xt
pos – xtneg.
.)(
,)(βα)(βα.)β,α(
,βyα
.α,yα
,
negtt
postt
ptptt
tttt
negt
postt
tttt
ttt
2b
cases, either Inmethod. Newton the by solved is which
221
argmin
where Otherwise, (ii)
and 2
where
If (i)
111t
22
1
2
2
1
2
21
xwxw
wfwfz
wzw
xxzz
zw
www
Solution of PUMMA is found numerically:
xpost, xneg
t
: last positive and negative examples which incur updates
Our (implicit) reduction which preserves the margin
),...,1( 1
),...,1( 1
:to sub.21
.minarg),( 2
2 ,
**
Njb
Pib
b
negj
posi
b
xw
xw
www
),...,1,,...1(
2)(
:to sub.21
.minarg~ 2
2
NjPi
xx negj
posi
w
www
ww ~ Thm. * = -
hyperplane with bias hyperplane without biasover pairs of positive and negative instances
Main Result
Thm• Suppose that given S=((x1,y1),…,(xT,yT)),
there exists a linear classifier (u,b) , s.t. yt(u·x+b)≥1 for t=1,…,T.
• (# of updates of PUMMAp(δ)) ≤(p-1)uq2R2/ δ2
• After (p-1)u q2R2/ δ2 updates,
PUMMAp(δ) outputs a hypothesis with p-norm margin ≥ (1-δ)γ (γ: margin of (u,b) ).
.max where,...1 ptTt
R x
similar to those of previous algorithms
• example (x,y)- x: n(=100)-dimensional {-1,+1}-valued vector- y=f(x),where
• generate 1000 examples randomly• 3 datasets (b=1 (small), 9(medium), 15(large))• Compare ROMMA(p=2), ALMA(p=2ln n).
Experiment over artificial data
),(sign)( 1621 bxxxf x
Results over Artificial Data
NOTE1: margin is defined over the original space (w/o reduction) NOTE2: We omit the results for b=9 for clarity .
0 0.5 1 1.5 2 2.5x 104
- 0.1
- 0.08
- 0.06
- 0.04
- 0.02
0
0.0225
# of updates
mar
gin
p=2
PUMMA(15)PUMMA(1)ROMMA(15)ALMA(1)
103 104 105 106- 0.2
- 0.15
- 0.1
- 0.05
0
0.0461p=2 ln (N)
# of updates
mar
gin
ALMA(15)ALMA(9)ALMA(1)PUMMA(15)PUMMA(9)PUMMA(1)
PUMMA
ROMMAPUMMA ALMA
# of updates # of updates
mar
gin
# of updates
mar
gin
p=2 p=2ln n
Computation Time
• time
For p=2 , PUMMA is faster than ROMMA.For p=2ln n , PUMMA is faster than ALMA even though PUMMA uses Newton method.
p=2 p=2ln n
large← bias → small
large← bias → small
PUMMA
ROMMA
PUMMA
ALMA
Sec. Se
c.
Results over UCI Adult data
• result
adult
# of data 32561
algorithm sec.maginrate
SVMlight 5893 100ROMMA
(99%) 71296 99.03PUMMA
(99%) 44480 99.14
•Fix p=2.•2-norm soft margin formulation for linearly inseparable data.•Run ROMMA and PUMMA until they achieves 99% of the maximum margin.
Results over MNIST data
MNIST
# of data
algorithm sec. margin rate(%)
SVMlight 401.36 100ROMMA
(99%) 1715.57 93.5PUMMA
(99%) 1971.30 99.2
•Fix p=2.•Use polynomial kernels.•2-norm soft margin formulation for linearly inseparable data.•Run ROMMA and PUMMA until they achieves 99% of the maximum margin.
Summary
• PUMMA can learn p-norm maximum margin classifiers with bias directly.– # of updates is similar to those of previous algs.– achieves (1-δ) times the maximum p-norm margin.
• PUMMA outperforms other online algs when the underlying hyperplane has large bias.
Future work
• Maximizing ∞-norm margin directly.• Tighter bounds of # of updates:– In our experiments, PUMMA is faster especially
when bias is large (like WINNOW). – Our current bound does not reflect this fact.