View
214
Download
0
Embed Size (px)
Citation preview
Forgetron Slide 1
Online Learning with a Memory Harness using
the Forgetron
Shai Shalev-Shwartz joint work with
Ofer Dekel and Yoram Singer
Large Scale Kernel Machine NIPS’05, Whistler
The Hebrew University Jerusalem, Israel
Forgetron Slide 2
Overview
• Online learning with kernels
• Goal: strict limit on the number of
“support vectors”
• The Forgetron algorithm
• Analysis
• Experiments
Forgetron Slide 3
current classifier: ft(x) = i 2 I yi K(xi,x)
Kernel-based Perceptron for Online Learning
Online Learner
#Mistakes M
Current Active-Set
1 2 3 4 5 6 7 . . .
I = {1,3}
xtsign(ft(xt))yt
I = {1,3,4}
sign(ft(xt))
Forgetron Slide 4
current classifier: ft(x) = i 2 I yi K(xi,x)
Kernel-based Perceptron for Online Learning
Online Learner
#Mistakes M
Current Active-Set
1 2 3 4 5 6 7 . . .
xtsign(ft(xt))yt
I = {1,3,4}
sign(ft(xt))
Forgetron Slide 5
Learning on a Budget
• |I| = number of mistakes until round t• Memory + time inefficient • |I| might grow unboundedly
• Goal: Construct a kernel-based online
algorithm for which: • |I| · B for each t• Still performs “well” comes with
performance guarantee
Forgetron Slide 6
Mistake Bound for Perceptron
• {(x1,y1),…,(xT,yT)} : a sequence of examples
• A kernel K s.t. K(xt,xt) · 1
• g : a fixed competitor classifier in RKHS
• Define ` t(g)= max(0,1 – yt g(xt))
• Then,
Forgetron Slide 7
Previous Work
• Crammer, Kandola, Singer (2003)
• Kivinen, Smola, Williamson (2004)
• Weston, Bordes, Bottu (2005)
Previous online budget algorithms do not provide a mistake bound
Is our goal attainable ?
Forgetron Slide 8
Mission Impossible
• Input space: {e1,…,eB+1}
• Linear kernel: K(ei,ej) = ei ¢ ej = i,j
• Budget constraint: |I| · B . Therefore,
there exists j s.t. i2 I i K(ei,ej) = 0
• We might always err
• But, the competitor: g=i ei never errs !
• Perceptron makes B+1 mistakes
Forgetron Slide 9
Redefine the Goal
• We must restrict the competitor g
somehow. One way: restrict ||g||
• The counter example implies that we
cannot compete with ||g|| ¸ (B+1)1/2
• Main result: The Forgetron algorithm
can compete with any classifier g s.t.
Forgetron Slide 10
The Forgetron
1 2 3 ... ... t-1 tft(x) = i 2 I i yi K(xi,x)
Step (1) - Perceptron
I’ = I [ {t}
Step (2) – Shrinking
Step (3) – Remove Oldest
i t i
1 2 3 ... ... t-1 t
1 2 3 ... ... t-1 t
1 2 3 ... ... t-1 t
r = min I
I I [ {t}
Forgetron Slide 12
Quantifying Deviation
• “Progress” measure: t = ||ft – g||2 - ||ft+1-g||2
• “Progress” for each update step
• “Deviation” is measured by negative progress
t = t + t + t
||ft-g||2-||f’-g||2 ||f’-g||2-||f’’-g||2 ||f’’-g||2-||ft+1-g||2
after shrinking
after removal
after Perceptron
Forgetron Slide 13
Quantifying Deviation
The Forgetron sets:
Gain from Perceptron step:Damage from shrinking:Damage from removal:
Forgetron Slide 14
Resulting Mistake Bound
For any g s.t.
the number of prediction mistakes the
Forgetron makes is at most
Forgetron Slide 22
Experiment I: MNIST dataset
1000 2000 3000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
budget size - B
ave
rag
e e
rro
rForgetronCKS
Forgetron Slide 23
1000 2000 3000 4000 5000 6000
0.05
0.1
0.15
0.2
0.25
0.3
budget size - B
ave
rag
e e
rro
r
ForgetronCKS
Experiment II: Census-income (adult)
… (Perceptron makes 16,000 mistakes)
Forgetron Slide 24
Experiment III: Synthetic Data with Label Noise
0 500 1000 1500 2000
0.2
0.25
0.3
0.35
0.4
0.45
budget size - B
ave
rage
err
orForgetronCKS