Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Diﬀerential Form Consider

Lecture 8Online and Incremental Learning

Pavel Laskov1 Blaine Nelson1

1Cognitive Systems Group

Wilhelm Schickard Institute for Computer Science

Universitat Tubingen, Germany

Advanced Topics in Machine Learning, 2012

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 1 / 24

Why Learning Online?

Non-stationarity of real data

⇒ Update of models is needed when new data is available⇒ Evaluation of relevance of past data

Learning on a tight budget

⇒ Some data may be irrelevant⇒ Cost-sensitive learning


Applications of Online Learning

Time series analysis

Security event monitoring

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

100

110

120

130

140

150

160

170

180

190

200EW−CUSUM, geometric STD normalization

Anomaly and change point detection

Object tracking


Computational Considerations

How much do we pay for training?




PCA: O(n3) (eigenvalue decomposition)

Ridge regression: O(n3) (matrix inversion)

SVM: O(n2 log n) (theoretical bound on feasible directiondecomposition)







How much are we willing to pay?







How much are we willing to pay?

An order of magnitude less: to match batch learning

Constant or linear time: if we are really greedy!


Incremental SVM: Problem Setup

Initial state: An SVM has been trained on a data setX = {(x1, y1), . . . , (xk , yk)}, and the initial solution α0 is available.




Goal: Given a new data point xc , find an solution α∗ which is optimalfor the extended training set {X ∪ xc}.





Key idea: Ensure that optimality conditions are enforced at every stepof updating the solution from α0 to α∗.





Key idea: Ensure that optimality conditions are enforced at every stepof updating the solution from α0 to α∗.

Notation: All training data will be grouped into three sets according tothe values of α:

set S: 0 < αi < C

set O: αi = 0set E : αi = C


KKT Re-visited (once again ,)

Consider the saddle-point formulation of the SVM training problem:

maxµ

min0≤α≤C

W = −1⊤α+ 12α

⊤Hα+ µ(y⊤α)




maxµ

min0≤α≤C

W = −1⊤α+ 12α

⊤Hα+ µ(y⊤α)

The KKT conditions for this problem can be expressed as follows:

gi =∂W

∂αi

=∑

j

Hijαj + yiµ− 1

≥ 0, i ∈ O

= 0, i ∈ S

≤ 0, i ∈ E

∂W

∂µ=

∑

j

yjαj = 0




maxµ

min0≤α≤C

W = −1⊤α+ 12α

⊤Hα+ µ(y⊤α)

The KKT conditions for this problem can be expressed as follows:

gi =∂W

∂αi

=∑

j

Hijαj + yiµ− 1

≥ 0, i ∈ O

= 0, i ∈ S

≤ 0, i ∈ E

∂W

∂µ=

∑

j

yjαj = 0

These conditions must be met for both α0 and α∗, i.e., before and afterthe inclusion of xc , and at every single intermediate step!


Incremental SVM: Initialization

Which weight αc should be assigned to xc at the beginning?




Let’s try αc = 0. What happens?





Constraints for all gi , i 6= c remain intact.






Equality constraint remains intact.







A new constraint gets added for gc :

gi =∑

j

Hcjαj + ycµ− 1 ≥ 0, since αc = 0








gi =∑

j


If the last constraint is satified for αc = 0, then nothing needs to bedone, and α∗ = α0.








gi =∑

j


If the last constraint is satified for αc = 0, then nothing needs to bedone, and α∗ = α0.

Otherwise we need to increase αc and update all other α’s so as to keepall conditions except for gc satisfied.


KKT Conditions in Differential Form

Consider the difference between the optimality conditions in two differentSVM states:

∆gi = Hic∆αc +∑

j∈S

Hij∆αj + yi∆µ, ∀i ∈ D ∪ {c}

0 = yc∆αc +∑

j∈S

yj∆αj


KKT Conditions in Differential Form

Consider the difference between the optimality conditions in two differentSVM states:

∆gi = Hic∆αc +∑

j∈S

Hij∆αj + yi∆µ, ∀i ∈ D ∪ {c}

0 = yc∆αc +∑

j∈S

yj∆αj

For i ∈ S, gi = 0; hence, these conditions can be written in matrix form:

0 ys1 · · · yskys1 Hs1s1 · · · Hs1sk...

.... . .

...ysk Hsk s1 · · · Hsk sk

︸︷︷︸

Q

∆µ

∆αs1...

∆αsk

︸︷︷︸

∆αS

= −

ycHs1c

...Hskc

︸︷︷︸ηc

∆αc


Sensitivity Conditions

The conditions for gS imply that

∆αS = −Q−1ηc︸︷︷︸

β

∆αc

⇒ vector β measures sensitivity of α with respect to ∆αc .


Sensitivity Conditions

The conditions for gS imply that

∆αS = −Q−1ηc︸︷︷︸

β

∆αc

⇒ vector β measures sensitivity of α with respect to ∆αc .

Substituting the sensitivity of αS back into the conditions for g , weobtain:

∆gi = γi∆αc , ∀i ∈ D ∪ {c}

where γi = Hic +∑

j∈S

Hijβj + yiβ0

⇒ vector γ measures sensitivity of gi , for i ∈ {O, E}, with respect to ∆αc .


“Bookkeeping”

Knowing the increment ∆αc we directly forecast:

the change ∆αS of the weights for points in S,the change ∆gi of the gradient for points in {O, E}.


“Bookkeeping”



/ Unfortunately, these forecasts hold only as long as no structural changesoccur between S, O and E .


“Bookkeeping”




Potential structural changes:

for i ∈ S, αi → C : i joins E .for i ∈ S, αi → 0: i joins O.for i ∈ O, gi > 0→ 0: i joins S.for i ∈ E , gi < 0→ 0: i joins S.


“Bookkeeping”




Potential structural changes:

for i ∈ S, αi → C : i joins E .for i ∈ S, αi → 0: i joins O.for i ∈ O, gi > 0→ 0: i joins S.for i ∈ E , gi < 0→ 0: i joins S.

Key idea: we can predict the smallest ∆αc such that for some i , thefirst structural change occurs.


Predicting Structural Changes

1 i ∈ S, βi > 0: α grows up to C .

find ∆α1c = mini

C−αi

βi




find ∆α1c = mini

C−αi

βi

2 i ∈ S, βi < 0: α falls down to 0.

find ∆α2c = mini

−αi

βi




find ∆α1c = mini

C−αi

βi


find ∆α2c = mini

−αi

βi

3 i ∈ E , γi > 0 or i ∈ O, γi < 0: gi grows up / falls down to 0

find ∆α3c = mini

−giγi




find ∆α1c = mini

C−αi

βi


find ∆α2c = mini

−αi

βi


find ∆α3c = mini

−giγi

4 gc becomes 0 (termination condition 1):

compute ∆α4c = −gc

γc




find ∆α1c = mini

C−αi

βi


find ∆α2c = mini

−αi

βi


find ∆α3c = mini

−giγi



γc

5 αc reaches C (termination condition 2):

compute ∆α5c = C − αc




find ∆α1c = mini

C−αi

βi


find ∆α2c = mini

−αi

βi


find ∆α3c = mini

−giγi



γc

5 αc reaches C (termination condition 2):

compute ∆α5c = C − αc

The smallest increment ∆αc among the 5 possible cases yields the firststructural change.


Structural Change: Post-processing

The main operation to be performed after the structural change is theupdate of the inverse matrix Q−1.

Example k joins S:

⇒ add a row and a column with zero entries to Q−1

⇒ perform a recursive update (next slide)

Example k leaves S:

⇒ re-arrange Q−1 to put row and column k “at the end”⇒ update the values (follows)


Recursive Expansion of Q−1

Consider the specific structure of Q:

Q−1 =

0 y⊤s ykys Hss Hsk

yk Hks Hkk

−1

=

[Q ηkη⊤k Hkk

]−1

whereη⊤k =

[yk Hks

]


Recursive Expansion of Q−1

Consider the specific structure of Q:

Q−1 =


yk Hks Hkk

−1

=

[Q ηkη⊤k Hkk

]−1

whereη⊤k =

[yk Hks

]

We will make use of the Sherman-Woodbury-Morrison formula:

[A B

C D

]−1

=

[A−1 + A−1BZCA−1 −A−1BZ

−ZCA−1 Z

]

where Z = (D − CA−1B)−1 = (Hkk − η⊤k Q−1ηk)

−1 (scalar !)


Recursive Expansion of Q−1 (ctd.)

Putting everything together and using the fact that βk = −Q−1ηk (seeslide 9), we obtain:


yk Hks Hkk

−1

=

[Q−1 + Z (Q−1ηk)(Q

−1ηk)⊤ −Z (Q−1ηk)

−Z (Q−1ηk)⊤ Z

]

=

[Q−1 00 0

]

+ Z

[βk1

][β⊤k 1

]


Recursive Expansion of Q−1 (ctd.)

Putting everything together and using the fact that βk = −Q−1ηk (seeslide 9), we obtain:


yk Hks Hkk

−1

=

[Q−1 + Z (Q−1ηk)(Q

−1ηk)⊤ −Z (Q−1ηk)

−Z (Q−1ηk)⊤ Z

]

=

[Q−1 00 0

]

+ Z

[βk1

][β⊤k 1

]

Hence, expansion of Q−1 amounts to:

extending Q−1 with a zero row and column (O(s))

computing Z (O(s)2)

computing and adding the outer product of [β⊤k 1] (O(s)2)


Recursive Contraction of Q−1

Before contraction of Q−1, we have:

Q−1 =

[q11 q12q21 q22

]

=

[Q−1 + Zβkβ

⊤k Zβk

Zβ⊤k Z

]




Q−1 =

[q11 q12q21 q22

]

=

[Q−1 + Zβkβ

⊤k Zβk

Zβ⊤k Z

]

Writing this relation component-wise, we obtain:

Q−1 = q11 − Zβkβ⊤k

q⊤21 = q12 = Zβk

q22 = Z




Q−1 =

[q11 q12q21 q22

]

=

[Q−1 + Zβkβ

⊤k Zβk

Zβ⊤k Z

]

Writing this relation component-wise, we obtain:

Q−1 = q11 − Zβkβ⊤k

q⊤21 = q12 = Zβk

q22 = Z

Combination of the three constraints above yields:

Q−1 = q11 −q12q21

q22


Incremental SVM: High-Level Algorithm

1: Read example xc , compute gc .2: while gc < 0 & αc < C do

3: Compute β and γ.4: Compute ∆α1

c , . . . ,∆α5c .

5: ∆αmaxc ← min{∆α1

c , . . . ,∆α5c}

6: αc ← αc +∆αmaxc

7: αs ← β∆αmaxc

8: gc,e,o ← γ∆αmaxc

9: Let k be the index of the example yielding the minimum in step 5.10: if k ∈ S then

11: Move k from S to either E or O.12: else if k ∈ E ∪O then

13: Move k from either E or O to S.14: else

15: {k = c : do nothing, the algorithm terminates.}16: end if

17: Update Q−1 recursively.18: end while


Incremental SVM: Summary

The only general method to exactly update an existing SVM solution

Can be reversed for incremental “unlearning”

Can be extended for regression and anomaly detection

Moderate performance:

O(s2) time and space (s being the size of the set S)suitable for working sets of up to 50,000 examples

Further usage:

Computation of the leave-one-out errorPoisoning attacks against SVM


Online Learning

An online learning algorithm looks at every example exactly once.

, Easy update of existing solutions

, Low storage requirements

, (Sometimes) faster learning of batch datasets

/ Inexact: may converge to a different solution

/ Potentially slow convergence rate


LASVM: a Simple Online Extension of SMO

Function PROCESS(k):

1: αk ← 0, compute gk , add k to S.2: j ← argmins∈S ysgs3: Bail out if (j , k) is not a τ -violating pair.4: Perform an SMO update step on (j , k).

Function REPROCESS:

1: i ← argmaxs∈S gs with αs < C

2: j ← argmins∈S gs with αs > 03: Bail out if (j , k) is not a τ -violating pair.4: Perform an SMO update step on (i , j).


LASVM: Main Algorithm and Analysis

Initialization

Seed S with a few examples of each class and compute the initial gradient g

Online iterations: repeat a predefined number of times:

Pick an example ktRun PROCESS(kt)Run REPROCESS once

Finishing:

Repeat REPROCESS until δ ≤ τ

, The online part brings us close enough to the optimal solution at lowlinear cost.

/ Still no hard guarantees on the solution quality: the final stage has anuncontrollable number of iterations.


Stochastic Gradient Descent (SGD)The Best Pill for an Easy Case

Consider a linear SVM in d -dimensional case. A large number of simpleonline algorithms implement the following simple strategy:

Pick up a random example xt

Update the weight as:

wt+1 ← wt − γtBt∇wQ(xt ,wt)

where γt is a learning rate and Bt is a scaling matrix.

Under mild regularity conditions and with appropriate learning rates(decreasing gains satisfying

∑

t γ2t <∞), the stochastic gradient descent

converges to an optimal solution.


SGD Variants and their Convergence Rates

First order, B = λ−1I :

wt+1 ← wt −1

λ(t + t0)gt(wt)

Iterations to reach accuracy ρ: νκ2

ρ+ O( 1

ρ), iteration cost: O(d)

Second order, B = H−1:

wt+1 ← wt −1

t + t0H−1gt(wt)

Iterations to reach accuracy ρ: νρ+ O( 1

ρ), iteration cost: O(d2)


Summary

In contrast to incremental learning, online learning attempts to achievefixed cost per example, potentially at the expense of learning accuracy.

Theoretical analysis shows that high optimization accuracy is not alwaysnecessary for learning accuracy.

Efficient algorithms for online learning exist mainly for linear learningproblems.

For non-linear methods, online learning currently cannot guaranteeperfect covergence to the optimal solution.


Bibliography I

[1] Antoine Bordes, Leon Bottou, and Patrick Gallinari. SGD-QN: Carefulquasi-newton stochastic gradient descent. Journal of Machine Learning

Research, 10:1737–1754, 2009.

[2] Antoine Bordes, Seyda Ertekin, Jason Weston, and Leon Bottou. Fast kernelclassifiers with online and active learning. Journal of Machine Learning

Research, 6:1579–1619, 2005.

[3] Gert Cauwenberghs and Tommaso Poggio. Incremental and decrementalsupport vector machine learning. In Advances in Neural Information

Processing Systems 13, pages 409–415, 2001.

[4] Pavel Laskov, Christian Gehl, Stefan Kruger, and Klaus-Robert Muller.Incremental support vector learning: Analysis, implementation andapplications. Journal of Machine Learning Research, 7:1909–1936, 2006.


Documents

Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Diﬀerential Form Consider