52
Lecture 8 Online and Incremental Learning Pavel Laskov 1 Blaine Nelson 1 1 Cognitive Systems Group Wilhelm Schickard Institute for Computer Science Universit¨ at T¨ ubingen, Germany Advanced Topics in Machine Learning, 2012 P. Laskov and B. Nelson (T¨ ubingen) Lecture 8: Online and Incremental Learning June 19, 2012 1 / 24

Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Lecture 8Online and Incremental Learning

Pavel Laskov1 Blaine Nelson1

1Cognitive Systems Group

Wilhelm Schickard Institute for Computer Science

Universitat Tubingen, Germany

Advanced Topics in Machine Learning, 2012

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 1 / 24

Page 2: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Why Learning Online?

Non-stationarity of real data

⇒ Update of models is needed when new data is available⇒ Evaluation of relevance of past data

Learning on a tight budget

⇒ Some data may be irrelevant⇒ Cost-sensitive learning

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 2 / 24

Page 3: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Applications of Online Learning

Time series analysis

Security event monitoring

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

100

110

120

130

140

150

160

170

180

190

200EW−CUSUM, geometric STD normalization

Anomaly and change point detection

Object tracking

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 3 / 24

Page 4: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Computational Considerations

How much do we pay for training?

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 4 / 24

Page 5: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Computational Considerations

How much do we pay for training?

PCA: O(n3) (eigenvalue decomposition)

Ridge regression: O(n3) (matrix inversion)

SVM: O(n2 log n) (theoretical bound on feasible directiondecomposition)

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 4 / 24

Page 6: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Computational Considerations

How much do we pay for training?

PCA: O(n3) (eigenvalue decomposition)

Ridge regression: O(n3) (matrix inversion)

SVM: O(n2 log n) (theoretical bound on feasible directiondecomposition)

How much are we willing to pay?

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 4 / 24

Page 7: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Computational Considerations

How much do we pay for training?

PCA: O(n3) (eigenvalue decomposition)

Ridge regression: O(n3) (matrix inversion)

SVM: O(n2 log n) (theoretical bound on feasible directiondecomposition)

How much are we willing to pay?

An order of magnitude less: to match batch learning

Constant or linear time: if we are really greedy!

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 4 / 24

Page 8: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Incremental SVM: Problem Setup

Initial state: An SVM has been trained on a data setX = {(x1, y1), . . . , (xk , yk)}, and the initial solution α0 is available.

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 5 / 24

Page 9: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Incremental SVM: Problem Setup

Initial state: An SVM has been trained on a data setX = {(x1, y1), . . . , (xk , yk)}, and the initial solution α0 is available.

Goal: Given a new data point xc , find an solution α∗ which is optimalfor the extended training set {X ∪ xc}.

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 5 / 24

Page 10: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Incremental SVM: Problem Setup

Initial state: An SVM has been trained on a data setX = {(x1, y1), . . . , (xk , yk)}, and the initial solution α0 is available.

Goal: Given a new data point xc , find an solution α∗ which is optimalfor the extended training set {X ∪ xc}.

Key idea: Ensure that optimality conditions are enforced at every stepof updating the solution from α0 to α∗.

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 5 / 24

Page 11: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Incremental SVM: Problem Setup

Initial state: An SVM has been trained on a data setX = {(x1, y1), . . . , (xk , yk)}, and the initial solution α0 is available.

Goal: Given a new data point xc , find an solution α∗ which is optimalfor the extended training set {X ∪ xc}.

Key idea: Ensure that optimality conditions are enforced at every stepof updating the solution from α0 to α∗.

Notation: All training data will be grouped into three sets according tothe values of α:

set S: 0 < αi < C

set O: αi = 0set E : αi = C

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 5 / 24

Page 12: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

KKT Re-visited (once again ,)

Consider the saddle-point formulation of the SVM training problem:

maxµ

min0≤α≤C

W = −1⊤α+ 12α

⊤Hα+ µ(y⊤α)

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 6 / 24

Page 13: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

KKT Re-visited (once again ,)

Consider the saddle-point formulation of the SVM training problem:

maxµ

min0≤α≤C

W = −1⊤α+ 12α

⊤Hα+ µ(y⊤α)

The KKT conditions for this problem can be expressed as follows:

gi =∂W

∂αi

=∑

j

Hijαj + yiµ− 1

≥ 0, i ∈ O

= 0, i ∈ S

≤ 0, i ∈ E

∂W

∂µ=

j

yjαj = 0

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 6 / 24

Page 14: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

KKT Re-visited (once again ,)

Consider the saddle-point formulation of the SVM training problem:

maxµ

min0≤α≤C

W = −1⊤α+ 12α

⊤Hα+ µ(y⊤α)

The KKT conditions for this problem can be expressed as follows:

gi =∂W

∂αi

=∑

j

Hijαj + yiµ− 1

≥ 0, i ∈ O

= 0, i ∈ S

≤ 0, i ∈ E

∂W

∂µ=

j

yjαj = 0

These conditions must be met for both α0 and α∗, i.e., before and afterthe inclusion of xc , and at every single intermediate step!

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 6 / 24

Page 15: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Incremental SVM: Initialization

Which weight αc should be assigned to xc at the beginning?

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24

Page 16: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Incremental SVM: Initialization

Which weight αc should be assigned to xc at the beginning?

Let’s try αc = 0. What happens?

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24

Page 17: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Incremental SVM: Initialization

Which weight αc should be assigned to xc at the beginning?

Let’s try αc = 0. What happens?

Constraints for all gi , i 6= c remain intact.

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24

Page 18: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Incremental SVM: Initialization

Which weight αc should be assigned to xc at the beginning?

Let’s try αc = 0. What happens?

Constraints for all gi , i 6= c remain intact.

Equality constraint remains intact.

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24

Page 19: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Incremental SVM: Initialization

Which weight αc should be assigned to xc at the beginning?

Let’s try αc = 0. What happens?

Constraints for all gi , i 6= c remain intact.

Equality constraint remains intact.

A new constraint gets added for gc :

gi =∑

j

Hcjαj + ycµ− 1 ≥ 0, since αc = 0

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24

Page 20: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Incremental SVM: Initialization

Which weight αc should be assigned to xc at the beginning?

Let’s try αc = 0. What happens?

Constraints for all gi , i 6= c remain intact.

Equality constraint remains intact.

A new constraint gets added for gc :

gi =∑

j

Hcjαj + ycµ− 1 ≥ 0, since αc = 0

If the last constraint is satified for αc = 0, then nothing needs to bedone, and α∗ = α0.

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24

Page 21: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Incremental SVM: Initialization

Which weight αc should be assigned to xc at the beginning?

Let’s try αc = 0. What happens?

Constraints for all gi , i 6= c remain intact.

Equality constraint remains intact.

A new constraint gets added for gc :

gi =∑

j

Hcjαj + ycµ− 1 ≥ 0, since αc = 0

If the last constraint is satified for αc = 0, then nothing needs to bedone, and α∗ = α0.

Otherwise we need to increase αc and update all other α’s so as to keepall conditions except for gc satisfied.

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24

Page 22: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

KKT Conditions in Differential Form

Consider the difference between the optimality conditions in two differentSVM states:

∆gi = Hic∆αc +∑

j∈S

Hij∆αj + yi∆µ, ∀i ∈ D ∪ {c}

0 = yc∆αc +∑

j∈S

yj∆αj

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 8 / 24

Page 23: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

KKT Conditions in Differential Form

Consider the difference between the optimality conditions in two differentSVM states:

∆gi = Hic∆αc +∑

j∈S

Hij∆αj + yi∆µ, ∀i ∈ D ∪ {c}

0 = yc∆αc +∑

j∈S

yj∆αj

For i ∈ S, gi = 0; hence, these conditions can be written in matrix form:

0 ys1 · · · yskys1 Hs1s1 · · · Hs1sk...

.... . .

...ysk Hsk s1 · · · Hsk sk

︸ ︷︷ ︸

Q

∆µ

∆αs1...

∆αsk

︸ ︷︷ ︸

∆αS

= −

ycHs1c

...Hskc

︸ ︷︷ ︸ηc

∆αc

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 8 / 24

Page 24: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Sensitivity Conditions

The conditions for gS imply that

∆αS = −Q−1ηc︸ ︷︷ ︸

β

∆αc

⇒ vector β measures sensitivity of α with respect to ∆αc .

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 9 / 24

Page 25: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Sensitivity Conditions

The conditions for gS imply that

∆αS = −Q−1ηc︸ ︷︷ ︸

β

∆αc

⇒ vector β measures sensitivity of α with respect to ∆αc .

Substituting the sensitivity of αS back into the conditions for g , weobtain:

∆gi = γi∆αc , ∀i ∈ D ∪ {c}

where γi = Hic +∑

j∈S

Hijβj + yiβ0

⇒ vector γ measures sensitivity of gi , for i ∈ {O, E}, with respect to ∆αc .

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 9 / 24

Page 26: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

“Bookkeeping”

Knowing the increment ∆αc we directly forecast:

the change ∆αS of the weights for points in S,the change ∆gi of the gradient for points in {O, E}.

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 10 / 24

Page 27: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

“Bookkeeping”

Knowing the increment ∆αc we directly forecast:

the change ∆αS of the weights for points in S,the change ∆gi of the gradient for points in {O, E}.

/ Unfortunately, these forecasts hold only as long as no structural changesoccur between S, O and E .

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 10 / 24

Page 28: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

“Bookkeeping”

Knowing the increment ∆αc we directly forecast:

the change ∆αS of the weights for points in S,the change ∆gi of the gradient for points in {O, E}.

/ Unfortunately, these forecasts hold only as long as no structural changesoccur between S, O and E .

Potential structural changes:

for i ∈ S, αi → C : i joins E .for i ∈ S, αi → 0: i joins O.for i ∈ O, gi > 0→ 0: i joins S.for i ∈ E , gi < 0→ 0: i joins S.

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 10 / 24

Page 29: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

“Bookkeeping”

Knowing the increment ∆αc we directly forecast:

the change ∆αS of the weights for points in S,the change ∆gi of the gradient for points in {O, E}.

/ Unfortunately, these forecasts hold only as long as no structural changesoccur between S, O and E .

Potential structural changes:

for i ∈ S, αi → C : i joins E .for i ∈ S, αi → 0: i joins O.for i ∈ O, gi > 0→ 0: i joins S.for i ∈ E , gi < 0→ 0: i joins S.

Key idea: we can predict the smallest ∆αc such that for some i , thefirst structural change occurs.

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 10 / 24

Page 30: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Predicting Structural Changes

1 i ∈ S, βi > 0: α grows up to C .

find ∆α1c = mini

C−αi

βi

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 11 / 24

Page 31: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Predicting Structural Changes

1 i ∈ S, βi > 0: α grows up to C .

find ∆α1c = mini

C−αi

βi

2 i ∈ S, βi < 0: α falls down to 0.

find ∆α2c = mini

−αi

βi

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 11 / 24

Page 32: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Predicting Structural Changes

1 i ∈ S, βi > 0: α grows up to C .

find ∆α1c = mini

C−αi

βi

2 i ∈ S, βi < 0: α falls down to 0.

find ∆α2c = mini

−αi

βi

3 i ∈ E , γi > 0 or i ∈ O, γi < 0: gi grows up / falls down to 0

find ∆α3c = mini

−giγi

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 11 / 24

Page 33: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Predicting Structural Changes

1 i ∈ S, βi > 0: α grows up to C .

find ∆α1c = mini

C−αi

βi

2 i ∈ S, βi < 0: α falls down to 0.

find ∆α2c = mini

−αi

βi

3 i ∈ E , γi > 0 or i ∈ O, γi < 0: gi grows up / falls down to 0

find ∆α3c = mini

−giγi

4 gc becomes 0 (termination condition 1):

compute ∆α4c = −gc

γc

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 11 / 24

Page 34: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Predicting Structural Changes

1 i ∈ S, βi > 0: α grows up to C .

find ∆α1c = mini

C−αi

βi

2 i ∈ S, βi < 0: α falls down to 0.

find ∆α2c = mini

−αi

βi

3 i ∈ E , γi > 0 or i ∈ O, γi < 0: gi grows up / falls down to 0

find ∆α3c = mini

−giγi

4 gc becomes 0 (termination condition 1):

compute ∆α4c = −gc

γc

5 αc reaches C (termination condition 2):

compute ∆α5c = C − αc

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 11 / 24

Page 35: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Predicting Structural Changes

1 i ∈ S, βi > 0: α grows up to C .

find ∆α1c = mini

C−αi

βi

2 i ∈ S, βi < 0: α falls down to 0.

find ∆α2c = mini

−αi

βi

3 i ∈ E , γi > 0 or i ∈ O, γi < 0: gi grows up / falls down to 0

find ∆α3c = mini

−giγi

4 gc becomes 0 (termination condition 1):

compute ∆α4c = −gc

γc

5 αc reaches C (termination condition 2):

compute ∆α5c = C − αc

The smallest increment ∆αc among the 5 possible cases yields the firststructural change.

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 11 / 24

Page 36: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Structural Change: Post-processing

The main operation to be performed after the structural change is theupdate of the inverse matrix Q−1.

Example k joins S:

⇒ add a row and a column with zero entries to Q−1

⇒ perform a recursive update (next slide)

Example k leaves S:

⇒ re-arrange Q−1 to put row and column k “at the end”⇒ update the values (follows)

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 12 / 24

Page 37: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Recursive Expansion of Q−1

Consider the specific structure of Q:

Q−1 =

0 y⊤s ykys Hss Hsk

yk Hks Hkk

−1

=

[Q ηkη⊤k Hkk

]−1

whereη⊤k =

[yk Hks

]

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 13 / 24

Page 38: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Recursive Expansion of Q−1

Consider the specific structure of Q:

Q−1 =

0 y⊤s ykys Hss Hsk

yk Hks Hkk

−1

=

[Q ηkη⊤k Hkk

]−1

whereη⊤k =

[yk Hks

]

We will make use of the Sherman-Woodbury-Morrison formula:

[A B

C D

]−1

=

[A−1 + A−1BZCA−1 −A−1BZ

−ZCA−1 Z

]

where Z = (D − CA−1B)−1 = (Hkk − η⊤k Q−1ηk)

−1 (scalar !)

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 13 / 24

Page 39: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Recursive Expansion of Q−1 (ctd.)

Putting everything together and using the fact that βk = −Q−1ηk (seeslide 9), we obtain:

0 y⊤s ykys Hss Hsk

yk Hks Hkk

−1

=

[Q−1 + Z (Q−1ηk)(Q

−1ηk)⊤ −Z (Q−1ηk)

−Z (Q−1ηk)⊤ Z

]

=

[Q−1 00 0

]

+ Z

[βk1

][β⊤k 1

]

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 14 / 24

Page 40: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Recursive Expansion of Q−1 (ctd.)

Putting everything together and using the fact that βk = −Q−1ηk (seeslide 9), we obtain:

0 y⊤s ykys Hss Hsk

yk Hks Hkk

−1

=

[Q−1 + Z (Q−1ηk)(Q

−1ηk)⊤ −Z (Q−1ηk)

−Z (Q−1ηk)⊤ Z

]

=

[Q−1 00 0

]

+ Z

[βk1

][β⊤k 1

]

Hence, expansion of Q−1 amounts to:

extending Q−1 with a zero row and column (O(s))

computing Z (O(s)2)

computing and adding the outer product of [β⊤k 1] (O(s)2)

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 14 / 24

Page 41: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Recursive Contraction of Q−1

Before contraction of Q−1, we have:

Q−1 =

[q11 q12q21 q22

]

=

[Q−1 + Zβkβ

⊤k Zβk

Zβ⊤k Z

]

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 15 / 24

Page 42: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Recursive Contraction of Q−1

Before contraction of Q−1, we have:

Q−1 =

[q11 q12q21 q22

]

=

[Q−1 + Zβkβ

⊤k Zβk

Zβ⊤k Z

]

Writing this relation component-wise, we obtain:

Q−1 = q11 − Zβkβ⊤k

q⊤21 = q12 = Zβk

q22 = Z

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 15 / 24

Page 43: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Recursive Contraction of Q−1

Before contraction of Q−1, we have:

Q−1 =

[q11 q12q21 q22

]

=

[Q−1 + Zβkβ

⊤k Zβk

Zβ⊤k Z

]

Writing this relation component-wise, we obtain:

Q−1 = q11 − Zβkβ⊤k

q⊤21 = q12 = Zβk

q22 = Z

Combination of the three constraints above yields:

Q−1 = q11 −q12q21

q22

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 15 / 24

Page 44: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Incremental SVM: High-Level Algorithm

1: Read example xc , compute gc .2: while gc < 0 & αc < C do

3: Compute β and γ.4: Compute ∆α1

c , . . . ,∆α5c .

5: ∆αmaxc ← min{∆α1

c , . . . ,∆α5c}

6: αc ← αc +∆αmaxc

7: αs ← β∆αmaxc

8: gc,e,o ← γ∆αmaxc

9: Let k be the index of the example yielding the minimum in step 5.10: if k ∈ S then

11: Move k from S to either E or O.12: else if k ∈ E ∪O then

13: Move k from either E or O to S.14: else

15: {k = c : do nothing, the algorithm terminates.}16: end if

17: Update Q−1 recursively.18: end while

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 16 / 24

Page 45: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Incremental SVM: Summary

The only general method to exactly update an existing SVM solution

Can be reversed for incremental “unlearning”

Can be extended for regression and anomaly detection

Moderate performance:

O(s2) time and space (s being the size of the set S)suitable for working sets of up to 50,000 examples

Further usage:

Computation of the leave-one-out errorPoisoning attacks against SVM

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 17 / 24

Page 46: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Online Learning

An online learning algorithm looks at every example exactly once.

, Easy update of existing solutions

, Low storage requirements

, (Sometimes) faster learning of batch datasets

/ Inexact: may converge to a different solution

/ Potentially slow convergence rate

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 18 / 24

Page 47: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

LASVM: a Simple Online Extension of SMO

Function PROCESS(k):

1: αk ← 0, compute gk , add k to S.2: j ← argmins∈S ysgs3: Bail out if (j , k) is not a τ -violating pair.4: Perform an SMO update step on (j , k).

Function REPROCESS:

1: i ← argmaxs∈S gs with αs < C

2: j ← argmins∈S gs with αs > 03: Bail out if (j , k) is not a τ -violating pair.4: Perform an SMO update step on (i , j).

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 19 / 24

Page 48: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

LASVM: Main Algorithm and Analysis

Initialization

Seed S with a few examples of each class and compute the initial gradient g

Online iterations: repeat a predefined number of times:

Pick an example ktRun PROCESS(kt)Run REPROCESS once

Finishing:

Repeat REPROCESS until δ ≤ τ

, The online part brings us close enough to the optimal solution at lowlinear cost.

/ Still no hard guarantees on the solution quality: the final stage has anuncontrollable number of iterations.

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 20 / 24

Page 49: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Stochastic Gradient Descent (SGD)The Best Pill for an Easy Case

Consider a linear SVM in d -dimensional case. A large number of simpleonline algorithms implement the following simple strategy:

Pick up a random example xt

Update the weight as:

wt+1 ← wt − γtBt∇wQ(xt ,wt)

where γt is a learning rate and Bt is a scaling matrix.

Under mild regularity conditions and with appropriate learning rates(decreasing gains satisfying

t γ2t <∞), the stochastic gradient descent

converges to an optimal solution.

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 21 / 24

Page 50: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

SGD Variants and their Convergence Rates

First order, B = λ−1I :

wt+1 ← wt −1

λ(t + t0)gt(wt)

Iterations to reach accuracy ρ: νκ2

ρ+ O( 1

ρ), iteration cost: O(d)

Second order, B = H−1:

wt+1 ← wt −1

t + t0H−1gt(wt)

Iterations to reach accuracy ρ: νρ+ O( 1

ρ), iteration cost: O(d2)

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 22 / 24

Page 51: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Summary

In contrast to incremental learning, online learning attempts to achievefixed cost per example, potentially at the expense of learning accuracy.

Theoretical analysis shows that high optimization accuracy is not alwaysnecessary for learning accuracy.

Efficient algorithms for online learning exist mainly for linear learningproblems.

For non-linear methods, online learning currently cannot guaranteeperfect covergence to the optimal solution.

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 23 / 24

Page 52: Lecture 8 Online and Incremental LearningP. Laskov and B. Nelson (Tu¨bingen) Lecture 8: Online and Incremental Learning June 19, 2012 7 / 24 KKT Conditions in Differential Form Consider

Bibliography I

[1] Antoine Bordes, Leon Bottou, and Patrick Gallinari. SGD-QN: Carefulquasi-newton stochastic gradient descent. Journal of Machine Learning

Research, 10:1737–1754, 2009.

[2] Antoine Bordes, Seyda Ertekin, Jason Weston, and Leon Bottou. Fast kernelclassifiers with online and active learning. Journal of Machine Learning

Research, 6:1579–1619, 2005.

[3] Gert Cauwenberghs and Tommaso Poggio. Incremental and decrementalsupport vector machine learning. In Advances in Neural Information

Processing Systems 13, pages 409–415, 2001.

[4] Pavel Laskov, Christian Gehl, Stefan Kruger, and Klaus-Robert Muller.Incremental support vector learning: Analysis, implementation andapplications. Journal of Machine Learning Research, 7:1909–1936, 2006.

P. Laskov and B. Nelson (Tubingen) Lecture 8: Online and Incremental Learning June 19, 2012 24 / 24