Incremental Classifier and Representation Learning Sylvestre-Alvise Rebuffi †,* , Alexander Kolesnikov * , Georg Sperl * , Christoph H. Lampert * * IST Austria † University of Oxford Abstract iCaRL can learn classifiers/representation incrementally over a long period of time where other methods quickly fail. Catastrophic forgetting is avoided thanks to a combination of a nearest-mean-of-exemplars classifier, herding for adaptive exemplar selection and distillation for representation learning. 1) Motivation Situation: class-incremental learning Problem: I classes appear sequentially I train on new classes without forgetting the old ones I no retraining from scratch: too heavy I can’t keep all the data around 2) Class-Incremental Learning We want to/we can: I for any number of observed classes, t , learn a multi-class classifier I store a certain number, K , of images (a few hundreds or thousands) We do not want to/we cannot: I add more and more resources: fixed sized network I store all training examples (could be millions) The dilemma: I fixing the data representation: suboptimal results on new classes. I continuously improving the representation: classifiers for earlier classes deteriorate over time (”catastrophic forgetting/interference”) [McCloskey, Cohen. 1989] 3) Existing Approaches Fixed data representation: I represent classes by mean feature vectors [Mensink et al., 2012], [Ristin et al., 2014] Learning the data representation: I grow neural network incrementally, fixing parts that are responsible for earlier class decisions [Mandziuk, Shastri. 1998], . . . , [Rusu et al., 2016] I multi-task setting: preserve network activations by distillation [Li, Hoiem. 2016] 4) iCaRL iCaRL component 1: exemplar-based classification. I nearest-mean-of-exemplars classifier I automatic adjustment to representation change I more robust than network outputs iCaRL component 2: representation learning. I add a distillation term to loss function I stabilizes outputs I limited overhead, just need one copy of the network iCaRL component 3: exemplar selection. I greedy selection procedure by herding I number of exemplars per class decrease as the number of class increases I ability to remove exemplars on-the-fly through ranking of exemplars 5) Experiments CIFAR-100: I 100 classes, in batches of 10 I 32-layer ResNet [He et al., 2015] I evaluated by top-1 accuracy I number of exemplars: 2000 ImageNet ILSVRC 2012: I 1000 classes, in batches of 10 I 18-layer ResNet [He et al., 2015] I evaluated by top-5 accuracy I number of exemplars: 20000 Baselines: I fixed representation: freeze representation after first batch of classes I finetuning: ordinary NN learning, no freezing I LwF: ”Learning without Forgetting” [Li, Hoiem. 2016], use network itself to classify 6) Results CIFAR-100 (2, 5 and 10 classes per batch) ILSVRC 2012 (small) ILSVRC 2012 (full) Discussion: I as expected: fixed representation and finetuning do not work well I iCaRL is able to keep good classification accuracy for many iterations I ”Learning without Forgetting” starts to forget earlier Confusion matrices for CIFAR-100 Discussion: I iCaRL: predictions spread homogeneously I LwF.MC: prefer recently seen classes → long-term memory loss I fixed representation: prefer first batch of classes → lack of neural plasticity I finetuning: predict only classes among the last batch → catastrophic forgetting

Incremental Classifier and Representation Learningopenaccess.thecvf.com/content_cvpr_2017/poster/739...I LwF: ”Learning without Forgetting ” [Li, Hoiem. 2016], use network itself

Download PDF Report

Upload
others
View
0
Download
0

Embed Size (px)

Citation preview

Page 1: Incremental Classifier and Representation Learningopenaccess.thecvf.com/content_cvpr_2017/poster/739...I LwF: ”Learning without Forgetting ” [Li, Hoiem. 2016], use network itself

Incremental Classifier and Representation LearningSylvestre-Alvise Rebuffi†,∗, Alexander Kolesnikov∗, Georg Sperl∗, Christoph H. Lampert∗

∗ IST Austria † University of Oxford

Abstract

iCaRL can learn classifiers/representation incrementally over along period of time where other methods quickly fail. Catastrophicforgetting is avoided thanks to a combination of anearest-mean-of-exemplars classifier, herding for adaptiveexemplar selection and distillation for representation learning.

1) Motivation

Situation: class-incremental learning

Problem:I classes appear sequentiallyI train on new classes without forgetting the old onesI no retraining from scratch: too heavyI can’t keep all the data around

2) Class-Incremental Learning

We want to/we can:I for any number of observed classes, t , learn a multi-class classifierI store a certain number, K , of images (a few hundreds or thousands)

We do not want to/we cannot:I add more and more resources: fixed sized networkI store all training examples (could be millions)

The dilemma:I fixing the data representation: suboptimal results on new classes.I continuously improving the representation: classifiers for earlier classes deteriorate over time

(”catastrophic forgetting/interference”) [McCloskey, Cohen. 1989]

3) Existing Approaches

Fixed data representation:I represent classes by mean feature vectors [Mensink et al., 2012], [Ristin et al., 2014]

Learning the data representation:I grow neural network incrementally, fixing parts that are responsible for earlier class decisions

[Mandziuk, Shastri. 1998], . . . , [Rusu et al., 2016]

I multi-task setting: preserve network activations by distillation [Li, Hoiem. 2016]

4) iCaRL

iCaRL component 1: exemplar-based classification.

I nearest-mean-of-exemplars classifierI automatic adjustment to representation

changeI more robust than network outputs

iCaRL component 2: representation learning.I add a distillation term to loss functionI stabilizes outputsI limited overhead, just need one copy of the network

iCaRL component 3: exemplar selection.

I greedy selection procedure by herdingI number of exemplars per class decrease

as the number of class increasesI ability to remove exemplars on-the-fly

through ranking of exemplars

5) Experiments

CIFAR-100:I 100 classes, in batches of 10I 32-layer ResNet [He et al., 2015]

I evaluated by top-1 accuracyI number of exemplars: 2000

ImageNet ILSVRC 2012:I 1000 classes, in batches of 10I 18-layer ResNet [He et al., 2015]

I evaluated by top-5 accuracyI number of exemplars: 20000

Baselines:I fixed representation: freeze representation after first batch of classesI finetuning: ordinary NN learning, no freezingI LwF: ”Learning without Forgetting” [Li, Hoiem. 2016], use network itself to classify

6) Results

CIFAR-100 (2, 5 and 10 classes per batch)

ILSVRC 2012 (small) ILSVRC 2012 (full)

Discussion:I as expected: fixed representation and finetuning do not work wellI iCaRL is able to keep good classification accuracy for many iterationsI ”Learning without Forgetting” starts to forget earlier

Confusion matrices for CIFAR-100

Discussion:I iCaRL: predictions spread homogeneouslyI LwF.MC: prefer recently seen classes→ long-term memory lossI fixed representation: prefer first batch of classes→ lack of neural plasticity

I finetuning: predict only classes among the last batch→ catastrophic forgetting