Download pdf - 16. Online Learning - TUM€¦ · Online Learning Uncertainty Estimation: An Example 17 Comparison by evaluation on an unseen class: Ground Truth GP Classiﬁcation building tree

Computer Vision Group Prof. Daniel Cremers

16. Online Learning

PD Dr. Rudolph TriebelComputer Vision GroupOnline Learning

Motivation

• So far, we have handled offline learning methods, but very often data is observed in an ongoing process

• Therefore, it would be good to adapt the learned models with newly arriving data without having to consider old data

• This is called online learning

• Major benefits:

• algorithms are adaptive to new observations

• learning is usually faster

2


Clarification of Notions

There are (at least) three notions:

• incremental learning:

• model is updated with new samples, but old samples might be considered again

• online learning:

• after considering a data sample, it can be disregarded (no need to store it)

• learning at frame rate:

• learning is fast enough such that it can be performed in one perception cycle

3


Example: Pedestrian Detection

Offline Approach:

1. Collect and annotate a large data set from different sensor modalities, here: camera and 2D laser

4

Codebook

Image DataLaser Data

Training SetTraining Set

[Spinello, Triebel, and Siegwart, IJRR 2010]



Offline Approach:


2. Train a classifier for each sensor modality

5

Codebook


Boosted CRF

Extended ISM





Offline Approach:


2. Train a classifier for each sensor modality

3. Apply the classifiers and fuse data

6

Codebook


Boosted CRF

Extended ISM



New Laser Data Stream New Image Data Stream



Testing Phase:

7

Fused pedestrian detection and trackingFused detection and tracking

of cars and pedestrians


But: non-adaptive; large training set required


Major Problem of this Approach

Offline learning:

• cannot adapt to new environments

• cannot learn new classes

• cannot consider new instances of a given class

• is often very slow

• requires a large amount of manually annotated training data

Ideas:

• use online learning for adaptivity

• use active learning to reduce (human) work load

8


Active Learning

• To reduce the required training data, the learner can select the data it needs to learn from.

• This is called active learning

• Major advantages:

• only those samples where classification is hard are used for re-training

• humans only need to give ground truth labels for a smaller set of samples

• In principle, active learning can be used with offline and online learning, but online makes more sense

9


Active Learning

10

Optimization

Training data

function

f(x)

“Laptop”

“Cup”f(x)

X

Y

training data


Active Learning

11

Optimization Prediction

Training data New Data

function Uncertainty,Labelsf(x)

“Laptop”

“Cup”f(x)

X

Y

f(x)

training data

“Laptop”


Active Learning

12



function

f(x)

“Laptop”

“Cup”f(x)

X

Y

f(x)

training data

?

Uncertainty,Labels


Active Learning

13



Label Query

Supervisor

function

f(x)

“Laptop”

“Cup”f(x)

X

Y

training data

“Present”

Uncertainty,Labels


Active Learning

14



Label Query

Supervisor

function

f(x)

“Laptop”

“Cup”f(x)

X

Y

training data

“Present”

Extend training data

New training data

Uncertainty,Labels


Active Learning

15



Label Query

Supervisor

function

f(x)Extend training

dataNew training

dataUncertainty,

Labels

Major Benefits:

• Adapts to new situations

• Requires less training samples


Uncertainty Estimates

• A key step in active learning is the selection step

• It requires a good estimate of uncertainties

• Examples for classification algorithms to compute uncertainties:

• Support Vector Machine (“Platt scaling”)

• Gaussian Process Classifier (“predictive variance”)

• Tree-based classifiers (entropy in leaf nodes)

16

But: It is important how uncertainty correlates to incorrect classifications!


Uncertainty Estimation: An Example

17

Comparison by evaluation on an unseen class:

Ground Truth GP Classification

building

tree

ground

hedge

car

background

0

1

Normalized Entropy

Uncertainty

0 0.2 0.4 0.6 0.8 10

50

100

150

200

Normalized Entropy for Label Distribution

Fre

quency

GP Classifier

0 0.2 0.4 0.6 0.8 10

50

100

150

200

250

300

Normalized Entropy for Label Distribution

Fre

qu

en

cy

SVM Classifier

The GP is less overconfident than the SVM!

[Paul, Triebel, Rus, Newman, IROS 2012]

SVM GPC

uncertainty uncertainty


Application: Traffic Sign Classification

18

10

Lorry Stop Roadworks Keep left

BDT

IVM

Linear GPC

Linear SVM

Logit Boost

Non-linear SVM

Non-linear GPC

Trained classes Unseen classes� �0 0.5 10

5

10

15

20

70kph

0 0.5 10

10

20

30

40

50

0 0.5 10

20

40

60

80

100

0 0.5 10

20

40

60

80

100

0 0.5 10

5

10

15

20

25

Right ahead

0 0.5 10

50

100

150

0 0.5 10

5

10

15

20

0 0.5 10

10

20

30

40

50

0 0.5 10

20

40

60

80

100

0 0.5 10

20

40

60

80

0 0.5 10

5

10

15

20

25

0 0.5 10

50

100

0 0.5 10

5

10

15

20

25

0 0.5 10

10

20

30

0 0.5 10

20

40

60

80

100

0 0.5 10

20

40

60

80

0 0.5 10

5

10

15

20

0 0.5 10

50

100

150

0 0.5 10

10

20

30

40

0 0.5 10

20

40

60

80

100

0 0.5 10

10

20

30

40

50

0 0.5 10

50

100

150

0 0.5 10

10

20

30

40

0 0.5 10

20

40

60

80

100

0 0.5 10

5

10

15

0 0.5 10

5

10

15

20

25

0 0.5 10

10

20

30

0 0.5 10

50

100

150

0 0.5 10

5

10

15

20

0 0.5 10

50

100

0 0.5 10

50

100

150

200

0 0.5 10

50

100

150

200

0 0.5 10

50

100

150

200

0 0.5 10

50

100

150

200

0 0.5 10

50

100

150

200

0 0.5 10

50

100

150

200

0 0.5 10

10

20

30

40

0 0.5 10

10

20

30

40

0 0.5 10

50

100

150

0 0.5 10

50

100

0 0.5 10

20

40

60

0 0.5 10

50

100

Fig. 4: Normalised entropy histograms (frequency vs NE) of the marginal probabilities for a variety of classifiers trained onthe road sign classes stop and lorries prohibited and tested on 200 instances of the unseen classes roadworks ahead and keepleft. Higher normalised entropy implies more uncertainty in classifier output, so we expect the classifiers to be certain (lowNE) on the trained classes and uncertain (high NE) for the unseen classes.

10


BDT

IVM

Linear GPC

Linear SVM

Logit Boost

Non-linear SVM

Non-linear GPC


5

10

15

20

70kph

0 0.5 10

10

20

30

40

50

0 0.5 10

20

40

60

80

100

0 0.5 10

20

40

60

80

100

0 0.5 10

5

10

15

20

25

Right ahead

0 0.5 10

50

100

150

0 0.5 10

5

10

15

20

0 0.5 10

10

20

30

40

50

0 0.5 10

20

40

60

80

100

0 0.5 10

20

40

60

80

0 0.5 10

5

10

15

20

25

0 0.5 10

50

100

0 0.5 10

5

10

15

20

25

0 0.5 10

10

20

30

0 0.5 10

20

40

60

80

100

0 0.5 10

20

40

60

80

0 0.5 10

5

10

15

20

0 0.5 10

50

100

150

0 0.5 10

10

20

30

40

0 0.5 10

20

40

60

80

100

0 0.5 10

10

20

30

40

50

0 0.5 10

50

100

150

0 0.5 10

10

20

30

40

0 0.5 10

20

40

60

80

100

0 0.5 10

5

10

15

0 0.5 10

5

10

15

20

25

0 0.5 10

10

20

30

0 0.5 10

50

100

150

0 0.5 10

5

10

15

20

0 0.5 10

50

100

0 0.5 10

50

100

150

200

0 0.5 10

50

100

150

200

0 0.5 10

50

100

150

200

0 0.5 10

50

100

150

200

0 0.5 10

50

100

150

200

0 0.5 10

50

100

150

200

0 0.5 10

10

20

30

40

0 0.5 10

10

20

30

40

0 0.5 10

50

100

150

0 0.5 10

50

100

0 0.5 10

20

40

60

0 0.5 10

50

100


10


BDT

IVM

Linear GPC

Linear SVM

Logit Boost

Non-linear SVM

Non-linear GPC


5

10

15

20

70kph

0 0.5 10

10

20

30

40

50

0 0.5 10

20

40

60

80

100

0 0.5 10

20

40

60

80

100

0 0.5 10

5

10

15

20

25

Right ahead

0 0.5 10

50

100

150

0 0.5 10

5

10

15

20

0 0.5 10

10

20

30

40

50

0 0.5 10

20

40

60

80

100

0 0.5 10

20

40

60

80

0 0.5 10

5

10

15

20

25

0 0.5 10

50

100

0 0.5 10

5

10

15

20

25

0 0.5 10

10

20

30

0 0.5 10

20

40

60

80

100

0 0.5 10

20

40

60

80

0 0.5 10

5

10

15

20

0 0.5 10

50

100

150

0 0.5 10

10

20

30

40

0 0.5 10

20

40

60

80

100

0 0.5 10

10

20

30

40

50

0 0.5 10

50

100

150

0 0.5 10

10

20

30

40

0 0.5 10

20

40

60

80

100

0 0.5 10

5

10

15

0 0.5 10

5

10

15

20

25

0 0.5 10

10

20

30

0 0.5 10

50

100

150

0 0.5 10

5

10

15

20

0 0.5 10

50

100

0 0.5 10

50

100

150

200

0 0.5 10

50

100

150

200

0 0.5 10

50

100

150

200

0 0.5 10

50

100

150

200

0 0.5 10

50

100

150

200

0 0.5 10

50

100

150

200

0 0.5 10

10

20

30

40

0 0.5 10

10

20

30

40

0 0.5 10

50

100

150

0 0.5 10

50

100

0 0.5 10

20

40

60

0 0.5 10

50

100


10


BDT

IVM

Linear GPC

Linear SVM

Logit Boost

Non-linear SVM

Non-linear GPC


5

10

15

20

70kph

0 0.5 10

10

20

30

40

50

0 0.5 10

20

40

60

80

100

0 0.5 10

20

40

60

80

100

0 0.5 10

5

10

15

20

25

Right ahead

0 0.5 10

50

100

150

0 0.5 10

5

10

15

20

0 0.5 10

10

20

30

40

50

0 0.5 10

20

40

60

80

100

0 0.5 10

20

40

60

80

0 0.5 10

5

10

15

20

25

0 0.5 10

50

100

0 0.5 10

5

10

15

20

25

0 0.5 10

10

20

30

0 0.5 10

20

40

60

80

100

0 0.5 10

20

40

60

80

0 0.5 10

5

10

15

20

0 0.5 10

50

100

150

0 0.5 10

10

20

30

40

0 0.5 10

20

40

60

80

100

0 0.5 10

10

20

30

40

50

0 0.5 10

50

100

150

0 0.5 10

10

20

30

40

0 0.5 10

20

40

60

80

100

0 0.5 10

5

10

15

0 0.5 10

5

10

15

20

25

0 0.5 10

10

20

30

0 0.5 10

50

100

150

0 0.5 10

5

10

15

20

0 0.5 10

50

100

0 0.5 10

50

100

150

200

0 0.5 10

50

100

150

200

0 0.5 10

50

100

150

200

0 0.5 10

50

100

150

200

0 0.5 10

50

100

150

200

0 0.5 10

50

100

150

200

0 0.5 10

10

20

30

40

0 0.5 10

10

20

30

40

0 0.5 10

50

100

150

0 0.5 10

50

100

0 0.5 10

20

40

60

0 0.5 10

50

100


uncertaintyhistograms

The GP is less overconfident than the SVM!

[Grimmett, Paul, Triebel, Posner ICRA 2013]


Over- and Underconfidence

19

Definition 1: The overconfidence of a classifier is the mean of all confidence values that correspond to incorrect classifications.

Definition 2: The underconfidence of a classifier is the mean of all uncertainty values that correspond to correct classifications.


The 3 Criteria of Learning Algorithms

20

•memory •run time

•precision •recall

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1P−R: Tracking and fusion PED class

Precision

Re

call

CameraLaserFusion

Run

tim

e

Memory

precision

recall

Efficiency Accuracy


The 3 Criteria of Learning Algorithms

21

Efficiency Accuracy

•memory •run time

•precision •recall

Confidence•overconfidence •underconfidence


Active Learning for Semantic Mapping

Active Learning is well suited for semantic mapping because:

• it can deal with large amounts of data.

• data is not independent.

• the task is essentially an online learning problem.

22


Active Learning for Semantic Mapping

23



Label Query

Supervisor

function

f(x)Extend training

dataNew training

dataUncertainty,

Labels

Concrete Approach:

• Use IVM (a sparse online GP) for classification

• Sort classified data by uncertainty

• Request labels for the most uncertain samples

• Remove uninformative samples (“forgetting”)

Train IVMhyperparams

active set

New Batch

Predict and Sort

query most uncertain

Extend and forget training data

1

2

34

56

1

2

34

56

Template Image

1

2

3

Window

1

2

3

⨂ =⨂ =⨂ =

2

666664

n1

n2

n3...

nN

3

777775...

[Triebel, Grimmett, Paul, Posner ISRR 2013]


The Informative Vector Machine

Main differences to standard GP classifier:

• it uses a subset (“active set”) of training points

• the (inverse) posterior covariance matrix is computed incrementally

Decision of inclusion in the active set based on infor-mation-theoretic criterion

Slight caveat: Training of hyper-parameters needsto be done iteratively

24

From: http://staffwww.dcs.shef.ac.uk/people/N.Lawrence/ivm/

http://staffwww.dcs.shef.ac.uk/people/N.Lawrence/ivm/


Semantic Mapping: Results

➡ GP “overtakes” and stays better than SVM.

➡ Active learning better than passive learning.

➡ Random selection is not better.

25

Epoch&0& Epoch&2&Passive&detector& Ac3ve&detector& Epoch&9&

low

high

number ofdetections

false positives

true positions

[Triebel, Grimmett, Paul, Posner ISRR 2013]

0 1 2 3 4 5 6 7 8 9 100.75

0.8

0.85

0.9

0.95

1

Epoch no.

f 1 mea

sure

SVM active SVM passive SVM random IVM active IVM passive IVM random

epoch

perf

orm

ance

PD Dr. habil. Rudolph TriebelComputer Vision GroupOnline Learning

Which Classifier Should We Choose?

26

Accuracylow high

low

high

Efficiency

GPC

SVM

Boosting


Which Classifier Should We Choose?

27

Accuracylow high

low

high

Efficiency

Con

f. Est

.

SVM

Boosting

GPC

• SVMs are accurate, but (more) overconfident

• GPCs are less overconfident, but very inefficient

• Boosting is very efficient, but also overconfident

bad

good


Can We Reduce Over-(Under-)Confidence?

• Start with a very efficient multi-class boosting classifier

• Modify it so that it reduces overconfidence

• Use also the confidence to weight training samples

Result (“Confidence Boosting”):

• wrong, certain classified samples receive higher weight

• overconfidence reduced in every learning round

• overall classifier is less overconfident ⇒ better for Active Learning

Note: This is the alternative approach to making a non-overconfident classifier efficient!

28


Online Gradient Boosting

Online: outer loopover data points,inner loop overweak classifiers

• For each point, allweak classifiers are updated

• Agreement between prediction and ground truth

• Sample receives low weight if agreement high

Note: Agreement does not use the confidence!

29

Algorithm 1: Online Multi-class Gradient Boost [Sa↵ari et al. 2010]

Data: training data (X ,y) with C classesInput: number of weak learners M , loss function `, agreement function aOutput: weak learners f1, . . . , fM

1 Initialize(f1, . . . , fM)2 for n = 1, . . . , N do3 wn 14 gn 05 for m = 1, . . . ,M do6 fm UpdateWeakLearner(fm,xn, yn, wn)7 pnm fm(xn)8 ↵nm a(pnm, yn)9 gn gn + ↵nm

10 wn �r`(gn)11 end

12 end


Confidence Boosting

Main idea:

Use also the confidence to compute the agreement

Intuitively:

• if classification is correct ⇒ use the positive

confidence

• if classification is wrong ⇒ use negative confidence

Result:

• wrong, certain classified samples receive higher weight

• overconfidence reduced in every learning round

• overall classifier is less overconfident ⇒ better for Active Learning

30


Example: AL for 3D Object Recognition

Qualitative Results:

31

True Label = lime

0

0.05

0.1

0.15

0.2

0.25Epoch 10

klee

nex

lem

on

light

bulb

lime

mar

ker

apple

ball

bana

na

ballP

eppe

r

bind

erbo

wl

calculat

or

cam

era

cap

cellP

hone

cere

alBox

coffe

eMug

Gradient Boost −>lemonConfidence Boost −>lime

True Label = binder

0

0.05

0.1

0.15

0.2Epoch 10

klee

nex

lem

on

light

bulb

lime

mar

ker

apple

ball

bana

na

ballP

eppe

r

bind

erbo

wl

calculat

or

cam

era

cap

cellP

hone

cere

alBox

coffe

eMug

Gradient Boost −>calculatorConfidence Boost −>binder

True Label = bowl

0

0.05

0.1

0.15

0.2

0.25Epoch 10

kleen

ex

lem

on

light

bulb

lime

mar

ker

apple

ball

bana

na

ballP

eppe

r

bind

erbo

wl

calcu

lato

r

cam

era

cap

cellP

hone

cere

alBox

coffe

eMug

Gradient Boost −>coffeeMugConfidence Boost −>bowl


Example: AL for 3D Object Recognition

Confidence Boosting

• classifies better than GradientBoost.

• generates fewer label queries.

• is less overconfident.

• is orders of magnitude faster than GPC.

32

1 2 3 4 5 6 7 8 9 10 110

500

1000

1500

Pendigits GBPendigits CBUSPS GBUSPS CBLetter GBLetter CB

no

. q

ueries

epochs

GB

CB

correct false

Avg

. C

lass

ificatio

n E

rro

r

0

0.1

0.2

0.3

0.4

USPS Letter Pendigits DNA Begbroke RGBD

Saffari et al.act. GradientBoostact. Confidence BoostingLai et al.


Pool-based and Stream-based AL

Problem: Most often, data occurs in streams (online) and not in pools (offline)

But: Online Random Forests

• require a balanced class dist.

• depend on the order of the data

Idea: use Mondrian Forests [1]

Main differences:

•MFs also store the range of the data in each dimension

•MFs are independent on the class labels

33

[1] B. Lakshminarayanan, D. M. Roy, and Y. W. Teh, “Mondrian Forests: Efficient Online Random Forests,” in Advances in Neural Information Processing Systems (NIPS), 2014, pp. 3140–3148.

data stream


Mondrian Forests

• start with two points

• insert a split

34


Mondrian Forests


• insert a split

• add a third point

35


Mondrian Forests


• insert a split


• sample a “time” from an exponential distr.

• insert a node above the given node

• insert a random split

36


Mondrian Forests

37


• insert a split





• add a new node inside the outer box


Mondrian Forests

38


• insert a split





• add a new node inside the outer box

• sample again and insert new node at the end

• this requires extending a split


Why the Name “Mondrian” Forests?

39


Why the Name “Mondrian” Forests?

Piet Mondrian: “Composition with Red, Yellow, Blue, and Black” 1926

Gemeentemuseum, Den Haag

40


Application to Real Data

KiTTI benchmark data set:

• 3D point clouds

• cars, pedestrians, bikes, trucks, etc.

• segmentation given (“tracklets”)

Goal:

• online classification

• use active learning

Major problems:

• un-balanced classes

• order of appearance

41


The Data Set

The real data

42

The data uniformly resampled


Results (no Active Learning)

43

The real data The data uniformly resampled

A. Narr, R. Triebel, D. Cremers “Stream-based Active Learning for Efficient and Adaptive Classification of 3D Objects” in: ICRA 2016


• Mondrian Forests require only 5% of the data to reach 90% accuracy

• Standard Random Forests can not reach that

44

Results Using Active Learning

A. Narr, R. Triebel, D. Cremers “Stream-based Active Learning for Efficient and Adaptive Classification of 3D Objects” in: ICRA 2016


Summary

• Online learning has important advantages over offline learning: efficiency and adaptivity

• In active learning the algorithm selects the data to learn from (human in the loop)

• This requires good confidence estimates

• The GP classifiers tends to be less overconfident, but inefficient

• A faster alternative method is Confidence Boosting, it is very efficient and not much overconfident

• For stream-based Active Learning Mondrian Forests are better suited

45