81
Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert HSE Computer Science Colloquium September 6, 2016

TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Towards Lifelong Machine LearningMulti-Task and Lifelong Learning with Unlabeled Tasks

Christoph Lampert

HSE Computer Science ColloquiumSeptember 6, 2016

Page 2: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

IST Austria (Institute of Science and Technology Austria)

http://www.ist.ac.at

I institute for basic researchI located in outskirts of ViennaI English-speaking

Research at IST AustriaI focus on interdisciplinarityI currently 41 research groups:

I Computer Science, Mathematics,I Physics, Biology, Neuroscience

Open PositionsI Faculty (Asst.Prof/Prof)I PostdocsI PhD (4yr Graduate School)I Internships

Page 3: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

My research group at IST Austria:

Alex Kolesnikov Georg Martius Asya Pentina Amélie Royer Alex Zimin

Funding Sources:

Page 4: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Research Topics

Theory (Statistical Machine Learning)I Multi-task learningI Domain adaptation

I Lifelong learning (learning to learn)I Learning with dependent data

Models/AlgorithmsI Zero-shot learningI Classifier adaptation

I Weakly-supervised learningI Probabilistic graphical models

Applications (in Computer Vision)I Object recognitionI Object localization

I Image segmentationI Semantic image representations

Today’s colloquium Mon/Wed/Fri 16:40 HSE Thu, 12:00, SkoltechThu, 19:00, Yandex Fri, 14:00, MSU

Page 5: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Research Topics

Theory (Statistical Machine Learning)I Multi-task learningI Domain adaptation

I Lifelong learning (learning to learn)I Learning with dependent data

Models/AlgorithmsI Zero-shot learningI Classifier adaptation

I Weakly-supervised learningI Probabilistic graphical models

Applications (in Computer Vision)I Object recognitionI Object localization

I Image segmentationI Semantic image representations

Today’s colloquium

Mon/Wed/Fri 16:40 HSE Thu, 12:00, SkoltechThu, 19:00, Yandex Fri, 14:00, MSU

Page 6: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Research Topics

Theory (Statistical Machine Learning)I Multi-task learningI Domain adaptation

I Lifelong learning (learning to learn)I Learning with dependent data

Models/AlgorithmsI Zero-shot learningI Classifier adaptation

I Weakly-supervised learningI Probabilistic graphical models

Applications (in Computer Vision)I Object recognitionI Object localization

I Image segmentationI Semantic image representations

Today’s colloquium Mon/Wed/Fri 16:40 HSE

Thu, 12:00, SkoltechThu, 19:00, Yandex Fri, 14:00, MSU

Page 7: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Research Topics

Theory (Statistical Machine Learning)I Multi-task learningI Domain adaptation

I Lifelong learning (learning to learn)I Learning with dependent data

Models/AlgorithmsI Zero-shot learningI Classifier adaptation

I Weakly-supervised learningI Probabilistic graphical models

Applications (in Computer Vision)I Object recognitionI Object localization

I Image segmentationI Semantic image representations

Today’s colloquium Mon/Wed/Fri 16:40 HSE Thu, 12:00, Skoltech

Thu, 19:00, Yandex Fri, 14:00, MSU

Page 8: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Research Topics

Theory (Statistical Machine Learning)I Multi-task learningI Domain adaptation

I Lifelong learning (learning to learn)I Learning with dependent data

Models/AlgorithmsI Zero-shot learningI Classifier adaptation

I Weakly-supervised learningI Probabilistic graphical models

Applications (in Computer Vision)I Object recognitionI Object localization

I Image segmentationI Semantic image representations

Today’s colloquium Mon/Wed/Fri 16:40 HSE Thu, 12:00, SkoltechThu, 19:00, Yandex

Fri, 14:00, MSU

Page 9: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Research Topics

Theory (Statistical Machine Learning)I Multi-task learningI Domain adaptation

I Lifelong learning (learning to learn)I Learning with dependent data

Models/AlgorithmsI Zero-shot learningI Classifier adaptation

I Weakly-supervised learningI Probabilistic graphical models

Applications (in Computer Vision)I Object recognitionI Object localization

I Image segmentationI Semantic image representations

Today’s colloquium Mon/Wed/Fri 16:40 HSE Thu, 12:00, SkoltechThu, 19:00, Yandex Fri, 14:00, MSU

Page 10: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Overview

I Introduction

I Learning Multiple Tasks

I Multi-Task Learning with Unlabeled Tasks

Page 11: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Machine LearningDesigning and analyzing automaticsystems that draw conclusions

from empirical data

Page 12: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

What is "Machine Learning"?

Robotics Time Series Prediction Social Networks

Language Processing Healthcare Natural SciencesImages: NASA, Kevin Hutchinson (CC BY-SA 2.0), Amazon, Microsoft, Ragesoss (CC BY-SA 3.0), R. King

Page 13: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

What is "Machine Learning"?

Image: "The CSIRO GPU cluster at the data centre" by CSIRO. Licensed under CC BY 3.0 via Wikimedia Commons

Page 14: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

What is "Machine Learning"?

Page 15: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Introduction/Notation

Page 16: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Introduction/Notation

Learning TasksA learning tasks, T , consists of

I input set, X (e.g. email text, or camera images),I output set, Y (e.g. spam, not spam or left, right, straight)I loss function, `(y, y) ("how bad is it to predict y if y is correct?"),I probability distribution, D(x, y), over X × Y .

We assume that X , Y and ` are know, but D(x, y) is unknown.

LearningThe goal of learning is to find a function h : X → Y with minimal risk (expected loss)

er(h) := E(x,y)∼D `( y, h(x) )

(also called: "generalization error")

Page 17: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Introduction/Notation

Learning TasksA learning tasks, T , consists of

I input set, X (e.g. email text, or camera images),I output set, Y (e.g. spam, not spam or left, right, straight)I loss function, `(y, y) ("how bad is it to predict y if y is correct?"),I probability distribution, D(x, y), over X × Y .

We assume that X , Y and ` are know, but D(x, y) is unknown.LearningThe goal of learning is to find a function h : X → Y with minimal risk (expected loss)

er(h) := E(x,y)∼D `( y, h(x) )

(also called: "generalization error")

Page 18: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Introduction/Notation

Supervised LearningGiven a training set

S = (x1, y1), . . . , (xn, yn)

consisting of independent samples from D, i.e. S i.i.d.∼ D, define the empirical risk

er(h) := 1n

n∑i=1

`( yi, h(xi) )

(also called: "training error")

Page 19: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Introduction/Notation

Empirical Risk MinimizationLet H ⊂ h : X → Y be a hypothesis set. Finding a hypothesis h ∈ H by solving

minh∈H

er(h)

is called empirical risk minimization (ERM).

Fundamental Theorem of Statistical Learning Theory[Alexey Chervonenkis and Vladimir Vapnik, 1970s]

"ERM is a good learning strategy ifand only if H has finite VC-dimension."

Page 20: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Introduction/Notation

Empirical Risk MinimizationLet H ⊂ h : X → Y be a hypothesis set. Finding a hypothesis h ∈ H by solving

minh∈H

er(h)

is called empirical risk minimization (ERM).

Fundamental Theorem of Statistical Learning Theory[Alexey Chervonenkis and Vladimir Vapnik, 1970s]

"ERM is a good learning strategy ifand only if H has finite VC-dimension."

Page 21: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Introduction/Notation

Vapnik-Chervonenkis generalization boundLet H be a set of binary-valued hypotheses. For any δ > 0, the following inequalityholds with probability at least 1− δ over the draw of a training set

|er(h)− er(h)| ≤

√√√√VC(H)(

ln 2nVC(H) + 1

)+ log(4/δ)

n

simultaneously for all h ∈ H.

Observation (assuming our training set is not atypical):I the right hand side of the bound contains computable quantities.→ we can compute the hypothesis that minimizes it

I the bound holds uniformly for all h ∈ H → it also holds for the minimizer

Choosing the hypothesis that minimizes er(h) is a promising strategy → ERM

Page 22: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Introduction/Notation

Vapnik-Chervonenkis generalization boundLet H be a set of binary-valued hypotheses. For any δ > 0, the following inequalityholds with probability at least 1− δ over the draw of a training set

er(h) ≤ er(h) +

√√√√VC(H)(

ln 2nVC(H) + 1

)+ log(4/δ)

n

simultaneously for all h ∈ H.

Observation (assuming our training set is not atypical):I the right hand side of the bound contains computable quantities.→ we can compute the hypothesis that minimizes it

I the bound holds uniformly for all h ∈ H → it also holds for the minimizer

Choosing the hypothesis that minimizes er(h) is a promising strategy → ERM

Page 23: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Introduction/Notation

Summary

I for some learning algorithms, it is possible to give (probabilistic) guarantees oftheir performance on future data.

I a principled way to construct such algorithms:1) prove a generalization bound that

I consists only of computable quantities on the right hand sideI holds uniformly over all hypotheses

2) algorithm ≡ minimizing the r.h.s. of the bound with respect to hypotheses

I even when not practical, still a good way just to get inspiration for new algorithms

Page 24: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Learning multiple tasks

Page 25: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Learning a Single Task

Page 26: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Learning Multiple Tasks

data

learner

predictions

Task 1

data

learner

predictions

Task 2

. . .

data

learner

predictions

Task 3

Tabula Rasa Learning

Page 27: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Learning Multiple Tasks

data

learner

predictions

Task 1

data

learner

predictions

Task 2

. . .

data

learner

predictions

Task 3

Tabula Rasa Learning

Page 28: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Multi-Task Learning

data

predictions

Task 1

data

multitask learner

predictions

Task 2

data

predictions

Task 3

Multi-Task Learning

Page 29: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Future challenge: continuously improving open-ended learning

life long learner

datapredictions

Task 1

Task 2 datapredictions

datapredictions

Task 3

Lifelong Learning

Page 30: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Active Task Selectionfor Multi-Task Learning

Asya Pentina

A. Pentina, CHL, "Active Task Selection for Multi-Task Learning", arXiv:1602.06518 [stat.ML]

Page 31: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Multi-Task Learning

data

predictions

Task 1

data

multitask learner

predictions

Task 2

data

predictions

Task 3

Page 32: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Example: Personalized Speech Recognition

each user speaks differently: learn an individual classifier

Page 33: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Example: Personalized Spam Filters

each user has different preferences: learn individual filter rules

Page 34: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Multi-Task Learning: state-of-the-art

Sharing ParametersI assume: all tasks are minor variations of each otherI e.g. introduce regularizer that encourages similar weight vectors

Sharing RepresentationsI assume: one set of features works well for all tasksI e.g., neural network with early layers shared between all tasks

Sharing SamplesI form larger training sets from samples sets of different tasksI extreme case: merge all data and learn a single classifier

All have in common: they need labeled training sets for all tasks

Page 35: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

New Setting:I Given: unlabeled sample sets for each of T tasksI Algorithm chooses subset of k tasks and asks for labeled samplesI Output: classifiers for all tasks (labeled as well as unlabeled)

Example: speech recognition, T = millionsI collecting some unlabeled data (=speech) from each user is easyI asking users for annotation is what annoys them: avoid as much as possibleI strategy:

I select a subset of speakersI ask only those for training annotationI use data to build classifiers for all

I is this going to work?I clearly not, if all speakers are completely differentI clearly yes, if all speakers are the same

Can we quantify this effect, give guarantees and derive a principled algorithm?

Page 36: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

New Setting:I Given: unlabeled sample sets for each of T tasksI Algorithm chooses subset of k tasks and asks for labeled samplesI Output: classifiers for all tasks (labeled as well as unlabeled)

Example: speech recognition, T = millionsI collecting some unlabeled data (=speech) from each user is easyI asking users for annotation is what annoys them: avoid as much as possibleI strategy:

I select a subset of speakersI ask only those for training annotationI use data to build classifiers for all

I is this going to work?I clearly not, if all speakers are completely differentI clearly yes, if all speakers are the same

Can we quantify this effect, give guarantees and derive a principled algorithm?

Page 37: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Reminder: Learning TasksA learning tasks, T , consists of

I input space, X (e.g. audio samples)I output space, Y (e.g. phonemes or natural language text)I loss function ` : Y × Y → RI joint probability distribution, D(x, y), over X × Y

For simplicity, assumeI Y = 0, 1 (binary classification tasks)I `(y, y) = Jy 6= yK (0/1-loss)I "realizable" situation:

I x distributed according to a (marginal) distribution D(x)I deterministic labeling function f : X → Y

Page 38: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Learning Classifiers for All Tasks

Learning Classifiers for Labeled TasksFor labeled tasks, we have a training set Sj = (x1, y1), . . . , (xn, yn), we can computea classifier, e.g. by minimizing the empirical risk

er(h) := 1n

n∑i=1

`( yi, h(xi) ).

Other algorithms instead of ERM: support vector machines, deep learning, . . .

Learning Classifiers for Unlabeled TasksFor unlabeled tasks, we have only data Sj = x1, . . . , xn without annotation, so wecannot compute the empirical risk to learn or evaluate a classifier.

Main idea: let’s re-use a classifier from a ’similar’ labeled tasks.

Page 39: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Given two tasks, 〈D1, f1〉 and 〈D2, f2〉. How to determine how similar they are?

A similarity measure between tasks (first attempt)

Compare D1(x, y) and D2(x, y), e.g., by their Kullback-Leibler divergence

KL(D1|D2) = ED1 [log D1

D2]

Problem: too strict, e.g., KL(D1|D2) =∞ unless f1 6= f2

A similarity measure between tasks (second attempt)

Compare only at marginal distributions, e.g. Kullback-Leibler divergence, and treatlabeling functions separately.

Problem: still very strict, can’t be estimated well from sample sets

Page 40: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Given two tasks, 〈D1, f1〉 and 〈D2, f2〉. How to determine how similar they are?

A similarity measure between tasks (first attempt)

Compare D1(x, y) and D2(x, y), e.g., by their Kullback-Leibler divergence

KL(D1|D2) = ED1 [log D1

D2]

Problem: too strict, e.g., KL(D1|D2) =∞ unless f1 6= f2

A similarity measure between tasks (second attempt)

Compare only at marginal distributions, e.g. Kullback-Leibler divergence, and treatlabeling functions separately.

Problem: still very strict, can’t be estimated well from sample sets

Page 41: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

A more suitable similarity measure between tasks:

Discrepancy Distance [Mansour et al . 2009]

For a set of hypothesis H and erD(h, h′) = Ex∼DJh(x) 6= h′(x)K.

disc(D1, D2) = maxh,h′∈H

|erD1(h, h′)− erD2(h, h′)| .

I maximal classifier disagreement between distributions D1, D2I only as strong as we need when dealing with a hypothesis class H

I can be estimated from (unlabeled) sample sets S1, S2:

disc(S1, S2) = 2(1− E)where E is the empirical error of the best classifier in H on the virtual binaryclassification problem with

I class 1: all samples of tasks 1I class 2: all samples of tasks 2

Page 42: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

A more suitable similarity measure between tasks:

Discrepancy Distance [Mansour et al . 2009]

For a set of hypothesis H and erD(h, h′) = Ex∼DJh(x) 6= h′(x)K.

disc(D1, D2) = maxh,h′∈H

|erD1(h, h′)− erD2(h, h′)| .

I maximal classifier disagreement between distributions D1, D2I only as strong as we need when dealing with a hypothesis class HI can be estimated from (unlabeled) sample sets S1, S2:

disc(S1, S2) = 2(1− E)where E is the empirical error of the best classifier in H on the virtual binaryclassification problem with

I class 1: all samples of tasks 1I class 2: all samples of tasks 2

Page 43: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Discrepancy illustration

Good separating linear classifier → large discrepancy (w.r.t. linear hypotheses class)

Page 44: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Discrepancy illustration

Good separating linear classifier → large discrepancy (w.r.t. linear hypotheses class)

Page 45: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Discrepancy illustration

No good separating linear classifier → small discrepancy (w.r.t. linear hypotheses class)

Page 46: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Discrepancy illustration

No good separating linear classifier → small discrepancy (w.r.t. linear hypotheses class)

Page 47: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Discrepancy illustration

No good separating linear classifier → small discrepancy (w.r.t. linear hypotheses class)but could be large for non-linear hypotheses, e.g., quadratic

Page 48: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Discrepancy illustration

No good separating linear classifier → small discrepancy (w.r.t. linear hypotheses class)

but could be large for non-linear hypotheses, e.g., quadratic

Page 49: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Discrepancy illustration

No good separating linear classifier → small discrepancy (w.r.t. linear hypotheses class)but could be large for non-linear hypotheses, e.g., quadratic

Page 50: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Learning classifiers for unlabeled task

Interpret as domain adaptationI classifier trained from samples of one distribution 〈Ds, fs〉I classifier used for predicting with respect to another distribution 〈Dt, ft〉

Discrepancy allows relating the performance of a hypothesis on one task to itsperformance on a different task:

Domain Adaptation Bound [S. Ben-David, 2010]

For any two tasks 〈Ds, fs〉 and 〈Dt, ft〉 and any hypothesis h ∈ H the following holds:

ert(h) ≤ ers(h) + disc(Ds, Dt) + λst,

where λst = minh∈H

(ers(h) + ert(h)).

Note: λst depend on the labeling functions, fs, ft, and is not observable.

Page 51: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Active Tasks Selection for Multi-Task Learning with Unlabeled Tasks

Situation:I 〈D1, f1〉, . . . , 〈DT , fT 〉: tasksI unlabeled S1, . . . , ST : sample sets for T tasks, |St| = n

I for a subset of k tasks we can ask for labeled data, St with |St| = m

I Goal:I identify a subset of k tasks to be labeled, t1, . . . , tk,I decide between which tasks the classifiers should be sharedI ask for labeled training setsI learn all jointly classifiers

Parameterization:I assignment vector, C = (c1, . . . , cT ) ∈ 1, . . . , TT ,

with at most k different values occurring as entriesI cs = t indicates that we solve tasks s using the classifier from task t

Page 52: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Theorem [Pentina, CHL. 2016]

Let δ > 0. With probability at least 1− δ the following inequality holds uniformly forall possible assignments C = (c1, . . . , cT ) ∈ 1, . . . , TT and all h ∈ H:

1T

T∑t=1

ert(hct) ≤1T

T∑t=1

erct(hct) + 1T

T∑t=1

disc(St, Sct) + 1T

T∑t=1

λtct

+2

√2d log(2n) + 2 log(T ) + log(4/δ)

n+

√2d log(em/d)

m+

√log(T ) + log(2/δ)

2m,

whereI ert(h) = generalization error of h on task t,I ert(h) = training error of h on task t,I λst = minh∈H(erDs(fs, h) + erDt(ft, h))

Page 53: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Theorem – Executive Summary

generalization error over all tasks ≤

training error on labeled tasks +T∑t=1

disc(St, Sct) + 1T

T∑t=1

λtct + . . .

Strategy: minimize (computable terms of) the right hand sideAlgorithm:

I estimate disc(Si, Sj) between all given tasksI find assignment C that minimizes ∑T

t=1 disc(St, Sct) → k-mediods clusteringI request labels for cluster center tasks and learn classifiers hct

I for other tasks, use classifier hct of nearest center

What about λtct?I small under assumption that "similar tasks have similar labeling functions"

Page 54: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Theorem – Executive Summary

generalization error over all tasks ≤

training error on labeled tasks +T∑t=1

disc(St, Sct) + 1T

T∑t=1

λtct + . . .

Strategy: minimize (computable terms of) the right hand sideAlgorithm:

I estimate disc(Si, Sj) between all given tasksI find assignment C that minimizes ∑T

t=1 disc(St, Sct) → k-mediods clusteringI request labels for cluster center tasks and learn classifiers hct

I for other tasks, use classifier hct of nearest center

What about λtct?I small under assumption that "similar tasks have similar labeling functions"

Page 55: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Active Task Selection for Multi-Task Learning: single-source transfer

Given: tasks (circles) with only unlabeled data

Page 56: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Active Task Selection for Multi-Task Learning: single-source transfer

Step 1: compute pairwise discrepancies

Page 57: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Active Task Selection for Multi-Task Learning: single-source transfer

Step 2: cluster tasks by k-mediod

Page 58: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Active Task Selection for Multi-Task Learning: single-source transfer

Step 3: request labels for tasks corresponding to cluster centroids

Page 59: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Active Task Selection for Multi-Task Learning: single-source transfer

Step 4: learn classifier for centroid tasks

Page 60: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Active Task Selection for Multi-Task Learning: single-source transfer

Step 5: transfer classifiers to other tasks in cluster

Page 61: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Experiments

Synthetic ImageNet (ILSVRC2010)

1 2 2.5 4 5 10 25 50% of labeled tasks

2

4

6

8

10

12

14te

st e

rror

in %

Fully Labeled

Partially Labeled

ATS-SS

RTS-SS

0.1 0.5 1 4 5 10 25 50% of labeled tasks

5

10

15

20

25

30

35

test

err

or

in %

Fully Labeled

Partially Labeled

ATS-SS

RTS-SS

1000 tasks (Gaussians in R2) 999 tasks ("1 class vs. others")

Methods:I ATS-SS: Active Tasks Selection + discrepancy based transferI RTS-SS: Random Tasks Selection + discrepancy based transfer

Baselines:I Fully Labeled: all T tasks labeled, 100 labeled examples eachI Partially LabeledPartially LabeledPartially LabeledPartially LabeledPartially LabeledPartially LabeledPartially LabeledPartially LabeledPartially LabeledPartially LabeledPartially Labeled: k tasks labeled, 100k/T labeled examples each

Page 62: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Active Task Selection for Multi-Task Learning: multi-source transfer

Extension: can we transfer from more than one task?New design choice:

I Q: what classifier to use for unlabeled tasks?I A: train on weighted combination of the training sets of "similar" labeled tasks

Parameterization: weight matrix α ∈ RT×T

I αst indicates how much tasks t relies on samples from tasks sI columns of α sum to 1I only k non-zero rows of α (indicates which tasks to label)

Note: transfer is now also possible between labeled tasks

Page 63: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Theorem [Pentina, CHL. 2016]

Let δ > 0. Then, provided that the choice of labeled tasks I = i1, . . . , ik and theweights α1, . . . , αT ∈ ΛI are determined by the unlabeled data only, the followinginequality holds with probability at least 1−δ for all possible choices of h1, . . . , hT ∈ H:

1T

T∑t=1

ert(ht)≤1T

T∑t=1

erαt(ht)+ 1T

T∑t=1

∑i∈Iαtidisc(St, Si) + A

T‖α‖2,1 +B

T‖α‖1,2+ 1

T

T∑t=1

∑i∈I

αtiλti,

+

√8(log T + d log(enT/d))

n+√

2n

log 4δ

+2

√2d log(2n) + log(4T 2/δ)

n

where

‖α‖2,1 =T∑t=1

√∑i∈I

(αti)2, ‖α‖1,2 =

√√√√√∑i∈I

(T∑t=1

αti

)2

, A=

√2d log(ekm/d)

m, B=

√log(4/δ)

2m .

Page 64: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Theorem – Executive SummaryThe generalization error over all T tasks can be bounded uniformly in weights α andh1, . . . , hT by a weighted combination of

I training errors on selected tasksI discrepancies between tasksI ‖α‖2,1 and ‖α‖1,2

plus λst-terms (that we assume to be small), plus complexity terms (constants).

Strategy: minimize (computable terms of) right hand sideAlgorithm:

I estimate disc(Si, Sj) between all tasksI minimize α-weighted discrepancies + ‖α‖2,1 + ‖α‖1,2 w.r.t. αI request labels for tasks corresponding to non-zero rows of αI learn classifiers ht for each task from α-weighted sample sets

Page 65: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Theorem – Executive SummaryThe generalization error over all T tasks can be bounded uniformly in weights α andh1, . . . , hT by a weighted combination of

I training errors on selected tasksI discrepancies between tasksI ‖α‖2,1 and ‖α‖1,2

plus λst-terms (that we assume to be small), plus complexity terms (constants).

Strategy: minimize (computable terms of) right hand sideAlgorithm:

I estimate disc(Si, Sj) between all tasksI minimize α-weighted discrepancies + ‖α‖2,1 + ‖α‖1,2 w.r.t. αI request labels for tasks corresponding to non-zero rows of αI learn classifiers ht for each task from α-weighted sample sets

Page 66: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Active Task Selection for Multi-Task Learning: multi-source transfer

Given: tasks (circles) with only unlabeled data

Page 67: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Active Task Selection for Multi-Task Learning: multi-source transfer

1

0.60.4

0.30.7

11

1

0.1

0.9 0.9

Step 1: select prototypes and compute amount of sharing

Page 68: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Active Task Selection for Multi-Task Learning: multi-source transfer

1

0.60.4

0.30.7

11

1

0.1

0.9 0.9

Step 2: request labels for prototypes

Page 69: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Active Task Selection for Multi-Task Learning: multi-source transfer

1

0.60.4

0.30.7

11

1

0.1

0.9 0.9

Step 3: learn classifier for each task using shared samples

Page 70: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Experiments

Synthetic ImageNet (ILSVRC2010)

1 2 2.5 4 5 10 25 50% of labeled tasks

2

4

6

8

10

12

14te

st e

rror

in %

Fully Labeled

Partially Labeled

ATS-MS

RTS-MS

0.1 0.5 1 4 5 10 25 50% of labeled tasks

5

10

15

20

25

30

35

test

err

or

in %

Fully Labeled

Partially Labeled

ATS-MS

RTS-MS

1000 tasks (Gaussians in 2D) 999 tasks ("1 class vs. others")

Methods:I ATS-MS: Active Tasks Selection + transferI RTS-MS: Random Tasks Selection + transfer

Baselines:I Fully Labeled: all T tasks labeled, 100 labeled examples eachI Partially LabeledPartially LabeledPartially LabeledPartially LabeledPartially LabeledPartially LabeledPartially LabeledPartially LabeledPartially LabeledPartially LabeledPartially Labeled: k tasks labeled, 100k/T labeled examples each

Page 71: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

life long learner

datapredictions

Task 1

Task 2 datapredictions

datapredictions

Task 3

From Multi-Task to Lifelong Learning...

Page 72: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

How to handle a new task?

Page 73: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

How to handle a new task?

?

Page 74: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

How to handle a new task?

Page 75: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

How to handle a new task?

Page 76: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

How to handle a new task?

?

Page 77: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

How to handle a new task?

?

Page 78: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

How to handle a new task?

Page 79: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

How to handle a new task?

Page 80: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Summary

Multi-Task Learning with Unlabeled TasksI many (similar) tasks to solveI how to avoid having to collect annotation for all of them?

Active Task SelectionI identify most informative tasks and have only those labeledI transfer information from labeled to unlabeled tasksI principled algorithms, derived from generalization bounds

Lifelong LearningI new tasks coming in: use similarity to previous tasks to

I solve using a previous classifier, orI ask for new training annotation

Page 81: TowardsLifelongMachineLearning Multi ... · ResearchTopics Theory(StatisticalMachineLearning) I Multi-tasklearning I Domainadaptation I Lifelonglearning(learningtolearn) I Learningwithdependentdata

Thank you...