35
Chelsea Finn UC Berkeley & Google Brain -> Stanford Properties of Good Meta-Learning Algorithms (And How to Achieve Them)

Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

Chelsea Finn UC Berkeley & Google Brain -> Stanford

Properties of Good Meta-Learning Algorithms (And How to Achieve Them)

Page 2: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

- effectively reuse data on other tasks - replace manual engineering of architecture, hyperparameters, etc. - learn to quickly adapt to unexpected scenarios (inevitable failures,

long tail) - learn how to learn with weak supervision

Why Learn to Learn?

Page 3: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

Problem Domains: - few-shot classification & generation - hyperparameter optimization - architecture search - faster reinforcement learning - domain generalization - learning structure - …

Approaches: - recurrent networks - learning optimizers or update rules - learning initial parameters &

architecture - acquiring metric spaces - Bayesian models - …

What is the meta-learning problem statement?

Page 4: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

The Meta-Learning Problem

Inputs: Outputs:Supervised Learning:

Meta-Supervised Learning:Inputs: Outputs:

Data:

Data:

Why is this view useful? Reduces the problem to the design & optimization of f.

{

Finn, Levine. Meta-learning and Universality: Deep Representation… ICLR 2018

Page 5: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

Example: Few-Shot Classification

diagram adapted from Ravi & Larochelle ‘17

Given 1 example of 5 classes: Classify new examples

meta-training

training data test set

training classes

… …

Page 6: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

What do we want from our meta-learning algorithms?

Expressive power the ability for f to represent a range of learning procedures

Consistency

Uncertainty awareness

No need to differentiatethrough learning

better scalability to long learning processes

learned learning procedure will solve task with enough dataresult: reasonable performance on out-of-distribution tasks

ability to reason about ambiguity during learning

Page 7: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

Design of f ?Recurrent network

(LSTM, NTM, Conv)Santoro et al. ’16, Duan et al. ’17, Wang et al. ’17, Munkhdalai & Yu ’17, Mishra et al. ’17, …

Page 8: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

Design of f ?Recurrent network

(LSTM, NTM, Conv)Santoro et al. ’16, Duan et al. ’17, Wang et al. ’17, Munkhdalai & Yu ’17, Mishra et al. ’17, …

Learned optimizer (often uses recurrence)

Schmidhuber et al. ’87, Bengio et al. ’90, Hochreiter et al. ’01, Li & Malik ’16, Andrychowicz et al. ’16, Ha et al. ’17, Ravi & Larochelle ’17, …

Page 9: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

Design of f ?Recurrent network

(LSTM, NTM, Conv)Santoro et al. ’16, Duan et al. ’17, Wang et al. ’17, Munkhdalai & Yu ’17, Mishra et al. ’17, …

Learned optimizer (often uses recurrence)

Schmidhuber et al. ’87, Bengio et al. ’90, Hochreiter et al. ’01, Li & Malik ’16, Andrychowicz et al. ’16, Ha et al. ’17, Ravi & Larochelle ’17, …

Black box approaches

Expressive power Consistency

Can we incorporate structure into the learning procedure?

Page 10: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

Nearest Neighbors(in a learned metric space)

Koch ’15, Vinyals et al. ’16, Snell et al. ’17, Reed et al. ’17, Li et al. ’17, …

Vinyals et al. ’16

Expressive powerConsistency

Approaches that incorporate learning structure

Can we get both?

Koch ’15

Page 11: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

Nearest Neighbors(in a learned metric space)

Koch ’15, Vinyals et al. ’16, Snell et al. ’17, Reed et al. ’17, Li et al. ’17, …

Gradient Descent(from learned initialization)

Finn et al. ’17, Grant et al. ’17, Reed et al. ’17, Li et al. ’17, …

Key idea: Train over many tasks, to learn parameter vector θ that transfers via fine-tuning

Fine-tuningtraining data for test task

pretrained parameters

[test-time]

Model-AgnosticMeta-Learning:

(MAML)

Finn, Abbeel Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML 2017

Approaches that incorporate learning structure

Page 12: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

Learning procedure:

Expressive powerConsistency

?

Approaches that incorporate learning structureNearest Neighbors

(in a learned metric space)Koch ’15, Vinyals et al. ’16, Snell et al. ’17, Reed et al. ’17, Li et al. ’17, …

Gradient Descent(from learned initialization)

Finn et al. ’17, Grant et al. ’17, Reed et al. ’17, Li et al. ’17, …

Does consistency come at a cost?

Page 13: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

Universal Function Approximation Theorem

Hornik et al. ’89, Cybenko ’89, Funahashi ‘89A neural network with one hidden layer of finite width can approximate any continuous function.

How can we define a notion of expressive power for meta-learning?

“universal function approximator”

Recurrent network Learned optimizer

“universal learning procedure approximator”

Finn, Abbeel Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML 2017

Page 14: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

For a sufficiently deep f, MAML function can approximate any function of

Finn & Levine, ICLR 2018

Why is this interesting?MAML has benefit of consistency without losing expressive power.

Recurrent network MAML

Assumptions: - nonzero - loss function gradient does not lose information about the label - datapoints in are unique

Finn, Levine. Meta-learning and Universality: Deep Representation… ICLR 2018

Page 15: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

Black box approaches

Expressive power Consistency

MAML

Empirically, what does consistency get you?

Page 16: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

How well can methods generalize to similar, but extrapolated tasks?MAML TCML, MetaNetworks

task variability

perf

orm

ance

Omniglot image classification

The world is non-stationary.

Finn, Levine. Meta-learning and Universality: Deep Representation… ICLR 2018

Page 17: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

MAML TCMLer

ror

Sinusoid curve regression

task variabilityTakeaway: Strategies learned with MAML consistently generalize better to out-of-distribution tasks

How well can methods generalize to similar, but extrapolated tasks?

The world is non-stationary.

Finn, Levine. Meta-learning and Universality: Deep Representation… ICLR 2018

Page 18: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

What do we want from our meta-learning algorithms?

Expressive power the ability for f to represent a range of learning procedures

Consistency

Uncertainty awareness

No need to differentiatethrough learning

better scalability to long learning processes

learned learning procedure will solve task with enough dataresult: reasonable performance on out-of-distribution tasks

ability to reason about ambiguity during learning

Page 19: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

What do we want from our meta-learning algorithms?

Expressive power the ability for f to represent a range of learning procedures

Consistency

Uncertainty awareness

No need to differentiatethrough learning better scalability to long learning processes

learned learning procedure will solve task with enough dataresult: reasonable performance on out-of-distribution tasks

ability to reason about ambiguity during learning

Page 20: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

Model-AgnosticMeta-Learning:

(MAML)

What if we stop the gradient through this term?

Note: In some tasks, we have observed a more substantial drop.

Finn, Abbeel Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML 2017

Page 21: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

What do we want from our meta-learning algorithms?

Expressive power the ability for f to represent a range of learning procedures

Consistency

Uncertainty awareness

No need to differentiatethrough learning

better scalability to long learning processes

learned learning procedure will solve task with enough dataresult: reasonable performance on out-of-distribution tasks

ability to reason about ambiguity during learning

Page 22: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

What do we want from our meta-learning algorithms?

Expressive power the ability for f to represent a range of learning procedures

Consistency

Uncertainty awareness

No need to differentiatethrough learning

better scalability to long learning processes

learned learning procedure will solve task with enough dataresult: reasonable performance on out-of-distribution tasks

ability to reason about ambiguity during learning

No uncertainty awareness in original MAML

Page 23: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

A Probabilistic Interpretation of MAML

that’s nice but is it… Bayesian?kind of… but it can be more Bayesian

Erin Grant

(Santos, 1996)

Grant, Finn, Levine, Darrell, Griffiths. Recasting gradient-based meta-learning as hierarchical Bayes. ICLR ‘18

Page 24: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

Erin Grant

MAP inference in this model

can we do better than MAP?can use Laplace estimate:

estimate Hessian for neural nets with KFAC

A Probabilistic Interpretation of MAML

Grant, Finn, Levine, Darrell, Griffiths. Recasting gradient-based meta-learning as hierarchical Bayes. ICLR ‘18

Page 25: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

Modeling ambiguity

Can we sample classifiers?

Intuition: we want to learn a prior where a random kick can put us in different modes

smiling, hatsmiling, young

Page 26: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

Meta-learning with ambiguity

Finn*, Xu*, Levine. Probabilistic Model-Agnostic Meta-Learning

Page 27: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

Sampling parameter vectors

approximate with MAPthis is extremely crude

but extremely convenient!Finn*, Xu*, Levine. Probabilistic Model-Agnostic Meta-Learning

Page 28: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

What does ancestral sampling look like?

smiling, hatsmiling, young

Finn*, Xu*, Levine. Probabilistic Model-Agnostic Meta-Learning

Sampling parameter vectors

Page 29: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

PLATIPUSProbabilistic LATent model for Incorporating Priors and Uncertainty in few-Shot learning

Ambiguous regression:

Ambiguous classification:

Finn*, Xu*, Levine. Probabilistic Model-Agnostic Meta-Learning

Page 30: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

Probabilistic LATent model for Incorporating Priors and Uncertainty in few-Shot learning

PLATIPUS

Finn*, Xu*, Levine. Probabilistic Model-Agnostic Meta-Learning

Page 31: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

Related concurrent work

Kim et al. “Bayesian Model-Agnostic Meta-Learning”: uses Stein variational gradient descent for sampling parameters

Gordon et al. “Decision-Theoretic Meta-Learning”: unifies a number of meta-learning algorithms under a variational inference framework

Page 32: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

What do we want from our meta-learning algorithms?

Expressive power the ability for f to represent a range of learning procedures

Consistency

Uncertainty awareness

No need to differentiatethrough learning

better scalability to long learning processes

learned learning procedure will solve task with enough dataresult: reasonable performance on out-of-distribution tasks

ability to reason about ambiguity during learning

Page 33: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

We can build meta-learning algorithms with many nice properties (expressive power, consistency, first-order, uncertainty awareness)

Page 34: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

CollaboratorsSergey Levine

Pieter Abbeel

Tianhe Yu Trevor Darrell

Tom Griffiths

Erin Grant

Tianhao Zhang

Josh Abbott

Josh Peterson

Questions?

Blog post, code, and papers: eecs.berkeley.edu/~cbfinn

Annie Xie

Sudeep Dasari

Kelvin Xu

Page 35: Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,