The Big Problem with Meta-Learning and How Bayesians Can ...cbfinn/_files/neurips19...An entire...

Chelsea Finn

The Big Problem with Meta-Learning and How Bayesians Can Fix It

Stanford

CezanneBraque

By Braque or Cezanne?

training data test datapoint

How did you accomplish this?

Through previous experience.

How might you get a machine to accomplish this task?

Fine-tuning from ImageNet features

SIFT features, HOG features + SVM

Modeling image formaKon

Geometry

Can we explicitly learn priors from previous experience that lead to efficient downstream learning?

Domain adaptaKon from other painters

Fewer human priors, more data -driven priors

Greater success.

Can we learn to learn?

Outline

1. Brief overview of meta-learning

2. The problem: peculiar, lesser-known, yet ubiquitous

3. Steps towards a solution

How does meta-learning work? An example.Given 1 example of 5 classes: Classify new examples

training data test set

How does meta-learning work? An example.

meta-trainingtraining classes

… …

meta-testing Ttest

Given 1 example of 5 classes: Classify new examples

How does meta-learning work?

One approach: parameterize learner by neural network

0 1 2 3 44

(Hochreiter et al. ’91, Santoro et al. ’16, many others)

yts = f(𝒟tr, xts; θ)

How does meta-learning work?

Another approach: embed optimization inside the learning process

0 1 2 3 4

(Maclaurin et al. ’15, Finn et al. ’17, many others)

yts = f(𝒟tr, xts; θ)

The Bayesian perspective

(Grant et al. ’18, Gordon et al. ’18, many others)

meta-learning <~> learning priors from datap(ϕ |θ)

Outline

1. Brief overview of meta-learning

3. First steps towards a solution

2. The problem: peculiar, lesser-known, yet ubiquitous

How we construct tasks for meta-learning.

0 1 2 3 4 42

0 1 2 3 4 3 1

0 1 2 3 4 34

Randomly assign class labels to image classes for each task

Algorithms must use training data to infer label ordering.

𝒟tr xts

—> Tasks are mutually exclusive.

What if label order is consistent?

The network can simply learn to classify inputs, irrespective of 𝒟tr

0 1 2 3 4 42

0 1 2 3 4 3 1

0 1 2 3 4 1

𝒟tr xts

Tasks are non-mutually exclusive: a single function can solve all tasks.

The network can simply learn to classify inputs, irrespective of 𝒟tr

0 1 2 3 4

4r✓L

What if label order is consistent?

0 1 2 3 4 42

0 1 2 3 4 3 1

0 1 2 3 4 1

𝒟tr xts

TtestFor new image classes: can’t make predictions w/o 𝒟tr

Is this a problem?

- No: for image classification, we can just shuffle labels* - No, if we see the same image classes as training (& don’t need to adapt at

meta-test time) - But, yes, if we want to be able to adapt with data for new tasks.

Another example

If you tell the robot the task goal, the robot can ignore the trials.

“close box”

meta-training …

“close drawer” “hammer” “stack”

T Yu, D Quillen, Z He, R Julian, K Hausman, C Finn, S Levine. Meta-World. CoRL ‘19

Another example

Model can memorize the canonical orientations of the training objects.

Yin, Tucker, Yuan, Levine, Finn. Meta-Learning without Memorization. ‘19

Can we do something about it?

If tasks mutually exclusive: single function cannot solve all tasks

Suggests a potential approach: control information flow.

An entire spectrum of solutions based on how information flows.

(i.e. due to label shuffling, hiding information)

If tasks are non-mutually exclusive: single function can solve all tasks

multiple solutions to the meta-learning problem

yts = f✓(Dtri , x

memorize canonical pose info in & ignore θ 𝒟tri

carry no info about canonical pose in , acquire from θ 𝒟tri

One solution:

Another solution:

An entire spectrum of solutions based on how information flows.

If tasks are non-mutually exclusive: single function can solve all tasksmultiple solutions to the meta-learning problem

yts = f✓(Dtri , x

memorize canonical pose info in & ignore θ 𝒟tri

carry no info about canonical pose in , acquire from θ 𝒟tri

One solution:

Another solution:

Meta-regularization

minimize meta-training loss + information in θ+βDKL(q(θ; θμ, θσ)∥p(θ))ℒ(θ, 𝒟meta−train)

Places precedence on using information from over storing info in .𝒟tr θCan combine with your favorite meta-learning algorithm.

one option: max I(yts, 𝒟tr |xts)

(and it’s not just as simple as standard regularization)

On pose prediction task:

Omniglot without label shuffling: “non-mutually-exclusive” Omniglot

TAML: Jamal & Qi. Task-Agnostic Meta-Learning for Few-Shot Learning. CVPR ‘19

Does meta-regularization lead to better generalization?

Let be an arbitrary distribution over that doesn’t depend on the meta-training data.P(θ) θ

For MAML, with probability at least ,1 − δ

(e.g. )P(θ) = 𝒩(θ; 0, I)

∀θμ, θσerror on the meta-training set

meta-regularization

With a Taylor expansion of the RHS + a particular value of —> recover the MR MAML objective.β

Proof: draws heavily on Amit & Meier ‘18

generalization error

Yin, Tucker, Yuan, Levine, Finn. Meta-Learning without Memorization. ‘19T Yu, D Quillen, Z He, R Julian, K Hausman, C Finn, S Levine. Meta-World. CoRL ‘19

CS330: Deep Multi-Task & Meta-Learning Lecture videos coming out soon!Want to Learn More?

Collaborators

Working on Meta-RL?

Try out the Meta-World benchmark

The Big Problem with Meta-Learning and How Bayesians Can ...cbfinn/_files/neurips19...An entire...

Documents

Meta-Learning Frontierspeople.eecs.berkeley.edu/~cbfinn/_files/metalearning_frontiers_2018_small.pdf · Finn & Levine, ICLR 2018 Why is this interesting? MAML has benefit of inductive

Conﬁdence in Beliefs and Rational Decision Makingsition—Bayesians often speak of grades of uncertainty, degrees of belief, subjective probability or credences—this position can

Machine Learning Estimation of Heterogeneous Treatment ...vsyrgkanis.com/camera_ready/neurips19/ml_instruments/instruments.pdf•Econometrics -> ML: use of the notion of Neyman orthogonality

Distributed Training Across the Worldlearningsys.org/neurips19/assets/papers/NeurIPS19_MLSys... · 2020. 2. 14. · showing that ASGD converges as good as SSGD [24], synchronous SGD

GNNExplainer: Generating Explanations for Graph Neural ...cs.stanford.edu/people/jure/pubs/gnnexplainer-neurips19.pdfGNNExplainer: Generating Explanations for Graph Neural Networks

Managing Self-Conﬁdence · 2019-06-20 · in a self-serving manner, but then process these misinterpreted signals like normal Bayesians. We therefore refer to this type of updating

Identiﬂcation of the Variance Components in the General Two-Variance …bjreich/pubs/IDgen2var.pdf · 2005. 8. 23. · 1 Introduction Advances in computing allows Bayesians to ﬂt

Bias Correction in Learned Generative ... - Aditya Groveraditya-grover.github.io/files/posters/neurips19.pdf · Aditya Grover StanfordUniversity Joint work with Jiaming Song, AlekhAgarwal,

Refresher on Bayesian and Frequentist Conceptsarchived.stat.ufl.edu/casella/Talks/BayesRefresher.pdf · 2008-09-11 · Refresher on Bayesian and Frequentist Concepts Bayesians and

Neural Relational Inference with Fast Modular Meta-learning › pubs › alet-neurips19.pdfWe proposed the BounceGrad algorithm (Alet et al., 2018), which alternates between simulated

Bayesian Statistics in One Hour - patricklam.orgpatricklam.org/teaching/bayesianhour_print.pdfB: Non-Bayesians are just doing Bayesian statistics with uninformative priors, which may

Welcome to IPB University · Dua aliran dalam statistika: Bayesians dan frequentists. Aliran Bayesian lebih banyak relevansinya dalam sains data. Penganut Frequentist menggunakan

5 Parallel Prism: A topology for pipelined implementations ...learningsys.org/neurips19/assets/...5PP_final_v2.pdf · 5PP can be thought of as a graph constructed from the coalescing

CAN/MAY BAYESIANS DO PURE TESTS OF SIGNIFICANCE

Properties of Good Meta-Learning Algorithmsai.stanford.edu/~cbfinn/_files/icml2018_automl_35min.pdf · 2019-07-24 · MAML function can approximate any function of Finn & Levine,

SOFTWARE RELIABILITY ANALYSISir.amu.ac.in/5459/1/DS 3544.pdf · 4.2.3 Software Reliability Measurement 121 4.2.4 Software engineers should be Bayesians 124 4.2.4 Future Work 130 CHAPTER

A Truth-Serum for Non-Bayesians: Correcting Proper Scoring ... · 1 A Truth-Serum for Non-Bayesians: Correcting Proper 2 Scoring Rules for Risk Attitudes1 3 Theo Offermana, Joep Sonnemansa,

AuthorGAN: Improving GAN Reproducibility using a Modular ...learningsys.org/neurips19/assets/papers/35_Camera... · DCGAN can be combined with the discriminator of a WGAN with the

Manipulation by Feel: Touch-Based Control with Deep ...dineshj/publication/tian-2019-manipula… · mudigonda, cbfinn, sergey.levineg@berkeley.edu 2Facebook AI Research, Menlo Park,

Attention over Parameters for Dialogue Systemsalborz-geramifard.com/workshops/neurips19-Conversational-AI/Pape… · Dialogue systems require a great deal of different but complementary