Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Chelsea Finn UC Berkeley & Google Brain -> Stanford
Properties of Good Meta-Learning Algorithms (And How to Achieve Them)
- effectively reuse data on other tasks - replace manual engineering of architecture, hyperparameters, etc. - learn to quickly adapt to unexpected scenarios (inevitable failures,
long tail) - learn how to learn with weak supervision
Why Learn to Learn?
Problem Domains: - few-shot classification & generation - hyperparameter optimization - architecture search - faster reinforcement learning - domain generalization - learning structure - …
Approaches: - recurrent networks - learning optimizers or update rules - learning initial parameters &
architecture - acquiring metric spaces - Bayesian models - …
What is the meta-learning problem statement?
The Meta-Learning Problem
Inputs: Outputs:Supervised Learning:
Meta-Supervised Learning:Inputs: Outputs:
Data:
Data:
Why is this view useful? Reduces the problem to the design & optimization of f.
{
Finn, Levine. Meta-learning and Universality: Deep Representation… ICLR 2018
Example: Few-Shot Classification
diagram adapted from Ravi & Larochelle ‘17
Given 1 example of 5 classes: Classify new examples
meta-training
training data test set
training classes
… …
What do we want from our meta-learning algorithms?
Expressive power the ability for f to represent a range of learning procedures
Consistency
Uncertainty awareness
No need to differentiatethrough learning
better scalability to long learning processes
learned learning procedure will solve task with enough dataresult: reasonable performance on out-of-distribution tasks
ability to reason about ambiguity during learning
Design of f ?Recurrent network
(LSTM, NTM, Conv)Santoro et al. ’16, Duan et al. ’17, Wang et al. ’17, Munkhdalai & Yu ’17, Mishra et al. ’17, …
Design of f ?Recurrent network
(LSTM, NTM, Conv)Santoro et al. ’16, Duan et al. ’17, Wang et al. ’17, Munkhdalai & Yu ’17, Mishra et al. ’17, …
Learned optimizer (often uses recurrence)
Schmidhuber et al. ’87, Bengio et al. ’90, Hochreiter et al. ’01, Li & Malik ’16, Andrychowicz et al. ’16, Ha et al. ’17, Ravi & Larochelle ’17, …
Design of f ?Recurrent network
(LSTM, NTM, Conv)Santoro et al. ’16, Duan et al. ’17, Wang et al. ’17, Munkhdalai & Yu ’17, Mishra et al. ’17, …
Learned optimizer (often uses recurrence)
Schmidhuber et al. ’87, Bengio et al. ’90, Hochreiter et al. ’01, Li & Malik ’16, Andrychowicz et al. ’16, Ha et al. ’17, Ravi & Larochelle ’17, …
Black box approaches
Expressive power Consistency
Can we incorporate structure into the learning procedure?
Nearest Neighbors(in a learned metric space)
Koch ’15, Vinyals et al. ’16, Snell et al. ’17, Reed et al. ’17, Li et al. ’17, …
Vinyals et al. ’16
Expressive powerConsistency
Approaches that incorporate learning structure
Can we get both?
Koch ’15
Nearest Neighbors(in a learned metric space)
Koch ’15, Vinyals et al. ’16, Snell et al. ’17, Reed et al. ’17, Li et al. ’17, …
Gradient Descent(from learned initialization)
Finn et al. ’17, Grant et al. ’17, Reed et al. ’17, Li et al. ’17, …
Key idea: Train over many tasks, to learn parameter vector θ that transfers via fine-tuning
Fine-tuningtraining data for test task
pretrained parameters
[test-time]
Model-AgnosticMeta-Learning:
(MAML)
Finn, Abbeel Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML 2017
Approaches that incorporate learning structure
Learning procedure:
Expressive powerConsistency
?
Approaches that incorporate learning structureNearest Neighbors
(in a learned metric space)Koch ’15, Vinyals et al. ’16, Snell et al. ’17, Reed et al. ’17, Li et al. ’17, …
Gradient Descent(from learned initialization)
Finn et al. ’17, Grant et al. ’17, Reed et al. ’17, Li et al. ’17, …
Does consistency come at a cost?
Universal Function Approximation Theorem
Hornik et al. ’89, Cybenko ’89, Funahashi ‘89A neural network with one hidden layer of finite width can approximate any continuous function.
How can we define a notion of expressive power for meta-learning?
“universal function approximator”
Recurrent network Learned optimizer
“universal learning procedure approximator”
Finn, Abbeel Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML 2017
For a sufficiently deep f, MAML function can approximate any function of
Finn & Levine, ICLR 2018
Why is this interesting?MAML has benefit of consistency without losing expressive power.
Recurrent network MAML
Assumptions: - nonzero - loss function gradient does not lose information about the label - datapoints in are unique
Finn, Levine. Meta-learning and Universality: Deep Representation… ICLR 2018
Black box approaches
Expressive power Consistency
MAML
Empirically, what does consistency get you?
How well can methods generalize to similar, but extrapolated tasks?MAML TCML, MetaNetworks
task variability
perf
orm
ance
Omniglot image classification
The world is non-stationary.
Finn, Levine. Meta-learning and Universality: Deep Representation… ICLR 2018
MAML TCMLer
ror
Sinusoid curve regression
task variabilityTakeaway: Strategies learned with MAML consistently generalize better to out-of-distribution tasks
How well can methods generalize to similar, but extrapolated tasks?
The world is non-stationary.
Finn, Levine. Meta-learning and Universality: Deep Representation… ICLR 2018
What do we want from our meta-learning algorithms?
Expressive power the ability for f to represent a range of learning procedures
Consistency
Uncertainty awareness
No need to differentiatethrough learning
better scalability to long learning processes
learned learning procedure will solve task with enough dataresult: reasonable performance on out-of-distribution tasks
ability to reason about ambiguity during learning
What do we want from our meta-learning algorithms?
Expressive power the ability for f to represent a range of learning procedures
Consistency
Uncertainty awareness
No need to differentiatethrough learning better scalability to long learning processes
learned learning procedure will solve task with enough dataresult: reasonable performance on out-of-distribution tasks
ability to reason about ambiguity during learning
Model-AgnosticMeta-Learning:
(MAML)
What if we stop the gradient through this term?
Note: In some tasks, we have observed a more substantial drop.
Finn, Abbeel Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML 2017
What do we want from our meta-learning algorithms?
Expressive power the ability for f to represent a range of learning procedures
Consistency
Uncertainty awareness
No need to differentiatethrough learning
better scalability to long learning processes
learned learning procedure will solve task with enough dataresult: reasonable performance on out-of-distribution tasks
ability to reason about ambiguity during learning
What do we want from our meta-learning algorithms?
Expressive power the ability for f to represent a range of learning procedures
Consistency
Uncertainty awareness
No need to differentiatethrough learning
better scalability to long learning processes
learned learning procedure will solve task with enough dataresult: reasonable performance on out-of-distribution tasks
ability to reason about ambiguity during learning
No uncertainty awareness in original MAML
A Probabilistic Interpretation of MAML
that’s nice but is it… Bayesian?kind of… but it can be more Bayesian
Erin Grant
(Santos, 1996)
Grant, Finn, Levine, Darrell, Griffiths. Recasting gradient-based meta-learning as hierarchical Bayes. ICLR ‘18
Erin Grant
MAP inference in this model
can we do better than MAP?can use Laplace estimate:
estimate Hessian for neural nets with KFAC
A Probabilistic Interpretation of MAML
Grant, Finn, Levine, Darrell, Griffiths. Recasting gradient-based meta-learning as hierarchical Bayes. ICLR ‘18
Modeling ambiguity
Can we sample classifiers?
Intuition: we want to learn a prior where a random kick can put us in different modes
smiling, hatsmiling, young
Meta-learning with ambiguity
Finn*, Xu*, Levine. Probabilistic Model-Agnostic Meta-Learning
Sampling parameter vectors
approximate with MAPthis is extremely crude
but extremely convenient!Finn*, Xu*, Levine. Probabilistic Model-Agnostic Meta-Learning
What does ancestral sampling look like?
smiling, hatsmiling, young
Finn*, Xu*, Levine. Probabilistic Model-Agnostic Meta-Learning
Sampling parameter vectors
PLATIPUSProbabilistic LATent model for Incorporating Priors and Uncertainty in few-Shot learning
Ambiguous regression:
Ambiguous classification:
Finn*, Xu*, Levine. Probabilistic Model-Agnostic Meta-Learning
Probabilistic LATent model for Incorporating Priors and Uncertainty in few-Shot learning
PLATIPUS
Finn*, Xu*, Levine. Probabilistic Model-Agnostic Meta-Learning
Related concurrent work
Kim et al. “Bayesian Model-Agnostic Meta-Learning”: uses Stein variational gradient descent for sampling parameters
Gordon et al. “Decision-Theoretic Meta-Learning”: unifies a number of meta-learning algorithms under a variational inference framework
What do we want from our meta-learning algorithms?
Expressive power the ability for f to represent a range of learning procedures
Consistency
Uncertainty awareness
No need to differentiatethrough learning
better scalability to long learning processes
learned learning procedure will solve task with enough dataresult: reasonable performance on out-of-distribution tasks
ability to reason about ambiguity during learning
We can build meta-learning algorithms with many nice properties (expressive power, consistency, first-order, uncertainty awareness)
CollaboratorsSergey Levine
Pieter Abbeel
Tianhe Yu Trevor Darrell
Tom Griffiths
Erin Grant
Tianhao Zhang
Josh Abbott
Josh Peterson
Questions?
Blog post, code, and papers: eecs.berkeley.edu/~cbfinn
Annie Xie
Sudeep Dasari
Kelvin Xu