The Meta-Learning Problem & Black-Box Meta-Learning

CS 330

Logistics

Homework 1 posted to be today, due Wednesday, October 6

One more CA!

Optional Homework 0 due tonight.

Azure guide to be posted today

Plan for TodayTransfer Learning - Problem formulation - Fine-tuning

Meta-Learning - Problem formulation - General recipe of meta-learning algorithms - Black-box adaptation approaches - Case study of GPT-3 (time-permitting)

Goals for by the end of lecture: - Differences between multi-task learning, transfer learning, and meta-learning problems - Basics of transfer learning via fine-tuning - Training set-up for few-shot meta-learning algorithms - How to implement black-box meta-learning techniques

Topic of Homework 1!}

Multi-Task Learning vs. Transfer Learning

∑i=1

ℒi(θ, 𝒟i)

Multi-Task Learning

Solve multiple tasks at once.𝒯1, ⋯, 𝒯T

Transfer Learning

Solve target task after solving source task 𝒯b 𝒯aby transferring knowledge learned from 𝒯a

Key assumption: Cannot access data during transfer.𝒟a

Transfer learning is a valid solution to multi-task learning.(but not vice versa)

Question: What are some problems/applications where transfer learning might make sense?

when is very large (don’t want to retain & retrain on )

𝒟a𝒟a

when you don’t care about solving & simultaneously𝒯a 𝒯b

training data for new task 𝒯b

Parameters pre-trained on 𝒟a

� ✓ � ↵r✓L(✓,Dtr)

Wheredoyougetthepre-trainedparameters? - ImageNet classifica8on - Models trained on large language corpora (BERT, LMs) - Other unsupervised learning techniques - Whatever large, diverse dataset you might have

(typically for many gradient steps)

Somecommonprac6ces - Fine-tune with a smaller learning rate - Smaller learning rate for earlier layers - Freeze earlier layers, gradually unfreeze - Reini8alize last layer - Search over hyperparameters via cross-val - Architecture choices maMer (e.g. ResNets)

Pre-trained models oOen available online.

WhatmakesImageNetgoodfortransferlearning? Huh, Agrawal, Efros. ‘16

Transfer learning via fine-tuning

UniversalLanguageModelFine-TuningforTextClassifica6on. Howard, Ruder. ‘18

Fine-tuning doesn’t work well with very small target task datasets

This is where meta-learning can help.

The Meta-Learning Problem Statement(that we will consider in this class)

Two ways to view meta-learning algorithms

Mechanisticview

➢ Deepnetworkthatcanreadinanentiredatasetandmakepredictionsfornewdatapoints

➢ Trainingthisnetworkusesameta-dataset,whichitselfconsistsofmanydatasets,eachforadifferenttask

Probabilisticview

➢ Extractpriorknowledgefromasetoftasksthatallowsefficientlearningofnewtasks

➢ Learninganewtaskusesthispriorand(small)trainingsettoinfermostlikelyposteriorparameters

Fornow:Focusprimarilyonthemechanisticview.

(Bayes will come back later)

How does meta-learning work? An example.Given 1 example of 5 classes: Classify new examples

training data test set

How does meta-learning work? An example.

meta-trainingtraining classes

… …

meta-testing Ttest

Given 1 example of 5 classes: Classify new examples

training data test set

regression, languagegenera6on, skilllearning,anyMLproblemCan replace image classifica8on with:

The Meta-Learning Problem

Given data from , quickly solve new task 𝒯1, …, 𝒯n 𝒯test

Key assumption: meta-training tasks and meta-test task drawn i.i.d. from same task distribution , 𝒯1, …, 𝒯n ∼ p(𝒯) 𝒯j ∼ p(𝒯)

What do the tasks correspond to?- recognizing handwritten digits from different languages (see homework 1!)

- giving feedback to students on different exams

- classifying species in different regions of the world

- a robot performing different tasks

How many tasks do you need? The more the better.

Like before, tasks must share structure.

(analogous to more data in ML)

“support set”

Some terminology

“query set”

Dtestitask test datasetDtr

itask training set

k-shot learning: learning with k examples per class N-way classification: choosing between N classes(or k examples total for regression)

Question: What are k and N for the above example?

∑i=1

ℒi(θ, 𝒟i)

Multi-Task Learning

Solve multiple tasks at once.𝒯1, ⋯, 𝒯T

Transfer Learning

Solve target task after solving source task 𝒯b 𝒯aby transferring knowledge learned from 𝒯a

The Meta-Learning Problem

Given data from , quickly solve new task 𝒯1, …, 𝒯n 𝒯test

In all settings: tasks must share structure.

In transfer learning and meta-learning: generally impractical to access prior tasks

Problem Settings Recap

One View on the Meta-Learning Problem

Inputs: Outputs:

Supervised Learning:

Meta Supervised Learning:

Inputs: Outputs:

Why is this view useful? Reduces the meta-learning problem to the design & optimization of .f

Finn. Learning to Learn with Gradients. PhD Thesis. 2018

Dtr<latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">AAACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4MM/PCVApDvv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ//6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j33aLftkfwftLggkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU//MaGUWVVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421IINIZh9+S+pH5YDyy+PiienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyyX3C+vgHtwKVA</latexit><latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">AAACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4MM/PCVApDvv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ//6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j33aLftkfwftLggkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU//MaGUWVVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421IINIZh9+S+pH5YDyy+PiienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyyX3C+vgHtwKVA</latexit><latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">AAACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4MM/PCVApDvv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ//6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j33aLftkfwftLggkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU//MaGUWVVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421IINIZh9+S+pH5YDyy+PiienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyyX3C+vgHtwKVA</latexit><latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">AAACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4MM/PCVApDvv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ//6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j33aLftkfwftLggkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU//MaGUWVVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421IINIZh9+S+pH5YDyy+PiienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyyX3C+vgHtwKVA</latexit>

GeneralrecipeHowtodesignameta-learningalgorithm

1. Choose a form of

2. Choose how to op8mize w.r.t. max-likelihood objec8ve using meta-training data✓

meta-parameters

Keyidea:Train a neural network to represent

Train with standard supervised learning!

L(�i,Dtesti )

Dtesti

Black-BoxAdapta6on

max✓

L(f✓(Dtri ),Dtest

max✓

(x,y)⇠Dtesti

log g�i(y|x)

�i = f✓(Dtri )

Predict test points with yts = g�i(xts)

<latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">AAACtHicbVFRa9swEJa9de28bs22x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s77wMXu2+frPXe/vuxhaVETAShSrMbcItKKlhhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGGGSA/CAaszKQzSOMVs9nngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit><latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">AAACtHicbVFRa9swEJa9de28bs22x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s77wMXu2+frPXe/vuxhaVETAShSrMbcItKKlhhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGGGSA/CAaszKQzSOMVs9nngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit><latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">AAACtHicbVFRa9swEJa9de28bs22x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s77wMXu2+frPXe/vuxhaVETAShSrMbcItKKlhhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGGGSA/CAaszKQzSOMVs9nngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit><latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">AAACtHicbVFRa9swEJa9de28bs22x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s77wMXu2+frPXe/vuxhaVETAShSrMbcItKKlhhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGGGSA/CAaszKQzSOMVs9nngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit>

“learner"

Black-BoxAdapta6on

1. Sample task Ti2. Sample disjoint datasets Dtr

i ,Dtesti from Di

(or mini batch of tasks)

Dtri Dtest

Dtesti

Keyidea:Train a neural network to represent . �i = f✓(Dtri )

Black-BoxAdapta6on

1. Sample task Ti2. Sample disjoint datasets Dtr

i ,Dtesti from Di

(or mini batch of tasks)

3. Compute �i f✓(Dtri )

4. Update ✓ using r✓L(�i,Dtesti )

Dtri Dtest

Dtesti

Black-BoxAdapta6on

ChallengeOutpu_ng all neural net parameters does not seem scalable?

represents contextual task informa8onhi

low-dimensional vector hi

generalform:

Dtesti

yts = f✓(Dtri , x

(Santoro et al. MANN, Mishra et al. SNAIL)Idea: Do not need to output all parameters of neural net, only sufficient sta8s8cs

�i = {hi, ✓g}

recall:

22Whatarchitectureshouldweusefor ?f✓

Black-BoxAdapta6onArchitectures

Meta-LearningwithMemory-AugmentedNeuralNetworks Santoro, Bartunov, Botvinick, Wierstra, Lillicrap. ICML ‘16

LSTMs or Neural turing machine (NTM)

Feedforward + average

Condi6onalNeuralProcesses. Garnelo, Rosenbaum, Maddison, Ramalho, Saxton, Shanahan, Teh, Rezende, Eslami. ICML ‘18

Other external memory mechanisms

MetaNetworks Munkhdalai, Yu. ICML ‘17

Convolu8ons & aMen8on

ASimpleNeuralARen6veMeta-Learner Mishra, Rohaninejad, Chen, Abbeel. ICLR ‘18

Question: Why might feedforward+average be better than a recurrent model?

HW 1: - implement data processing - implement simple black-box meta-learner - train few-shot Omniglot classifier

Let’srunthroughanexample

the “transpose” of MNIST

manyclasses, fewexamples1623characters from 50differentalphabets

Omniglotdataset Lake et al. Science 2015

…sta8s8cs more reflec8ve

of therealworld20instances of each character

More few-shotimagerecogni6ondatasets: 8eredImageNet, CIFAR, CUB, CelebA, ORBIT, others

More benchmarks: molecular property predic8on (Ngyugen et al. ’20), object pose predic8on (Yin et al. ICLR ’20), channel coding (Li et al. ’21)

*whiteboard*

Black-BoxAdapta6on

+ expressive + easy to combine with varietyoflearningproblems (e.g. SL, RL)

- complexmodel w/ complextask: challengingopCmizaCon problem - oOen data-inefficientDtr

Dtesti

Next&me(Wednesday):What if we treat it as an opCmizaCon procedure?

How else can we represent ? �i = f✓(Dtri )

Case Study: GPT-3

May 2020

What is GPT-3?

black-box meta-learner trained on language generation tasks

architecture: giant “Transformer” network 175 billion parameters, 96 layers, 3.2M batch size

: sequence of characters𝒟tri : the following sequence of characters𝒟tsi

[meta-training] dataset: crawled data from the internet, English-language Wikipedia, two books corpora

a language model

What do different tasks correspond to? spelling correctionsimple math problemstranslating between languagesa variety of other tasks

How can those tasks all be solved by a single architecture?

How can those tasks all be solved by a single architecture? Put them all in the form of text!

spelling correctionsimple math problems translating between languages

Why is that a good idea? Very easy to get a lot of meta-training data.

Some ResultsOne-shot learning from dictionary definitions:

Few-shot language editing:

Non-few-shot learning tasks:

General Notes & TakeawaysThe results are extremely impressive.

The model fails in unintuitive ways.

Source: https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html

Source: https://github.com/shreyashankar/gpt3-sandbox/blob/master/docs/priming.md

The choice of at test time is important. (“priming”)𝒟tri

The model is far from perfect.

Goals for by the end of lecture: - Differences between multi-task learning, transfer learning, and meta-learning problems - Basics of transfer learning via fine-tuning - Training set-up for few-shot meta-learning algorithms - How to implement black-box meta-learning techniques

Topic of Homework 1!}

Reminders

Next time: Optimization-based meta-learning

Homework 1 posted to be today, due Wednesday, October 6

Optional Homework 0 due tonight.

Azure guide to be posted today

The Meta-Learning Problem & Black-Box Meta-Learning

Documents

Semantic Meta-Miningsemantic.cs.put.poznan.pl/...semantic-meta-mining.pdf · Meta-mining: process-oriented meta-learning Rank/select workﬂows rather than individual algorithms/parameters

Bootstrapped Meta-Learning

Reconciling meta-learning and continual learning with online mixtures of taskspapers.nips.cc/paper/9112-reconciling-meta-learning-and... · 2020. 2. 13. · Reconciling meta-learning

Meta Learning Course

TL by Kernel Meta Learning

Online Meta-Learning · 2019-07-05 · Online Meta-Learning MAML and other meta-learning algorithms (see Section7) are not directly applicable to sequential settings for two rea-sons

Meta Learning - cs.ubc.calsigal/532S_2018W2/6a.pdf · Meta Learning Shared Hierarchies (Frans et. al, 2017) One-shot visual imitation learning via meta-learning (Finn et. al, 2017)

Meta-Learning for Contextual Bandit Exploration · Meta-Learning for Contextual Bandit Exploration Amr Sharaf1 Hal Daume III´ 1 2 Abstract We describe MELˆ EE´ , a meta-learning

Meta-Learning : towards universal learning paradigms

One-Shot Visual Imitation Learning via Meta-Learningproceedings.mlr.press/v78/finn17a/finn17a.pdf · Keywords: imitation learning, deep learning, one-shot learning, meta-learning

Meta Learning Objectives

Mozilla · 2011-07-07 · features Meta Normal as the light and lean choice for sub-branding. Light in weight, heavy in awesome. Meta Black Reserved for our Web headlines, Meta Black

Meta-Learning for User Cold-Start Recommendation · RS, by applying model-agnostic meta learning (MAML) [11]. MAML is a general and very powerful technique proposed for meta-learning

Meta-Learning Recipe, Black-Box Adaptation, Optimization ...cs330.stanford.edu/slides/cs330_lecture3.pdf · Meta-Learning with Memory-Augmented Neural Networks Santoro, Bartunov,

Meta-Inverse Reinforcement Learning with Probabilistic ...papers.nips.cc/paper/9348-meta-inverse... · Meta-Inverse Reinforcement Learning with Probabilistic Context Variables Lantao

Meta Learning (Part 1) - NTU Speech Processing …speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture...Meta Learning (Part 1) Hung-yi Lee Introduction •Meta learning = Learn to

Unsupervised Meta-Learning for Reinforcement Learning · are drawn from the same distribution as the meta-training tasks [8]. In effect, meta-reinforcement learning ofﬂoads some

Meta-Learning With Differentiable Convex Optimizationopenaccess.thecvf.com/content_CVPR_2019/papers/Lee_Meta-Learning_With... · Meta-Learning with Differentiable Convex Optimization

Meta-Gradient Reinforcement Learningpapers.nips.cc/paper/7507-meta-gradient-reinforcement-learning.pdf · of learning. We derive a practical gradient-based meta-learning algorithm

META LEARNING CURIOSITY ALGORITHMS