33
CS 330 The Meta-Learning Problem & Black-Box Meta-Learning

The Meta-Learning Problem & Black-Box Meta-Learning

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Meta-Learning Problem & Black-Box Meta-Learning

CS 330

The Meta-Learning Problem & Black-Box Meta-Learning

Page 2: The Meta-Learning Problem & Black-Box Meta-Learning

Logistics

Homework 1 posted to be today, due Wednesday, October 6

One more CA!

Optional Homework 0 due tonight.

Azure guide to be posted today

Page 3: The Meta-Learning Problem & Black-Box Meta-Learning

Plan for TodayTransfer Learning - Problem formulation - Fine-tuning

Meta-Learning - Problem formulation - General recipe of meta-learning algorithms - Black-box adaptation approaches - Case study of GPT-3 (time-permitting)

Goals for by the end of lecture: - Differences between multi-task learning, transfer learning, and meta-learning problems - Basics of transfer learning via fine-tuning - Training set-up for few-shot meta-learning algorithms - How to implement black-box meta-learning techniques

Topic of Homework 1!}

Page 4: The Meta-Learning Problem & Black-Box Meta-Learning

Multi-Task Learning vs. Transfer Learning

4

minθ

T

∑i=1

ℒi(θ, 𝒟i)

Multi-Task Learning

Solve multiple tasks at once.𝒯1, ⋯, 𝒯T

Transfer Learning

Solve target task after solving source task 𝒯b 𝒯aby transferring knowledge learned from 𝒯a

Key assumption: Cannot access data during transfer.𝒟a

Transfer learning is a valid solution to multi-task learning.(but not vice versa)

Question: What are some problems/applications where transfer learning might make sense?

when is very large (don’t want to retain & retrain on )

𝒟a𝒟a

when you don’t care about solving & simultaneously𝒯a 𝒯b

Page 5: The Meta-Learning Problem & Black-Box Meta-Learning

training data for new task 𝒯b

Parameters pre-trained on 𝒟a

� ✓ � ↵r✓L(✓,Dtr)

5

Wheredoyougetthepre-trainedparameters? - ImageNet classifica8on - Models trained on large language corpora (BERT, LMs) - Other unsupervised learning techniques - Whatever large, diverse dataset you might have

(typically for many gradient steps)

Somecommonprac6ces - Fine-tune with a smaller learning rate - Smaller learning rate for earlier layers - Freeze earlier layers, gradually unfreeze - Reini8alize last layer - Search over hyperparameters via cross-val - Architecture choices maMer (e.g. ResNets)

Pre-trained models oOen available online.

WhatmakesImageNetgoodfortransferlearning? Huh, Agrawal, Efros. ‘16

Transfer learning via fine-tuning

Page 6: The Meta-Learning Problem & Black-Box Meta-Learning

6

UniversalLanguageModelFine-TuningforTextClassifica6on. Howard, Ruder. ‘18

Fine-tuning doesn’t work well with very small target task datasets

This is where meta-learning can help.

Page 7: The Meta-Learning Problem & Black-Box Meta-Learning

Plan for TodayTransfer Learning - Problem formulation - Fine-tuning

Meta-Learning - Problem formulation - General recipe of meta-learning algorithms - Black-box adaptation approaches - Case study of GPT-3 (time-permitting)

Page 8: The Meta-Learning Problem & Black-Box Meta-Learning

The Meta-Learning Problem Statement(that we will consider in this class)

Page 9: The Meta-Learning Problem & Black-Box Meta-Learning

Two ways to view meta-learning algorithms

Mechanisticview

➢ Deepnetworkthatcanreadinanentiredatasetandmakepredictionsfornewdatapoints

➢ Trainingthisnetworkusesameta-dataset,whichitselfconsistsofmanydatasets,eachforadifferenttask

Probabilisticview

➢ Extractpriorknowledgefromasetoftasksthatallowsefficientlearningofnewtasks

➢ Learninganewtaskusesthispriorand(small)trainingsettoinfermostlikelyposteriorparameters

Fornow:Focusprimarilyonthemechanisticview.

(Bayes will come back later)

Page 10: The Meta-Learning Problem & Black-Box Meta-Learning

How does meta-learning work? An example.Given 1 example of 5 classes: Classify new examples

training data test set

Page 11: The Meta-Learning Problem & Black-Box Meta-Learning

How does meta-learning work? An example.

meta-trainingtraining classes

… …

meta-testing Ttest

Given 1 example of 5 classes: Classify new examples

training data test set

regression, languagegenera6on, skilllearning,anyMLproblemCan replace image classifica8on with:

Page 12: The Meta-Learning Problem & Black-Box Meta-Learning

The Meta-Learning Problem

Given data from , quickly solve new task 𝒯1, …, 𝒯n 𝒯test

Key assumption: meta-training tasks and meta-test task drawn i.i.d. from same task distribution , 𝒯1, …, 𝒯n ∼ p(𝒯) 𝒯j ∼ p(𝒯)

What do the tasks correspond to?- recognizing handwritten digits from different languages (see homework 1!)

- giving feedback to students on different exams

- classifying species in different regions of the world

- a robot performing different tasks

How many tasks do you need? The more the better.

Like before, tasks must share structure.

(analogous to more data in ML)

Page 13: The Meta-Learning Problem & Black-Box Meta-Learning

“support set”

Some terminology

“query set”

Dtestitask test datasetDtr

itask training set

k-shot learning: learning with k examples per class N-way classification: choosing between N classes(or k examples total for regression)

Question: What are k and N for the above example?

Page 14: The Meta-Learning Problem & Black-Box Meta-Learning

minθ

T

∑i=1

ℒi(θ, 𝒟i)

Multi-Task Learning

Solve multiple tasks at once.𝒯1, ⋯, 𝒯T

Transfer Learning

Solve target task after solving source task 𝒯b 𝒯aby transferring knowledge learned from 𝒯a

The Meta-Learning Problem

Given data from , quickly solve new task 𝒯1, …, 𝒯n 𝒯test

In all settings: tasks must share structure.

In transfer learning and meta-learning: generally impractical to access prior tasks

Problem Settings Recap

Page 15: The Meta-Learning Problem & Black-Box Meta-Learning

Plan for TodayTransfer Learning - Problem formulation - Fine-tuning

Meta-Learning - Problem formulation - General recipe of meta-learning algorithms - Black-box adaptation approaches - Case study of GPT-3 (time-permitting)

Page 16: The Meta-Learning Problem & Black-Box Meta-Learning

One View on the Meta-Learning Problem

Inputs: Outputs:

Supervised Learning:

Meta Supervised Learning:

Inputs: Outputs:

Data:

Data:

Why is this view useful? Reduces the meta-learning problem to the design & optimization of .f

{

Finn. Learning to Learn with Gradients. PhD Thesis. 2018

Dtr<latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">AAACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4MM/PCVApDvv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ//6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j33aLftkfwftLggkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU//MaGUWVVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421IINIZh9+S+pH5YDyy+PiienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyyX3C+vgHtwKVA</latexit><latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">AAACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4MM/PCVApDvv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ//6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j33aLftkfwftLggkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU//MaGUWVVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421IINIZh9+S+pH5YDyy+PiienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyyX3C+vgHtwKVA</latexit><latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">AAACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4MM/PCVApDvv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ//6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j33aLftkfwftLggkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU//MaGUWVVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421IINIZh9+S+pH5YDyy+PiienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyyX3C+vgHtwKVA</latexit><latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">AAACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4MM/PCVApDvv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ//6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j33aLftkfwftLggkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU//MaGUWVVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421IINIZh9+S+pH5YDyy+PiienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyyX3C+vgHtwKVA</latexit>

Page 17: The Meta-Learning Problem & Black-Box Meta-Learning

GeneralrecipeHowtodesignameta-learningalgorithm

1. Choose a form of

2. Choose how to op8mize w.r.t. max-likelihood objec8ve using meta-training data✓

meta-parameters

Page 18: The Meta-Learning Problem & Black-Box Meta-Learning

Plan for TodayTransfer Learning - Problem formulation - Fine-tuning

Meta-Learning - Problem formulation - General recipe of meta-learning algorithms - Black-box adaptation approaches - Case study of GPT-3 (time-permitting)

Page 19: The Meta-Learning Problem & Black-Box Meta-Learning

Dtri

�i

Keyidea:Train a neural network to represent

Train with standard supervised learning!

L(�i,Dtesti )

Dtesti

f✓

g�i

yts

xts

Black-BoxAdapta6on

max✓

X

Ti

L(f✓(Dtri ),Dtest

i )

max✓

X

Ti

X

(x,y)⇠Dtesti

log g�i(y|x)

�i = f✓(Dtri )

Predict test points with yts = g�i(xts)

<latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">AAACtHicbVFRa9swEJa9de28bs22x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s77wMXu2+frPXe/vuxhaVETAShSrMbcItKKlhhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGGGSA/CAaszKQzSOMVs9nngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit><latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">AAACtHicbVFRa9swEJa9de28bs22x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s77wMXu2+frPXe/vuxhaVETAShSrMbcItKKlhhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGGGSA/CAaszKQzSOMVs9nngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit><latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">AAACtHicbVFRa9swEJa9de28bs22x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s77wMXu2+frPXe/vuxhaVETAShSrMbcItKKlhhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGGGSA/CAaszKQzSOMVs9nngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit><latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">AAACtHicbVFRa9swEJa9de28bs22x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s77wMXu2+frPXe/vuxhaVETAShSrMbcItKKlhhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGGGSA/CAaszKQzSOMVs9nngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit>

“learner"

Page 20: The Meta-Learning Problem & Black-Box Meta-Learning

Black-BoxAdapta6on

1. Sample task Ti2. Sample disjoint datasets Dtr

i ,Dtesti from Di

(or mini batch of tasks)

Di

Dtri Dtest

i

Dtri

�i

Dtesti

f✓

g�i

yts

xts

Keyidea:Train a neural network to represent . �i = f✓(Dtri )

Page 21: The Meta-Learning Problem & Black-Box Meta-Learning

Black-BoxAdapta6on

1. Sample task Ti2. Sample disjoint datasets Dtr

i ,Dtesti from Di

(or mini batch of tasks)

3. Compute �i f✓(Dtri )

4. Update ✓ using r✓L(�i,Dtesti )

Dtri Dtest

i

Dtri

�i

Dtesti

f✓

g�i

yts

xts

Keyidea:Train a neural network to represent . �i = f✓(Dtri )

Page 22: The Meta-Learning Problem & Black-Box Meta-Learning

Black-BoxAdapta6on

ChallengeOutpu_ng all neural net parameters does not seem scalable?

�i

Dtri

f✓

represents contextual task informa8onhi

low-dimensional vector hi

generalform:

Dtesti

yts

xts

g�i

yts = f✓(Dtri , x

ts)

(Santoro et al. MANN, Mishra et al. SNAIL)Idea: Do not need to output all parameters of neural net, only sufficient sta8s8cs

�i = {hi, ✓g}

recall:

22Whatarchitectureshouldweusefor ?f✓

Keyidea:Train a neural network to represent . �i = f✓(Dtri )

Page 23: The Meta-Learning Problem & Black-Box Meta-Learning

Black-BoxAdapta6onArchitectures

23

Meta-LearningwithMemory-AugmentedNeuralNetworks Santoro, Bartunov, Botvinick, Wierstra, Lillicrap. ICML ‘16

LSTMs or Neural turing machine (NTM)

Feedforward + average

Condi6onalNeuralProcesses. Garnelo, Rosenbaum, Maddison, Ramalho, Saxton, Shanahan, Teh, Rezende, Eslami. ICML ‘18

Other external memory mechanisms

MetaNetworks Munkhdalai, Yu. ICML ‘17

Convolu8ons & aMen8on

ASimpleNeuralARen6veMeta-Learner Mishra, Rohaninejad, Chen, Abbeel. ICLR ‘18

Question: Why might feedforward+average be better than a recurrent model?

HW 1: - implement data processing - implement simple black-box meta-learner - train few-shot Omniglot classifier

Page 24: The Meta-Learning Problem & Black-Box Meta-Learning

Let’srunthroughanexample

the “transpose” of MNIST

manyclasses, fewexamples1623characters from 50differentalphabets

Omniglotdataset Lake et al. Science 2015

…sta8s8cs more reflec8ve

of therealworld20instances of each character

More few-shotimagerecogni6ondatasets: 8eredImageNet, CIFAR, CUB, CelebA, ORBIT, others

More benchmarks: molecular property predic8on (Ngyugen et al. ’20), object pose predic8on (Yin et al. ICLR ’20), channel coding (Li et al. ’21)

*whiteboard*

Page 25: The Meta-Learning Problem & Black-Box Meta-Learning

Black-BoxAdapta6on

+ expressive + easy to combine with varietyoflearningproblems (e.g. SL, RL)

- complexmodel w/ complextask: challengingopCmizaCon problem - oOen data-inefficientDtr

i

�i

Dtesti

f✓

g�i

Next&me(Wednesday):What if we treat it as an opCmizaCon procedure?

yts

xts

Keyidea:Train a neural network to represent . �i = f✓(Dtri )

How else can we represent ? �i = f✓(Dtri )

Page 26: The Meta-Learning Problem & Black-Box Meta-Learning

Plan for TodayTransfer Learning - Problem formulation - Fine-tuning

Meta-Learning - Problem formulation - General recipe of meta-learning algorithms - Black-box adaptation approaches - Case study of GPT-3 (time-permitting)

Page 27: The Meta-Learning Problem & Black-Box Meta-Learning

Case Study: GPT-3

May 2020

Page 28: The Meta-Learning Problem & Black-Box Meta-Learning

What is GPT-3?

black-box meta-learner trained on language generation tasks

architecture: giant “Transformer” network 175 billion parameters, 96 layers, 3.2M batch size

: sequence of characters𝒟tri : the following sequence of characters𝒟tsi

[meta-training] dataset: crawled data from the internet, English-language Wikipedia, two books corpora

a language model

What do different tasks correspond to? spelling correctionsimple math problemstranslating between languagesa variety of other tasks

How can those tasks all be solved by a single architecture?

Page 29: The Meta-Learning Problem & Black-Box Meta-Learning

How can those tasks all be solved by a single architecture? Put them all in the form of text!

spelling correctionsimple math problems translating between languages

Why is that a good idea? Very easy to get a lot of meta-training data.

Page 30: The Meta-Learning Problem & Black-Box Meta-Learning

Some ResultsOne-shot learning from dictionary definitions:

Few-shot language editing:

Non-few-shot learning tasks:

Page 31: The Meta-Learning Problem & Black-Box Meta-Learning

General Notes & TakeawaysThe results are extremely impressive.

The model fails in unintuitive ways.

Source: https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html

Source: https://github.com/shreyashankar/gpt3-sandbox/blob/master/docs/priming.md

The choice of at test time is important. (“priming”)𝒟tri

The model is far from perfect.

Page 32: The Meta-Learning Problem & Black-Box Meta-Learning

Plan for TodayTransfer Learning - Problem formulation - Fine-tuning

Meta-Learning - Problem formulation - General recipe of meta-learning algorithms - Black-box adaptation approaches - Case study of GPT-3 (time-permitting)

Goals for by the end of lecture: - Differences between multi-task learning, transfer learning, and meta-learning problems - Basics of transfer learning via fine-tuning - Training set-up for few-shot meta-learning algorithms - How to implement black-box meta-learning techniques

Topic of Homework 1!}

Page 33: The Meta-Learning Problem & Black-Box Meta-Learning

Reminders

Next time: Optimization-based meta-learning

Homework 1 posted to be today, due Wednesday, October 6

Optional Homework 0 due tonight.

Azure guide to be posted today