A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax

A Survey on Meta-Learning

Xiang Li

Nanyang Technological University

September 10, 2019

1 / 53

What’s Not Included in This Survey?

I Meta-Learning for Reinforcement Learning

I Beyesian-Based Meta-Learning

2 / 53

Overview

Problem DefinitionFew-shot Learning

ApproachesNon-Parametric Methods (Metric Learning)Model-Based Methods (Black-Box Adaptation)Optimization-Based MethodsOther Methods

Summary

3 / 53

Problem Definition

I Over a task distribution p(T ):

θ∗ ← argminθ

∑t∼p(T )

Lt(θ)

I Common tasks:I Few-shot LearningI RegressionI Reinforcement Learning

4 / 53

Few-shot Learning: Definition

I Training set Dmeta−train = {(x1, y1), ..., (xk, yk)}I Test set Dmeta−test = {D1, ..., Dn},Di = {(xi1, yi1), ..., (xim, yim)}

I N-way, K-shot problem (usually 5-way, 5-shot or 5-way,1-shot)

Support Set (Label Given)

Query Set(To Predict Label)

Ravi et al. ’17

5 / 53

Few-shot Learning: Dataset

I Omniglot dataset (Lake et al. ’15)I 1623 characters, 20 instances of each characterI ”transpose” of MNIST

I Mini-ImageNetI subset of ImageNet

I Other datasets: CIFAR, CUB, CelebA and others

6 / 53



Summary

7 / 53

Non-Parametric Methods (Metric Learning)

I Idea:Simply compare query images to images in support set infeature space

I Challenge:I Feature space (compare in what space?)I Distance metric

8 / 53

Siamese Network (Koch et al. ’15)

9 / 53

I Can we make the training condition match the test condition?

I Can the feature space be conditioned on specific task?

10 / 53

Matching Networks for One Shot Learning (Vinyals et al.’16)

I Sampling Strategy:test and trainingconditions mustmatchI Sample ”episodes”

in a training batch

I Attention KernelI y =∑k

i=1 a (x, xi) yiI a (x, xi) =

ec(f(x),g(xi))∑kj=1 e

c(f(x),g(xj))

I Full ContextEmbeddings

𝑔𝜃 is a bidirectional LSTM to encode 𝑥𝑖in context of 𝑆

𝑓𝜃 is a LSTM to encode 𝑥 in context of 𝑆

11 / 53

Other Metric-Based Works

I Prototypical Networks for Few-shot Learning (Snell et al. ’17)

I ck = 1|Sk|

∑(xi,yi)∈Sk fφ (xi)

I pφ(y = k|x) = exp(−d(fφ(x),ck))∑k′ exp(−d(fφ(x),ck′ ))

I Learning to Compare: Relation Network for Few-ShotLearning (Sung et al. ’18)

12 / 53

Performance

Omniglot

Mini-ImageNet

13 / 53

Metric-Based Methods: Summary

I Meta Knowledge: feature space & distance metric

I Task Knowledge: None

I Advantage:I Simple and computational fastI Entirely feed-forward

I Disadvantage:I Hard to scale very large KI Limited to classification problem

14 / 53

Problem on Metric-Based Methods

Adaptation for each specific task?

I Model-Based Methods

I Optimization-Based Methods

15 / 53



Summary

16 / 53

Model-Based Methods (Black-Box Adaptation)

I Idea:I Adaptation φi for task iI Train a network (θ) to represent p

(φi|Dsupport

i , θ)

17 / 53

Model-Based Methods (Black-Box Adaptation)

How to aggregate samples insupport set (form of fθ)?

I Average (PrototypicalNetwork)

I LSTM (Matching Network)

I Memory-Augmented NeuralNetworks

18 / 53

Neural Turing Machines (NTMs) (Graves et al. ’14)

I Attention-based

I Memory matrix M

I Read:

{r←−

∑iw(i)M(i)∑

iw(i) = 1

I Write:{Mt(i)←−Mt−1(i) [1− w(i)e]Mt(i)←− Mt(i) + w(i)a

19 / 53

Neural Turing Machines (NTMs) (Graves et al. ’14)

Advantages:

I Large memory

I Addressable

20 / 53

Target: p(φi|Dsupport

i , θ)

What formations can φi be?

I Contain task-specific information

I Contain only useful information

Solution 1: feature representations of samples in Dsupport

21 / 53

Meta-Learning with Memory Augmented Neural Networks(Santoro et al. ’16)

I Read Memory:

ri =

N∑i=1

wrt (i)Mt(i),

where

wrt (i) = softmax

(kt ·Mt(i)

‖kt‖ · ‖Mt(i)‖

)I Least Recently Used

Access (LRUA)

I Strategy: 1 timeoff-set

22 / 53

Dynamic Few-Shot Visual Learning without Forgetting(Gidaris et al. ’18)

Use cosine similarity in the classification layer instead ofdot-product

Generate prototypical vector of each class using:I Average of features in support set (like prototypical network)I Attention-based inference from memory of base classes

featuresI Base class features are like a dictionaryI A memory matrix MI A key matrix KI w′att =

1N ′

∑N ′

i=1

∑Kbaseb=1 Att (φqz

′i, kb) · wb

23 / 53


i , θ)

What formations can φi be?

I Contain task-specific information

I Contain only useful information

Solution 2: network weights

24 / 53

Meta Networks (Munkhdalai et al. ’17)I Slow weight + Fast weightI meta feature extractor fm (key embedding) with slow weightM , base feature extractor fb with slow weight B, meta weightgenerator gm, base weight generator gb

I Meta-information: gradients

25 / 53

Meta Networks (Munkhdalai et al. ’17)I Generate fast weight for meta feature extractor:

I Sample T examples (x′i, y′i) in support set

I For each i: ∇i ← ∇MLembed(fm(M,x′i), y′i)

I M∗ ← gm({∇i}) (then meta feature extractor fm can useslow weight M + fast weight M∗)

I Store support set base fast weights in NTM:I For all samples (x′i, y

′i) in support set:

I ∇i ← ∇BLtask(fb(B, x′i), y′i)

I B∗i ← gb(∇i) (slow weight for base extractor)I Store B∗i in memory U(i)I r′i ← fm(M,M∗, x′i) and store in key matrix K(i)

I Use fast weight from memory to extract features for querysamples:I For all samples (xi, yi) in query set:

I ri ← fm(M,M∗, xi) (key)I Access memory using ri and get W ∗i (fast weight for fb)I Extract feature fb(W,W ∗i , xi) and compute total task loss

I Use total loss to update slow weights and weights for weightgenerators

26 / 53

Meta Networks (Munkhdalai et al. ’17)

I Remaining question: can we find a better meta-information?(in this work: gradients)

27 / 53

Model-Based Methods: Summary

I Meta Knowledge: Slow weight

I Task Knowledge: Generated weights & support set features

I Advantage:I Applicable to many learning problems

I Disadvantage:I Complicated model (model & architecture intertwined)I Difficult to optimize

28 / 53


i , θ)

Since generating weights (φi) is very computational cost, why notupdating existing weights θ?

fine-tune!

29 / 53



Summary

30 / 53

Optimization-Based Methods

I Idea:I Fine-tune to a specific taskI At test time, given a task t, θt ← g(θ,Dsupport). Then we

have y = f (θt, x)

I What can we do in fine-tune?

31 / 53

Take a look into Fine-Tune Process (Gradient Descent)

𝜃′ = 𝜃 − 𝛼∇𝜃𝐿

Compute a more effective “gradients”?

Have a better initial parameters?

Update only partial weights?

32 / 53

Optimization as a Model for Few-Shot Learning (Ravi etal. ’17)

I An observation:I Gradient descent:

θt = θt−1 − αt∇θt−1Lt

I Cell update in LSTM:

ct = ft � ct−1 + it � ct

I If ft = 1, it = at and ct = ∇θt−1Lt:

ct can be treated as θt

I Dynamic ft and ct:

it = σ(WI ·

[∇θt−1Lt,Lt, θt−1, it−1

]+ bI

)ft = σ

(WF ·

[∇θt−1Lt,Lt, θt−1, ft−1

]+ bF

)I Weight Sharing

33 / 53


34 / 53


Utilize Loss (Lt), Gradient (∇θt−1Lt), network weight (θi) andprevious state (it−1&ft−1) as meta-information.

35 / 53


𝜃′ = 𝜃 − 𝛼∇𝜃𝐿




36 / 53

Model-Agnostic Meta-Learning for Fast Adaptation ofDeep Networks (Finn et al. ’17)

I Learn initial parameters that is easy to adapt to all tasks inthe task distributionI Initial parameters: prior knowledge (meta knowledge)I Easy: one step or few steps

37 / 53


Task-specific Adaption

Meta-knowledge Update

38 / 53


Does optimal point in parameter space exsits?

39 / 53


𝜃′ = 𝜃 − 𝛼∇𝜃𝐿




40 / 53

Fast Context Adaptation via Meta-Learning (Zintgraf et al.’19)

I Meta parameters φ and task-specific parameters θ

I Divide parameters to meta and task-specific, similar to LSTMmeta-learner (Ravi et al. ’17) and Meta Networks(Munkhdalai et al. ’17)

41 / 53

Performance

42 / 53

Optimization-Based Methods: Summary

I Meta Knowledge: Outer loop optimization

I Task Knowledge: Inner loop optimization

I Advantages:I Applicable to many kinds of tasksI Model-agnostic

I Disadvantages:I Second-order optimization (first-order MAML & Reptile)

43 / 53

Comparison

Metric-Based Model-Based Optimization-Based

Applicable Tasks Classification orverficiation

All All

Applicable Models All feature ex-tractors

DesignedModel

All BP-basedmodels

Computational Cost Low High High

Optimization Easy Hard Hard

Meta Information Feature space& distancemetric

Slow weight Outer loop op-timized weights

Task Information None Generated fea-tures & weights

Inner loop opti-mized weights

44 / 53



Summary

45 / 53

MetaGAN: An Adversarial Approach to Few-Shot Learning(Zhang et al. ’18)

LD = Ex∼QuTlog pD(y ≤ N |x) + Ex∼pTG

log pD(N + 1|x)

Explanation:

I Enrich task samples withoutliers

I cat & car > cat & dog →better decision boundary

46 / 53

A Simple Neural Attentive Meta-Learner (Mishra et al. 18)Temporal Generation:

p(x) =

T∏t=1

p (xt|x1, . . . , xt−1)

Causal Temporal Convolutional Layers:

Dilated Causal Convolutional Layers (van den Oord et al. ’16):

47 / 53

A Simple Neural Attentive Meta-Learner (Mishra et al. 18)

p(ytest|xtest, Xsupport, Y supprot

)

Only Coarse access of previousinputs (support set)

48 / 53

A Simple Neural Attentive Meta-Learner (Mishra et al. 18)

Attention is all you need!

Dilated Layer

Attention Layer

49 / 53



Summary

50 / 53

Summary: Common Ideas

I Knowledge DesignI Meta-knowledgeI Task-specific knowledge (support set)

I Network DesignI Discriminate meta-knowledge and task-specific knowledgeI Combine meta-knowledge and task-specific knowledge

I Traning StrategyI Joint training or mimicing the test case?I Outloop and innerloop for updating different knowledge

51 / 53

Summary: Challenges

I OverfitI Sample-level overfitI Meta-overfit

I Optimization

I Ambiguity

52 / 53

Thanks for your listening!

53 / 53

Documents

A Survey on Meta-Learning - Shawn Li's Blog · Meta-Learning with Memory Augmented Neural Networks (Santoro et al. ’16) I Read Memory: r i= XN i=1 wr t(i)M t(i); where wr t(i) =softmax