Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
1/13
Learning to Warm-StartBayesian Hyperparameter Optimization
and Task-Adaptive Ensemble of Meta-Learnersfor Few-Shot Classification
Jungtaek Kim ([email protected])
Machine Learning Group,Department of Computer Science and Engineering, POSTECH,
77 Cheongam-ro, Nam-gu, Pohang 37673,Gyeongsangbuk-do, Republic of Korea
September 11, 2018
2/13
Table of Contents
Learning to Warm-Start Bayesian Hyperparameter OptimizationMotivationMain ArchitectureExperiments
Task-Adaptive Ensemble of Meta-Learners for Few-Shot ClassificationMotivationMain ArchitectureExperiments
3/13
Learning to Warm-Start BayesianHyperparameter Optimization
4/13
Motivation
I Bayesian hyperparameter optimization usually starts fromrandom initial points.
I Better initializations might help to speed up Bayesianhyperparameter optimization.
I Mappings from hyperparameters to validation error are able tobe trained.
I We attempt to transfer prior knowledge about initializationsto new task.
5/13
Main Architecture
All weights are shared.
Meta-featureextractor
Meta-featureextractor
Dataset
Deep featureextractor
Dataset
Deep featureextractor
Meta-feature distance
fc layerfc layer
6/13
Experiments (EI)
0 5 10 15 20Iteration
0.76
0.78
0.80
0.82
Min
imum
valid
atio
ner
ror
(a) AwA2
0 5 10 15 20Iteration
0.56
0.58
0.60
0.62
Min
imum
valid
atio
ner
ror
(b) Caltech-101
0 5 10 15 20Iteration
0.84
0.85
0.86
0.87
0.88
Min
imum
valid
atio
ner
ror
(c) Caltech-256
0 5 10 15 20Iteration
0.30
0.35
0.40
0.45
Min
imum
valid
atio
ner
ror
(d) CIFAR-10
0 5 10 15 20Iteration
0.700
0.725
0.750
0.775
0.800
0.825
Min
imum
valid
atio
ner
ror
(e) CIFAR-100
0 5 10 15 20Iteration
0.960
0.965
0.970
0.975
Min
imum
valid
atio
ner
ror
(f) CUB200-2011
0 5 10 15 20Iteration
0.012
0.014
0.016
0.018
0.020
Min
imum
valid
atio
ner
ror
(g) MNIST
0 5 10 15 20Iteration
0.70
0.71
0.72
0.73
0.74
0.75
0.76
Min
imum
valid
atio
ner
ror
(h) VOC2012
Random init. (Uniform)
Random init. (Latin)
Random init. (Halton)
Nearest best init. (ADF)
Nearest best init. (Bi-LSTM)
7/13
Experiments (UCB)
0 5 10 15 20Iteration
0.76
0.78
0.80
0.82
Min
imum
valid
atio
ner
ror
(j) AwA2
0 5 10 15 20Iteration
0.56
0.58
0.60
0.62
Min
imum
valid
atio
ner
ror
(k) Caltech-101
0 5 10 15 20Iteration
0.84
0.85
0.86
0.87
0.88
Min
imum
valid
atio
ner
ror
(l) Caltech-256
0 5 10 15 20Iteration
0.30
0.35
0.40
0.45
Min
imum
valid
atio
ner
ror
(m) CIFAR-10
0 5 10 15 20Iteration
0.700
0.725
0.750
0.775
0.800
0.825
Min
imum
valid
atio
ner
ror
(n) CIFAR-100
0 5 10 15 20Iteration
0.960
0.965
0.970
0.975
Min
imum
valid
atio
ner
ror
(o) CUB200-2011
0 5 10 15 20Iteration
0.012
0.014
0.016
0.018
0.020
Min
imum
valid
atio
ner
ror
(p) MNIST
0 5 10 15 20Iteration
0.70
0.72
0.74
0.76
Min
imum
valid
atio
ner
ror
(q) VOC2012
Random init. (Uniform)
Random init. (Latin)
Random init. (Halton)
Nearest best init. (ADF)
Nearest best init. (Bi-LSTM)
8/13
Task-Adaptive Ensemble ofMeta-Learners for Few-Shot
Classification
9/13
Motivation
10/13
Motivation
I Few-shot classification needs to generalize training episodesand outperform in test episodes.
I Domain distribution of meta-learner for few-shot classificationis assumed not to be changed.
I In practice, domain distribution is able to be varied.
I We try to make ensemble of several meta-learners, each ofwhich is trained by the episodes from single dataset.
11/13
Main Architecture
12/13
Experiments
13/13
Experiments