Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Learning to Learn Image Classifiers with Visual Analogy
Linjun Zhou1 Peng Cui1 Shiqiang Yang1 Wenwu Zhu1 Qi Tian2
1Tsinghua University 2Huawei Noah’s Ark Lab
{cuip, yangshq, wwzhu}@mail.tsinghua.edu.cn, [email protected]
Abstract
Humans are far better learners who can learn a new
concept very fast with only a few samples compared with
machines. The plausible mystery making the difference is
two fundamental learning mechanisms: learning to learn
and learning by analogy. In this paper, we attempt to in-
vestigate a new human-like learning method by organically
combining these two mechanisms. In particular, we study
how to generalize the classification parameters from pre-
viously learned concepts to a new concept. we first pro-
pose a novel Visual Analogy Graph Embedded Regression
(VAGER) model to jointly learn a low-dimensional embed-
ding space and a linear mapping function from the embed-
ding space to classification parameters for base classes. We
then propose an out-of-sample embedding method to learn
the embedding of a new class represented by a few sam-
ples through its visual analogy with base classes and derive
the classification parameters for the new class. We conduct
extensive experiments on ImageNet dataset and the results
show that our method could consistently and significantly
outperform state-of-the-art baselines.
1. Introduction
The emergence of deep learning has advanced the image
classification performance into an unprecedented level. The
error rate on ImageNet has been halved and halved again
[11, 21, 9], even approaching human-level performance.
Despite the success, the state-of-the-art models are noto-
riously data hungry, requiring tons of samples for parame-
ter learning. In real cases, however, the visual phenomena
follows a long-tail distribution [31] where only a few sub-
categories are data-rich and the rest are with limited training
samples. How to learn a classifier from as few samples as
possible is critical for real applications and fundamental for
exploring new learning mechanisms.
Compared with machines, people are far better learners
as they are capable of learning models from very limited
samples of a new category and make accurate prediction
and judgment accordingly. An intuitive example is that a
baby learner can learn to recognize a wolf with only a few
sample images provided that he/she has been able to suc-
cessfully recognize a dog. The key mystery making the dif-
ference is that people have strong prior knowledge to gen-
eralize across different categories [13]. It means that people
do not need to learn a new classifier (e.g. wolf) from scratch
as most machine learning methods, but generalize and adapt
the previously learned classifiers (e.g. dog) towards the new
category. A major way to acquire the prior knowledge is
through learning to learn from previous experience. In the
image classification scenario, learning to learn refers to the
mechanism that learning to recognize a new concept can be
accelerated by previously learned other related concepts.
A typical image classifier is constituted by representa-
tion and classification steps, leading to two fundamental
problems in learning to learn image classifiers: (1) how to
generalize the representations from previous concepts to a
new concept, and (2) how to generalize the classification
parameters of previous concepts to a new concept. In litera-
ture, transfer learning and domain adaptation methods [14]
are proposed with a similar notion, mainly focusing on the
problem of representation generalization across different
domains and tasks. With the development of CNN-based
image classification models, the high-level representations
learned from very large scale labeled dataset are demon-
strated to have good transferability across different concepts
or even different datasets [26], which significantly alleviate
the representation generalization problem. However, how
to generalize the classification parameters in deep models
(e.g. the fc7 layer in AlexNet) from well-trained concepts
to a new concept (with only a few samples) is largely ig-
nored by previous studies.
Learning by analogy has been proved to be a fundamen-
tal building block in human learning process [7], a plausi-
ble explanation on the fast learning of novel class is that
a human learner selects some similar classes from the base
classes by visual analogy, transfers and combines their clas-
sification parameters for the novel class. In this sense, vi-
sual analogy provides an effective and informative clue for
11497
generalizing image classifiers in a way of human-like learn-
ing. But the limited number of samples in the new class
would cause inaccurate and unstable measurements on vi-
sual analogy in high-dimensional representation space, and
how to transfer the classification parameters from selected
base classes to a new class is also highly non-trivial for the
generation efficacy.
To address the above problems, we first propose a novel
Visual Analogy Graph Embedded Regression (VAGER)
model to jointly learn a low-dimension embedding space
and a linear mapping function from the embedding space
to classification parameters for base classes. In particular,
we learn a low-dimension embedding for each base class
so that embedding similarity between two base classes can
reflect their visual analogy in the original representation
space. Meanwhile, we learn a linear mapping function from
the embedding of a base class to its previously learned clas-
sification parameters (i.e. the logistic regression parame-
ters). The VAGER model enables the transformation from
the original representation space to embedding space and
further into classification parameters. We then propose an
out-of-sample embedding method to learn the embedding
of a new class represented by a few samples through its vi-
sual analogy with base classes. By inputting the learned
embedding into VAGER, we can derive the classification
parameters for the new class. Note that these classifica-
tion parameters are purely generated from base classes (i.e.
transferred classification parameters), while the samples in
the new class, although only a few, can also be exploited to
generate a set of classification parameters (i.e. model clas-
sification parameters). Therefore, we further investigate the
fusion strategy of the two kinds of parameters so that the
prior knowledge and data knowledge can be fully leveraged.
The framework of the proposed method is illustrated in Fig-
ure 1.
The technical contributions of this paper are three folds.
(1) We introduce the mechanism of visual analogy into im-
age classification, which provides a new way of transfer-
ring classification parameters from previous concepts to a
new concept. (2) We propose a novel VAGER model to
realize the transformation from original representation to
classification parameters for any new class. (3) We inten-
sively evaluate the proposed method and the results show
that our method consistently and significantly outperform
other baselines.
2. Related Work
One/Few-shot Learning. One/Few-shot learning mainly
focuses on how to train models from just one, or a hand-
ful of images instead of the large-scale training dataset. [5]
first proposed this concept as well as a transfer method via
a Bayesian approach on the low-level visual features. Af-
terward researchers have been working on hand-crafted vi-
sual features. [30, 15] propose transfer mechanism based
on Adaboost-SVM method. They both construct a set of
weak classifiers through the data from the base classes and
learn a new classifier by linearly combining the weak classi-
fiers. Furthermore, [25] proposes an adaptive Least-Square
SVM method. These methods require huge supervised in-
formation to learn the weight of the combined model and
the insufficient representative ability of low-level features
limits their performance.
After deep learning is introduced to the large-scale im-
age classification, benefited from its strong representative
ability, the performance of the few-shot learning is im-
proved gradually. [10] introduces a two-way Siamese
Neural Network to learn the similarity of two input im-
ages as the evaluation metric, which is an early work of
few-shot learning combined with deep learning. After-
wards, meta-learning provides a new training mechanism
and shows great performance on small datasets like Om-
niglot [12] and MiniImageNet [27]. MANN[20], Matching
Network[27], MAML[6], Prototypical Network[22], Re-
lation Network[23] are some representitive works. Their
methods introduce a new training mechanism to completely
simulate evaluation circumstance on m-way k-shot classi-
fication, where training data is split into support sets and
training process is based on the support set, not a single
image. However, they perform not so well on large-scale
datasets like ImageNet. For large-scale datasets, [8] pro-
poses a Squared Gradient Magnitude Loss considering both
the multi-class logistic loss and small dataset training loss,
[29] proposes a Model Regression Network for intra-class
transfer which learns a nonlinear mapping from the model
parameter trained by small-samples to the model parame-
ter trained by large-samples. More recently, a few works
exploit generative models to create more data for training.
[18] takes advantage of the deep generative models to give
a method to produce similar images from given images.
[28] adds a deep hallucinator structure to the original meta-
learning methods and trains the hallucinator and the classi-
fier at the same time.
Learning to Learn Image Classifiers. The problem fo-
cuses on how to learn classifier parameters for a novel class
and the methods are widely used in zero-shot learning and
few-shot learning. [4] and [2] use purely textual description
of categories to learn the parameter of the classifier in zero-
shot image classification. [4] uses a kernel method to learn
from the textual feature to the parameter, while [2] uses a
neural network. Further, [3] learns base classifiers and con-
struct classifiers of novel classes utilizing attribute similar-
ities between classes. Recently, [17] and [16] investigate
how to utilize visual features to generate classifier parame-
ters for novel class and show good performance on few-shot
learning. Different from these previous works, our work
concentrates more on how to generate classification param-
11498
fc7_1
fc7_n
……
……
w1B ℎ(𝑥) fc7_kwmodelN
=====
Training base classes with VAGER Generalization to a new class
Dog
Cat
Wolf
Similar
wnB ℎ(𝑥) ℎ(𝑥) wtransNwNCNN CNN
Visual Analogy
Graph
Late
Fusion
Graph
Embedding
Embedded
Regression
Figure 1. The framework of learning to learn image classifiers. Training Base Classes with VAGER: By training base classes with VAGER,
we derive the embeddings of each base class and the common mapping function from embeddings to classification parameters. General-
ization to a New Class: Given a new class with only a few samples, we can infer its embedding through out-of-sample inference, and then
transform the embedding into transferred classification parameters by the mapping function learned by VAGER. After training the classifier
with new class samples and getting the model classification parameters, we fuse the two kinds of parameters to form the final classifier.
eters with visual analogy at category level.
Graph Embedding. Graph Embedding (Network Em-
bedding) is used to extract the formalized representation
of each node in a large-scale graph or network. The low-
dimension hidden embeddings could capture both the rela-
tionship between nodes and the features of each node itself.
Graph Embedding is widely used in the social network area
to solve the node clustering or link prediction problems etc.
There are many classical algorithms in graph embedding;
we list some of them but not all. For example, [1] uses a
matrix factorization technique which is optimized by SGD
and [24] proposes LINE method which preserves both the
first-order and second-order proximities of each node and
improves the quality of the embeddings etc. Graph embed-
ding is proved to be an effective method in the graph analy-
sis area.
3. Methodology
3.1. Notations and Problem Formulation
Suppose that we have an image set I , and the set is
divided into base-class set IB = IB1 ∪ IB2 ∪ · · · ∪ IBnwhich have sufficient training samples, and novel-class set
IN = IN1 ∪ IN2 ∪ · · · ∪ INm which have only a few training
samples in each class. We train an AlexNet [11] on IB as
our base CNN model and extract its fc7 layer as the high-
level features of images. The feature space is denoted as
X ⊂ Rd. For each image in IB , we obtain its fc7 layer
feature xBij ∈ X where i = 1, 2, · · · , n represents its class
and j = 1, 2, · · · , |IBi | represents its index in class i. We
use the same CNN model to derive high-level representa-
tions for images in novel classes, denoted by xNij .
A typical binary classifier can be represented as
f( · ;w|X) which is a mapping function f : Rd −→ R
parametrized by w. The input is a d-dimensional image fea-
ture vector and the output is the probability that the image
belongs to the class. We use wBi to denote the parameters
for base class i and wNi for novel class i. Based on the
above notations, Our problem is defined as follows.
Problem 1 (Learning to learn image classifiers) Given
the image features of base classes XB , the well-trained
base classifier parameters WB , and the image features of
a novel class i XNi with only a few positive samples, learn
the classification parameters wNi for the novel class, so
that the learned classifier f( · ;wNi |XB ,WB ,XN
i ) can
precisely predict labels for the ith novel class.
Note that the problem of learning to learn image classifiers
differs from traditional image classification problems in that
the learning of a classifier for a novel class depends on the
previously learned base-class classifiers and the image rep-
resentations in base classes besides the image samples in
the novel class.
3.2. The VAGER Model
We define a graph G = (V,E) where V is the vertex
set of the graph, with each vertex representing a base class
and |V | = n. E is the edge set of the graph, each edge
represents visual analogy relationship between two classes
with the edge weight depicting the similarity degree. We
11499
use A to represent the adjacency matrix of the network, and
Aij is the edge weight between vertex i and vertex j. Ai,:
and A:,j stands for the i-th row and the j-th column of A
respectively. In our classification problem, we construct the
visual analogy network as a undirected complete graph, and
edge weight (i.e. degree of visual analogy) between two
classes is calculated by:
Aij =xBi · xB
j
‖xBi ‖2 · ‖x
Bj ‖2
. (1)
Here xBi means the average feature vector for class i and
this equation is the cosine similarity between two base
classes. Note that our graph is an undirected graph, and
the adjacency matrix A is symmetric.
To make the visual analogy measurement robust in
sparse scenarios, we need to reduce the representation space
dimensions. Our basic hypothesis in generalizing classifi-
cation parameters is that if two classes are visually simi-
lar, they should share similar classification parameters. By
imposing a linear mapping function from the embedding
space to classification parameter space, similar embeddings
will result in similar classification parameters. Motivated
by this, we propose a Visual Analogy Graph Embedded Re-
gression model.
Let V ∈ Rn×q be the embeddings for all nodes in the
graph, and each row of V with dimension q is the embed-
ding for each vertex. Let W ∈ Rn×p represent all param-
eters of the base classifiers. There is also a common lin-
ear transformation matrix for all base classes T ∈ Rq×p to
convert the embedding space to the classification parame-
ter space for all base classifiers. Then the loss function is
defined as:
L (V,T) = ‖VT−W‖2F + β‖A−VV⊤‖2F . (2)
where ‖ · ‖F is the Frobenius Norm of the matrix.
The first term enforces the embeddings to be able to con-
vert into the classification parameter through a linear trans-
formation. The second term constrains the embeddings to
preserve the structure of the visual analogy graph. Our goal
is to find the matrix V and T to minimize this loss function.
This is a common unconstrained two variables optimiza-
tion problem and we use the alternative coordinate descent
method to find the best solution for V and T, where the
gradients are calculated by:
∂L (V,T)
∂V= 2(VT−W)T⊤ + β(−4AV + 4VV
⊤V)
∂L (V,T)
∂T= 2V⊤(VT−W).
(3)
3.3. Embedding Inference for Novel Classes
By training VAGER model in base classes, we can obtain
the embeddings for each base class and the mapping func-
tion from embeddings to classification parameters. Given
a new class with only a few samples, we need to infer its
embedding. Suppose the embedding for the novel class is
vnew ∈ Rq . We calculate the similarity of a novel class with
all base classes by Equation 1, and we denote this similarity
vector by anew ∈ Rn.
Then we define the objective function for the novel class
embedding inference and our goal is to minimize the fol-
lowing function:
L (vnew) =
∥
∥
∥
∥
[
A a⊤new
anew 1
]
−
[
V
vnew
]
[
V⊤
v⊤new
]
∥
∥
∥
∥
2
F
.
(4)
Equation 4 is in fact the extension of the second term in
Equation 2. As we have little information about the classifi-
cation parameters of the novel class, we omit the first term
in Equation 2.
After we delete the independence term of vnew, the final
minimization problem for us to solve is:
minL (vnew) = 2∥
∥anew − vnewV⊤∥
∥
2
2+(vnewv
⊤new−1).
(5)
In fact, the second term of Equation 5 is a regularization
term. We omit the second term and thus the first term is in
the form of a linear regression loss. Then we can get the
explicit solution for vnew without using gradient descent.
The solution is represented as:
vnew = anew(V⊤)+, (6)
where M+ is the Moore-Penrose pseudo-inverse of matrix
M defined by (M⊤M)−1
M⊤. Note that we could speed
up the algorithm by pre-computing the pseudo-inverse of
V⊤.
After deriving the embedding for the new class, we
can easily obtain its transferred classification parameters by
multiplying transformation matrix T:
wNnew = vnewT. (7)
3.4. Parameter Refinement
As mentioned above, we can also learn the classifica-
tion parameters of a new class from its samples (although
only a few), and we call them model classification parame-
ters. Then we need to fuse the transferred classification pa-
rameters and model classification parameters into the final
classifier. Here we present three strategies for refinement:
Initializing, Tuning, and Voting.
Let f(·,wN ) : Rd −→ [0, 1] be the binary classifier for a
new class. XT is the mixture set of positive and negative
11500
samples, and y is the label with y = 1 indicating positive
sample and y = 0 indicating negative sample.
Initializing We use the transferred classification parame-
ters as an initialization and then re-learn the parameters of
new classifier by the new class samples. The training loss
function is defined as the common loss function for classi-
fication. That is:
L (wN ) =
{
∑
x∈XT
L(f(x,wN ), y)
}
+ λ ·R(wN ), (8)
where L(·, ·) is the prediction error and we use cross-
entropy loss in our experiment. R(·) is a regularization
term and we use L2-norm in our experiment. For learning
wN , we use the batched Stochastic Gradient Descent
(SGD) and the wN is initialized with the transferred
classification parameters wNtrans.
Tuning We train the model classification parameters with
new class samples, while adding a loss term to constrain the
similarity of the transferred classification parameters and
the final parameter:
L (wN ) =
{
∑
x∈XT
L(f(x,wN ), y)
}
+λ·∥
∥wN −w
Ntrans
∥
∥
2
F.
(9)
Here, wNtrans is the transferred parameter we obtain from
the previous steps (i.e. wNnew in Equation 7). We still use
the batched SGD method with a random initialization to
solve for wN .
Voting This method is a weighted average for the trans-
ferred classification parameters and the learned model clas-
sification parameters. First, we learn a wNmodel using the
Equation 8 with random initialization. Then we get the fi-
nal parameter by:
wN = w
Ntrans + λ ·wN
model. (10)
The hyper-parameter λ serves as a voting weight.
3.5. Complexity Analysis
During the training process of our VAGER model, the
main cost is to calculate the gradient of the loss func-
tion L (V,T). For calculating the first derivative of L
with respect to V, the complexity per iteration is O(nq ·max(p, n)). As to the first derivative of L with respect to
T, the complexity per iteration is O(nq ·max(p, q)). While
predicting the novel class, if we use Equation 6 for accel-
erating, we are able to pre-compute the (V⊤)+ for O(nq2)and for each novel class, the complexity of the predicting
process is O(q ·max(p, n)).
4. Experiments
4.1. Data and Experimental Settings
In our experiments, we mainly use the ImageNet dataset
[19], whose training set contains over 1.2 million images in
1,000 categories. We randomly divide the ImageNet train-
ing dataset into 800 base classes and 200 novel classes. 10
of the novel classes are used for validation to confirm the
hyper-parameters and the other 190 novel classes are used
for testing. We retrain the AlexNet on the 800 base classes
as our base CNN model, where the training setting is the
same as [11]. After training, we use the fc7 layer of AlexNet
as the high-level representations for images and the param-
eters from fc7 to fc8 as the base classifiers’ parameters (i.e.
matrix W in Equation 2) . As our algorithm does not de-
pend on the base model structure, we choose AlexNet as
our base model in this paper. Moreover, when implement-
ing our algorithm, we use 600 dimensions embedding space
and the training hyper-parameter β is set to 1.0.
We evaluate the performance of our algorithm from two
aspects: Section 4.2 and Section 4.3 show a binary classifi-
cation problem, where the new classifier is learned to clas-
sify the novel class (as positive samples) and all the base
classes (as negative samples). This setting eliminates the re-
lationship between novel classes and is convenient for us to
validate each novel class independently, which is helpful to
find the applicability of our algorithm, as Section 4.3 illus-
trates. In the training phase, we randomly select k images
as the training set for each novel class to simulate k-shot
learning scenario. In the testing phase, given a novel class,
we randomly select 500 images (no overlap with the train-
ing set) from it as the positive examples and randomly select
5 images from each base class of the ImageNet validation
set as negative samples. To eliminate randomness, for any
k-shot setting, we run 50 times and report the average re-
sult in the following experiments. Section 4.4 shows an m-
way k-shot classification problem, where the new classifier
is learned to classify among the m novel classes, which is
consistent with the classical setting in few-shot learning. In
the training phase, we randomly select m novel classes and
select k images from each of these classes as the training
dataset. In the testing phase, we randomly select 5 images
per novel class from the rest images as the testing dataset.
The experiment will repeat 500 times under each m-way
k-shot setting.
The evaluating metric in our experiment is the Area Un-
der Curve (AUC) of the Receiver Operating Characteristic
(ROC) and the F1-score, which are widely used in binary
classification. We report the average AUC and F1-score
across all test classes. As to the m-way k-shot classifica-
tion, we use average top-1 accuracy across m novel classes.
We compare our method with the baselines below. The
complete version of our method is VAGER+Voting.
11501
Logistic Regression (LR) Common logistic regression
model on novel classes. In the setting of multi-class clas-
sification, it becomes Softmax Regression. Note that LR is
also equivalent to fine-tune the last layer of AlexNet.
Weighted Logistic Regression (Weighted-LR) Here we
use the weighted average of the base classifiers’ parameters
as the classification parameters for a new class. The weights
are calculated by an L2-normalized cosine similarities be-
tween the features of the novel class and 10 most similar
base classes. This method can also be regarded as a visual
analogy approach, but the transferring process is heuristic.
VAGER This is the VAGER algorithm without parameter
refinement step.
VAGER(-Mapping) We directly learn the embedding by
Equation 2 without the first regression term. Then we use
the above weighted-LR method in the embedding space in-
stead of the original feature space. This method is used to
evaluate the effectiveness of the mapping function.
VAGER(-Embedding) We directly train a regression
model from the original feature space to the classification
parameter space without the visual analogy graph embed-
ding. This method is used to demonstrate the effectiveness
of class node embedding over the visual analogy network.
Besides, we also consider some state-of-the-art algo-
rithms as our baselines in multi-class classification set-
ting, such as Model Regression Network (MRN)[29],
Matching Network (MatchingNet)[27], Prototypical Net-
work (ProtoNet)[22] and the method proposed in [17] (Ac-
tivationNet). Note that for MatchingNet and ProtoNet, we
use a two-layer fully-connected neural network as the em-
bedding architecture, which is consistent with [28].
4.2. Binary Classification
In this section, we evaluate how well the classifiers
learned by our method and other baselines can perform in
novel classes on binary classification setting.
The results are shown in Table 1. In all low-shot settings,
our method VAGER+Voting consistently performs the best
in both AUC and F1 metrics. In contrast, LR performs the
worst in 1-shot setting, which demonstrates the importance
of generalization from base classes when the new class has
very few samples. MRN does not work well in most set-
tings, demonstrating that its basic hypothesis that the classi-
fication parameters trained by large samples and small sam-
ples respectively are correlated does not necessarily hold
in real data. By comparing VAGER+Voting with the other
five variant versions of our method, we can safely draw
the conclusion that the major ingredients in our method,
including network embedding for low dimensional repre-
sentations, mapping function for transforming embedding
space to classification parameter space, as well as the re-
finement strategy are necessary and effective and the results
support that the Voting strategy performs the best in our sce-
nario.
Furthermore, we compare the performances of these
methods in different low-shot settings, and the results are
shown in Figure 2. Our method consistently performs the
best in all settings, and the advantage of our method is more
obvious when the novel classes have less training samples.
Especially, by comparing our method and LR, we can see
that LR needs about 20 shots to reach AUC 0.9, while we
only need 2 shots, indicating that we can save 90% training
data. An interesting phenomenon is that the performance
of Weighted-LR does not change as the shot number in-
creases. The main reason is that the heuristic rule is not flex-
ible enough to incorporate new information, which demon-
strates the importance of learning to learn, rather than rule-
based learning.
1 2 3 4 5 10 20 50Shots
0.75
0.80
0.85
0.90
0.95
1.00
AUC
VAGER+VotingVAGERLRWeighted LRMRN
Figure 2. The change of performance as the number of shots in-
creases in binary classification.
4.3. Insightful Analysis
Although our method performs the best in different set-
tings, the failure cases are easy to find. We are interested
in the following questions: (1) What are the typical fail-
ure cases? (2) What is the driving factor that controls the
success of generalization? (3) Whether the generalization
process is explainable?
In order to answer the above questions, we further con-
duct an insightful analysis. We randomly select 10 novel
classes, and list the performance of our method compared
with LR in one-shot setting on these classes, as shown in
Table 2. It’s obvious that the effect of generalization is no-
table in 9 of them, but in the bubble class, the generalization
plays a negative role.
To discover the driving factor controlling success or fail-
ure of the generalization, we define and calculate the simi-
larity ratio (SR) of a novel class with the base classes by:
SR =Average Top-K Similarity with Base Classes
Average Similarity with Base Classes(11)
11502
Table 1. Performance of different algorithms for k-shot binary classification problem
Algorithm1-shot 5-shot 10-shot 20-shot
AUC F1 AUC F1 AUC F1 AUC F1
VAGER 0.8556 0.5292 0.9271 0.6491 0.9379 0.6721 0.9432 0.6850
VAGER+Initializing 0.7662 0.3941 0.9030 0.6185 0.9338 0.6887 0.9461 0.7237
VAGER+Tuning 0.7923 0.4244 0.9098 0.6307 0.9365 0.7012 0.9466 0.7268
VAGER+Voting 0.8718 0.5671 0.9425 0.7039 0.9543 0.7343 0.9607 0.7510
VAGER(-Mapping) 0.8261 0.4551 0.8526 0.4807 0.8726 0.5179 0.8897 0.5394
VAGER(-Embedding) 0.7922 0.4335 0.9032 0.6015 0.9183 0.6347 0.9393 0.6788
LR 0.7705 0.3994 0.8885 0.5882 0.9134 0.6421 0.9341 0.6877
Weighted-LR 0.8440 0.4775 0.8458 0.4813 0.8509 0.4835 0.8468 0.4801
MRN 0.8083 0.4511 0.9175 0.6653 0.9361 0.7133 0.9474 0.7388
Here the similarity of two classes is calculated by Equation
1. Intuitively, if a new class is similar with the top-K base
classes, while dissimilar with the remained base classes, its
Similarity Ratio will be high, meaning that this new class
can benefit more from the base classes.
For each new class, we calculate the relative improve-
ment in AUC of our method over non-transfer method LR
in 1-shot setting, and do linear regression over its Similar-
ity Ratio with K = 10. The dependent variable indicates
the success degree of generalization. And we set K = 10.
We plot the similarity ratio and relative improvement of all
novel classes in Figure 3. We can see that the relative im-
provement in a new class is positively correlated with the
similarity ratio of the new class, with 95% confidence inter-
val for the correlation coefficient range between 0.124 and
0.169 and R2 = 0.45, showing that the SR ratio could ex-
plain 45% of the dependent variable.
The results fully demonstrate that our method is consis-
tent with the notion of human-like learning: First, we can
learn a new concept faster if it is more similar to some pre-
viously learned concepts. (i.e. Leading to the increase of the
numerator of the Similarity Ratio). Second, we can learn a
new concept faster if we have learned more diversified con-
cepts (i.e. Leading to the decrease of the denominator of the
Similarity Ratio). This principle can also be used to guide
the generalization process and help to determine whether a
new class is fit for generalization.
Finally, we validate whether the generalization process is
explainable. Here we randomly select 5 novel classes, and
for each novel class, we visualize the top-3 base classes that
are most similar with the novel class in the visual analogy
graph, as shown in Figure 4. In our method, these base
classes have a large impact on the formation of the new
classifier. We can see that the top-3 base classes are visu-
ally correlated with the novel classes, and the generalization
process can be very intuitive and explainable.
Table 2. Comparison of VAGER and LR over novel classes with
1-shot binary classification setting
Category LR (No Transfer) VAGER (Transfer)
Jeep 0.8034 0.9469
Zebra 0.8472 0.9393
Hen 0.7763 0.8398
Lemon 0.6854 0.9583
Bubble 0.7455 0.7041
Pineapple 0.7364 0.8623
Lion 0.8305 0.9372
Screen 0.7801 0.9056
Drum 0.6510 0.6995
Restaurant 0.7806 0.8787
1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0Similarity Ratio
0.1
0.0
0.1
0.2
0.3
AU
C Incr
easi
ng
Figure 3. Linear regression of AUC improvement on Similarity
Ratio for all novel classes
4.4. Multiclass Classification
In this section, we mainly show the performance of the
experiments on multi-class classification. We will show that
our algorithm performs well from three aspects. All base-
lines in Section 4.2 are extended to multi-class classification
version in these experiments.
The first experiment is to validate the robustness of our
algorithm. We randomly select 10 categories from the novel
11503
Table 3. Top-1 Accuracy for m classes 1-shot problem
Algorithm 10 cls/G1 10 cls/G2 10 cls/G3 10 cls/G4 10 cls/G5 30 cls/G1 50 cls/G1 100 cls/G1
VAGER+Voting 67.59% 63.96% 58.02% 51.27% 56.24% 40.73% 38.69% 28.38%
LR 61.97% 59.72% 52.97% 47.51% 52.01% 37.32% 34.75% 23.94%
Weighted-LR 63.13% 60.09% 50.32% 46.13% 49.81% 36.77% 34.64% 23.60%
MRN 64.55% 61.82% 54.74% 48.85% 54.54% 39.43% 37.78% 27.16%
MatchingNet 65.69% 61.74% 57.13% 48.56% 54.34% 39.04% 37.05% 27.21%
ProtoNet 47.98% 47.18% 40.20% 35.86% 41.55% 30.15% 28.12% 21.28%
ActivationNet 65.04% 62.42% 55.62% 48.61% 53.85% 40.15% 37.41% 27.68%
Jeep Lemon Lion Screen Restaurant
Pickup
Beach_wagon
Tow_truck
Orange
Acorn
Granny_Smith
Cougar Monitor
Lynx Bakery
Laptop
Television
Shoe_shop
MarimbaDingo
Novel
Class
Top-3
Similar
Base
Classes
Figure 4. Top-3 most similar base classes to novel class on embed-
ding layer in 5-shot setting.
1 2 3 4 5 10 20 50Shots
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
Top-
1 Ac
cura
cy
VAGER+VotingLRWeighted LRMRNMatchingNetProtoNetActivationNet
Figure 5. Change of performance as shot number increases in 10
classes 1-shot multi-class classification problem.
test categories and learn to distinguish these 10 categories
on the 1-shot setting. We repeat random selections five
times and the result is shown on the first 5 columns in Table
3. Our VAGER+Voting performs the best in all 5 groups,
with promotion of around 2% of average top-1 accuracy,
which demonstrates that our method is robust whatever the
novel classes are.
The second experiment is to evaluate our method
on different numbers of novel classes. We design an
10/30/50/100-way 1-shot setting. The result is shown in the
last four columns in Table 3. As the result shows, our algo-
rithm consistently gets the best performance.
The third experiment is to evaluate our method on dif-
ferent shots. We control the number of novel classes
and change the number of shots used for learning novel
classifiers. We randomly choose 10 novel classes and
test the performance of our algorithm and baselines on
1/2/3/4/5/10/20/50 shots. The result is shown in Figure
5. In all scenarios, our algorithm performs the best. Al-
though MatchingNet and ProtoNet could do better on small
dataset like Omniglot [12] and MiniImageNet[27], in large-
scale dataset, however, their performances are not satisfac-
tory. One plausible reason is that the effectiveness of their
meta-learning mechanism is limited when the embedding
architecture is representative enough. On the other hand,
MRN and ActivationNet adopt learning to learn mechanism
as well. The advantage of our method over these two base-
lines is attributed to learning by analogy mechanism that is
inspired by human learning.
5. Conclusions
In this paper, we investigate the problem of learning to
learn image classifiers and explore a new human-like learn-
ing mechanism which fully leverages the previously learned
concepts to assist new concept learning. In particular, we
organically combine the ideas of learning to learn and learn-
ing by analogy and propose a novel VAGER model to ful-
fill the generalization process from base classes to novel
classes. From the extensive experiments, it shows that the
proposed method complies with human-like learning and
provides an insightful and intuitive generalization process.
Acknowledgement: This work was supported in part by
National Program on Key Basic Research Project (No.
2015CB352300), National Natural Science Foundation of
China (No. 61772304, No. 61521002, No. 61531006,
No. 61702296), National Natural Science Foundation of
China Major Project (No.U1611461), the research fund of
Tsinghua-Tencent Joint Laboratory for Internet Innovation
Technology, and the Young Elite Scientist Sponsorship Pro-
gram by CAST.
11504
References
[1] Amr Ahmed, Nino Shervashidze, Shravan Narayanamurthy,
Vanja Josifovski, and Alexander J Smola. Distributed large-
scale natural graph factorization. In Proceedings of the 22nd
international conference on World Wide Web, pages 37–48.
ACM, 2013. 3
[2] Jimmy Lei Ba, Kevin Swersky, Sanja Fidler, and Ruslan
Salakhutdinov. Predicting deep zero-shot convolutional neu-
ral networks using textual descriptions. In IEEE Interna-
tional Conference on Computer Vision, pages 4247–4255,
2015. 2
[3] Soravit Changpinyo, Wei Lun Chao, Boqing Gong, and Fei
Sha. Synthesized classifiers for zero-shot learning. In IEEE
Conference on Computer Vision and Pattern Recognition,
pages 5327–5336, 2016. 2
[4] Mohamed Elhoseiny, Babak Saleh, and Ahmed Elgammal.
Write a classifier: Zero-shot learning using purely textual
descriptions. In IEEE International Conference on Computer
Vision, pages 2584–2591, 2014. 2
[5] Li Fei-Fei, Rob Fergus, and Pietro Perona. One-shot learning
of object categories. IEEE transactions on pattern analysis
and machine intelligence, 28(4):594–611, 2006. 2
[6] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-
agnostic meta-learning for fast adaptation of deep networks.
In Proceedings of the 34th International Conference on Ma-
chine Learning, pages 1126–1135, 2017. 2
[7] Dedre Gentner and Keith J Holyoak. Reasoning and learning
by analogy: Introduction. American Psychologist, 52(1):32,
1997. 1
[8] Bharath Hariharan and Ross Girshick. Low-shot visual
recognition by shrinking and hallucinating features. In Pro-
ceedings of the IEEE International Conference on Computer
Vision, pages 3018–3027, 2017. 2
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 770–778, 2016. 1
[10] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov.
Siamese neural networks for one-shot image recognition. In
ICML Deep Learning Workshop, volume 2, 2015. 2
[11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Imagenet classification with deep convolutional neural net-
works. In Advances in neural information processing sys-
tems, pages 1097–1105, 2012. 1, 3, 5
[12] Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B.
Tenenbaum. One-shot learning by inverting a compositional
causal process. In International Conference on Neural In-
formation Processing Systems, pages 2526–2534, 2013. 2,
8
[13] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum,
and Samuel J Gershman. Building machines that learn and
think like people. Behavioral and Brain Sciences, 40, 2017.
1
[14] Novi Patricia and Barbara Caputo. Learning to learn, from
transfer learning to domain adaptation: A unifying perspec-
tive. In The IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), June 2014. 1
[15] Guo-Jun Qi, Charu Aggarwal, Yong Rui, Qi Tian, Shiyu
Chang, and Thomas Huang. Towards cross-category knowl-
edge propagation for learning visual concepts. In Computer
Vision and Pattern Recognition (CVPR), 2011 IEEE Confer-
ence on, pages 897–904. IEEE, 2011. 2
[16] Hang Qi, Matthew Brown, and David G Lowe. Low-shot
learning with imprinted weights. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
pages 5822–5830, 2018. 2
[17] Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan L Yuille. Few-
shot image recognition by predicting parameters from activa-
tions. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 7229–7238, 2018. 2,
6
[18] Danilo Jimenez Rezende, Shakir Mohamed, Ivo Danihelka,
Karol Gregor, and Daan Wierstra. One-shot generalization in
deep generative models. arXiv preprint arXiv:1603.05106,
2016. 2
[19] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
Aditya Khosla, Michael Bernstein, Alexander C. Berg, and
Li Fei-Fei. ImageNet Large Scale Visual Recognition Chal-
lenge. International Journal of Computer Vision (IJCV),
115(3):211–252, 2015. 5
[20] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan
Wierstra, and Timothy Lillicrap. One-shot learning
with memory-augmented neural networks. arXiv preprint
arXiv:1605.06065, 2016. 2
[21] Karen Simonyan and Andrew Zisserman. Very deep convo-
lutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556, 2014. 1
[22] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypi-
cal networks for few-shot learning. In Advances in Neural
Information Processing Systems, pages 4077–4087, 2017. 2,
6
[23] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS
Torr, and Timothy M Hospedales. Learning to compare: Re-
lation network for few-shot learning. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, 2018. 2
[24] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan,
and Qiaozhu Mei. Line: Large-scale information network
embedding. In Proceedings of the 24th International Con-
ference on World Wide Web, pages 1067–1077. ACM, 2015.
3
[25] Tatiana Tommasi, Francesco Orabona, and Barbara Caputo.
Learning categories from few examples with multi model
knowledge transfer. IEEE transactions on pattern analysis
and machine intelligence, 36(5):928–941, 2014. 2
[26] Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko.
Simultaneous deep transfer across domains and tasks. In
Proceedings of the IEEE International Conference on Com-
puter Vision, pages 4068–4076, 2015. 1
[27] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wier-
stra, et al. Matching networks for one shot learning. In
Advances in Neural Information Processing Systems, pages
3630–3638, 2016. 2, 6, 8
11505
[28] Yu-Xiong Wang, Ross Girshick, Martial Herbert, and
Bharath Hariharan. Low-shot learning from imaginary data.
In Computer Vision and Pattern Recognition (CVPR), 2018.
2, 6
[29] Yu-Xiong Wang and Martial Hebert. Learning to learn:
Model regression networks for easy small sample learning.
In European Conference on Computer Vision, pages 616–
634. Springer, 2016. 2, 6
[30] Yi Yao and Gianfranco Doretto. Boosting for transfer learn-
ing with multiple sources. In Computer vision and pattern
recognition (CVPR), 2010 IEEE conference on, pages 1855–
1862. IEEE, 2010. 2
[31] Xiangxin Zhu, Dragomir Anguelov, and Deva Ramanan.
Capturing long-tail distributions of object subcategories.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2014. 1
11506