7
Fine-Grained Classification via Mixture of Deep Convolutional Neural Networks ZongYuan Ge †‡* , Alex Bewley ,Christopher McCool ‡* , Ben Upcroft †‡ ,Peter Corke †‡ , Conrad Sanderson * Australian Centre for Robotic Vision, Brisbane, Australia Queensland University of Technology (QUT), Brisbane, Australia * University of Queensland, Brisbane, Australia NICTA, Australia Abstract We present a novel deep convolutional neural network (DCNN) system for fine-grained image classification, called a mixture of DCNNs (MixDCNN). The fine-grained im- age classification problem is characterised by large intra- class variations and small inter-class variations. To over- come these problems our proposed MixDCNN system par- titions images into K subsets of similar images and learns an expert DCNN for each subset. The output from each of the K DCNNs is combined to form a single classifica- tion decision. In contrast to previous techniques, we pro- vide a formulation to perform joint end-to-end training of the K DCNNs simultaneously. Extensive experiments, on three datasets using two network structures (AlexNet and GoogLeNet), show that the proposed MixDCNN sys- tem consistently outperforms other methods. It provides a relative improvement of 12.7% and achieves state-of-the-art results on two datasets. 1. Introduction Fine-grained image classification consists of discrimi- nating between classes in a sub-category of objects, for in- stance the particular species of bird or dog [2, 5, 8, 9, 23]. This is a very challenging problem due to large intra-class variations (due to pose and appearance changes), as well as small inter-class variation (due to only subtle differences in the overall appearance between classes). See Fig. 1 for examples. 1 To cope with the above problems, many fine-grained classification methods have performed parts detection [2, 5, 20, 24] in order to decrease the intra-class variation. Re- cently, an alternative approach was introduced by Ge et al. [13] where the images were first partitioned into K non- overlapping sets and K expert systems were learned. By grouping similar images, the input space is being parti- 1* Authors contributed equally. Figure 1. Example images from the Birdsnap dataset [3] which exhibits large intra-class variations and low inter-class variations. Each column represents a unique class. tioned so that an expert network can better learn the subtle differences between similar samples. Expert selection was performed by training a dedicated gating network which as- signs samples to the most appropriate expert network. This approach has two downsides. Firstly, a separate gating net- work (subset selector) needs to be trained. Secondly, the expert networks are trained only to extract features, leaving the final classification to be performed by a linear support vector machine (SVM). We propose a novel system based on a mixture of deep convolutional neural networks (DCNNs) that pro- vides state-of-the-art performance along with several im- portant properties. Similar to Ge et al. [13], we partition the data into K non-overlapping sets to learn K expert DCNNs. However, unlike [13], the classification decision from the each expert is weighted proportional to the confi- dence of its decision. This allows us to define a single net- work (MixDCNN), comprised of K sub-networks (expert DCNNs), that can be trained to perform classification. This is in contrast to [13], where each expert is used just for fea- ture extraction. Our system has similarities to the gated net- work approach proposed by Jacobs et al. [16], which utilises a separately trained network to select the most appropriate expert network. arXiv:1511.09209v1 [cs.CV] 30 Nov 2015

NICTA, Australia arXiv:1511.09209v1 [cs.CV] 30 Nov 2015 · work approach proposed by Jacobs et al. [16], which utilises a separately trained network to select the most appropriate

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: NICTA, Australia arXiv:1511.09209v1 [cs.CV] 30 Nov 2015 · work approach proposed by Jacobs et al. [16], which utilises a separately trained network to select the most appropriate

Fine-Grained Classification via Mixture of Deep Convolutional Neural Networks

ZongYuan Ge †‡∗, Alex Bewley ‡,Christopher McCool ‡∗, Ben Upcroft †‡,Peter Corke †‡, Conrad Sanderson∗�† Australian Centre for Robotic Vision, Brisbane, Australia

‡ Queensland University of Technology (QUT), Brisbane, Australia∗ University of Queensland, Brisbane, Australia

� NICTA, Australia

Abstract

We present a novel deep convolutional neural network(DCNN) system for fine-grained image classification, calleda mixture of DCNNs (MixDCNN). The fine-grained im-age classification problem is characterised by large intra-class variations and small inter-class variations. To over-come these problems our proposed MixDCNN system par-titions images into K subsets of similar images and learnsan expert DCNN for each subset. The output from eachof the K DCNNs is combined to form a single classifica-tion decision. In contrast to previous techniques, we pro-vide a formulation to perform joint end-to-end trainingof the K DCNNs simultaneously. Extensive experiments,on three datasets using two network structures (AlexNetand GoogLeNet), show that the proposed MixDCNN sys-tem consistently outperforms other methods. It provides arelative improvement of 12.7% and achieves state-of-the-artresults on two datasets.

1. Introduction

Fine-grained image classification consists of discrimi-nating between classes in a sub-category of objects, for in-stance the particular species of bird or dog [2, 5, 8, 9, 23].This is a very challenging problem due to large intra-classvariations (due to pose and appearance changes), as wellas small inter-class variation (due to only subtle differencesin the overall appearance between classes). See Fig. 1 forexamples.

1

To cope with the above problems, many fine-grainedclassification methods have performed parts detection [2, 5,20, 24] in order to decrease the intra-class variation. Re-cently, an alternative approach was introduced by Ge etal. [13] where the images were first partitioned into K non-overlapping sets and K expert systems were learned. Bygrouping similar images, the input space is being parti-

1∗ Authors contributed equally.

(a)

(b) (c)

Figure 1. Example images from the Birdsnap dataset [3] whichexhibits large intra-class variations and low inter-class variations.Each column represents a unique class.

tioned so that an expert network can better learn the subtledifferences between similar samples. Expert selection wasperformed by training a dedicated gating network which as-signs samples to the most appropriate expert network. Thisapproach has two downsides. Firstly, a separate gating net-work (subset selector) needs to be trained. Secondly, theexpert networks are trained only to extract features, leavingthe final classification to be performed by a linear supportvector machine (SVM).

We propose a novel system based on a mixture ofdeep convolutional neural networks (DCNNs) that pro-vides state-of-the-art performance along with several im-portant properties. Similar to Ge et al. [13], we partitionthe data into K non-overlapping sets to learn K expertDCNNs. However, unlike [13], the classification decisionfrom the each expert is weighted proportional to the confi-dence of its decision. This allows us to define a single net-work (MixDCNN), comprised of K sub-networks (expertDCNNs), that can be trained to perform classification. Thisis in contrast to [13], where each expert is used just for fea-ture extraction. Our system has similarities to the gated net-work approach proposed by Jacobs et al. [16], which utilisesa separately trained network to select the most appropriateexpert network.

arX

iv:1

511.

0920

9v1

[cs

.CV

] 3

0 N

ov 2

015

Page 2: NICTA, Australia arXiv:1511.09209v1 [cs.CV] 30 Nov 2015 · work approach proposed by Jacobs et al. [16], which utilises a separately trained network to select the most appropriate

The proposed MixDCNN system allows us to jointlytrain the network, which has two advantages: (i) it obvi-ates the need for a separate gating network, and (ii) samplescan be re-assigned to the most appropriate expert networkduring the training process. Empirical evaluations show thatthis approach outperforms related approaches such as sub-set feature learning [13], a gated DCNN approach similarto [16], and an ensemble of classifiers.

The paper is continued as follows. In Section 2 webriefly review recent advances in fine-grained classificationand overview approaches to learn multiple expert classi-fiers, particularly within the field of neural networks. InSection 3 we present our proposed MixDCNN approach indetail. Section 4 is devoted to a comparative evaluationagainst several recent methods on the task of fine-grainedclassification. Conclusions and possible future avenues ofresearch are given in Section 5.

2. Prior WorkPrior work for fine-grained image classification has con-

centrated on performing parts detection [2, 5, 20, 24] inorder to decrease the intra-class variation. The part-basedone-vs-one feature system [2] is an example of this, whereparts-based features are progressively selected to improveclassification. An alternative is the deformable parts-modelwhich obtains a combined feature from a set of pre-definedparts [24]. Chai et al. [6] proposed a symbiotic model wherepart localisation is helped by segmentation and, conversely,the segmentation is helped by parts detection. Zhang etal. [24] extract pose-normalised features based on weaksemantic annotations to learn cross-component correspon-dences of various parts.

Recent work has shown the effectiveness of DCNNs forfine-grained image classification, but again, predominantlyto perform parts detection. Region proposal methods com-bined with a DCNN were shown to more accurately localiseobject parts [23]. Lin et al. [19] showed that a DCNN canbe trained to perform both parts localisation and visibil-ity prediction, achieving state-of-the-art results on the CUBdataset [22]. Although the above parts-based approachesare fully automatic at test time, they require a large num-ber of images to be manually annotated in order to train themodel.

To remove the need for time-consuming manual anno-tations, recent work has explored ways to perform fine-grained classification without using part annotations. Zhanget al. [23] and Ge et al. [12] showed that, even without partannotations, DCNNs can provide impressive performancefor fine-grained classification tasks. Of particular interest isthe approach of Ge et al. [10] which showed that the datacan be partitioned into K non-overlapping sets and an ex-pert feature extraction algorithm, utilising DCNNs, can betrained for each of the K sets.

Learning algorithms which construct a set of K classi-fiers and make decisions by taking a weighted or average oftheir predictions are often referred to as ensemble methods.A simple ensemble approach called bagging has been usedto improve the overall performance of a system [4]. Bag-ging manipulates the training examples to generate multi-ple hypotheses. In this case, a set of K classifiers is learnedusing a randomly selected subset of the training data. Weuse this bagging approach on a set of DCNNs for a baselinemethod and refer to it as an Ensemble approach (Section 4).

Ensemble approaches, or learning K expert classifiers,has been explored by several researchers within the contextof neural networks. In 1991 Jacobs et al. [16] described agated network structure to learn K expert neural networksand applied it to multi-speaker vowel recognition. The un-derlying idea is to only allocate a small region of the in-put space to a particular expert system. This was achievedby having K expert systems (neural networks) which wereallocated samples selected by a separate gating network.In [16], the gating network determines the probability that asample is associated to one of the K expert systems.

More recently, Ge et al. [13] outlined a subset featurelearning (Subset FL) approach usingK expert DCNNs. Thedata is partitioned into K non-overlapping sets and for eachset an expert DCNN is learned to extract set-specific fea-tures. A gating network is then used to extract only the mostrelevant features from these K DCNNs. Classification isthen performed by training an SVM on these features, yield-ing impressive performance for fine-grained bird and plantclassification [11]. An issue with this work is the relianceof an independent gating network G and the fact that fea-ture extraction and classification are treated as independentsteps.

3. Proposed ApproachWe propose a novel mixture of DCNNs (MixDCNN)

to improve fine-grained image classification by partitioningthe data into K non-overlapping sets and learning an ex-pert classifier for each set. This approach has similaritiesto the gated neural network proposed by Jacobs et al. [16],which has never been applied to DCNNs nor to the fine-grained classification problem. As such, we also outline agated DCNN (GatedDCNN). An overview of these two ap-proaches is given in Figure 2.

The main idea behind the MixDCNN and GatedDCNNapproaches is to learn K expert networks, [S1, . . . ,SK ],which make decisions about a subset of the data. This sim-plifies the space that is being modelled by each component.Key to both approaches is being able to assign a sample tothe appropriate network.

A GatedDCNN assigns samples by learning a separategating neural network which produces the probability, αk,that the sample belongs to the k-th network. Learning this

2

Page 3: NICTA, Australia arXiv:1511.09209v1 [cs.CV] 30 Nov 2015 · work approach proposed by Jacobs et al. [16], which utilises a separately trained network to select the most appropriate

fc

sum 1…K

Component 1

Component K

fc

fc

Gating network

soft-max

soft-max

sum 1…K

prediction

prediction+Occ.Prob.

.

.

.

(b) Mixture of Networks

(a) Gated Network

Component 1

Component K fc

+ O

cc.Prob. Eq (5)soft-max

.

.

.

.

Figure 2. GatedDCNN structure (top) and MixDCNN structure (bottom). The term Occ. Prob. refers to occupation probability (re-sponsibility) α. In GatedDCNN, the gating network uses the image, the same input as each component (subset networks), to estimate α.In contrast, MixDCNN estimates α without the need for an external network.

gating neural network requires ground truth labels aboutwhich sample should be assigned to a particular network,which for our work is an open question. In contrast, aMixDCNN assigns samples based on the confidence of theprediction from each network, which leads us to considerαk to be the occupation probability of the sample for thek-th network.

Before we describe these two approaches in more detailwe define some notation. The output of a DCNN, trained forclassification, is anN -dimensional vector z of class predic-tions, where N indicates the number of classes. These pre-dictions then are normalised by a softmax [18, 21] to givethe probability that the sample belongs to the n-th class:

cn =exp{zn}∑Nj=1 exp{zj}

(1)

In the approaches described below, we are most interestedin the vector of predictions z prior to applying the softmax.

3.1. GatedDCNN

Inspired by [15, 16], we define a GatedDCNN that con-sists of K components (DCNNs) and an additional gatingnetwork. The overall structure of this network is shown inFig. 2a. In this arrangement, the k-th DCNN Sk is given

greater responsibility for learning to discriminate subtle dif-ferences of the k-th subset of images, while the gating net-work G is responsible for associating the image I with themost appropriate component. The gating network G is afine-tuned DCNN that is learned using the cross-entropyloss to produce a K-dimensional vector of probabilities α.The k-th value denotes the probability that the input imageI is associated with the k-th component. We refer to this asan occupation probability.

A fundamental difficulty with training the GatedDCNNis how to provide the T training labels y. This label vectoris a K-dimensional label vector which indicates which ofthe K subsets the sample belongs to. To deal with this issuewe consider two ways of estimating these labels. The firstapproach is to initialise the labels y using the partitioningof the training images into K subsets. The gated network Gis then trained using these labels and the K DCNNs (com-ponents) are then trained independently so that Sk is trainedexclusively with data from the k-th subset. The second ap-proach is to use the above gated network (and K compo-nents) as an initialisation and to iteratively retrain by:

1. Fixing G, and then updating [S1, . . . ,SK ] using the as-signments from G.

3

Page 4: NICTA, Australia arXiv:1511.09209v1 [cs.CV] 30 Nov 2015 · work approach proposed by Jacobs et al. [16], which utilises a separately trained network to select the most appropriate

2. Fixing the K components [S1, . . . ,SK ] and usingthese to estimate new labels y. The network G is thenupdated using these new labels.

The labels y estimated in step 2 are obtained by takingthe network which is most confident about its decision. For-mally, yt for the t-th training sample is given by:

yt = argmaxk=1...K

Ck,t (2)

where Ck,t is the best classification result for Sk using thet-th sample:

Ck,t = maxn=1...N

zk,n,t (3)

Classification with the GatedDCNN is performed usinga weighted summation of the classification results from theK components:

cn =∑K

k=1ck,nαk (4)

where ck,n is the probability of the sample belonging to then-th class for the k-th component, and αk is the probabilitythat the sample is assigned to the k-th component Sk.

An issue with the GatedDCNN system is that a separategating network has to be trained to assign a sample to aparticular component Sk. This provides the further compli-cation of having to estimate the labels y in order to train thegating network G. In this paper the first GatedDCNN train-ing approaches provides marginally better performance. Inthe experiment section, we will report results based on thefirst approach.

3.2. Mixture of DCNNs (MixDCNN)

We propose a mixture of DCNNs approach where the oc-cupation probabilitiesα are based on the classification con-fidence from each component. An advantage of this struc-ture is that we can jointly train theK DCNNs (components)without having to estimate a separate label vector y or traina separate gating network G.

For MixDCNN, the occupation probability for the k-thcomponent is:

αk =exp{Ck}∑Kc=1 exp{Cc}

(5)

where Ck is given by Eq. (3). This occupation probabilitygives higher weight to components that are confident abouttheir prediction. The overall structure of this network isshown in Fig. 2b.

Classification is performed by multiplying the output ofthe final layer from each component by the occupation prob-ability and then summing over the K components:

zn =∑K

k=1zk,nαk (6)

This mixes the network outputs together and the probabil-ity for each class is then produced by applying the softmax

function in Eq. (1). As a consequence our MixDCNN isoptimised using the cross-entropy loss2.

3.3. Differences Between MixDCNN and Ensembles

The aim of the MixDCNN approach is that each compo-nent takes greater responsibility for a portion of the dataallowing each component to concentrate on samples (orclasses) that are more difficult to differentiate. This willallow the MixDCNN to learn subtle differences for similarclasses. This is in contrast to an Ensemble approach whichrandomly excludes a portion of the training data for eachDCNN. Therefore, the key difference between the proposedMixDCNN approach and an ensemble of DCNNs (Ensem-ble) is the use of the occupation probability. For training,this means the MixDCNN approach does not randomly se-lect the data. Instead, each sample is weighted proportionalto its relevance to each DCNN S1,...,K . For testing, theMixDCNN approach is able to adaptively calculate the oc-cupation probability for each sample, whereas an Ensembleapproach will use pre-defined weights or, more commonly,equal weights.

4. Experiments4.1. Datasets

We present results on three fine-grained image classi-fication datasets using two network structures. The threedatasets are the Caltech-UCSD-2011 (CUB200-2011) [22],Birdsnap [3], and PlantCLEF 2015 [14]. Example imagesare shown in Figures 1 and 3.

CUB200-2011 is a fine-grained bird classification taskwith 11,788 images from 200 bird species in North Amer-ica. This dataset has become a de facto standard for thebird classification task. Each species has approximately30 images for training and 30 for testing. Birdsnap is a

2Optimised in a mini-batch Stochastic Gradient Descent framework.

CUB200-2011

PlantCLEF Flower

Figure 3. Examples from CUB-200-2011 and PlantCLEF Flower.

4

Page 5: NICTA, Australia arXiv:1511.09209v1 [cs.CV] 30 Nov 2015 · work approach proposed by Jacobs et al. [16], which utilises a separately trained network to select the most appropriate

much larger bird dataset consisting of 49,829 images from500 bird species with 47,386 images used for training and2,443 images used for testing. PlantCLEF 2015 is a largeplant classification dataset that has seven content types. Todemonstrate the capabilities of the proposed MixDCNN ap-proach for the task of fine-grained classification, we analyseits effectiveness on one content type, Flower. This portionof the dataset consists of 28,705 images from 967 species.We split this data into training and test sets. The training setconsists of 25,025 images from 967 species, while the testset has 3,200 images from 801 species.

Both CUB200-2011 and Birdsnap have bounding boxannotations around the object of interest. We use this in-formation to extract just the object of interest from the im-age. PlantCLEF 2015 does not come with bounding boxinformation making it a more challenging dataset.

Prior work [13, 23] has shown the importance of transferlearning for the fine-grained image classification problem.Results have shown that training a DCNN from scratch foreither the fine-grained CUB200-2011 or Birdsnap datasetleads to overfitting on the training samples. As such, forall the of our experiments we use pre-trained networksfrom ImageNet [7] to provide a good initialisation for eachDCNN and then perform transfer learning. We consider thisto be our baseline and refer to it as DCNN-tl. All of ournetworks are trained using Caffe [17] and partitioning wasperformed using the Bob toolkit [1].

4.2. Comparative Evaluation

We compare the proposed MixDCNN approach againstfour other related methods: (1) the baseline DCNN-tl, (2) anensemble of K DCNNs, (3) an implementation of Gated-DCNN, and (4) Subset FL [13]. Two network structuresconsidered are the well known AlexNet [18] and the LargeScale Visual Recognition Challenge (ILSVRC) 2014 win-ner GoogLeNet [21]. AlexNet is a deep network consistingof 8 layers, while ILSVRC has 22 layers3. We follow thesame procedure as Ge et. al [13] to cluster the data. Forthe AlexNet structure we use the output of the first fullyconnected layer as features for clustering. For GoogLeNetwe use the output of the last layer, prior to classification, asfeatures. In both cases linear discriminant analysis (LDA)is applied to reduce the dimensionality to D = 128. Inour initial experiments, we varied D and results showed noimpact of that.

The results in Table 1 show that the proposed MixDCNNapproach provides consistent improvement regardless ofnetwork structure or dataset. MixDCNN provides the bestperformance for all of the network and dataset combina-tions, with the exception of the MixDCNN model using theGoogLeNet structure on CUB. It provides an average rela-

3To prevent GoogLeNet from over-fitting we use a higher dropout rateequal to 0.5 for the final loss layer, as opposed to the original setting of 0.4.

tive performance improvement of 12.7% over the baselineDCNN-tl approach, excluding CUB.

For the CUB dataset, using multiple expert networksprovides limited performance improvement. This is truefor all of the methods examined. We attribute this to thefact that CUB200-2011 is a small dataset consisting ofjust 5,994 training images. This is an order of magnitudefewer samples than other datasets such as Birdsnap. Fur-thermore, applying transfer learning to GoogLeNet alreadyprovides exceptional performance and so minimises the im-provement introduced by the MixDCNN framework, or anymulti-expert approach.

The proposed MixDCNN method achieves state-of-the-art results on the challenging Birdsnap and PlantCLEF-Flower datasets. For Birdsnap the previous state-of-the-artperformance was 48.8% [3]. Applying transfer learning toGoogLeNet already outperforms this prior art with an ac-curacy of 67.4%. MixDCNN provides a further relativeperformance improvement of 9.9%. For the PlantCLEF-Flower dataset the baseline performance of DCNN-tl (usingGoogLeNet) is 48.7%. MixDCNN provides state-of-the-artperformance with a relative performance improvement of7.0%.

The MixDCNN approach consistently outperforms theEnsemble, GatedDCNN and Subset FL approaches. Inter-estingly, it provides a considerable improvement over theclosely related GatedDCNN approach, with an average rel-ative performance improvement of 9.1%. We attribute thisto the ability of the MixDCNN approach to adaptively re-assign samples to the most appropriate expert network, inspite of the original partitioning.

In our experiments, component sizes greater than K = 6were not considered as we could not store these in mem-ory on a single GPU4. This highlights one of the limitationswith this technique as it currently requires all of the net-works to be stored on a single GPU; future work should con-sider how to extend the architecture across multiple GPUs.

5. Conclusion

We have proposed a novel mixture of deep neural net-works, termed MixDCNN, which achieves state-of-the-artperformance for fine-grained classification. It provides anaverage relative performance improvement of 12.7% andhas been shown to consistently outperform several relatedmethods: subset feature learning, GatedDCNN, and an en-semble of classifiers.

The key advantage of our proposed approach is the useof an occupation probability that weights each sample pro-portional to its relevance to each DCNN S1,...,K . This ap-proach obviates the need for a separate gating function and

4The GPU used in all our experiments was an Nvidia K40 Tesla with12 Gb of memory.

5

Page 6: NICTA, Australia arXiv:1511.09209v1 [cs.CV] 30 Nov 2015 · work approach proposed by Jacobs et al. [16], which utilises a separately trained network to select the most appropriate

Table 1. Comparison of the proposed MixDCNN approach against DCNN-tl, Ensemble, GatedDCNN and Subset FL on three datasets:CUB, BirdSnap and PlantCLEF-Flower. Two network structures are used: AlexNet and GoogLeNet.

DCNN-tl Ensemble GatedDCNN Subset FL MixDCNN

AlexNetCUB 68.3% 71.2% 69.2% 72.0% 73.4%

BirdSnap 55.7% 57.2% 57.4% 59.3% 63.2%PlantCLEF-Flower 29.1% 30.2% 30.2% 31.1% 35.0%

GoogLeNetCUB 80.0% 80.9% 81.0% 81.2% 81.1%

BirdSnap 67.4% 71.4% 70.1% 72.8% 74.1%PlantCLEF-Flower 48.7% 50.2% 49.7% 51.7% 52.1%

highlights the importance of being able to adaptively weightsamples based on their relevance to a component (DCNN).

Future work will explore alternative methods for initial-ising the clustering and its impact upon performance. Forinstance, the impact of grouping images together in termsof their pose rather than similar visual appearance. Further-more, we will examine the role of the occupation probabil-ity in two ways: (i) whether the responsibility for a sampleis shared between components, and (ii) deeper analysis ofhow this occupation probability changes during the train-ing process. Additionally, we intend on exploring differentmethods for computing the occupational probability via al-ternative aggregation techniques.

AcknowledgementsThe Australian Centre for Robotic Vision is supported by the

Australian Research Council via the Centre of Excellence pro-gram. NICTA is funded by the Australian Government throughthe Department of Communications, as well as the Australian Re-search Council through the ICT Centre of Excellence program.Agricultural Robotics Program at QUT has been supported by theDepartment of Agriculture

References[1] A. Anjos, L. E. Shafey, R. Wallace, M. Gunther, C. McCool,

and S. Marcel. Bob: a free signal processing and machinelearning toolbox for researchers. In Proceedings of the ACMInternational Conference on Multimedia, pages 1449–1452,2012.

[2] T. Berg and P. N. Belhumeur. POOF: Part-based one-vs.-onefeatures for fine-grained categorization, face verification, andattribute estimation. In CVPR, 2013.

[3] T. Berg, J. Liu, S. W. Lee, M. L. Alexander, D. W. Jacobs,and P. N. Belhumeur. Birdsnap: Large-scale fine-grainedvisual categorization of birds. In CVPR, pages 2019–2026,2014.

[4] L. Breiman. Bagging predictors. Machine Learning,24(2):123–140, 1996.

[5] Y. Chai, V. Lempitsky, and A. Zisserman. Symbiotic seg-mentation and part localization for fine-grained categoriza-tion. In ICCV, 2013.

[6] Y. Chai, V. Lempitsky, and A. Zisserman. Symbiotic seg-mentation and part localization for fine-grained categoriza-tion. In ICCV, 2013.

[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009.

[8] R. Farrell, O. Oza, N. Zhang, V. I. Morariu, T. Darrell, andL. S. Davis. Birdlets: Subordinate categorization using volu-metric primitives and pose-normalized appearance. In ICCV,2011.

[9] E. Gavves, B. Fernando, C. G. Snoek, A. W. Smeulders, andT. Tuytelaars. Fine-grained categorization by alignments. InICCV, 2013.

[10] Z. Ge, C. McCool, C. Sanderson, A. Bewley, Z. Chen, andP. Corke. Fine-Grained bird species recognition via hierar-chical subset learning. In ICIP, sep 2015.

[11] Z. Ge, C. McCool, C. Sanderson, and P. Corke. Contentspecific feature learning for fine-grained plant classification.In Working Notes of the CLEF 2015 Conference, 2015.

[12] Z. Ge, C. McCool, C. Sanderson, and P. Corke. Modellinglocal deep convolutional neural network features to improvefine-grained image classification. ICIP, 2015.

[13] Z. Ge, C. McCool, C. Sanderson, and P. Corke. Subset fea-ture learning for fine-grained classification. In CVPR Deep-Vision Workshop, 2015.

[14] H. Goeau, A. Joly, P. Bonnet, S. Selmi, J.-F. Molino,D. Barthelemy, and N. Boujemaa. Lifeclef plant identifica-tion task 2014. In Working Notes for CLEF 2014 Conference,pages 598–615. CEUR-WS, 2014.

[15] J. B. Hampshire II and A. Waibel. The meta-pi net-work: Building distributed knowledge representations for ro-bust multisource pattern recognition. PAMI, 14(7):751–769,1992.

[16] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hin-ton. Adaptive mixtures of local experts. Neural Computa-tion, 3(1):79–87, 1991.

[17] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-shick, S. Guadarrama, and T. Darrell. Caffe: Convolutionalarchitecture for fast feature embedding. In Proceedings ofthe ACM International Conference on Multimedia, pages675–678, 2014.

[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InNIPS, pages 1097–1105, 2012.

[19] D. Lin, X. Shen, C. Lu, and J. Jia. Deep lac: Deep local-ization, alignment and classification for fine-grained recog-nition. In CVPR, pages 1666–1674, 2015.

6

Page 7: NICTA, Australia arXiv:1511.09209v1 [cs.CV] 30 Nov 2015 · work approach proposed by Jacobs et al. [16], which utilises a separately trained network to select the most appropriate

[20] J. Liu, A. Kanazawa, D. Jacobs, and P. Belhumeur. Dogbreed classification using part localization. In ECCV. 2012.

[21] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. CVPR, 2014.

[22] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie.The Caltech-UCSD Birds-200-2011 Dataset. In Computa-tion & Neural Systems Technical Report, California Institute

of Technology, number CNS-TR-2011-001, 2011.[23] N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-

based R-CNNs for fine-grained category detection. InECCV, pages 834–849. 2014.

[24] N. Zhang, R. Farrell, F. Iandola, and T. Darrell. Deformablepart descriptors for fine-grained recognition and attributeprediction. In ICCV, 2013.

7