Unsupervised Learning for Image Category Detectioncvww2017.prip.tuwien.ac.at › papers › CVWW2017_paper_4.pdf · Unsupervised Learning for Image Category Detection ... of-the-art

22nd Computer Vision Winter WorkshopNicole M. Artner, Ines Janusch, Walter G. Kropatsch (eds.)Retz, Austria, February 6–8, 2017

Unsupervised Learning for Image Category Detection

Philipp Seebock1,2, Rene Donner1, Thomas Schlegl1,2, Georg Langs1

1Computational Imaging Research Lab, Medical University of Vienna2Christian Doppler Laboratory for Ophthalmic Image Analysis, Medical University of Vienna{philipp.seeboeck, rene.donner, thomas.schlegl, georg.langs}@meduniwien.ac.at

Abstract. We propose an unsupervised object cate-gory learning approach, where the output represen-tations improve classification performance too. Thecontribution is threefold: we integrate a ’Networkin Network’ (NIN) approach in unsupervised learn-ing, improving the representational power of the net-work for local patches by replacing linear filters withmicro convolutional networks. We learn receptivefields to connect layers of the network sparsely, andwe propose a new encoding function that introducessparsity in a natural way and avoids the necessityof parameter tuning. The learned model generatesa feature representation of images, used for unsu-pervised category learning. Results demonstrate thatthe obtained image categories reflect true object cat-egories well. In addition, experimental results onclassification tasks show superiority of the proposedapproach in comparison with an unsupervised state-of-the-art learning architecture.

1. Introduction

The unsupervised identification of categories is acrucial aspect of machine learning when ground truthis either unknown or costly to obtain. In medicalimaging both issues arise frequently, and the discov-ery of potential markers occurring across large imagesets is of central interest. In this paper we present anunsupervised approach to learn a model for categorydetection. Images are mapped to a new feature repre-sentation using an unsupervisedly learned Convolu-tional Neural Network (CNN), and are subsequentlyclustered in this representation space. We demon-strate the capability of the learned model identify-ing meaningful categories on natural imaging datain a completely unsupervised way, and show that itimproves performance in comparison with a state-of-the-art learning architecture. The key contribu-

tion of this paper is to understand if the NIN ap-proach improves an unsupervised learning architec-ture in comparison with conventional unsupervisedfeature learning approaches. We use the term ’cate-gory detection’ to stress the difference between ourapproach and clustering methods. In contrast to theclassical approach of feeding hand-crafted features[17, 25, 23] to a clustering or classification [33, 2] al-gorithm, we aim at learning high level features fromthe training data which are both highly discrimina-tive to the unknown categorical classes, and robust tolow level visual variation.

Related work The idea of learning categories in aset of unlabeled or weakly labeled data is not new.Several approaches tackling the problem of objectdetection have been proposed [15, 13, 32, 28, 3, 11].While Grauman and Darrell group images based on adissimilarity measure between sets of unordered fea-tures [15], Faktor and Irani cluster images based onthe idea that images which can be composed fromeach other are similar [13]. These methods can beseen as non-deep learning approaches for learningimage categories without using any labels. In con-trast, Wu et al. [32] use a weakly supervised multipleinstance learning approach to train a deep CNN forimage classification and annotation. Weakly super-vised learning is also used by Oquab et al. [28] totrain a deep CNN to detect and localize objects inimages. In that work, labels are given at image-level,but no object location annotations are used for train-ing. Schlegl et al. use weakly supervised learningto link image information to semantic descriptionsof image content [29]. Instead of assigning labels toimages, Cho et al. [3] find object locations basedon hand-crafted features (HOG [10]) in an unsuper-vised way. Doersch et al. [11] use spatial context assupervisory signal to train a CNN, where predictingthe relative position of a randomly extracted pair of

patches is used as training objective. As opposed tothe work of Doersch et al. [11], we want to tacklethe problem of identifying categories in cases whereno informative context is available. Other works ad-dress the challenge of unsupervised learning beyondunsupervised object detection in a more general way[7, 5, 6, 4, 12]. Coates et al. [7] compare variousunsupervised learning methods in a single-layer net-work and show that the best performance is achievedwith K-means clustering as feature learning algo-rithm. Dosovitskiy et al. [12] train a network unsu-pervisedly with surrogate classes. These are createdby applying hand-crafted transformations, which areassumed not to change the identity of the image con-tent. In contrast, our approach does not use hand-engineered prior knowledge about the data.

An important issue in building deep neural net-works is how to choose the connections between lay-ers (i.e. how to connect the features of the actuallayer with the features of the lower layer), as pointedout and investigated in [9]. In line with terminologyof previous literature, we denote the grouping of fil-ters in order to achieve sparse connections betweenlayers as receptive field learning. Coates and Ng in-troduced an algorithm to learn receptive fields auto-matically and reduce the feature dimension for sim-ple unsupervised training algorithms in higher lay-ers [5]. Apart from learning connections betweenlayers, there have also been attempts to learn spatial-pooling regions using labels [20].

Besides the choice of connections between lay-ers, the basic network architecture plays an importantrole, too. Lin et al. [24] introduced the idea of replac-ing convolution filters with a ”micro network”, calledNIN approach. The basic idea is to increase the rep-resentational power of neural networks, respectivelyof individual parts, by using a ”micro network” aspart of the overall network. This concept has beenpicked up and taken one step further by Szegedyet al. [31], who describe the ”Inception Module” intheir work. The ”Inception Module” consists of ”mi-cro networks”, while the module itself is also a partof a final, even bigger, network. Since both papersapply the concept successfully to a purely supervisedlearning problem, a transfer to unsupervised learningseems promising, and is evaluated in this paper.

Contribution In this paper, we use recent in-sights [7, 4, 24, 31] to develop a novel unsupervisedarchitecture. Specifically, we (1) specify an algo-rithm that groups similar filters in order to create re-

ceptive fields, (2) introduce the basic idea of the NINapproach in unsupervised learning and (3) proposethe new mean-sparse encoding function. First, weuse these ingredients to learn image categories fromthe dataset (CIFAR-10, STL-10) in a purely unsuper-vised way. Second, we evaluate the learned featurerepresentation on a standard supervised classificationtask in direct comparison with unsupervised state-of-the-art approaches [7, 5].

Receptive field learning: High input dimension-ality can have a negative impact on the success of un-supervised learning, specifically on the learned fea-tures of K-means clustering [6]. Furthermore, resultsin [5] indicate that grouping of similar features is es-sential in order to enable unsupervised learning al-gorithms to detect useful higher-layer features. Thegrouping of similar features is also motivated by theHebbian principle, ’neurons that fire together, wiretogether’ [26]. In our experiments, the proposed re-ceptive field learning algorithm yields better classifi-cation accuracy than the one introduced in [5].

NIN architecture: We transfer the core idea ofNIN [24, 31] to unsupervised learning and replacesimple filters with ”micro-conv-nets” to improve therepresentational power of individual network com-ponents. In the unsupervised category detection ex-periment we used three external evaluation criteria todetermine the performance. The proposed approachoutperforms the state-of-the-art single layer architec-ture (Section 4.2).

Mean-sparse encoding: On the one hand, mean-sparse encoding is motivated by the performance ofthe triangle function, described in [7]. That encod-ing scheme also yields sparse outputs by subtractinga mean value across features. On the other hand, wewant to avoid tuning parameters where possible. Incontrast to the widely used soft-threshold encoding,the mean-sparse function selects the threshold valueautomatically. Mean-sparse encoding achieves com-parable results on the category-learning task on bothdatasets as well as on the classification task on theSTL-10 dataset.

2. Method

Our method consists of a CNN model that mapsthe input images to a new representation and a sub-sequent clustering or classification stage. We presentan approach to train the CNN in a completely un-supervised way. Then we describe the subsequentclustering, respectively classification stage.

Figure 1. Illustration of the proposed architecture, revealing the interaction of the techniques described in Section 2. Ingeneral, the input images are mapped to a new feature representation X4 by applying two convolutional and one poolinglayer, where this representation then can be used for category detection or classification tasks. The procedure to learn thedictionary is depicted for layer 2: (1) The receptive field learning algorithm is used to create R2 receptive fields, followedby (2) learning sub-dictionaries C2 from the training sets V2 and concatenating them in D2. (3) X3 is computed by usingD2 and mean-sparse encoding. The transition from X1 to X2 and from X2 to X3 conceptually forms an NIN approach,illustrated in Figure 2.

We work with image patches forming the inputof the first layer, denoted as X1 of size k1 × k1 ×f1 × X1, where X1 is the number of patches, eachof size k1 × k1 with f1 channels (feature maps),and 1 being the layer index. Considering one sin-gle 32-by-32 RGB input image, it is then describedas 32× 32× 3× 1.

For clarity, we represent convolution and pool-ing as separate layers. Considering a convolutionallayer l, the output of the unsupervised feature learn-ing stage is a ”dictionary” Dl of size ml×ml×f l×Dl, where Dl is the number of filters, each of sizeml ×ml with f l feature maps. In the following wedescribe the procedure to learn this dictionary (illus-trated in Figure 1):

1. Create Rl receptive fields (Section 2.1), con-taining Sl feature maps each, where Sl ≤ f l.Rl denotes the number of receptive fields, Sl thenumber of feature maps in one receptive fieldand f l the total number of feature maps in layerl.

2. For each receptive field unsupervised learning isperformed separately:

(a) Extract V l random patches within the re-ceptive field to create a training set Vl ofsizeml×ml×Sl×V l, whereml∗ml∗Sl

denotes the feature dimensionality.(b) Apply the unsupervised learning algo-

rithm (Section 2.2) to Vl in order to gen-erate a so called sub-dictionary Cl of sizeml ×ml × Sl × C l, with C l denoting thenumber of learned filters.

3. We summarize all Rl sub-dictionaries (one foreach receptive field) in one final dictionary:

Each Cl contains filters which are only con-nected to the feature maps contained in the cor-responding receptive field. We ”blow up” thefilters Cl from sizeml×ml×Sl toml×ml×f lby inserting zero values at all other positions. Interms of feature extraction, this makes no dif-ference to applying the filters Cl to the corre-sponding receptive field. This means care has tobe taken that the spatial information is preservedby linking the non-zero values to the correct fea-ture maps.

4. All ”blown up” filters build the dictionary Dl.

If two layers are fully connected, Rl = 1 andSl = f l. Note thatDl = Rl ·C l must hold. Based onthe dictionary Dl and a given input Xl, we then cal-culate the output Xl+1 by using a specific encoding,where the concrete computation steps are describedin Section 2.4.

Regarding the pooling layer, no explicit learningstage is applied. We use simple average poolingto aggregate features within feature maps. Averagepooling partitions a feature map by shifting a rectan-gle of size pl × pl with a stride wl over the map andoutputs the average value for each region.

2.1. Learning Receptive Fields

As dissimilarity gi,j between two filters di and dj

we use a metric defined in Coates et al. [4]1:

gi,j = ‖di − dj‖2 =√2− 2d>i dj (1)

1In contrast to our work, Coates et al. [4] use this metric todo max-pooling over similar units. Additionally, the algorithmthat groups similiar filters differs from ours.

This dissimilarity is used to implement the conceptof grouping together related features in the same re-ceptive field, hence their relationship can be learnedmore finely by the unsupervised learning algorithmof the subsequent layer. It is important to note thatthe filters are all normalized to unit length.

To construct a single receptive field in layer l, werandomly select one filter from Dl−1 as seed. Wethen compute the distance to all other filters of thisdictionary and add the Sl most similar filters to theactual group. A filter can only be used once as seed.In this way we get Rl equally sized, potentially over-lapping, receptive fields that determine the connec-tions between layer l and l + 1, as illustrated in Fig-ure 1. Both Rl and Sl are hyper-parameters.

2.2. Learning a Dictionary

For each sample in the training set Vl we subtractthe mean of intensities, divide by the standard devia-tion and apply ZCA-whitening. Then we use spher-ical K-means according to Coates et al. [6] as unsu-pervised learning algorithm that produces filters sim-ilar to sparse autoencoders or sparse RBMs [7]. C l

centroids cl are randomly initialized from a normaldistribution and normalized to unit length. Dampedupdates are used to compute new centroids in everyiteration in oder to minimize the following objective:

minimizes,cl

∑i

‖clsi − vli‖22 (2)

where si is a vector with only one non-zero entry, forthe closest centroid to sample vl

i. This results in adictionary (Figure 1). A detailed explanation of thealgorithm can be found in [6].

2.3. Unsupervised NIN Approach

We increase the representational power of individ-ual parts in the neural network using a ”micro net-work” as part of the overall network. Figure 2 il-lustrates the increased representational power of theproposed NIN approach. For the purpose of clarity,we explain this approach using a 6 × 6 × 1 patch asinput and chose concrete parameters in accordancewith the network architecture explained in Section3.1.

A conventional convolution is shown in Fig-ure 2 (a), where one filter with m1 = 6 is con-volved with the input patch to produce a single acti-vation value. Following the spirit of supervised NINlearning [24], we replace the single convolutional

(a) (b)

Figure 2. Illustration of two example architectures, wherethe filters (orange) are applied to the patch-representations(black). (a) Conventional convolution with 6-by-6 filters”6conv”, and (b) the proposed architecture with two con-volutional layers (m1 = 4 and m2 = 3) ”4/3conv”.

layer with two convolutional layers, where m1 = 4,w1 = 1, D1 = 11, m2 = 3 and D2 = 1. As illus-trated in Figure 2 (b), this leads to a deeper architec-ture and therefore to an increased abstraction capabil-ity, and can be seen as applying a ”micro-conv-net”instead of a single filter.

2.4. Mean-sparse Encoding

The mean-sparse function is defined as follows:

xl+1 = max(0,Dl ∗ xl) (3)

xl+1 = max(0, xl+1 − µ(xl+1)) (4)

where ∗ is the operator denoting spatial convolutionof all filters in Dl with sample xl, xl+1 is an interme-diate representation with Dl feature maps and xl+1

is the output representation of the sample. First, therectified linear function is applied (Eqn. 3). Then themean activation at each position across all featuremaps µ(xl+1) is calculated. This leads to a ”meanfeature map” of size kl+1-by-kl+1. This map is thensubtracted from each feature map in the intermediaterepresentation xl+1 (Eqn. 4). With mean-sparse en-coding, we select the threshold value automaticallyfor each spatial position in the feature map and atthe same time introduce sparsity in the activations.This can be seen as a simple form of competition be-tween features. The use of the rectified linear func-tion is necessary, since the negative and positive val-ues would cancel each other out otherwise.

In contrast to previous work [5, 7], we have de-cided not to preprocess the ”input sub-patches” atthe stage of convolution during feature extraction.Though this may cause a slight loss in performance,it enables the possibility to use the fast convolu-tion modules of Torch7 [8] for our experiments.This speed advantage is a necessity in practice whenworking with large-scale image data.

In our experiments, we compare the mean-sparseencoding function in the convolutional layers with

the widely used soft-threshold function:

xl+1 = max(0,Dl ∗ xl − α) (5)

where α is a tunable constant.

2.5. Category Learning

Categories are identified by mapping each imagex1 using the unsupervisedly learned CNN to a newfeature representation xL (in Figure 1 denoted as x4):

xL = CNN(x1) (6)

and subsequently performing Spherical K-meansclustering2 [18] with cosine distance. This leads tocluster centers tj , where each cluster represents aseparate category. To categorize unseen images, theyare mapped to xL and subsequently assigned to thenearest centroid t = minInd(xL, tj). minInd(·) re-turns the index of the centroid tj with the minimumcosine dissimilarity to sample xL, and t thereforerepresents the label of the assigned category.

2.6. Classification

The feature representation xL can also be usedto train a classifier (shown in Figure 1) if labels areavailable. Here, we train a linear Support Vector Ma-chine (L2-SVM) in order to evaluate the applicabilityof the feature representation for classification.

3. Experimental Setup

Data We conducted experiments on the CIFAR-10 and the STL-10 datasets. The CIFAR-10 datasetcomprises 50,000 training and 10,000 test images[21]. The 32 × 32 RGB images can be divided into10 different categories, where each class consists of5,000 training and 1,000 test samples. No prepro-cessing is applied to the images. The test set is onlyused for evaluation purposes and is not involved inany training procedure.

The STL-10 dataset has also 10 classes, but com-prises only 100 labeled images per class for eachtraining fold (10 pre-defined folds), 800 test imagesper class and 100,000 additional unlabeled images[7]. The unlabeled images are from a similar butbroader distribution of images and are used for theunsupervised training of the CNN architecture andthe clustering stage. We down-sample the 96×96×3images to 32 × 32 × 3, which enables us to use thesame architecture for both CIFAR-10 and STL-10.

2The centroids are updated according to the procedure de-scribed in [6], where all examples are normalized to unit lengthin advance.

(a)

(b)

(c)

Figure 3. Examples of learned receptive fields in 4/3convmodel on CIFAR-10. Each row shows one receptive fieldwith all 11 filter members of the first layer that are similarin terms of (a) orientation, (b) color or (c) both.

3.1. Compared Network Architectures

Our experiments are based on two network archi-tectures, the proposed method, and a state-of-the-artreference architecture that follows [7]. This allows usboth to compare receptive field learning algorithms[5], and the underlying network architectures. Fol-lowing the experiments reported in [7, 5], we usethe approach that achieves the best reported perfor-mance among all unsupervised single-layer convolu-tional networks both on the CIFAR-10 and the STL-10 dataset as reference method. For all convolutionallayers in the experiments, we use stride w = 1.

The reference architecture (6conv) is composedof a convolutional layer, which is fully connectedto the input layer, and a subsequent average poolinglayer. Regarding the experiments of Coates et al. [7],the convolutional layer uses a filter size of 6× 6× 3,which leads to a 27 × 27 × D1 representation afterthe first layer. Then average pooling with p2 = 14and w2 = 13 is applied to get the final feature repre-sentation 2× 2×D1 of each image.

The proposed architecture (4/3conv) replaces theconvolutional layer with ”micro-conv-nets”, as de-scribed in Section 2.3, while the average poolinglayer remains the same. For better understanding,this architecture can also be described as consist-ing of two convolutional layers with m1 = 4 andm2 = 3. While the first convolutional layer is fullyconnected to the input layer (R1 = 1 and S1 = f1),the connections between the first and second convo-lutional layer are sparse, since we choseR2 = 32 andS2 = 11 if D1 ≤ 400, and S2 = 22 if D1 > 400.

The dictionary of the first layer D1 exhibits a sizeof 4 × 4 × 3 × D1, and the sub-dictionaries C2 asize of 3 × 3 × S2 × D2

32 , where D1 = D2. Foreach receptive field we learn D2

32 filters, which leadsto D2 feature maps in D2. Therefore, each receptivefield in the second layer has a feature dimension of99 = 3 · 3 · 11 if S2 = 11 and 198 = 3 · 3 · 22 ifS2 = 22, which serves as input for the unsupervisedfeature learning algorithm in the second layer.

Considering the case where D1 = D2 = 800,we learn 800 different ”micro-conv-nets” instead oflearning 800 different simple filters. Since we learn32 receptive fields, in each field 25 ”micro-conv-nets” share the same first layer and differ only in thesecond layer. This enables the possibility to learnvarious relationships of the first layer filters in thesecond layer.

3.2. Evaluating Category Learning

We evaluate if the unsupervised learning capturesmeaningful categories, by comparing learned cate-gories with ground truth classes.

In this experiment we compare the 4/3conv witha 6conv model (Section 3.1). While the referencearchitecture (6conv) uses only soft-threshold encod-ing, we apply both soft-threshold and mean-sparseencoding in the proposed architecture (4/3conv). Forall experiments, we do hyperparameter tuning of thesoft-threshold function with the following values:α = {0.1, 0.2, 0.25, 0.3, 0.4, 0.5}.

Regarding CNN parameters, we only varied thenumber of learned filters (100, 200, 400, 800). Bothfeature-wise and no normalization of xL is evaluatedin the experiments. For Spherical K-means cluster-ing, we used 10 clusters in all experiments and variedthe initialization of the centroids (from a normal dis-tribution or random examples as seed) as well as there-initialization procedure in case of empty clusters(re-initialization from a normal distribution or withrandom examples). To ensure a fair comparison, theparameters were the same for both architectures.

For the purpose of selecting the final model, theclustering results are evaluated on the train set usingthree external evaluation criteria. The Adjusted RandIndex (ARI) [19], Normalized Mutual Information(NMI) [30] and Purity [34]. While −1 ≤ ARI ≤ 1(random labels lead to values close to zero, perfectlabels to 1), both NMI and Purity range between 0(random labels) and 1 (perfect labels). For the se-lected category models with the best ARI values onthe train set, the external measures are also calculatedon the test set in order to evaluate the generalizationperformance.

3.3. Evaluating Unsupervised Selection of Cate-gory Models

Besides an supervised selection of the model, wealso evaluate if we can choose the category modelunsupervised. Instead of using the external evalua-tion criteria, an internal value is used to select the

final model, namely the Davies-Bouldin (DB) in-dex [16, 1], which has been calculated on the trainingdata. A small value indicates compact and well sep-arated clusters, hence we selected the model with thesmallest DB index. The final models are evaluatedwith the external evaluation criteria on the test set inorder to assess the quality of the chosen model.

3.4. Evaluating Classification

We evaluate how the learned features perform ona standard supervised classification task. In partic-ular we train an L2-SVM using the LIBLINEAR li-brary [14] on the feature representation xL. Both forCIFAR-10 and STL-10, 20% of the training imagesare used as validation set to determine the regular-ization parameter of the classifier. While we receiveone final L2-SVM on the CIFAR-10 dataset, we ob-tain ten L2-SVMs for the STL-10 dataset (one foreach training fold).

Again, we compare 4/3conv with 6conv using avarying number of filters. Additionally, in order tocompare the proposed receptive field learning algo-rithm with a state-of-the-art method, we also train4/3conv architecture with the receptive field learningmethod introduced in [5]. For a fair comparison, wetrain the same number of receptive fields and selectthe same feature dimension for both receptive fieldlearning algorithms.

4. Results

We report results illustrating the receptive fields,and quantitative results comparing the proposed ap-proach with state-of-the-art methodology for unsu-pervised category learning as well as for classifica-tion. Furthermore, we evaluate to which extent themodels can be selected based on an internal criterion.

4.1. Receptive Fields

In Figure 3 three typical examples for the learnedreceptive fields are illustrated (4/3conv model onCIFAR-10). It can be seen that the algorithm in-corporates filters with similar orientation but varyingcolor (a), similar orientation and color (b) and similarcolor but varying orientation (c).

4.2. Category Learning

Table 1 shows that the proposed method 4/3convoutperforms the conventional approach 6conv for allevaluation measures on both datasets. As expected,both outperform clustering applied directly to the in-put images and random clustering.

MethodCIFAR-10 STL-10

ARI NMI Pur. ARI NMI Pur.

Random 0 0 0.1 0 0 0.1SKM 0.06 0.10 0.24 0.06 0.12 0.266conv[7]+ SKM 0.09 0.16 0.27 0.10 0.18 0.284/3conv + SKM 0.10 0.18 0.30 0.12 0.20 0.29

Table 1. ARI, NMI and Purity are calculated for eachmodel on the train data in order to select the final model.This table summarizes the values of the final models (bothfor CIFAR-10 and STL-10), calculated on the test set. Re-sults for random and Spherical K-means (SKM) cluster-ing applied directly on the images using 10 clusters areshown, too.

In terms of hyper-parameter settings, experimentsshowed that neither the type of initialization, the typeof re-initialization nor the number of filters in theCNN has a strong influence on the external measures.Therefore, a lower number of filters seems prefer-able to reduce computation time. Furthermore, re-sults indicate that feature-wise normalization helpsto improve the performance. For example, the meanNMI of 4/3conv model is 0.146(±0.012) without and0.170(±0.006) with normalization on the CIFAR-10 dataset. Furthermore, both encoding functionsachieve comparable performance on both datasets(e.g. on STL-10 the mean NMI is 0.194(±0.012) formean-sparse and 0.190(±0.008) for soft-thresholdencoding).

In-Depth Evaluation of Learned Categories Re-sults demonstrate that the 4/3conv model with thebest external evaluation values categorizes images re-flecting true object categories well, as can be seen inFigure 4. The confusion matrix of the ground truthclass labels and the predicted category labels of theCIFAR-10 dataset is plotted in Figure 4 on the left-hand side. Each row of the matrix corresponds to onelearned category, while every column corresponds toone ground truth class. The confusion matrix is re-arranged according to the Hungarian method [22].On the right-hand side of Figure 4 the categories arevisualized. For each centroid, the ten nearest testimages are plotted to illustrate the characteristics ofeach cluster.

When looking at the visualization of the learnedcategories, it is important to bear in mind that thegiven ground truth categorization is not the only pos-sible reasonable one. The confusion matrix in Fig-ure 4 shows that the category model distinguishes

MethodCIFAR-10 STL-10

ARI NMI Pur. ARI NMI Pur.

6conv[7]+ SKM 0.08 0.15 0.27 0.09 0.17 0.284/3conv + SKM 0.10 0.18 0.30 0.12 0.20 0.29

Table 2. This table shows ARI, NMI and Purity (calculatedon the test set) for unsupervised selected models (via DBIndex, calculated on the training set).

quite clearly between animals and non vital objects.While the clusters in row 1, 2, 3, 9, and 10 corre-spond to non vital objects, animals are categorizedby the other centroids. As can be seen both in theconfusion matrix and the nearest images, cluster ”2”mainly groups white automobiles. Centroid ”3” rec-ognizes red automobiles and trucks, whereas cluster”4” mainly groups animals with white background.A big part of deer and bird images is contained incategory ”5”. Also centroid ”6” is a reasonable cate-gory, since cats and dogs have a similar appearance.While the major part of frogs is contained in cluster”7”, a large part in category ”8” is made up of horseimages. Category ”9” mainly contains ships and air-planes, where all visualized images have a clearlyvisible horizontal edge in the middle of the image.The major part of category ”10” consists of automo-biles and trucks with a bright background.

4.3. Unsupervised Model Selection

On both datasets unsupervised model selectionbased on the DB index chooses the 4/3conv mod-els that also have the best external evaluation crite-ria if an additional test set is used (Table 2). For the6conv architecture comparable models are selectedfor both datasets, as can be seen by comparing Ta-ble 1 and Table 2. The Pearson correlation coeffi-cient [27] between the DB index and the external cri-teria ARI(−0, 55), NMI(−0, 37) and Purity(−0, 52),indicates that this is not a selection by chance.

4.4. Classification Results

The classification results, obtained on the test set,are illustrated in Figure 5(a) for CIFAR-10. Fig-ure 5(b) contains the results for STL-10, where theerror bars denote the standard deviation of the testaccuracy, since 10 training folds are evaluated.

The proposed 4/3conv architecture clearly outper-forms the conventional 6conv architecture on bothdatasets. While mean-sparse encoding leads to aslight performance decrease on CIFAR-10, it showscomparable results on STL-10 in comparison with

plan

eau

tobi

rdca

tde

erdo

g

frog

hors

esh

ip

truc

k

1

2345678910

Figure 4. On the left we show the confusion matrix between clusters learned with a 4/3conv model and ground-truthclasses of CIFAR-10. On the right, we show for each cluster center the nearest images in the test set.

(a)

100 200 400 80060

64

68

72

76

# Filters

Test

Accu

racy (

%)

6conv + soft

4/3conv + RF−single + soft

4/3conv + RF−dict + soft

4/3conv + RF−dict + sparse

(b)

100 200 400 80042

46

50

54

# Filters

Test

Accu

racy (

%)

6conv + soft

4/3conv + RF−single + soft

4/3conv + RF−dict + soft

4/3conv + RF−dict + sparse

Figure 5. Test accuracy: This plot illustrates the classifi-cation performance of different approaches on the test setfor a varying number of filters on the (a) CIFAR-10 andthe (b) STL-10 dataset. While the proposed receptive fieldlearning algorithm is denoted as RF-dict, RF-single refersto the state-of-the-art receptive field learning method in-troduced in [5].

the soft-threshold function.

Figure 5 also provides results for the 4/3convarchitecture using a state-of-the-art receptive fieldlearning algorithm (RF-single [5]) in order to verifythat the performance gains we achieve are not onlythe result of using a deeper architecture. As can beseen in Figure 5, the connections which are learnedby our algorithm lead to a higher performance thanthe connections learned by the RF-single approachon both datasets.

5. Conclusion

In this paper we introduce a new receptive fieldlearning algorithm, transferring the concept of theNIN approach to unsupervised learning, and pro-pose a new encoding function, too. We evaluate howthis contributes to improve unsupervised visual cate-gory learning as well as classification in comparisonto unsupervised state-of-the-art algorithms. Resultsdemonstrate superiority of the proposed method bothon category learning and classification tasks. Finally,we demonstrate unsupervised category-model selec-tion, leading to a fully unsupervised category detec-tion method which does not lead to a performancedecrease in comparison with model selection basedon external criteria.

AcknowledgmentsThis work funded by the Austrian Federal Ministryof Science, Research and Economy, and the FWF(I2714-B31).

References[1] N. Bolshakova and F. Azuaje. Cluster validation

techniques for genome expression data. Signal pro-cessing, 83(4):825–833, 2003. 6

[2] Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce.Learning mid-level features for recognition. In IEEEConference on Computer Vision and Pattern Recog-nition, 2010. 1

[3] M. Cho, S. Kwak, C. Schmid, and J. Ponce. Un-supervised object discovery and localization in thewild: Part-based matching with bottom-up regionproposals. In IEEE Conference on Computer Visionand Pattern Recognition, 2015. 1

[4] A. Coates, A. Karpathy, and A. Y. Ng. Emergenceof object-selective features in unsupervised feature

learning. In Advances In Neural Information Pro-cessing Systems, volume 25, 2012. 2, 3

[5] A. Coates and A. Y. Ng. Selecting receptive fields indeep networks. In Advances in Neural InformationProcessing Systems, 2011. 2, 4, 5, 6, 8

[6] A. Coates and A. Y. Ng. Learning feature represen-tations with k-means. In Neural Networks: Tricks ofthe Trade. Springer, 2012. 2, 4, 5

[7] A. Coates, A. Y. Ng, and H. Lee. An analysis ofsingle-layer networks in unsupervised feature learn-ing. In International conference on artificial intelli-gence and statistics, 2011. 2, 4, 5, 7

[8] R. Collobert, C. Farabet, and K. Kavukcuoglu.Torch7: A matlab-like environment for machinelearning. In BigLearn, NIPS Workshop, 2011. 4

[9] E. Culurciello, J. Jin, A. Dundar, and J. Bates. Ananalysis of the connections between layers of deepneural networks. CoRR, abs/1306.0152, 2013. 2

[10] N. Dalal and B. Triggs. Histograms of oriented gra-dients for human detection. In IEEE Conference onComputer Vision and Pattern Recognition, 2005. 1

[11] C. Doersch, A. Gupta, and A. A. Efros. Unsuper-vised visual representation learning by context pre-diction. In IEEE International Conference on Com-puter Vision, pages 1422–1430, 2015. 1, 2

[12] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller,and T. Brox. Discriminative unsupervised featurelearning with convolutional neural networks. In Ad-vances in Neural Information Processing Systems,pages 766–774, 2014. 2

[13] A. Faktor and M. Irani. “clustering by composition”-unsupervised discovery of image categories. IEEETransactions on Pattern Analysis and Machine In-telligence, 36(6):1092–1106, 2014. 1

[14] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang,and C.-J. Lin. LIBLINEAR: A library for large lin-ear classification. Journal of Machine Learning Re-search, 9:1871–1874, 2008. 6

[15] K. Grauman and T. Darrell. Unsupervised learningof categories from sets of partially matching imagefeatures. In IEEE Conference on Computer Visionand Pattern Recognition, 2006. 1

[16] M. Halkidi, Y. Batistakis, and M. Vazirgiannis. Onclustering validation techniques. Journal of Intelli-gent Information Systems, 17(2):107–145, 2001. 6

[17] R. M. Haralick, K. Shanmugam, and I. H. Din-stein. Textural features for image classification.IEEE Transactions on Systems, Man and Cybernet-ics, (6):610–621, 1973. 1

[18] K. Hornik, I. Feinerer, M. Kober, and C. Buchta.Spherical k-means clustering. Journal of StatisticalSoftware, 50(10):1–22, 2012. 5

[19] L. Hubert and P. Arabie. Comparing partitions. Jour-nal of classification, 2(1):193–218, 1985. 6

[20] Y. Jia, C. Huang, and T. Darrell. Beyond spatialpyramids: Receptive field learning for pooled imagefeatures. In IEEE Conference on Computer Visionand Pattern Recognition, 2012. 2

[21] A. Krizhevsky and G. Hinton. Learning multiplelayers of features from tiny images, 2009. 5

[22] H. W. Kuhn. The hungarian method for the assign-ment problem. Naval research logistics quarterly,2(1-2):83–97, 1955. 7

[23] R. Lienhart and J. Maydt. An extended set of haar-like features for rapid object detection. In Interna-tional Conference on Image Processing, volume 1.IEEE, 2002. 1

[24] M. Lin, Q. Chen, and S. Yan. Network in network.CoRR, abs/1312.4400, 2013. 2, 4

[25] D. G. Lowe. Object recognition from local scale-invariant features. In Seventh IEEE internationalconference on Computer vision, volume 2, pages1150–1157, 1999. 1

[26] K. D. Miller. Synaptic economics: competitionand cooperation in synaptic plasticity. Neuron,17(3):371–374, 1996. 2

[27] A. J. Onwuegbuzie, L. Daniel, and N. L. Leech.Pearson product-moment correlation coefficient.Encyclopedia of Measurement and Statistics, 2007.7

[28] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is ob-ject localization for free?–weakly-supervised learn-ing with convolutional neural networks. In IEEEConference on Computer Vision and Pattern Recog-nition, 2015. 1

[29] T. Schlegl, S. M. Waldstein, W.-D. Vogl,U. Schmidt-Erfurth, and G. Langs. Predictingsemantic descriptions from medical images withconvolutional neural networks. In InformationProcessing in Medical Imaging. Springer, 2015. 1

[30] A. Strehl and J. Ghosh. Cluster ensembles - a knowl-edge reuse framework for combining multiple parti-tions. The Journal of Machine Learning Research,3:583–617, 2003. 6

[31] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi-novich. Going deeper with convolutions. In IEEEConference on Computer Vision and Pattern Recog-nition, 2015. 2

[32] J. Wu, Y. Yu, C. Huang, and K. Yu. Deep multipleinstance learning for image classification and auto-annotation. In IEEE Conference on Computer Visionand Pattern Recognition, 2015. 1

[33] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spa-tial pyramid matching using sparse coding for imageclassification. In IEEE Conference on Computer Vi-sion and Pattern Recognition, 2009. 1

[34] Y. Zhao and G. Karypis. Criterion functions for doc-ument clustering: Experiments and analysis. Tech-nical report, Citeseer, 2001. 6

Documents

Unsupervised Learning for Image Category Detectioncvww2017.prip.tuwien.ac.at › papers › CVWW2017_paper_4.pdf · Unsupervised Learning for Image Category Detection ... of-the-art