14
Learning with Hierarchical-Deep Models Ruslan Salakhutdinov, Joshua B. Tenenbaum, and Antonio Torralba, Member, IEEE Abstract—We introduce HD (or “Hierarchical-Deep”) models, a new compositional learning architecture that integrates deep learning models with structured hierarchical Bayesian (HB) models. Specifically, we show how we can learn a hierarchical Dirichlet process (HDP) prior over the activities of the top-level features in a deep Boltzmann machine (DBM). This compound HDP-DBM model learns to learn novel concepts from very few training example by learning low-level generic features, high-level features that capture correlations among low-level features, and a category hierarchy for sharing priors over the high-level features that are typical of different kinds of concepts. We present efficient learning and inference algorithms for the HDP-DBM model and show that it is able to learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion capture datasets. Index Terms—Deep networks, deep Boltzmann machines, hierarchical Bayesian models, one-shot learning Ç 1 INTRODUCTION T HE ability to learn abstract representations that support transfer to novel but related tasks lies at the core of many problems in computer vision, natural language processing, cognitive science, and machine learning. In typical applications of machine classification algorithms today, learning a new concept requires tens, hundreds, or thousands of training examples. For human learners, however, just one or a few examples are often sufficient to grasp a new category and make meaningful general- izations to novel instances [15], [25], [31], [44]. Clearly, this requires very strong but also appropriately tuned inductive biases. The architecture we describe here takes a step toward this ability by learning several forms of abstract knowledge at different levels of abstraction that support transfer of useful inductive biases from previously learned concepts to novel ones. We call our architectures compound HD models, where “HD” stands for “Hierarchical-Deep,” because they are derived by composing hierarchical nonparametric Bayesian models with deep networks, two influential approaches from the recent unsupervised learning literature with complementary strengths. Recently introduced deep learn- ing models, including deep belief networks (DBNs) [12], deep Boltzmann machines (DBM) [29], deep autoencoders [19], and many others [9], [10], [21], [22], [26], [32], [34], [43], have been shown to learn useful distributed feature representations for many high-dimensional datasets. The ability to automatically learn in multiple layers allows deep models to construct sophisticated domain-specific features without the need to rely on precise human-crafted input representations, increasingly important with the prolifera- tion of datasets and application domains. While the features learned by deep models can enable more rapid and accurate classification learning, deep networks themselves are not well suited to learning novel classes from few examples. All units and parameters at all levels of the network are engaged in representing any given input (“distributed representations”), and are adjusted together during learning. In contrast, we argue that learning new classes from a handful of training examples will be easier in architectures that can explicitly identify only a small number of degrees of freedom (latent variables and parameters) that are relevant to the new concept being learned, and thereby achieve more appropriate and flexible transfer of learned representations to new tasks. This ability is the hallmark of hierarchical Bayesian (HB) models, recently proposed in computer vision, statistics, and cognitive science [8], [11], [15], [28], [44] for learning from few examples. Unlike deep networks, these HB models explicitly represent category hierarchies that admit sharing the appropriate abstract knowledge about the new class’s parameters via a prior abstracted from related classes. HB approaches, however, have complementary weaknesses relative to deep networks. They typically rely on domain- specific hand-crafted features [2], [11] (e.g., GIST, SIFT features in computer vision, MFCC features in speech perception domains). Committing to the a-priori defined feature representations, instead of learning them from data, can be detrimental. This is especially important when learning complex tasks, as it is often difficult to hand-craft high-level features explicitly in terms of raw sensory input. Moreover, many HB approaches often assume a fixed hierarchy for sharing parameters [6], [33] instead of discovering how parameters are shared among classes in an unsupervised fashion. In this paper, we propose compound HD architectures that integrate these deep models with structured HB IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013 1 . R. Salakhutdinov is with the Department of Statistics and Computer Science, University of Toronto, Toronto, ON M5S 3G3, Canada. E-mail: [email protected]. . J.B. Tenenbaum is with the Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139. E-mail: [email protected]. . A. Torralba is with the Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139. E-mail: [email protected]. Manuscript received 18 Apr. 2012; revised 30 Aug. 2012; accepted 30 Nov. 2012; published online 19 Dec. 2012. Recommended for acceptance by M. Welling. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPAMI-2012-04-0302. Digital Object Identifier no. 10.1109/TPAMI.2012.269. 0162-8828/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...rsalakhu/papers/HD_PAMI.pdfIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013 1. R. Salakhutdinov

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...rsalakhu/papers/HD_PAMI.pdfIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013 1. R. Salakhutdinov

Learning with Hierarchical-Deep ModelsRuslan Salakhutdinov, Joshua B. Tenenbaum, and Antonio Torralba, Member, IEEE

Abstract—We introduce HD (or “Hierarchical-Deep”) models, a new compositional learning architecture that integrates deep learning

models with structured hierarchical Bayesian (HB) models. Specifically, we show how we can learn a hierarchical Dirichlet process

(HDP) prior over the activities of the top-level features in a deep Boltzmann machine (DBM). This compound HDP-DBM model learns

to learn novel concepts from very few training example by learning low-level generic features, high-level features that capture

correlations among low-level features, and a category hierarchy for sharing priors over the high-level features that are typical of

different kinds of concepts. We present efficient learning and inference algorithms for the HDP-DBM model and show that it is able to

learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion

capture datasets.

Index Terms—Deep networks, deep Boltzmann machines, hierarchical Bayesian models, one-shot learning

Ç

1 INTRODUCTION

THE ability to learn abstract representations that supporttransfer to novel but related tasks lies at the core of

many problems in computer vision, natural languageprocessing, cognitive science, and machine learning. Intypical applications of machine classification algorithmstoday, learning a new concept requires tens, hundreds, orthousands of training examples. For human learners,however, just one or a few examples are often sufficientto grasp a new category and make meaningful general-izations to novel instances [15], [25], [31], [44]. Clearly, thisrequires very strong but also appropriately tuned inductivebiases. The architecture we describe here takes a steptoward this ability by learning several forms of abstractknowledge at different levels of abstraction that supporttransfer of useful inductive biases from previously learnedconcepts to novel ones.

We call our architectures compound HD models, where“HD” stands for “Hierarchical-Deep,” because they arederived by composing hierarchical nonparametric Bayesianmodels with deep networks, two influential approachesfrom the recent unsupervised learning literature withcomplementary strengths. Recently introduced deep learn-ing models, including deep belief networks (DBNs) [12],deep Boltzmann machines (DBM) [29], deep autoencoders[19], and many others [9], [10], [21], [22], [26], [32], [34], [43],have been shown to learn useful distributed feature

representations for many high-dimensional datasets. Theability to automatically learn in multiple layers allows deepmodels to construct sophisticated domain-specific featureswithout the need to rely on precise human-crafted inputrepresentations, increasingly important with the prolifera-tion of datasets and application domains.

While the features learned by deep models can enablemore rapid and accurate classification learning, deepnetworks themselves are not well suited to learning novelclasses from few examples. All units and parameters at alllevels of the network are engaged in representing any giveninput (“distributed representations”), and are adjustedtogether during learning. In contrast, we argue that learningnew classes from a handful of training examples will beeasier in architectures that can explicitly identify only asmall number of degrees of freedom (latent variables andparameters) that are relevant to the new concept beinglearned, and thereby achieve more appropriate and flexibletransfer of learned representations to new tasks. This abilityis the hallmark of hierarchical Bayesian (HB) models,recently proposed in computer vision, statistics, andcognitive science [8], [11], [15], [28], [44] for learning fromfew examples. Unlike deep networks, these HB modelsexplicitly represent category hierarchies that admit sharingthe appropriate abstract knowledge about the new class’sparameters via a prior abstracted from related classes. HBapproaches, however, have complementary weaknessesrelative to deep networks. They typically rely on domain-specific hand-crafted features [2], [11] (e.g., GIST, SIFTfeatures in computer vision, MFCC features in speechperception domains). Committing to the a-priori definedfeature representations, instead of learning them from data,can be detrimental. This is especially important whenlearning complex tasks, as it is often difficult to hand-crafthigh-level features explicitly in terms of raw sensory input.Moreover, many HB approaches often assume a fixedhierarchy for sharing parameters [6], [33] instead ofdiscovering how parameters are shared among classes inan unsupervised fashion.

In this paper, we propose compound HD architecturesthat integrate these deep models with structured HB

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013 1

. R. Salakhutdinov is with the Department of Statistics and ComputerScience, University of Toronto, Toronto, ON M5S 3G3, Canada.E-mail: [email protected].

. J.B. Tenenbaum is with the Department of Brain and Cognitive Sciences,Massachusetts Institute of Technology, Cambridge, MA 02139.E-mail: [email protected].

. A. Torralba is with the Computer Science and Artificial IntelligenceLaboratory, Massachusetts Institute of Technology, Cambridge, MA02139. E-mail: [email protected].

Manuscript received 18 Apr. 2012; revised 30 Aug. 2012; accepted 30 Nov.2012; published online 19 Dec. 2012.Recommended for acceptance by M. Welling.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTPAMI-2012-04-0302.Digital Object Identifier no. 10.1109/TPAMI.2012.269.

0162-8828/13/$31.00 � 2013 IEEE Published by the IEEE Computer Society

Page 2: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...rsalakhu/papers/HD_PAMI.pdfIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013 1. R. Salakhutdinov

models. In particular, we show how we can learn ahierarchical Dirichlet process (HDP) prior over the activitiesof the top-level features in a DBM, coming to represent botha layered hierarchy of increasingly abstract features and atree-structured hierarchy of classes. Our model dependsminimally on domain-specific representations and achievesstate-of-the-art performance by unsupervised discovery ofthree components: 1) low-level features that abstract fromthe raw high-dimensional sensory input (e.g., pixels, orthree-dimensional joint angles) and provide a useful firstrepresentation for all concepts in a given domain; 2) high-level part-like features that express the distinctive percep-tual structure of a specific class, in terms of class-specificcorrelations over low-level features; and 3) a hierarchy ofsuperclasses for sharing abstract knowledge among relatedclasses via a prior on which higher level features are likelyto be distinctive for classes of a certain kind and are thuslikely to support learning new concepts of that kind.

We evaluate the compound HDP-DBM model on threedifferent perceptual domains. We also illustrate theadvantages of having a full generative model, extendingfrom highly abstract concepts all the way down to sensoryinputs: We cannot only generalize class labels but alsosynthesize new examples in novel classes that look reason-ably natural, and we can significantly improve classificationperformance by learning parameters at all levels jointly bymaximizing a joint log-probability score.

There have also been several approaches in the computervision community addressing the problem of learning withfew examples. Torralba et al. [42] proposed using severalboosted detectors in a multitask setting, where features areshared between several categories. Bart and Ullman [3]further proposed a cross-generalization framework forlearning with few examples. Their key assumption is thatnew features for a novel category are selected from the pool offeatures that was useful for previously learned classificationtasks. In contrast to our work, the above approaches arediscriminative by nature and do not attempt to identifysimilar or relevant categories. Babenko et al. [1] used aboosting approach that simultaneously groups togethercategories into several supercategories, sharing a similaritymetric within these classes. They, however, did not attempt toaddress transfer learning problem, and primarily focused onlarge-scale image retrieval tasks. Finally, Fei-Fei et al. [11]used an HB approach, with a prior on the parameters ofnew categories that was induced from other categories.However, their approach was not ideal as a genericapproach to transfer learning with few examples. Theylearned only a single prior shared across all categories.The prior was learned from only three categories, chosenby hand. Compared to our work, they used a moreelaborate visual object model, based on multiple partswith separate appearance and shape components.

2 DEEP BOLTZMANN MACHINES

A DBM is a network of symmetrically coupled stochasticbinary units. It contains a set of visible units v 2 f0; 1gD,and a sequence of layers of hidden units hð1Þ 2 f0; 1gF1 ;hð2Þ 2 f0; 1gF2 . . . hðLÞ 2 f0; 1gFL . There are connections onlybetween hidden units in adjacent layers, as well as betweenvisible and hidden units in the first hidden layer. Consider a

DBM with three hidden layers1 (i.e., L ¼ 3). The energy of

the joint configuration fv;hg is defined as

Eðv;h; Þ ¼ �Xij

Wð1Þij vih

ð1Þj �

Xjl

Wð2Þjl h

ð1Þj h

ð2Þl

�Xlk

Wð3Þlk h

ð2Þl h

ð3Þk ;

where h ¼ fhð1Þ;hð2Þ;hð3Þg represent the set of hidden units

and ¼ fWð1Þ;Wð2Þ;Wð3Þg are the model parameters,

representing visible-to-hidden and hidden-to-hidden sym-

metric interaction terms.2

The probability that the model assigns to a visible vector v

is given by the Boltzmann distribution:

P ðv; Þ ¼ 1

Zð ÞX

h

exp��E

�v;hð1Þ;hð2Þ;hð3Þ;

��: ð1Þ

Observe that setting both Wð2Þ ¼ 0 and Wð3Þ ¼ 0 recovers

the simpler Restricted Boltzmann Machine (RBM) model.The conditional distributions over the visible and the

three sets of hidden units are given by

pðhð1Þj ¼ 1jv;hð2ÞÞ ¼ gXDi¼1

Wð1Þij vi þ

XF2

l¼1

Wð2Þjl h

ð2Þl

!;

pðhð2Þl ¼ 1jhð1Þ;hð3ÞÞ ¼ gXF1

j¼1

Wð2Þjl h

ð1Þj þ

XF3

k¼1

Wð3Þlk h

ð3Þk

!;

pðhð3Þk ¼ 1jhð2ÞÞ ¼ gXF2

l¼1

Wð3Þlk h

ð2Þl

!;

pðvi ¼ 1jhð1ÞÞ ¼ gXF1

j¼1

Wð1Þij h

ð1Þj

!;

ð2Þ

where gðxÞ ¼ 1=ð1þ expð�xÞÞ is the logistic function.The derivative of the log-likelihood with respect to the

model parameters can be obtained from (1):

@ logP ðv; Þ@Wð1Þ ¼ EPdata

�vhð1Þ

>�� EPmodel

�vhð1Þ

>�;

@ logP ðv; Þ@Wð2Þ ¼ EPdata

�hð1Þhð2Þ

>�� EPmodel

�hð1Þhð2Þ

>�;

@ logP ðv; Þ@Wð3Þ ¼ EPdata

½hð2Þhð3Þ>�� EPmodel

�hð2Þhð3Þ

>�;

ð3Þ

where EPdata½�� denotes an expectation with respect to the

completed data distribution:

Pdataðh;v; Þ ¼ P ðhjv; ÞPdataðvÞ;

with PdataðvÞ ¼ 1N

Pn �vn representing the empirical distri-

bution and EPmodel½�� is an expectation with respect to the

distribution defined by the model (1). We will sometimes

refer to EPdata½�� as the data-dependent expectation and EPmodel

½��as the model’s expectation.

Exact maximum likelihood learning in this model is

intractable. The exact computation of the data-dependent

expectation takes time that is exponential in the number of

2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013

1. For clarity, we use three hidden layers. Extensions to models withmore than three layers is trivial.

2. We have omitted the bias terms for clarity of presentation. Biases areequivalent to weights on a connection to a unit whose state is fixed at 1.

Page 3: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...rsalakhu/papers/HD_PAMI.pdfIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013 1. R. Salakhutdinov

hidden units, whereas the exact computation of the modelsexpectation takes time that is exponential in the number ofhidden and visible units.

2.1 Approximate Learning

The original learning algorithm for Boltzmann machinesused randomly initialized Markov chains to approximateboth expectations to estimate gradients of the likelihoodfunction [14]. However, this learning procedure is too slowto be practical. Recently, Salakhutdinov and Hinton [29]proposed a variational approach, where mean-field infer-ence is used to estimate data-dependent expectations and anMCMC-based stochastic approximation procedure is usedto approximate the models expected sufficient statistics.

2.1.1 A Variational Approach to Estimating the

Data-Dependent Statistics

Consider any approximating distribution Qðhjv;����Þ, para-metarized by a vector of parameters ����, for the posteriorP ðhjv; Þ. Then the log-likelihood of the DBM model hasthe following variational lower bound:

logP ðv; Þ �X

h

Qðhjv;����Þ logP ðv;h; Þ þ HðQÞ

� logP ðv; Þ �KLðQðhjv;����ÞkP ðhjv; ÞÞ;ð4Þ

where Hð�Þ is the entropy functional and KLðQkP Þ denotesthe Kullback-Leibler divergence between the two distribu-tions. The bound becomes tight if and only if Qðhjv;����Þ ¼P ðhjv; Þ.

Variational learning has the nice property that inaddition to maximizing the log-likelihood of the data, italso attempts to find parameters that minimize the Kull-back-Leibler divergence between the approximating andtrue posteriors.

For simplicity and speed, we approximate the trueposterior P ðhjv; Þ with a fully factorized approximatingdistribution over the three sets of hidden units, whichcorresponds to so-called mean-field approximation:

QMF ðhjv;����Þ ¼YF1

j¼1

YF2

l¼1

YF3

k¼1

q�hð1Þj jv

�q�hð2Þl jv

�q�hð3Þk jv

�; ð5Þ

where ���� ¼ f����ð1Þ; ����ð2Þ; ����ð3Þg are the mean-field parameterswith qðhðlÞi ¼ 1Þ ¼ �ðlÞi for l ¼ 1; 2; 3. In this case, the varia-tional lower bound on the log-probability of the data takes aparticularly simple form:

logP ðv; Þ �X

h

QMF ðhjv;����Þ logP ðv;h; Þ þ HðQMF Þ

� v>Wð1Þ����ð1Þ þ ����ð1Þ>Wð2Þ����ð2Þþ

þ ����ð2Þ>Wð3Þ����ð3Þ � logZð Þ þ HðQMF Þ:ð6Þ

Learning proceeds as follows: For each training example,we maximize this lower bound with respect to thevariational parameters ���� for fixed parameters , whichresults in the mean-field fixed-point equations:

�ð1Þj g

�XDi¼1

Wð1Þij vi þ

XF2

l¼1

Wð2Þjl �

ð2Þl

�; ð7Þ

Algorithm 1. Learning Procedure for a Deep BoltzmannMachine with Three Hidden Layers.

1: Given: a training set of N binary data vectors

fvgNn¼1, and M, the number of persistent Markov chains

(i.e., particles).

2: Randomly initialize parameter vector 0 and M

samples: f~v0;1; ~h0;1g . . . f~v0;M; ~h0;Mg,where ~h ¼ f~hð1Þ; ~hð2Þ; ~hð3Þg.

3: for t ¼ 0 to T (number of iterations) do

4: // Variational Inference:

5: for each training example vn, n ¼ 1 to N do

6: Randomly initialize ���� ¼ f����ð1Þ; ����ð2Þ; ����ð3Þg and

run mean-field updates until convergence,

using (7), (8), (9).

7: Set ����n ¼ ����.8: end for

9: // Stochastic Approximation:

10: for each sample m ¼ 1 to M (number of persistent

Markov chains) do

11: Sample ð~vtþ1;m; ~htþ1;mÞ given ð~vt;m; ~ht;mÞ by

running a Gibbs sampler for one step (2).

12: end for

13: // Parameter Update:

14: Wð1Þtþ1 ¼W

ð1Þt þ �t

�1N

PNn¼1 vnð����ð1Þn Þ

> �1M

PMm¼1 ~vtþ1;mð~hð1Þtþ1;mÞ

>

.

15: Wð2Þtþ1 ¼W

ð2Þt þ �t

�1N

PNn¼1 ����

ð1Þn ð����ð2Þn Þ

> �1M

PMm¼1

~hð1Þtþ1;mð~h

ð2Þtþ1;mÞ

>

.

16: Wð3Þtþ1 ¼W

ð3Þt þ �t

�1N

PNn¼1 ����

ð2Þn ð����ð3Þn Þ

> �1M

PMm¼1

~hð2Þtþ1;mð~h

ð3Þtþ1;mÞ

>

.17: Decrease �t.

18: end for

�ð2Þl g

�XF1

j¼1

Wð2Þjl �

ð1Þj þ

XF3

k¼1

Wð3Þlk �

ð3Þk

�; ð8Þ

�ð3Þk g

�XF2

l¼1

Wð3Þlk �

ð2Þl

�; ð9Þ

where gðxÞ ¼ 1=ð1þ expð�xÞÞ is the logistic function. Tosolve these fixed-point equations, we simply cycle throughlayers, updating the mean-field parameters within a singlelayer. Note the close connection between the form ofthe mean-field fixed point updates and the form of theconditional distribution3 defined by (2).

2.1.2 A Stochastic Approximation Approach for

Estimating the Data-Independent Statistics

Given the variational parameters ����, the model parameters are then updated to maximize the variational bound usingan MCMC-based stochastic approximation [29], [39], [46].

SALAKHUTDINOV ET AL.: LEARNING WITH HIERARCHICAL-DEEP MODELS 3

3. Implementing the mean-field requires no extra work beyondimplementing the Gibbs sampler.

Page 4: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...rsalakhu/papers/HD_PAMI.pdfIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013 1. R. Salakhutdinov

Learning with stochastic approximation is straightfor-

ward. Let t and xt ¼ fvt;hð1Þt ;hð2Þt ;h

ð3Þt g be the current

parameters and the state. Then xt and t are updated

sequentially as follows:

. Given xt, sample a new state xtþ1 from the transitionoperator T tðxtþ1 xtÞ that leaves P ð�; tÞ invariant.This can be accomplished by using Gibbs sampling(see (2)).

. A new parameter tþ1 is then obtained by making agradient step, where the intractable model’s expec-tation EPmodel

½�� in the gradient is replaced by a pointestimate at sample xtþ1.

In practice, we typically maintain a set of M “persistent”sample particles Xt ¼ fxt;1; . . . ;xt;Mg, and use an averageover those particles. The overall learning procedure forDBMs is summarized in Algorithm 1.

Stochastic approximation provides asymptotic conver-gence guarantees and belongs to the general class ofRobbins-Monro approximation algorithms [27], [46]. Precisesufficient conditions that ensure almost sure convergence toan asymptotically stable point are given in [45], [46], and[47]. One necessary condition requires the learning rate todecrease with time so that

P1t¼0 �t ¼ 1 and

P1t¼0 �

2t <1.

This condition can, for example, be satisfied simply bysetting �t ¼ a=ðbþ tÞ, for positive constants a > 0, b > 0.Other conditions ensure that the speed of convergence ofthe Markov chain, governed by the transition operator T ,does not decrease too fast as tends to infinity. Typically,in practice the sequence j tj is bounded, and the Markovchain, governed by the transition kernel T , is ergodic.Together with the condition on the learning rate, thisensures almost sure convergence of the stochastic approx-imation algorithm to an asymptotically stable point.

2.1.3 Greedy Layerwise Pretraining of DBMs

The learning procedure for DBMs described above can beused by starting with randomly initialized weights, but itworks much better if the weights are initialized sensibly.We therefore use a greedy layerwise pretraining strategy bylearning a stack of modified RBMs (for details see [29]).

This pretraining procedure is quite similar to thepretraining procedure of DBNs [12], and it allows us toperform approximate inference by a single bottom-up pass.This fast approximate inference is then used to initialize themean-field, which then converges much faster than mean-field with random initialization.4

2.2 Gaussian-Bernoulli DBMs

We now briefly describe a Gaussian-Bernoulli DBM model,which we will use to model real-valued data, such asimages of natural scenes and motion capture data.Gaussian-Bernoulli DBMs represent a generalization of asimpler class of models, called Gaussian-Bernoulli RBMs,which have been successfully applied to various tasks,including image classification, video action recognition, andspeech recognition [17], [20], [23], [35].

In particular, consider modeling visible real-valued unitsv 2 IRD and let hð1Þ 2 f0; 1gF1 , hð2Þ 2 f0; 1gF2 , and hð3Þ 2

f0; 1gF3 be binary stochastic hidden units. The energy of the

joint configuration fv;hð1Þ;hð2Þ;hð3Þg of the three-hidden-

layer Gaussian-Bernoulli DBM is defined as follows:

Eðv;h; Þ ¼ 1

2

Xi

v2i

�2i

�Xij

Wð1Þij h

ð1Þj

vi�i

�Xjl

Wð2Þjl h

ð1Þj h

ð2Þl �

Xlk

Wð3Þlk h

ð2Þl h

ð3Þk ;

ð10Þ

where h ¼ fhð1Þ;hð2Þ;hð3Þg represent the set of hidden units,

and ¼ fWð1Þ;Wð2Þ;Wð3Þ; ����2g are the model parameters,

and �2i is the variance of input i. The marginal distribution

over the visible vector v takes form

P ðv; Þ ¼X

h

exp ð�Eðv;h; ÞÞRv0P

h exp ð�Eðv;h; ÞÞdv0 : ð11Þ

From (10), it is straightforward to derive the following

conditional distributions:

pðvi ¼ xjhð1ÞÞ ¼1ffiffiffiffiffiffi

2�p

�iexp �

x� �iP

j hð1Þj W

ð1Þij

� 2

2�2i

0B@

1CA;

pðhð1Þj ¼ 1jvÞ ¼ gXi

Wð1Þij

vi�i

!;

ð12Þ

where gðxÞ ¼ 1=ð1þ expð�xÞÞ is the logistic function.

Conditional distributions over hð2Þ and hð3Þ remain the

same as in the standard DBM model (see (2)).Observe that conditioned on the states of the hidden

units (12), each visible unit is modeled by a Gaussian

distribution whose mean is shifted by the weighted

combination of the hidden unit activations. The derivative

of the log-likelihood with respect to Wð1Þ takes form

@ logP ðv; Þ@W

ð1Þij

¼ EPdata

1

�ivihð1Þj

� �� EPModel

1

�ivihð1Þj

� �:

The derivatives with respect to parameters Wð2Þ and Wð3Þ

remain the same as in (3).As described in the previous section, learning of the

model parameters, including the variances ����2, can be

carried out using variational learning together with

stochastic approximation procedure. In practice, however,

instead of learning ����2, one would typically use a fixed,

predetermined value for ����2 [13], [24].

2.3 Multinomial DBMs

To allow DBMs to express more information and introduce

more structured hierarchical priors, we will use a condi-

tional multinomial distribution to model activities of the

top-level units hð3Þ. Specifically, we will use M softmax

units, each with “1-of-K” encoding, so that each unit

contains a set of K weights. We represent the kth discrete

value of hidden unit by a vector containing 1 at the

kth location and zeros elsewhere. The conditional prob-

ability of a softmax top-level unit is

4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013

4. The code for pretraining and generative learning of the DBM model isavailable at http://www.utstat.toronto.edu/~rsalakhu/DBM.html.

Page 5: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...rsalakhu/papers/HD_PAMI.pdfIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013 1. R. Salakhutdinov

P ðhð3Þk jhð2ÞÞ ¼exp

Pl W

ð3Þlk h

ð2Þl

� PK

s¼1 expP

l Wð3Þls h

ð2Þl

� : ð13Þ

In our formulation, all M separate softmax units will sharethe same set of weights, connecting them to binary hiddenunits at the lower level (see Fig. 1). The energy of the statefv;hg is then defined as follows:

Eðv;h; Þ ¼ �Xij

Wð1Þij vih

ð1Þj �

Xjl

Wð2Þjl h

ð1Þj h

ð2Þl

�Xlk

Wð3Þlk h

ð2Þl h

ð3Þk ;

where hð1Þ 2 f0; 1gF1 and hð2Þ 2 f0; 1gF2 represent stochasticbinary units. The top layer is represented by the M softmaxunits hð3;mÞ, m ¼ 1; . . . ;M, with h

ð3Þk ¼

PMm¼1 h

ð3;mÞk denoting

the count for the kth discrete value of a hidden unit.A key observation is that M separate copies of softmax

units that all share the same set of weights can be viewed asa single multinomial unit that is sampled M times from theconditional distribution of (13). This gives us a familiar“bag-of-words” representation [30], [36]. A pleasing prop-erty of using softmax units is that the mathematics under-lying the learning algorithm for binary-binary DBMsremains the same.

3 COMPOUND HDP-DBM MODEL

After a DBM model has been learned, we have anundirected model that defines the joint distributionP ðv;hð1Þ;hð2Þ;hð3ÞÞ. One way to express what has beenlearned is the conditional model P ðv;hð1Þ;hð2Þjhð3ÞÞ and acomplicated prior term P ðhð3ÞÞ, defined by the DBM model.We can therefore rewrite the variational bound as

logP ðvÞ �X

hð1Þ;hð2Þ;hð3Þ

Qðhjv;����Þ logP ðv;hð1Þ;hð2Þjhð3ÞÞ

þ HðQÞ þXhð3Þ

Qðhð3Þjv;����Þ logP ðhð3ÞÞ:ð14Þ

This particular decomposition lies at the core of the greedyrecursive pretraining algorithm: We keep the learned condi-

tional modelP ðv;hð1Þ;hð2Þjhð3ÞÞ, but maximize the variationallower bound of (14) with respect to the last term [12]. Thismaximization amounts to replacing P ðhð3ÞÞ by a prior that iscloser to the average, over all the data vectors, of theapproximate conditional posterior Qðhð3ÞjvÞ.

Instead of adding an additional undirected layer (e.g., anRBM) to model P ðhð3ÞÞ we can place an HDP prior over hð3Þ

that will allow us to learn category hierarchies and, moreimportantly, useful representations of classes that containfew training examples.

The part we keep, P ðv;hð1Þ;hð2Þjhð3ÞÞ, represents aconditional DBM model:5

P ðv;hð1Þ;hð2Þjhð3ÞÞ ¼ 1

Zð ;hð3ÞÞexp

�Xij

Wð1Þij vih

ð1Þj

þXjl

Wð2Þjl h

ð1Þj h

ð2Þl þ

Xlk

Wð3Þlk h

ð2Þl h

ð3Þk

�;

ð15Þ

which can be viewed as a two-layer DBM but with biasterms given by the states of hð3Þ.

3.1 A Hierarchical Bayesian Prior

In a typical hierarchical topic model, we observe a set ofN documents, each of which is modeled as a mixture overtopics, that are shared among documents. Let there beK words in the vocabulary. A topic t is a discrete distributionover K words with probability vector ����t. Each document nhas its own distribution over topics given by probabilities ����n.

In our compound HDP-DBM model, we will use ahierarchical topic model as a prior over the activities of theDBM’s top-level features. Specifically, the term “document”will refer to the top-level multinomial unit hð3Þ, and M

“words” in the document will represent the M samples, oractive DBM’s top-level features, generated by this multi-nomial unit. Words in each document are drawn bychoosing a topic t with probability �nt, and then choosinga word w with probability �tw. We will often refer to topicsas our learned higher level features, each of which defines atopic specific distribution over DBM’s hð3Þ features. Let h

ð3Þin

be the ith word in document n, and xin be its topic. We canspecify the following prior over hð3Þ:

��nj�� � Dirð�����Þ; for each document n ¼ 1; . . . ; N;

��tj � DirðÞ; for each topic t ¼ 1; . . . ;T ;

xinj��n � Multð1; ����nÞ; for each word i ¼ 1; . . . ;M;

hð3Þin jxin; ����xin � Multð1; ����xinÞ;

where ���� is the global distribution over topics, is theglobal distribution over K words, and � and areconcentration parameters.

Let us further assume that our model is presented with afixed two-level category hierarchy. In particular, supposethat N documents, or objects, are partitioned into C basiclevel categories (e.g., cow, sheep, car). We represent such apartition by a vector zb of length N , each entry of which iszbn 2 f1 . . .Cg. We also assume that our C basic-level

SALAKHUTDINOV ET AL.: LEARNING WITH HIERARCHICAL-DEEP MODELS 5

Fig. 1. Left: Multinomial DBM model: The top layer represents M softmaxhidden units hð3Þ which share the same set of weights. Right: A differentinterpretation: M softmax units are replaced by a single multinomial unitwhich is sampled M times.

5. Our experiments reveal that using DBNs instead of DBMs decreasedmodel performance.

Page 6: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...rsalakhu/papers/HD_PAMI.pdfIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013 1. R. Salakhutdinov

categories are partitioned into S supercategories (e.g.,

animal, vehicle), represented by a vector zs of length C,

with zsc 2 f1 . . .Sg. These partitions define a fixed two-level

tree hierarchy (see Fig. 2). We will relax this assumption

later by placing a nonparametric prior over the category

assignments.The hierarchical topic model can be readily extended to

modeling the above hierarchy. For each document n that

belongs to the basic category c, we place a common Dirichlet

prior over ����n with parameters ����ð1Þc . The Dirichlet parameters

����ð1Þ are themselves drawn from a Dirichlet prior with level-2

parameters ����ð2Þ, common to all basic-level categories that

belong to the same supercategory, and so on. Specifically,

we define the following hierarchical prior over hð3Þ:

��ð2Þs j����ð3Þg � Dir��ð3Þ��ð3Þg

�; for each superclass s ¼ 1; . . . ; S;

��ð1Þc j����ð2Þzsc� Dir

��ð2Þ����

ð2Þzsc

�; for each basic-classc ¼ 1; . . . ; C;

����nj����ð1Þzbn � Dir��ð1Þ����

ð1Þzbn

�; for each document n ¼ 1; . . . ; N;

xinj����n � Multð1; ����nÞ; for each word i ¼ 1; . . . ;M;

����tj; � DirðÞ;

hð3Þin jxin; ����xin � Multð1; ����xinÞ;

ð16Þ

where ����ð3Þg is the global distribution over topics, ����ð2Þs is the

supercategory specific, and ����ð1Þc is the class specific

distribution over topics, or higher level features. These

high-level features in turn define topic-specific distribution

over hð3Þ features, or “words” in our DBM model. Finally,

�ð1Þ, �ð2Þ, and �ð3Þ represent concentration parameters

describing how close ����s are to their respective prior means

within the hierarchy.For a fixed number of topics T , the above model

represents a hierarchical extension of the latent Dirichlet

allocation (LDA) model [4]. However, we typically do not

know the number of topics a-priori. It is therefore natural to

consider a nonparametric extension based on the HDP

model [38], which allows for a countably infinite number of

topics. In the standard HDP notation, we have the following:

Gð3Þg j; �; � DPð�;DirðÞÞ;Gð2Þs j�ð3Þ; Gð3Þ � DP

��ð3Þ; Gð3Þg

�;

Gð1Þc j�ð2Þ; Gð2Þ � DP��ð2Þ; G

ð2Þzsc

�;

Gnj�ð1Þ; Gð1Þ � DP��ð1Þ; G

ð1Þzbn

�;

�����injGn � Gn;

h3inj�����in � Mult

�1; �����in

�;

ð17Þ

where DirðÞ is the base-distribution, and each ����� is afactor associated with a single observation h

ð3Þin . Making use

of topic index variables xin, we denote �����in ¼ ����xin (see (16)).Using a stick-breaking representation, we can write

Gð3Þg ð�Þ ¼X1t¼1

�ð3Þgt ��t ; G

ð2Þs ð�Þ

X1t¼1

�ð2Þst ��t ;

Gð1Þc ð�Þ ¼X1t¼1

�ð1Þct ��t ; Gnð�Þ ¼

X1t¼1

�nt��t ;

ð18Þ

which represent sums of point masses. We also placeGamma priors over concentration parameters as in [38].

The overall generative model is shown in Fig. 2. Togenerate a sample, we first draw M words, or activations ofthe top-level features, from the HDP prior over hð3Þ given by(17). Conditioned on hð3Þ, we sample the states of v from theconditional DBM model given by (15).

3.2 Modeling the Number of Supercategories

So far we have assumed that our model is presented with atwo-level partition z ¼ fzs; zbg that defines a fixed two-leveltree hierarchy. We note that this model corresponds to astandard HDP model that assumes a fixed hierarchy forsharing parameters. If, however, we are not given any level-1or level-2 category labels, we need to infer the distributionover the possible category structures. We place a nonpara-metric two-level nested Chinese restaurant prior (CRP) [5]over z, which defines a prior over tree structures and isflexible enough to learn arbitrary hierarchies. The mainbuilding block of the nested CRP is the Chinese restaurantprocess, a distribution on partition of integers. Imagine aprocess by which customers enter a restaurant with anunbounded number of tables, where the nth customeroccupies a table k drawn from

P ðzn ¼ kjz1 . . . zn�1Þ ¼nk

n� 1þ � ; nk > 0;

n� 1þ � ; k is new;

8><>: ð19Þ

where nk is the number of previous customers at table k and� is the concentration parameter.

The nested CRP, nCRPð�Þ, extends CRP to nestedsequence of partitions, one for each level of the tree. Inthis case each observation n is first assigned to thesupercategory zsn using (19). Its assignment to the basic-level category zbn, which is placed under a supercategory zsn,is again recursively drawn from (19). We also place aGamma prior �ð1; 1Þ over �. The proposed model allows forboth: a nonparametric prior over potentially unboundednumber of global topics, or higher level features, as well asa nonparametric prior that allows learning an arbitrary treetaxonomy.

6 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013

Fig. 2. HDP prior over the states of the DBM’s top-level features hð3Þ.

Page 7: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...rsalakhu/papers/HD_PAMI.pdfIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013 1. R. Salakhutdinov

Unlike in many conventional HB models, here we inferboth the model parameters as well as the hierarchy forsharing those parameters. As we show in the experimentalresults section, both sharing higher level features andforming coherent hierarchies play a crucial role in theability of the model to generalize well from one or fewexamples of a novel category. Our model can be readilyused in unsupervised or semi-supervised modes, withvarying amounts of label information at different levels ofthe hierarchy.

4 INFERENCE

Inferences about model parameters at all levels of hierarchycan be performed by MCMC. When the tree structure z ofthe model is not given, the inference process will alternatebetween fixing z while sampling the space of modelparameters, and vice versa.

Sampling HDP parameters. Given the category assignmentvector z and the states of the top-level DBM features hð3Þ, weuse the posterior representation sampler of [37]. In parti-cular, the HDP sampler maintains the stick-breaking weightsf����gNn¼1, f����ð1Þc ; ����ð2Þs ; ����ð3Þg g and topic indicator variables x(parameters ���� can be integrated out). The sampler alternatesbetween: 1) sampling cluster indices xin using Gibbs updatesin the Chinese restaurant franchise (CRF) representation ofthe HDP; 2) sampling the weights at all three levelsconditioned on x using the usual posterior of a DP.

Conditioned on the draw of the superclass DP Gð2Þs andthe state of the CRF, the posteriors over Gð1Þc becomeindependent. We can easily speed up inference by samplingfrom these conditionals in parallel. The speedup could besubstantial, particularly as the number of the basic-levelcategories becomes large.

Sampling category assignments z. Given the currentinstantiation of the stick-breaking weights, for each input nwe have

ð�1;n; . . . ; �T ;n; �new;nÞ �Dir��ð1Þ�

ð1Þzn;1

; . . . ; �ð1Þ�ð1Þzn;T

; �ð1Þ�ð1Þzn;new

�:

ð20Þ

Combining the above likelihood term with the CRP prior (19),the posterior over the category assignment can be calculatedas follows:

p�znj����n; z�n; ����ð1Þ

�/ p�����nj����ð1Þ; zn

�pðznjz�nÞ; ð21Þ

where z�n denotes variables z for all observations otherthan n. When computing the probability of placing ����n undera newly created category, its parameters are sampled fromthe prior.

Sampling DBM’s hidden units. Given the states of theDBM’s top-level multinomial unit hð3Þn , conditional samplesfrom P ðhð1Þn ;hð2Þn jhð3Þn ;vnÞ can be obtained by running aGibbs sampler that alternates between sampling the statesof hð1Þn independently given hð2Þn , and vice versa. Condi-tioned on topic assignments xin and hð2Þn , the states of themultinomial unit hð3Þn for each input n are sampled usingGibbs conditionals:

P�hð3Þin jhð2Þn ;h

ð3Þ�in;xn

�/ P

�hð2Þn jhð3Þn

�P�hð3Þin jxin

�; ð22Þ

where the first term is given by the product of logisticfunctions (see (15)):

P�hð2Þn jhð3Þn

�¼Yl

P�hð2Þln jhð3Þn

�; with

P�hð2Þl ¼ 1jhð3Þ

�¼ 1

1þ exp��P

k Wð3Þlk h

ð3Þk

� ; ð23Þ

and the second term P ðhð3Þin Þ is given by the multinomial:Multð1; ����xinÞ (see (17)). In our conjugate setting, para-meters ���� can be further integrated out.

Fine-tuning DBM. Finally, conditioned on the states ofhð3Þ, we can further fine-tune low-level DBM parameters ¼ fWð1Þ;Wð2Þ;Wð3Þg by applying approximate maximumlikelihood learning (see Section 2) to the conditional DBMmodel of (15). For the stochastic approximation algorithm,since the partition function depends on the states of hð3Þ, wemaintain one “persistent” Markov chain per data point(for details see [29], [39]). As we show in our experimentalresults section, fine-tuning low-level DBM features cansignificantly improve model performance.

4.1 Making Predictions

Given a test input vt, we can quickly infer the approximateposterior over h

ð3Þt using the mean-field of (6), followed by

running the full Gibbs sampler to get approximate samplesfrom the posterior over the category assignments. Inpractice, for faster inference, we fix learned topics �t andapproximate the marginal likelihood that h

ð3Þt belongs to

category zt by assuming that document specific DP can bewell approximated by the class-specific6 DP Gt � Gð1Þzt(see Fig. 2). Hence, instead of integrating out documentspecific DP Gt, we approximate

P ðhð3Þt jzt; Gð1Þ; ����Þ ¼ZGt

P ðhð3Þt j����;GtÞP ðGtjGð1ÞztÞdGt

� P ðhð3Þt j����;Gð1ÞztÞ;

ð24Þ

which can be computed analytically by integrating outtopic assignments xin (17). Combining this likelihood termwith nCRP prior P ðztjz�tÞ of (19) allows us to efficientlyinfer approximate posterior over category assignments. Inall of our experimental results, computing this approx-imate posterior takes a fraction of a second, which iscrucial for applications, such as object recognition orinformation retrieval.

5 EXPERIMENTS

We present experimental results on the CIFAR-100 [17],handwritten character [18], and human motion capturerecognition datasets. For all datasets, we first pretrain aDBM model in unsupervised fashion on raw sensory input(e.g., pixels, or three-dimensional joint angles), followed byfitting an HDP prior which is run for 200 Gibbs sweeps.We further run 200 additional Gibbs steps to fine-tuneparameters of the entire compound HDP-DBM model. Thiswas sufficient to obtain good performance. Across alldatasets, we also assume that the basic-level category

SALAKHUTDINOV ET AL.: LEARNING WITH HIERARCHICAL-DEEP MODELS 7

6. We note that Gð1Þzt¼ E½GtjGð1Þzt

�.

Page 8: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...rsalakhu/papers/HD_PAMI.pdfIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013 1. R. Salakhutdinov

labels are given, but no supercategory labels are available.We must infer how to cluster basic categories intosupercategories at the same time as we infer parametervalues at all levels of the hierarchy. The training setincludes many examples of familiar categories but only afew examples of a novel class. Our goal is to generalizewell on a novel class.

In all experiments, we compare performance of HDP-DBM to the following alternative models. The first twomodels, stand-alone DBMs and DBNs [12], used threelayers of hidden variables and were pretrained using astack of RBMs. To evaluate classification performance ofDBNs and DBMs, both models were converted intomultilayer neural networks and were discriminativelyfine-tuned using backpropagation algorithm (see [29] fordetails). Our third model, “Flat HDP-DBM,” always used asingle supercategory. The Flat HDP-DBM approach,similar in spirit to the one-shot learning model of [11],could potentially identify a set of useful high-level featurescommon to all categories. Our fourth model used a versionof SVM that implements cost-sensitive learning.7 The basicidea is to assign a larger penalty value for misclassifyingexamples that arise from the underrepresented class. In oursetting, this model performs slightly better compared to astandard SVM classifier. Our last model used a simplek nearest neighbors (k-NN) classifier. Finally, using HDPson top of raw sensory input (i.e., pixels, or even image-specific GIST features) performs far worse compared to ourHDP-DBM model.

5.1 CIFAR-100 Data Set

The CIFAR-100 image dataset [17] contains 50,000 trainingand 10,000 test images of 100 object categories (100 perclass), with 32 32 3 RGB pixels. Extreme variability inscale, viewpoint, illumination, and cluttered backgroundmakes the object recognition task for this dataset quitedifficult. Similarly to [17], to learn good generic low-levelfeatures, we first train a two-layer DBM in completelyunsupervised fashion using 4 million tiny images8 [40].We use a conditional Gaussian distribution to modelobserved pixel values [13], [17]. The first DBM layer

contained 10,000 binary hidden units, and the secondlayer contained M ¼ 1;000 softmax units.9 We then fit anHDP prior over hð2Þ to the 100 object classes. We alsoexperimented with a three-layer DBM model, as well asvarious softmax parameters: M ¼ 500 and M ¼ 2;000. Thedifference in performance was not significant.

Fig. 3 displays a random subset of the training data, firstand second layer DBM features, as well as higher levelclass-sensitive features, or topics, learned by the HDPmodel. Second layer features were visualized as a weightedlinear combination of the first layer features as in [21]. Tovisualize a particular higher level feature, we first sampleM words from a fixed topic �t, followed by sampling RGBpixel values from the conditional DBM model. While DBMfeatures capture mostly low-level structure, includingedges and corners, the HDP features tend to capture higherlevel structure, including contours, shapes, color compo-nents, and surface boundaries in the images. Moreimportantly, features at all levels of the hierarchy evolvewithout incorporating any image-specific priors. Fig. 4shows a typical partition over 100 classes that our modeldiscovers with many supercategories containing semanti-cally similar classes.

Table 1 quantifies performance using the area under theROC curve (AUROC) for classifying 10,000 test images asbelonging to the novel versus all of the other 99 classes. Wereport 2*AUROC-1, so zero corresponds to the classifierthat makes random predictions. The results are averagedover 100 classes using “leave-one-out” test format. Based ona single example, the HDP-DBM model achieves anAUROC of 0.36, significantly outperforming DBMs, DBNs,SVMs, and 1-NN using standard image-specific GISTfeatures10 that achieve an AUROC of 0.26, 0.25, 0.20, and0.27, respectively. Table 1 also shows that fine-tuningparameters of all layers jointly as well as learning super-category hierarchy significantly improves model perfor-mance. As the number of training examples increases, theHDP-DBM model still outperforms alternative methods.

8 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013

Fig. 3. A random subset of the training images along with the first and second layer DBM features and higher level class-sensitive HDP features/topics. To visualize higher level features, we first sample M words from a fixed topic �t, followed by sampling RGB pixel values from the conditionalDBM model.

7. We used LIBSVM software package of [7].8. The dataset contains random images of natural scenes downloaded

from the web.

9. The generative training of the DBM model using 4 million imagestakes about a week on the Intel Xeon 3.00 GHz. Fitting an HDP prior to theDBMs top-level features on the CIFAR dataset takes about 12 hours.However, at test time, using variational inference and approximation of(24), it takes a fraction of a second to classify a test example into itscorresponding category.

10. Gist descriptors have previously been used for this dataset [41].

Page 9: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...rsalakhu/papers/HD_PAMI.pdfIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013 1. R. Salakhutdinov

With 50 training examples, however, all models performabout the same. This is to be expected, as with moretraining examples, the effect of the hierarchical priordecreases.

We next illustrate the ability of the HDP-DBM togeneralize from a single training example of a “pear” class.We trained the model on 99 classes containing 500 trainingimages each, but only one training example of a “pear”class. Fig. 5 shows the kind of transfer our model isperforming, where we display training examples along witheight most probable topics �t, ordered by hand. The modeldiscovers that pears are like apples and oranges, and notlike other classes of images, such as dolphins, that reside invery different parts of the hierarchy. Hence the novelcategory can inherit the prior distribution over similar high-level shape and color features, allowing the HDP-DBM togeneralize considerably better to new instances of the“pear” class.

We next examined the generative performance of theHDP-DBM model. Fig. 6 shows samples generated by theHDP-DBM model for four classes: “Apple,” “Willow Tree,”“Elephant,” and “Castle.” Despite extreme variability in

scale, viewpoint, and cluttered background, the model isable to capture the overall structure of each class. Fig. 7shows conditional samples when learning with only threetraining examples of a novel class. For example, based ononly three training examples of the “Apple” class, the HDP-DBM model is able to generate a rich variety of new apples.Fig. 8 further quantifies performance of HDP-DBM, DBM,and SVM models for all object categories when learningwith only three examples. Observe that over 40 classesbenefit in various degrees from both: learning a hierarchy aswell as learning low and high-level features.

5.2 Handwritten Characters

The handwritten characters dataset [18] can be viewed asthe “transpose” of the standard MNIST dataset. Instead ofcontaining 60,000 images of 10 digit classes, the datasetcontains 30,000 images of 1,500 characters (20 exampleseach) with 28 28 pixels. These characters are from50 alphabets from around the world, including Bengali,Cyrillic, Arabic, Sanskrit, Tagalog (see Fig. 9). We split thedataset into 15,000 training and 15,000 test images (10 ex-amples of each class). Similarly to the CIFAR dataset, wepretrain a two-layer DBM model, with the first layercontaining 1,000 hidden units, and the second layercontaining M ¼ 100 softmax units. The HDP prior overhð2Þ was fit to all 1,500 character classes.

SALAKHUTDINOV ET AL.: LEARNING WITH HIERARCHICAL-DEEP MODELS 9

TABLE 1Classification Performance on the Test Set Using 2�AUROC� 1

The results in bold correspond to ROCs that are statistically indistinguishable from the best (the difference is not statistically significant).

Fig. 4. A typical partition of the 100 basic-level categories. Many of thediscovered supercategories contain semantically coherent classes.

Fig. 5. Learning to learn: Training examples along with the eight mostprobable topics �t, ordered by hand.

Page 10: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...rsalakhu/papers/HD_PAMI.pdfIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013 1. R. Salakhutdinov

Fig. 9 displays a random subset of training images, along

with the first and second layer DBM features, as well as

higher level class-sensitive HDP features. The first layerfeatures capture low-level features, such as edges andcorners, while the HDP features tend to capture higher levelparts, many of which resemble pen “strokes,” which isbelieved to be a promising way to represent characters [18].The model discovers approximately 50 supercategories, andFig. 10 shows a typical partition of some of the classes intosupercategories which share the same prior distributionover “strokes.” Similarly to the CIFAR dataset, many of thesupercategories contain meaningful groups of characters.

Table 1 further shows results for classifying 15,000 testimages as belonging to the novel versus all of the other1,499 character classes. The results are averaged over200 characters chosen at random, using the “leave-one-out” test format. The HDP-DBM model significantly out-performs other methods, particularly when learning char-acters with few training examples. This result demonstratesthat the HDP-DBM model is able to successfully transferappropriate prior over higher level “strokes” from pre-viously learned categories.

We next tested the generative aspect the HDP-DBMmodel. Fig. 11 displays learned superclasses along withexamples of entirely novel characters that have beengenerated by the model for the same superclass. Inparticular, the left panels show training characters in onesupercategory with each row displaying a different ob-served character and each column displaying a drawing

10 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013

Fig. 8. Performance of HDP-DBM, DBM, and SVMs for all objectclasses when learning with three examples. Object categories aresorted by their performance.

Fig. 7. Conditional samples generated by the HDP-DBM model when learning with only three training examples of a novel class: Top: Three trainingexamples, Bottom: 49 conditional samples. Best viewed in color.

Fig. 6. Class-conditional samples generated from the HDP-DBM model. Observe that the model despite extreme variability, the model is able tocapture a coherent structure of each class. See in color for better visualization.

Page 11: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...rsalakhu/papers/HD_PAMI.pdfIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013 1. R. Salakhutdinov

produced by a different subject. The right panels showexamples of novel synthesized characters in the corre-sponding supercategory, where each row displays adifferent synthesized character, whereas each columnshows a different example generated at random by theHDP-DBM model. Note that many samples look realistic,containing coherent, long-range structure, while at the sametime being different from existing training images.

Fig. 12 further shows conditional samples when learningwith only three training examples of a novel character. Eachpanel shows three figures: 1) three training examples of anovel character class, 2) 12 synthesized examples of thatclass, and 3) samples of the training characters in the samesupercategory that the novel character has been groupedunder. Many of the novel characters are grouped togetherwith related classes, allowing each character to inherit theprior distribution over similar high-level “strokes,” andhence generalizing better to new instances of the corre-sponding class (see the supplemental materials, which canbe found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TPAMI.2012.269, for

a much richer class of generated samples). Using DBNsinstead of DBMs produced far inferior generative sampleswhen generating new characters as well as when learningfrom three examples.

5.3 Motion Capture

Results on the CIFAR and Character datasets show that theHDP-DBM model can significantly outperform many othermodels on object and character recognition tasks. Featuresat all levels of the hierarchy were learned without assumingany image-specific priors, and the proposed model can beapplied in a wide variety of application domains. In thissection, we show that the HDP-DBM model can be appliedto modeling human motion capture data.

The human motion capture dataset consists of sequencesof 3D joint angles plus body orientation and translation, asshown in Fig. 13, and was preprocessed to be invariant toisometries [34]. The dataset contains 10 walking styles,including normal, drunk, graceful, gangly, sexy, dinosaur,chicken, old person, cat, and strong. There are 2,500 framesof each style at 60fps, where each time step was represented

SALAKHUTDINOV ET AL.: LEARNING WITH HIERARCHICAL-DEEP MODELS 11

Fig. 10. Some of the learned supercategories that share the same prior distribution over “strokes.” Many of the discovered supercategories containmeaningful groupings of characters.

Fig. 9. A random subset of the training images along with the first and second layer DBM features, as well as higher level class-sensitive HDPfeatures/topics. To visualize higher level features, we first sample M words from a fixed topic �t, followed by sampling pixel values from theconditional DBM model.

Page 12: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...rsalakhu/papers/HD_PAMI.pdfIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013 1. R. Salakhutdinov

by a vector of 58 real-valued numbers. The dataset was split

at random into 1,500 training and 1,000 test frames of each

style. We further preprocessed the data by treating each

window of 10 consecutive frames as a single 58 � 10 ¼ 580-d

data vector.For the two-layer DBM model, the first layer contained

500 hidden units, with the second layer containing M ¼ 50

softmax units. The HDP prior over the second-layer

features was fit to various walking styles. Using “leave-

one-out” test format, Table 1 shows that the HDP-DBM

model performs much better compared to other models

when discriminating between existing nine walking styles

versus novel walking style. The difference is particularly

large in the regime when we observe only a handful of

training examples of a novel walking style.

6 CONCLUSIONS

We developed a compositional architecture that learns an

HDP prior over the activities of top-level features of the

DBM model. The resulting compound HDP-DBM model is

able to learn low-level features from raw, high-dimen-

sional sensory input, high-level features, as well as a

category hierarchy for parameter sharing. Our experimen-

tal results show that the proposed model can acquire new

concepts from very few examples in a diverse set of

application domains.The compositional model considered in this paper was

directly inspired by the architecture of the DBM and HDP,

but it need not be. Indeed, any other deep learning module,

including DBNs, sparse autoencoders, or any other HB

model, can be adapted. This perspective opens a space of

12 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013

Fig. 12. Each panel shows three figures from left to right: 1) three training examples of a novel character class, 2) 12 synthesized examples of thatclass, and 3) training characters in the same supercategory that the novel character has been assigned to.

Fig. 11. Within each panel: Left: Examples of training characters in one supercategory: Each row is a different training character and each column isa drawing produced by a different subject. Right: Examples of novel sampled characters in the corresponding supercategory: Each row is a differentsampled character, and each column is a different example generated at random by the model.

Page 13: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...rsalakhu/papers/HD_PAMI.pdfIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013 1. R. Salakhutdinov

compositional models that may be more suitable forcapturing the human-like ability to learn from few examples.

ACKNOWLEDGMENTS

This research was supported by NSERC, ONR (MURI Grant1015GNA126), ONR N00014-07-1-0937, ARO W911NF-08-1-0242, and Qualcomm.

REFERENCES

[1] B. Babenko, S. Branson, and S.J. Belongie, “Similarity Functions forCategorization: From Monolithic to Category Specific,” Proc. IEEEInt’l Conf. Computer Vision, 2009.

[2] E. Bart, I. Porteous, P. Perona, and M. Welling, “UnsupervisedLearning of Visual Taxonomies,” Proc. IEEE Conf. Computer Visionand Pattern Recognition, pp. 1-8, 2008.

[3] E. Bart and S. Ullman, “Cross-Generalization: Learning NovelClasses from a Single Example by Feature Replacement,” Proc.IEEE Conf. Computer Vision and Pattern Recognition, pp. 672-679,2005.

[4] D.M. Blei, A.Y. Ng, and M.I. Jordan, “Latent Dirichlet Allocation,”J. Machine Learning Research, vol. 3, pp. 993-1022, 2003.

[5] D.M. Blei, T.L. Griffiths, and M.I. Jordan, “The Nested ChineseRestaurant Process and Bayesian Nonparametric Inference ofTopic Hierarchies,” J. ACM, vol. 57, no. 2, 2010.

[6] K.R. Canini and T.L. Griffiths, “Modeling Human TransferLearning with the Hierarchical Dirichlet Process,” Proc. NIPSWorkshop Nonparametric Bayes, 2009.

[7] C.-C. Chang and C.-J. Lin, “LIBSVM: A Library for Support VectorMachines,” ACM Trans. Intelligent Systems and Technology, vol. 2,pp. 27:1-27:27, 2011.

[8] B. Chen, G. Polatkan, G. Sapiro, D.B. Dunson, and L. Carin, “TheHierarchical Beta Process for Convolutional Factor Analysis andDeep Learning,” Proc. 28th Int’l Conf. Machine Learning, pp. 361-368, 2011.

[9] A. Coates, B. Carpenter, C. Case, S. Satheesh, B. Suresh, T. Wang,and A.Y. Ng, “Text Detection and Character Recognition in SceneImages with Unsupervised Feature Learning,” Proc. 11th Int’l Conf.Document Analysis and Recognition, 2011.

[10] A. Courville, J. Bergstra, and Y. Bengio, “Unsupervised Models ofImages by Spike-and-Slab RBNS,” Proc. 28th Int’l Conf. MachineLearning, pp. 1145-1152, June 2011.

[11] L. Fei-Fei, R. Fergus, and P. Perona, “One-Shot Learning of ObjectCategories,” IEEE Trans. Pattern Analysis and Machine Intelligence,vol. 28, no. 4, pp. 594-611, Apr. 2006.

[12] G.E. Hinton, S. Osindero, and Y.2. Teh, “A Fast LearningAlgorithm for Deep Belief Nets,” Neural Computation, vol. 18,no. 7, pp. 1527-1554, 2006.

[13] G.E. Hinton and R.R. Salakhutdinov, “Reducing the Dimension-ality of Data with Neural Networks,” Science, vol. 313, no. 5786,pp. 504-507, 2006.

[14] G.E. Hinton and T. Sejnowski, “Optimal Perceptual Inference,”Proc. IEEE Conf. Computer Vision and Pattern Recognition, 1983.

[15] C. Kemp, A. Perfors, and J. Tenenbaum, “Learning Overhypoth-eses with Hierarchical Bayesian Models,” Developmental Science,vol. 10, no. 3, pp. 307-321, 2006.

[16] A. Krizhevsky, “Learning Multiple Layers of Features from TinyImages,” 2009.

[17] A. Krizhevsky, “Learning Multiple Layers of Features from TinyImages,” technical report, Dept. of Computer Science, Univ. ofToronto, 2009.

[18] B. Lake, R. Salakhutdinov, J. Gross, and J. Tenenbaum, “One-ShotLearning of Simple Visual Concepts,” Proc. 33rd Ann. Conf.Cognitive Science Soc., 2011.

[19] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin, “Explor-ing Strategies for Training Deep Neural Networks.” J. MachineLearning Research, vol. 10, pp. 1-40, 2009.

[20] H. Lee, R. Grosse, R. Ranganath, and A.Y. Ng, “ConvolutionalDeep Belief Networks for Scalable Unsupervised Learning ofHierarchical Representations,” Proc. Int’l Conf. Machine Learning,pp. 609-616, 2009.

[21] H. Lee, R. Grosse, R. Ranganath, and A.Y. Ng, “ConvolutionalDeep Belief Networks for Scalable Unsupervised Learning ofHierarchical Representations,” Proc. 26th Int’l Conf. MachineLearning, pp. 609-616, 2009.

[22] Y. Lin, T. Zhangi, S. Zhu, and K. Yu, “Deep Coding Networks,”Proc. Advances in Neural Information Processing Systems Conf.,vol. 23, 2011.

[23] A. Mohamed, G. Dahl, and G. Hinton, “Acoustic Modeling UsingDeep Belief Networks,” IEEE Trans. Audio, Speech, and LanguageProcessing, vol. 20, no. 1, pp. 14-22, Jan. 2012.

[24] V. Nair and G.E. Hinton, “Implicit Mixtures of RestrictedBoltzmann Machines,” Proc. Advances in Neural InformationProcessing Systems Conf., vol. 21, 2009.

[25] A. Perfors and J.B. Tenenbaum, “Learning to Learn Categories,”Proc. 31st Ann. Conf. Cognitive Science Soc., pp. 136-141, 2009.

[26] M.A. Ranzato, Y. Boureau, and Y. LeCun, “Sparse FeatureLearning for Deep Belief Networks,” Proc. Advances in NeuralInformation Processing Systems, 2008.

[27] H. Robbins and S. Monro, “A Stochastic Approximation Method,”Annals Math. Statistics, vol. 22, pp. 400-407, 1951.

[28] A. Rodriguez, D. Dunson, and A. Gelfand, “The Nested DirichletProcess.” J. Am. Statistical Assoc., vol. 103, pp. 1131-1144, 2008.

[29] R.R. Salakhutdinov and G.E. Hinton, “Deep Boltzmann Ma-chines,” Proc. Int’l Conf. Artificial Intelligence and Statistics, vol. 12,2009.

[30] R.R. Salakhutdinov and G.E. Hinton, “Replicated Softmax: AnUndirected Topic Model,” Proc. Advances in Neural InformationProcessing Systems Conf., vol. 22, 2010.

[31] L.B. Smith, S.S. Jones, B. Landau, L. Gershkoff-Stowe, and L.Samuelson, “Object Name Learning Provides On-the-Job Trainingfor Attention,” Psychological Science, vol. 13, pp. 13-19, 2002.

[32] R. Socher, C. Lin, A.Y. Ng, and C. Manning, “Parsing NaturalScenes and Natural Language with Recursive Neural Networks,”Proc. 28th Int’l Conf. Machine Learning, 2011.

[33] E.B. Sudderth, A. Torralba, W.T. Freeman, and A.S. Willsky,“Describing Visual Scenes Using Transformed Objects and Parts,”Int’l J. Computer Vision, vol. 77, nos. 1-3, pp. 291-330, 2008.

[34] G. Taylor, G.E. Hinton, and S.T. Roweis, “Modeling HumanMotion Using Binary Latent Variables,” Proc. Advances in NeuralInformation Processing Systems Conf., 2006.

[35] G.W. Taylor, R. Fergus, Y. LeCun, and C. Bregler, “ConvolutionalLearning of Spatio-Temporal Features,” Proc. 11th European Conf.Computer Vision, 2010.

[36] Y.W. Teh and G.E. Hinton, “Rate-Coded Restricted BoltzmannMachines for Face Recognition,” Proc. Advances in Neural Informa-tion Processing Systems Conf., vol. 13, 2001.

[37] Y.W. Teh and M.I. Jordan, “Hierarchical Bayesian NonparametricModels with Applications,” Bayesian Nonparametrics: Principles andPractice, Cambridge Univ. Press, 2010.

[38] Y.W. Teh, M.I. Jordan, M.J. Beal, and D.M. Blei, “HierarchicalDirichlet Processes,” J. Am. Statistical Assoc., vol. 101, no. 476,pp. 1566-1581, 2006.

SALAKHUTDINOV ET AL.: LEARNING WITH HIERARCHICAL-DEEP MODELS 13

Fig. 13. Human motion capture data that corresponds to the “normal”walking style.

Page 14: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...rsalakhu/papers/HD_PAMI.pdfIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013 1. R. Salakhutdinov

[39] T. Tieleman, “Training Restricted Boltzmann Machines UsingApproximations to the Likelihood Gradient,” Proc. 25th Int’l Conf.Machine Learning, 2008.

[40] A. Torralba, R. Fergus, and W.T. Freeman, “80 Million TinyImages: A Large Data Set for Non-Parametric Object and SceneRecognition,” IEEE Trans. Pattern Analysis and Machine Intelligence,vol. 30, no. 11, pp. 1958-1970, Nov. 2008.

[41] A. Torralba, R. Fergus, and Y. Weiss, “Small Codes and LargeImage Databases for Recognition,” Proc. IEEE Conf. ComputerVision and Pattern Recognition, 2008.

[42] A.B Torralba, K.P Murphy, and W.T. Freeman, “Shared Featuresfor Multiclass Object Detection,” Toward Category-Level ObjectRecognition, pp. 345-361, 2006.

[43] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol, “Extractingand Composing Robust Features with Denoising Autoencoders,”Proc. 25th Int’l Conf. Machine Learning, vol. 307, pp. 1096-1103,2008.

[44] F. Xu and J.B. Tenenbaum, “Word Learning as BayesianInference,” Psychological Rev., vol. 114, no. 2, pp. 245-272, 2007.

[45] L. Younes, “Parameter Inference for Imperfectly ObservedGibbsian Fields,” Probability Theory Related Fields, vol. 82,pp. 625-645, 1989.

[46] L. Younes, “On the Convergence of Markovian StochasticAlgorithms with Rapidly Decreasing Ergodicity Rates,” Mar. 2000.

[47] A.L. Yuille, “The Convergence of Contrastive Divergences,” Proc.Advances in Neural Information Processing Systems Conf., 2004.

Ruslan Salakhutdinov received the PhD de-gree in machine learning (computer science)from the University of Toronto, Ontario, Canada,in 2009. After spending two postdoctoral yearsat the Massachusetts Institute of TechnologyArtificial Intelligence Lab, he joined the Univer-sity of Toronto as an assistant professor in theDepartments of Statistics and ComputerScience. His primary interests lie in artificialintelligence, machine learning, deep learning,

and large-scale optimization. He is and Alfred P. Sloan Research Fellowand Microsoft Research New Faculty Fellow, a recipient of the EarlyResearcher Award, Connaught New Researcher Award, and is aScholar of the Canadian Institute for Advanced Research.

Joshua B. Tenenbaum received the PhDdegree in the Department of Brain and CognitiveSciences, Massachusetts Institute of Technol-ogy, Cambridge, in 1993, where he is currently aprofessor of computational cognitive science aswell as a principal investigator in the ComputerScience and Artificial Intelligence Laboratory. Hestudies learning, reasoning, and perception inhumans and machines, with the twin goals ofunderstanding human intelligence in computa-

tional terms and bringing computers closer to human capacities. He andhis collaborators have pioneered accounts of human cognition based onsophisticated probabilistic models and developed several novel machinelearning algorithms inspired by human learning, most notably Isomap,an approach to unsupervised learning of nonlinear manifolds in high-dimensional data. His current work focuses on understanding howpeople come to be able to learn new concepts from very sparsedata—how we “learn to learn”—and on characterizing the nature andorigins of people’s intuitive theories about the physical and social worlds.His papers have received awards at the IEEE Computer Vision andPattern Recognition (CVPR), NIPS, Cognitive Science, UAI, and IJCAIconferences. He is the recipient of early career awards from the Societyfor Mathematical Psychology, the Society of Experimental Psycholo-gists, and the American Psychological Association, along with theTroland Research Award from the National Academy of Sciences.

Antonio Torralba received the degree in tele-communications engineering from the Universi-dad Politecnica de Cataluna, Barcelona, Spain,and the PhD degree in signal, image, andspeech processing from the Institute NationalPolytechnique de Grenoble, France. Thereafter,he spent postdoctoral training at the Brain andCognitive Science Department and the Compu-ter Science and Artificial Intelligence Laboratoryat the Massachusetts Institute of Technology

(MIT), Cambridge. He is an associate professor of electrical engineeringand computer science in the Computer Science and Artificial IntelligenceLaboratory at MIT. He is a member of the IEEE.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

14 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. X, XXXXXXX 2013