8
1 Learning the Structure of Deep Sparse Graphical Models Ryan Prescott Adams Hanna M. Wallach Zoubin Ghahramani University of Toronto University of Massachusetts Amherst University of Cambridge Abstract Deep belief networks are a powerful way to model complex probability distributions. However, it is difficult to learn the structure of a belief network, particularly one with hid- den units. The Indian buffet process has been used as a nonparametric Bayesian prior on the structure of a directed belief network with a single infinitely wide hidden layer. Here, we introduce the cascading Indian buffet process (CIBP), which provides a prior on the struc- ture of a layered, directed belief network that is unbounded in both depth and width, yet allows tractable inference. We use the CIBP prior with the nonlinear Gaussian belief net- work framework to allow each unit to vary its behavior between discrete and continuous representations. We use Markov chain Monte Carlo for inference in this model and explore the structures learned on image data. 1 Introduction The belief network or directed probabilistic graphical model (Pearl, 1988) is a popular and useful way to represent complex probability distributions. Methods for learning the parameters of such networks are well- established. Learning network structure, however, is more difficult, particularly when the network includes unobserved hidden units. Then, not only must the structure (edges) be determined, but the number of hidden units must also be inferred. This paper con- tributes a novel nonparametric Bayesian perspective on the general problem of learning graphical models with hidden variables. Nonparametric Bayesian ap- proaches to this problem are appealing because they can avoid the difficult computations required for select- ing the appropriate a posteriori dimensionality of the Appearing in Proceedings of the 13 th International Con- ference on Artificial Intelligence and Statistics (AISTATS) 2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 of JMLR: W&CP 9. Copyright 2010 by the authors. model. Instead, they introduce an infinite number of parameters into the model a priori and the inference procedure determines the subset of these parameters that actually contributed to the observations. The In- dian buffet process (IBP) (Griffiths and Ghahramani, 2006) is one example of a nonparametric Bayesian prior. It has previously been used to introduce an infinite number of hidden units into a belief network with a single hidden layer (Wood et al., 2006) or with a pre-specified number of layers (Courville et al., 2009). This paper unites two important areas of research: nonparametric Bayesian methods and deep belief net- works. Specifically, we develop a nonparametric Bayesian framework to perform structure learning in deep networks, a problem that has not been addressed to date. We first propose a novel extension to the In- dian buffet process—the cascading Indian buffet pro- cess (CIBP)—and use the Foster-Lyapunov criterion to prove convergence properties that make it tractable with finite computation. We then use the CIBP to generalize the single-layered, IBP-based, directed be- lief network to construct multi-layered networks that are both infinitely wide and infinitely deep. We discuss useful properties of such networks including expected in-degree and out-degree for individual units. Finally, we combine this approach with the continuous sig- moidal belief network framework of Frey (1997). This framework allows us to infer the type (i.e., discrete or continuous) of individual hidden units—an impor- tant property that is not widely discussed in previous work. In summary, we present a flexible, nonparamet- ric framework for directed deep belief networks that permits inference of the number of hidden units, the directed edge structure between units, the depth of the network, and the most appropriate type for each unit. 2 Finite Belief Networks We consider belief networks that are layered directed acyclic graphs with both visible and hidden units. Hid- den units are random variables that appear in the joint distribution described by the belief network but are not observed. We index layers by m, increasing with depth up to M , and allow visible units (i.e., observed

Learning the Structure of Deep Sparse Graphical …mlg.eng.cam.ac.uk/pub/pdf/AdaWalGha10.pdf2 Learning the Structure of Deep Sparse Graphical Models −1 −0.5 0 0.5 1 (a) ν= 1 2

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Learning the Structure of Deep Sparse Graphical …mlg.eng.cam.ac.uk/pub/pdf/AdaWalGha10.pdf2 Learning the Structure of Deep Sparse Graphical Models −1 −0.5 0 0.5 1 (a) ν= 1 2

1

Learning the Structure of Deep Sparse Graphical Models

Ryan Prescott Adams Hanna M. Wallach Zoubin GhahramaniUniversity of Toronto University of Massachusetts Amherst University of Cambridge

Abstract

Deep belief networks are a powerful wayto model complex probability distributions.However, it is di!cult to learn the structureof a belief network, particularly one with hid-den units. The Indian bu"et process has beenused as a nonparametric Bayesian prior onthe structure of a directed belief network witha single infinitely wide hidden layer. Here, weintroduce the cascading Indian bu"et process(CIBP), which provides a prior on the struc-ture of a layered, directed belief network thatis unbounded in both depth and width, yetallows tractable inference. We use the CIBPprior with the nonlinear Gaussian belief net-work framework to allow each unit to varyits behavior between discrete and continuousrepresentations. We use Markov chain MonteCarlo for inference in this model and explorethe structures learned on image data.

1 Introduction

The belief network or directed probabilistic graphicalmodel (Pearl, 1988) is a popular and useful way torepresent complex probability distributions. Methodsfor learning the parameters of such networks are well-established. Learning network structure, however, ismore di!cult, particularly when the network includesunobserved hidden units. Then, not only must thestructure (edges) be determined, but the number ofhidden units must also be inferred. This paper con-tributes a novel nonparametric Bayesian perspectiveon the general problem of learning graphical modelswith hidden variables. Nonparametric Bayesian ap-proaches to this problem are appealing because theycan avoid the di!cult computations required for select-ing the appropriate a posteriori dimensionality of the

Appearing in Proceedings of the 13th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 ofJMLR: W&CP 9. Copyright 2010 by the authors.

model. Instead, they introduce an infinite number ofparameters into the model a priori and the inferenceprocedure determines the subset of these parametersthat actually contributed to the observations. The In-dian bu"et process (IBP) (Gri!ths and Ghahramani,2006) is one example of a nonparametric Bayesianprior. It has previously been used to introduce aninfinite number of hidden units into a belief networkwith a single hidden layer (Wood et al., 2006) or with apre-specified number of layers (Courville et al., 2009).

This paper unites two important areas of research:nonparametric Bayesian methods and deep belief net-works. Specifically, we develop a nonparametricBayesian framework to perform structure learning indeep networks, a problem that has not been addressedto date. We first propose a novel extension to the In-dian bu"et process—the cascading Indian bu"et pro-cess (CIBP)—and use the Foster-Lyapunov criterionto prove convergence properties that make it tractablewith finite computation. We then use the CIBP togeneralize the single-layered, IBP-based, directed be-lief network to construct multi-layered networks thatare both infinitely wide and infinitely deep. We discussuseful properties of such networks including expectedin-degree and out-degree for individual units. Finally,we combine this approach with the continuous sig-moidal belief network framework of Frey (1997). Thisframework allows us to infer the type (i.e., discreteor continuous) of individual hidden units—an impor-tant property that is not widely discussed in previouswork. In summary, we present a flexible, nonparamet-ric framework for directed deep belief networks thatpermits inference of the number of hidden units, thedirected edge structure between units, the depth of thenetwork, and the most appropriate type for each unit.

2 Finite Belief Networks

We consider belief networks that are layered directedacyclic graphs with both visible and hidden units. Hid-den units are random variables that appear in the jointdistribution described by the belief network but arenot observed. We index layers by m, increasing withdepth up to M , and allow visible units (i.e., observed

Page 2: Learning the Structure of Deep Sparse Graphical …mlg.eng.cam.ac.uk/pub/pdf/AdaWalGha10.pdf2 Learning the Structure of Deep Sparse Graphical Models −1 −0.5 0 0.5 1 (a) ν= 1 2

2

Learning the Structure of Deep Sparse Graphical Models

!1 !0.5 0 0.5 1

(a) ! = 12

!1 !0.5 0 0.5 1

(b) ! = 5!1 !0.5 0 0.5 1

(c) ! = 1000

Figure 1: Three operation modes for a NLGBN unit.The black solid line shows the zero mean distribution (i.e.,y = 0), the red dashed line shows a pre-sigmoid mean of+1 and the blue dash-dot line shows a pre-sigmoid meanof !1. (a) Binary behavior resulting from small precision.(b) Roughly Gaussian behavior resulting from medium pre-cision. (c) Deterministic behavior from large precision.

variables) only in layer m=0. We require that unitsin layer m have parents only in layer m+1. Withinlayer m, we denote the number of units as K(m) andindex the units with k so that the kth unit in layer m

is denoted u(m)k . We use the notation u(m) to refer to

the vector of all K(m) units for layer m together. Abinary K(m!1)!K(m) matrix Z(m) specifies the edges

from layer m to layer m"1, so that element Z(m)k,k! =1

i! there is an edge from unit u(m)k! to unit u(m!1)

k .

A unit’s activation is determined by a weighted sum ofits parents. The weights for layer m are denoted by aK(m!1)!K(m) real-valued matrix W (m), so that theactivations for the units in layer m can be written asy(m) =(W (m+1)#Z(m+1))u(m+1)+!(m), where !(m)

is a K(m)-dimensional vector of bias weights and thebinary operator # indicates the Hadamard product.

To achieve a wide range of possible behaviors, weuse the nonlinear Gaussian belief network (NLGBN)(Frey, 1997; Frey and Hinton, 1999) framework. In the

NLGBN, the distribution on u(m)k arises from adding

zero mean Gaussian noise with precision !(m)k to the

activation sum y(m)k . This noisy sum is transformed

with a sigmoid function "(·) to obtain the value ofthe unit. We modify the NLGBN slightly so that thesigmoid function is from the real line to ("1, 1), i.e.," : R$("1, 1), via "(x) = 2/(1 + exp{x}) " 1. The

distribution of u(m)k % ("1, 1) given its parents is then

p(u(m)k |y(m)

k , !(m)k ) =

exp

!

"!(m)k

2

"

"!1(u(m)k )"y(m)

k

#2$

""("!1(u(m)k ))

%

2#/!(m)k

where ""(x) = ddx

"(x). As discussed by Frey (1997)

and as shown in Fig 1, di!erent choices of !(m)k yield

di!erent belief unit behaviors, ranging from e!ectivelydiscrete binary units to nonlinear continuous units.

3 Infinite Belief Networks

Conditioned on the number of layers M , the layerwidths K(m), and the network structures Z(m), infer-

ence in belief networks can be straightforwardly imple-mented using Markov chain Monte Carlo (Neal, 1992).Learning the depth, width, and structure, however,presents significant computational challenges. In thissection, we present a novel nonparametric prior, thecascading Indian bu!et process, for multi-layered beliefnetworks that are both infinitely wide and infinitelydeep. By using an infinite prior we avoid the needfor the complex dimensionality-altering proposals thatwould otherwise be required during inference.

3.1 The Indian bu!et process

Sec 2 used the binary matrix Z(m) as a conve-nient way to represent the edges connecting layer mto layer m"1. We stated that Z(m) was a finiteK(m!1)!K(m) matrix. We can use the Indian bu!etprocess (IBP) (Gri"ths and Ghahramani, 2006) to al-low this matrix to have an infinite number of columns.We assume the two-parameter IBP (Ghahramani et al.,2007), and use Z(m)& IBP($,%) to indicate that the

matrix Z(m) % {0, 1}K(m"1)#$ is drawn from an IBPwith parameters $,% > 0. The eponymous metaphorfor the IBP is a restaurant with an infinite numberof dishes available. Each customer chooses a finiteset of dishes to taste. The rows of the binary ma-trix correspond to customers, while the columns corre-spond to dishes. If the jth customer tastes the kth dish,then Zj,k =1. Otherwise, Zj,k =0. The first customerto enter the restaurant samples a number of dishesthat is Poisson-distributed with parameter $. Afterthat, when the jth customer enters the restaurant, sheselects dish k with probability &k / (j+%"1), where &k

is the number of previous customers that have triedthe kth dish. She then chooses some number of addi-tional dishes to taste that is Poisson-distributed withparameter $% / (j+%"1). Even though each customerchooses dishes based on their popularity among theprevious customers, the rows and columns of the re-sulting matrix Z(m) are infinitely exchangeable.

As with the model of Wood et al. (2006), if the modelin Sec 2 had only a single hidden layer, i.e., M =1, thenthe IBP could be used to make that layer infinitelywide. Without intra-layer connections, however, thehidden units are independent a priori. This “shallow-ness” is a strong assumption that weakens the modelin practice. The explosion of recent literature on deepbelief networks (see, e.g., Hinton et al. (2006); Hintonand Salakhutdinov (2006)) speaks to the empirical suc-cess of networks with more hidden structure.

3.2 The cascading Indian bu!et process

To build a prior on belief networks that are unboundedin both width and depth, we use an IBP-like construc-

Page 3: Learning the Structure of Deep Sparse Graphical …mlg.eng.cam.ac.uk/pub/pdf/AdaWalGha10.pdf2 Learning the Structure of Deep Sparse Graphical Models −1 −0.5 0 0.5 1 (a) ν= 1 2

3

Ryan Prescott Adams, Hanna M. Wallach, Zoubin Ghahramani

tion that results in an infinite sequence of binary matri-ces Z(0),Z(1),Z(2), · · ·. The matrices in this sequencemust inherit the useful sparsity properties of the IBP,with the constraint that the columns from Z(m!1) cor-respond to the rows in Z(m). We interpret each ma-trix Z(m) as specifying the directed edge structurefrom the units in layer m to those in layer m!1, whereboth layers have a potentially-unbounded width.

We propose the cascading Indian bu!et process(CIBP), which provides a prior with these properties.The CIBP extends the IBP as follows: each dish in therestaurant is also a customer in another Indian bu!etprocess—i.e., the columns in one binary matrix corre-spond to the rows in another. The CIBP is infinitelyexchangeable in the rows of matrix Z(0). Each of thematrices in the recursion is exchangeable in its rowsand columns—propagating a permutation through thematrices does not change the probability of the data.

Surprisingly, if there are K(0) customers in the firstrestaurant, then for finite K(0), !, and ", the CIBPrecursion terminates with probability one. In otherwords, at some point the customers do not taste anydishes, and deeper restaurants have neither dishes norcustomers. Here we sketch the intuition behind thisresult. (See the supplementary materials for a proof.)

The CIBP constructs matrices in a sequence, start-ing with m=0. The number of nonzero columns inmatrix Z(m+1), K(m+1), is determined by K(m), thenumber of active nonzero columns in Z(m). We re-quire that for some matrix Z(m), all columns are zero.We can therefore disregard the fact that the CIBP is amatrix-valued stochastic process and instead considerthe Markov chain on the number of nonzero columns.Fig 2a shows three traces of such a Markov chainon K(m). If we define #(K;!,") = !

!Kk!=1

!k!+!!1 ,

then the Markov chain has the transition distribution

p(K(m+1) = k |K(m),!,") =

1

k!exp

"

!#(K(m);!,")#

#(K(m);!,")k, (1)

which is a Poisson distribution with mean#(K(m);!,"). To show that the chain reachesthe absorbing state K(m) =0 with probability one, wemust show that K(m) does not blow up to infinity.

In such a Markov chain, this requirement is equiv-alent to the chain having an equilibrium distribu-tion when conditioned on nonabsorption (has a quasi-stationary distribution) (Seneta and Vere-Jones, 1966).For countably-infinite state spaces, a Markov chainhas a (quasi-) stationary distribution if it is positive-recurrent, i.e., there is a finite expected time betweenconsecutive visits to any state. Positive recurrency canbe shown via the Foster–Lyapunov stability criterion

10 20 30 40 50 60 70 800

20

40

Depth m

Wid

thK

(m)

(a) Example traces with K(0) = 50

0 5 10 15 200

5

10

15

20

Current Width K(m)

Expec

ted

K(m

+1)

(b) Expected K(m+1)

0 5 10 15 20!10

!5

0

5

Current Width K(m)

Dri

ft

(c) Drift

Figure 2: Properties of the Markov chain on layer widthfor the CIBP, with ! = 3, " = 1. Note that these valuesare illustrative and are not necessarily appropriate for anetwork structure. a) Example traces of a Markov chainon layer width, indexed by depth m. b) Expected K(m+1)

as a function of K(m) is shown in blue. The Lyapunovfunction L(·) is shown in green. c) The drift as a function ofthe current width K(m). This corresponds to the di!erencebetween the two lines in (a). Note that the drift becomesnegative when the layer width is greater than eight.

(FLSC) (Fayolle et al., 2008). Satisfying the FLSCfor the Markov chain with transitions given by Eqn 1demonstrates that eventually the CIBP will reach arestaurant in which the customers try no new dishes.We do this by showing that if K(m) is large enough,then the expected K(m+1) is smaller than K(m). Weuse a Lyapunov function L(k) : N+ " R > 0, L(0)=0,with which we define the drift function as follows:

Ek|K(m) [L(k) ! L(K(m))] ="$

k=1

p(K(m+1) = k |K(m))(L(k) ! L(K(m))).

The drift is the expected change in L(k) as a func-tion of K(m). If there is a K(m) above which alldrifts are negative, then the Markov chain satisfiesthe FLSC and is positive-recurrent. In the CIBP,this is satisfied for L(k) = k. That the drift even-tually will become negative can be seen by the factthat Ek|K(m) [L(k)] = #(K(m) ; !,") is O(lnK(m)) and

Ek|K(m) [L(K(m))] = K(m) is O(K(m)). Figs 2b and 2ctogether provide a schematic illustration of this idea.

3.3 Unbounded priors on network structure

The CIBP can be used as a prior on the sequenceZ(0),Z(1),Z(2), · · · from Sec 2, to allow an infinitesequence of infinitely-wide hidden layers. As before,there are K(0) visible units. The edges between the

Page 4: Learning the Structure of Deep Sparse Graphical …mlg.eng.cam.ac.uk/pub/pdf/AdaWalGha10.pdf2 Learning the Structure of Deep Sparse Graphical Models −1 −0.5 0 0.5 1 (a) ν= 1 2

4

Learning the Structure of Deep Sparse Graphical Models

(a) ! = 1, " = 1

(b) ! = 12 , " = 1 (c) ! = 1, " = 2 (d) ! = 3

2 , " = 1

Figure 3: Samples from the CIBP prior (for four sets of!, " values) over network structures with five visible units.

first hidden layer and the visible layer are drawn ac-cording to the restaurant metaphor. This yields afinite number of units in the first hidden layer, de-noted K(1) as before. These units become the visi-ble units in another IBP-based network. While thisrecurses infinitely deep, only a finite number of unitsare ancestors of the visible units. If a unit has noancestors, its activation is determined only by its bias.Fig 3 shows several samples from the prior for di!erentparameterizations. Only connected units are shown.

A significant aspect of using an IBP-based prior is thatit introduces a “rich-get-richer” behavior in the struc-ture. The probability of a hidden unit acquiring anew outgoing edge increases with the current numberof outgoing edges. While this is not desirable for allmodels, we feel it is a reasonable property of a be-lief network. Sharing of hidden variables in the beliefnetwork is what induces structure on the outputs. Itseems appropriate that a hidden variable which is al-ready important will likely become more important.

The parameters ! and " govern the expected widthand sparsity of the network at each level. The ex-pected in-degree of each unit (number of parents)is ! and the expected out-degree (number of children)

is K/!K

k=1"

"+k!1 , for K units used in the layer below.These equations arise directly from the properties ofthe IBP described by Ghahramani et al. (2007). Forclarity, we have presented the CIBP results with !and " fixed at all depths; however, this may be overlyrestrictive. For example, in an image recognition prob-lem we would not expect the sparsity of edges mappinglow-level features to pixels to be the same as that forhigh-level features to low-level features. To addressthis, we allow ! and " to vary with depth, writing !(m)

and "(m). The CIBP terminates with probability oneas long as there exists some finite upper bound for !(m)

and "(m) for all m. To ensure this, we place top-hatpriors on !(m) and "(m), which is vague but bounded.

3.4 Priors on other parameters

For other parameters in the model, we use priors thattie parameters together according to layer. We assumethat the weights in layer m are drawn independently

from Gaussian distributions with mean µ(m)w and pre-

cision #(m)w . We assume a similar layer-wise prior for

biases !(m) with parameters µ(m)# and #(m)

# . We use

layer-wise gamma priors on the $(m)k , with parame-

ters a(m) and b(m). We tie these prior parameters to-gether with global normal-gamma hyperpriors for theweight and bias parameters, and gamma hyperpriorsfor the unit-activation precision parameters $(m).

4 Inference

We have so far described a prior on belief networkstructures and parameters, along with likelihood func-tions for unit activation. For inference, however, wemust find the posterior distribution over the structureand the parameters of the network, having seen a setof data given by N D-dimensional vectors. We willassume that these data have been scaled to be in therange (!1, 1). Our inference procedure assumes thatthese data have been generated by a belief networkof some unknown width, depth, and structure. Thefirst layer always has a width equal to the dimension-ality of the data, i.e., K(0) =D and treat the data asactivations of the visible units. This corresponds topre-specifying the values of the visible units to be thedata in N identically-structured networks. We denote

this set of N visible layer units as {u(0)n }N

n=1.

We then wish to find the posterior distribution over:1) the depth of the network; 2) the widths of the hid-den layers; 3) the edge structure between layers; 4) theweights associated with the edges; 5) the biases of theunits; 6) the activations of the hidden units that ledto the data; 7) the values of the various hyperparam-eters. The posterior distribution over these unknownsis not analytically tractable, so we use Markov chainMonte Carlo (MCMC) to draw samples from it. To useMCMC, we instantiate these unknowns to particularvalues and then define a transition operator on thisstate that leaves the posterior distribution invariant.Under easily-satisfied conditions, the distribution overthe current state of the Markov chain will evolve so asto be closer and closer to the distribution of interest.

From a technical standpoint, the trick with the CIBP(and other similar nonparametric Bayesian models) isthat it does not actually define a prior on the widthand depth of the network. Rather, it constructs a net-work with an infinite number of layers that each havean infinite number of units in such a way that only afinite number of units actually contribute to the ob-served data. In general, one would not expect that adistribution on infinite networks would yield tractableinference. However, in our construction, given the se-quence Z(1),Z(2), · · ·, almost all of the infinite numberof units are conditionally independent, unconnected tothe data and therefore irrelevant to the posterior dis-tribution. Due to this independence, the activationsof these unconnected units trivially marginalize out of

Page 5: Learning the Structure of Deep Sparse Graphical …mlg.eng.cam.ac.uk/pub/pdf/AdaWalGha10.pdf2 Learning the Structure of Deep Sparse Graphical Models −1 −0.5 0 0.5 1 (a) ν= 1 2

5

Ryan Prescott Adams, Hanna M. Wallach, Zoubin Ghahramani

the model’s joint distribution and we can restrict in-ference only to those units that are ancestors of thevisible units. Of course, since this trivial marginaliza-tion only arises from the Z(m) matrices, we must alsohave a distribution on infinite binary matrices thatallows exact marginalization of all the uninstantiatededges. The row-wise and column-wise exchangeabilityproperties of the IBP are what allows the use of infi-nite matrices. The bottom-up conditional structure ofthe CIBP allows an infinite number of these matrices.

To simplify notation, we will use ! for the aggregatedstate of the model variables. It is this aggregated stateon which the Markov chain is defined. Given the hy-perparameters, we write the joint distribution as

p(!) =

!

"p(!(0)) p("(0))K(0)#

k=1

N#

n=1

p(xk,n | y(0)k,n, !(0)

k )

$

%

!

&

!#

m=1

p(W (m)) p(!(m)) p("(m))

!K(m)#

k=1

N#

n=1

p(u(m)k,n | y(m)

k,n , !(m)k )

$

% . (2)

Although this distribution involves several infinite sets,the aforementioned marginalization makes it possibleto sample the relevant parts via MCMC. We use asequence of transition operators that update subsets ofthe state, conditioned on the remainder, in such a wayas to leave Eqn 2 invariant. We specifically note thatconditioned on the binary matrices {Z(m)}!m=1, whichdefine the structure of the network, inference becomesexactly as it would be in a finite belief network.

4.1 Updating hidden unit activations

Since we cannot easily integrate out the activationsof the hidden units, we have included them as partof the MCMC state. Conditioned on the networkstructure, it is only necessary to sample the activa-tions of the units that are ancestors of the visibleunits. Frey (1997) used slice sampling for the hiddenunit states but we have had greater success with aspecialized independence-chain variant of multiple-tryMetropolis–Hastings (Liu et al., 2000). Our methodproposes several (" 5) possible new unit activationsfrom the prior imposed by its parents and selectsamong them (or rejects them all) according to the like-lihood imposed by its children. As this operation canbe executed in a vectorized manner with modern mathlibraries we have seen significantly better mixing per-formance by wall-clock time than with slice sampling.

4.2 Updating weights and biases

Given that a directed edge exists in the structure, wesample the posterior distribution over its weight. Con-ditioned on the rest of the model, the NLGBN providesa convenient Gaussian form for the distribution overweights so that we can Gibbs sample them from a con-ditional posterior Gaussian with parameters

µw"post

m,k,k! ="(m)

w µ(m)w +!(m"1)

k

'

nu(m)n,k!(#"1(u(m"1)

k )#$(m)n,k,k!)

"(m)w + !(m"1)

k

'

n(u(m)n,k!)2

"w"post

m,k,k! ="(m)w + !(m"1)

k

(

n

(u(m)n,k!)2,

where

$(m)n,k,k! = %(m"1)

k +(

k!! #=k!

Z(m)k,k!!W

(m)k,k!!u

(m)n,k!! . (3)

The bias %(m)k can also be similarly sampled from a

Gaussian conditional posterior distribution.

4.3 Updating activation variances

We use the NLGBN model to gain the ability to varythe mode of unit behaviors between discrete and con-tinuous representations. This corresponds to sampling

from the posterior distributions over the !(m)k . With a

conjugate prior, the new value can be sampled from agamma distribution with the following parameters:

a!"post

m,k = a(m)! + N/2 (4)

b!"post

m,k = b(m)! +

1

2

N(

n=1

(#"1(u(m)n,k ) # y(m)

k )2. (5)

4.4 Updating structure

To sample from the structure of the network—the se-quence Z(1),Z(2), · · · in our construction—we must de-fine an MCMC operator that adds and removes edgeswhile leaving the posterior in Eqn 2 invariant. Theprocedure we use is similar to that proposed by Foxet al. (2009). When adding a layer, we must sampleadditional layer-wise model components. When intro-ducing an edge, we must also sample its weight fromthe posterior distribution. If a new edge introduces apreviously-unseen hidden unit, we must draw a biasfor it and also draw its deeper-cascading connectionsfrom the posterior. Finally, we draw a top-down sam-ple of the N new hidden unit activations from any unitwe introduce. E!ectively, we make a joint proposal forthe edge and all relevant state that we previously didnot have to store. We generate these proposals fromthe prior but define a Metropolis–Hastings rule thataccepts them according to the posterior distribution.

Page 6: Learning the Structure of Deep Sparse Graphical …mlg.eng.cam.ac.uk/pub/pdf/AdaWalGha10.pdf2 Learning the Structure of Deep Sparse Graphical Models −1 −0.5 0 0.5 1 (a) ν= 1 2

6

Learning the Structure of Deep Sparse Graphical Models

(a) (b)

(c) (d)Figure 4: Olivetti faces a) Test images on the left, withreconstructed bottom halves on the right. b) Sixty featureslearned in the bottom layer, where black shows absence ofan edge. Note the learning of sparse features correspond-ing to specific facial structures such as mouth shapes, nosesand eyebrows. c) Raw predictive fantasies. d) Feature ac-tivations from individual units in the second hidden layer.

We iterate over each layer that connects to the visi-ble units. Within each layer m ! 0, we iterate overthe connected units. Sampling the edges incident tothe kth unit in layer m has two phases. First, we iter-ate over each connected unit in layer m + 1, indexed

by k!. We calculate !(m)"k,k! , the number of nonzero en-

tries in the k!th column of Z(m+1), excluding any entry

in the kth row. If !(m)"k,k! is zero, we call the unit k! a

singleton parent, to be dealt with in the second phase.

If !(m)"k,k! is nonzero, we introduce (or keep) the edge

from unit u(m+1)k! to u(m)

k with Bernoulli probability

p(Z(m+1)k,k! =1 |!\Z(m+1)

k,k! )=1

Z

!

!(m)"k,k!

K(m)+"(m)"1

"

N#

n=1

p(u(m)n,k |Z(m+1)

k,k! = 1,!\Z(m)k,k!)

p(Z(m+1)k,k! =0 |!\Z(m+1)

k,k! )=1

Z

!

1"!(m)"k,k!

K(m)+"(m)"1

"

N#

n=1

p(u(m)n,k |Z(m+1)

k,k! = 0,!\Z(m+1)k,k! ),

where Z is the appropriate normalization constant.

In the second phase, we consider deleting connectionsto singleton parents of unit k, or adding new sin-

gleton parents. We use Metropolis–Hastings with abirth/death process. If there are currently K# sin-gleton parents, then with probability 1/2 we proposeadding a new one by drawing it recursively from deeperlayers, as above. We accept the proposal to insert aconnection to this new parent unit with M–H ratio

"(m)+K(m)"1

#(m)"(m)(K#+1)2

N#

n=1

p(u(m)n,k |Z(m+1)

k,j =1,!\Z(m+1)k,j )

p(u(m)n,k |Z(m+1)

k,j =0,!\Z(m+1)k,j )

.

If we do not propose to insert a unit and K# ! 0, thenwith probability 1/2 we select uniformly from amongthe singleton parents of unit k and propose removingthe connection to it. We accept the proposal to removethe jth one with M–H acceptance ratio given by

#(m)"(m)K2#

"(m)+K(m)"1

N#

n=1

p(u(m)n,k |Z(m+1)

k,j =0,!\Z(m+1)k,j )

p(u(m)n,k |Z(m+1)

k,j =1,!\Z(m+1)k,j )

.

After these phases, chains of units that are not ances-tors of the visible units can be discarded. Notably,this birth/death operator samples from the IBP poste-rior with a nontruncated equilibrium distribution, evenwithout conjugacy. Unlike the stick-breaking approachof Teh et al. (2007), it allows use of the two-parameterIBP, which is important to this model.

5 Reconstructing Images

We applied our approach to three image data sets—theOlivetti faces, the MNIST digits and the Frey faces—and analyzed the structures that arose in the modelposteriors. To assess the model, we constructed amissing-data problem using held-out images from eachset. We removed the bottom halves of the test imagesand used the model to reconstruct the missing data,conditioned on the top half. Prediction itself was doneby integrating out the parameters and structure viaMCMC and averaging over predictive samples.

Olivetti Faces The Olivetti faces data (Samariaand Harter, 1994) are 400 64#64 grayscale imagesof the faces of 40 distinct subjects, which we dividedrandomly into 350 training and 50 test data. Fig 4ashows six bottom-half test set reconstructions on theright, compared to the ground truth on the left. Fig 4bshows a subset of sixty weight patterns from a poste-rior sample of the structure, with black indicating thatno edge is present from that hidden unit to the visibleunit (pixel). The algorithm is clearly assigning hid-den units to specific and interpretable features, suchas mouth shapes, facial hair, and skin tone. Fig 4cshows ten pure fantasies from the model, easily gener-ated in a directed acyclic belief network. Fig 4d showsthe result of activating individual units in the second

Page 7: Learning the Structure of Deep Sparse Graphical …mlg.eng.cam.ac.uk/pub/pdf/AdaWalGha10.pdf2 Learning the Structure of Deep Sparse Graphical Models −1 −0.5 0 0.5 1 (a) ν= 1 2

7

Ryan Prescott Adams, Hanna M. Wallach, Zoubin Ghahramani

(a) (b) (c)

(d)

Figure 5: MNIST Digits a) Eight pairs of test image re-constructions, with the bottom half of each digit missing.The truth is the left image in each pair. b) 120 featureslearned in the bottom layer, where black indicates that noedge exists. c) Activations in pixel space resulting from ac-tivating individual units in the deepest layer. d) Samplesfrom the posterior of Z(0), Z(1) and Z(2) (transposed).

hidden layer, while keeping the rest unactivated, andpropagating the activations down to the visible pix-els. This provides an idea of the image space spannedby the principal components of these deeper units. Atypical posterior network structure had three hiddenlayers, with approximately seventy units in each layer.

MNIST Digits We used a subset of the MNISThandwritten digit data (LeCun et al., 1998) for train-ing: 50 28 ! 28 examples of each of the ten digits, withten more examples of each digit held out for testing.The inferred lower-level features are extremely sparse,as shown in Fig 5b, and the deeper units are sim-ply activating sets of blobs at the pixel level. This isshown also by activating individual units at the deep-est layer, as shown in Fig 5c. Test reconstructions arein Fig 5a. A typical network had three hidden layers,with roughly 120, 100, and 70 units in each one. Fig 5dshows typical binary matrices Z(0), Z(1), and Z(2).

Frey Faces The Frey faces data1 are 1965 20 ! 28grayscale video frames of a single face with di!erent ex-pressions. We randomly selected 1865 training imagesand 100 test images. While typical posterior samplesof the network again typically used three hidden layers,the networks for these data tended to be much widerand more densely connected. In the bottom layer, asshown in Fig 6b, a typical hidden unit connects tomany pixels. We attribute this to global correlatione!ects since all images come from a single person. Typ-ical layer widths were around 260, 120, and 35 units.

1http://www.cs.toronto.edu/~roweis/data.html

(a) (b)

Figure 6: Frey faces a) Eight pairs of test reconstructions,with the bottom half of each face missing. The truth isthe left image in each pair. b) 260 features learned in thebottom layer, where black indicates that no edge exists.

In the experiments, our MCMC sampler appeared tomix well and begins to find reasonable reconstructionsafter a few hours of CPU time. Note that the learnedsparse connection patterns in Z(0) varied from local(MNIST), through intermediate (Olivetti) to global(Frey), despite identical IBP hyperpriors. This sug-gests that flexible priors on structures are needed toadequately capture the statistics of di!erent data sets.

6 Discussion

This paper unites two areas of research—nonparametric Bayesian methods and deep beliefnetworks—to provide a novel nonparametric perspec-tive on the general problem of learning the structureof directed deep belief networks with hidden units.

We addressed three issues surrounding deep belief net-works. First, we inferred appropriate local representa-tions with units varying from discrete binary to nonlin-ear continuous behavior. Second, we provided a wayfor a deep belief network to contain an arbitrary num-ber of hidden units arranged in an arbitrary numberof layers. Third, we presented a method for inferringthe graph structure of a directed deep belief network.To achieve this, we introduced a cascading extensionto the Indian bu!et process and proved convergenceproperties that make it useful as a Bayesian prior dis-tribution for a sequence of infinite binary matrices.

This work can be viewed as an infinite multilayer gener-alization of the density network (MacKay, 1995), andalso as part of a more general literature of learningstructure in probabilistic networks. With a few excep-tions (e.g., Ramachandran and Mooney (1998); Fried-man (1998); Elidan et al. (2000); Beal and Ghahra-mani (2006)), most previous work on learning the

Page 8: Learning the Structure of Deep Sparse Graphical …mlg.eng.cam.ac.uk/pub/pdf/AdaWalGha10.pdf2 Learning the Structure of Deep Sparse Graphical Models −1 −0.5 0 0.5 1 (a) ν= 1 2

8

Learning the Structure of Deep Sparse Graphical Models

structure of belief networks has focused on the casewhere all units are observed (Buntine, 1991; Hecker-man et al., 1995; Friedman and Koller, 2003; Koivistoand Sood, 2004). Our framework not only allows foran unbounded number of hidden units, but couplesthe model for the number of units with a nonpara-metric model for the structure of the network. Ratherthan comparing structures by evaluating marginal like-lihoods, we do inference in a single model with an un-bounded number of units and layers, thereby learninge!ective model complexity. This approach is more ap-pealing both computationally and philosophically.

There are a variety of future research directions arisingfrom the model we have presented here. As we havepresented it, we do not expect that our unsupervised,MCMC-based inference scheme will be competitive onsupervised tasks with extensively-tuned discriminativemodels based on variants of maximum-likelihood learn-ing. However, we believe that our model can informchoices for the network depth, layer size, and edgestructure in such networks, and will inspire further re-search into flexible, nonparametric network models.

Acknowledgements

The authors thank Brendan Frey, David MacKay, Iain Mur-

ray and Radford Neal for valuable discussions. RPA is

funded by the Canadian Institute for Advanced Research.

This work was supported in part by the Center for Intelli-

gent Information Retrieval, in part by CIA, NSA and NSF

under NSF grant #IIS-0326249, and in part by subcon-

tract #B582467 from Lawrence Livermore National Secu-

rity, LLC, prime contractor to DOE/NNSA contract #DE-

AC52-07NA27344. Any opinions, findings and conclusions

or recommendations expressed in this material are the au-

thors’ and do not necessarily reflect those of the sponsor.

References

M. J. Beal and Z. Ghahramani. Variational Bayesian learn-ing of directed graphical models with hidden variables.Bayesian Analysis, 1(4):793–832, 2006.

W. Buntine. Theory refinement on Bayesian networks. In7th Conference on Uncertainty in Artificial Intelligence,1991.

A. C. Courville, D. Eck, and Y. Bengio. An infinite factormodel hierarchy via a noisy-or mechanism. In Advancesin Neural Information Processing Systems 22, 2009.

G. Elidan, N. Lotner, N. Friedman, and D. Koller. Discov-ering hidden variables. In Advances in Neural Informa-tion Processing Systems 13, 2000.

G. Fayolle, V. A. Malyshev, and M. V. Menshikov. Topicsin the Constructive Theory of Countable Markov Chains.Cambridge University Press, Cambridge, UK, 2008.

E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Willsky.Sharing features among dynamical systems with betaprocesses. In Advances in Neural Information ProcessingSystems 22, 2009.

B. J. Frey. Continuous sigmoidal belief networks trainedusing slice sampling. In Advances in Neural InformationProcessing Systems 9, 1997.

B. J. Frey and G. E. Hinton. Variational learning in non-linear Gaussian belief networks. Neural Computation, 11(1):193–213, 1999.

N. Friedman. The Bayesian structural EM algorithm. In14th Conference on Uncertainty in Artificial Intelligence,1998.

N. Friedman and D. Koller. Being Bayesian about networkstructure. Machine Learning, 50(1-2):95–125, 2003.

Z. Ghahramani, T. L. Gri!ths, and P. Sollich. Bayesiannonparametric latent feature models. In Bayesian Statis-tics 8, pages 201–226. 2007.

T. L. Gri!ths and Z. Ghahramani. Infinite latent featuremodels and the Indian bu"et process. In Advances inNeural Information Processing Systems 18, 2006.

D. Heckerman, D. Geiger, and D. M. Chickering. LearningBayesian networks: The combination of knowledge andstatistical data. Machine Learning, 20(3):197–243, 1995.

G. E. Hinton and R. Salakhutdinov. Reducing the dimen-sionality of data with neural networks. Science, 313(5786):504–507, 2006.

G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learningalgorithm for deep belief nets. Neural Computation, 18(7):1527–1554, July 2006.

M. Koivisto and K. Sood. Exact Bayesian structure discov-ery in Bayesian networks. Journal of Machine LearningResearch, 5:549–573, 2004.

Y. LeCun, L. Bottou, Y. Bengio, and P. Ha"ner. Gradient-based learning applied to document recognition. Pro-ceedings of the IEEE, 86(11):2278–2324, 1998.

J. S. Liu, F. Liang, and W. H. Wong. The use of multiple-try method and local optimization in Metropolis sam-pling. Journal of the American Statistical Association,95(449):121–134, 2000.

D. J. C. MacKay. Bayesian neural networks and densitynetworks. Nuclear Instruments and Methods in PhysicsResearch, Section A, 354(1):73–80, 1995.

R. M. Neal. Connectionist learning in belief networks. Ar-tificial Intelligence, 56:71–113, July 1992.

J. Pearl. Probabalistic Reasoning in Intelligent Systems:Networks of Plausible Inference. Morgan Kaufmann, SanMateo, CA, 1988.

S. Ramachandran and R. J. Mooney. Theory refinement ofBayesian networks with hidden variables. In 15th Inter-national Conference on Machine Learning, 1998.

F. S. Samaria and A. C. Harter. Parameterisation of astochastic model for human face identification. In 2ndIEEE Workshop on Applications of Comp. Vision, 1994.

E. Seneta and D. Vere-Jones. On quasi-stationary distri-butions in discrete-time Markov chains with a denumer-able infinity of states. Journal of Applied Probability, 3(2):403–434, December 1966.

Y.-W. Teh, D. Gorur, and Z. Ghahramani. Stick-breakingconstruction for the Indian bu"et process. In 11th Inter-national Conference on Artificial Intelligence and Statis-tics, 2007.

F. Wood, T. L. Gri!ths, and Z. Ghahramani. A non-parametric Bayesian method for inferring hidden causes.In 22nd Conference on Uncertainty in Artificial Intelli-gence, 2006.