12
arXiv:1002.4802v2 [cs.LG] 12 Mar 2010 Gaussian Process Structural Equation Models with Latent Variables Ricardo Silva Department of Statistical Science University College London [email protected] Robert B. Gramacy Statistical Laboratory University of Cambridge [email protected] Abstract In a variety of disciplines such as social sci- ences, psychology, medicine and economics, the recorded data are considered to be noisy mea- surements of latent variables connected by some causal structure. This corresponds to a fam- ily of graphical models known as the structural equation model with latent variables. While linear non-Gaussian variants have been well- studied, inference in nonparametric structural equation models is still underdeveloped. We in- troduce a sparse Gaussian process parameteriza- tion that defines a non-linear structure connect- ing latent variables, unlike common formulations of Gaussian process latent variable models. The sparse parameterization is given a full Bayesian treatment without compromising Markov chain Monte Carlo efficiency. We compare the stabil- ity of the sampling procedure and the predictive ability of the model against the current practice. 1 CONTRIBUTION A cornerstone principle of many disciplines is that obser- vations are noisy measurements of hidden variables of in- terest. This is particularly prominent in fields such as so- cial sciences, psychology, marketing and medicine. For instance, data can come in the form of social and eco- nomical indicators, answers to questionnaires in a medical exam or marketing survey, and instrument readings such as fMRI scans. Such indicators are treated as measures of la- tent factors such as the latent ability levels of a subject in a psychological study, or the abstract level of democrati- zation of a country. The literature on structural equation models (SEMs) (Bartholomew et al., 2008; Bollen, 1989) approaches such problems with directed graphical models, where each node in the graph is a noisy function of its par- ents. The goals of the analysis include typical applications of latent variable models, such as projecting points in a la- tent space (with confidence regions) for ranking, cluster- ing and visualization; density estimation; missing data im- putation; and causal inference (Pearl, 2000; Spirtes et al., 2000). This paper introduces a nonparametric formulation of SEMs with hidden nodes, where functions connecting la- tent variables are given a Gaussian process prior. An effi- cient but flexible sparse formulation is adopted. To the best of our knowledge, our contribution is the first full Gaussian process treatment of SEMs with latent variables. We assume that the model graphical structure is given. Structural model selection with latent variables is a com- plex topic which we will not pursue here: a detailed discussion of model selection is left as future work. Asparouhov and Muth´ en (2009) and Silva et al. (2006) dis- cuss relevant issues. Our goal is to be able to generate posterior distributions over parameters and latent variables with scalable sampling procedures with good mixing prop- erties, while being competitive against non-sparse Gaus- sian process models. In Section 2, we specify the likelihood function for our structural equation models and its implications. In Sec- tion 3, we elaborate on priors, Bayesian learning, and a sparse variation of the basic model which is able to handle larger datasets. Section 4 describes a Markov chain Monte Carlo (MCMC) procedure. Section 5 evaluates the useful- ness of the model and the stability of the sampler in a set of real-world SEM applications with comparisons to modern alternatives. Finally, in Section 6 we discuss related work. 2 THE MODEL: LIKELIHOOD Let G be a given directed acyclic graph (DAG). For sim- plicity, in this paper we assume that no observed variable is a parent in G of any latent variable. Many SEM appli- cations are of this type (Bollen, 1989; Silva et al., 2006), and this will simplify our presentation. Likewise, we will treat models for continuous variables only. Although cyclic SEMs are also well-defined for the linear case (Bollen,

arXiv:1002.4802v2 [cs.LG] 12 Mar 2010

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:1002.4802v2 [cs.LG] 12 Mar 2010

arX

iv:1

002.

4802

v2 [

cs.L

G]

12 M

ar 2

010

Gaussian Process Structural Equation Models with Latent Variables

Ricardo SilvaDepartment of Statistical Science

University College [email protected]

Robert B. GramacyStatistical Laboratory

University of [email protected]

Abstract

In a variety of disciplines such as social sci-ences, psychology, medicine and economics, therecorded data are considered to be noisy mea-surements of latent variables connected by somecausal structure. This corresponds to a fam-ily of graphical models known as the structuralequation model with latent variables. Whilelinear non-Gaussian variants have been well-studied, inference in nonparametric structuralequation models is still underdeveloped. We in-troduce a sparse Gaussian process parameteriza-tion that defines a non-linear structure connect-ing latent variables, unlike common formulationsof Gaussian process latent variable models. Thesparse parameterization is given a full Bayesiantreatment without compromising Markov chainMonte Carlo efficiency. We compare the stabil-ity of the sampling procedure and the predictiveability of the model against the current practice.

1 CONTRIBUTION

A cornerstone principle of many disciplines is that obser-vations are noisy measurements of hidden variables of in-terest. This is particularly prominent in fields such as so-cial sciences, psychology, marketing and medicine. Forinstance, data can come in the form of social and eco-nomical indicators, answers to questionnaires in a medicalexam or marketing survey, and instrument readings such asfMRI scans. Such indicators are treated as measures of la-tent factors such as the latent ability levels of a subject ina psychological study, or the abstract level of democrati-zation of a country. The literature on structural equationmodels (SEMs) (Bartholomew et al., 2008; Bollen, 1989)approaches such problems with directed graphical models,where each node in the graph is a noisy function of its par-ents. The goals of the analysis include typical applicationsof latent variable models, such as projecting points in a la-

tent space (with confidence regions) for ranking, cluster-ing and visualization; density estimation; missing data im-putation; and causal inference (Pearl, 2000; Spirtes et al.,2000).

This paper introduces a nonparametric formulation ofSEMs with hidden nodes, where functions connecting la-tent variables are given a Gaussian process prior. An effi-cient but flexible sparse formulation is adopted. To the bestof our knowledge, our contribution is the first full Gaussianprocess treatment of SEMs with latent variables.

We assume that the model graphical structure is given.Structural model selection with latent variables is a com-plex topic which we will not pursue here: a detaileddiscussion of model selection is left as future work.Asparouhov and Muthen (2009) and Silva et al. (2006) dis-cuss relevant issues. Our goal is to be able to generateposterior distributions over parameters and latent variableswith scalable sampling procedures with good mixing prop-erties, while being competitive against non-sparse Gaus-sian process models.

In Section 2, we specify the likelihood function for ourstructural equation models and its implications. In Sec-tion 3, we elaborate on priors, Bayesian learning, and asparse variation of the basic model which is able to handlelarger datasets. Section 4 describes a Markov chain MonteCarlo (MCMC) procedure. Section 5 evaluates the useful-ness of the model and the stability of the sampler in a set ofreal-world SEM applications with comparisons to modernalternatives. Finally, in Section 6 we discuss related work.

2 THE MODEL: LIKELIHOOD

Let G be a given directed acyclic graph (DAG). For sim-plicity, in this paper we assume that no observed variableis a parent inG of any latent variable. Many SEM appli-cations are of this type (Bollen, 1989; Silva et al., 2006),and this will simplify our presentation. Likewise, we willtreat models for continuous variables only. Although cyclicSEMs are also well-defined for the linear case (Bollen,

Page 2: arXiv:1002.4802v2 [cs.LG] 12 Mar 2010

7

Y1

IL PDL

ε1

ζ

Y Y Y Y Y Y2 3 4 5 6 7

ε ε ε ε ε2 3 4 5 6 ε 7

Y1

IL

ε1

ζ

Y Y Y Y Y Y2 3 4 5 6 7

ε ε ε ε ε2 3 4 5 6 ε

(a) (b)

Figure 1: (a) An example adapted from Palomo et al. (2007): latent variableIL corresponds to a scalar labeled as theindustrialization level of a country.PDL is the corresponding political democratization level. VariablesY1, Y2, Y3 areindicators of industrialization (e.g., gross national product) whileY4, . . . , Y7 are indicators of democratization (e.g., expertassessements of freedom of press). Each variable is a function of its parents with a corresponding additive error term:ǫi foreachYi, andζ for democratization levels. For instance,PDL = f(IL)+ζ for some functionf(·). (b) Dependence amonglatent variables is essential to obtain sparsity in the measurement structure. Here we depict how the graphical dependencestructure would look like if we regressed the observed variables on the independent latent variables of (a).

1989), non-linear cyclic models are not trivial to define andas such we will exclude them from this paper.

LetX be our set of latent variables andXi ∈ X be a partic-ular latent variable. LetXPi

be the set of parents ofXi inG. The latent structure in our SEM is given by the follow-ing generative model: if the parent set ofXi is not empty,

Xi = fi(XPi) + ζi, whereζi ∼ N (0, vζi) (1)

N (m, v) is the Gaussian distribution with meanm andvariancev. If Xi has no parents (i.e., it is anexogenouslatent variable, in SEM terminology), it is given a mixtureof Gaussians marginal1.

Themeasurement model, i.e., the model that describes thedistribution of observationsY given latent variablesX , isas follows. For eachYj ∈ Y with parent setXPj

, we have

Yj = λj0 +XT

PjΛj + ǫj ,whereǫj ∼ N (0, vǫj ) (2)

Error terms{ǫj} are assumed to be mutually independentand independent of all latent variables inX . Moreover,Λj

is a vector of linear coefficientsΛj = [λj1 . . . λj|XPj|]T.

Following SEM terminology, we say thatYj is anindicatorof the latent variables inXPj

.

An example is shown in Figure 1(a). Following the nota-tion of Bollen (1989), squares represent observed variablesand circles, latent variables. SEMs are graphical modelswith an emphasis on sparse models where: 1. latent vari-ables are dependent according to a directed graph model;2. observed variables measure (i.e., are children of) veryfew latent variables. Although sparse latent variable mod-els have been the object of study in machine learning and

1For simplicity of presentation, in this paper we adopt a fi-nite mixture of Gaussians marginal for the exogenous variables.However, introducing a Dirichlet process mixture of Gaussiansmarginal is conceptually straightforward.

statistics (e.g., Wood et al. (2006); Zou et al. (2006)), notmuch has been done on exploring nonparametric modelswith dependent latent structure (a loosely related excep-tion being dynamic systems, where filtering is the typicalapplication). Figure 1(b) illustrates how modeling can beaffected by discarding the structure among latents2.

2.1 Identifiability Conditions

Latent variable models might be unidentifiable. In the con-text of Bayesian inference, this is less of a theoretical issuethan a computational one: unidentifiable models might leadto poor mixing in MCMC, as discussed in Section 5. More-over, in many applications, the latent embedding of the datapoints is of interest itself, or the latent regression functionsare relevant for causal inference purposes. In such appli-cations, an unidentifiable model is of limited interest. Inthis Section, we show how to derive sufficient conditionsfor identifiability.

Consider the case where a latent variableXi has at leastthree unique indicatorsYi ≡ {Yiα, Yiβ , Yiγ}, in the sensethat no element inYi has any other parent inG butXi. It isknown that in this case (Bollen, 1989) the parameters of thestructural equations for each element ofYi are identifiable(i.e., the linear coefficients and the error term variance) upto a scale and sign of the latent variable. This can be re-solved by setting the linear structural equation of (say)Yiα

to Yiα = Xi + ǫiα. The distribution of the error termsis then identifiable. The distribution ofXi follows from a

2Another consequence of modeling latent dependencies is re-ducing the number of parameters of the model: a SEM with a lin-ear measurement model can be seen as a type of module network(Segal et al., 2005) where the observed children of a particular la-tentXi share the same nonlinearities propagated fromXPi

: inthe context of Figure 1, each indicatorYi ∈ {Y4, . . . , Y7} has aconditional expected value ofλi0 + λi1f2(X1) for a givenX1:functionf2(·) is shared among the indicators ofX2.

Page 3: arXiv:1002.4802v2 [cs.LG] 12 Mar 2010

deconvolution between the observed distribution of an ele-ment ofYi and the identified distribution of the error term.

Identifiability of the joint ofX can be resolved by mul-tivariate deconvolution under extra assumptions. For in-stance, Masry (2003) describes the problem in the contextof kernel density estimation (with known joint distributionof error terms, but unknown joint ofY).

Assumptions for the identifiability of functionsfi(·), giventhe identifiability of the joint of X , have been dis-cussed in the literature of error-in-variables regression(Fan and Truong, 1993; Carroll et al., 2004). Error-in-variables regression is a special case of our problem, whereXi is observed butXPi

is not. However, since we haveYiα = Xi + ǫi, this is equivalent to a error-in-variables re-gressionYiα = fi(XPi

) + ǫiα + ζi, where the compounderror termǫiα + ζi is still independent ofXPi

.

It can be shown that such identifiability conditions can beexploited in order to identify causal directionality amonglatent variables under additional assumptions, as discussedby Hoyer et al. (2008a) for the fully observed case3. A briefdiscussion is presented in the Appendix. In our context,we focus on the implications of identifiabilty on MCMC(Section 5).

3 THE MODEL: PRIORS

Each fi(·) can be given a Gaussian process prior(Rasmussen and Williams, 2006). In this case, we call thisclass of models the GPSEM-LV family, standing for Gaus-sian Process Structural Equation Model with Latent Vari-ables. Models without latent variables and measurementmodels have been discussed by Friedman and Nachman(2000)4.

3.1 Gaussian Process Prior and Notation

Let Xi be an arbitrary latent variable in the graph, withlatent parentsXPi

. We will use X(d) to represent the

dth X sampled from the distribution of random vectorX,andX

(d)i indexes itsith component. For instance,X(d)

Pi

is thedth sample of the parents ofXi. A training set of

3Notice that if the distribution of the error terms is non-Gaussian, identification is easier: we only need two unique in-dicatorsYiα and Yiβ : since ǫiα, ǫiβ and Xi are mutually in-dependent, identification follows from known results derived inthe literature of overcomplete independent component analysis(Hoyer et al., 2008b).

4To see how the Gaussian process networks ofFriedman and Nachman (2000) are a special case of GPSEM-LV,imagine a model where each latent variable is measured withouterror. That is, eachXi has at least one observed childYi such thatYi = Xi. The measurement model is still linear, but each struc-tural equation among latent variables can be equivalently writtenin terms of the observed variables: i.e.,Xi = fi(XPi

) + ζi isequivalent toYi = fi(YPi

) + ζi, as in Friedman and Nachman.

sizeN is represented as{Z(1), . . . ,Z(N)}, whereZ is theset of all variables. Lower casex represents fixed val-ues of latent variables, andx1:N represents a whole set{x(1), . . . ,x(N)}.

For eachxPi, the corresponding Gaussian process prior for

function valuesf1:Ni ≡ {f(1)i , . . . , f

(N)i } is

f1:Ni | x1:N

Pi∼ N (0,Ki)

where Ki is a N × N kernel matrix(Rasmussen and Williams, 2006), as determined byx1:NPi

. Each correspondingx(d)i is given byf (d)

i + ζ(d)i , as

in Equation (1).

MCMC can be used to sample from the posterior distri-bution over latent variables and functions. However, eachsampling step in this model costsO(N3), making samplingvery slow whenN is at the order of hundreds, and essen-tially undoable whenN is in the thousands. As an alter-native, we introduce a multilayered representation adaptedfrom the pseudo-inputs model of Snelson and Ghahramani(2006). The goal is to reduce the sampling cost down toO(M2N), M < N . M can be chosen according to theavailable computational resources.

3.2 Pseudo-inputs Review

We briefly review the pseudo-inputs model(Snelson and Ghahramani, 2006) in our notation. Asbefore, letX(d) represent thedth data point for someX.For a setX1:N

i ≡ {X(1)i , . . . , X

(N)i } with corresponding

parent setX1:NPi

≡ {X(1)Pi

, . . . ,X(N)Pi

} and correspondinglatent function valuesf1:Ni , we define apseudo-inputset

X1:Mi ≡ {X

(1)i , . . . , X

(M)i } such that

f1:Ni | x1:N

Pi, fi, x

1:Mi ∼ N (Ki;NMK

−1i;M fi, Vi)

fi | x1:Mi ∼ N (0,Ki;M ) (3)

whereKi;NM is a N × M matrix with each(j, k) ele-

ment given by the kernel functionki(x(j)Pi

, x(k)i ). Simi-

larly, Ki;M is a M × M matrix where element(j, k) is

ki(x(j)i , x

(k)i ). It is important to notice that each pseudo-

input X(d)i , d = 1, . . . ,M , has the same dimensionality as

XPi. The motivation for this is thatXi works as an alter-

native training set, with the original prior predictive meansand variances being recovered ifM = N andXi = XPi

.

Let ki;dM be the dth row of Ki;NM . Matrix Vi isa diagonal matrix with entryvi;dd given by vi;dd =

ki(x(d)Pi

,x(d)Pi

)−kT

i;dMK−1i;Mki;dM . This implies that all la-

tent function values{f (1)i , . . . , f

(N)i } are conditionally in-

dependent.

Page 4: arXiv:1002.4802v2 [cs.LG] 12 Mar 2010

X1

(3)

X1

X1

(1)

(2)

X

X(1)

(2)

X(3)

2

2

2

(1)

(2)

(3)

f2

f2

f2

2

X1

(3)

X1

X1

(1)

(2)

(2)

(1)

f

f2

2

X

X(1)

(2)

X(3)

2

2

2

(1)

(2)

(3)

f2

f2

f2

X(2)

X(1)

2

(a) (b)

Figure 2: (a) This figure depicts a model for the latent structureX1 → X2 with N = 3 (edges into latent functions arelighter for visualization purposes only) using a standard Gaussian process prior. Dashed arrows represent that function val-ues{f (1)

2 , f(2)2 , f

(3)2 } are mutually dependent even after conditioning on{X

(1)1 , X

(2)1 , X

(3)1 }. In (b), we have the graphical

depiction of the respective Bayesian pseudo-inputs model with M = 2. Althought the model is seemingly more complex,it scales much better: mutual dependencies are confined to the clique of pseudo-functions, which scales byM instead.

3.3 Pseudo-inputs: A Fully Bayesian Formulation

The density function implied by (3) replaces the stan-dard Gaussian process prior. In the context ofSnelson and Ghahramani (2006), input and output vari-ables are observed, and as such Snelson and Ghahramanioptimize x1:M

i by maximizing the marginal likelihood ofthe model. This is practical but sometimes prone to over-fitting, since pseudo-inputs are in fact free parameters, andthe pseudo-inputs model is best seen as a variation of theGaussian process prior rather than an approximation to it(Titsias, 2009).

In our setup, there is limited motivation to optimize thepseudo-inputs since the inputs themselves are random vari-ables. For instance, we show in the next section that thecost of sampling pseudo-inputs is no greater than the costof sampling latent variables, while avoiding cumbersomeoptimization techniques to choose pseudo-input values. In-stead we put a prior on the pseudo-inputs and extend thesampling procedure. By conditioning on the data, a goodplacement for the pseudo-inputs can be learned, sinceXPi

andX(d)i are dependent in the posterior. This is illustrated

by Figure 2. Moreover, it naturally provides a protectionagainst overfitting.

A simple choice of priors for pseudo-inputs is as fol-lows: each pseudo-inputX(d)

i , d = 1, . . . ,M , is givena N (µd

i ,Σdi ) prior, independent of all other random vari-

ables. A partially informative (empirical) prior can be eas-ily defined in the case where, for eachXk, we have thefreedom of choosing a particular indicatorYq with fixedstructural equationYq = Xk + ǫq (see Section 2.1), imply-ing E[Xk] = E[Yq]. This means ifXk is a parentXi, weset the respective entry inµd

i (recallµdi is a vector with an

entry for every parent ofXi) to the empirical mean ofYq.Each prior covariance matrixΣd

i is set to be diagonal witha common variance.

Alternatively, we would like to spread the pseudo-inputs a

priori: other things being equal, pseudo-inputs that are tooclose to each can be wasteful given their limited number.One prior, inspired by space-filling designs from the exper-imental design literature (Santner et al., 2003), is

p(x1:Mi ) ∝ det(Di)

the determinant of a kernel matrixDi. We use asquared exponential covariance function with characteristiclength scale of0.1 (Rasmussen and Williams, 2006), and a“nugget” constant that adds10−4 to each diagonal term.This prior has support over a[−L,L]|XPi

| hypercube. WesetL to be three times the largest standard deviation of ob-served variables in the training data. This is the pseudo-input prior we adopt in our experiments, where we centerall observed variables at their empirical means.

3.4 Other Priors

We adopt standard priors for the parametric components ofthis model: independent Gaussians for each coefficientλij ,inverse gamma priors for the variances of the error termsand a Dirichlet prior for the distribution of the mixture in-dicators of the exogenous variables.

4 INFERENCE

We use a Metropolis-Hastings scheme to sample from ourspace of latent variables and parameters. Similarly to Gibbssampling, we sample blocks of random variables whileconditioning on the remaining variables. When the corre-sponding conditional distributions are canonical, we sam-ple directly from them. Otherwise, we use mostly standardrandom walk proposals.

Conditioned on the latent variables, sampling the parame-ters of the measurement model is identical to the case ofclassical Bayesian linear regression. The same can be saidof the sampling scheme for the posterior variances of eachζi. Sampling the mixture distribution parameters for the ex-ogenous variables is also identical to the standard Bayesian

Page 5: arXiv:1002.4802v2 [cs.LG] 12 Mar 2010

case of Gaussian mixture models. Details are described inthe Appendix.

We describe the central stages of the sampler for the sparsemodel. The sampler for the model with full Gaussian pro-cess priors is simpler and analogous, and also described inthe Appendix.

4.1 Sampling Latent Functions

In principle, one can analytically marginalize the pseudo-functions f1:Mi . However, keeping an explicit sample ofthe pseudo-functions is advantageous when sampling la-tent variablesX(d)

i , d = 1, . . . , N : for each childXc ofXi, only the corresponding factor for the conditional den-sity of f (d)c needs to be computed (at aO(M) cost), sincefunction values are independent given latent parents andpseudo-functions. This issue does not arise in the fully-observed case of Snelson and Ghahramani (2006), who domarginalize the pseudo-functions.

Pseudo-functions and functions{f1:Mi , f1:Ni } are jointlyGaussian given all other random variables and data. Theconditional distribution off1:Mi given everything, except

itself and {f(1)i , . . . , f

(N)i }, is Gaussian with covariance

matrix

Si ≡ (K−1i;M+K

−1i;MK

T

i;NM (V−1i +I/υζi)Ki;NMK

−1i;M )−1

whereVi is defined in Section 3.2 andI is a M × Midentity matrix. The total cost of computing this matrixisO(NM2 +M3) = O(NM2). The corresponding meanis

Si ×K−1i;MK

T

i;NM (V−1i + I/υζi)x

1:Ni

wherex1:Ni is a column vector of lengthN .

Given thatf1:Mi is sampled according to this multivariate

Gaussian, we can now sample{f (1)i , . . . , f

(N)i } in parallel,

since this becomes a mutually independent set with univari-ate Gaussian marginals. The conditional variance off

(d)i is

v′i ≡ 1/(1/vi;dd+1/υζi), wherevi;dd is defined in Section

3.2. The corresponding mean isv′i(f(d)µ /vi;dd + x

(d)i /υζi),

wheref (d)µ = ki;dMK

−1i;M fi.

In Section 5, we also sample from the posterior distributionof the hyperparametersΘi of the kernel function used byKi;M andKi;NM . Plain Metropolis-Hastings is used tosample these hyperparameters, using an uniform proposalin [αΘi, (1/α)Θi] for 0 < α < 1.

4.2 Sampling Pseudo-inputs and Latent Variables

We sample each pseudo-inputx(d)i one at a time,d =

1, 2, . . . ,M . Recall thatx(d)i is a vector, with as many

entries as the number of parents ofXi. In our implementa-

tion, we propose all entries of the newx(d)′

i simultaneously

using a Gaussian random walk proposal centered atx(d)′

i

with the same variance in each dimension and no correla-tion structure. For problems where the number of parentsof Xi is larger than in our examples (i.e., four or more par-ents), other proposals might be justified.

Let π(\d)i (x

(d)i ) be the conditional prior forx(d)

i given

x(\d)i , where(\d) ≡ {1, 2, . . . , d−1, d+1, . . . ,M}. Given

a proposedx(d)′

i , we accept the new value with probability

min{

1, li(x(d)′

i )/li(x(d)i )

}

where

li(x(d)i ) = π

(\d)i (x

(d)′

i )× p(f(d)i | f

(\d)i , xi)

×∏N

d=1 v−1/2i;dd e−(f

(d)i

−ki;dMK−1i;M fi)

2/(2vi;dd)

andp(f (d)i | f

(\d)i , xi) is the conditional density that fol-

lows from Equation (3). Row vectorki;dM is thedth rowof matrix Ki;NM . Fast submatrix updates ofK−1

i;M and

Ki;NMK−1i;M are required in order to calculateli(·) at a

O(NM) cost, which can be done by standard Cholesky up-dates (Seeger, 2004). The total cost is thereforeO(NM2)for a full sweep over all pseudo-inputs.

The conditional densityp(f (d)i | f

(\d)i , xi) is known to be

sharply peaked for moderate sizes ofM (at the order ofhundreds) (Titsias et al., 2009), which may cause mixingproblems for the Markov chain. One way to mitigate this

effect is to also propose a valuef (d)′

i jointly with x(d)′

i ,which is possible at no additional cost. We propose thepseudo-function using the conditionalp(f

(d)i | f

(\d)i , xi).

The Metropolis-Hastings acceptance probability for this

variation is then simplified tomin{

1, l0i (x(d)′

i )/li(x(d)i )

}

,

where

l0i (x(d)i ) = π

(\d)i (x

(d)′

i )

×∏N

d=1 v−1/2i;dd e−(f

(d)i

−ki;dMK−1i;M fi)

2/(2vi;dd)

Finally, consider the proposal for latent variablesX(d)i . For

each latent variableXi, the set of latent variable instan-tiations{X(1)

i , X(2)i , . . . , X

(N)i } is mutually independent

given the remaining variables. We propose each new la-

tent variable valuex(d)′

i in parallel, and accept or reject itbased on a Gaussian random walk proposal centered at thecurrent valuex(d)

i . We accept the move with probability

min{

1, hXi(x

(d)′

i )/hXi(x

(d)i )

}

where, ifXi is not an ex-

ogenous variable in the graph,

hXi(x

(d)i ) = e−(x

(d)i

−f(d)i

)2/(2υζi)

×∏

Xc∈XCip(f

(d)c | fc, xc, x

(d)i )

×∏

Yc∈YCip(y

(d)c | x

(d)Pc

)

whereXCiis the set of latent children ofXi in the graph,

andYCiis the corresponding set of observed children.

Page 6: arXiv:1002.4802v2 [cs.LG] 12 Mar 2010

The conditionalp(f (d)c | fc, xc, x

(d)i ), which follows from

(3), is a non-linear function ofx(d)i , but crucially does not

depend on anyx(·)i variable except pointd. The evaluation

of this factor costsO(M2). As such, sampling all latentvalues forXi takesO(NM2).

The case whereXi is an exogenous variable is analogous,given that we also sample the mixture component indica-tors of such variables.

5 EXPERIMENTS

In this evaluation Section5, we briefly illustrate the algo-rithm in a synthetic study, followed by an empirical eval-uation on how identifiability matters in order to obtain aninterpretable distribution of latent variables. We end thissection with a study comparing the performance our modelin predictive tasks against common alternatives6.

5.1 An Illustrative Synthetic Study

We generated data from a model of two latent variables(X1, X2) whereX2 = 4X2

1 + ζ2, Yi = X1 + ǫi fori = 1, 2, 3 andYi = X2 + ǫi, for i = 4, 5, 6. X1 and allerror terms follow standard Gaussians. Given a sample of150 points from this model, we set the structural equationsfor Y1 andY4 to have a zero intercept and unit slope foridentifiability purposes. Observed data forY1 againstY4 isshown in Figure 3(a), which suggests a noisy quadratic re-lationship (plotted in 3(b), but unknown to the model). Werun a GPSEM-LV model with 50 pseudo-inputs. The ex-pected posterior value of each latent pair{X

(d)1 , X

(d)2 } for

d = 1, . . . , 150 is plotted in Figure 3(c). It is clear that wewere able to reproduce the original non-linear functionalrelationship given noisy data using a pseudo-inputs model.

5MATLAB code to run all of our experiments is availableat http://www.homepages.ucl.ac.uk/∼ucgtrbd/code/gpsem.zip

6Some implementation details: we used the squared expo-nential kernel functionk(xp,xq) = a exp(− 1

2b|xp − xq|

2) +

10−4δpq , whereδpq = 1 is p = q and 0 otherwise. The hyper-prior for a is a mixture of a gamma(1, 20) and a gamma(10, 10)with equal probability each. The same (independent) prior isgiven to b. Variance parameters were given inverse gamma (2,1) priors, and the linear coefficients were given Gaussian priorswith a common large variance of 5. Exogenous latent variableswere modeled as a mixture of five Gaussians where the mixturedistribution is given a Dirichlet prior with parameter 10. Finally,for each latentXi variable we choose one of its indicatorsYj andfix the corresponding edge coefficient to 1 and intercept to 0 tomake the model identifiable. We perform20, 000 MCMC iter-ations with a burn-in period of2000 (only 6000 iterations with1000 of burn-in for the non-sparse GPSEM-LV due to its highcomputational cost). Small variations in the priors for coefficients(using a variance of10) and variance parameters (using an inversegamma(2, 2)), and a mixture of 3 Gaussians instead of 5, wereattempted with no significant differences between models.

For comparison, the output of the Gaussian process latentvariable model (GPLVM, Lawrence, 2005) with two hid-den variables is shown in Figure 3(d). GPLVM here as-sumes that the marginal distribution of each latent variableis a standard Gaussian, but the measurement model is non-parametric. In theory, GPLVM is as flexible as GPSEM-LV in terms of representing observed joints. However, itdoes not learn functional relationships among latent vari-ables, which is often of central interest in SEM applications(Bollen, 1989). Moreover, since no marginal dependenceamong latent variables is allowed, the model adapts itselfto find (unidentifiable) functional relationships between theexogenous latent variables of the true model and the ob-servables, analogous to the case illustrated by Figure 1(b).As a result, despite GPLVM being able to depict, as ex-pected, some quadratic relationship (up to a rotation), it isnoisier than the one given by GPSEM-LV.

5.2 MCMC and Identifiability

We now explore the effect of enforcing identifiability con-straints on the MCMC procedure. We consider the datasetConsumer, a study7 with 333 university students in Greece(Bartholomew et al., 2008). The aim of the study was toidentify the factors that affect willingness to pay more toconsume environmentally friendly products. We selected16 indicators of environmental beliefs and attitudes, mea-suring a total of 4 hidden variables. For simplicity, we willcall these variablesX1, . . . , X4. The structure among la-tents isX1 → X2, X1 → X3, X2 → X3, X2 → X4. Fulldetails are given by Bartholomew et al. (2008).

All observed variables have a single latent parent in the cor-responding DAG. As discussed in Section 2.1, the corre-sponding measurement model is identifiable by fixing thestructural equation for one indicator of each variable tohave a zero intercept and unit slope (Bartholomew et al.,2008). If the assumptions described in the references ofSection 2.1 hold, then the latent functions are also identifi-able. We normalized the dataset before running the MCMCinference algorithm.

An evaluation of the MCMC procedure is done by runningand comparing 5 independent chains, each starting from adifferent point. Following Lee (2007), we evaluate conver-gence using the EPSR statistic (Gelman and Rubin, 1992),which compares the variability of a given marginal pos-terior within each chain and between chains. We calcu-late this statistic for all latent variables{X1, X2, X3, X4}across all 333 data points.

A comparison is done against a variant of the model wherethe measurement model isnot sparse: instead, each ob-served variable has all latent variables as parents, and no

7There was one latent variable marginally independent of ev-erything else. We eliminated it and its two indicators, as well asthe REC latent variable that had only 1 indicator.

Page 7: arXiv:1002.4802v2 [cs.LG] 12 Mar 2010

−3 −2 −1 0 1 2 3−2

−1

0

1

2

3

4

5Observed data (normalized)

Y1

Y4

−3 −2 −1 0 1 2 3−2

−1

0

1

2

3

4

5Latent data (normalized)

X1

X2

−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2

−1

0

1

2

3

4

5GPSEM−LV

E[X1]

E[X

2]

−2 −1 0 1 2 3 4 5−3

−2

−1

0

1

2

3

4

5GPLVM

mode(X1)

mod

e(X

2)

(a) (b) (c) (d)

Figure 3: (a) Plot of observed variablesY1 andY4 generated by adding standard Gaussian noise to two latent variablesX1

andX2, whereX2 = 4X21 + ζ2, ζ2 also a standard Gaussian. 150 data points were generated. (b) Plot of the corresponding

latent variables, which are not recorded in the data. (c) Theposterior expected values of the 150 latent variable pairsaccording to GPSEM-LV. (d) The posterior modes of the 150 pairs according to GPLVM.

0 500 1000 1500 2000 2500−2

−1

0

1

2

3

45 independent chains, sparse model

Iteration

X2,

Dat

a po

int:1

0

0 2000 4000 6000 8000 10000−4

−3

−2

−1

0

1

2

3

4

55 chains, unidentifiable model

Iteration

X2,

Dat

a po

int:1

0

0 500 1000 1500 2000 2500−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

3.55 independent chains, sparse model

Iteration

X4,

Dat

a po

int:2

00

0 2000 4000 6000 8000 10000−5

−4

−3

−2

−1

0

1

2

3

4

55 chains, unidentifiable model

Iteration

X4,

Dat

a po

int:2

00

Figure 4: An illustration of the behavior of independent chains forX(10)2 andX(200)

4 using two models for theConsumerdata: the original (sparse) model (Bartholomew et al., 2008); an (unidentifiable) alternative where the each observed vari-able is an indicator of all latent variables. In the unidentifiable model, there is no clear pattern across the independentchains. Our model is robust to initialization, while the alternative unidentifiable approach cannot be easily interpreted.

coefficients are fixed. The differences are noticeable andillustrated in Figure 4. Box-plots of EPSR for the 4 latentvariables are shown in Figure 5. It is difficult to interpretor trust an embedding that is strongly dependent on the ini-tialization procedure, as it is the case for the unidentifiablemodel. As discussed by Palomo et al. (2007), identifiabilitymight not be a fundamental issue for Bayesian inference,but it is an important practical issue in SEMs.

5.3 Predictive Verification of the Sparse Model

We evaluate how well the sparse GPSEM-LV modelperforms compared against two parametric SEMs andGPLVM. Thelinear structural equation model is the SEM,where each latent variable is given by a linear combina-tion of its parents with additive Gaussian noise. Latentvariables without parents are given the same mixture ofGaussians model as our GPSEM-LV implementation. Thequadratic model includes all quadratic and linear terms,plus first-order interactions, among the parents of any givenlatent variable. This is perhaps the most common non-linear SEM used in practice (Bollen and Paxton, 1998; Lee,2007). GPLVM is fit with 50 active points and therbfkernel with automatic relevance determination (Lawrence,

2005). Each sparse GPSEM model uses50 pseudo-points.

We performed a 5-fold cross-validation study where the av-erage predictive log-likelihood on the respective test sets isreported. Three datasets are used. The first is theCon-sumer dataset, described in the previous section.

The second is theAbalone data (Asuncion and Newman,2007), where we postulate two latent variables, “Size” and“Weight.” Sizehas as indicators the length, diameter andheight of each abalone specimen, whileWeighthas as indi-cators the four weight variables. We direct the relationshipamong latent variables asSize → Weight.

The third is theHousing dataset (Asuncion and Newman,2007; Harrison and Rubinfeld, 1978), which includes in-dicators about features of suburbs in Boston that are rel-evant for the housing market. Following the originalstudy (Harrison and Rubinfeld, 1978, Table IV), we pos-tulate three latent variables: “Structural,” correspondingto the structure of each residence; “Neighborhood,” corre-sponding to an index of neighborhood attractiveness; and“Accessibility,” corresponding to an index of accessibil-ity within Boston8. The corresponding 11 non-binary ob-

8The analysis by (Harrison and Rubinfeld, 1978, Table IV)

Page 8: arXiv:1002.4802v2 [cs.LG] 12 Mar 2010

1 2 3 41

1.5

2

2.5

3

EP

SR

, 333

poi

nts

X

EPSR distribution, consumer data

Figure 5: Boxplots for the EPSR distribution across each ofthe 333 datapoints of each latent variable. Boxes representthe distribution for the non-sparse model. A value less than1.1 is considered acceptable evidence of convergence (Lee,2007), but this essentially never happens. For the sparsemodel, all EPSR statistics were under 1.03.

served variables that are associated with the given latentconcepts are used as indicators. The “Neighborhood” con-cept was refined into two, “Neighborhood I” and “Neigh-borhood II” due to the fact that three of its original in-dicators have very similar (and highly skewed) marginaldistributions, which were very dissimilar from the oth-ers9. The structure among latent variables is given by afully connected network directed according to the order{Accessibility, Structural, Neighborhood II, NeighborhoodI}. Harrison and Rubinfeld (1978) provide full details onthe meaning of the indicators. We note that it is well knownthat theHousingdataset poses stability problems to densityestimation due to discontinuities in the variableRAD, oneof the indicators of accessibility (Friedman and Nachman,2000). In order to get more stable results, we use a subsetof the data (374 points) whereRAD < 24.

The need for non-linear SEMs is well-illustrated by Figure6, where fantasy samples of latent variables are generatedfrom the predictive distributions of two models.

We also evaluate how the non-sparse GPSEM-LV behavescompared to the sparse alternative. Notice that whileCon-sumer andHousinghave each approximately 300 trainingpoints in each cross-validation fold,Abalonehas over 3000points. For the non-sparse GPSEM, we subsampled all of

also included a fourth latent concept of “Air pollution,” which weremoved due to the absence of one of its indicators in the elec-tronic data file that is available.

9The final set of indicators, using the nomenclature of theUCI repository documentation file, is as follows: “Structural”has as indicatorsRM andAGE; “Neighborhood I” has as in-dicatorsCRIM , ZN andB; “Neighborhood II” has as indica-torsINDUS, TAX,PTRATIO andLSTAT ; “Accessibility”has as indicatorsDIS andRAD. See (Asuncion and Newman,2007) for detailed information about these indicators. FollowingHarrison and Rubinfield, we log-transformed some of the vari-ables:INDUS, DIS, RAD andTAX.

−4 −3 −2 −1 0 1 2 3−2

−1

0

1

2

3

4Abalone: Predictive Latent Variables

Size

Wei

ght

−3 −2 −1 0 1 2 3−2

0

2

4

6

8

10

12

14Housing: Predictive Latent Variables

Accessibility

Nei

ghbo

rhoo

d I

Figure 6: Scatterplots of 2000 fantasy samples taken fromthe predictive distributions of sparse GPSEM-LV models.In contrast, GPLVM would generate spherical Gaussians.

Abalone training folds down to 300 samples.

Results are presented in Table 1. Each dataset was cho-sen to represent a particular type of problem. The datain Consumer is highly linear. In particular, it is impor-tant to point out that the GPSEM-LV model is able to be-have as a standard structural equation model if necessary,while the quadratic polynomial model shows some overfit-ting. TheAbalone study is known for having clear func-tional relationships among variables, as also discussed byFriedman and Nachman (2000). In this case, there is asubstantial difference between the non-linear models andthe linear one, although GPLVM seems suboptimal in thisscenario where observed variables can be easily clusteredinto groups. Finally, functional relationships among vari-ables inHousingare not as clear (Friedman and Nachman,2000), with multimodal residuals. GPSEM still showsan advantage, but all SEMs are suboptimal compared toGPLVM. One explanation is that the DAG on which themodels rely is not adequate. Structure learning might benecessary to make the most out of nonparametric SEMs.

Although results suggest that the sparse model behavedbetter that the non-sparse one (which was true of somecases found by Snelson and Ghahramani, 2006, due to het-eroscedasticity effects), such results should be interpretedwith care.Abalonehad to be subsampled in the non-sparsecase. Mixing is harder in the non-sparse model since alldatapoints{X(1)

i , . . . , X(N)i } are dependent. While we be-

lieve that with larger sample sizes and denser latent struc-tures the non-sparse model should be the best, large samplesizes are too expensive to process and, in many SEM appli-cations, latent variables have very few parents.

It is also important to emphasize that the wallclock sam-pling time for the non-sparse model was an order of mag-nitude larger than the sparse case withM = 50− even con-sidering that 3000 training points were used by the sparsemodel in theAbalone experiment, against 300 points bythe non-sparse alternative.

Page 9: arXiv:1002.4802v2 [cs.LG] 12 Mar 2010

Consumer Abalone HousingGPS GP LIN QDR GPL GPS GP LIN QDR GPL GPS GP LIN QDR GPL

Fold 1 -20.66 -21.17 -20.67 -21.20 -22.11 -1.96 -2.08 -2.75 -2.00 -3.04 -13.92 -14.10 -14.46 -14.11 -11.94Fold 2 -21.03 -21.15 -21.06 -21.08 -22.22 -1.90 -2.97 -2.52 -1.92 -3.41 -15.07 -17.70 -16.20 -15.12 -12.98Fold 3 -20.86 -20.88 -20.84 -20.90 -22.33 -1.91 -5.50 -2.54 -1.93 -3.65 -13.66 -15.75 -14.86 -14.69 -12.58Fold 4 -20.79 -21.09 -20.78 -20.93 -22.03 -1.77 -2.96 -2.30 -1.80 -3.40 -13.30 -15.98 -14.05 -13.90 -12.84Fold 5 -21.26 -21.76 -21.27 -21.75 -22.72 -3.85 -4.56 -4.67 -3.84 -4.80 -13.80 -14.46 -14.67 -13.71 -11.87

Table 1: Average predictive log-likelihood in a 5-fold cross-validation setup. The five methods are the GPSEM-LV modelwith 50 pseudo-inputs (GPS), GPSEM-LV with standard Gaussian process priors (GP), the linear and quadratic structuralequation models (LIN and QDR) and the Gaussian process latent variable model (GPL) of Lawrence (2005), a nonparamet-ric factor analysis model. ForAbalone, GP uses a subsample of the training data. The p-values givenby a paired Wilcoxonsigned-rank test, measuring the significance of positive differences between sparse GPSEM-LV and the quadratic model,are 0.03 (forConsumer), 0.34 (Abalone) and 0.09 (Housing).

6 RELATED WORK

Non-linear factor analysis has been studied for decades inthe psychometrics literature10. A review is provided byYalcin and Amemiya (2001). However, most of the clas-sic work is based on simple parametric models. A modernapproach based on Gaussian processes is the Gaussian pro-cess latent variable model of Lawrence (2005). By con-struction, factor analysis cannot be used in applicationswhere one is interested in learning functions relating la-tent variables, such as in causal inference. For embed-ding, factor analysis is easier to use and more robust tomodel misspecification than SEM analysis. Conversely, itdoes not benefit from well-specified structures and mightbe harder to interpret. Bollen (1989) discusses the inter-play between factor analysis and SEM. Practical non-linearstructural equation models are discussed by Lee (2007), butnone of such approaches rely on nonparametric methods.Gaussian processes latent structures appear mostly in thecontext of dynamical systems (e.g., Ko and Fox (2009)).However, the connection is typically among data pointsonly, not among variables within a data point, where on-line filtering is the target application.

7 CONCLUSION

The goal of graphical modeling is to exploit the structureof real-world problems, but the latent structure is often ig-nored. We introduced a new nonparametric approach forSEMs by extending a sparse Gaussian process prior as afully Bayesian procedure. Although a standard MCMCalgorithm worked reasonably well, it is possible as futurework to study ways of improving mixing times. This canbe particularly relevant in extensions to ordinal variables,where the sampling of thresholds will likely make mixingmore difficult. Since the bottleneck of the procedure is the

10Another instance of the “whatever you do, some-body in psychometrics already did it long before” law:http://www.stat.columbia.edu/∼cook/movabletype/archives/2009/01/alongstanding.html

sampling of the pseudo-inputs, one might consider a hy-brid approach where a subset of the pseudo-inputs is fixedand determined prior to sampling using a cheap heuris-tic. New ways of deciding pseudo-input locations basedon a given measurement model will be required. Evalua-tion with larger datasets (at least a few hundred variables)remains an open problem. Finally, finding ways of deter-mining the graphical structure is also a promising area ofresearch.

Acknowledgements

We thank Patrick Hoyer for several relevant discussionsconcerning the results of Section 2.1, and Irini Moustakifor the consumer data.

ReferencesT. Asparouhov and Bengt Muthen. Exploratory structural equa-

tion modeling.Structural Equation Modeling, 16, 2009.

A. Asuncion and D.J. Newman. UCI machine learning repository,2007.

D. Bartholomew, F. Steele, I. Moustaki, and J. Galbraith.Analysisof Multivariate Social Science Data. Chapman & Hall, 2008.

K. Bollen. Structural Equations with Latent Variables. John Wi-ley & Sons, 1989.

K. Bollen and P. Paxton. Interactions of latent variables instruc-tural equation models.Structural Equation Modeling, 5:267–293, 1998.

R. Carroll, D. Ruppert, C. Crainiceanu, T. Tosteson, and M. Kara-gas. Nonlinear and nonparametric regression and instrumentalvariables.JASA, 99, 2004.

J. Fan and Y. Truong. Nonparametric regression with errors-in-variables.Annals of Statistics, 21:1900–1925, 1993.

N. Friedman and I. Nachman. Gaussian process networks.Un-certainty in Artificial Intelligence, 2000.

A. Gelman and D. Rubin. Inference from iterative simulationus-ing multiple sequences.Statistical Science, 7:457–472, 1992.

A. Gretton, K. Fukumizu, C. Teo, L. Song, B. Scholkopf, andA. Smola. A kernel statistical test of independence.NeuralInformation Processing Systems, 2007.

Page 10: arXiv:1002.4802v2 [cs.LG] 12 Mar 2010

D. Harrison and D. Rubinfeld. Hedonic prices and the demandfor clean air.Journal of Environmental Economics & Manage-ment, 5:81–102, 1978.

P. Hoyer, D. Janzing, J. Mooij, J. Peters, and B. Scholkopf.Non-linear causal discovery with additive noise models.Neural In-formation Processing Systems, 2008a.

P. Hoyer, S. Shimizu, A. Kerminen, and M. Palviainen. Estima-tion of causal effects using linear non-Gaussian causal modelswith hidden variables.IJAR, 49, 2008b.

J. Ko and D. Fox. GP-BayesFilters: Bayesian filtering usingGaussian process prediction and observation models.Au-tonomous Robots, 2009.

N. D. Lawrence. Probabilistic non-linear principal componentanalysis with Gaussian process latent variable models.Journalof Machine Learning Research, 6:1783–1816, 2005.

S.-Y. Lee.Structural Equation Modeling: a Bayesian Approach.Wiley, 2007.

E. Masry. Deconvolving multivariate kernel density estimatesfrom contaminated associated observations.IEEE Transac-tions on Information Theory, 49:2941–2952, 2003.

J. Palomo, D. Dunson, and K. Bollen. Bayesian structural equa-tion modeling.In Sik-Yum Lee (ed.), Handbook of Latent Vari-able and Related Models, pages 163–188, 2007.

J. Pearl.Causality: Models, Reasoning and Inference. CambridgeUniversity Press, 2000.

C. Rasmussen and C. Williams.Gaussian Processes for MachineLearning. MIT Press, 2006.

T. Santner, B. Williams, and W. Notz.The Design and Analysisof Computer Experiments. Springer-Verlag, 2003.

M. Seeger. Low rank updates for the Cholesky decomposition.Technical Report, 2004.

E. Segal, D. Pe’er, A. Regev, D. Koller, and N. Friedman. Learn-ing module networks.JMLR, 6, 2005.

R. Silva, R. Scheines, C. Glymour, and P. Spirtes. Learning thestructure of linear latent variable models.JMLR, 7, 2006.

E. Snelson and Z. Ghahramani. Sparse Gaussian processes usingpseudo-inputs.NIPS, 18, 2006.

P. Spirtes, C. Glymour, and R. Scheines.Causation, Predictionand Search. Cambridge University Press, 2000.

M. Titsias. Variational learning of inducing variables in sparseGaussian processes.AISTATS, 2009.

M. Titsias, N. Lawrence, and M. Rattray. Efficient sampling forGaussian process inference using control variables.Neural In-formation Processing Systems, 2009.

F. Wood, T. Griffiths, and Z. Ghahramani. A non-parametricBayesian method for inferring hidden causes.UAI, 2006.

I. Yalcin and Y. Amemiya. Nonlinear factor analysis as a statisti-cal method.Statistical Science, 16:275–294, 2001.

H. Zou, T. Hastie, and R. Tibshirani. Sparse principal componentanalysis.J. of Comp. and Graph. Stats., pages 265–286, 2006.

APPENDIX A: FURTHER MCMCDETAILS

We use a MCMC sampler to draw all variables of interest fromthe posterior distribution of a GPSEM model. LetM denote the

number of pseudo-inputs per latent functionfi(·), N be the sam-ple size,V the number of latent variables andK the commonnumber of Gaussian mixture components for each exogenous la-tent variable.

The sampler is a standard Metropolis-Hastings procedure withblock sampling: random variables are divided into blocks, wherewe sample each block conditioning on the current values of theremaining blocks.

We consider both the non-sparse and sparse variations of GPSEM.The blocks are as follows for the non-sparse GPSEM:

• the linear coefficients for the structural equation of each ob-served variableYj : {λj0} ∪ Λj

• the conditional variance for the structural equation of eachobserved variableYj : υǫj

• thed-th instantiation of each latent variableXi, x(d)i ;

• the set of latent function values{f (1)i , . . . , f

(N)i } for each

particular endogenous latent variableXi

• the conditional variance for the structural equation of eachlatent variableXi: υζi

• the set of latent mixture component indicators{z

(1)i , . . . z

(N)i } for each particular exogenous latent

variableXi

• the set of means{µi1, . . . , µiK} for the mixture compo-nents of each particular exogenous latent variableXi

• the set of variances{vi1, . . . , viK} for the mixture compo-nents of each particular exogenous latent variableXi

• mixture distributionπi corresponding to the probability overmixture components for exogenous latent variableXi

The blocks for the sparse model are similar, except that

• all instantiations of a given latent variablex(d)i , for d =

1, 2, . . . , N , are mutually independent conditioned on thefunctions, pseudo-inputs and pseudo-functions. As such,they can be treated as a single block of sizeN , where allelements are sampled in parallel;

• the d-th instantiation of each pseudo-inputx(d)i for d =

1, 2, . . . ,M

• all instantiations of latent functions and pseudo-latentsfunc-tions {f (1)

i , . . . , f(N)i , f

(1)i , . . . , f

(M)i } for any particular

Xi are conditionally multivariate Gaussian and can be sam-pled together

We adopt the convention that, for any particular step described inthe following procedure, any random variable that is not explic-itly mentioned should be considered fixed at the current sampledvalue. Moreover, any density function that depends on such im-plicit variables uses the respective implicit values.

Our implementation uses code for submatrix Cholesky updatesfrom the library provided by Seeger (2004).

Page 11: arXiv:1002.4802v2 [cs.LG] 12 Mar 2010

The measurement model

The measurement model can be integrated out in principle, ifweadopt a conjugate normal-inverse gamma prior for the linearre-gression of observed variablesY on X. However, we opted fora non-conjugate prior in order to evaluate the convergence of thesampler when this marginalization cannot be done (as in alterna-tive models with non-Gaussian error terms).

Given the latent variables, the corresponding conditionaldistribu-tions for the measurement model parameters boil down to stan-dard Bayesian linear regression posteriors. In our Metropolis-Hastings scheme, our proposals correspond to such conditionals,as in Gibbs sampling (and therefore have an acceptance probabil-ity of 1).

Let XPjbe the parents of observed variableYj in the graph and

let the d-th instantiation of the corresponding regression inputbe x

(d)Pj≡ [xT

Pj1]T. Let each cofficientλjk have an indepen-

dent Gaussian prior with mean zero and varianceu. Conditionedon the error varianceυǫj , the posterior distribution of the vector[λj1, . . . , λj|XPj

|, λj0]T is multivariate Gaussian with covariance

Sj ≡ (∑N

d=1 x(d)Pj

x(d)TPj

+ I/u)−1 and meanSj

∑Nd=1 x

(d)Pj

y(d)j ,

whereI is a(|XPj|+ 1)× (|XPj

|+ 1) identity matrix.

The derivation for the case where some coefficientsλjk are fixedto constants is analogous.

For a fixed set of linear coefficients{λj0} ∪ Λj , we now sam-ple the conditional varianceυǫj . Let this variance have a in-verse gamma prior(a, b). Its conditional distribution is an inversegamma(a′, b′), wherea′ = a+N/2, b′ = b+

∑Nd=1(e

(d)j )2/2,

ande(d)j ≡ y(d)j − λj0 − ΛT

j x(d)Pj

.

The structural model: non-sparse GPSEM

For all i = 1, 2, . . . , V andd = 1, 2, . . . , N , we propose each

new latent variable valuex(d)′

i individually, and accept or reject itbased on a Gaussian random walk proposal centered at the currentvaluex(d)

i . We accept the move with probability

min

{

1,gXi

(x(d)′

i )

gXi(x

(d)i )

}

where, ifXi is not an exogenous variable in the graph,

gXi(x

(d)i ) = e−(x

(d)i

−f(d)i

)2/(2υζi)

×∏

Xc∈XCip(f

(d)c | f

(\d)c )

×∏

Yc∈XYip(y

(d)c |X

(d)Pc

)

(4)

Recall thatfi(·) is a function of the parentsXPiof Xi in the

graph. Thed-th instantiation of such parents assume the valuex(d)Pi

. We usef (d)i as a shorthand notation forfi(x

(d)Pi

). Morever,letXCi

denote the latent children ofXi in the graph. The symbolf(\d)c refers to the respective function values taken byfc in data

points{1, 2, . . . , d− 1, d+ 1, . . . , N}. Functionp(f (d)c | f

(\d)c )

is the conditional density off (d)c given f

(\d)c , according to the

Gaussian process prior. The evaluation of this factor costsO(N2)using standard submatrix Cholesky updates (Seeger, 2004).Assuch, sampling all latent values forXi takesO(N3).

Finally, XYidenotes the observed children ofXi, and function

p(y(d)c | X

(d)Pc

) is the corresponding density of observed childYc

evaluated aty(d)c , given its parents (which includesX(d)

i ) and(implicit) measurement model parameters. This factor can bedropped ify(d)

c is missing.

If variable Xi is an exogenous variable, then the factor

e−(x(d)i

−f(d)i

)2/(2υζi) gets substituted by

e− 1

2

(

x(d)i

−µiz

(d)i

)2

/viz

(d)i

wherez(d)i is the latent mixture indicator for the marginal mix-ture of Gaussians model forXi, with means{µi1, . . . , µiK} andvariances{vi1, . . . , viK}.

Given all latent variables, latent function values{f (1)i , . . . , f

(N)i }

are multivariate Gaussian with covariance matrix

Sfi ≡ (K−1i + I/υζi )

−1

whereKi is the corresponding kernel matrix andI is aN × N

identity matrix. The respective mean is given bySfix(1:N)i /υζi ,

wherex(1:N)i ≡ [x

(1)i . . . x

(N)i ]T. This operation costsO(N3).

We sample from this conditional as in a standard Gibbs update.

Sampling each latent conditional varianceυζi can also be doneby sampling from its conditional. Letυζi have an inverse gammaprior (aζ , bζ). The conditional distribution for this variance givenall other random variables is inverse gamma(a′

ζ , b′ζ), wherea′

ζ =

aζ +N/2 andb′ζ = bζ +∑N

d=1(x(d)i − f

(d)i )2/2.

We are left with sampling the mixture model parameters that cor-respond to the marginal distributions of the exogenous latent vari-ables. Once we condition on the latent variables, this is com-pletely standard. If each mixture mean parameterµij is given anindependent Gaussian prior with mean zero and variancevπ , itsconditional given the remaining variables is also Gaussianwithvariancev′π ≡ 1/(1/vπ + |Zij |/vij), whereZij is the subset of1, 2, . . . , N such thatd ∈ Zij if and only if z(d)i = j. The corre-sponding mean is given byv′π

d∈Zijx(d)i /vij . If each mixture

variance parametervij is given an inverse gamma prior(aπ, bπ),its conditional is an inverse gamma(a′

π, b′π), wherea′

π = aπ +

|Zij |/2, andb′π = bπ +∑

d∈Zij(x

(d)i − µij)

2/2. The condi-

tional probabilityP (z(d)i = j | everything else) is proportional

to∑N

t=1 v−1/2ij e−(x

(t)i

−µij)2/(2vij ). Finally, given a Dirichlet

prior distribution(α1, . . . , αK) for eachπi, its conditional is alsoDirichlet with parameter vector(α1 + |Zi1|, . . . , αK + |ZiK |).

APPENDIX B: A NOTE ONDIRECTIONALITY DETECTION

The assumption of linearity of the measurement model is not onlya matter of convenience. In SEM applications, observed variablesare carefully chosen to represent different aspects of latent con-cepts of interest and often have a single latent parent. As such, itis plausible that children of a particular latent variable are differ-ent noisy linear transformations of the target latent variable. Thisdiffers from other applications of latent variable Gaussian processmodels such as those introduced by Lawrence (2005), where mea-surements are not designed to explicitly account for targetlatentvariables of interest. Moreover, this linearity conditionhas im-portant implications on distinguishing among candidate models.

Page 12: arXiv:1002.4802v2 [cs.LG] 12 Mar 2010

Implications for Model Selection

We assumed that the DAGG is given. A detailed discussion ofmodel selection is left as future work. Instead, we discuss sometheoretical aspects of a very particular but important structuralfeature that will serve as a building block to more general modelselection procedures, in the spirit of Hoyer et al. (2008a):deter-mining sufficient conditions for the subproblem of detecting edgedirectionality from the data. Given a measurement model fortwolatent variablesX1 andX2, we need to establish conditions inwhich we can test whether the only correct latent structure isX1 → X2, X2 → X1, the disconnected structure, or either di-rectionality. The results of Hoyer et al. (2008a) can be extendedto the latent variable case by exploiting the conditions of identifi-ability discussed in Section 2.1 as follows.

Our sufficient conditions are a weaker set of assumptions thanthat of Silva et al. (2006). We assume thatX1 has at least twoobservable children which are not children ofX2 and vice-versa.Call these sets{Y1, Y

′1} and{Y2, Y

′2}, respectively. Assume all

error terms ({ǫi} ∪ {ζi}) are non-Gaussian11. The variance of allerror terms is assumed to be nonzero. As in Hoyer et al. (2008a),we also assumeX1 andX2 are unconfounded.

To test whether the model whereX1 andX2 are independent be-comes easy in this case: the independence model entails that(say)Y1 andY2 are marginally independent. This can be tested usingthe nonparametric marginal independence test of Gretton etal.(2007).

For the nontrivial case where latent variables are dependent,the results of Section 2.1 imply that the measurement model of{X1 → Y1, X1 → Y ′

1} is identifiable up to the scale and sign ofthe latent variables, including the marginal distributions of ǫ1 andǫ1′ . An analogous result applies to{X2, Y2, Y

′2}.

Since the measurement model{Y1, Y′1 , Y2, Y

′2} of {X1, X2} is

identifiable, assume without loss of generality that the linear co-efficients corresponding toX1 → Y1 andX2 → Y2 are fixed to1, i.e.,Y1 = X1+ǫi andY2 = X2+ǫ2. Also from Section 2.1, itfollows that the distribution of{X1, X2} can be identified undervery general conditions. The main result of Hoyer et al. (2008a)can then be directly applied. That is, data generated by a modelX2 = f(X1) + η2, with η2 being non-Gaussian and independentof X1, cannot be represented by an analogous generative modelX1 = g(X2) + η1 except in some particular cases that are ruledout as implausible.

Practical Testing

The test for comparingX1 → X2 againstX2 → X1 inHoyer et al. (2008a) can be modified to our context as follows:we cannot regressX2 onX1 and estimate the residualsζ2 sinceX1 andX2 are latent. However, we can do a error-in-variablesregression ofY2 onY1 usingY ′

1 andY ′2 as instrumental variables

(Carroll et al., 2004): this means we find a functionh(·) such thatY2 = h(X) + r andY1 = X + w, for non-Gaussian latent vari-ablesr,w andX. We then calculate the estimated residualsr ofthis regression, and test whether such residuals are independent ofY1 (Gretton et al., 2007). If this is true, then we have no evidenceto discard the hypothesisX1 → X2.

The justification for this process is that, if the true model is indeedX2 = f(X1)+ η2, thenh(·) = f(·) andr = ǫ2 + η2 in the limit

11Variations whereǫi and latent error terms are allowed to beGaussian, as in our original model description are also possibleand will be treated in the future.

of infinite data, since the error-in-variables regression model isidentifiable in our case (Carroll et al., 2004), withX = X1 beinga consequence of deconvolvingY1 andǫ1. By this result,r willbe independent ofY1. However, if the opposite holds (X1 ←X2) then, as in (Hoyer et al., 2008a), the residual is not in generalindependent ofY1: givenX1 (= X), there is a d-connecting pathY2 ← X2 → X1 ← η1 (Pearl, 2000), andr will be a function ofη1, which is dependent onY1. This is analogous to (Hoyer et al.,2008a), but using a different family of regression techniques.

Error-in-variables regression is a special case of the Gaussianprocess SEM. The main practical difficulty on using GPSEMwith the pseudo-inputs approximation in this case is that suchpseudo-inputs formulation implies a heteroscedastic regressionmodel (Snelson and Ghahramani, 2006). One has either to usethe GPSEM formulation without pseudo-inputs, or a model linearin the parameters but with an explicit, finite, basis dictionary onthe input space.