arXiv:1112.6399v1 [cs.LG] 29 Dec 2011bboots/files/two_manifold_arxiv.pdftion problem. We propose a class of al-gorithms for two-manifold problems, based on spectral decomposition of

Two-Manifold Problems

Byron Boots Geoffrey J. GordonMachine Learning Department

Carnegie Mellon UniversityPittsburgh, PA 15213

[email protected]

Machine Learning DepartmentCarnegie Mellon University

Pittsburgh, PA [email protected]

Abstract

Recently, there has been much interest inspectral approaches to learning manifolds—so-called kernel eigenmap methods. Thesemethods have had some successes, but theirapplicability is limited because they are notrobust to noise. To address this limita-tion, we look at two-manifold problems, inwhich we simultaneously reconstruct two re-lated manifolds, each representing a differentview of the same data. By solving these in-terconnected learning problems together andallowing information to flow between them,two-manifold algorithms are able to succeedwhere a non-integrated approach would fail:each view allows us to suppress noise in theother, reducing bias in the same way thatan instrumental variable allows us to re-move bias in a linear dimensionality reduc-tion problem. We propose a class of al-gorithms for two-manifold problems, basedon spectral decomposition of cross-covarianceoperators in Hilbert space. Finally, we dis-cuss situations where two-manifold problemsare useful, and demonstrate that solving atwo-manifold problem can aid in learninga nonlinear dynamical system from limiteddata.

1 Introduction

Manifold learning algorithms are non-linear meth-ods for embedding a set of data points into a low-dimensional space while preserving local geometry.Recently, there has been a great deal of interest inspectral approaches to learning manifolds. These ker-nel eigenmap methods include Isomap [1], LocallyLinear Embedding (LLE) [2], Laplacian Eigenmaps(LE) [3], Maximum Variance Unfolding (MVU) [4],and Maximum Entropy Unfolding (MEU) [5]. These

approaches can be viewed as kernel principal compo-nent analysis [6] with specific choices of manifold ker-nels [7]: they seek a small set of latent variables that,through a nonlinear mapping, explains the observedhigh-dimensional data.

Despite the popularity of kernel eigenmap methods,they are limited in one important respect: they gen-erally only perform well when there is little or nonoise. Several authors have attacked this problemusing methods including neighborhood smoothing [8]and robust principal components analysis [9, 10], withsome success under limited noise. Unfortunately, theproblem is fundamentally ill posed without some sortof side information about the true underlying signal:by design, manifold methods will recover extra latentdimensions which “explain” the noise.

We take a different approach to the problem of learningmanifolds from noisy observations. We assume accessto a set of instrumental variables: variables that arecorrelated with the true latent variables, but uncorre-lated with the noise in observations. Such instrumen-tal variables can be used to separate signal from noise,as described in Section 3. Instrumental variables havebeen used to allow consistent estimation of model pa-rameters in many statistical learning problems, includ-ing linear regression [11], principal component analy-sis [12], and temporal difference learning [13]. Here weextend the scope of this technique to manifold learn-ing. We will pay particular attention to the case ofobservations from two manifolds, each of which canserve as instruments for the other. We call such prob-lems two-manifold problems.

In one-manifold problems, when the noise variance issignificant compared to the manifold’s geometry, ker-nel eigenmap methods are biased: they will fail to re-cover the true manifold, even in the limit of infinitedata. Two-manifold methods, by contrast, can succeedin this case: we propose algorithms based on spectraldecompositions related to cross-covariance operatorsin a tensor product space of two different manifold

arX

iv:1

112.

6399

v1 [

cs.L

G]

29

Dec

201

1


kernels, and show that the instrumental variable ideasuppresses noise in practice. We also examine sometheoretical properties of two-manifold methods: we ar-gue consistency for a special case in Sec. 3.2.3, but weleave a full theoretical analysis of these algorithms tofuture work.

The benefits of the spectral approach to two-manifoldproblems are significant: first, the spectral decomposi-tion naturally solves the manifold alignment problemby finding low-dimensional manifolds (defined by theleft and right eigenvectors) that explain both sets ofobservations. Second, the connection between man-ifolds and covariance operators opens the door tosolving more-sophisticated machine learning problemswith manifolds, including nonlinear analogs of canon-ical correlations analysis (CCA) and reduced-rank re-gression (RRR).

As an example of this last point, subspace identifi-cation approaches to learning non-linear dynamicalsystems depend critically on instrumental variablesand the spectral decomposition of (potentially infinite-dimensional) covariance operators [14, 15, 16, 17, 18].Two-manifold problems are a natural fit: by relat-ing the spectral decomposition to our two-manifoldmethod, subspace identification techniques can beforced to identify a manifold state space, and con-sequently, to learn a dynamical system that is bothaccurate and interpretable, outperforming the currentstate of the art.

2 Preliminaries

We begin by looking at two well-known classes of non-linear dimensionality reduction: kernel principal com-ponent analysis (kernel PCA) and manifold learning.

2.1 Kernel PCA

Kernel PCA [6] is a generalization of principal com-ponent analysis [12]: we first map our d-dimensionalinputs x1, . . . , xn ∈ Rd to a higher-dimensional fea-ture space F using a feature mapping φ : Rd → F ,and then find the principal components in this newspace. If the features are sufficiently expressive, ker-nel PCA can find structure that regular PCA misses.However, if F is high- or infinite-dimensional, thestraightforward approach to PCA, via an eigende-composition of a covariance matrix, is in general in-tractable. Kernel PCA overcomes this problem by as-suming that F is a reproducing-kernel Hilbert space(RKHS), and that the feature mapping φ is implicitlydefined via an efficiently-computable kernel functionK(x,x′) = 〈φ(x), φ(x′)〉F . Popular kernels include thelinear kernel K(x,x′) = x ·x′ (which identifies the fea-

ture space with the input space) and the RBF kernelK(x,x′) = exp(−γ‖x− x′‖2/2).

Conceptually, if we define an infinitely-tall “matrix”with columns φ(xi), Φ = (φ(x1), . . . , φ(xn)), our goalis to recover the eigenvalues and eigenvectors of thecentered covariance operator ΣXX = 1

nΦHΦT, where

H is the centering matrix H = In − 1n11T. To avoid

working with the high- or infinite-dimensional covari-ance operator, we instead define the Gram matrixG = 1

nΦTΦ. The nonzero eigenvalues of the centered

covariance ΣXX are the same as those of the centeredGram matrix HGH. And, the corresponding unit-

length eigenvectors of ΣXX are given by ΦHviλ−1/2i ,

where λi and vi are the eigenvalues and eigenvectorsof HGH [6]. If data weights P (a diagonal matrix)are present, we can instead decompose the weightedcentered Gram matrix PHGHP. (Note that, perhapsconfusingly, we center first and multiply by the weightsonly after centering; the reason for this order will be-come clear below, in Section 2.2.)

2.2 Manifold Learning

Spectral algorithms for manifold learning, sometimescalled kernel eigenmap methods, include Isomap [1],Locally Linear Embedding (LLE) [2], Laplacian Eigen-maps (LE) [3], and Maximum Variance Unfolding(MVU) [4]. These methods seek a nonlinear functionthat maps a high-dimensional set of data points toa lower-dimensional space while preserving the mani-fold on which the data lies. The main insight behindthese methods is that large distances in input space areoften meaningless due to the large-scale curvature ofthe manifold; so, ignoring these distances can lead toa significant improvement in dimensionality reductionby “unfolding” the manifold.

Interestingly, these algorithms can be viewed as spe-cial cases of kernel PCA where the Gram matrix G isconstructed over the finite domain of the training datain a particular way [7]. Specifically, kernel eigenmapmethods first induce a neighborhood structure on thedata, capturing a notion of local geometry via a graph,where nodes are data points and edges are neighbor-hood relations. Then they solve an eigenvalue or re-lated problem based on the graph to embed the datainto a lower dimensional space while preserving the lo-cal relationships. Here we just describe LE; the othermethods have a similar intuition, although the detailsand performance characteristics differ.

In LE, neighborhoods are summarized by an adja-cency matrix W, computed by nearest neighbors: el-ement wi,j is nonzero whenever the ith point is oneof the nearest neighbors of the jth point, or viceversa. Non-zero weights are typically either set to

Byron Boots, Geoffrey J. Gordon

1 or computed according to a Gaussian RBF kernel:wi,j = exp(−γ‖xi − xj‖2/2). Now let Si,i =

∑j wi,j ,

and set C = S−1/2(W − S)S−1/2. Finally, eigende-compose C to get a low dimensional embedding of thedata points, discarding the top eigenvector (which istrivial): if C = VΛVT, the embedding is S1/2V2:k+1.To relate LE to kernel PCA, note that G = W − Sis already centered, i.e., G = HGH. So, we can viewLE as performing weighted kernel PCA, with Grammatrix G and data weights P = S−1/2.1,2,3

The weighted Gram matrix C can be related to a ran-dom walk on the graph defined by W: if we addthe identity and make a similarity transform (bothoperations that preserve the eigensystem), we get

S−1/2(C + I)S1/2

= S−1W, a stochastic transitionmatrix for the walk. So, Laplacian eigenmaps can beviewed as trying to keep points close together whenthey are connected by many short paths in our graph.

3 Bias and Instrumental Variables

Kernel eigenmap methods are very good at dimension-ality reduction when the original data points samplea high-dimensional manifold relatively densely, andwhen the noise in each sample is small compared tothe local curvature of the manifold (or when we don’tcare about recovering curvature on a scale smallerthan the noise). In practice, however, observationsare frequently noisy. Depending on the nature of thenoise, manifold-learning algorithms applied to thesedatasets produce biased embeddings. See Figures 1–2,the “noisy swiss rolls,” for an example of how noisecan bias manifold learning algorithms.

To see why, we examine PCA, a special case of man-ifold learning methods, and look at why it producesbiased embeddings in the presence of noise. We firstshow how overcome this problem in the linear case, and

1The original algorithm actually looks for the small-est eigenvalues of −C, which is equivalent. The matrix−G = S−W is the graph Laplacian for our neighborhoodgraph, and −C is the weighted graph Laplacian—hence thealgorithm name “Laplacian eigenmaps.”

2The centering step is vestigial: we can include it,

C = S−1/2H(W − S)HS−1/2, but it has no effect. Thediscarded eigenvector (which corresponds to eigenvalue 0and to a constant coordinate in the embedding) similarlyis vestigial: we can discard it or not, and it doesn’t af-fect pairwise distances among embedded points. Interest-ingly, previous papers on the connection between Laplacianeigenmaps and the other algorithms mentioned here seemto contain an imprecision: they typically connect droppingthe top (trivial) eigenvector with the centering step—e.g.,see [5].

3A minor difference is that LE does not scale its em-bedding using the eigenvalues of G; the related diffusionmaps algorithm [19] does.

then use these same ideas to fix kernel PCA, a non-linear algorithm. Finally, in Sec. 4, we extend theseideas to fully general kernel eigenmap methods.

3.1 Bias in Finite-Dimensional Linear Models

Suppose that xi is a noisy view of some underlyinglow-dimensional latent variable zi: xi = Mzi + εi fora linear transformation M and i.i.d. zero-mean noiseterm εi. Without loss of generality, we assume that xi

and zi are centered, and that Cov[zi] and M both havefull column rank: any component of zi in the nullspaceof M doesn’t affect xi.

In this case, PCA on X will generally fail to recover Z:the expectation of ΣXX = 1

nXXT is M Cov[zi] MT +

Cov[εi], while we need M Cov[zi] MT to be able to

recover a transformation of M or Z. The unwantedterm Cov[εi] will, in general, affect all eigenvalues andeigenvectors of ΣXX , causing us to recover a biasedanswer even in the limit of infinite data.

3.1.1 Instrumental Variables

We can fix this problem for linear embeddings: in-stead of plain PCA, we can use what might be calledtwo-subspace PCA. This method finds a statisticallyconsistent solution through the use of an instrumentalvariable [11, 12], an observation yi that is correlatedwith the true latent variables, but uncorrelated withthe noise in xi.

Importantly, picking an instrumental variable is notmerely a statistical aid, but rather a value judgementabout the nature of the latent variable and the noisein the observations. In particular, we are defining thenoise to be that part of the variability which is uncor-related with the instrumental variable, and the signalto be that part which is correlated.

In our example above, a good instrumental variableyi is a different (noisy) view of the same underlyinglow-dimensional latent variable: yi = Nzi + ζi forsome full-column-rank linear transformation N andi.i.d. zero-mean noise term ζi. The expectation ofthe empirical cross covariance ΣXY = 1

nXYT is thenM Cov(zi) NT: the noise terms, being independentand zero-mean, cancel out. (And the variance of eachelement of ΣXY goes to 0 as n→∞.)

In this case, we can identify the embedding by com-puting the singular value decomposition (SVD) of thecovariance: let UDVT = ΣXY , where U and V areorthonormal and D is diagonal. To reduce noise, wecan keep just the columns of U, D, and V which cor-respond to the top k largest singular values (diagonalentries of D): 〈U,D,V〉 = SVD(ΣXY , k). If we set kto be the true dimension of z, then as n→∞, U will


converge to an orthonormal basis for the range of M,and V will converge to an orthonormal basis for therange of N. The corresponding embeddings are thengiven by UTX and VTY.

Interestingly, we can equally well view xi as an instru-mental variable for yi: we simultaneously find con-sistent embeddings of both xi and yi, using each tounbias the other.

3.1.2 Whitening: Reduced-Rank Regressionand Cannonical Correlation Analysis

Going beyond two-subspace PCA, there are a num-ber of interesting spectral decompositions of cross-covariance matrices that involve transforming the vari-ables xi and yi before applying a singular value decom-position [12]. For example, in reduced-rank regres-sion [20, 12], we want to estimate E [xi | yi]. DefineΣY Y = 1

nYYT and ΣXY = 1nXYT. Then the ordi-

nary regression of Y on X is ΣXY (ΣY Y + ηI)−1Y,where the regularization term ηI (η > 0) ensures thatthe matrix inverse is well-defined. For a reduced-rankregression, we instead project onto the set of rank-kmatrices: 〈U,D,V〉 = SVD(ΣXY (ΣY Y + ηI)−1Y, k).In our example above, so long as we let η → 0 asn→∞, U will again converge to an orthonormal ba-sis for the range of M.

To connect back to two-subspace PCA, we can showthat RRR is the same as two-subspace PCA if wefirst whiten Y. That is, we can equivalently defineU for RRR to be the top k left singular vectors of thecovariance between X and the whitened instrumentsYw = (ΣY Y + ηI)−1/2Y. (Here, A1/2 stands for thesymmetric square root of A, which is guaranteed toexist and be invertible if A is symmetric and positivedefinite.) Whitening means transforming a covariancematrix toward the identity: if η = 0 and ΣY Y has full

rank, then E[YwYTw] = Σ

−1/2Y Y ΣY Y Σ

−1/2Y Y = I, while if

η is near 0, then E[YwYTw] ≈ I.

For symmetry, we can whiten both X and Y beforecomputing their covariance. An SVD of the result-ing doubly-whitened cross-covariance matrix (ΣXX +ηI)−1/2ΣXY (ΣY Y + ηI)−1/2 is called canonical corre-lation analysis [21], and the resulting singular valuesare called canonical correlations.

3.2 Bias in Learning Nonlinear Models

We now extend the analysis of Section 3.1 to nonlinearmodels. We assume noisy observations xi = f(zi)+εi,where zi is the desired low-dimensional latent variable,εi is an i.i.d. noise term, and f is a smooth functionwith smooth inverse (so that f(zi) lies on a manifold).Our goal is to recover f and zi up to identifiability.

Kernel PCA (Sec. 2.1) is a common approach to thisproblem. In the realizable case, kernel PCA gets theright answer: that is, suppose that zi has dimension k,and that we have at least k independent samples. And,suppose that φ(f(z)) is a linear function of z. Then,the Gram matrix or the covariance “matrix” will haverank k, and we can reconstruct a basis for the range ofφ ◦ f from the top k eigenvectors of the Gram matrix.(Similarly, if φ ◦ f is near linear and the variance ofεi is small, we can expect kernel PCA to work well, ifnot perfectly.)

However, just as PCA recovers a biased answer inthe finite dimensional case when the variance of εi isnonzero, kernel PCA will also recover a biased answerin this case, even in the limit of infinite data. The biasof kernel PCA follows immediately from the exampleat the beginning of Section 3: if we use a linear kernel,kernel PCA will simply reproduce the bias of ordinaryPCA.

3.2.1 Instrumental Variables

By analogy to two-subspace PCA, a a natural gener-alization of kernel PCA is two-subspace kernel PCA,which we can accomplish via a kernelized SVD of across-covariance operator in Hilbert space. Given ajoint distribution P[X,Y ] over two variables X on Xand Y on Y, with feature maps φ and υ (correspondingto kernels Kx and Ky), the cross-covariance operatorΣXY is E[φ(x) ⊗ υ(y)]. The cross-covariance opera-tor reduces to an ordinary cross-covariance matrix inthe finite-dimensional case; in the infinite-dimensionalcase, it can be viewed as a kernel mean map descrip-tor [22] for the joint distribution P[X,Y ].

The concept of a cross-covariance operator is helpfulbecause it allows us to extend the methods of instru-mental variables to infinite dimensional RKHSs. Inour example above, a good instrumental variable yi isa different (noisy) view of the same underlying low-dimensional latent variable: yi = g(zi) + ζi for somesmoothly invertible function g and i.i.d. zero-meannoise term ζi.

We proceed now to derive the kernel SVD for a cross-covariance operator.4 We show below (Sec. 3.2.3) thatthe kernel SVD of the cross-covariance operator leadsto a consistent estimate of the shared latent variablesin two-manifold problems which satisfy appropriate as-sumptions.

4The kernel SVD algorithm previously appeared as anintermediate step in [17, 23]; here we generalize the al-gorithm to weighted cross covariance operators in Hilbertspace and give a more complete description, both becausethe method is interesting in its own right, and because thisgeneralized SVD will serve as a step in our two-manifoldalgorithms.


Conceptually, our inputs are “matrices” Φ and Υwhose columns are respectively φ(xi) and υ(yi),along with data weights PX and PY . The cen-tered empirical covariance operator is then ΣXY =1n (ΦH)(ΥH)T. The goal of the kernel SVD is then to

factor ΣXY (or possibly the weighted variant ΣPXY =

1n (ΦH)PXPY (ΥH)T) so that we can recover the de-sired bases for φ(xi) and υ(yi).

However, this conceptual algorithm is impractical,since ΣXY can be high- or infinite-dimensional. In-stead, we can perform an SVD on the covariance op-erator in Hilbert space via a trick analogous to ker-nel PCA. We first show how to perform an SVDon a covariance matrix using Gram matrices in finite-dimensional space, and then we extend the method toinfinite dimensional spaces in Section 3.2.3.

3.2.2 SVD via Gram matrices

We start by looking at a Gram matrix formulation offinite dimensional SVD. In standard SVD, the singularvalues of ΣXY = 1

n (XH)(YH)T are the square roots

of the eigenvalues of ΣXY ΣY X (where ΣY X = ΣTXY ),

and the left singular vectors are defined to be the cor-responding eigenvectors. We can find identical eigen-vectors and eigenvalues through centered Gram matri-ces BX = 1

n (XH)T(XH) and BY = 1n (YH)T(YH).

Let vi be the right eigenvector of BY BX so thatBY BXvi = λivi. Premultiplying by (XH) yields

1

n2(XH)(YH)T(YH)(XH)T(XH)vi = λi(XH)vi

and regrouping terms gives us ΣXY ΣY Xwi = λiwi

where wi = (XH)vi. So, λi is an eigenvalue ofΣXY ΣY X ,

√λi is a singular value of ΣXY , and

(XH)viλ−1/2i is the corresponding unit length left sin-

gular vector. An analogous argument shows that, ifw′i is a unit-length right singular vector of ΣXY , then

w′i = (YH)v′iλ−1/2i , where v′i is a unit-length left

eigenvector of BY BX . Furthermore, we can easily in-corporate weights into SVD by using weighted matri-ces CX = PXBXPX and CY = PY BY PY in place ofBX and BY .

3.2.3 Two-subspace PCA in RKHSs

The machinery developed in Section 3.2.2 allowsus to solve the two-subspace kernel PCA prob-lem by computing the singular values of theweighted empirical covariance operator ΣP

XY . Wedefine GX and GY to be the Gram matriceswhose elements are Kx(xi,xj) and Ky(yi,yj) re-spectively, and then compute the eigendecomposi-tion of CY CX = (PY HGY HPY )(PXHGXHPX).This method avoids any computations in infinite-

dimensional spaces; and, it gives us compact repre-sentations of the left and right singular vectors. E.g.,if vi is a right eigenvector of CY CX , then the corre-sponding singular vector is wi =

∑j vi,j υ(xj − x)pxj ,

where pxj is the weight assigned to data point xj .

Under appropriate assumptions, we can show thatthe SVD of the empirical cross-covariance operatorΣXY = 1

nΦHΥT converges to the desired value. Sup-pose that E[φ(xi) | zi] is a linear function of zi,and similarly, that E[υ(yi) | zi] is a linear functionof zi.

5 The noise terms φ(xi) − E[φ(xi) | zi] andυ(yi) − E[υ(yi) | zi] are by definition zero-mean; andthey are independent of each other, since the first de-pends only on εi and the second only on ζi. So, thenoise terms cancel out, and the expectation of ΣXY

is the true covariance ΣXY . If we additionally as-sume that the noise terms have finite variance, theproduct-RKHS norm of the error ΣXY − ΣXY van-ishes as n→∞.

The remainder of the proof follows from the proof ofTheorem 1 in [17] (the convergence of the empirical es-timator of the kernel covariance operator). In partic-ular, the top k left singular vectors of ΣXY convergeto a basis for the range of E[φ(xi) | zi] (consideredas a function of zi); similarly, the top right singularvectors of ΣXY converge to a basis for the range ofE[υ(yi) | zi].

3.2.4 Whitening and Kernel SVD

Just as one can use kernel SVD to solve the two-subspace kernel PCA problem for high- or infinite-dimensional feature space, one can also compute thekernel versions of CCA and RRR. (Kernel CCA is awell-known algorithm, though our formulation here isdifferent than in [23], and kernel RRR is to our knowl-edge novel.) Again, these problems can be solved bypre-whitening the feature space before applying kernelSVD.

To compute a RRR from centered covariates HΥ inHilbert space to centered responses HΦ in Hilbertspace, we find the kernel SVD of the HΥ-whitenedcovariance matrix: first we define BX = HGXHand BY = HGY H; next we compute BXBY (B2

Y +ηI)−1BY ; and, finally, we perform kernel SVD by find-ing the singular value decomposition of BXBY (B2

Y +ηI)−1BY .

5The assumption of linearity is restrictive, but appearsnecessary: in order to learn a representation of a man-ifold using factorization-based methods, we need to picka kernel which flattens out the manifold into a subspace.This is why kernel eigenmap methods are generally moresuccessful than plain kernel PCA: by learning an appropri-ate kernel, they are able to adapt their nonlinearity to theshape of the target manifold.


Similarly, for CCA in Hilbert space, we find the kernelSVD of the HΥ- and HΦ-whitened covariance matrix:BX(B2

X + ηI)−1BXBY (B2Y + ηI)−1BY .

If data weights are present, we use instead weightedcentered Gram matrices CX = PXBXPX and CY =PY BY PY . Then, to generalize RRR and CCA toRKHSs, we can compute the eigendecompositions ofEquations 1a–b, respectively:

CXCY (C2Y + ηI)−1CY (1a)

CX(C2X + ηI)−1CXCY (C2

Y + ηI)−1CY (1b)

The consistency of both approaches follows directlyfrom the consistency of kernel SVD, so long as we letη → 0 as n→∞ [24].

4 Two-Manifold Problems

Now that we have extended the instrumental variableidea to RKHSs, we can also expand the scope of mani-fold learning to two-manifold problems, where we wantto simultaneously learn two manifolds for two covary-ing lists of observations, each corrupted by uncorre-lated noise.6 The idea is simple: we view manifoldlearners as constructing Gram matrices, then applythe RKHS instrumental variable idea of Sec. 3. As wewill see below, this procedure allows us to regain goodperformance when observations are noisy.

Suppose we are given two set of observations residingon (or near) two different manifolds: x1, . . . ,xn ∈ Rd1

on a manifoldMX and y1, . . . ,yn ∈ Rd2 on a manifoldMY . Further suppose that both xi and yi are noisyfunctions of a latent variable zi, itself residing on alatent k-dimensional manifold MZ : xi = f(zi) + εiand yi = g(zi) + ζi. We assume that the functions fand g are smooth, so that f(zi) and g(zi) trace outsubmanifolds f(MZ) ⊆ MX and g(MZ) ⊆ MY . Wefurther assume that the noise terms εi and ζi move xi

and yi within their respective manifoldsMX andMY :this assumption is without loss of generality, since wecan can always increase the dimension of the manifoldsMX and MY to allow an arbitrary noise term. SeeFigure 1 for an example.

If the variance of the noise terms εi and ζi is too high,or if MX and MY are higher-dimensional than thelatent MZ manifold (i.e., if the noise terms move xi

and yi away from f(MZ) and g(MZ)), then it may be

6The uncorrelated noise assumption is extremely mild:if some latent variable causes correlated changes in ourmeasurements on the two manifolds, then we are makingthe definition that it is part of the desired signal to berecovered. No other definition seems reasonable: if thereis no difference in statistical behavior between signal andnoise, then it is impossible to use a statistical method toseparate signal from noise.

05

10

−50

5

1020

0

−10 0 10−10

0

C.

0 6 120

10

20

−5

05

10

−50

5

1020

0

−10 0 10−10

0

0 6 120

10

20

−5

B.A.

Figure 1: The Noisy Swiss Rolls. We are given twosets of 3-d observations residing on two different man-ifoldsMX andMY . The latent signal MZ is 2-d, butMX and MY are each corrupted by 3-d noise. (A)5000 data points sampled from MZ . (B) The func-tions f (MZ) and g (MZ) “roll” the manifold in 3-dimensional space, two different ways, generating twodifferent manifolds in observation space. (C) Each setof observations is then perturbed by 3-d noise (show-ing 2 dimensions only), resulting in 3-d manifoldsMX

and MY . The black lines indicate the location of thesubmanifolds from (B).

difficult to reconstruct f(MZ) or g(MZ) separatelyfrom xi or yi. Our goal, therefore, is to use xi andyi together to reconstruct both manifolds simultane-ously: the extra information from the correspondencebetween xi and yi will make up for noise, allowing suc-cess in the two-manifold problem where the individualone-manifold problems are intractable.

Given samples of n i.i.d. pairs {xi,yi}ni=1 from twomanifolds, we propose a two-step spectral learning al-gorithm for two-manifold problems: first, use eithera given kernel or an ordinary one-manifold algorithmsuch as LE or LLE to compute weighted centeredGram matrices CX and CY from xi and yi separately.Second, use use one of the cross-covariance methodsfrom Section 3.2.4, such as kernel SVD, to recover theembedding of points inMZ . The procedure, called in-strumental eigenmaps, is summarized in Algorithm 1.

For example, if we are using LE (Section 2.2), then

CX = S−1/2X (WX−SX)S

−1/2X and CY = S

−1/2Y (WY −

SY )S−1/2Y , where WX and SX are computed from xi,

and WY and SY are computed from yi. We thenperform a singular value decomposition and truncateto the top k singular values:

〈U,Λ,VT〉 = SVD(CXCY , k)


Algorithm 1 Instrumental Eigenmaps

In: n i.i.d. pairs of observations {xi,yi}ni=1

Out: embeddings EX and EY

1: Compute centered Gram matrices: BX and BY

and data weights PX and PY from x1:n and y1:n

2: Compute weighted, centered Gram matrices:CX = PXBXPX and CY = PY BY PY

3: Perform a singular value decomposition and trun-cate the top k singular values:〈U,Λ,VT〉 = SVD(CXCY , k)

4: Find the embeddings from the singular values:

EX = P1/2X U2:k+1Λ

1/22:k+1 and

EY = P1/2Y V2:k+1Λ

1/22:k+1

The embeddings for xi and yi are then given by

S1/2X U2:k+1Λ

1/22:k+1 and S

1/2Y V2:k+1Λ

1/22:k+1.

Computing eigenvalues of CXCY instead of CX or CY

alone will alter the eigensystem: it will promote direc-tions within each individual learned manifold that areuseful for predicting coordinates on the other learnedmanifold, and demote directions that are not useful.As shown in Figure 2, this effect strengthens our abil-ity to recover relevant dimensions in the face of noise.

5 Two-Manifold Detailed Example:Learning Dynamical Systems

A longstanding goal in machine learning and roboticshas been to learn accurate, economical models of dy-namical systems directly from observations. This taskrequires two related subtasks: 1) learning a low di-mensional state space, which is often known to lieon a manifold ; and 2) learning the system dynamics.We propose tackling this problem by combining spec-tral learning algorithms for non-linear dynamical sys-tems [14, 15, 16, 18, 25, 17] with two-manifold meth-ods.

Any of the above-cited spectral learning algorithmscan be combined with two-manifold methods. Here,we focus on one specific example: we show how to com-bine HSE-HMMs [17], a powerful nonparametric ap-proach to modeling dynamical systems, with manifoldlearning. Specifically, we look at one important stepin the HSE-HMM learning algorithm: the original ap-proach uses kernel SVD to discover a low-dimensionalstate space, and we propose replacing the kernel SVDwith a two-manifold method. We demonstrate that theresulting manifold HSE-HMM can outperform stan-dard HSE-HMMs (and many other well-known meth-ods for learning dynamical systems) on a difficult real-world example: the manifold HSE-HMM accurately

A.

−.01−.05 0

−.01

0

.01

−.05 0 .05−.01

0

.01

−.01−.05 0

−.01

0

.01

−.05 0 .05−.01

0

.01

−2 0 2−4

−2

0

2

−2 0 2−4

−2

0

2

4

B.One-Manifold Two-Manifold

x

y

i

i

Figure 2: Solving the Noisy Swiss Roll two-manifoldproblem (see Fig. 1 for setup). Top graphs show em-beddings of xi, bottom graphs show embeddings of yi.(A) 2-dimensional embeddings found by LE. The bestresults were obtained by setting the number of near-est neighbors to 5. Due to the large amounts of noise,the separately learned embeddings do not accuratelyreflect the latent 2-dimensional manifold. (B) The em-beddings learned from the left and right eigenvectors ofCXCY closely match the original data sampled fromthe true manifold. By treating the learning problemas a two-manifold problem, noise disappears in expec-tation and the latent manifold is recovered accurately.

discovers a curved low-dimensional manifold whichcontains the state space, while other methods discoveronly a (potentially much higher-dimensional) subspacewhich contains this manifold.

5.1 Hilbert Space Embeddings of HMMs

The key idea behind spectral learning of dynamicalsystems is that a good latent state is one that letsus predict the future. HSE-HMMs implement thisidea by finding a low-dimensional embedding of theconditional probability distribution of sequences of fu-ture observations, and using the embedding coordi-nates as state. Song et al. [17] suggest finding thislow-dimensional state space as a subspace of an infi-nite dimensional RKHS.

Intuitively, we might think that we could find the beststate space by performing PCA or kernel PCA of se-quences of future observations. That is, we would sam-ple n sequences of future observations x1, . . . ,xn ∈ Rd1

from a dynamical system. (If our training data con-sists of a single long sequence of observations, we cancollect our sample by picking n random time stepst1, . . . , tn. Each sample xi is then a sequence of obser-


vations starting at time ti.) We would then constructa Gram matrix GX , whose (i, j) element is Kx(xi,xj).Finally, we would find the eigendecomposition of thecentered Gram matrix BX = HGXH as in Section 2.1.The resulting embedding coordinates would be tunedto predict future observations well, and so could beviewed as a good state space.

However, the state space found by kernel PCA is bi-ased : it typically includes noise, information that can-not be predicted from past observations. We wouldlike instead to find a low dimensional state space thatembeds the probability distribution of possible futuresconditioned on the past. In particular, we want to finda state space that is uncorrelated with the noise in thefuture observations. From a two-manifold perspective,we view features of the past as instrumental variablesto unbias the future.

Therefore, in addition to sampling sequences of futureobservations, we sample corresponding sequences ofpast observations y1, . . . ,yn ∈ Rd2 : sequence yi endsat time ti− 1. We then construct a Gram matrix GY ,whose (i, j) element is Ky(yi,yj). From GY we con-struct the centered Gram matrix BY = HGY H. Fi-nally, we identify the state space using a kernel SVD asin Section 3.2.3: 〈U,Λ,VT〉 = SVD(BXBY , k) (thereare no data weights for HSE-HMMs). The left singular“vectors” (reconstructed from U as in Section 3.2.3)now identify a subspace in which the system evolves.From this subspace, we can proceed to identify theparameters of the dynamical system as in Song etal. [17].7

5.2 Manifold HSE-HMMs

In contrast with HSE-HMMs, we are interested inmodeling a dynamical system whose state space lieson a low-dimensional manifold, even if this manifoldis curved to occupy a higher-dimensional subspace (anexample is given in Section 5.3, below). We want touse this additional knowledge to constrain the learn-ing algorithm and produce a more accurate model fora given amount of training data.

To do so, we replace the kernel SVD by a two-manifoldmethod. That is, we learn weighted centered Grammatrices CX and CY for the future and past obser-vations, using a manifold method like LE or LLE (seeSection 2.2). Then we apply a SVD to CXCY in orderto recover the latent state space.

7Other approaches such as CCA and RRR have alsobeen successfully used for finite-dimensional spectral learn-ing algorithms [18], suggesting that kernel SVD could bereplaced by kernel CCA or kernel RRR (Section 3.2.4).

5.3 Slotcar: A Real-World Dynamical System

Here we look at the problem of tracking and predict-ing the position of a slotcar with attached inertial mea-surement unit (IMU) racing around a track. The setupconsisted of a track and a miniature car (1:32 scalemodel) guided by a slot cut into the track. Figure 3(A)shows setup. At a rate of 10Hz, we extracted the es-timated 3-D acceleration and angular velocity of thecar. An overhead camera also supplied estimates ofthe 2-dimensional location of the car.

We collected 3000 successive observations while theslot car circled the track controlled by a constant pol-icy (with varying speeds). The goal was to learn adynamical model of the noisy IMU data, and, afterfiltering, to predict current and future 2-dimensionallocations. We used the first 2000 data points as train-ing data, and held out the last 500 data points fortesting the learned models. We trained four models,and evaluated these models based on prediction accu-racy, and, where appropriate, the learned latent state.

First, we trained a 20-dimensional embedded HMMwith the spectral algorithm of Song et al. [17], usingsequences of 150 consecutive observations. FollowingSong et al., we used Gaussian RBF kernels, setting thebandwidth parameter with the “median trick”, andusing regularization parameter λ = 10−4. Second, wetrained a similar 20-dimensional embedded HMM withLE kernels. The number of nearest neighbors was se-lected to be 50, and the other parameters were set tobe identical to the first model. (So, the only differ-ence is that the first model performs a kernel SVDto find the subspace on which the dynamical systemevolves, while the second model solves a two-manifoldproblem.) Third, for comparison’s sake, we trained a20-dimensional Kalman filter using the N4SID algo-rithm [26] with Hankel matrices of 150 time steps; andfinally, we learned a 20-state discrete HMM (with 400levels of discretization for observations) using the EMalgorithm.

Before investigating the predictive accuracy of thelearned dynamical systems, we looked at the learnedstate space of the first three models. These modelsdiffer mainly in their kernel: Gaussian RBF, learnedmanifold from Laplacian Eigenmaps, or linear. As atest, we tried to reconstruct the 2-dimensional loca-tions of the car from each of the three latent statespaces: the more accurate the learned state space, thebetter we expect to be able to reconstruct the loca-tions. Results are shown in Figure 3(B); the learnedmanifold is the clear winner.

Finally we examined the prediction accuracy of eachmodel. We performed filtering for different extentst1 = 100, . . . , 350, then predicted the car location for


0 10 20 30 40 50 60 70 80 90 1000

40

80

120

160A. C.

Prediction Horizon

RM

S Er

rorIMU

Racetrack

Mean Obs.Slot Car B.

Laplacian Eigenmap

Gaussian RBF

Kalman Filter

LinearHSE-HMMManifold-HMM

HMM (EM)Previous Obs.

Car

Loc

atio

ns P

redi

cted

by

Diff

eren

t Sta

te S

pace

s

Figure 3: Slot car with inertial measurement unit (IMU). (A) The slot car platform: the car and IMU (top) andracetrack (bottom). (B) A comparison of training data embedded into the state space of three different learnedmodels. Red line indicates true 2-d position of the car over time, blue lines indicate the prediction from statespace. The top graph shows the Kalman filter state space (linear kernel), the middle graph shows the HSE-HMM state space (Gaussian RBF kernel), and the bottom graph shows the manifold HSE-HMM state space(LE kernel). The LE kernel finds the best representation of the true manifold. (C) Root mean squared error forprediction (averaged over 250 trials) with different estimated models. The HSE-HMM significantly outperformsthe other learned models by taking advantage of the fact that the data we want to predict lies on a manifold.

a further t2 steps in the future, for t2 = 1, . . . , 100.The root mean squared error of this prediction in the2-dimensional location space is plotted in Figure 3(C).The Manifold HMM learned by the method detailedin Section 5.2 consistently yields lower prediction er-ror for the duration of the prediction horizon. Thisis important: not only does the manifold HSE-HMMprovide a model that better visualizes the data that wewant to predict, but it is significantly more accuratewhen filtering and predicting than classic and state-of-the-art alternatives.

6 Related Work

While preparing this manuscript, we learned of thesimultaneous and independent work of Mahadevan etal. [27]. This paper defines one particular two-manifoldalgorithm, maximum covariance unfolding (MCU), byextending maximum variance unfolding; but, it doesnot discuss how to extend other one-manifold methods.It also does not discuss any asymptotic properties ofthe MCU method, such as consistency.

A similar problem to the two-manifold problem ismanifold alignment [28, 29], which builds connectionsbetween two or more data sets by aligning their un-derlying manifolds. Generally, manifold alignment al-gorithms either first learn the manifolds separatelyand then attempt to align them based on their low-dimensional geometric properties, or they take theunion of several manifolds and attempt to learn a

latent space that preserves the geometry of all ofthem [29]. Our aim is different: we assume paireddata, where manifold alignments do not; and, we focuson learning algorithms that simultaneously discovermanifold structure (as kernel eigenmap methods do)and connections between manifolds (as provided by,e.g., a top-level learning problem defined between twomanifolds).

This type of interconnected learning problem has beenexplored before in a different context via reduced-rankregression (RRR) [30, 31, 20] and sufficient dimensionreduction (SDR) [32, 33, 34]. RRR is a linear regres-sion method that attempts to estimate a set of coeffi-cients β to predict response vectors yi from covariatevectors xi under the constraint that β is low rank (andcan therefore be factored). In SDR, the goal is to finda linear subspace of covariates xi that makes responsevectors yi conditionally independent of the xis. Theformulation is in terms of conditional independence,and, unlike in RRR, no assumption is made on theform of the regression from xi to yi. Unfortunately,SDR pays a price for the relaxation of the linear as-sumption: the solution to SDR problems usually re-quires a difficult non-linear non-convex optimization.

Manifold kernel dimension reduction [35], finds anembedding of covariates xi using a kernel eigenmapmethod, and then attempts to find a linear transfor-mation of some of the dimensions of the embeddedpoints to predict response variables yi. The responsevariables are constrained to be linear in the manifold,


so the problem is quite different from a two-manifoldproblem.

There has also been some work on finding a mappingbetween manifolds [36] and learning a dynamical sys-tem on a manifold [37]; however, in both of these casesit was assumed that the manifold was known.

Finally, some authors have focused on the problem ofcombining manifold learning algorithms with systemidentification algorithms. For example, Lewandowskiet al. [38] introduces Laplacian eigenmaps that accom-modate time series data by picking neighbors basedon temporal ordering. Lin et al. [39] and Li et al. [40]propose learning a piecewise linear model that approx-imates a non-linear manifold and then attempt to learnthe dynamics in the low-dimensional space. Althoughthese methods do a qualitatively reasonable job of con-structing a manifold, none of these papers comparesthe predictive accuracy of its model to state-of-the-artdynamical system identification algorithms.

7 Conclusion

In this paper we propose a class of problems called two-manifold problems, where two sets of correspondingdata points, generated from a single latent manifoldand corrupted by noise, lie on or near two differenthigher dimensional manifolds. We design algorithmsby relating two-manifold problems to cross-covarianceoperators in RKHSs, and show that these algorithmsresult in a significant improvement over standard man-ifold learning approaches in the presence of noise. Thisis an appealing result: manifold learning algorithmstypically assume that observations are (close to) noise-less, an assumption that is rarely satisfied in practice.

Furthermore, we demonstrate the utility of two-manifold problems by extending a recent dynamicalsystem identification algorithm to learn a system witha state space that lies on a manifold. The resulting al-gorithm learns a model that outperforms the currentstate-of-the-art in terms of predictive accuracy. To ourknowledge this is the first combination of system iden-tification and manifold learning that accurately iden-tifies a latent time series manifold and is competitivewith the best system identification algorithms at learn-ing accurate predictive models.

Acknowledgements

Byron Boots and Geoffrey J. Gordon were supportedby ONR MURI grant number N00014-09-1-1052. By-ron Boots was supported by the NSF under grant num-ber EEEC-0540865.

References

[1] Joshua B. Tenenbaum, Vin De Silva, and JohnLangford. A global geometric framework fornonlinear dimensionality reduction. Science,290:2319–2323, 2000.

[2] Sam T. Roweis and Lawrence K. Saul. Nonlin-ear dimensionality reduction by locally linear em-bedding. Science, 290(5500):2323–2326, Decem-ber 2000.

[3] Mikhail Belkin and Partha Niyogi. Laplacianeigenmaps for dimensionality reduction and datarepresentation. Neural Computation, 15:1373–1396, 2002.

[4] Kilian Q. Weinberger, Fei Sha, and Lawrence K.Saul. Learning a kernel matrix for nonlinear di-mensionality reduction. In In Proceedings of the21st International Conference on Machine Learn-ing, pages 839–846. ACM Press, 2004.

[5] Neil D. Lawrence. Spectral dimensionality reduc-tion via maximum entropy. In Proc. AISTATS,2011.

[6] Bernhard Scholkopf, Alex J. Smola, and Klaus-Robert Muller. Nonlinear component analysis asa kernel eigenvalue problem. Neural Computation,10(5):1299–1319, 1998.

[7] Jihun Ham, Daniel D. Lee, Sebastian Mika, andBernhard Schlkopf. A kernel view of the dimen-sionality reduction of manifolds, 2003.

[8] Guisheng Chen, Junsong Yin, and Deyi Li. Neigh-borhood smoothing embedding for noisy manifoldlearning. In GrC, pages 136–141, 2008.

[9] Yubin Zhan and Jianping Yin. Robust local tan-gent space alignment. In ICONIP (1), pages 293–301, 2009.

[10] Yubin Zhan and Jianping Yin. Robust local tan-gent space alignment via iterative weighted PCA.Neurocomputing, 74(11):1985–1993, 2011.

[11] Judea Pearl. Causality: models, reasoning, andinference. Cambridge University Press, 2000.

[12] I. T. Jolliffe. Principal component analysis.Springer, New York, 2002.

[13] Steven J. Bradtke and Andrew G. Barto. Lin-ear least-squares algorithms for temporal differ-ence learning. In Machine Learning, pages 22–33,1996.


[14] Daniel Hsu, Sham Kakade, and Tong Zhang. Aspectral algorithm for learning hidden Markovmodels. In COLT, 2009.

[15] Sajid Siddiqi, Byron Boots, and Geoffrey J. Gor-don. Reduced-rank hidden Markov models. InProceedings of the Thirteenth International Con-ference on Artificial Intelligence and Statistics(AISTATS-2010), 2010.

[16] Byron Boots, Sajid M. Siddiqi, and Geoffrey J.Gordon. Closing the learning-planning loop withpredictive state representations. In Proceedings ofRobotics: Science and Systems VI, 2010.

[17] L. Song, B. Boots, S. M. Siddiqi, G. J. Gordon,and A. J. Smola. Hilbert space embeddings ofhidden Markov models. In Proc. 27th Intl. Conf.on Machine Learning (ICML), 2010.

[18] Byron Boots and Geoff Gordon. Predictive statetemporal difference learning. In J. Lafferty,C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel,and A. Culotta, editors, Advances in Neural In-formation Processing Systems 23, pages 271–279.2010.

[19] Boaz Nadler, Stephane Lafon, Ronald R. Coif-man, and Ioannis G. Kevrekidis. Diffusion maps,spectral clustering and eigenfunctions of fokker-planck operators. In NIPS, 2005.

[20] Gregory C. Reinsel and Rajabather Palani Velu.Multivariate Reduced-rank Regression: Theoryand Applications. Springer, 1998.

[21] Harold Hotelling. The most predictable criterion.Journal of Educational Psychology, 26:139–142,1935.

[22] A.J. Smola, A. Gretton, L. Song, andB. Scholkopf. A Hilbert space embeddingfor distributions. In E. Takimoto, editor, Al-gorithmic Learning Theory, Lecture Notes onComputer Science. Springer, 2007.

[23] K. Fukumizu, F. Bach, and A. Gretton. Con-sistency of kernel canonical correlation analy-sis. Technical Report 942, Institute of StatisticalMathematics, Tokyo, Japan, 2005.

[24] L. Song, J. Huang, A. Smola, and K. Fukumizu.Hilbert space embeddings of conditional distribu-tions. In Proceedings of the International Confer-ence on Machine Learning (ICML), 2009.

[25] Byron Boots, Sajid Siddiqi, and Geoffrey Gordon.An online spectral learning algorithm for partiallyobservable nonlinear dynamical systems. In Pro-ceedings of the 25th National Conference on Ar-tificial Intelligence (AAAI-2011), 2011.

[26] P. Van Overschee and B. De Moor. Subspace Iden-tification for Linear Systems: Theory, Implemen-tation, Applications. Kluwer, 1996.

[27] Vijay Mahadevan, Chi Wah Wong, Jose CostaPereira, Tom Liu, Nuno Vasconcelos, andLawrence Saul. Maximum covariance unfold-ing: Manifold learning for bimodal data. In Ad-vances in Neural Information Processing Systems24, 2011.

[28] Jihun Ham, Daniel Lee, and Lawrence Saul.Semisupervised alignment of manifolds. InRobert G. Cowell and Zoubin Ghahramani, ed-itors, 10th International Workshop on ArtificialIntelligence and Statistics, pages 120–127, 2005.

[29] Chang Wang and Sridhar Mahadevan. A gen-eral framework for manifold alignment. In Proc.AAAI, 2009.

[30] T. W. Anderson. Estimating linear restrictionson regression coefficients for multivariate normaldistributions. The Annals of Mathematical Statis-tics, 22(3):pp. 327–351, 1951.

[31] Alan Julian Izenman. Reduced-rank regression forthe multivariate linear model. Journal of Multi-variate Analysis, 5(2):248–264, 1975.

[32] Ker-Chau Li. Sliced inverse regression for dimen-sion reduction. Journal of the American Statisti-cal Association, 86(414):pp. 316–327, 1991.

[33] R. Dennis Cook and Xiangrong Yin. Theory andmethods: Special invited paper: Dimension re-duction and visualization in discriminant analysis(with discussion). Australian and New ZealandJournal of Statistics, 43(2):147–199, 2001.

[34] Kenji Fukumizu, Francis R. Bach, Michael I. Jor-dan, and Chris Williams. Dimensionality reduc-tion for supervised learning with reproducing ker-nel hilbert spaces. Journal of Machine LearningResearch, 5:2004, 2004.

[35] Jens Nilsson, Fei Sha, and Michael I. Jordan. Re-gression on manifolds using kernel dimension re-duction. In ICML, pages 697–704, 2007.

[36] Florian Steinke, Matthias Hein, and BernhardScholkopf. Nonparametric regression betweengeneral riemannian manifolds. SIAM J. ImagingSciences, 3(3):527–563, 2010.

[37] Ambrish Tyagi and James W. Davis. A recursivefilter for linear systems on riemannian manifolds.Computer Vision and Pattern Recognition, IEEEComputer Society Conference on, 0:1–8, 2008.


[38] Michal Lewandowski, Jesus Martınez del Rincon,Dimitrios Makris, and Jean-Christophe Nebel.Temporal extension of laplacian eigenmaps forunsupervised dimensionality reduction of time se-ries. In ICPR, pages 161–164, 2010.

[39] Ruei sung Lin, Che bin Liu, Ming hsuan Yang,Narendra Ahuja, and Stephen Levinson. Learningnonlinear manifolds from time series. In In Proc.ECCV, pages 239–250, 2006.

[40] Rui Li, Rui Li, Stan Sclaroff Phd, Margrit BetkePhd, and David J. Fleet. S.: Simultaneous learn-ing of nonlinear manifold and dynamical modelsfor high-dimensional time series. In In: Proc.ICCV (2007, 2007.

Documents

arXiv:1112.6399v1 [cs.LG] 29 Dec 2011bboots/files/two_manifold_arxiv.pdftion problem. We propose a class of al-gorithms for two-manifold problems, based on spectral decomposition of