21
Data Min Knowl Disc (2011) 22:372–392 DOI 10.1007/s10618-010-0183-9 Matrix-variate and higher-order probabilistic projections Shipeng Yu · Jinbo Bi · Jieping Ye Received: 2 May 2009 / Accepted: 18 June 2010 / Published online: 10 July 2010 © The Author(s) 2010 Abstract Feature extraction from two-dimensional or higher-order data, such as face images and surveillance videos, have recently been an active research area. There have been several 2D or higher-order PCA-style dimensionality reduction algo- rithms, but they mostly lack probabilistic interpretations and are difficult to apply with, e.g., incomplete data. It is also hard to extend these algorithms for applications where a certain region of the data point needs special focus in the dimensionality reduction process (e.g., the facial region in a face image). In this paper we propose a probabi- listic dimensionality reduction framework for 2D and higher-order data. It specifies a particular generative process for this type of data, and leads to better understanding of some 2D and higher-order PCA-style algorithms. In particular, we show it actually takes several existing algorithms as its (non-probabilistic) special cases. We develop efficient iterative learning algorithms within this framework and study the theoreti- cal properties of the stationary points. The model can be easily extended to handle special regions in the high-order data. Empirical studies on several benchmark data and real-world cardiac ultrasound images demonstrate the strength of this framework. Responsible editor: Tao Li. S. Yu (B ) Siemens Medical Solutions USA, Malvern, PA, USA e-mail: [email protected] J. Bi Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USA e-mail: [email protected] J. Ye Arizona State University, Tempe, AZ, USA e-mail: [email protected] 123

Matrix-variate and higher-order probabilistic projections · Matrix-variate and higher-order probabilistic projections 373 Keywords Dimensionality reduction ·Higher-order principle

  • Upload
    others

  • View
    22

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Matrix-variate and higher-order probabilistic projections · Matrix-variate and higher-order probabilistic projections 373 Keywords Dimensionality reduction ·Higher-order principle

Data Min Knowl Disc (2011) 22:372–392DOI 10.1007/s10618-010-0183-9

Matrix-variate and higher-order probabilisticprojections

Shipeng Yu · Jinbo Bi · Jieping Ye

Received: 2 May 2009 / Accepted: 18 June 2010 / Published online: 10 July 2010© The Author(s) 2010

Abstract Feature extraction from two-dimensional or higher-order data, such asface images and surveillance videos, have recently been an active research area.There have been several 2D or higher-order PCA-style dimensionality reduction algo-rithms, but they mostly lack probabilistic interpretations and are difficult to apply with,e.g., incomplete data. It is also hard to extend these algorithms for applications wherea certain region of the data point needs special focus in the dimensionality reductionprocess (e.g., the facial region in a face image). In this paper we propose a probabi-listic dimensionality reduction framework for 2D and higher-order data. It specifies aparticular generative process for this type of data, and leads to better understandingof some 2D and higher-order PCA-style algorithms. In particular, we show it actuallytakes several existing algorithms as its (non-probabilistic) special cases. We developefficient iterative learning algorithms within this framework and study the theoreti-cal properties of the stationary points. The model can be easily extended to handlespecial regions in the high-order data. Empirical studies on several benchmark dataand real-world cardiac ultrasound images demonstrate the strength of this framework.

Responsible editor: Tao Li.

S. Yu (B)Siemens Medical Solutions USA, Malvern, PA, USAe-mail: [email protected]

J. BiDepartment of Computer Science and Engineering,University of Connecticut, Storrs, CT 06269, USAe-mail: [email protected]

J. YeArizona State University, Tempe, AZ, USAe-mail: [email protected]

123

Page 2: Matrix-variate and higher-order probabilistic projections · Matrix-variate and higher-order probabilistic projections 373 Keywords Dimensionality reduction ·Higher-order principle

Matrix-variate and higher-order probabilistic projections 373

Keywords Dimensionality reduction · Higher-order principle component analysis ·Low-rank matrix factorization · Probabilistic projection

1 Introduction

Recent technological innovations have unleashed a torrent of data with large numbersof dimensions (features or measurements). One of the key issues in such data analysisis the curse of dimensionality, i.e., an enormous number of samples is required toperform accurate prediction on problems with large numbers of features. Dimension-ality reduction techniques for these vector-valued data, such as principal componentanalysis (PCA), have been widely applied to overcome this problem.

In recent years there are massive applications which generate matrix data whereeach datum is a two-dimensional (2D) matrix, or higher-order data where each datumis a tensor with higher (≥3) dimensions. The latter is also called a multiway array insome applications. Hereafter we will use matrix-variate data to denote both 2D andhigher-order data. Formerly, let an order-O datum be represented as X ∈ R

I1×I2×···IO ,where I j , j = 1, . . . , O , is the dimensionality of the j th order of the datum. In thisdefinition, an order-2 datum can be conveniently represented as a matrix X ∈ R

I1×I2 .Examples of matrix-variate data include image data (order 2) in face recognition whereeach datum is a face image, and videos (order 3) from surveillance cameras whereeach datum is a three-dimensional data cube (width×height× time).

Dimensionality reduction for this type of data is an active research area and hasstrong connections to low-rank tensor approximations. One can certainly convert everydatum Xi into a (column) vector by concatenating columns (or rows) and then applytraditional PCA, but such tensor-to-vector conversion may lead to loss of spatial local-ity information inherent in the data, and it also leads to very high dimensional rep-resentation of the data which is not feasible for PCA. Several 2D and higher-orderalgorithms have been proposed such as PCA-style unsupervised methods (e.g., Yanget al. 2004; Ye 2005; Lu et al. 2008) and multiway data analysis (see, e.g., Kolda andBader 2007 for a survey). However, so far there lack probabilistic interpretations tothese algorithms, and thus it is difficult to apply them incrementally (when new dataare available), locally (when some sub-regions are of special interest), robustly (whenthere are some outlier points which might bias the projected dimensions), and whenthere are missing entries in some data points.

A probabilistic dimensionality reduction framework for matrix-variate data hasmany applications. One application in image analysis is to distinguish the region ofinterest (ROI) from the background in learning the dimensionality reduction mapping.This has great potential in, e.g., face recognition and medical imaging, since ideallythe reconstruction should focus more on the ROI (i.e., the face and the organ) insteadof the background pixels. Another application is to learn a mixture of mappings whichreveal the projection in different perspectives and better reconstruct the datum. Wewill discuss more about these applications in the experiment section.

In this paper we introduce a family of probabilistic models, the probabilistic higher-order PCA (PHOPCA in short), for matrix-variate data. Interestingly, we will showthat they recover the optimal solutions of several matrix-variate PCA-style algorithms

123

Page 3: Matrix-variate and higher-order probabilistic projections · Matrix-variate and higher-order probabilistic projections 373 Keywords Dimensionality reduction ·Higher-order principle

374 S. Yu et al.

under mild conditions (Sect. 3). These models for the first time explicitly specifythe generative process of matrix-variate objects. The well-known probabilistic PCAmodel (Tipping and Bishop 1999b; Roweis and Ghahramani 1999) is shown as theone-dimensional special case of PHOPCA. Efficient expectation-maximization (EM)algorithms are derived for learning the projection mapping, with less time complexitythan the non-probabilistic counterparts (Sect. 4). Several extensions of the PHOPCAfamily are also discussed which show the additional benefits of the proposed prob-abilistic framework (Sect. 5). Empirical studies are shown using face images, USPShandwritten digits and a real application in cardiac view recognition of echocardio-gram (Sect. 6).

2 Related work

PCA is a well-known dimensionality reduction method for one-dimensional data andhas been extensively applied in machine learning and data mining (see, e.g., Jolliffe2002). Let {x1, . . . , xN } denote a set of N (column) vectors of input data, PCA com-putes the eigen-decomposition of the sample covariance matrix of the data, 1

N

∑i (xi−

x̄)(xi − x̄)� (with x̄ being the mean vector), and outputs an orthogonal transformationwhich contains the eigenvector(s) corresponding to the largest eigenvalue(s). Here (·)�denote matrix transpose. It is known that PCA captures the largest variance direction(s)of the data, and achieves the minimum reconstruction error (in vector 2-norm) amongall projection directions with the same reduced dimensionality. PCA can be seen asa matrix factorization method (as it factorizes the sample covariance matrix), but itis significantly different from other special matrix factorization methods (such asnon-negative matrix factorization; Lee and Seung 1999).

In the following we will review various non-probabilistic PCA-style algorithms formatrix-variate data (including both 2D and higher-order data), and the probabilisticPCA which is targeting at one-dimensional data.

2.1 2D/higher-order PCA-style algorithms

Order-2 data like images can be conveniently represented as a matrix and is thereforeeasier to analyze compared to higher-order data. Yang et al. (2004) and Ding andYe (2005) proposed 2DPCA (or 2DSVD), in which a one-sided linear transforma-tion, i.e., Xi R, is applied to each image Xi . Let each image Xi be of size m × n. Bymaximizing the data variance in the transformed space, 2DPCA leads to an analyt-ical solution for R which contains the leading eigenvector(s) of the right one-sidedsample covariance matrix, 1

N

∑Ni=1(Xi − X̄)�(Xi − X̄), where X̄ is the mean image

1N

∑Ni=1 Xi .

Later Ye (2005) introduced GLRAM, which considers two-sided transformation,L�Xi R, to project each image Xi from space R

m×n to Rr×c, where the left-sided

mapping matrix L(m × r) and right-sided mapping matrix R(n × c) are both ortho-normal (r < m, c < n). GLRAM minimizes the average matrix reconstruction errorsof all images, 1

N

∑Ni=1 ‖Xi − LZi R�‖2F , with respect to both the reduced images Zi

123

Page 4: Matrix-variate and higher-order probabilistic projections · Matrix-variate and higher-order probabilistic projections 373 Keywords Dimensionality reduction ·Higher-order principle

Matrix-variate and higher-order probabilistic projections 375

and the mapping matrices L and R. Here ‖ · ‖F denote matrix Frobenius norm. It isshown in Ye (2005) that GLRAM solves this optimization problem by choosing aninitial L and R and iteratively computing the leading eigenvectors of the left and rightone-sided sample covariance matrices as follows until convergence:

R← top c eigenvectors of∑

i

X�i LL�Xi with L fixed;

L← top r eigenvectors of∑

i

Xi RR�X�i with R fixed.

It’s shown that GLRAM obtains smaller reconstruction error than 2DPCA if maintain-ing the same compression ratio. There exist other variants of 2D PCA-style algorithmssuch as Inoue and Urahama (2006) and Shashua and Levin (2001).

GLRAM is further extended to higher-order multilinear PCA in Lu et al. (2008),where we consider tensor product Xi ×1 U�1 × · · · ×O U�O to map each datum fromspace R

I1×···×IO to RP1×···×PO . Here Xi × j U�j is the j th-mode product of tensor

Xi by (transpose of) the orthonormal mapping matrix U j ∈ RI j×Pj (Pj < I j ). Min-

imizing the reconstruction error leads to a similar eigen-decomposition algorithm toiteratively find U j with other mapping matrices fixed ( j = 1, . . . , O). Other relatedwork include Kolda (2001), Lathauwer et al. (2000), Lu et al. (2006), Sun et al. (2006),Vasilescu and Terzopoulos (2002), and Wang et al. (2005).

These 2D and higher-order methods are able to capture the spatial locality and are ingeneral more efficient and memory-cheaper than standard PCA (after tensor-to-vectorunfolding). They also yield lower reconstruction error if a same effective projectiondimension is considered. Several other variants of PCA-style algorithms include Dingand Ye (2005) and Inoue and Urahama (2006). But up to now, no probabilistic expla-nation exists so far for this type of algorithms, therefore there is no principled solutionto handle missing entries in the higher-order data.

These methods are also strongly related to the multiway data analysis (Kolda andBader 2007), where matrix singular value decomposition (SVD) is extended to higher-order tensors using, e.g., Tucker and PARAFAC models. The main difference is thatthe PCA-style algorithms consider i.i.d. tensor samples, whereas multiway data anal-ysis factorizes one big tensor. Our proposed probabilistic models only interpret theformer, while a recently published paper (Chu and Ghahramani 2009) addresses thelatter.

2.2 Probabilistic PCA

While PCA originates from the analysis of data variances, probabilistic PCA (PPCA)emerges from the statistics community and brings probabilistic explanations to PCA(Tipping and Bishop 1999b; Roweis and Ghahramani 1999). For input data x ∈ R

d ,PPCA defines a latent variable model (of dimensionality k < d) as follows:

123

Page 5: Matrix-variate and higher-order probabilistic projections · Matrix-variate and higher-order probabilistic projections 373 Keywords Dimensionality reduction ·Higher-order principle

376 S. Yu et al.

x =Wz+ μ+ ε, (1)

with z ∈ Rk the latent variables, W(d × k) the factor loadings, μ ∈ R

d the meanvector, and ε a noise process which follows a normal distribution with zero mean andisotropic variance, i.e., ε ∼ N (0, σ 2Id). We also follow the convention to assumez ∼ N (0, Ik) for the latent variables. Here Ik denote the identity matrix of size k × k.

Conditioned on the latent variable z, x is seen to follow a normal distribution,i.e., x|z ∼ N (Wz + μ, σ 2I). With z integrated out, x is also marginally normaldistributed as x ∼ N (μ, WW� + σ 2I). Based on the Bayes’ rule, the a posterioridistribution of z given x is also a normal:

z|x ∼ N((W�W+ σ 2I)−1W�(x − μ), σ 2(W�W+ σ 2I)−1

).

This gives the projection model for any input data x under PPCA, with any fixedσ 2. When σ 2 → 0, this normal distribution collapses to a single mass at the mean(W�W)−1W�(x − μ), which can be shown to be equivalent to PCA up to a scal-ing and rotation factor (Tipping and Bishop 1999b). This indicates that the principalsubspace obtained from PPCA is the same as that in PCA.

The generative model for PPCA is similar to the factor analysis (Bartholomew andKnott 1999). The only difference is the noise process. In factor analysis the noise lev-els for different dimensions can be different, leading to a noise process ε ∼ N (0,�)

with � = diag(σ 21 , . . . , σ 2

d ). Both models assume that in the noise process everytwo dimensions are independent. The different noise models lead to very differentbehavior in the projection model: factor analysis is covariant under component-wiserescaling of the data variables, whilst PPCA (and PCA) is covariant under rotation ofthe coordination system. For a detailed comparison of these two models please referto Tipping and Bishop (1999b).

With a set of observations {xi }Ni=1, the maximum likelihood (ML) estimate of Wcan be obtained by eigen-decomposing the sample covariance matrix. There also existsan expectation-maximization (EM) algorithm for W, which is more efficient, mem-ory-cheaper, and allows us to principally handle missing data and PPCA mixtures (seeTipping and Bishop 1999b for more details).

3 Probabilistic higher-order PCA

As will be seen later in Sect. 3.3, it is straightforward to extend the probabilistic modelfrom second-order data to higher-order (O ≥ 3) data. So for simplicity we mainlyfocus on second-order data (we call them “images” hereafter).

Given a set of N 2D images, our goal is to find a low dimensional feature represen-tation for each image Xi which is a m×n matrix. The canonical solution is to vectorizeeach image into a vector by concatenating all the columns (or all rows), and then applystandard PCA or PPCA. But this is arguably not the best solution because: 1) concate-nating columns (rows) remove the local correlation structure between columns (rows)in the image, and 2) vectorization leads to high dimensional representation of images,causing problems for PCA and PPCA in terms of both time and space complexities. We

123

Page 6: Matrix-variate and higher-order probabilistic projections · Matrix-variate and higher-order probabilistic projections 373 Keywords Dimensionality reduction ·Higher-order principle

Matrix-variate and higher-order probabilistic projections 377

propose a matrix-variate probabilistic model in this section, and build its connectionsto PPCA and other PCA-style algorithms.

Let us start with some matrix notations. For any matrices A(m × n) = (ai j ) andB(p×q) = (bkl), we define vec(A) = (a11, a21, . . . , am1, a12, . . . , am2, . . . , amn)� ∈R

mn the vectorization of A, and A⊗B = (ai j B) ∈ Rmp×nq the Kronecker product of

A and B. Recall that vec(ABC) = (C� ⊗A) vec(B) holds for any matrices A, B andC with proper dimensions. Finally we denote � 0 if square matrix � is positivedefinite.

3.1 Preliminaries

Matrix-variate distributions, such as matrix-variate normal and Wishart, are widelyused in statistics (Gupta and Nagar 1999). They are in general the 2D extensions ofsome multi-variate distributions and show interesting characteristics. Among them thematrix-variate normal is the most basic one.

Definition 1 (Gupta and Nagar 1999) Random matrix X(m × n) is said to followa matrix-variate normal distribution with mean matrix M(m × n) and covariancematrices �(m × m) 0 and �(n × n) 0, if vec(X�) ∼ N (vec(M�),� ⊗ �).This is denoted as X ∼ N (M,�,�).1

Matrix-variate normal is defined through a normal distribution on the vectorized formof the matrix, with a special Kronecker covariance structure. It is not hard to see thatthe p.d.f. of X ∼ N (M,�,�) is

1

(2π)12 mn|�| 12 n|�| 12 m

etr[− 1

2�−1(X−M)�−1(X−M)�],

with etr(·) = exp(tr(·)) and tr(·) the matrix trace. Simple properties of matrix-variatenormal include: 1) X�, the transpose of X, follows N (M�,�,�); 2) The rows and/orcolumns of X are independent if � and/or � are diagonal; 3) With m = 1 or n = 1,matrix-variate normal reduces to multi-variate normal.

Following this definition we can also define a similar normal distribution for higher-order (O > 2) tensors, with a special Kronecker covariance �1 ⊗ · · · ⊗ �O for the“vectorized” or “unfolded” form of tensor X. Note that the vector unfolding shouldhappen from order O back to order 1.

3.2 Probabilistic second-order PCA

Let each image X be a m× n matrix. To directly model the spatial locality of the data,we propose the probabilistic second-order PCA, or the second-order PHOPCA, basedon the matrix-variate normal assumption and assume the following two-sided latentvariable model:

1 For simplicity we overload symbol N to denote both multi-variate normal distribution (with 2 parameters)and matrix-variate normal distribution (with 3 parameters).

123

Page 7: Matrix-variate and higher-order probabilistic projections · Matrix-variate and higher-order probabilistic projections 373 Keywords Dimensionality reduction ·Higher-order principle

378 S. Yu et al.

X = LZR� +M+ϒ, (2)

where L(m×r) and R(n×c) are the row and column loading matrices, and Z(r×c) isthe latent variable core of X, with r ≤ m, c ≤ n the row and column PCA dimensions,respectively. M(m × n) is the mean matrix, and ϒ is a matrix-variate noise process.In this probabilistic framework, we assume matrix-variate normal for both Z and ϒ:Z ∼ N (0, Ir , Ic), and ϒ ∼ N (0, σ Im, σ In) with noise level σ > 0. This noise modelwill be further discussed in Sect. 5.

From (2) we see that PHOPCA directly models the 2D image X as a two-sidedmapping, and Z is the low-dimensional mapping. The local structure is kept in L andR. In other words, the image X is generated by sampling a (smaller) latent variablecore Z, applying the row and column loadings, and adding a mean and some noise toevery entry. With fixed dimensions (r, c), all the parameters in second-order PHOPCAare {L, R, M, σ }. When m = 1 or n = 1, L or R is a (positive) scalar, and second-order PHOPCA reduces to standard PPCA. Therefore, second-order PHOPCA is a2D extension of PPCA. If r = m or c = n, projection happens only on one sideof the image, and we call it one-mode second-order PHOPCA. In the general casewhen r < m, c < n, we call it two-mode second-order PHOPCA. In Sect. 4 we willshow that one-mode second-order PHOPCA recovers 2DPCA (Yang et al. 2004), andtwo-mode second-order PHOPCA recovers the GLRAM solution (Ye 2005).

The definition (2) indicates that the image X follows a matrix-variate normal con-ditioned on the core Z. But unlike in PPCA, if we integrate Z out, X in general doesnot follow a matrix-variate normal. The a posteriori distribution of Z given X is alsoin general not matrix-variate normal, but for one-mode second-order PHOPCA it is.For instance if L = Im , the a posteriori distribution of the core Z give image X isN (B, Im, S), with B = (X−M)R(R�R + σ 2I)−1, S = σ 2(R�R + σ 2I)−1.

To see the connection between second-order PHOPCA and PPCA, it is easy to obtainthat in the vectorized form, Eq. 2 is a special PPCA model vec(X�) =Wz+ μ+ ε,with z = vec(Z�),μ = vec(M�), W = L⊗R, and ε = vec(ϒ�).2 This means sec-ond-order PHOPCA enforces a Kronecker structure to the factor loadings W, and thisis the key why PHOPCA is able to take into account the row-wise and column-wisecorrelations in the images. Second-order PHOPCA also has much less free parameterscompared to a standard PPCA model on vec(X�). Learning under PHOPCA modelwill be discussed in Sect. 4.

3.3 Probabilistic higher-order PCA

We can easily extend the second-order PHOPCA to higher order by realizing thatmatrix product LZR� in (2) is simply the first and second-mode product of Z with Land R in the tensor form, i.e., Z ×1 L ×2 R. For an order-O tensor X ∈ R

I1×···×IO ,we propose PHOPCA which assumes an order-O latent variable model as:

2 This can be proved by applying formula vec(ABC) = (C� ⊗ A) vec(B) on both side of Eq. 2.

123

Page 8: Matrix-variate and higher-order probabilistic projections · Matrix-variate and higher-order probabilistic projections 373 Keywords Dimensionality reduction ·Higher-order principle

Matrix-variate and higher-order probabilistic projections 379

X = Z×1 U1 ×2 U2 × · · · ×O UO +M+ϒ,

with U j ∈ RI j×Pj the j th-mode factor loadings (Pj ≤ I j ), and Z ∈ R

P1×···×PO thelatent variable core of X. M is the mean tensor, and ϒ is an order-O noise process.A tensor extension of matrix-variate normal distribution can be assigned to Z andϒ, similar to that in second-order PHOPCA. The connection to PPCA also holds forgeneral PHOPCA models with a proper tensor-to-vector unfolding and the specialKronecker factor loadings W = U1 ⊗ · · · ⊗ UO . When projection happens only onone mode of the tensor (i.e., Pj < I j , Pk = Ik when k �= j), we call it one-modePHOPCA.

4 Learning in PHOPCA models

In this section we discuss how to optimize the parameters in the proposed PHOPCAmodels. For simplicity we again start with second-order PHOPCA, and then extendto general PHOPCA models. Given N images {Xi }Ni=1 which we assume are samplesfrom second-order PHOPCA model (2), we focus on learning the loading matrices Land R (with fixed PCA dimensions) mainly using EM type algorithms. We will showthat there is in general no analytical ML solutions for PHOPCA projections, but forone-mode PHOPCA there is a global optimal solution (up to a scaling and rotationfactor). These learning algorithms provide insights and probabilistic explanations toexisting PCA-style algorithms, and can be easily extended to handle mixture modelsand other noise models.

By definition (2) we see that M is the mean of the matrix-variate normal, and aneasy derivation shows that its ML estimate is M = 1

N

∑i Xi . Therefore for simplicity

we drop M in the following derivations.

4.1 Learning in one-mode second-order PHOPCA

Without loss of generality, we consider the right one-mode second-order PHOPCAmodel,

Xi = Zi R� +ϒ,

in which Zi (m×c) has the same number of rows as Xi (m×n). As shown in Sect. 3.2,the a posteriori distribution of Zi given Xi given all the model parameters is a matrix-variate normal N (Bi , Im, S), with

Bi = Xi R(R�R + σ 2I)−1, S = σ 2(R�R + σ 2I)−1. (3)

Treating Zi as the latent variable, we can derive a standard EM algorithm to learn Rand σ iteratively. In the E-step we calculate the a posteriori distribution of Zi whichgives the sufficient statistics (3), and then in the M-step we maximize the expectedlog-likelihood of the images with respect to R and σ , which is:

123

Page 9: Matrix-variate and higher-order probabilistic projections · Matrix-variate and higher-order probabilistic projections 373 Keywords Dimensionality reduction ·Higher-order principle

380 S. Yu et al.

−Nmn log σ 2 − 1

2σ 2

i

E

(‖Xi − Zi R�‖2F

)− 1

2

i

E

(‖Zi‖2F

).

Here all the expectations E(·) are with respect to the matrix-variate normal distributionZi ∼ N (Bi , Im, S). A page of mathematics leads to the following update equations:

R =[

1

N

i

X�i Bi

] [1

N

i

B�i Bi + mS

]−1

, (4)

σ 2 = 1

mn

(1

N

i

‖Xi − Bi R�‖2F + m tr(R�RS)

)

. (5)

Finally we run (3), (4) and (5) until convergence. For a test image X∗, (the dis-tribution of) its PHOPCA projection is calculated as in (3). We yield both the meanprojection and the covariance of the projection.

An important fact about this EM algorithm is that it leads to the global optimalprojection subspace for one-mode second-order PHOPCA, as summarized in the fol-lowing theorem.

Theorem 1 Let G = 1N

∑i X�i Xi , and λ1 ≥ · · · ≥ λn be its eigenvalues with eigen-

vectors u1, . . . , un. The EM algorithm for one-mode second-order PHOPCA leads tothe following ML solutions for R and σ 2:

R = Uc

(1

m�c − σ 2I

) 12

V, σ 2 = 1m(n−c)

n∑

j=c+1λ j ,

where �c = diag(λ1, . . . , λc), Uc = [u1, . . . , uc], and V is an arbitrary c×c orthog-onal matrix.

Proof We give a sketch here. We plug (3) into (4) to find the stationary point of the EMupdates. At convergence we have R = σ 2GRS(SR�GRS+σ 4mS)−1. Let R = UDVbe its SVD, we have S = σ 2V�(D2 + σ 2I)−1V. Then after some mathematics weobtain U�GU = m(D2 + σ 2I), which means U contains the eigenvectors of G. Let

G = U�U� be its SVD, we have D = ( 1m �− σ 2I)

12 . Finally a similar study as that

in Tipping and Bishop (1999b) shows that D corresponds to the largest c eigenvalues.This gives R. Plug this into (5) leads to the solution for σ 2. ��

This theorem generalizes the optimal solution of PPCA (Tipping and Bishop 1999b)for which m = 1. It is seen that the ML estimate of noise level σ 2 is the average ofthe rest n − c eigenvalues divided by the number of rows m. A similar result existsfor L if we do the left one-mode second-order PHOPCA with R = In . If the arbitraryrotation matrix V needs to be identified, one can run SVD to R�R to recover it.

For a test image X∗, its PHOPCA projection converges to a point mass B∗ =X∗Uc

√m�− 1

2c V when σ → 0. Note that matrix G in Theorem 1 is precisely the

right one-sided sample covariance used in Yang et al. (2004). Therefore, we have thefollowing corollary:

123

Page 10: Matrix-variate and higher-order probabilistic projections · Matrix-variate and higher-order probabilistic projections 373 Keywords Dimensionality reduction ·Higher-order principle

Matrix-variate and higher-order probabilistic projections 381

Corollary 2 The right one-mode second-order PHOPCA recovers the 2DPCA algo-rithm proposed in Yang et al. (2004) when σ → 0, up to a scaling and rotationfactor.

This indicates that one-mode second-order PHOPCA provides a probabilistic expla-nation to the 2DPCA algorithm (Yang et al. 2004). The EM algorithm provides anotherway of calculating those projection directions.

4.2 Learning in general second-order PHOPCA

In the general two-mode second-order PHOPCA model, we encounter some diffi-culty since the a posteriori distribution of Zi given Xi , P(Zi |Xi , L, R) ∝ P(Xi |Zi ,

L, R)P(Zi ), is in general not matrix-variate normal. In this subsection we solve thisoptimization problem using the variational EM (Jordan et al. 1999), in which wemaximize a lower bound of the data log-likelihood with respect to some variationalparameters in the E-step, and with respect to L and R in the M-step. Please refer toJordan et al. (1999) for more details on this type of algorithms.

In variational EM we need to choose a variational distribution Q(Zi ) to approxi-mate the true posterior, which is here P(Zi |Xi , L, R). In the following we use a matrix-variate normal, Q � N (Bi , T, S), with mean Bi (r × c) and covariances T(r × r) 0, S(c× c) 0 being variational parameters. Then the lower-bound to be maximizedis

∑i

∫Q log P

Q dZi , i.e., the sum of the KL-divergence between Q and P for eachimage i . We omit the derivations and summarize the results in the following.

In the variational E-step, the lower bound is maximized with respect to the varia-tional parameters Bi , T and S. It turns out that:

T = c σ 2[tr(R�RS)L�L+ σ 2 tr(S) Ir

]−1, (6)

S = r σ 2[tr(L�LT)R�R + σ 2 tr(T) Ic

]−1, (7)

and each Bi needs to satisfy L�LBi R�R + σ 2Bi = L�Xi R. To solve this we needto make a vectorization on both sides and solve a (big) linear equation:

[(R�R)⊗ (L�L)+ σ Ic ⊗ σ Ir

]vec(Bi ) = vec(L�Xi R) (8)

with respect to vec(Bi ), and then reshape it back to get Bi . When σ is small (e.g., σ <

0.001), however, the σ term in the equation corresponds to a small add-on (σ 2) to thediagonal entries of matrix (R�R)⊗ (L�L). In this case we might safely ignore the σ

term and get

Bi = (L�L)−1L�Xi R(R�R)−1 = L+Xi R+� (9)

to remove the computational burden. The entry-wise difference is at most O(σ 2). HereL+ (R+) denote the pseudo-inverse of L (R).

123

Page 11: Matrix-variate and higher-order probabilistic projections · Matrix-variate and higher-order probabilistic projections 373 Keywords Dimensionality reduction ·Higher-order principle

382 S. Yu et al.

In the variational M-step, we maximize the lower bound with respect to factorloadings L and R to get:

L =[

1

N

i

Xi RB�i

] [1

N

i

Bi R�RB�i + tr(R�RS) T

]−1

, (10)

R =[

1

N

i

X�i LBi

][1

N

i

B�i L�LBi + tr(L�LT) S

]−1

. (11)

Finally we iterate (6)–(11) until convergence. Note that updates for T and S are cou-pled (so as for L and R). σ can also be optimized if desired. It is easy to check that theseequations lead to one-mode second-order PHOPCA updates when we fix L = Im (orR = In).

Unlike the one-mode model, the general second-order PHOPCA does not have ananalytical global solution. When σ → 0, the variational parameters T and S tend tobe zero matrices, and the variational posterior of Zi tend to decouple to a single mass(9) (this is also how to calculate the PHOPCA projection for a test image X∗). In thiscase the iterative updates (9), (10) and (11) lead to the following important result:

Theorem 3 Let G(L) = 1N

∑i X�i LL+Xi and H(R) = 1

N

∑i Xi RR+X�i be two

matrix-valued functions with input matrix L(m × r) and R(n × c). Let Uc(L) andVr (R) contain eigenvectors of G(L) and H(R) with leading c and r eigenvalues,respectively. Then the stationary point of zero-noise, general second-order PHOPCAalgorithm (9)–(11) satisfies:

RR+ = Uc(L)Uc(L)�, LL+ = Vr (R)Vr (R)�. (12)

Proof We give a sketch here. When σ = 0, S and T are zero matrices. With L fixed,

plugging (9) into (11) yields R = GR[R�GR

]−1R�R at the stationary point. Let

R = EDF� be its SVD, we have EE�GE = GE. To get E we eigendecomposeE�GE = P�P� and obtain EP� = GEP. This indicates EP is the eigenvector ofG, and the only stable stationary solution is EP = Uc. Then we have RR+ = EE� =UcU�c . Similarly we can obtain LL+ with R fixed. ��

Theorem 3 builds an important connection between general second-order PHOPCAmodels and the GLRAM (Ye 2005). If we write

G(L) = 1

N

i

X�i Vr (R)Vr (R)�Xi , H(R) = 1

N

i

Xi Uc(L)Uc(L)�X�i

by plugging in (12) at the stationary point of second-order PHOPCA, this is exactlythe stationary point of repeatedly calculating the SVD of G(L) and H(R) in GLRAM(cf. Sect. 2.1, also see Ye 2005). Therefore, we actually proved the following:

Corollary 4 Zero-noise second-order PHOPCA models and the GLRAM (Ye 2005)have the same stationary point.

123

Page 12: Matrix-variate and higher-order probabilistic projections · Matrix-variate and higher-order probabilistic projections 373 Keywords Dimensionality reduction ·Higher-order principle

Matrix-variate and higher-order probabilistic projections 383

Based on the connection between PHOPCA and PPCA, we also see that GLRAM(in its vectorized form) is indeed a PCA model. Actually when both L and R areconstrained to be column orthonormal (as in GLRAM), the two-sided factorizationX = LZR� has vectorized form vec(X�) = (L⊗R) vec(Z�), which is indeed a PCAmodel since L⊗R is also orthonormal. Thus GLRAM defines a PCA-style factoriza-tion of the vectorized images, and constrains that the orthonormal mapping matrix inPCA is a Kronecker product of two smaller-sized orthonormal matrices. This explainswhy: 1) GLRAM has less space requirement than PCA in the vectorized space, and 2)GLRAM gets better reconstruction than its one-sided counterpart (Yang et al. 2004).Ye (2005) also suggests applying a PCA to vec(Z�) after GLRAM, i.e., a GLRAM +PCA, and it’s clear from here that GLRAM + PCA is still a PCA model.

4.3 Learning in general PHOPCA

The learning algorithms for second-order PHOPCA can be extended to higher-orderPHOPCA. For one-mode PHOPCA where projection only happens at the j th-mode, wecan “unfold” the other modes to yield a big matrix X( j)

i ∈ RI j×(I j+1 I j+2...IO I1 I2...I j−1)

for each tensor Xi , and apply the algorithm in Sect. 4.1. A similar theorem like Theo-rem 1 exists, and after convergence we find the globally optimal projection subspacein mode j . For general PHOPCA where we need to project in at least two modes, noglobal optimum exists and we can turn to variational EM algorithms similar to thosedescribed in Sect. 4.2. In the most interesting case where the noise level σ → 0, theE-step (9) is extended to get the core Bi as the all-mode product:

Bi = Xi ×1 U+�1 ×2 U+�2 × · · · ×O U+�O , (13)

and the M-step to update the factor loadings is now

U j =[∑

i

X( j)i · E( j)�

i

] [∑

i

E( j)i · E( j)�

i

]−1

,

where Ei = Bi ×1 U1×· · ·× j−1 U j−1× j+1 U j+1×· · ·×O UO ∈ RI1...I j−1 Pj I j+1...IO

is the tensor product of Bi with all-mode factor loadings except U j . A similar resultlike Theorem 3 can also be proved for general PHOPCA models, which indicates thatthis iterative algorithm yields the same stationary point as the higher-order multilinearPCA (Lu et al. 2008). A corollary is that after tensor-to-vector unfolding, multilinearPCA is still a PCA model with Kronecker-type factor loading matrix. For a test tensorX∗, the PHOPCA projection can be calculated using (13).

5 Discussions and extensions

The proposed PHOPCA framework extends PPCA to model matrix-variate objects,and is shown to take several 2D and higher-order PCA-style algorithms as (determinis-tic) special cases. The probabilistic interpretations provide additional insights to thesealgorithms (e.g., show that they are special PCA models after unfolding).

123

Page 13: Matrix-variate and higher-order probabilistic projections · Matrix-variate and higher-order probabilistic projections 373 Keywords Dimensionality reduction ·Higher-order principle

384 S. Yu et al.

PHOPCA also enjoys less time complexity than the deterministic counterparts, andless space complexity than PPCA (after unfolding). The general second-order PHO-PCA has time complexity O(t Nmn max(r, c)) and space complexity O(mn), wherer and c are the row-wise and column-wise projected dimensions. The higher-orderPHOPCA has time complexity O(t N

∏j I j max j {Pj }). Here t is the number of EM

iterations.PHOPCA fundamentally assumes a matrix-variate normal distribution for the latent

variable model. In general, one can define various projection models with other matrix-variate distributions such as matrix-variate t distribution and Wishart distribution(Gupta and Nagar 1999). A similar formulation such as (2) can be defined with thesealternative distributions, but learning and inference might be harder as they do nothave the nice conjugate property as matrix-variate normal.

As of PPCA to PCA, PHOPCA provides additional benefits and possible extensionsto the matrix-variate PCA-style algorithms:

– Incremental learning with newly obtained data is easy via the EM algorithm (whichis crucial for real applications of matrix-variate PCA-style algorithms).

– Missing data can now be handled elegantly with an additional E-step (to estimatethese missing values).

– Other noise models can be introduced for or against certain projection dimensions(or factors).

– Mixture of PHOPCA models can be easily derived (similar to Tipping and Bishop1999a) for matrix-variate object clustering and (local) projections.

– Robust matrix-variate PCA projections can be introduced (via, e.g., a student’tmodel instead of normal) when there are “outlier” matrix-variate objects.

In this paper we briefly go into two of these extensions, leaving the other extensionsfor future work. Some empirical results are shown in the next section.

5.1 Matrix-variate factor analysis

In PHOPCA the noise level σ is fixed for all input dimensions. If we allow them todiffer, we are more in the family of factor analysis (FA) models for matrix-variatedata. Let us take the second-order PHOPCA as an example in this subsection. We firstgive some motivation of this extension.

In some applications we want to focus on some specific sub-region(s) of the images,e.g., for face images we might only care about the facial region (eye, nose, mouse,etc.) and ignore the background. Or on the contrary we know some sub-regions con-tain higher noise, for instance, a medical scanner might obtain high noise in a certainregion of the scanned image due to orientation and light angles. In terms of projectionor feature extraction, we want the regions in focus to have lower reconstruction errors.Standard PHOPCA model cannot distinguish sub-regions (neither can 2DPCA andGLRAM), but we can easily extend PHOPCA to solve this problem if each of suchregions is in a rectangular shape. This is not a limitation since we can always find arectangular coverage of the actual shape.

In second-order PHOPCA the noise model for ϒ is ϒ ∼ N (0, σ Im, σ In), where thenoise level for both covariance matrices is the same σ . This means the noise level for

123

Page 14: Matrix-variate and higher-order probabilistic projections · Matrix-variate and higher-order probabilistic projections 373 Keywords Dimensionality reduction ·Higher-order principle

Matrix-variate and higher-order probabilistic projections 385

every entry in the image is σ 2. If we change the noise model to be ϒ ∼ N (0,�0,�0)

such that both �0 and �0 are diagonal matrices with �0(k, k) = σkk > 0 and�0(�, �) = φ�� > 0, then the noise level at a specific entry (k, �) in the image iseffectively σkkφ��. Therefore by choosing (or adapting) different (σkk) and (φ��), wecan make PHOPCA for or against certain regions in the images.

The EM (for one-mode PHOPCA) and variational EM (for general PHOPCA) canbe easily derived for this new model (with fixed noise levels). For simplicity we onlyput the variational EM updates in the following. We need to iterate these equationsuntil convergence. The derivation is straightforward.

T = c[tr(R��−1

0 RS) L��−10 L+ tr(S) Ir

]−1,

S = r[tr(L��−1

0 LT) R��−10 R + tr(T) Ic

]−1,

Solve Bi from L��−10 LBi R��−1

0 R + Bi = L��−10 Xi�

−10 R,

R =[∑

i

X�i �−10 LBi

] [∑

i

B�i L��−10 LBi + N tr(L��−1

0 LT) S

]−1

,

L =[∑

i

Xi�−10 RB�i

][∑

i

Bi R��−10 RB�i + N tr(R��−1

0 RS) T

]−1

.

This model is a generalization of factor analysis to matrix-variate data. We rec-ommend to fix �0 and �0 before training or do cross-validation to learn these noiselevels.

5.2 Mixture of matrix-variate projections

PHOPCA also allows us to consider MPHOPCA, a mixture of matrix-variate pro-jections, where for each datum we sample one PHOPCA model from a pool of, say,K candidate models, and then sample the datum from that specific PHOPCA model.This leads to a clustering structure of the data, and following the same discussions inTipping and Bishop (1999a) we can show that it is actually a Gaussian mixture model(after tensor-to-vector unfolding) with a specific covariance structure. Learning inMPHOPCA is basically (a weighted) PHOPCA learning with an additional E-stepestimating the (soft) weights that each datum belongs to these component models.The details are omitted here.

6 Empirical study

In this section we validate the proposed PHOPCA models on some benchmark dataand present a real-world application in dimensionality reduction for automatic cardiacview recognition. We will evaluate one-mode PHOPCA, general two-mode PHOPCA,the matrix-variate factor analysis and mixture of PHOPCA models on these problems.

123

Page 15: Matrix-variate and higher-order probabilistic projections · Matrix-variate and higher-order probabilistic projections 373 Keywords Dimensionality reduction ·Higher-order principle

386 S. Yu et al.

Due to lack of higher-order (O ≥ 3) benchmark data, we mainly present results on2D matrix-variate data.

6.1 Benchmark data

We first test the proposed PHOPCA models on face image benchmark data sets ORLand AR. ORL3 is a well-known data set for face recognition. It contains the faceimages of 40 persons, for a total of 400 images. The image size is 92 × 112. Themajor challenge here is the variation of the face pose. We use the whole image as aninstance (i.e., the dimension is 92× 112 = 10304). AR4 is a large face image dataset.The instance of each face may contain large areas of occlusion, due to the presenceof sun glasses and scarves. We use a subset of AR which contains 1638 face imagesof 126 persons. Its image size is 768× 576. We first crop the image from row 100 to500 and column 200 to 550, and then subsample the cropped images down to a sizeof 101× 88 = 8888.

We first show some reconstructed images in Fig. 1 from ORL. As expected, generalPHOPCA is better than (right) one-mode PHOPCA, and both FA-type noise model(row 4) and mixture of PHOPCAs (row 5) yield even better reconstructed images. Forthe FA-type noise model suppose the focus is on the facial region (rows [12, 80] andcolumns [40, 100]), and we assume low noise level (10−4) to the diagonals in focusedrows/columns whereas putting high noise level (10−3) to the other diagonals. For themixture model, 5 components are used with random image-to-component initializa-tion. For all the testings we run 20 EM iterations. A typical learning curve is shownin Fig. 2.

A thorough comparison of reconstruction errors are shown in Fig. 3 for ORL andAR, respectively. The projection dimensions are (r, c) for general PHOPCA (and itsvariants), �rc/m� for right one-mode PHOPCA, and � Nrc+nr+mc

N+mn � for PCA (with Nthe number of images in the dataset). Note that we choose the projection dimensions forPCA and one-mode PHOPCA such that they have the same overall compression ratio.The performance is measured in root mean square error (RMSE) which is the squareroot of the mean reconstruction error. As suggested in Ye (2005), we just show theresults with r = c. As expected, general PHOPCA outperforms one-mode PHOPCAand PCA in almost all dimensions, showing its great benefit for image compression.The mixture model (with 5 components) yields even better performance, but mainlyin the region of smaller projection dimensions.

We confirmed experimentally that the one-mode second-order PHOPCA yieldsexactly the same results as the 2DPCA model (Yang et al. 2004), and general second-order PHOPCA model yields the same results as the GLRAM algorithm (Ye 2005).This is why we didn’t plot the results for these two algorithms in Fig. 3. We also testedthe efficiency of the proposed EM algorithms. General PHOPCA only takes abouthalf the time as the GLRAM algorithm. On the ORL data set, for example, general

3 http://www.uk.research.att.com/facedatabase.html.4 http://rvl1.ecn.purdue.edu/~aleix/aleix_face_DB.html.

123

Page 16: Matrix-variate and higher-order probabilistic projections · Matrix-variate and higher-order probabilistic projections 373 Keywords Dimensionality reduction ·Higher-order principle

Matrix-variate and higher-order probabilistic projections 387

Fig. 1 Reconstructions of some ORL images (92×112). From top to bottom: (1) original images; (2) usinggeneral PHOPCA (equivalent to GLRAM; Ye 2005); (3) using (right) one-mode PHOPCA (equivalent to2DPCA; Yang et al. 2004); (4) using PHOPCA with FA noise model; (5) Mixture of PHOPCA with 5components. Projection dimensions are r = c = 15

Fig. 2 A typical learning curvefor PHOPCA. The root meansquare reconstruction errorconverges to the optimal valueafter around 10 iterations

0 5 10 15 202500

3000

3500

4000

4500

5000

EM Iterations

Roo

t Mea

n S

qure

Err

or

General PHOPCAOptimal RMSE

123

Page 17: Matrix-variate and higher-order probabilistic projections · Matrix-variate and higher-order probabilistic projections 373 Keywords Dimensionality reduction ·Higher-order principle

388 S. Yu et al.

5 10 15 201000

1500

2000

2500

3000

3500

4000

Projection Dimension r = c

Roo

t Mea

n S

qure

Err

orORL

PCAOne−mode PHOPCAGeneral PHOPCAMixture of PHOPCA

5 10 15 201000

2000

3000

4000

5000

6000

7000

Projection Dimension r = c

Roo

t Mea

n S

qure

Err

or

AR

PCAOne−mode PHOPCAGeneral PHOPCAMixture of PHOPCA

Fig. 3 Reconstruction performance comparison on two image data sets ORL (left) and AR (right). The per-formance is calculated using root mean square error (RMSE) of the reconstruction. The GLRAM algorithm(Ye 2005) yields exactly the same results as the general PHOPCA model, as shown in Theorem 3

Fig. 4 Reconstructionperformance comparison on theimage data set ORL focusing onsub-regions. The performance iscalculated using root meansquare error (RMSE) of thereconstruction on rows [12,80]and columns [40,100]

5 10 15 20500

1000

1500

2000

2500

Projection Dimension r = c

Roo

t Mea

n S

qure

Err

or

ORL with Focus on Sub−region

PCAOne−mode PHOPCAGeneral PHOPCAPHOPCA FA Model

PHOPCA takes in average 0.56 s to converge for r = c = 5, whilst GLRAM takes0.90 s.

In Fig. 4, we compare different methods on the specified sub-region of ORL data-set. For the factor analysis model the focus is on the facial region (rows [12, 80] andcolumns [40, 100]), and the calculated RMSE value is only calculated in this sub-region. It can be seen that the PHOPCA model with FA-type noise model yields thebest performance.

6.2 Automatic cardiac view recognition

Ultrasound images of the heart are usually taken as 2D slice of the 3D heart from stan-dardized 15 different angles. Diagnostic analysis of these images requires, as the firststep, recognizing the pose of the heart so that spatial cardiac structures can be identi-fied. Cardiac views are imaged from 4 windows: the parasternal, apical, subcostal andsuprasternal windows, which leads up to 15 basic views. Automatic view recognitionis the problem of automatically classifying cardiac ultrasound images with respect

123

Page 18: Matrix-variate and higher-order probabilistic projections · Matrix-variate and higher-order probabilistic projections 373 Keywords Dimensionality reduction ·Higher-order principle

Matrix-variate and higher-order probabilistic projections 389

Fig. 5 Four views of the heart: top-left, apical 2 chamber (a2c); top-right, apical 4 chamber (a4c); bottom-left, parasternal long axis (plax); bottom-right, parasternal short axis (psax)

to their views. We collected patient data with various image quality from St. FrancisHospital at New York which contain largely 4 views. Figure 5 shows an example ofeach of them.

Ultrasound videos of 100 patients were collected where some patients had multipleimages for one view or certain views missing, resulting in 72 a2c, 87 a4c, 80 plax and83 psax clips of 480 × 640 image frames. Due to the high number of raw features,various dimensional reduction methods can be employed to extract useful features fordiscrimination from raw images or evaluated image features. In Fig. 6 we show oneoriginal image of each view and the reconstructed images from different methods.The projection dimension of second-order PHOPCA was (r, c) = (10, 10), equiva-lent to 100 features. For PCA and one-mode PHOPCA, we determined the dimensionssimilarly as in previous experiments.

For the problem of automatic view recognition, we illustrate the potentials of PHO-PCA by distinguishing psax clips from a4c clips. These images were randomly splitinto the training set (44 a4c vs. 42 psax) and test set (43 a4c vs. 41 psax). One-mode PHOPCA, general PHOPCA and mixture of PHOPCA were applied to thepsax clips to produce the projection matrices L and R. Then features for each clipwere generated by projecting the first frame of the clip along the projection matrices.Moreover, PHOPCA with a focus area ([150, 350] × [250, 450]), where noise levelis 10−6 in contrast to 10−4 in the remaining area, was also used to emphasize thearea that is the most discriminative for the psax view. The projection dimension ofPHOPCA was (r, c) = (10, 10) which generated acceptable accuracy as in Fig. 7. ForPCA and one-mode PHOPCA, we determined the dimensions similarly as in previous

123

Page 19: Matrix-variate and higher-order probabilistic projections · Matrix-variate and higher-order probabilistic projections 373 Keywords Dimensionality reduction ·Higher-order principle

390 S. Yu et al.

Original

200 400 600

50

100

150

200

250

300

350

400

450

PCA

200 400 600

50

100

150

200

250

300

350

400

450

One−mode PHOPCA

200 400 600

50

100

150

200

250

300

350

400

450

PHOPCA

200 400 600

50

100

150

200

250

300

350

400

450

Original

200 400 600

50

100

150

200

250

300

350

400

450

PCA

200 400 600

50

100

150

200

250

300

350

400

450

One−mode PHOPCA

200 400 600

50

100

150

200

250

300

350

400

450

PHOPCA

200 400 600

50

100

150

200

250

300

350

400

450

Original

200 400 600

50

100

150

200

250

300

350

400

450

PCA

200 400 600

50

100

150

200

250

300

350

400

450

One−mode PHOPCA

200 400 600

50

100

150

200

250

300

350

400

450

PHOPCA

200 400 600

50

100

150

200

250

300

350

400

450

Original

200 400 600

50

100

150

200

250

300

350

400

450

PCA

200 400 600

50

100

150

200

250

300

350

400

450

One−mode PHOPCA

200 400 600

50

100

150

200

250

300

350

400

450

PHOPCA

200 400 600

50

100

150

200

250

300

350

400

450

Fig. 6 One original image of each view and the reconstructed images from PCA, one-mode PHOPCA(equivalent to 2DPCA; Yang et al. 2004) and general PHOPCA (equivalent to GLRAM; Ye 2005). Theviews from top to bottom: a2c, a4c, plax, psax

experiments. Any suitable classification methods can then be used to construct a clas-sifier using these features. In our experiments, we used least squares support vectormachine (LSSVM) with threefold cross-validation. As shown in Fig. 7, mixture of

123

Page 20: Matrix-variate and higher-order probabilistic projections · Matrix-variate and higher-order probabilistic projections 373 Keywords Dimensionality reduction ·Higher-order principle

Matrix-variate and higher-order probabilistic projections 391

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1ROC Curve

False Positive Rate

Tru

e P

ositi

ve R

ate

PHOPCANoise−model PHOPCAMixture of PHOPCAOne−mode PHOPCAPCA

Fig. 7 The ROC curves of least-square support vector machines using different dimensionality reductionmethods for feature extraction

PHOPCA and PHOPCA with a focus area outperformed other approaches. GeneralPHOPCA is better than one-mode PHOPCA, and they are both significantly betterthan PCA.

7 Conclusion and future work

In this paper we introduced a family of probabilistic dimensionality reduction modelscalled PHOPCA for matrix-variate data, including both 2D data as well as higher-order data. These models for the first time explicitly specify the generative process ofmatrix-variate objects, and the proposed optimization framework recovers the optimalsolutions of several matrix-variate PCA-style algorithms under mild conditions. PHO-PCA also takes the well-known probabilistic PCA model (for vectorized data) as aspecial case. Efficient EM algorithms are derived for learning the projection mapping,with less time complexity than the non-probabilistic counterparts. Several extensionsof the PHOPCA family, including a factor analysis model for matrix-variate data anda mixture of PHOPCA model, are also discussed which show the additional benefitsof the proposed probabilistic framework.

For future work we plan to apply the proposed models on data from other applica-tions including surveillance video and gene arrays. The other extensions mentionedin this paper will also be explored with specific applications in medical imaging. Forinstance, for computer-aided detection in lung CT images we often see some patientswith only partial lung CT scans. These images should be taken as the “outlier” imagesfor the data sets and should be down-weighted in a feature extraction process. Aninteresting future work is to design a robust matrix-variate projection model to tacklethis problem.

123

Page 21: Matrix-variate and higher-order probabilistic projections · Matrix-variate and higher-order probabilistic projections 373 Keywords Dimensionality reduction ·Higher-order principle

392 S. Yu et al.

References

Bartholomew D, Knott M (1999) Latent variable models and factor analysis. Oxford University Press, NewYork

Chu W, Ghahramani Z (2009) Probabilistic models for incomplete multi-dimensional arrays. In: Proceed-ings of the international conference on artificial intelligence and statistics (AISTATS-12)

Ding C, Ye J (2005) 2-Dimensional singular value decomosition for 2D maps and images. In: SDMGupta AK, Nagar DK (1999) Matrix variate distributions. Chapman and Hall/CRC, Boca Raton, FLInoue K, Urahama K (2006) Equivalence of non-iterative algorithms for simultaneous low rank approxi-

mations of matrices. In: CVPR, pp 154–159Jolliffe IT (2002) Principal component analysis. Springer Verlag, New YorkJordan MI, Ghahramani Z, Jaakkola T, Saul LK (1999) An introduction to variational methods for graphical

models. Mach Learn 37(2):183–233Kolda TG (2001) Orthogonal tensor decompositions. SIAM J Matrix Anal Appl 23(1):243–255Kolda TG, Bader BW (2007) Tensor decompositions and applications. Technical Report SAND2007-6702.

Sandia National LaboratoriesLathauwer LD, Moor BD, Vandewalle J (2000) On the best rank-1 and rank-(r1,r2,. . .,rn) approximation

of higher-order tensors. SIAM J Matrix Anal Appl 21(4):1324–1342Lee DD, Seung HS (1999) Learning the parts of objects with nonnegative matrix factorization. Nature

401:788–791Lu H, Plataniotis KN, Venetsanopoulos AN (2006) Multilinear principal component analysis of tensor

objects for recognition. In: Proceedings of the 18th international conference on pattern recognition,pp 776–779

Lu H, Plataniotis KN, Venetsanopoulos AN (2008) MPCA: multilinear principal component analysis oftensor objects. IEEE Trans Neural Netw 19(1):18–39

Roweis S, Ghahramani Z (1999) A unifying review of linear gaussian models. Neural Comput 11:305–345Shashua A, Levin A (2001) Linear image coding for regression and classification using the tensor-rank

principle. In: Proceedings of the international conference on computer vision and pattern recogniton,pp 42–49

Sun J, Tao D, Faloutsos C (2006) Beyond streams and graphs: dynamic tensor analysis. In: KDD, pp374–383

Tipping ME, Bishop CM (1999a) Mixtures of probabilistic principal component analysers. Neural Comput11(2):443–482

Tipping ME, Bishop CM (1999b) Probabilistic principal component analysis. J R Stat Soc B 61:611-622Vasilescu MAO, Terzopoulos D (2002) Multilinear analysis of image ensembles: Tensorfaces. In: Proceed-

ings of the 7th European conference on computer vision-Part I, pp 447–460Wang H, Wu Q, Shi L, Yu Y, Ahuja N (2005) Out-of-core tensor approximation of multi-dimensional

matrices of visual data. In: SIGGRAPHYang J, Zhang D, Frangi AF, Yang J-Y (2004) Two-dimensional pca: a new approach to appearance-based

face representation and recognition. IEEE Trans Pattern Anal Mach Intell 26(1):131–137Ye J (2005) Generalized low rank approximation of matrices. Mach Learn 61:167–191

123