14
1053-587X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2016.2603969, IEEE Transactions on Signal Processing 1 Probabilistic Tensor Canonical Polyadic Decomposition With Orthogonal Factors Lei Cheng, Yik-Chung Wu and H. Vincent Poor Abstract—Tensor canonical polyadic decomposition (CPD), which recovers the latent factor matrices from multidimen- sional data, is an important tool in signal processing. In many applications, some of the factor matrices are known to have orthogonality structure, and this information can be exploited to improve the accuracy of latent factors recovery. However, existing methods for CPD with orthogonal factors all require the knowledge of tensor rank, which is difficult to acquire, and have no mechanism to handle outliers in measurements. To overcome these disadvantages, in this paper, a novel tensor CPD algorithm based on the probabilistic inference framework is devised. In particular, the problem of tensor CPD with orthogonal factors is interpreted using a probabilistic model, based on which an inference algorithm is proposed that alternatively estimates the factor matrices, recovers the tensor rank and mitigates the outliers. Simulation results using synthetic data and real-world applications are presented to illustrate the excellent performance of the proposed algorithm in terms of accuracy and robustness. Index Terms—Tensor Canonical Polyadic Decomposition, Or- thogonal Constraints, Robust Estimation, Multidimensional Sig- nal Processing I. I NTRODUCTION Many problems in signal processing, such as independent component analysis (ICA) with matrix-based models [1]–[4], blind signal estimation in wireless communications [5]–[9], localization in array signal processing [10], [11], and linear image coding [12], [13], eventually reduce to the issue of finding a set of factor matrices {A (n) C In×R } N n=1 from a complex-valued tensor X∈ C I1×I2×...×I N that satisfy X = R X r=1 A (1) :,r A (2) :,r ◦···◦ A (N) :,r , JA (1) , A (2) , ··· , A (N) K (1) where A (n) :,r C In×1 is the r th column of the factor matrix A (n) , and denotes the vector outer product. This decompo- sition is called canonical polyadic decomposition (CPD), and the number of rank-1 component R is defined as the tensor rank [14]. Under rather mild conditions, the CPD is unique Copyright (c) 2015 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected]. This research was supported in part by the U S. Army Research Office under MURI Grant W911NF-11-1-0036, U.S. National Science Foundation under Grant CCF-1420575 and Grant CCF-1435778. Lei Cheng and Yik-Chung Wu are with the Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong (e-mail: [email protected], [email protected]). H. Vincent Poor is with the Department of Electrical Engineering, Princeton University, Princeton, NJ 08544 USA (email: [email protected]). up to a trivial scalar and permutation ambiguity [14], and this fact has underlain its importance in signal processing [1]-[13]. To find the factor matrices in CPD, a common approach is to solve min {A (n) } N n=1 kX- JA (1) , A (2) , ··· , A (N) K k 2 F . Unfortunately, it can be seen from (1) that all the factor matri- ces are nonlinearly coupled, and thus a closed-form solution does not exist. Consequently, the most popular solution is the alternating least squares (ALS) method, which iteratively optimizes one factor matrix at a time while holding the other factor matrices fixed [14], [15]. However, the ALS method does not take into account the potential orthogonality structure in the factor matrices, which can be found in a variety of applications. For example, the zero-mean uncorrelated signals in wireless communications [5]–[9], the prewhitening proce- dure in ICA [1], [4], and the basis matrices in linear image coding [12], [13], all give rise to orthogonal factors in the tensor model. Interestingly, the uniqueness of tensor CPD incorporating orthogonal factors is guaranteed under an even milder condition than the case without orthogonal factors. Pio- neering work [17] formally established this fact, and extended the conventional methods to account for the orthogonality structure, among which the orthogonality constrained ALS (OALS) algorithm 1 shows remarkable efficiency in terms of accuracy and complexity. However, there are at least two major challenges the algorithms in [17] (including the OALS) face in practical applications. Firstly, these algorithms are least-squares based, and thus lack robustness to outliers in measurements, such as ubiquitous impulsive noise in sensor arrays or networks [18], [19], and salt-and-pepper noise in images [20]. Secondly, knowledge of tensor rank is a prerequisite to implement these algorithms. Unfortunately, tensor rank acquisition from tensor data is known to be NP-hard [14]. Even though for applications in wireless communications, where the tensor rank can be assumed to be known as it is related to the number of users or sensors, existing decomposition algorithms are still susceptible to degradation caused by network dynamics, e.g., users joining and leaving the network, sudden sensor failures, etc. In order to overcome the disadvantages presented in ex- isting methods, we devise a novel algorithm for complex- valued tensor CPD with orthogonal factors based on the probabilistic inference framework. Probabilistic inference is well-known for providing an alternative formulation to prin- cipal component analysis (PCA) [22]. With the inception of probabilistic PCA, not only is the conventional singular 1 It was called the “first kind of ALS algorithm for tensor CPD with orthogonal factors (ALS1-CPO)” in [17]. For brevity of discussion, we just call it the OALS algorithm in this paper.

1 Probabilistic Tensor Canonical Polyadic Decomposition ...ycwu/publication_download/ycwu_TSP_201… · Tensor CPD with orthogonal factors has been widely ex-ploited in various signal

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • 1053-587X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2016.2603969, IEEETransactions on Signal Processing

    1

    Probabilistic Tensor Canonical PolyadicDecomposition With Orthogonal Factors

    Lei Cheng, Yik-Chung Wu and H. Vincent Poor

    Abstract—Tensor canonical polyadic decomposition (CPD),which recovers the latent factor matrices from multidimen-sional data, is an important tool in signal processing. In manyapplications, some of the factor matrices are known to haveorthogonality structure, and this information can be exploitedto improve the accuracy of latent factors recovery. However,existing methods for CPD with orthogonal factors all requirethe knowledge of tensor rank, which is difficult to acquire,and have no mechanism to handle outliers in measurements.To overcome these disadvantages, in this paper, a novel tensorCPD algorithm based on the probabilistic inference framework isdevised. In particular, the problem of tensor CPD with orthogonalfactors is interpreted using a probabilistic model, based on whichan inference algorithm is proposed that alternatively estimatesthe factor matrices, recovers the tensor rank and mitigates theoutliers. Simulation results using synthetic data and real-worldapplications are presented to illustrate the excellent performanceof the proposed algorithm in terms of accuracy and robustness.

    Index Terms—Tensor Canonical Polyadic Decomposition, Or-thogonal Constraints, Robust Estimation, Multidimensional Sig-nal Processing

    I. INTRODUCTION

    Many problems in signal processing, such as independentcomponent analysis (ICA) with matrix-based models [1]–[4],blind signal estimation in wireless communications [5]–[9],localization in array signal processing [10], [11], and linearimage coding [12], [13], eventually reduce to the issue offinding a set of factor matrices {A(n) ∈ CIn×R}Nn=1 froma complex-valued tensor X ∈ CI1×I2×...×IN that satisfy

    X =R∑

    r=1

    A(1):,r ◦A(2):,r ◦ · · · ◦A(N):,r

    , JA(1),A(2), · · · ,A(N)K (1)

    where A(n):,r ∈ CIn×1 is the rth column of the factor matrixA(n), and ◦ denotes the vector outer product. This decompo-sition is called canonical polyadic decomposition (CPD), andthe number of rank-1 component R is defined as the tensorrank [14]. Under rather mild conditions, the CPD is unique

    Copyright (c) 2015 IEEE. Personal use of this material is permitted.However, permission to use this material for any other purposes must beobtained from the IEEE by sending a request to [email protected].

    This research was supported in part by the U S. Army Research Officeunder MURI Grant W911NF-11-1-0036, U.S. National Science Foundationunder Grant CCF-1420575 and Grant CCF-1435778.

    Lei Cheng and Yik-Chung Wu are with the Department of Electrical andElectronic Engineering, The University of Hong Kong, Hong Kong (e-mail:[email protected], [email protected]).

    H. Vincent Poor is with the Department of Electrical Engineering, PrincetonUniversity, Princeton, NJ 08544 USA (email: [email protected]).

    up to a trivial scalar and permutation ambiguity [14], and thisfact has underlain its importance in signal processing [1]-[13].

    To find the factor matrices in CPD, a common approachis to solve min{A(n)}Nn=1 ‖ X − JA

    (1),A(2), · · · ,A(N)K ‖2F .Unfortunately, it can be seen from (1) that all the factor matri-ces are nonlinearly coupled, and thus a closed-form solutiondoes not exist. Consequently, the most popular solution isthe alternating least squares (ALS) method, which iterativelyoptimizes one factor matrix at a time while holding the otherfactor matrices fixed [14], [15]. However, the ALS methoddoes not take into account the potential orthogonality structurein the factor matrices, which can be found in a variety ofapplications. For example, the zero-mean uncorrelated signalsin wireless communications [5]–[9], the prewhitening proce-dure in ICA [1], [4], and the basis matrices in linear imagecoding [12], [13], all give rise to orthogonal factors in thetensor model. Interestingly, the uniqueness of tensor CPDincorporating orthogonal factors is guaranteed under an evenmilder condition than the case without orthogonal factors. Pio-neering work [17] formally established this fact, and extendedthe conventional methods to account for the orthogonalitystructure, among which the orthogonality constrained ALS(OALS) algorithm1 shows remarkable efficiency in terms ofaccuracy and complexity.

    However, there are at least two major challenges thealgorithms in [17] (including the OALS) face in practicalapplications. Firstly, these algorithms are least-squares based,and thus lack robustness to outliers in measurements, suchas ubiquitous impulsive noise in sensor arrays or networks[18], [19], and salt-and-pepper noise in images [20]. Secondly,knowledge of tensor rank is a prerequisite to implement thesealgorithms. Unfortunately, tensor rank acquisition from tensordata is known to be NP-hard [14]. Even though for applicationsin wireless communications, where the tensor rank can beassumed to be known as it is related to the number of users orsensors, existing decomposition algorithms are still susceptibleto degradation caused by network dynamics, e.g., users joiningand leaving the network, sudden sensor failures, etc.

    In order to overcome the disadvantages presented in ex-isting methods, we devise a novel algorithm for complex-valued tensor CPD with orthogonal factors based on theprobabilistic inference framework. Probabilistic inference iswell-known for providing an alternative formulation to prin-cipal component analysis (PCA) [22]. With the inceptionof probabilistic PCA, not only is the conventional singular

    1It was called the “first kind of ALS algorithm for tensor CPD withorthogonal factors (ALS1-CPO)” in [17]. For brevity of discussion, we justcall it the OALS algorithm in this paper.

  • 1053-587X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2016.2603969, IEEETransactions on Signal Processing

    2

    value decomposition (SVD) linked to statistical inferenceover a probabilistic model, advances in Bayesian statisticsand machine learning can also be incorporated to achieveautomatic relevance determination [23] and outlier removal[24]. Although the probabilistic approach is well establishedin matrix decomposition, extension to the tensor counterpartfaces its unique challenges, since all the factor matrices arenonlinearly coupled via multiple Khatri-Rao products [14].

    In this paper, we propose a probabilistic CPD algorithmfor complex-valued tensor with some of the factors beingorthogonal, under unknown tensor rank and in the presenceof outliers in the observations. In particular, the tensor CPDproblem is reformulated as an inference problem over aprobabilistic model, wherein the uniform distribution overthe Stiefel manifold is leveraged to encode the orthogonalitystructure. Since the complicatedly coupled factor matricesin the probabilistic model lead to analytically intractableintegrations in exact Bayesian inference, variational inferenceis exploited to give an alternative solution. This results inan efficient algorithm that alternatively estimates the factormatrices, recovers the tensor rank and mitigates the outliers.Interestingly, the OALS in [17] can be interpreted as a specialcase of the proposed algorithm.

    The remainder of this paper is organized as follows. Sec-tion II presents the motivating examples and the problemformulation. In Section III, the CPD problem is interpretedusing probability density functions, and the correspondingprobabilistic model is established. In Section IV, based onvariational inference framework, a robust algorithm for tensorCPD with orthogonal factors is derived, and its relationshipto the OALS algorithm is revealed. Simulation results usingsynthetic data and real-world applications are reported inSection V. Finally, conclusions are drawn in Section VI.

    Notation: Boldface lowercase and uppercase letters willbe used for vectors and matrices, respectively. Tensors arewritten as calligraphic letters. E[ · ] denotes the expectationof its argument and j ,

    √−1. Superscripts T , ∗ and

    H denote transpose, conjugate and Hermitian respectively.δ(·) denotes the Dirac delta function. The operator Tr (A)denotes the trace of a matrix A and ‖ · ‖F represents theFrobenius norm of the argument. The symbol ∝ represents alinear scalar relationship between two real-valued functions.CN (x|u,R) stands for the probability density function of acircularly-symmetric complex Gaussian vector x with meanu and covariance matrix R. CMN (X|M,Σr,Σc) denotesthe complex-valued matrix normal probability density functionp(X) ∝ exp{−Tr(Σ−1c (X − M)HΣ−1r (X − M))}, andVMF(X|F) stands for the complex-valued von Mises-Fishermatrix probability density function p(X) ∝ exp{−Tr(FXH+XFH)}. The N × N diagonal matrix with diagonal compo-nents a1 through aN is represented as diag{a1, a2, ..., aN},while IM represents the M ×M identity matrix. The (i, j)thelement and the jth column of a matrix A is represented byAi,j and A:,j , respectively.

    II. MOTIVATING EXAMPLES AND PROBLEMFORMULATION

    Tensor CPD with orthogonal factors has been widely ex-ploited in various signal processing applications [1]-[13]. Inthis section, we briefly mention two motivating examples, andthen we give the general problem formulation.

    A. Motivating Example 1: Blind Receiver Design for DS-CDMA Systems

    In a direct-sequence code division multiple access (DS-CDMA) system, the transmitted signal sr(k) from the rth

    user at the kth symbol period is multiplied by a spreadingsequence [c1r, c2r, · · · , cZr] where czr is the zth chip ofthe applied spreading code. Assuming R users transmit theirsignals simultaneously to a base station (BS) equipped withM receive antennas, the received data is given by

    ymz(k) =R∑

    r=1

    hmrczrsr(k) + wmz(k),

    1 ≤ m ≤M, 1 ≤ z ≤ Z, (2)where hmr denotes the flat fading channel between the rth

    user and the mth receive antenna at the base station, andwmz(k) denotes white Gaussian noise. By introducing H ∈CM×R with its (m, r)th element being hmr, and C ∈ CZ×Rwith its (z, r)th element being czr, the model (2) can bewritten in matrix form as Y(k) =

    ∑Rr=1 H:,r ◦ C:,rsr(k) +

    W(k), where Y(k),W(k) ∈ CM×Z are matrices with their(m, z)th elements being ymz(k) and wmz(k), respectively.After collecting T samples along the time dimension anddefining S ∈ CT×R with its (k, r)th element being sr(k),the system model can be further written in tensor form as [5]

    Y =R∑

    r=1

    H:,r ◦C:,r ◦ S:,r +W

    = JH,C,SK +W (3)where Y ∈ CM×Z×T and W ∈ CM×Z×T are third-ordertensors, which take ymz(k) and wmz(k) as their (m, z, k)th

    elements, respectively.It is shown in [5] that under certain mild conditions, the

    CPD of tensor Y , which solves minH,C,S ‖ Y−JH,C,SK ‖2F ,can blindly recover the transmitted signals S. Furthermore,since the transmitted signals are usually uncorrelated and withzero mean, the orthogonality structure2 of S can further betaken into account to give better performance for blind signalrecovery [9]. Similar models can also be found in blind datadetection for cooperative communication systems [6]-[7], andin topology learning for wireless sensor networks (WSNs) [8].

    B. Motivating Example 2: Linear Image Coding for a Collec-tion of Images

    Given a collection of images representing a class of objects,linear image coding extracts the commonalities of these im-ages, which is important in image compression and recognition

    2Strictly speaking, S is only approximately orthogonal. But the approxi-mation gets better and better when observation length T increases.

  • 1053-587X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2016.2603969, IEEETransactions on Signal Processing

    3

    [12], [13]. The kth image of size M×Z naturally correspondsto a matrix B(k) with its (m, z)th element being the image’sintensity at that position. Linear image coding seeks theorthogonal basis matrices U ∈ CM×R and V ∈ CZ×R thatcapture the directions of the largest R variances in the imagedata, and this problem can be written as [12], [13]

    minU,V,{dr(k)}Rr=1

    K∑

    k=1

    ‖B(k)−Udiag{d1(k), ..., dR(k)}VT ‖2F

    s.t. UHU = IR,VHV = IR. (4)

    Obviously, if there is only one image (i.e., K = 1), problem(4) is equivalent to the well-studied SVD problem. Noticethat the expression inside the Frobenius norm in (4) can bewritten as B(k)−∑Rr=1 U:,r ◦V:,rdr(k). Further introducingthe matrix D with its (k, r)th element being dr(k), it is easyto see that problem (4) can be rewritten in tensor form as

    minU,V,D

    ‖ B −R∑

    r=1

    U:,r ◦V:,r ◦D:,r︸ ︷︷ ︸

    =JU,V,DK

    ‖2F

    s.t. UHU = IR,VHV = IR, (5)

    where B ∈ CM×Z×K is a third-order tensor with B(k) asits kth slice. Therefore, linear image coding for a collectionof images is equivalent to solving a tensor CPD with twoorthonormal factor matrices.

    C. Problem Formulation

    From the two motivating examples above, we take a stepfurther and consider a generalized problem in which theobserved data tensor Y ∈ CI1×I2×...×IN obeys the followingmodel:

    Y = JA(1),A(2), · · · ,A(N)K +W + E (6)whereW represents an additive noise tensor with each elementwi1,i2,...,iN ∼ CN

    (wi1,i2,...,iN |0, β−1

    )and with correlation

    E(w∗i1,i2,...,iNwτ1,τ2,...,τN ) = β−1∏N

    n=1 δ(τn − in); E de-notes potential outliers in measurements with each elementei1,i2,...,iN taking an unknown value if an outlier emerges,and otherwise taking the value zero. Since the number oforthogonal factor matrices could be known a priori in specificapplication, it is assumed that {A(n)}Pn=1 are known to beorthogonal where P < N , while the remaining factor matricesare unconstrained.

    Due to the orthogonality structure of the first P factormatrices {A(n)}Pn=1, they can be written as A(n) = U(n)Λ(n)where U(n) is an orthonormal matrix and Λ(n) is a diagonalmatrix. Putting A(n) = U(n)Λ(n) for 1 ≤ n ≤ P into thedefinition of the tensor CPD in (1), it is easy to show that

    JA(1),A(2), · · · ,A(N)K = JΞ(1),Ξ(2), · · · ,Ξ(N)K (7)with Ξ(n) = U(n)Π for 1 ≤ n ≤ P , Ξ(n) = A(n)Π forP + 1 ≤ n ≤ N − 1, and Ξ(N) = A(N)Λ(1)Λ(2) · · ·Λ(P )Π,where Π ∈ CR×R is a permutation matrix. From (7), it canbe seen that up to the scaling and permutation indeterminacy,the tensor CPD under orthogonal constraints is equivalent to

    that under orthonormal constraints. In general, the scalingand permutation ambiguity can be easily resolved using sideinformation [5]. On the other hand, for those applications thatseek the subspaces spanned by the factor matrices, such aslinear image coding described in Section II.B, the scalingand permutation ambiguity can be ignored. Thus, withoutloss of generality, our goal is to estimate an N -tuplet offactor matrices (Ξ(1),Ξ(2), ...,Ξ(N)) with the first P (whereP < N ) of them being orthonormal, based on the observationY and in the absence of the knowledge of noise power β−1,outlier statistics and the tensor rank R. In particular, since wedo not know the exact value of R, it is assumed that there are Lcolumns in each factor matrix Ξ(n), where L is the maximumpossible value of the tensor rank R. Thus, the problem to besolved can be stated as

    min{Ξ(n)}Nn=1,E

    β ‖ Y − JΞ(1),Ξ(2), · · · ,Ξ(N)K− E ‖2F

    +L∑

    l=1

    γl

    ( N∑

    n=1

    Ξ(n)H:,l Ξ

    (n):,l

    )

    s.t. Ξ(n)HΞ(n) = IL, n = 1, 2, · · · , P, (8)

    where the regularization term∑Ll=1 γl

    (∑Nn=1 Ξ

    (n)H:,l Ξ

    (n):,l

    )is

    added to control the complexity of the model and avoid over-fitting of noise [21], since more columns (thus more degreesof freedom) in Ξ(n) than the true model are introduced, and{γl}Ll=1 are regularization parameters trading off the relativeimportance of the square error term and the regularizationterm.

    Existing algorithms [17] for tensor CPD with orthonormalfactors cannot be used to solve problem (8), since they haveno mechanism to handle outliers E . Furthermore, the choiceof regularization parameters plays an important role, sincesetting γl too large results in excessive residual squared error,while setting γl too small risks overfitting of noise. In general,determining the optimal regularization parameters (e.g., usingcross-validation [27], or the L-curve [28]) requires exhaustivesearch, and thus is computationally demanding. To overcomethese problems, we propose a novel algorithm based on theframework of probabilistic inference, which effectively miti-gates the outliers E and automatically learns the regularizationparameters.

    III. PROBABILISTIC MODEL FOR TENSOR CPD WITHORTHOGONAL FACTORS

    Before solving problem (8), we interpret different terms in(8) as probability density functions, based on which a proba-bilistic model that encodes our knowledge of the observationand the unknowns can be established.

    Firstly, since the elements of the additive noise W is white,zero-mean and circularly-symmetric complex Gaussian, thesquared error term in problem (8) can be interpreted as thenegative log of the likelihood given by [21]:

    p(Y | Ξ(1),Ξ(2), ...,Ξ(N), E , β

    )

    ∝ exp(−β ‖ Y − JΞ(1),Ξ(2), ...,Ξ(N)K− E ‖2F

    ). (9)

  • 1053-587X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2016.2603969, IEEETransactions on Signal Processing

    4

    Secondly, the regularization term in problem (8) can beinterpreted as arising from a circularly-symmetric complexGaussian prior distribution over the columns of the factormatrices, i.e.,

    ∏Nn=1

    ∏Ll=1 CN

    (n):,l | 0In×1, γ−1l IL

    )[21].

    Note that the columns of the factor matrices are independentof each other, and the lth columns in all factor matrices{Ξ(n)}Nn=1 share the same variance γ−1l . This has the physicalinterpretation that if γl is large, the lth columns in all Ξ(n)’swill be effectively “switched off”. On the other hand, for thefirst P factor matrices {Ξ(n)}Pn=1, there are additional hardconstraints in problem (8), which correspond to the Stiefelmanifold [33] VL(CIn) = {A ∈ CIn×L : AHA = IL}for 1 ≤ n ≤ P . Since the orthonormal constraints resultin Ξ(n)H:,l Ξ:,l = 1, the hard constraints would dominate theGaussian distribution of the columns in {Ξ(n)}Pn=1. Therefore,Ξ(n) can be interpreted as being uniformly distributed overthe Stiefel manifold VL(CIn) for 1 ≤ n ≤ P , and Gaussiandistributed for P + 1 ≤ n ≤ N :

    p(Ξ(1),Ξ(2), · · · ,Ξ(P )) ∝P∏

    n=1

    IVL(CIn )(Ξ(n)),

    p(Ξ(P+1),Ξ(P+2), · · · ,Ξ(N))

    =N∏

    n=P+1

    L∏

    l=1

    CN(Ξ

    (n):,l |0In×1, γ−1l IL

    ), (10)

    where IVL(CIn )(Ξ(n)) is an indicator function withIVL(CIn )(Ξ(n)) = 1 when Ξ(n) ∈ VL(CIn), and otherwiseIVL(CIn )(Ξ(n)) = 0. For the parameters β and {γl}Ll=1, whichcorrespond to the inverse noise power and the variances ofcolumns in the factor matrices, since we have no informationabout their distributions, non-informative a Jeffrey’s prior[27] is imposed on them, i.e., p(β) ∝ β−1 and p(γl) ∝ γ−1lfor l = 1, · · · , L.

    Finally, although the generative model for outliers Ei1,··· ,iNis unknown, the rare occurrence of outliers motivates usto employ student’s t distribution as its prior [27], i.e.,p(Ei1,··· ,iN ) = T (Ei1,··· ,iN |0, ci1,··· ,iN , di1,··· ,iN ). To facilitatethe Bayesian inference procedure, student’s t distribution canbe equivalently represented as a Gaussian scale mixture asfollows [34] :

    T (Ei1,··· ,iN | 0, ci1,··· ,iN , di1,··· ,iN )

    =

    ∫CN

    (Ei1,··· ,iN | 0, ζ−1i1,··· ,iN

    )

    ×Gamma (ζi1,··· ,iN | ci1,··· ,iN , di1,··· ,iN ) dζi1,··· ,iN . (11)

    This means that student’s t distribution can be obtained bymixing an infinite number of zero-mean circularly-symmetriccomplex Gaussian distributions where the mixing distributionon the precision ζi1,··· ,iN is the gamma distribution withparameters ci1,··· ,iN and di1,··· ,iN . In addition, since the statis-tics of outliers such as means and correlations are generallyunavailable in practice, we set the hyper-parameters ci1,··· ,iNand di1,··· ,iN as 10

    −6 to produce a non-informative prior on

    · · · · · ·

    Y

    XW E

    β Ξ(1) Ξ(P ) Ξ(P+1) Ξ(N)

    Stiefel manifold

    γl

    ζi1,··· ,iN

    L

    I1, · · · , IN

    Figure 1: Probabilistic model for tensor CPD with orthogonal factors

    Ei1,··· ,iN , and assume outliers are independent of each other:

    p (E)=I1∏

    i1=1

    · · ·IN∏

    iN=1

    T(Ei1,··· ,iN |0, ci1,··· ,iN=10−6, di1,··· ,iN=10−6

    ).

    (12)

    The complete probabilistic model is shown in Figure 1.Notice that the proposed probabilistic model in this paper isdifferent from that of existing works on tensor decompositions[30]–[32]. In particular, existing tensor probabilistic modelsdo not take orthogonality structure into account. Furthermore,existing tenor decompositions [30]–[32], [43]–[45] are de-signed for real-valued tensors only, and thus cannot process thecomplex-valued data arising in applications such as wirelesscommunications [5]-[9] and functional magnetic resonanceimaging [3].

    IV. VARIATIONAL INFERENCE FOR TENSORFACTORIZATION

    Let Θ be a set containing the factor matrices {Ξ(n)}Nn=1,and other variables E , {γl}Ll=1, {ζi1,··· ,iN }I1,··· ,INi1=1,··· ,iN=1, β .From the probabilistic model established above, the marginalprobability density functions of the unknown factor matrices{Ξ}Nn=1 are given by

    p(Ξ(n)|Y) =∫p(Y,Θ)p(Y) dΘ\Ξ

    (n), n = 1, 2, · · · , N, (13)

    where

    p(Y,Θ)

    ∝P∏

    n=1

    IVL(CIn )(Ξ(n)

    )exp

    {(N∏

    n=1

    In − 1)

    lnβ

    +

    (N∑

    n=P+1

    In + 1

    )L∑

    l=1

    ln γl − Tr(

    ΓN∑

    n=P+1

    Ξ(n)HΞ(n)

    )

    +

    I1∑

    i1=1

    · · ·IN∑

    iN=1

    [(ci1,··· ,iN − 1) ln ζi1,··· ,iN − di1,··· ,iN ζi1,··· ,iN

    ]

  • 1053-587X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2016.2603969, IEEETransactions on Signal Processing

    5

    +

    I1∑

    i1=1

    · · ·IN∑

    iN=1

    (ln ζi1,··· ,iN − ζi1,··· ,iNE∗i1,··· ,iNEi1,··· ,iN

    )

    − β ‖ Y − JΞ(1),Ξ(2), ...,Ξ(N)K− E ‖2F

    }(14)

    with Γ = diag{γ1, · · · , γR}.Since the factor matrices and other variables are nonlinearly

    coupled in (14), the multiple integrations in (13) are analyt-ically intractable, which prohibits exact Bayesian inference.To handle this problem, Monte Carlo statistical methods [25],[26], in which a large number of random samples are generatedfrom the joint distributions and marginalization is approx-imated by operations on samples, can be explored. TheseMonte Carlo based approximations can approach the exactmultiple integrations when the number of samples approachesinfinity, which however is computationally demanding [27].More recently, variational inference, in which another dis-tribution that is close to the true posterior distribution inthe Kullback-Leibler (KL) divergence sense is sought, hasbeen exploited to give deterministic approximations to theintractable multiple integrations [29].

    More specifically, in variational inference, a variationaldistribution with probability density function Q(Θ) that is theclosest among a given set of distributions to the true posteriordistribution p(Θ | Y) = p(Θ,Y)/p(Y) in the KL divergencesense is sought [29]:

    KL(Q (Θ) ‖ p (Θ | Y)

    ), −EQ(Θ)

    {lnp (Θ | Y)Q (Θ)

    }. (15)

    The KL divergence vanishes when Q(Θ) = p(Θ | Y) if noconstraint is imposed on Q(Θ), which however leads us backto the original intractable posterior distribution. A commonapproach is to apply the mean field approximation, whichassumes that the variational probability density takes a fullyfactorized form Q(Θ) =

    ∏kQ(Θk), Θk ∈ Θ. Furthermore,

    to facilitate the manipulation of hard constraints on the first Pfactor matrices, their variational densities are assumed to takea Dirac delta functional form Q(Ξ(k)) = δ(Ξ(k) − Ξ̂(k)) fork = 1, 2, · · · , P , where Ξ̂(k) is a parameter to be derived.

    Under these approximations, the probability density func-tions Q(Θk) of the variational distribution can be analyticallyobtained via [29]

    Q(Ξ(k))=δ(Ξ(k) − arg max

    Ξ(k)E∏

    Θj 6=Ξ(k)Q(Θj)

    [ln p (Y,Θ)

    ]

    ︸ ︷︷ ︸,Ξ̂(k)

    ),

    k = 1, 2, · · · , P, (16)and

    Q(Θk) ∝ exp{E∏

    j 6=k Q(Θj)

    [ln p (Y,Θ)

    ]},Θk∈Θ\{Ξ(k)}Pk=1.

    (17)

    Obviously, these variational distributions are coupled in thesense that the computation of variational distribution of oneparameter requires the knowledge of variational distributionsof other parameters. Therefore, these variational distributionsshould be updated iteratively. In the following, explicit expres-sion for each Q (·) is derived.

    i3 = 1 i3 = 2

    ︸︷︷︸ ︸︷︷︸︸︷︷︸

    U(1) [A]i3 = I3......

    ︸︷︷︸︸︷︷︸ ︸︷︷︸

    i1

    i2

    i3

    I2

    ......I2

    ......

    ︸︷︷︸︸︷︷︸ ︸︷︷︸I3 I3 I3

    i1 = 1 i1 = I1

    U(2) [A]

    U(3) [A]

    1 A

    i1

    i2

    i3

    I21 A

    I3

    i1

    i2

    i3

    I21 A

    i1 = 2

    I3

    I1

    I1 I1 I1

    I2 I2 I2

    i2 = 1 i2 = 2 i2 = I2I1

    I3

    I3

    I1

    I1

    U(2) [A] =I1∑

    i1=1

    I2∑

    i2=1

    I3∑

    i3=1

    ai1,i2,i3eI2i2

    [eI3i3 ⋄ e

    I1i1

    ]T

    U(3) [A] =I1∑

    i1=1

    I2∑

    i2=1

    I3∑

    i3=1

    ai1,i2,i3eI3i3

    [eI2i2 ⋄ e

    I1)i1

    ]T

    U(1) [A] =I1∑

    i1=1

    I2∑

    i2=1

    I3∑

    i3=1

    ai1,i2,i3eI1i1

    [eI3i3 ⋄ e

    I2i2

    ]T

    Figure 2: Unfolding operation for a third-order tensor

    A. Derivation for Q(Ξ(k)), 1 ≤ k ≤ PBy substituting (14) into (16) and only keeping the terms

    relevant to Ξ(k) (1 ≤ k ≤ P ), we directly haveΞ̂(k) = arg max

    Ξ(k)∈VL(CIk )E∏

    Θj 6=Ξ(k)Q(Θj)

    [

    − β ‖ Y − JΞ(1), · · · ,Ξ(N)K− E ‖2F]. (18)

    To expand the square of the Frobenius norm inside theexpectation in (18), we use the result that ‖ A ‖2F = ‖U(k) [A] ‖2F = Tr

    (U(k) [A]

    (U(k) [A]

    )H)[14], where the

    unfolding operation{U(k) [A]

    }k=1,2,··· ,N on an N

    th-ordertensor A ∈ CI1×···×IN along its kth mode is specified asU(k) [A] = ∑I1i1=1 · · ·

    ∑INin=1

    ai1,··· ,iN eIkik

    [N�

    n=1,n6=keInin

    ]T. In

    this expression, the elementary vector eInin ∈ RIn×1 is allzeroes except for a 1 at the ithn location, and the multipleKhatri-Rao products

    N�n=1,n6=k

    A(n) = A(N) � · · · � A(k+1) �A(k−1) � · · · �A(1). For example, the unfolding operation fora third-order tensor is illustrated in Figure 2. After expandingthe square of the Frobenius norm and taking expectations, theparameter Ξ̂(k) for each variational density in {Q(Ξ(k))}Pk=1can be obtained from the problem (19) at the top of the nextpage.

    Using the fact that the feasible set for parameter Ξ(k) isthe Stiefel manifold VL(CIk), i.e., Ξ(k)HΞ(k) = IL, theterm G(k) is irrelevant to the factor matrix of interest Ξ(k).Consequently, problem (19) is equivalent to

    Ξ̂(k) = arg maxΞ(k)∈VL(CIk )

    Tr

    (F(k)Ξ(k)H + Ξ(k)F(k)H

    ), (20)

    where F(k) was defined in the first line of (19). Problem (20) isa non-convex optimization problem, as its feasible set VL(CIk)is non-convex [37]. While in general (20) can be solved bynumerical iterative algorithms based on a geometric approachor the alternating direction method of multipliers [37], aclosed-form optimal solution can be obtained by noticingthat the objective function in (20) has the same functionalform as the log of the von Mises-Fisher matrix distribution

  • 1053-587X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2016.2603969, IEEETransactions on Signal Processing

    6

    Ξ̂(k) = arg maxΞ(k)∈VL(CIk )

    Tr

    (EQ(β)[β]U(k)

    [Y − EQ(E) [E ]

    ])(N�

    n=1,n6=kEQ(Ξ(n))

    [Ξ(n)

    ] )∗

    ︸ ︷︷ ︸,F(k)

    Ξ(k)H + Ξ(k)F(k)H

    )

    − Tr(

    Ξ(k)HΞ(k)[EQ(β) [β]E∏N

    n=1,n6=k Q(Ξ(n))

    [( N�n=1,n6=k

    Ξ(n))T ( N�

    n=1,n6=kΞ(n)

    )∗]+ E∏L

    l=1 Q(γl)[Γ]

    ]

    ︸ ︷︷ ︸,G(k)

    )(19)

    with parameter matrix F(k), and the feasible set in (20) alsocoincides with the support of this von Mises-Fisher matrixdistribution [35]. As a result, we have

    Ξ̂(k) = arg maxΞ(k)

    lnVMF(Ξ(k) | F(k)

    ). (21)

    Then, the closed-form solution for problem (21) can beacquired using Property 1 below, which has been proved in[33].Property 1. Suppose the matrix A ∈ Cκ1×κ2 follows avon Mises-Fisher matrix distribution with parameter matrixF ∈ Cκ1×κ2 . If F = UΦVH is the SVD of the matrix F,then the unique mode of VMF (A | F) is UVH .From Property 1, it is easy to conclude that Ξ̂(k) =Υ(k)Π(k)H , where Υ(k) and Π(k) are the left-orthonormalmatrix and right-orthonormal matrix from the SVD of F(k),respectively.

    B. Derivation for Q(Ξ(k)

    ), P + 1 ≤ k ≤ N

    Using (14) and (17), the variational density Q(Ξ(k)

    )(P +

    1 ≤ k ≤ N) is derived in Appendix A to be a circularly-symmetric complex matrix normal distribution [35] as

    Q(Ξ(k)

    )= CMN (Ξ(k) |M(k), IIk ,Σ(k)) (22)

    where

    Σ(k) =

    (EQ(β) [β]E∏N

    n=1,n6=k Q(Ξ(n))

    [

    ( N�n=1,n6=k

    Ξ(n))T ( N�

    n=1,n6=kΞ(n)

    )∗]+ E∏L

    l=1 Q(γl)[Γ]

    )−1

    (23)

    M(k) = EQ(β) [β]U(k)[Y − EQ(E) [E ]

    ]

    ×(

    N�n=1,n6=k

    EQ(Ξ(n))[Ξ(n)

    ] )∗Σ(k). (24)

    Due to the fact that Q(Ξ(k)) is Gaussian, the parameter M(k)

    is both the expectation and the mode of the variational densityQ(Ξ(k)

    ).

    To calculate M(k), some expectation computations arerequired as shown in (23) and (24). For those with theform EQ(Θk) [Θk] where Θk ∈ Θ, the value can be easilyobtained if the corresponding Q (Θk) is available. The remain-ing challenge stems from the expectation E∏N

    n=1,n6=k Q(Ξ(n))[( N�

    n=1,n6=kΞ(n)

    )T ( N�n=1,n6=k

    Ξ(n))∗]

    in (23). But its calcula-

    tion becomes straightforward after exploiting the orthonormal

    structure of {Ξ̂(k)}Pk=1 and the property of multiple Khatri-Rao products, as presented in the following property.Property 2. Suppose the matrix A(n) ∈ Cκn×ρ ∼ δ(A(n) −Â(n)) for 1 ≤ n ≤ P , where Â(n) ∈ Vρ(Cκn) andP < N , and the matrix A(n) ∈ Cκn×ρ ∼ CMN (A(n) |M(n), Iκn ,Σ

    (n)) for P + 1 ≤ n ≤ N . Then,

    E∏Nn=1,n6=k p(A

    (n))

    [( N�n=1,n6=k

    A(n))T ( N�

    n=1,n6=kA(n)

    )∗]

    = D[ N

    �n=P+1,n6=k

    (M(n)HM(n) + κnΣ

    (n))∗]

    (25)

    where D[A] is a diagonal matrix taking the diagonal element

    from A, and the multiple Hadamard productsN�

    n=1,n6=kA(n) =

    A(N) � · · · �A(k+1) �A(k−1) � · · · �A(1).Proof: See Appendix B.

    C. Derivation for Q (E)The variational density Q (E) can be obtained by taking

    only the terms relevant to E after substituting (14) into (17),and can be expressed as

    Q (E)∝I1∏

    i1=1

    · · ·IN∏

    iN=1

    exp

    {E∏

    Θj 6=EQ(Θj)

    [− ζi1,··· ,iN

    ∣∣∣Ei1,··· ,iN∣∣∣2

    −β∣∣∣Yi1,··· ,in −

    L∑

    l=1

    N∏

    n=1

    Ξ(n)in,l− Ei1,··· ,iN

    ∣∣∣2]}

    . (26)

    After taking expectations, the term inside the exponent of (26)is

    − E∗i1,··· ,iN(EQ(β)[β] + EQ(ζi1,··· ,iN ) [ζi1,··· ,iN ]︸ ︷︷ ︸

    ,pi1,··· ,iN

    )Ei1,··· ,iN

    + 2Re

    [E∗i1,··· ,iN pi1,··· ,iN

    × EQ(β)[β]p−1i1,··· ,iN

    (Yi1,··· ,iN −

    L∑

    l=1

    ( N∏

    n=1

    EQ(Ξ(n))[Ξ(n)in,l

    ]))

    ︸ ︷︷ ︸,mi1,··· ,iN

    ].

    (27)

    Since (27) is a quadratic function with respect to Ei1,··· ,iN , itis easy to show that

    Q (E) =I1∏

    i1=1

    · · ·IN∏

    iN=1

    CN(Ei1,··· ,iN | mi1,··· ,iN , p−1i1,··· ,iN

    ).

    (28)

  • 1053-587X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2016.2603969, IEEETransactions on Signal Processing

    7

    Notice that from (27), the computation of outlier meanmi1,··· ,iN can be rewritten as mi1,··· ,iN = n1n2, where

    n1 =

    (EQ(ζi1,··· ,iN )[ζi1,··· ,iN ]

    )−1(EQ(ζi1,··· ,iN )[ζi1,··· ,iN ]

    )−1+(EQ(β)[β]

    )−1 and n2 =

    Yi1,··· ,iN −∑Ll=1

    (∏Nn=1 EQ(Ξ(n))[Ξ

    (n)in,l

    ])

    . From the gen-eral data model in (6), it can be seen that n2 consists ofthe estimated outliers plus noise. On the other hand, since(EQ(ζi1,··· ,iN )[ζi1,··· ,iN ]

    )−1and

    (EQ(β)[β]

    )−1can be inter-

    preted as the estimated power of the outliers and the noiserespectively, n1 represents the strength of the outliers in theestimated outliers plus noise. Therefore, if the estimated powerof the outliers

    (EQ(ζi1,··· ,iN )[ζi1,··· ,iN ]

    )−1goes to zero, the

    outlier mean mi1,··· ,iN becomes zero accordingly.

    D. Derivations for Q (γl) , Q (ζi1,··· ,iN ) , and Q (β)

    Using (14) and (17) again, the variational density Q (γl)can be expressed as

    Q (γl) ∝ exp{

    (N∑

    n=P+1

    In

    ︸ ︷︷ ︸,ãl

    −1) ln γl

    − γl[ N∑

    n=P+1

    EQ(Ξ(n))[Ξ

    (n)H:,l Ξ

    (n):,l

    ]]

    ︸ ︷︷ ︸,b̃l

    }(29)

    which has the same functional form as the probability densityfunction of gamma distribution, i.e., Q(γl) = gamma(γl |ãl, b̃l). Since EQ(γl)[γl] = ãl/b̃l is required for updatingthe variational distributions of other variables in Θ, weneed to compute ãl and b̃l. While computation of ãl isstraightforward, the computation of b̃l can be facilitated byusing the correlation property of the matrix normal distribu-tion EQ(Ξ(n))[Ξ

    (n)H:,l Ξ

    (n):,l ] = M

    (n)H:,l M

    (n):,l + InΣ

    (n)l,l [35] for

    P + 1 ≤ n ≤ N .Similarly, using (14) and (17), the variational densities

    Q (ζi1,··· ,iN ) and Q (β) can be found to be gamma distribu-tions as

    Q (ζi1,··· ,iN ) = gamma(ζi1,··· ,iN | c̃i1,··· ,iN , d̃i1,··· ,iN

    )

    (30)

    Q (β) = gamma(β | ẽ, f̃

    )(31)

    with parameters c̃i1,··· ,iN = ci1,··· ,iN + 1, d̃i1,··· ,iN =di1,··· ,iN + (mi1,··· ,iN )

    ∗mi1,··· ,iN + p

    −1i1,··· ,iN , ẽ =

    ∏Nn=1 In,

    and f̃ = E∏Nn=1 Q(Ξ

    (n))Q(E)

    [‖ Y − JΞ(1), · · · ,Ξ(N)K −

    E ‖2F]. For c̃i1,··· ,iN , d̃i1,··· ,iN and ẽ, the computations are

    straightforward. f̃ is derived in Appendix C to be

    f̃ =‖ Y −M ‖2F +I1∑

    i1=1

    · · ·IN∑

    iN=1

    p−1i1,··· ,iN

    + Tr(D[ N�

    n=P+1

    (M(n)HM(n)+InΣ

    (n))∗])

    − 2Re(

    Tr(U(1)

    [Y −M

    ][(N�

    n=P+1M(n)

    )

    �(

    P�n=2

    Ξ̂(n))]∗

    Ξ̂(1)H))

    (32)

    where M is a tensor with its (i1, · · · , iN )th element beingmi1,··· ,iN , and Re(·) denotes the real part of its argument.Although equation (32) for computing f̃ is complicated, itsmeaning is clear when we refer to its definition below (31),from which it can be seen that f̃ represents the estimate ofthe overall noise power.

    E. Summary of the Iterative Algorithm

    From the expressions for Q(Θk) evaluated above, it isseen that the calculation of a particular Q (Θk) relies on thestatistics of other variables in Θ. As a result, the variationaldistribution for each variable in Θ should be iterativelyupdated. The iterative algorithm is summarized as follows.

    Initializations:Choose L > R and initial values {Ξ̂(n,0)}Pn=1,{M(n,0),Σ(n,0)}Nn=P+1 , b̃0l , {c̃0i1,··· ,iN , d̃0i1,··· ,iN } andf̃0 for all l and i1, · · · , iN .Let ãl =

    ∑Nn=P+1 In and ẽ =

    ∏Nn=1 In.

    Iterations: For the tth iteration (t ≥ 1),Update the statistics of outliers: {pi1,··· ,iN ,mi1,··· ,iN }I1,··· ,INi1=1,··· ,iN=1

    pti1,··· ,iN =ẽ

    f̃ t−1+c̃t−1i1,··· ,iNd̃t−1i1,··· ,iN

    (33)

    mti1,··· ,iN =ẽ

    f̃ t−1pti1,··· ,iN

    (Yi1,··· ,iN

    −L∑

    l=1

    [( P∏

    n=1

    Ξ̂(n,t−1)in,l

    )( N∏

    n=P+1

    M(n,t−1)in,l

    )])(34)

    Update the statistics of factor matrices: {M(k),Σ(k)}Nk=P+1

    Σ(k,t)

    =

    (ẽ

    f̃ t−1D[ N

    �n=P+1,n6=k

    (M(n,t−1)HM(n,t−1)+InΣ

    (n,t−1))∗]

    +diag{ ã1b̃t−11

    , ...,ãL

    b̃t−1L

    })−1(35)

    M(k,t) =ẽ

    f̃ t−1

    (U(k)

    [Y −Mt

    ] )

    ×[(

    N�n=P+1,n6=k

    M(n,t−1))�(

    P�n=1

    Ξ̂(n,t−1))]∗

    Σ(k,t) (36)

    Update the orthonormal factor matrices {Ξ̂(k)}Pk=1

    [Υ(k,t),Π(k,t)

    ]= SVD

    [ẽ

    f̃ t−1U(k)

    [Y −Mt

    ]

    ×[(

    N�n=P+1

    M(n,t))�(

    P�n=1,n6=k

    Ξ̂(n,t−1))]∗]

    Ξ̂(k,t) = Υ(k,t)Π(k,t)H (37)

  • 1053-587X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2016.2603969, IEEETransactions on Signal Processing

    8

    Update {b̃l}Ll=1, {d̃i1,··· ,iN }I1,··· ,INi1=1,··· ,iN=1 and f̃

    b̃tl =N∑

    n=P+1

    M(n,t)H:,l M

    (n,t):,l + InΣ

    (n,t)l,l (38)

    c̃ti1,··· ,iN = c̃0i1,··· ,iN + 1 (39)

    d̃ti1,··· ,iN = d̃0i1,··· ,iN +

    (mti1,··· ,iN

    )∗mti1,··· ,iN + 1/p

    ti1,··· ,iN

    (40)

    f̃ t =‖ Y −Mt ‖2F +I1∑

    i1=1

    · · ·IN∑

    in=1

    (pti1,··· ,iN )−1

    +Tr

    (D[ N�

    n=P+1

    (M(n,t)HM(n,t)+InΣ

    (n,t))∗])

    − 2Re(

    Tr

    (U(1)

    [Y −Mt

    ][(N�

    n=P+1M(n,t)

    )

    �(P�n=2

    Ξ̂(n,t))]∗

    Ξ̂(1,t)H))

    (41)

    Until Convergence

    F. Further Discussions

    To gain more insights from the above proposed CPD al-gorithm, discussions on its convergence property, automaticrank determination, relationship to the OALS algorithm andcomputational complexity are presented in the following.

    1) Convergence Property: Although the functional mini-mization of the KL divergence in (15) is non-convex overthe mean-field family Q(Θ) =

    ∏kQ(Θk), it is convex

    with respect to a single variational density Q(Θk) when theothers {Q(Θj)|j 6= k} are fixed [29]. Therefore, the proposedalgorithm, which iteratively updates the optimal solution foreach Θk, is essentially a coordinate-descent algorithm in thefunctional space of variational distributions with each updatesolving a convex problem. This guarantees monotonic decreaseof the KL divergence in (15), and the proposed algorithm isguaranteed to converge to at least a stationary point [Theorem2.1, 39].

    2) Automatic Rank Determination: The automatic rank de-termination for the tensor CPD uses an idea from the Bayesianmodel selection (or Bayesian Occam’s razor) [pp. 157, 27].More specifically, the parameters {γl}Ll=1 control the modelcomplexity, and their optimal variational densities are obtainedtogether with those of other parameters by minimizing theKL divergence. After convergence, if some E[γl] are verylarge, e.g., 106, this indicates that their corresponding columnsin {M(n)}Nn=P+1 can be “switched off”, as they play norole in explaining the data. Furthermore, according to thedefinition of the tensor CPD in (1), the corresponding columnsin {Ξ̂(n)}Pn=1 should also be pruned accordingly. Finally, thelearned tensor rank R is the number of remaining columns ineach estimated factor matrix Ξ̂(n).

    3) Relationship to the OALS: If the tensor rank R isknown, the regularization term in (8) is not needed, and conse-quently there are no parameters {ãl, b̃l}Ll=1. Further restrictingQ(Ξ(k)) to be δ(Ξ(k) −M(k)) for P + 1 ≤ k ≤ N , it can beshown that all the equations in the above algorithm still hold

    except the term InΣ(n,t−1) in (35) and the term InΣ(n,t) in(41) are removed. Then, the proposed algorithm is a robustversion of OALS, even covering the case of P = N . If wefurther have the knowledge that outliers do not exist, only (35)-(37) remain. Interestingly, this resulting algorithm is exactlythe OALS algorithm in [17]. In this regard, the proposedalgorithm not only provides a probabilistic interpretation ofthe OALS algorithm, but also has the additional properties inautomatic rank determination, outlier removal and learning ofthe noise power.

    4) Computational Complexity: For each iteration, the com-plexity is dominated by updating each factor matrix, costingO(∑Nn=1 InL

    2 +N∏Nn=1 InL). Thus, the overall complexity

    is about O(q(∑Nn=1 InL

    2 + N∏Nn=1 InL)) where q is the

    number of iterations needed for convergence. On the otherhand, for the OALS algorithm with exact tensor rank R, itscomplexity is O(m(

    ∑Nn=1 InR

    2 + N∏Nn=1 InR)) where m

    is the number of iterations needed for convergence. Therefore,for each iteration, the complexity of the proposed algorithmis comparable to that of the OALS algorithm.

    V. SIMULATION RESULTS AND DISCUSSIONSIn this section, numerical simulations are presented to

    assess the performance of the proposed algorithm (labeledas VB) using synthetic data and two applications, in com-parison with various state-of-the-art tensor CPD algorithms.The algorithms being compared include the ALS [15], thesimultaneous diagonalization method for coupled tensor CPD(labeled as SD) [40], the direct algorithm for CPD followedby enhanced ALS (labeled as DIAG-A) [41], the Bayesiantensor CPD (labeled as BCPD) [32], the robust iterativelyreweighed ALS (labeled as IRALS) [42], and the OALSalgorithm (labeled as OALS) [17]. In all experiments, threeoutlier models are considered, and they are listed in TableI. For all the simulated algorithms, the initial factor matrixΞ̂(n,0) is set as the matrix consisting of L leading leftsingular vectors of U(n)[Y] where L = max{I1, I2, · · · , IN}for the proposed algorithm and the BCPD, and L = Rfor other algorithms. The initial parameters of the pro-posed algorithm {c̃0i1,··· ,iN , d̃0i1,··· ,iN } are set as 10−6 for alli1, · · · , iN , b̃0l =

    ∑Nn=P+1 In for all l, f̃

    0 =∏Nn=1 In, and

    {Σ(n,0)}Nn=P+1 are all set to be IL. All the algorithms termi-nate at the tth iteration when ‖ JA(1,t),A(2,t), · · · ,A(N,t)K−JA(1,t−1),A(2,t−1), · · · ,A(N,t−1)K ‖2F< 10−6 or the iterationnumber exceeds 2000.

    A. Validation on Synthetic Data

    Synthetic tensors are used in this subsection to assessthe performance of the proposed algorithm on convergence,rank learning ability and factor matrix recovery under dif-ferent outlier models. A complex-valued third-order tensorJA(1),A(2),A(3)K ∈ C12×12×12 with rank R = 5 is consid-ered, where the orthogonal factor matrix A(1) is constructedfrom the R leading left singular vectors of a matrix drawnfrom CMN (A|012×5,, I12×12, I5×5), and the factor matrices{A(n)}3n=2 are drawn from CMN (A| 012×5,, I12×12, I5×5).Parameters for outlier models are set as: π = 0.05, σ2e = 100,

  • 1053-587X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2016.2603969, IEEETransactions on Signal Processing

    9

    Table I: Three Different Outlier Models

    Scenario Variable DescriptionBernoulli-Gaussian Ei1,··· ,iN ∼ CN (0, σ2e) with a probability πBernoulli-Uniform Ei1,··· ,iN ∼ U(−H,H) with a probability πBernoulli-Student’s t Ei1,··· ,iN ∼ T (µ, λ, ν) with a probability π

    0 10 20 30 40 50 60 70 80 90 100

    Iteration

    10-2

    10-1

    100

    101

    102

    103

    MS

    E

    Bernoulli-Uniform

    Bernoulli-Student's t

    Bernoulli-Gaussian

    No Outlier

    70 71 72 73 74 75

    4.9

    5

    5.1

    5.2

    5.3

    50 51 52 53 54 55

    0.5

    0.55

    0.6

    SNR = 10 dB

    SNR = 20 dB

    Figure 3: Convergence of the proposed algorithm under differentoutlier models

    H = 10 arg maxi1,··· ,iN |JA(1),A(2),A(3)Ki1,··· ,iN |, µ = 3,λ = 1/50 and ν = 10. The signal-to-noise ratio (SNR) isdefined as 10 log10(‖ JA(1),A(2),A(3)K ‖2F / ‖ W ‖2F ) [5],[17]. Each result in this subsection is obtained by averaging500 Monte-Carlo runs.

    Figure 3 presents the convergence performance ofthe proposed algorithm under different outlier models,where the mean-square-error (MSE) ‖ JΞ̂(1), Ξ̂(2), Ξ̂(3)K −JA(1),A(2),A(3)K ‖2F is chosen as the assessment criterion.From Figure 3, it can be seen that the MSEs decreasesignificantly in the first few iterations and converge to stablevalues quickly, demonstrating the rapid convergence property.Furthermore, by comparing the simulation results with outliersto that without outlier, it is clear that the proposed algorithmis effective in mitigating outliers.

    For tensor rank learning, the simulation results of theproposed algorithm are shown in Figure 4 (a), while thoseof the Bayesian tensor CPD algorithm are shown in Figure4 (b). Each vertical bar in the figures shows the mean andstandard deviation of rank estimates, with the red horizontaldotted lines indicating the true tensor rank. The percentages ofcorrect estimates are also shown on top of the figures. FromFigure 4 (a), it is seen that the proposed method can recoverthe true tensor rank with 100% accuracy when SNR ≥ 5 dB,both with or without outliers. This shows the accuracy androbustness of the proposed algorithm when the noise poweris moderate. Even though the performance at low SNRs isnot as impressive as that at high SNRs, it can be observedthat the proposed algorithm still gives estimates close to thetrue tensor rank with the true rank lying mostly within onestandard deviation from the mean estimate. On the other hand,in Figure 4 (b), it is observed that while the Bayesian tensorCPD algorithm performs nearly the same as the proposed

    -5 0 5 10 15

    SNR (dB)

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    Estim

    ate

    d R

    ank

    No Outlier Bernoulli-Gaussian Bernoulli-Uniform Bernoulli-Student's t

    100%

    39.4%

    35.6%

    48.2% 91.2%

    42%

    38.2% 100%

    100%

    100%

    100%

    100%

    100%

    100%

    100%

    100% 33.2% 42.6% 100% 100%

    (a)

    -5 0 5 10 15

    SNR (dB)

    0

    5

    10

    15

    20

    25

    Estim

    ate

    d R

    ank

    No Outlier Bernoulli-Gaussian Bernoulli-Uniform Bernoulli-Student's t

    100% 100% 48.4% 89% 100%

    0% 0% 0.2%

    0% 0%

    0%

    0%

    0.2%

    0%

    0% 0% 0% 0% 0%

    0%

    (b)

    Figure 4: Rank determination using (a) the proposed method and (b)the Bayesian tensor CPD [32]

    algorithm without outliers, it gives tensor rank estimates veryfar away from the true value when outliers are present.

    Figure 5 compare the proposed algorithm to other state-of-the-art CPD algorithms in terms of recovery accuracy of theorthogonal factor matrix A(1) under different outlier models.The criterion is set as the best congruence ratio defined asmin∆P ‖ A(1) − Ξ̂(1)P∆ ‖F / ‖ A(1) ‖F , where thediagonal matrix ∆ and the permutation matrix P are foundvia the greedy least-squares column matching algorithm [5].From Figure 5 (a), it is seen that both the proposed algorithmand OALS perform better than other algorithms when outliersare absent. This shows the importance of incorporating theorthogonality information of the factor matrix. On the otherhand, while OALS offers the same performance as the pro-posed algorithm when there is no outlier, its performance issignificantly different in the presence of outliers, as presentedin Figure 5 (b)-(d). Furthermore, since all the algorithmsexcept VB and IRALS have not taken the outliers into account,their performances degrade significantly as shown in Figure 5(b)-(d). Even though the IRALS uses the robust lp (0 < p ≤ 1)norm optimization to alleviate the effects of outliers, it cannotlearn the statistical information of the outliers, leading toits worse performance in outliers mitigation than that of theproposed algorithm.

  • 1053-587X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2016.2603969, IEEETransactions on Signal Processing

    10

    5 10 15 20 25

    SNR (dB)

    10-1

    Best C

    ongru

    ence R

    atio

    ALS

    SD

    IRALS

    BCPD

    DIAG-A

    OALS

    VB 14.999 14.9995 15 15.0005 15.0010.05516

    0.05517

    0.05518

    0.05519

    (a) No outlier

    5 10 15 20 25

    SNR (dB)

    10-2

    10-1

    100

    101

    102

    103

    Best C

    ongru

    ence R

    atio

    ALS

    SD

    IRALS

    BCPD

    DIAG-A

    OALS

    VB14.5 15 15.5 16

    8

    10

    12

    (b) Bernoulli-Gaussian

    5 10 15 20 25

    SNR (dB)

    10-1

    100

    101

    102

    103

    Best C

    ongru

    ence R

    atio

    ALS

    SD

    IRALS

    BCPD

    DIAG-A

    OALS

    VB9.5 10 10.5

    7

    8

    9

    10

    (c) Bernoulli-Uniform

    5 10 15 20 25

    SNR (dB)

    10-2

    10-1

    100

    101

    102

    103

    Best C

    ongru

    ence R

    atio

    ALS

    SD

    IRALS

    BCPD

    DIAG-A

    OALS

    VB14.9 15 15.1 15.2

    10

    12

    14

    16

    (d) Bernoulli-Student’s t

    Figure 5: Performance of factor matrix recovery versus SNR under different outlier models

    B. Blind Data Detection for DS-CDMA SystemsIn this subsection, we consider an uplink DS-CDMA sys-

    tem, in which R = 5 users communicate with the BSequipped with M = 8 antennas over flat fading channelshmr ∼ CN (hmr|0, 1). The transmitted data sr(k) are randombinary phase-shift keying (BPSK) symbols. The spreadingcode is of length Z = 6, and with each code elementczr ∼ CN (czr|0, 1). After observing the received tensorY ∈ C8×6×100, the proposed algorithm and other state-of-the-art tensor CPD algorithms, combined with ambiguity removaland constellation mapping [5], [9], are executed to blindlydetect the transmitted data. Their performance is measured interms of bit error rate (BER).

    The BERs versus SNR under different outlier models arepresented in Figure 6, which are averaged over 10000 in-dependent trials. The parameter settings for different outliermodels are the same as those in the last subsection. It isseen from Figure 6 (a) that when there is no outlier, theproposed algorithm and OALS behave the same, and bothoutperform other CPDs. However, when outliers exist, itis seen from Figure 6 (b)-(d) that the proposed algorithmperforms significantly better than other algorithms.

    C. Linear Image Coding for Face ImagesIn this subsection, we conduct experiments on 165 face

    images from the Yale Face Database3 [38], representing dif-ferent facial expressions (also with or without sunglasses) of15 people (11 images for each person). In each classificationexperiment, we randomly pick two people’s images. Amongthese 22 images, 12 (6 from each person) are used for training.In particular, each image is of size 240 × 320, and thetraining data can be naturally represented by a third-ordertensor Y ∈ R240×320×12. Various state-of-the-art tensor CPDalgorithms and the proposed algorithm are run to learn the twoorthogonal basis matrices (see (4)). Then, the feature vectorsof these 12 training images, which are obtained by projectingthem onto the multilinear subspaces spanned by the twoorthogonal basis matrices, are used to train a support vectormachine (SVM) classifier. For the 10 testing images, theirfeature vectors are fed into the SVM classifier to determinewhich person is in each image. The parameters of variousoutlier models are: π = 0.05, σe = 100, H = 100, µ = 1,λ = 1/1000 and ν = 20.

    Since the tensor rank is not known in the image data, itshould be carefully chosen. For the algorithms (ALS, SD,IRALS, DIAG-A and OALS) that cannot automatically de-termine the rank, it can be obtained by first running the thesealgorithms with tensor rank ranges from 1 to 12, and then

    3http://vision.ucsd.edu/content/yale-face-database

  • 1053-587X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2016.2603969, IEEETransactions on Signal Processing

    11

    5 10 15 20

    SNR (dB)

    10-6

    10-5

    10-4

    10-3

    10-2

    BE

    R

    ALS

    SD

    IRALS

    BCPD

    DIAG-A

    OALS

    VB

    14.5 15 15.5

    0.9

    1

    1.1

    1.2

    ×10-3

    9.999 9.9995 10 10.0005 10.001

    9.15

    9.2

    9.25×10

    -5

    (a) No outlier

    5 10 15 20

    SNR (dB)

    10-7

    10-6

    10-5

    10-4

    10-3

    10-2

    10-1

    BE

    R

    ALS

    SD

    IRALS

    BCPD

    DIAG-A

    OALS

    VB

    14.8 14.9 15 15.1 15.2

    0.04

    0.045

    0.05

    0.055

    (b) Bernoulli-Gaussian

    5 10 15 20

    SNR (dB)

    10-7

    10-6

    10-5

    10-4

    10-3

    10-2

    10-1

    100

    BE

    R

    ALS

    SD

    IRALS

    BCPD

    DIAG-A

    OALS

    VB

    14.8 14.9 15 15.1

    0.22

    0.24

    0.26

    0.28

    0.3

    0.32

    (c) Bernoulli-Uniform

    5 10 15 20

    SNR (dB)

    10-7

    10-6

    10-5

    10-4

    10-3

    10-2

    10-1

    BE

    R

    ALS

    SD

    IRALS

    BCPD

    DIAG-A

    OALS

    VB

    14.5 15 15.5

    0.1

    0.105

    0.11

    0.115

    (d) Bernoulli-Student’s t

    Figure 6: BER versus SNR under different outlier models

    Table II: Classification Error and CPD Computation Time in Face Recognition

    Algorithm No Outlier Bernoulli-Gaussian Bernoulli-Uniform Bernoulli-Student’s tClassificationError

    CPDTime (s)

    ClassificationError

    CPDTime (s)

    ClassificationError

    CPDTime (s)

    ClassificationError

    CPDTime (s)

    ALS [15] 9% 1.9635 51% 46.3041 47% 63.6765 48% 58.1047SD [40] 12% 0.5736 49% 0.6287 44% 0.6181 49% 0.6594

    IRALS [42] 9% 4.3527 43% 15.8361 27% 16.6151 36% 17.5181BCPD [32] 2% 4.1546 53% 20.7338 35% 17.9896 48% 17.1151

    DIAG-A [41] 11% 3.7384 50% 28.9961 41% 30.6317 51% 22.5127OALS [17] 11% 1.0174 58% 40.2731 34% 38.1754 47% 24.0162

    VB 2% 2.4827 10% 2.7806 6% 2.4912 7% 2.8895

    finding the knee point of the reconstruction error decrement[27]. When there is no outlier, it is able to find the appropriatetensor rank. However, when outliers exist, the knee pointcannot be found and we set the rank as the upper bound 12. Forthe BCPD, although it learns the appropriate rank when thereis no outliers, it learns the rank as 12 when outliers exist. Onthe other hand, no matter whether there are outliers or not, theproposed algorithm automatically learns the appropriate tensorrank without exhaustive search, and thus saves considerablecomputational complexity.

    The average classification errors of 10 independent experi-ments and the corresponding average CPD computation times(benchmarked in Matlab on a personal computer with an i7CPU) are shown in Table II, and it can be seen that the

    proposed algorithm provides the smallest classification errorunder all considered scenarios.

    VI. CONCLUSIONSIn this paper, a probabilistic CPD algorithm with orthogonal

    factors has been proposed for complex-valued tensors, underunknown tensor rank and in the presence of outliers. It hasbeen shown that, without knowledge of noise power and outlierstatistics, the proposed algorithm alternatively estimates thefactor matrices, recovers the tensor rank and mitigates theoutliers. Interestingly, the popular OALS in [17] has beenshown to be a special case of the proposed algorithm. Simu-lation results using synthetic data and real-world applicationshave demonstrated the excellent performance of the proposedalgorithm in terms of accuracy and robustness.

  • 1053-587X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2016.2603969, IEEETransactions on Signal Processing

    12

    APPENDIX ADERIVATIONS OF Q(Ξ(k)), P + 1 ≤ k ≤ N

    By substituting (14) into (17) and only taking the termsrelevant to Ξ(k), we directly have

    Q(Ξ(k)

    )∝ exp

    {E∏

    Θj 6=Ξ(k)Q(Θj)

    [

    − β ‖ Y − JΞ(1), · · · ,Ξ(N)K− E ‖2F −Tr(ΓΞ(k)HΞ(k)

    )]}.

    (42)

    Using that ‖ A ‖2F =‖ U(n) [A] ‖2F = Tr(U(n) [A]

    (U(n) [A]

    )H), the square of the Frobenius norm inside ex-

    pectation in (42) can be expressed as

    Tr

    (Ξ(k)

    ( N�n=1,n6=k

    Ξ(n))T ( N�

    n=1,n6=kΞ(n)

    )∗Ξ(k)H

    −Ξ(k)( N�n=1,n6=k

    Ξ(n))T (

    U(k)[Y − E

    ])H

    − U(k)[Y − E

    ]( N�n=1,n6=k

    Ξ(n))∗

    Ξ(k)H

    + U(k)[Y − E

    ](U(k)

    [Y − E

    ])H). (43)

    By putting (43) into (42), and distributing the expectations intovarious terms, (42) becomes (44) at the top of the next page.

    After completing the square over Ξ(k), it can be seenthat (44) corresponds to the functional form of a circularly-symmetric complex matrix normal distribution [35]. In par-ticular, with CMN (X|M,Σr,Σc) denoting the distributionp(X) ∝ exp{−Tr(Σ−1c (X −M)HΣ−1r (X −M))}, we haveQ(Ξ(k)

    )= CMN (Ξ(k)|M(k), IIk ,Σ(k)).

    APPENDIX BPROOF OF PROPERTY 2

    To prove Property 2, we first introduce the followinglemma.Lemma 1. Suppose the matrix A(n) ∈ Cκn×ρ (1 ≤ n ≤ N).Then, if N ≥ 2,

    ( N�n=1

    A(n))T ( N�

    n=1A(n)

    )∗can be computed

    as( N�n=1

    A(n))T ( N�

    n=1A(n)

    )∗=

    N�n=1

    A(n)TA(n)∗. (45)

    Proof: Mathematical induction is used to prove thislemma. For N = 2, it is proved in [36] that

    (A(2) �A(1)

    )T (A(2) �A(1)

    )∗

    =(A(2)TA(2)∗

    )�(A(1)TA(1)∗

    ). (46)

    Without loss of generality, assume that (45) holds for someK ≥ 2, i.e.,

    ( K�n=1

    A(n))T ( K�

    n=1A(n)

    )∗=

    K�n=1

    A(n)TA(n)∗. (47)

    Now consider[A(K+1) �

    ( K�n=1

    A(n))]T [

    A(K+1) �( K�n=1

    A(n))]∗

    . TreatingK�n=1

    A(n) as a matrix and using (46),

    we have[A(K+1) �

    ( K�n=1

    A(n))]T [

    A(K+1) �( K�n=1

    A(n))]∗

    =(A(K+1)TA(K+1)∗

    )�[( K�n=1

    A(n))T ( K�

    n=1A(n)

    )∗]. Further

    using (47), we have(K+1�n=1

    A(n))T [(K+1�

    n=1A(n)

    )]∗

    =(A(K+1)TA(K+1)∗

    )�[ K�n=1

    A(n)TA(n)∗]

    =K+1�n=1

    A(n)TA(n)∗. (48)

    Then, (45) is shown to hold for N = K + 1. Thus, bymathematical induction, (45) holds for any N ≥ 2.

    Using Lemma 1 and taking expectations, it is easy to provethat

    E∏Nn=1,n6=k p(A

    (n))

    [( N�n=1,n6=k

    A(n))T ( N�

    n=1,n6=kA(n)

    )∗]

    =( P

    �n=1,n6=k

    Ep(A(n))[A(n)TA(n)∗

    ])

    �( N

    �n=P+1,n6=k

    Ep(A(n))[A(n)TA(n)∗

    ]). (49)

    Since the matrix A(n) ∼ δ(A(n) − Â(n)) for 1 ≤ n ≤P where Â(n) ∈ Vρ(Cκn), and the matrix A(n) ∼CMN (A(n) |M(n), Iκn ,Σ(n)) for P + 1 ≤ n ≤ N , we haveEp(A(n))

    [A(n)TA(n)∗

    ]= Â(n)T Â(n)∗ = Iρ for 1 ≤ n ≤ P ,

    and Ep(A(n))[A(n)TA(n)∗

    ]= M(n)TM(n)∗ + κnΣ(n)∗ for

    P + 1 ≤ n ≤ N . Then, (49) becomes

    E∏Nn=1,n6=k p(A

    (n))

    [( N�n=1,n6=k

    A(n))T ( N�

    n=1,n6=kA(n)

    )∗]

    = Iρ �[ N

    �n=P+1,n6=k

    (M(n)HM(n) + κnΣ

    (n))∗]

    = D[ N

    �n=P+1,n6=k

    (M(n)HM(n) + κnΣ

    (n))∗]

    . (50)

    APPENDIX CDERIVATION OF f̃

    Recall that ‖ Y − JΞ(1), · · · ,Ξ(N)K − E ‖2F is ex-pressed in (43). Using this result and taking expecta-tions with respect to Q(Ξ(n)) for 1 ≤ n ≤ N andQ(E), we obtain (51) at the top of the next page. SinceQ(Ei1,··· ,iN ) = CN (Ei1,··· ,iN |mi1,··· ,iN , p−1i1,··· ,iN ), it is easyto show that Tr(w1) = Tr

    ((U(1)

    [Y−M

    ])HU(1)

    [Y−M

    ]))+

    ∑I1,··· ,INi1,··· ,iN p

    −1i1,··· ,iN =‖ Y − M ‖2F +

    ∑I1,··· ,INi1,··· ,iN p

    −1i1,··· ,iN ,

    where M is a tensor with its (i1, · · · , iN )th element beingmi1,··· ,iN . On the other hand, using Property 2, we have

    w2 = D[ N�

    n=P+1

    (M(n)HM(n) + InΣ

    (n))∗]

    . Substituting

    these two results into (51) at the top of the next page, wehave equation (32).

    REFERENCES[1] L. D. Lathauwer, “A short introduction to tensor-based methods for factor

    analysis and blind source separation,” in Proc. of the IEEE Int. Symp.on Image and Signal Processing and Analysis (ISPA 2011), Dubrovnik,Croatia, Sep. 4-6, 2011, pp. 558-563.

    [2] J. F. Cardoso, “High-order contrasts for independent component analysis,”Neural Computation, vol. 11, no.1, pp. 157-192, 1999.

  • 1053-587X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2016.2603969, IEEETransactions on Signal Processing

    13

    Q(Ξ(k)

    )∝ exp

    {− Tr

    (Ξ(k)

    [EQ(β) [β]E∏N

    n=1,n6=k Q(Ξ(n))

    [( N�n=1,n6=k

    Ξ(n))T ( N�

    n=1,n6=kΞ(n)

    )∗]+E∏L

    l=1 Q(γl)[Γ]]

    ︸ ︷︷ ︸,[Σ(k)]

    −1

    Ξ(k)H

    −Ξ(k)[Σ(k)

    ]−1[EQ(β) [β]

    (U(k)

    [Y − EQ(E) [E ]

    ])(N�

    n=1,n6=kEQ(Ξ(n))

    [Ξ(n)

    ] )∗Σ(k)

    ︸ ︷︷ ︸,M(k)

    ]H

    −M(k)[Σ(k)

    ]−1Ξ(k)H

    )}. (44)

    f̃ =Tr

    (EQ(E)

    [(U(1)

    [Y − E

    ])HU(1)

    [Y − E

    ]]

    ︸ ︷︷ ︸,w1

    + EQ(Ξ(1))[Ξ(1)

    ]E∏N

    n=2 Q(Ξ(n))

    [( N�n=2

    Ξ(n))T ( N�

    n=2Ξ(n)

    )∗]

    ︸ ︷︷ ︸,w2

    EQ(Ξ(1))[Ξ(1)

    ]H

    − EQ(Ξ(1))[Ξ(1)

    ] (N�n=2

    EQ(Ξ(n))[Ξ(n)

    ] )T(U(1)

    [Y − EQ(E) [E ]

    ])H

    − U(1)[Y − EQ(E) [E ]

    ](N�n=2

    EQ(Ξ(n))[Ξ(n)

    ] )∗EQ(Ξ(1))

    [Ξ(1)

    ]H). (51)

    [3] C. F. Beckmann and S. M. Smith, “Tensorial extensions of independentcomponent analysis for multisubject FMRI analysis,” Neuroimage, vol.25, no.1, pp. 294-31, 2005.

    [4] L. De Lathauwer, “Algebraic methods after prewhitening,” in Handbookof Blind Source Separation, Independent Component Analysis and Appli-cations, Academic Press, New York, 2010, pp. 155-177.

    [5] N. D. Sidiropoulos, G. B. Giannakis, and R. Bro, “Blind PARAFACreceivers for DS-CDMA systems,” IEEE Trans. Signal Process., vol.48, no.3, pp. 810-823, 2000.

    [6] C. A. R. Fernandes, A. L. F. de Almeida, and D. B. da Costa, “Unifiedtensor modeling for blind receivers in multiuser uplink cooperativesystems,” IEEE Signal Process. Lett., vol. 19, no. 5, pp. 247-250, May2012.

    [7] A. Y. Kibangou and A. De Almeida, “Distributed PARAFAC basedDS-CDMA blind receiver for wireless sensor networks”, in Proc. ofthe IEEE Int. Conference on Signal Processing Advances in WirelessCommunications (SPAWC 2010), Marrakech, Jun. 20-23, 2010, pp. 1-5.

    [8] A. L. F. de Almeida, A. Y. Kibangou, S. Miron and D. C. Araujo, “Jointdata and connection topology recovery in collaborative wireless sensornetworks,” in Proc. of the IEEE Int. Conference on Acoustics, Speech andSignal Processing (ICASSP 2013), Vancouver, BC, May 26-31, 2013, pp.5303-5307.

    [9] M. Sorensen, L. D. Lathauwer, L. Deneire, “PARAFAC with orthogo-nality in one mode and applications in DS-CDMA systems”, in Proc. ofthe IEEE Int. Conference on Acoustics, Speech and Signal Processing(ICASSP 2010), Dallas, Texas, Mar. 2010, pp. 4142-4145.

    [10] D. Nion and N. D. Sidiropoulos, “Tensor algebra and multidimensionalharmonic retrieval in signal processing for MIMO radar,” IEEE Trans.Signal Process., vol. 58, no. 11, pp. 5693-5705, 2010.

    [11] W. Sun, H. C. So and F. K. W. Chan and L. Huang, “Tensor approachfor eigenvector-based multi-dimensional harmonic retrieval,” IEEE Trans.Signal Process., vol. 61, no. 13, pp. 3378-3388, 2013.

    [12] B. Pesquet-Popescu, J.-C. Pesquet and A. P. Petropulu, “Joint singularvalue decomposition - A new tool for separable representation of images,”in Proc. of the IEEE Int. Conference on Image Processing (ICIP 2001),Thessaloniki, Greece, Oct. 2001, pp. 569-572.

    [13] A. Shashua and A. Levin, “Linear image coding for regression andclassification using the tensor-rank principle,” in Proc. of the 2001

    IEEE Computer Society Conference on Computer Vision and PatternRecognition (CVPR 2001), Kauai, Hawaii, Dec. 2001, pp. 42-49.

    [14] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,”SIAM Review, vol. 51, pp. 455-500, 2009.

    [15] J. D. Carroll and J. J. Chang, “Analysis of individual differences inmultidimensional scaling via an N-way generalization of “Eckart-Young”decomposition,” Psychometrika, vol. 35, pp. 283-319, 1970.

    [16] De Silva and L. H. Lim, “Tensor rank and the ill-posedness of the bestlow-rank approximation problem,” SIAM J. Matrix Anal. Appl., vol. 30,pp. 1084-1127, 2008.

    [17] M. Sorensen, L. D. Lathauwer, P. Comon, S. Icart, and L. Deneire,“Canonical polyadic decomposition with a columnwise orthonormal fac-tor matrix,” SIAM J. Matrix Anal. Appl., vol.33, no.4, pp. 1190-1213,2012.

    [18] M. Muma, Y. Cheng, F. Roemer, M. Haardt, and A. M. Zoubir, “Robustsource number enumeration for R-dimensional arrays in case of briefsensor failures,” in Proc. of the IEEE Int. Conf. on Acoustics, Speechand Signal Processing (ICASSP 2012), Kyoto, Mar. 2012, pp. 3709-3712.

    [19] Z. Lu, Y. Chakhchoukh, and A. M. Zoubir, “Source number estimationin impulsive noise environments using bootstrap techniques and robuststatistics,” in Proc. of the IEEE Int. Conf. on Acoustics, Speech, SignalProcessing (ICASSP), Prague, Czech Republic, May 22-27, 2011, pp.2712-2715.

    [20] R. H. Chan, C. W. Ho, and M. Nikolova, “Salt-and-pepper noise removalby median-type noise detectors and detail-preserving regularization,”IEEE Trans. Image Process. vol.14, no. 10, pp. 1479-1485, 2005.

    [21] D. J. C. MacKay, “Probable networks and plausible predictions - areview of practical Bayesian methods for supervised neural networks,”Network: Computation in Neural Systems, vol. 6, no. 3, pp. 469-505,1995.

    [22] M. E. Tipping and C. M. Bishop, “Probabilistic principal componentanalysis,” Journal of the Royal Statistical Society: Series B, vol. 61, no.3, pp. 611-622, 1999.

    [23] C. M. Bishop, “Bayesian principal component analysis,” in Advances inNeural Information Processing Systems (NIPS), pp. 382-388, 1999.

    [24] X. Ding, L. He, and L. Carin, “Bayesian robust principal componentanalysis,” IEEE Trans. Image Process., vol.29, no.12, pp. 3419-3430,2011.

  • 1053-587X (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSP.2016.2603969, IEEETransactions on Signal Processing

    14

    [25] L. Xiong, X. Chen, T. K. Huang, J. Schneider, and J. G. Carbonell,“Temporal collaborative filtering with Bayesian probabilistic tensor fac-torization,” in Proc. of the SIAM Int. Conference on Data Mining,Columbus, May, 2010, pp. 211-222.

    [26] I. Porteous, E. Bart, and M. Welling, “Multi-HDP: A non-parametricBayesian model for tensor factorization,” in the AAAI Conference onArtificial Intelligence (AAAI 2008), Chicago, July, 2008, pp. 1487-1490.

    [27] K. P. Murphy, Machine Learning: A Probabilistic Perspective. MITPress, 2012.

    [28] P. C. Hansen, Rank Deficient and Discrete Ill-Posed Problems-NumericalAspects of Linear Inversion. Philadelphia, PA: SIAM, 1998.

    [29] M. J. Wainwright and M. I. Jordan, “Graphical models, exponentialfamilies, and variational inference,” Foundations and Trends in MachineLearning, vol.1, no. 102, pp. 1-305, Jan. 2008.

    [30] B. Ermis and A. T. Cemgil, “A Bayesian tensor factorization modelvia variational inference for link prediction”, submitted on 29 Sep 2014.Online Available: http://arxiv.org/pdf/1409.8276v1.pdf.

    [31] Z. Xu, F. Yan and Y.Qi, “Bayesian nonparametric models for multiwaydata analysis,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol.37, no.2, pp. 475-487, Nov. 2013.

    [32] Q. Zhao, L. Zhang, and A. Cichocki, “Bayesian CP factorization ofincomplete tensors with automatic rank determination,” IEEE Trans. onPattern Analysis and Machine Intelligence, vol. 37, no.9, pp. 1751-1753,Sep. 2015.

    [33] C. G. Khat and K. V. Mardia “The von Mises-Fisher distribution inorientation statistics,” Journal of Royal Statistical Society, vol. 39, pp.95-106, 1977.

    [34] M. West, “On scale mixtures of normal distributions,” Biometrika,vol.74, no.3, pp.646-648, 1987.

    [35] A. K. Gupta and D. K. Nagar, Matrix Variate Distributions , CRCPress,1999.

    [36] S. Liu and G. Trenkler, “Hadamard, Khatri-Rao, Kronecker and othermatrix products” Int. J. Inform. Syst. Sci, vol. 4, no.1, pp.160-177, 2008.

    [37] T. Kanamori and A. Takeda, “Non-convex optimization on Stiefelmanifold and applications to machine learning,” Neural InformationProcessing, pp.109-16, 2012.

    [38] A. Georghiades, D. Kriegman, and P. Belhumeur, “From few to many:Generative models for recognition under variable pose and illumination,”IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 40, pp.643-660, 2001.

    [39] M. J. Beal, “Variational algorithms for approximate Bayesian inference,”PhD thesis, Gatsby Computational Neuroscience Unit, University College,London, 2003.

    [40] M. Srensen, D. Ignat, and L. D. Lieven. ”Coupled canonical polyadicdecompositions and (coupled) decompositions in multilinear rank-(Lr, n, Lr, n, 1) terms—Part II: Algorithms,” SIAM Journal on MatrixAnalysis and Applications, vol. 36, no. 3, pp. 1015-1045, Mar. 2015.

    [41] X. Luciani, and L. Albera. ”Canonical polyadic decomposition basedon joint eigenvalue decomposition,” Chemometrics and Intelligent Lab.Systems, vol. 132, pp. 152-167, Mar. 2014.

    [42] X. Fu, K. Huang, W.-K. Ma, N.D. Sidiropoulos, and R. Bro, “Jointtensor factorization and outlying slab suppression with applications,”IEEE Trans. Signal Process., vol. 63, no. 23, pp. 6315-6328, Dec. 1,2015.

    [43] H. Huang and C. H. Q. Ding, “Robust tensor factorization using r1norm,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Anchorage,AK, 2008, pp. 1-8.

    [44] D. Goldfarb and Z. Qin, “Robust low-rank tensor recovery: Models andalgorithms,” SIAM J. Matrix Anal. Appl., vol. 35, no. 1, pp. 225-253,2014.

    [45] Q. Zhao, G. Zhou, L. Zhang, A. Cichocki, and S. I. Amari, “Bayesianrobust tensor factorization for incomplete multiway data,” IEEE Trans.Neural Netw. Learn. Syst., vol. 27, no. 4, pp. 736-748, Apr., 2016.

    Lei Cheng received the B.Eng. degree from Zhe-jiang University (ZJU) in 2013. He is currentlyworking towards the Ph.D. degree at the Universityof Hong Kong (HKU). His research interests arein general areas of signal processing and machinelearning, and in particular statistical inference formultidimensional data.

    Yik-Chung Wu received the B.Eng. (EEE) degreein 1998 and the M.Phil. degree in 2001 from theUniversity of Hong Kong (HKU). He received theCroucher Foundation scholarship in 2002 to studyPh.D. degree at Texas A&M University, CollegeStation, and graduated in 2005. From August 2005to August 2006, he was with the Thomson Cor-porate Research, Princeton, NJ, as a Member ofTechnical Staff. Since September 2006, he has beenwith HKU, currently as an Associate Professor. Hewas a visiting scholar at Princeton University, in

    summers of 2011 and 2015. His research interests are in general areasof signal processing, machine learning and communication systems, andin particular distributed signal processing and robust optimization theorieswith applications to communication systems and smart grid. Dr. Wu servedas an Editor for IEEE Communications Letters, is currently an Editor forIEEE Transactions on Communications and Journal of Communications andNetworks.

    H. Vincent Poor (S’72, M’77, SM’82, F’87) re-ceived the Ph.D. degree in EECS from PrincetonUniversity in 1977. From 1977 until 1990, he wason the faculty of the University of Illinois at Urbana-Champaign. Since 1990 he has been on the facultyat Princeton, where he is the Michael Henry StraterUniversity Professor of Electrical Engineering. From2006 till 2016, he served as Dean of Princeton’sSchool of Engineering and Applied Science. Dr.Poor’s research interests are in the areas of statisticalsignal processing, stochastic analysis and informa-

    tion theory, and their applications in wireless networks and related fields.Among his publications in these areas is the recent book Mechanisms andGames for Dynamic Spectrum Allocation (Cambridge University Press, 2014).

    Dr. Poor is a member of the National Academy of Engineering and theNational Academy of Sciences, and a foreign member of the Royal Society.He is also a Fellow of the American Academy of Arts and Sciences andthe National Academy of Inventors, and of other national and internationalacademies. He received the Technical Achievement and Society Awards ofthe IEEE Signal Processing Society in 2007 and 2011, respectively. Recentrecognition of his work includes the 2014 URSI Booker Gold Medal, the2015 EURASIP Athanasios Papoulis Award, the 2016 John Fritz Medal, andhonorary doctorates from Aalborg University, Aalto University, HKUST andthe University of Edinburgh.