19
ON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN ADJUSTABLE CORE SIZE BILIAN CHEN * , ZHENING LI , AND SHUZHONG ZHANG Abstract. This paper is concerned with the problem of finding a Tucker decomposition for tensors. Traditionally, solution methods for Tucker decomposition presume that the size of the core tensor is specified in advance, which may not be a realistic assumption in some applications. In this paper we propose a new computational model where the configuration and the size of the core become a part of the decisions to be optimized. Our approach is based on the so-called maximum block improvement (MBI) method for non-convex block optimization. We consider a number of real applications in this paper, showing the promising numerical performances of the proposed algorithms. Key words. multiway array, Tucker decomposition, low-rank approximation, maximum block improvement AMS subject classifications. 15A69, 90C26, 90C59, 49M27, 65F99 1. Introduction. The models and solution methods for high order tensor de- compositions were initially motivated and developed in psychometrics (see, e.g. [41, 42, 43]) and chemometrics (see, e.g. [4]) where the need for analysis of multiway data is imperative. Since then, tensor decomposition has become an active field of research in mathematics due partly to the intricate algebraic structures of tensors. In addi- tion to this, high order (tensor) statistics and independent component analysis (ICA) have been developed by engineers and statisticians primarily because of their wide practical applications. There are two major tensor decompositions that can be con- sidered as higher order extensions of the matrix singular value decomposition (SVD): one is known as the CANDECOMP/PARAFAC (CP) decomposition, which attempts to present the tensor as a sum of rank-one tensors and has various applications in chemometrics [2, 4], neuroscience [1], and data mining [6], and the other is known as the Tucker decomposition, which is a higher order generalization of the principal component analysis (PCA) and has found wide applications in signal processing [18], data mining [37], computer vision [45, 46], and chemical analysis [22, 23]. There are extensive studies in the literature for finding the CP decomposition and the Tuck- er decomposition for tensors; see, e.g. [16, 15, 27, 30, 31, 35, 17, 47, 50, 40]. On the computational side, the existing popular algorithms are based on the alternating least square (ALS) method proposed originally by Carroll and Chang [12], and Harsh- man [20]. However, the ALS method is not guaranteed to converge to a global optimal or a stationary point, but only to a solution where the objective function ceases to decrease. However, if the ALS method converges under certain non-degeneracy as- sumption, then it actually has a local linear convergence rate [44]. Recently, Chen et al. [14] put forward a block coordinate descent type search method known as maximum block improvement (MBI), for solving a specific nonlinear optimization with separable structure (including tensor decompositions as special cases), and proved that the se- * Department of Automation, Xiamen University, Xiamen 361005, China ([email protected]). Department of Mathematics, Shanghai University, Baoshan District, Shanghai 200444, China ([email protected]). The research of this author was supported in part by Natural Science Foun- dation of Shanghai (Grant 12ZR1410100) and Ph.D. Programs Foundation of Chinese Ministry of Education (Grant 20123108120002). Department of Industrial and Systems Engineering, University of Minnesota, Minneapolis, MN 55455, USA ([email protected]). The research of this author was supported in part by National Science Foundation of USA (Grant CMMI-1161242). 1

ON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN …zhangs/Reports/2013_CLZ.pdfON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN ADJUSTABLE CORE SIZE BILIAN CHEN , ... Introduction. The

Embed Size (px)

Citation preview

Page 1: ON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN …zhangs/Reports/2013_CLZ.pdfON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN ADJUSTABLE CORE SIZE BILIAN CHEN , ... Introduction. The

ON TENSOR TUCKER DECOMPOSITION: THE CASE FOR ANADJUSTABLE CORE SIZE

BILIAN CHEN ∗, ZHENING LI † , AND SHUZHONG ZHANG ‡

Abstract. This paper is concerned with the problem of finding a Tucker decomposition fortensors. Traditionally, solution methods for Tucker decomposition presume that the size of the coretensor is specified in advance, which may not be a realistic assumption in some applications. Inthis paper we propose a new computational model where the configuration and the size of the corebecome a part of the decisions to be optimized. Our approach is based on the so-called maximumblock improvement (MBI) method for non-convex block optimization. We consider a number of realapplications in this paper, showing the promising numerical performances of the proposed algorithms.

Key words. multiway array, Tucker decomposition, low-rank approximation, maximum blockimprovement

AMS subject classifications. 15A69, 90C26, 90C59, 49M27, 65F99

1. Introduction. The models and solution methods for high order tensor de-compositions were initially motivated and developed in psychometrics (see, e.g. [41,42, 43]) and chemometrics (see, e.g. [4]) where the need for analysis of multiway datais imperative. Since then, tensor decomposition has become an active field of researchin mathematics due partly to the intricate algebraic structures of tensors. In addi-tion to this, high order (tensor) statistics and independent component analysis (ICA)have been developed by engineers and statisticians primarily because of their widepractical applications. There are two major tensor decompositions that can be con-sidered as higher order extensions of the matrix singular value decomposition (SVD):one is known as the CANDECOMP/PARAFAC (CP) decomposition, which attemptsto present the tensor as a sum of rank-one tensors and has various applications inchemometrics [2, 4], neuroscience [1], and data mining [6], and the other is knownas the Tucker decomposition, which is a higher order generalization of the principalcomponent analysis (PCA) and has found wide applications in signal processing [18],data mining [37], computer vision [45, 46], and chemical analysis [22, 23]. There areextensive studies in the literature for finding the CP decomposition and the Tuck-er decomposition for tensors; see, e.g. [16, 15, 27, 30, 31, 35, 17, 47, 50, 40]. Onthe computational side, the existing popular algorithms are based on the alternatingleast square (ALS) method proposed originally by Carroll and Chang [12], and Harsh-man [20]. However, the ALS method is not guaranteed to converge to a global optimalor a stationary point, but only to a solution where the objective function ceases todecrease. However, if the ALS method converges under certain non-degeneracy as-sumption, then it actually has a local linear convergence rate [44]. Recently, Chen etal. [14] put forward a block coordinate descent type search method known as maximumblock improvement (MBI), for solving a specific nonlinear optimization with separablestructure (including tensor decompositions as special cases), and proved that the se-

∗Department of Automation, Xiamen University, Xiamen 361005, China ([email protected]).†Department of Mathematics, Shanghai University, Baoshan District, Shanghai 200444, China

([email protected]). The research of this author was supported in part by Natural Science Foun-dation of Shanghai (Grant 12ZR1410100) and Ph.D. Programs Foundation of Chinese Ministry ofEducation (Grant 20123108120002).‡Department of Industrial and Systems Engineering, University of Minnesota, Minneapolis, MN

55455, USA ([email protected]). The research of this author was supported in part by NationalScience Foundation of USA (Grant CMMI-1161242).

1

Page 2: ON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN …zhangs/Reports/2013_CLZ.pdfON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN ADJUSTABLE CORE SIZE BILIAN CHEN , ... Introduction. The

2 BILIAN CHEN, ZHENING LI, AND SHUZHONG ZHANG

quence produced by the MBI method converges to a stationary point in general. Thisperforms very well numerically. The MBI method has been applied successfully tothe problem of finding the best rank-one approximation of tensors [14], and in solvingproblems arising from bioinformatics [49, 48] and radar systems [5]. Local linear con-vergence of the MBI method for certain types of problems were established by Li etal. [33]. Recent treatise on this topic can be found in the Ph.D. thesis of Chen [13]. Inthis paper, we study the Tucker decomposition problem with the characteristic thatthe size of its core is unspecified. We shall demonstrate how the MBI method can beapplied to solve such problems.

Tucker decomposition can be viewed as a generalization of CP decompositionwhich is a Tucker model with equal number of components in each mode. It was firstintroduced in 1963 by Tucker [41], and later redefined in Levin [32] and Tucker [42, 43].The goal of Tucker decomposition is to decompose a tensor into a core tensor mul-tiplied by a matrix along each mode. It is related to the best rank-(r1, r2, . . . , rd)approximation of d-th order tensors (cf. [16]). Typically, a rank-(r1, r2, . . . , rd) Tuckerdecomposition (sometimes called the best multilinear rank-(r1, r2, . . . , rd) approxima-tion) means that the size of the core tensor is r1 × r2 × · · · × rd. In terms of thedecomposition methods, Tucker [43] proposed three methods to find the Tucker de-composition for three-way tensors in 1966, among which the first method is referredto the Tucker1 method, and is now better known as the higher order singular valuedecomposition (HOSVD) (see [15]). In [15], De Lathauwer et al. not only showed thatthe HOSVD for the tensor is a convincing generalization of SVD for the matrix case,but also proposed a truncated HOSVD that gives a suboptimal rank-(r1, r2, . . . , rd)approximation of tensors. Analogous to the ALS method developed for computingCP decomposition, researchers also derived similar methods for solving Tucker de-composition. Kroonenberg and De Leeuw [30] proposed TUCKALS3 for computingTucker decomposition for the three-way arrays. Later Kapteyn et al. [25] extendedthe TUCKALS3 to decompose the general d-way arrays for d > 3. Moreover, DeLathauwer et al. [16] derived a more efficient technique called higher order orthogonaliteration (HOOI) method for computing rank-(r1, r2, . . . , rd) Tucker decomposition.Possible improvements for the HOOI algorithm are considered by Andersson andBro [3]. However, as we mentioned earlier, the ALS method has no convergence guar-antee in general. Alternatively, there are methods for Tucker decomposition with aconvergence guarantee. Elden and Savas [19], and Ishtev et al. [24] simultaneouslyproposed Newton-based method for computing the best multilinear rank-(r1, r2, r3)approximation of tensors, the former is known as the Newton-Grassmann method,and the latter is also called the differential-geometric Newton method. Both methodshave quadratic local convergence property. (Note that computing Hessian is usuallya very demanding task). For more discussions on the Tucker decomposition and itsvariants, one is referred to the survey paper by Kolda and Bader [29].

In Tucker decomposition, the rank-(r1, r2, . . . , rd) (namely the size of the core)is considered a set of predetermined parameters. This requirement, however, is re-strictive in some applications. For example, the problem on how to choose a goodnumber of clusters for co-clustering of gene expression data comes up in the area ofbioinformatics. In fact, the number of suitable co-clusters is not known in advance(see Zhang et al. [48]). The question of how to choose the rank of Tucker model(cf. [39, 26, 21]) is challenging, since the problem is already NP-hard even if the Tuck-er rank (r1, r2, . . . , rd) is known. Similar issue on selecting an appropriate rank hasalso been considered in CP decomposition, e.g. a consistency diagnostic named COR-

Page 3: ON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN …zhangs/Reports/2013_CLZ.pdfON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN ADJUSTABLE CORE SIZE BILIAN CHEN , ... Introduction. The

ADJUSTABLE TENSOR TUCKER DECOMPOSITION 3

CONDIA designed in [11]. Timmerman and Kiers [39] proposed DIFFIT procedurebased on optimal fit to choose the numbers of components in Tucker decompositionof three-way data. The procedure, however, is rather time-consuming. Kiers and DerKinderen [26] revised the procedure and computed DIFFIT on approximate fit to savethe computational time.

In this paper, we propose a new scheme to choose good ranks in Tucker decom-position, provided that the summation of the ranks

∑di=1 ri is fixed. The remainder

of the paper is organized as follows. In Section 2, we introduce the notations andfrequently used tensor operations throughout the paper. Then we apply the MBImethod for solving rank-(r1, r2, . . . , rd) Tucker decomposition in Section 3. In Sec-tion 4, a new model for Tucker decomposition with unspecified core size is proposedand solved based on the MBI method and the penalty method. A heuristic approachis also developed to compute the new model in this section. Finally, numerical resultson testing the new model and algorithms are presented in Section 5.

2. Notations. Throughout this paper, we uniformly use non-bold lowercase let-ters, boldface lowercase letters, capital letters, and calligraphic letters to denote s-calars, vectors, matrices, and tensors, respectively; e.g.: scalar i, vector y, matrix A,and tensor F . We use the subscripts to denote the component of a vector, a matrix,or a tensor; e.g.: yi being the i-th entry of the vector y, Aij being the (i, j)-th entryof the matrix A, and Fijk being the (i, j, k)-th entry of the tensor F . Let us firstintroduce some important tensor operations frequently appeared in this paper, whichare largely in line with that in [29, 15]. For an overview of tensor operations andproperties, we refer to the survey paper [29].

A tensor is a multidimensional array, and the order of a tensor is its dimension,also known as the ways or the modes of a tensor. In particular, a vector is a tensorof order one, and a matrix is a tensor of order two. Consider A = (Ai1i2...id) ∈Rn1×n2×···×nd , a standard tensor of order d, where d ≥ 3. A usual way to handle atensor is to reorder its elements into a matrix; the process is called matricization, alsoknown as unfolding or flattening. A ∈ Rn1×n2×···×nd has d modes, namely, mode-1,mode-2, . . . , mode-d. Denote the mode-k matricization of tensor A to be A(k), thenthe (i1, i2, . . . , id)-th entry of tensor A is mapped to the (ik, j)-th entry of matrix

A(k) ∈ Rnk×∏

` 6=k n` , where

j = 1 +∑

1≤`≤d, 6=k

(i` − 1)∏

1≤t≤`−1,t6=k

nt.

The k-rank of tensor A, denoted by rankk(A), is the column rank of mode-kunfolding A(k), i.e., rankk(A) = rank(A(k)). A d-th order tensor whose rankk(A) = rkfor k = 1, 2, . . . , d, is briefly called a rank-(r1, r2, . . . , rd) tensor.

Analogous to the Frobenius norm of a matrix, the Frobenius norm of tensor A isthe usual 2-norm, defined by

‖A‖ :=

√√√√ n1∑i1=1

n2∑i2=1

· · ·nd∑

id=1

Ai1i2...id2.

In this paper we uniformly denote the 2-norm for vectors, and the Frobenius normfor matrices and tensors, all by notation ‖ · ‖. The inner product of two same-sized

Page 4: ON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN …zhangs/Reports/2013_CLZ.pdfON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN ADJUSTABLE CORE SIZE BILIAN CHEN , ... Introduction. The

4 BILIAN CHEN, ZHENING LI, AND SHUZHONG ZHANG

tensors A,B is given as

〈A,B〉 =

n1∑i1=1

n2∑i2=1

· · ·nd∑

id=1

Ai1i2...id Bi1i2...id .

Hence, it is clear that 〈A,A〉 = ‖A‖2.One important tensor operation is the multiplication of a tensor by a matrix.

The k-mode product of tensor A by a matrix U ∈ Rm×nk , denoted by A ×k U , isa tensor in Rn1×n2×···×nk−1×m×nk+1×···×nd , whose (i1, i2, . . . , ik−1, `, ik+1, . . . , id)-thentry is defined by

(A×k U)i1i2...ik−1` ik+1...id =

nk∑ik=1

Ai1i2...ik−1ikik+1...idU`ik .

The equation can also be written in terms of tensor unfolding as well, i.e.,

Y = A×k U ⇐⇒ Y(k) = UA(k).

This multiplication in fact changes the dimension of tensorA in mode-k. In particular,if U is a vector in Rnk , the order of tensor A×k U is then reduced to d−1, whose sizeis n1×n2× · · ·×nk−1×nk+1× · · ·×nd. The k-mode products can also be expressedby the the matrix Kronecker product as follows:

Y = A×1 U(1) ×2 U

(2) · · · ×d U(d)

⇐⇒ Y(k) = U (k)A(k)

(U (d) ⊗ · · · ⊗ U (k+1) ⊗ U (k−1) ⊗ · · · ⊗ U (1)

),

for any k = 1, 2, . . . , d with U (k) ∈ Rmk×nk for k = 1, 2, . . . , d. The proof of thisproperty can be found in [28].

Throughout this paper we uniformly use a subscript in the parentheses to denotethe matricization of a tensor (e.g. A(1) being mode-1 matricization of tensor A), anduse a superscript in the parentheses to denote the matrix in the mode product of atensor (e.g. U (1) in appropriate size showed in mode-1 product of a tensor).

3. Convergence of the traditional Tucker decomposition. Traditionally,Tucker decomposition attempts to find the best approximation for a large-sized tensorby a small-sized tensor with pre-specified dimensions (called the core tensor) multi-plied by a matrix on each mode. The problem can be formulated as follows: Given areal tensor F ∈ Rn1×n2×···×nd , find a core tensor C ∈ Rr1×r2×···×rd with pre-specifiedintegers ri with 1 ≤ ri ≤ ni for i = 1, 2, . . . , d, that optimizes

(Tmin) min∥∥F − C ×1 A

(1) ×2 A(2) · · · ×d A

(d)∥∥

s.t. C ∈ Rr1×r2×···×rd ,A(i) ∈ Rni×ri , i = 1, 2, . . . , d.

Here, matrices A(i)’s are the factor matrices. Without loss of generality, these matricesare assumed to be columnwise orthogonal. The problem can be considered as ageneralization of the best rank-one approximation problem, namely the case of c1 =c2 = · · · = cd = 1.

For any fixed matrices A(i)’s, if we optimize the objective function of (Tmin) overC, it is not hard to verify that (Tmin) is equivalent to the following maximizationmodel (see the discussion in [29, 16, 3]):

(Tmax) max∥∥F ×1 (A(1))T ×2 (A(2))T · · · ×d (A(d))T

∥∥s.t. A(i) ∈ Rni×ri , (A(i))TA(i) = I, i = 1, 2, . . . , d.

Page 5: ON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN …zhangs/Reports/2013_CLZ.pdfON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN ADJUSTABLE CORE SIZE BILIAN CHEN , ... Introduction. The

ADJUSTABLE TENSOR TUCKER DECOMPOSITION 5

The workhorse for solving (Tmax) or (Tmin) has been traditionally the ALS method.However, the convergence of the ALS method is not guaranteed in general. Chen etal. [14] proposed the MBI method, a greedy-type search algorithm for optimizationmodel with general separable structure

(G) max f(x1, x2, . . . , xd)s.t. xi ∈ Si ⊆ Rni , i = 1, 2, . . . , d,

where f : Rn1+n2+···+nd → R is a general continuous function. Assuming that forany fixed d − 1 blocks of variables xi’s, optimization over one block of variables canbe done, the MBI method is guaranteed to converge to a stationary point under themild condition that Si is compact for i = 1, 2, . . . , d. We notice that (Tmax) is aparticular instance of (G). Moreover, by fixing any d − 1 blocks A(i)’s in (Tmax),the optimization subproblem in the MBI method can be easily solved by the singularvalue decomposition (SVD).

To be specific, the subproblem of (Tmax) by applying the MBI method is

(T imax) max

∥∥F ×1 (A(1))T ×2 (A(2))T · · · ×d (A(d))T∥∥

s.t. A(i) ∈ Rni×ri , (A(i))TA(i) = I,

where matrices A(1), A(2), . . . , A(i−1), A(i+1), . . . , A(d) are given. The objective func-tion of (T i

max) can be written in the matrix form as follows:∥∥∥(A(i))TF(i)

(A(d) ⊗ · · · ⊗A(i+1) ⊗A(i−1) ⊗ · · · ⊗A(1)

)∥∥∥ .Therefore, the optimal solution for (T i

max) is the ri leading left singular vectors ofthe matrix F(i)

(A(d) ⊗ · · · ⊗A(i+1) ⊗A(i−1) ⊗ · · · ⊗A(1)

). Let us now present the

algorithm for solving (Tmax) by the MBI approach.

Algorithm 1. The MBI method for Tucker decomposition

Input Tensor F ∈ Rn1×n2×···×nd and scalars r1, r2, . . . , rd.Output Core tensor C ∈ Rr1×r2×···×rd and matrices A(i) ∈ Rni×ri for i = 1, 2, . . . , d.

0 Choose an initial feasible solution (A(1)0 , A

(2)0 , . . . , A

(d)0 ) and compute initial

objective value v0 :=∥∥∥F ×1 (A

(1)0 )T ×2 (A

(2)0 )T · · · ×d (A

(d)0 )T

∥∥∥. Set k := 0.

1 For each i = 1, 2, . . . , d, compute B(i)k+1 consisted by the ri leading left singular

vectors of F(i)

(A

(d)k ⊗ · · · ⊗A

(i+1)k ⊗A(i−1)

k ⊗ · · · ⊗A(1)k

), and let wi

k+1 :=∥∥∥F ×1 (A(1)k )T · · · ×i−1 (A

(i−1)k )T ×i (B

(i)k+1)T ×i+1 (A

(i+1)k )T · · · ×d (A

(d)k )T

∥∥∥ .2 Let vk+1 := max1≤i≤d w

ik+1 and choose one i∗ = arg max1≤i≤d w

ik+1, and

further denote

A(i)k+1 :=

{A

(i)k i ∈ {1, 2, . . . , d}\{i∗},

B(i)k+1 i = i∗.

3 If |vk+1 − vk| ≤ ε, stop and output the core tensor

C := F ×1 (A(1)k+1)T ×2 (A

(2)k+1)T · · · ×d (A

(d)k+1)T

and matrices A(i) = A(i)k+1 for i = 1, 2, . . . , d; otherwise, set k := k+ 1 and go

to Step 1.

Page 6: ON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN …zhangs/Reports/2013_CLZ.pdfON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN ADJUSTABLE CORE SIZE BILIAN CHEN , ... Introduction. The

6 BILIAN CHEN, ZHENING LI, AND SHUZHONG ZHANG

According to the convergence result of the MBI method established in [14], Algo-rithm 1 guarantees to converge to a stationary point for Tucker decomposition, whichis however not the case for many other algorithms, e.g. HOSVD method and HOOImethod.

4. Tucker decomposition with unspecified size of the core. In this section,we propose a new model for Tucker decomposition without pre-specifying the size ofthe core tensor. The dimension of each mode for the core is no longer a constant.Rather it is a variable that also needs to be optimized, which is a key ingredient inour model. Several algorithms are proposed to solve the new model as well.

4.1. The formulation. Given a tensor F ∈ Rn1×n2×···×nd , the goal is to find asmall-sized (low rank) core tensor C and d slim matrices A(i)’s to express F , as closeas possible. We are interested in determining the rank of each mode of C as well asthe best approximation of F . Let the i-rank of C be ri for i = 1, 2, . . . , d. Clearly, wehave 1 ≤ ri ≤ ni for i = 1, 2, . . . , d. Unlike the general Tucker decomposition, ri’s arenow decision variables which need to be determined. Denote c to be a given constantfor the summation of all i-rank variables, i.e.,

∑di=1 ri = c, which in general prevents

ri being too large. To determine the allocation for ri’s in the total number c, the newmodel is

(NTmin) min∥∥F − C ×1 A

(1) ×2 A(2) · · · ×d A

(d)∥∥

s.t. C ∈ Rr1×r2×···×rd ,A(i) ∈ Rni×ri , i = 1, 2, . . . , d,ri ∈ Z, 1 ≤ ri ≤ ni, i = 1, 2, . . . , d,∑d

i=1 ri = c.

It is difficult to solve (NTmin) directly, since the first two constraints of (NTmin)combine the block variables A(i) and i-rank variable ri together. A straightforwardmethod is to separate these variables, and we introduce d more block variables Y (i) ∈Rmi×mi where mi := min{ni, c} for i = 1, 2, . . . , d and

Y (i) = diag(y(i)), y(i) ∈ {0, 1}mi , and

mi∑j=1

y(i)j = ri for i = 1, 2, . . . , d.

By the equivalence between Tucker decomposition models (Tmin) and (Tmax), we mayreformulate (NTmin) as follows:

(NTmax) max∥∥∥F ×1

(A(1)Y (1)

)T ×2

(A(2)Y (2)

)T · · · ×d

(A(d)Y (d)

)T∥∥∥s.t. A(i) ∈ Rni×mi , (A(i))TA(i) = I, i = 1, 2, . . . , d,

y(i) ∈ {0, 1}mi , i = 1, 2, . . . , d,∑mi

j=1 y(i)j ≥ 1, i = 1, 2, . . . , d,∑d

i=1

∑mi

j=1 y(i)j = c.

Throughout this paper, Y (i) always represents diag(y(i)) rather than putting Y (i) =diag(y(i)) in the constraints, in order to make the models being concise. Note thatri in (NTmin) is already replaced the number of nonzero entries of y(i) in (NTmax).

Let X = F ×1

(A(1)Y (1)

)T ×2

(A(2)Y (2)

)T · · · ×d

(A(d)Y (d)

)T ∈ Rm1×m2×···×md . If

feasible block variables Y (i)’s satisfy

Page 7: ON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN …zhangs/Reports/2013_CLZ.pdfON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN ADJUSTABLE CORE SIZE BILIAN CHEN , ... Introduction. The

ADJUSTABLE TENSOR TUCKER DECOMPOSITION 7

Y (i) = diag(1, 1, . . . , 1︸ ︷︷ ︸ri

, 0, 0, . . . , 0︸ ︷︷ ︸mi−ri

), i = 1, 2, . . . , d, (4.1)

then the size of the core tensor X can be reduced to r1×r2×· · ·×rd by deleting the tailsof zero entries in all modes, and the rank of X in each mode is equal to r1, r2, . . . , rd,respectively. This observation, however establishes the equivalence between (NTmin)and (NTmax). As we shall see in the next subsection, we may without loss of generalityassume Y (i) to be in the form of (4.1). This allows us to construct a core tensor withrank-(r1, r2, . . . , rd) easily, which is exactly the same size of the core tensor C to beoptimized in (NTmin). Besides, we would like to remark that the third constraint∑mi

j=1 y(i)j ≥ 1 for i = 1, 2, . . . , d in (NTmax) is actually redundant. The reason is

that if∑mi

j=1 y(i)j = 0 for some i, then clearly the objective is zero, which can never

be optimal. However, we still keep it in (NTmax) for continency in implementing thealgorithms discussed in the following subsections.

4.2. The penalty method. Let us focus on the model for Tucker decompositionwith unspecified size of the core in the maximization form (NTmax). Recall that theseparable structure of (G) is required to implement the MBI method. Therefore, we

move the nonseparable constraint∑d

i=1

∑mi

j=1 y(i)j = c to the objective function, i.e.,

the following penalty function

p(λ,A(1), A(2), . . . , A(d), Y (1), Y (2), . . . , Y (d)

):=∥∥∥∥F ×1

(A(1)Y (1)

)T×2

(A(2)Y (2)

)T×3 · · · ×d

(A(d)Y (d)

)T∥∥∥∥2 − λ d∑

i=1

mi∑j=1

y(i)j − c

2

,

where λ > 0 is a penalty parameter. The penalty model for (NTmax) is then

(PT ) max p(λ,A(1), A(2), . . . , A(d), Y (1), Y (2), . . . , Y (d)

)s.t. A(i) ∈ Rni×mi , (A(i))TA(i) = I, i = 1, 2, . . . , d,

y(i) ∈ {0, 1}mi , i = 1, 2, . . . , d,∑mi

j=1 y(i)j ≥ 1, i = 1, 2, . . . , d.

We are ready to apply the MBI method to solve (PT ) since the block constraintsare now separable. Before presenting the formal algorithm, let us first discuss thesubproblems in implementing the MBI method, which have to be solved globally as arequirement to guarantee convergence. Without loss of generality, we wish to optimize(A(1), Y (1)) while all other block variables (A(i), Y (i)) for i = 2, 3, . . . , d are fixed, i.e.,

(PT 1) max∥∥∥(A(1)Y (1)

)TW (1)

∥∥∥2 − λ(∑m1

j=1 y(1)j + c1

)2s.t. A(1) ∈ Rn1×m1 , (A(1))TA(1) = I,

y(1) ∈ {0, 1}m1 ,∑m1

j=1 y(1)j ≥ 1,

where W (1) := F(1)

(A(d)Y (d) ⊗ · · · ⊗A(2)Y (2)

)and c1 :=

∑di=2

∑mi

j=1 y(i)j − c.

This subproblem indeed can be solved easily. First, as the optimization of A(1)

is irrelevant to the penalty term, the optimal A(1) is the m1 leading left singular

Page 8: ON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN …zhangs/Reports/2013_CLZ.pdfON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN ADJUSTABLE CORE SIZE BILIAN CHEN , ... Introduction. The

8 BILIAN CHEN, ZHENING LI, AND SHUZHONG ZHANG

vectors of matrix W (1) by applying SVD. Next, we search for the optimal Y (1) forgiven optimal A(1). Denote vj to be the j-th row vector of matrix (A(1))TW (1) forj = 1, 2, . . . ,m1, and we have∥∥∥∥(A(1)Y (1)

)TW (1)

∥∥∥∥2 =

m1∑j=1

y(1)j ‖vj‖2.

Therefore, the optimization of Y (1) is then the following problem:

max −λ(∑m1

j=1 y(1)j

)2+∑m1

j=1(‖vj‖2 − 2λc1)y(1)j − λ(c1)2

s.t. y(1) ∈ {0, 1}m1 ,∑m1

j=1 y(1)j ≥ 1.

Although the above model appears to be combinatorial, it is solvable in polynomial-

time. This is because all the possible values for∑m1

j=1 y(1)j are {1, 2, . . . ,m1}, and for

any fixed∑m1

j=1 y(1)j , the optimal allocation for y(1) can be assigned greedily as the

objective function is now linear. Thus we only need to try m1 different values for∑m1

j=1 y(1)j and pick the best solution.

In the above discussion, we know that the optimal Y (1) automatically satisfiesthe formation (4.1). This is because A(1) is the m1 leading left singular vectors ofmatrix W (1), leading to the non-increasing order for ‖vj‖’s. Therefore we can updateA(1) and Y (1) simultaneously in solving the subproblem of the MBI method for givenpenalty parameter λ. To summarize, the whole procedure for solving (NTmax) usingthe MBI method and penalty function method (cf. [8, 38]) is as follows.

Algorithm 2. The MBI method for the penalty model

Input Tensor F ∈ Rn1×n2×···×nd and integer c ≥ d.Output Core tensor C ∈ Rm1×m2×···×md and matrices A(i) ∈ Rni×mi for i = 1, 2, . . . , d.

0 Choose penalty parameters λ0 > 0, σ > 1 and an initial feasible solution

(A(1)0 , A

(2)0 , . . . , A

(d)0 , Y

(1)0 , Y

(2)0 , . . . , Y

(d)0 ), and compute initial objective value

v0 := p(λ0, A

(1)0 , A

(2)0 , . . . , A

(d)0 , Y

(1)0 , Y

(2)0 , . . . , Y

(d)0

). Set k := 0, ` := 0 and

λ := λ0.

1 For each i = 1, 2, . . . , d, solve (PT i) and get its optimal solution(B

(i)k+1, Z

(i)k+1

)with optimal value wi

k+1, where (PT i) is defined similar as (PT 1) by replacingblock 1 to block i.

2 Let vk+1 := max1≤i≤d wik+1 and choose one i∗ = arg max1≤i≤d w

ik+1, and

further denote

(A

(i)k+1, Y

(i)k+1

):=

(A

(i)k , Y

(i)k

)i ∈ {1, 2, . . . , d}\{i∗},(

B(i)k+1, Z

(i)k+1

)i = i∗.

3 If |vk+1 − vk| ≤ ε, go to Step 4; otherwise, set k := k + 1 and go to Step 1.

4 If λ`

(∑di=1

∑mi

j=1(y(i)k+1)j − c

)2≤ ε0, stop and output the core tensor

C := F ×1

(A

(1)k+1Y

(1)k+1

)T×2

(A

(2)k+1Y

(2)k+1

)T×3 · · · ×d

(A

(d)k+1Y

(d)k+1

)Tand matrices A(i) = A

(i)k+1Y

(i)k+1 for i = 1, 2, . . . , d; otherwise, set ` := ` + 1,

λ := λ0σ` and k := 0, and go to Step 1.

Page 9: ON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN …zhangs/Reports/2013_CLZ.pdfON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN ADJUSTABLE CORE SIZE BILIAN CHEN , ... Introduction. The

ADJUSTABLE TENSOR TUCKER DECOMPOSITION 9

We remark that when Algorithm 2 stops, we can shrink the size of the core tensorto r1 × r2 × · · · × rd by deleting the tails of zero entries in all modes, where ri is thenumber of nonzero entries of y(i). As an MBI method, Algorithm 2 is guaranteed toconverge to a stationary point of (PT ), as claimed in [14].

4.3. A heuristic method. To further save the computational efforts, in thissubsection we present a heuristic approach to solve Tucker decomposition model withunspecified size of the core, i.e., (NTmax). We know that if each i-rank (i = 1, 2, . . . , d)of the core tensor C is given, then the problem becomes the traditional Tucker de-composition discussed in Section 3, which can be solved by the MBI method (Algo-rithm 1). From the discussion in Section 4, we notice that the key issue of the modelis to allocate the constant number c to each i-rank of the core tensor. Therefore, theoptimal value of (Tmax) can be considered as a function of (r1, r2, . . . , rd), denoted byg(r1, r2, . . . , rd). Our heuristic approach tries to find a distribution of the constantc, by adapting the idea of the MBI method. Specifically, we may start with loweri-ranks, e.g. r01 = r02 = · · · = r0d = 1, and compute Tucker decomposition using Algo-rithm 1; then we increase one of the i-ranks (1 ≤ i ≤ d) by one, by choosing i as thebest Tucker decomposition among d possible increment of the i-rank, i.e.,

(rk+11 , rk+1

2 , . . . , rk+1d ) = arg max

1≤i≤dg(rk1 , r

k2 , . . . , r

ki−1, r

ki + 1, rki+1, . . . , r

kd).

This procedure is continued until∑d

i=1 rki = c for some k (= c− d).

If the function g(r1, r2, . . . , rd) for Tucker decomposition can be globally solvedfor any given (r1, r2, . . . , rd) (though it is NP-hard in general), then the above ap-proach is a dynamic program and the optimality of (NTmax) is guaranteed when themethod stops. Though in general Algorithm 1 can only find the stationary solutionof g(r1, r2, . . . , rd), this heuristic approach is in fact very effective; see the numericaltests in Section 5. Apart from the rank increasing approach, rank decreasing strategycan be the other alternative, i.e., starting from large number ri’s and decreasing themuntil

∑di=1 ri = c. We conclude this subsection by presenting the heuristic algorithm

for rank decreasing method below, which is actually very easy to implement.

Algorithm 3. The rank decreasing method

Input Tensor F ∈ Rn1×n2×···×nd and integer c ≥ d.Output Core tensor C ∈ Rr1×r2×···×rd and matrices A(i) ∈ Rni×ri for i = 1, 2, . . . , d.

0 Choose initial ranks (r1, r2, . . . , rd) with∑d

i=1 ri > c.1 Apply Algorithm 1 to solve (Tmax) with input (r1, r2, . . . , rd), and output ma-

trices A(i) for i = 1, 2, . . . , d.2 For each i = 1, 2, . . . , d, let B(i) ∈ Rni×(ri−1) be the matrix by deleting theri-th column of A(i), and compute

i∗ := arg max1≤i≤d

∥∥∥F ×1 (A(1))T · · · ×i−1 (A(i−1))T ×i (B(i))T ×i+1 (A(i+1))T

· · · ×d (A(d))T∥∥∥ .

Update ri∗ := ri∗−1 and A(i∗) := B(i∗). Repeat if necessary until∑d

i=1 ri = c.3 Apply Algorithm 1 to solve (Tmax) with input (r1, r2, . . . , rd), and output the

core tensor C and matrices A(i) for i = 1, 2, . . . , d.

Page 10: ON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN …zhangs/Reports/2013_CLZ.pdfON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN ADJUSTABLE CORE SIZE BILIAN CHEN , ... Introduction. The

10 BILIAN CHEN, ZHENING LI, AND SHUZHONG ZHANG

5. Numerical experiments. In this section, we apply the algorithms proposedin the previous sections to solve Tucker decomposition without specification of thecore size. Four different types of data are tested to demonstrate the effectivenessand efficiency of our algorithms. All the numerical computations are conducted inan Intel Xeon CPU 3.40GHz computer with 8GB RAM. The supporting software isMATLAB 7.12.0 (R2011a) as a platform. We use MATLAB Tensor Toolbox Version2.5 [7] whenever tensor operations are called, and also apply its embedded algorithm(the ALS method) to solve Tucker decomposition with given rank-(r1, r2, . . . , rd). Thetermination precision for the ALS method is set to be 10−4.

In implementing Algorithm 2, the penalty increment parameter is set as σ := 2.

The starting matricesA(i)0 ’s are randomly generated and then made to be orthonormal.

The starting diagonal matrices Y(i)0 ’s are all set as

Y(i)0 := diag(1, 0, 0, . . . , 0︸ ︷︷ ︸

mi−1

) for i = 1, 2, . . . , d.

The termination precision for Algorithm 2 is also set to be 10−4.For Algorithm 3, the initial ranks are set as ri := min(ni, c) for i = 1, 2, . . . , d.

When applying Algorithm 1 in Step 1, the starting matrices A(i)0 ’s for solving (Tmax)

are randomly generated, and the termination precision is set to be 10−2. While

applying Algorithm 1 in Step 3, the starting matrices A(i)0 ’s for solving (Tmax) are

obtained from Step 2, and the termination precision is set to be 10−4.For the given data tensor F in the following tests, and the rank-(r1, r2, . . . , rd)

approximation tensor F computed by the algorithms, the relative square error of theapproximation is normally defined as ‖F − F‖/‖F‖. To measure how close it is tothe original tensor F , the term fit is defined as

fit := 1− ‖F − F‖‖F‖

.

The larger the fit, the better the performance.

5.1. Noisy tensor decomposition. First we present some preliminary testresults on two synthetic data sets. We use Tensor Toolbox to randomly generated anoise-free tensor Y ∈ R50×50×20 with rank-(4, 4, 2), where entries of the core tensorfollow standard normal distributions and entries of the orthogonal factor matricesfollow uniform distributions. A noise tensor N ∈ R50×50×20 in the same size israndomly generated, whose entries follow standard normal distributions. The noisytensor Z to be tested is then constructed as follows:

Z = Y + η‖Y‖‖N‖

N ,

where η is the noise parameter, more formally, the perturbed ratio. Our task is tofind the real i-ranks of the core tensor from the corrupted tensor Z.

In this set of tests, the initial penalty parameter for Algorithm 2 is set as λ0 =0.005‖Z‖, and summation of the ranks is set as c = 10. Results of this experimentare summarized in Figure 5.1, representing the fit and the computational time ofAlgorithms 2 and 3 for different perturbed ratios. For comparison, we also call the ALSmethod for solving the traditional Tucker decomposition with given (r1, r2, r3), whichare randomly generated satisfying r1 + r2 + r3 = 10. Also, 10 random rank samples

Page 11: ON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN …zhangs/Reports/2013_CLZ.pdfON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN ADJUSTABLE CORE SIZE BILIAN CHEN , ... Introduction. The

ADJUSTABLE TENSOR TUCKER DECOMPOSITION 11

are generated for the ALS method, and its maximum fit, average fit and minimum fitare presented in the left hand side of Figure 5.1, corresponding to three pentagon dotsfrom the top to the bottom in each line segment, respectively. The computational timefor the ALS method is then the summation of these 10 rank samples. We observe thatAlgorithms 2 and 3 can easily and quickly find the exact i-ranks (4, 4, 2) of the coretensor for different perturbed ratios, and can still approximate the original tensor Yaccurately even for large noise. Figure 5.1 shows that Algorithms 2 and 3 outperformthe ALS method significantly, both in fit and in computational time.

Fig. 5.1. Approximating a noisy tensor of size 50 × 50 × 30 with original rank-(4,4,2).

Figure 5.2 shows another synthetic experiment whose data is generated similarlyas that in Figure 5.1. In this set of data, Y ∈ R100×100×50, the i-ranks are (5, 5, 4).The summation of the ranks in testing Algorithms 2 and 3 are set as c = 14. Figure 5.2again shows that our algorithms perform quite well in anti-interference and are farsuperior to the ALS method. Furthermore, both Algorithms 2 and 3 can find theexact i-ranks (5, 5, 4) of the core tensor for various perturbed ratios.

Fig. 5.2. Approximating a noisy tensor of size 100 × 100 × 50 with original rank-(5,5,4).

5.2. Amino acid fluorescence data. This data set was originally generatedand measured by Claus Andersson and was later published and tested by Bro [9, 10].

Page 12: ON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN …zhangs/Reports/2013_CLZ.pdfON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN ADJUSTABLE CORE SIZE BILIAN CHEN , ... Introduction. The

12 BILIAN CHEN, ZHENING LI, AND SHUZHONG ZHANG

It consists of five laboratory-made samples. Each sample contains different amountsof tyrosine, tryptophan and phenylalanine dissolved in phosphate buffered water. Thesize of the array A to be decomposed is 5× 201× 61, which corresponds to samples,emission wavelength (250–450nm) and excitation wavelength (240–300nm), respec-tively. Ideally the array should be describable with three PARAFAC components,where its fit is equal to 97.44% obtained by Tensor Toolbox. This data can also beapproximated well in the sense of Tucker decomposition. In implementing Algorith-m 2, the initial penalty parameter is set as λ0 = 0.01‖A‖.

The numerical results for the three methods (ALS, Algorithms 2 and 3) are p-resented in Figure 5.3. Each pentagon dot in Figure 5.3 denotes the average fit byrunning the ALS method 10 times with randomly generated rank-(r1, r2, r3) satisfyingr1 + r2 + r3 = c, and the CPU time is the summation of 10 random samples. Theperformances of Algorithms 2 and 3 are much better than that of the ALS method, es-pecially for Algorithm 2, which can get a convincing fit even for lower c. Furthermore,Algorithm 2 decomposes this data set as good as PARAFAC does without knowingthe exact i-ranks, e.g. the fit reaches 97.55% when c = 9 which finds rank-(3, 3, 3)when it stops.

Fig. 5.3. Approximating the amino acid fluorescence data for different c = r1 + r2 + r3

To further justify the importance of Tucker decomposition with unspecified size ofthe core as well as Algorithm 2, here we investigate the ALS method with all possiblepre-specified i-ranks to test this data set. For given summation of i-ranks c = 9,there are total 25 possible combinations of (r1, r2, r3), and their corresponding fits arelisted in Figure 5.4. It shows that Tucker decomposition of rank-(3, 3, 3) outperformsall other combinations, which is exactly the i-ranks found by Algorithm 2.

5.3. Gene expression data. We further test one real three-way tensor F ∈R2395×6×9, which is from 3D Arabidopsis gene expression data1. Essentially this is aset of gene expression data with 2395 genes, measured at 6 time points and 9 differentconditions.

We again apply the three methods to test this data set. The initial penaltyparameter for Algorithm 2 is set as λ0 = 0.01‖F‖. Results of our experiments aresummarized in Table 5.1. For the fit in the last three columns by the ALS method,fit-2 and fit-3 denote the fit by using the ALS method with the ranks obtained from

1We thank Professor Xiuzhen Huang for providing this set of data.

Page 13: ON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN …zhangs/Reports/2013_CLZ.pdfON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN ADJUSTABLE CORE SIZE BILIAN CHEN , ... Introduction. The

ADJUSTABLE TENSOR TUCKER DECOMPOSITION 13

Fig. 5.4. The ALS method on the amino acid fluorescence data for c = 9.

c Algorithm 2 Algorithm 3 ALS

(r1, r2, r3) CPU Fit(%) (r1, r2, r3) CPU Fit(%) Fit-2(%) Fit-3(%) Fit(%)

3 (1, 1, 1) 0.45 85.85 (1, 1, 1) 1.62 85.96 85.85 85.85 85.855 (2, 1, 2) 7.04 86.66 (2, 1, 2) 2.89 87.73 86.75 86.77 86.0110 (4, 2, 4) 6.16 89.79 (1, 3, 6) 2.33 93.89 89.82 85.96 86.6115 (7, 3, 5) 9.21 91.96 (5, 6, 4) 1.62 91.44 91.97 90.99 89.1520 (10, 4, 6) 5.97 93.59 (5, 6, 9) 1.55 91.44 93.60 91.44 92.71

Table 5.1Approximating the gene expression data

Algorithms 2 and 3, respectively; while the fit in the last column is the averagefit by running the ALS method 10 times with randomly generated ranks satisfyingr1 + r2 + r3 = c. The following observations are clearly visible:

• Excellent performances of Algorithms 2 and 3 are validated.• Computing the i-rank information is important, as the ranks outputted by

Algorithms 2 and 3 are more suitable than randomly generated ranks for theALS method in terms of the fit.

• With the combination of the ALS method, Algorithm 2 works better thanAlgorithm 3 in terms of the fit, while its CPU time is more costly than thatof Algorithm 3.

5.4. Tensor compression of image data. Our final tests are applying themodels and algorithms in this paper to compress the tensors from the image data.Here two set of faces are presented, with each in one subsection.

5.4.1. The ORL database of faces. The first set of images is from the ORLdatabase of faces [36] in AT&T Laboratories Cambridge. In this database, there are10 different images for each of the 40 distinct subjects, and the size of each image is92×112 pixels. Here, we draw one single distinct subject, and then construct a tensorT ∈ R92×112×10. When applying Algorithm 2, the initial penalty parameter is set asλ0 = 0.005‖T ‖.

Figure 5.5 displays the images compressed and recovered by the three methods

Page 14: ON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN …zhangs/Reports/2013_CLZ.pdfON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN ADJUSTABLE CORE SIZE BILIAN CHEN , ... Introduction. The

14 BILIAN CHEN, ZHENING LI, AND SHUZHONG ZHANG

c Methods (r1, r2, r3) CR CP(%) RMSE Fit(%) CPU

50

Algorithm 3 (25, 22, 3) 62.4 98.4 143.98 83.99 0.71

ALS(best) (16, 25, 9) 28.6 96.5 75.25 87.64

1.15(worst) (43, 6, 1) 399.4 99.8 556.85 75.52

Algorithm 2 (21, 19, 10) 25.8 96.1 67.89 88.26 5.80

70

Algorithm 3 (35, 31, 4) 23.7 95.8 72.92 86.85 1.36

ALS(best) (34, 29, 7) 14.9 93.3 45.78 89.59

1.20(worst) (11, 58, 1) 161.5 99.4 345.26 76.10

Algorithm 2 (30, 30, 10) 11.4 91.3 34.13 91.14 12.35

Table 5.2Tensor compression for the ORL database of faces

when c = 50 and 70 respectively. The detailed numerical values for the two casesare listed in Table 5.2. Some standard abbreviations from image science are adopted,namely, CP for compression, CR for compression ratio, and RMSE for root meansquared error. For the ALS method, its CPU time is computed by the summationof 10 random generated i-ranks satisfying r1 + r2 + r3 = c, and its best and worstcompressions in terms of fit are reported for comparison.

Fig. 5.5. Recovered images for the ORL database of faces (rows from the top to the bottom: 1.Original, 2. Algorithm 3, 3. ALS (the best), 4. ALS (the worst), 5. Algorithm 2)

Observation from Figure 5.5 indicates that only the compressed images by theALS method (the best one) and Algorithm 2 keep all the facial expressions. HoweverAlgorithm 2 indeed outperforms the ALS method (the best one) in terms of fit andRMSE as shown in Table 5.2. It is worth mentioning that most of computational effortof Algorithm 2 is to find a suitable combination of ranks (r1, r2, r3), while this issuefor the ALS method is pre-specified. This computational effort in selecting betterinitial ranks is worthwhile, as we noticed that the ALS method for a large c may beworse than Algorithm 2 for a smaller c, e.g. the 4th row in the right of Figure 5.5

Page 15: ON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN …zhangs/Reports/2013_CLZ.pdfON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN ADJUSTABLE CORE SIZE BILIAN CHEN , ... Introduction. The

ADJUSTABLE TENSOR TUCKER DECOMPOSITION 15

c Methods (r1, r2, r3) CR CP(%) RMSE Fit(%) CPU

50

Algorithm 3 (40, 39, 1) 294.1 99.7 480.15 80.11 2.50

ALS(best) (39, 38, 3) 103.2 99.0 183.30 87.18

2.81(worst) (75, 2, 3) 1.0e+03 99.9 1350.50 69.95

Algorithm 2 (39, 34, 7) 49.4 98.0 75.17 92.40 35.67

70

Algorithm 3 (53, 55, 2) 78.7 98.7 185.49 85.14 9.00

ALS(best) (54, 52, 4) 40.8 97.6 94.95 89.44

2.50(worst) (1, 105, 4) 1.1e+03 99.9 1958.61 57.89

Algorithm 2 (53, 50, 7) 24.7 96.0 43.35 93.81 169.27

Table 5.3Tensor compression for for the JAFFE database

is not clearer than the 2rd row in the left of Figure 5.5, which is also confirmed byTable 5.2.

Figure 5.6 presents the performance of Algorithm 2 for various c. Clearly, thelarger c is, the larger the fit, and the lower the RMSE. This figure also suggests aguideline for choosing a suitable c depending on the quality of the compressed imagesrequired by the user.

Fig. 5.6. Performance of Algorithm 2 for the ORL database of faces

5.4.2. The Japanese female facial expression (JAFFE) database. Ourlast tests are on the facial database [34] from the Psychology Department in KyushuUniversity, which contains 213 images of 7 different emotional facial expressions (sad-ness, happiness, surprise, anger, disgust, fear and neutral) posed by 10 Japanesefemale models. The size of each image is 256 × 256 pixels. Here we draw 7 differentfacial expressions of one female model and construct a tensor T of size 256× 256× 7.A similar set of tests as in Section 5.4.1 are conducted. The results are presented inFigures 5.7 and 5.8, Table 5.3, and Figure 5.9, which again confirm the observationfrom the results for the ORL database of faces. In particular, by comparing the besttwo set of images (the best ALS method and Algorithm 2) in Figures 5.7 and 5.8,Algorithm 2 is clearer better in details, e.g. the eyebrows.

Page 16: ON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN …zhangs/Reports/2013_CLZ.pdfON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN ADJUSTABLE CORE SIZE BILIAN CHEN , ... Introduction. The

16 BILIAN CHEN, ZHENING LI, AND SHUZHONG ZHANG

Fig. 5.7. Recovered images for the JAFFE database when c = 80 (rows from the top to thebottom: 1. Original, 2. Algorithm 3, 3. ALS (the best), 4. ALS (the worst), 5. Algorithm 2)

6. Conclusion. In this paper we study the problem of finding a proper Tuckerdecomposition for a general tensor without a pre-specified size of the core tensor,which addresses an important practical issue for real applications. The size of the coretensor is assumed in almost all the existing methods for tensor Tucker decomposition.The approach that we propose is based on the so-called maximum block improvement(MBI) method. Our numerical experiments on a variety of instances taken from realapplications suggest that the proposed methods perform robustly and effectively.

REFERENCES

[1] A. H. Andersen and W. S. Rayens, Structure-seeking multilinear methods for the analysis offMRI data, NeuroImage, 22 (2004), pp. 728–739.

[2] C. M. Andersen and R. Bro, Practical aspects of PARAFAC modeling of fluorescenceexcitation-emission data, Journal of Chemometrics, 17 (2003), pp. 200–215.

[3] C. A. Andersson and R. Bro, Improving the speed of multi-way algorithms: Part I. Tucker3,Chemometrics and Intelligent Laboratory Systems, 42 (1998), pp. 93–103.

[4] C. J. Appellof and E. R. Davidson, Strategies for analyzing data from video fluorometric mon-itoring of liquid chromatographic effluents, Analytical Chemistry, 53 (1981), pp. 2053–2056.

[5] A. Aubry, A. De Maio, B. Jiang, and S. Zhang, A cognitive approach for ambiguity functionshaping, Proceedings of the 7th IEEE Sensor Array and Multichannel Signal ProcessingWorkshop, 2012, pp. 41–44.

[6] B. W. Bader, M. W. Berry, and M. Browne, Discussion tracking in enron email usingPARAFAC, in M. W. Berry and M. Castellanos (eds.), Survey of Text Mining: Clustering,Classification, and Retrieval (2nd edition), Springer, New York, 2007, pp. 147–162.

[7] B. W. Bader and T. G. Kolda, Matlab Tensor Toolbox, Version 2.5, 2012.http://csmr.ca.sandia.gov/ tgkolda/TensorToolbox

Page 17: ON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN …zhangs/Reports/2013_CLZ.pdfON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN ADJUSTABLE CORE SIZE BILIAN CHEN , ... Introduction. The

ADJUSTABLE TENSOR TUCKER DECOMPOSITION 17

Fig. 5.8. Recovered images for the JAFFE database when c = 110 (rows from the top to thebottom: 1. Original, 2. Algorithm 3, 3. ALS (the best), 4. ALS (the worst), 5. Algorithm 2)

Fig. 5.9. Performance of Algorithm 2 for the JAFFE database

[8] D. P. Bertsekas, Nonlinear Programming (2nd edition), Athena Scientific, Belmont, MA, 1999.[9] R. Bro, PARAFAC: Tutorial and applications, Chemometrics and Intelligent Laboratory Sys-

tems, 38 (1997), pp. 149–171.[10] R. Bro, Multi-way Analysis in the Food Industry: Models, Algorithms, and Applications,

Ph.D. Thesis, University of Amsterdam, Nethelands, and Royal Veterinary and Agricul-tural University, Denmark, 1998.

[11] R. Bro and H. A. L. Kiers, A new efficient method for determining the number of componentsin PARAFAC models, Journal of Chemometrics, 17 (2003), pp. 274–286.

[12] J. D. Carroll and J. J. Chang, Analysis of individual differences in multidimensional scaling

Page 18: ON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN …zhangs/Reports/2013_CLZ.pdfON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN ADJUSTABLE CORE SIZE BILIAN CHEN , ... Introduction. The

18 BILIAN CHEN, ZHENING LI, AND SHUZHONG ZHANG

via an N-way generalization of “Eckart-Young” decomposition, Psychometrika, 35 (1970),pp. 283–319.

[13] B. Chen, Optimization with Block Variables: Theory and Applications, Ph.D. Thesis, TheChinese Univesrity of Hong Kong, Hong Kong, 2012.

[14] B. Chen, S. He, Z. Li, and S. Zhang, Maximum block improvement and polynomial optimiza-tion, SIAM Journal on Optimization, 22 (2012), pp. 87–107.

[15] L. De Lathauwer, B. De Moor, and J. Vandewalle, A multilinear singular value decomposition,SIAM Journal on Matrix Analysis and Applications, 21 (2000), pp. 1253–1278.

[16] L. De Lathauwer, B. De Moor, and J. Vandewalle, On the best rank-1 and rank-(R1, R2, . . . , RN ) approximation of higher-order tensors, SIAM Journal on Matrix Anal-ysis and Applications, 21 (2000), pp. 1324–1342.

[17] L. De Lathauwer and D. Nion, Decompositions of a higher-order tensor in block terms—Part III: Alternating least squares algorithms, SIAM Journal on Matrix Analysis andApplications, 30 (2008), pp. 1067–1083.

[18] L. De Lathauwer and J. Vandewalle, Dimensionality reduction in higher-order signal process-ing and rank-(R1, R2, . . . , RN ) reduction in multilinear algebra, Linear Algebra and itsApplications, 391 (2004), pp. 31–55.

[19] L. Elden and B. Savas, A Newton-Grassmann method for computing the best multi-linearrank-(r1, r2, r3) approximation of a tensor, SIAM Journal on Matrix Analysis and Ap-plications, 31 (2009), pp. 248–271.

[20] R. A. Harshman, Foundations of the PARAFAC procedure: Models and conditions for an “ex-planatory” multi-modal factor analysis, UCLA Working Papers in Phonetics, 16 (1970),pp. 1–84. Also available online at http://publish.uwo.ca/ harshman/wpppfac0.pdf

[21] Z. He, A. Cichocki, and S. Xie, Efficient method for Tucker3 model selection, ElectronicsLetters, 45 (2009), pp. 805–806.

[22] R. Henrion, Simultaneous simplification of loading and core matrices in N-way PCA: Ap-plication to chemometric data arrays, Fresenius’ Journal of Analytical Chemistry, 361(1998), pp. 15–22.

[23] R. Henrion, On global, local and stationary solutions in three-way data analysis, J. Chemo-metrics, 14 (2000), pp. 261–274.

[24] M. Ishteva, L. De Lathauwer, P. A. Absil, and S. Van Huffel, Differential-geometric Newtonmethod for the best rank-(R1, R2, R3) approximation of tensors, Numerical Algorithms,51 (2009), pp. 179–194.

[25] A. Kapteyn, H. Neudecker, and T. Wansbeek, An approach to n-mode components analysis,Psychometrika, 51 (1986), pp. 269–275.

[26] H. A. L. Kiers and A. Der Kinderen, A fast method for choosing the numbers of components inTucker3 analysis, British Journal of Mathematical and Statistical Psychology, 56 (2003),pp. 119–125.

[27] E. Kofidis and P. A. Regalia, On the best rank-1 approximation of higher order supersymmetrictensors, SIAM Journal on Matrix Analysis and Applications, 23 (2002), pp. 863–884.

[28] T. G. Kolda, Multilinear operators for higher-order decompositions, Technical ReportSAND2006-2081, Sandia National Laboratories, Albuquerque, NM, 2006.

[29] T. G. Kolda and B. W. Bader, Tensor decompositions and applications, SIAM Review, 51(2009), pp. 455–500.

[30] P. M. Kroonenberg and J. De Leeuw, Principal component analysis of three-mode data bymeans of alternating least squares algorithms, Psychometrika, 45 (1980), pp. 69–97.

[31] D. Leibovici and R. Sabatier, A singular value decomposition of a k-way array for a principalcomponent analysis of multiway data, PTA-k, Linear Algebra and its Applications, 269(1998), pp. 307–329.

[32] J. Levin, Three-mode Factor Analysis, Ph.D. Thesis, University of Illinois, Urbana, IL, 1963.[33] Z. Li, A. Uschmajew, and S. Zhang, Linear convergence analysis of the maximum block

improvement method for spherically constrained optimization, Technical Report, 2013.[34] M. J. Lyons, S. Akamatsu, M. Kamachi, and J. Gyoba, Coding facial expressions with gabor

wavelets, Proceedings of the 3rd IEEE International Conference on Automatic Face andGesture Recognition, 1998, pp. 200–205.

[35] L. Qi, The best rank-one approximation ratio of a tensor space, SIAM Journal on MatrixAnalysis and Applications, 32 (2011), pp. 430–442.

[36] F. Samaria and A. Harter, Parameterisation of a stochastic model for human face identifica-tion, Proceedings of 2nd IEEE Workshop on Applications of Computer Vision, 1994, pp.138–142.

[37] B. Savas and L. Elden, Handwritten digit classification using higher order singular valuedecomposition, Pattern Recognition, 40 (2007), pp. 993–1003.

Page 19: ON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN …zhangs/Reports/2013_CLZ.pdfON TENSOR TUCKER DECOMPOSITION: THE CASE FOR AN ADJUSTABLE CORE SIZE BILIAN CHEN , ... Introduction. The

ADJUSTABLE TENSOR TUCKER DECOMPOSITION 19

[38] W. Sun and Y. Yuan, Optimization Theory and Methods: Nonlinear Programming, SpringerOptimization and Its Applicationss, Volume 1, Springer, New York, 2006.

[39] M. E. Timmerman and H. A. L. Kiers, Three-mode principal components analysis: Choosingthe numbers of components and sensitivity to local optima, British Journal of Mathemat-ical and Statistical Psychology, 53 (2000), pp. 1–16.

[40] R. Tomioka, T. Suzuki, K. Hayashi, and H. Kashima, Statistical performance of convex ten-sor decomposition, Proceedings of the 25th Annual Conference on Neural InformationProcessing Systems, 2011, pp. 972–980.

[41] L. R. Tucker, Implications of factor analysis of three-way matrices for measurement of change,in C. W. Harris (eds.), Problems in Measuring Change, University of Wisconsin Press,1963, pp. 122–137.

[42] L. R. Tucker, The extension of factor analysis to three-dimensional matrices, in H. Gulliksenand N. Frederiksen (eds.), Contributions to Mathematical Psychology, Holt, Rinehardt,& Winston, New York, NY, 1964.

[43] L. R. Tucker, Some mathematical notes on three-mode factor analysis, Psychometrika, 31(1966), pp. 279–311.

[44] A. Uschmajew, Local convergence of the alternating least squares algorithm for canonicaltensor approximation, SIAM Journal on Matrix Analysis and Applications, 33 (2012),pp. 639–652.

[45] M. A. O. Vasilescu and D. Terzopoulos, Multilinear analysis of image ensembles: Tensor-faces, Proceedings of the 7th European Conference on Computer Vision, Lecture Notesin Computer Science, 2350 (2002), Springer, New York, pp. 447–460.

[46] H. Wang, Q. Wu, L. Shi, Y. Yu, and N. Ahuja, Out-of-core tensor approximation of multi-dimensional matrices of visual data, ACM Transactions on Graphics (Special Issue forSIGGRAPH 2005), 24 (2005), pp. 527–535.

[47] Y. Wang and L. Qi, On the successive supersymmetric rank-1 decomposition of higher-ordersupersymmetric tensors, Numerical Linear Algebra with Applications, 14 (2007), pp.503–519.

[48] S. Zhang, K. Wang, C. Ashby, B. Chen, and X. Huang, A unified adaptive co-identificationframework for high-D expression data, in T. Shibuya et al. (eds.), Proceedings of the 7thIAPR International Conference on Pattern Recognition in Bioinformatics, Lecture Notesin Computer Science, 7632 (2012), Springer, New York, pp. 59–70.

[49] S. Zhang, K. Wang, B. Chen, and X. Huang, A new framework for co-clustering of geneexpression data, in M. Loog et al. (eds.), Proceedings of the 6th IAPR InternationalConference on Pattern Recognition in Bioinformatics, Lecture Notes in Computer Science,7036 (2011), Springer, New York, pp. 1–12.

[50] T. Zhang and G. H. Golub, Rank-one approximation to high order tensors, SIAM Journal onMatrix Analysis and Applications, 23 (2001), pp. 534–550.