Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Part 2: Communication Costs ofTensor Decompositions

Grey Ballard

CS 294/Math 270: Communication-Avoiding AlgorithmsUC Berkeley

March 28, 2016

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Plan

Introduction to tensor decompositions [KB09]nomenclature and notationpopular decompositions: CP and Tucker

We’re seeking comm-optimal sequential and parallel algorithmsfew known lower boundsfew standard libraries or HPC implementations

Many flavors of problemsdense or sparsesequential or parallelCP or Tucker (or alternatives like tensor train)choices of mathematical algorithm

We’ll do two case studies of parallel algorithmscomputing CP decomposition of sparse tensor [KU15]computing Tucker decomposition of dense tensor [ABK15]

Grey Ballard CA Algorithms 1

Plan






Plan






Plan






Outline

1 Tensor Notation

2 Tensor Decompositions

3 Computing CP via Alternating Least SquaresMathematical BackgroundParallel Algorithm for Sparse Tensors

4 Computing Tucker via Sequentially Truncated Higher-Order SVDMathematical BackgroundParallel Algorithm for Dense Tensors

5 Open Problems

Tensors

Vector N = 1

Matrix N = 2

3rd-Order Tensor N = 3

4th-Order Tensor N = 4

5th-Order Tensor N = 5

An N th-order tensor has N modesNotation convention: vector v, matrix M, tensor T


Fibers

Mode-1 Fibers Mode-2 Fibers Mode-3 Fibers

A tensor can be decomposed into the fibers of each mode(fix all indices but one)


Slices

458 TAMARA G. KOLDA AND BRETT W. BADER

(a) Mode-1 (column) fibers: x:jk (b) Mode-2 (row) fibers: xi:k (c) Mode-3 (tube) fibers: xij:

Fig. 2.1 Fibers of a 3rd-order tensor.

(a) Horizontal slices: Xi:: (b) Lateral slices: X:j: (c) Frontal slices: X::k (or Xk)

Fig. 2.2 Slices of a 3rd-order tensor.

A. The inner product of two same-sized tensors X, Y ∈ RI1×I2×···×IN is the sum ofthe products of their entries, i.e.,

⟨X, Y ⟩ =

I1!

i1=1

I2!

i2=1

· · ·IN!

iN=1

xi1i2···iN yi1i2···iN .

It follows immediately that ⟨X, X ⟩ = ∥X ∥2.2.1. Rank-One Tensors. An N -way tensor X ∈ RI1×I2×···×IN is rank one if it

can be written as the outer product of N vectors, i.e.,

X = a(1) ◦ a(2) ◦ · · · ◦ a(N).

The symbol “◦” represents the vector outer product. This means that each elementof the tensor is the product of the corresponding vector elements:

xi1i2···iN = a(1)i1

a(2)i2

· · · a(N)iN

for all 1 ≤ in ≤ In.

Figure 2.3 illustrates X = a ◦ b ◦ c, a third-order rank-one tensor.

2.2. Symmetry and Tensors. A tensor is called cubical if every mode is the samesize, i.e., X ∈ RI×I×I×···×I [49]. A cubical tensor is called supersymmetric (though

Dow

nloa

ded

09/0

5/13

to 1

98.2

06.2

19.3

8. R

edis

tribu

tion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.siam

.org

/jour

nals

/ojs

a.ph

p

A tensor can also be decomposed into the slices of each mode(fix one index)


Unfoldings

A tensor can be reshaped into matrices,called unfoldings or matricizations, for different modes

(fibers form columns, slices form rows)


Outline

1 Tensor Notation




5 Open Problems

Low-rank approximations of tensors

Tensor “decompositions” are usually low-rank approximations

They generalize matrix approximations from two viewpointssum of outer products (think PCA)product of two rectangular matrices (think high-variance subspaces)

Some applications seek true decompositions, but less common


Sum of outer products

Matrix:

Tensor:

This is known as the CANDECOMP/PARAFAC (CP) decomposition


CP Notation

T ≈ u1 ◦ v1 ◦w1 + · · ·+ uR ◦ vR ◦wR, T ∈ RI×J×K

T ≈ JU,V,WK , U ∈ RI×R,V ∈ RJ×R,W ∈ RK×R are factor matrices

tijk ≈R∑

r=1

uir vjr wkr , 1 ≤ i ≤ I,1 ≤ j ≤ J,1 ≤ k ≤ K


Notation convention: scalar dimension N, index n with 1 ≤ n ≤ N

Applications of CP

CP often used like PCA for multi-dimensional datainterpretable components separated from noise

Sample applicationschemometrics [AB03]

data is excitation wavelengths × emission wavelengths × timecomponents correspond to chemical species’ signatures

neuroscience [AABB+07]data is electrode × frequency × timecomponents help to describe origin of a seizure

text analysis [BBB08]data is term × author × timecomponents discover conversations


High-variance subspaces

Matrix:

Tensor:

This is known as the Tucker decomposition


Tucker Notation

T ≈ G×1 U×2 V×3 W T ∈ RI×J×K ,G ∈ RP×Q×R is core tensor

T ≈ JG;U,V,WK , U ∈ RI×P ,V ∈ RJ×Q,W ∈ RK×R are factor matrices

tijk ≈P∑

p=1

Q∑

q=1

R∑

r=1

gpqr uipvjqwkr , 1 ≤ i ≤ I,1 ≤ j ≤ J,1 ≤ k ≤ K


Tensor-Times-Matrix (TTM)

Tensor version:Y = X×2 M

Y ∈ RI×Q×K X ∈ RI×J×K M ∈ RQ×J

Matrix version:Y(2) = MX(2)

Y(2) ∈ RQ×IK X(2) ∈ RJ×IK

Element version:

yiqk =J∑

j=1

mqjxijk

TTM is matrix multiplication with certain unfolding


Applications of Tucker

Tucker can be viewed as a richer form of CP, so it’s also used like PCAa diagonal core tensor corresponds to a CP decomposition

Sample ApplicationComputer vision: TensorFaces [VT02]

facial recognition system benefiting from varying lighting,expression, viewpoint

Tucker is typically more efficient than CP for compressionSample Application

Visual data compression [BRP15]image, video, and 3D volume data


Ambiguities

There are several ambiguities that have to be handled carefully

CP scaling ambiguity

T ≈R∑

r=1

ur ◦ vr ◦wr → T ≈R∑

r=1

λr · ur ◦ vr ◦wr

where ‖ur‖2 = ‖vr‖2 = ‖wr‖2 = 1

Tucker basis ambiguity

T ≈ G×1 U×2 V×3 W

where UTU = IP , VTV = IQ, WTW = IR


Survey Paper

Notation can be a huge obstacle to working with tensors,standardization can help

I recommend following the conventions of the following paper:

Tensor Decompositions and ApplicationsTammy Kolda and Brett Bader

SIAM Review 2009http://epubs.siam.org/doi/abs/10.1137/07070111X


http://epubs.siam.org/doi/abs/10.1137/07070111X

Outline

1 Tensor Notation




5 Open Problems

Outline

1 Tensor Notation




5 Open Problems

CP Optimization Problem

For fixed rank R, we want to solve

minU,V,W

∥∥∥∥∥X−R∑

r=1

ur ◦ vr ◦wr

∥∥∥∥∥

which is a nonlinear, nonconvex optimization problem

in the matrix case, the SVD gives us the optimal solution

in the tensor case, uniqueness/convergence to optimum not guaranteed


Alternating Least Squares (ALS)

Fixing all but one factor matrix, we have a linear least squares problem:

minV

∥∥∥∥∥X−R∑

r=1

ur ◦ vr ◦ wr

∥∥∥∥∥

or equivalentlymin

V

∥∥∥X(2) − V(W� U)T∥∥∥

F

where � is the Khatri-Rao product, a column-wise Kronecker product

ALS works by alternating over factor matrices, updating one at a timeby solving the corresponding linear least squares problem


CP-ALS

Repeat1 Solve U(VTV ∗WTW) = X(1)(W� V) for U2 Normalize columns of U3 Solve V(UTU ∗WTW) = X(2)(W� U) for V4 Normalize columns of V5 Solve W(UTU ∗ VTV) = X(3)(V� U) for W6 Normalize columns of W and store norms in λ

Linear least squares problems solved via normal equationsusing identity (A� B)T(A� B) = ATA ∗ BTB,

where ∗ is element-wise product


Outline

1 Tensor Notation




5 Open Problems

Matricized Tensor Times Khatri-Rao Product

CP-ALS spends most of its time in MTTKRP (dense or sparse)corresponds to setting up the right-hand-side of normal equationsM(V ) = X(2)(W� U), for example

In the dense case, it usually makes sense to1 form Khatri-Rao product explicitly2 call dense matrix multiplication

In the sparse case, it usually make sense [BK07] to use

element-wise formula m(V )jr =

I∑

i=1

K∑

k=1

xijkuir wkr

row-wise formula m(V )j,: =

I∑

i=1

K∑

k=1

xijk (ui,: ∗ wk ,:)







I∑

i=1

K∑

k=1

xijkuir wkr


I∑

i=1

K∑

k=1








I∑

i=1

K∑

k=1

xijkuir wkr


I∑

i=1

K∑

k=1



Coarse-Grain Distribution for CP-ALS [KU15]

U

P1

P2

P3

P4

↓I↑

←R→

V

P1P2P3

P4↓J↑

←R→

W

P1

P2

P3

P4

↓K↑

←R→

X

Rows of each factor matrices are distributed across processors

Each tensor nonzero is copied to each process that will need it


Coarse-Grain Distribution for CP-ALS [KU15]

U

P1

P2

P3

P4

↓I↑

←R→

V

P1P2P3

P4↓J↑

←R→

W

P1

P2

P3

P4

↓K↑

←R→

X

i

j

k ∗xijk

Rows of each factor matrices are distributed across processorsEach tensor nonzero is copied to each process that will need it


Coarse-Grain Parallelization for MTTKRP [KU15]

To update V, need to compute M(V ) = X(2)(W� U)

Let Ip, Jp,Kp be the subset of rows of U,V,W owned by processor p

Main loop:for each j ∈ Jp

for each nonzero xijk in slice jm(V )

j,: ← m(V )j,: + xijk · (ui,: ∗ wk ,:)

In the inner loop, ui,: or wk ,: require communication if i /∈ Ip or k /∈ Kp


Fine-Grain Distribution of CP-ALS [KU15]

U

P1

P2

P3

P4

↓I↑

←R→

V

P1P2P3

P4↓J↑

←R→

W

P1

P2

P3

P4

↓K↑

←R→

X

∗

∗

∗

∗

∗

∗

Rows of each factor matrices are distributed across processorsTensor nonzeros are distributed across processors


Fine-Grain Parallelization for MTTKRP [KU15]

To update V, need to compute M(V ) = X(2)(W� U)

Let Xp be the subset of nonzeros of X owned by processor pLet Ip, Jp,Kp be the subset of rows of U,V,W owned by processor p

Main loop:for each xijk ∈ Xp

m(V )j,: ← m(V )

j,: + xijk · (ui,: ∗ wk ,:)

In the inner loop, ui,: or wk ,: require communication if i /∈ Ip or k /∈ Kp

After the loop, m(V )j,: for j /∈ Jp needs to be sent to owner processor


Minimizing Communication

Algorithms defined for any distributions of factor matrices / tensor

Distributions determine computational load balance andcommunication costs

Finding optimal distribution for each algorithm is a hypergraphpartitioning problem (subject to load balance constraint)

Even if hypergraph is optimally partitioned, no guarantees thateither algorithm is communication optimal


Coarse-Grain vs Fine-Grain

Coarse-GrainOwner computes: communicatesonly inputs within MTTKRPRequires replication of XGeneralizes row-wise algorithmfor SpMV (for multiple vectors)

Fine-GrainCommunicates inputs andoutputs within MTTKRPNo replication of XGeneralizes fine-grainalgorithm for SpMV


Performance Comparison [KU15]

1 2 4 8 16 32 64 128 256 512 102410

−1

100

101

102

Number of MPI processes

Avera

ge t

ime (

in s

eco

nd

s)

per

CP

−A

LS

ite

rati

on

ht−finegrain−hp

ht−finegrain−random

ht−coarsegrain−hp

ht−coarsegrain−block

DFacTo

(a) Time on Netflix

1 2 4 8 16 32 64 128 256 512 102410

−1

100

101

102


Avera

ge t

ime (

in s

eco

nd

s)

per

CP

−A

LS

ite

rati

on

ht−finegrain−hp




DFacTo

(b) Time on NELL-B

1 2 4 8 16 32 64 128 256 512 10240

20

40

60

80

100

120

140

160

180


Sp

eed

up

over

the s

eq

uen

tial execu

tio

n o

f a C

P−

AL

S ite

rati

on

ht−finegrain−hp




(c) Speedup on Flickr

1 2 4 8 16 32 64 128 256 512 10240

20

40

60

80

100

120

140

160


Sp

eed

up

over

the s

eq

uen

tial execu

tio

n o

f a C

P−

AL

S ite

rati

on

ht−finegrain−hp




(d) Speedup on Delicious

Figure 2: Time for parallel CP-ALS iteration on Netflix and NELL-B, and speedups on Flickr and Delicious.

Table 2: Statistics for the computation and communicationrequirements in one CP-ALS iteration for 512-way partition-ings of the Netflix tensor.

ModeComp. load Comm. volume Num. msg.

Max Avg Max Avg Max Avght-finegrain-hp

1 196672 196251 21079 6367 734 3162 196672 196251 18028 5899 1022 10163 196672 196251 3545 2492 1022 1018

ht-finegrain-random1 197507 196251 272326 252118 1022 10222 197507 196251 29282 22715 1022 10223 197507 196251 7766 4300 1013 1003

ht-coarsegrain-hp1 364181 196251 302001 136741 511 5112 349123 196251 59523 12228 511 5113 737570 196251 23524 2000 511 507

ht-coarsegrain-block1 198602 196251 239337 142006 448 4472 367966 196251 33889 12458 511 4453 737570 196251 24659 2049 511 394

on up to 1024 cores. The experiments showed that the pro-posed fine-grain MTTKRP can achieve the best performancewith respect to other alternatives with a good partitioning,reaching up to 194x speedups on 512 cores.

In our analysis and experiments, we identified the com-munication latency as the dominant hindrance for furtherscalability of the fastest proposed method. We will inves-tigate this in the future. We also note that the size of thehypergraphs that we build can cause discomfort to all ex-isting partitioning tools. Methods that partition huge hy-pergraphs e�ciently and e↵ectively are needed for handlinglarger tensors than those treated in this work.

During the revision process, we have come across to a newstudy called DMS on distributed memory tensor factoriza-tion [30] which uses SPLATT formulation. The communica-tion is based on all-to-all primitives. The current versions ofthe two codes (ours using MPI, DMS using OpenMP+MPI)do not allow a useful comparison. We plan to update ourcodes and do a comparison in the near future.

6. ACKNOWLEDGMENTSThis work was performed using HPC resources from GENCI-

[TGCC/CINES/IDRIS] (Grant 2015-100570). Additionalcomputational resources were used from the PSMN com-

puting center at ENS Lyon.

7. REFERENCES[1] E. Acar, D. M. Dunlavy, and T. G. Kolda. A scalable

optimization approach for fitting canonical tensordecompositions. Journal of Chemometrics,25(2):67–86, February 2011.

[2] C. A. Andersson and R. Bro. The N-way toolbox forMATLAB. Chemometrics and Intelligent LaboratorySystems, 52(1):1–4, 2000.

[3] C. J. Appellof and E. Davidson. Strategies foranalyzing data from video fluorometric monitoring ofliquid chromatographic e✏uents. AnalyticalChemistry, 53(13):2053–2056, 1981.

[4] C. Aykanat, B. B. Cambazoglu, and B. Ucar.Multi-level direct K-way hypergraph partitioning withmultiple constraints and fixed vertices. Journal ofParallel and Distributed Computing, 68:609–625, 2008.

[5] B. W. Bader and T. G. Kolda. E�cient MATLABcomputations with sparse and factored tensors. SIAMJournal on Scientific Computing, 30(1):205–231,December 2007.

[6] B. W. Bader, T. G. Kolda, et al. Matlab tensortoolbox version 2.6. Available online, February 2015.

[7] J. Bennett and S. Lanning. The netflix prize. InProceedings of KDD cup and workshop, volume 2007,page 35, 2007.

[8] R. H. Bisseling and W. Meesen. Communicationbalancing in parallel sparse matrix-vectormultiplication. Electronic Transactions on NumericalAnalysis, 21:47–65, 2005.

[9] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R.Hruschka Jr, and T. M. Mitchell. Toward anarchitecture for never-ending language learning. InAAAI, volume 5, page 3, 2010.

[10] J. D. Carroll and J.-J. Chang. Analysis of individualdi↵erences in multidimensional scaling via an N-waygeneralization of “Eckart-Young” decomposition.Psychometrika, 35(3):283–319, 1970.

[11] U. V. Catalyurek and C. Aykanat.Hypergraph-partitioning-based decomposition forparallel sparse-matrix vector multiplication. IEEETransactions on Parallel and Distributed Systems,10(7):673–693, Jul 1999.

[12] U . V. Catalyurek and C. Aykanat. PaToH: AMultilevel Hypergraph Partitioning Tool, Version 3.0.


Outline

1 Tensor Notation




5 Open Problems

Outline

1 Tensor Notation




5 Open Problems

Tucker Optimization Problem

For fixed ranks P,Q,R, we want to solve

minX

∥∥∥X− X∥∥∥

2=

I∑

i=1

J∑

j=1

K∑

k=1

(xijk − xijk )2 subject to X = JG;U,V,WK

which turns out to be equivalent to

maxU,V,W

‖G‖ subject to G = X×1 UT ×2 VT ×3 WT

which is a nonlinear, nonconvex optimization problem


Higher-Order Orthogonal Iteration (HOOI)

Fixing all but one factor matrix, we have a matrix problem:

maxV

∥∥∥X×1 UT ×2 VT ×3 W

T∥∥∥

or equivalentlymax

V

∥∥∥VTY(2)

∥∥∥F

where Y = X×1 UT ×3 W

T

HOOI works by alternating over factor matrices, updating one at a timeby computing leading left singular vectors


Sequentially Truncated Higher-Order SVD

HOOI is very sensitive to initialization

Truncated Higher-Order SVD (T-HOSVD) typically used

ST-HOSVD [VVM12] is more efficient than T-HOSVD, works byinitializing with identity matrices U = II , V = IJ , W = IKapplying one iteration of HOOIwhere ranks P,Q,R can be chosen based on error tolerance


ST-HOSVD Algorithm

1 S(1) ← X(1)XT(1)

2 U = leading eigenvectors of S(1)

3 Y = X×1 U4 S(2) ← Y(2)YT

(2)

5 V = leading eigenvectors of S(2)

6 Z = Y×2 V7 S(3) ← Z(3)ZT

(3)

8 W = leading eigenvectors of S(3)

9 G = Z×3 W

Left singular vectors of A computed as eigenvectors of ATA


Outline

1 Tensor Notation




5 Open Problems

Parallel Block Tensor Distribution

For N-mode tensor, use logical N-mode processor gridProc. grid: PI × PJ × PK = 3× 5× 2

← J →

←I→

←K→

Local tensors have dimensions IPI× J

PJ× K

PK


Unfolded Tensor Distribution

Key idea: each unfolded matrix is 2D block distributedProc. grid: PI × PJ × PK = 3× 5× 2

← IK →

↓

J

↑

X(2)

Logical mode-2 2D processor grid: PJ × PIPKLocal unfolded matrices have dimensions J

PJ× IK

PIPK


Kernel Matrix Computations

Key computations in ST-HOSVD areGram: computing X(2)XT

(2)

TTM: computing Y(2) = VTX(2)

These are just matrix computations, done for each mode in sequencecan determine lower bound/opt. alg. for individual computationshow to minimize communication across all computations?


Parameter Tuning

1x1x

1x38

4

1x1x

16x2

4

1x1x

2x19

2

1x1x

4x96

1x1x

8x48

1x2x

12x1

61x

4x8x

122x

2x8x

122x

4x6x

8

4x4x

4x6

6x4x

4x40

1

2

3

4

5 TTMEvecsGram

1234

1324

1342

2134

2314

2341

3124

3142

3214

3241

3412

3421

0

0.5

1

1.5

2

2.5 TTMEvecsGram

Varying processor grid for tensor ofsize 384×384×384×384 with

reduced size of 96×96×96×96.

Varying mode order for tensor ofsize 25×250×250×250 with

reduced size 10×10×100×100.


Parallel Scaling

181 256 625 1296

510

1519.2

Number of Nodes

GFL

OP

SPe

rCor

e

ST-HOSVD

1 2 4 8 16 32 64 128256512

2−4

2−3

2−2

2−1

20

Number of Nodes

Tim

e(s

econ

ds)

ST-HOSVD

Weak scaling for200k×200k×200k×200ktensor with reduced size

20k×20k×20k×20k ,using k4 nodes for 1 ≤ k ≤ 6.

Strong scaling for200×200×200×200

tensor with reduced size20×20×20×20,

using 2k nodes for 0 ≤ k ≤ 9.


Application: Compression of Scientific Simulation Data

We applied ST-HOSVD to compress multidimensional data fromnumerical simulations of combustion, including the following data sets:

HCCI:Dimensions: 672× 672× 33× 627672× 672 spatial grid, 33 variables over 627 time stepsTotal size: 70 GB

TJLR:Dimensions: 460× 700× 360× 35× 16460× 700× 360 spatial grid, 35 variables over 16 time stepsTotal size: 520 GB

SP:Dimensions: 500× 500× 500× 11× 50500× 500× 500 spatial grid, 11 variables over 50 time stepsTotal size: 550GB


Application: Compression of Scientific Simulation Data

10−6 10−5 10−4 10−3 10−2

100

101

102

103

104

Relative Normwise Error

Com

pres

sion

Rat

io

HCCITJLRSP

Compression ratio: IJKPQR+IP+JQ+KR Relative Normwise Error: ‖X−X‖‖X‖


Outline

1 Tensor Notation




5 Open Problems

Numerical Questions

CP-ALS solves least squares problems using normal equationsST-HOSVD computes singular vectors using the Gram matrix

Are there applications that require better numerical stability?Can more numerically stable methods be implemented efficiently?


CA Questions

What are the communication lower bounds for MTTKRP?the computation can be expressed as nested loopsis there a tradeoff between computation and communicaton?

What are the communication lower bounds for ST-HOSVD?we’ve already improved the comm. costs of the published algorithmcan the parameter tuning problems be solved analytically?


For more details:

Scalable Sparse Tensor Decompositions inDistributed Memory Systems

Oguz Kaya and Bora UçarInternational Conference for High Performance Computing,

Networking, Storage and Analysis 2015http://doi.acm.org/10.1145/2807591.2807624

Parallel Tensor Compression for Large-Scale Scientific DataWoody Austin, Grey Ballard, and Tamara G. Kolda

International Parallel and Distributed Processing Symposium 2016http://arxiv.org/abs/1510.06689


http://doi.acm.org/10.1145/2807591.2807624

http://arxiv.org/abs/1510.06689

References I

Evrim Acar, Canan Aykut-Bingol, Haluk Bingol, Rasmus Bro, and BülentYener.Multiway analysis of epilepsy tensors.Bioinformatics, 23(13):i10–i18, 2007.

C. M. Andersen and R. Bro.Practical aspects of parafac modeling of fluorescenceexcitation-emission data.Journal of Chemometrics, 17(4):200–215, 2003.

Woody Austin, Grey Ballard, and Tamara G. Kolda.Parallel tensor compression for large-scale scientific data.Technical Report 1510.06689, arXiv, 2015.To appear in IPDPS.


References II

Brett W. Bader, Michael W. Berry, and Murray Browne.Survey of Text Mining II: Clustering, Classification, and Retrieval, chapterDiscussion Tracking in Enron Email Using PARAFAC, pages 147–163.Springer London, London, 2008.

Brett W. Bader and Tamara G. Kolda.Efficient MATLAB computations with sparse and factored tensors.SIAM Journal on Scientific Computing, 30(1):205–231, December 2007.

Rafael Ballester-Ripoll and Renato Pajarola.Lossy volume compression using tucker truncation and thresholding.The Visual Computer, pages 1–14, 2015.

T. G. Kolda and B. W. Bader.Tensor decompositions and applications.SIAM Review, 51(3):455–500, September 2009.


References III

Oguz Kaya and Bora Uçar.Scalable sparse tensor decompositions in distributed memory systems.In Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis, SC ’15, pages77:1–77:11, New York, NY, USA, 2015. ACM.

M. Alex O. Vasilescu and Demetri Terzopoulos.Computer Vision — ECCV 2002: 7th European Conference on ComputerVision Copenhagen, Denmark, May 28–31, 2002 Proceedings, Part I,chapter Multilinear Analysis of Image Ensembles: TensorFaces, pages447–460.Springer Berlin Heidelberg, Berlin, Heidelberg, 2002.

Nick Vannieuwenhoven, Raf Vandebril, and Karl Meerbergen.A new truncation strategy for the higher-order singular valuedecomposition.SIAM Journal on Scientific Computing, 34(2):A1027–A1052, 2012.


Documents

Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley