59
Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley March 28, 2016 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Part 2: Communication Costs ofTensor Decompositions

Grey Ballard

CS 294/Math 270: Communication-Avoiding AlgorithmsUC Berkeley

March 28, 2016

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Page 2: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Plan

Introduction to tensor decompositions [KB09]nomenclature and notationpopular decompositions: CP and Tucker

We’re seeking comm-optimal sequential and parallel algorithmsfew known lower boundsfew standard libraries or HPC implementations

Many flavors of problemsdense or sparsesequential or parallelCP or Tucker (or alternatives like tensor train)choices of mathematical algorithm

We’ll do two case studies of parallel algorithmscomputing CP decomposition of sparse tensor [KU15]computing Tucker decomposition of dense tensor [ABK15]

Grey Ballard CA Algorithms 1

Page 3: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Plan

Introduction to tensor decompositions [KB09]nomenclature and notationpopular decompositions: CP and Tucker

We’re seeking comm-optimal sequential and parallel algorithmsfew known lower boundsfew standard libraries or HPC implementations

Many flavors of problemsdense or sparsesequential or parallelCP or Tucker (or alternatives like tensor train)choices of mathematical algorithm

We’ll do two case studies of parallel algorithmscomputing CP decomposition of sparse tensor [KU15]computing Tucker decomposition of dense tensor [ABK15]

Grey Ballard CA Algorithms 1

Page 4: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Plan

Introduction to tensor decompositions [KB09]nomenclature and notationpopular decompositions: CP and Tucker

We’re seeking comm-optimal sequential and parallel algorithmsfew known lower boundsfew standard libraries or HPC implementations

Many flavors of problemsdense or sparsesequential or parallelCP or Tucker (or alternatives like tensor train)choices of mathematical algorithm

We’ll do two case studies of parallel algorithmscomputing CP decomposition of sparse tensor [KU15]computing Tucker decomposition of dense tensor [ABK15]

Grey Ballard CA Algorithms 1

Page 5: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Plan

Introduction to tensor decompositions [KB09]nomenclature and notationpopular decompositions: CP and Tucker

We’re seeking comm-optimal sequential and parallel algorithmsfew known lower boundsfew standard libraries or HPC implementations

Many flavors of problemsdense or sparsesequential or parallelCP or Tucker (or alternatives like tensor train)choices of mathematical algorithm

We’ll do two case studies of parallel algorithmscomputing CP decomposition of sparse tensor [KU15]computing Tucker decomposition of dense tensor [ABK15]

Grey Ballard CA Algorithms 1

Page 6: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Outline

1 Tensor Notation

2 Tensor Decompositions

3 Computing CP via Alternating Least SquaresMathematical BackgroundParallel Algorithm for Sparse Tensors

4 Computing Tucker via Sequentially Truncated Higher-Order SVDMathematical BackgroundParallel Algorithm for Dense Tensors

5 Open Problems

Page 7: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Tensors

Vector N = 1

Matrix N = 2

3rd-Order Tensor N = 3

4th-Order Tensor N = 4

5th-Order Tensor N = 5

An N th-order tensor has N modesNotation convention: vector v, matrix M, tensor T

Grey Ballard CA Algorithms 2

Page 8: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Fibers

Mode-1 Fibers Mode-2 Fibers Mode-3 Fibers

A tensor can be decomposed into the fibers of each mode(fix all indices but one)

Grey Ballard CA Algorithms 3

Page 9: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Slices

458 TAMARA G. KOLDA AND BRETT W. BADER

(a) Mode-1 (column) fibers: x:jk (b) Mode-2 (row) fibers: xi:k (c) Mode-3 (tube) fibers: xij:

Fig. 2.1 Fibers of a 3rd-order tensor.

(a) Horizontal slices: Xi:: (b) Lateral slices: X:j: (c) Frontal slices: X::k (or Xk)

Fig. 2.2 Slices of a 3rd-order tensor.

A. The inner product of two same-sized tensors X, Y ∈ RI1×I2×···×IN is the sum ofthe products of their entries, i.e.,

⟨X, Y ⟩ =

I1!

i1=1

I2!

i2=1

· · ·IN!

iN=1

xi1i2···iN yi1i2···iN .

It follows immediately that ⟨X, X ⟩ = ∥X ∥2.2.1. Rank-One Tensors. An N -way tensor X ∈ RI1×I2×···×IN is rank one if it

can be written as the outer product of N vectors, i.e.,

X = a(1) ◦ a(2) ◦ · · · ◦ a(N).

The symbol “◦” represents the vector outer product. This means that each elementof the tensor is the product of the corresponding vector elements:

xi1i2···iN = a(1)i1

a(2)i2

· · · a(N)iN

for all 1 ≤ in ≤ In.

Figure 2.3 illustrates X = a ◦ b ◦ c, a third-order rank-one tensor.

2.2. Symmetry and Tensors. A tensor is called cubical if every mode is the samesize, i.e., X ∈ RI×I×I×···×I [49]. A cubical tensor is called supersymmetric (though

Dow

nloa

ded

09/0

5/13

to 1

98.2

06.2

19.3

8. R

edis

tribu

tion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.siam

.org

/jour

nals

/ojs

a.ph

p

A tensor can also be decomposed into the slices of each mode(fix one index)

Grey Ballard CA Algorithms 4

Page 10: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Unfoldings

A tensor can be reshaped into matrices,called unfoldings or matricizations, for different modes

(fibers form columns, slices form rows)

Grey Ballard CA Algorithms 5

Page 11: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Outline

1 Tensor Notation

2 Tensor Decompositions

3 Computing CP via Alternating Least SquaresMathematical BackgroundParallel Algorithm for Sparse Tensors

4 Computing Tucker via Sequentially Truncated Higher-Order SVDMathematical BackgroundParallel Algorithm for Dense Tensors

5 Open Problems

Page 12: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Low-rank approximations of tensors

Tensor “decompositions” are usually low-rank approximations

They generalize matrix approximations from two viewpointssum of outer products (think PCA)product of two rectangular matrices (think high-variance subspaces)

Some applications seek true decompositions, but less common

Grey Ballard CA Algorithms 6

Page 13: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Sum of outer products

Matrix:

Tensor:

This is known as the CANDECOMP/PARAFAC (CP) decomposition

Grey Ballard CA Algorithms 7

Page 14: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

CP Notation

T ≈ u1 ◦ v1 ◦w1 + · · ·+ uR ◦ vR ◦wR, T ∈ RI×J×K

T ≈ JU,V,WK , U ∈ RI×R,V ∈ RJ×R,W ∈ RK×R are factor matrices

tijk ≈R∑

r=1

uir vjr wkr , 1 ≤ i ≤ I,1 ≤ j ≤ J,1 ≤ k ≤ K

Grey Ballard CA Algorithms 8

Notation convention: scalar dimension N, index n with 1 ≤ n ≤ N

Page 15: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Applications of CP

CP often used like PCA for multi-dimensional datainterpretable components separated from noise

Sample applicationschemometrics [AB03]

data is excitation wavelengths × emission wavelengths × timecomponents correspond to chemical species’ signatures

neuroscience [AABB+07]data is electrode × frequency × timecomponents help to describe origin of a seizure

text analysis [BBB08]data is term × author × timecomponents discover conversations

Grey Ballard CA Algorithms 9

Page 16: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

High-variance subspaces

Matrix:

Tensor:

This is known as the Tucker decomposition

Grey Ballard CA Algorithms 10

Page 17: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Tucker Notation

T ≈ G×1 U×2 V×3 W T ∈ RI×J×K ,G ∈ RP×Q×R is core tensor

T ≈ JG;U,V,WK , U ∈ RI×P ,V ∈ RJ×Q,W ∈ RK×R are factor matrices

tijk ≈P∑

p=1

Q∑

q=1

R∑

r=1

gpqr uipvjqwkr , 1 ≤ i ≤ I,1 ≤ j ≤ J,1 ≤ k ≤ K

Grey Ballard CA Algorithms 11

Page 18: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Tensor-Times-Matrix (TTM)

Tensor version:Y = X×2 M

Y ∈ RI×Q×K X ∈ RI×J×K M ∈ RQ×J

Matrix version:Y(2) = MX(2)

Y(2) ∈ RQ×IK X(2) ∈ RJ×IK

Element version:

yiqk =J∑

j=1

mqjxijk

TTM is matrix multiplication with certain unfolding

Grey Ballard CA Algorithms 12

Page 19: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Applications of Tucker

Tucker can be viewed as a richer form of CP, so it’s also used like PCAa diagonal core tensor corresponds to a CP decomposition

Sample ApplicationComputer vision: TensorFaces [VT02]

facial recognition system benefiting from varying lighting,expression, viewpoint

Tucker is typically more efficient than CP for compressionSample Application

Visual data compression [BRP15]image, video, and 3D volume data

Grey Ballard CA Algorithms 13

Page 20: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Ambiguities

There are several ambiguities that have to be handled carefully

CP scaling ambiguity

T ≈R∑

r=1

ur ◦ vr ◦wr → T ≈R∑

r=1

λr · ur ◦ vr ◦wr

where ‖ur‖2 = ‖vr‖2 = ‖wr‖2 = 1

Tucker basis ambiguity

T ≈ G×1 U×2 V×3 W

where UTU = IP , VTV = IQ, WTW = IR

Grey Ballard CA Algorithms 14

Page 21: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Survey Paper

Notation can be a huge obstacle to working with tensors,standardization can help

I recommend following the conventions of the following paper:

Tensor Decompositions and ApplicationsTammy Kolda and Brett Bader

SIAM Review 2009http://epubs.siam.org/doi/abs/10.1137/07070111X

Grey Ballard CA Algorithms 15

Page 22: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Outline

1 Tensor Notation

2 Tensor Decompositions

3 Computing CP via Alternating Least SquaresMathematical BackgroundParallel Algorithm for Sparse Tensors

4 Computing Tucker via Sequentially Truncated Higher-Order SVDMathematical BackgroundParallel Algorithm for Dense Tensors

5 Open Problems

Page 23: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Outline

1 Tensor Notation

2 Tensor Decompositions

3 Computing CP via Alternating Least SquaresMathematical BackgroundParallel Algorithm for Sparse Tensors

4 Computing Tucker via Sequentially Truncated Higher-Order SVDMathematical BackgroundParallel Algorithm for Dense Tensors

5 Open Problems

Page 24: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

CP Optimization Problem

For fixed rank R, we want to solve

minU,V,W

∥∥∥∥∥X−R∑

r=1

ur ◦ vr ◦wr

∥∥∥∥∥

which is a nonlinear, nonconvex optimization problem

in the matrix case, the SVD gives us the optimal solution

in the tensor case, uniqueness/convergence to optimum not guaranteed

Grey Ballard CA Algorithms 16

Page 25: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Alternating Least Squares (ALS)

Fixing all but one factor matrix, we have a linear least squares problem:

minV

∥∥∥∥∥X−R∑

r=1

ur ◦ vr ◦ wr

∥∥∥∥∥

or equivalentlymin

V

∥∥∥X(2) − V(W� U)T∥∥∥

F

where � is the Khatri-Rao product, a column-wise Kronecker product

ALS works by alternating over factor matrices, updating one at a timeby solving the corresponding linear least squares problem

Grey Ballard CA Algorithms 17

Page 26: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

CP-ALS

Repeat1 Solve U(VTV ∗WTW) = X(1)(W� V) for U2 Normalize columns of U3 Solve V(UTU ∗WTW) = X(2)(W� U) for V4 Normalize columns of V5 Solve W(UTU ∗ VTV) = X(3)(V� U) for W6 Normalize columns of W and store norms in λ

Linear least squares problems solved via normal equationsusing identity (A� B)T(A� B) = ATA ∗ BTB,

where ∗ is element-wise product

Grey Ballard CA Algorithms 18

Page 27: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Outline

1 Tensor Notation

2 Tensor Decompositions

3 Computing CP via Alternating Least SquaresMathematical BackgroundParallel Algorithm for Sparse Tensors

4 Computing Tucker via Sequentially Truncated Higher-Order SVDMathematical BackgroundParallel Algorithm for Dense Tensors

5 Open Problems

Page 28: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Matricized Tensor Times Khatri-Rao Product

CP-ALS spends most of its time in MTTKRP (dense or sparse)corresponds to setting up the right-hand-side of normal equationsM(V ) = X(2)(W� U), for example

In the dense case, it usually makes sense to1 form Khatri-Rao product explicitly2 call dense matrix multiplication

In the sparse case, it usually make sense [BK07] to use

element-wise formula m(V )jr =

I∑

i=1

K∑

k=1

xijkuir wkr

row-wise formula m(V )j,: =

I∑

i=1

K∑

k=1

xijk (ui,: ∗ wk ,:)

Grey Ballard CA Algorithms 19

Page 29: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Matricized Tensor Times Khatri-Rao Product

CP-ALS spends most of its time in MTTKRP (dense or sparse)corresponds to setting up the right-hand-side of normal equationsM(V ) = X(2)(W� U), for example

In the dense case, it usually makes sense to1 form Khatri-Rao product explicitly2 call dense matrix multiplication

In the sparse case, it usually make sense [BK07] to use

element-wise formula m(V )jr =

I∑

i=1

K∑

k=1

xijkuir wkr

row-wise formula m(V )j,: =

I∑

i=1

K∑

k=1

xijk (ui,: ∗ wk ,:)

Grey Ballard CA Algorithms 19

Page 30: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Matricized Tensor Times Khatri-Rao Product

CP-ALS spends most of its time in MTTKRP (dense or sparse)corresponds to setting up the right-hand-side of normal equationsM(V ) = X(2)(W� U), for example

In the dense case, it usually makes sense to1 form Khatri-Rao product explicitly2 call dense matrix multiplication

In the sparse case, it usually make sense [BK07] to use

element-wise formula m(V )jr =

I∑

i=1

K∑

k=1

xijkuir wkr

row-wise formula m(V )j,: =

I∑

i=1

K∑

k=1

xijk (ui,: ∗ wk ,:)

Grey Ballard CA Algorithms 19

Page 31: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Coarse-Grain Distribution for CP-ALS [KU15]

U

P1

P2

P3

P4

↓I↑

←R→

V

P1P2P3

P4↓J↑

←R→

W

P1

P2

P3

P4

↓K↑

←R→

X

Rows of each factor matrices are distributed across processors

Each tensor nonzero is copied to each process that will need it

Grey Ballard CA Algorithms 20

Page 32: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Coarse-Grain Distribution for CP-ALS [KU15]

U

P1

P2

P3

P4

↓I↑

←R→

V

P1P2P3

P4↓J↑

←R→

W

P1

P2

P3

P4

↓K↑

←R→

X

i

j

k ∗xijk

Rows of each factor matrices are distributed across processorsEach tensor nonzero is copied to each process that will need it

Grey Ballard CA Algorithms 20

Page 33: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Coarse-Grain Parallelization for MTTKRP [KU15]

To update V, need to compute M(V ) = X(2)(W� U)

Let Ip, Jp,Kp be the subset of rows of U,V,W owned by processor p

Main loop:for each j ∈ Jp

for each nonzero xijk in slice jm(V )

j,: ← m(V )j,: + xijk · (ui,: ∗ wk ,:)

In the inner loop, ui,: or wk ,: require communication if i /∈ Ip or k /∈ Kp

Grey Ballard CA Algorithms 21

Page 34: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Fine-Grain Distribution of CP-ALS [KU15]

U

P1

P2

P3

P4

↓I↑

←R→

V

P1P2P3

P4↓J↑

←R→

W

P1

P2

P3

P4

↓K↑

←R→

X

Rows of each factor matrices are distributed across processorsTensor nonzeros are distributed across processors

Grey Ballard CA Algorithms 22

Page 35: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Fine-Grain Parallelization for MTTKRP [KU15]

To update V, need to compute M(V ) = X(2)(W� U)

Let Xp be the subset of nonzeros of X owned by processor pLet Ip, Jp,Kp be the subset of rows of U,V,W owned by processor p

Main loop:for each xijk ∈ Xp

m(V )j,: ← m(V )

j,: + xijk · (ui,: ∗ wk ,:)

In the inner loop, ui,: or wk ,: require communication if i /∈ Ip or k /∈ Kp

After the loop, m(V )j,: for j /∈ Jp needs to be sent to owner processor

Grey Ballard CA Algorithms 23

Page 36: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Minimizing Communication

Algorithms defined for any distributions of factor matrices / tensor

Distributions determine computational load balance andcommunication costs

Finding optimal distribution for each algorithm is a hypergraphpartitioning problem (subject to load balance constraint)

Even if hypergraph is optimally partitioned, no guarantees thateither algorithm is communication optimal

Grey Ballard CA Algorithms 24

Page 37: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Coarse-Grain vs Fine-Grain

Coarse-GrainOwner computes: communicatesonly inputs within MTTKRPRequires replication of XGeneralizes row-wise algorithmfor SpMV (for multiple vectors)

Fine-GrainCommunicates inputs andoutputs within MTTKRPNo replication of XGeneralizes fine-grainalgorithm for SpMV

Grey Ballard CA Algorithms 25

Page 38: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Performance Comparison [KU15]

1 2 4 8 16 32 64 128 256 512 102410

−1

100

101

102

Number of MPI processes

Avera

ge t

ime (

in s

eco

nd

s)

per

CP

−A

LS

ite

rati

on

ht−finegrain−hp

ht−finegrain−random

ht−coarsegrain−hp

ht−coarsegrain−block

DFacTo

(a) Time on Netflix

1 2 4 8 16 32 64 128 256 512 102410

−1

100

101

102

Number of MPI processes

Avera

ge t

ime (

in s

eco

nd

s)

per

CP

−A

LS

ite

rati

on

ht−finegrain−hp

ht−finegrain−random

ht−coarsegrain−hp

ht−coarsegrain−block

DFacTo

(b) Time on NELL-B

1 2 4 8 16 32 64 128 256 512 10240

20

40

60

80

100

120

140

160

180

Number of MPI processes

Sp

eed

up

over

the s

eq

uen

tial execu

tio

n o

f a C

P−

AL

S ite

rati

on

ht−finegrain−hp

ht−finegrain−random

ht−coarsegrain−hp

ht−coarsegrain−block

(c) Speedup on Flickr

1 2 4 8 16 32 64 128 256 512 10240

20

40

60

80

100

120

140

160

Number of MPI processes

Sp

eed

up

over

the s

eq

uen

tial execu

tio

n o

f a C

P−

AL

S ite

rati

on

ht−finegrain−hp

ht−finegrain−random

ht−coarsegrain−hp

ht−coarsegrain−block

(d) Speedup on Delicious

Figure 2: Time for parallel CP-ALS iteration on Netflix and NELL-B, and speedups on Flickr and Delicious.

Table 2: Statistics for the computation and communicationrequirements in one CP-ALS iteration for 512-way partition-ings of the Netflix tensor.

ModeComp. load Comm. volume Num. msg.

Max Avg Max Avg Max Avght-finegrain-hp

1 196672 196251 21079 6367 734 3162 196672 196251 18028 5899 1022 10163 196672 196251 3545 2492 1022 1018

ht-finegrain-random1 197507 196251 272326 252118 1022 10222 197507 196251 29282 22715 1022 10223 197507 196251 7766 4300 1013 1003

ht-coarsegrain-hp1 364181 196251 302001 136741 511 5112 349123 196251 59523 12228 511 5113 737570 196251 23524 2000 511 507

ht-coarsegrain-block1 198602 196251 239337 142006 448 4472 367966 196251 33889 12458 511 4453 737570 196251 24659 2049 511 394

on up to 1024 cores. The experiments showed that the pro-posed fine-grain MTTKRP can achieve the best performancewith respect to other alternatives with a good partitioning,reaching up to 194x speedups on 512 cores.

In our analysis and experiments, we identified the com-munication latency as the dominant hindrance for furtherscalability of the fastest proposed method. We will inves-tigate this in the future. We also note that the size of thehypergraphs that we build can cause discomfort to all ex-isting partitioning tools. Methods that partition huge hy-pergraphs e�ciently and e↵ectively are needed for handlinglarger tensors than those treated in this work.

During the revision process, we have come across to a newstudy called DMS on distributed memory tensor factoriza-tion [30] which uses SPLATT formulation. The communica-tion is based on all-to-all primitives. The current versions ofthe two codes (ours using MPI, DMS using OpenMP+MPI)do not allow a useful comparison. We plan to update ourcodes and do a comparison in the near future.

6. ACKNOWLEDGMENTSThis work was performed using HPC resources from GENCI-

[TGCC/CINES/IDRIS] (Grant 2015-100570). Additionalcomputational resources were used from the PSMN com-

puting center at ENS Lyon.

7. REFERENCES[1] E. Acar, D. M. Dunlavy, and T. G. Kolda. A scalable

optimization approach for fitting canonical tensordecompositions. Journal of Chemometrics,25(2):67–86, February 2011.

[2] C. A. Andersson and R. Bro. The N-way toolbox forMATLAB. Chemometrics and Intelligent LaboratorySystems, 52(1):1–4, 2000.

[3] C. J. Appellof and E. Davidson. Strategies foranalyzing data from video fluorometric monitoring ofliquid chromatographic e✏uents. AnalyticalChemistry, 53(13):2053–2056, 1981.

[4] C. Aykanat, B. B. Cambazoglu, and B. Ucar.Multi-level direct K-way hypergraph partitioning withmultiple constraints and fixed vertices. Journal ofParallel and Distributed Computing, 68:609–625, 2008.

[5] B. W. Bader and T. G. Kolda. E�cient MATLABcomputations with sparse and factored tensors. SIAMJournal on Scientific Computing, 30(1):205–231,December 2007.

[6] B. W. Bader, T. G. Kolda, et al. Matlab tensortoolbox version 2.6. Available online, February 2015.

[7] J. Bennett and S. Lanning. The netflix prize. InProceedings of KDD cup and workshop, volume 2007,page 35, 2007.

[8] R. H. Bisseling and W. Meesen. Communicationbalancing in parallel sparse matrix-vectormultiplication. Electronic Transactions on NumericalAnalysis, 21:47–65, 2005.

[9] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R.Hruschka Jr, and T. M. Mitchell. Toward anarchitecture for never-ending language learning. InAAAI, volume 5, page 3, 2010.

[10] J. D. Carroll and J.-J. Chang. Analysis of individualdi↵erences in multidimensional scaling via an N-waygeneralization of “Eckart-Young” decomposition.Psychometrika, 35(3):283–319, 1970.

[11] U. V. Catalyurek and C. Aykanat.Hypergraph-partitioning-based decomposition forparallel sparse-matrix vector multiplication. IEEETransactions on Parallel and Distributed Systems,10(7):673–693, Jul 1999.

[12] U . V. Catalyurek and C. Aykanat. PaToH: AMultilevel Hypergraph Partitioning Tool, Version 3.0.

Grey Ballard CA Algorithms 26

Page 39: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Outline

1 Tensor Notation

2 Tensor Decompositions

3 Computing CP via Alternating Least SquaresMathematical BackgroundParallel Algorithm for Sparse Tensors

4 Computing Tucker via Sequentially Truncated Higher-Order SVDMathematical BackgroundParallel Algorithm for Dense Tensors

5 Open Problems

Page 40: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Outline

1 Tensor Notation

2 Tensor Decompositions

3 Computing CP via Alternating Least SquaresMathematical BackgroundParallel Algorithm for Sparse Tensors

4 Computing Tucker via Sequentially Truncated Higher-Order SVDMathematical BackgroundParallel Algorithm for Dense Tensors

5 Open Problems

Page 41: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Tucker Optimization Problem

For fixed ranks P,Q,R, we want to solve

minX

∥∥∥X− X∥∥∥

2=

I∑

i=1

J∑

j=1

K∑

k=1

(xijk − xijk )2 subject to X = JG;U,V,WK

which turns out to be equivalent to

maxU,V,W

‖G‖ subject to G = X×1 UT ×2 VT ×3 WT

which is a nonlinear, nonconvex optimization problem

Grey Ballard CA Algorithms 27

Page 42: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Higher-Order Orthogonal Iteration (HOOI)

Fixing all but one factor matrix, we have a matrix problem:

maxV

∥∥∥X×1 UT ×2 VT ×3 W

T∥∥∥

or equivalentlymax

V

∥∥∥VTY(2)

∥∥∥F

where Y = X×1 UT ×3 W

T

HOOI works by alternating over factor matrices, updating one at a timeby computing leading left singular vectors

Grey Ballard CA Algorithms 28

Page 43: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Sequentially Truncated Higher-Order SVD

HOOI is very sensitive to initialization

Truncated Higher-Order SVD (T-HOSVD) typically used

ST-HOSVD [VVM12] is more efficient than T-HOSVD, works byinitializing with identity matrices U = II , V = IJ , W = IKapplying one iteration of HOOIwhere ranks P,Q,R can be chosen based on error tolerance

Grey Ballard CA Algorithms 29

Page 44: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

ST-HOSVD Algorithm

1 S(1) ← X(1)XT(1)

2 U = leading eigenvectors of S(1)

3 Y = X×1 U4 S(2) ← Y(2)YT

(2)

5 V = leading eigenvectors of S(2)

6 Z = Y×2 V7 S(3) ← Z(3)ZT

(3)

8 W = leading eigenvectors of S(3)

9 G = Z×3 W

Left singular vectors of A computed as eigenvectors of ATA

Grey Ballard CA Algorithms 30

Page 45: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Outline

1 Tensor Notation

2 Tensor Decompositions

3 Computing CP via Alternating Least SquaresMathematical BackgroundParallel Algorithm for Sparse Tensors

4 Computing Tucker via Sequentially Truncated Higher-Order SVDMathematical BackgroundParallel Algorithm for Dense Tensors

5 Open Problems

Page 46: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Parallel Block Tensor Distribution

For N-mode tensor, use logical N-mode processor gridProc. grid: PI × PJ × PK = 3× 5× 2

← J →

←I→

←K→

Local tensors have dimensions IPI× J

PJ× K

PK

Grey Ballard CA Algorithms 31

Page 47: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Unfolded Tensor Distribution

Key idea: each unfolded matrix is 2D block distributedProc. grid: PI × PJ × PK = 3× 5× 2

← IK →

J

X(2)

Logical mode-2 2D processor grid: PJ × PIPKLocal unfolded matrices have dimensions J

PJ× IK

PIPK

Grey Ballard CA Algorithms 32

Page 48: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Kernel Matrix Computations

Key computations in ST-HOSVD areGram: computing X(2)XT

(2)

TTM: computing Y(2) = VTX(2)

These are just matrix computations, done for each mode in sequencecan determine lower bound/opt. alg. for individual computationshow to minimize communication across all computations?

Grey Ballard CA Algorithms 33

Page 49: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Parameter Tuning

1x1x

1x38

4

1x1x

16x2

4

1x1x

2x19

2

1x1x

4x96

1x1x

8x48

1x2x

12x1

61x

4x8x

122x

2x8x

122x

4x6x

8

4x4x

4x6

6x4x

4x40

1

2

3

4

5 TTMEvecsGram

1234

1324

1342

2134

2314

2341

3124

3142

3214

3241

3412

3421

0

0.5

1

1.5

2

2.5 TTMEvecsGram

Varying processor grid for tensor ofsize 384×384×384×384 with

reduced size of 96×96×96×96.

Varying mode order for tensor ofsize 25×250×250×250 with

reduced size 10×10×100×100.

Grey Ballard CA Algorithms 34

Page 50: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Parallel Scaling

181 256 625 1296

510

1519.2

Number of Nodes

GFL

OP

SPe

rCor

e

ST-HOSVD

1 2 4 8 16 32 64 128256512

2−4

2−3

2−2

2−1

20

Number of Nodes

Tim

e(s

econ

ds)

ST-HOSVD

Weak scaling for200k×200k×200k×200ktensor with reduced size

20k×20k×20k×20k ,using k4 nodes for 1 ≤ k ≤ 6.

Strong scaling for200×200×200×200

tensor with reduced size20×20×20×20,

using 2k nodes for 0 ≤ k ≤ 9.

Grey Ballard CA Algorithms 35

Page 51: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Application: Compression of Scientific Simulation Data

We applied ST-HOSVD to compress multidimensional data fromnumerical simulations of combustion, including the following data sets:

HCCI:Dimensions: 672× 672× 33× 627672× 672 spatial grid, 33 variables over 627 time stepsTotal size: 70 GB

TJLR:Dimensions: 460× 700× 360× 35× 16460× 700× 360 spatial grid, 35 variables over 16 time stepsTotal size: 520 GB

SP:Dimensions: 500× 500× 500× 11× 50500× 500× 500 spatial grid, 11 variables over 50 time stepsTotal size: 550GB

Grey Ballard CA Algorithms 36

Page 52: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Application: Compression of Scientific Simulation Data

10−6 10−5 10−4 10−3 10−2

100

101

102

103

104

Relative Normwise Error

Com

pres

sion

Rat

io

HCCITJLRSP

Compression ratio: IJKPQR+IP+JQ+KR Relative Normwise Error: ‖X−X‖‖X‖

Grey Ballard CA Algorithms 37

Page 53: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Outline

1 Tensor Notation

2 Tensor Decompositions

3 Computing CP via Alternating Least SquaresMathematical BackgroundParallel Algorithm for Sparse Tensors

4 Computing Tucker via Sequentially Truncated Higher-Order SVDMathematical BackgroundParallel Algorithm for Dense Tensors

5 Open Problems

Page 54: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

Numerical Questions

CP-ALS solves least squares problems using normal equationsST-HOSVD computes singular vectors using the Gram matrix

Are there applications that require better numerical stability?Can more numerically stable methods be implemented efficiently?

Grey Ballard CA Algorithms 38

Page 55: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

CA Questions

What are the communication lower bounds for MTTKRP?the computation can be expressed as nested loopsis there a tradeoff between computation and communicaton?

What are the communication lower bounds for ST-HOSVD?we’ve already improved the comm. costs of the published algorithmcan the parameter tuning problems be solved analytically?

Grey Ballard CA Algorithms 39

Page 56: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

For more details:

Scalable Sparse Tensor Decompositions inDistributed Memory Systems

Oguz Kaya and Bora UçarInternational Conference for High Performance Computing,

Networking, Storage and Analysis 2015http://doi.acm.org/10.1145/2807591.2807624

Parallel Tensor Compression for Large-Scale Scientific DataWoody Austin, Grey Ballard, and Tamara G. Kolda

International Parallel and Distributed Processing Symposium 2016http://arxiv.org/abs/1510.06689

Grey Ballard CA Algorithms 40

Page 57: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

References I

Evrim Acar, Canan Aykut-Bingol, Haluk Bingol, Rasmus Bro, and BülentYener.Multiway analysis of epilepsy tensors.Bioinformatics, 23(13):i10–i18, 2007.

C. M. Andersen and R. Bro.Practical aspects of parafac modeling of fluorescenceexcitation-emission data.Journal of Chemometrics, 17(4):200–215, 2003.

Woody Austin, Grey Ballard, and Tamara G. Kolda.Parallel tensor compression for large-scale scientific data.Technical Report 1510.06689, arXiv, 2015.To appear in IPDPS.

Grey Ballard CA Algorithms 41

Page 58: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

References II

Brett W. Bader, Michael W. Berry, and Murray Browne.Survey of Text Mining II: Clustering, Classification, and Retrieval, chapterDiscussion Tracking in Enron Email Using PARAFAC, pages 147–163.Springer London, London, 2008.

Brett W. Bader and Tamara G. Kolda.Efficient MATLAB computations with sparse and factored tensors.SIAM Journal on Scientific Computing, 30(1):205–231, December 2007.

Rafael Ballester-Ripoll and Renato Pajarola.Lossy volume compression using tucker truncation and thresholding.The Visual Computer, pages 1–14, 2015.

T. G. Kolda and B. W. Bader.Tensor decompositions and applications.SIAM Review, 51(3):455–500, September 2009.

Grey Ballard CA Algorithms 42

Page 59: Part 2: Communication Costs of Tensor Decompositions · Part 2: Communication Costs of Tensor Decompositions Grey Ballard CS 294/Math 270: Communication-Avoiding Algorithms UC Berkeley

References III

Oguz Kaya and Bora Uçar.Scalable sparse tensor decompositions in distributed memory systems.In Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis, SC ’15, pages77:1–77:11, New York, NY, USA, 2015. ACM.

M. Alex O. Vasilescu and Demetri Terzopoulos.Computer Vision — ECCV 2002: 7th European Conference on ComputerVision Copenhagen, Denmark, May 28–31, 2002 Proceedings, Part I,chapter Multilinear Analysis of Image Ensembles: TensorFaces, pages447–460.Springer Berlin Heidelberg, Berlin, Heidelberg, 2002.

Nick Vannieuwenhoven, Raf Vandebril, and Karl Meerbergen.A new truncation strategy for the higher-order singular valuedecomposition.SIAM Journal on Scientific Computing, 34(2):A1027–A1052, 2012.

Grey Ballard CA Algorithms 43