Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Part 2: Communication Costs ofTensor Decompositions
Grey Ballard
CS 294/Math 270: Communication-Avoiding AlgorithmsUC Berkeley
March 28, 2016
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Plan
Introduction to tensor decompositions [KB09]nomenclature and notationpopular decompositions: CP and Tucker
We’re seeking comm-optimal sequential and parallel algorithmsfew known lower boundsfew standard libraries or HPC implementations
Many flavors of problemsdense or sparsesequential or parallelCP or Tucker (or alternatives like tensor train)choices of mathematical algorithm
We’ll do two case studies of parallel algorithmscomputing CP decomposition of sparse tensor [KU15]computing Tucker decomposition of dense tensor [ABK15]
Grey Ballard CA Algorithms 1
Plan
Introduction to tensor decompositions [KB09]nomenclature and notationpopular decompositions: CP and Tucker
We’re seeking comm-optimal sequential and parallel algorithmsfew known lower boundsfew standard libraries or HPC implementations
Many flavors of problemsdense or sparsesequential or parallelCP or Tucker (or alternatives like tensor train)choices of mathematical algorithm
We’ll do two case studies of parallel algorithmscomputing CP decomposition of sparse tensor [KU15]computing Tucker decomposition of dense tensor [ABK15]
Grey Ballard CA Algorithms 1
Plan
Introduction to tensor decompositions [KB09]nomenclature and notationpopular decompositions: CP and Tucker
We’re seeking comm-optimal sequential and parallel algorithmsfew known lower boundsfew standard libraries or HPC implementations
Many flavors of problemsdense or sparsesequential or parallelCP or Tucker (or alternatives like tensor train)choices of mathematical algorithm
We’ll do two case studies of parallel algorithmscomputing CP decomposition of sparse tensor [KU15]computing Tucker decomposition of dense tensor [ABK15]
Grey Ballard CA Algorithms 1
Plan
Introduction to tensor decompositions [KB09]nomenclature and notationpopular decompositions: CP and Tucker
We’re seeking comm-optimal sequential and parallel algorithmsfew known lower boundsfew standard libraries or HPC implementations
Many flavors of problemsdense or sparsesequential or parallelCP or Tucker (or alternatives like tensor train)choices of mathematical algorithm
We’ll do two case studies of parallel algorithmscomputing CP decomposition of sparse tensor [KU15]computing Tucker decomposition of dense tensor [ABK15]
Grey Ballard CA Algorithms 1
Outline
1 Tensor Notation
2 Tensor Decompositions
3 Computing CP via Alternating Least SquaresMathematical BackgroundParallel Algorithm for Sparse Tensors
4 Computing Tucker via Sequentially Truncated Higher-Order SVDMathematical BackgroundParallel Algorithm for Dense Tensors
5 Open Problems
Tensors
Vector N = 1
Matrix N = 2
3rd-Order Tensor N = 3
4th-Order Tensor N = 4
5th-Order Tensor N = 5
An N th-order tensor has N modesNotation convention: vector v, matrix M, tensor T
Grey Ballard CA Algorithms 2
Fibers
Mode-1 Fibers Mode-2 Fibers Mode-3 Fibers
A tensor can be decomposed into the fibers of each mode(fix all indices but one)
Grey Ballard CA Algorithms 3
Slices
458 TAMARA G. KOLDA AND BRETT W. BADER
(a) Mode-1 (column) fibers: x:jk (b) Mode-2 (row) fibers: xi:k (c) Mode-3 (tube) fibers: xij:
Fig. 2.1 Fibers of a 3rd-order tensor.
(a) Horizontal slices: Xi:: (b) Lateral slices: X:j: (c) Frontal slices: X::k (or Xk)
Fig. 2.2 Slices of a 3rd-order tensor.
A. The inner product of two same-sized tensors X, Y ∈ RI1×I2×···×IN is the sum ofthe products of their entries, i.e.,
⟨X, Y ⟩ =
I1!
i1=1
I2!
i2=1
· · ·IN!
iN=1
xi1i2···iN yi1i2···iN .
It follows immediately that ⟨X, X ⟩ = ∥X ∥2.2.1. Rank-One Tensors. An N -way tensor X ∈ RI1×I2×···×IN is rank one if it
can be written as the outer product of N vectors, i.e.,
X = a(1) ◦ a(2) ◦ · · · ◦ a(N).
The symbol “◦” represents the vector outer product. This means that each elementof the tensor is the product of the corresponding vector elements:
xi1i2···iN = a(1)i1
a(2)i2
· · · a(N)iN
for all 1 ≤ in ≤ In.
Figure 2.3 illustrates X = a ◦ b ◦ c, a third-order rank-one tensor.
2.2. Symmetry and Tensors. A tensor is called cubical if every mode is the samesize, i.e., X ∈ RI×I×I×···×I [49]. A cubical tensor is called supersymmetric (though
Dow
nloa
ded
09/0
5/13
to 1
98.2
06.2
19.3
8. R
edis
tribu
tion
subj
ect t
o SI
AM
lice
nse
or c
opyr
ight
; see
http
://w
ww
.siam
.org
/jour
nals
/ojs
a.ph
p
A tensor can also be decomposed into the slices of each mode(fix one index)
Grey Ballard CA Algorithms 4
Unfoldings
A tensor can be reshaped into matrices,called unfoldings or matricizations, for different modes
(fibers form columns, slices form rows)
Grey Ballard CA Algorithms 5
Outline
1 Tensor Notation
2 Tensor Decompositions
3 Computing CP via Alternating Least SquaresMathematical BackgroundParallel Algorithm for Sparse Tensors
4 Computing Tucker via Sequentially Truncated Higher-Order SVDMathematical BackgroundParallel Algorithm for Dense Tensors
5 Open Problems
Low-rank approximations of tensors
Tensor “decompositions” are usually low-rank approximations
They generalize matrix approximations from two viewpointssum of outer products (think PCA)product of two rectangular matrices (think high-variance subspaces)
Some applications seek true decompositions, but less common
Grey Ballard CA Algorithms 6
Sum of outer products
Matrix:
Tensor:
This is known as the CANDECOMP/PARAFAC (CP) decomposition
Grey Ballard CA Algorithms 7
CP Notation
T ≈ u1 ◦ v1 ◦w1 + · · ·+ uR ◦ vR ◦wR, T ∈ RI×J×K
T ≈ JU,V,WK , U ∈ RI×R,V ∈ RJ×R,W ∈ RK×R are factor matrices
tijk ≈R∑
r=1
uir vjr wkr , 1 ≤ i ≤ I,1 ≤ j ≤ J,1 ≤ k ≤ K
Grey Ballard CA Algorithms 8
Notation convention: scalar dimension N, index n with 1 ≤ n ≤ N
Applications of CP
CP often used like PCA for multi-dimensional datainterpretable components separated from noise
Sample applicationschemometrics [AB03]
data is excitation wavelengths × emission wavelengths × timecomponents correspond to chemical species’ signatures
neuroscience [AABB+07]data is electrode × frequency × timecomponents help to describe origin of a seizure
text analysis [BBB08]data is term × author × timecomponents discover conversations
Grey Ballard CA Algorithms 9
High-variance subspaces
Matrix:
Tensor:
This is known as the Tucker decomposition
Grey Ballard CA Algorithms 10
Tucker Notation
T ≈ G×1 U×2 V×3 W T ∈ RI×J×K ,G ∈ RP×Q×R is core tensor
T ≈ JG;U,V,WK , U ∈ RI×P ,V ∈ RJ×Q,W ∈ RK×R are factor matrices
tijk ≈P∑
p=1
Q∑
q=1
R∑
r=1
gpqr uipvjqwkr , 1 ≤ i ≤ I,1 ≤ j ≤ J,1 ≤ k ≤ K
Grey Ballard CA Algorithms 11
Tensor-Times-Matrix (TTM)
Tensor version:Y = X×2 M
Y ∈ RI×Q×K X ∈ RI×J×K M ∈ RQ×J
Matrix version:Y(2) = MX(2)
Y(2) ∈ RQ×IK X(2) ∈ RJ×IK
Element version:
yiqk =J∑
j=1
mqjxijk
TTM is matrix multiplication with certain unfolding
Grey Ballard CA Algorithms 12
Applications of Tucker
Tucker can be viewed as a richer form of CP, so it’s also used like PCAa diagonal core tensor corresponds to a CP decomposition
Sample ApplicationComputer vision: TensorFaces [VT02]
facial recognition system benefiting from varying lighting,expression, viewpoint
Tucker is typically more efficient than CP for compressionSample Application
Visual data compression [BRP15]image, video, and 3D volume data
Grey Ballard CA Algorithms 13
Ambiguities
There are several ambiguities that have to be handled carefully
CP scaling ambiguity
T ≈R∑
r=1
ur ◦ vr ◦wr → T ≈R∑
r=1
λr · ur ◦ vr ◦wr
where ‖ur‖2 = ‖vr‖2 = ‖wr‖2 = 1
Tucker basis ambiguity
T ≈ G×1 U×2 V×3 W
where UTU = IP , VTV = IQ, WTW = IR
Grey Ballard CA Algorithms 14
Survey Paper
Notation can be a huge obstacle to working with tensors,standardization can help
I recommend following the conventions of the following paper:
Tensor Decompositions and ApplicationsTammy Kolda and Brett Bader
SIAM Review 2009http://epubs.siam.org/doi/abs/10.1137/07070111X
Grey Ballard CA Algorithms 15
Outline
1 Tensor Notation
2 Tensor Decompositions
3 Computing CP via Alternating Least SquaresMathematical BackgroundParallel Algorithm for Sparse Tensors
4 Computing Tucker via Sequentially Truncated Higher-Order SVDMathematical BackgroundParallel Algorithm for Dense Tensors
5 Open Problems
Outline
1 Tensor Notation
2 Tensor Decompositions
3 Computing CP via Alternating Least SquaresMathematical BackgroundParallel Algorithm for Sparse Tensors
4 Computing Tucker via Sequentially Truncated Higher-Order SVDMathematical BackgroundParallel Algorithm for Dense Tensors
5 Open Problems
CP Optimization Problem
For fixed rank R, we want to solve
minU,V,W
∥∥∥∥∥X−R∑
r=1
ur ◦ vr ◦wr
∥∥∥∥∥
which is a nonlinear, nonconvex optimization problem
in the matrix case, the SVD gives us the optimal solution
in the tensor case, uniqueness/convergence to optimum not guaranteed
Grey Ballard CA Algorithms 16
Alternating Least Squares (ALS)
Fixing all but one factor matrix, we have a linear least squares problem:
minV
∥∥∥∥∥X−R∑
r=1
ur ◦ vr ◦ wr
∥∥∥∥∥
or equivalentlymin
V
∥∥∥X(2) − V(W� U)T∥∥∥
F
where � is the Khatri-Rao product, a column-wise Kronecker product
ALS works by alternating over factor matrices, updating one at a timeby solving the corresponding linear least squares problem
Grey Ballard CA Algorithms 17
CP-ALS
Repeat1 Solve U(VTV ∗WTW) = X(1)(W� V) for U2 Normalize columns of U3 Solve V(UTU ∗WTW) = X(2)(W� U) for V4 Normalize columns of V5 Solve W(UTU ∗ VTV) = X(3)(V� U) for W6 Normalize columns of W and store norms in λ
Linear least squares problems solved via normal equationsusing identity (A� B)T(A� B) = ATA ∗ BTB,
where ∗ is element-wise product
Grey Ballard CA Algorithms 18
Outline
1 Tensor Notation
2 Tensor Decompositions
3 Computing CP via Alternating Least SquaresMathematical BackgroundParallel Algorithm for Sparse Tensors
4 Computing Tucker via Sequentially Truncated Higher-Order SVDMathematical BackgroundParallel Algorithm for Dense Tensors
5 Open Problems
Matricized Tensor Times Khatri-Rao Product
CP-ALS spends most of its time in MTTKRP (dense or sparse)corresponds to setting up the right-hand-side of normal equationsM(V ) = X(2)(W� U), for example
In the dense case, it usually makes sense to1 form Khatri-Rao product explicitly2 call dense matrix multiplication
In the sparse case, it usually make sense [BK07] to use
element-wise formula m(V )jr =
I∑
i=1
K∑
k=1
xijkuir wkr
row-wise formula m(V )j,: =
I∑
i=1
K∑
k=1
xijk (ui,: ∗ wk ,:)
Grey Ballard CA Algorithms 19
Matricized Tensor Times Khatri-Rao Product
CP-ALS spends most of its time in MTTKRP (dense or sparse)corresponds to setting up the right-hand-side of normal equationsM(V ) = X(2)(W� U), for example
In the dense case, it usually makes sense to1 form Khatri-Rao product explicitly2 call dense matrix multiplication
In the sparse case, it usually make sense [BK07] to use
element-wise formula m(V )jr =
I∑
i=1
K∑
k=1
xijkuir wkr
row-wise formula m(V )j,: =
I∑
i=1
K∑
k=1
xijk (ui,: ∗ wk ,:)
Grey Ballard CA Algorithms 19
Matricized Tensor Times Khatri-Rao Product
CP-ALS spends most of its time in MTTKRP (dense or sparse)corresponds to setting up the right-hand-side of normal equationsM(V ) = X(2)(W� U), for example
In the dense case, it usually makes sense to1 form Khatri-Rao product explicitly2 call dense matrix multiplication
In the sparse case, it usually make sense [BK07] to use
element-wise formula m(V )jr =
I∑
i=1
K∑
k=1
xijkuir wkr
row-wise formula m(V )j,: =
I∑
i=1
K∑
k=1
xijk (ui,: ∗ wk ,:)
Grey Ballard CA Algorithms 19
Coarse-Grain Distribution for CP-ALS [KU15]
U
P1
P2
P3
P4
↓I↑
←R→
V
P1P2P3
P4↓J↑
←R→
W
P1
P2
P3
P4
↓K↑
←R→
X
Rows of each factor matrices are distributed across processors
Each tensor nonzero is copied to each process that will need it
Grey Ballard CA Algorithms 20
Coarse-Grain Distribution for CP-ALS [KU15]
U
P1
P2
P3
P4
↓I↑
←R→
V
P1P2P3
P4↓J↑
←R→
W
P1
P2
P3
P4
↓K↑
←R→
X
i
j
k ∗xijk
Rows of each factor matrices are distributed across processorsEach tensor nonzero is copied to each process that will need it
Grey Ballard CA Algorithms 20
Coarse-Grain Parallelization for MTTKRP [KU15]
To update V, need to compute M(V ) = X(2)(W� U)
Let Ip, Jp,Kp be the subset of rows of U,V,W owned by processor p
Main loop:for each j ∈ Jp
for each nonzero xijk in slice jm(V )
j,: ← m(V )j,: + xijk · (ui,: ∗ wk ,:)
In the inner loop, ui,: or wk ,: require communication if i /∈ Ip or k /∈ Kp
Grey Ballard CA Algorithms 21
Fine-Grain Distribution of CP-ALS [KU15]
U
P1
P2
P3
P4
↓I↑
←R→
V
P1P2P3
P4↓J↑
←R→
W
P1
P2
P3
P4
↓K↑
←R→
X
∗
∗
∗
∗
∗
∗
Rows of each factor matrices are distributed across processorsTensor nonzeros are distributed across processors
Grey Ballard CA Algorithms 22
Fine-Grain Parallelization for MTTKRP [KU15]
To update V, need to compute M(V ) = X(2)(W� U)
Let Xp be the subset of nonzeros of X owned by processor pLet Ip, Jp,Kp be the subset of rows of U,V,W owned by processor p
Main loop:for each xijk ∈ Xp
m(V )j,: ← m(V )
j,: + xijk · (ui,: ∗ wk ,:)
In the inner loop, ui,: or wk ,: require communication if i /∈ Ip or k /∈ Kp
After the loop, m(V )j,: for j /∈ Jp needs to be sent to owner processor
Grey Ballard CA Algorithms 23
Minimizing Communication
Algorithms defined for any distributions of factor matrices / tensor
Distributions determine computational load balance andcommunication costs
Finding optimal distribution for each algorithm is a hypergraphpartitioning problem (subject to load balance constraint)
Even if hypergraph is optimally partitioned, no guarantees thateither algorithm is communication optimal
Grey Ballard CA Algorithms 24
Coarse-Grain vs Fine-Grain
Coarse-GrainOwner computes: communicatesonly inputs within MTTKRPRequires replication of XGeneralizes row-wise algorithmfor SpMV (for multiple vectors)
Fine-GrainCommunicates inputs andoutputs within MTTKRPNo replication of XGeneralizes fine-grainalgorithm for SpMV
Grey Ballard CA Algorithms 25
Performance Comparison [KU15]
1 2 4 8 16 32 64 128 256 512 102410
−1
100
101
102
Number of MPI processes
Avera
ge t
ime (
in s
eco
nd
s)
per
CP
−A
LS
ite
rati
on
ht−finegrain−hp
ht−finegrain−random
ht−coarsegrain−hp
ht−coarsegrain−block
DFacTo
(a) Time on Netflix
1 2 4 8 16 32 64 128 256 512 102410
−1
100
101
102
Number of MPI processes
Avera
ge t
ime (
in s
eco
nd
s)
per
CP
−A
LS
ite
rati
on
ht−finegrain−hp
ht−finegrain−random
ht−coarsegrain−hp
ht−coarsegrain−block
DFacTo
(b) Time on NELL-B
1 2 4 8 16 32 64 128 256 512 10240
20
40
60
80
100
120
140
160
180
Number of MPI processes
Sp
eed
up
over
the s
eq
uen
tial execu
tio
n o
f a C
P−
AL
S ite
rati
on
ht−finegrain−hp
ht−finegrain−random
ht−coarsegrain−hp
ht−coarsegrain−block
(c) Speedup on Flickr
1 2 4 8 16 32 64 128 256 512 10240
20
40
60
80
100
120
140
160
Number of MPI processes
Sp
eed
up
over
the s
eq
uen
tial execu
tio
n o
f a C
P−
AL
S ite
rati
on
ht−finegrain−hp
ht−finegrain−random
ht−coarsegrain−hp
ht−coarsegrain−block
(d) Speedup on Delicious
Figure 2: Time for parallel CP-ALS iteration on Netflix and NELL-B, and speedups on Flickr and Delicious.
Table 2: Statistics for the computation and communicationrequirements in one CP-ALS iteration for 512-way partition-ings of the Netflix tensor.
ModeComp. load Comm. volume Num. msg.
Max Avg Max Avg Max Avght-finegrain-hp
1 196672 196251 21079 6367 734 3162 196672 196251 18028 5899 1022 10163 196672 196251 3545 2492 1022 1018
ht-finegrain-random1 197507 196251 272326 252118 1022 10222 197507 196251 29282 22715 1022 10223 197507 196251 7766 4300 1013 1003
ht-coarsegrain-hp1 364181 196251 302001 136741 511 5112 349123 196251 59523 12228 511 5113 737570 196251 23524 2000 511 507
ht-coarsegrain-block1 198602 196251 239337 142006 448 4472 367966 196251 33889 12458 511 4453 737570 196251 24659 2049 511 394
on up to 1024 cores. The experiments showed that the pro-posed fine-grain MTTKRP can achieve the best performancewith respect to other alternatives with a good partitioning,reaching up to 194x speedups on 512 cores.
In our analysis and experiments, we identified the com-munication latency as the dominant hindrance for furtherscalability of the fastest proposed method. We will inves-tigate this in the future. We also note that the size of thehypergraphs that we build can cause discomfort to all ex-isting partitioning tools. Methods that partition huge hy-pergraphs e�ciently and e↵ectively are needed for handlinglarger tensors than those treated in this work.
During the revision process, we have come across to a newstudy called DMS on distributed memory tensor factoriza-tion [30] which uses SPLATT formulation. The communica-tion is based on all-to-all primitives. The current versions ofthe two codes (ours using MPI, DMS using OpenMP+MPI)do not allow a useful comparison. We plan to update ourcodes and do a comparison in the near future.
6. ACKNOWLEDGMENTSThis work was performed using HPC resources from GENCI-
[TGCC/CINES/IDRIS] (Grant 2015-100570). Additionalcomputational resources were used from the PSMN com-
puting center at ENS Lyon.
7. REFERENCES[1] E. Acar, D. M. Dunlavy, and T. G. Kolda. A scalable
optimization approach for fitting canonical tensordecompositions. Journal of Chemometrics,25(2):67–86, February 2011.
[2] C. A. Andersson and R. Bro. The N-way toolbox forMATLAB. Chemometrics and Intelligent LaboratorySystems, 52(1):1–4, 2000.
[3] C. J. Appellof and E. Davidson. Strategies foranalyzing data from video fluorometric monitoring ofliquid chromatographic e✏uents. AnalyticalChemistry, 53(13):2053–2056, 1981.
[4] C. Aykanat, B. B. Cambazoglu, and B. Ucar.Multi-level direct K-way hypergraph partitioning withmultiple constraints and fixed vertices. Journal ofParallel and Distributed Computing, 68:609–625, 2008.
[5] B. W. Bader and T. G. Kolda. E�cient MATLABcomputations with sparse and factored tensors. SIAMJournal on Scientific Computing, 30(1):205–231,December 2007.
[6] B. W. Bader, T. G. Kolda, et al. Matlab tensortoolbox version 2.6. Available online, February 2015.
[7] J. Bennett and S. Lanning. The netflix prize. InProceedings of KDD cup and workshop, volume 2007,page 35, 2007.
[8] R. H. Bisseling and W. Meesen. Communicationbalancing in parallel sparse matrix-vectormultiplication. Electronic Transactions on NumericalAnalysis, 21:47–65, 2005.
[9] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R.Hruschka Jr, and T. M. Mitchell. Toward anarchitecture for never-ending language learning. InAAAI, volume 5, page 3, 2010.
[10] J. D. Carroll and J.-J. Chang. Analysis of individualdi↵erences in multidimensional scaling via an N-waygeneralization of “Eckart-Young” decomposition.Psychometrika, 35(3):283–319, 1970.
[11] U. V. Catalyurek and C. Aykanat.Hypergraph-partitioning-based decomposition forparallel sparse-matrix vector multiplication. IEEETransactions on Parallel and Distributed Systems,10(7):673–693, Jul 1999.
[12] U . V. Catalyurek and C. Aykanat. PaToH: AMultilevel Hypergraph Partitioning Tool, Version 3.0.
Grey Ballard CA Algorithms 26
Outline
1 Tensor Notation
2 Tensor Decompositions
3 Computing CP via Alternating Least SquaresMathematical BackgroundParallel Algorithm for Sparse Tensors
4 Computing Tucker via Sequentially Truncated Higher-Order SVDMathematical BackgroundParallel Algorithm for Dense Tensors
5 Open Problems
Outline
1 Tensor Notation
2 Tensor Decompositions
3 Computing CP via Alternating Least SquaresMathematical BackgroundParallel Algorithm for Sparse Tensors
4 Computing Tucker via Sequentially Truncated Higher-Order SVDMathematical BackgroundParallel Algorithm for Dense Tensors
5 Open Problems
Tucker Optimization Problem
For fixed ranks P,Q,R, we want to solve
minX
∥∥∥X− X∥∥∥
2=
I∑
i=1
J∑
j=1
K∑
k=1
(xijk − xijk )2 subject to X = JG;U,V,WK
which turns out to be equivalent to
maxU,V,W
‖G‖ subject to G = X×1 UT ×2 VT ×3 WT
which is a nonlinear, nonconvex optimization problem
Grey Ballard CA Algorithms 27
Higher-Order Orthogonal Iteration (HOOI)
Fixing all but one factor matrix, we have a matrix problem:
maxV
∥∥∥X×1 UT ×2 VT ×3 W
T∥∥∥
or equivalentlymax
V
∥∥∥VTY(2)
∥∥∥F
where Y = X×1 UT ×3 W
T
HOOI works by alternating over factor matrices, updating one at a timeby computing leading left singular vectors
Grey Ballard CA Algorithms 28
Sequentially Truncated Higher-Order SVD
HOOI is very sensitive to initialization
Truncated Higher-Order SVD (T-HOSVD) typically used
ST-HOSVD [VVM12] is more efficient than T-HOSVD, works byinitializing with identity matrices U = II , V = IJ , W = IKapplying one iteration of HOOIwhere ranks P,Q,R can be chosen based on error tolerance
Grey Ballard CA Algorithms 29
ST-HOSVD Algorithm
1 S(1) ← X(1)XT(1)
2 U = leading eigenvectors of S(1)
3 Y = X×1 U4 S(2) ← Y(2)YT
(2)
5 V = leading eigenvectors of S(2)
6 Z = Y×2 V7 S(3) ← Z(3)ZT
(3)
8 W = leading eigenvectors of S(3)
9 G = Z×3 W
Left singular vectors of A computed as eigenvectors of ATA
Grey Ballard CA Algorithms 30
Outline
1 Tensor Notation
2 Tensor Decompositions
3 Computing CP via Alternating Least SquaresMathematical BackgroundParallel Algorithm for Sparse Tensors
4 Computing Tucker via Sequentially Truncated Higher-Order SVDMathematical BackgroundParallel Algorithm for Dense Tensors
5 Open Problems
Parallel Block Tensor Distribution
For N-mode tensor, use logical N-mode processor gridProc. grid: PI × PJ × PK = 3× 5× 2
← J →
←I→
←K→
Local tensors have dimensions IPI× J
PJ× K
PK
Grey Ballard CA Algorithms 31
Unfolded Tensor Distribution
Key idea: each unfolded matrix is 2D block distributedProc. grid: PI × PJ × PK = 3× 5× 2
← IK →
↓
J
↑
X(2)
Logical mode-2 2D processor grid: PJ × PIPKLocal unfolded matrices have dimensions J
PJ× IK
PIPK
Grey Ballard CA Algorithms 32
Kernel Matrix Computations
Key computations in ST-HOSVD areGram: computing X(2)XT
(2)
TTM: computing Y(2) = VTX(2)
These are just matrix computations, done for each mode in sequencecan determine lower bound/opt. alg. for individual computationshow to minimize communication across all computations?
Grey Ballard CA Algorithms 33
Parameter Tuning
1x1x
1x38
4
1x1x
16x2
4
1x1x
2x19
2
1x1x
4x96
1x1x
8x48
1x2x
12x1
61x
4x8x
122x
2x8x
122x
4x6x
8
4x4x
4x6
6x4x
4x40
1
2
3
4
5 TTMEvecsGram
1234
1324
1342
2134
2314
2341
3124
3142
3214
3241
3412
3421
0
0.5
1
1.5
2
2.5 TTMEvecsGram
Varying processor grid for tensor ofsize 384×384×384×384 with
reduced size of 96×96×96×96.
Varying mode order for tensor ofsize 25×250×250×250 with
reduced size 10×10×100×100.
Grey Ballard CA Algorithms 34
Parallel Scaling
181 256 625 1296
510
1519.2
Number of Nodes
GFL
OP
SPe
rCor
e
ST-HOSVD
1 2 4 8 16 32 64 128256512
2−4
2−3
2−2
2−1
20
Number of Nodes
Tim
e(s
econ
ds)
ST-HOSVD
Weak scaling for200k×200k×200k×200ktensor with reduced size
20k×20k×20k×20k ,using k4 nodes for 1 ≤ k ≤ 6.
Strong scaling for200×200×200×200
tensor with reduced size20×20×20×20,
using 2k nodes for 0 ≤ k ≤ 9.
Grey Ballard CA Algorithms 35
Application: Compression of Scientific Simulation Data
We applied ST-HOSVD to compress multidimensional data fromnumerical simulations of combustion, including the following data sets:
HCCI:Dimensions: 672× 672× 33× 627672× 672 spatial grid, 33 variables over 627 time stepsTotal size: 70 GB
TJLR:Dimensions: 460× 700× 360× 35× 16460× 700× 360 spatial grid, 35 variables over 16 time stepsTotal size: 520 GB
SP:Dimensions: 500× 500× 500× 11× 50500× 500× 500 spatial grid, 11 variables over 50 time stepsTotal size: 550GB
Grey Ballard CA Algorithms 36
Application: Compression of Scientific Simulation Data
10−6 10−5 10−4 10−3 10−2
100
101
102
103
104
Relative Normwise Error
Com
pres
sion
Rat
io
HCCITJLRSP
Compression ratio: IJKPQR+IP+JQ+KR Relative Normwise Error: ‖X−X‖‖X‖
Grey Ballard CA Algorithms 37
Outline
1 Tensor Notation
2 Tensor Decompositions
3 Computing CP via Alternating Least SquaresMathematical BackgroundParallel Algorithm for Sparse Tensors
4 Computing Tucker via Sequentially Truncated Higher-Order SVDMathematical BackgroundParallel Algorithm for Dense Tensors
5 Open Problems
Numerical Questions
CP-ALS solves least squares problems using normal equationsST-HOSVD computes singular vectors using the Gram matrix
Are there applications that require better numerical stability?Can more numerically stable methods be implemented efficiently?
Grey Ballard CA Algorithms 38
CA Questions
What are the communication lower bounds for MTTKRP?the computation can be expressed as nested loopsis there a tradeoff between computation and communicaton?
What are the communication lower bounds for ST-HOSVD?we’ve already improved the comm. costs of the published algorithmcan the parameter tuning problems be solved analytically?
Grey Ballard CA Algorithms 39
For more details:
Scalable Sparse Tensor Decompositions inDistributed Memory Systems
Oguz Kaya and Bora UçarInternational Conference for High Performance Computing,
Networking, Storage and Analysis 2015http://doi.acm.org/10.1145/2807591.2807624
Parallel Tensor Compression for Large-Scale Scientific DataWoody Austin, Grey Ballard, and Tamara G. Kolda
International Parallel and Distributed Processing Symposium 2016http://arxiv.org/abs/1510.06689
Grey Ballard CA Algorithms 40
References I
Evrim Acar, Canan Aykut-Bingol, Haluk Bingol, Rasmus Bro, and BülentYener.Multiway analysis of epilepsy tensors.Bioinformatics, 23(13):i10–i18, 2007.
C. M. Andersen and R. Bro.Practical aspects of parafac modeling of fluorescenceexcitation-emission data.Journal of Chemometrics, 17(4):200–215, 2003.
Woody Austin, Grey Ballard, and Tamara G. Kolda.Parallel tensor compression for large-scale scientific data.Technical Report 1510.06689, arXiv, 2015.To appear in IPDPS.
Grey Ballard CA Algorithms 41
References II
Brett W. Bader, Michael W. Berry, and Murray Browne.Survey of Text Mining II: Clustering, Classification, and Retrieval, chapterDiscussion Tracking in Enron Email Using PARAFAC, pages 147–163.Springer London, London, 2008.
Brett W. Bader and Tamara G. Kolda.Efficient MATLAB computations with sparse and factored tensors.SIAM Journal on Scientific Computing, 30(1):205–231, December 2007.
Rafael Ballester-Ripoll and Renato Pajarola.Lossy volume compression using tucker truncation and thresholding.The Visual Computer, pages 1–14, 2015.
T. G. Kolda and B. W. Bader.Tensor decompositions and applications.SIAM Review, 51(3):455–500, September 2009.
Grey Ballard CA Algorithms 42
References III
Oguz Kaya and Bora Uçar.Scalable sparse tensor decompositions in distributed memory systems.In Proceedings of the International Conference for High PerformanceComputing, Networking, Storage and Analysis, SC ’15, pages77:1–77:11, New York, NY, USA, 2015. ACM.
M. Alex O. Vasilescu and Demetri Terzopoulos.Computer Vision — ECCV 2002: 7th European Conference on ComputerVision Copenhagen, Denmark, May 28–31, 2002 Proceedings, Part I,chapter Multilinear Analysis of Image Ensembles: TensorFaces, pages447–460.Springer Berlin Heidelberg, Berlin, Heidelberg, 2002.
Nick Vannieuwenhoven, Raf Vandebril, and Karl Meerbergen.A new truncation strategy for the higher-order singular valuedecomposition.SIAM Journal on Scientific Computing, 34(2):A1027–A1052, 2012.
Grey Ballard CA Algorithms 43