Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Parallel Sparse Matrix-Matrix Multiplication and Its Use in Triangle Counting and Enumeration
Ariful Azad Lawrence Berkeley National Laboratory
SIAM ALA 2015, Atlanta
In collabora4on with Grey Ballard (Sandia), Aydin Buluc (LBNL), James Demmel (UC Berkeley), John Gilbert (UCSB), Laura Grigori (INRIA), Oded Schwartz (Hebrew University), Sivan Toledo(Tel Aviv University), Samuel Williams(LBNL)
SIAM ALA 2015 2
Sparse Matrix-Matrix Multiplication (SpGEMM)
B
x
A C
=
q A, B and C are sparse. q Why sparse matrix-matrix multiplication?
– Algebraic multigrid (AMG), – Graph clustering, – Betweenness centrality, – Graph contraction, – Subgraph extraction – Quantum chemistry – Triangle counting/enumeration
SIAM ALA 2015 3
q 3D SpGEMM [A. Azad et al. Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication. arXiv preprint arXiv:1510.00844. ]
– Scalable shared-memory SpGEMM – Distributed-memory 3D SpGEMM
q Parallel triangle counting and enumeration using SpGEMM [A.Azad, A. Buluc, J. Gilbert. Parallel Triangle Counting and Enumeration using Matrix Algebra. IPDPS Workshops, 2015]
– Masked SpGEMM to reduce communication – 1D parallel implementation
Outline of the talk
SIAM ALA 2015 4
Shared-Memory SpGEMM
• Heap-based column-by-column algorithm [Buluc and Gilbert, 2008]
• Easy to parallelize in shared-memory via multithreading • Memory efficient and scalable (i.e. temporary per-thread
storage is asymptotically negligible)
i-‐th column
i-‐th column of result
SIAM ALA 2015 5
1 2 4 8 160.25
0.5
1
2
4
Number of Threads
Tim
e (s
ec)
MKLHeapSpGEMM
Performance of Shared-Memory SpGEMM
1 2 4 8 160.06250.1250.250.5
1248
163264
Number of ThreadsTi
me
(sec
)
MKLHeapSpGEMM
(a) cage12 x cage12 (b) Scale 16 G500 x G500
Faster than Intel’s MKL (mkl_csrmultcsr) (when we keep output sorted by indices)
A. Azad, G. Ballard, A. Buluc, J. Demmel, L. Grigori, O. Schwartz, S. Toledo, S. Williams. ExploiSng mulSple levels of parallelism in sparse matrix-‐matrix mulSplicaSon. arXiv preprint arXiv:1510.00844.
SIAM ALA 2015 6
Variants of Distributed-Memory SpGEMM
Matrix multiplication: ∀(i,j) ∈ n x n, C(i,j) = Σk A(i,k)B(k,j),
A B C The computation (discrete) cube:
• A face for each (input/output) matrix
• A grid point for each multiplication
• Categorized by the way work is partitioned
1D algorithms 2D algorithms 3D algorithms Communicate A or B Communicate A and B Communicate A, B and C
Sparsity independent algorithms: assigning grid-points to processors is independent of sparsity structure. [Ballard et al., SPAA 2013]
SIAM ALA 2015 7
2D Algorithm: Sparse SUMMA
x =
€
Cij
100K 25K
25K
100K
A B C
Cij += LocalSpGEMM(Arecv, Brecv) Processor Grid p × p
• 2D Sparse SUMMA (Scalable Universal Matrix Multiplication Algorithm) was the previous state of the art. It becomes communication bound on high concurrency [Buluc & Gilbert, 2012]
Processor row
Processor colum
n
SIAM ALA 2015 8
3D Parallel SpGEMM in a Nutshell
A::1$
A::2$
A::3$
n pc
Allto
All%
Allto
All%
C intijk = Ailk
l=1
p/c
∑ Bljk
A$ B$ Cintermediate$ Cfinal%
x$
x$
x$
=$
=$
=$
!$
!$
!$
A. Azad, G. Ballard, A. Buluc, J. Demmel, L. Grigori, O. Schwartz, S. Toledo, S. Williams. ExploiSng mulSple levels of parallelism in sparse matrix-‐matrix mulSplicaSon. arXiv preprint arXiv:1510.00844.
A::1$
A::2$
A::3$
n pc
Allto
All%
Allto
All%
C intijk = Ailk
l=1
p/c
∑ Bljk
A$ B$ Cintermediate$ Cfinal%
x$
x$
x$
=$
=$
=$
!$
!$
!$
Input storage does not increase
3rd dimen
sion
(c=3)
Processor Grid p / c × p / c × c
SIAM ALA 2015 9
3D Parallel SpGEMM in a Nutshell
A::1$
A::2$
A::3$
n pc
Allto
All%
Allto
All%
C intijk = Ailk
l=1
p/c
∑ Bljk
A$ B$ Cintermediate$ Cfinal%
x$
x$
x$
=$
=$
=$
!$
!$
!$
Broadcast (row/column)
AlltoAll (layer) Communication
Computation (multithreaded)
Local multiply Multiway merge
Multiway merge
SIAM ALA 2015 10
CommunicaSon Cost
A::1$
A::2$
A::3$
n pc
Allto
All%
Allto
All%
C intijk = Ailk
l=1
p/c
∑ Bljk
A$ B$ Cintermediate$ Cfinal%
x$
x$
x$
=$
=$
=$
!$
!$
!$
Broadcast (row/column)
Most expensive step Communication
layers
Processor Grid
p / c × p / c × c
thread
s #broadcasts targeted to each process (synchroniza4on points / SUMMA stages) #processes parScipaSng in each broadcasts
(communicator size) Total data received in all broadcasts (process)
p / c
p / c
nnz / pc
Threading decreases Network card contenSon
SIAM ALA 2015 11
c=1 c=2 c=4 c=8 c=160
2
4
6
8
10
t=1
t=1
t=1
t=1
t=1
t=2
t=2
t=2
t=2
t=2
t=4
t=4
t=4
t=4
t=4
t=8
t=8
t=8
t=8
t=8
Tim
e (s
ec)
BroadcastAlltoAllLocal MultiplyMerge LayerMerge FiberOther
How do 3D algorithms gain performance?
On 8,192 cores of Titan when multiplying two scale 26 G500 matrices.
For fixed #layers (c) Increased #threads (t) reduces run4me
SIAM ALA 2015 12
c=1 c=2 c=4 c=8 c=160
2
4
6
8
10
t=1
t=1
t=1
t=1
t=1
t=2
t=2
t=2
t=2
t=2
t=4
t=4
t=4
t=4
t=4
t=8
t=8
t=8
t=8
t=8
Tim
e (s
ec)
BroadcastAlltoAllLocal MultiplyMerge LayerMerge FiberOther
How do 3D algorithms gain performance?
On 8,192 cores of Titan when multiplying two scale 26 G500 matrices.
For fixed #threads (t) Increased #layers (c) reduces run4me
SIAM ALA 2015 13
3D SpGEMM performance (matrix squaring)
64 256 1024 4096 163840.25
1
4
16
Number of Cores
Tim
e (s
ec)
nlpkkt160 x nlpkkt160 (on Edison)
2D (t=1)2D (t=3)2D (t=6)3D (c=4, t=1)3D (c=4, t=3)3D (c=8, t=1)3D (c=8, t=6)3D (c=16, t=6)
2D#threads
increasing
3D#layers &#threads
increasing
2D (non-‐threaded) is the previous state-‐of-‐the art
3D (threaded) – first presented here – beats it by 8X at large concurrencies
Squaring nlpkkt160 on Edison: 1.2 billion nonzeros in the A2
SIAM ALA 2015 14
512 1024 2048 4096 8192 16384 327680.125
0.25
0.5
1
2
4
8
16
32
Number of Cores
Tim
e (s
ec)
ldoordelaunay_n24cage15dielFilterV3realmouse_geneHV15Rit−2004
3D SpGEMM performance (matrix squaring)
It-‐2004 (web crawl) nnz(A) = 1.1 billion nnz(A2) = 14 billion
18x 14x
3D SpGEMM with c=16, t=6 on Edison
SIAM ALA 2015 15
3D SpGEMM performance (AMG coarsening)
• Galerkin triple product in Algebraic Multigrid (AMG) • Acoarse = RT Afine R where R is the restriction matrix • R constructed via distance-2 maximal independent set (MIS)
Only showing the first product (RTA)
512 1024 2048 4096 8192 16384 327680.25
0.5
1
2
4
8
16
Number of Cores
Tim
e (s
ec)
RTA with NaluR3 (on Edison)
2D (t=1)2D (t=6)3D (c=8, t=1)3D (c=8, t=6)3D (c=16, t=6)
3D is 16x faster than 2D
SIAM ALA 2015 16
4 16 64 256 1024 40960.25
1
4
16
64
Number of Cores
Tim
e (s
ec)
EpetraExt2D (t=1)3D (c=8, t=6)
3D SpGEMM performance (AMG coarsening)
Comparing performance with EpetraExt package of Trilinos
2x
AR computaSon with nlpkkt160 on Edison
Notes: • EpetraExt runs up to 3x
faster when computing AR, 3D is less sensitive
• 1D decomposition used by EpetraExt performs better on matrices with good separators.
SIAM ALA 2015 17
Application of SpGEMM Triangle Counting/Enumeration
[A.Azad, A. Buluc, J. Gilbert. Parallel Triangle CounSng and EnumeraSon using Matrix Algebra. IPDPS Workshops, 2015]
SIAM ALA 2015 18
Counting Triangles
A
5
6
3
1 2
4
Clustering coefficient:
• Pr (wedge i-j-k makes a triangle with edge i-k)
• 3 * # triangles / # wedges
• 3 * 2 / 13 = 0.46 in example
• may want to compute for each vertex j
Wedge: a path of length two
A triangle has three wedges
5
3 4
5
3 4
SIAM ALA 2015 19
Triangle Counting in Matrix Algebra
A
5
6
3
1 2
4
Step1: Count wedges from a pair of verSces (u,v) by finding a path of length 2. Avoid repeSSon by counSng wedges with low index of the middle vertex.
5
34 hi hi
lo
A L U
1 1 1 1 2
B
A = L + U (hi->lo + lo->hi) L × U = B (wedges)
2 wedges between vertex 5 and 6
2
Adjacency matrix
SIAM ALA 2015 20
Triangle Counting in Matrix Algebra
A 5
6
3
1 2
4
Step2: Count wedges as triangles if there is an edge between the selected pair of verSces.
A
1 2
1 1 1 2
B
L × U = B (wedges)
1 1 1 1
C
5
34
A ∧ B = C (closed wedge)
#triangles =sum(C)/2
Goal: Can we design a parallel algorithm that communicates less data exploiSng this observaSon?
SIAM ALA 2015 21
Masked SpGEMM (1D algorithm)
In this example, nonzeros of L(5,:) (marked with red color) are not needed by P1 because A1(5,:) does not contain any nonzeros. So, we can mask rows of L by A to avoid communication.
Special SpGEMM: The nonzero structure of the product is contained in the original adjacency matrix A
U
x
A L
1
5
2 3 4
6 7
1
5
2 3 4
6 7
⊇
P1 P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4
SIAM ALA 2015 22
q Benefit – Reduce communication by a factor of p/d where d
is the average degree (good for large p)
q Overheads – Increased index traffic for requesting subset of rows/columns.
• Partial remedy: Compress index requests using bloom filters.
– Indexing and packaging the requested rows (SpRef).
• Remedy (ongoing work): data structures/algorithms for faster SpRef.
Benefits and Overheads of Masked SpGEMM
SIAM ALA 2015 23
Where do algorithms spend time with increased concurrency?
0%#
20%#
40%#
60%#
80%#
100%#
1# 2# 4# 8# 16#
32#
64#
128#
256#
512#
Number#of#Cores#
(b)$Triangle-masked-bloom$
Comp6Indices# Comm6Indices# SpRef# Comm6L# SpGEMM#
0%#
20%#
40%#
60%#
80%#
100%#
1# 2# 4# 8# 16#
32#
64#
128#
256#
512#
Number#of#Cores#
(a)$Triangle-basic$
Comp6Indices# Comm6Indices# SpRef# Comm6L# SpGEMM#
CopapersDBLP on NERSC/Edison
As expected, Masked SpGEMM reduces cost to communicate L
Expensive indexing undermines the advantage More computation in exchange of communication. [Easy to fix]
SIAM ALA 2015 24
Strong Scaling of Triangle Counting
1"
2"
4"
8"
16"
32"
64"
128"
256"
1" 4" 16" 64" 256"
Speedu
p"
Number"of"Cores"
(a)$Triangle,basic$$
coPapersDBLP"mouse<gene"cage15"europe_osm"
1"
2"
4"
8"
16"
32"
64"
128"
1" 4" 16" 64" 256"Speedu
p"
Number"of"Cores"
(b)$Triangle,masked,bloom$
coPapersDBLP"
mouse<gene"
cage15"
europe_osm"
• Decent scaling to modest number of cores (p=512 in this case) • Further scaling requires 2D/3D SpGEMM (ongoing/future work)
SIAM ALA 2015 25
What about triangle enumeration?
Data access patterns stay intact, only the scalar operations change !
5
6
3
1 2
4
{5}
{3}
{3,4}
B
B(i,k) now captures the set of wedges (indexed solely by their middle vertex because i and k are already known implicitly), as opposed to just the count of wedges
A
5
6
3
1 2
4
SIAM ALA 2015 26
• Develop data structures/algorithms for faster SpRef
• Reduce the cost of communicating indices by a factor of t (the number of threads) using in-node multithreading
• Use 3D decomposition/algorithms: The requested index vectors would be of shorter length.
• Implement triangle enumeration via good serialization support. Larger performance gains are expected using masked-SpGEMM
Future Work
SIAM ALA 2015 27
Acknowledgements
Work is funded by
Grey Ballard (Sandia), Aydin Buluc (LBNL), James Demmel (UC Berkeley), John Gilbert (UCSB), Laura Grigori (INRIA), Oded Schwartz (Hebrew University), Sivan Toledo(Tel Aviv University), Samuel Williams(LBNL)
SIAM ALA 2015 28
Thanks for your attention
?
SIAM ALA 2015 29
Supporting slides
SIAM ALA 2015 30
q How do dense and sparse GEMM compare? Dense: Sparse: Lower bounds match algorithms. Significant gap Allows extensive data reuse Inherent poor reuse?
SIAM ALA 2015 31
Communication Lower Bound
Lower bound for Erdős-Rényi(n,d) [Ballard et al., SPAA 2013]:
(Under some technical assumpSons) ⎟⎟⎠
⎞⎜⎜⎝
⎛
⎭⎬⎫
⎩⎨⎧
ΩPnd
Pdn 2
,min
• Few algorithms achieve this bound • 1D algorithms do not scale well on high concurrency • 2D Sparse SUMMA (Scalable Universal Matrix
Multiplication Algorithm) was the previous state of the art. But, it still becomes communication bound on several thousands of processors. [Buluc & Gilbert, 2013]
• 3D algorithms avoid communications
SIAM ALA 2015 32
384 864 1,536 3,456 7,776 16,224 31,1040
0.2
0.4
0.6
0.8
1
1.2
1.4
Number of Cores
Tim
e (s
ec)
BroadcastAlltoAllLocal MultiplyMerge LayerMerge FiberOther
What dominates the runtime of the 3D algorithm?
Squaring nlpkkt160 on Edison
Broadcast dominates on all concurrency
SIAM ALA 2015 33
c=1 c=2 c=4 c=8 c=160
2
4
6
8
10
t=1
t=1
t=1
t=1
t=1
t=2
t=2
t=2
t=2
t=2
t=4
t=4
t=4
t=4
t=4
t=8
t=8
t=8
t=8
t=8
Tim
e (s
ec)
BroadcastAlltoAllLocal MultiplyMerge LayerMerge FiberOther
How do 3D algorithms gain performance?
On 8,192 cores of Titan when multiplying two scale 26 G500 matrices.
8x8x16 grid 8 threads
90 x 90 grid 1 thread
SIAM ALA 2015 34
3D SpGEMM performance (AMG coarsening)
• Galerkin triple product in Algebraic Multigrid (AMG) • Acoarse = RT Afine R where R is the restriction matrix • R constructed via distance-2 maximal independent set (MIS)
Only showing the first product (RTA)
512 1024 2048 4096 8192 16384 327680.25
0.5
1
2
4
8
16
Number of Cores
Tim
e (s
ec)
RTA with NaluR3 (on Edison)
2D (t=1)2D (t=6)3D (c=8, t=1)3D (c=8, t=6)3D (c=16, t=6)
3D is 16x faster than 2D
Limited scalability of 3D in compuSng RTA
NaluR3 nnz(A) 474 million nnz(A2) 2.1 billion nnz(RTA) 77 million
Not enough work on high concurrency
SIAM ALA 2015 35
q Communication volume per processor for receiving L – Assume Erdős-Rényi(n,d) graphs and d<p
– Number of rows of L needed by a processor: nd/p – Number of columns of L needed by a processor: nd/p – Data received per processor: O(nd3/p2)
q If no output masking is used (“improved 1D” ) – O(nd2/p) [Ballard, Buluc, et al. SPAA 2013]
q reduction of a factor of p/d over “improved 1D” – good for large p
Communication Reduction in Masked SpGEMM
SIAM ALA 2015 36
Experimental Evaluation
Higher nnz(B) / nnz(A.*B) raSo => more poten)al communica)on reduc)on due to masked-‐spgemm for triangle coun-ng Ranges between 1.7 and 500
Graph #Vertices #Edges Clustering Coeff. Description
mouse gene 28,967,291 28,922,190 0.4207 Mouse gene regulatory networkcoPaperDBLP 540,486 30,491,458 0.6562 Citation networks in DBLPsoc-LiveJournal1 4,847,571 85,702,474 0.1179 LiveJournal online social networkwb-edu 9,845,725 92,472,210 0.0626 Web crawl on .edu domaincage15 5,154,859 94,044,692 0.1209 DNA electrophoresis, 15 monomers in polymereurope osm 50,912,018 108,109,320 0.0028 Europe street networkshollywood-2009 1,139,905 112,751,422 0.3096 Hollywood movie actor network
TABLE ITEST PROBLEMS FOR EVALUATING THE TRIANGLE COUNTING ALGORITHMS.
Graph nnz(A) nnz(B = L ·U) nnz(A. ⇤B) Triads Triangles
mouse gene 28,922,190 138,568,561 26,959,674 25,808,000,000 3,619,100,000coPaperDBLP 30,491,458 48,778,658 28,719,580 2,030,500,000 444,095,058soc-LiveJournal1 85,702,474 480,215,359 40,229,140 7,269,500,000 285,730,264wb-edu 92,472,210 57,151,007 32,441,494 12,204,000,000 254,718,147cage15 94,044,692 405,169,608 48,315,646 895,671,340 36,106,416europe osm 108,109,320 57,983,978 121,816 66,584,124 61,710hollywood-2009 112,751,422 1,010,100,000 104,151,320 47,645,000,000 4,916,400,000
TABLE IISTATISTICS OF TRIANGLE COUNTING ALGORITHM FOR DIFFERENT INPUT GRAPHS.
0"
10"
20"
30"
40"
50"
60"
mouse_gene" coPapersDBLP" soc9LiveJournal1" wb9edu" cage15" europe_osm"
Time"(sec)"
Triangle9basic"
Triangle9masked9bloom"
Triangle9masked9list"
(a)$$$p=64$
0"
2"
4"
6"
8"
10"
12"
14"
16"
mouse_gene" coPapersDBLP" soc9LiveJournal1" wb9edu" cage15" europe_osm"
Time"(sec)"
Triangle9basic"
Triangle9masked9bloom"
Triangle9masked9list"
(b)$$$p=256$
Fig. 4. Runtimes of three triangle counting algorithms for six input graphs on (a) 64 and (b) 256 cores of Edison.
algorithm, a processor sends the indices of necessary rowsin a Bloom filter. By contrast, a processor sends an array ofnecessary row indices in the latter algorithm.
Fig. 4 shows runtimes of three triangle counting algorithmsfor six input graphs on (a) 64 and (b) 256 cores of Edison. Inboth subfigures, Triangle-masked-list is the slowest algorithmfor all problem instances. On average, over all problem in-stances, Triangle-masked-bloom runs 2.3x (stdev=0.86) fasteron 64 cores and 3.4x (stdev=1.9) faster on 256 cores thanTriangle-masked-list. This is due to the fact that sending row
indices by Bloom filters saves communication, and SpRefruns faster with the Bloom filter. Since Triangle-masked-listis always slower than Triangle-masked-bloom, we will notdiscuss the former algorithm in the rest of the paper.
For all graphs except wb-edu, Triangle-basic runs fasterthan the Triangle-masked-bloom algorithm. On average, overall problem instances, Triangle-basic runs 1.6x (stdev=0.7)faster on 64 cores and 1.5x (stdev=0.6) faster on 256 coresthan Triangle-masked-bloom. We noticed that the performancedifference between these two algorithms is reduced as we
Higher Triads / Triangles raSo => more poten)al communica)on reduc)on due to masked-‐spgemm for triangle enumera-on Ranges between 5 and 1000
SIAM ALA 2015 37
Serial Complexity
Model: Erdős-‐Rényi(n,d) graphs aka G(n, p=d/n) Observa4on: The mulSplicaSon B = L·∙ U costs O(d2n) operaSons. However, the ulSmate matrix C has at most 2dn nonzeros. Goal: Can we design a parallel algorithm that communicates less data exploiSng this observaSon?
A
1 2
1 1 1 2
B
L × U = B (wedges)
1 1 1 1
C
A ∧ B = C (closed wedge)
sum(C)/2 2 triangles
SIAM ALA 2015 38
Where do algorithms spend time?
0%#
20%#
40%#
60%#
80%#
100%#
mouse_g
ene#
coPapers
DBLP#
soc9Live
Journal1
#wb9
edu#
cage15#
europe_
osm#
hollywo
od#
(b)$Triangle-masked-bloom$
Comp9Indices# Comm9Indices# SpRef# Comm9L# SpGEMM#
0%#
20%#
40%#
60%#
80%#
100%#
mouse_g
ene#
coPapers
DBLP#
soc9Live
Journal1
#wb9
edu#
cage15#
europe_
osm#
hollywo
od#
(a)$Triangle-basic$
Comp9Indices# Comm9Indices# SpRef# Comm9L# SpGEMM#
80% time spent in communicating L And local multiplication. (increased communication)
70% time spent in communicating index requests and SpRef (expensive indexing)
On 512 cores of NERSC/Edison