38
Parallel Sparse Matrix-Matrix Multiplication and Its Use in Triangle Counting and Enumeration Ariful Azad Lawrence Berkeley National Laboratory SIAM ALA 2015, Atlanta In collabora4on with Grey Ballard (Sandia), Aydin Buluc (LBNL), James Demmel (UC Berkeley), John Gilbert (UCSB), Laura Grigori (INRIA), Oded Schwartz (Hebrew University), Sivan Toledo(Tel Aviv University), Samuel Williams(LBNL)

Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

Parallel Sparse Matrix-Matrix Multiplication and Its Use in Triangle Counting and Enumeration

Ariful Azad Lawrence Berkeley National Laboratory

SIAM  ALA  2015,  Atlanta  

In  collabora4on  with  Grey  Ballard  (Sandia),  Aydin  Buluc  (LBNL),  James  Demmel  (UC  Berkeley),    John  Gilbert  (UCSB),  Laura  Grigori  (INRIA),  Oded  Schwartz  (Hebrew  University),    Sivan  Toledo(Tel  Aviv  University),  Samuel  Williams(LBNL)    

Page 2: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 2

Sparse Matrix-Matrix Multiplication (SpGEMM)

B  

x  

A   C  

=

q  A, B and C are sparse. q  Why sparse matrix-matrix multiplication?

–  Algebraic multigrid (AMG), –  Graph clustering, –  Betweenness centrality, –  Graph contraction, –  Subgraph extraction –  Quantum chemistry –  Triangle counting/enumeration

Page 3: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 3

q  3D SpGEMM [A. Azad et al. Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication. arXiv preprint arXiv:1510.00844. ]

–  Scalable shared-memory SpGEMM –  Distributed-memory 3D SpGEMM

q  Parallel triangle counting and enumeration using SpGEMM [A.Azad, A. Buluc, J. Gilbert. Parallel Triangle Counting and Enumeration using Matrix Algebra. IPDPS Workshops, 2015]

–  Masked SpGEMM to reduce communication –  1D parallel implementation

Outline of the talk

Page 4: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 4

Shared-Memory SpGEMM

•  Heap-based column-by-column algorithm [Buluc and Gilbert, 2008]

•  Easy to parallelize in shared-memory via multithreading •  Memory efficient and scalable (i.e. temporary per-thread

storage is asymptotically negligible)

i-­‐th  column  

i-­‐th  column  of  result  

Page 5: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 5

1 2 4 8 160.25

0.5

1

2

4

Number of Threads

Tim

e (s

ec)

MKLHeapSpGEMM

Performance of Shared-Memory SpGEMM

1 2 4 8 160.06250.1250.250.5

1248

163264

Number of ThreadsTi

me

(sec

)

MKLHeapSpGEMM

(a)  cage12  x  cage12   (b)  Scale  16  G500  x  G500  

Faster  than  Intel’s  MKL  (mkl_csrmultcsr)  (when  we  keep  output  sorted  by  indices)  

A.  Azad,  G.  Ballard,  A.  Buluc,  J.  Demmel,  L.  Grigori,  O.  Schwartz,  S.  Toledo,  S.  Williams.  ExploiSng  mulSple  levels  of  parallelism  in  sparse  matrix-­‐matrix  mulSplicaSon.  arXiv  preprint  arXiv:1510.00844.    

Page 6: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 6

Variants of Distributed-Memory SpGEMM

Matrix multiplication: ∀(i,j)  ∈  n  x  n,   C(i,j)  =  Σk  A(i,k)B(k,j),

A B C The computation (discrete) cube:

•  A face for each (input/output) matrix

•  A grid point for each multiplication

•  Categorized by the way work is partitioned

1D algorithms 2D algorithms 3D algorithms Communicate A or B Communicate A and B Communicate A, B and C

Sparsity independent algorithms: assigning grid-points to processors is independent of sparsity structure. [Ballard et al., SPAA 2013]    

Page 7: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 7

2D Algorithm: Sparse SUMMA

x    =  

Cij

100K 25K

25K

100K

A   B   C  

Cij += LocalSpGEMM(Arecv, Brecv) Processor  Grid  p × p

•  2D Sparse SUMMA (Scalable Universal Matrix Multiplication Algorithm) was the previous state of the art. It becomes communication bound on high concurrency [Buluc & Gilbert, 2012]

Processor  row  

Processor  colum

n  

Page 8: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 8

3D  Parallel  SpGEMM  in  a  Nutshell    

A::1$

A::2$

A::3$

n pc

Allto

All%

Allto

All%

C intijk = Ailk

l=1

p/c

∑ Bljk

A$ B$ Cintermediate$ Cfinal%

x$

x$

x$

=$

=$

=$

!$

!$

!$

A.  Azad,  G.  Ballard,  A.  Buluc,  J.  Demmel,  L.  Grigori,  O.  Schwartz,  S.  Toledo,  S.  Williams.  ExploiSng  mulSple  levels  of  parallelism  in  sparse  matrix-­‐matrix  mulSplicaSon.  arXiv  preprint  arXiv:1510.00844.    

A::1$

A::2$

A::3$

n pc

Allto

All%

Allto

All%

C intijk = Ailk

l=1

p/c

∑ Bljk

A$ B$ Cintermediate$ Cfinal%

x$

x$

x$

=$

=$

=$

!$

!$

!$

Input storage does not increase

3rd  dimen

sion

 (c=3)  

Processor  Grid  p / c × p / c × c

Page 9: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 9

3D  Parallel  SpGEMM  in  a  Nutshell

A::1$

A::2$

A::3$

n pc

Allto

All%

Allto

All%

C intijk = Ailk

l=1

p/c

∑ Bljk

A$ B$ Cintermediate$ Cfinal%

x$

x$

x$

=$

=$

=$

!$

!$

!$

Broadcast (row/column)

AlltoAll (layer) Communication

Computation (multithreaded)

Local multiply Multiway merge

Multiway merge

Page 10: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 10

CommunicaSon  Cost

A::1$

A::2$

A::3$

n pc

Allto

All%

Allto

All%

C intijk = Ailk

l=1

p/c

∑ Bljk

A$ B$ Cintermediate$ Cfinal%

x$

x$

x$

=$

=$

=$

!$

!$

!$

Broadcast (row/column)

Most expensive step Communication

layers  

Processor  Grid  

p / c × p / c × c

thread

s  #broadcasts  targeted  to  each  process  (synchroniza4on  points  /  SUMMA  stages)      #processes  parScipaSng  in  each  broadcasts  

(communicator  size)      Total  data  received  in  all    broadcasts  (process)  

p / c

p / c

nnz / pc

Threading  decreases    Network  card  contenSon  

Page 11: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 11

c=1 c=2 c=4 c=8 c=160

2

4

6

8

10

t=1

t=1

t=1

t=1

t=1

t=2

t=2

t=2

t=2

t=2

t=4

t=4

t=4

t=4

t=4

t=8

t=8

t=8

t=8

t=8

Tim

e (s

ec)

BroadcastAlltoAllLocal MultiplyMerge LayerMerge FiberOther

How do 3D algorithms gain performance?

On 8,192 cores of Titan when multiplying two scale 26 G500 matrices.

For  fixed  #layers  (c)  Increased  #threads  (t)  reduces  run4me    

Page 12: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 12

c=1 c=2 c=4 c=8 c=160

2

4

6

8

10

t=1

t=1

t=1

t=1

t=1

t=2

t=2

t=2

t=2

t=2

t=4

t=4

t=4

t=4

t=4

t=8

t=8

t=8

t=8

t=8

Tim

e (s

ec)

BroadcastAlltoAllLocal MultiplyMerge LayerMerge FiberOther

How do 3D algorithms gain performance?

On 8,192 cores of Titan when multiplying two scale 26 G500 matrices.

For  fixed  #threads  (t)  Increased  #layers  (c)    reduces  run4me    

Page 13: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 13

3D SpGEMM performance (matrix squaring)

64 256 1024 4096 163840.25

1

4

16

Number of Cores

Tim

e (s

ec)

nlpkkt160 x nlpkkt160 (on Edison)

2D (t=1)2D (t=3)2D (t=6)3D (c=4, t=1)3D (c=4, t=3)3D (c=8, t=1)3D (c=8, t=6)3D (c=16, t=6)

2D#threads

increasing

3D#layers &#threads

increasing

2D  (non-­‐threaded)  is  the  previous  state-­‐of-­‐the  art  

3D  (threaded)  –  first  presented  here  –  beats  it  by  8X  at  large  concurrencies  

Squaring nlpkkt160 on Edison: 1.2 billion nonzeros in the A2

Page 14: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 14

512 1024 2048 4096 8192 16384 327680.125

0.25

0.5

1

2

4

8

16

32

Number of Cores

Tim

e (s

ec)

ldoordelaunay_n24cage15dielFilterV3realmouse_geneHV15Rit−2004

3D SpGEMM performance (matrix squaring)

It-­‐2004  (web  crawl)  nnz(A)  =  1.1  billion  nnz(A2)  =  14  billion  

18x  14x  

3D  SpGEMM  with  c=16,  t=6  on  Edison  

Page 15: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 15

3D SpGEMM performance (AMG coarsening)

•  Galerkin triple product in Algebraic Multigrid (AMG) •  Acoarse = RT Afine R where R is the restriction matrix •  R constructed via distance-2 maximal independent set (MIS)

Only  showing  the  first  product  (RTA)  

512 1024 2048 4096 8192 16384 327680.25

0.5

1

2

4

8

16

Number of Cores

Tim

e (s

ec)

RTA with NaluR3 (on Edison)

2D (t=1)2D (t=6)3D (c=8, t=1)3D (c=8, t=6)3D (c=16, t=6)

3D  is  16x  faster    than  2D  

Page 16: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 16

4 16 64 256 1024 40960.25

1

4

16

64

Number of Cores

Tim

e (s

ec)

EpetraExt2D (t=1)3D (c=8, t=6)

3D SpGEMM performance (AMG coarsening)

Comparing performance with EpetraExt package of Trilinos

2x  

AR  computaSon  with  nlpkkt160  on  Edison  

Notes: •  EpetraExt runs up to 3x

faster when computing AR, 3D is less sensitive

•  1D decomposition used by EpetraExt performs better on matrices with good separators.

Page 17: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 17

Application of SpGEMM Triangle Counting/Enumeration

[A.Azad,  A.  Buluc,  J.  Gilbert.  Parallel  Triangle  CounSng  and  EnumeraSon  using  Matrix  Algebra.  IPDPS  Workshops,  2015]  

Page 18: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 18

Counting Triangles

A

5

6

3

1 2

4

Clustering coefficient:

•  Pr (wedge i-j-k makes a triangle with edge i-k)

•  3 * # triangles / # wedges

•  3 * 2 / 13 = 0.46 in example

•  may want to compute for each vertex j

Wedge:  a  path  of  length  two  

A  triangle  has  three  wedges  

5

3 4

5

3 4

Page 19: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 19

Triangle Counting in Matrix Algebra

A

5

6

3

1 2

4

Step1:  Count  wedges  from  a  pair  of  verSces  (u,v)  by  finding  a  path  of  length  2.  Avoid  repeSSon  by  counSng  wedges  with  low  index  of  the  middle  vertex.  

5

34 hi hi

lo

A L U

1 1 1 1 2

B

A = L + U (hi->lo + lo->hi) L × U = B (wedges)

2  wedges  between  vertex  5  and  6    

2

Adjacency  matrix  

Page 20: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 20

Triangle Counting in Matrix Algebra

A 5

6

3

1 2

4

Step2:  Count  wedges  as  triangles  if  there  is  an  edge  between  the  selected  pair  of  verSces.  

A

1 2

1 1 1 2

B

L × U = B (wedges)

1 1 1 1

C

5

34

A ∧ B = C (closed wedge)

#triangles =sum(C)/2

Goal:  Can  we  design  a  parallel  algorithm  that  communicates  less  data  exploiSng  this  observaSon?  

Page 21: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 21

Masked SpGEMM (1D algorithm)

In this example, nonzeros of L(5,:) (marked with red color) are not needed by P1 because A1(5,:) does not contain any nonzeros. So, we can mask rows of L by A to avoid communication.

Special SpGEMM: The nonzero structure of the product is contained in the original adjacency matrix A

U  

x  

A   L  

1

5

2 3 4

6 7

1

5

2 3 4

6 7

P1   P2   P3   P4  P1   P2   P3   P4   P1   P2   P3   P4  

Page 22: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 22

q  Benefit –  Reduce communication by a factor of p/d where d

is the average degree (good for large p)

q  Overheads –  Increased index traffic for requesting subset of rows/columns.

•  Partial remedy: Compress index requests using bloom filters.

–  Indexing and packaging the requested rows (SpRef).

•  Remedy (ongoing work): data structures/algorithms for faster SpRef.

Benefits and Overheads of Masked SpGEMM

Page 23: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 23

Where do algorithms spend time with increased concurrency?

0%#

20%#

40%#

60%#

80%#

100%#

1# 2# 4# 8# 16#

32#

64#

128#

256#

512#

Number#of#Cores#

(b)$Triangle-masked-bloom$

Comp6Indices# Comm6Indices# SpRef# Comm6L# SpGEMM#

0%#

20%#

40%#

60%#

80%#

100%#

1# 2# 4# 8# 16#

32#

64#

128#

256#

512#

Number#of#Cores#

(a)$Triangle-basic$

Comp6Indices# Comm6Indices# SpRef# Comm6L# SpGEMM#

CopapersDBLP  on  NERSC/Edison  

As expected, Masked SpGEMM reduces cost to communicate L

Expensive indexing undermines the advantage More computation in exchange of communication. [Easy to fix]

Page 24: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 24

Strong Scaling of Triangle Counting

1"

2"

4"

8"

16"

32"

64"

128"

256"

1" 4" 16" 64" 256"

Speedu

p"

Number"of"Cores"

(a)$Triangle,basic$$

coPapersDBLP"mouse<gene"cage15"europe_osm"

1"

2"

4"

8"

16"

32"

64"

128"

1" 4" 16" 64" 256"Speedu

p"

Number"of"Cores"

(b)$Triangle,masked,bloom$

coPapersDBLP"

mouse<gene"

cage15"

europe_osm"

•  Decent scaling to modest number of cores (p=512 in this case) •  Further scaling requires 2D/3D SpGEMM (ongoing/future work)

Page 25: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 25

What about triangle enumeration?

Data access patterns stay intact, only the scalar operations change !

5

6

3

1 2

4

{5}

{3}

{3,4}

B

B(i,k) now captures the set of wedges (indexed solely by their middle vertex because i and k are already known implicitly), as opposed to just the count of wedges

A

5

6

3

1 2

4

Page 26: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 26

•  Develop data structures/algorithms for faster SpRef

•  Reduce the cost of communicating indices by a factor of t (the number of threads) using in-node multithreading

•  Use 3D decomposition/algorithms: The requested index vectors would be of shorter length.

•  Implement triangle enumeration via good serialization support. Larger performance gains are expected using masked-SpGEMM

Future Work

Page 27: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 27

Acknowledgements

Work  is  funded  by  

Grey  Ballard  (Sandia),  Aydin  Buluc  (LBNL),    James  Demmel  (UC  Berkeley),    John  Gilbert  (UCSB),  Laura  Grigori  (INRIA),    Oded  Schwartz  (Hebrew  University),    Sivan  Toledo(Tel  Aviv  University),  Samuel  Williams(LBNL)    

Page 28: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 28

Thanks for your attention

?

Page 29: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 29

Supporting slides

Page 30: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 30

q  How do dense and sparse GEMM compare? Dense: Sparse: Lower bounds match algorithms. Significant gap Allows extensive data reuse Inherent poor reuse?

Page 31: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 31

Communication Lower Bound

Lower bound for Erdős-Rényi(n,d) [Ballard et al., SPAA 2013]:

(Under  some  technical  assumpSons)  ⎟⎟⎠

⎞⎜⎜⎝

⎭⎬⎫

⎩⎨⎧

ΩPnd

Pdn 2

,min

•  Few algorithms achieve this bound •  1D algorithms do not scale well on high concurrency •  2D Sparse SUMMA (Scalable Universal Matrix

Multiplication Algorithm) was the previous state of the art. But, it still becomes communication bound on several thousands of processors. [Buluc & Gilbert, 2013]

•  3D algorithms avoid communications

Page 32: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 32

384 864 1,536 3,456 7,776 16,224 31,1040

0.2

0.4

0.6

0.8

1

1.2

1.4

Number of Cores

Tim

e (s

ec)

BroadcastAlltoAllLocal MultiplyMerge LayerMerge FiberOther

What dominates the runtime of the 3D algorithm?

Squaring nlpkkt160 on Edison

Broadcast dominates on all concurrency

Page 33: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 33

c=1 c=2 c=4 c=8 c=160

2

4

6

8

10

t=1

t=1

t=1

t=1

t=1

t=2

t=2

t=2

t=2

t=2

t=4

t=4

t=4

t=4

t=4

t=8

t=8

t=8

t=8

t=8

Tim

e (s

ec)

BroadcastAlltoAllLocal MultiplyMerge LayerMerge FiberOther

How do 3D algorithms gain performance?

On 8,192 cores of Titan when multiplying two scale 26 G500 matrices.

8x8x16  grid  8  threads  

90  x  90  grid  1  thread  

Page 34: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 34

3D SpGEMM performance (AMG coarsening)

•  Galerkin triple product in Algebraic Multigrid (AMG) •  Acoarse = RT Afine R where R is the restriction matrix •  R constructed via distance-2 maximal independent set (MIS)

Only  showing  the  first  product  (RTA)  

512 1024 2048 4096 8192 16384 327680.25

0.5

1

2

4

8

16

Number of Cores

Tim

e (s

ec)

RTA with NaluR3 (on Edison)

2D (t=1)2D (t=6)3D (c=8, t=1)3D (c=8, t=6)3D (c=16, t=6)

3D  is  16x  faster    than  2D  

Limited  scalability  of    3D  in  compuSng  RTA  

NaluR3  nnz(A)   474  million  nnz(A2)   2.1  billion  nnz(RTA)   77  million  

Not  enough  work  on    high  concurrency  

Page 35: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 35

q  Communication volume per processor for receiving L –  Assume Erdős-Rényi(n,d) graphs and d<p

–  Number of rows of L needed by a processor: nd/p –  Number of columns of L needed by a processor: nd/p –  Data received per processor: O(nd3/p2)

q  If no output masking is used (“improved 1D” ) –  O(nd2/p) [Ballard, Buluc, et al. SPAA 2013]

q  reduction of a factor of p/d over “improved 1D” –  good for large p

Communication Reduction in Masked SpGEMM

Page 36: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 36

Experimental Evaluation

Higher  nnz(B)  /  nnz(A.*B)  raSo  =>  more  poten)al  communica)on  reduc)on  due  to  masked-­‐spgemm  for  triangle  coun-ng  Ranges  between  1.7  and  500  

Graph #Vertices #Edges Clustering Coeff. Description

mouse gene 28,967,291 28,922,190 0.4207 Mouse gene regulatory networkcoPaperDBLP 540,486 30,491,458 0.6562 Citation networks in DBLPsoc-LiveJournal1 4,847,571 85,702,474 0.1179 LiveJournal online social networkwb-edu 9,845,725 92,472,210 0.0626 Web crawl on .edu domaincage15 5,154,859 94,044,692 0.1209 DNA electrophoresis, 15 monomers in polymereurope osm 50,912,018 108,109,320 0.0028 Europe street networkshollywood-2009 1,139,905 112,751,422 0.3096 Hollywood movie actor network

TABLE ITEST PROBLEMS FOR EVALUATING THE TRIANGLE COUNTING ALGORITHMS.

Graph nnz(A) nnz(B = L ·U) nnz(A. ⇤B) Triads Triangles

mouse gene 28,922,190 138,568,561 26,959,674 25,808,000,000 3,619,100,000coPaperDBLP 30,491,458 48,778,658 28,719,580 2,030,500,000 444,095,058soc-LiveJournal1 85,702,474 480,215,359 40,229,140 7,269,500,000 285,730,264wb-edu 92,472,210 57,151,007 32,441,494 12,204,000,000 254,718,147cage15 94,044,692 405,169,608 48,315,646 895,671,340 36,106,416europe osm 108,109,320 57,983,978 121,816 66,584,124 61,710hollywood-2009 112,751,422 1,010,100,000 104,151,320 47,645,000,000 4,916,400,000

TABLE IISTATISTICS OF TRIANGLE COUNTING ALGORITHM FOR DIFFERENT INPUT GRAPHS.

0"

10"

20"

30"

40"

50"

60"

mouse_gene" coPapersDBLP" soc9LiveJournal1" wb9edu" cage15" europe_osm"

Time"(sec)"

Triangle9basic"

Triangle9masked9bloom"

Triangle9masked9list"

(a)$$$p=64$

0"

2"

4"

6"

8"

10"

12"

14"

16"

mouse_gene" coPapersDBLP" soc9LiveJournal1" wb9edu" cage15" europe_osm"

Time"(sec)"

Triangle9basic"

Triangle9masked9bloom"

Triangle9masked9list"

(b)$$$p=256$

Fig. 4. Runtimes of three triangle counting algorithms for six input graphs on (a) 64 and (b) 256 cores of Edison.

algorithm, a processor sends the indices of necessary rowsin a Bloom filter. By contrast, a processor sends an array ofnecessary row indices in the latter algorithm.

Fig. 4 shows runtimes of three triangle counting algorithmsfor six input graphs on (a) 64 and (b) 256 cores of Edison. Inboth subfigures, Triangle-masked-list is the slowest algorithmfor all problem instances. On average, over all problem in-stances, Triangle-masked-bloom runs 2.3x (stdev=0.86) fasteron 64 cores and 3.4x (stdev=1.9) faster on 256 cores thanTriangle-masked-list. This is due to the fact that sending row

indices by Bloom filters saves communication, and SpRefruns faster with the Bloom filter. Since Triangle-masked-listis always slower than Triangle-masked-bloom, we will notdiscuss the former algorithm in the rest of the paper.

For all graphs except wb-edu, Triangle-basic runs fasterthan the Triangle-masked-bloom algorithm. On average, overall problem instances, Triangle-basic runs 1.6x (stdev=0.7)faster on 64 cores and 1.5x (stdev=0.6) faster on 256 coresthan Triangle-masked-bloom. We noticed that the performancedifference between these two algorithms is reduced as we

Higher  Triads  /  Triangles  raSo  =>  more  poten)al  communica)on  reduc)on  due  to  masked-­‐spgemm  for  triangle  enumera-on  Ranges  between  5  and  1000  

Page 37: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 37

Serial Complexity

Model:      Erdős-­‐Rényi(n,d)  graphs  aka  G(n,  p=d/n)    Observa4on:  The  mulSplicaSon  B  =  L·∙  U  costs  O(d2n)  operaSons.  However,  the  ulSmate  matrix  C  has  at  most  2dn  nonzeros.    Goal:  Can  we  design  a  parallel  algorithm  that  communicates  less  data  exploiSng  this  observaSon?  

A

1 2

1 1 1 2

B

L × U = B (wedges)

1 1 1 1

C

A ∧ B = C (closed wedge)

sum(C)/2 2 triangles

Page 38: Parallel Sparse Matrix-Matrix Multiplication and Its …...SIAM ALA 2015 22 ! Benefit – Reduce communication by a factor of p/d where d is the average degree (good for large p) !

SIAM ALA 2015 38

Where do algorithms spend time?

0%#

20%#

40%#

60%#

80%#

100%#

mouse_g

ene#

coPapers

DBLP#

soc9Live

Journal1

#wb9

edu#

cage15#

europe_

osm#

hollywo

od#

(b)$Triangle-masked-bloom$

Comp9Indices# Comm9Indices# SpRef# Comm9L# SpGEMM#

0%#

20%#

40%#

60%#

80%#

100%#

mouse_g

ene#

coPapers

DBLP#

soc9Live

Journal1

#wb9

edu#

cage15#

europe_

osm#

hollywo

od#

(a)$Triangle-basic$

Comp9Indices# Comm9Indices# SpRef# Comm9L# SpGEMM#

80% time spent in communicating L And local multiplication. (increased communication)

70% time spent in communicating index requests and SpRef (expensive indexing)

On  512  cores  of  NERSC/Edison