36
Reducing Communication in Parallel Graph Computations Aydın Buluç Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting Albuquerque, NM

Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

Reducing Communication in Parallel Graph Computations

Aydın  Buluç    Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting Albuquerque, NM

Page 2: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

Acknowledgments  (past  &  present)  

•  Krste  Asanovic  (UC  Berkeley)  •  Grey  Ballard  (UC  Berkeley)  •  ScoA  Beamer  (UC  Berkeley)  •  Jim  Demmel  (UC  Berkeley)  •  Erika  Duriakova  (UC  Dublin)  •  Armando  Fox  (UC  Berkeley)  •  John  Gilbert  (UCSB)  •  Laura  Grigori  (INRIA)  •  Shoaib  Kamil  (MIT)  •  Ben  Lipshitz  (UC  Berkeley)  •  Adam  Lugowski  (UCSB)  

•  Lenny  Oliker  (Berkeley  Lab)  •  Lenny  Oliker  (Berkeley  Lab)  •  Dave  PaAerson  (UC  Berkeley)  •  Steve  Reinhardt  (YarcData)  •  Oded  Schwartz  (UC  Berkeley)  •  Edgar  Solomonik  (UC  Berkeley)  •  Sivan  Toledo  (Tel  Aviv  Univ)  •  Sam  Williams  (Berkeley  Lab)  

This  work  is  funded  by:  

Page 3: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

Informal  Outline  

•  Large-­‐scale  graphs    •  Communica8on  costs  of  parallel  graph  computaVons  •  CommunicaVon  takes  energy  •  Linear  algebraic  primi8ves  enable  scalable  graph  computaVon  •  Semiring  abstracVon  glues  graphs  to  linear  algebra  •  Direc8on-­‐op8mizing  BFS  does  >6X  less  communicaVon  •  New  communicaVon-­‐opVmal  sparse  matrix-­‐matrix  mul8ply    •  New  communicaVon-­‐opVmal  dense  all-­‐pairs  shortest-­‐paths  

Page 4: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

Graph  abstracVons  in    Computer  Science  

Compiler  op8miza8on:    Control  flow  graph  :  graph  dominators  Register  allocaVon:  graph  coloring  

Scien8fic  compu8ng:  PrecondiVoning:  support  graphs,  spanning  trees  Sparse  direct  solvers:  chordal  graphs    Parallel  compuVng:  graph  separators   10

1 3

2

4

5

6

7

8

9

G+(A)  [chordal]  

Computer  Networks:  RouVng:  shortest  path  algorithms  Web  crawling:  graph  traversal  Interconnect  design:  Cayley  graphs  

Page 5: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

Large  graphs  are  everywhere  

WWW snapshot, courtesy Y. Hyun Yeast protein interaction network, courtesy H. Jeong

Internet structure Social interactions

Scientific datasets: biological, chemical, cosmological, ecological, …

Page 6: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

Two kinds of costs: Arithmetic (FLOPs) Communication: moving data

Running time = γ ⋅ #FLOPs + β ⋅ #Words + (α ⋅ #Messages)

CPU  M  

CPU  M  

CPU  M  

CPU  M  

Distributed

CPU  M

RAM  

Sequential Fast/local  memory    of  size  M  

P  processors   6  

23%/Year 26%/Year

59%/Year 59%/Year

2004: trend transition into multi-core, further communication costs

Develop  faster  algorithms:    minimize  communicaVon    (to  lower  bound  if  possible)  

Model  and  MoVvaVon  

Page 7: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

Communication-avoiding algorithms: Save time

Communication-avoiding algorithms: Save energy

Image  courtesy  of    John  Shalf  (LBNL)  

CommunicaVon  -­‐>  Energy  

Page 8: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

- Often no surface to volume ratio. - Very little data reuse in existing algorithmic formulations * - Already heavily communication bound

CommunicaVon  crucial  for  graphs  

100

1 4 16 64 256

1024

4096

Perc

enta

ge o

f Tim

e Sp

ent

Number of Cores

ComputationCommunication

0

20

40

60

80

2D  sparse  matrix-­‐matrix  mulVply  emulaVng:  -­‐  Graph  contracVon  -­‐  AMG  restricVon  operaVons  

Scale  23  R-­‐MAT  (scale-­‐free  graph)  8mes  order  4  restricVon  operator    Cray  XT4,  Franklin,  NERSC  

B.,  Gilbert.  Parallel  sparse  matrix-­‐matrix  mulVplicaVon  and  indexing:  ImplementaVon  and  experiments.  SIAM  Journal  of  Scien1fic  Compu1ng,  2012  

Page 9: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

Sparse  matrix-­‐sparse  matrix  mulVplicaVon  (Sparse  GEMM)  

x  

The Combinatorial BLAS implements these, and more, on arbitrary semirings, e.g. (×, +), (and, or), (+, min)  

Sparse  matrix-­‐sparse  vector  mulVplicaVon  (SpMV)  

       

                 

x  

.*  

Linear-­‐algebraic  primiVves  

Element-­‐wise  operaVons   Sparse  matrix  indexing  

Page 10: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

The  case  for  sparse  matrices  

Many irregular applications contain coarse-grained parallelism that can be exploited

by abstractions at the proper level.

Tradi8onal  graph  computa8ons Data  driven,  

unpredictable  communicaVon.

Irregular  and  unstructured,    poor  locality  of  reference

Fine  grained  data  accesses,  dominated  by  latency

Page 11: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

The  case  for  sparse  matrices  

Many irregular applications contain coarse-grained parallelism that can be exploited

by abstractions at the proper level.

Tradi8onal  graph  computa8ons

Graphs  in  the  language  of  linear  algebra

Data  driven,  unpredictable  communicaVon.

Fixed  communicaVon  paAerns

Irregular  and  unstructured,    poor  locality  of  reference

OperaVons  on  matrix  blocks  exploit  memory  hierarchy

Fine  grained  data  accesses,  dominated  by  latency

Coarse  grained  parallelism,  bandwidth  limited

Page 12: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

1 2

3

4 7

6

5

AT

1

7

7 1 from

to

Breadth-­‐first  search  in    the  language  of  linear  

algebra  

Page 13: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

1 2

3

4 7

6

5

X AT

1

7

7 1 from

to

ATX

à

1  

1  

1  

1  

1  parents:  

ParVcular  semiring  operaVons:    Mul8ply:  select  Add:  minimum  

Page 14: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

1 2

3

4 7

6

5

X

4  

2  

2  

AT

1

7

7 1 from

to

ATX

à

2  

4  

4  

2  

2  4  

1  

1  parents:  4  

2  

2  

Mul8ple  traverses  outgoing  edges  Add  chooses  among  incoming  edges  

Page 15: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

1 2

3

4 7

6

5

X

3  

AT

1

7

7 1 from

to

ATX

à 3  

5  

7  

3  

1  

1  parents:  4  

2  

2  

5  

3  

Select  vertex  with  minimum  label  as  parent  

Page 16: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

X AT

1

7

7 1 from

to

ATX

à

6  

1 2

3

4 7

6

5

Result:  DeterminisVc  breadth-­‐first  search  

Page 17: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

Performance  observaVons  of  level-­‐synchronous  BFS  

           

     

Phase #1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Per

cent

age

of to

tal

0

10

20

30

40

50

60 # of vertices in frontier arrayExecution time

Youtube  social  network  

When  the  fron8er  is  at  its  peak,  almost  all  edge  examinaVons  “fail”  to    claim  a  child  

SyntheVc  network  

Page 18: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

BoAom-­‐up  BFS  algorithm  [Single-­‐node:  Beamer  et  al.  2012]  

Classical  (top-­‐down)  algorithm  is  opVmal  in  worst  case,  but  pessimisVc  for  low-­‐diameter  graphs  (previous  slide).  

Direc8on-­‐op8mizing  approach:    -­‐  Switch  from  top-­‐down  to  

boAom-­‐up  search    -­‐  When  the  majority  of  the  

verVces  are  discovered.  

Page 19: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

DirecVon-­‐opVmizing  BFS  on    distributed  memory    

•  Kronecker  (Graph500):  16  billion  verVces  and  256  billion  edges.  •  Implemented  on  top  of  Combinatorial  BLAS  (i.e.  uses  2D  decomposiVon)  

   Beamer,  B.,  Asanović,  and  PaAerson,  "Distributed  Memory  Breadth-­‐First  Search  Revisited:  Enabling  BoAom-­‐Up  Search”,  MTAAP’13,  in  conjunc1on  with  IPDPS  

Page 20: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

DirecVon-­‐opVmizing  BFS    reduces  communicaVon  

k:  average  vertex  degree  of  the  graph  

CommunicaVon  volume  reducVon  based  on  analyVcal  model  derived  in  the  paper,  assuming  4  boAom-­‐up  steps  (typical  for  power-­‐law  graphs)  

Page 21: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

DirecVon-­‐opVmizing  BFS    saves  energy  

For  a  fixed-­‐sized  real  input,  direcVon-­‐opVmizing  algorithm  completes  at  the  same  Vme  using  1/16th  of  processors  (and  energy)    

Page 22: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

•  Parallel breadth-first search is implemented with sparse GEMM

•  Work-efficient, highly-parallel implementation of Brandes’ algorithm

Betweenness  Centrality  using    Sparse  GEMM  

6

B AT AT � B

à

1 2

3

4 7 5

Betweenness Centrality is a measure of influence.

CB(v):  Among  all  the  shortest  paths,  what  fracVon  passes  through  v?  

Page 23: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

Sparse  Matrix-­‐Matrix  MulVplicaVon  

Why sparse matrix-matrix multiplication? Used for algebraic multigrid, graph clustering, betweenness

centrality, graph contraction, subgraph extraction, cycle detection, quantum chemistry,…

How do dense and sparse GEMM compare? Dense: Sparse: Lower bounds match algorithms. Significant gap Allows extensive data reuse Inherent poor reuse?

What do we obtain here? Improved (higher) lower bound New optimal algorithms Only for random matrices

Erdős-Rényi(n,d) graphs aka G(n, p=d/n)

Page 24: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

The  computaVon  cube  and  sparsity-­‐independent  algorithms  

Matrix multiplication: ∀(i,j)  ∈  n  x  n,   C(i,j)  =  Σk  A(i,k)B(k,j),

A B C

The  computa1on  (discrete)  cube:  

•  A  face  for  each  (input/output)  matrix    

•  A  grid  point  for  each  mulVplicaVon  

1D  algorithms   2D  algorithms   3D  algorithms  

How  about  sparse  algorithms?  

Page 25: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

The  computaVon  cube  and  sparsity-­‐independent  algorithms  

Matrix multiplication: ∀(i,j)  ∈  n  x  n,   C(i,j)  =  Σk  A(i,k)B(k,j),

A B C

The  computa1on  (discrete)  cube:  

•  A  face  for  each  (input/output)  matrix    

•  A  grid  point  for  each  mulVplicaVon  

1D  algorithms   2D  algorithms   3D  algorithms  

How  about  sparse  algorithms?  

Sparsity  independent  algorithms:    assigning  grid-­‐points  to  processors  is  independent  of  sparsity  structure.  -­‐  In  parVcular:  if  Cij  is  non-­‐zero,  who  holds  it?  -­‐  all  standard  algorithms          are  sparsity  independent  

AssumpVons:      -­‐  Sparsity  independent  algorithms    -­‐  input  (and  output)  are  sparse:      -­‐  The  algorithm  is  load  balanced  

Page 26: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

Algorithms  aAaining  lower  bounds  

Previous Sparse Classical:

[Ballard,  et  al.  SIMAX’11]  

( ) ⎟⎟

⎜⎜

⎛⋅ΩPM

M

FLOPs3

#⎟⎟⎠

⎞⎜⎜⎝

⎛Ω=

MPnd 2

New Lower bound for Erdős-Rényi(n,d) :

[here]  Expected  (Under  some  technical  assumpVons)    No  previous  algorithm  aAain  these.  

⎟⎟⎠

⎞⎜⎜⎝

⎭⎬⎫

⎩⎨⎧

ΩPnd

Pdn 2

,min

û  No  algorithm  aAain  this  bound!  

ü   Two  new  algorithms  achieving  the  bounds  (Up  to  a  logarithmic  factor)  

i.  Recursive  3D,  based  on    [Ballard,  et  al.  SPAA’12]    

ii.  IteraVve  3D,  based  on    [Solomonik  &  Demmel  EuroPar’11]  

Ballard,  B.,  Demmel,  Grigori,  Lipshitz,  Schwartz,  and  Toledo.  CommunicaVon  opVmal  parallel  mulVplicaVon  of  sparse  random  matrices.  In  SPAA  2013.  

Page 27: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

3D  IteraVve  Sparse  GEMM  [Dense  case:  Solomonik  and  Demmel,  2011]  

c =Θ pd 2"

#$

%

&'OpVmal  replicas:  

(theoreVcally)  

3D  Algorithms:  Using  extra  memory    (c  replicas)  reduces  communicaVon  volume  by  a  factor  of  c1/2  compared  to  2D  

Page 28: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

Rerformance  results    (Erdős-­‐Rényi  graphs)  

Page 29: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

IteraVve  2D-­‐3D  performance  results    (R-­‐MAT  graphs)  

4 16

Seco

nds

Number of grid layers (c)

p=4096 (s20)

p=16384 (s22)

p=65536 (s24)

Imbalance Merge HypersparseGEMM Comm_Reduce Comm_Bcast

0

5

10

15

20

251 4 16 1 4 16 1

2X  Faster  

5X  faster  if  load  balanced  

Page 30: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

•  Input:  Directed  graph  with  “costs”  on  edges  •  Find  least-­‐cost  paths  between  all  reachable  vertex  pairs  •  Classical  algorithm:  Floyd-­‐Warshall  

 •  It  turns  out  a  previously  overlooked  recursive  version  is  more  parallelizable  than  the  triple  nested  loop  

for k=1:n // the induction sequence for i = 1:n for j = 1:n if(w(i→k)+w(k→j) < w(i→j))

w(i→j):= w(i→k) + w(k→j)

1 5 2 3 4 1

5

2

3

4

k = 1 case

All-­‐pairs  shortest-­‐paths  problem  

Page 31: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

6

3

12

5

43

39

5

7

-3 4

4-1

1

45

-2

0 5 9 ∞ ∞ 4∞ 0 1 −2 ∞ ∞3 ∞ 0 4 ∞ 3∞ ∞ 5 0 4 ∞∞ ∞ −1 ∞ 0 ∞−3 ∞ ∞ ∞ 7 0

#

$

%%%%%%%

&

'

(((((((

A B

C D A

B D

C

A = A*; % recursive call B = AB; C = CA; D = D + CB; D = D*; % recursive call B = BD; C = DC; A = A + BC;

+ is “min”, × is “add”

V1 V2

Page 32: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

6

3

12

5

43

39

5

7

-3 4

4-1

1

45

-2

0 5 9 ∞ ∞ 4∞ 0 1 −2 ∞ ∞3 8 0 4 ∞ 3∞ ∞ 5 0 4 ∞∞ ∞ −1 ∞ 0 ∞−3 ∞ ∞ ∞ 7 0

#

$

%%%%%%%

&

'

(((((((

∞3

"

#$

%

&' 5 9!

"#$

∞ ∞8 12

"

#$

%

&'

C B

=

The cost of 3-1-2 path

Π =

1 1 1 1 1 12 2 2 2 2 23 1 3 3 3 34 4 4 4 4 45 5 5 5 5 56 6 6 6 6 6

"

#

$$$$$$$

%

&

'''''''

a(3, 2) = a(3,1)+ a(1, 2) then! →! Π(3, 2) =Π(1, 2)

8

Distances Parents

Page 33: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

6

3

12

5

43

36

5

7

-3 4

4-1

1

45

-2

0 5 6 ∞ ∞ 4∞ 0 1 −2 ∞ ∞3 8 0 4 ∞ 3∞ ∞ 5 0 4 ∞∞ ∞ −1 ∞ 0 ∞−3 ∞ ∞ ∞ 7 0

#

$

%%%%%%%

&

'

(((((((

Π =

1 1 2 1 1 12 2 2 2 2 23 1 3 3 3 34 4 4 4 4 45 5 5 5 5 56 6 6 6 6 6

"

#

$$$$$$$

%

&

'''''''

8

D = D*: no change

5 9!"

#$

0 18 0

!

"#

$

%&

B

=

D

5 6!"

#$

Path: 1-2-3

a(1,3) = a(1, 2)+ a(2,3) then! →! Π(1,3) =Π(2,3)

Distances Parents

Page 34: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

Wbc-2.5D(n, p) =O(n2 cp )

Sbc-2.5D(p) =O cp log2(p)( )

Bandwidth:

Latency:

c: number of replicas Optimal for any memory size !

CommunicaVon-­‐avoiding  APSP  on    distributed  memory  

Page 35: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

c=1

0

200

400

600

800

1000

1200

1 4 16 64 256

1024 1 4 16 64 25

610

24

GFl

ops

Number of compute nodes

n=4096

n=8192

c=16c=4

CommunicaVon-­‐avoiding  APSP  on    distributed  memory  

Solomonik, B., and J. Demmel. “Minimizing communication in all-pairs shortest paths”, Proceedings of the IPDPS. 2013.

Page 36: Reducing Communication in Parallel Graph ComputationsReducing Communication in Parallel Graph Computations Aydın&Buluç&& Berkeley Lab August 8, 2013 ASCR Applied Math PI Meeting

Conclusions