52
1 Parallel breadth-first search on distributed memory systems Aydın Buluç Lawrence Berkeley National Laboratory Kamesh Madduri The Pennsylvania State University Supercomputing, Seattle November 17, 2011

Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

1

Parallel breadth-first search on distributed memory systemsAydınBuluçLawrence Berkeley National Laboratory

Kamesh MadduriThe Pennsylvania State University

Supercomputing, SeattleNovember 17, 2011

Page 2: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

Largegraphsareeverywhere

WWW snapshot, courtesy Y. Hyun Yeast protein interaction network, courtesy H. Jeong

Internet structureSocial interactions

Scientific datasets: biological, chemical, cosmological, ecological, …

Page 3: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

• Level-by-levelgraphtraversal• Serialcomplexity: Θ(m+n)

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Graph:G(E,V)

E: Setofedges(sizem)

V: Setofvertices(sizen)

Breadth-firstsearch(BFS)

Page 4: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

• Level-by-levelgraphtraversal• Serialcomplexity: Θ(m+n)

Level11 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

1

Breadth-firstsearch(BFS)

Current‘frontier’showninred

Page 5: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

• Level-by-levelgraphtraversal• Serialcomplexity: Θ(m+n)

Level1

Level2

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

2

1

5

Breadth-firstsearch(BFS)

Current‘frontier’showninred

Page 6: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

• Level-by-levelgraphtraversal• Serialcomplexity: Θ(m+n)

Level1

2

1

5 Level2

9 6 3

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Level3

Breadth-firstsearch(BFS)

Current‘frontier’showninred

Page 7: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

• Level-by-levelgraphtraversal• Serialcomplexity: Θ(m+n)

Level1

Level2

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

13 10 7 4

Level3

Level4

2

1

5

9 6 3

Breadth-firstsearch(BFS)

Current‘frontier’showninred

Page 8: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

• Level-by-levelgraphtraversal• Serialcomplexity: Θ(m+n)

Level1

Level2

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

13 4

Level3

Level4

14 11 8 Level5

10 7

2

1

5

9 6 3

Breadth-firstsearch(BFS)

Current‘frontier’showninred

Page 9: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

• Level-by-levelgraphtraversal• Serialcomplexity: Θ(m+n)

Level1

Level2

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

13 4

Level3

Level4

14 11 8 Level5

10 7

2

1

5

9 6 3

Level61215

Breadth-firstsearch(BFS)

Current‘frontier’showninred

Page 10: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

• BFSisrepresentativeofcommunicationintensivegraphcomputationsindistributedmemory

• BFSisasubroutineformanyalgorithms– Betweennesscentrality– Maximumflows– Connectedcomponents– Spanningforests– Testingforbipartiteness– ReverseCuthill-McKeeordering

BFSasagraphbuildingblock

Page 11: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

Graph500Benchmark#1

0

50000

100000

150000

200000

250000

300000

1 2 3 4 5 6 7 8 9 10

Num

ber o

f ver

tices

in fr

inge

BFS tree level

• Shortspan(criticalpath)• Highparallelism(work/span)

Breadth-first searchon a low-diameter graph

with skewed degree distribution

1 2

3

4 7

6

5

Performancemetric:TEPS“TraversedEdgesPerSecond”

Page 12: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

Graph500:implementationchallenge

4

8

16

32

64

128

256

11/2010 04/2011 06/2011 11/2011

Gig

a TE

PS

This workSC’11 deadline

Graph500 #1NERSC/Franklin

PerformancenumbersarejustasnapshotintimeButtheinsightsandconclusionsarestillvalidtoday

8.2Ximprovementonsamehardware,inoneyear.

Page 13: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

• BFS overview and applications

• BFS as sparse matrix-sparse vector multiply• Parallel BFS: 1D and 2D approaches

• Experimental results and insight

• Conclusions / Contributions

• Future Directions

Outline

Page 14: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

• Adjacencymatrix:sparsearrayw/nonzerosforgraphedges• Multiplybyadjacencymatrixà steptoneighborvertices

Breadth-firstsearch:amatrixview

[loopsarenotdrawn]x

1 2

3

4 7

6

5

AT

Page 15: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

• Adjacencymatrix:sparsearrayw/nonzerosforgraphedges• Multiplybyadjacencymatrixà steptoneighborvertices

Breadth-firstsearch:amatrixview

[loopsarenotdrawn]x

1 2

3

4 7

6

5

AT ATx

à

Page 16: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

• Adjacencymatrix:sparsearrayw/nonzerosforgraphedges• Multiplybyadjacencymatrixà steptoneighborvertices

Breadth-firstsearch:amatrixview

[loopsarenotdrawn]x

1 2

3

4 7

6

5

AT ATx

à

(AT)2x

à

Page 17: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

12

3

4 7

6

5

AT

1

7

71from

to

BFSassparsematrix-sparsevectormultiplySemiring:(select,min)

Page 18: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

12

3

4 7

6

5

XAT

1

7

71from

to

ATX

à

1

1

1

1

1parents:

Page 19: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

12

3

4 7

6

5

X

4

2

2

AT

1

7

71from

to

ATX

à

2

4

4

2

24

1

1parents:4

2

2

Page 20: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

12

3

4 7

6

5

X

3

AT

1

7

71from

to

ATX

à3

5

7

3

1

1parents:4

2

2

5

3

Page 21: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

12

3

4 7

6

5

XAT

1

7

71from

to

ATX

à

6

Page 22: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

• BFSoverviewandapplications

• BFSassparsematrix-sparsevectormultiply

• ParallelBFS:1Dand2Dapproaches

• Experimentalresultsandinsight

• Conclusions/Contributions

• FutureDirections

Outline

Page 23: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

ParallelBFSstrategies

1.Expandcurrentfrontier(level-synchronous approach,suitedforlowdiametergraphs)

0 7

5

3

8

2

4 6

1

9

sourcevertex

2.Stitchmultipleconcurrenttraversals(Ullman-Yannakakis,forhigh-diameter graphs)

• O(D)parallelsteps• Adjacenciesofallverticesincurrentfrontierarevisitedinparallel

0 7

5

3

8

2

4 6

1

9sourcevertex

• path-limitedsearchesfrom“supervertices”• APSPbetween“supervertices”

Page 24: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

ALGORITHM:1. Findownersofthecurrentfrontier’sadjacency[computation]2. Exchangeadjacenciesviaall-to-all.[communication]3. Updatedistances/parentsforunvisitedvertices.[computation]

1DparallelBFSalgorithm

x

1 2

3

4 7

6

5

AT frontier

àx

Page 25: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

Communicationin1Dalgorithm

βLmp+αL,n/p

n+mp

Locallatencyonworkingset|n/p|

InverselocalRAMbandwidth

Localmemoryreferences:

ALGORITHM:1. Findownersofthecurrentfrontier’sadjacency[computation]2. Exchangeadjacenciesviaall-to-all.[communication]3. Updatedistances/parentsforunvisitedvertices.[computation]

βN ,a2a (p)mp+αN p

Remotecommunication:

All-to-allremotebandwidthwithpparticipatingprocessors

Page 26: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

2DparallelBFSalgorithm

x

1 2

3

4 7

6

5

AT frontier

àx

ALGORITHM:1. Gatherverticesinprocessorcolumn [communication]2. Findownersofthecurrentfrontier’sadjacency[computation]3. Exchangeadjacenciesinprocessorrow[communication]4. Updatedistances/parentsforunvisitedvertices.[computation]

n pr

n pc

Page 27: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

2Dalgorithm:Localcomputation

A

SPA

gather scatter/accumulate

y x(frontier)

Localmemoryreferences:

βLmp+αL,n pc

np+αL,n pr

mp

n pc

n pr

Page 28: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

Submatrix storage

Submatrices are“hypersparse” (i.e. nnz << n)

Total Storage:

Average of c nonzeros per column

• Adatastructureoralgorithmthatdependsonmatrixdimensionn(e.g.CSRorCSC)isasymptoticallytoowastefulforsubmatrices

• Usedoubly-compressed(DCSC)datastructuresinstead.

O(n+ nnz)→O(n p + nnz)p blocks

p blocks

nnz ' = cp→ 0

Page 29: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

2Dhybridparallelism

• Explicitlysplitsubmatrices tot(#threads)piecesalongtherows.

• Localworkingsetissmallerbyafactorof t(not a factor of t, because pr is now a factor of t smaller as well)

y

x(frontier)n prt

Thread1

Thread2

n prt

Page 30: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

• BFSoverviewandapplications

• BFSassparsematrix-sparsevectormultiply

• ParallelBFS:1Dand2Dapproaches

• Experimentalresultsandinsight

• Conclusions/Contributions

• FutureDirections

Outline

Page 31: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

Hopperstrongscaling

2

4

6

8

10

12

5040 10008 20000 40000Co

mm

. ti

me

(sec

onds

)Number of cores

1D Flat MPI 2D Flat MPI1D Hybrid 2D Hybrid

• NERSCHopper(CrayXE6,GeminiinterconnectAMDMagny-Cours)• Hybrid:In-node6-wayOpenMP multithreading• Graph500:Scale32,R-MATwithedgefactor=16

92%

66%

67%

58%

4

8

12

16

20

5040 10008 20000 40000

GTE

PS

Number of cores

1D Flat MPI 2D Flat MPI1D Hybrid 2D Hybrid

Page 32: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

Franklinstrongscaling

• NERSCFranklin(CrayXT4,Seastar interconnectAMDBudapest)• Hybrid:In-node4-wayOpenMP multithreading• Graph500:Scale29,R-MATwithedgefactor=16

2

4

6

8

10

512 1024 2048 4096

GTE

PS

Number of cores

1D Flat MPI 2D Flat MPI1D Hybrid 2D Hybrid

1

2

3

4

512 1024 2048 4096Co

mm

. ti

me

(sec

onds

)Number of cores

1D Flat MPI 2D Flat MPI1D Hybrid 2D Hybrid

Page 33: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

• ThenetworkparameterβN,a2a isafunctionofparticipatingprocessors.• Micro-benchmarkimitates 2Dalgorithm’scommunicationpattern.

βN ,a2a (pc )mp+βN ,ag(pr )

npc

2D

βN ,a2a (p)mp

1D 0

50

100

150

200

250

300

350

400

5 10 20 40 80 160 320 0

5

10

15

20

25

30

BW

per

cor

e (M

B/s

)

Sust

aine

d B

W (G

B/s

)

Number of Cores

MB/sec/coreSustained (GB/sec)

*:Latencycostsnotlisted

Bandwidthasafunctionofp

Page 34: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

Communicationbreakdown(2D)

ery other process roughly m/p2 words. This value is typ-ically large enough that the bandwidth component domi-nates over the latency term. Since we randomly shu⌅e thevertex identifiers prior to execution of BFS, these commu-nication bounds hold true in case of the synthetic randomnetworks we experimented with in this paper. Thus, theper-node communication cost is given by p�N + m

p ⇥N,a2a(p).

⇥N,a2a(p) is a function of the processor count, and severalfactors, including the interconnection topology, node injec-tion bandwidth, the routing protocol, network contention,etc. determine the sustained per-node bandwidth. For in-stance, if nodes are connected in a 3D torus, it can be shownthat bisection bandwidth scales as p2/3. Assuming all-to-all communication scales similarly, the communication costcan be revised to p�N + m

p p1/3⇥N,a2a. If processors are con-nected via a ring, then p�N+m

p p⇥N,a2a would be an estimatefor the all-to-all communication cost, essentially meaning noparallel speedup.

5.2 Analysis of the 2D AlgorithmConsider the general 2D case processor grid of pr�pc. The

size of the local adjacency matrix is n/pr �n/pc. The num-ber of memory references is the same as the 1D case, cumula-tively over all processors. However, the cache working set isbigger, because the sizes of the local input (frontier) and out-put vectors are n/pr and n/pc, respectively. The local mem-ory reference cost is given by m

p ⇥L + np�L,n/pc +

mp �L,n/pr .

The higher number of cache misses associated with largerworking sets is perhaps the primary reason for the relativelyhigher computation costs of the 2D algorithm.

Most of the costs due to remote memory accesses is con-centrated in two operations. The expand phase is character-ized by an Allgatherv operation over the processor column(of size pr) and the fold phase is characterized by an All-toallv operation over the processor row (of size pc).

The aggregate input to the Allgatherv step is O(n) overall iterations. However, each processor receives a 1/pc por-tion of it, meaning that frontier subvector gets replicatedalong the processor column. Hence, the per node commu-nication cost is pr�N + n

pc⇥N,ag(pr). This replication can

be partially avoided by performing an extra round of com-munication where each processor individually examines itscolumns and broadcasts the indices of its nonempty columns.However, this extra step does not decrease the asymptoticcomplexity for RMAT graphs, neither did it gave any per-formance increase in our experiments.

The aggregate input to the Alltoallv step can be as highas O(m), although the number is lower in practice due toin-node aggregation of newly discovered vertices that takesplace before the communication. Since each processor re-ceives only a 1/p portion of this data, the remote costs dueto this step are at most pc�N + m

p ⇥N,a2a(pc).We see that for large p, the expand phase is likely to be

more expensive than the fold phase. In fact, Table 1 exper-imentally confirms the findings of our analysis. We see thatAllgatherv always consumes a higher percentage of the BFStime than the Alltoallv operation, with the gap widening asthe matrix gets sparser. This is because for fixed numberof edges, increased sparsity only a�ects the Allgatherv per-formance, to a first degree approximation. In practice, itslightly slows down the Alltoallv performance as well, be-cause the in-node aggregation is less e�ective for sparsergraphs.

Table 1: Decomposition of communication times forthe flat (MPI only) 2D algorithm on Franklin, usingthe R-MAT graphs. Allgatherv takes place duringthe expand phase and Alltoallv takes place duringthe fold phase. The edge counts are kept constant.

Core Problem Edge BFS time Allgatherv Alltoallvcount scale factor (secs) (percent.) (percent.)

102427 64 2.67 7.0% 6.8%29 16 4.39 11.5% 7.7%31 4 7.18 16.6% 9.1%

202527 64 1.56 10.4% 7.6%29 16 2.87 19.4% 9.2%31 4 4.41 24.3% 9.0%

409627 64 1.31 13.1% 7.8%29 16 2.23 20.8% 9.0%31 4 3.15 30.9% 7.7%

Our analysis successfully captures that the relatively lowercommunication costs of the 2D algorithm by representing⇥N,x as a function of the processor count.

6. EXPERIMENTAL STUDIESWe provide an apples-to-apples comparison of our four

di�erent BFS implementations. For both 1D and 2D dis-tributions, we experimented with flat MPI as well as hy-brid MPI+Threading versions. We use synthetic graphsbased on the R-MAT random graph model [9], as well asa big real world graph that represents a web crawl of theUK domain [6] (uk-union). The R-MAT generator createsnetworks with skewed degree distributions and a very lowgraph diameter. We set the R-MAT parameters a, b, c, andd to 0.59, 0.19, 0.19, 0.05 respectively. These parameters areidentical to the ones used for generating synthetic instancesin the Graph 500 BFS benchmark. R-MAT graphs make forinteresting test instances: traversal load-balancing is non-trivial due to the skewed degree distribution, the graphslack good separators, and common vertex relabeling strate-gies are also expected to have a minimal e�ect on cacheperformance. The diameter of the uk-union graph is signifi-cantly higher (⇥ 140) than R-MAT’s (less than 10), allowingus to access the sensitivity of our algorithms with respect tothe number of synchronizations. We use undirected graphsfor all our experiments, but the BFS approaches can workwith directed graphs as well.To compare performance across multiple systems using a

rate analogous to the commonly-used floating point oper-ations/second, we normalize the serial and parallel execu-tion times by the number of edges visited in a BFS traver-sal and present a ‘Traversed Edges Per Second’ (TEPS)rate. For a graph with a single connected component (orone strongly connected component in case of directed net-works), the baseline BFS algorithm would visit every edgetwice (once in case of directed graphs). We only considertraversal execution times from vertices that appear in thelarge component, compute the average time using at least 16randomly-chosen sources vertices for each benchmark graph,and normalize the time by the cumulative number of edgesvisited to get the TEPS rate. As suggested by the Graph 500benchmark, we first symmetrize the input to model undi-

βN ,a2a (pc )mp+βN ,ag(pr )

npc

i. Allgatherbecomesthebottleneckasconcurrencyincreases.

ii. Allgathersensitivetosparsity.

Page 35: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

Ahigherdiametergraph

6 7 8 9

500

1000

2000

4000 50

0

1000

2000

4000

Mea

n Se

arch

Tim

e (s

ecs)

2D Flat 2D Hybrid

Computa.Communi.

0 1 2 3 4 5

Figure 6: Running times of the 2D algorithms onthe uk-union data set on Hopper (lower is better).The numbers on the x-axis is the number of cores.The running times translate into a maximum of 3GTEPS performance, achieving a 4� speedup whengoing from 500 to 4000 cores

and the 2D algorithms increases in favor of the 1D algorithmas the graph gets sparser. The empirical data supports ouranalysis in Section 5, which stated that the 2D algorithmperformance was limited by the local memory accesses to itsrelatively larger vectors. For a fixed number of edges, thematrix dimensions (hence the length of intermediate vectors)shrink as the graph gets denser, partially nullifying the costof local cache misses.

We show the performance of our 2D algorithms on thereal uk-union data set in Figure 6. We see that communi-cation takes a very small fraction of the overall executiontime, even on 4K cores. This is a notable result becausethe uk-union dataset has a relatively high-diameter and theBFS takes approximately 140 iterations to complete. Sincecommunication is not the most important factor, the hybridalgorithm is slower than flat MPI, as it has more intra-nodeparallelization overheads.

To compare our approaches with prior and reference dis-tributed memory implementations, we experimented withthe Parallel Boost Graph Library’s (PBGL) BFS implemen-tation [21] (Version 1.45 of the Boost library) and the ref-erence MPI implementation (Version 2.1) of the Graph 500benchmark [20]. On Franklin, our Flat 1D code is 2.72�,3.43�, and 4.13� faster than the non-replicated referenceMPI code on 512, 1024, and 2048 cores, respectively.

Since PBGL failed to compile on the Cray machines, weran comparison tests on Carver, an IBM iDataPlex systemwith 400 compute nodes, each node having two quad-coreIntel Nehalem processors. PBGL su�ers from severe mem-ory bottlenecks in graph construction that hinder scalablecreation of large graphs. Our results are summarized in Ta-ble 1 for graphs that ran to completion. We are up to 16�faster than PBGL even on these small problem instances.

Extensive experimentation reveals that our single-nodemultithreaded BFS version (i.e., without the inter-node com-munication steps in Algorithm 2) is also extremely fast. Wecompare the Nehalem-EP performance results reported inthe work by Agarwal et al. [1] with the performance on asingle node of the Carver system (also Nehalem-EP), andnotice that for R-MAT graphs with average degree 16 and32 million vertices, our approach is nearly 1.30� faster. Ourapproach is also faster than BFS results reported by Leis-erson et al. [25] on the KKt_power, Freescale1, and Cage14

Table 1: Performance comparison with PBGL onCarver. The reported numbers are in MTEPS forR-MAT graphs with the same parameters as before.In all cases, the graphs are undirected and edges arepermuted for load balance.

Core count CodeProblem Size

Scale 22 Scale 24

128PBGL 25.9 39.4Flat 2D 266.5 567.4

256PBGL 22.4 37.5Flat 2D 349.8 603.6

test instances, by up to 1.47�.

7. CONCLUSIONS AND FUTURE WORKIn this paper, we present a design-space exploration of

distributed-memory parallel BFS, discussing two fast“hybrid-parallel” approaches for large-scale graphs. Our experimen-tal study encompasses performance analysis on several large-scale synthetic random graphs that are also used in therecently announced Graph 500 benchmark. The absoluteperformance numbers we achieve on the large-scale parallelsystems Hopper and Franklin at NERSC are significantlyhigher than prior work. The performance results, coupledwith our analysis of communication and memory access costsof the two algorithms, challenges conventional wisdom thatfine-grained communication is inherent in parallel graph al-gorithms and necessary for achieving high performance [26].

We list below optimizations that we intend to explore infuture work, and some open questions related to design ofdistributed-memory graph algorithms.Exploiting symmetry in undirected graphs. If thegraph is undirected, then one can save 50% space by storingonly the upper (or lower) triangle of the sparse adjacencymatrix, e�ectively doubling the size of the maximum prob-lem that can be solved in-memory on a particular system.The algorithmic modifications needed to save a comparableamount in communication costs for BFS iterations is notwell-studied.Exploring alternate programming models. Partitionedglobal address space (PGAS) languages can potentially sim-plify expression of graph algorithms, as inter-processor com-munication is implicit. In future work, we will investigatewhether our two new BFS approaches are amenable to ex-pression using PGAS languages, and whether they can de-liver comparable performance.Reducing inter-processor communication volume withgraph partitioning. An alternative to randomization ofvertex identifiers is to use hypergraph partitioning softwareto reduce communication. Although hypergraphs are ca-pable of accurately modeling the communication costs ofsparse matrix-dense vector multiplication, SpMSV case hasnot been studied, which is potentially harder as the sparsitypattern of the frontier vector changes over BFS iterations.Interprocessor collective communication optimiza-tion. We conclude that even after alleviating the communi-cation costs, the performance of distributed-memory parallelBFS is heavily dependent on the inter-processor collectivecommunication routines All-to-all and Allgather. Under-standing the bottlenecks in these routines at high processconcurrencies, and designing network topology-aware collec-

• Unionofmultiple2005crawlsofthe.uk domain• Speedup:4Xwhengoingfrom500to4000cores• Hybridisslowersinceremotecommunicationisnotthebottleneck.

Vertices:133MEdges:5.5BDiameter≥140RunonHopper

Lowerisbetter

Page 36: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

Conclusions/Contributions

• In-depthanalysisandevaluationofall4combinationsofparallelBFSondistributedmemory

• Aperformancemodelthatcapturesbothlocalandnon-localmemoryaccesses,aswellascollectives

• Novel2Dhybridapproach:Upto3.5Xreductionincommunicationtimecomparedto1D

1DFlatMPI 2D FlatMPI

1DHybridMPI+OpenMP 2D HybridMPI+OpenMP

• Scalingto40Kcoreson“scale-free”graphs

Page 37: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

Somefuturework

1- Optimalprocessorgriddimensions(p=pr xpc)dependson:• Graphsize• Graphdensity• Desiredconcurrency• TargetarchitectureGreatopportunityforautotuning.

2- Performancedependsheavilyoncollectivesperformance.• Non-toruspartitions->unpredictableperformance• Topologyawarecollectives(EdgarSolomonik’s talkat4:30today)

3- Graph/hypergraph partitioningforreducingcommunication4- ProspectofusingPGASlanguages

Page 38: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

Extraslides

Page 39: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

Graphcharacteristics

Skewed-degreedistribution– someveryhighdegreevertices• EdgecutisO(m)• Remotecommunicationisthebottleneck

Loadbalanceinitiallyanissue

• Partiallysolvedbyrandomvertexrelabeling

A=rmat(15) A(r,r):r random

Page 40: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

process0

p-1stacks.…

Aggregatesize: edgecut,O(m_local)

tthreadslocalgraph

edges/vertices

Step1.Exploreadjacenciesofcurrentfrontiervertices

Page 41: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

process0

.…

tthreads

process1

.…

processp-1

.…

.

.

.

Step1.Exploreadjacenciesofcurrentfrontiervertices

Page 42: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

process0

.…

tthreads

process1

.…

processp-1

.…

.

.

.

Step1.Exploreadjacenciesofcurrentfrontiervertices

Page 43: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

Step2.All-to-allcommunicationtoexchangevisitedvertices

process0

.…

process1

.…

processp-1

.…

.

.

.

‘Alltoallv’collective

Masterthread

Page 44: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

Step3.Updatelocalfrontierfornextiteration

process0

.…

tthreads

process1

.…

processp-1

.…

frontiervertices

Page 45: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

5

8

x1

x1,1

x1,2

x1,3

x2,1

x2,2

x2,3

x3,1

x3,2

x3,3

x2

x3

A1,1

A1,2

A1,3

A2,1

A2,2

A2,3

A3,1

A3,2

A3,3

Initial2Dmatrixandvectordistributions,interleavedoneachother.A=Sparseadjacencymatrixofthegraph

n pr€

n pc

2Dalgorithm,step-by-step

Page 46: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

5

8

x1

x1,1

x1,2

x1,3

x2,1

x2,2

x2,3

x3,1

x3,2

x3,3€

x2

x3

A1,1

A1,2

A1,3

A2,1

A2,2

A2,3

A3,1

A3,2

A3,3

Aftervectortransposephase• Forr=c=sqrt(p),pairwisecommunicationonly• Takes<0.25%ofwall-clocktime

Function:SendRecvSize:

pc

pr

O(n p)

Page 47: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

5

8

x1

x1,1

x1,2

x1,3

x2,1

x2,2

x2,3

x3,1

x3,2

x3,3€

x2

x3

A1,1

A1,2

A1,3

A2,1

A2,2

A2,3

A3,1

A3,2

A3,3

AfterAllGatherv onprocessorcolumns• (n/c)-sizedsub-vectorgetsreplicated• Mostexpensivecommunicationstep

Function:AllgathervSize:

O(n pc )

Page 48: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

A

SPA

gather scatter/accumulate

y x(frontier)

Localmemoryreferences:

mpβL +

npαL ,n pc

+mpαL,n pr

n pc

n pr

Page 49: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

5

8

A1,1

A1,2

A1,3

A2,1

A2,2

A2,3

A3,1

A3,2

A3,3

Afterlocalsparse-matrixXsparse-vector,Alltoallv sendspiecestotheirowners.

Function:AlltoallvSize:

y1

y1,1

y1,2

y1,3

y2,1

y2,2

y2,3

y3,1

y3,2

y3,3

y2

y3

O(m p)

Page 50: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

Franklinweakscaling

• NERSCFranklin(CrayXT4,Seastar interconnectAMDBudapest)• Graph500:Scales29,30,31,32;respectively• Lowerisbetterforbothgraphs

3

6

9

12

15

512 1024 2048 4096

Exec

utio

n ti

me

(sec

onds

)

Number of cores

1D Flat MPI 2D Flat MPI1D Hybrid 2D Hybrid

1

2

3

4

5

6

7

512 1024 2048 4096Co

mm

. ti

me

(sec

onds

)Number of cores

1D Flat MPI 2D Flat MPI1D Hybrid 2D Hybrid

Page 51: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

Recentimprovements(July2011)

Twobottlenecksfor2Dalgorithm:1) LocalcomputationisexpensiveduetoO(n/pr)workingset2) Communicationinvolves64-bitindices

Remedies:1) Avoidthedensevector(SPA)inlocalSpMSV:

- Selectvertexwithminimum labelasparent(abitvectorsuffices)2) Localindicesare32-bitaddressablein2outof3cases:

- Thereceiverappliesthecorrectsender“offset”

0

1

2

1224 2500 5040 10008

Com

mun

icat

ion

(sec

)

Number of cores

May-11

1

2

3

4

5

6

1224 2500 5040 10008

. Co

mpu

tati

on (

sec)

Number of cores

Page 52: Parallel breadth-first search on distributed memory systemsaydin/talks/SC11-BFS.pdf · Parallel BFS strategies 1. Expand current frontier (level-synchronousapproach, suited for low

Mostrecently(November2011)

Areimplementationofthe2Dalgorithm:1. Transposedprocessorgrid(feedbackfrom

microbenchmarks)Alltoallv alongcolumns,allgather alongrows.2. BFSpersistentlocalduplicateelimination

KeeplocallydiscoveredverticesuntilBFScompletes.