Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
1
Parallel breadth-first search on distributed memory systemsAydınBuluçLawrence Berkeley National Laboratory
Kamesh MadduriThe Pennsylvania State University
Supercomputing, SeattleNovember 17, 2011
Largegraphsareeverywhere
WWW snapshot, courtesy Y. Hyun Yeast protein interaction network, courtesy H. Jeong
Internet structureSocial interactions
Scientific datasets: biological, chemical, cosmological, ecological, …
• Level-by-levelgraphtraversal• Serialcomplexity: Θ(m+n)
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Graph:G(E,V)
E: Setofedges(sizem)
V: Setofvertices(sizen)
Breadth-firstsearch(BFS)
• Level-by-levelgraphtraversal• Serialcomplexity: Θ(m+n)
Level11 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1
Breadth-firstsearch(BFS)
Current‘frontier’showninred
• Level-by-levelgraphtraversal• Serialcomplexity: Θ(m+n)
Level1
Level2
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
2
1
5
Breadth-firstsearch(BFS)
Current‘frontier’showninred
• Level-by-levelgraphtraversal• Serialcomplexity: Θ(m+n)
Level1
2
1
5 Level2
9 6 3
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Level3
Breadth-firstsearch(BFS)
Current‘frontier’showninred
• Level-by-levelgraphtraversal• Serialcomplexity: Θ(m+n)
Level1
Level2
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
13 10 7 4
Level3
Level4
2
1
5
9 6 3
Breadth-firstsearch(BFS)
Current‘frontier’showninred
• Level-by-levelgraphtraversal• Serialcomplexity: Θ(m+n)
Level1
Level2
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
13 4
Level3
Level4
14 11 8 Level5
10 7
2
1
5
9 6 3
Breadth-firstsearch(BFS)
Current‘frontier’showninred
• Level-by-levelgraphtraversal• Serialcomplexity: Θ(m+n)
Level1
Level2
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
13 4
Level3
Level4
14 11 8 Level5
10 7
2
1
5
9 6 3
Level61215
Breadth-firstsearch(BFS)
Current‘frontier’showninred
• BFSisrepresentativeofcommunicationintensivegraphcomputationsindistributedmemory
• BFSisasubroutineformanyalgorithms– Betweennesscentrality– Maximumflows– Connectedcomponents– Spanningforests– Testingforbipartiteness– ReverseCuthill-McKeeordering
BFSasagraphbuildingblock
Graph500Benchmark#1
0
50000
100000
150000
200000
250000
300000
1 2 3 4 5 6 7 8 9 10
Num
ber o
f ver
tices
in fr
inge
BFS tree level
• Shortspan(criticalpath)• Highparallelism(work/span)
Breadth-first searchon a low-diameter graph
with skewed degree distribution
1 2
3
4 7
6
5
Performancemetric:TEPS“TraversedEdgesPerSecond”
Graph500:implementationchallenge
4
8
16
32
64
128
256
11/2010 04/2011 06/2011 11/2011
Gig
a TE
PS
This workSC’11 deadline
Graph500 #1NERSC/Franklin
PerformancenumbersarejustasnapshotintimeButtheinsightsandconclusionsarestillvalidtoday
8.2Ximprovementonsamehardware,inoneyear.
• BFS overview and applications
• BFS as sparse matrix-sparse vector multiply• Parallel BFS: 1D and 2D approaches
• Experimental results and insight
• Conclusions / Contributions
• Future Directions
Outline
• Adjacencymatrix:sparsearrayw/nonzerosforgraphedges• Multiplybyadjacencymatrixà steptoneighborvertices
Breadth-firstsearch:amatrixview
[loopsarenotdrawn]x
1 2
3
4 7
6
5
AT
• Adjacencymatrix:sparsearrayw/nonzerosforgraphedges• Multiplybyadjacencymatrixà steptoneighborvertices
Breadth-firstsearch:amatrixview
[loopsarenotdrawn]x
1 2
3
4 7
6
5
AT ATx
à
• Adjacencymatrix:sparsearrayw/nonzerosforgraphedges• Multiplybyadjacencymatrixà steptoneighborvertices
Breadth-firstsearch:amatrixview
[loopsarenotdrawn]x
1 2
3
4 7
6
5
AT ATx
à
(AT)2x
à
12
3
4 7
6
5
AT
1
7
71from
to
BFSassparsematrix-sparsevectormultiplySemiring:(select,min)
12
3
4 7
6
5
XAT
1
7
71from
to
ATX
à
1
1
1
1
1parents:
12
3
4 7
6
5
X
4
2
2
AT
1
7
71from
to
ATX
à
2
4
4
2
24
1
1parents:4
2
2
12
3
4 7
6
5
X
3
AT
1
7
71from
to
ATX
à3
5
7
3
1
1parents:4
2
2
5
3
12
3
4 7
6
5
XAT
1
7
71from
to
ATX
à
6
• BFSoverviewandapplications
• BFSassparsematrix-sparsevectormultiply
• ParallelBFS:1Dand2Dapproaches
• Experimentalresultsandinsight
• Conclusions/Contributions
• FutureDirections
Outline
ParallelBFSstrategies
1.Expandcurrentfrontier(level-synchronous approach,suitedforlowdiametergraphs)
0 7
5
3
8
2
4 6
1
9
sourcevertex
2.Stitchmultipleconcurrenttraversals(Ullman-Yannakakis,forhigh-diameter graphs)
• O(D)parallelsteps• Adjacenciesofallverticesincurrentfrontierarevisitedinparallel
0 7
5
3
8
2
4 6
1
9sourcevertex
• path-limitedsearchesfrom“supervertices”• APSPbetween“supervertices”
ALGORITHM:1. Findownersofthecurrentfrontier’sadjacency[computation]2. Exchangeadjacenciesviaall-to-all.[communication]3. Updatedistances/parentsforunvisitedvertices.[computation]
1DparallelBFSalgorithm
x
1 2
3
4 7
6
5
AT frontier
àx
Communicationin1Dalgorithm
βLmp+αL,n/p
n+mp
Locallatencyonworkingset|n/p|
InverselocalRAMbandwidth
Localmemoryreferences:
ALGORITHM:1. Findownersofthecurrentfrontier’sadjacency[computation]2. Exchangeadjacenciesviaall-to-all.[communication]3. Updatedistances/parentsforunvisitedvertices.[computation]
βN ,a2a (p)mp+αN p
Remotecommunication:
All-to-allremotebandwidthwithpparticipatingprocessors
2DparallelBFSalgorithm
x
1 2
3
4 7
6
5
AT frontier
àx
ALGORITHM:1. Gatherverticesinprocessorcolumn [communication]2. Findownersofthecurrentfrontier’sadjacency[computation]3. Exchangeadjacenciesinprocessorrow[communication]4. Updatedistances/parentsforunvisitedvertices.[computation]
n pr
n pc
2Dalgorithm:Localcomputation
A
SPA
gather scatter/accumulate
y x(frontier)
Localmemoryreferences:
βLmp+αL,n pc
np+αL,n pr
mp
€
n pc
€
n pr
Submatrix storage
Submatrices are“hypersparse” (i.e. nnz << n)
Total Storage:
Average of c nonzeros per column
• Adatastructureoralgorithmthatdependsonmatrixdimensionn(e.g.CSRorCSC)isasymptoticallytoowastefulforsubmatrices
• Usedoubly-compressed(DCSC)datastructuresinstead.
O(n+ nnz)→O(n p + nnz)p blocks
p blocks
nnz ' = cp→ 0
2Dhybridparallelism
• Explicitlysplitsubmatrices tot(#threads)piecesalongtherows.
• Localworkingsetissmallerbyafactorof t(not a factor of t, because pr is now a factor of t smaller as well)
y
x(frontier)n prt
Thread1
Thread2
n prt
• BFSoverviewandapplications
• BFSassparsematrix-sparsevectormultiply
• ParallelBFS:1Dand2Dapproaches
• Experimentalresultsandinsight
• Conclusions/Contributions
• FutureDirections
Outline
Hopperstrongscaling
2
4
6
8
10
12
5040 10008 20000 40000Co
mm
. ti
me
(sec
onds
)Number of cores
1D Flat MPI 2D Flat MPI1D Hybrid 2D Hybrid
• NERSCHopper(CrayXE6,GeminiinterconnectAMDMagny-Cours)• Hybrid:In-node6-wayOpenMP multithreading• Graph500:Scale32,R-MATwithedgefactor=16
92%
66%
67%
58%
4
8
12
16
20
5040 10008 20000 40000
GTE
PS
Number of cores
1D Flat MPI 2D Flat MPI1D Hybrid 2D Hybrid
Franklinstrongscaling
• NERSCFranklin(CrayXT4,Seastar interconnectAMDBudapest)• Hybrid:In-node4-wayOpenMP multithreading• Graph500:Scale29,R-MATwithedgefactor=16
2
4
6
8
10
512 1024 2048 4096
GTE
PS
Number of cores
1D Flat MPI 2D Flat MPI1D Hybrid 2D Hybrid
1
2
3
4
512 1024 2048 4096Co
mm
. ti
me
(sec
onds
)Number of cores
1D Flat MPI 2D Flat MPI1D Hybrid 2D Hybrid
• ThenetworkparameterβN,a2a isafunctionofparticipatingprocessors.• Micro-benchmarkimitates 2Dalgorithm’scommunicationpattern.
βN ,a2a (pc )mp+βN ,ag(pr )
npc
2D
βN ,a2a (p)mp
1D 0
50
100
150
200
250
300
350
400
5 10 20 40 80 160 320 0
5
10
15
20
25
30
BW
per
cor
e (M
B/s
)
Sust
aine
d B
W (G
B/s
)
Number of Cores
MB/sec/coreSustained (GB/sec)
*:Latencycostsnotlisted
Bandwidthasafunctionofp
Communicationbreakdown(2D)
ery other process roughly m/p2 words. This value is typ-ically large enough that the bandwidth component domi-nates over the latency term. Since we randomly shu⌅e thevertex identifiers prior to execution of BFS, these commu-nication bounds hold true in case of the synthetic randomnetworks we experimented with in this paper. Thus, theper-node communication cost is given by p�N + m
p ⇥N,a2a(p).
⇥N,a2a(p) is a function of the processor count, and severalfactors, including the interconnection topology, node injec-tion bandwidth, the routing protocol, network contention,etc. determine the sustained per-node bandwidth. For in-stance, if nodes are connected in a 3D torus, it can be shownthat bisection bandwidth scales as p2/3. Assuming all-to-all communication scales similarly, the communication costcan be revised to p�N + m
p p1/3⇥N,a2a. If processors are con-nected via a ring, then p�N+m
p p⇥N,a2a would be an estimatefor the all-to-all communication cost, essentially meaning noparallel speedup.
5.2 Analysis of the 2D AlgorithmConsider the general 2D case processor grid of pr�pc. The
size of the local adjacency matrix is n/pr �n/pc. The num-ber of memory references is the same as the 1D case, cumula-tively over all processors. However, the cache working set isbigger, because the sizes of the local input (frontier) and out-put vectors are n/pr and n/pc, respectively. The local mem-ory reference cost is given by m
p ⇥L + np�L,n/pc +
mp �L,n/pr .
The higher number of cache misses associated with largerworking sets is perhaps the primary reason for the relativelyhigher computation costs of the 2D algorithm.
Most of the costs due to remote memory accesses is con-centrated in two operations. The expand phase is character-ized by an Allgatherv operation over the processor column(of size pr) and the fold phase is characterized by an All-toallv operation over the processor row (of size pc).
The aggregate input to the Allgatherv step is O(n) overall iterations. However, each processor receives a 1/pc por-tion of it, meaning that frontier subvector gets replicatedalong the processor column. Hence, the per node commu-nication cost is pr�N + n
pc⇥N,ag(pr). This replication can
be partially avoided by performing an extra round of com-munication where each processor individually examines itscolumns and broadcasts the indices of its nonempty columns.However, this extra step does not decrease the asymptoticcomplexity for RMAT graphs, neither did it gave any per-formance increase in our experiments.
The aggregate input to the Alltoallv step can be as highas O(m), although the number is lower in practice due toin-node aggregation of newly discovered vertices that takesplace before the communication. Since each processor re-ceives only a 1/p portion of this data, the remote costs dueto this step are at most pc�N + m
p ⇥N,a2a(pc).We see that for large p, the expand phase is likely to be
more expensive than the fold phase. In fact, Table 1 exper-imentally confirms the findings of our analysis. We see thatAllgatherv always consumes a higher percentage of the BFStime than the Alltoallv operation, with the gap widening asthe matrix gets sparser. This is because for fixed numberof edges, increased sparsity only a�ects the Allgatherv per-formance, to a first degree approximation. In practice, itslightly slows down the Alltoallv performance as well, be-cause the in-node aggregation is less e�ective for sparsergraphs.
Table 1: Decomposition of communication times forthe flat (MPI only) 2D algorithm on Franklin, usingthe R-MAT graphs. Allgatherv takes place duringthe expand phase and Alltoallv takes place duringthe fold phase. The edge counts are kept constant.
Core Problem Edge BFS time Allgatherv Alltoallvcount scale factor (secs) (percent.) (percent.)
102427 64 2.67 7.0% 6.8%29 16 4.39 11.5% 7.7%31 4 7.18 16.6% 9.1%
202527 64 1.56 10.4% 7.6%29 16 2.87 19.4% 9.2%31 4 4.41 24.3% 9.0%
409627 64 1.31 13.1% 7.8%29 16 2.23 20.8% 9.0%31 4 3.15 30.9% 7.7%
Our analysis successfully captures that the relatively lowercommunication costs of the 2D algorithm by representing⇥N,x as a function of the processor count.
6. EXPERIMENTAL STUDIESWe provide an apples-to-apples comparison of our four
di�erent BFS implementations. For both 1D and 2D dis-tributions, we experimented with flat MPI as well as hy-brid MPI+Threading versions. We use synthetic graphsbased on the R-MAT random graph model [9], as well asa big real world graph that represents a web crawl of theUK domain [6] (uk-union). The R-MAT generator createsnetworks with skewed degree distributions and a very lowgraph diameter. We set the R-MAT parameters a, b, c, andd to 0.59, 0.19, 0.19, 0.05 respectively. These parameters areidentical to the ones used for generating synthetic instancesin the Graph 500 BFS benchmark. R-MAT graphs make forinteresting test instances: traversal load-balancing is non-trivial due to the skewed degree distribution, the graphslack good separators, and common vertex relabeling strate-gies are also expected to have a minimal e�ect on cacheperformance. The diameter of the uk-union graph is signifi-cantly higher (⇥ 140) than R-MAT’s (less than 10), allowingus to access the sensitivity of our algorithms with respect tothe number of synchronizations. We use undirected graphsfor all our experiments, but the BFS approaches can workwith directed graphs as well.To compare performance across multiple systems using a
rate analogous to the commonly-used floating point oper-ations/second, we normalize the serial and parallel execu-tion times by the number of edges visited in a BFS traver-sal and present a ‘Traversed Edges Per Second’ (TEPS)rate. For a graph with a single connected component (orone strongly connected component in case of directed net-works), the baseline BFS algorithm would visit every edgetwice (once in case of directed graphs). We only considertraversal execution times from vertices that appear in thelarge component, compute the average time using at least 16randomly-chosen sources vertices for each benchmark graph,and normalize the time by the cumulative number of edgesvisited to get the TEPS rate. As suggested by the Graph 500benchmark, we first symmetrize the input to model undi-
βN ,a2a (pc )mp+βN ,ag(pr )
npc
i. Allgatherbecomesthebottleneckasconcurrencyincreases.
ii. Allgathersensitivetosparsity.
Ahigherdiametergraph
6 7 8 9
500
1000
2000
4000 50
0
1000
2000
4000
Mea
n Se
arch
Tim
e (s
ecs)
2D Flat 2D Hybrid
Computa.Communi.
0 1 2 3 4 5
Figure 6: Running times of the 2D algorithms onthe uk-union data set on Hopper (lower is better).The numbers on the x-axis is the number of cores.The running times translate into a maximum of 3GTEPS performance, achieving a 4� speedup whengoing from 500 to 4000 cores
and the 2D algorithms increases in favor of the 1D algorithmas the graph gets sparser. The empirical data supports ouranalysis in Section 5, which stated that the 2D algorithmperformance was limited by the local memory accesses to itsrelatively larger vectors. For a fixed number of edges, thematrix dimensions (hence the length of intermediate vectors)shrink as the graph gets denser, partially nullifying the costof local cache misses.
We show the performance of our 2D algorithms on thereal uk-union data set in Figure 6. We see that communi-cation takes a very small fraction of the overall executiontime, even on 4K cores. This is a notable result becausethe uk-union dataset has a relatively high-diameter and theBFS takes approximately 140 iterations to complete. Sincecommunication is not the most important factor, the hybridalgorithm is slower than flat MPI, as it has more intra-nodeparallelization overheads.
To compare our approaches with prior and reference dis-tributed memory implementations, we experimented withthe Parallel Boost Graph Library’s (PBGL) BFS implemen-tation [21] (Version 1.45 of the Boost library) and the ref-erence MPI implementation (Version 2.1) of the Graph 500benchmark [20]. On Franklin, our Flat 1D code is 2.72�,3.43�, and 4.13� faster than the non-replicated referenceMPI code on 512, 1024, and 2048 cores, respectively.
Since PBGL failed to compile on the Cray machines, weran comparison tests on Carver, an IBM iDataPlex systemwith 400 compute nodes, each node having two quad-coreIntel Nehalem processors. PBGL su�ers from severe mem-ory bottlenecks in graph construction that hinder scalablecreation of large graphs. Our results are summarized in Ta-ble 1 for graphs that ran to completion. We are up to 16�faster than PBGL even on these small problem instances.
Extensive experimentation reveals that our single-nodemultithreaded BFS version (i.e., without the inter-node com-munication steps in Algorithm 2) is also extremely fast. Wecompare the Nehalem-EP performance results reported inthe work by Agarwal et al. [1] with the performance on asingle node of the Carver system (also Nehalem-EP), andnotice that for R-MAT graphs with average degree 16 and32 million vertices, our approach is nearly 1.30� faster. Ourapproach is also faster than BFS results reported by Leis-erson et al. [25] on the KKt_power, Freescale1, and Cage14
Table 1: Performance comparison with PBGL onCarver. The reported numbers are in MTEPS forR-MAT graphs with the same parameters as before.In all cases, the graphs are undirected and edges arepermuted for load balance.
Core count CodeProblem Size
Scale 22 Scale 24
128PBGL 25.9 39.4Flat 2D 266.5 567.4
256PBGL 22.4 37.5Flat 2D 349.8 603.6
test instances, by up to 1.47�.
7. CONCLUSIONS AND FUTURE WORKIn this paper, we present a design-space exploration of
distributed-memory parallel BFS, discussing two fast“hybrid-parallel” approaches for large-scale graphs. Our experimen-tal study encompasses performance analysis on several large-scale synthetic random graphs that are also used in therecently announced Graph 500 benchmark. The absoluteperformance numbers we achieve on the large-scale parallelsystems Hopper and Franklin at NERSC are significantlyhigher than prior work. The performance results, coupledwith our analysis of communication and memory access costsof the two algorithms, challenges conventional wisdom thatfine-grained communication is inherent in parallel graph al-gorithms and necessary for achieving high performance [26].
We list below optimizations that we intend to explore infuture work, and some open questions related to design ofdistributed-memory graph algorithms.Exploiting symmetry in undirected graphs. If thegraph is undirected, then one can save 50% space by storingonly the upper (or lower) triangle of the sparse adjacencymatrix, e�ectively doubling the size of the maximum prob-lem that can be solved in-memory on a particular system.The algorithmic modifications needed to save a comparableamount in communication costs for BFS iterations is notwell-studied.Exploring alternate programming models. Partitionedglobal address space (PGAS) languages can potentially sim-plify expression of graph algorithms, as inter-processor com-munication is implicit. In future work, we will investigatewhether our two new BFS approaches are amenable to ex-pression using PGAS languages, and whether they can de-liver comparable performance.Reducing inter-processor communication volume withgraph partitioning. An alternative to randomization ofvertex identifiers is to use hypergraph partitioning softwareto reduce communication. Although hypergraphs are ca-pable of accurately modeling the communication costs ofsparse matrix-dense vector multiplication, SpMSV case hasnot been studied, which is potentially harder as the sparsitypattern of the frontier vector changes over BFS iterations.Interprocessor collective communication optimiza-tion. We conclude that even after alleviating the communi-cation costs, the performance of distributed-memory parallelBFS is heavily dependent on the inter-processor collectivecommunication routines All-to-all and Allgather. Under-standing the bottlenecks in these routines at high processconcurrencies, and designing network topology-aware collec-
• Unionofmultiple2005crawlsofthe.uk domain• Speedup:4Xwhengoingfrom500to4000cores• Hybridisslowersinceremotecommunicationisnotthebottleneck.
Vertices:133MEdges:5.5BDiameter≥140RunonHopper
Lowerisbetter
Conclusions/Contributions
• In-depthanalysisandevaluationofall4combinationsofparallelBFSondistributedmemory
• Aperformancemodelthatcapturesbothlocalandnon-localmemoryaccesses,aswellascollectives
• Novel2Dhybridapproach:Upto3.5Xreductionincommunicationtimecomparedto1D
1DFlatMPI 2D FlatMPI
1DHybridMPI+OpenMP 2D HybridMPI+OpenMP
• Scalingto40Kcoreson“scale-free”graphs
Somefuturework
1- Optimalprocessorgriddimensions(p=pr xpc)dependson:• Graphsize• Graphdensity• Desiredconcurrency• TargetarchitectureGreatopportunityforautotuning.
2- Performancedependsheavilyoncollectivesperformance.• Non-toruspartitions->unpredictableperformance• Topologyawarecollectives(EdgarSolomonik’s talkat4:30today)
3- Graph/hypergraph partitioningforreducingcommunication4- ProspectofusingPGASlanguages
Extraslides
Graphcharacteristics
Skewed-degreedistribution– someveryhighdegreevertices• EdgecutisO(m)• Remotecommunicationisthebottleneck
Loadbalanceinitiallyanissue
• Partiallysolvedbyrandomvertexrelabeling
A=rmat(15) A(r,r):r random
process0
p-1stacks.…
Aggregatesize: edgecut,O(m_local)
tthreadslocalgraph
edges/vertices
Step1.Exploreadjacenciesofcurrentfrontiervertices
process0
.…
tthreads
process1
.…
processp-1
.…
.
.
.
Step1.Exploreadjacenciesofcurrentfrontiervertices
process0
.…
tthreads
process1
.…
processp-1
.…
.
.
.
Step1.Exploreadjacenciesofcurrentfrontiervertices
Step2.All-to-allcommunicationtoexchangevisitedvertices
process0
.…
process1
.…
processp-1
.…
.
.
.
‘Alltoallv’collective
Masterthread
Step3.Updatelocalfrontierfornextiteration
process0
.…
tthreads
process1
.…
processp-1
.…
frontiervertices
5
8
€
x1
€
x1,1
€
x1,2
€
x1,3
€
x2,1
€
x2,2
€
x2,3
€
x3,1
€
x3,2
€
x3,3
€
x2
€
x3
€
A1,1
€
A1,2
€
A1,3
€
A2,1
€
A2,2
€
A2,3
€
A3,1
€
A3,2
€
A3,3
Initial2Dmatrixandvectordistributions,interleavedoneachother.A=Sparseadjacencymatrixofthegraph
€
n pr€
n pc
2Dalgorithm,step-by-step
5
8
€
x1
€
x1,1
€
x1,2
€
x1,3
€
x2,1
€
x2,2
€
x2,3
€
x3,1
€
x3,2
€
x3,3€
x2
€
x3
€
A1,1
€
A1,2
€
A1,3
€
A2,1
€
A2,2
€
A2,3
€
A3,1
€
A3,2
€
A3,3
Aftervectortransposephase• Forr=c=sqrt(p),pairwisecommunicationonly• Takes<0.25%ofwall-clocktime
Function:SendRecvSize:
€
pc
€
pr
€
O(n p)
5
8
€
x1
€
x1,1
€
x1,2
€
x1,3
€
x2,1
€
x2,2
€
x2,3
€
x3,1
€
x3,2
€
x3,3€
x2
€
x3
€
A1,1
€
A1,2
€
A1,3
€
A2,1
€
A2,2
€
A2,3
€
A3,1
€
A3,2
€
A3,3
AfterAllGatherv onprocessorcolumns• (n/c)-sizedsub-vectorgetsreplicated• Mostexpensivecommunicationstep
Function:AllgathervSize:
€
O(n pc )
A
SPA
gather scatter/accumulate
y x(frontier)
Localmemoryreferences:
€
mpβL +
npαL ,n pc
+mpαL,n pr
€
n pc
€
n pr
5
8
€
A1,1
€
A1,2
€
A1,3
€
A2,1
€
A2,2
€
A2,3
€
A3,1
€
A3,2
€
A3,3
Afterlocalsparse-matrixXsparse-vector,Alltoallv sendspiecestotheirowners.
Function:AlltoallvSize:
€
y1
€
y1,1
€
y1,2
€
y1,3
€
y2,1
€
y2,2
€
y2,3
€
y3,1
€
y3,2
€
y3,3
€
y2
€
y3
€
O(m p)
Franklinweakscaling
• NERSCFranklin(CrayXT4,Seastar interconnectAMDBudapest)• Graph500:Scales29,30,31,32;respectively• Lowerisbetterforbothgraphs
3
6
9
12
15
512 1024 2048 4096
Exec
utio
n ti
me
(sec
onds
)
Number of cores
1D Flat MPI 2D Flat MPI1D Hybrid 2D Hybrid
1
2
3
4
5
6
7
512 1024 2048 4096Co
mm
. ti
me
(sec
onds
)Number of cores
1D Flat MPI 2D Flat MPI1D Hybrid 2D Hybrid
Recentimprovements(July2011)
Twobottlenecksfor2Dalgorithm:1) LocalcomputationisexpensiveduetoO(n/pr)workingset2) Communicationinvolves64-bitindices
Remedies:1) Avoidthedensevector(SPA)inlocalSpMSV:
- Selectvertexwithminimum labelasparent(abitvectorsuffices)2) Localindicesare32-bitaddressablein2outof3cases:
- Thereceiverappliesthecorrectsender“offset”
0
1
2
1224 2500 5040 10008
Com
mun
icat
ion
(sec
)
Number of cores
May-11
1
2
3
4
5
6
1224 2500 5040 10008
. Co
mpu
tati
on (
sec)
Number of cores
Mostrecently(November2011)
Areimplementationofthe2Dalgorithm:1. Transposedprocessorgrid(feedbackfrom
microbenchmarks)Alltoallv alongcolumns,allgather alongrows.2. BFSpersistentlocalduplicateelimination
KeeplocallydiscoveredverticesuntilBFScompletes.