View
2
Download
0
Category
Preview:
Citation preview
Reservoir Labs
N.Vasilache, B. Meister, M. Baskaran, R.Lethin
BFS preconditioning for high locality, data parallel BFS algorithm
Reservoir Labs
• Streaming Graph Challenge Characteristics:• Large scale• Highly dynamic• Scale-free• Massive parallelism, data movement and synchronization is
key• Completely unpredictable• In this talk, focusing on breadth-first search
Problem
Reservoir Labs
• Dynamic exploration algorithm:• Computes a single source shortest path• O (V + E) complexity -> important• First graph500 problem• Comes in a variety of base implementation:
– sequential list– sequential csr, omp csr– mpi– MTGL
• Best 2010 implementation is IBM’s MPI
• We propose a solution to optimize the run of a single BFS (graph500 requirement). Rely on a test run, BFS_0 to precondition placement, locality and parallelism.
Breadth-First Search
Reservoir Labs
• Assume the result of a first BFS run is available (BFS_0)• In the form provided by graph500 (list of fathers or -1 if root)• BFS_0 can be viewed as an ordering (a traversal) of connected vertices
consistent with an ordering of the edges.
• Construct a representation that exploits BFS_0• Parallel distributed construction• Data parallel programming idioms
• Reuse the representation for subsequent BFS runs• Improve parallelism and locality• Must be profitable including the overhead of the representation• Must bring improvement on a single BFS run (graph500 requirement)• Very preliminary work
High-Level Ideas
Reservoir Labs
• In BFS_0 order: siblings are contiguous, children are localized (recursively), parent is not too far, potential neighbors not too far
• Given a graph and a potential BFS:• Full edges are actually used in BFS, dashed edges are potential
edges, red dashed edges are illegal edges
BFS_0 interesting properties
Reservoir Labs
• Additional “structural information” on the graph carried by BFS_0• Potential neighbors of “f” are in the grey region• Distance between “f” and “d” is 1 or 2 at most • Clear separation of potential vertices in 3 classes depending on the
depth in BFS_0 relative to the depth of “f”
BFS_0 interesting properties (continued)
Reservoir Labs
• We want to reuse as much information from BFS_0 as possible• Given 1 visited node “N”, 3 classes are really data independent
regions (-1 depth, same depth, +1 depth)• Additionally, we distinguish Children of N from other Nephew nodes
Sketch of proposed algorithm
Reservoir Labs
• We give highest importance to NChildren relation:– position of first child > position of any node in {PlusOne – Children} simple criterion for parallel processing
– Children are contiguous, Nephews are not– visiting Children in the BFS_0 order must be stable under recursion
• suppose I want a shortest path from i to g:– ibg is impossible– idg is ok but lacks
structure– ifg is much better
(recursively contiguous and data parallel)
– Children are important, we hope there many
Sketch of proposed algorithm
Reservoir Labs
• “Discover-and-Merge-and-Mark” algorithm
• Given a single starting node we can explore 4 regions in parallel (C,P,S,M)
• Order of commit is M,S,P,C – for a node at distance 2 from N and at same depth as N in BFS_0, a
transition MC must be favored over a transition SS– This order guarantees recursive consistency of children relation– In general, nodes should be marked in the BFS_0 order
• Order of commit is relevant for nodes discovered at the same distance and same depth in BFS_0 wrt the starting node
• 3 parameters to order traversal: distance, depth, list of transitions lattice of transitions and synchronizations
Sketch of proposed algorithm
Reservoir Labs
• Let height the height of BFS_0, the maximal distance is 2*height• Maximal depth difference is [-height, height] (can be refined)
• Arrows represent producer/consumer dependences:• 2-D and uniform dependences pipelined parallelism (for free !)• Transitions and edge direction are related:
– Bottom-Left edge is an M transition ([D,d] [D+1, d-1])– Vertical edge is an S transition ([D,d] [D+1, d])– Bottom-Right edge is a (C | P) transition ([D,d] [D+1, d+1])
Lattice of Transitions
D=0
D=1
D=2
D=3
Start Node
d=-1 d=0 d=1
d=-1 d=0 d=1d=-2 d=2
d=3d=-3 d=-1 d=0 d=1d=-2 d=2
Reservoir Labs
• Little less constrained than real pipelined parallelism • Some tasks have only 1 or 2 predecessors, relaxed ordering
• CPSM transitions allow inter-region parallelism and gives a third dimension of parallelism/synchronizations:
• Unable to exploit it yet (need dynamic dependences otherwise too many empty tasks are created)
Available Parallelism
D=0
D=1
D=2
D=3
Start Node
d=-1 d=0 d=1
d=-1 d=0 d=1d=-2 d=2
d=3d=-3 d=-1 d=0 d=1d=-2 d=2
Reservoir Labs
• Graph500 output:• bfs_0_list (for each vertex his father)• xadj (list of edges in compact array)• xoff (for each node, offset of first and last edge in xadj)
• Overhead representation we propose:• bfs_order (for each vertex id, the order it was discovered) slight
extension of original seq-csr• bfs_0_list (for each position in BFS_0, get the vertex id) sort
(used for finalization, maybe not needed)• bfs_0_list_of_positions (for each vertex id, get the list of positions)• num_children(tmp, doall), ordered_num_children(tmp, doall),
pps_num_children(PPS, implemented sequentially), pps_depths (PPS)• xoff + xadj wrt BFS_0 (doall + PPS), categorized in v2 (doall)• Takes as much as 1 sequential run at the moment
Overhead Representation From BFS_0
Reservoir Labs
• CnC and C++ implementation:• Pointers to helper data structures• All discovered nodes are copied using data collections
• CnC task granularity is a pair (D,d):• Generate exactly height * (2*height + 1) / 2 tasks• Synchronization is easy to write:
– Each tasks gets input from its predecessors at (D-1, d-1) and/or (D-1,d) and/or (D-1,d+1)
– Each tasks puts data at (D,d)
• Within a CnC task:• Get input from (D-1, d-1), discover/mark C transition, discover/mark
P transition. Get input from (D-1, d), discover/mark S transition. Get input from (D-1, d+1), discover/mark M transition.
Implementation
Reservoir Labs
• Within a CnC task, everything is sequentialized:• Ability to spawn asynchronous tasks would be useful• Very coarse-grained parallelism (1 task is 4 regions, each region may
touch many elements in parallel)
• 2 implementations:• “intvec” uses the list of edges in the original graph• “intvec.cat” categorizes the edges by region for faster region
traversal• A lot of untuned overhead• Slowdown … but still valuable information
Implementation (continued)
Reservoir Labs
• Biggest example I ran ("size 22" in graph500 terminology):• Scale free graph, 4M vertices and 88M edges• The height of the BFS tree is only 7 small world property• The total number of CnC tasks created is only 14*7 / 2 = 49 • Of these 49 tasks, only a fraction actually perform work, maybe 10 • The work performed is extremely unbalanced:
– one task can discover up to 1M new nodes – others discover only 1.
• ~70% of discovery and marks happen by visiting C transitions– children are all contiguous in BFS_0 which gives great locality– children have good synchronization properties: they can all be
processed in parallel
• Need to spawn subtasks
Statistical Analysis
Reservoir Labs
• There is literally almost no parallelism exploited, but a lot is available• Spawn async tasks• Reduce+prefix to deal efficiently with large C regions
• Tried another implementation:• CnC task is now (D, d, last_transition)• For each (D,d), fan-out 4 “discover” tasks CP // S // M• For each (D,d), reduce 1 “merge-and-mark” task dependent on these 4
tasks• Additionally, each discover task can be broken down into a static
number of pieces to try and process in parallel• VERY crude way of representing async and prefix-reduction• Huge overhead (between 2 and 4x over the previous CnC version)
Another Implementation
Reservoir Labs
• Examine overhead (memory leaks, spurious copies, inefficient hashing, too many tasks created, no dependence specified etc)
• Hierarchical parallelism
• Distributed implementation
• Complement with DFS preconditioning (all recursive children become contiguous)
Future Work
Recommended