Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Graph Partitioning for High-Performance
Scientific Simulations
Advanced Topics Spring 2009 Prof. Robert van Engelen
HPC II Spring 2009 2 1/20/09
Overview
Challenges for irregular meshes Modeling mesh-based computations as graphs Static graph-partitioning techniques
Geometric techniques Combinatorial techniques Spectral methods Multilevel schemes
Load balancing of adaptive computations Scratch-remap repartitioners Diffusion-based repartitioners
Multiconstraint graph partitioning Publicly available software packages
A Partitioned 2D Irregular Mesh of an Airfoil
HPC II Spring 2009 3 1/20/09
HPC II Spring 2009 4 1/20/09
Challenges for Irregular Meshes
How to generate irregular meshes in the first place? Triangulation, 2D/3D finite-element mesh
How to partition them? Optimal graph partitioning is an NP-complete problem Heuristic edge-cut methods Sequential or parallel partitioning methods? Static or dynamic (adaptive) partitioning methods?
How to design iterative solvers? PETSc, Prometheus, …
How to design direct solvers? Reordering the adjacency matrix, e.g. reverse Cuthill-McKee SuperLU, parallel Gaussian elimination, …
HPC II Spring 2009 5 1/20/09
Node Graph and Dual
Graphs model the structure of the computation Node graph (mesh nodes compute):
Vertices = mesh nodes (nodes compute) Edges = communication between nodes
Dual graph (mesh elements compute): Vertices = mesh elements Edges = communication between adjacent mesh elements
Exchanges take place for every face between adjacent elements
2D irregular mesh Node graph Dual graph
HPC II Spring 2009 6 1/20/09
Matrix - Graph Relationship
1 1 1
1 1 1 1 1
1 1 1
1 1 1
1 1 1 1 1
1 1 1 1
1 1 1 1 1
1 1
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
1 2
3 4 5
6
7 8
Can reorder rows and columns of matrix Non-zeros outside of blocks require communication
An optimal partition of the graph for parallel computation: Equal number of vertices in subdomains Lowest number of edges between subdomains
Subdomains are mapped to the processors
HPC II Spring 2009 7 1/20/09
Finite-Element Mesh Example
Image source: MATLAB 7.5 NASA airfoil demo
NASA airfoil finite-element mesh, 4253 grid points
HPC II Spring 2009 8 1/20/09
Adjacency Matrix and Reverse Cuthill-McKee Reordering
Image source: MATLAB 7.5 NASA airfoil demo
Adjacency Matrix Reverse Cuthill-McKee
HPC II Spring 2009 9 1/20/09
Goals of Partitioning
The balance constraint: Balance computational load such that each processor has the
same execution time Balance storage such that each processor has the same storage
demands
Minimum edge cut: Minimize communication volume between subdomains, along
the edges of the mesh
5-cut 4-cut
Example 3-way partition with edge-cut = 9
HPC II Spring 2009 10 1/20/09
Partitioning Problem
Let G = (V, E) be a weighted undirected graph with weight functions wV: V → R+ and wE: E → R+
The k-way graph partitioning problem is to split V into k disjoint subsets Sj, j = 1, …, k, (subdomains) such that Balance constraint: ∑v ∈ Sj wV(v) is roughly equal for all j = 1, …, k
Min-cut (minimum edge-cut): ∑(u,v) ∈ E such that u ∈ Si and v ∈ Sj with i≠j wE((u,v)) is minimal
The weight functions are defined such that: wV models CPU loads wE models communication volumes (+ fixed startup latency)
Can add subdomain weights to improve mapping to heterogeneous clusters of workstations
HPC II Spring 2009 11 1/20/09
Edge Cuts Versus Communication Steps
Caveat: min edge cut ≠ optimal communication Communication pattern does not follow edges exactly
Edge-cut: 7
Communication steps: 9
A sends 1 to B A sends 3 to B A sends 4 to C B sends 5 to A B sends 7 to A B sends 7 to C B sends 8 to C C sends 9 to B C sends 10 to A
HPC II Spring 2009 12 1/20/09
Static Graph Partitioning
Geometric techniques Coordinate Nested Dissection (CND) Recursive Inertial Bisection (RIB) Space-filling curve techniques Sphere-cutting approach
Combinatorial techniques Levelized Nested Dissection (LND) Kernighan-Lin/Fiduccia-Mattheyses (KL/FM)
Spectral methods Multilevel schemes
Multilevel recursive bisection Multilevel k-way partitioning
HPC II Spring 2009 13 1/20/09
Geometric Techniques
Partition solely based on coordinate information Applicable when coordinate system exists or can be
constructed Goal is to group together vertices that are spatially near
each other Recursively bisects mesh into increasingly smaller
subdomains Geometric techniques are typically fast Have no concept of edge cut, so no communication
optimization May suffer from disconnected meshes in complex
subdomains
HPC II Spring 2009 14 1/20/09
CND
Coordinate Nested Dissection (CND) is a geometric method Compute centers of mass of mesh elements Project onto axis of longest dimension Bisect the list of centers Repeat recursively
Good: fast and easy to parallelize Bad: low quality partitioning
HPC II Spring 2009 15 1/20/09
CND
Mesh is bisected normal to the axis of the longest dimension Results in smaller subdomain boundaries Approximates smaller data volume exchanges
Bisected normal to the x-axis Bisected normal to the y-axis
CND
For complicated geometries CND may produce disconnected subdomains Lower quality of computational solution
Variations of CND have been proposed to mitigate this disadvantage
HPC II Spring 2009 16 1/20/09
HPC II Spring 2009 17 1/20/09
RIB
Recursive Inertial Bisection (RIB) orients the bisection to minimize the subdomain boundary
Mesh elements are converted into point masses Compute principal inertial axis of the mass distribution Bisect orthogonal to principal inertial axis Repeat recursively
Fast and better quality than CND
CND RIB
HPC II Spring 2009 18 1/20/09
Space-Filling Curve Techniques
CND and RIB consider only a single dimension at a time Space-filling curve techniques follow points of mass of mesh
elements using locality-preserving curves, e.g. Peano-Hilbert curves Curve segments fill higher-dimensional subspaces Split curve into k parts resulting in k subdomains
Good: fast and generally better quality than CND and RIB Bad: works better for certain types of problems, e.g. n-body sims
8-way partition using a Peano-Hilbert space-filling curve
HPC II Spring 2009 19 1/20/09
The Sphere-Cutting Approach
Approach separates mesh by vertices in overlap graph Create (α,k)-overlap graph, with k-ply neighborhood system
Overlap graphs have O(n(d-1)/d) vertex separators No point is contained in more than k d-dimensional spheres Nodes are connected when spheres, sized up by α, intersect
Project overlap graph onto (d+1)-dimensional sphere, which is cut
(1,3)-overlap graph
3-ply neighborhood system
Mesh nodes
HPC II Spring 2009 20 1/20/09
Combinatorial Techniques
The problem with geometric techniques is that they group vertices that are spatially close using a coordinate system, whether nodes are connected or not
Combinatorial partitioning techniques use adjacency information to group together vertices that are highly connected
Smaller edge-cuts Reasonably fast But not easily parallelizable
HPC II Spring 2009 21 1/20/09
LND
Levelized Nested Dissection (LND)
Select vertex v0, preferably a (pseudo) peripheral vertex
For each vertex, compute vertex distance to v0 using a breadth-first search starting from v0
When half of the vertices have been assigned, split graph in two (assigned, unassigned)
Can be repeated with trials of v0 to improve edge-cut
LND cut Min-cut
HPC II Spring 2009 22 1/20/09
KL Partition Refinement
Kernighan-Lin (KL) partition refinement Given partition of vertices V in disjoint sets A and B General idea: find X ⊆ A and Y ⊆ B such that swapping X to B and Y
to A yields the greatest reduction in edge-cut Finding optimal X and Y is intractable KL performs repeated passes over V, each pass taking O(|V|2) time Each KL pass swaps two vertices, one from A with one from B
After KL pass Before KL pass
Edge-cut: 6
Edge-cut: 3
KL/FM Partition Refinement
Kernighan-Lin/Fiduccia-Mattheyses (KL/FM) partition refinement
FM considers moving one vertex at a time
Works by annotating each vertex along the partition with gain/loss values
Move vertex that gives gain May move vertex even when
gain is negative Avoid local minimum
HPC II Spring 2009 23 1/20/09
KL/FM Partition Refinement
Kernighan-Lin/Fiduccia-Mattheyses (KL/FM) partition refinement
FM considers moving one vertex at a time
Works by annotating each vertex along the partition with gain/loss values
Move vertex that gives gain May move vertex even when
gain is negative Avoid local minimum
HPC II Spring 2009 24 1/20/09
Stuck in local minimum: Explore move to increase edge cut while meeting the balance constraint
KL/FM Partition Refinement
Kernighan-Lin/Fiduccia-Mattheyses (KL/FM) partition refinement
FM considers moving one vertex at a time
Works by annotating each vertex along the partition with gain/loss values
Move vertex that gives gain May move vertex even when
gain is negative Avoid local minimum
HPC II Spring 2009 25 1/20/09
Stuck in local minimum: Explore move to increase edge cut while meeting the balance constraint
KL/FM Partition Refinement
Kernighan-Lin/Fiduccia-Mattheyses (KL/FM) partition refinement
FM considers moving one vertex at a time
Works by annotating each vertex along the partition with gain/loss values
Move vertex that gives gain May move vertex even when
gain is negative Avoid local minimum
HPC II Spring 2009 26 1/20/09
Now explore moves to reduce edge cut while meeting the
balance constraint
KL/FM Partition Refinement
Kernighan-Lin/Fiduccia-Mattheyses (KL/FM) partition refinement
FM considers moving one vertex at a time
Works by annotating each vertex along the partition with gain/loss values
Move vertex that gives gain May move vertex even when
gain is negative Avoid local minimum
HPC II Spring 2009 27 1/20/09
Explore move to reduce edge cut while meeting the balance
constraint
KL/FM Partition Refinement
Kernighan-Lin/Fiduccia-Mattheyses (KL/FM) partition refinement
FM considers moving one vertex at a time
Works by annotating each vertex along the partition with gain/loss values
Move vertex that gives gain May move vertex even when
gain is negative Avoid local minimum
HPC II Spring 2009 28 1/20/09
Explore move to reduce edge cut while meeting the balance
constraint
KL/FM Partition Refinement
Kernighan-Lin/Fiduccia-Mattheyses (KL/FM) partition refinement
FM considers moving one vertex at a time
Works by annotating each vertex along the partition with gain/loss values
Move vertex that gives gain May move vertex even when
gain is negative Avoid local minimum
HPC II Spring 2009 29 1/20/09
Explore move to reduce edge cut while meeting the balance
constraint
KL/FM Partition Refinement
Kernighan-Lin/Fiduccia-Mattheyses (KL/FM) partition refinement
FM considers moving one vertex at a time
Works by annotating each vertex along the partition with gain/loss values
Move vertex that gives gain May move vertex even when
gain is negative Avoid local minimum
HPC II Spring 2009 30 1/20/09
Explore move to reduce edge cut while meeting the balance
constraint
HPC II Spring 2009 31 1/20/09
Spectral Methods
Given graph G with adjacency matrix A The discrete Laplacian matrix is
LG = A - D with D diagonal matrix such that Dii = degree of vertex i
Largest eigenvalue of LG is 0 Second largest eigenvalue measures graph connectivity The Fiedler vector (the eigenvector corresponding to second largest
eigenvalue) measures distance between vertices Sort Fiedler vector and split graph according to the sorted list of
vertices Recursive spectral bisection computes a k-way partition recursively
Good: high quality subdomain splitting Bad: computing the Fiedler vector is expensive
Spectral Methods
HPC II Spring 2009 32 1/20/09
Multilevel Schemes
Multilevel schemes consist of recursively applying:
1. Graph coarsening 2. Initial partitioning 3. Partition refinement
HPC II Spring 2009 33 1/20/09
Multilevel Schemes (cont’d)
HPC II Spring 2009 34 1/20/09
Multilevel recursive bisection Coarsening collapses
vertices in pairs Pick random pairs Or pick heavy-edge pairs
Partitioning is performed on coarsest graph (smallest) using recursive bisection methods
Refinement with KL/FM on uncoarsened graph
Multilevel Schemes (cont’d)
HPC II Spring 2009 35 1/20/09
Has no local minimum Has local minimum
Partitioning Methods Compared
HPC II Spring 2009 36 1/20/09
From: [SRC] Ch.18 K.Schloegel, G. Karypis, and V. Kumar
HPC II Spring 2009 37 1/20/09
Load Balancing of Adaptive Computations
Adaptive graph partitioning methods Goal is to balance the load by redistributing data when mesh
dynamically changes Redistributing data is expensive, should be minimized
A helicopter blade rotating through a mesh
HPC II Spring 2009 38 1/20/09
Load Balancing of Adaptive Computations (cont’d)
To measure the redistribution cost we define: TotalV is the sum of the sizes of vertices that change subdomain
as a result of repartitioning (measures total data volume) MaxV is the max of the sum of sizes of those vertices that
moved in/out of a subdomain (measures max time of send/recv)
Repartitioners aim to minimize TotalV or MaxV Minimizing TotalV is easier Minimizing TotalV tends to minimize MaxV
HPC II Spring 2009 39 1/20/09
Adaptive Graph Partitioning
Two repartitioning approaches:
Scratch (and remap) repartition methods Partition from scratch
Incremental repartition methods Cut-and-paste Diffusion-based methods
Repartitioning Approaches
HPC II Spring 2009 40 1/20/09
Diffusive repartitioning
Repartitioning from scratch
Imbalanced partitioning
(vertex weights=1)
Cut-and-paste repartitioning
HPC II Spring 2009 41 1/20/09
Scratch-Remap Partitioning
Improves repartitioning from scratch After repartitioning from scratch remap vertices to match up with old
subdomains Lowers TotalV (and thus MaxV)
Similarity matrix
Scratch-remap partitioned graph
Matrix elements Sqr = sum of the sizes of vertices in subdomain of old partition q and in subdomain of new partition r
HPC II Spring 2009 42 1/20/09
Incremental Versus Scratch-Remap
Scratch-remap results in higher distribution costs (TotalV) compared to incremental methods that use local perturbations
Incremental partitioning with cut-and-paste Moves few vertices between subdomains to restore balance
Incremental diffusion-based methods
Imbalanced partioning
Incremental partioning
Scrath-remap partioning
HPC II Spring 2009 43 1/20/09
Diffusion-Based Repartitioners
Address two questions How much work should be transferred between processors? Which tasks should be transferred?
Attempt to minimize the difference between original and final repartitioning by making incremental changes
Local diffusion schemes only consider workloads of localized groups
Global diffusion schemes consider entire graph Recursive bisection diffusion partitioners Adaptive space-filling curve partitioners Repartitioning based on flow solutions
Tradeoff Between Edge-Cut and Data Redistribution Costs
The objective to minimize data redistribution cost is at odds with the objective of minimizing the edge cut
Minimizing data redistribution cost is preferred over minimizing the edge-cut If mesh is frequently adapted or the amount of element’s state
data is high
Unified repartitioning is a diffusive repartitioning algorithm gives user control of tradeoff Reduce sum of interprocessor communication overhead Reduce data redistribution cost required to balance the load
HPC II Spring 2009 44 1/20/09
HPC II Spring 2009 45 1/20/09
Multiconstraint Graph Partitioning
Sophisticated classes of simulations for which traditional graph partitioning formulation is inadequate
Multiple constraints must be satisfied Nonuniform memory requirements Nonuniform CPU requirements
Multiphase Simulations
Multiphase computations consist of m distinct computational phases, separated by synchronization
Each phase should be load balanced Compute a single
partitioning that attempts to balance loads of m phases
Or compute m partitionings, which requires redistribution between phases
HPC II Spring 2009 46 1/20/09
Example particle-in-cell computation: phase 1: mesh-based computation phase 2: particle-based computation
Particle in cell
Multimesh Computations
Important class of emerging numerical methods Combines
Structured grids to discretize PDEs Unstructured meshes to model complex geometries of objects
Requires interpolation to map data to finer meshes
How to partition so that all grids/meshes are balanced and interprocessor communication is minimized?
HPC II Spring 2009 47 1/20/09
Domain Decomposition-Based Preconditioners
Use preconditioner to minimize the number of nonzero elements that are off the block diagonal in the adjacency matrix Every nonzero element off the block diagonal corresponds to
interprocessor communication between subdomains
Also need to minimize the magnitude of nonzero elements that are off the block diagonal when edge weights are not 1
21
1
11
25
2
1
3
5
10
3
12
12
2
21
3
2
1
5
1
1
1
55
5
1
2
3
10
3
12
2
1
1
2
2
edge-cut: 12
magnitude: 66
1
1
3
3
1
1
5
2
2
10
10
1
1
12
12
1
12
12
10
12
1
3 2
5
5
1 2
3
3
5
1
2
2
5 2
2
1
2
52
2
1
1
1
1
2
5 3
15
22
2
5
1
3
1
1
1
5
1
1
10
2
2
3
3
2
2
2
5
11
5
1
3 --
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
12
1
1
3
2
1
5
1
1
55
5
10
3
12
2
2
1
1
5
2
1
3
5
3
12
12
21
2
10
1
2
2
1
2
1
3
1
2
1
edge-cut: 23
magnitude: 36
2
2
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
-- 1
1
3
3
5
5
1
1
3
3
12
12
1
15
5
2
2
3
3
1
1
1
1
1
1
5
5
1
1
10
10
2
2
5
5
1
1
2
2
2
2
5
5
2
2
1
1
2
2
2
2
12
12
10
10
1
1
3
3
1
1
1
1
1
1
2
2
2
2
5
5
12
12 3
3
HPC II Spring 2009 48 1/20/09
Partition with minimuml edge-cut: minimum number of nonzero off-block diagonal elements
Domain Decomposition-Based Preconditioners
Use preconditioner to minimize the number of nonzero elements that are off the block diagonal in the adjacency matrix Every nonzero element off the block diagonal corresponds to
interprocessor communication between subdomains
Also need to minimize the magnitude of nonzero elements that are off the block diagonal when edge weights are not 1
HPC II Spring 2009 49 1/20/09
Partition with minimum magnitude: minimum sum magnitude of nonzero off-block diagonal elements
21
1
11
25
2
1
3
5
10
3
12
12
2
21
3
2
1
5
1
1
1
55
5
1
2
3
10
3
12
2
1
1
2
2
edge-cut: 12
magnitude: 66
1
1
3
3
1
1
5
2
2
10
10
1
1
12
12
1
12
12
10
12
1
3 2
5
5
1 2
3
3
5
1
2
2
5 2
2
1
2
52
2
1
1
1
1
2
5 3
15
22
2
5
1
3
1
1
1
5
1
1
10
2
2
3
3
2
2
2
5
11
5
1
3 --
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
12
1
1
3
2
1
5
1
1
55
5
10
3
12
2
2
1
1
5
2
1
3
5
3
12
12
21
2
10
1
2
2
1
2
1
3
1
2
1
edge-cut: 23
magnitude: 36
2
2
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
-- 1
1
3
3
5
5
1
1
3
3
12
12
1
15
5
2
2
3
3
1
1
1
1
1
1
5
5
1
1
10
10
2
2
5
5
1
1
2
2
2
2
5
5
2
2
1
1
2
2
2
2
12
12
10
10
1
1
3
3
1
1
1
1
1
1
2
2
2
2
5
5
12
12 3
3
Domain Decomposition-Based Preconditioners
Use preconditioner to minimize the number of nonzero elements that are off the block diagonal in the adjacency matrix Every nonzero element off the block diagonal corresponds to
interprocessor communication between subdomains
Also need to minimize the magnitude of nonzero elements that are off the block diagonal when edge weights are not 1
HPC II Spring 2009 50 1/20/09
Partitioning that minimizes both number and magnitude of the edges cut
1
5
55
5
10
3
2
2
1
5
2
5
3
12
2
2
1
1
3
2
1
1
2
31
2
2
1
1
12
2
3
10
1
1
1
1
12
edge-cut: 15
magnitude: 45
5
5
1
1
1
1
3
3
3
3
12
12
1
1
2
2
5
5 2
2
3
3 1
1
1
1
5
5
1
1
2
2
10
10
5
5
2
2
2
2
1
1
2
2
1
1
12
12
10
10
1
1
12
12
1
1
1
1
3
3
5
5
3
3
2
2
1
1 5
5
2
2
2
2 2
2
1
1
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
(1,0,0)
(1,0,0)
(1,0,0)
(1,0,0)
(1,0,0)
(0,0,1)
(0,0,1)
(0,0,1)
(0,0,1)
(0,1,0)
(0,1,0)
(0,1,0)
(0,1,0)
(0,1,0)
(0,1,0)(0,1,0)
(1, 0)(1, 0)
(1, 0)
(1, 0)
(1, 0)
(1, 0)
(1, 0)
(1, 0)(1, 0)
(0, 1)
(0, 1)
(0, 1)
(1, 0)
(1, 0)
(0, 1)
(0, 1)
(0, 1)
(0, 1) (0,1,0)
(1,0,0)
(0,0,1)
(1, 0)
(1, 0)
(1,0,0)
(0,1,0)
(0,0,1)
(0, 1)
(a) (b)
Software Packages
HPC II Spring 2009 51 1/20/09
From: [SRC] Ch.18 K.Schloegel, G. Karypis, and V. Kumar
HPC II Spring 2009 52 1/20/09
Further Reading
[SRC] Chapter 18