Graph Partitioning for High-Performance Scientific Simulations

Graph Partitioning for High-Performance

Scientific Simulations

Advanced Topics Spring 2009 Prof. Robert van Engelen

HPC II Spring 2009 2 1/20/09

Overview

  Challenges for irregular meshes   Modeling mesh-based computations as graphs   Static graph-partitioning techniques

  Geometric techniques   Combinatorial techniques   Spectral methods   Multilevel schemes

  Load balancing of adaptive computations   Scratch-remap repartitioners   Diffusion-based repartitioners

  Multiconstraint graph partitioning   Publicly available software packages

A Partitioned 2D Irregular Mesh of an Airfoil

HPC II Spring 2009 3 1/20/09

HPC II Spring 2009 4 1/20/09

Challenges for Irregular Meshes

  How to generate irregular meshes in the first place?   Triangulation, 2D/3D finite-element mesh

  How to partition them?   Optimal graph partitioning is an NP-complete problem   Heuristic edge-cut methods   Sequential or parallel partitioning methods?   Static or dynamic (adaptive) partitioning methods?

  How to design iterative solvers?   PETSc, Prometheus, …

  How to design direct solvers?   Reordering the adjacency matrix, e.g. reverse Cuthill-McKee   SuperLU, parallel Gaussian elimination, …

HPC II Spring 2009 5 1/20/09

Node Graph and Dual

  Graphs model the structure of the computation   Node graph (mesh nodes compute):

  Vertices = mesh nodes (nodes compute)   Edges = communication between nodes

  Dual graph (mesh elements compute):   Vertices = mesh elements   Edges = communication between adjacent mesh elements

  Exchanges take place for every face between adjacent elements

2D irregular mesh Node graph Dual graph

HPC II Spring 2009 6 1/20/09

Matrix - Graph Relationship

1 1 1

1 1 1 1 1

1 1 1

1 1 1

1 1 1 1 1

1 1 1 1

1 1 1 1 1

1 1

1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8

1 2

3 4 5

6

7 8

  Can reorder rows and columns of matrix   Non-zeros outside of blocks require communication

  An optimal partition of the graph for parallel computation:   Equal number of vertices in subdomains   Lowest number of edges between subdomains

Subdomains are mapped to the processors

HPC II Spring 2009 7 1/20/09

Finite-Element Mesh Example

Image source: MATLAB 7.5 NASA airfoil demo

NASA airfoil finite-element mesh, 4253 grid points

HPC II Spring 2009 8 1/20/09

Adjacency Matrix and Reverse Cuthill-McKee Reordering

Image source: MATLAB 7.5 NASA airfoil demo

Adjacency Matrix Reverse Cuthill-McKee

HPC II Spring 2009 9 1/20/09

Goals of Partitioning

  The balance constraint:   Balance computational load such that each processor has the

same execution time   Balance storage such that each processor has the same storage

demands

  Minimum edge cut:   Minimize communication volume between subdomains, along

the edges of the mesh

5-cut 4-cut

Example 3-way partition with edge-cut = 9

HPC II Spring 2009 10 1/20/09

Partitioning Problem

  Let G = (V, E) be a weighted undirected graph with weight functions wV: V → R+ and wE: E → R+

  The k-way graph partitioning problem is to split V into k disjoint subsets Sj, j = 1, …, k, (subdomains) such that   Balance constraint: ∑v ∈ Sj wV(v) is roughly equal for all j = 1, …, k

  Min-cut (minimum edge-cut): ∑(u,v) ∈ E such that u ∈ Si and v ∈ Sj with i≠j wE((u,v)) is minimal

  The weight functions are defined such that:   wV models CPU loads   wE models communication volumes (+ fixed startup latency)

  Can add subdomain weights to improve mapping to heterogeneous clusters of workstations

HPC II Spring 2009 11 1/20/09

Edge Cuts Versus Communication Steps

  Caveat: min edge cut ≠ optimal communication   Communication pattern does not follow edges exactly

Edge-cut: 7

Communication steps: 9

A sends 1 to B A sends 3 to B A sends 4 to C B sends 5 to A B sends 7 to A B sends 7 to C B sends 8 to C C sends 9 to B C sends 10 to A

HPC II Spring 2009 12 1/20/09

Static Graph Partitioning

  Geometric techniques   Coordinate Nested Dissection (CND)   Recursive Inertial Bisection (RIB)   Space-filling curve techniques   Sphere-cutting approach

  Combinatorial techniques   Levelized Nested Dissection (LND)   Kernighan-Lin/Fiduccia-Mattheyses (KL/FM)

  Spectral methods   Multilevel schemes

  Multilevel recursive bisection   Multilevel k-way partitioning

HPC II Spring 2009 13 1/20/09

Geometric Techniques

  Partition solely based on coordinate information   Applicable when coordinate system exists or can be

constructed   Goal is to group together vertices that are spatially near

each other   Recursively bisects mesh into increasingly smaller

subdomains   Geometric techniques are typically fast   Have no concept of edge cut, so no communication

optimization   May suffer from disconnected meshes in complex

subdomains

HPC II Spring 2009 14 1/20/09

CND

  Coordinate Nested Dissection (CND) is a geometric method   Compute centers of mass of mesh elements   Project onto axis of longest dimension   Bisect the list of centers   Repeat recursively

  Good: fast and easy to parallelize   Bad: low quality partitioning

HPC II Spring 2009 15 1/20/09

CND

  Mesh is bisected normal to the axis of the longest dimension   Results in smaller subdomain boundaries   Approximates smaller data volume exchanges

Bisected normal to the x-axis Bisected normal to the y-axis

CND

  For complicated geometries CND may produce disconnected subdomains   Lower quality of computational solution

  Variations of CND have been proposed to mitigate this disadvantage

HPC II Spring 2009 16 1/20/09

HPC II Spring 2009 17 1/20/09

RIB

  Recursive Inertial Bisection (RIB) orients the bisection to minimize the subdomain boundary

  Mesh elements are converted into point masses   Compute principal inertial axis of the mass distribution   Bisect orthogonal to principal inertial axis   Repeat recursively

  Fast and better quality than CND

CND RIB

HPC II Spring 2009 18 1/20/09

Space-Filling Curve Techniques

  CND and RIB consider only a single dimension at a time   Space-filling curve techniques follow points of mass of mesh

elements using locality-preserving curves, e.g. Peano-Hilbert curves   Curve segments fill higher-dimensional subspaces   Split curve into k parts resulting in k subdomains

  Good: fast and generally better quality than CND and RIB   Bad: works better for certain types of problems, e.g. n-body sims

8-way partition using a Peano-Hilbert space-filling curve

HPC II Spring 2009 19 1/20/09

The Sphere-Cutting Approach

  Approach separates mesh by vertices in overlap graph   Create (α,k)-overlap graph, with k-ply neighborhood system

  Overlap graphs have O(n(d-1)/d) vertex separators   No point is contained in more than k d-dimensional spheres   Nodes are connected when spheres, sized up by α, intersect

  Project overlap graph onto (d+1)-dimensional sphere, which is cut

(1,3)-overlap graph

3-ply neighborhood system

Mesh nodes

HPC II Spring 2009 20 1/20/09

Combinatorial Techniques

  The problem with geometric techniques is that they group vertices that are spatially close using a coordinate system, whether nodes are connected or not

  Combinatorial partitioning techniques use adjacency information to group together vertices that are highly connected

  Smaller edge-cuts   Reasonably fast   But not easily parallelizable

HPC II Spring 2009 21 1/20/09

LND

  Levelized Nested Dissection (LND)

  Select vertex v0, preferably a (pseudo) peripheral vertex

  For each vertex, compute vertex distance to v0 using a breadth-first search starting from v0

  When half of the vertices have been assigned, split graph in two (assigned, unassigned)

  Can be repeated with trials of v0 to improve edge-cut

LND cut Min-cut

HPC II Spring 2009 22 1/20/09

KL Partition Refinement

  Kernighan-Lin (KL) partition refinement   Given partition of vertices V in disjoint sets A and B   General idea: find X ⊆ A and Y ⊆ B such that swapping X to B and Y

to A yields the greatest reduction in edge-cut   Finding optimal X and Y is intractable   KL performs repeated passes over V, each pass taking O(|V|2) time   Each KL pass swaps two vertices, one from A with one from B

After KL pass Before KL pass

Edge-cut: 6

Edge-cut: 3

KL/FM Partition Refinement

  Kernighan-Lin/Fiduccia-Mattheyses (KL/FM) partition refinement

  FM considers moving one vertex at a time

  Works by annotating each vertex along the partition with gain/loss values

  Move vertex that gives gain   May move vertex even when

gain is negative   Avoid local minimum

HPC II Spring 2009 23 1/20/09







HPC II Spring 2009 24 1/20/09

Stuck in local minimum: Explore move to increase edge cut while meeting the balance constraint







HPC II Spring 2009 25 1/20/09

Stuck in local minimum: Explore move to increase edge cut while meeting the balance constraint







HPC II Spring 2009 26 1/20/09

Now explore moves to reduce edge cut while meeting the

balance constraint







HPC II Spring 2009 27 1/20/09

Explore move to reduce edge cut while meeting the balance

constraint







HPC II Spring 2009 28 1/20/09


constraint







HPC II Spring 2009 29 1/20/09


constraint







HPC II Spring 2009 30 1/20/09


constraint

HPC II Spring 2009 31 1/20/09

Spectral Methods

  Given graph G with adjacency matrix A   The discrete Laplacian matrix is

LG = A - D with D diagonal matrix such that Dii = degree of vertex i

  Largest eigenvalue of LG is 0   Second largest eigenvalue measures graph connectivity   The Fiedler vector (the eigenvector corresponding to second largest

eigenvalue) measures distance between vertices   Sort Fiedler vector and split graph according to the sorted list of

vertices   Recursive spectral bisection computes a k-way partition recursively

  Good: high quality subdomain splitting   Bad: computing the Fiedler vector is expensive

Spectral Methods

HPC II Spring 2009 32 1/20/09

Multilevel Schemes

  Multilevel schemes consist of recursively applying:

1.  Graph coarsening 2.  Initial partitioning 3.  Partition refinement

HPC II Spring 2009 33 1/20/09

Multilevel Schemes (cont’d)

HPC II Spring 2009 34 1/20/09

  Multilevel recursive bisection   Coarsening collapses

vertices in pairs   Pick random pairs   Or pick heavy-edge pairs

  Partitioning is performed on coarsest graph (smallest) using recursive bisection methods

  Refinement with KL/FM on uncoarsened graph

Multilevel Schemes (cont’d)

HPC II Spring 2009 35 1/20/09

Has no local minimum Has local minimum

Partitioning Methods Compared

HPC II Spring 2009 36 1/20/09

From: [SRC] Ch.18 K.Schloegel, G. Karypis, and V. Kumar

HPC II Spring 2009 37 1/20/09

Load Balancing of Adaptive Computations

  Adaptive graph partitioning methods   Goal is to balance the load by redistributing data when mesh

dynamically changes   Redistributing data is expensive, should be minimized

A helicopter blade rotating through a mesh

HPC II Spring 2009 38 1/20/09

Load Balancing of Adaptive Computations (cont’d)

  To measure the redistribution cost we define:   TotalV is the sum of the sizes of vertices that change subdomain

as a result of repartitioning (measures total data volume)   MaxV is the max of the sum of sizes of those vertices that

moved in/out of a subdomain (measures max time of send/recv)

  Repartitioners aim to minimize TotalV or MaxV   Minimizing TotalV is easier   Minimizing TotalV tends to minimize MaxV

HPC II Spring 2009 39 1/20/09

Adaptive Graph Partitioning

  Two repartitioning approaches:

  Scratch (and remap) repartition methods   Partition from scratch

  Incremental repartition methods   Cut-and-paste   Diffusion-based methods

Repartitioning Approaches

HPC II Spring 2009 40 1/20/09

Diffusive repartitioning

Repartitioning from scratch

Imbalanced partitioning

(vertex weights=1)

Cut-and-paste repartitioning

HPC II Spring 2009 41 1/20/09

Scratch-Remap Partitioning

  Improves repartitioning from scratch   After repartitioning from scratch remap vertices to match up with old

subdomains   Lowers TotalV (and thus MaxV)

Similarity matrix

Scratch-remap partitioned graph

Matrix elements Sqr = sum of the sizes of vertices in subdomain of old partition q and in subdomain of new partition r

HPC II Spring 2009 42 1/20/09

Incremental Versus Scratch-Remap

  Scratch-remap results in higher distribution costs (TotalV) compared to incremental methods that use local perturbations

  Incremental partitioning with cut-and-paste   Moves few vertices between subdomains to restore balance

  Incremental diffusion-based methods

Imbalanced partioning

Incremental partioning

Scrath-remap partioning

HPC II Spring 2009 43 1/20/09

Diffusion-Based Repartitioners

  Address two questions   How much work should be transferred between processors?   Which tasks should be transferred?

  Attempt to minimize the difference between original and final repartitioning by making incremental changes

  Local diffusion schemes only consider workloads of localized groups

  Global diffusion schemes consider entire graph   Recursive bisection diffusion partitioners   Adaptive space-filling curve partitioners   Repartitioning based on flow solutions

Tradeoff Between Edge-Cut and Data Redistribution Costs

  The objective to minimize data redistribution cost is at odds with the objective of minimizing the edge cut

  Minimizing data redistribution cost is preferred over minimizing the edge-cut   If mesh is frequently adapted or the amount of element’s state

data is high

  Unified repartitioning is a diffusive repartitioning algorithm gives user control of tradeoff   Reduce sum of interprocessor communication overhead   Reduce data redistribution cost required to balance the load

HPC II Spring 2009 44 1/20/09

HPC II Spring 2009 45 1/20/09

Multiconstraint Graph Partitioning

  Sophisticated classes of simulations for which traditional graph partitioning formulation is inadequate

  Multiple constraints must be satisfied   Nonuniform memory requirements   Nonuniform CPU requirements

Multiphase Simulations

  Multiphase computations consist of m distinct computational phases, separated by synchronization

  Each phase should be load balanced   Compute a single

partitioning that attempts to balance loads of m phases

  Or compute m partitionings, which requires redistribution between phases

HPC II Spring 2009 46 1/20/09

Example particle-in-cell computation: phase 1: mesh-based computation phase 2: particle-based computation

Particle in cell

Multimesh Computations

  Important class of emerging numerical methods   Combines

  Structured grids to discretize PDEs   Unstructured meshes to model complex geometries of objects

  Requires interpolation to map data to finer meshes

  How to partition so that all grids/meshes are balanced and interprocessor communication is minimized?

HPC II Spring 2009 47 1/20/09

Domain Decomposition-Based Preconditioners

  Use preconditioner to minimize the number of nonzero elements that are off the block diagonal in the adjacency matrix   Every nonzero element off the block diagonal corresponds to

interprocessor communication between subdomains

  Also need to minimize the magnitude of nonzero elements that are off the block diagonal when edge weights are not 1

21

1

11

25

2

1

3

5

10

3

12

12

2

21

3

2

1

5

1

1

1

55

5

1

2

3

10

3

12

2

1

1

2

2

edge-cut: 12

magnitude: 66

1

1

3

3

1

1

5

2

2

10

10

1

1

12

12

1

12

12

10

12

1

3 2

5

5

1 2

3

3

5

1

2

2

5 2

2

1

2

52

2

1

1

1

1

2

5 3

15

22

2

5

1

3

1

1

1

5

1

1

10

2

2

3

3

2

2

2

5

11

5

1

3 --

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

12

1

1

3

2

1

5

1

1

55

5

10

3

12

2

2

1

1

5

2

1

3

5

3

12

12

21

2

10

1

2

2

1

2

1

3

1

2

1

edge-cut: 23

magnitude: 36

2

2

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

-- 1

1

3

3

5

5

1

1

3

3

12

12

1

15

5

2

2

3

3

1

1

1

1

1

1

5

5

1

1

10

10

2

2

5

5

1

1

2

2

2

2

5

5

2

2

1

1

2

2

2

2

12

12

10

10

1

1

3

3

1

1

1

1

1

1

2

2

2

2

5

5

12

12 3

3

HPC II Spring 2009 48 1/20/09

Partition with minimuml edge-cut: minimum number of nonzero off-block diagonal elements





HPC II Spring 2009 49 1/20/09

Partition with minimum magnitude: minimum sum magnitude of nonzero off-block diagonal elements

21

1

11

25

2

1

3

5

10

3

12

12

2

21

3

2

1

5

1

1

1

55

5

1

2

3

10

3

12

2

1

1

2

2

edge-cut: 12

magnitude: 66

1

1

3

3

1

1

5

2

2

10

10

1

1

12

12

1

12

12

10

12

1

3 2

5

5

1 2

3

3

5

1

2

2

5 2

2

1

2

52

2

1

1

1

1

2

5 3

15

22

2

5

1

3

1

1

1

5

1

1

10

2

2

3

3

2

2

2

5

11

5

1

3 --

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

12

1

1

3

2

1

5

1

1

55

5

10

3

12

2

2

1

1

5

2

1

3

5

3

12

12

21

2

10

1

2

2

1

2

1

3

1

2

1

edge-cut: 23

magnitude: 36

2

2

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

-- 1

1

3

3

5

5

1

1

3

3

12

12

1

15

5

2

2

3

3

1

1

1

1

1

1

5

5

1

1

10

10

2

2

5

5

1

1

2

2

2

2

5

5

2

2

1

1

2

2

2

2

12

12

10

10

1

1

3

3

1

1

1

1

1

1

2

2

2

2

5

5

12

12 3

3





HPC II Spring 2009 50 1/20/09

Partitioning that minimizes both number and magnitude of the edges cut

1

5

55

5

10

3

2

2

1

5

2

5

3

12

2

2

1

1

3

2

1

1

2

31

2

2

1

1

12

2

3

10

1

1

1

1

12

edge-cut: 15

magnitude: 45

5

5

1

1

1

1

3

3

3

3

12

12

1

1

2

2

5

5 2

2

3

3 1

1

1

1

5

5

1

1

2

2

10

10

5

5

2

2

2

2

1

1

2

2

1

1

12

12

10

10

1

1

12

12

1

1

1

1

3

3

5

5

3

3

2

2

1

1 5

5

2

2

2

2 2

2

1

1

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

--

(1,0,0)

(1,0,0)

(1,0,0)

(1,0,0)

(1,0,0)

(0,0,1)

(0,0,1)

(0,0,1)

(0,0,1)

(0,1,0)

(0,1,0)

(0,1,0)

(0,1,0)

(0,1,0)

(0,1,0)(0,1,0)

(1, 0)(1, 0)

(1, 0)

(1, 0)

(1, 0)

(1, 0)

(1, 0)

(1, 0)(1, 0)

(0, 1)

(0, 1)

(0, 1)

(1, 0)

(1, 0)

(0, 1)

(0, 1)

(0, 1)

(0, 1) (0,1,0)

(1,0,0)

(0,0,1)

(1, 0)

(1, 0)

(1,0,0)

(0,1,0)

(0,0,1)

(0, 1)

(a) (b)

Software Packages

HPC II Spring 2009 51 1/20/09

From: [SRC] Ch.18 K.Schloegel, G. Karypis, and V. Kumar

HPC II Spring 2009 52 1/20/09

Further Reading

  [SRC] Chapter 18

Documents

Graph Partitioning for High-Performance Scientific Simulations