Lecture 16 Irregular Problems â€“ Part II

Lecture 16

Irregular Problems – Part II

Announcements

©2012 Scott B. Baden /CSE 260/ Fall 2012 2

Today’s lecture •  Particle methods •  Sparse matrices •  Irregular meshes

3 ©2012 Scott B. Baden /CSE 260/ Fall 2012


Classifying application domains – Colella’s “7 Dwarfs”

u  Classify applications according to a pattern of data access and operations on the data: important in HPC

u  Structured grids •  Image processing, simulations

u  Dense linear algebra: •  Gaussian elimination, SVD, many others

u  N-body methods u  Sparse linear algebra u  Unstructured Grids u  Spectral methods (FFT) u  Monte Carlo


The N-body problem •  Compute trajectories of a system of N

bodies often called particles, moving under mutual influence

u  The Greek word for particle: somati'dion = “little body”

u  No general analytic (exact) solution when N > 2

u  Numerical simulations required u  N can ranges from thousands to millions

•  A force law governs the way the particles interact

u  We may not need to perform all O(N2) force computations

u  Introduces non-uniformity due to uneven distributions


Discretization •  Particles move continuously through space

and time according to a force, a continuous function of position and time: F(x,t)

•  Can’t solve the problem analytically, must solve it numerically

•  We approximate continuous values using a discrete representation

•  Evaluate force field at discrete points in time, called timesteps Δt, 2Δt , 3Δt Δt = discrete time step (a parameter)

•  “Push” the bodies according to Newton’s third law: F = ma = m du/dt

while (current time < end time) forall bodies i ∈ 1:N compute force Fi induced by all bodies j ∈ 1:N

update position xi by Fi Δt ∀i current time += Δt

end


Timestep selection •  We approximate the velocity of a particle by the tangent to the

particle’s trajectory •  Since we compute velocities at discrete points in space and time, we

approximate the true trajectory by a straight line •  So long as Δt is small enough, the resultant error is reasonable •  If not then we might “jump” to another trajectory: this is an error

Particle A’s trajectory

Particle B’s trajectory


Some details

•  There is no self induced force •  How should we choose the timestep Δt ? •  We apply numerical criteria to select Δt, based on an

analysis of the evolving solution •  Selection may be ongoing, i.e. dynamic


Computing the force

•  The running time of the computation is dominated by the force computation, so we ignore the push phase

•  The simplest approach is to use the direct method, with a running time of O(N2) Force on particle i = ∑j=0:N-1 F(xi, xj)

•  F( ) is the force law •  One example is the gravitational force law

G mi mj / r2ij where rij = dist(xi, xj)

G is the gravitational constant


Significance of locality •  Many scientific problems exhibit both spatial and temporal

locality, due to the underlying physics u  The values of the solution at nearby points in space and time are

closer than for values at distant points u  Gravitational interactions decay as 1/r2

u  Waves live in a localized region of space at any one time and will appear in a nearby region of space in a nearby point in time

•  Numerically speaking, timesteps are “small:” the solution changes gradually

•  Let’s consider a more highly localized force

Jim Demmel, U. C. Berkeley


Localized Van Der Waals Force

F(r) = C (2 - 1 / (30r)) r < δ C / (30r)5 r ≥ δ,

C = 0.1

0 0.05 0.1 0.15 0.2

-0.1

-0.05

0

0.05

0.1

0.15


Implementation •  We don’t need to compute all

O(N2) interactions •  To speed up the search for

nearby particles, sort into a chaining mesh (Hockney & Eastwood, 1981)

•  Compute forces one mesh box at a time

•  Only consider particles in the 8 surrounding cells



Parallel implementation •  We split the mesh as in a stencil

method •  We use ghost cells to store copies

of nearby particles •  The ghost region also manages

particles that have moved outside the subdomain and must be repatriated to their new owner

•  Packing and unpacking of particle lists


Load Balancing •  Boxes carry varying amounts of

work depending on local particle density

•  A uniformly blocked workload assignment will result in a load imbalance

•  Particles can redistribute themselves over processors

•  2 Alternatives u  Static: cyclic mapping u  Dynamic: non-uniform partitioning


•  Block cyclic decomposition

•  All processors should get about the same amount of work if we choose a reasonable chunk size

•  What are the tradeoffs?

Static decomposition

0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3


Communication overhead

•  With a b=2, each processor must obtain all the simulation data

•  With b=4, this factor decreases to ½

•  If the dependence distance increases, so must the granularity

0 1 0 1 0 12 3 2 3 2 30 1 0 1 0 12 3 2 3 2 30 1 0 1 0 12 3 2 3 2 3


0 1 2 3 0 1 2 34 5 6 7 4 5 6 78 9 10 11 8 9 10 1112 13 14 15 12 13 14 150 1 2 3 0 1 2 34 5 6 7 4 5 6 78 9 10 11 8 9 10 1112 13 14 15 12 13 14 15


Non-uniform partitioning •  Cyclic partitioning is a static strategy

u  Partitions space uniformly and maps work irregularly u  A statistical sampling method -- does not attempt to

measure load imbalance u  May incur high communication overheads due to fine

grain data decomposition •  Another approach is partition the work non-

uniformly according to the spatial workload distribution

u  Each non-uniform region carries an equal work u  Particles move: partitioning changes over time

•  Avoids high surface to volume ratio of cyclic decomposition

•  In many numerical problems, the solution changes gradually due to timestep constraints, so we don’t have to repartition every timestep


The workload density distribution •  Irregular partitioning algorithms require that we

estimate the workload in space and time •  Construct a workload density mapping ρ(x,t) giving

the workload at each point x=(x,y,z) of the mesh •  In static problems the workload density mapping

depends only on position •  In many applications we can come up with a good

estimate of the mapping •  Since timesteps are “small,” particle motion is

localized in space and in time •  This condition holds for continuum methods such as

classical partial differential equations: we can adjust the partitioning frequency accordingly


Non-uniform blocked decomposition

•  A well known partitioning technique is recursive coordinate bisection (RCB)


An example of RCB in one dimension

•  Consider the following loop for i = 0 : N-1 do if ( G(i) )

then y[i] = f1( …. ); else y[i] = f2( …. ); end for •  Assume

f1( ) takes twice as long to compute as f2( )


Partitioning procedure

•  Let W[i] = if (G(i)) then 2 else 1 1 1 2 1 2 2 2 1 1 1 1 1 1 2 2 1

•  Compute the running sum (i.e. the scan) of W 1 2 4 5 7 9 11 12 13 14 15 16 17 19 21 22

•  Split into 2 equal parts, at 22/2 = 11 1 1 2 1 2 2 2 1 1 1 1 1 1 2 2 1


Recursive coordinate bisection

•  Recurse until the desired number of partitions have been rendered 1 1 2 1 2 2 2 1 1 1 1 1 1 2 2 1

•  May be applied in multiple dimensions


Load Balancing with space filling curves

•  Multidimensional RCB suffers from granularity problems

•  We can reduce the granularity with many partitions per processor

•  Introduces higher surface to volume effects •  Another approach: spacefilling curves


Partitioning with filling curves •  Maps higher dimensional physical

space onto the line u  Load balancing in one dimension u  Many types; Hilbert curves shown here

•  Irregular communication surfaces •  Useful for managing locality:

memory, databases

mathforum.org/advanced/robertd/lsys3d.html

http://www.math.ohio-state.edu/~fiedorow/math655/Peano.html


•  If we ignore serial sections and other overheads, then we may express load imbalance in terms of a load balancing efficiency metric

•  Let each processor i complete its assigned work in time Ti •  Thus, the running time TP = MAX ( Ti )

•  Define

•  We define the load balancing efficiency

•  Ideally η = 1.0

Load balancing efficiency

!

T = iTi"

!

" =T

PTP


Classifying application domains – Colella’s “7 Dwarfs”

u  Classify applications according to a pattern of data access and operations on the data: important in HPC

u  Structured grids •  Image processing, simulations

u  Dense linear algebra: •  Gaussian elimination, SVD, many others

u  N-body methods u  Sparse linear algebra u  Unstructured Grids u  Spectral methods (FFT) u  Monte Carlo


Sparse Matrices

•  A matrix where we employ knowledge about the location of the non-zeroes

•  Consider Jacobi’s method with a 5-point stencil

u’[i,j] = (u[i-1,j] + u[i+1,j]+ u[i,j-1]+ u[i, j+1] - h2f[i,j]) /4

1M x 1M submatrix of the web connectivity graph, constructed from an archive at the Stanford WebBase 3 non-zeroes/row

Dense: 220×220 = 240

= 1024 Gwords Sparse: (3/220) × 240 = 3 Mwords Sparse representation saves a factor of 1 million in storage

©2012 Scott B. Baden /CSE 260/ Fall 2012" 28

Web connectivity Matrix: 1M x 1M

Jim Demmel


Circuit Simulation

www.cise.ufl.edu/research/sparse/matrices/Hamm/scircuit.html

Motorola Circuit 170,9982

958,936 nonzeroes .003% nonzeroes 5.6 nonzeroes/row


Cholesky

•  In some cases the matrix is both –  Symmetric, A = AT –  Positive definite A = L * LT L = √A

•  No need for pivoting •  We can use Cholesky Factorization •  Flops count cut in half •  What complications does this avoid? •  Result: many matrices can be transformed into

this form •  Super LU: Demmel and Li


Savings with Sparse Matrices

•  7 x 7 grid: 49 x 49 matrix, use Cholesky’s method •  Nonzeroes: 349, 30% of the space •  Flops: 2643, 6.7% •  Fill-in: a zero entry becomes nonzero


Managing fill in •  Fill-in and computation reduced with improved ordering •  Advantage increases with N: n=63 (N = n2 = 3969)

–  250109 non-zeroes → 85416 –  16 million flops → 3.6 million

Natural order: nonzeros = 233 Min. Degree order: nonzeros = 207


Sparse Matrix Vector Multiplication

•  Important kernel used in sparse linear algebra y[i] += A[i,j] × x[j]

•  Many formats, common format for CPUs is Compressed Sparse Row (CSR)

Jim Demmel

Sparse matrix vector multiply kernel // y[i] += A[i,j] × x[j] #pragma parallel for schedule

(dynamic,chunk) for i = 0 : N-1 // rows i0= ptr[i] i1 = ptr[i+1] – 1 for j = i0 : i1 // cols y[ ind[j]] +=

val[ j ] *x[ ind[j] ] end j end i


j→

i ↓

A X


Irregular mesh: NASA Airfoil in 2D

Jim Demmel

•  In the Jacobi method, computational effort is applied uniformly

•  Irregular “unstructured” meshes don’t have a uniform structure and they apply computational effort non-uniformly

•  We can apply direct mesh updates, but more common to generate the equivalent sparse matrix


A Structured Adaptive mesh

Courtesy Phil Colella, Lawrence Berkeley National Laboratory


Turbulent flow over a step

Applied Mathematics Group Center for Applied Scientific Computing, Lawrence Livermore National Laboratory

http://www.llnl.gov/CASC/groups/casc-amg.html

Fin


Irregular Problems •  In the Jacobi application,

computational effort is applied uniformly

•  Irregular applications apply computational effort non-uniformly

•  A load balancing problem arises when partitioning the data

•  There may not be a mesh

Courtesy of Randy Bank

The matrix factoriza/on can be represented as a DAG: • nodes: tasks that operate on “/les” • edges: dependencies among tasks Tasks can be scheduled asynchronously and in any order as long as dependencies are not violated.

Achieving Asynchronicity on Multicore Source: Jack Dongarra

System: PLASMA

02/14/2006 CS267 Lecture 9 44

Row and Column Block Cyclic Layout

• processors and matrix blocks are distributed in a 2d array

• prow-by-pcol array of processors • brow-by-bcol matrix blocks

• pcol-fold parallelism in any column, and calls to the BLAS2 and BLAS3 on matrices of size brow-by-bcol

• serial bottleneck is eased

• prow ≠ pcol and brow ≠ bcol possible, even desireable

0 1 0 1 0 1 0 1

2 3 2 3 2 3 2 3

0 1 0 1 0 1 0 1

2 3 2 3 2 3 2 3

0 1 0 1 0 1 0 1

2 3 2 3 2 3 2 3

0 1 0 1 0 1 0 1

2 3 2 3 2 3 2 3

bcol brow

02/14/2006 CS267 Lecture 9 45

Distributed GE with a 2D Block Cyclic Layout

• block size b in the algorithm and the block sizes brow and bcol in the layout satisfy b=bcol.

• shaded regions indicate processors busy with computation or communication.

• unnecessary to have a barrier between each step of the algorithm, e.g.. steps 9, 10, and 11 can be pipelined

Documents

Lecture 16 Irregular Problems â€“ Part II