Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Today’s lecture • Particle methods • Sparse matrices • Irregular meshes
3 ©2012 Scott B. Baden /CSE 260/ Fall 2012
©2012 Scott B. Baden /CSE 260/ Fall 2012 4
Classifying application domains – Colella’s “7 Dwarfs”
u Classify applications according to a pattern of data access and operations on the data: important in HPC
u Structured grids • Image processing, simulations
u Dense linear algebra: • Gaussian elimination, SVD, many others
u N-body methods u Sparse linear algebra u Unstructured Grids u Spectral methods (FFT) u Monte Carlo
©2012 Scott B. Baden /CSE 260/ Fall 2012 5
The N-body problem • Compute trajectories of a system of N
bodies often called particles, moving under mutual influence
u The Greek word for particle: somati'dion = “little body”
u No general analytic (exact) solution when N > 2
u Numerical simulations required u N can ranges from thousands to millions
• A force law governs the way the particles interact
u We may not need to perform all O(N2) force computations
u Introduces non-uniformity due to uneven distributions
©2012 Scott B. Baden /CSE 260/ Fall 2012 6
Discretization • Particles move continuously through space
and time according to a force, a continuous function of position and time: F(x,t)
• Can’t solve the problem analytically, must solve it numerically
• We approximate continuous values using a discrete representation
• Evaluate force field at discrete points in time, called timesteps Δt, 2Δt , 3Δt Δt = discrete time step (a parameter)
• “Push” the bodies according to Newton’s third law: F = ma = m du/dt
while (current time < end time) forall bodies i ∈ 1:N compute force Fi induced by all bodies j ∈ 1:N
update position xi by Fi Δt ∀i current time += Δt
end
©2012 Scott B. Baden /CSE 260/ Fall 2012 7
Timestep selection • We approximate the velocity of a particle by the tangent to the
particle’s trajectory • Since we compute velocities at discrete points in space and time, we
approximate the true trajectory by a straight line • So long as Δt is small enough, the resultant error is reasonable • If not then we might “jump” to another trajectory: this is an error
Particle A’s trajectory
Particle B’s trajectory
©2012 Scott B. Baden /CSE 260/ Fall 2012 8
Some details
• There is no self induced force • How should we choose the timestep Δt ? • We apply numerical criteria to select Δt, based on an
analysis of the evolving solution • Selection may be ongoing, i.e. dynamic
©2012 Scott B. Baden /CSE 260/ Fall 2012 9
Computing the force
• The running time of the computation is dominated by the force computation, so we ignore the push phase
• The simplest approach is to use the direct method, with a running time of O(N2) Force on particle i = ∑j=0:N-1 F(xi, xj)
• F( ) is the force law • One example is the gravitational force law
G mi mj / r2ij where rij = dist(xi, xj)
G is the gravitational constant
©2012 Scott B. Baden /CSE 260/ Fall 2012 10
Significance of locality • Many scientific problems exhibit both spatial and temporal
locality, due to the underlying physics u The values of the solution at nearby points in space and time are
closer than for values at distant points u Gravitational interactions decay as 1/r2
u Waves live in a localized region of space at any one time and will appear in a nearby region of space in a nearby point in time
• Numerically speaking, timesteps are “small:” the solution changes gradually
• Let’s consider a more highly localized force
Jim Demmel, U. C. Berkeley
©2012 Scott B. Baden /CSE 260/ Fall 2012 11
Localized Van Der Waals Force
F(r) = C (2 - 1 / (30r)) r < δ C / (30r)5 r ≥ δ,
C = 0.1
0 0.05 0.1 0.15 0.2
-0.1
-0.05
0
0.05
0.1
0.15
©2012 Scott B. Baden /CSE 260/ Fall 2012 12
Implementation • We don’t need to compute all
O(N2) interactions • To speed up the search for
nearby particles, sort into a chaining mesh (Hockney & Eastwood, 1981)
• Compute forces one mesh box at a time
• Only consider particles in the 8 surrounding cells
Jim Demmel, U. C. Berkeley
©2012 Scott B. Baden /CSE 260/ Fall 2012 13
Parallel implementation • We split the mesh as in a stencil
method • We use ghost cells to store copies
of nearby particles • The ghost region also manages
particles that have moved outside the subdomain and must be repatriated to their new owner
• Packing and unpacking of particle lists
©2012 Scott B. Baden /CSE 260/ Fall 2012 14
Load Balancing • Boxes carry varying amounts of
work depending on local particle density
• A uniformly blocked workload assignment will result in a load imbalance
• Particles can redistribute themselves over processors
• 2 Alternatives u Static: cyclic mapping u Dynamic: non-uniform partitioning
©2012 Scott B. Baden /CSE 260/ Fall 2012 15
• Block cyclic decomposition
• All processors should get about the same amount of work if we choose a reasonable chunk size
• What are the tradeoffs?
Static decomposition
0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3 0 1 0 1 0 1 0 1 2 3 2 3 2 3 2 3
©2012 Scott B. Baden /CSE 260/ Fall 2012 16
Communication overhead
• With a b=2, each processor must obtain all the simulation data
• With b=4, this factor decreases to ½
• If the dependence distance increases, so must the granularity
0 1 0 1 0 12 3 2 3 2 30 1 0 1 0 12 3 2 3 2 30 1 0 1 0 12 3 2 3 2 3
Jim Demmel, U. C. Berkeley
0 1 2 3 0 1 2 34 5 6 7 4 5 6 78 9 10 11 8 9 10 1112 13 14 15 12 13 14 150 1 2 3 0 1 2 34 5 6 7 4 5 6 78 9 10 11 8 9 10 1112 13 14 15 12 13 14 15
©2012 Scott B. Baden /CSE 260/ Fall 2012 17
Non-uniform partitioning • Cyclic partitioning is a static strategy
u Partitions space uniformly and maps work irregularly u A statistical sampling method -- does not attempt to
measure load imbalance u May incur high communication overheads due to fine
grain data decomposition • Another approach is partition the work non-
uniformly according to the spatial workload distribution
u Each non-uniform region carries an equal work u Particles move: partitioning changes over time
• Avoids high surface to volume ratio of cyclic decomposition
• In many numerical problems, the solution changes gradually due to timestep constraints, so we don’t have to repartition every timestep
©2012 Scott B. Baden /CSE 260/ Fall 2012 18
The workload density distribution • Irregular partitioning algorithms require that we
estimate the workload in space and time • Construct a workload density mapping ρ(x,t) giving
the workload at each point x=(x,y,z) of the mesh • In static problems the workload density mapping
depends only on position • In many applications we can come up with a good
estimate of the mapping • Since timesteps are “small,” particle motion is
localized in space and in time • This condition holds for continuum methods such as
classical partial differential equations: we can adjust the partitioning frequency accordingly
©2012 Scott B. Baden /CSE 260/ Fall 2012 19
Non-uniform blocked decomposition
• A well known partitioning technique is recursive coordinate bisection (RCB)
©2012 Scott B. Baden /CSE 260/ Fall 2012 20
An example of RCB in one dimension
• Consider the following loop for i = 0 : N-1 do if ( G(i) )
then y[i] = f1( …. ); else y[i] = f2( …. ); end for • Assume
f1( ) takes twice as long to compute as f2( )
©2012 Scott B. Baden /CSE 260/ Fall 2012 21
Partitioning procedure
• Let W[i] = if (G(i)) then 2 else 1 1 1 2 1 2 2 2 1 1 1 1 1 1 2 2 1
• Compute the running sum (i.e. the scan) of W 1 2 4 5 7 9 11 12 13 14 15 16 17 19 21 22
• Split into 2 equal parts, at 22/2 = 11 1 1 2 1 2 2 2 1 1 1 1 1 1 2 2 1
©2012 Scott B. Baden /CSE 260/ Fall 2012 22
Recursive coordinate bisection
• Recurse until the desired number of partitions have been rendered 1 1 2 1 2 2 2 1 1 1 1 1 1 2 2 1
• May be applied in multiple dimensions
©2012 Scott B. Baden /CSE 260/ Fall 2012 23
Load Balancing with space filling curves
• Multidimensional RCB suffers from granularity problems
• We can reduce the granularity with many partitions per processor
• Introduces higher surface to volume effects • Another approach: spacefilling curves
©2012 Scott B. Baden /CSE 260/ Fall 2012 24
Partitioning with filling curves • Maps higher dimensional physical
space onto the line u Load balancing in one dimension u Many types; Hilbert curves shown here
• Irregular communication surfaces • Useful for managing locality:
memory, databases
mathforum.org/advanced/robertd/lsys3d.html
http://www.math.ohio-state.edu/~fiedorow/math655/Peano.html
©2012 Scott B. Baden /CSE 260/ Fall 2012 25
• If we ignore serial sections and other overheads, then we may express load imbalance in terms of a load balancing efficiency metric
• Let each processor i complete its assigned work in time Ti • Thus, the running time TP = MAX ( Ti )
• Define
• We define the load balancing efficiency
• Ideally η = 1.0
Load balancing efficiency
!
T = iTi"
!
" =T
PTP
©2012 Scott B. Baden /CSE 260/ Fall 2012 26
Classifying application domains – Colella’s “7 Dwarfs”
u Classify applications according to a pattern of data access and operations on the data: important in HPC
u Structured grids • Image processing, simulations
u Dense linear algebra: • Gaussian elimination, SVD, many others
u N-body methods u Sparse linear algebra u Unstructured Grids u Spectral methods (FFT) u Monte Carlo
©2012 Scott B. Baden /CSE 260/ Fall 2012 27
Sparse Matrices
• A matrix where we employ knowledge about the location of the non-zeroes
• Consider Jacobi’s method with a 5-point stencil
u’[i,j] = (u[i-1,j] + u[i+1,j]+ u[i,j-1]+ u[i, j+1] - h2f[i,j]) /4
1M x 1M submatrix of the web connectivity graph, constructed from an archive at the Stanford WebBase 3 non-zeroes/row
Dense: 220×220 = 240
= 1024 Gwords Sparse: (3/220) × 240 = 3 Mwords Sparse representation saves a factor of 1 million in storage
©2012 Scott B. Baden /CSE 260/ Fall 2012" 28
Web connectivity Matrix: 1M x 1M
Jim Demmel
©2012 Scott B. Baden /CSE 260/ Fall 2012" 29
Circuit Simulation
www.cise.ufl.edu/research/sparse/matrices/Hamm/scircuit.html
Motorola Circuit 170,9982
958,936 nonzeroes .003% nonzeroes 5.6 nonzeroes/row
©2012 Scott B. Baden /CSE 260/ Fall 2012 30
Cholesky
• In some cases the matrix is both – Symmetric, A = AT – Positive definite A = L * LT L = √A
• No need for pivoting • We can use Cholesky Factorization • Flops count cut in half • What complications does this avoid? • Result: many matrices can be transformed into
this form • Super LU: Demmel and Li
©2012 Scott B. Baden /CSE 260/ Fall 2012 31
Savings with Sparse Matrices
• 7 x 7 grid: 49 x 49 matrix, use Cholesky’s method • Nonzeroes: 349, 30% of the space • Flops: 2643, 6.7% • Fill-in: a zero entry becomes nonzero
©2012 Scott B. Baden /CSE 260/ Fall 2012 32
Managing fill in • Fill-in and computation reduced with improved ordering • Advantage increases with N: n=63 (N = n2 = 3969)
– 250109 non-zeroes → 85416 – 16 million flops → 3.6 million
Natural order: nonzeros = 233 Min. Degree order: nonzeros = 207
©2012 Scott B. Baden /CSE 260/ Fall 2012 34
Sparse Matrix Vector Multiplication
• Important kernel used in sparse linear algebra y[i] += A[i,j] × x[j]
• Many formats, common format for CPUs is Compressed Sparse Row (CSR)
Jim Demmel
Sparse matrix vector multiply kernel // y[i] += A[i,j] × x[j] #pragma parallel for schedule
(dynamic,chunk) for i = 0 : N-1 // rows i0= ptr[i] i1 = ptr[i+1] – 1 for j = i0 : i1 // cols y[ ind[j]] +=
val[ j ] *x[ ind[j] ] end j end i
©2012 Scott B. Baden /CSE 260/ Fall 2012 35
j→
i ↓
A X
©2012 Scott B. Baden /CSE 260/ Fall 2012" 38
Irregular mesh: NASA Airfoil in 2D
Jim Demmel
• In the Jacobi method, computational effort is applied uniformly
• Irregular “unstructured” meshes don’t have a uniform structure and they apply computational effort non-uniformly
• We can apply direct mesh updates, but more common to generate the equivalent sparse matrix
©2012 Scott B. Baden /CSE 260/ Fall 2012 39
A Structured Adaptive mesh
Courtesy Phil Colella, Lawrence Berkeley National Laboratory
©2012 Scott B. Baden /CSE 260/ Fall 2012 40
Turbulent flow over a step
Applied Mathematics Group Center for Applied Scientific Computing, Lawrence Livermore National Laboratory
http://www.llnl.gov/CASC/groups/casc-amg.html
©2012 Scott B. Baden /CSE 260/ Fall 2012 42
Irregular Problems • In the Jacobi application,
computational effort is applied uniformly
• Irregular applications apply computational effort non-uniformly
• A load balancing problem arises when partitioning the data
• There may not be a mesh
Courtesy of Randy Bank
The matrix factoriza/on can be represented as a DAG: • nodes: tasks that operate on “/les” • edges: dependencies among tasks Tasks can be scheduled asynchronously and in any order as long as dependencies are not violated.
Achieving Asynchronicity on Multicore Source: Jack Dongarra
System: PLASMA
02/14/2006 CS267 Lecture 9 44
Row and Column Block Cyclic Layout
• processors and matrix blocks are distributed in a 2d array
• prow-by-pcol array of processors • brow-by-bcol matrix blocks
• pcol-fold parallelism in any column, and calls to the BLAS2 and BLAS3 on matrices of size brow-by-bcol
• serial bottleneck is eased
• prow ≠ pcol and brow ≠ bcol possible, even desireable
0 1 0 1 0 1 0 1
2 3 2 3 2 3 2 3
0 1 0 1 0 1 0 1
2 3 2 3 2 3 2 3
0 1 0 1 0 1 0 1
2 3 2 3 2 3 2 3
0 1 0 1 0 1 0 1
2 3 2 3 2 3 2 3
bcol brow
02/14/2006 CS267 Lecture 9 45
Distributed GE with a 2D Block Cyclic Layout
• block size b in the algorithm and the block sizes brow and bcol in the layout satisfy b=bcol.
• shaded regions indicate processors busy with computation or communication.
• unnecessary to have a barrier between each step of the algorithm, e.g.. steps 9, 10, and 11 can be pipelined