Upload
hakhuong
View
223
Download
0
Embed Size (px)
Citation preview
Computational ModelDesign Methodology
Example
Parallel Numerical AlgorithmsChapter 2 – Parallel Thinking
Section 2.1 – Parallel Algorithm Design
Michael T. Heath and Edgar Solomonik
Department of Computer ScienceUniversity of Illinois at Urbana-Champaign
CS 554 / CSE 512
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 35
Computational ModelDesign Methodology
Example
Outline
1 Computational Model
2 Design MethodologyPartitioningCommunicationAgglomerationMapping
3 Example
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 2 / 35
Computational ModelDesign Methodology
Example
Computational Model
Task : a subset of the overall program, with a set of inputsand outputs
Parallel computation : a program that executes two or moretasks concurrently
Communication channel : connection between two tasksover which information is passed (messages are sent andreceived) periodically
For now we work with the following messaging semanticssend is nonblocking : sending task resumes executionimmediatelyreceive is blocking : receiving task blocks execution untilrequested message is available
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 3 / 35
Computational ModelDesign Methodology
Example
Example: Laplace Equation in 1-D
Consider Laplace equation in 1-D
u′′(t) = 0
on interval a < t < b with BC
u(a) = α, u(b) = β
Seek approximate solution vector u such that
ui ≈ u(ti) at mesh points ti = a+ ih,∀i ∈ 0, . . . , n+ 1,
where h = (b− a)/(n+ 1)
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 4 / 35
Computational ModelDesign Methodology
Example
Example: Laplace Equation in 1-D
Finite difference approximation
u′′(ti) ≈ui+1 − 2ui + ui−1
h2
yields tridiagonal system of algebraic equations
ui+1 − 2ui + ui−1h2
= 0, i = 1, . . . , n,
for ui, i = 1, . . . , n, where u0 = α and un+1 = β
Starting from initial guess u(0), compute Jacobi iterates
u(k+1)i =
u(k)i−1 + u
(k)i+1
2, i = 1, . . . , n,
for k = 1, . . . until convergence
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 5 / 35
Computational ModelDesign Methodology
Example
Example: Laplace Equation in 1-D
Define n tasks, one for each ui, i = 1, . . . , n
Task i stores initial value of ui and updates it at eachiteration until convergence
To update ui, necessary values of ui−1 and ui+1 obtainedfrom neighboring tasks i− 1 and i+ 1
u1 u2 u3 un•••
Tasks 1 and n determine u0 and un+1 from BC
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 6 / 35
Computational ModelDesign Methodology
Example
Example: Laplace Equation in 1-D
initialize uifor k = 1, . . .
if i > 1, send ui to task i− 1if i < n, send ui to task i+ 1if i < n, recv ui+1 from task i+ 1if i > 1, recv ui−1 from task i− 1wait for sends to completeui = (ui−1 + ui+1)/2
end
{ send to left neighbor }{ send to right neighbor }{ receive from right neighbor }{ receive from left neighbor }
{ update my value }
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 7 / 35
Computational ModelDesign Methodology
Example
Mapping Tasks to Processors
Tasks must be assigned to physical processors forexecution
Tasks can be mapped to processors in various ways,including multiple tasks per processor
Semantics of program should not depend on number ofprocessors or particular mapping of tasks to processors
Performance usually sensitive to assignment of tasks toprocessors due to concurrency, workload balance,communication patterns, etc.
Computational model maps naturally ontodistributed-memory multicomputer using message passing
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 8 / 35
Computational ModelDesign Methodology
Example
PartitioningCommunicationAgglomerationMapping
Four-Step Design Methodology
Partition : Decompose problem into fine-grain tasks,maximizing number of tasks that can execute concurrently
Communicate : Determine communication pattern amongfine-grain tasks, yielding task graph with fine-grain tasksas nodes and communication channels as edges
Agglomerate : Combine groups of fine-grain tasks to formfewer but larger coarse-grain tasks, thereby reducingcommunication requirements
Map : Assign coarse-grain tasks to processors, subject totradeoffs between communication costs and concurrency
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 9 / 35
Computational ModelDesign Methodology
Example
PartitioningCommunicationAgglomerationMapping
Four-Step Design Methodology
Problem
Partition Communicate Agglom
erate Map
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 10 / 35
Computational ModelDesign Methodology
Example
PartitioningCommunicationAgglomerationMapping
Graph Embeddings
Target network may be virtual network topology, withnodes usually called processors or processes
Overall design methodology is composed of sequence ofgraph embeddings:
fine-grain task graph to coarse-grain task graphcoarse-grain task graph to virtual network graphvirtual network graph to physical network graph
Depending on circumstances, one or more of theseembeddings may be skipped
An alternative methodology is to map tasks andcommunication onto a graph of the network topology laidout in time, similar to the way we defined butterfly protocols
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 11 / 35
Computational ModelDesign Methodology
Example
PartitioningCommunicationAgglomerationMapping
Partitioning Strategies
Domain partitioning : subdivide geometric domain intosubdomainsFunctional decomposition : subdivide algorithm intomultiple logical componentsIndependent tasks : subdivide computation into tasks thatdo not depend on each other (embarrassingly parallel )Array parallelism : subdivide data stored in vectors,matrices, or other arraysDivide-and-conquer : subdivide problem recursively intotree-like hierarchy of subproblemsPipelining : subdivide sequences of tasks performed by thealgorithm on each piece of data
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 12 / 35
Computational ModelDesign Methodology
Example
PartitioningCommunicationAgglomerationMapping
Desirable Properties of Partitioning
Maximum possible concurrency in executing resultingtasks, ideally enough to keep all processors busy
Number of tasks, rather than size of each task, grows asoverall problem size increases
Tasks reasonably uniform in size
Redundant computation or storage avoided
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 13 / 35
Computational ModelDesign Methodology
Example
PartitioningCommunicationAgglomerationMapping
Example: Domain Decomposition
3-D domain partitioned along one (left), two (center), or allthree (right) of its dimensions
With 1-D or 2-D partitioning, minimum task size grows withproblem size, but not with 3-D partitioning
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 14 / 35
Computational ModelDesign Methodology
Example
PartitioningCommunicationAgglomerationMapping
Communication Patterns
Communication pattern determined by data dependencesamong tasks: because storage is local to each task, anydata stored or produced by one task and needed byanother must be communicated between them
Communication pattern may belocal or globalstructured or randompersistent or dynamically changingsynchronous or sporadic
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 15 / 35
Computational ModelDesign Methodology
Example
PartitioningCommunicationAgglomerationMapping
Desirable Properties of Communication
Frequency and volume minimized
Highly localized (between neighboring tasks)
Reasonably uniform across channels
Network resources used concurrently
Does not inhibit concurrency of tasks
Overlapped with computation as much as possible
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 16 / 35
Computational ModelDesign Methodology
Example
PartitioningCommunicationAgglomerationMapping
Agglomeration
Increasing task sizes can reduce communication but alsopotentially reduces concurrency
Subtasks that can’t be executed concurrently anyway areobvious candidates for combining into single task
Maintaining balanced workload still important
Replicating computation can eliminate communication andis advantageous if result is cheaper to compute than tocommunicate
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 17 / 35
Computational ModelDesign Methodology
Example
PartitioningCommunicationAgglomerationMapping
Example: Laplace Equation in 1-D
Combine groups of consecutive mesh points ti andcorresponding solution values ui into coarse-grain tasks,yielding p tasks, each with n/p of ui values
ur+1ul−1 ul ur••• ••••••
Communication is greatly reduced, but ui values withineach coarse-grain task must be updated sequentially
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 18 / 35
Computational ModelDesign Methodology
Example
PartitioningCommunicationAgglomerationMapping
Example: Laplace Equation in 1-D
initialize ul, . . . , urfor k = 1, . . .
if j > 1, send ul to task j − 1if j < p, send ur to task j + 1if j < p, recv ur+1 from task j + 1if j > 1, recv ul−1 from task j − 1for i = l to r
ui = (ui−1 + ui+1)/2endwait for sends to completeu = u
end
{ send to left neighbor }{ send to right neighbor }{ receive from right neighbor }{ receive from left neighbor }
{ update local values }
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 19 / 35
Computational ModelDesign Methodology
Example
PartitioningCommunicationAgglomerationMapping
Overlapping Communication and Computation
Updating of solution values ui is done only after allcommunication has been completed, but only two of thosevalues actually depend on awaited data
Since communication is often much slower thancomputation, initiate communication by sending allmessages first, then update all “interior” values whileawaiting values from neighboring tasks
Much (possibly all ) of updating can be done while taskwould otherwise be idle awaiting messages
Performance can often be enhanced by overlappingcommunication and computation in this manner
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 20 / 35
Computational ModelDesign Methodology
Example
PartitioningCommunicationAgglomerationMapping
Example: Laplace Equation in 1-D
initialize ul, . . . , urfor k = 1, . . .
if j > 1, send ul to task j − 1if j < p, send ur to task j + 1for i = l + 1 to r − 1
ui = (ui−1 + ui+1)/2endif j < p, recv ur+1 from task j + 1ur = (ur−1 + ur+1)/2if j > 1, recv ul−1 from task j − 1ul = (ul−1 + ul+1)/2wait for sends to completeu = u
end
{ send to left neighbor }{ send to right neighbor }
{ update local values }
{ receive from right neighbor }{ update local value }{ receive from left neighbor }{ update local value }
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 21 / 35
Computational ModelDesign Methodology
Example
PartitioningCommunicationAgglomerationMapping
Surface-to-Volume Effect
For domain decomposition,computation is proportional to volume of subdomaincommunication is (roughly) proportional to surface area ofsubdomain
Higher-dimensional decompositions have more favorablesurface-to-volume ratio
Partitioning across more dimensions yields moreneighboring subdomains but smaller total volume ofcommunication than partitioning across fewer dimensions
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 22 / 35
Computational ModelDesign Methodology
Example
PartitioningCommunicationAgglomerationMapping
Mapping
As with agglomeration, mapping of coarse-grain tasks toprocessors should maximize concurrency, minimizecommunication, maintain good workload balance, etc
But connectivity of coarse-grain task graph is inheritedfrom that of fine-grain task graph, whereas connectivity oftarget interconnection network is independent of problem
Communication channels between tasks may or may notcorrespond to physical connections in underlyinginterconnection network between processors
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 23 / 35
Computational ModelDesign Methodology
Example
PartitioningCommunicationAgglomerationMapping
Mapping
Two communicating tasks can be assigned toone processor, avoiding interprocessor communication butsacrificing concurrencytwo adjacent processors, so communication between thetasks is directly supported, ortwo nonadjacent processors, so message routing isrequired
In general, finding optimal solution to these tradeoffs isNP-complete, so heuristics are used to find effectivecompromise
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 24 / 35
Computational ModelDesign Methodology
Example
PartitioningCommunicationAgglomerationMapping
Mapping
For many problems, task graph has regular structure thatcan make mapping easier
If communication is mainly global, then communicationperformance may not be sensitive to placement of tasks onprocessors, so random mapping may be as good as any
Random mappings sometimes used deliberately to avoidcommunication hot spots, where some communicationlinks are oversubscribed with message traffic
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 25 / 35
Computational ModelDesign Methodology
Example
PartitioningCommunicationAgglomerationMapping
Mapping Strategies
With n tasks and p processors consecutively numbered insome ordering,
block mapping : blocks of n/p consecutive tasks areassigned to successive processorscyclic mapping : task i is assigned to processor i mod preflection mapping : like cyclic mapping except tasks areassigned in reverse order on alternate passesblock-cyclic mapping and block-reflection mapping : blocksof tasks assigned to processors as in cyclic or reflection
For higher-dimensional grid, these mappings can beapplied in each dimension
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 26 / 35
Computational ModelDesign Methodology
Example
PartitioningCommunicationAgglomerationMapping
Examples of Mappings
151412 1311108 9764 5320 1
15113 714102 61391 51280 4
12113 413102 51491 61580 7
15146 713124 511102 3980 1
cyclic
block
reflection
block-cyclic
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 27 / 35
Computational ModelDesign Methodology
Example
PartitioningCommunicationAgglomerationMapping
Dynamic Mapping
If task sizes vary during computation or can’t be predictedin advance, tasks may need to be reassigned toprocessors dynamically to maintain reasonable workloadbalance throughout computation
To be beneficial, gain in load balance must more thanoffset cost of communication required to move tasks andtheir data between processors
Dynamic load balancing usually based on local exchangesof workload information (and tasks, if necessary), so workdiffuses over time to be reasonably uniform acrossprocessors
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 28 / 35
Computational ModelDesign Methodology
Example
PartitioningCommunicationAgglomerationMapping
Task Scheduling
With multiple tasks per processor, execution of those tasksmust be scheduled over time
For shared-memory, any idle processor can simply selectnext ready task from common pool of tasks, or use workstealing by taking a task assigned to the task-queue ofanother processor
For distributed-memory, the manager/worker paradigm canimplement a common task pool, with manager dispatchingtasks to workers
Better scalability is achieved dynamic load balancingstrategies in distributed-memory by periodic global orhierarchical rebalancing
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 29 / 35
Computational ModelDesign Methodology
Example
PartitioningCommunicationAgglomerationMapping
Task Scheduling
For completely decentralized scheme, it can be difficult todetermine when overall computation has been completed,so termination detection scheme is required
With multithreading, task scheduling can conveniently bedriven by availability of data: whenever executing taskbecomes idle awaiting data, another task is executed
For problems with regular structure, it is often possible todetermine mapping in advance that yields reasonable loadbalance and natural order of execution
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 30 / 35
Computational ModelDesign Methodology
Example
Example: Atmospheric Flow Model
Fluid dynamics of atmosphere modeled by system ofpartial differential equations
3-D problem domain discretized by nx × ny × nz mesh ofpoints
Vertical dimension (altitude) z, much smaller thanhorizontal dimensions (latitude and longitude) x and y, sonz � nx, ny
Derivatives in PDEs approximated by finite differences
Simulation proceeds through successive discrete steps intime
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 31 / 35
Computational ModelDesign Methodology
Example
Example: Atmospheric Flow Model
Partition :Each fine-grain task computes and stores data values(pressure, temperature, etc) for one mesh pointLarge-scale problems may require millions or billions ofmesh points / tasks
Communicate :Finite difference computations at each mesh point use9-point horizontal stencil and 3-point vertical stencilSolar radiation computations require communicationthroughout each vertical column of mesh pointsGlobal communication to compute total mass of air overdomain
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 32 / 35
Computational ModelDesign Methodology
Example
Example: Atmospheric Flow Model
Agglomerate :Combine horizontal mesh points in b× b blocks intocoarse-grain tasks to reduce communication for finitedifferences to exchanges between adjacent nodesCombine each vertical column of mesh points into singletask to eliminate communication for solar computationsYields nx/b× ny/b coarse-grain tasks
Map :Cyclic or random mapping reduces load imbalance due tosolar computations
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 33 / 35
Computational ModelDesign Methodology
Example
Example: Atmospheric Flow Model
Horizontal finite difference stencil for typical point (shadedblack) in mesh for atmospheric flow model before (left) andafter (right) agglomeration with b = 2
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 34 / 35
Computational ModelDesign Methodology
Example
References
K. M. Chandy and J. Misra, Parallel Program Design: AFoundation, Addison-Wesley, 1988
I. T. Foster, Designing and Building Parallel Programs,Addison-Wesley, 1995
A. Grama, A. Gupta, G. Karypis, and V. Kumar,Introduction to Parallel Computing, 2nd. ed.,Addison-Wesley, 2003
T. G. Mattson and B. A. Sanders and B. L. Massingill,Patterns for Parallel Programming, Addison-Wesley, 2005
M. J. Quinn, Parallel Computing: Theory and Practice,McGraw-Hill, 1994
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 35 / 35