Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554_fall2017/slides/slides_02.pdf · Computational Model Design Methodology Example Parallel Numerical Algorithms

Computational ModelDesign Methodology

Example

Parallel Numerical AlgorithmsChapter 2 – Parallel Thinking

Section 2.1 – Parallel Algorithm Design

Michael T. Heath and Edgar Solomonik

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

CS 554 / CSE 512

Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 35


Example

Outline

1 Computational Model

2 Design MethodologyPartitioningCommunicationAgglomerationMapping

3 Example



Example

Computational Model

Task : a subset of the overall program, with a set of inputsand outputs

Parallel computation : a program that executes two or moretasks concurrently

Communication channel : connection between two tasksover which information is passed (messages are sent andreceived) periodically

For now we work with the following messaging semanticssend is nonblocking : sending task resumes executionimmediatelyreceive is blocking : receiving task blocks execution untilrequested message is available



Example

Example: Laplace Equation in 1-D

Consider Laplace equation in 1-D

u′′(t) = 0

on interval a < t < b with BC

u(a) = α, u(b) = β

Seek approximate solution vector u such that

ui ≈ u(ti) at mesh points ti = a+ ih,∀i ∈ 0, . . . , n+ 1,

where h = (b− a)/(n+ 1)



Example


Finite difference approximation

u′′(ti) ≈ui+1 − 2ui + ui−1

h2

yields tridiagonal system of algebraic equations

ui+1 − 2ui + ui−1h2

= 0, i = 1, . . . , n,

for ui, i = 1, . . . , n, where u0 = α and un+1 = β

Starting from initial guess u(0), compute Jacobi iterates

u(k+1)i =

u(k)i−1 + u

(k)i+1

2, i = 1, . . . , n,

for k = 1, . . . until convergence



Example


Define n tasks, one for each ui, i = 1, . . . , n

Task i stores initial value of ui and updates it at eachiteration until convergence

To update ui, necessary values of ui−1 and ui+1 obtainedfrom neighboring tasks i− 1 and i+ 1

u1 u2 u3 un•••

Tasks 1 and n determine u0 and un+1 from BC



Example


initialize uifor k = 1, . . .

if i > 1, send ui to task i− 1if i < n, send ui to task i+ 1if i < n, recv ui+1 from task i+ 1if i > 1, recv ui−1 from task i− 1wait for sends to completeui = (ui−1 + ui+1)/2

end

{ send to left neighbor }{ send to right neighbor }{ receive from right neighbor }{ receive from left neighbor }

{ update my value }



Example

Mapping Tasks to Processors

Tasks must be assigned to physical processors forexecution

Tasks can be mapped to processors in various ways,including multiple tasks per processor

Semantics of program should not depend on number ofprocessors or particular mapping of tasks to processors

Performance usually sensitive to assignment of tasks toprocessors due to concurrency, workload balance,communication patterns, etc.

Computational model maps naturally ontodistributed-memory multicomputer using message passing



Example

PartitioningCommunicationAgglomerationMapping

Four-Step Design Methodology

Partition : Decompose problem into fine-grain tasks,maximizing number of tasks that can execute concurrently

Communicate : Determine communication pattern amongfine-grain tasks, yielding task graph with fine-grain tasksas nodes and communication channels as edges

Agglomerate : Combine groups of fine-grain tasks to formfewer but larger coarse-grain tasks, thereby reducingcommunication requirements

Map : Assign coarse-grain tasks to processors, subject totradeoffs between communication costs and concurrency



Example


Four-Step Design Methodology

Problem

Partition Communicate Agglom

erate Map



Example


Graph Embeddings

Target network may be virtual network topology, withnodes usually called processors or processes

Overall design methodology is composed of sequence ofgraph embeddings:

fine-grain task graph to coarse-grain task graphcoarse-grain task graph to virtual network graphvirtual network graph to physical network graph

Depending on circumstances, one or more of theseembeddings may be skipped

An alternative methodology is to map tasks andcommunication onto a graph of the network topology laidout in time, similar to the way we defined butterfly protocols



Example


Partitioning Strategies

Domain partitioning : subdivide geometric domain intosubdomainsFunctional decomposition : subdivide algorithm intomultiple logical componentsIndependent tasks : subdivide computation into tasks thatdo not depend on each other (embarrassingly parallel )Array parallelism : subdivide data stored in vectors,matrices, or other arraysDivide-and-conquer : subdivide problem recursively intotree-like hierarchy of subproblemsPipelining : subdivide sequences of tasks performed by thealgorithm on each piece of data



Example


Desirable Properties of Partitioning

Maximum possible concurrency in executing resultingtasks, ideally enough to keep all processors busy

Number of tasks, rather than size of each task, grows asoverall problem size increases

Tasks reasonably uniform in size

Redundant computation or storage avoided



Example


Example: Domain Decomposition

3-D domain partitioned along one (left), two (center), or allthree (right) of its dimensions

With 1-D or 2-D partitioning, minimum task size grows withproblem size, but not with 3-D partitioning



Example


Communication Patterns

Communication pattern determined by data dependencesamong tasks: because storage is local to each task, anydata stored or produced by one task and needed byanother must be communicated between them

Communication pattern may belocal or globalstructured or randompersistent or dynamically changingsynchronous or sporadic



Example


Desirable Properties of Communication

Frequency and volume minimized

Highly localized (between neighboring tasks)

Reasonably uniform across channels

Network resources used concurrently

Does not inhibit concurrency of tasks

Overlapped with computation as much as possible



Example


Agglomeration

Increasing task sizes can reduce communication but alsopotentially reduces concurrency

Subtasks that can’t be executed concurrently anyway areobvious candidates for combining into single task

Maintaining balanced workload still important

Replicating computation can eliminate communication andis advantageous if result is cheaper to compute than tocommunicate



Example



Combine groups of consecutive mesh points ti andcorresponding solution values ui into coarse-grain tasks,yielding p tasks, each with n/p of ui values

ur+1ul−1 ul ur••• ••••••

Communication is greatly reduced, but ui values withineach coarse-grain task must be updated sequentially



Example



initialize ul, . . . , urfor k = 1, . . .

if j > 1, send ul to task j − 1if j < p, send ur to task j + 1if j < p, recv ur+1 from task j + 1if j > 1, recv ul−1 from task j − 1for i = l to r

ui = (ui−1 + ui+1)/2endwait for sends to completeu = u

end

{ send to left neighbor }{ send to right neighbor }{ receive from right neighbor }{ receive from left neighbor }

{ update local values }



Example


Overlapping Communication and Computation

Updating of solution values ui is done only after allcommunication has been completed, but only two of thosevalues actually depend on awaited data

Since communication is often much slower thancomputation, initiate communication by sending allmessages first, then update all “interior” values whileawaiting values from neighboring tasks

Much (possibly all ) of updating can be done while taskwould otherwise be idle awaiting messages

Performance can often be enhanced by overlappingcommunication and computation in this manner



Example



initialize ul, . . . , urfor k = 1, . . .

if j > 1, send ul to task j − 1if j < p, send ur to task j + 1for i = l + 1 to r − 1

ui = (ui−1 + ui+1)/2endif j < p, recv ur+1 from task j + 1ur = (ur−1 + ur+1)/2if j > 1, recv ul−1 from task j − 1ul = (ul−1 + ul+1)/2wait for sends to completeu = u

end

{ send to left neighbor }{ send to right neighbor }

{ update local values }

{ receive from right neighbor }{ update local value }{ receive from left neighbor }{ update local value }



Example


Surface-to-Volume Effect

For domain decomposition,computation is proportional to volume of subdomaincommunication is (roughly) proportional to surface area ofsubdomain

Higher-dimensional decompositions have more favorablesurface-to-volume ratio

Partitioning across more dimensions yields moreneighboring subdomains but smaller total volume ofcommunication than partitioning across fewer dimensions



Example


Mapping

As with agglomeration, mapping of coarse-grain tasks toprocessors should maximize concurrency, minimizecommunication, maintain good workload balance, etc

But connectivity of coarse-grain task graph is inheritedfrom that of fine-grain task graph, whereas connectivity oftarget interconnection network is independent of problem

Communication channels between tasks may or may notcorrespond to physical connections in underlyinginterconnection network between processors



Example


Mapping

Two communicating tasks can be assigned toone processor, avoiding interprocessor communication butsacrificing concurrencytwo adjacent processors, so communication between thetasks is directly supported, ortwo nonadjacent processors, so message routing isrequired

In general, finding optimal solution to these tradeoffs isNP-complete, so heuristics are used to find effectivecompromise



Example


Mapping

For many problems, task graph has regular structure thatcan make mapping easier

If communication is mainly global, then communicationperformance may not be sensitive to placement of tasks onprocessors, so random mapping may be as good as any

Random mappings sometimes used deliberately to avoidcommunication hot spots, where some communicationlinks are oversubscribed with message traffic



Example


Mapping Strategies

With n tasks and p processors consecutively numbered insome ordering,

block mapping : blocks of n/p consecutive tasks areassigned to successive processorscyclic mapping : task i is assigned to processor i mod preflection mapping : like cyclic mapping except tasks areassigned in reverse order on alternate passesblock-cyclic mapping and block-reflection mapping : blocksof tasks assigned to processors as in cyclic or reflection

For higher-dimensional grid, these mappings can beapplied in each dimension



Example


Examples of Mappings

151412 1311108 9764 5320 1

15113 714102 61391 51280 4

12113 413102 51491 61580 7

15146 713124 511102 3980 1

cyclic

block

reflection

block-cyclic



Example


Dynamic Mapping

If task sizes vary during computation or can’t be predictedin advance, tasks may need to be reassigned toprocessors dynamically to maintain reasonable workloadbalance throughout computation

To be beneficial, gain in load balance must more thanoffset cost of communication required to move tasks andtheir data between processors

Dynamic load balancing usually based on local exchangesof workload information (and tasks, if necessary), so workdiffuses over time to be reasonably uniform acrossprocessors



Example


Task Scheduling

With multiple tasks per processor, execution of those tasksmust be scheduled over time

For shared-memory, any idle processor can simply selectnext ready task from common pool of tasks, or use workstealing by taking a task assigned to the task-queue ofanother processor

For distributed-memory, the manager/worker paradigm canimplement a common task pool, with manager dispatchingtasks to workers

Better scalability is achieved dynamic load balancingstrategies in distributed-memory by periodic global orhierarchical rebalancing



Example


Task Scheduling

For completely decentralized scheme, it can be difficult todetermine when overall computation has been completed,so termination detection scheme is required

With multithreading, task scheduling can conveniently bedriven by availability of data: whenever executing taskbecomes idle awaiting data, another task is executed

For problems with regular structure, it is often possible todetermine mapping in advance that yields reasonable loadbalance and natural order of execution



Example

Example: Atmospheric Flow Model

Fluid dynamics of atmosphere modeled by system ofpartial differential equations

3-D problem domain discretized by nx × ny × nz mesh ofpoints

Vertical dimension (altitude) z, much smaller thanhorizontal dimensions (latitude and longitude) x and y, sonz � nx, ny

Derivatives in PDEs approximated by finite differences

Simulation proceeds through successive discrete steps intime



Example


Partition :Each fine-grain task computes and stores data values(pressure, temperature, etc) for one mesh pointLarge-scale problems may require millions or billions ofmesh points / tasks

Communicate :Finite difference computations at each mesh point use9-point horizontal stencil and 3-point vertical stencilSolar radiation computations require communicationthroughout each vertical column of mesh pointsGlobal communication to compute total mass of air overdomain



Example


Agglomerate :Combine horizontal mesh points in b× b blocks intocoarse-grain tasks to reduce communication for finitedifferences to exchanges between adjacent nodesCombine each vertical column of mesh points into singletask to eliminate communication for solar computationsYields nx/b× ny/b coarse-grain tasks

Map :Cyclic or random mapping reduces load imbalance due tosolar computations



Example


Horizontal finite difference stencil for typical point (shadedblack) in mesh for atmospheric flow model before (left) andafter (right) agglomeration with b = 2



Example

References

K. M. Chandy and J. Misra, Parallel Program Design: AFoundation, Addison-Wesley, 1988

I. T. Foster, Designing and Building Parallel Programs,Addison-Wesley, 1995

A. Grama, A. Gupta, G. Karypis, and V. Kumar,Introduction to Parallel Computing, 2nd. ed.,Addison-Wesley, 2003

T. G. Mattson and B. A. Sanders and B. L. Massingill,Patterns for Parallel Programming, Addison-Wesley, 2005

M. J. Quinn, Parallel Computing: Theory and Practice,McGraw-Hill, 1994


Documents

Parallel Numerical Algorithmssolomonik.cs.illinois.edu/teaching/cs554_fall2017/slides/slides_02.pdf · Computational Model Design Methodology Example Parallel Numerical Algorithms