View
236
Download
0
Category
Preview:
Citation preview
7/29/2019 Basic of Parallel Algorithm Design
1/16
uses some of the slides for chapters 3 and 5 accompanying
Introduction to Parallel Computing,
Addison Wesley, 2003.
http://www-users.cs.umn.edu/~karypis/parbook
7/29/2019 Basic of Parallel Algorithm Design
2/16
Preliminaries: Decomposition,
Tasks, and Dependency Graphs Parallel algorithm should decompose the problem
into tasks that can be executed concurrently.
A decomposition can be illustrated in the form of adirected graph (task dependency graph).
nodes = tasks and edges = dependencies: the result of onetask is required for processing the next
Degree of concurrency determines the amount oftasks which can indeed run in parallel.
Critical path is the longest path in the dependencygraph lower bound on program runtime.
7/29/2019 Basic of Parallel Algorithm Design
3/16
Example: Multiplying a Dense Matrix
with a Vector
Observations: While tasks share data (namely, the vector b ),they do not have any control dependencies - i.e., no taskneeds to wait for the (partial) completion of any other. Alltasks are of the same size in terms of number of operations. Isthis the maximum number of tasks we could decompose this
problem into?
Computation of each element ofoutput vectory is independent ofother elements. Based on this, adense matrix-vector product canbe decomposed into nindependent tasks. Is this themaximum number of tasks wecould decompose this probleminto?
7/29/2019 Basic of Parallel Algorithm Design
4/16
Multiplying a dense matrix with a
vector
2n CPUs availableTask 1 Task 2
On what kind of platform will we have 2N processors?
7/29/2019 Basic of Parallel Algorithm Design
5/16
Multiplying a dense matrix with a
vector
2n CPUs availableTask 1
Task 2 Task 2n+1
Task 1 Task 2Task 2n+1
Task 2n-1 Task 2n
Task 3n
7/29/2019 Basic of Parallel Algorithm Design
6/16
Granularity of Task Decompositions The size of tasks into which a problem is
decomposed is called granularity
7/29/2019 Basic of Parallel Algorithm Design
7/16
Parallelization efficiency
7/29/2019 Basic of Parallel Algorithm Design
8/16
Guidelines for good parallel algorithm
design Maximize concurrency.
Spread tasks evenly between processors to avoid
idling and achieve load balancing. Execute tasks on critical path as soon as the
dependencies are satisfied.
Minimize(overlap) communication between
processors.
7/29/2019 Basic of Parallel Algorithm Design
9/16
Evaluating parallel algorithm Speedup S = Tserial / Tparallel
Efficiency E = S / #processors
Cost
C = #processors * Tparallel Scalability
Speedup (efficiency) as a function of number ofprocessors with fixed task size.
7/29/2019 Basic of Parallel Algorithm Design
10/16
Simple example: summing n numbers on p CPUs
7/29/2019 Basic of Parallel Algorithm Design
11/16
Summing n numbers on p CPUs Parallel Time = ((n/p)log p)
Serial Time = (n)
Speedup = (p/(log p))
Efficiency = (1/log p)
7/29/2019 Basic of Parallel Algorithm Design
12/16
Parallel multiplication of 3 matricesA, B, C rectangular matrices N x N
We want to compute in parallel: A x B x C
How to achieve the maximum performance?
7/29/2019 Basic of Parallel Algorithm Design
13/16
A B TEMP
X
TEMP C Result
X
Try 12-CPU parallelization:
7/29/2019 Basic of Parallel Algorithm Design
14/16
A B C'
X
C' C Result
X
Try 23-CPU parallelization:
Is this a good way?
7/29/2019 Basic of Parallel Algorithm Design
15/16
A(1,1)*B(1,1) A(1,2)*B(2,1) A(1,1)*B(1,2) A(1,2)*B(2,2)
T(1,1)*C(1,1) T(2,1)*C(2,1)
Result(1,1)
+
+
+
Dynamic task creation -
Simple Master-Worker Identify independent computations
Create a queue of tasks
Each process picks from the queue and may insert anew task to the queue
7/29/2019 Basic of Parallel Algorithm Design
16/16
Hypothetic implementation
by multiple threads Let's assign separate thread for each multiply/sumoperation.
How do we coordinate the threads?
Initialize threads to know their dependencies.
Build producer-consumer like logic for every thread.
Recommended