Upload
joshua-crawford
View
217
Download
0
Embed Size (px)
Citation preview
1
High-Performance Grid Computing and High-Performance Grid Computing and Research NetworkingResearch Networking
Presented by Yuming Zhang
Instructor: S. Masoud Sadjadihttp://www.cs.fiu.edu/~sadjadi/Teaching/
sadjadi At cs Dot fiu Dot edu
Classic Examples of Classic Examples of Shared Memory ProgramShared Memory Program
2
Acknowledgements The content of many of the slides in this lecture notes
have been adopted from the online resources prepared previously by the people listed below. Many thanks!
Henri Casanova Principles of High Performance Computing http://navet.ics.hawaii.edu/~casanova [email protected]
3
Domain Decomposition Now that we know how to create and manage
threads, we need to decide which thread does what This is really the art of parallel computing
Fortunately, in shared memory, it is often quite simple
We’ll look at three examples “Embarrassingly” parallel application
load-balancing issue “Non-embarrassingly parallel” application
thread synchronization issue Shark & Fish simulation
load-balancing AND thread synchronization issue
4
Embarrassingly Parallel Embarrassingly parallel applications
Consists of a set of elementary computations These computations can be done in any order They are said to be “independent”
Sometimes referred to as “pleasantly” parallel Trivial Example: Compute all values of a function of
two variables over a 2-D domain function f(x,y) = <requires many flops> domain = (]0,10],]0,10]) domain resolution = 0.001 number of points = (10 / 0.001)2 = 108
number of processors and of threads = 4 each thread performs 25x106 function evaluations
No need for critical sections No shared output
5
Mandelbrot Set In many cases, the “cost” of computing f
varies with its input Example: Mandelbrot
For each complex number c Define the series
Z0 = 0 Zn+1= Zn
2 + c If the series converges, put a
black dot at point c i.e., if it hasn’t diverged after many iterations
If one partitions the domain in 4 squares among 4 threads, some of the threads will have much more work to do than others
6
Mandelbrot and Load Balancing The problem with partitioning the domain into 4 identical tiles is
that it leads to load imbalance i.e., suboptimal use of the hardware resources
Solution: do not partition the domain in as many tiles as threads instead use many more tiles than threads
Then have each thread operate as follows compute a tile when done “request” another tile until there are no tiles left to compute
This is called a “master-worker” execution confusing terminology that will make more sense when we do
distributed memory programming
7
Mandelbrot implementation Conceptually very simple, but how do we write code to do it? Pthreads
Use some shared (protected) counter that keeps track of the next tile
the “keeping track” can be easy or difficult depending of the shape of the tiles
Threads read and update the counter each time When the counter goes over some predefined value terminate
OpenMP Could be done in the same way But OpenMP provides tons of convenient ways to do parallel loops
including “dynamic” scheduling strategies, which do exactly what we need!
Just write the code as a loop over the tiles Add the proper pragma And you’re done
8
Dependent Computations In many applications, things are not so simple: elementary
computations may not be independent otherwise parallel computing would be pretty easy
A common example: Consider a (1-D, 2-D, ...) domain that consists of “cells” Each cell holds some “state”, for example:
temperature, pressure, humidity, wind velocity RGB color value
The application consists of rule(s) that must be applied to update the cell states
possibly over-and-over in an iterative fashion CFD, game of life, image processing, etc.
Such applications are often termed Stencil Applications We have already talked about one example: Heat Transfer
9
Dependent Computations
Really simple: Cell values: one floating point number Program written with two arrays:
f_old f_new
One simple loop: f_new[i] = f_old[i] + ...
In more “real” cases, the domain in 2-D (or worse), there are more terms, and the values on the right hand side can be at time step m+1 as well
Example from: http://ocw.mit.edu/NR/rdonlyres/Nuclear-Engineering/22-00JIntroduction-to-Modeling-and-SimulationSpring2002/55114EA2-9B81-4FD8-90D5-5F64F21D23D0/0/lecture_16.pdf
10
Wavefront Pattern
i,j-1
i-1,j
i,j
i-1,j-1
2-D domain Example stencil shapes
Data elements are laid out as multidimensional grids representing a logical plane or space.
The dependency between the elements, often formulated by dynamic programming, results in computations known as wavefront.
11
The Longest-Common-Subsequence Problem
LCS Given two sequences A=<a1,a2,…,an> and B=<b1,b2,
…,bn>, find the longest sequence that is a subsequence of both A and B.
If A =<c,a,d,b,r,z> and B =<a,s,b,z>, the longest common subsequence of A and B is <a,b,z>.
a valuable tool in finding valuable information regarding amino acid sequences in biological genes.
Determine F[n, m] Let F[i, j] be the length of the longest common
subsequence of the first i elements of A and the first j elements of B.
12
LCS Wavefront
F[i,j-1]
F[i-1,j]
F[i,j]
F[i-1,j-1]
The computation starts from F[0,0] and starts filling out the memoization space table diagonally.
13
One exampleComputing the LCS of amino acid sequences <H, E, A, G, A, W, G, H,E> and <P, A, W, H, E, A, E>. F[n, m] = 5 is the answer.
14
Wavefront computation How can we parallelize a wavefront computation? We have seen that the computation consists in
computing 2n-1 antidiagonals, in sequence. Computations within each antidiagonal are
independent, and can be done in a multithreaded fashion
Algorithm: for each antidiagonal use multiple threads to compute its elements
one may need to use a variable number of threads because some diagonals are very small, while some can be large
can be implemented with a single array
15
Wavefront computation What about cache efficiency? After all, reading only one element from an anti
diagonal at a time is probably not good They are not contiguous in memory!
Solution: blocking Just like matrix multiply
p0 p1 p2 p3
16
Wavefront computation What about cache efficiency? After all, reading only one element from a diagonal at
a time is probably not good Solution: blocking
Just like matrix multiply
p0 p1 p2 p3
1
17
Wavefront computation What about cache efficiency? After all, reading only one element from a diagonal at
a time is probably not good Solution: blocking
Just like matrix multiply
2
2
p0 p1 p2 p3
18
Wavefront computation What about cache efficiency? After all, reading only one element from a diagonal at
a time is probably not good Solution: blocking
Just like matrix multiply
3
3
3
p0 p1 p2 p3
19
Wavefront computation What about cache efficiency? After all, reading only one element from a diagonal at
a time is probably not good Solution: blocking
Just like matrix multiply
4
p0 p1 p2 p3
4
4
4
3
3
3
2
21
5
5
5
6
6
7
20
Workload Partitioning First the matrix is divided into parts of adjacent
columns equal to the numbers of clusters. Afterwards the part within each cluster is partitioned.
The computation is then performed in the same way.
21
Performance Modeling One thing we’ll need to do often in HPC
is building performance models Given simple assumptions regarding the
underlying architecture e.g., ignore cache effects
Come up with an analytical formula for the parallel speed-up
Let’s try it on this simple application Let N be the (square) matrix size Let p be the number of threads/cores,
which is fixed
22
Performance Modeling
T3
T2
T0
T1
What if we use p2 blocks? We assume that p divides N (N > p)
Then the computation proceeds in 2p-1 phases each phase lasts as long as the time to compute one block (because of concurrency), T b
Therefore Parallel time = (2p-1) Tb
Sequential time = p2 Tb
Parallel speedup = p2 / (2p - 1) Parallel efficiency = p / (2p -1)
Example: p=2, speedup = 4/3, efficiency = 66% p=4, speedup = 16/7, efficiency = 57% p=8, speedup = 64/17, efficiency = 53% Asymptotically: efficiency = 50%
23
Performance Modeling
What if we use (bxp)2 blocks? b some integer between 1 and N/p We assume that p divides N (N > p)
But performance modeling becomes more complicated The computation still proceeds in 2bp-1 phases But a thread can have more than one block to compute during a
phase! During phase i, there are
i blocks to compute for i=1,..,bp 2bp-i blocks to compute for i=bp+1,...,2bp-1
If there are x (>0) blocks to compute in a phase, then the execution time for that phase is: (x-1)/p + 1)
Assuming Tb = 1 Therefore, the parallel execution time is
03
21
0
1
24
Performance Modeling
25
Performance Modeling
Example: N=1000, p = 4
26
Performance Modeling
When b gets larger, speedup increases and tends to p Since b <= N/p, best speed-up: Np / (N + p -1) When N is large compared to p, speedup is very close to p Therefore, use a block size of 1, meaning no blocking! We’re back to where we started because our performance model ignores
cache effects! Trade-off:
From a parallel efficiency perspective: small block size From a cache efficiency perspective: big block size
Possible rule of thumb: use the biggest block size that fits in the L1 cache (L2 cache?)
Lesson: full performance modeling is difficult We could add the cache behavior, but think of a dual-core machine with shared
L2 cache, etc. In practice: do performance modeling for asymptotic behaviors, and then
do experiments to find out what works best
27
Sharks and Fish Simulation of a population of preys and
predators Each entity follows some behavior
Preys move and breed Predators move, hunt, and breed
Given initial populations, nature of the entity behaviors (e.g., probability of breeding, probability of successful hunting), what do populations look like after some time?
This is something computational ecologists do all the time to study ecosystems
28
Sharks and Fish There are several possibilities to implement such a
simulation A simple one is to do something that looks like “the
game of life” A 2-D domain, with NxN cells (each cell can be described by
many environmental parameters) Each cell in the domain can hold a shark or a fish The simulation is iterative There are several rules for movement, breeding, preying
Why do it in parallel? Many entities Entity interactions may be complex
How can one write this in parallel with threads and shared memory?
29
Space partitioning One solution is the divide the 2-D domain between
threads Each thread deals with the entities in its domain
30
Space partitioning One solution is the divide the 2-D domain between
threads Each thread deals with the entities in its domain
4 threads
31
Move conflict? Threads can make decisions that will lead to
conflicts!
32
Move conflict? Threads can make decisions that will lead to
conflicts!
33
Dealing with conflicts
Concept of shadow cells
Only entities in the red regions may cause a conflict
One possible implementation Each thread deals with its green region Thread 1 deals with its red region Thread 2 deals with its red region Thread 3 deals with its red region Thread 4 deals with its red region Repeat
Will still prevent some types of moves No swapping of location
The implementer must make choices
34
Load Balancing What if all the fish end up in the same region?
because they move because they breed
Then one thread has much more work to do that the others
Solution: dynamic repartitioning Modify the partitioning so that the load is balanced
But perhaps one good idea would be to not do domain partitioning at all! How about doing entity partitioning Better load balancing, but more difficult to deal
with conflicts May use locks, but high overhead
35
Conclusion Main lessons
There are many classes of applications, with many domain partitioning schemes
Performance modeling is fun but inherently limited It’s all about trade-offs
overhead - load balancing parallelism - cache usage etc.
Remember, this is the easy side of parallel computing Things will become much more complex in
distributed memory programming