1 High-Performance Grid Computing and Research Networking Presented by Yuming Zhang Instructor: S. Masoud Sadjadi sadjadi/Teaching

1

High-Performance Grid Computing and High-Performance Grid Computing and Research NetworkingResearch Networking

Presented by Yuming Zhang

Instructor: S. Masoud Sadjadihttp://www.cs.fiu.edu/~sadjadi/Teaching/

sadjadi At cs Dot fiu Dot edu

Classic Examples of Classic Examples of Shared Memory ProgramShared Memory Program

2

Acknowledgements The content of many of the slides in this lecture notes

have been adopted from the online resources prepared previously by the people listed below. Many thanks!

Henri Casanova Principles of High Performance Computing http://navet.ics.hawaii.edu/~casanova [email protected]

3

Domain Decomposition Now that we know how to create and manage

threads, we need to decide which thread does what This is really the art of parallel computing

Fortunately, in shared memory, it is often quite simple

We’ll look at three examples “Embarrassingly” parallel application

load-balancing issue “Non-embarrassingly parallel” application

thread synchronization issue Shark & Fish simulation

load-balancing AND thread synchronization issue

4

Embarrassingly Parallel Embarrassingly parallel applications

Consists of a set of elementary computations These computations can be done in any order They are said to be “independent”

Sometimes referred to as “pleasantly” parallel Trivial Example: Compute all values of a function of

two variables over a 2-D domain function f(x,y) = <requires many flops> domain = (]0,10],]0,10]) domain resolution = 0.001 number of points = (10 / 0.001)2 = 108

number of processors and of threads = 4 each thread performs 25x106 function evaluations

No need for critical sections No shared output

5

Mandelbrot Set In many cases, the “cost” of computing f

varies with its input Example: Mandelbrot

For each complex number c Define the series

Z0 = 0 Zn+1= Zn

2 + c If the series converges, put a

black dot at point c i.e., if it hasn’t diverged after many iterations

If one partitions the domain in 4 squares among 4 threads, some of the threads will have much more work to do than others

6

Mandelbrot and Load Balancing The problem with partitioning the domain into 4 identical tiles is

that it leads to load imbalance i.e., suboptimal use of the hardware resources

Solution: do not partition the domain in as many tiles as threads instead use many more tiles than threads

Then have each thread operate as follows compute a tile when done “request” another tile until there are no tiles left to compute

This is called a “master-worker” execution confusing terminology that will make more sense when we do

distributed memory programming

7

Mandelbrot implementation Conceptually very simple, but how do we write code to do it? Pthreads

Use some shared (protected) counter that keeps track of the next tile

the “keeping track” can be easy or difficult depending of the shape of the tiles

Threads read and update the counter each time When the counter goes over some predefined value terminate

OpenMP Could be done in the same way But OpenMP provides tons of convenient ways to do parallel loops

including “dynamic” scheduling strategies, which do exactly what we need!

Just write the code as a loop over the tiles Add the proper pragma And you’re done

8

Dependent Computations In many applications, things are not so simple: elementary

computations may not be independent otherwise parallel computing would be pretty easy

A common example: Consider a (1-D, 2-D, ...) domain that consists of “cells” Each cell holds some “state”, for example:

temperature, pressure, humidity, wind velocity RGB color value

The application consists of rule(s) that must be applied to update the cell states

possibly over-and-over in an iterative fashion CFD, game of life, image processing, etc.

Such applications are often termed Stencil Applications We have already talked about one example: Heat Transfer

9

Dependent Computations

Really simple: Cell values: one floating point number Program written with two arrays:

f_old f_new

One simple loop: f_new[i] = f_old[i] + ...

In more “real” cases, the domain in 2-D (or worse), there are more terms, and the values on the right hand side can be at time step m+1 as well

Example from: http://ocw.mit.edu/NR/rdonlyres/Nuclear-Engineering/22-00JIntroduction-to-Modeling-and-SimulationSpring2002/55114EA2-9B81-4FD8-90D5-5F64F21D23D0/0/lecture_16.pdf

10

Wavefront Pattern

i,j-1

i-1,j

i,j

i-1,j-1

2-D domain Example stencil shapes

Data elements are laid out as multidimensional grids representing a logical plane or space.

The dependency between the elements, often formulated by dynamic programming, results in computations known as wavefront.

11

The Longest-Common-Subsequence Problem

LCS Given two sequences A=<a1,a2,…,an> and B=<b1,b2,

…,bn>, find the longest sequence that is a subsequence of both A and B.

If A =<c,a,d,b,r,z> and B =<a,s,b,z>, the longest common subsequence of A and B is <a,b,z>.

a valuable tool in finding valuable information regarding amino acid sequences in biological genes.

Determine F[n, m] Let F[i, j] be the length of the longest common

subsequence of the first i elements of A and the first j elements of B.

12

LCS Wavefront

F[i,j-1]

F[i-1,j]

F[i,j]

F[i-1,j-1]

The computation starts from F[0,0] and starts filling out the memoization space table diagonally.

13

One exampleComputing the LCS of amino acid sequences <H, E, A, G, A, W, G, H,E> and <P, A, W, H, E, A, E>. F[n, m] = 5 is the answer.

14

Wavefront computation How can we parallelize a wavefront computation? We have seen that the computation consists in

computing 2n-1 antidiagonals, in sequence. Computations within each antidiagonal are

independent, and can be done in a multithreaded fashion

Algorithm: for each antidiagonal use multiple threads to compute its elements

one may need to use a variable number of threads because some diagonals are very small, while some can be large

can be implemented with a single array

15

Wavefront computation What about cache efficiency? After all, reading only one element from an anti

diagonal at a time is probably not good They are not contiguous in memory!

Solution: blocking Just like matrix multiply

p0 p1 p2 p3

16

Wavefront computation What about cache efficiency? After all, reading only one element from a diagonal at

a time is probably not good Solution: blocking

Just like matrix multiply

p0 p1 p2 p3

1

17




2

2

p0 p1 p2 p3

18




3

3

3

p0 p1 p2 p3

19




4

p0 p1 p2 p3

4

4

4

3

3

3

2

21

5

5

5

6

6

7

20

Workload Partitioning First the matrix is divided into parts of adjacent

columns equal to the numbers of clusters. Afterwards the part within each cluster is partitioned.

The computation is then performed in the same way.

21

Performance Modeling One thing we’ll need to do often in HPC

is building performance models Given simple assumptions regarding the

underlying architecture e.g., ignore cache effects

Come up with an analytical formula for the parallel speed-up

Let’s try it on this simple application Let N be the (square) matrix size Let p be the number of threads/cores,

which is fixed

22

Performance Modeling

T3

T2

T0

T1

What if we use p2 blocks? We assume that p divides N (N > p)

Then the computation proceeds in 2p-1 phases each phase lasts as long as the time to compute one block (because of concurrency), T b

Therefore Parallel time = (2p-1) Tb

Sequential time = p2 Tb

Parallel speedup = p2 / (2p - 1) Parallel efficiency = p / (2p -1)

Example: p=2, speedup = 4/3, efficiency = 66% p=4, speedup = 16/7, efficiency = 57% p=8, speedup = 64/17, efficiency = 53% Asymptotically: efficiency = 50%

23


What if we use (bxp)2 blocks? b some integer between 1 and N/p We assume that p divides N (N > p)

But performance modeling becomes more complicated The computation still proceeds in 2bp-1 phases But a thread can have more than one block to compute during a

phase! During phase i, there are

i blocks to compute for i=1,..,bp 2bp-i blocks to compute for i=bp+1,...,2bp-1

If there are x (>0) blocks to compute in a phase, then the execution time for that phase is: (x-1)/p + 1)

Assuming Tb = 1 Therefore, the parallel execution time is

03

21

0

1

24


25


Example: N=1000, p = 4

26


When b gets larger, speedup increases and tends to p Since b <= N/p, best speed-up: Np / (N + p -1) When N is large compared to p, speedup is very close to p Therefore, use a block size of 1, meaning no blocking! We’re back to where we started because our performance model ignores

cache effects! Trade-off:

From a parallel efficiency perspective: small block size From a cache efficiency perspective: big block size

Possible rule of thumb: use the biggest block size that fits in the L1 cache (L2 cache?)

Lesson: full performance modeling is difficult We could add the cache behavior, but think of a dual-core machine with shared

L2 cache, etc. In practice: do performance modeling for asymptotic behaviors, and then

do experiments to find out what works best

27

Sharks and Fish Simulation of a population of preys and

predators Each entity follows some behavior

Preys move and breed Predators move, hunt, and breed

Given initial populations, nature of the entity behaviors (e.g., probability of breeding, probability of successful hunting), what do populations look like after some time?

This is something computational ecologists do all the time to study ecosystems

28

Sharks and Fish There are several possibilities to implement such a

simulation A simple one is to do something that looks like “the

game of life” A 2-D domain, with NxN cells (each cell can be described by

many environmental parameters) Each cell in the domain can hold a shark or a fish The simulation is iterative There are several rules for movement, breeding, preying

Why do it in parallel? Many entities Entity interactions may be complex

How can one write this in parallel with threads and shared memory?

29

Space partitioning One solution is the divide the 2-D domain between

threads Each thread deals with the entities in its domain

30

Space partitioning One solution is the divide the 2-D domain between

threads Each thread deals with the entities in its domain

4 threads

31

Move conflict? Threads can make decisions that will lead to

conflicts!

32

Move conflict? Threads can make decisions that will lead to

conflicts!

33

Dealing with conflicts

Concept of shadow cells

Only entities in the red regions may cause a conflict

One possible implementation Each thread deals with its green region Thread 1 deals with its red region Thread 2 deals with its red region Thread 3 deals with its red region Thread 4 deals with its red region Repeat

Will still prevent some types of moves No swapping of location

The implementer must make choices

34

Load Balancing What if all the fish end up in the same region?

because they move because they breed

Then one thread has much more work to do that the others

Solution: dynamic repartitioning Modify the partitioning so that the load is balanced

But perhaps one good idea would be to not do domain partitioning at all! How about doing entity partitioning Better load balancing, but more difficult to deal

with conflicts May use locks, but high overhead

35

Conclusion Main lessons

There are many classes of applications, with many domain partitioning schemes

Performance modeling is fun but inherently limited It’s all about trade-offs

overhead - load balancing parallelism - cache usage etc.

Remember, this is the easy side of parallel computing Things will become much more complex in

distributed memory programming

Documents

1 High-Performance Grid Computing and Research Networking Presented by Yuming Zhang Instructor: S. Masoud Sadjadi sadjadi/Teaching