DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 4: Programming and Performance Analysis of Parallel Computers Dr. Nor Asilah Wati Abdul Hamid Room 2.15

DISTRIBUTED ANDHIGH-PERFORMANCE COMPUTING

CHAPTER 4: Programming and Performance Analysis of Parallel Computers

Dr. Nor Asilah Wati Abdul HamidRoom 2.15Ext : 6532FSKTM, UPM

Programming Models for Parallel Computing

First look at an example of parallelizing a real-world task, taken from Solving Problems on Concurrent Processors, Fox et al.

Hadrian's Wall was built by the ancient Romans to keep the marauding Scottish barbarians out of Roman England. It was originally 120 km long and 2 meters high.

How would you build such a huge structure in the shortest possible time? Clearly the sequential approach | a single bricklayer building the entire wall | would be too slow.

Can have modest task parallelism by having different specialist workers concurrently: making the bricks delivering the bricks laying the bricks

But unless we can replicate these workers, this only gives us a threefold speedup of the process

Data Parallelism To really speed up completion of the task, need many workers laying the

bricks concurrently. In general there are two ways to do this: pipelining (vectorization) replication (parallelization)

Concurrent execution of a task requires assigning different sections of the problem (processes and data) to different processors.

We will concentrate on data parallelism, where the processors work concurrently on their own section of the whole data set, or domain.

Splitting up the domain between processors is known as domain decomposition. Each processor has their own sub-domain, or grain, to work on. Parallelism may be described as fine-grained (lots of very small domains) or coarse-grained (a smaller number of larger domains).

Deciding how the domain decomposition is done is a key issue in implementing efficient parallel processing. For the most efficient (least time) execution of the task, need to: minimize communication of data between processors distribute workload equally among the processors (known as load balancing)

Vectorization (Pipelining) One possible approach is to allocate a different bricklayer to each row

of the wall, i.e. a horizontal decomposition of the problem domain.

This is a pipelined approach - each bricklayer has to wait until the row underneath them has been started, so there is some inherent inefficiency.

Once all rows have been started (the pipeline is full) all the bricklayers (processors) are working effciently at the same time, until the end of the task, when there is some overhead (idle workers) while the upper levels are completed (the pipeline is flushed).

Parallelization (Replication) Another approach is to do a vertical decomposition of the problem domain, so

each bricklayer gets a vertical section of the wall to complete. In this case, the workers must communicate and synchronize their actions at

the edges where the sub-domains meet. In general this communication and synchronization will incur some overhead, so there is some inefficiency.

However each worker has an inner section of wall within their sub-domain that is completely independent of the others, which they can build just as efficiently as if there were no other workers.

As long as the time taken to build this inner section is much longer than the time taken up by the communication and synchronization overhead, then the parallelization will be efficient and give good speedup over using a single worker.

Parallel I/O For large tasks with lots of data, need efficient means to pass the appropriate data to each

processor. In building a wall, need to keep each bricklayer supplied with bricks. Simple approach has a single host processor connected to outside network and handling all

I/O. Passes data to other processors through internal comms network of the machine.

Host processor is a sequential bottleneck for I/O. Bandwidth is limited to a single network connection. Better approach is to do I/O in parallel, so each node (or group of nodes) has a direct I/O channel to a disk array.

Domain Decompositions

Some standard domain decompositions of a regular 2D grid (array) include:

BLOCK - contiguous chunks of rows or columns of data on each processor.

BLOCK-BLOCK - block decomposition in both dimensions.

CYCLIC - data is assigned to processors like cards dealt to players, so neighbouring points are on different processors.

This can be good for load balancing applications with varying workloads that have certain types of communication, e.g. very little, or a lot (global sums or all-to-all), or strided.

BLOCK-CYCLIC - a block decomposition in one dimension, cyclic in the other.

SCATTERED - points are scattered randomly across processors.

This can be good for load balancing applications with little (or lots of) communication. The human brain seems to work this way - neighboring sections may control widely separated parts of the body.

Domain Decompositions Illustrated

The squares represent the 2D data array, the colours represent the processors where the data elements are stored.

BLOCK BLOCK-BLOCK

CYCLIC BLOCK-CYCLIC

Static Load BalancingFor maximum efficiency, domain decomposition should give equal work to each processor.

In building the wall, can just give each bricklayer an equal length segment.

But things can become much more complicated: What if some bricklayers are faster than others? (this is like an inhomogeneous cluster

of different workstations) What if there are guard towers every few hundred meters, which require more work to

construct? (in some applications, more work is required in certain parts of the domain)

If we know in advance 1. the relative speed of the processors, and 2. the relative amount of processing required for each part of the problem

then we can do a domain decomposition that takes this into account, so that different processors may have different sized domains, but the time to process them will be about the same. This is static load balancing, and can be done at compile-time.

For some applications, maintaining load balance while simultaneously minimising communication can be a very diffcult optimisation problem

Irregular Doman Decomposition

In this figure the airflow over an aeroplane wing is modeled

on an irregular triangulated mesh. The grid is finer in

areas where there is the most change in the airflow (e.g.

turbulent regions) and coarsest where the flow is more regular (laminar).

The domain is distributed among processors given by

different colours.

Dynamic Load BalancingIn some cases we do not know in advance one (or both) of:

the effective performance of the processors - may be sharing the processors with other applications, so the load and available CPU may vary

the amount of work required for each part of the domain - many applications are adaptive or dynamic, and the workload is only known at runtime (e.g. dynamic irregular mesh for CFD, or varying convergence rates for PDE solvers in different sections of a regular grid)

In this case we need to dynamically change the domain decomposition by periodically repartitioning the data between processors. This is dynamic load balancing, and it can involve substantial overheads in:

Figuring out how to best repartition the data - may need to use a fast method that gives a good (but not optimal) domain decomposition, to reduce computation.

Moving the data between processors - could restrict this to local (neighbouring processor) moves to reduce communication.

Usually repartition as infrequently as possible (e.g. every few iterations instead of every iteration). There is a tradeoff between performance improvement and repartitioning overhead.

Causes of Inefficiency in Parallel Programs

Communication OverheadIn some cases communication can be overlapped with computation, i.e. needed data can be prefetched at the same time as useful computation is being done. But often processors must wait for data, causing inefficiency (like a cache miss on a sequential machine, but worse).

Load Imbalance OverheadAny load imbalance will cause some processors to be idle some of the time. In some problems the domain will change during the computation, requiring expensive dynamic load balancing.

Algorithmic OverheadThe best algorithm to solve a problem in parallel is often slightly dfferent (or in some cases very different) to the best sequential algorithm. Any excess cycles caused by the difference in algorithm (more operations or more iterations) gives an algorithmic overhead.Note that to calculate speedup honestly, should not use the time for the parallel algorithm on 1 processor, but the time for the best sequential algorithm.

Sequential OverheadThe program may have parts that are not parallelizable, so that each processor replicates the same calculation.

Parallel Speedup If N workers or processors are used, expect to be able to finish the task N times

faster.

Speedup = Time on 1 processor Time on N processors

Speedup will (usually) be at most N on N processors.(Question: How can it be more?)

N.B. Usually better to plot speedup vs number of processors, rather than time taken (1/speedup) vs #procs, since it is much harder to judge deviation from a 1/x curve than from a straight line.

Any problem of fixed size will have a maximum speedup, beyond which adding more processors will not reduce (and will usually increase) the time taken.

Linear Speedup(Perfect Speedup)

Actual Speedup

No. of Processors

Speedup

Superlinear Speedup

Answer to previous question .....

It is possible to get speedups greater than N on N processors if the parallel algorithm is very efficient (no load imbalance or algorithmic overhead, and any communication is overlapped with computation), and if splitting the data among processors allows a greater proportion of the data to _t into cache memory on each processor, resulting in faster sequential performance of the program on each processor.

Same idea applies to programs that would be out-of-core on 1 processor (i.e. require costly paging to disk) but can have all data in core memory if they are spread over multiple processors.

Some parallel implementations of tree search applications (e.g. branch and

bound optimisation algorithms) can also have superlinear speedup, by exploring multiple branches in parallel, allowing better pruning and thus finding the solution faster. However the same efect can be achieved by running multiple processes (or threads) on a single processor.

Parallel E ciencyffi

E ciency measures the actual speedup relative to theffi maximum speedup, given as a fraction or a percentage. It is often quoted instead of speedup.

E ciency = Speedup/Nffi

E ciency will (usually) be at most 1 (or 100%).ffi It is still a function of the number of processors N, but not as much as speedup, since it is scaled by N. Parallel programs are rarely 100% e cient - there isffi usually some overhead f from the parallel implementation, so that the time taken for the parallel program is

Time on N procs = Time on 1 proc/N * (1 + f)

Speedup = N/ (1+f)

E ciency =ffi

Alternative Definitions of Speedup and E ciencyffi

Previous definitions of speedup and e ciency assume that times are ffimeasured for a fixed problem size.

This is artificial – in practice, often use more processors to solve larger problems in the same time, not the same sized problem in less time.

Also, cannot use this definition to assess scalability of the program or algorithm to larger numbers of processors. Cannot expect to obtain good speedups as amount of dataper processor is reduced to zero, no matter how good the parallel algorithm might be.

Alternative definitions of speedup and e ciency use:ffi Constant problem size (standard definition) Constant domain size on each processor Constant time for solution (adjust size accordingly)

When quoting speedups, be sure to say which of these you are measuring.

Minimizing Communication Overhead

A crucial problem in parallel programming is to minimize the ratio of communication to computation, otherwise cannot have e cient programs.ffi

Want to: Minimize the amount of communication Overlap communications with computation where possible Reduce latency by sending a few large messages, rather than a lot of small

messages.

At the hardware level, can reduce latency by using fast (but expensive) communications.

At the systems software level, can reduce latency by using lightweight message passing protocols, such as Active Messages.

But most of the real work needs to be done by the programmer and/or the compiler. Keep data local Calculate using local data while getting remote data Bu er communications to send fewer messages (e.g. sendall edge data at ff

once rather than one point at a time)

Benchmarking The peak performance of a computer is the maximum number of operations it can

perform per second. This is usually: Flops = number of pipes (flops/cycle) * clock speed (cycles/sec) * processors

(floating-point operations per second)

Many applications do not achieve anywhere near the peak performance, particularly on high-performance computers.

Many standard benchmarks have been developed to determine actual performance of a computer over a range of applications, e.g. SPECmarks, transactions per second.

Benchmarks aimed at parallel and vector HPC machines include: LINPACK matrix solver (from SCALAPACK parallel linear algebra library), used to

rank Top 500 list NAS Benchmarks (kernels for some NASA fluid dynamics applications) PARKBENCH Benchmarks (PARallel Kernels and BENCHmarks) Others at BenchWeb (www.netlib.org/benchweb/) MPI Benchmarks – SKaMPI, MPIBENH IMB

Best benchmark is of course to run your applications (or the compute-intensive application kernel) on the machine.

Using Benchmarks

Care is needed in obtaining and analyzing results from benchmarks.

Run multiple runs using di erent data sizes and numbers of processors.ff May need to do multiple runs and either average the results or take the

best result. Analyze performance, speedup, e ciency, scalability as functions of ffi

number of procs and di erent data sizes.ff Should have a dedicated machine - someone else running a job one

node of a cluster could reduce performance by a factor of 2! I/O and initialization time may a ect results - may be large di erence in ff ff

CPU time and total wall clock time. Percentage of time spent on I/O and initialization may be very di erent ff

for benchmark (short run time) vs real application (long run time), so should they be excluded? Or is I/O an important part of the application?

Need to be careful in comparing results on di erent machines - may ffneed di erent optimization flags for compilers, di erent data ff ffdecomposition, or even to modify code to perform better on di erent ffarchitecture.

Documents

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 4: Programming and Performance Analysis of Parallel Computers Dr. Nor Asilah Wati Abdul Hamid Room 2.15