Upload
michael-karpov
View
547
Download
0
Tags:
Embed Size (px)
Citation preview
04/12/2023 1
Parallel and High Performance Computing
Burton Smith
Technical Fellow
Microsoft
04/12/2023 2
Agenda Introduction Definitions Architecture and Programming Examples Conclusions
04/12/2023 3
INTRODUCTION
04/12/2023 4
“Parallel and High Performance”? “Parallel computing is a form of computation
in which many calculations are carried out simultaneously” G.S. Almasi and A. Gottlieb, Highly Parallel Computing. Benjamin/Cummings, 1994
A High Performance (Super) Computer is: One of the 500 fastest computers as measured by
HPL: the High Performance Linpack benchmark A computer that costs 200.000.000 руб or more Necessarily parallel, at least since the 1970’s
04/12/2023 5
Recent Developments For 20 years, parallel and high performance
computing have been the same subject Parallel computing is now mainstream
It reaches well beyond HPC into client systems: desktops, laptops, mobile phones
HPC software once had to stand alone Now, it can be based on parallel PC software The result: better tools and new possibilities
The Emergence of the Parallel Client Uniprocessor performance is leveling off
Instruction-level parallelism nears a limit (ILP Wall) Power is getting painfully high (Power Wall) Caches show diminishing returns (Memory Wall)
Logic density continues to grow (Moore’s Law) So uniprocessors will collapse in area and cost Cores per chip need to increase exponentially
We must all learn to write parallel programs So new “killer apps” will enjoy more speed
The ILP Wall Instruction-level parallelism preserves the
serial programming model While getting speed from “undercover” parallelism For example, see HPS†: out-of-order issue, in-
order retirement, register renaming, branch prediction, speculation, …
At best, we get a few instructions/clock
† Y.N. Patt et al., "Critical Issues Regarding HPS, a High Performance Microarchitecture,“Proc. 18th Ann. ACM/IEEE Int'l Symp. on Microarchitecture, 1985, pp. 109−116.
The Power Wall In the old days, power was kept roughly constant
Dynamic power, equal to CV2f, dominated Every shrink of .7 in feature size halved transistor area Capacitance C and voltage V also decreased by .7 Even with the clock frequency f increased by 1.4,
power per transistor was cut in half Now, shrinking no longer reduces V very much
So even at constant frequency, power density doubles Static (leakage) power is also getting worse
Simpler, slower processors are more efficient And to conserve power, we can turn some of them off
The Memory Wall We can get bigger caches from more transistors
Does this suffice, or is there a problem scaling up? To speed up 2X without changing bandwidth
below the cache, the miss rate must be halved How much bigger does the cache have to be?†
For dense matrix multiply or dense LU, 4x bigger For sorting or FFTs, the square of its former size For sparse or dense matrix-vector multiply,
impossible Deeper interconnects increase miss latency
Latency tolerance needs memory access parallelism† H.T. Kung, “Memory requirements for balanced computer architectures,” 13th International Symposium on Computer Architecture, 1986, pp. 49−54.
Overcoming the Memory Wall Provide more memory bandwidth
Increase DRAM I/O bandwidth per gigabyte Increase microprocessor off-chip bandwidth
Use architecture to tolerate memory latency More latency more threads or longer vectors No change in programming model is needed
Use caches for bandwidth as well as latency Let compilers control locality Keep cache lines short Avoid mis-speculation
The End of The von Neumann Model“Instructions are executed one at a time…” We have relied on this idea for 60 years Now it (and things it brought) must change Serial programming is easier than parallel
programming, at least for the moment But serial programs are now slow programs
We need parallel programming paradigms that will make all programmers successful
The stakes for our field’s vitality are high Computing must be reinvented
04/12/2023 12
DEFINITIONS
04/12/2023 13
Asymptotic Notation Quantities are often meaningful only within a
constant factor Algorithm performance analyses, for example
f(n) = O(g(n)) means there exist constants c and n0 such that n n0 implies |f(n)| |cg(n)|
f(n) = (g(n)) means there exist constants c and n0 such that n n0 implies |f(n)| |cg(n)|
f(n) = (g(n)) means both f(n) = O(g(n)) and f(n) = (g(n))
04/12/2023 14
Speedup, Time, and Work The speedup of a computation is how much
faster it runs in parallel compared to serially If one processor takes T1 and p of them take Tp
then the p-processor speedup is Sp = T1/Tp
The work done is the number of operations performed, either serially or in parallel W1 = O(T1) is the serial work, Wp the parallel work We say a parallel computation is work-optimal if
Wp = O(W1) = O(T1) We say a parallel computation is time-optimal if
Tp = O(W1/p) = O(T1/p)
Latency, Bandwidth, & Concurrency
In any system that moves items from input to output without creating or destroying them,
Queueing theory calls this result Little’s law
bandwidth = 2
latency = 3
concurrency = 6
latency × bandwidth = concurrency
04/12/2023 16
ARCHITECTURE ANDPROGRAMMING
Parallel Processor Architecture SIMD: Each instruction operates concurrently
on multiple data items MIMD: Multiple instruction sequences execute
concurrently Concurrency is expressible in space or time
Spatial: the hardware is replicated Temporal: the hardware is pipelined
TypeSpatial concurrency Temporal concurrency
SIMD PE arrays and VLIW:ILLIAC-IV, Elbrus 3M
Vector pipelined systems:TI ASC, Cray 1
MIMD Multiprocessors:iPSC/1, PS-2000
Multithreaded systems:HEP, MARS-M, Niagara
04/12/2023 18
Trends in Parallel Processors Today’s chips are spatial MIMD at top level
To get enough performance, even in PCs Temporal MIMD is also used SIMD is tending back toward spatial Intel’s Larrabee combines all three Temporal concurrency is easily “adjusted”
Vector length or number of hardware contexts Temporal concurrency tolerates latency
Memory latency in the SIMD case For MIMD, branches and synchronization also
04/12/2023 19
Parallel Memory Architecture A shared memory system is one in which any
processor can address any memory location Quality of access can be either uniform (UMA) or
nonuniform (NUMA), in latency and/or bandwidth A distributed memory system is one in which
processors can’t address most of memory The disjoint memory regions and their associated
processors are usually called nodes A cluster is a distributed memory system with
more than one processor per node Nearly all HPC systems are clusters
04/12/2023 20
Parallel Programming Variations Data Parallelism and Task Parallelism Functional Style and Imperative Style Shared Memory and Message Passing …and more we won’t have time to look at A parallel application may use all of them
04/12/2023 21
Data Parallelism and Task Parallelism A computation is data parallel when similar
independent sub-computations are done simultaneously on multiple data items Applying the same function to every element of a
data sequence, for example A computation is task parallel when dissimilar
independent sub-computations are done simultaneously Controlling the motions of a robot, for example
It sounds like SIMD vs. MIMD, but isn’t quite Some kinds of data parallelism need MIMD
04/12/2023 22
Functional and Imperative Programs A program is said to be written in (pure)
functional style if it has no mutable state Computing = naming and evaluating expressions
Programs with mutable state are usually called imperative because the state changes must be done when and where specified:
while (z < x) { x = y; y = z; z = f(x, y);} return y; Often, programs can be written either way:
let w(x, y, z) = if (z < x) then w(y, z, f(x, y)) else y;
04/12/2023 23
Shared Memory and Message Passing Shared memory programs access data in a
shared address space When to access the data is the big issue Subcomputations therefore must synchronize
Message passing programs transmit data between subcomputations The sender computes a value and then sends it The receiver recieves a value and then uses it Synchronization can be built in to communication Message passing can be implemented very well
on shared memory architectures
04/12/2023 24
Barrier Synchronization A barrier synchronizes multiple parallel sub-
computations by letting none proceed until all have arrived
It is named after the barrier used to start horse races It guarantees everything before the barrier
finishes before anything after it begins It is a central feature in several data-parallel
languages such as OpenMP
04/12/2023 25
Mutual Exclusion This type of synchronization ensures only one
subcomputation can do a thing at any time If the thing is a code block, it is a critical section
It classically uses a lock: a data structure with which subcomputations can stop and start
Basic operations on a lock object L might be Acquire(L): blocks until other subcomputations are
finished with L, then acquires exclusive ownership Release(L): yields L and unblocks some Acquire(L)
A lot has been written on these subjects
04/12/2023 26
Non-Blocking Synchronization The basic idea is to achieve mutual exclusion
using memory read-modify-write operations Most commonly used is compare-and-swap:
CAS(addr, old, new) reads memory at addr and if it contains old then old is replaced by new
Arbitrary update operations at an addr require {read old; compute new; CAS(addr, old, new);}be repeated until the CAS operation succeeds
If there is significant updating contention at addr, the repeated computation of new may be wasteful
04/12/2023 27
Load Balancing Some processors may be busier than others To balance the workload, subcomputations
can be scheduled on processors dynamically A technique for parallel loops is self-scheduling:
processors repetitively grab chunks of iterations In guided self-scheduling, the chunk sizes shrink
Analogous imbalances can occur in memory Overloaded memory locations are called hot spots Parallel algorithms and data structures must be
designed to avoid them Imbalanced messaging is sometimes seen
04/12/2023 28
EXAMPLES
04/12/2023 29
A Data Parallel Example: Sortingvoid sort(int *src, int *dst, int size, int nvals) {int i, j, t1[nvals], t2[nvals];for (j = 0 ; j < nvals ; j++) {
t1[j] = 0;}for (i = 0 ; i < size ; i++) {
t1[src[i]]++;} //t1[] now contains a histogram of the valuest2[0] = 0;for (j = 1 ; j < nvals ; j++) { t2[j] = t2[j-1] + t1[j-1];} //t2[j] now contains the origin for value jfor (i = 0 ; i < size ; i++) {
dst[t2[src[i]]++] = src[i];}
}
04/12/2023 30
When Is a Loop Parallelizable? The loop instances must safely interleave A way to do this is to only read the data Another way is to isolate data accesses Look at the first loop:
The accesses to t1[] are isolated from each other This loop can run in parallel “as is”
for (j = 0 ; j < nvals ; j++) { t1[j] = 0;}
04/12/2023 31
Isolating Data Updates The second loop seems to have a problem:
Two iterations may access the same t1[src[i]] If both reads precede both increments, oops!
A few ways to isolate the iteration conflicts: Use an “isolated update” (lock prefix) instruction Use an array of locks, perhaps as big as t1[] Use non-blocking updates Use a transaction
for (i = 0 ; i < size ; i++) { t1[src[i]]++;}
04/12/2023 32
Dependent Loop Iterations The 3rd loop is an interesting challenge:
Each iteration depends on the previous one This loop is an example of a prefix computation
If • is an associative binary operation on a set S, the • - prefixes of the sequence x0 ,x1 ,x2 … of values from S is x0,x0•x1,x0•x1•x2 … Prefix computations are often known as scans Scan can be done in efficiently in parallel
for (j = 1 ; j < nvals ; j++) { t2[j] = t2[j-1] + t1[j-1];}
04/12/2023 33
Cyclic Reduction Each vertical line represents a loop iteration
The associated sequence element is to its right
On step k of the scan, iteration j prefixes its own value with the value from iteration j – 2k
a b c d e f g
a ab bc cd de ef fg
a ab abc abcd bcde cdef defg
a ab abc abcd abcde abcdef
abcdefg
Applications of Scan Linear recurrences like the third loop Polynomial evaluation String comparison High-precision addition Finite automata
Each xi is the next-state function given the ith input symbol and • is function composition
APL compress When only the final value is needed, the
computation is called a reduction instead It’s a little bit cheaper than a full scan
04/12/2023 35
More Iterations n Than Processors p
Wp = 3n + O(p log p), Tp = 3n / p + O(log p)
04/12/2023 36
OpenMP OpenMP is a widely-implemented extension
to C++ and Fortran for data† parallelism It adds directives to serial programs A few of the more important directives:#pragma omp parallel for <modifiers><for loop>
#pragma omp atomic<binary op=,++ or -- statement>
#pragma omp critical <name><structured block>
#pragma omp barrier
†And perhaps task parallelism soon
04/12/2023 37
The Sorting Example in OpenMP Only the third “scan” loop is a problem We can at least do this loop “manually”:nt = omp_get_num_threads();int ta[nt], tb[nt];#omp parallel forfor(myt = 0; myt < nt; myt++) { //Set ta[myt]= local sum of nvals/nt elements of t1[] #pragma omp barrier for(k = 1; k <= myt; k *= 2){
tb[myt] = ta[myt]; ta[myt] += tb[myt - k];
#pragma omp barrier } fix = (myt > 0) ? ta[myt – 1] : 0; //Set nvals/nt elements of t2[] to fix + local scan of t1[]}
04/12/2023 38
Parallel Patterns Library (PPL) PPL is a Microsoft C++ library built on top of the
ConcRT user-mode scheduling runtime It supports mixed data- and task-parallelism:
parallel_for, parallel_for_each, parallel_invoke agent, send, receive, choice, join, task_group
Parallel loops use C++ lambda expressions:
Updates can be isolated using intrinsic functions
Microsoft and Intel plan to unify PPL and TBB
parallel_for(1,nvals,[&t1](int j) { t1[j] = 0;});
(void)_InterlockedIncrement(t1[src[i]]++);
04/12/2023 39
Dynamic Resource Management PPL programs are written for an arbitrary
number of processors, could be just one Load balancing is mostly done by work stealing There are two kinds of work to steal:
Work that is unblocked and waiting for a processor Work that is not yet started and is potentially parallel
Work of the latter kind will be done serially unless it is first stolen by another processor This makes recursive divide and conquer easy There is no concern about when to stop parallelism
04/12/2023 40
A Quicksort Examplevoid quicksort (vector<int>::iterator first, vector<int>::iterator last) { if (last - first < 2){return;} int pivot = *first; auto mid1 = partition (first, last, [=](int e){return e < pivot;}); auto mid2 = partition (mid1, last, [=](int e){return e == pivot;}); parallel_invoke( [=] { quicksort(first, mid1); }, [=] { quicksort(mid2, last); } );};
04/12/2023 41
LINQ and PLINQ LINQ (Language Integrated Query) extends
the .NET languages C#, Visual Basic, and F# A LINQ query is really just a functional monad It queries databases, XML, or any IEnumerable
PLINQ is a parallel implementation of LINQ Non-isolated functions must be avoided Otherwise it is hard to tell the two apart
04/12/2023 42
A LINQ Examplevar q = from n in names where n.Name == queryInfo.Name && n.State == queryInfo.State && n.Year >= yearStart && n.Year <= yearEnd orderby n.Year ascending select n;
.AsParallel()
PLINQ
04/12/2023 43
Message Passing Interface (MPI) MPI is a widely used message passing library
for distributed memory HPC systems Some of its basic functions:
A few of its “collective communication” functions:
MPI_InitMPI_Comm_rankMPI_Comm_size
MPI_ReduceMPI_AllreduceMPI_ScanMPI_Exscan
MPI_BarrierMPI_GatherMPI_AllgatherMPI_Alltoall
MPI_SendMPI_Recv
04/12/2023 44
Sorting in MPI Roughly, it could work like this on n nodes:
Run the first two loops locally Use MPI_Allreduce to build a global histogram Run the third loop (redundantly) at every node Allocate n value intervals to nodes (redundantly)
Balancing the data per node as well as possible Run the fourth loop using the local histogram Use MPI_Alltoall to redistribute the data Merge the n sorted subarrays on each node
Collective communication is expensive But sorting needs it (see the Memory Wall slide)
04/12/2023 45
Another Way to Sort in MPI The Samplesort algorithm is like Quicksort It works like this on n nodes:
Sort the local data on each node independently Take s samples of the sorted data on each node Use MPI_Allgather to send all nodes all samples Compute n 1 splitters (redundantly) on all nodes
Balancing the data per node as well as possible Use MPI_Alltoall to redistribute the data Merge the n sorted subarrays on each node
04/12/2023 46
CONCLUSIONS
04/12/2023 47
Parallel Computing Has Arrived We must rethink how we write programs
And we are definitely doing that Other things will also need to change
Architecture Operating systems Algorithms Theory Application software
We are seeing the biggest revolution in computing since its very beginnings