View
230
Download
3
Tags:
Embed Size (px)
Citation preview
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
1
Parallel Computer Architectures
Duncan A. Buell
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
2
Rules for Parallel Computing
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
4
Parallel Computing History
Late 1960s ILLIAC-41970 CDC STAR-1001980s
Denelcor HEPTera Computer Corp. MPAAlliantSequentStardentKendall Square Research (KSR)Intel HypercubeNCubeBBN ButterflyNASA MPPThinking Machines CM-2MasPar
1990s and forwardCray T3D, T3EThinking Machines CM-5Tera Computer Corp. MPASGI ChallengeSun EnterpriseSGI OriginHP-ConvexDEC 84xxPittsburgh TerascaleASCI machinesBeowulf clustersIBM SP-1, SP-2New DoE-inspired machines
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
5
Memory Latency is the Problem
Instructions execute in nanoseconds
Memory provides data in 100s of nanoseconds
The problem is keeping processors fed with data
Standard machines use levels of cache
How do we keep lots of processors fed?
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
6
Solutions(?) to the Latency Problem
Connect all the processors to all the memory
SMP: Sun Enterprise, SGI Challenge, Cray multiprocessors
Provide fast, constant time, memory fetch to anywhere from anywhere
Requires a fast, expensive, full crossbar switch
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
7
Solutions(?) to the Latency Problem (2)
Build a machine that is physically structured like the computations to be performed
Vectors: Cray, CDCSIMD: MPP, CM-2, MASPAR2D/3D Grid: CRAY T3D, T3EButterfly: BBNMeiko “computing surface”
Works well on problems on which it works wellWorks badly on problems that don’t fit
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
8
Solutions(?) to the Latency Problem (3)
Build a machine with “generic” structure and software support for computations that may not fit well
Butterfly: BBNLog network: CM-2, CM-5
Relies on magicMagic has always been hard to do
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
9
Solutions(?) to the Latency Problem (4)
Build an an SMP and then connect SMPs together in clusters
SGI: Origin (NUMA, ccNUMA)DoE: ASCI Red, Blue Pacific, White, etc.
Performance requires distributable computations, because the memory access is slow off the local node
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
10
Solutions(?) to the Latency Problem (5)
Ignore performance and concentrate on cost
Beowulf clustersNetworks of workstations
If the machine is cheap, and works very well on some (distributable) computations, then maybe no one will notice that it’s not so great on other computations.
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
12
Vector Computers
Much of high end computing is for scientific and engineering applications
Many of these involve linear algebraWe happen to know how to do linear algebraMany solutions can be expressed with lin alg(Lin alg is both the hammer and the nail)
The basic operation is a dot product, i.e. a vector multiplication
Vector computers do blocks of arithmetic ops as one operation
Register-based (CRAY) or memory-memory(CDC)
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
13
Programming Vector Computers
Everything reduces to a compiler’s recognizing (or being told to recognize) a loop whose ops can be done in parallel.
for(i=0; i < n; i++) /* works just fine */a[i] = b[i] * c[i];
for(i = 0; i < n; i++) /* fails, a[.] values not independent */
a[i] = a[i-1] * b[i];
Programming involves contortions of code to make it into independent operations inside the loops.
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
14
Vector Computing History
1960s Seymour R. Cray does CDC 6400Cray leaves CDC, forms Cray Research, Inc., produces
CRAY-1 (1976)CDC Cyber 205 (late 1970s)CDC spins off ETA, liquid nitrogen ETA-10 fails, ETA failsCRAY X-MP (1983?), CRAY 2 runs Unix (1985)Convex C-1 and a host of “Cray-ettes”, now HP-ConvexCRAY Y-MP (1988?), C90, T90, J series (1990s)Steve Chen leaves CRI, forms SSC, fails spectacularlyCray leaves CRI, forms Cray Computer Corp.CCC CRAY 3 fails, CRAY 4 fails, CCC SSS failsCRI sold to SGI, then sold to Tera Computer Corp.1996 S.R. Cray killed in auto wreck by teenager
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
16
Parallel Computers
• The theoretic model of a PRAM• Symmetric Multi Processors• Distributed memory machines• Machines with an inherent structure• Non Uniform Memory Access machines• Massively parallel machines• Grid computing
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
17
Theory – The PRAM Model
PRAM (Parallel Random Access Machine):
• Control unit• Global memory
• Unbounded set of procs• Private mem for each
processor
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
18
PRAM
Types of PRAM:• EREW (Exclusive Read Exclusive Write)• CREW (Concurrent Read Exclusive Write)• CRCW (Concurrent Read Concurrent Write)
Flaws with PRAM:• Logical flaw:
– Must deal with the concurrent write problem• Practicality flaw:
– Can’t really assume unbounded number of processors– Can’t really afford to build the interconnect switch
Nonetheless, it’s a good starting place
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
19
Standard Single Processor Machine
• One processor• One memory block• Bus to memory• All addresses visible Processor
Memory
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
20
(Michael) Flynn’s Taxonomy
SISD (Single Instruction, Single Data) – The ordinary computer
MIMD (Multiple Instruction, Multiple Data) – True, symmetric, parallel computing (Sun Enterprise)
SIMD (Single Instruction, Multiple Data) – Massively parallel army-of-ants approach – Processors execute the same sequence of instructions (or else NO-OP) in lockstep (TMC CM-2)
SCMD/SPMD (Single Code/Program Multiple Data) – Processors run the same program, but on their own local data (Beowulf clusters)
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
21
Symmetric Multi-Processor (SMP) (MIMD)
• Lots of processors (32? 64? 128? 1024?)
• Multiple “ordinary” processors
• Lots of global memory• All addresses visible to
all processors
• Closest thing to a PRAM
• This the holy grail
Processors
Memory
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
22
SMP Characteristics
Middle level parallel execution
Processors spawn “threads” at or below the size of a function
Compiler magic to extract parallelism(if no pointers in the code, then at the function level one can determine independence of use of variables)
Compiler directives to force parallelism
Sun Enterprise, SGI Challenge, …
Processors
Memory
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
23
But SMPs Are Hard to Build
• N processors
• M memory blocks
• N*M connections
• This is hard and expensive
PP PP
MM MM
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
24
But SMPs Are Hard to Build
For large N and M, we do this as a switch, not point to point
But it’s still hard and expensive
Half the cost of a CRAY was the switch between processors and memory
Beyond 128 processors, almost impossible
PP PP
MM MM
SWITCH
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
25
Memory Banking Issues
Many processors requesting dataProcessors generate addresses faster than memory can
respond
Memory banking: use low bits of address to specify the physical bank so consecutive addresses go to physically different banks
But power-of-2 stride (as in an FFT) hits the same bank repeatedly
CDC deliberately used 17 memory banks to randomize accesses
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
26
FFT Butterfly Communication
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
27
Distributed Parallelism
• Beowulf cluster of Linux nodes (requires an identifible “computer” to be a Beowulf?)
• SNOW (Scalable Network of Workstations)• SETI@home, GIMP,
• Beowulfs programmed with MPI or PVM• MPI uses explicit processor-to-processor
message passing• Sun (and others) have tools for networks
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
28
Distributed Parallel Computers
Usually we can’t get to the memory except through the processor, but we would like to have memory-to-memory connections.
P M
P M
P M
P M
P M
Network
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
29
Parallel Computers With Structure
• If it’s hard/expensive to build an SMP, is it useful to build the structure into the machine?
• Build in a communication pattern that you expect to see in the computations, but keep things simple enough to make them buildable
• Make sure that you have efficient algorithms for the common computational tasks
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
30
Parallel Computers With Structure
• Ring-connected machines (Alliant)• 2-dimensional meshes (CRAY T3D, T3E)• 3-D mesh with missing links (Tera MPA)• Logarithmic tree interconnections
– Thinking Machines Connection Machine
CM-2, CM-5– MasPar MP-1, MP-2)
• Bolt, Beranek, and Newman BBN Butterfly
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
31
2-dimensional Mesh with Wraparound
A vector multiply can be done very efficiently (shift column data up past row data), but what about a matrix transpose?
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
32
Logarithmic Tree Communications
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
33
Parallel Computers With Structure
Machines with structure that were intended to be SMPs were generally not successful
Alliant, Sequent, BBN Butterfly, etc.
CM-5 claimed magical compilers, but efficiency only came by using the structure explicitly
T3D, T3E were the ONLY machines that allowed shared memory with clusters of nodes—and had it work
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
34
NUMA Clusters of SMPs
• 2-4 Processors, 2-4Gbytes memory on a node• 4 (plus or minus) nodes per cabinet with a
switch• Cabinets interconnected with another switch• Non Uniform Memory Access
– Fast access to node memory– Slower access elsewhere in the cabinet– Yet slower access off-cabinet
• Nearly all large machines are NUMA (DoE ASCI, SGI Origin, Pittsburgh Terascale, …
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
35
Massively Parallel SIMD Computers
• NASA Massively Parallel Processor– Built by Goodyear 1984 for image processing– 16384 1-bit procs, 1024 bits/proc of mem– Mesh connections
• Thinking Machines CM-2 (1986)– 65536 1-bit procs, 8192 bits/proc– Log network– Compute cost = communication cost?
• MasPar MP-1, MP-2 (late 1980s)– 8192 4-bit processors– Log network
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
36
Massively Parallel SIMD Computers
• Plane of processors each sitting above an array of memory bits
• Usually a log network connecting the processors
• Usually also some local connections (e.g., 16 procs/node on CM-2)
Memory
Procs
Control processor
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
37
Massively Parallel SIMD Computers
• Control processor sends instructions clock by clock to the compute processors
• All compute processors execute the instruction (or NO-OP) on the same relative data location
• Obvious image processing model• Allows variable data types (although TMC
didn’t do this until told to)
04/18/23
Parallel Computer ArchitecturesComputer Science and
Engineering
38
Massively Parallel SIMD Computers
Processor in Memory (PIM)Take half the memory off a chipUse the silicon for implementing SIMD
processorsExtra address bit toggles mode
If 0, use address as addressIf 1, use “address” as SIMD instruction
2048 processors per memory chipCray Computer Corp. SSS would have provided
millions of processors