Upload
mya
View
48
Download
0
Tags:
Embed Size (px)
DESCRIPTION
CMPE 421 Parallel Computer Architecture. Multi Processing. Multi Processing. Goal: connecting multiple computers to get higher performance Multiprocessors: To create powerful computers by connecting many existing smaller ones Scalability - PowerPoint PPT Presentation
Citation preview
1
CMPE 421 Parallel Computer ArchitectureMulti Processing
2
Multi Processing Goal: connecting multiple computers to get higher performance Multiprocessors: To create powerful computers by connecting many
existing smaller ones Scalability
- H/W and S/W are designed to be sold with variable number of processors Availability
- If one fails others should continue Power efficiency
Job-level (process-level) parallelism (Threads) High throughput for independent jobs
Parallel processing program Single program run on multiple processors
Multi-core microprocessors Chips with multiple processors (cores)
Clusters: A set of computers connected over a local area network that function as single
large multiprocessor
3
Hardware and Software Hardware
Serial: e.g., Pentium 4 Parallel: e.g., Core 2 Duo, quad-core, Xeon
Software Sequential: e.g., matrix multiplication Concurrent: e.g., operating system
Sequential/concurrent software can run on serial/parallel hardware
Challenge: making effective use of parallel hardware Example
Database File servers Computer-aided design packages Multiprocessing operating systems Google PC’s clusters
4
Parallel ProgrammingParallel software is the problemNeed to get significant performance
improvement Otherwise, just use a faster uniprocessor, since it’s
easier!Difficulties
Partitioning (loops) Coordination Communications overhead
5
Scaling ExampleWorkload: sum of 10 scalars, and 10 × 10 matrix
sum Speed up from 10 to 100 processors
Single processor: Time = (10 + 100) × tadd10 processors
Time = 10 × tadd + 100/10 × tadd = 20 × tadd Speedup = 110/20 = 5.5
100 processors Time = 10 × tadd + 100/100 × tadd = 11 × tadd Speedup = 110/11 = 10
Assumes load can be balanced across processors
6
Scaling Example (cont)What if matrix size is 100 × 100?Single processor: Time = (10 + 10000) × tadd
10 processors Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd
Speedup = 10010/1010 = 9.9 100 processors
Time = 10 × tadd + 10000/100 × tadd = 110 × tadd
Speedup = 10010/110 = 91 Assuming load balanced
7
Strong vs Weak ScalingStrong scaling: problem size fixed
As in exampleWeak scaling: problem size proportional to
number of processors 10 processors, 10 × 10 matrix
- Time = 20 × tadd
100 processors, 32 × 32 matrix- Time = 10 × tadd + 1000/100 × tadd = 20 × tadd
Constant performance in this example
8
Problems to derive the design of multiprocessors and clusters
1. How do parallel processors share data?2. How do parallel processors coordinate?
» WHEN operating on shared data ….
3. How many processors?
9
How do parallel processors share data? Shared memory Procedures
Single address space that all procedures share Since the processors communicate through shared variables in
memory, Accessing any memory location is performed via loads and stores
Single Bus
10
How do parallel processors coordinate? Difficulty of coordination within processors is:
- One processor could start working on data before another is finished with it.This coordination is called synchronization
- When sharing is supported with a single address space, there must be a separate mechanism for synchronization.
Write-back caches used to keep bus traffic at a minimum Caches are used to reduce latency and to lower bus traffic Must provide hardware to ensure that caches and memory are
consistent (cache coherency) we will discuss later
One approach uses a lock: only one processor at a time can acquire the lock, and other processors interested in shared data must wait until the original processor unlocks the variable
11
Types of Memory Access UMAs (uniform memory access) – SMP (symmetric
multiprocessors) all accesses to main memory take the same amount of time no
matter which processor makes the request or which location is requested
NUMAs (nonuniform memory access) some main memory accesses are faster than others depending
on the processor making the request and which location is requested
can scale to larger sizes than UMAs so are potentially higher performance compared to UMA
12
Example: Sum Reduction Suppose we have a single-bus multiprocessor UMA of
100 processors. Write a parallel processing program to calculate sum of 100,000 numbers.
Each processor has ID: 0 ≤ Pn ≤ 99 Partition 1000 numbers per processor Initial summation on each processor sum[Pn] = 0; for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i];
vectors A and sum are shared variables, Pn is the processor’s number, i is a private variable
Now need to add these partial sums Reduction: divide and conquer Half the processors add pairs, then quarter, … Need to synchronize between reduction steps
13
Example: Sum Reduction
half = 100;repeat synch(); if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; /* Conditional sum needed when half is odd; Processor0 gets missing element */ half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];until (half == 1);
Processors start by running a loop that sums their subset of vector A numbers (vectors A and sum are shared variables, Pn is the processor’s number, i is a private variable)
The processors then coordinate in adding together the partial sums (half is a private variable initialized to 100 (the number of processors))
15
An Example with 10 Processors
P0 P1 P2 P3 P4 P5 P6 P7 P8 P9
sum[P0]sum[P1]sum[P2] sum[P3]sum[P4]sum[P5]sum[P6] sum[P7]sum[P8] sum[P9]
P0
P0 P1 P2 P3 P4
half = 10
half = 5
P1 half = 2
P0 half = 1
16
Alternative Model (Message Passing) Distributed Memory Multiprocessors
Each Processor has private physical address space Hardware send/receives messages between processor Example: Clusters (desktop computers) Processors knows when a message is sent Processors knows when a message arrives
Sometimes confirmation is asked to receiving processors
• Network of independent computers• Each has private memory and OS• Connected using I/O system
• E.g., Ethernet/switch, Internet• Suitable for applications with independent tasks
• Web servers, databases, simulations,
17
Sum Reduction (Again) Suppose we have a network-connected multiprocessor
of 100 processors. Write a parallel processing program to calculate sum of 100,000 numbers.
Divided the 100,000 numbers into 100 subsets, each of 1000 numbers, and each subset is summed by an individual processor.
Then add the partial sums together with log2 (100) steps The do partial sums sum = 0;for (i = 0; i<1000; i = i + 1) sum = sum + AN[i];
Reduction Half the processors send, other half receive and add The quarter send, quarter receive and add, …
Different subsets of 100,000 numbers are copied to different individual memories
All processors have same copy of the program
18
Sum Reduction (Again)Communication and coordination between
processors through message passingGiven send() and receive() operations
limit = 100; half = 100;/* 100 processors */repeat half = (half+1)/2; /* send vs. receive dividing line */ if (Pn >= half && Pn < limit) send(Pn - half, sum); if (Pn < (limit/2)) sum = sum + receive(); limit = half; /* upper limit of senders */until (half == 1); /* exit with final sum */
Send/receive also provide synchronization Assumes send/receive take similar time to addition
19
Cache Coherence Shared memory multiprocessors have cache coherence problem
Multiple copies of the same memory data can exist in different caches simultaneously.
Update in the caches will cause an inconsistent view of memory Solutions:
Software oriented– Compilers identify the data items that may cause cache – inconsistent and instruct hardware not caching them
Hardware oriented Directory protocol
- The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry
Snoopy Protocol- is the process where the individual caches monitor address lines for accesses to memory
locations that they have cached. When a write operation is observed to a location that a cache has a copy of, the cache controller invalidates its own copy of the snooped memory location.
– Write-update – Write-invalidate
20
Cache Coherence Ex: P1 reads A from shared MEM (A=5)
P2 reads A from shared MEM (A=5)P1 modifies A as 10 (A=10 in cache 1, A=5 in cache2, mem) inconsistent
We can not apply a write-trough policy for every update Why? P1 should tell P2 and MEM that A=10, It will be to slow, and to costly in terms of MEM access and bus traffic
We need cache coherence protocols to solve this problem Multiple copies are not problem when reading they become a problem
when writing
21
Snooping Cache controllers monitor (snoop) the shared bus to determine
whether or not they have a copy of the shared block which is being updated (written)
Shared reads are no problem So, snoopy cache coherence protocols must find all caches that
share an object to be written When there is a “write” to a shared data block, all other copies of that
variable can be updated or invalidated. (How?, will be explained later)
Tags are duplicated to reduce the demands of snooping on the cache
Monitor whether or not they have a copy of the deisred block
In Snoopy caches there is broadcast media that listens to all invalidates and read request and performs appropriate coherence operations locally
22
Read Misses If the request is a read” which missed (value could not be
found in local cache), then all other caches check to see whether they have a copy of the requested block, if yes, then the supply the data to requesting processor.
23
Handling WritesEnsuring that all other processors sharing data are
informed of writes can be handled two ways:1. Write-update (write-broadcast) – writing processor
broadcasts new data over the bus, all copies are updated All writes go to the bus higher bus traffic Since new values appear in caches sooner, can reduce latency
2. Write-invalidate – (other “owners” are told their copies are no longer valid) writing processor issues invalidation signal on bus, cache snoops check to see if they have a copy of the data, if so they invalidate their cache block containing the word (this allows multiple readers but only one writer) Uses the bus only on the first write lower bus traffic, so better
use of bus bandwidth
24
An Example of a Cache Coherence (CC) Protocol Finite state transition diagram for a write-invalidation
protocol based on a write-back policy. Each cache block is in one of three states:
1. Shared (read only): This cache block is clean (not written) and may be shared.
2. Modified (read/write): This cache block is dirty (written) and may not be shared.
3. Invalid: This cache block does not have valid data.
25
A Write-Invalidate CC Protocol
Shared(clean)Invalid
Modified(dirty)
write-back caching protocol in black
read (miss)
write (h
it or m
iss)
read (hit or miss)
read (hit) or write (hit)
writ
e (m
iss)
send invalidate
receives invalidate(write by another processor
to this block)A
noth
er p
roce
ssor
has
a
read
mis
s or
a w
rite
mis
s fo
r th
is b
lock
(see
n on
bus
); w
rite
back
old
blo
ck
send
inva
lidat
e
signals from the processor coherence additions in redsignals from the bus coherence additions in blue
26
CC Protocol A read miss causes the cache to acquire the bus and
write back the victim block (if it was in the Modified (dirty) state). All the other caches monitor the read miss to see if this block is in their cache. If one has a copy and it is in the Modified state, then the block is written back and its state is changed to Invalid. The read miss is then satisfied by reading from the block memory, and the state of the block is set to Shared.
Read hits do not change the cache state.
27
CC Protocol A write miss to an Invalid block causes the cache to
acquire the bus, read the block, modify the portion of the block being written and changing the block’s state to Modified. A write miss to a Shared block causes the cache to acquire the bus, send an invalidate signal to invalidate any other existing copies in other caches, modify the portion of the block being written and change the block’s state to Modified.
A write hit to a Modified block causes no action. A write hit to a Shared block causes the cache to acquire the bus, send an invalidate signal to invalidate any other existing copies in other caches, modify the part of the block being written, and change the state to Modified.
28
Write-Invalidate CC Examples I = invalid (many), S = shared (many), M = modified (only one)
Proc 1
A S
Main Mem A
Proc 2
A I
1. read miss for A
2. read request for A
3. snoop sees read request for
A & lets MM supply A
4. gets A from MM & changes its state
to S
Proc 1
A S
Main Mem A
Proc 2
A I
1. write miss for A
2. writes A & changes its state
to M
Proc 1
A M
Main Mem A
Proc 2
A I
1. read miss for A3. snoop sees read request for A, writes-
back A to MM
2. read request for A
4. gets A from MM & changes its state
to M
3. P2 sends invalidate for A
4. change A state to I
5. P2 sends invalidate for A
6. change A state to I
Proc 1
A M
Main Mem A
Proc 2
A I
1. write miss for A
2. writes A & changes its state
to M
3. P2 sends invalidate for A
4. change A state to I
29
Grid Computing Separate computers interconnected by long-haul
networks E.g., Internet connections Work units farmed out, results sent back
Can make use of idle time on PCs E.g., SETI@home, World Community Grid
30
Multithreading
Performing multiple threads of execution in parallel
Replicate registers, PC, etc. Fast switching between threads
Fine-grain multithreading Switch threads after each cycle Interleave instruction execution If one thread stalls, others are executed
Coarse-grain multithreading Only switch on long stall (e.g., L2-cache miss) Simplifies hardware, but doesn’t hide short stalls
(eg, data hazards)
31
Simultaneous Multithreading In multiple-issue dynamically scheduled processor
Schedule instructions from multiple threads Instructions from independent threads execute when function
units are available Within threads, dependencies handled by scheduling and
register renaming Example: Intel Pentium-4 HT
Two threads: duplicated registers, shared function units and caches
32
Multithreading Example
33
Future of Multithreading Will it survive? In what form? Power considerations simplified microarchitectures
Simpler forms of multithreading Tolerating cache-miss latency
Thread switch may be most effective Multiple simple cores might share resources more
effectively
34
Instruction and Data Streams
An alternate classification
Data StreamsSingle Multiple
Instruction Streams
Single SISD:Intel Pentium 4
SIMD: SSE instructions of x86
Multiple MISD:No examples today
MIMD:Intel Xeon e5345
SPMD: Single Program Multiple Data A parallel program on a MIMD computer Conditional code for different processors