Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
MULTIPROCESSORS AND
THREAD-LEVEL PARALLELISM(PART 1)Chapter 5
Appendix F
Appendix I
1
CP
E731 -D
r. Iyad Jafar
OUTLINE
Introduction (5.1)
Multiprocessor Architecture
Challenges in Parallel Processing
Centralized Shared Memory
Architectures (5.2)
Performance of SMP (5.3)2
CP
E731 -D
r. Iyad Jafar
INTRODUCTION
3
CP
E731 -D
r. Iyad Jafar
3
CP
E731 -D
r. Iyad Jafar
RISC
Move to multi-processor
Technology Improvement New Architectures and Organization
Power and ILP limitations?
INTRODUCTION
4
CP
E731 -D
r. Iyad Jafar
INTRODUCTION
Why multiprocessors? Increased costs of silicon and energy to exploit
ILP
Increasing performance of desktop is lessimportant
Advantage of replication rather than uniquedesign
Improved understanding on how to usemultiprocessors effectively; especially in servers! significant natural parallelism in large dataset,
scientific codes, and independent requests
Growing interest in high end servers for cloudcomputing and SaaS
A growth in data-intensive applications 5
CP
E731 -D
r. Iyad Jafar
INTRODUCTION
Multiprocessor Tightly coupled processors whose coordination and
usage are typically controlled by a single operatingsystem and that share memory through a sharedaddress space
2-32 processors Single-chip system (multicore) or multiple multicore
chips
Multiprocessors exploit thread-level parallelism Parallel programming execute tightly-coupled
threads that collaborate on a single task Request-level parallelism execute multiple
independent processes Single program or multiple applications (multiprogramming)
Multicomputers!6
CP
E731 -D
r. Iyad Jafar
INTRODUCTION
To maximize the advantage of multiprocessors withn processors, we need n threads
Independent threads are created by programmer oroperating systems
TLP may exploit DLP A thread may have some iterations of a loop to exploit
data-level parallelism
Grain size must be sufficiently large to compensate forthe thread overhead!
7
CP
E731 -D
r. Iyad Jafar
MULTIPROCESSOR ARCHITECTURE
Symmetric Shared-Memory Multiprocessors(SMPs) Centralized shared-memory multiprocessors Small number of cores Share single memory with uniform latency (UMA)
8
CP
E731 -D
r. Iyad Jafar
MULTIPROCESSOR ARCHITECTURE
Distributed Shared Memory Multiprocessors(DSMs) Larger number of processors Memory distributed among processors Non-uniform memory access/latency (NUMA) Processors connected via direct (switched) and non-
direct (multi-hop) interconnection networks
9
CP
E731 -D
r. Iyad Jafar
MULTIPROCESSOR ARCHITECTURE
The term shared memory in botharchitectures implies that threadscommunicate with each other through thesame address space
i.e. Any processor can reference any memorylocation as long as it has access rights
In DSM, the distributed memory adds thecommunication complexities and overhead
10
CP
E731 -D
r. Iyad Jafar
CHALLENGES
Limited Parallelism in Programs Example. Suppose you want to achieve a speedup
of 80 with 100 processors. What fraction of theoriginal computation can be sequential?
How this can be addressed?11
CP
E731 -D
r. Iyad Jafar
CHALLENGES
Communication Overhead Example. Suppose we have an application running on a
32-processor multiprocessor, which has a 200 ns time tohandle reference to a remote memory.
For this application, assume that all the referencesexcept those involving communication hit in the localmemory hierarchy, which is slightly optimistic.
Processors are stalled on a remote request, and theprocessor clock rate is 3.3 GHz.
If the base CPI (assuming that all references hit in thecache) is 0.5, how much faster is the multiprocessor ifthere is no communication versus if 0.2% of theinstructions involve a remote communication reference? 12
CP
E731 -D
r. Iyad Jafar
CHALLENGES
Communication Overhead Example.
CPIcom = CPIideal + miss penalty= 0.5 + remote request rate × penalty= 0.5 + 0.002 × 200 ns / 0.3 ns= 0.5 + 1.2= 1.7
Speedup = 1.7 / 0.5 = 3.4The multiprocessor with all local references is3.4 faster
How to address? SW HW
13
CP
E731 -D
r. Iyad Jafar
SMP ARCHITECTURES
14
CP
E731 -D
r. Iyad Jafar
SMP ARCHITECTURES
15
CP
E731 -D
r. Iyad Jafar
Intel Nehalem (Nov 2008)
SMP ARCHITECTURES
SMPs support caching of private and shareddata
Reduce latency, BW and contention
Caching private data is not a problem! Likea uniprocessor!
Caching shared data issues formemory system behavior Coherence; what values can be returned by a
read Consistency; when a written value will be
returned by a read 16
CP
E731 -D
r. Iyad Jafar
CACHE COHERENCE
17
CP
E731 -D
r. Iyad Jafar
P1 P2 P3
MemoryX = 5
X = 5 X = 5X = 8
X = ?X = ?
1
2
34
5
CACHE COHERENCE
A memory system is coherent if Preserve Program Order: A read by processor P to location
X that follows a write by P to X, with no writes of X by anotherprocessor occurring between the write and the read by P,always returns the value written by P
Coherent view of memory: Read by a processor to location Xthat follows a write by another processor to X returns thewritten value if the read and write are sufficiently separated intime and no other writes to X occur between the two accesses
Write serialization: 2 writes to same location by any 2processors are seen in the same order by all processors. Forexample, if the values 1 and then 2 are written to a location,processors can never read the value of the location as 2 andthen later read it as 1
18
CP
E731 -D
r. Iyad Jafar
BASIC SCHEMES FOR ENFORCING COHERENCE
A program running on multiple processors havecopies of the same data in several caches
In coherent multiprocessor, caches use migrationand replication
Migration Move data to a local cache and use transparently Reduce latency and BW demand
Replication Copy data to individual caches for simultaneous reads Reduce latency and bus contention
Use a HW protocol to keep caches coherent insteadof using a SW approach 19
CP
E731 -D
r. Iyad Jafar
BASIC SCHEMES FOR ENFORCING COHERENCE
Directory-based Protocols The sharing status of a shared block is kept in one (or
more) location, i.e. Directory
In SMP, centralized directory
In DSM, distributed directories
Snooping-based Protocols Every cache that has a copy of a shared block keeps track of
the sharing status
In SMP,
Caches are accessible via some broadcast medium
Each cache monitors or snoops the medium to determinewhether they have a copy of the requested block
Can be used in multichip multiprocessor on top of directoryprotocol within each multicore 20
CP
E731 -D
r. Iyad Jafar
SNOOPING COHERENCE PROTOCOLS
Write-update protocol (broadcast) A write to a cashed shared item updates all
cached copies via the medium Less popular; consumes BW!
Write-invalidate protocol A write to a shared cached item invalidates all
cached copies (exclusive access)
21
CP
E731 -D
r. Iyad Jafar
X
BASIC IMPLEMENTATION TECHNIQUES
A bus or broadcast medium Perform invalidates by acquiring the bus first, then
broadcasting the address Other processors snoop and check their caches for the
broadcasted address Invalidation by different processors is serialized by
bus arbitration
Locating shared items on a miss Simple in write-through! Write-back is more difficult! However, in write-back, caches can snoop for read
requests as well and provide the data if they have itin dirty state
Write buffers? 22
CP
E731 -D
r. Iyad Jafar
BASIC IMPLEMENTATION TECHNIQUES
Tracking state Use cache tags, valid and dirty bits to implement
snooping 1-bit to track the sharing state of each block
Exclusive/Modified state The processor has amodified copy of the block. There is no need to sendinvalidates on successive writes by the sameprocessor
Shared the block is in more than private caches.
Finite state controller in each core Responds to requests from the core and medium Change the state of a cached block
Invalid, modified or shared23
CP
E731 -D
r. Iyad Jafar
EXAMPLE PROTOCOL (INVALIDATE & WB)
24
CP
E731 -D
r. Iyad Jafar
Why to write-back?
EXAMPLE PROTOCOL (INVALIDATE & WB)
25
CP
E731 -D
r. Iyad Jafar
EXAMPLE PROTOCOL (INVALIDATE & WB)
26
CP
E731 -D
r. Iyad Jafar
EXAMPLE PROTOCOL (INVALIDATE & WB)
27
CP
E731 -D
r. Iyad Jafar
EXTENSIONS TO MSI PROTOCOL
The previous protocol is called MSI Many extensions exist
Add states and/or transactions to improve performance
MESI protocol (Intel i7 MESIF) Exclusive state added to indicate cache line is the same as
main memory and is the only cached copy
When the state changes on Read Miss, no need to write-back block to memory
MOESI protocol (AMD Opteron) MSI and MESI update memory whenever changing the
state to Shared. MOESI adds the Owner state to indicate that a block is
owned by that cache and out-of-date in memory MOESI, a block can be changed from Modified to Owned
without writing to memory. The owner should update the block in memory on a miss
28
CP
E731 -D
r. Iyad Jafar
LIMITATIONS
Centralized memory can be become a bottleneck asthe number of processors or their memory demandsincrease
High BW connection to L3 cache allowed 4 to 8cores. However, it is not likely to scale! Multiple busses and interconnection networks such as
cross-bar or small point-to-point Banked memory or cache
29
CP
E731 -D
r. Iyad Jafar
LIMITATIONS
Snooping BW could become a problem. Eachprocessor must examine every miss
Snooping may interfere with cache operation Duplicate cache tags Centralized directory in the outermost cache Does not eliminate the bottleneck at the bus
30
CP
E731 -D
r. Iyad Jafar
PERFORMANCE OF SMPS
Performance is determined by Traffic caused by cache misses of processors Traffic of communication
Both are affected by processor count, cachesize and block size
Adds the fourth C (coherence) for the 3Csmisses
Types of coherence misses True Sharing Misses False Sharing Misses
Single valid bit per block.Writing a word in a block invalidates the entire
block. 31
CP
E731 -D
r. Iyad Jafar
PERFORMANCE OF SMPS
Coherence Misses ExampleAssume that words x1 and x2 are in the same cache block, which is in the shared state in the caches of both P1 and P2. Assuming the following sequence of events, identify each miss as a true sharing miss, a false sharing miss, or a hit.
32
CP
E731 -D
r. Iyad Jafar
PERFORMANCE OF SMPS
1998 Study
Processor Alpha Server 4100 with four Alpha 21164
processors 4 IPC at 300 MHz
Workload TPC-B: online transaction processing (OLTP) TPC-D: Decision support system (DSS) AltaVista: Web index search
33
CP
E731 -D
r. Iyad Jafar
PERFORMANCE OF SMPS
34
OLTP has the poorest performance due to memory hierarchy problems
Consider evaluating the OLTP when varying L3 cache size, block size and number of processors
PERFORMANCE OF SMPS
35
CP
E731 -D
r. Iyad JafarBiggest improvement
when moving from 1 to 2 MB L3?
PERFORMANCE OF SMPS
36
CP
E731 -D
r. Iyad JafarInstruction and
capacity misses drops but true sharing, false
and compulsory misses are unaffected!
PERFORMANCE OF SMPS
37
CP
E731 -D
r. Iyad Jafar
Increase of true sharing misses!
PERFORMANCE OF SMPS
38
CP
E731 -D
r. Iyad Jafar
Reduce true sharing misses!