Upload
vuongtu
View
218
Download
4
Embed Size (px)
Citation preview
Computer Architecture ELEC3441
Lecture 15 – Multithreading & Multi-Core Processors
Dr. Hayden Kwok-Hay So
Department of Electrical and
Electronic Engineering 1
5
913
1824
51
80
117183
280
481649
9931,267
1,7793,016
4,1956,043 6,681
7,10811,865
14,38719,484
21,87124,129
1
10
100
1000
10,000
100,000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012
Per
form
ance
(vs
. VA
X-1
1/78
0)
25%/year
52%/year
22%/year
IBM POWERstation 100, 150 MHz
Digital Alphastation 4/266, 266 MHz
Digital Alphastation 5/300, 300 MHz
Digital Alphastation 5/500, 500 MHz AlphaServer 4000 5/600, 600 MHz 21164
Digital AlphaServer 8400 6/575, 575 MHz 21264Professional Workstation XP1000, 667 MHz 21264AIntel VC820 motherboard, 1.0 GHz Pentium III processor
IBM Power4, 1.3 GHz
Intel Xeon EE 3.2 GHz AMD Athlon, 2.6 GHz
Intel Core 2 Extreme 2 cores, 2.9 GHz Intel Core Duo Extreme 2 cores, 3.0 GHz
Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz)
Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz)
Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology)
1.5, VAX-11/785
AMD Athlon 64, 2.8 GHz
Digital 3000 AXP/500, 150 MHzHP 9000/750, 66 MHz
IBM RS6000/540, 30 MHzMIPS M2000, 25 MHz
MIPS M/120, 16.7 MHz
Sun-4/260, 16.7 MHz
VAX 8700, 22 MHz
AX-11/780, 5 MHz
End of an Era …
2nd sem. '16-17 ENGG3441 - HS 2
Limited by Power, ILP, Memory speed
Ways to Achieve Parallelismn Instruction Level Parallelism (ILP)
• Parallel operations come from instructions that execute in parallel
• Dynamic: Super-scalar processor, OOO execution • Static: VLIW
n Data Level Parallelism (DLP) • Parallel operations come from concurrent
operation on independent data • Vector machines, SIMD extensions
n Thread Level Parallelism
2nd sem. '16-17 ENGG3441 - HS 3 4 2nd sem. '16-17 ENGG3441 - HS
Multiprocessor Systems on a Chipn Machines with more than 1 processors was popular
among servers and supercomputers in the 80 and 90s n Uniprocessor speed comes to a halt due to power wall n All major processor vendors move to multi-core designs
2nd sem. '16-17 ENGG3441 - HS 5
Chip Multi-ProcessorMulti-Processor board level
Connecting Cores
6
CPU CPU CPU
On-chip Network Shared Memory
CPU CPU CPU
CPU CPU CPU
CPU CPU CPU
Shared memory
Direct Network
CPU CPU
2nd sem. '16-17 ENGG3441 - HS
Direct Connectionsn Usually in the form of low latency, high throughput,
point-to-point network between processors • By pass I/O subsystems
n Allows low-latency communication between neighboring processors • Sometimes with dedicated machine instructions
n Multi-hop routing for further processors • Typology of network plays an important role • e.g. Ring, torus, mesh…
n Often tie to the distributed memory system n Often proprietary design n Commercial examples:
• AMD: HyperTransport • Intel: QuickPath Interconnect
2nd sem. '16-17 ENGG3441 - HS 7
Network Typology
2nd sem. '16-17 ENGG3441 - HS 8
ring mesh
torus
On-chip Networkn The study of building constructing network in
system-on-chip • A complete computer system on a chip • Including graphs, peripheral and memory controllers,
accelerators
n MPSoC: multi-processor system on a chip • Multiple compute core in the system
n Mostly proprietary n Some example of on-chip network:
• Advanced Microcontroller Bus Architecture (AMBA): on-chip interconnect developed by ARM
• Wishbone: OpenCore standard
2nd sem. '16-17 ENGG3441 - HS 9
Shared memory coresn Common typology for commercial multi-core processors n Various combination of shared and private cache/memory
2nd sem. '16-17 ENGG3441 - HS 10
Main Memory
L1 I$
Shared L2$
L1 D$
CPU Core
L1 I$
L1 D$
CPU Core
Main Memory
L1 I$
Shared L3$
L1 D$
CPU Core
L1 I$
L1 D$
CPU Core
L2$ L2$
e.g. Intel Nehalem, Sandy Bridge, Ivy Bridge
e.g. Intel Core, Core 2
SymmetricMul-processors
11
symmetric• Allmemoryisequallyfarawayfromallprocessors• AnyprocessorcandoanyI/O(setupaDMAtransfer)
Memory
I/Ocontroller
Graphicsoutput
CPU-Memorybus
bridge
Processor
I/Ocontroller I/Ocontroller
I/Obus
Networks
Processor
Synchroniza-on
12
TheneedforsynchronizaIonariseswheneverthereareconcurrentprocessesinasystem
(eveninauniprocessorsystem)TwoclassesofsynchronizaIon:Producer-Consumer:AconsumerprocessmustwaitunIltheproducerprocesshasproduceddataMutualExclusion:EnsurethatonlyoneprocessusesaresourceatagivenIme
producer
consumer
SharedResource
P1 P2
AProducer-ConsumerExample
13
TheprogramiswriQenassuminginstrucIonsareexecutedinorder.
Producer posting Item x: Load Rtail, (tail) Store (Rtail), x Rtail=Rtail+1 Store (tail), Rtail
Consumer: Load Rhead, (head)
spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead process(R)
Producer Consumer tail head
Rtail Rtail Rhead R
Problems?
AProducer-ConsumerExamplecon$nued
14
Producer posting Item x: Load Rtail, (tail) Store (Rtail), x Rtail=Rtail+1 Store (tail), Rtail
Consumer: Load Rhead, (head)
spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead process(R)
Can the tail pointer get updated before the item x is stored?
Programmer assumes that if 3 happens after 2, then 4 happens after 1. Problem sequences are:
2, 3, 4, 1 4, 1, 2, 3
1
2
3
4
Sequen-alConsistencyAMemoryModel
15
“Asystemissequen<allyconsistentiftheresultofanyexecuIonisthesameasiftheoperaIonsofalltheprocessorswereexecutedinsomesequenIalorder,andtheoperaIonsofeachindividualprocessorappearintheorderspecifiedbytheprogram”
LeslieLamportSequenIalConsistency=
arbitraryorder-preservinginterleavingofmemoryreferencesofsequenIalprograms
M
P P P P P P
Sequen-alConsistency
16
Sequential concurrent tasks: T1, T2 Shared variables: X, Y (initially X = 0, Y = 10) T1: T2:
Store (X), 1 #X <= 1 Load R1, (Y) Store (Y), 11 #Y <= 11 Store (Y’), R1 #Y’<= Y
Load R2, (X) Store (X’), R2 #X’<= X
what are the legitimate answers for X’ and Y’ ?
(X’,Y’) ε {(1,11), (0,10), (1,10), (0,11)} ?
If y is 11 then x cannot be 0
Sequen-alConsistency
17
Sequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( ) What are these in our example ? T1: T2:
Store (X), 1 (X = 1) Load R1, (Y) Store (Y), 11 (Y = 11) Store (Y’), R1 (Y’= Y)
Load R2, (X) Store (X’), R2 (X’= X) additional SC requirements
Does (can) a system with caches or out-of-order execution capability provide a sequentially consistent view of the memory ?
IssuesinImplemen-ngSequen-alConsistency
18
Implementation of SC is complicated by two issues
• Out-of-order execution capability Load(a); Load(b) yes Load(a); Store(b) yes if a ≠ b Store(a); Load(b) yes if a ≠ b Store(a); Store(b) yes if a ≠ b
• Caches
Caches can prevent the effect of a store from being seen by other processors
M
P P P P P P
No common commercial architecture has a sequentially consistent memory model!
MemoryFencesInstruc$onstoserializememoryaccesses
19
Processors with relaxed or weak memory models (i.e., permit Loads and Stores to different addresses to be reordered) need to provide memory fence instructions to force the serialization of memory accesses
Examples of processors with relaxed memory models:
Sparc V8 (TSO,PSO): Membar Sparc V9 (RMO):
Membar #LoadLoad, Membar #LoadStore Membar #StoreLoad, Membar #StoreStore
PowerPC (WO): Sync, EIEIO ARM: DMB (Data Memory Barrier) X86/64: mfence (Global Memory Barrier)
Memory fences are expensive operations, however, one pays the cost of serialization only when it is required
MemoryCoherenceinSMPs
20
Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has a stale value Do these stale values matter? What is the view of shared memory for programming?
cache-1 A 100
CPU-Memory bus
CPU-1 CPU-2
cache-2 A 100
memory A 100
Write-backCaches&SC
21
• T1 is executed
prog T2 LD Y, R1 ST Y’, R1 LD X, R2 ST X’,R2
prog T1 ST X, 1 ST Y,11
cache-2 cache-1 memory X = 0 Y =10 X’= Y’=
X= 1 Y=11
Y = Y’= X = X’=
• cache-1 writes back Y X = 0 Y =11 X’= Y’=
X= 1 Y=11
Y = Y’= X = X’=
X = 1 Y =11 X’= Y’=
X= 1 Y=11
Y = 11 Y’= 11 X = 0 X’= 0
• cache-1 writes back X
X = 0 Y =11 X’= Y’=
X= 1 Y=11
Y = 11 Y’= 11 X = 0 X’= 0
• T2 executed
X = 1 Y =11 X’= 0 Y’=11
X= 1 Y=11
Y =11 Y’=11 X = 0 X’= 0
• cache-2 writes back X’ & Y’
Write-throughCaches&SC
22
cache-2 Y = Y’= X = 0 X’=
memory X = 0 Y =10 X’= Y’=
cache-1 X= 0 Y=10
prog T2 LD Y, R1 ST Y’, R1 LD X, R2 ST X’,R2
prog T1 ST X, 1 ST Y,11
Write-throughcachesdon’tpreservesequen<alconsistencyeither
• T1 executed
Y = Y’= X = 0 X’=
X = 1 Y =11 X’= Y’=
X= 1 Y=11
• T2 executed Y = 11 Y’= 11 X = 0 X’= 0
X = 1 Y =11 X’= 0 Y’=11
X= 1 Y=11
MaintainingCacheCoherence
§ Hardwaresupportisrequiredsuchthat– onlyoneprocessorataImehaswritepermissionforalocaIon
– noprocessorcanloadastalecopyofthelocaIona[erawrite
⇒cachecoherenceprotocols
23
CacheCoherencevs.MemoryConsistency
§ Acachecoherenceprotocolensuresthatallwritesbyoneprocessorareeventuallyvisibletootherprocessors,foronememoryaddress– i.e.,updatesarenotlost
§ Amemoryconsistencymodelgivestherulesonwhenawritebyoneprocessorcanbeobservedbyareadonanother,acrossdifferentaddresses– Equivalently,whatvaluescanbeseenbyaload
§ AcachecoherenceprotocolisnotenoughtoensuresequenIalconsistency– ButifsequenIallyconsistent,thencachesmustbecoherent
§ CombinaIonofcachecoherenceprotocolplusprocessormemoryreorderbufferusedtoimplementagivenarchitecture’smemoryconsistencymodel
24
SnoopyCache,Goodman1983
§ Idea:Havecachewatch(orsnoopupon)DMAtransfers,andthen“dotherightthing”
§ Snoopycachetagsaredual-ported
25
Proc.
Cache
Snoopy read port attached to Memory Bus Data
(lines)
Tags and State
A
D
R/W
Used to drive Memory Bus when Cache is Bus Master
A
R/W
SharedMemoryMul-processor
26
Use snoopy mechanism to keep all processors’ view of memory coherent
M1
M2
M3
Snoopy Cache
DMA
Physical Memory
Memory Bus
Snoopy Cache
Snoopy Cache
DISKS
SnoopyCacheCoherenceProtocols
27
write miss: the address is invalidated in all other caches before the write is performed
read miss:
if a dirty copy is found in some cache, a write-back is performed before the memory is read
CacheStateTransi-onDiagramTheMSIprotocol
28
M
S I
M: Modified S: Shared I: Invalid
Each cache line has state bits
Address tag state bits Write miss
(P1 gets line from memory)
Other processor intent to write (P1 writes back)
Read miss (P1 gets line from memory)
Other processor intent to write
Read by any processor
P1 reads or writes
Cache state in processor P1
Other processor reads (P1 writes back)
TwoProcessorExample(Readingandwri-ngthesamecacheline)
29
M
S I
Write miss
Read miss
P2 intent to write
P2 reads, P1 writes back
P1 reads or writes
P2 intent to write
P1
M
S I
Write miss
Read miss
P1 intent to write
P1 reads, P2 writes back
P2 reads or writes
P1 intent to write
P2
P1 reads P1 writes P2 reads P2 writes
P1 writes P2 writes
P1 reads
P1 writes
Observa-on
§ IfalineisintheMstatethennoothercachecanhaveacopyoftheline!
§ Memorystayscoherent,mulIpledifferingcopiescannotexist
30
M
S I
Write miss
Other processor intent to write
Read miss
Other processor intent to write
Read by any processor
P1 reads or writes Other processor reads
P1 writes back
MESI:AnEnhancedMSIprotocolincreasedperformanceforprivatedata
31
M E
S I
M: Modified Exclusive E: Exclusive but unmodified S: Shared I: Invalid
Each cache line has a tag
Address tag state bits
Write miss
Other processor intent to write
Read miss, shared
Other processor intent to write
P1 write
Read by any processor
Other processor reads P1 writes back
P1 read P1 write or read
Cache state in processor P1
P1 intent to write
Read miss, not shared Other
processor reads
Other processor intent to write, P1 writes back
Op-mizedSnoopwithLevel-2Caches
32
Snooper Snooper Snooper Snooper
• Processorso[enhavetwo-levelcaches• smallL1,largeL2(usuallybothonchipnow)
• Inclusionproperty:entriesinL1mustbeinL2invalidaIoninL2⇒invalidaIoninL1• SnoopingonL2doesnotaffectCPU-L1bandwidth
Whatproblemcouldoccur?
CPU
L1$
L2$
CPU
L1$
L2$
CPU
L1$
L2$
CPU
L1$
L2$
Interven-on
33
Whenaread-missforAoccursincache-2,areadrequestforAisplacedonthebus
• Cache-1needstosupply&changeitsstatetoshared• Thememorymayrespondtotherequestalso!
Doesmemoryknowithasstaledata?Cache-1needstointervenethroughmemorycontrollertosupplycorrectdatatocache-2
cache-1A 200
CPU-Memorybus
CPU-1 CPU-2
cache-2
memory(staledata)A 100
FalseSharing
34
state line addr data0 data1 ... dataN
A cache line contains more than one word Cache-coherence is done at the line-level and not word-level Suppose M1 writes wordi and M2 writes wordk and both words have the same line address. What can happen?
Out-of-OrderLoads/Stores&CC
35
Blocking caches One request at a time + CC ⇒ SC
Non-blocking caches Multiple requests (different addresses) concurrently + CC ⇒ Relaxed memory models
CC ensures that all processors observe the same order of loads and stores to an address
Cache Memory pushout (Wb-rep)
load/store buffers
CPU
(S-req, E-req)
(S-rep, E-rep)
Wb-req, Inv-req, Inv-rep snooper
(I/S/E)
CPU/Memory Interface
36
Acknowledgements n These slides contain material developed and
copyright by: • Arvind (MIT) • Krste Asanovic (MIT/UCB) • Joel Emer (Intel/MIT) • James Hoe (CMU) • John Kubiatowicz (UCB) • David Patterson (UCB) • John Lazzaro (UCB)
n MIT material derived from course 6.823 n UCB material derived from course CS152,
CS252
2nd sem. '16-17 ENGG3441 - HS