Computer Architecture ELEC3441 End of an Era É …elec3441/sp17/handouts/L15-multi...Computer Architecture ELEC3441 Lecture 15 Ð Multithreading & Multi-Core Processors 24 Dr. Hayden

Computer Architecture ELEC3441

Lecture 15 – Multithreading & Multi-Core Processors

Dr. Hayden Kwok-Hay So

Department of Electrical and

Electronic Engineering 1

5

913

1824

51

80

117183

280

481649

9931,267

1,7793,016

4,1956,043 6,681

7,10811,865

14,38719,484

21,87124,129

1

10

100

1000

10,000

100,000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012

Per

form

ance

(vs

. VA

X-1

1/78

0)

25%/year

52%/year

22%/year

IBM POWERstation 100, 150 MHz

Digital Alphastation 4/266, 266 MHz

Digital Alphastation 5/300, 300 MHz

Digital Alphastation 5/500, 500 MHz AlphaServer 4000 5/600, 600 MHz 21164

Digital AlphaServer 8400 6/575, 575 MHz 21264Professional Workstation XP1000, 667 MHz 21264AIntel VC820 motherboard, 1.0 GHz Pentium III processor

IBM Power4, 1.3 GHz

Intel Xeon EE 3.2 GHz AMD Athlon, 2.6 GHz

Intel Core 2 Extreme 2 cores, 2.9 GHz Intel Core Duo Extreme 2 cores, 3.0 GHz

Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz)

Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz)

Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology)

1.5, VAX-11/785

AMD Athlon 64, 2.8 GHz

Digital 3000 AXP/500, 150 MHzHP 9000/750, 66 MHz

IBM RS6000/540, 30 MHzMIPS M2000, 25 MHz

MIPS M/120, 16.7 MHz

Sun-4/260, 16.7 MHz

VAX 8700, 22 MHz

AX-11/780, 5 MHz

End of an Era …

2nd sem. '16-17 ENGG3441 - HS 2

Limited by Power, ILP, Memory speed

Ways to Achieve Parallelismn  Instruction Level Parallelism (ILP)

•  Parallel operations come from instructions that execute in parallel

•  Dynamic: Super-scalar processor, OOO execution •  Static: VLIW

n  Data Level Parallelism (DLP) •  Parallel operations come from concurrent

operation on independent data •  Vector machines, SIMD extensions

n  Thread Level Parallelism

2nd sem. '16-17 ENGG3441 - HS 3 4 2nd sem. '16-17 ENGG3441 - HS

Multiprocessor Systems on a Chipn  Machines with more than 1 processors was popular

among servers and supercomputers in the 80 and 90s n  Uniprocessor speed comes to a halt due to power wall n  All major processor vendors move to multi-core designs

2nd sem. '16-17 ENGG3441 - HS 5

Chip Multi-ProcessorMulti-Processor board level

Connecting Cores

6

CPU CPU CPU

On-chip Network Shared Memory

CPU CPU CPU

CPU CPU CPU

CPU CPU CPU

Shared memory

Direct Network

CPU CPU

2nd sem. '16-17 ENGG3441 - HS

Direct Connectionsn  Usually in the form of low latency, high throughput,

point-to-point network between processors •  By pass I/O subsystems

n  Allows low-latency communication between neighboring processors •  Sometimes with dedicated machine instructions

n  Multi-hop routing for further processors •  Typology of network plays an important role •  e.g. Ring, torus, mesh…

n  Often tie to the distributed memory system n  Often proprietary design n  Commercial examples:

•  AMD: HyperTransport •  Intel: QuickPath Interconnect

2nd sem. '16-17 ENGG3441 - HS 7

Network Typology

2nd sem. '16-17 ENGG3441 - HS 8

ring mesh

torus

On-chip Networkn  The study of building constructing network in

system-on-chip •  A complete computer system on a chip •  Including graphs, peripheral and memory controllers,

accelerators

n  MPSoC: multi-processor system on a chip •  Multiple compute core in the system

n  Mostly proprietary n  Some example of on-chip network:

•  Advanced Microcontroller Bus Architecture (AMBA): on-chip interconnect developed by ARM

•  Wishbone: OpenCore standard

2nd sem. '16-17 ENGG3441 - HS 9

Shared memory coresn  Common typology for commercial multi-core processors n  Various combination of shared and private cache/memory

2nd sem. '16-17 ENGG3441 - HS 10

Main Memory

L1 I$

Shared L2$

L1 D$

CPU Core

L1 I$

L1 D$

CPU Core

Main Memory

L1 I$

Shared L3$

L1 D$

CPU Core

L1 I$

L1 D$

CPU Core

L2$ L2$

e.g. Intel Nehalem, Sandy Bridge, Ivy Bridge

e.g. Intel Core, Core 2

SymmetricMul-processors

11

symmetric• Allmemoryisequallyfarawayfromallprocessors• AnyprocessorcandoanyI/O(setupaDMAtransfer)

Memory

I/Ocontroller

Graphicsoutput

CPU-Memorybus

bridge

Processor

I/Ocontroller I/Ocontroller

I/Obus

Networks

Processor

Synchroniza-on

12

TheneedforsynchronizaIonariseswheneverthereareconcurrentprocessesinasystem

(eveninauniprocessorsystem)TwoclassesofsynchronizaIon:Producer-Consumer:AconsumerprocessmustwaitunIltheproducerprocesshasproduceddataMutualExclusion:EnsurethatonlyoneprocessusesaresourceatagivenIme

producer

consumer

SharedResource

P1 P2

AProducer-ConsumerExample

13

TheprogramiswriQenassuminginstrucIonsareexecutedinorder.

Producer posting Item x: Load Rtail, (tail) Store (Rtail), x Rtail=Rtail+1 Store (tail), Rtail

Consumer: Load Rhead, (head)

spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead process(R)

Producer Consumer tail head

Rtail Rtail Rhead R

Problems?

AProducer-ConsumerExamplecon$nued

14

Producer posting Item x: Load Rtail, (tail) Store (Rtail), x Rtail=Rtail+1 Store (tail), Rtail

Consumer: Load Rhead, (head)

spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead process(R)

Can the tail pointer get updated before the item x is stored?

Programmer assumes that if 3 happens after 2, then 4 happens after 1. Problem sequences are:

2, 3, 4, 1 4, 1, 2, 3

1

2

3

4

Sequen-alConsistencyAMemoryModel

15

“Asystemissequen<allyconsistentiftheresultofanyexecuIonisthesameasiftheoperaIonsofalltheprocessorswereexecutedinsomesequenIalorder,andtheoperaIonsofeachindividualprocessorappearintheorderspecifiedbytheprogram”

LeslieLamportSequenIalConsistency=

arbitraryorder-preservinginterleavingofmemoryreferencesofsequenIalprograms

M

P P P P P P

Sequen-alConsistency

16

Sequential concurrent tasks: T1, T2 Shared variables: X, Y (initially X = 0, Y = 10) T1: T2:

Store (X), 1 #X <= 1 Load R1, (Y) Store (Y), 11 #Y <= 11 Store (Y’), R1 #Y’<= Y

Load R2, (X) Store (X’), R2 #X’<= X

what are the legitimate answers for X’ and Y’ ?

(X’,Y’) ε {(1,11), (0,10), (1,10), (0,11)} ?

If y is 11 then x cannot be 0

Sequen-alConsistency

17

Sequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( ) What are these in our example ? T1: T2:

Store (X), 1 (X = 1) Load R1, (Y) Store (Y), 11 (Y = 11) Store (Y’), R1 (Y’= Y)

Load R2, (X) Store (X’), R2 (X’= X) additional SC requirements

Does (can) a system with caches or out-of-order execution capability provide a sequentially consistent view of the memory ?

more on this later

IssuesinImplemen-ngSequen-alConsistency

18

Implementation of SC is complicated by two issues

•  Out-of-order execution capability Load(a); Load(b) yes Load(a); Store(b) yes if a ≠ b Store(a); Load(b) yes if a ≠ b Store(a); Store(b) yes if a ≠ b

•  Caches

Caches can prevent the effect of a store from being seen by other processors

M

P P P P P P

No common commercial architecture has a sequentially consistent memory model!

MemoryFencesInstruc$onstoserializememoryaccesses

19

Processors with relaxed or weak memory models (i.e., permit Loads and Stores to different addresses to be reordered) need to provide memory fence instructions to force the serialization of memory accesses

Examples of processors with relaxed memory models:

Sparc V8 (TSO,PSO): Membar Sparc V9 (RMO):

Membar #LoadLoad, Membar #LoadStore Membar #StoreLoad, Membar #StoreStore

PowerPC (WO): Sync, EIEIO ARM: DMB (Data Memory Barrier) X86/64: mfence (Global Memory Barrier)

Memory fences are expensive operations, however, one pays the cost of serialization only when it is required

MemoryCoherenceinSMPs

20

Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has a stale value Do these stale values matter? What is the view of shared memory for programming?

cache-1 A 100

CPU-Memory bus

CPU-1 CPU-2

cache-2 A 100

memory A 100

Write-backCaches&SC

21

•  T1 is executed

prog T2 LD Y, R1 ST Y’, R1 LD X, R2 ST X’,R2

prog T1 ST X, 1 ST Y,11

cache-2 cache-1 memory X = 0 Y =10 X’= Y’=

X= 1 Y=11

Y = Y’= X = X’=

•  cache-1 writes back Y X = 0 Y =11 X’= Y’=

X= 1 Y=11

Y = Y’= X = X’=

X = 1 Y =11 X’= Y’=

X= 1 Y=11

Y = 11 Y’= 11 X = 0 X’= 0

•  cache-1 writes back X

X = 0 Y =11 X’= Y’=

X= 1 Y=11

Y = 11 Y’= 11 X = 0 X’= 0

•  T2 executed

X = 1 Y =11 X’= 0 Y’=11

X= 1 Y=11

Y =11 Y’=11 X = 0 X’= 0

•  cache-2 writes back X’ & Y’

Write-throughCaches&SC

22

cache-2 Y = Y’= X = 0 X’=

memory X = 0 Y =10 X’= Y’=

cache-1 X= 0 Y=10

prog T2 LD Y, R1 ST Y’, R1 LD X, R2 ST X’,R2

prog T1 ST X, 1 ST Y,11

Write-throughcachesdon’tpreservesequen<alconsistencyeither

•  T1 executed

Y = Y’= X = 0 X’=

X = 1 Y =11 X’= Y’=

X= 1 Y=11

•  T2 executed Y = 11 Y’= 11 X = 0 X’= 0

X = 1 Y =11 X’= 0 Y’=11

X= 1 Y=11

MaintainingCacheCoherence

§ Hardwaresupportisrequiredsuchthat–  onlyoneprocessorataImehaswritepermissionforalocaIon

–  noprocessorcanloadastalecopyofthelocaIona[erawrite

⇒cachecoherenceprotocols

23

CacheCoherencevs.MemoryConsistency

§ Acachecoherenceprotocolensuresthatallwritesbyoneprocessorareeventuallyvisibletootherprocessors,foronememoryaddress–  i.e.,updatesarenotlost

§ Amemoryconsistencymodelgivestherulesonwhenawritebyoneprocessorcanbeobservedbyareadonanother,acrossdifferentaddresses–  Equivalently,whatvaluescanbeseenbyaload

§ AcachecoherenceprotocolisnotenoughtoensuresequenIalconsistency–  ButifsequenIallyconsistent,thencachesmustbecoherent

§ CombinaIonofcachecoherenceprotocolplusprocessormemoryreorderbufferusedtoimplementagivenarchitecture’smemoryconsistencymodel

24

SnoopyCache,Goodman1983

§  Idea:Havecachewatch(orsnoopupon)DMAtransfers,andthen“dotherightthing”

§  Snoopycachetagsaredual-ported

25

Proc.

Cache

Snoopy read port attached to Memory Bus Data

(lines)

Tags and State

A

D

R/W

Used to drive Memory Bus when Cache is Bus Master

A

R/W

SharedMemoryMul-processor

26

Use snoopy mechanism to keep all processors’ view of memory coherent

M1

M2

M3

Snoopy Cache

DMA

Physical Memory

Memory Bus

Snoopy Cache

Snoopy Cache

DISKS

SnoopyCacheCoherenceProtocols

27

write miss: the address is invalidated in all other caches before the write is performed

read miss:

if a dirty copy is found in some cache, a write-back is performed before the memory is read

CacheStateTransi-onDiagramTheMSIprotocol

28

M

S I

M: Modified S: Shared I: Invalid

Each cache line has state bits

Address tag state bits Write miss

(P1 gets line from memory)

Other processor intent to write (P1 writes back)

Read miss (P1 gets line from memory)

Other processor intent to write

Read by any processor

P1 reads or writes

Cache state in processor P1

Other processor reads (P1 writes back)

TwoProcessorExample(Readingandwri-ngthesamecacheline)

29

M

S I

Write miss

Read miss

P2 intent to write

P2 reads, P1 writes back

P1 reads or writes

P2 intent to write

P1

M

S I

Write miss

Read miss

P1 intent to write

P1 reads, P2 writes back

P2 reads or writes

P1 intent to write

P2

P1 reads P1 writes P2 reads P2 writes

P1 writes P2 writes

P1 reads

P1 writes

Observa-on

§  IfalineisintheMstatethennoothercachecanhaveacopyoftheline!

§  Memorystayscoherent,mulIpledifferingcopiescannotexist

30

M

S I

Write miss


Read miss



P1 reads or writes Other processor reads

P1 writes back

MESI:AnEnhancedMSIprotocolincreasedperformanceforprivatedata

31

M E

S I

M: Modified Exclusive E: Exclusive but unmodified S: Shared I: Invalid

Each cache line has a tag

Address tag state bits

Write miss


Read miss, shared


P1 write


Other processor reads P1 writes back

P1 read P1 write or read

Cache state in processor P1

P1 intent to write

Read miss, not shared Other

processor reads

Other processor intent to write, P1 writes back

Op-mizedSnoopwithLevel-2Caches

32

Snooper Snooper Snooper Snooper

• Processorso[enhavetwo-levelcaches• smallL1,largeL2(usuallybothonchipnow)

• Inclusionproperty:entriesinL1mustbeinL2invalidaIoninL2⇒invalidaIoninL1• SnoopingonL2doesnotaffectCPU-L1bandwidth

Whatproblemcouldoccur?

CPU

L1$

L2$

CPU

L1$

L2$

CPU

L1$

L2$

CPU

L1$

L2$

Interven-on

33

Whenaread-missforAoccursincache-2,areadrequestforAisplacedonthebus

• Cache-1needstosupply&changeitsstatetoshared• Thememorymayrespondtotherequestalso!

Doesmemoryknowithasstaledata?Cache-1needstointervenethroughmemorycontrollertosupplycorrectdatatocache-2

cache-1A 200

CPU-Memorybus

CPU-1 CPU-2

cache-2

memory(staledata)A 100

FalseSharing

34

state line addr data0 data1 ... dataN

A cache line contains more than one word Cache-coherence is done at the line-level and not word-level Suppose M1 writes wordi and M2 writes wordk and both words have the same line address. What can happen?

Out-of-OrderLoads/Stores&CC

35

Blocking caches One request at a time + CC ⇒ SC

Non-blocking caches Multiple requests (different addresses) concurrently + CC ⇒ Relaxed memory models

CC ensures that all processors observe the same order of loads and stores to an address

Cache Memory pushout (Wb-rep)

load/store buffers

CPU

(S-req, E-req)

(S-rep, E-rep)

Wb-req, Inv-req, Inv-rep snooper

(I/S/E)

CPU/Memory Interface

36

Acknowledgements n  These slides contain material developed and

copyright by: •  Arvind (MIT) •  Krste Asanovic (MIT/UCB) •  Joel Emer (Intel/MIT) •  James Hoe (CMU) •  John Kubiatowicz (UCB) •  David Patterson (UCB) •  John Lazzaro (UCB)

n  MIT material derived from course 6.823 n  UCB material derived from course CS152,

CS252

2nd sem. '16-17 ENGG3441 - HS

Documents

Computer Architecture ELEC3441 End of an Era É …elec3441/sp17/handouts/L15-multi...Computer Architecture ELEC3441 Lecture 15 Ð Multithreading & Multi-Core Processors 24 Dr. Hayden