EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

EECS 470

Lecture 24

Chip Multiprocessors andChip Multiprocessors and

Simultaneous MultithreadingFall 2007

Prof. Thomas Wenisch

http://www.eecs.umich.edu/courses/eecs470

Slides developed in part by Profs. Slides developed in part by Profs. FalsafiFalsafi, Hill, Hoe, , Hill, Hoe, LipastiLipasti, , Martin, Roth, Martin, Roth,

Lecture 24 Slide 1EECS 470

ShenShen, Smith, , Smith, SohiSohi, , and and VijaykumarVijaykumar of Carnegie Mellon University, Purdue of Carnegie Mellon University, Purdue University, University of University, University of Pennsylvania, Pennsylvania, and University of Wisconsin. and University of Wisconsin.


Announcements

•HW 6 Posted, due 12/7HW 6 Posted, due 12/7

•Project due 12/10Project due 12/10In‐class presentations (~8 minutes + questions)



Base Snoopy OrganizationP

DataAddr Cmd

Cache data RAM

Bus-

Tagsandstateforsnoop

TagsandstateforP

Processor-side

controller

Comparator

Busside

controllerTocontroller

Write-back buffer

Comparator

Tag

Tocontroller

Addr CmdSnoop state Data buffer Addr Cmd


System bus


Non-Atomic State TransitionsOperations involve multiple actions

Look up cache tagsBus arbitrationCheck for writebackEven if bus is atomic overall set of actions is notEven if bus is atomic, overall set of actions is notRace conditions among multiple operations

Suppose P1 and P2 attempt to write cached block ASuppose P1 and P2 attempt to write cached block AEach decides to issue BusUpgr to allow S –> M

IssuesIssuesHandle requests for other blocks while waiting to acquire bus Must handle requests for this block A



Non-Atomicity Transient States

Two types of states• Stable (e.g. MESI) P W /

P r R d / —Stable (e.g. MESI)• Transient or Intermediate

Increases complexity

P r W r /—

B u s R d /F lu s hB u s R d X / F lu s h

P r W r /

M

B u s G r a n t /B u s U p g r

B u s G r a n t / B u s R d X

P r W r /—

P r R d /

E

S → M

P r W r /

B u s G r a n t / B u s R d ( S) B u s R d /F l u s h

B u s G r a n t /

P r R d / —

P r R d /—

S

B u s R d ( S )I → M

P r W r /B u s R e q

B u s R d X /F l u s h′

I → S ,E

B u s R d X /F l u s h

B u s R d X /F lu s h′

P r R d /B u s R e q

B u s R d /F l u s h′

I

P r W r /B u s R e q



Scalability problems of Snoopy Coherence

• Prohibitive bus bandwidthRequired bandwidth grows with # CPUS…… but available BW per bus is fixedAdding busses makes serialization/ordering hardAdding busses makes serialization/ordering hard

• Prohibitive processor snooping bandwidthAll caches do tag lookup when ANY processor accesses memoryInclusion limits this to L2, but still lots of lookups

U h b b d h d ’ l b d 8 16 CPU• Upshot: bus‐based coherence doesn’t scale beyond 8–16 CPUs



Scalable Cache Coherence• Scalable cache coherence: two part solution

• Part I: bus bandwidthReplace non‐scalable bandwidth substrate (bus)…with scalable bandwidth one (point to point network e g mesh)…with scalable bandwidth one (point‐to‐point network, e.g., mesh)

• Part II: processor snooping bandwidthp p gInteresting: most snoops result in no actionReplace non‐scalable broadcast protocol (spam everyone)…with scalable directory protocol (only spam processors that care)…with scalable directory protocol (only spam processors that care)



Directory Coherence Protocols• Observe: physical address space statically partitioned

+ Can easily determine which memory module holds a given lineThat memory module sometimes called “home”

– Can’t easily determine which processors have line in their cachesBus‐based protocol: broadcast events to all processors/cachesp p /

± Simple and fast, but non‐scalable

• Directories: non‐broadcast coherence protocold k h fExtend memory to track caching information

For each physical cache line whose home this is, track:Owner: which processor has a dirty copy (I.e., M state)Sharers: which processors have clean copies (I.e., S state)

Processor sends coherence event to home directoryHome directory only sends events to processors that care



Read Processing

Read A (miss)Node #1 Directory Node #2

Read A (miss)

A: Shared, #1



Write Processing

Read A (miss)Node #1 Directory Node #2

Read A (miss)

A: Shared, #1

A:Mod., #2

Trade‐offs:Trade offs:• Longer accesses (3‐hop between Processor, directory, other Processor)• Lower bandwidth no snoops necessary

( )


Makes sense either for CMPs (lots of L1 miss traffic) or large‐scale servers (shared‐memory MP > 32 nodes)


Chip Performance Scalability: Hi t & E t tiHistory & Expectations

1000000

NUMA hi ?

1000

10000

100000

(in M

IP

SMT

NUMA on chip?

SMP on chip

10

100

1000

man

ce (

FirstSuperscalar Out-of-Order

Cores

SMT SMP on chip

0.1

1

10

Perfo

rm First Micro RISC

Cores

0.011970 1980 1990 2000 2010


Goal: 1 Tera inst/sec by 2010!


Piranha: Performance and Managed C l itComplexity

Large‐scale server based on CMP nodesLarge scale server based on CMP nodes

CMP architectureexcellent platform for exploiting thread‐level parallelisminherently emphasizes replication over monolithic complexity

D i th d l d i l t ti l itDesign methodology reduces implementation complexitynovel simulation methodologyuse ASIC physical design process

Piranha: 2x performance advantage with team size of


approximately 20 people


Piranha Processing Node

Alpha core:1‐issue, in‐order,500MHzCPU

Next few slides from

Luiz Barosso’s ISCA 2000 presentation of

Piranha: A Scalable ArchitectureBased on Single-Chip Multiprocessing




Alpha core:1‐issue, in‐order,

CPU 500MHzL1 caches:I&D, 64KB, 2‐way

D$I$ D$I$






Intra‐chip switch (ICS)D$I$

CPU

D$I$

CPU

D$I$

CPU

D$I$ p32GB/sec, 1‐cycle delayD$I$

ICS

D$I$ D$I$ D$I$

D$I$ D$I$ D$I$ D$I$

CPU CPU CPU CPU






Intra‐chip switch (ICS)D$I$L2$

CPU

D$I$L2$

CPU

D$I$L2$

CPU

D$I$L2$ p

32GB/sec, 1‐cycle delayL2 cache:shared, 1MB, 8‐way

D$I$

ICS

D$I$ D$I$ D$I$

L2$D$I$

L2$D$I$

L2$D$I$

L2$D$I$

CPU CPU CPU CPU




Alpha core:1‐issue, in‐order,MEM-CTL MEM-CTL MEM-CTL MEM-CTL



CPU

D$I$L2$

CPU

D$I$L2$

CPU

D$I$L2$ p


Memory Controller (MC)

D$I$

ICS

D$I$ D$I$ D$I$

RDRAM, 12.8GB/sec

L2$D$I$

L2$D$I$

L2$D$I$

L2$D$I$

CPU CPU CPU CPUMEM-CTL MEM-CTL MEM-CTL MEM-CTL


8 [email protected]/sec






CPU

D$I$L2$

CPU

D$I$L2$

CPU

D$I$L2$HE



D$I$

ICS

D$I$ D$I$ D$I$

RDRAM, 12.8GB/secProtocol Engines (HE & RE)μprog., 1K μinstr., even/odd interleavingL2$

D$I$L2$

D$I$L2$

D$I$L2$

D$I$RE





Alpha core:1‐issue, in‐order,MEM-CTL MEM-CTL MEM-CTL MEM-CTL4 Links

CPU1 issue, in order,500MHz

L1 caches:I&D, 64KB, 2‐way


CPU

D$I$L2$

CPU

D$I$L2$

CPU

D$I$L2$

MEM CTL MEM CTL MEM CTL MEM CTL

HE

4 Links @ 8GB/s

Intra chip switch (ICS)32GB/sec, 1‐cycle delay

L2 cache:shared, 1MB, 8‐way


D$I$

ICS

D$I$ D$I$ D$I$

oute

r

Memory Controller (MC)RDRAM, 12.8GB/sec

Protocol Engines (HE & RE):μprog., 1K μinstr., even/odd interleavingL2$

D$I$L2$

D$I$L2$

D$I$L2$

D$I$RE

Ro

/ gSystem Interconnect:4‐port Xbar routertopology independent32GB/sec total bandwidth



/




CPU, ,

500MHzL1 caches:I&D, 64KB, 2‐way


CPU

D$I$L2$

CPU

D$I$L2$

CPU

D$I$L2$

MEM CTL MEM CTL MEM CTL MEM CTL

HE p ( )32GB/sec, 1‐cycle delay

L2 cache:shared, 1MB, 8‐way


D$I$

ICS

D$I$ D$I$ D$I$

oute

r

y ( )RDRAM, 12.8GB/sec

Protocol Engines (HE & RE):μprog., 1K μinstr., even/odd interleavingL2$

D$I$L2$

D$I$L2$

D$I$L2$

D$I$RE

Ro

/ gSystem Interconnect:4‐port Xbar routertopology independent32GB/sec total bandwidth




Single-Chip Piranha Performance

300

350

me L2Miss

350

200

250

300

ecut

ion

Tim L2Miss

L2HitCPU

233

191

100

150

mal

ized

Exe 145

100 100

44

0

50

P1500 MH

INO1GH

OOO1GH

P8500MH

P1500 MH

INO1GH

OOO1GH

P8500MH

Nor

m 34 44

500 MHz1-issue

1GHz1-issue

1GHz4-issue

500MHz1-issue

500 MHz1-issue

1GHz1-issue

1GHz4-issue

500MHz1-issue

OLTP DSSPiranha’s performance margin 3x for OLTP and 2.2x for DSS


Piranha has more outstanding misses better utilizes memory system


Performance And Utilization• Performance (IPC) important

• Utilization (actual IPC / peak IPC) important too( p ) p

• Even moderate superscalars (e.g., 4‐way) not fully utilizedp ( g , y) yAverage sustained IPC: 1.5–2 → <50% utilization

Mis‐predicted branchesCache misses, especially L2Cache misses, especially L2Data dependences

Multi threading (MT)•Multi‐threading (MT)Improve utilization by multi‐plexing multiple threads on single CPUOne thread cannot fully utilize CPU? Maybe 2, 4 (or 100) can



Latency vs Throughput•MT trades (single‐thread) latency for throughput

– Sharing processor degrades latency of individual threads+ But improves aggregate latency of both threads+ Improves utilization

• Example• ExampleThread A: individual latency=10s, latency with thread B=15sThread B: individual latency=20s, latency with thread A=25sSequential latency (first A then B or vice versa): 30sParallel latency (A and B simultaneously): 25s

– MT slows each thread by 5s+ But improves total latency by 5s

• Different workloads have different parallelismS FP h l t f ILP ( 8 id hi )


SpecFP has lots of ILP (can use an 8‐wide machine)Server workloads have TLP (can use multiple threads)


Core SharingTime sharing:

Run one threadOn a long‐latency operation (e.g., cache miss), switchAlso known as “switch‐on‐miss” multithreadingE g Niagara (UltraSPARC T1/T2)E.g., Niagara (UltraSPARC T1/T2)

Space sharing:Space sharing:Across pipeline depth

Fetch and issue each cycle form a different threadBoth across pipeline width and depthBoth across pipeline width and depth

Fetch and issue each cycle from from multiple threadsPolicy to decide which to fetch gets complicatedAlso known as “simultaneous” multithreading


Also known as simultaneous multithreadingE.g., Alpha 21464, IBM POWER5


Instruction Issue

Time

Reduced function unit utilization due to dependencies



Superscalar Issue

Time

Superscalar leads to more performance, but lower utilization



Predicated Issue

Time

Adds to function unit utilization, but results are thrown away



Chip Multiprocessor

Time

Limited utilization when only running one thread



Coarse-grain Multithreading

Time

Preserves single‐thread performance, but can only hide long latencies (i e main memory accesses)


but can only hide long latencies (i.e., main memory accesses)


Coarse-Grain Multithreading (CGMT)• Coarse‐Grain Multi‐Threading (CGMT)

+ Sacrifices very little single thread performance (of one thread)– Tolerates only long latencies (e.g., L2 misses)Thread scheduling policy

Designate a “preferred” thread (e.g., thread A)g p ( g , )Switch to thread B on thread A L2 missSwitch back to A when A L2 miss returns

Pipeline partitioningPipeline partitioningNone, flush on switch

– Can’t tolerate latencies shorter than twice pipeline depthNeed short in‐order pipeline for good performancep p g p

Example: IBM Northstar/Pulsar



CGMTregfile

D$I$

D$BP

• CGMT

regfileregfile

thread scheduler

I$BP

D$


L2 miss?


Fine Grained Multithreading

Time

Saturated workload > Lots of threadsSaturated workload ‐> Lots of threads

Unsaturated workload ‐> Lots of stalls

Intra‐thread dependencies still limit performance



Fine-Grain Multithreading (FGMT)• Fine‐Grain Multithreading (FGMT)

– Sacrifices significant single thread performanceg g p+ Tolerates all latencies (e.g., L2 misses, mispredicted branches, etc.)Thread scheduling policy

Switch threads every cycle (round robin) L2 miss or noSwitch threads every cycle (round‐robin), L2 miss or no

Pipeline partitioningDynamic, no flushingLength of pipeline doesn’t matter

– Need a lot of threadsExtreme example: Denelcor HEPp

So many threads (100+), it didn’t even need cachesFailed commercially

Not popular today


Not popular todayMany threads →many register files


Fine-Grain Multithreading• FGMT

(Many) more threadsMultiple threads in pipeline at once

fil

regfileregfileregfileregfile

thread scheduler

regfile

D$I$BPP



Simultaneous Multithreading

Time

Maximum utilization of function units by independent operations



Simultaneous Multithreading (SMT)• Can we multithread an out‐of‐order machine?

Don’t want to give up performance benefitsDon’t want to give up natural tolerance of D$ (L1) miss latency

• Simultaneous multithreading (SMT)T l t ll l t i ( L2 i i di t d b h )+ Tolerates all latencies (e.g., L2 misses, mispredicted branches)

± Sacrifices some single thread performanceThread scheduling policy

Round‐robin (just like FGMT)Pipeline partitioning

Dynamic, hmmm…Example: Pentium4 (hyper‐threading): 5‐way issue, 2 threadsAnother example: Alpha 21464: 8‐way issue, 4 threads (canceled)



Simultaneous Multithreading (SMT)

regfile

map table

D$I$BP

• SMTReplicate map table, share physical register file

map tablesthread scheduler

fil

I$B

D$

regfile


P


Issues for SMT• Cache interference

General concern for all MT variantsCan the working sets of multiple threads fit in the caches?Shared memory SPMD threads help here

Same insns→ share I$+ Same insns→ share I$+ Shared data → less D$ contentionMT is good for “server” workloads

k i l i h d l ( hi h i )To keep miss rates low, SMT might need a larger L2 (which is OK)Out‐of‐order tolerates L1 misses

• Large map table and physical register file#mt‐entries = (#threads * #arch‐regs)


#phys‐regs = (#threads * #arch‐regs) + #in‐flight insns


SMT Resource Partitioning• How are ROB/MOB, RS partitioned in SMT?

Depends on what you want to achieve

• Static partitioningDivide ROB/MOB, RS into T static equal‐sized partitionsE th t l IPC th d d ’t t hi h IPC+ Ensures that low‐IPC threads don’t starve high‐IPC ones

Low‐IPC threads stall and occupy ROB/MOB, RS slots– Low utilization

• Dynamic partitioningDivide ROB/MOB, RS into dynamically resizing partitionsLet threads fight for amongst themselvesLet threads fight for amongst themselves

+ High utilization– Possible starvation


ICOUNT: fetch policy prefers thread with fewest in‐flight insns


Power Implications of MT• Is MT (of any kind) power efficient?

Static power? YesDissipated regardless of utilization

Dynamic power? Less clear, but probably yesHighly utilization dependentMajor factor is additional cache activitySome debate here

Overall? YesStatic power relatively increasing



SMT vs. CMP• If you wanted to run multiple threads would you build a…

Chip multiprocessor (CMP): multiple separate pipelines?A multithreaded processor (SMT): a single larger pipeline?

• Both will get you throughput on multiple threadsCMP ill b i l ibl f t l kCMP will be simpler, possibly faster clockSMT will get you better performance (IPC) on a single thread

SMT is basically an ILP engine that converts TLP to ILPCMP is mainly a TLP engine

• Again, do bothSun’s Niagara (UltraSPARC T1)Sun s Niagara (UltraSPARC T1)8 processors, each with 4‐threads (coarse‐grained threading)1Ghz clock, in‐order, short pipeline (6 stages or so)


Designed for power‐efficient “throughput computing”


Niagara: 32 Threads on Chip

8 six‐stage in‐order pipelines

Each pipeline, 4‐way fine‐grainEach pipeline, 4 way fine grain multithreaded

Small L1 write‐through caches

Shared L2

Screams for Web, OLTP

Shipping @1.2Ghz clock



Alpha 21464 Architecture Overview [Slid f Sh b M kh j @ I t l][Slides from Shubu Mukherjee @ Intel]

8‐wide out‐of‐order superscalar

Large on‐chip L2 cachearge on chip cache

Direct RAMBUS interface

On‐chip router for system interconnectOn chip router for system interconnect

Glueless, directory‐based, ccNUMAfor up to 512‐way multiprocessing

4‐way simultaneous multithreading (SMT)



Basic Out-of-order Pipeline

F t h D d Q R E t D h R R tiFetch Decode/Map

Queue Reg Read

Execute

Dcache/Stor

e Buffer

Reg Write

Retire

PC

Icache

RegisterMap

DcacheRegs Regs

Icache

Thread‐


Threadblind


SMT Pipeline

Fetch Decode/Map

Queue Reg Read

Execute

Dcache/Stor

e Buffer

Reg Write

Retire

Buffer

PC

Dcache

RegisterMap

Regs Regs

Icache

Dcacheg Regs



Changes for SMT

Basic pipeline – unchanged

Replicated resourcesProgram countersProgram countersRegister maps

Shared resourcesRegister file (size increased)Instruction queueqFirst and second level cachesTranslation buffersBranch predictor


Branch predictor


Fetch Policy (Key)Simple fetch (e.g., round‐robin) policies don’t work. Why?

• Instruction from slow threads hog the pipelineInstruction from slow threads hog the pipeline

• Once instructions placed in IQ, can not remove them until they execute

Fetch optimization:

• i‐count

• Favor faster threads

• Count retirement rate and bias accordingly

• Must avoid starvation (tricky)



Multiprogrammed workload

250%

200%

250%

100%

150%1T2T3T

50%4T

0%SpecInt SpecFP Mixed Int/FP



Decomposed SPEC95 Applications

250%

200%

1T

100%

150%1T2T3T

50%4T

0%Turb3d Swm256 Tomcatv



Multithreaded Applications

300%

250%

300%

150%

200%1T2T4T

50%

100% 4T

0%Barnes Chess Sort TP


Documents

EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture