50
© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar EECS 470 Lecture 24 Chip Multiprocessors and Chip Multiprocessors and Simultaneous Multithreading Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Slides developed in part by Profs. Falsafi Falsafi, Hill, Hoe, , Hill, Hoe, Lipasti Lipasti, , Martin, Roth, Martin, Roth, Lecture 24 Slide 1 EECS 470 Shen Shen, Smith, , Smith, Sohi Sohi, , and and Vijaykumar Vijaykumar of Carnegie Mellon University, Purdue of Carnegie Mellon University, Purdue University, University of University, University of Pennsylvania, Pennsylvania, and University of Wisconsin. and University of Wisconsin.

EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

Embed Size (px)

Citation preview

Page 1: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

EECS 470

Lecture 24

Chip Multiprocessors andChip Multiprocessors and

Simultaneous MultithreadingFall 2007

Prof. Thomas Wenisch

http://www.eecs.umich.edu/courses/eecs470

Slides developed in part by Profs. Slides developed in part by Profs. FalsafiFalsafi, Hill, Hoe, , Hill, Hoe, LipastiLipasti, , Martin, Roth, Martin, Roth,

Lecture 24 Slide 1EECS 470

ShenShen, Smith, , Smith, SohiSohi, , and and VijaykumarVijaykumar of Carnegie Mellon University, Purdue of Carnegie Mellon University, Purdue University, University of University, University of Pennsylvania, Pennsylvania, and University of Wisconsin. and University of Wisconsin.

Page 2: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Announcements

•HW 6 Posted, due 12/7HW 6 Posted, due 12/7

•Project due 12/10Project due 12/10In‐class presentations (~8 minutes + questions)

Lecture 24 Slide 2EECS 470

Page 3: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Base Snoopy OrganizationP

DataAddr Cmd

Cache data RAM

Bus-

Tagsandstateforsnoop

TagsandstateforP

Processor-side

controller

Comparator

Busside

controllerTocontroller

Write-back buffer

Comparator

Tag

Tocontroller

Addr CmdSnoop state Data buffer Addr Cmd

Lecture 24 Slide 3EECS 470

System bus

Page 4: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Non-Atomic State TransitionsOperations involve multiple actions 

Look up cache tagsBus arbitrationCheck for writebackEven if bus is atomic overall set of actions is notEven if bus is atomic, overall set of actions is notRace conditions among multiple operations

Suppose P1 and P2 attempt to write cached block ASuppose P1 and P2 attempt to write cached block AEach decides to issue BusUpgr to allow S –> M

IssuesIssuesHandle requests for other blocks while waiting to acquire bus Must handle requests for this block A

Lecture 24 Slide 4EECS 470

Page 5: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Non-Atomicity Transient States

Two types of states• Stable (e.g. MESI) P W /

P r R d / —Stable (e.g. MESI)• Transient or Intermediate

Increases complexity

P r W r /—

B u s R d /F lu s hB u s R d X / F lu s h

P r W r /

M

B u s G r a n t /B u s U p g r

B u s G r a n t / B u s R d X

P r W r /—

P r R d /

E

S → M

P r W r /

B u s G r a n t / B u s R d ( S) B u s R d /F l u s h

B u s G r a n t /

P r R d / —

P r R d /—

S

B u s R d ( S )I → M

P r W r /B u s R e q

B u s R d X /F l u s h′

I → S ,E

B u s R d X /F l u s h

B u s R d X /F lu s h′

P r R d /B u s R e q

B u s R d /F l u s h′

I

P r W r /B u s R e q

Lecture 24 Slide 5EECS 470

Page 6: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Scalability problems of Snoopy Coherence

• Prohibitive bus bandwidthRequired bandwidth grows with # CPUS…… but available BW per bus is fixedAdding busses makes serialization/ordering hardAdding busses makes serialization/ordering hard

• Prohibitive processor snooping bandwidthAll caches do tag lookup when ANY processor accesses memoryInclusion limits this to L2, but still lots of lookups

U h b b d h d ’ l b d 8 16 CPU• Upshot: bus‐based coherence doesn’t scale beyond 8–16 CPUs

Lecture 24 Slide 6EECS 470

Page 7: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Scalable Cache Coherence• Scalable cache coherence: two part solution

• Part I: bus bandwidthReplace non‐scalable bandwidth substrate (bus)…with scalable bandwidth one (point to point network e g mesh)…with scalable bandwidth one (point‐to‐point network, e.g., mesh)

• Part II: processor snooping bandwidthp p gInteresting: most snoops result in no actionReplace non‐scalable broadcast protocol (spam everyone)…with scalable directory protocol (only spam processors that care)…with scalable directory protocol (only spam processors that care)

Lecture 24 Slide 7EECS 470

Page 8: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Directory Coherence Protocols• Observe: physical address space statically partitioned

+ Can easily determine which memory module holds a given lineThat memory module sometimes called “home” 

– Can’t easily determine which processors have line in their cachesBus‐based protocol: broadcast events to all processors/cachesp p /

± Simple and fast, but non‐scalable

• Directories: non‐broadcast coherence protocold k h fExtend memory to track caching information

For each physical cache line whose home this is, track:Owner: which processor has a dirty copy (I.e., M state)Sharers: which processors have clean copies (I.e., S state)

Processor sends coherence event to home directoryHome directory only sends events to processors that care

Lecture 24 Slide 8EECS 470

Page 9: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Read Processing

Read A (miss)Node #1 Directory Node #2

Read A (miss)

A: Shared, #1

Lecture 24 Slide 9EECS 470

Page 10: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Write Processing

Read A (miss)Node #1 Directory Node #2

Read A (miss)

A: Shared, #1

A:Mod., #2

Trade‐offs:Trade offs:• Longer accesses (3‐hop between Processor, directory, other Processor)• Lower bandwidth  no snoops necessary

( )

Lecture 24 Slide 10EECS 470

Makes sense either for CMPs (lots of L1 miss traffic) or large‐scale servers (shared‐memory MP > 32 nodes)

Page 11: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Chip Performance Scalability: Hi t & E t tiHistory & Expectations

1000000

NUMA hi ?

1000

10000

100000

(in M

IP

SMT

NUMA on chip?

SMP on chip

10

100

1000

man

ce (

FirstSuperscalar Out-of-Order

Cores

SMT SMP on chip

0.1

1

10

Perfo

rm First Micro RISC

Cores

0.011970 1980 1990 2000 2010

Lecture 24 Slide 11EECS 470

Goal: 1 Tera inst/sec by 2010!

Page 12: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Piranha: Performance and Managed C l itComplexity

Large‐scale server based on CMP nodesLarge scale server based on CMP nodes

CMP architectureexcellent platform for exploiting thread‐level parallelisminherently emphasizes replication over monolithic complexity

D i th d l d i l t ti l itDesign methodology reduces implementation complexitynovel simulation methodologyuse ASIC physical design process

Piranha: 2x performance advantage with team size of

Lecture 24 Slide 12EECS 470

approximately 20 people

Page 13: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Piranha Processing Node

Alpha core:1‐issue, in‐order,500MHzCPU

Next few slides from

Luiz Barosso’s ISCA 2000 presentation of

Piranha: A Scalable ArchitectureBased on Single-Chip Multiprocessing

Lecture 24 Slide 13EECS 470

Page 14: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Piranha Processing Node

Alpha core:1‐issue, in‐order,

CPU 500MHzL1 caches:I&D, 64KB, 2‐way

D$I$ D$I$

Lecture 24 Slide 14EECS 470

Page 15: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Piranha Processing Node

Alpha core:1‐issue, in‐order,

CPU 500MHzL1 caches:I&D, 64KB, 2‐way

Intra‐chip switch (ICS)D$I$

CPU

D$I$

CPU

D$I$

CPU

D$I$ p32GB/sec, 1‐cycle delayD$I$

ICS

D$I$ D$I$ D$I$

D$I$ D$I$ D$I$ D$I$

CPU CPU CPU CPU

Lecture 24 Slide 15EECS 470

Page 16: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Piranha Processing Node

Alpha core:1‐issue, in‐order,

CPU 500MHzL1 caches:I&D, 64KB, 2‐way

Intra‐chip switch (ICS)D$I$L2$

CPU

D$I$L2$

CPU

D$I$L2$

CPU

D$I$L2$ p

32GB/sec, 1‐cycle delayL2 cache:shared, 1MB, 8‐way

D$I$

ICS

D$I$ D$I$ D$I$

L2$D$I$

L2$D$I$

L2$D$I$

L2$D$I$

CPU CPU CPU CPU

Lecture 24 Slide 16EECS 470

Page 17: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Piranha Processing Node

Alpha core:1‐issue, in‐order,MEM-CTL MEM-CTL MEM-CTL MEM-CTL

CPU 500MHzL1 caches:I&D, 64KB, 2‐way

Intra‐chip switch (ICS)D$I$L2$

CPU

D$I$L2$

CPU

D$I$L2$

CPU

D$I$L2$ p

32GB/sec, 1‐cycle delayL2 cache:shared, 1MB, 8‐way

Memory Controller (MC)

D$I$

ICS

D$I$ D$I$ D$I$

RDRAM, 12.8GB/sec

L2$D$I$

L2$D$I$

L2$D$I$

L2$D$I$

CPU CPU CPU CPUMEM-CTL MEM-CTL MEM-CTL MEM-CTL

Lecture 24 Slide 17EECS 470

8 [email protected]/sec

Page 18: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Piranha Processing Node

Alpha core:1‐issue, in‐order,MEM-CTL MEM-CTL MEM-CTL MEM-CTL

CPU 500MHzL1 caches:I&D, 64KB, 2‐way

Intra‐chip switch (ICS)D$I$L2$

CPU

D$I$L2$

CPU

D$I$L2$

CPU

D$I$L2$HE

32GB/sec, 1‐cycle delayL2 cache:shared, 1MB, 8‐way

Memory Controller (MC)

D$I$

ICS

D$I$ D$I$ D$I$

RDRAM, 12.8GB/secProtocol Engines (HE & RE)μprog., 1K μinstr., even/odd interleavingL2$

D$I$L2$

D$I$L2$

D$I$L2$

D$I$RE

CPU CPU CPU CPUMEM-CTL MEM-CTL MEM-CTL MEM-CTL

Lecture 24 Slide 18EECS 470

Page 19: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Piranha Processing Node

Alpha core:1‐issue, in‐order,MEM-CTL MEM-CTL MEM-CTL MEM-CTL4 Links

CPU1 issue, in order,500MHz

L1 caches:I&D, 64KB, 2‐way

Intra‐chip switch (ICS)D$I$L2$

CPU

D$I$L2$

CPU

D$I$L2$

CPU

D$I$L2$

MEM CTL MEM CTL MEM CTL MEM CTL

HE

4 Links @ 8GB/s

Intra chip switch (ICS)32GB/sec, 1‐cycle delay

L2 cache:shared, 1MB, 8‐way

Memory Controller (MC)

D$I$

ICS

D$I$ D$I$ D$I$

oute

r

Memory Controller (MC)RDRAM, 12.8GB/sec

Protocol Engines (HE & RE):μprog., 1K μinstr., even/odd interleavingL2$

D$I$L2$

D$I$L2$

D$I$L2$

D$I$RE

Ro

/ gSystem Interconnect:4‐port Xbar routertopology independent32GB/sec total bandwidth

CPU CPU CPU CPUMEM-CTL MEM-CTL MEM-CTL MEM-CTL

Lecture 24 Slide 19EECS 470

/

Page 20: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Piranha Processing Node

Alpha core:1‐issue, in‐order,MEM-CTL MEM-CTL MEM-CTL MEM-CTL

CPU, ,

500MHzL1 caches:I&D, 64KB, 2‐way

Intra‐chip switch (ICS)D$I$L2$

CPU

D$I$L2$

CPU

D$I$L2$

CPU

D$I$L2$

MEM CTL MEM CTL MEM CTL MEM CTL

HE p ( )32GB/sec, 1‐cycle delay

L2 cache:shared, 1MB, 8‐way

Memory Controller (MC)

D$I$

ICS

D$I$ D$I$ D$I$

oute

r

y ( )RDRAM, 12.8GB/sec

Protocol Engines (HE & RE):μprog., 1K μinstr., even/odd interleavingL2$

D$I$L2$

D$I$L2$

D$I$L2$

D$I$RE

Ro

/ gSystem Interconnect:4‐port Xbar routertopology independent32GB/sec total bandwidth

CPU CPU CPU CPUMEM-CTL MEM-CTL MEM-CTL MEM-CTL

Lecture 24 Slide 20EECS 470

Page 21: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Single-Chip Piranha Performance

300

350

me L2Miss

350

200

250

300

ecut

ion

Tim L2Miss

L2HitCPU

233

191

100

150

mal

ized

Exe 145

100 100

44

0

50

P1500 MH

INO1GH

OOO1GH

P8500MH

P1500 MH

INO1GH

OOO1GH

P8500MH

Nor

m 34 44

500 MHz1-issue

1GHz1-issue

1GHz4-issue

500MHz1-issue

500 MHz1-issue

1GHz1-issue

1GHz4-issue

500MHz1-issue

OLTP DSSPiranha’s performance margin 3x for OLTP and 2.2x for DSS

Lecture 24 Slide 21EECS 470

Piranha has more outstanding misses  better utilizes memory system

Page 22: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Performance And Utilization• Performance (IPC) important

• Utilization (actual IPC / peak IPC) important too( p ) p

• Even moderate superscalars (e.g., 4‐way) not fully utilizedp ( g , y) yAverage sustained IPC: 1.5–2 → <50% utilization

Mis‐predicted branchesCache misses, especially L2Cache misses, especially L2Data dependences

Multi threading (MT)•Multi‐threading (MT)Improve utilization by multi‐plexing multiple threads on single CPUOne thread cannot fully utilize CPU? Maybe 2, 4 (or 100) can

Lecture 24 Slide 22EECS 470

Page 23: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Latency vs Throughput•MT trades (single‐thread) latency for throughput

– Sharing processor degrades latency of individual threads+ But improves aggregate latency of both threads+ Improves utilization

• Example• ExampleThread A: individual latency=10s, latency with thread B=15sThread B: individual latency=20s, latency with thread A=25sSequential latency (first A then B or vice versa): 30sParallel latency (A and B simultaneously): 25s

– MT slows each thread by 5s+ But improves total latency by 5s

• Different workloads have different parallelismS FP h l t f ILP ( 8 id hi )

Lecture 24 Slide 23EECS 470

SpecFP has lots of ILP (can use an 8‐wide machine)Server workloads have TLP (can use multiple threads)

Page 24: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Core SharingTime sharing:

Run one threadOn a long‐latency operation (e.g., cache miss), switchAlso known as “switch‐on‐miss” multithreadingE g Niagara (UltraSPARC T1/T2)E.g., Niagara (UltraSPARC T1/T2)

Space sharing:Space sharing:Across pipeline depth

Fetch and issue each cycle form a different threadBoth across pipeline width and depthBoth across pipeline width and depth

Fetch and issue each cycle from from multiple threadsPolicy to decide which to fetch gets complicatedAlso known as “simultaneous” multithreading

Lecture 24 Slide 24EECS 470

Also known as  simultaneous  multithreadingE.g., Alpha 21464, IBM POWER5

Page 25: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Instruction Issue

Time

Reduced function unit utilization due to dependencies

Lecture 24 Slide 25EECS 470

Page 26: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Superscalar Issue

Time

Superscalar leads to more performance, but lower utilization

Lecture 24 Slide 26EECS 470

Page 27: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Predicated Issue

Time

Adds to function unit utilization, but results are thrown away

Lecture 24 Slide 27EECS 470

Page 28: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Chip Multiprocessor

Time

Limited utilization when only running one thread

Lecture 24 Slide 28EECS 470

Page 29: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Coarse-grain Multithreading

Time

Preserves single‐thread performance, but can only hide long latencies (i e main memory accesses)

Lecture 24 Slide 29EECS 470

but can only hide long latencies (i.e., main memory accesses)

Page 30: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Coarse-Grain Multithreading (CGMT)• Coarse‐Grain Multi‐Threading (CGMT)

+ Sacrifices very little single thread performance (of one thread)– Tolerates only long latencies (e.g., L2 misses)Thread scheduling policy

Designate a “preferred” thread (e.g., thread A)g p ( g , )Switch to thread B on thread A L2 missSwitch back to A when A L2 miss returns

Pipeline partitioningPipeline partitioningNone, flush on switch

– Can’t tolerate latencies shorter than twice pipeline depthNeed short in‐order pipeline for good performancep p g p

Example: IBM Northstar/Pulsar

Lecture 24 Slide 30EECS 470

Page 31: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

CGMTregfile

D$I$

D$BP

• CGMT

regfileregfile

thread scheduler

I$BP

D$

Lecture 24 Slide 31EECS 470

L2 miss?

Page 32: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Fine Grained Multithreading

Time

Saturated workload > Lots of threadsSaturated workload ‐> Lots of threads

Unsaturated workload ‐> Lots of stalls

Intra‐thread dependencies still limit performance

Lecture 24 Slide 32EECS 470

Page 33: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Fine-Grain Multithreading (FGMT)• Fine‐Grain Multithreading (FGMT)

– Sacrifices significant single thread performanceg g p+ Tolerates all latencies (e.g., L2 misses, mispredicted branches, etc.)Thread scheduling policy

Switch threads every cycle (round robin) L2 miss or noSwitch threads every cycle (round‐robin), L2 miss or no

Pipeline partitioningDynamic, no flushingLength of pipeline doesn’t matter

– Need a lot of threadsExtreme example: Denelcor HEPp

So many threads (100+), it didn’t even need cachesFailed commercially

Not popular today

Lecture 24 Slide 33EECS 470

Not popular todayMany threads →many register files

Page 34: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Fine-Grain Multithreading• FGMT

(Many) more threadsMultiple threads in pipeline at once

fil

regfileregfileregfileregfile

thread scheduler

regfile

D$I$BPP

Lecture 24 Slide 34EECS 470

Page 35: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Simultaneous Multithreading

Time

Maximum utilization of function units by independent operations

Lecture 24 Slide 35EECS 470

Page 36: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Simultaneous Multithreading (SMT)• Can we multithread an out‐of‐order machine?

Don’t want to give up performance benefitsDon’t want to give up natural tolerance of D$ (L1) miss latency

• Simultaneous multithreading (SMT)T l t ll l t i ( L2 i i di t d b h )+ Tolerates all latencies (e.g., L2 misses, mispredicted branches)

± Sacrifices some single thread performanceThread scheduling policy

Round‐robin (just like FGMT)Pipeline partitioning

Dynamic, hmmm…Example: Pentium4 (hyper‐threading): 5‐way issue, 2 threadsAnother example: Alpha 21464: 8‐way issue, 4 threads (canceled)

Lecture 24 Slide 36EECS 470

Page 37: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Simultaneous Multithreading (SMT)

regfile

map table

D$I$BP

• SMTReplicate map table, share physical register file

map tablesthread scheduler

fil

I$B

D$

regfile

Lecture 24 Slide 37EECS 470

P

Page 38: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Issues for SMT• Cache interference

General concern for all MT variantsCan the working sets of multiple threads fit in the caches?Shared memory SPMD threads help here

Same insns→ share I$+ Same insns→ share I$+ Shared data → less D$ contentionMT is good for “server” workloads

k i l i h d l ( hi h i )To keep miss rates low, SMT might need a larger L2 (which is OK)Out‐of‐order tolerates L1 misses

• Large map table and physical register file#mt‐entries = (#threads * #arch‐regs)

Lecture 24 Slide 38EECS 470

#phys‐regs = (#threads * #arch‐regs) + #in‐flight insns

Page 39: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

SMT Resource Partitioning• How are ROB/MOB, RS partitioned in SMT?

Depends on what you want to achieve

• Static partitioningDivide ROB/MOB, RS into T static equal‐sized partitionsE th t l IPC th d d ’t t hi h IPC+ Ensures that low‐IPC threads don’t starve high‐IPC ones

Low‐IPC threads stall and occupy ROB/MOB, RS slots– Low utilization

• Dynamic partitioningDivide ROB/MOB, RS into dynamically resizing partitionsLet threads fight for amongst themselvesLet threads fight for amongst themselves

+ High utilization– Possible starvation

Lecture 24 Slide 39EECS 470

ICOUNT: fetch policy prefers thread with fewest in‐flight insns

Page 40: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Power Implications of MT• Is MT (of any kind) power efficient?

Static power? YesDissipated regardless of utilization

Dynamic power? Less clear, but probably yesHighly utilization dependentMajor factor is additional cache activitySome debate here

Overall?  YesStatic power relatively increasing

Lecture 24 Slide 40EECS 470

Page 41: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

SMT vs. CMP• If you wanted to run multiple threads would you build a…

Chip multiprocessor (CMP): multiple separate pipelines?A multithreaded processor (SMT): a single larger pipeline?

• Both will get you throughput on multiple threadsCMP ill b i l ibl f t l kCMP will be simpler, possibly faster clockSMT will get you better performance (IPC) on a single thread

SMT is basically an ILP engine that converts TLP to ILPCMP is mainly a TLP engine

• Again, do bothSun’s Niagara (UltraSPARC T1)Sun s Niagara (UltraSPARC T1)8 processors, each with 4‐threads (coarse‐grained threading)1Ghz clock, in‐order, short pipeline (6 stages or so)

Lecture 24 Slide 41EECS 470

Designed for power‐efficient “throughput computing”

Page 42: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Niagara: 32 Threads on Chip

8 six‐stage in‐order pipelines

Each pipeline, 4‐way fine‐grainEach pipeline, 4 way fine grain multithreaded

Small L1 write‐through caches

Shared L2

Screams for Web, OLTP

Shipping @1.2Ghz clock

Lecture 24 Slide 42EECS 470

Page 43: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Alpha 21464 Architecture Overview [Slid f Sh b M kh j @ I t l][Slides from Shubu Mukherjee @ Intel]

8‐wide out‐of‐order superscalar

Large on‐chip L2 cachearge on chip cache

Direct RAMBUS interface

On‐chip router for system interconnectOn chip router for system interconnect 

Glueless, directory‐based, ccNUMAfor up to 512‐way multiprocessing

4‐way simultaneous multithreading (SMT)

Lecture 24 Slide 43EECS 470

Page 44: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Basic Out-of-order Pipeline

F t h D d Q R E t D h R R tiFetch Decode/Map

Queue Reg Read

Execute

Dcache/Stor

e Buffer

Reg Write

Retire

PC

Icache

RegisterMap

DcacheRegs Regs

Icache

Thread‐

Lecture 24 Slide 44EECS 470

Threadblind

Page 45: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

SMT Pipeline

Fetch Decode/Map

Queue Reg Read

Execute

Dcache/Stor

e Buffer

Reg Write

Retire

Buffer

PC

Dcache

RegisterMap

Regs Regs

Icache

Dcacheg Regs

Lecture 24 Slide 45EECS 470

Page 46: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Changes for SMT

Basic pipeline – unchanged

Replicated resourcesProgram countersProgram countersRegister maps

Shared resourcesRegister file (size increased)Instruction queueqFirst and second level cachesTranslation buffersBranch predictor

Lecture 24 Slide 46EECS 470

Branch predictor

Page 47: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Fetch Policy (Key)Simple fetch (e.g., round‐robin) policies don’t work. Why?

• Instruction from slow threads hog the pipelineInstruction from slow threads hog the pipeline

• Once instructions placed in IQ, can not remove them until they execute

Fetch optimization:

• i‐count

• Favor faster threads

• Count retirement rate and bias accordingly

• Must avoid starvation (tricky)

Lecture 24 Slide 47EECS 470

Page 48: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Multiprogrammed workload

250%

200%

250%

100%

150%1T2T3T

50%4T

0%SpecInt SpecFP Mixed Int/FP

Lecture 24 Slide 48EECS 470

Page 49: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Decomposed SPEC95 Applications

250%

200%

1T

100%

150%1T2T3T

50%4T

0%Turb3d Swm256 Tomcatv

Lecture 24 Slide 49EECS 470

Page 50: EECS 470 Lecture 24 Chip Multiprocessors andweb.eecs.umich.edu/~twenisch/470_F07/lectures/24.pdf · EECS 470 Lecture 24 Chip Multiprocessors and Simultaneous Multithreading ... Lecture

© Wenisch 2007 -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar

Multithreaded Applications

300%

250%

300%

150%

200%1T2T4T

50%

100% 4T

0%Barnes Chess Sort TP

Lecture 24 Slide 50EECS 470