21
Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm Presented by Kim Ki Young @ DCSLab

Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

  • Upload
    ayala

  • View
    34

  • Download
    1

Embed Size (px)

DESCRIPTION

Exploiting Choice : Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm Presented by Kim Ki Young @ DCSLab. Introduction. Simultaneous Multithreading(SMT)‏ - PowerPoint PPT Presentation

Citation preview

Page 1: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca

L. Stamm

Presented by Kim Ki Young @ DCSLab

Page 2: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

Simultaneous Multithreading(SMT)A Technique that permits multiple independent threads to issue multiple instructions each cycle to a superscalar processor’s functional unitTwo major impediments to processor utilization

long latencieslimited per-thread parallelism

2/20

Page 3: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

1.Demonstrate the throughput gains of SMT are possible without extensive changes to a conventional, wide-issue superscalar processor2.Show that SMT need not compromise single-thread performance3.Detailed architecture model to analyze and relieve bottlenecks that did not exist in the more idealized model4.Show how simultaneous multithreading creates an advantage previously unexploitable in other architecture

3

Page 4: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

4

Page 5: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

A projection of current superscalar design trends 3-5 years into the futureChanges necessary to support simultaneous multithreading

Multiple program countersSeparate return stack for each threadPer-thread instruction retirement, instruction queue flush, and trap mechanismsA thread id with each branch target buffer entryA larger register file

5

Page 6: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

6

Page 7: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

7

Page 8: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

MIPSIMIPS-based simulatorexecutes unmodified Alpha object code

WorkloadSPEC92 benchmark suitefive floating point programs, two integer programs, TeX

Multiflowtrace scheduling compiler

8

Page 9: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

9

Page 10: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

With only single thread, throughput is less than 2% below a superscalar w/o SMT supportPeak throughput is 84% higher than the superscalarThree problems

IQ sizeFetch throughputLack of parallelism

10

Page 11: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

Improve fetch throughput w/o increasing the fetch bandwidthalg.num1.num2

alg : Fetch selection methodnum1 : # of threads that can fetch in 1 cyclenum2 : max # of instructions fetched per thread in 1 cycle

Partitioning the fetch unitRR.1.8RR.2.4, RR.4.2

Some hardware additionRR.2.8

Additional logic is required11

Page 12: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

12

Page 13: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

Fetch PoliciesBRCOUNT

that are least likely to be on a wrong path

MISSCOUNTthat have the fewest outstanding D cache miss

ICOUNTwith the fewest instructions in decode

IQPOSNwith instructions farther from head of IQ

13

Page 14: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

14

Page 15: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

15

Page 16: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

Unblocking the Fetch UnitBIGQ

increase IQ’s size as long as we don’t increase the search spacedouble size, search first 32 entries

ITAGdo I cache tag lookup a cycle early

16

Page 17: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

Two sources of issue slot wasteWrong-path instructions

result from mispredicted branchesOptimistically issued instructions

result from cache miss or bank conflictIssue Algorithms

OPT_LASTSPEC_LASTBRANCH_FIRST

17

Page 18: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

The Issue Bandwidthnot a bottleneck

Instruction Queue Sizenot a bottleneckexperiment with larger queues increased throughput by less than 1%

Fetch Bandwidthprime candidate for bottleneck statusincreasing IQ and excess registers increased performance another 7%

Branch Predictionless sensitive in SMT

18

Page 19: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

Speculative Executionnot a bottleneckeliminating will be a issue

Memory Throughputinfinite bandwidth caches will increase throughput only by 3%

Register File Sizeno sharp drop-off point

Fetch Throughput is still a bottleneck

19

Page 20: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

Borrows heavily from conventional superscalar design, requiring little additional hardware supportMinimizes the impact on single-thread performance, running only 2% slower in that scenarioAchieves significant throughput improvements over the superscalar when many threads are running

20

Page 21: Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L.Lo, and Rebecca L. Stamm

Intel Pentium4, 2002Hyper-Threading Technology(HTT)30% speed improvement

MIPS MTIBM POWER5, 2004

two-thread SMT engineSUN Ultrasparc T1, 2005

CMT : SMT + CMP(Chip-level multiprocessing)

21