44
Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads Milad Hashemi, Onur Mutlu, Yale N. Patt UT Austin/Google, ETH Zürich, UT Austin October 19 th , 2016

Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads

Milad Hashemi, Onur Mutlu, Yale N. PattUT Austin/Google, ETH Zürich, UT Austin

October 19th, 2016

Page 2: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

Continuous Runahead Outline• OverviewofRunahead• RunaheadLimitations• ContinuousRunaheadDependenceChains• ContinuousRunaheadEngine• ContinuousRunaheadEvaluation• Conclusions

2

Page 3: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

Continuous Runahead Outline• OverviewofRunahead• RunaheadLimitations• ContinuousRunaheadDependenceChains• ContinuousRunaheadEngine• ContinuousRunaheadEvaluation• Conclusions

3

Page 4: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

Runahead Execution Overview•Runahead dynamically expands the instruction window

when the pipeline is stalled [Mutlu et al., 2003]• The core checkpoints architectural state• The result of the memory operation that caused the stall is

marked as poisoned in the physical register file• The core continues to fetch and execute instructions• Operations are discarded instead of retired• The goal is to generate new independent cache misses

4

Page 5: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

Traditional Runahead Accuracy

0%10%20%30%40%50%60%70%80%90%

100%

Requ

estA

ccuracy

Runahead

GHB

Stream

Markov+Stream

5

Page 6: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

Traditional Runahead Accuracy

0%10%20%30%40%50%60%70%80%90%

100%

Requ

estA

ccuracy

Runahead

GHB

Stream

Markov+Stream

6

Runaheadis95%Accurate

Page 7: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

Traditional Runahead Prefetch Coverage

0%10%20%30%40%50%60%70%80%90%

100%

%Inde

pend

entC

ache

Misses

7

Page 8: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

Traditional Runahead Prefetch Coverage

0%10%20%30%40%50%60%70%80%90%

100%

%Inde

pend

entC

ache

Misses

8

Runaheadhasonly13%Prefetch Coverage

Page 9: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

Traditional Runahead Performance Gain

0%

50%

100%

150%

200%

250%

300%

350%

%IPCIm

provem

ento

ver

No-PrefetchingBa

seline

RunaheadPerformanceGain OraclePerformanceGain

9

Page 10: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

Traditional Runahead Performance Gain

0%

50%

100%

150%

200%

250%

300%

350%

%IPCIm

provem

ento

ver

No-PrefetchingBa

seline

RunaheadPerformanceGain OraclePerformanceGain

10

Runaheadhasa12%PerformanceGainRunaheadOraclehasan85%PerformanceGain

Page 11: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

Traditional Runahead Interval Length

0

20

40

60

80

100

120

140

CyclesPerRun

aheadInterval

128ROB

256ROB

512ROB

1024ROB

11

Page 12: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

Traditional Runahead Interval Length

0

20

40

60

80

100

120

140

CyclesPerRun

aheadInterval

128ROB

256ROB

512ROB

1024ROB

12

RunaheadIntervalsareShortLowPerformanceGain

Page 13: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

•WhichinstructionstouseduringContinuousRunahead?• Dynamicallytargetthedependencechainsthatleadtocriticalcachemisses

•WhathardwaretouseforContinuousRunahead?• Howlongshouldchainspre-executefor?

Continuous Runahead Challenges

13

Page 14: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

Dependence Chains

LD[R6]->R8

ADDR9,R1->R6

ADDR4,R5->R9

LD[R3]->R5

CacheMiss

14

Page 15: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

Experimentwith3policiestodeterminethebestpolicytouseforContinuousRunahead:• PC-BasedPolicy• UsethedependencechainthathascausedthemostmissesforthePCthatisblockingretirement

• MaximumMissesPolicy• Useadependencechainfrom thePCthathasgeneratedthemostmisses fortheapplication

• StallPolicy• UseadependencechainfromthePCthathascausedthemostfull-windowstalls fortheapplication

Dependence Chain Selection Policies

15

Page 16: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

-20

0

20

40

60

80

100

%IPCIm

provem

ent

RunaheadBuffer

PC-Policy

Maximum-MissesPolicy

StallPolicy

Dependence Chain Selection Policies

16

Page 17: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

0100200300400500600700800900

1000

Num

bero

fPCs

90%ofStalls

AllStalls

AllMisses

Why does Stall Policy Work?

17

Page 18: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

0100200300400500600700800900

1000

Num

bero

fPCs

90%ofStalls

AllStalls

AllMisses

Why does Stall Policy Work?

18

19PCscover90%ofallStalls

Page 19: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

0

0.2

0.4

0.6

0.8

1

1.2

Normalize

dPerfo

rmance

1Chain

2Chains

4Chains

8Chains

16Chains

32Chains

Constrained Dependence Chain Storage

19

Page 20: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

0

0.2

0.4

0.6

0.8

1

1.2

Normalize

dPerfo

rmance

1Chain

2Chains

4Chains

8Chains

16Chains

32Chains

Constrained Dependence Chain Storage

20

Storing1Chainprovides95%ofthePerformance

Page 21: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

Maintaintwostructures:• 32-entrycacheofPCstotracktheoperationsthatcausethepipelinetofrequentlystall• ThelastdependencechainforthePCthathascausedthemostfull-windowstalls

Ateveryfullwindowstall:• IncrementthecounterofthePCthatcausedthestall• GenerateadependencechainforthePCthathascausedthemoststalls

Continuous Runahead Chain Generation

21

Page 22: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

Runahead for Longer Intervals

ContinuousRunahead

Engine(CRE)

Core 0 Core 1

Core 2 Core 3

LLC

LLC

LLC

LLC

DRAMChannel 0

DRAMChannel 1

22

Page 23: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

• NoFront-End• NoRegisterRenamingHardware• 32PhysicalRegisters• 2-Wide• NoFloatingPointorVectorPipeline• 4kBDataCache

CRE Microarchitecture

23

Page 24: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

SHIFTP1->P9

ADDP7+1->P1

ADDP9+P1->P3

SHIFTP3->P2

LD[P2]->P8

Cycle:012345

RegisterRemappingTable:

ADDE5+1->E3

SHIFTE3->E4

ADDE4+E3->E2

SHIFTE2->E1

LD[E1]->E0

CorePhysicalRegister

CREPhysicalRegister

SearchList: P2

FirstCREPhysicalRegister

EAX EBX ECX

P8

E0

E0

P2

E1

E1

P3

E2

P3

P1

E3

E3

P9

E4

P9,P1P1P7

P7

E5MAPE3->E5

Dependence Chain Generation

24

Page 25: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

ADD E5 + 1 -> E3

SHIFT E3 -> E4 ADD E4 + E3 -> E2

SHIFT E2 -> E1

MEM_LD [E1] -> E0

Dependence Chain Generation

25

Page 26: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

0%10%20%30%40%50%60%70%80%90%

100%

1k 5k 10k 25k 50k 100k 250k 500k 1M 2MUpdateInterval(InstructionsRetired)

ContinuousRunaheadRequestAccuracy

GMeanPerformanceGain

Interval Length

26

Page 27: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

• Single-Core/Quad-Core• 4-wideIssue• 256EntryReorderBuffer• 92EntryReservationStation

• Caches• 32KB8-WaySetAssociativeL1I/D-Cache• 1MB8-WaySetAssociativeSharedLastLevelCacheperCore

• Non-UniformMemoryAccessLatencyDDR3System• 256-EntryMemoryQueue• BatchScheduling

• Prefetchers• Stream,GlobalHistoryBuffer• FeedbackDirectedPrefetching:DynamicDegree1-32

• CRECompute• 2-wideissue• 1ContinuousRunaheadissuecontextwitha32-entrybufferand32-entryphysicalregisterfile• 4kBDataCache

System Configuration

27

Page 28: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

0

20

40

60

80

100

120

%IPCIm

provem

ento

ver

No-PrefetchingBa

seline

RunaheadBuffer

ContinuousRunahead

Single-Core Performance

28

Page 29: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

0

20

40

60

80

100

120

%IPCIm

provem

ento

ver

No-PrefetchingBa

seline

RunaheadBuffer

ContinuousRunahead

Single-Core Performance

29

21%SingleCorePerformanceIncreaseoverpriorStateoftheArt

Page 30: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

0

20

40

60

80

100

120

140

%IPCIm

provem

ento

ver

No-PrefetchingBa

seline

RunaheadBuffer

ContinuousRunahead

StreamPF

GHBPF

ContinuousRunahead+Stream

ContinuousRunahead+GHB

Single-Core Performance + Prefetching

30

Page 31: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

0

20

40

60

80

100

120

140

%IPCIm

provem

ento

ver

No-PrefetchingBa

seline

RunaheadBuffer

ContinuousRunahead

StreamPF

GHBPF

ContinuousRunahead+Stream

ContinuousRunahead+GHB

Single-Core Performance + Prefetching

31

IncreasesPerformanceoverandIn-ConjunctionwithPrefetching

Page 32: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

0%10%20%30%40%50%60%70%80%90%

100%

%Inde

pend

entC

ache

Misses

Prefetched

Independent Miss Coverage

32

Page 33: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

0%10%20%30%40%50%60%70%80%90%

100%

%Inde

pend

entC

ache

Misses

Prefetched

Independent Miss Coverage

33

70%Prefetch Coverage

Page 34: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

0

0.5

1

1.5

2

2.5

Normalize

dBa

ndwidth

ContinuousRunahead

StreamPF

GHBPF

Bandwidth Overhead

34

Page 35: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

0

0.5

1

1.5

2

2.5

Normalize

dBa

ndwidth

ContinuousRunahead

StreamPF

GHBPF

Bandwidth Overhead

35

LowBandwidthOverhead

Page 36: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

0

10

20

30

40

50

60

H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 GMean

%W

eightedSpeedu

pIm

provem

ent

ContinuousRunahead

StreamPF

GHBPF

Multi-Core Performance

36

Page 37: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

0

10

20

30

40

50

60

H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 GMean

%W

eightedSpeedu

pIm

provem

ent

ContinuousRunahead

StreamPF

GHBPF

Multi-Core Performance

37

43%WeightedSpeedupIncrease

Page 38: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

0

10

20

30

40

50

60

70

80

90

H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 GMean

%W

eightedSpeedu

pIm

provem

ent

ContinuousRunahead

StreamPF

GHBPF

ContinuousRunahead+Stream

ContinuousRunahead+GHB

Multi-Core Performance + Prefetching

38

Page 39: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

0

10

20

30

40

50

60

70

80

90

H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 GMean

%W

eightedSpeedu

pIm

provem

ent

ContinuousRunahead

StreamPF

GHBPF

ContinuousRunahead+Stream

ContinuousRunahead+GHB

Multi-Core Performance + Prefetching

39

13%WeightedSpeedupGainoverGHBPrefetching

Page 40: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 Mean

EnergyNormalize

dto

No-PrefetchingBa

seline

ContinuousRunahead

StreamPF

GHBPF

ContinuousRunahead+Stream

ContinuousRunahead+GHB

Multi-Core Energy Evaluation

40

Page 41: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 Mean

EnergyNormalize

dto

No-PrefetchingBa

seline

ContinuousRunahead

StreamPF

GHBPF

ContinuousRunahead+Stream

ContinuousRunahead+GHB

Multi-Core Energy Evaluation

41

22%EnergyReduction

Page 42: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

• Runaheadprefetch coverageislimitedbythedurationofeachrunaheadinterval• Toremovethisconstraint,weintroducethenotionofContinuousRunahead•WecandynamicallyidentifythemostcriticalLLCmissestotargetwithContinuousRunaheadbytrackingtheoperationsthatcausethepipelinetofrequentlystall•WemigratethesedependencechainstotheCREwheretheyareexecutedcontinuouslyinaloop

Conclusions

42

Page 43: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

• ContinuousRunaheadgreatlyincreasesprefetch coverage• Increasessingle-coreperformanceby34.4%• Increasesmulti-coreperformanceby43.3%• Synergisticwithvarioustypesofprefetching

Conclusions

43

Page 44: Continuous Runahead: Transparent Hardware Acceleration for ... · 1k 5k 10k 25k 50k 100k 250k 500k 1M 2M Update Interval (Instructions Retired) Continuous Runahead Request Accuracy

Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads

Milad Hashemi, Onur Mutlu, Yale N. PattUT Austin/Google, ETH Zürich, UT Austin

October 19th, 2016

44