31
An Improved Analytical Superscalar Microprocessor Memory Model Memory Model Xi Chen and Tor Aamodt Electrical and Computer Engineering University of British Columbia June 22 nd , 2008 (MoBS-2008)

An Improved Analytical Superscalar Microprocessor Memory Model

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Improved Analytical Superscalar Microprocessor Memory Model

An Improved Analytical Superscalar Microprocessor Memory ModelMemory Model

Xi Chen and Tor Aamodt

Electrical and Computer EngineeringUniversity of British Columbia

June 22nd, 2008 (MoBS-2008)

Page 2: An Improved Analytical Superscalar Microprocessor Memory Model

Outlinen Introductionn Backgroundn Modeling Long Latency Memory Systems

¨ Pending Hits¨ Accurate Hidden Miss Latency Estimation

An Improved Analytical Superscalar Microprocessor Memory Model 2

Xi Chen, Tor AamodtUniversity of British Columbia

¨ Accurate Hidden Miss Latency Estimation¨ A Limited Number of Outstanding Cache Misses¨ SWAM & SWAM-MLP

n Methodology n Resultsn Future Work and Conclusions

Page 3: An Improved Analytical Superscalar Microprocessor Memory Model

Introduction

n Processor design is a complicated task

n Cycle-accurate performance simulators are

An Improved Analytical Superscalar Microprocessor Memory Model 3

Xi Chen, Tor AamodtUniversity of British Columbia

n Cycle-accurate performance simulators are extensively used to explore design space¨ Creating and debugging a simulator takes time

¨ Running detailed simulations is slow

Page 4: An Improved Analytical Superscalar Microprocessor Memory Model

Introductionn Simulating future large-scale CMPs ?

Over 8 months

An Improved Analytical Superscalar Microprocessor Memory Model 4

Xi Chen, Tor AamodtUniversity of British Columbia

Time to simulate 1 second of LCMP (32-core, 4-thread per core) workload execution (Li Zhao et al., Exploring Large-Scale CMP Architectures Using ManySim, IEEE Micro, Issue 4, 2007)

Page 5: An Improved Analytical Superscalar Microprocessor Memory Model

Introduction

n Analytical Modeling

¨an alternative to cycle-accurate simulations

An Improved Analytical Superscalar Microprocessor Memory Model 5

Xi Chen, Tor AamodtUniversity of British Columbia

Analytical Model

program characteristics

microarchitectural parameters

performance

Page 6: An Improved Analytical Superscalar Microprocessor Memory Model

Introduction

n Analytical Modeling vs. Performance Simulations

¨Pros

n Fast speed (orders of magnitude times faster)

Providing more insights for chip designers

An Improved Analytical Superscalar Microprocessor Memory Model 6

Xi Chen, Tor AamodtUniversity of British Columbia

n Providing more insights for chip designers

¨Cons

n Less accurate than performance simulations

n Covering only major microarchitectural parameters

Page 7: An Improved Analytical Superscalar Microprocessor Memory Model

Backgroundn First-order Model

¨ Tejas Karkhanis and James Smith. A First-order Superscalar Processor Model. ISCA’04.

¨ Stable performance is disrupted by different types of miss-events

An Improved Analytical Superscalar Microprocessor Memory Model 7

Xi Chen, Tor AamodtUniversity of British Columbia

miss-events

time

IPC

miss-event #1 miss-event #2 miss-event #3

Page 8: An Improved Analytical Superscalar Microprocessor Memory Model

Backgroundn First-order Model

¨ The overall performance (CPI) is modeled by the sum of isolated CPI components due to each type of miss-event.

CPImodeled = CPIbase + CPIbmsp + CPID$miss + CPII$miss

An Improved Analytical Superscalar Microprocessor Memory Model 8

Xi Chen, Tor AamodtUniversity of British Columbia

in absence of miss-events

branchmispredictions

data cache misses

instruction cache misses

¨ The baseline in our paper is our careful re-implementation of the first-order model based upon the available details described in the ISCA’04 paper and its follow-up work.

Page 9: An Improved Analytical Superscalar Microprocessor Memory Model

Pending Data Cache Hits

8

10

12

CP

I_D

$m

iss

(m

cf)

actual w/o PH

An Improved Analytical Superscalar Microprocessor Memory Model 9

Xi Chen, Tor AamodtUniversity of British Columbia

0

2

4

6

20 50 100 200 500

memory access latency (cycle)

CP

I_D

$m

iss

(m

cf)

Page 10: An Improved Analytical Superscalar Microprocessor Memory Model

Contributions

n Pending Hits

n Accurate Hidden Miss Latency Estimation

i1

i2

i3

i4 i5

i6

i7

i8error is reduced43.5% -> 27.5 %

15.5%-> 10.3 %

An Improved Analytical Superscalar Microprocessor Memory Model 10

Xi Chen, Tor AamodtUniversity of British Columbia

n Accurate Hidden Miss Latency Estimation

n Profile Window Selection

i9 i10 i11 i12 i13 i14 i15 i16i4 i5 i6 i7 i8i1 i2 i3

i9 i10 i11 i12 i13 i14 i15 i16i4 i5 i6 i7 i8i1 i2 i3

baseline contribution

29.2%-> 10.3 %

32.4%-> 9.2 %

-> 10.3 %

Page 11: An Improved Analytical Superscalar Microprocessor Memory Model

Modeling CPID$miss : Baseline

i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19i4 i5 i6 i7 i8i1 i2 i3

n Assuming the instruction window size is eight

An Improved Analytical Superscalar Microprocessor Memory Model 11

Xi Chen, Tor AamodtUniversity of British Columbia

num_serialized_D$miss: 0 -> 1

i1 i3 i5

i8i6

i7

i2 i4

Page 12: An Improved Analytical Superscalar Microprocessor Memory Model

Modeling CPID$miss : Baseline

i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19i4 i5 i6 i7 i8i1 i2 i3

n Assuming the instruction window size is eight

An Improved Analytical Superscalar Microprocessor Memory Model 12

Xi Chen, Tor AamodtUniversity of British Columbia

num_serialized_D$miss: 1 -> 3

i9 i11 i13

i16

i15

i14i10 i12

Page 13: An Improved Analytical Superscalar Microprocessor Memory Model

Modeling CPID$miss : Baseline

n mem_lat: main memory latency

An Improved Analytical Superscalar Microprocessor Memory Model 13

Xi Chen, Tor AamodtUniversity of British Columbia

n mem_lat: main memory latency

n total_num_instructions: total number of instructions committed

Page 14: An Improved Analytical Superscalar Microprocessor Memory Model

Pending Data Cache Hits

n A pending data cache hit results from a memory reference to a cache block for which a request has already been initiated by another instruction

An Improved Analytical Superscalar Microprocessor Memory Model 14

Xi Chen, Tor AamodtUniversity of British Columbia

by another instruction

n Pending data cache hits are common due to spatial locality in applications

Page 15: An Improved Analytical Superscalar Microprocessor Memory Model

Pending Data Cache Hits

n A (traditional) cache simulator cannot identify pending cache hits due to the lack of timing information

L2 cache

An Improved Analytical Superscalar Microprocessor Memory Model 15

Xi Chen, Tor AamodtUniversity of British Columbia

ld r2, (0)r1ld r3, (4)r1

add r4, r2, r5add r6, r3, r5ld r7, (0)r6

……

L2 cache

…miss

L1 cache

...miss hit

Page 16: An Improved Analytical Superscalar Microprocessor Memory Model

Pending Data Cache Hits

ld r2, (0)r1ld r3, (4)r1

add r4, r2, r5add r6, r3, r5ld r7, (0)r6

i1

i2

i3

i4

i5i1 i3 i6

i8

not considering pending hits (num_serialized_D$miss += 1)

wrongvalue

An Improved Analytical Superscalar Microprocessor Memory Model 16

Xi Chen, Tor AamodtUniversity of British Columbia

ld r7, (0)r6or r8, r4, r5

add r9, r5, r7or r10, r8, r9

……

i5

i6

i7

i8

i9

i2 i4 i5 i7

i8

Page 17: An Improved Analytical Superscalar Microprocessor Memory Model

Modeling Pending Hits

ld r2, (0)r1ld r3, (4)r1

add r4, r2, r5add r6, r3, r5ld r7, (0)r6

i1

i2

i3

i4

i5

miss

hit (i1)

miss

i1 i3 i6i8

considering pending hits (num_serialized_D$miss += 2)correctvalue

An Improved Analytical Superscalar Microprocessor Memory Model 17

Xi Chen, Tor AamodtUniversity of British Columbia

ld r7, (0)r6or r8, r4, r5

add r9, r5, r7or r10, r8, r9

……

i5

i6

i7

i8

i9

miss

i2 i4 i5 i7

Error is reduced from 43.5% to 27.5 % withbest fixed cycle compensation for miss latency

latency modeled by CPIbase

Page 18: An Improved Analytical Superscalar Microprocessor Memory Model

Compensating Overestimate [K&S ISCA’04, Karkhanis PhD Thesis]

n Part of the latency of a miss may be hidden

when a miss issues, it can be (or near) the

commit side of the ROB

An Improved Analytical Superscalar Microprocessor Memory Model 18

Xi Chen, Tor AamodtUniversity of British Columbia

fetch commitROB

Page 19: An Improved Analytical Superscalar Microprocessor Memory Model

Compensating Overestimate [K&S ISCA’04, Karkhanis PhD Thesis]

n Part of the latency of a miss may be hidden

when a miss issues, it can be (or near)

the middle of the ROB

when a miss issues, it can be (or near)

the middle of the ROB

An Improved Analytical Superscalar Microprocessor Memory Model 19

Xi Chen, Tor AamodtUniversity of British Columbia

fetch commitROBROB

Page 20: An Improved Analytical Superscalar Microprocessor Memory Model

Compensating Overestimate [K&S ISCA’04, Karkhanis PhD Thesis]

when a miss issues, it can be (or near) the fetch side of the ROB

n Part of the latency of a miss may be hidden

An Improved Analytical Superscalar Microprocessor Memory Model 20

Xi Chen, Tor AamodtUniversity of British Columbia

fetch commitROB

Page 21: An Improved Analytical Superscalar Microprocessor Memory Model

Compensating Overestimate

n Fixed-cycle compensation used by prior work is not accurate for all the benchmarks we study.

An Improved Analytical Superscalar Microprocessor Memory Model 21

Xi Chen, Tor AamodtUniversity of British Columbia

n We propose to compensate the overestimate using the average distance between consecutive misses.

Page 22: An Improved Analytical Superscalar Microprocessor Memory Model

Compensating Overestimate

fetch commitROBld miss ld miss

over 70% instructions can be used to hide miss latency in pointer chasing

An Improved Analytical Superscalar Microprocessor Memory Model 22

Xi Chen, Tor AamodtUniversity of British Columbia

dist: average distance between two consecutive misses (the distance between two misses is saturated by the size of ROB)

error is reduced from 15.5% (best fixed cycle compensation) to 10.3%

Page 23: An Improved Analytical Superscalar Microprocessor Memory Model

Modeling a limited number of MSHRs

n The model thus far has assumed that the number of outstanding cache misses supported is unlimited.

An Improved Analytical Superscalar Microprocessor Memory Model 23

Xi Chen, Tor AamodtUniversity of British Columbia

n We propose a technique to model a limited number of outstanding cache misses supported.

Page 24: An Improved Analytical Superscalar Microprocessor Memory Model

Modeling a limited number of MSHRs

n We stop a profile step and update num_serialized_D$miss

when the number of misses analyzed is equal to the number of Miss Status Holding Registers (MSHRs)

An Improved Analytical Superscalar Microprocessor Memory Model 24

Xi Chen, Tor AamodtUniversity of British Columbia

i9 i10 i11 i12 i13 i14 i15 i16i4 i5 i6 i7 i8i1 i2 i3

ROBsize = 8, NMSHR = 4

With plain profiling, error reduces from 32.4% to 23.9% for 8 MSHRs

Page 25: An Improved Analytical Superscalar Microprocessor Memory Model

Start-with-a-miss (SWAM) Profiling

n Each profile step starts with a cache miss

i9 i10 i11 i12 i13 i14 i15 i16i4 i5 i6 i7 i8i1 i2 i3

plain profiling (baseline)

sliding profiling

An Improved Analytical Superscalar Microprocessor Memory Model 25

Xi Chen, Tor AamodtUniversity of British Columbia

i9 i10 i11 i12 i13 i14 i15 i16i4 i5 i6 i7 i8i1 i2 i3

sliding profiling

i9 i10 i11 i12 i13 i14 i15 i16i4 i5 i6 i7 i8i1 i2 i3

SWAM profiling 29.2% (error of plain) -> 10.3% (error of SWAM)

(does not improve the accuracy much but slowdown the model significantly)

Page 26: An Improved Analytical Superscalar Microprocessor Memory Model

SWAM-MLP

i1

i2

i5

i4 i6

i3i4 i5 i6 i7 i8i1 i2 i3

SWAM (ROBsize = 8, NMSHR = 4)

SWAM-MLP

num_serialized_miss += 4 (wrong value)

An Improved Analytical Superscalar Microprocessor Memory Model 26

Xi Chen, Tor AamodtUniversity of British Columbia

i7

i8i4 i5 i6 i7 i8i1 i2 i3

SWAM-MLP

num_serialized_miss += 2 (correct value)

12.8% (error of SWAM) to 9.2% (error of SWAM-MLP) for 8 MSHRs

23.2% (error of SWAM) to 9.9% (error of SWAM-MLP) for 4 MSHRs

Page 27: An Improved Analytical Superscalar Microprocessor Memory Model

Methodology

n We modified SimpleScalar and used a set of memory intensive benchmarks from SPEC 2000 and OLDEN suite whose cache misses per thousand instructions (MPKI) is higher than 10 for our cache configurations

An Improved Analytical Superscalar Microprocessor Memory Model 27

Xi Chen, Tor AamodtUniversity of British Columbia

Page 28: An Improved Analytical Superscalar Microprocessor Memory Model

Resultsn Unlimited number of outstanding cache misses supported

40

60

80

100Plain w/o PH Plain w/ PH SWAM w/ PH

39.7%

An Improved Analytical Superscalar Microprocessor Memory Model 28

Xi Chen, Tor AamodtUniversity of British Columbia

-100

-80

-60

-40

-20

0

20

40

app ar

teq

klu

csw

mm

cf em hth

prm

lbm

mea

nerr

or

(%) 39.7%

-> 29.2%-> 10.3%

Page 29: An Improved Analytical Superscalar Microprocessor Memory Model

Resultsn Modeling a limited number of MSHRs

¨ SWAM-MLP further decreases error of SWAM from 12.8% to 9.2% (8 MSHRs, e.g., Prescott), from 23.2% to 9.9% (4 MSHRs)

60

80

100Plain w/o MSHR Plain w/ MSHR

SWAM SWAM-MLP

An Improved Analytical Superscalar Microprocessor Memory Model 29

Xi Chen, Tor AamodtUniversity of British Columbia

-100

-80

-60

-40

-20

0

20

40

app art eqk luc swm mcf em hth prm lbm meanerr

or

(%)

n Our improvements do not slowdown the first-order model

Page 30: An Improved Analytical Superscalar Microprocessor Memory Model

Current / Future Workn Analytically modeling the performance impact of hardware

data prefetching (by the pending hits caused by data prefetches)

n Analytically modeling the throughput of fine-grain multithreaded microprocessors (e.g., Sun’s Niagara)

An Improved Analytical Superscalar Microprocessor Memory Model 30

Xi Chen, Tor AamodtUniversity of British Columbia

multithreaded microprocessors (e.g., Sun’s Niagara)

n Extending the analytical model for CMPs with superscalar, out-of-order execution cores

n Analytically modeling the performance of SMT superscalar cores

Page 31: An Improved Analytical Superscalar Microprocessor Memory Model

Conclusionsn Modeling Long Latency Memory System

¨ Pending Data Cache Hit¨ Accurate Hidden Miss Latency Estimation¨ Modeling MSHRs¨ SWAM & SWAM-MLP

An Improved Analytical Superscalar Microprocessor Memory Model 31

Xi Chen, Tor AamodtUniversity of British Columbia

¨ SWAM & SWAM-MLP

n Overall our improvements reduce the error of our baseline from 39.7% to 10.3%. The error is less than 10% when modeling MSHRs. Our improvements do not slow down the first-order model.