26
June 18, 2006 5th Annual Workshop on Duplicating, Deconstructing and Debunking 1 Duplicating and Deconstructing Virtual Load/Store Queues Vikas Garg Sonal Agarwal

Duplicating and Deconstructing Virtual Load/Store Queues

  • Upload
    glynis

  • View
    42

  • Download
    5

Embed Size (px)

DESCRIPTION

Duplicating and Deconstructing Virtual Load/Store Queues. Vikas Garg Sonal Agarwal. Motivation. Large instruction window and load/store queue to achieve high performance Speculative executions of memory instructions Replay traps due to re-ordering of memory accesses. - PowerPoint PPT Presentation

Citation preview

Page 1: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

1

Duplicating and Deconstructing Virtual Load/Store Queues

Vikas GargSonal Agarwal

Page 2: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

2

Motivation Large instruction window and load/store queue

to achieve high performance Speculative executions of memory instructions Replay traps due to re-ordering of memory

accesses. Pipeline flushes to handle replay traps

• Wasted pipeline operations (Power)• Excessive L1 accesses (Power and Locality)

Page 3: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

3

Motivation Virtual Load/Store Queue (VLSQ) proposal

[Jaleel, HPCA’05] • Use large load store queue for the front end• Throttle memory instructions at issue stage• Reduces the re-ordering of memory instructions• Help in avoiding replay traps• Saves power• No big performance drop

What if we simply reduce the LSQ size?

Does a VLSQ really work?

Page 4: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

4

Outline Motivation VLSQ Introduction Simulation Setup VLSQ Results VLSQ vs. LSQ Conclusions

Page 5: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

5

VLSQ Introduction

LD/ST 0LD/ST 1LD/ST 2LD/ST 3LD/ST 4LD/ST 5LD/ST 6LD/ST 7LD/ST 8LD/ST 9

LD/ST 10LD/ST 11LD/ST 12LD/ST 13LD/ST 14LD/ST 15

LSQ Head

LSQ Tail

Virtual Head

Virtual TailFRONT END

ISSUE

ISSUED NOT READY BLOCKED EMPTY

Page 6: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

6

VLSQ Pipeline Operation

Issu

e

Renam

e

Inte

ger

Mem

ory

Regis

ter

File

Fetc

h/

Deco

de

Load/Store Queue

Stall

Stall

Page 7: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

7

Outline Motivation VLSQ Introduction Simulation Setup VLSQ Results VLSQ vs. LSQ Conclusion

Page 8: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

8

Simulation Setup Alpha 21264 simulator (sim-alpha)

• I-Cache(64KB, 1Cycle); D-Cache(64KB, 3Cycle)• L2-Cache(2MB, 15Cycle) • 1.3 GB/s DDR SDRAM (DRAMsim)• 1024 entry store-wait table• 2048 line 2-level bimodal branch predictor• Pipeline width: Fetch(8); Issue(8/4); Commit(11)• Functional units: Int(4), Int-Mul(4), FP(1), FP-Mul(1)

Subset of SPEC 2000 benchmark • FP: applu,art,mgrid,swim; INT: gcc,gzip,mcf,twolf• Warm-up: 2 Billion Inst; Data: 500 Million Inst• Reference input

Page 9: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

9

Simulation Setup (Continued…)

ROB Size Registers Issue Queue LSQ Size VLSQ Size

80 80/72 20/15 32/32 Infinite

128 160/144 40/30 64/64 Infinite

256 320/288 80/60 128/128 Infinite

512 640/576 160/120 256/256 Infinite

Baseline Out-of-Order Configurations

For VLSQ use baseline LSQ and VLSQ of Inf, 64, 32, 16, 8, 4, and 2

For LSQ use the VLSQ of Infinity and LSQ size of 64, 32, 16, 8, 4, and 2

Page 10: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

10

Outline Motivation VLSQ Introduction Simulation Setup VLSQ Results VLSQ vs. LSQ Conclusion

Page 11: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

11

VLSQ - Performance

0

0.5

1

1.5

2

2.5

80 128 256 512

ROB Size

CP

I

Inf

64

32

16

8

4

2

Page 12: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

12

VLSQ - Trap Overhead

0%

10%

20%

30%

40%

50%

80 128 256 512

ROB Size

Ex

ec

uti

on

Cy

cle

s

Inf

64

32

16

8

4

2

Page 13: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

13

VLSQ – Map/Rename Stalls

0

500

1,000

1,500

2,000

2,50080

-Inf

80-6

480

-32

80-1

680

-880

-480

-2

128-

Inf

128-

6412

8-32

128-

1612

8-8

128-

412

8-2

256-

Inf

256-

6425

6-32

256-

1625

6-8

256-

425

6-2

512-

Inf

512-

6451

2-32

512-

1651

2-8

512-

451

2-2

ROB-VLSQ Sizes

Sta

ll C

ycle

s p

er T

ho

usa

nd

Inst

.

ROB MEM ISSUE

Page 14: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

14

VLSQ Pipeline Operation (Continued…)

Issu

e

Renam

e

Inte

ger

Mem

ory

Regis

ter

File

Stall

Fetc

h/

Deco

de

Load/Store Queue

Stall

Stall

Stall

Page 15: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

15

VLSQ Summary Reduces speculation and replay traps Not a big performance drop Saves power Stall propagates backwards

• Need a lot of memory independent instructions

On the critical path?

What if we simply reduce the LSQ size?

VLSQ works!

Page 16: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

16

Outline Motivation VLSQ Introduction Simulation Setup VLSQ Results VLSQ vs. LSQ Conclusion

Page 17: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

17

Small Load/Store Queue

Issu

e

Renam

e

Inte

ger

Mem

ory

Regis

ter

File

Fetc

h/

Deco

de

Load/StoreQueue

Stall

Stall

Page 18: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

18

VLSQ vs. LSQ (Map/Rename Stalls)

VLSQ Stalls

0

500

1,000

1,500

2,000

2,500

80-I

nf80

-64

80-3

280

-16

80-8

80-4

80-2

128-

Inf

128-

6412

8-32

128-

1612

8-8

128-

412

8-2

256-

Inf

256-

6425

6-32

256-

1625

6-8

256-

425

6-2

512-

Inf

512-

6451

2-32

512-

1651

2-8

512-

451

2-2

ROB-VLSQ Sizes

Sta

ll C

ycle

s

ROB MEM ISSUE

LSQ Stalls

0

500

1,000

1,500

2,000

2,500

80-B

ase

80-6

480

-32

80-1

680

-880

-480

-2

128-

Bas

e12

8-64

128-

3212

8-16

128-

812

8-4

128-

2

256-

Bas

e25

6-64

256-

3225

6-16

256-

825

6-4

256-

2

512-

Bas

e51

2-64

512-

3251

2-16

512-

851

2-4

512-

2

ROB-LSQ Sizes

Sta

ll C

ycle

sROB MEM ISSUE

Page 19: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

19

VLSQ vs. LSQ (Performance)

VLSQ Performance

0

0.5

1

1.5

2

2.5

80 128 256 512

ROB Size

CPI

Inf 64 32 16 8 4 2

LSQ Performance

0

0.5

1

1.5

2

2.5

80 128 256 512

ROB SizeC

PI

Base 64 32 16 8 4 2

Page 20: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

20

VLSQ vs. LSQ (Trap Overhead)

VLSQ Trap Overhead

0%

10%

20%

30%

40%

50%

80 128 256 512

ROB Size

Exec

utio

n C

ycle

s

Inf 64 32 16 8 4 2

LSQ Trap Overhead

0%

10%

20%

30%

40%

50%

80 128 256 512

ROB SizeEx

ecut

ion

Cyc

les

Base 64 32 16 8 4 2

Page 21: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

21

VLSQ vs. LSQ (Summary)

  Baseline VLSQ LSQ

CPI 1.35 1.34 1.35

ROB Stall Cycles 1 0 0

MEM Stall Cycles 3 0 534

ISSUE Stall Cycles 233 363 1

Total Stall Cycles 236 364 536

Trap Overhead 45% 36% 26%

L1 Accesses 648 499 451

L1 Misses 96 94 91

Fetch Ops 0% 12% 31%

Map Ops. 0% 12% 34%

Exec Ops. 0% 12% 18%

ROB Size: 512; VLSQ Size 16; LSQ Size 16

Page 22: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

22

LSQ Summary Reduces speculation and replay traps Performance vs. power tradeoff better than

that for VLSQ Simpler than VLSQ

• Not on the critical path• Additional power saving from a smaller LSQ

Reducing LSQ size is better than using VLSQ!

VLSQ works BUT…

Page 23: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

23

Dynamic Throttling Easy to do dynamic throttling using VLSQ

• Just need to tweak the VLSQ window size

Might be better to just vary the LSQ size• Maybe we can just shut down parts of the LSQ

Better to throttle in the issue stage using • Just in time instruction delivery [Karkhanis, ISPLED‘02]

Page 24: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

24

Conclusions Speculative execution of memory instructions

leads to wasted power due to replay traps VLSQ helps to reduce memory re-ordering and

replay traps LSQ is more effective For power saving it is better to throttle earlier

in the pipeline

Page 25: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

25

Duplicating and Deconstructing Virtual Load/Store Queues

Questions?

Page 26: Duplicating and Deconstructing Virtual Load/Store Queues

June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking

26

Duplicating and Deconstructing Virtual Load/Store Queues

Questions?