Upload
glynis
View
42
Download
5
Embed Size (px)
DESCRIPTION
Duplicating and Deconstructing Virtual Load/Store Queues. Vikas Garg Sonal Agarwal. Motivation. Large instruction window and load/store queue to achieve high performance Speculative executions of memory instructions Replay traps due to re-ordering of memory accesses. - PowerPoint PPT Presentation
Citation preview
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
1
Duplicating and Deconstructing Virtual Load/Store Queues
Vikas GargSonal Agarwal
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
2
Motivation Large instruction window and load/store queue
to achieve high performance Speculative executions of memory instructions Replay traps due to re-ordering of memory
accesses. Pipeline flushes to handle replay traps
• Wasted pipeline operations (Power)• Excessive L1 accesses (Power and Locality)
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
3
Motivation Virtual Load/Store Queue (VLSQ) proposal
[Jaleel, HPCA’05] • Use large load store queue for the front end• Throttle memory instructions at issue stage• Reduces the re-ordering of memory instructions• Help in avoiding replay traps• Saves power• No big performance drop
What if we simply reduce the LSQ size?
Does a VLSQ really work?
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
4
Outline Motivation VLSQ Introduction Simulation Setup VLSQ Results VLSQ vs. LSQ Conclusions
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
5
VLSQ Introduction
LD/ST 0LD/ST 1LD/ST 2LD/ST 3LD/ST 4LD/ST 5LD/ST 6LD/ST 7LD/ST 8LD/ST 9
LD/ST 10LD/ST 11LD/ST 12LD/ST 13LD/ST 14LD/ST 15
LSQ Head
LSQ Tail
Virtual Head
Virtual TailFRONT END
ISSUE
ISSUED NOT READY BLOCKED EMPTY
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
6
VLSQ Pipeline Operation
Issu
e
Renam
e
Inte
ger
Mem
ory
Regis
ter
File
Fetc
h/
Deco
de
Load/Store Queue
Stall
Stall
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
7
Outline Motivation VLSQ Introduction Simulation Setup VLSQ Results VLSQ vs. LSQ Conclusion
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
8
Simulation Setup Alpha 21264 simulator (sim-alpha)
• I-Cache(64KB, 1Cycle); D-Cache(64KB, 3Cycle)• L2-Cache(2MB, 15Cycle) • 1.3 GB/s DDR SDRAM (DRAMsim)• 1024 entry store-wait table• 2048 line 2-level bimodal branch predictor• Pipeline width: Fetch(8); Issue(8/4); Commit(11)• Functional units: Int(4), Int-Mul(4), FP(1), FP-Mul(1)
Subset of SPEC 2000 benchmark • FP: applu,art,mgrid,swim; INT: gcc,gzip,mcf,twolf• Warm-up: 2 Billion Inst; Data: 500 Million Inst• Reference input
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
9
Simulation Setup (Continued…)
ROB Size Registers Issue Queue LSQ Size VLSQ Size
80 80/72 20/15 32/32 Infinite
128 160/144 40/30 64/64 Infinite
256 320/288 80/60 128/128 Infinite
512 640/576 160/120 256/256 Infinite
Baseline Out-of-Order Configurations
For VLSQ use baseline LSQ and VLSQ of Inf, 64, 32, 16, 8, 4, and 2
For LSQ use the VLSQ of Infinity and LSQ size of 64, 32, 16, 8, 4, and 2
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
10
Outline Motivation VLSQ Introduction Simulation Setup VLSQ Results VLSQ vs. LSQ Conclusion
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
11
VLSQ - Performance
0
0.5
1
1.5
2
2.5
80 128 256 512
ROB Size
CP
I
Inf
64
32
16
8
4
2
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
12
VLSQ - Trap Overhead
0%
10%
20%
30%
40%
50%
80 128 256 512
ROB Size
Ex
ec
uti
on
Cy
cle
s
Inf
64
32
16
8
4
2
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
13
VLSQ – Map/Rename Stalls
0
500
1,000
1,500
2,000
2,50080
-Inf
80-6
480
-32
80-1
680
-880
-480
-2
128-
Inf
128-
6412
8-32
128-
1612
8-8
128-
412
8-2
256-
Inf
256-
6425
6-32
256-
1625
6-8
256-
425
6-2
512-
Inf
512-
6451
2-32
512-
1651
2-8
512-
451
2-2
ROB-VLSQ Sizes
Sta
ll C
ycle
s p
er T
ho
usa
nd
Inst
.
ROB MEM ISSUE
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
14
VLSQ Pipeline Operation (Continued…)
Issu
e
Renam
e
Inte
ger
Mem
ory
Regis
ter
File
Stall
Fetc
h/
Deco
de
Load/Store Queue
Stall
Stall
Stall
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
15
VLSQ Summary Reduces speculation and replay traps Not a big performance drop Saves power Stall propagates backwards
• Need a lot of memory independent instructions
On the critical path?
What if we simply reduce the LSQ size?
VLSQ works!
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
16
Outline Motivation VLSQ Introduction Simulation Setup VLSQ Results VLSQ vs. LSQ Conclusion
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
17
Small Load/Store Queue
Issu
e
Renam
e
Inte
ger
Mem
ory
Regis
ter
File
Fetc
h/
Deco
de
Load/StoreQueue
Stall
Stall
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
18
VLSQ vs. LSQ (Map/Rename Stalls)
VLSQ Stalls
0
500
1,000
1,500
2,000
2,500
80-I
nf80
-64
80-3
280
-16
80-8
80-4
80-2
128-
Inf
128-
6412
8-32
128-
1612
8-8
128-
412
8-2
256-
Inf
256-
6425
6-32
256-
1625
6-8
256-
425
6-2
512-
Inf
512-
6451
2-32
512-
1651
2-8
512-
451
2-2
ROB-VLSQ Sizes
Sta
ll C
ycle
s
ROB MEM ISSUE
LSQ Stalls
0
500
1,000
1,500
2,000
2,500
80-B
ase
80-6
480
-32
80-1
680
-880
-480
-2
128-
Bas
e12
8-64
128-
3212
8-16
128-
812
8-4
128-
2
256-
Bas
e25
6-64
256-
3225
6-16
256-
825
6-4
256-
2
512-
Bas
e51
2-64
512-
3251
2-16
512-
851
2-4
512-
2
ROB-LSQ Sizes
Sta
ll C
ycle
sROB MEM ISSUE
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
19
VLSQ vs. LSQ (Performance)
VLSQ Performance
0
0.5
1
1.5
2
2.5
80 128 256 512
ROB Size
CPI
Inf 64 32 16 8 4 2
LSQ Performance
0
0.5
1
1.5
2
2.5
80 128 256 512
ROB SizeC
PI
Base 64 32 16 8 4 2
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
20
VLSQ vs. LSQ (Trap Overhead)
VLSQ Trap Overhead
0%
10%
20%
30%
40%
50%
80 128 256 512
ROB Size
Exec
utio
n C
ycle
s
Inf 64 32 16 8 4 2
LSQ Trap Overhead
0%
10%
20%
30%
40%
50%
80 128 256 512
ROB SizeEx
ecut
ion
Cyc
les
Base 64 32 16 8 4 2
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
21
VLSQ vs. LSQ (Summary)
Baseline VLSQ LSQ
CPI 1.35 1.34 1.35
ROB Stall Cycles 1 0 0
MEM Stall Cycles 3 0 534
ISSUE Stall Cycles 233 363 1
Total Stall Cycles 236 364 536
Trap Overhead 45% 36% 26%
L1 Accesses 648 499 451
L1 Misses 96 94 91
Fetch Ops 0% 12% 31%
Map Ops. 0% 12% 34%
Exec Ops. 0% 12% 18%
ROB Size: 512; VLSQ Size 16; LSQ Size 16
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
22
LSQ Summary Reduces speculation and replay traps Performance vs. power tradeoff better than
that for VLSQ Simpler than VLSQ
• Not on the critical path• Additional power saving from a smaller LSQ
Reducing LSQ size is better than using VLSQ!
VLSQ works BUT…
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
23
Dynamic Throttling Easy to do dynamic throttling using VLSQ
• Just need to tweak the VLSQ window size
Might be better to just vary the LSQ size• Maybe we can just shut down parts of the LSQ
Better to throttle in the issue stage using • Just in time instruction delivery [Karkhanis, ISPLED‘02]
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
24
Conclusions Speculative execution of memory instructions
leads to wasted power due to replay traps VLSQ helps to reduce memory re-ordering and
replay traps LSQ is more effective For power saving it is better to throttle earlier
in the pipeline
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
25
Duplicating and Deconstructing Virtual Load/Store Queues
Questions?
June 18, 2006 5th Annual Workshop onDuplicating, Deconstructing and Debunking
26
Duplicating and Deconstructing Virtual Load/Store Queues
Questions?