Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Scheduling Reusable Instructions for Power ReductionJ. S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M. J. IrwinMicrosystems Design LabThe Pennsylvania State University
MDL@PSU 24/4/2004
Power: StrongARM SA-110
Power dissipationICache 27%IBox 18%EBox 8%IMMU 9%DCache 16%DMMU 8%Clock 10%Write Buffer 2%Bus Ctrl 2%PLL < 1%
Die of DEC StrongARM
MDL@PSU 34/4/2004
Related Work
Stage-skip pipelineA small decoded instruction buffer [1][2]
Loop cachesDynamic/preloaded/hybrid loop caches [3][4][5]
Filter cacheFilter dcache [6], decode filter cache [7]
MDL@PSU 44/4/2004
Related Work[1] M. Hiraki et al. Stage-skip pipeline: A low power processor architecture using a
decoded instruction buffer. In Proc. International Symposium on Low Power Electronics and Design, 1996.
[2] R. S. Bajwa et al. Instruction buffering to reduce power in processors for signal processing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 5(4):417–424, December 1997.
[3] L. H. Lee, B. Moyer, and J. Arends. Instruction fetch energy reduction using loop caches for embedded applications with small tight loops. In Proc. International Symposium on Low Power Electronics and Design, 1999.
[4] T. Anderson and S. Agarwala. Effective hardware-based two-way loop cache for high performance low power processors. In IEEE Int’l Conf. on Computer Design, 2000.
[5] A. Gordon-Ross, S. Cotterell, and F. Vahid. Exploiting fixed programs in embedded systems: A loop cache example. IEEE Computer Architecture Letters, 2002.
[6] J. Kin et al. The filter cache: An energy efficient memory structure. In Proc. International Symposium on Microarchitecture, 1997.
[7] W. Tang, R. Gupta, and A. Nicolau. Power savings in embedded processors through decode filter cache. In Proc. Design and Test in Europe Conference, 2002.
MDL@PSU 54/4/2004
Our Proposed Approach
Scheduling reusable loop instructions within the issue queue
No need of an additional instruction bufferUtilize the existing issue queue resourcesBe able to gate the front-end of pipelineAutomatically unroll loops in the issue queueNo ISA modification
MDL@PSU 64/4/2004
Embedded Processor based on MIPS Core
Inst.Cache
Reorder Buffer (ROB)
LoadCache
Inst.Decoder
RegisterMap
Resource
Queue
Issue
StoreQueue
Register
File
FP Function Units
Int Function Units
DataAddcalc
Fetch Decode Issue CommitRenameQueue
Reg Read Execute WriteBackDcacheAcc
(a)
(b)
MDL@PSU 74/4/2004
Schedule Reusable Instructions
(bufferable)Loop
(non-bufferable)Outer Loop Innermost
slti r2, r24, 499addiu r24, r24, 1addiu r5, r5, 2000addiu r6, r6, 2000
slti r2, r22, 499
addiu r3, r3, 4addiu r4, r4, 4
addiu r22, r22, 1
sw r2, 0(r4)subu r2, r24, 422sw r2, 0(r3)
addu r3, r0, r5addu r4, r0, r6beq r20, r0, 0x4002e8addiu r20, r0, 499����������������������������������������addu r22, r0, r0
bne r2, r0, 0x4002a0
addu r2, r24, r22
bne r2, r0, 0x400278
Array-intensive embedded applicationsUtilizing issue queueReusable instructions –innermost loopsSelf-steaming issue queueGate front-end of the datapath
MDL@PSU 84/4/2004
Loop Detection
(bufferable)Loop
(non-bufferable)Outer Loop Innermost
slti r2, r24, 499addiu r24, r24, 1addiu r5, r5, 2000addiu r6, r6, 2000
slti r2, r22, 499
addiu r3, r3, 4addiu r4, r4, 4
addiu r22, r22, 1
sw r2, 0(r4)subu r2, r24, 422sw r2, 0(r3)
addu r3, r0, r5addu r4, r0, r6beq r20, r0, 0x4002e8addiu r20, r0, 499����������������������������������������addu r22, r0, r0
bne r2, r0, 0x4002a0
addu r2, r24, r22
bne r2, r0, 0x400278
Add check logic for conditional branch and direct jump instructions
Two checks: a). Backward branch/jump; b). Static distance <= issue queue size
This check is performed at decode stage instead of commit stage
NBLT is used to store current non-bufferable loops
MDL@PSU 94/4/2004
Buffering Reusable InstructionsExtended issue queue microarchitectureReusable instructions are marked, logical register numbers are stored in LRABuffering integer number of loops
6263 1 0
scan directionreuse pointer
loopheadR
Rlooptail
rsrt
rd rdrd
rtrt
rsrsrsrsrs
rtrtrtrd rd rd
Logical Register List 15 bits
2Index
10
Original Issue Queue
0
Classification Bit 1 11111 1 bit 1 bit0 1 0Issue State Bit
MDL@PSU 104/4/2004
Reusing Buffered Instructions6263 1 0
scan directionreuse pointer
loopheadR
Rlooptail
rsrt
rd rdrd
rtrt
rsrsrsrsrs
rtrtrtrd rd rd
Logical Register List 15 bits
2Index
10
Original Issue Queue
0
Classification Bit 1 11111 1 bit 1 bit0 1 0Issue State Bit
Inst. With CB bit set will not be removed after its issue
ISB bit is set after the inst. Is issued
Check the first m inst. If the first n inst. with both CB and ISB set, reuse those inst. Reuse pointer advances by n.
Logical register numbers are sent for renaming
Reuse pointer is reset to Rloopheadwhen it reaches Rlooptail
MDL@PSU 114/4/2004
The New Datapath
CommitIssueDecodeFetch
calcAdd
Data
Int Function Units
FP Function Units
File
RegisterResource
MapRegister
DecoderInst.
Cache
Reorder Buffer (ROB)
CacheInst.
Rename
Gate-Gate-
DetectedLoop
Register #
LRL
Issue
Queue
Control
RenameRegister
SignalSignal
Reuse
QueueStoreLoad
(b)
(a)
DcacheAccWriteBackExecuteReg Read
Queue
MDL@PSU 124/4/2004
Experiment SetupParameters ConfigurationIssue QueueLoad/Store QueueROBFetch QueueFetch/Decode WidthIssue/Commit Width Function UnitsBranch Predictor
L1 ICacheL1 DCacheL2 UCacheTLB
Memory
64 entries32 entries64 entries4 entries4 inst. per cycle4 inst. per cycle4 IALU, 1 IMULT, 4 FPALU, 1 FPMULTbimod, 2048 entries, RAS 8 entriesBTB 512 set 4 way assoc.32KB, 2 way, 1 cycle32KB, 4 way, 1 cycle256KB, 4 way, 8 cyclesITLB: 16 set 4 way, DTLB: 32 set 4 way4KB page size, 30 cycle penalty80 cycles for first chunk, 8 cycles the rest
Bench SourceAdiApsBtrixEfluxTomcatTsfVpentawss
LivermorePerfect ClubSpec92/NASAPerfect ClubSpec95Perfect ClubSpec92/NASAPerfect Club
Name SimulatorArch.Power
SimpleScalar 3.0Wattch
MDL@PSU 134/4/2004
Rate of Gated Front End
adi aps btrix eflux tomcat tsf vpenta wss avg0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pipe
line
Fron
t−en
d G
ated
Rat
e (in
Cyc
les)
IQ−32IQ−64IQ−128IQ−256
On the average, the pipeline front-end gated rate increase from 42% to 82% as the issue queue size increase.
MDL@PSU 144/4/2004
Power Savings in Front-end
IQ−32−0.2
0
0.2
0.4
0.6
0.8
1
Pow
er (p
er C
ycle
) Sav
ings
IcacheBpredDecoderIssueQueueOverhead
On the average, a reduction of 35% - 72% in ICache, 19% -33% in branch predictor, 46% - 83% in Inst. decoder, and 12% - 21% in
issue queue as the issue queue size increase, with < 2% overhead.
IQ−64 IQ−128 IQ−256
MDL@PSU 154/4/2004
Impact of Compiler Optimizations
adi aps btrix eflux tomcat tsf vpenta wss avg−0.05
0
0.05
0.1
0.15
0.2
Ove
rall
Pow
er (p
er C
ycle
) Red
uctio
n OriginalOptimized
adi aps btrix eflux tomcat tsf vpenta wss avg−0.05
0
0.05
0.1
0.15
0.2
0.25
Ove
rall
Pow
er (p
er C
ycle
) Red
uctio
n OriginalOptimized
Optimized code increases power savings from 8% to 13% with issue queue size of 64 entries.
MDL@PSU 164/4/2004
Conclusions
Proposed a new issue queue architectureDetect capturable loop codeBuffer loop code in the issue queueSchedule the reusable loop inst. buffered
Significant power reduction in pipeline front-end components while gatedCompiler optimizations can further improve the power savings
MDL@PSU 174/4/2004
MDL@PSU 184/4/2004
Power: A Design Limiter?
A major constraint in embedded systems designSuperscalar architecture is more likely used for performanceFront-end: a power-hungry componentSeek to optimize the front-end of datapath in embedded processors
MDL@PSU 194/4/2004
Issue Queue State TransitionNo impact on exception handlingNon-successful buffering will be revokedChanging control flow in a buffering loop will cause buffering to be revokedBranch prediction is disabled & switched to static prediction during Code_Reuse stateMisprediction due to normally existing a loop will restore issue queue to Normal state
Reuse
Rec
over
y
Detected
Buffering R
evoke
Misprediction R
ecovery/
Start
Capturable Loop
Mis
pred
ictio
n
Code_Buffering
Loop_Buffering finished
Normal
MDL@PSU 204/4/2004
Overall Power Reduction
adi aps btrix eflux tomcat tsf vpenta wss avg−0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
Ove
rall
Po
wer
(p
er C
ycle
) S
avin
gs
IQ−32IQ−64IQ−128IQ−256
MDL@PSU 214/4/2004
Performance Loss
adi aps btrix eflux tomcat tsf vpenta wss avg−0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Per
form
ance
(IP
C)
Deg
rad
atio
n
IQ−32IQ−64IQ−128IQ−256
Average performance loss ranges from 0.2% to 4% due to buffering integer number of loops