Scheduling Reusable Instructions for Power Reduction · Effective hardware-based two-way loop cache for high performance low power processors. In IEEE Int’l Conf. on Computer Design,

Scheduling Reusable Instructions for Power ReductionJ. S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M. J. IrwinMicrosystems Design LabThe Pennsylvania State University

MDL@PSU 24/4/2004

Power: StrongARM SA-110

Power dissipationICache 27%IBox 18%EBox 8%IMMU 9%DCache 16%DMMU 8%Clock 10%Write Buffer 2%Bus Ctrl 2%PLL < 1%

Die of DEC StrongARM

MDL@PSU 34/4/2004

Related Work

Stage-skip pipelineA small decoded instruction buffer [1][2]

Loop cachesDynamic/preloaded/hybrid loop caches [3][4][5]

Filter cacheFilter dcache [6], decode filter cache [7]

MDL@PSU 44/4/2004

Related Work[1] M. Hiraki et al. Stage-skip pipeline: A low power processor architecture using a

decoded instruction buffer. In Proc. International Symposium on Low Power Electronics and Design, 1996.

[2] R. S. Bajwa et al. Instruction buffering to reduce power in processors for signal processing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 5(4):417–424, December 1997.

[3] L. H. Lee, B. Moyer, and J. Arends. Instruction fetch energy reduction using loop caches for embedded applications with small tight loops. In Proc. International Symposium on Low Power Electronics and Design, 1999.

[4] T. Anderson and S. Agarwala. Effective hardware-based two-way loop cache for high performance low power processors. In IEEE Int’l Conf. on Computer Design, 2000.

[5] A. Gordon-Ross, S. Cotterell, and F. Vahid. Exploiting fixed programs in embedded systems: A loop cache example. IEEE Computer Architecture Letters, 2002.

[6] J. Kin et al. The filter cache: An energy efficient memory structure. In Proc. International Symposium on Microarchitecture, 1997.

[7] W. Tang, R. Gupta, and A. Nicolau. Power savings in embedded processors through decode filter cache. In Proc. Design and Test in Europe Conference, 2002.

MDL@PSU 54/4/2004

Our Proposed Approach

Scheduling reusable loop instructions within the issue queue

No need of an additional instruction bufferUtilize the existing issue queue resourcesBe able to gate the front-end of pipelineAutomatically unroll loops in the issue queueNo ISA modification

MDL@PSU 64/4/2004

Embedded Processor based on MIPS Core

Inst.Cache

Reorder Buffer (ROB)

LoadCache

Inst.Decoder

RegisterMap

Resource

Queue

Issue

StoreQueue

Register

File

FP Function Units

Int Function Units

DataAddcalc

Fetch Decode Issue CommitRenameQueue

Reg Read Execute WriteBackDcacheAcc

(a)

(b)

MDL@PSU 74/4/2004

Schedule Reusable Instructions

(bufferable)Loop

(non-bufferable)Outer Loop Innermost

slti r2, r24, 499addiu r24, r24, 1addiu r5, r5, 2000addiu r6, r6, 2000

slti r2, r22, 499

addiu r3, r3, 4addiu r4, r4, 4

addiu r22, r22, 1

sw r2, 0(r4)subu r2, r24, 422sw r2, 0(r3)

addu r3, r0, r5addu r4, r0, r6beq r20, r0, 0x4002e8addiu r20, r0, 499��addu r22, r0, r0

bne r2, r0, 0x4002a0

addu r2, r24, r22

bne r2, r0, 0x400278

Array-intensive embedded applicationsUtilizing issue queueReusable instructions –innermost loopsSelf-steaming issue queueGate front-end of the datapath

MDL@PSU 84/4/2004

Loop Detection

(bufferable)Loop

(non-bufferable)Outer Loop Innermost

slti r2, r24, 499addiu r24, r24, 1addiu r5, r5, 2000addiu r6, r6, 2000

slti r2, r22, 499

addiu r3, r3, 4addiu r4, r4, 4

addiu r22, r22, 1

sw r2, 0(r4)subu r2, r24, 422sw r2, 0(r3)

addu r3, r0, r5addu r4, r0, r6beq r20, r0, 0x4002e8addiu r20, r0, 499��addu r22, r0, r0

bne r2, r0, 0x4002a0

addu r2, r24, r22

bne r2, r0, 0x400278

Add check logic for conditional branch and direct jump instructions

Two checks: a). Backward branch/jump; b). Static distance <= issue queue size

This check is performed at decode stage instead of commit stage

NBLT is used to store current non-bufferable loops

MDL@PSU 94/4/2004

Buffering Reusable InstructionsExtended issue queue microarchitectureReusable instructions are marked, logical register numbers are stored in LRABuffering integer number of loops

6263 1 0

scan directionreuse pointer

loopheadR

Rlooptail

rsrt

rd rdrd

rtrt

rsrsrsrsrs

rtrtrtrd rd rd

Logical Register List 15 bits

2Index

10

Original Issue Queue

0

Classification Bit 1 11111 1 bit 1 bit0 1 0Issue State Bit

MDL@PSU 104/4/2004

Reusing Buffered Instructions6263 1 0

scan directionreuse pointer

loopheadR

Rlooptail

rsrt

rd rdrd

rtrt

rsrsrsrsrs

rtrtrtrd rd rd

Logical Register List 15 bits

2Index

10

Original Issue Queue

0

Classification Bit 1 11111 1 bit 1 bit0 1 0Issue State Bit

Inst. With CB bit set will not be removed after its issue

ISB bit is set after the inst. Is issued

Check the first m inst. If the first n inst. with both CB and ISB set, reuse those inst. Reuse pointer advances by n.

Logical register numbers are sent for renaming

Reuse pointer is reset to Rloopheadwhen it reaches Rlooptail

MDL@PSU 114/4/2004

The New Datapath

CommitIssueDecodeFetch

calcAdd

Data

Int Function Units

FP Function Units

File

RegisterResource

MapRegister

DecoderInst.

Cache

Reorder Buffer (ROB)

CacheInst.

Rename

Gate-Gate-

DetectedLoop

Register #

LRL

Issue

Queue

Control

RenameRegister

SignalSignal

Reuse

QueueStoreLoad

(b)

(a)

DcacheAccWriteBackExecuteReg Read

Queue

MDL@PSU 124/4/2004

Experiment SetupParameters ConfigurationIssue QueueLoad/Store QueueROBFetch QueueFetch/Decode WidthIssue/Commit Width Function UnitsBranch Predictor

L1 ICacheL1 DCacheL2 UCacheTLB

Memory

64 entries32 entries64 entries4 entries4 inst. per cycle4 inst. per cycle4 IALU, 1 IMULT, 4 FPALU, 1 FPMULTbimod, 2048 entries, RAS 8 entriesBTB 512 set 4 way assoc.32KB, 2 way, 1 cycle32KB, 4 way, 1 cycle256KB, 4 way, 8 cyclesITLB: 16 set 4 way, DTLB: 32 set 4 way4KB page size, 30 cycle penalty80 cycles for first chunk, 8 cycles the rest

Bench SourceAdiApsBtrixEfluxTomcatTsfVpentawss

LivermorePerfect ClubSpec92/NASAPerfect ClubSpec95Perfect ClubSpec92/NASAPerfect Club

Name SimulatorArch.Power

SimpleScalar 3.0Wattch

MDL@PSU 134/4/2004

Rate of Gated Front End

adi aps btrix eflux tomcat tsf vpenta wss avg0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pipe

line

Fron

t−en

d G

ated

Rat

e (in

Cyc

les)

IQ−32IQ−64IQ−128IQ−256

On the average, the pipeline front-end gated rate increase from 42% to 82% as the issue queue size increase.

MDL@PSU 144/4/2004

Power Savings in Front-end

IQ−32−0.2

0

0.2

0.4

0.6

0.8

1

Pow

er (p

er C

ycle

) Sav

ings

IcacheBpredDecoderIssueQueueOverhead

On the average, a reduction of 35% - 72% in ICache, 19% -33% in branch predictor, 46% - 83% in Inst. decoder, and 12% - 21% in

issue queue as the issue queue size increase, with < 2% overhead.

IQ−64 IQ−128 IQ−256

MDL@PSU 154/4/2004

Impact of Compiler Optimizations

adi aps btrix eflux tomcat tsf vpenta wss avg−0.05

0

0.05

0.1

0.15

0.2

Ove

rall

Pow

er (p

er C

ycle

) Red

uctio

n OriginalOptimized


0

0.05

0.1

0.15

0.2

0.25

Ove

rall

Pow

er (p

er C

ycle

) Red

uctio

n OriginalOptimized

Optimized code increases power savings from 8% to 13% with issue queue size of 64 entries.

MDL@PSU 164/4/2004

Conclusions

Proposed a new issue queue architectureDetect capturable loop codeBuffer loop code in the issue queueSchedule the reusable loop inst. buffered

Significant power reduction in pipeline front-end components while gatedCompiler optimizations can further improve the power savings

MDL@PSU 174/4/2004

MDL@PSU 184/4/2004

Power: A Design Limiter?

A major constraint in embedded systems designSuperscalar architecture is more likely used for performanceFront-end: a power-hungry componentSeek to optimize the front-end of datapath in embedded processors

MDL@PSU 194/4/2004

Issue Queue State TransitionNo impact on exception handlingNon-successful buffering will be revokedChanging control flow in a buffering loop will cause buffering to be revokedBranch prediction is disabled & switched to static prediction during Code_Reuse stateMisprediction due to normally existing a loop will restore issue queue to Normal state

Reuse

Rec

over

y

Detected

Buffering R

evoke

Misprediction R

ecovery/

Start

Capturable Loop

Mis

pred

ictio

n

Code_Buffering

Loop_Buffering finished

Normal

MDL@PSU 204/4/2004

Overall Power Reduction


0

0.05

0.1

0.15

0.2

0.25

0.3

Ove

rall

Po

wer

(p

er C

ycle

) S

avin

gs

IQ−32IQ−64IQ−128IQ−256

MDL@PSU 214/4/2004

Performance Loss


0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Per

form

ance

(IP

C)

Deg

rad

atio

n

IQ−32IQ−64IQ−128IQ−256

Average performance loss ranges from 0.2% to 4% due to buffering integer number of loops

Documents

Scheduling Reusable Instructions for Power Reduction · Effective hardware-based two-way loop cache for high performance low power processors. In IEEE Int’l Conf. on Computer Design,