Upload
morrison
View
24
Download
0
Embed Size (px)
DESCRIPTION
Speculative Software Management of Datapath-width for Energy Optimization. G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin. IRISA, Campus de Beaulieu 35042 Rennes Cedex, France. Context. Embedded applications use to operate on 8-/16-bit data > 50% of program instructions in some case. - PowerPoint PPT Presentation
Citation preview
Speculative Software Management of Datapath-width for Energy
Optimization
G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin
IRISA, Campus de Beaulieu
35042 Rennes Cedex, France
2
Context
Embedded applications use to operate on 8-/16-bit data
> 50% of program instructions in some case
New opportunities for energy reduction …
clock-gating at finer granularity, i.e. operand level
3
Exploiting narrow-width operands
• Dynamic approach • Compiler approach
1. cycle-by-cycle operand gating
2. complex hardware mechanisms required
1. based on static data flow analysis
2. must be overly conservative to preserve program correctness
Brooks, et al. HPCA-99 Stephenson, et al. PLDI 2000
4
Our approach
Don’t want to pay the cost of a hardware scheme to detect when to clock-gate
Don’t want to rely on static data flow analysis to discover bit-width ranges
Dynamic approach Compiler approach
narrow-width execution mode is speculative :
exception management allows to recover to the correct mode
Take advantage of dynamic approach to expose dynamic narrow-width operands to the compiler (via profiling)
Use compiler approach to switch from normal to narrow-width mode and vice-versa (via a reconfiguration instruction)
5
Bit-width distribution analysis
• Cumulative distribution [Powerstone benchmarks]one operand two operands
Nar
row
-wid
th o
pera
nds
occu
rren
ce
6
Bit-width distribution analysis• Dynamic distribution of narrow-width operands at basic block level (adpcm)
7
Outline
• Motivation
• Micro-architectural support
• Narrow-width regions formation
• Simulation platform
• Evaluation
• Conclusions
8
Register file model
• We address a new dimension:– reduce register file activity by reducing register file width
• We propose the byte-slice register file approach
Tag bits
Slice enable signal
Row
dec
oder
8bits8bits 16bits
32bits
01
11
00110110 00110110 00110110 11000011
1. logically splitted
11110110 11110110
• Prior work to reduce the energy consumption in register file – limited port connectivity– limited number of registers
2. low-power mode via drowsy technique (allows to preserve register cells content) Flautner et al. ISCA-29
01 10010110
9
Reconfigurable data-path
• data-path resizable to accommodate to the bit-width execution mode (via clock-gating)– pipeline latches
– ALU
• clock-gating at coarser granularity
Slice-enable signal
(8/16/32 mode)
Write-back
(8/16/32 mode)
Bypass
(8/16/32 mode)
(8/16/32 mode) (8/16/32
mode)
ALU LSU
10
Exception management
• Data-path width misprediction may occur due to a dynamic event
• Simple recovery scheme
– the tag bits indicate the true data-width
– upon a misprediction: • trigger an exception
• recover to the correct execution mode
11
Address instructions
• Special care must be taken with address instructions– separate address calculation from memory access
• Use of dedicated registers for address computation– accumulator registers with additional ISA support (see
paper for details)
12
Outline
• Motivation
• Micro-architectural support
• Narrow-width regions formation
• Simulation platform
• Evaluation
• Conclusions
13
A two steps process
machine
input data sets
annotated .s file
annotated .s file
addresstransformation
modified .s file
Step 1
Step 2
14
Profiling
• Bit-width characteristics of selected regions
32 bits other LD/ST with 32 bits 8/16 bits
Nar
row
-wid
th o
per
and
s
0%
20%
40%
60%
80%
100%
weight of regions in program
15
Address instructions transformation
• Problem transform memory instructions into equivalent accumulator-based instructions
add1
• A graph partitioning formulation:– G, DDG of a BB
– iff there is def-use relation between n and m
Gmn ,
load
add2
add1
add -> Rx
mov Rx -> ACC
LDACC Ry
add2
Select (n,m) such that n has a 32-bit width operand and m is a LD/ST instr
Replace m with accumulator-based instructions
Minimize cut-size, number of instructions to move data from regfile to accumulators
16
Instructions reordering
• Problem: – reorder instructions in a
basic block such that operations with 32-bits operands are move around
8/16 bits operations
17
Outline
• Motivation
• Micro-architectural support
• Narrow-width regions formation
• Evaluation
• Conclusions
18
• Lx processor platform– in-order– 4-issue width– 64 32-bit GPR– 8 1-bit CBR– 6 stages pipeline– 4 ALUs, 1 LSU– 2 MULs
Simulation platform
• Tools– CACTI : register file
energy access– HotLeakage: leakage
energy
20
Summary of results• IPC degradation with varying misprediction penalty
and varying bit-width convergence
25,5p%80%,60
21
Summary of results
• Dynamic energy reduction
22
Summary of results
• Register file static energy savings
23
Outline
• Motivation
• Micro-architectural support
• Narrow-width regions formation
• Evaluation
• Conclusions
24
Conclusions• Contribution to power-aware compilation
– speculative management of processor data-path in software
– simple exception management scheme to repair a software misprediction
• Evaluation results – 17% data-path dynamic energy savings
– 22% register file static energy savings
– performance impact varies with implementation cost of the recovery scheme
• Future work – evaluation with larger granularity (e.g. trace)
• can reduce number of mispredictions
• can reduce amount of reconfiguration instructions
Thanks !
Questions …