Upload
yetty
View
28
Download
2
Tags:
Embed Size (px)
DESCRIPTION
RELOCATE Re gister File Loc al A ccess Pat te rn Redistribution Mechanism for Power and Thermal Management in Out-of-Order Embedded Processor. Houman Homayoun, Aseem Gupta, Avesta Sasan, Alex Veidenbaum, Nikil Dutt, Fadi Kurdahi University of California Irvine. Outline. Motivation - PowerPoint PPT Presentation
Citation preview
1
RELOCATE
Register File Local Access Pattern Redistribution Mechanism for Power and
Thermal Management in Out-of-Order Embedded Processor
Houman Homayoun, Aseem Gupta, Avesta Sasan, Alex Veidenbaum, Nikil Dutt, Fadi Kurdahi
University of California Irvine
2
Outline
• Motivation• Background study• Study of Register file Underutilization• Study of Register file default access
patterns• Access concentration and activity
redistribution to relocate register file access patterns
• Results
3
Why Register File?
• RF is one of the hottest units in a processor– A small, heavily multi-ported SRAM– Accessed very frequently
• Example: IBM PowerPC 750FX
4
Why Temperature?
• Higher power densities (Watt per mm2) lead to higher operating temperatures, which(i) Increase the probability of timing violations
(ii) Reduce IC lifetime
(iii) Lower operating frequency
(iv) Increase leakage power
(v) Require expensive cooling mechanisms
(vi) Overall increase in design effort and cost
5
Prior Work: Activity Migration• Reduces temperature by migrating the
activity to a replicated unit.– requires a replicated unit
• large area overhead
– leads to a large performance degradation
Tem
pera
ture
T final
T ambient
Active Period
Idle Period
T init
T crisis
time
AM AM+PG
6
Conventional Register Renaming
Free List
Active List
Tail pointer
Head pointer Instruction # Original code Renamed code
1 RA <- ... PR1 <- ...
2 …. <- RA .... <- PR1
3 branch to _L branch to _L
4 RA <- ... PR4 <- ...
5 ... ...
... ...
6 _ L:
_ L:
7 …. <- RA .... <- PR1
Register Renamer Register allocation-release
• Physical registers are allocated/released in a somewhat random order
7
Analysis of Register File Operation
1. Register File Occupancy
(a)
0%10%20%30%40%50%60%70%80%90%
100%
RF_ocuupancy < 16 16 < RF_ocuupancy < 32
32 < RF_ocuupancy < 48 48 < RF_ocuupancy < 64
(b)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
RF_ocuupancy < 16 16 < RF_ocuupancy < 3232 < RF_ocuupancy < 48 48 < RF_ocuupancy < 64
MiBench SPECint2K
8
Performance Degradation with a Smaller Register File
(a)
0%
5%
10%
15%
20%
25%
30%
35%
% p
erfo
rman
ce d
egra
dat
ion
48-entry 32-entry 16-entry
(b)
0%
10%
20%
30%
40%
50%
60%
% p
erfo
rman
ce d
egra
dat
ion
48-entry 32-entry 16-entry
MiBench SPECint2K
9
Analysis of Register File Operation
2. Register File Access Distribution– Coefficient of variation (CV) shows a “deviation”
from average # of accesses for individual physical registers.
• nai is the number of accesses to a physical register i during a specific period (10K cycles). na is the average
• N, the total number of physical registers
na
nanaN
CV
n
ii
access
2
1
)(1
10
Coefficient of Variation
(a)
0%
2%
4%
6%
8%
10%
12%
% c
oef
fici
ent
of
vari
atio
n
(b)
0%
2%
4%
6%
8%
10%
12%
14%
% c
oef
fici
ent
of
vari
atio
n
MiBench SPEC2K
11
Register File Operation
Underutilization which is distributed uniformly
while only a small number of registers are occupied at any given time, the total accesses are uniformly distributed over the entire physical register file during the course of execution
12
RELOCATE: Access Redistribution within a Register File
• The goal is to “concentrate” accesses within a partition of a RF (region)– Some regions will be idle (for 10K cycles)
• Can power-gate them and allow to cool down
register activity (a) baseline, (b) in-order (c) distant patterns
13
An Architectural Mechanism to Support Access Redistribution
• Active partition: a register renamer partition currently used in register renaming
• Idle partition: a register renamer partition which does not participate in renaming
• Active region: a region of the register file corresponding to a register renamer partition (whether active or idle) which has live registers
• Idle region: a region of the register file corresponding to a register renamer partition (whether active or idle) which has no live registers
14
Activity Migration without Replication
• An access concentration mechanism allocates registers from only one partition
• This default active partition (DAP) may run out of free registers before the 10K cycle “convergence period” is over – another partition (according to some algorithm) is then
activated (referred to as additional active partitions or AAP )
– To facilitate physical register concentration in DAP, if two or more partitions are active and have free registers, allocation is performed in the same order in which partitions were activated.
15
The Access Concentration Mechanism
• Partition activation order is 1-3-2-4
Free List
Active List
Free List
Active List
Free List
Active List
Free List
Active List
Partition P1
Free-list 1 full Free-list 3 full Free-list 2 full
Active List 4 emptyActive List 2 emptyActive List 3 empty
Partition P2
Partition P4
Partition P3
Free-list 4 full
Active List 1 empty
16
The redistribution mechanism• The default active partition is changed once every N
cycles to redistribute the activity within the register file (according to some algorithm)– Once a new default partition (NDP) is selected, all active
partitions (DAP+AAP) become idle.
• The idle partitions do not participate in register renaming, but their corresponding RF regions may have to be kept active (powered up)– A physical register in an idle partition may be live
• An idle RF region is power gated when its active list becomes empty.
17
The redistribution mechanism
Free List
Active List
Free List
Active List
Free List
Active List
Free List
Active List
Partition P1
Free-list 1 full Free-list 3 full Free-list 2 full
Active List 4 emptyActive List 2 emptyActive List 3 empty
Partition P2
Partition P4
Partition P3
Free-list 4 full
Active List 1 empty
18
Performance Impact?• There is a two-cycle delay to wakeup a power gated
physical register region • The register renaming occurs in the front end of the
microprocessor pipeline whereas the register access occurs in the back end. – There is a delay of at least two pipeline stages
between renaming and accessing a physical register file
– Can wake up the requested region in time
Can wake up a required register file region without incurring a performance penalty
at the time of access
19
Experimental setup• MASE (SimpleScalar 4.0)
– Model MIPS-74K processor, 800 MHz
• MiBench and SPECint2K benchmarks compiled with Compaq compiler, -O4 flag
• Industrial memory compiler used– 64-entry, 64bit single-ended SRAM memory in TSMC
45nm technology
• HotSpot to estimate thermal profiles
20
Table 1. Processor Architecture
L1 I-cache 8KB, ,4 way, 2 cycles
L1 D-cache 8KB, 4 way, 2 cycles
L2-cache 128KB, 15 cycles
Fetch, dispatch 2 wide
Register file 64 entry
Memory 50 cycles
Instruction fetch queue
2
Load/store queue 16 entry
Arithmetic units 2 integer
Complex unit 2 INT
Pipeline 12 stages
Processor speed 800 MHz
Issue Out-of-order
Table 2. RF Design specification
Process 45nm-CMOS
9 metal layers
Register
file layout area
0.009mm2
Operating Modes Active:R/W
Sleep: no data retention
Operating Voltage 0.6V~1.1V
Read Access Cycle
200MHz
to 1.1GHz
Access time typical corner (0.9V, 45 )
0.32ns
Active Power (Total) in typical corner (0.9V, 45 )
66mW
@ 800MHz
Active Leakage Power typical corner (0.9V, 45 )
15mW
Sleep Leakage Power in typical corner (0.9V, 45 )
2mW Wakeup Delay 0.42ns
Wakeup Energy per register file row (64bits)
0.42nJ
21
ResultsMibench RF power reduction
(a)
0%5%
10%15%20%25%30%35%40%45%50%55%
Po
we
r R
ed
uc
tio
n %
num_partition=2 num_partition=4 num_partition=8
22
SPEC2K RF power reduction
(b)
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
Po
we
r R
ed
uc
tio
n %
num_partition=2 num_partition=4 num_partition=8
23
Analysis of Power Reduction
• Increasing the number of RF partitions provides more opportunity to capture and cluster unmapped registers to a partition– Indicates that wakeup overhead is amortized for a
larger number of partitions.
• Some exceptions– the overall power overhead associated with waking up
an idle region becomes larger as the number of partition increases.
– frequent but ineffective power gating and its overhead as the number of partition increases
24
Peak Temperature ReductionTable 1. Peak temperature reduction for MiBench benchmarks
temperature reduction for different number of partition (C )
base
temperature
(C ) 2P 4P 8P
basicMath 94.3 3.6 4.8 5.0
bc 95.4 3.8 4.4 5.2
crc 92.8 5.3 6.0 6.0
dijkstra 98.4 6.3 6.8 6.4
djpeg 96.3 2.8 3.5 2.4
fft 94.5 6.8 7.4 7.6
gs 89.8 6.5 7.4 9.7
gsm 92.3 5.8 6.7 6.9
lame 90.6 6.2 8.5 11.3
mad 93.3 3.8 4.3 2.2
patricia 79.2 11.0 12.4 13.2
qsort 88.3 10.1 11.6 11.9
search 93.8 8.7 9.3 9.1
sha 90.1 5.1 5.4 4.5
susan_corners 92.7 4.7 5.3 5.1
susan_edges 91.9 3.7 5.8 6.3
tiff2bw 98.5 4.5 5.9 4.1
average 92.5 5.6 6.8 6.9
Table 2. Peak temperature reduction for SPEC2K integer benchmarks
temperature reduction for different number of partition (C )
base
temperature
(C ) 2P 4P 8P
bzip2 92.7 4.8 3.9 3.1
crafty 83.6 9.5 11 10.4
eon 77.3 10.6 12.4 12.5
galgel 89.4 6.9 7.2 5.8
gap 86.7 4.8 5.9 7.1
gcc 79.8 7.9 9.4 10.1
gzip 95.4 3.2 3.8 3.9
mcf 85.8 6.9 8.7 9.4
parser 97.8 4.3 5.8 4.8
perlbmk 85.8 10.6 12.3 12.6
twolf 86.2 8.8 10.2 10.5
vortex 81.7 11.3 12.5 12.9
vpr 94.6 4.9 5.2 4.4
average 87.4 7.2 8.3 8.2
25
Analysis of Temperature Reduction
• Increasing the number of partitions results in larger power density in each partition because RF access activity is concentrated in a smaller partition – While capturing more idle partitions and
power gating them may potentially result in higher power reduction, larger power density due to smaller partition size results in overall higher temperature
26
Conclusions• Showed Register File Underutilization
• Studied Register file default access patterns
• Propose access concentration and activity redistribution to relocate register file accesses
• Results show a noticeable power and temperature reduction in the RF
• RELOCATE technique can be applied when units are underutilized – as opposed to activity migration, which requires replication