Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Activity-Sensitive Flip-Flop and Latch Selection for Reduced Energy
Seongmoo Heo, Ronny Krashinsky, Krste AsanoviüMIT - Laboratory for Computer Science
http://www.cag.lcs.mit.edu/scale
ARVLSIMarch 15, 2001
Flip-Flop and Latch(collectively timing elements)
• Critical Timing Elements (TEs) in modern synchronous VLSI systems
áSignificant impact on cycle time
áBig portion of energy consumption
3% 1%2%7%
8%
23%
10%
23%
23%EqualCheckBufferShifterAdderALURegFileMuxLatchFlipflop
[Heo, MS Thesis, ’00]
Energy breakdown of a MIPS 5 stage pipeline datapath for SPECint 95 programs
Flip-flop
Latch
Motivation
• Previous work tried to find the most energy-efficient and fastest TEs
áassuming a single TE design used uniformly throughout a circuit.áusing a very limited set of data patterns and un-gated clock signal.
• Two important observations
áThere is a wide variation in clock and data activity across different TEs.áMany TEs are not in the critical path, and thus have ample time slack.
Basic Idea
• Selection from a heterogeneous library of designs, each tuned to different operating regimes
• Operating regimes :
o Different input and clock signal activitieso Different speed requirements
Related Work
• The use of timing slack for reduced energy
o Examples : - Traditional transistor sizing - Cluster voltage scaling [Usami and Horowitz ’95]- Multiple threshold voltage or series transistor for reducing leakage current [McPherson et al. ’00,
Yamashita et al. ’00, Johnsonet al. ’99]
Our Contribution
• Detailed energy characterization of wide range of TEs as a function of signal activities.
• Detailed measurement of TE signal activities for a micro-processor running complete programs
• Exploit signal activity to reduce TE energy by using different TE structures.
Overview
• Flip-Flop and Latch Designs
• Test Bench and Simulation Setup
• Delay and Energy Characterization
• Energy Analysis with Test Waveforms
• Evaluation with Processor
• Conclusion
Latch Designs
Transistor sizes optimized for two extremes:Highest speed vs. Lowest power
Flip-Flop Designs
Transistor sizes optimized for two extremes:Highest speed vs. Lowest power
Test Bench
• Used fixed, realistic input driver
• Determined appropriate output load o As large as 200fF output load was used by previous work.o We used 7.2fF (4 min-inv cap) because 60% of output loads in
the VP microprocessor datapath are smaller than 14.4fF.o Further work on load-sensitive analysis at upcoming WVLSI
• Sized clock buffer to give equal rise/fall time
7.2fF
Simulation Setup
• Custom layout in 0.25m TSMC CMOS process with Magic layout program
• Layout extraction with SPACE 2D extractor
• Circuit simulation with Hspice under nominal condition of Vdd=2.5V and T=25°C
o Hspice .Measure command to measure delay and energy
0
200
400
600
800
1000
1200
dela
y (p
s)
PPCFFSSAFF
SAFFM
SAFFHLF
FHLS
FFSSAPLSSASPLCCPPCFF
PPCLAPTL
ASSALASSA2L
ACPNLA
lowest power
highest speed
Delay Characterization
• Flip-flop : Minimum D-Q delay [Stojanovic et al. ’99]
• Latch : D-Q delay
(b) Latches(a) Flip-flops
Energy Characterization
• Total energy = input energy + internal energy + clock energy – output energy
• Accurate energy characterization
o State-transition technique based on [Zyuban and Kogge ’99]
D
C
Q
C
Q
D
1
231 2 3
Energy Tables
(a) Flip-flops
(b) Latches
Energy Tables
(a) Flip-flops
Low-Power Flip-Flop
PPCFF 51.26.9
6.9
6.949.7
49.7
68.1
68.0
19.419.4
19.2
68.149.1
46.8
91.510146.3
46.0
47.689.2
89.0
95.5
95.4
48.4
011
↓001
111
↓101
110
↓100
010
↓000
001
↓011
101
↓111
100
↓110
000
↓010
111
↓011
101
↓001
110
↓010
100
↓000
011
↓111
010
↓111
001
↓100
000
↓100
(b) Latches
Test Waveforms
• Test 1 and 2 : high clock activity, no data and output activity
• Test 3 and 4 : high data activity, no clock and output activity
• Test 5, 6, and 7 : high clock, data, and output activity(Traditional)
• Test 8 : high clock and data activity, no output activity
Energy Analysis
1131
(a) Flip-flops
(b) Latches
0
50
100
150
200
250
fJ/c
ycle
Test 1 Test 3 Test 5
PPCLAPTLASSALASSA2LACPNLA
0
100
200
300
400
500
600
700
fJ/c
ycle
Test 1 Test 3 Test 5
PPCFFSSAFFSAFFMSAFFHLFFHLSFFSSAPLSSASPLCCPPCFF
Low-power flip-flops and latches
Processor Design and Simulation
• Evaluation on a microprocessor datapath
• Vanilla Pekoe Processoro A classic 32-bit MIPS RISC 5 stage pipeline with caches and system
coprocessor registers (R3000-compatible)o Aggressive clock gating to save energyo 22 multi-bit flip-flops and latches, totaling 675 individual bits
• Simulation with 5 programs of SPECint95 benchmarkso A fast cycle-accurate simulator [Krashinsky, Heo, Zhang, and Asanovic
’00] with the ability of counting TE state transitionso 1.71 billion instructions and 2.69 billion cycles
• Some constraintso Cannot track the exact timing of signalso Cannot model glitches
Flip-Flops and Latches in Processor
Flip-Flops and Latches in Processor
Flip-Flops and Latches in Processor
Energy Breakdown
0.05HLFF-lp0.1cp0_epc
0.18HLFF-lp0.3cp0_baddr
0.03HLFF-lp0.1cp0_comp
4.80SSAFF-lp42.6cp0_count
4.76SSAFF-lp24.6m_exe
2.57SAFF-lp8.0x_addr
1.06SAFF-lp2.6x_sd
2.55SSAFF-lp20.2m_epc
2.62SSAFF-lp20.3x_epc
2.74SSAFF-lp20.5d_epc
6.52SSAFF-lp31.2d_inst
3.57SSAFF-lp25.1f_recovpc
Lowest-EnergyHLFF-hs
Flip-flops
2.42SSALA-lp2.74w_result
0.27SSA2LA-lp0.30x_sdalign
3.65SSALA-lp3.88m_exe
0.97SSALA-lp1.26d_aluctrl
0.63PPCLA-lp0.65d_rtshmd
0.70PPCLA-lp0.75d_rsshmd
2.28SSALA-lp2.81d_rtalu
3.16SSALA-lp3.27d_rsalu
1.72SSALA-lp2.95f_pc
2.25SSALA-lp3.22p_pc
Lowest-EnergyPPCLA-hs
Latches
(unit: mJ)
(unit: mJ)
• 32-bit MIPS 5 stage pipeline datapath• SPECint95 benchmarks: perl(test, primes),
ijpeg(test), m88ksim(test), go(20,9), and lzw(medtest)
Processor Energy Results - Flip-Flop
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0 500 1000
Flip- flo p De la y ( ps )
To
tal
Flip
-fl
op
En
erg
y (
J)
Unifo rmHLFF- S iz ingHLFF- ASS S AS P L- S iz ingS S AS P L- ASSSASPL-hs
HLFF-hs
SSAFF-lp
HLFF-lp
SSAFF-hs
SSASPL-lp
•Ref : Total datapath energy – Total TE energy = around 0.21J
HS: Highest-SpeedLP: Lowest-Power
(A single design used uniformly throughout a circuit)
Processor Energy Results - Flip-Flop
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0 500 1000
Flip- flo p De la y ( ps )
To
tal
Flip
-fl
op
En
erg
y (
J)
Unifo rmHLFF- S iz ingHLFF- ASS S AS P L- S iz ingS S AS P L- AS
•34% energy saving with conventional transistor sizing
HLFF-hs
34% energy saving
Processor Energy Results - Flip-Flop
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0 500 1000
Flip- flo p De la y ( ps )
To
tal
Flip
-fl
op
En
erg
y (
J)
Unifo rmHLFF- S iz ingHLFF- HS LES S AS P L- S iz ingS S AS P L- AS
•52% energy saving over just transistor sizing with the best performance (HLFF-hs)
HSLE: Activity-Sensitive selectionHLFF-hs
69% energy saving
52% energy saving
Processor Energy Results - Latch
0.015
0.02
0.025
0.03
0.035
0.04
0 100 200 300 400 500 600
Latc h De lay ( ps )
To
tal
La
tch
En
erg
y (J
)
UniformPPCLA- S iz ingPPCLA- HS LE2 1
•6.1% energy saving over just transistor sizing (1)•8.3% energy saving compared to homogeneous design with PPCLA-hs (2)•PPCLA is the fastest and also very energy-efficient.
SSA2LA-lpPPCLA-hs
Summary of Energy Results
• 63% TE energy savingcompared to a homogeneous design with HLFF-hs and PPCLA-hs
• 46% TE energy savingcompared to a design with conventional transistor sizing while keeping the best performance
Conclusion
áWe showed that activation patterns for various TEs in a circuit differ considerably.
áWe found that there is wide variation in the optimal TE designs for different regimes.
áWe provided complete energy and delay characterization.
áWe applied our technique to a real processor which we simulated 2.7 billion cycles of programs and showed over 63% TE energy reduction without losing any performance.
Difficulty of using a heterogeneous mix of TEs?
- Already designers have been doing verification for each local clock and added complexity is minimal.
- Timing verification for non-critical TEs is simple.