Upload
luis-dunn
View
220
Download
4
Tags:
Embed Size (px)
Citation preview
Low Power Functional Unit for use in Coarse Grained Reconfigurable Array
Nathaniel McVicarCorey Olson
Jimmy Xu
Outline Functional Unit
Shifter ALU MADD
Design Flow (all modules) VCS Design Compiler PrimeTime Encounter & Cadence v2lvs
UPF Tutorial Results
Dynamic Power consumption of modules Power Down/Up timing VDD Scaling
FU TopLevel Main Units
ALU MADD Barrel Shifter
Supporting Modules Output Muxes Clock gating registers Crossbar
IBM 65nm PDK
Process - cmos10lpe low power process very low leakage in power analysis
Standard cells cp65npksdst_tt1p2v25c
Shifter Specs
32-bit shifter with 5 shift bits Bi-directional shifting Logical and arithmetic shifting Purely combinational design
1GHz target frequency Want it as fast as possible Need to be power aware during
synthesis
Shifter Design
31’b0 X[31:0]
X[30:0] 31{X[31]}LEFT /
LOGICAL
Z
S[4]
S[3]
S[2]
S[1]
S[0]
ALU Specs
32-bit ALU supporting Supports 15 instructions Combinational design
1GHz target frequency On critical path Want it as fast as possible Need to be power aware during
synthesis
ALU Design Methodologies Muxed Output
Simple functions with muxed output
Gate off functions not in use
More gates Higher leakage,
lower switching
Hardware Reuse Do everything
with the adder Cannot gate the
adder Fewer gates
Lower leakage, higher switching
ALU Design 1
+
A
B
AB
flipA
flipB
clearA
clearB
setA
setA
AB
P
G
Z
Z
Control
sel[1:0]
Power Results
Switching: (Syn. Model) 630 uW (3.55 uW)
Interconnect: 1.14 mW (3.94 mW)
Leakage: 135 nW (530 nW)
Total: 1.77 mW (7.5 mW)
ALU Design 2
+
A
B
A
BO
Z
Control
sel[1:0]
en
en
latch
en
en
Power Results
Switching: (Syn. Model) 655 uW (3.55 uW)
Interconnect: 1.21 mW (3.94 mW)
Leakage: 160 nW (530 nW)
Total: 1.87 mW (7.5 mW)
MADD Specs
32 bit multiply-add unit 2 cycle pipelined module Add input arrives on second cycle
1 GHz target frequency most power hungry module in design
need to be power aware during synthesis ideally would run as fast as possible may need to trade speed for power
(~700MHz)
MADD Design
A
B
CLK
HeterogeneousBooth Enc
PP Generation
CSA TreeStage 1
D QRegisters
CLK
C CSA TreeStage 2
Final Adder Z
VCS
Testbenches written to verify functionality using VCS random input vectors used for data instructions/shift encodings tested
sequentially
Design Compiler Compile to standard cell library
cp65npksdst_tt1p2v25c from IBM’s cmos10lpe compile to others for corner analysis (ff, 1p0v,
…) control target frequency and synthesize for
power Reports created
Power – inaccurate, but use as a baseline Area – reports number of gates in design Timing – design can’t always meet timing
DC Example# standard cells that you synthesize toset target_library <libname>.dbset link_library <libname>.db
# prepare and synthesizeanalyze –f verilog <my_verilog_file>.velaborate <my_toplevel>current_design <my_toplevel>linkuniquifycompile_ultra –gate_clockcompile_ultra –incremental
# check for errors in the synthesized design (timing violations, cell warnings,…)check_designreport_constraint –all_violators
# write the output file in verilog netlist formatwrite –f verilog –output <filename>.vh
# output the timing or power or cell reportredirect timing/power/cell.rep { report_timing/cell/power }
DC Example Output
Operating Conditions: TT1P2V25C Library: cp65npksdst_tt1p2v25cWire Load Model Mode: enclosed
Design Wire Load Model Library------------------------------------------------Alu B0.1X0.1 cp65npksdst_tt1p2v25c
Global Operating Voltage = 1.2 Power-specific unit information : Voltage Units = 1V Capacitance Units = 1.000000pf Time Units = 1ns Dynamic Power Units = 1mW (derived from V,C,T units) Leakage Power Units = 1nW
Cell Internal Power = 433.2152 uW (51%) Net Switching Power = 409.2202 uW (49%) ---------Total Dynamic Power = 842.4354 uW (100%)
Cell Leakage Power = 129.3405 nW
PrimeTime power analysis
reports breakdown of power consumption internal switching intermediate nodes switching leakage
more detailed breakdown available memory, clock network, register, combinational
timing check - redundant at this stage no functional verification
use simulator for functionality vcs, ncsim
PT Example# setuplink_library <libname>.dbread_verilog <netlist>.vhcurrent_design <my_toplevel>link
# for a design without an existing clock inputcreate_clock –name clock -period
# toggle_count is prob of switching, static is prob of being a 1set_switching_activity –toggle_count 0.25 –static_probability 0.5 <INPUT>
# get the power analysis and write details to Alu.rptcheck_powerupdate_powerreport_power > Alu.rpt
PT Example Output
Attributes ---------- i - Including register clock pin internal power u - User defined power group
Internal Switching Leakage TotalPower Group Power Power Power Power ( %) Attrs--------------------------------------------------------------------------------------------io_pad 0.0000 0.0000 0.0000 0.0000 ( 0.00%) memory 0.0000 0.0000 0.0000 0.0000 ( 0.00%) black_box 0.0000 0.0000 0.0000 0.0000 ( 0.00%) clock_network 0.0000 0.0000 0.0000 0.0000 ( 0.00%) iregister 0.0000 0.0000 0.0000 0.0000 ( 0.00%) combinational 9.606e-04 1.053e-03 1.295e-07 2.014e-03 (100.00%) sequential 0.0000 0.0000 0.0000 0.0000 ( 0.00%)
Net Switching Power = 1.053e-03 (52.30%) Cell Internal Power = 9.606e-04 (47.70%) Cell Leakage Power = 1.295e-07 ( 0.01%) ---------Total Power = 2.014e-03 (100.00%)
Encounter
Features Place and Route Control the power and ground to all
cells Extract parasitic capacitances stream out gds for use with Cadence
ALU Encounter Example
Encounter
Failures difficult to use impossible to save netlist views still need to use cadence tools to
generate SPICE netlist unable to extract parasitics
could still do this with Cadence
Cadence
Features read in a verilog netlist stream in standard cell layouts and
schematics stream in gds from Encounter create SPICE netlist
ShiftLR Cadence Example
Cadence
Failures unable to properly stream in standard
cell schematics unable to create netlist from
schematic unable to run LVS or extract parasitics
Solution v2lvs
v2lvs
enables a SPICE netlist from a synthesized
verilog netlist include SPICE definitions of standard
cells run HSPICE simulations for power
down/up sequence and VDD scaling
v2lvs ExampleVerilog:
SEN_EO2_S_0P5 U2120 ( .A1(pprow4[11]), .A2(pprow5[9]), .X(n566) );SEN_EO2_S_0P5 U2121
( .A1(pprow4[13]), .A2(pprow5[11]), .X(n567) );SEN_EO2_S_0P5 U2122 ( .A1(pprow2[13]), .A2(pprow7[3]), .X(n568) );SEN_EO2_S_0P5 U2123 ( .A1(pprow2[15]), .A2(pprow7[5]), .X(n569) );
v2lvs:v2lvs -i -v ../synthesis/ShiftLR.vh -s0 VSS -s1 VDD -s
design_model.inc -o ShiftLR.sp -lsr cp65npksdst.lvs
HSPICE:XU2120 n566 pprow4[11] pprow5[9] SEN_EO2_S_0P5 XU2121 n567 pprow4[13] pprow5[11] SEN_EO2_S_0P5 XU2122 n568 pprow2[13] pprow7[3] SEN_EO2_S_0P5XU2123 n569 pprow2[15] pprow7[5] SEN_EO2_S_0P5
HSPICE
Created simulation test-bench for power measurement using vector input
Adds potential VDD scaling and gating
Final Power Results
Synthesis Matters At 1 GHz, MADD power very dependent
on synthesis options
Internal Switching
Leakage Total
Naïve 11.2 mW
7.16 mW 1.07 uW 18.3 mW
Constrained
7.77 mW
4.56 mW 0.59 uW 12.3 mW
Ultra 4.08 mW
1.88 mW 0.30 uW 5.96 mW
Synthesis Matter contd. The lower power synthesis options, have
trouble reducing clock and register power
Clock Register Comb
Naïve 9.95% 13.0% 77.05%
Constrained 12.7% 14.8% 72.5%
Ultra 27.4% 12.9% 58.5%
Power-up time resultsW=0.6um M=1
Power-up time results contd.
W=0.6um M=12
Power-up time results contd.
W=6um M=12
Power-up time results contd.
W=6um M=120
Power-up time results contd.
Iavg during power-down = 10.66 uAPavg = 12.792 uWPower-up Delay = 9.4ps
Voltage Scaling - ALU
0
1
2
3
4
5
6
7
8
9
500 2500 4500 6500 8500 10500
Delay (ps)
Po
wer
(m
W) 1 GHz
Voltage Scaling – ShiftLR
0
0.1
0.2
0.3
0.4
0.5
0.6
100 1000 10000 100000
Delay (ps)
Po
wer
(m
W) 1 GHz
500 MHz
1.2V
1.0V
0.8V
0.6V
Results
Significantly reduced power for all modules
Explored voltage scaling Implemented power-up / power-
down sleep logic
Intangibles
Gained significant insight into the current state-of-the-art for low power FPGA and CGRA design, through reading
Gained practical knowledge working with the design tool chain of a commercial PDK
Questions?