Upload
vito
View
36
Download
0
Tags:
Embed Size (px)
DESCRIPTION
ORION2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration. Andrew B. Kahng ¶ Bin Li ‡ Li-Shiuan Peh ‡ Kambiz Samadi ¶ ¶ University of California, San Diego ‡ Princeton University April 21, 2009. 1. Outline. Motivation ORION2.0 Framework - PowerPoint PPT Presentation
Citation preview
ORION2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage
Design Space Exploration
Andrew B. Kahng¶
Bin Li‡
Li-Shiuan Peh‡
Kambiz Samadi¶
¶ University of California, San Diego‡ Princeton University
April 21, 2009
1
Outline Motivation ORION2.0 Framework Dynamic Power Modeling Leakage Power Modeling Area Modeling Validation and Significance Assessment Conclusions
2
Motivation Many-core chip NoCs needed to interconnect
many-core chips Power-efficiency of NoCs is important
Performance was the primary concern Now power efficiency is critical
28% of total power in Intel 80-core Teraflops chip is due to interconnection networks (routers + links);
Need rapid power estimation to trade off alternative architectures
Rapid power-area tradeoffs at the architectural level
Our Goal: Develop accurate models that are easily usable by system-level designer early in the design cycle
3
Related Work
Real-chip power measurements (Isci et al. 03) RTL-level NoC power estimations (A. Banerjee et al. 07, and N.
Banerjee et al. 04) Simulation time is slow Requires detailed RTL modeling not suitable for early-stage NoC
design space exploration
Architectural-level power estimation Interconnection network (Patel et al. 97); model is not instantiated with
architectural parameters not suitable to explore tradeoffs in router microarchitecture
Uniprocessor power modeling (Wattch: Brooks et al. 00 and SimplePower: Ye et al. 00)
NoC power modeling (ORION 1.0: Wang et al. 02)
ORION 1.0 has been widely used early-stage design space exploration for NoC power-performance
tradeoff analysis4
ORION 1.0 Modeling Methodology Power models derived for major building blocks
(FIFO, Crossbar, and arbiter) For each component, a canonical structure is
described in terms of architectural and technological parameters
Detailed analysis is performed to determine parameterized capacitance equations
Capacitance equations and switch activity estimation are combined to determine power consumption
Power models are based on detailed estimates of gate and wire capacitance and switching activity
5
Limitations of ORION 1.0
Parameters Description
ORION 1.0
ORION2.0
BFPVX
tech
fclk
Vdd
---
BFPVX
tech
fclk
Vdd
Npipeline
AppD
#buffersflit-width#ports
#virtual channels#crossbar portstechnology nodeclock frequencysupply voltage
#pipeline stagesapplication domain
chip dimension
Parameters Description
ORION 1.0
ORION2.0
1639525
65nm5.1GHz
1.2V---
BFPVX
tech
fclk
Vdd
Npipeline
AppD
#buffersflit-width#ports
#virtual channels#crossbar portstechnology nodeclock frequencysupply voltage
#pipeline stagesapplication domain
chip dimension
Component Power (mW)
V1 Intel 80-core
BufferCrossbarArbiter
LinkClockTotal
25.253.211.1
--
89.5
203.3138.664.7212.5304.9924
Up to 8.1X diff.
10.3X diff.
6
OutlineMotivation ORION2.0 Framework Dynamic Power Modeling Leakage Power Modeling Area Modeling Validation and Significance Assessment Conclusions
7
ORION 2.0: Accurate NoC Router Modelscircuit implementation &
buffering scheme
• SRAM and register FIFO• MUX-tree and Matrix crossbar• different arbitration scheme• hybrid buffering scheme
architectural parameters
• # of ports; # of buffers• # of xbar ports; # of VC• voltage, frequency
• interconnect parameters• device parameters• scaling factors for future technologies• …
technology parameters
ORION 2.0 reqI
reqE
reqW
reqN
reqS
grantI
grantE
grantW
grantN
grantS
Arbiter
outE
outW
outN
outS
inI
inE
inW
inN
inS
outI
CrossbarBuf EBuf W
Buf NBuf S
Buf I
LinkLinkLink
Link
Source
LinkLink
LinkLink
Source
WriteControl
RequestSignals Built on top of ORION 1.0
Uses our automatic/semi-automatic flows to obtain technology inputs
Provides significant accuracy improvement compared with ORION 1.0
8
ORION 2.0 Improvements
Crossbar
Links(dynamic power)
Arbiter(dynamic power)
Buffer(SRAM-based)
Clock
Crossbar
Links• Hybrid buffering• Leakage power
Arbiter• VC allocator model• Leakage power
Buffer• SRAM-based• Flip-flop-based
• Application-specific technology-level adjustment• Updated capacitance and transistor sizes
ORION 1.0 ORION 2.0
Power Subcomponents
Model Infrastructure
Area(router)
Area• More accurate router area model• Link area model
9
Model Technology Inputs Inputs for power calculation
Leakage current values (obtained from Liberty (.lib) / SPICE) Input capacitance for different repeater size (Liberty, Predictive
Technology Models (PTM))
Inputs for area calculation Wire dimensions (Interconnect Technology Format (ITF) / LEF / ITRS) Cell area is available from Liberty and for future technologies, ITRS A-
factors or proposed area models can be used
We also provide data for (1) high-performance (HP), and (2) low-power (LOP) device types for 90nm and 65nm
Scaling factors for 45nm and 32nm technologies were obtained from ITRS 2007 / MASTAR5.0
10
OutlineMotivationORION2.0 Framework Dynamic Power Modeling Leakage Power Modeling Area Modeling Validation and Significance Assessment Conclusions
11
Dynamic Power Modeling
Dynamic Power: Switching Capacitance Clock power:
Pclk = × Cclk × Vdd2 × f
Cclk = Csram-fifo + Cpipeline-registers + Cregister-fifo + Cwiring
Physical Links: due to charging and discharging of capacitive load
Pd = × Cload × Vdd2 × f; Cload = Cground + Ccoupling + Cinput
Register-based FIFO: implemented as shift registers Virtual channel allocator: added two models Other components: we use ORION 1.0 models with updated
transistor and technology parameters
12
Clock Power (1) Clock power heavily depends on its distribution topology
we assume an H-tree topology Cclk = Csram-fifo + Cpipeline-registers + Cregister-fifo + Cclock-wiring
Memory structures: precharge circuitry capacitive load on clock network: due to precharge transistor Tc
Cchg = Cg(Tc) + Cd(Tc) Csram-fifo = (Pr + Pw) × F × B × Cchg
where Pr, Pw, F, B are #read ports, #write ports, #buffers, and flit-width, respectively
Pipeline registers: due to different stages in a router assume D-flip-flop (DFF) as the building block for pipeline registers Cpipeline-register = Npipeline × F × Cff, where Cff is DFF capacitance
Register-based FIFO: due to DFF capacitance used in registers Cregister-fifo = F × B × Cff 13
Clock Power (2) Wiring load: due to (1) wiring and (2) clock tree buffers Example: 5-level H-tree clock distribution:
where, D, Cw are chip dimension and per-unit-length wire capacitance, respectively
capacitive contribution due to clock buffers requires estimation of number of buffer stages, k:
where Rint, Cint, Rd, and Cgate are clock tree network wire resistance, wire capacitance, drive resistance, and input gate capacitance of a minimum size inverter, respectively
where ρ, Carea, and Cfringe are resistivity, unit area, and unit fringe capacitances respectively
Cclock-wiring = kCgate + Cwire
Clock leakage power is due to clock buffers
wwire CDDDDDC
)2
18
2
24
2
42
2
81
2
16(
gated
intint
CR7.0
CR4.0k
××
××=
14
fringearea CDCwDCw
DR
24224
24
int
int
Repeater and Wire Power Models Repeaters (buffers) are used in links and clock tree network Leakage power has two main components: (1) sub-threshold leakage, and
(2) gate-tunneling current Depending on design conditions we will compute the leakage power at different
temperature conditions:(1) 25◦C, (2) 80◦C, and (3) 110◦C Both components depend linearly on device size
ps= (psn + ps
p) / 2
psn = k0
n + k1n × wn
psp = k0
p + k1p × wp
Dynamic power can be calculated as:
pd = a × cl × vdd2 × f
cl = ci + cg + cc
pd, a, cl, vdd and f are dynamic power, activity factor, load capacitance, supply voltage and frequency, respectively
Load capacitance is composed of the input capacitance of the next repeater (ci), ground (cg) and coupling (cc) capacitances of the wire driven
15
Interconnect Optimization: Buffering Conventional delay-optimal buffering unrealistic buffer
sizes high dynamic / leakage power suboptimal
Our approach: iterative optimization of hybrid objective (power + delay) Search for optimal number and size of repeaters Can be extended for other interconnect optimizations (e.g.,
wire sizing and driver sizing)
Pareto-optimal frontier of the power-delay tradeoff of a 5mm interconnect in 90nm / 65nm
16
Virtual Channel Allocator Model Provides three virtual channel (VC) allocation models
Traditional two-stage VC allocator model Most widely used Power consumption increases rapidly as number VCs increases
Add One-stage VC allocator model Lower power consumption Lower matching probability
:
16:1 arbiter1
.:
1
20
Stage 1 (totally 80 arbiters) Stage 2 (totally 20 arbiters)
4:1 arbiter4
.
:
..
4:1 arbiter1
. .
:4:1 arbiter
4. .
4:1 arbiter1
. . 16:1 arbiter20
. .
5 ports, 4 VCs per port
:
8:1 arbiter1
.:
1
10
Stage 1 (totally 40 arbiters) Stage 2 (totally 10 arbiters)
2:1 arbiter4
.
:
..
2:1 arbiter1
. .
:2:1 arbiter
4. .
2:1 arbiter1
. . 8:1 arbiter10
. .
5 ports, 2 VCs per port
Add VC selection model Proposed by Kumar et al. "A 4.6Tbits/s 3.6GHz Single-cycle NoC
Router with a Novel Switch Allocator in 65nm CMOS”, ICCD07 Low power and high performance 17
OutlineMotivationORION2.0 FrameworkDynamic Power Modeling Leakage Power Modeling Area Modeling Validation and Significance Assessment Conclusions
18
Leakage Power Modeling
∑∑i s
'gategate
'subsubleak ))s,i(I)s,i(W)s,i(I)s,i(W()s,i()Block(I ×+××Prob=
Leakage Power: Subthreshold and Gate From 65nm and beyond gate leakage becomes significant I’
sub(i,s) and I’gate(i,s) are subthreshold and gate leakage currents per unit
transistor width for a specific technology Wsub(i,s) and Wgate(i,s) are the effective widths of component i at input
state s for subthreshold and gate leakage, respectively Key circuit components INVx1, NAND2x1, NOR2x1, and DFF Leakage currents are computed at different transistor junction
temperatures: (1) 110◦C, (2) 80◦C, and (3) 25◦C
Same methodology as in ORION 1.0 Leakage current values are all obtained through SPICE simulation using
foundry SPICE models
19
Arbiter Leakage Power Model
∏ ∏< >
+×+×=ni ni
niiininn )mreq()mreq(reqgnt
Three arbitration schemes: (1) matrix, (2) round-robin (RR), and (3) queuing Example: matrix arbiter
with R requesters one R×R matrix to keep the priorities
grant logic can be implemented as a tree of NOR and INV gates and the RxR matrix can be constructed using DFF
NOR2, INV, and DFF represent 2-input NOR gate, inverter gate, and DFF, respectively
Further details on modeling methodology in Chen et al. 2003
ddmatrixleakmatrixleak
leakleak
leakmatrixleak
VArbiterIArbiterP
RRDFFIRINVI
RRNORIArbiterI
)()(2
)1()()(
))12(()2()(
-
-
20
OutlineMotivationORION2.0 FrameworkDynamic Power ModelingLeakage Power Modeling Area Modeling Validation and Significance Assessment Conclusions
21
Router Area Model
As number of cores increases, the area occupied by communication components becomes significant (19% of total tile area in the Intel 80-core Teraflops Chip)
Gate area model by Yoshida et al. (DAC’04) Link area model by Carloni et al. (ASPDAC’08)
Areaarbiter = (AreaNOR2x12(R-1)R) +(AreaDFF(R(R-1)/2)) + (AreaINVx1R)
Matrix Arbiter 22
Repeater and Wire Area Models For existing technologies, the area of a repeater can be
calculated as: ar = τ0 + τ1 × (wn + wp)
ar denotes repeater area, τ0 and τ1 are coefficients using linear regression; wn, wp are widths of NMOS, and PMOS respectively
For future technologies, feature size (F), contacted pitch (CP), row height (RH), and cell width (CW) can be used to estimate the area:
NF = (wp + wn + 2 × F) / RH CW = NF × (F + CP) + CP
ar = RH × CW Wiring area can be calculated as:
aw = (n × (ww + sw) + sw) × L
aw denotes wire area, n is the bit width of the bus, and ww, sw, L are wire width, spacing and wire length
23
OutlineMotivationORION2.0 FrameworkDynamic Power ModelingLeakage Power ModelingArea Modeling Validation and Significance Assessment Conclusions
24
ORION2.0: Validations and Results Validation: Two Intel NoC Chips
(1) Intel 80-core Teraflops: high-performance many-core design (2) Intel SCC: ultra low-power communication core ORION2.0 offers significant accuracy improvement
v1.0 v2.0 v1.0 v2.0%diff (total power) -85.3 -6.5 +202.4 +11.0%diff (total area) -80.9 -23.6 +31.9 +25.3
Intel 80-core Intel SCC
Component %diff (ORION 2.0 vs. Intel 80-core)
BufferCrossbarArbiterClockLink
-14.816.9-9.0
-20.98.8
25
FIFO21%
Crossbar21%
Arbiter7%
Clock30%
Link21%
FIFO23%
Crossbar16%
Arbiter7%
Clock36%
Link18%
Intel 80-coreORION 2.0
FIFO 28%
Crossbar 60%
Arbiter 12%
Clock 0%
Link 0%
ORION 1.0
Impact on System-Level Design Testcases
VPROC: video processor with 42 cores and 128-bit datawidth dVOPD: dual video object plane decoder with 26 cores and 128-bit
datawidth
v1.0 v2.0 v1.0 v2.0 v1.0 v2.0 v1.0 v2.0 v1.0 v2.0
VPROC 0.875 0.924 2.043 2.329 33 25 8 12 6 5dVOPD 0.412 0.486 1.217 1.343 18 16 6 6 11 10
P (mW) A (mm2) # routers max. # hopsSoC max. # router ports
System-level Impact: Communication-Driven Synthesis in COSI-OCC Accurate ORION 2.0 models lead to better-performing NoC Relative power due to additional port not as high in ORION 2.0 vs. 1.0
……..R2 R2
R2 R2
R2
……..
…
……
…
…R1 R1R1
R1 R1R1
R1 R1R1…
… … …
…
26
Conclusions Accurate models can drive effective NoC design
space exploration ORION 1.0 is inaccurate for current and future
technology nodes Proposed accurate power and area models for
network routers (ORION 2.0) Presented a reproducible methodology for extracting
inputs to our models Maintained ORION 1.0 interface, while significantly
improved the accuracy of models switching to ORION 2.0 is easy!
27
ORION 2.0 Release
ORION 2.0 Website: http://www.princeton.edu/~peh/orion.html
28
System-Level NoC Power Modeling Example
LUNA High-level on-chip network
analysis
Microarchitecture parameters
ORIONpower and
area models
power consumption
Performance(latency)CMOS area
TridentSynthetic traffic generation
Design-space exploration tool
NoC designs projections
Step 1
Step 2
Step 3
V. Soteriou, N. Eisley, H. Wang, B. Li, L.S. Peh, TVLSI’07
Polaris Toolchain