ORION2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration

ORION2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage

Design Space Exploration

Andrew B. Kahng¶

Bin Li‡

Li-Shiuan Peh‡

Kambiz Samadi¶

¶ University of California, San Diego‡ Princeton University

April 21, 2009

1

Outline Motivation ORION2.0 Framework Dynamic Power Modeling Leakage Power Modeling Area Modeling Validation and Significance Assessment Conclusions

2

Motivation Many-core chip NoCs needed to interconnect

many-core chips Power-efficiency of NoCs is important

Performance was the primary concern Now power efficiency is critical

28% of total power in Intel 80-core Teraflops chip is due to interconnection networks (routers + links);

Need rapid power estimation to trade off alternative architectures

Rapid power-area tradeoffs at the architectural level

Our Goal: Develop accurate models that are easily usable by system-level designer early in the design cycle

3

Related Work

Real-chip power measurements (Isci et al. 03) RTL-level NoC power estimations (A. Banerjee et al. 07, and N.

Banerjee et al. 04) Simulation time is slow Requires detailed RTL modeling not suitable for early-stage NoC

design space exploration

Architectural-level power estimation Interconnection network (Patel et al. 97); model is not instantiated with

architectural parameters not suitable to explore tradeoffs in router microarchitecture

Uniprocessor power modeling (Wattch: Brooks et al. 00 and SimplePower: Ye et al. 00)

NoC power modeling (ORION 1.0: Wang et al. 02)

ORION 1.0 has been widely used early-stage design space exploration for NoC power-performance

tradeoff analysis4

ORION 1.0 Modeling Methodology Power models derived for major building blocks

(FIFO, Crossbar, and arbiter) For each component, a canonical structure is

described in terms of architectural and technological parameters

Detailed analysis is performed to determine parameterized capacitance equations

Capacitance equations and switch activity estimation are combined to determine power consumption

Power models are based on detailed estimates of gate and wire capacitance and switching activity

5

Limitations of ORION 1.0

Parameters Description

ORION 1.0

ORION2.0

BFPVX

tech

fclk

Vdd

---

BFPVX

tech

fclk

Vdd

Npipeline

AppD

#buffersflit-width#ports

#virtual channels#crossbar portstechnology nodeclock frequencysupply voltage

#pipeline stagesapplication domain

chip dimension

Parameters Description

ORION 1.0

ORION2.0

1639525

65nm5.1GHz

1.2V---

BFPVX

tech

fclk

Vdd

Npipeline

AppD

#buffersflit-width#ports

#virtual channels#crossbar portstechnology nodeclock frequencysupply voltage

#pipeline stagesapplication domain

chip dimension

Component Power (mW)

V1 Intel 80-core

BufferCrossbarArbiter

LinkClockTotal

25.253.211.1

--

89.5

203.3138.664.7212.5304.9924

Up to 8.1X diff.

10.3X diff.

6

OutlineMotivation ORION2.0 Framework Dynamic Power Modeling Leakage Power Modeling Area Modeling Validation and Significance Assessment Conclusions

7

ORION 2.0: Accurate NoC Router Modelscircuit implementation &

buffering scheme

• SRAM and register FIFO• MUX-tree and Matrix crossbar• different arbitration scheme• hybrid buffering scheme

architectural parameters

• # of ports; # of buffers• # of xbar ports; # of VC• voltage, frequency

• interconnect parameters• device parameters• scaling factors for future technologies• …

technology parameters

ORION 2.0 reqI

reqE

reqW

reqN

reqS

grantI

grantE

grantW

grantN

grantS

Arbiter

outE

outW

outN

outS

inI

inE

inW

inN

inS

outI

CrossbarBuf EBuf W

Buf NBuf S

Buf I

LinkLinkLink

Link

Source

LinkLink

LinkLink

Source

WriteControl

RequestSignals Built on top of ORION 1.0

Uses our automatic/semi-automatic flows to obtain technology inputs

Provides significant accuracy improvement compared with ORION 1.0

8

ORION 2.0 Improvements

Crossbar

Links(dynamic power)

Arbiter(dynamic power)

Buffer(SRAM-based)

Clock

Crossbar

Links• Hybrid buffering• Leakage power

Arbiter• VC allocator model• Leakage power

Buffer• SRAM-based• Flip-flop-based

• Application-specific technology-level adjustment• Updated capacitance and transistor sizes

ORION 1.0 ORION 2.0

Power Subcomponents

Model Infrastructure

Area(router)

Area• More accurate router area model• Link area model

9

Model Technology Inputs Inputs for power calculation

Leakage current values (obtained from Liberty (.lib) / SPICE) Input capacitance for different repeater size (Liberty, Predictive

Technology Models (PTM))

Inputs for area calculation Wire dimensions (Interconnect Technology Format (ITF) / LEF / ITRS) Cell area is available from Liberty and for future technologies, ITRS A-

factors or proposed area models can be used

We also provide data for (1) high-performance (HP), and (2) low-power (LOP) device types for 90nm and 65nm

Scaling factors for 45nm and 32nm technologies were obtained from ITRS 2007 / MASTAR5.0

10

OutlineMotivationORION2.0 Framework Dynamic Power Modeling Leakage Power Modeling Area Modeling Validation and Significance Assessment Conclusions

11

Dynamic Power Modeling

Dynamic Power: Switching Capacitance Clock power:

Pclk = × Cclk × Vdd2 × f

Cclk = Csram-fifo + Cpipeline-registers + Cregister-fifo + Cwiring

Physical Links: due to charging and discharging of capacitive load

Pd = × Cload × Vdd2 × f; Cload = Cground + Ccoupling + Cinput

Register-based FIFO: implemented as shift registers Virtual channel allocator: added two models Other components: we use ORION 1.0 models with updated

transistor and technology parameters

12

Clock Power (1) Clock power heavily depends on its distribution topology

we assume an H-tree topology Cclk = Csram-fifo + Cpipeline-registers + Cregister-fifo + Cclock-wiring

Memory structures: precharge circuitry capacitive load on clock network: due to precharge transistor Tc

Cchg = Cg(Tc) + Cd(Tc) Csram-fifo = (Pr + Pw) × F × B × Cchg

where Pr, Pw, F, B are #read ports, #write ports, #buffers, and flit-width, respectively

Pipeline registers: due to different stages in a router assume D-flip-flop (DFF) as the building block for pipeline registers Cpipeline-register = Npipeline × F × Cff, where Cff is DFF capacitance

Register-based FIFO: due to DFF capacitance used in registers Cregister-fifo = F × B × Cff 13

Clock Power (2) Wiring load: due to (1) wiring and (2) clock tree buffers Example: 5-level H-tree clock distribution:

where, D, Cw are chip dimension and per-unit-length wire capacitance, respectively

capacitive contribution due to clock buffers requires estimation of number of buffer stages, k:

where Rint, Cint, Rd, and Cgate are clock tree network wire resistance, wire capacitance, drive resistance, and input gate capacitance of a minimum size inverter, respectively

where ρ, Carea, and Cfringe are resistivity, unit area, and unit fringe capacitances respectively

Cclock-wiring = kCgate + Cwire

Clock leakage power is due to clock buffers

wwire CDDDDDC

)2

18

2

24

2

42

2

81

2

16(

gated

intint

CR7.0

CR4.0k

××

××=

14

fringearea CDCwDCw

DR

24224

24

int

int

Repeater and Wire Power Models Repeaters (buffers) are used in links and clock tree network Leakage power has two main components: (1) sub-threshold leakage, and

(2) gate-tunneling current Depending on design conditions we will compute the leakage power at different

temperature conditions:(1) 25◦C, (2) 80◦C, and (3) 110◦C Both components depend linearly on device size

ps= (psn + ps

p) / 2

psn = k0

n + k1n × wn

psp = k0

p + k1p × wp

Dynamic power can be calculated as:

pd = a × cl × vdd2 × f

cl = ci + cg + cc

pd, a, cl, vdd and f are dynamic power, activity factor, load capacitance, supply voltage and frequency, respectively

Load capacitance is composed of the input capacitance of the next repeater (ci), ground (cg) and coupling (cc) capacitances of the wire driven

15

Interconnect Optimization: Buffering Conventional delay-optimal buffering unrealistic buffer

sizes high dynamic / leakage power suboptimal

Our approach: iterative optimization of hybrid objective (power + delay) Search for optimal number and size of repeaters Can be extended for other interconnect optimizations (e.g.,

wire sizing and driver sizing)

Pareto-optimal frontier of the power-delay tradeoff of a 5mm interconnect in 90nm / 65nm

16

Virtual Channel Allocator Model Provides three virtual channel (VC) allocation models

Traditional two-stage VC allocator model Most widely used Power consumption increases rapidly as number VCs increases

Add One-stage VC allocator model Lower power consumption Lower matching probability

:

16:1 arbiter1

.:

1

20

Stage 1 (totally 80 arbiters) Stage 2 (totally 20 arbiters)

4:1 arbiter4

.

:

..

4:1 arbiter1

. .

:4:1 arbiter

4. .

4:1 arbiter1

. . 16:1 arbiter20

. .

5 ports, 4 VCs per port

:

8:1 arbiter1

.:

1

10

Stage 1 (totally 40 arbiters) Stage 2 (totally 10 arbiters)

2:1 arbiter4

.

:

..

2:1 arbiter1

. .

:2:1 arbiter

4. .

2:1 arbiter1

. . 8:1 arbiter10

. .

5 ports, 2 VCs per port

Add VC selection model Proposed by Kumar et al. "A 4.6Tbits/s 3.6GHz Single-cycle NoC

Router with a Novel Switch Allocator in 65nm CMOS”, ICCD07 Low power and high performance 17

OutlineMotivationORION2.0 FrameworkDynamic Power Modeling Leakage Power Modeling Area Modeling Validation and Significance Assessment Conclusions

18

Leakage Power Modeling

∑∑i s

'gategate

'subsubleak ))s,i(I)s,i(W)s,i(I)s,i(W()s,i()Block(I ×+××Prob=

Leakage Power: Subthreshold and Gate From 65nm and beyond gate leakage becomes significant I’

sub(i,s) and I’gate(i,s) are subthreshold and gate leakage currents per unit

transistor width for a specific technology Wsub(i,s) and Wgate(i,s) are the effective widths of component i at input

state s for subthreshold and gate leakage, respectively Key circuit components INVx1, NAND2x1, NOR2x1, and DFF Leakage currents are computed at different transistor junction

temperatures: (1) 110◦C, (2) 80◦C, and (3) 25◦C

Same methodology as in ORION 1.0 Leakage current values are all obtained through SPICE simulation using

foundry SPICE models

19

Arbiter Leakage Power Model

∏ ∏< >

+×+×=ni ni

niiininn )mreq()mreq(reqgnt

Three arbitration schemes: (1) matrix, (2) round-robin (RR), and (3) queuing Example: matrix arbiter

with R requesters one R×R matrix to keep the priorities

grant logic can be implemented as a tree of NOR and INV gates and the RxR matrix can be constructed using DFF

NOR2, INV, and DFF represent 2-input NOR gate, inverter gate, and DFF, respectively

Further details on modeling methodology in Chen et al. 2003

ddmatrixleakmatrixleak

leakleak

leakmatrixleak

VArbiterIArbiterP

RRDFFIRINVI

RRNORIArbiterI

)()(2

)1()()(

))12(()2()(

-

-

20

OutlineMotivationORION2.0 FrameworkDynamic Power ModelingLeakage Power Modeling Area Modeling Validation and Significance Assessment Conclusions

21

Router Area Model

As number of cores increases, the area occupied by communication components becomes significant (19% of total tile area in the Intel 80-core Teraflops Chip)

Gate area model by Yoshida et al. (DAC’04) Link area model by Carloni et al. (ASPDAC’08)

Areaarbiter = (AreaNOR2x12(R-1)R) +(AreaDFF(R(R-1)/2)) + (AreaINVx1R)

Matrix Arbiter 22

Repeater and Wire Area Models For existing technologies, the area of a repeater can be

calculated as: ar = τ0 + τ1 × (wn + wp)

ar denotes repeater area, τ0 and τ1 are coefficients using linear regression; wn, wp are widths of NMOS, and PMOS respectively

For future technologies, feature size (F), contacted pitch (CP), row height (RH), and cell width (CW) can be used to estimate the area:

NF = (wp + wn + 2 × F) / RH CW = NF × (F + CP) + CP

ar = RH × CW Wiring area can be calculated as:

aw = (n × (ww + sw) + sw) × L

aw denotes wire area, n is the bit width of the bus, and ww, sw, L are wire width, spacing and wire length

23

OutlineMotivationORION2.0 FrameworkDynamic Power ModelingLeakage Power ModelingArea Modeling Validation and Significance Assessment Conclusions

24

ORION2.0: Validations and Results Validation: Two Intel NoC Chips

(1) Intel 80-core Teraflops: high-performance many-core design (2) Intel SCC: ultra low-power communication core ORION2.0 offers significant accuracy improvement

v1.0 v2.0 v1.0 v2.0%diff (total power) -85.3 -6.5 +202.4 +11.0%diff (total area) -80.9 -23.6 +31.9 +25.3

Intel 80-core Intel SCC

Component %diff (ORION 2.0 vs. Intel 80-core)

BufferCrossbarArbiterClockLink

-14.816.9-9.0

-20.98.8

25

FIFO21%

Crossbar21%

Arbiter7%

Clock30%

Link21%

FIFO23%

Crossbar16%

Arbiter7%

Clock36%

Link18%

Intel 80-coreORION 2.0

FIFO 28%

Crossbar 60%

Arbiter 12%

Clock 0%

Link 0%

ORION 1.0

Impact on System-Level Design Testcases

VPROC: video processor with 42 cores and 128-bit datawidth dVOPD: dual video object plane decoder with 26 cores and 128-bit

datawidth

v1.0 v2.0 v1.0 v2.0 v1.0 v2.0 v1.0 v2.0 v1.0 v2.0

VPROC 0.875 0.924 2.043 2.329 33 25 8 12 6 5dVOPD 0.412 0.486 1.217 1.343 18 16 6 6 11 10

P (mW) A (mm2) # routers max. # hopsSoC max. # router ports

System-level Impact: Communication-Driven Synthesis in COSI-OCC Accurate ORION 2.0 models lead to better-performing NoC Relative power due to additional port not as high in ORION 2.0 vs. 1.0

……..R2 R2

R2 R2

R2

……..

…

……

…

…R1 R1R1

R1 R1R1

R1 R1R1…

… … …

…

26

Conclusions Accurate models can drive effective NoC design

space exploration ORION 1.0 is inaccurate for current and future

technology nodes Proposed accurate power and area models for

network routers (ORION 2.0) Presented a reproducible methodology for extracting

inputs to our models Maintained ORION 1.0 interface, while significantly

improved the accuracy of models switching to ORION 2.0 is easy!

27

ORION 2.0 Release

ORION 2.0 Website: http://www.princeton.edu/~peh/orion.html

28

System-Level NoC Power Modeling Example

LUNA High-level on-chip network

analysis

Microarchitecture parameters

ORIONpower and

area models

power consumption

Performance(latency)CMOS area

TridentSynthetic traffic generation

Design-space exploration tool

NoC designs projections

Step 1

Step 2

Step 3

V. Soteriou, N. Eisley, H. Wang, B. Li, L.S. Peh, TVLSI’07

Polaris Toolchain

Documents

ORION2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration