CNOC:%HighRadix%Clos%NetworkonChip%catt.nyu.edu/sites/default/files/files/NOC.pdf · – Previous%NetworkonChip%(NoC)%Topologies%and%their%respecCve% problems – Clos%Networkon*Chip%(CNOC)%

CNOC: High-‐Radix Clos Network-‐on-‐Chip

Yu-‐Hsiang Kao Advisor – H. Jonathan Chao

Outline •  IntroducCon

–  Chip MulCprocessors (CMP) –  Previous Network-‐on-‐Chip (NoC) Topologies and their respecCve

problems –  Clos Network-‐on-‐Chip (CNOC)

•  Research Issues •  Proposed soluCons

–  The scheduling algorithm of CNOC –  The design of the Hierarchical Round-‐Robin Arbiter (HRRA) –  The floorplanning methodology of CNOC

•  Experimental results •  Conclusion

2

Chip MulCprocessor (CMP)

•  Billions of transistors on chip –  CMP exploits parallelism to scale up

the processor performance. •  Logical view of a shared-‐memory CMP

–  Each Processing Element (PE) has its own L1 cache.

–  All PEs share a logical L2 cache •  Performance BoSleneck

–  L2 cache bandwidth –  Bus bandwidth

3

CMPL2C

PEL1C

PEL1C

PEL1C

Bus

CMP

PEL1C

PEL1C

PEL1C

Network-‐on-‐Chip

L2C L2C L2C

Some CMPs from Industry

•  Intel 80 core teraflops research chip •  65nm, year 2007 •  hSp://techresearch.intel.com/arCcles/

Tera-‐Scale/1449.htm

•  Sun ROCK, 16-‐core SPARC processor •  65nm, year 2008 •  G. Konstadinidis et al., “ImplementaCon of a third

generaCon 16 core, 32 thread CMT sparc processor,” IEEE ISSCC Dig. Tech. Papers, Feb. 2008, pp. 84–85.

Some CMPs from Industry

•  IMB, Sony, Toshiba, Cell Processor •  65nm year 2007 •  hSp://www.research.ibm.com/cell/ •  Tilera TILEPro64 Processor

•  90nm, year 2008 •  hSp://www.Clera.com/products/TILEPro64.php

NoC Topologies

•  Low-‐radix –  2D Mesh

•  Problems –  Low bi-‐secCon bandwidth –  Low throughput under uniform

traffic –  High zero-‐load latency –  Low power efficiency

6

R R R R RR RR

R R R R RR RR

R R R R RR RR

R R R R RR RR

R R R R RR RR

R R R R RR RR

R R R R RR RR

R R R R RR RR

NoC Topologies

•  Concentrated Mesh and Concentrated MeshX2 –  High-‐radix –  Improved zero-‐load latency –  Throughput is not improved. –  J. Balfour and W. J. Dally, “Design

tradeoffs for Cled CMP on-‐chip networks,” in ICS ’06: Proc. of the 20th annual internaConal conference on SupercompuCng, 2006, pp. 187–198.

7

R R R R

R R R R

R R R R

R R R R

R R R R

R R R R

R R R R

R R R R

R R R R

R R R R

R R R R

R R R R

NoC Topologies

•  FlaSened BuSerfly –  High-‐radix –  Improved zero-‐load latency –  Improved throughput –  Router radix is 10.

•  Area/power/frequency issues –  J. Kim, J. Balfour, W. J. Dally, “FlaSened BuSerfly

Topology for On-‐Chip Networks,” IEEE Computer Architecture LeSers, vol. 6, no. 2, pp. 37-‐40, July-‐Dec. 2007.

8

R R R R

R R R R

R R R R

R R R R

NoC Topologies

•  Fat-‐tree –  High-‐radix –  Improved zero-‐load latency –  Improved throughput –  Large number of routers

•  Area/power issues –  D. Ludovici, F. Gilabert, S. Medardoni, C. Gomez, M. E. Gomez, P. Lopez, G. N. Gaydadjiev, and D. Bertozzi,

“Assessing Fat-‐Tree Topologies for Regular Network-‐on-‐Chip Design under Nanoscale Technology Constraints”, Proc. of Conf. on Design, AutomaCon and Test in Europe, 2009.

9

NoC Topologies

•  Clos Network-‐on-‐Chip (CNOC) –  High-‐radix –  Improved zero-‐load latency –  Improved throughput –  Improved power efficiency

10

OMs

CMs

IMs

c1 c2 c3 c4 c5 c6 c7 c8

b1 b2 b3 b4 b5 b6 b7 b8

a1 a2 a3 a4 a5 a6 a7 a8Output ports of the PEs

Input ports of the PEs

Research Issues and ContribuCons •  CNOC composed of the convenConal VC routers can only achieve 65%

throughput under uniform traffic. –  We proposed a new scheduling algorithm for CNOC and improved

the throughput under uniform traffic from 65% to 78%. •  The convenConal router design cannot achieve 2GHz because of the

long criCcal path delay of the tradiConal round-‐robin arbiter. –  We designed a hierarchical round-‐robin arbiter and verified its

delay and power performance aoer place and route using SOC Encounter.

•  The long interconnects of the CNOC might cause delay and power boSlenecks. –  We proposed a floorplanning methodology for CNOC.

11







12

Wormhole RouCng and Virtual Channels

•  Wormhole rouCng –  Different packets cannot

interleave in the same input queue.

•  Virtual Channel (VC) flow control –  Different packets can

interleave on the same physical link.

13

TPacket B B B B H

R2R1

R2R1

Problems of the ConvenConal Router Design •  Head-‐of-‐Line blocking problem

–  Virtual Output Queue (VOQ) buffer structure

14

Upstream Router

VC2VC1

VC2VC1

11 IA

IA

OA

OAIA

IA

OA

OA

Downstream Router1

IA

IA

OA

OA11

Downstream Router2

IA: input arbiterOA: output arbiter

2

223

Upstream Router

VOQ2VOQ1

VOQ2VOQ1

11 IA

IA

OA

OAIA

IA

OA

OA

Downstream Router1

IA

IA

OA

OA11

Downstream Router2


2

223

The ConvenConal Scheduling Algorithm •  The input-‐first scheduling algorithm

–  Dual Round-‐Robin Matching (DRRM) –  The round-‐robin arbiters update their pointers in every clock cycle.

15

Upstream Router

VOQ2VOQ1

VOQ2VOQ1

33 IA

IA

OA

OAIA

IA

OA

OA

Downstream Router1

IA

IA

OA

OA33

Downstream Router2


44

44

Problems of the ConvenConal Router Design •  Tail-‐of-‐Line blocking problem

–  Packet-‐mode Dual Round Robin Matching (PDRRM) algorithm –  The IAs/OAs keep granCng a packet unCl its tail flit is transferred.

16

Upstream Router

VOQ2VOQ1

VOQ2VOQ1

1

2 2 2

1

222

33 IA

IA

OA

OAIA

IA

OA

OA

Downstream Router1

IA

IA

OA

OA33

Downstream Router2


44

44Upstream Router

VOQ2VOQ1

VOQ2VOQ1

1

2 2 2

1

222

33 IA

IA

OA

OAIA

IA

OA

OA

Downstream Router1

IA

IA

OA

OA33

Downstream Router2


44

The Performance of PDRRM •  In-‐house cycle-‐

accurate simulator •  8 VCs/VOQs in

each input port •  Each input port

can accommodate 256 flits

•  8-‐flit packet size •  Throughput under

uniform traffic –  65% -‐> 81%

17

0

50

100

150

200

250

300

350

400

450

500

0 10 20 30 40 50 60 70 80 90 100

Latency (ns)

Injection Rate (%)

VC_PIMVC_DRRMVOQ_DRRMVOQ_PDRRM_IVOQ_PDRRM_II

77.58

8.5

0 10 20 30A zoom in the region of 0-‐30%







18

The Micro-‐architecture of the PDRRM Router

19

Request

VOQB

Grantn

n

VOQA

Credit in

1

Pointer Update

RequestFilter

InputHRRA0

n:1

VOQA Update

OutputHRRA0

n:1

Input HRRAn-1

Credit Counter

Credit out

1

VOQ0VOQ1

VOQn-1

Crossbar SwitchInput Buffers

VOQ0VOQ1

VOQn-1

Input0

Inputn-1

Output0

Ontputn-1

PDRRM Allocator

NRC Logic

IANRC

OA+VAST

500 ps 500 ps 500 psBRBW

Generic RRA and Hierarchical RRA •  An 8x8 generic RRA has a 447ps criCcal path delay.

–  Too close to 500ps –  Hard to scale up the operaCng frequency –  P. Gupta and N. McKeown, “Designing and implemenCng a fast crossbar scheduler,” IEEE Micro, vol. 19, no. 1,

pp. 20–28, 1999.

20

LocalRequests

Local Grant

4

4

1 Granted

8Global Grant

SRRA(Local)

SRRA(Local)

4

RRA(Global)

Granted

Local Grant

LocalRequests

Granted

4

DRQ

DRQPass

Pass

The Performance of HRRA •  The VHDL code of both

designs are synthesized and analyzed by the Cadence Encounter RTL Compiler using STMicroelectronics Company 65nm technology.

•  SOC Encounter is used to place and route these designs.

•  Power evaluaCon is performed based on an acCvity factor of 100% using Cadence Encounter.

21

SA+VA

CreditCounter

WireWrapper

Crossbar

MemCtrl0

MemCtrl1

MemCtrl2

MemCtrl3

MemCtrl4

MemCtrl5

MemCtrl6

MemCtrl7

Floorplan and layout of a radix-‐8 VOQ PDRRM router (SRAMs not included).







22

Floorplanning •  RouCng over resources or rouCng on dedicated channels

•  Wiring density –  Wiring density is the maximum number of Cle-‐to-‐Cle wires routable

across a Cle edge. –  40-‐60% of the length of a Cle edge is routable for global

interconnects.

Tile Tile

Tile TileR

Tile Tile

Tile Tile

R

23

Problem DefiniCon •  NxN idenCcal Cles T •  Routers •  Links •  as the ManhaSan distance between the two nodes that

connects •  is the energy that a flit consumes when it travels through a link with

ManhaSan distance k •  ObjecCve funcCon:

𝑅 = {𝑅𝑖, 𝑖 = 1,2,3…𝑛}

𝐿 = {𝐿𝑖 , 𝑖 = 1,2,3…𝑚} 𝑀(𝐿𝑖) 𝐿𝑖

𝐸𝑘

C

A

BΦ(𝑅, 𝐿) =(𝐸𝑀(𝐿𝑖)

𝑚

𝑖=1

24

Floorplan Result

•  Under the constraints: –  Longest wire length is 7 hops –  40% length of the Cle edge is

routable –  Any link can contain up to128 bits –  All the wires can be operated

under 2 GHz –  There could be mulCple soluCons

1

1

1

1

1

1

1

1

2

2

2

2

2

2

2

2

7

7

7

7

8

8

8

8

7

7

7

7

8

8

8

8

4

4

4

4

3

3

3

3

4

4

4

4

3

3

3

3

6

6

6

6

6

6

6

6

5

5

5

5

5

5

5

5

i6

o6

c6

c5

i5

o5

i4 o4

c4 c3 i3 o3

i2

o2

c2

c1

i1

o1

i8o8

c8c7i7o7

1

1

1

1

1

1

1

1

2

2

2

2

2

2

2

2

7

7

7

7

8

8

8

8

7

7

7

7

8

8

8

8

4

4

4

4

3

3

3

3

4

4

4

4

3

3

3

3

6

6

6

6

6

6

6

6

5

5

5

5

5

5

5

5

i6

o6

c6

c5

i5o5 i4 o4

c4 c3 i3

o3

i2

o2

c2

c1

i1 o1i8o8

c8c7i7

o7

25







26

SimulaCon Setup

•  For energy consumpCon on NOC links, we use Cadence Spectre to run simulaCons on models with parameters given by the PredicCve Technology Model (PTM) of various lengths. The energy consumed by a flit is 2.16 pJ per 2mm.

27 *A. Kahng, et al., “ORION 2.0: A Fast and Accurate NoC Power and Area Model for Early-‐Stage Design Space ExploraCon.” In Proc. IEEE/ACM Conf. on Design AutomaCon and Test in Europe (DATE), April 2009.

*

50ns Throughputs

•  50-‐ns throughput of different topologies under different kinds of syntheCc traffic paSerns.

28

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

12D

Mesh

CMeshX

2VC

FTVO

Q FT

CNOC

2D Mesh

CMeshX

2VC

FTVO

Q FT

CNOC

2D Mesh

CMeshX

2VC

FTVO

Q FT

CNOC

2D Mesh

CMeshX

2VC

FTVO

Q FT

CNOC

2D Mesh

CMeshX

2VC

FTVO

Q FT

CNOC

2D Mesh

CMeshX

2VC

FTVO

Q FT

CNOC

Uniform Transpose Shuffle Tornado Non-‐uniform Local

Low-‐load Packet Latencies

•  Average packet latencies under 5% injecCon rate of different kinds of syntheCc traffic paSerns.

29

0

2

4

6

8

10

12

14

16

18

20

2D Mesh

CMeshX

2VO

Q FT

CNOC

2D Mesh

CMeshX

2VO

Q FT

CNOC

2D Mesh

CMeshX

2VO

Q FT

CNOC

2D Mesh

CMeshX

2VO

Q FT

CNOC

2D Mesh

CMeshX

2VO

Q FT

CNOC

2D Mesh

CMeshX

2VO

Q FT

CNOC


Average P

acket Laten

cy (n

s)

Normalized Power Efficiencies

•  Power Efficiency=1/E, where E = (Cme for each PE to finish sending 1000 packets at the injecCon rate with an average latency of 50ns) x (total energy dissipated during the process).

30

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

12D

Mesh

CMeshX

2VC

FTVO

Q FT

CNOC

2D Mesh

CMeshX

2VC

FTVO

Q FT

CNOC

2D Mesh

CMeshX

2VC

FTVO

Q FT

CNOC

2D Mesh

CMeshX

2VC

FTVO

Q FT

CNOC

2D Mesh

CMeshX

2VC

FTVO

Q FT

CNOC

2D Mesh

CMeshX

2VC

FTVO

Q FT

CNOC


Low-‐load Power ConsumpCon

•  Power consumpCon of different topologies under 5% traffic load.

31

0

0.2

0.4

0.6

0.8

1

1.2

1.4

2D Mesh

CMeshX

2VO

Q FT

CNOC

2D Mesh

CMeshX

2VO

Q FT

CNOC

2D Mesh

CMeshX

2VO

Q FT

CNOC

2D Mesh

CMeshX

2VO

Q FT

CNOC

2D Mesh

CMeshX

2VO

Q FT

CNOC

2D Mesh

CMeshX

2VO

Q FT

CNOC


Power (W

)

router static router dynamic link

Area Comparison

•  The NOC area percentage over the 16mm by 16mm chip area

32

0

0.05

0.1

0.15

0.2

0.25

2-‐D Mesh CMeshX2 Fat-‐tree CNOC

link

Allocator

Memory

Crossbar

Benchmark SimulaCon Setup •  We use Graphite, an x86 mulCcore

simulator, to generate the SPLASH-‐2 traffic traces.

•  Then we injected the traffic traces into our in-‐house cycle-‐accurate simulator to obtain the delay and power performance.

•  In Graphite, it is assumed that there are 64 PEs, each with its own private L1/L2 (32kB/512kB) caches.

•  The caches are coherent with MSI protocol, and the packet sizes are either 8-‐byte or 64-‐byte (2 flit or 16 flits)

33

Benchmark SimulaCon Results

•  Average packet latencies of the SPLASH-‐2 Benchmarks

34

0

5

10

15

20

25

30

35

40

2D Mesh

CMeshX2

VOQ FT

CNOC

2D Mesh

CMeshX2

VOQ FT

CNOC

2D Mesh

CMeshX2

VOQ FT

CNOC

2D Mesh

CMeshX2

VOQ FT

CNOC

2D Mesh

CMeshX2

VOQ FT

CNOC

2D Mesh

CMeshX2

VOQ FT

CNOC

2D Mesh

CMeshX2

VOQ FT

CNOC

2D Mesh

CMeshX2

VOQ FT

CNOC

2D Mesh

CMeshX2

VOQ FT

CNOC

2D Mesh

CMeshX2

VOQ FT

CNOC

2D Mesh

CMeshX2

VOQ FT

CNOC

barnes cholesky fft fmm lu ocean radiosity radix raytrace volrend water

Average Packet Latency (ns)

Conclusion •  We proposed a new scheduling algorithm for CNOC and improved the

throughput with the convenConal scheduling algorithm under uniform traffic from 65% to 78%.

•  We designed a hierarchical round-‐robin arbiter and verified its delay and power performance aoer place and route using SOC Encounter.

•  We proposed a floorplanning methodology for CNOC. •  We compared CNOC with 2D Mesh, CMeshX2, and Fat-‐tree in terms of

50-‐ns throughput, low-‐load latency, power efficiency, low-‐load power consumpCon, and average packet latencies under SPLASH-‐2 benchmarks.

•  The works have been published in –  IEEE/ACM NOCS 2010 –  IEEE TCAD, issue of Dec. 2011

35

Thank You!

36

Documents

CNOC:%High*Radix%Clos%Network*on*Chip%catt.nyu.edu/sites/default/files/files/NOC.pdf · – Previous%Network*on*Chip%(NoC)%Topologies%and%their%respecCve% problems – Clos%Network*on*Chip%(CNOC)%

CNOC:%HighRadix%Clos%NetworkonChip%catt.nyu.edu/sites/default/files/files/NOC.pdf · – Previous%NetworkonChip%(NoC)%Topologies%and%their%respecCve% problems – Clos%Networkon*Chip%(CNOC)%