9
................................................................................................................................................................................................................... AQOS-ENABLED ON-DIE INTERCONNECT FABRIC FOR KILO-NODE CHIPS ................................................................................................................................................................................................................... TO MEET RAPIDLY GROWING PERFORMANCE DEMANDS AND ENERGY CONSTRAINTS, FUTURE CHIPS WILL LIKELY FEATURE THOUSANDS OF ON-DIE RESOURCES.EXISTING NETWORK-ON-CHIP SOLUTIONS WERENT DESIGNED FOR SCALABILITY AND WILL BE UNABLE TO MEET FUTURE INTERCONNECT DEMANDS.A HYBRID NETWORK-ON-CHIP ARCHITECTURE CALLED KILO-NOC CO-OPTIMIZES TOPOLOGY, FLOW CONTROL, AND QUALITY OF SERVICE TO ACHIEVE SIGNIFICANT GAINS IN EFFICIENCY. ......The semiconductor industry is rapidly moving toward rich, chip-level in- tegration; in many application domains, chips with hundreds to thousands of process- ing elements are likely to appear in the near future. To address the communication needs of richly integrated chips, the industry has embraced structured, on-die communication fabrics called networks-on-chip (NoCs). However, existing NoC architectures have been designed for substrates with dozens of nodes, not hundreds or thousands; once scaled to tomorrow’s kilonode configura- tions, significant performance, energy, and area overheads emerge in today’s state-of- the-art NoCs. We focus on NoC scalability from en- ergy, area, performance, and quality of service (QoS) perspectives. Prior research indicates that richly connected topologies improve latency and energy efficiency in on-chip networks. 1,2 While our analysis con- firms those results, it also identifies buffer overheads as a critical scalability obstacle that emerges once richly connected NoCs are scaled to configurations with hundreds of nodes. Large buffer pools adversely affect NoC area and energy efficiency. The addi- tion of QoS support further increases storage overhead, virtual channel (VC) requirements, and arbitration complexity. Our solution holistically addresses key sources of inefficiency in NoCs of highly integrated chip multiprocessors (CMPs) through a hybrid NoC architecture (called Kilo-NoC) that offers low latency, a small footprint, good energy efficiency, and strong service guarantees. Kilo-NoC overview Existing QoS approaches necessitate hardware support at every router, incurring network-wide costs and complexity over- heads. We propose a QoS architecture that overcomes this limitation of previous designs by requiring QoS hardware at just a subset of the routers. Our approach consolidates shared resources, such as memory controllers, Boris Grot E ´ cole Polytechnique Fe ´de ´rale de Lausanne Joel Hestness University of Texas at Austin Stephen W. Keckler Nvidia Onur Mutlu Carnegie Mellon University 0272-1732/12/$31.00 c 2012 IEEE Published by the IEEE Computer Society ............................................................. 17

AQOS-ENABLED ON-DIE INTERCONNECT FABRIC KILO ODE CHIPSusers.ece.cmu.edu/~omutlu/pub/kilonoc-QoS_ieee_micro12.pdf · 2012-07-06 · aqos-enabled on-die interconnect fabric for kilo-node

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: AQOS-ENABLED ON-DIE INTERCONNECT FABRIC KILO ODE CHIPSusers.ece.cmu.edu/~omutlu/pub/kilonoc-QoS_ieee_micro12.pdf · 2012-07-06 · aqos-enabled on-die interconnect fabric for kilo-node

..........................................................................................................................................................................................................................

A QOS-ENABLED ON-DIEINTERCONNECT FABRICFOR KILO-NODE CHIPS

..........................................................................................................................................................................................................................

TO MEET RAPIDLY GROWING PERFORMANCE DEMANDS AND ENERGY CONSTRAINTS,

FUTURE CHIPS WILL LIKELY FEATURE THOUSANDS OF ON-DIE RESOURCES. EXISTING

NETWORK-ON-CHIP SOLUTIONS WEREN’T DESIGNED FOR SCALABILITY AND WILL BE

UNABLE TO MEET FUTURE INTERCONNECT DEMANDS. A HYBRID NETWORK-ON-CHIP

ARCHITECTURE CALLED KILO-NOC CO-OPTIMIZES TOPOLOGY, FLOW CONTROL, AND

QUALITY OF SERVICE TO ACHIEVE SIGNIFICANT GAINS IN EFFICIENCY.

......The semiconductor industry israpidly moving toward rich, chip-level in-tegration; in many application domains,chips with hundreds to thousands of process-ing elements are likely to appear in the nearfuture. To address the communication needsof richly integrated chips, the industry hasembraced structured, on-die communicationfabrics called networks-on-chip (NoCs).However, existing NoC architectures havebeen designed for substrates with dozens ofnodes, not hundreds or thousands; oncescaled to tomorrow’s kilonode configura-tions, significant performance, energy, andarea overheads emerge in today’s state-of-the-art NoCs.

We focus on NoC scalability from en-ergy, area, performance, and quality ofservice (QoS) perspectives. Prior researchindicates that richly connected topologiesimprove latency and energy efficiency inon-chip networks.1,2 While our analysis con-firms those results, it also identifies bufferoverheads as a critical scalability obstacle

that emerges once richly connected NoCsare scaled to configurations with hundredsof nodes. Large buffer pools adversely affectNoC area and energy efficiency. The addi-tion of QoS support further increases storageoverhead, virtual channel (VC) requirements,and arbitration complexity.

Our solution holistically addresses keysources of inefficiency in NoCs of highlyintegrated chip multiprocessors (CMPs)through a hybrid NoC architecture (calledKilo-NoC) that offers low latency, a smallfootprint, good energy efficiency, and strongservice guarantees.

Kilo-NoC overviewExisting QoS approaches necessitate

hardware support at every router, incurringnetwork-wide costs and complexity over-heads. We propose a QoS architecture thatovercomes this limitation of previous designsby requiring QoS hardware at just a subsetof the routers. Our approach consolidatesshared resources, such as memory controllers,

[3B2-9] mmi2012030017.3d 15/5/012 10:18 Page 17

Boris Grot

Ecole Polytechnique

Federale de Lausanne

Joel Hestness

University of Texas

at Austin

Stephen W. Keckler

Nvidia

Onur Mutlu

Carnegie Mellon University

0272-1732/12/$31.00 �c 2012 IEEE Published by the IEEE Computer Society

...................................................................

17

Page 2: AQOS-ENABLED ON-DIE INTERCONNECT FABRIC KILO ODE CHIPSusers.ece.cmu.edu/~omutlu/pub/kilonoc-QoS_ieee_micro12.pdf · 2012-07-06 · aqos-enabled on-die interconnect fabric for kilo-node

within a portion of the network and enforcesservice guarantees only within subnetworksthat contain the shared resources. The en-abling technology underlying the scheme is arichly connected topology that enables single-hop access to any QoS-protected subnetwork,effectively eliminating intermediate nodes assources of interference. To our knowledge,our work is the first to consider the interac-tion between topology and QoS.

While topology-awareness offers a consid-erable reduction in QoS-related costs, itdoesn’t address the high buffer overheads ofrichly connected topologies. We eliminatemuch of this buffer expense by introducinga lightweight elastic buffer (EB) architecturethat integrates storage directly into links.Again, our design leverages a feature of thetopology to offer a single-network, deadlock-free EB architecture at a fraction of priorschemes’ cost.

Together, these techniques synergisticallywork to enable a highly scalable kilo-nodeinterconnect fabric. Our evaluation in thecontext of a thousand-terminal system revealsthat the Kilo-NoC architecture is highly ef-fective in relieving the scalability bottlenecksof today’s NoCs. Compared to a state-of-the-art, QoS-enabled NoC, our proposeddesign reduces network area requirementsby 45 percent and energy expenditure by29 percent. The Kilo-NoC attains these ben-efits without sacrificing either performanceor strength of service guarantees.

BackgroundNetworks are commonly characterized

along three dimensions: topology, routing,

and flow control. Of these, topology is thesingle most important determinant of per-formance, energy efficiency, and cost (area).

To date, most NoCs that have been real-ized in silicon feature a ring or mesh topol-ogy. While such topologies are acceptablewhen interconnecting a modest number ofnodes, their large average hop count inchips with hundreds of networked compo-nents represents a serious efficiency bottle-neck. Each hop involves a router crossing,which often dominates the per-hop latencyand energy cost due to the need to writeand read the packet buffers, arbitrate forresources, and traverse a crossbar switch.

To bridge the scalability gap, researchershave proposed low-diameter NoC topologiesthat improve efficiency through rich inter-node connectivity. One such topology isthe flattened butterfly, which fully intercon-nects the nodes in each dimension via dedi-cated point-to-point channels.1 However,the flattened butterfly’s channel requirementsand crossbar complexity grow quadraticallywith the network radix and represent a scal-ability obstacle. An alternative organization,shown in Figure 1, uses multidrop expresschannels (MECS) to achieve the same degreeof connectivity as the flattened butterfly butwith fewer channels.2 Each node in a MECSnetwork has four output channels, one percardinal direction. Lightweight drop interfa-ces integrated into the channel let packetsexit the channel into one of the routersspanned by the link.

Unfortunately, long-range links requirelarge buffer pools to cover the flight timeof the data and the credit return. Suchlarge buffer configurations carry a significantarea and energy cost in richly connectedNoCs with hundreds of nodes. For instance,in a 16 � 16 network with a low-diametertopology (for example, MECS or flattenedbutterfly), each router needs 30 networkports with up to 35 buffer slots per packetclass per port. The chip-wide buffer require-ments of such a NoC, assuming two packetclasses and 16-byte links, exceed 8 Mbytes—an extraordinary amount from an area andenergy standpoint, even in future processtechnologies.

Adding QoS support further increasesbuffer demands. Although recent work has

[3B2-9] mmi2012030017.3d 15/5/012 10:18 Page 18

ArbiterWENS

Figure 1. A conceptual view of the Multidrop Express Channels (MECS)

architecture. MECS uses a point-to-multipoint interconnect model to enable

rich connectivity with relatively few channels.

....................................................................

18 IEEE MICRO

...............................................................................................................................................................................................

TOP PICKS

Page 3: AQOS-ENABLED ON-DIE INTERCONNECT FABRIC KILO ODE CHIPSusers.ece.cmu.edu/~omutlu/pub/kilonoc-QoS_ieee_micro12.pdf · 2012-07-06 · aqos-enabled on-die interconnect fabric for kilo-node

demonstrated NoC QoS architectures withsmaller footprints than traditional network-ing schemes,3,4 that work only consideredmesh topologies. Our analysis shows thatthe benefits of these QoS architectures aregreatly diminished in large-scale, low-diame-ter NoCs owing to the high VC requirementsimposed by long link spans. For instance, inthe hypothetical 16 � 16 network we men-tioned earlier, a high-radix router outfittedwith a state-of-the-art Preemptive VirtualClock (PVC) NoC QoS architecturerequires more than 700 VCs, compared tojust 24 VCs in a PVC-enabled mesh router.

Improving the flow-control mechanism,which manages traffic flow by allocatingresources to packets, can help mitigate highbuffer overheads. One potential approach isbufferless flow control, recently examinedby NoC researchers looking for ways toboost efficiency.5 Unfortunately, existingbufferless architectures are unable to provideservice guarantees. Integrating storage ele-ments directly into links, a technique termedelastic buffering, is another promising direc-tion recent research has taken.6,7 While exist-ing proposals offer some gains in efficiency,the serializing nature of EB links compli-cates deadlock avoidance and impedes theisolation of flows, which is necessary forQoS guarantees.

Kilo-NoC architectureWe describe the proposed Kilo-NoC ar-

chitecture in the context of a 1,024-tileCMP implemented in 15-nm technology.

Baseline designFigure 2a shows the baseline NoC organi-

zation, scaled down to 64 tiles for clarity.We employ concentration8 to reduce thenumber of network nodes from 1,024 to256 by integrating four terminals at eachrouter via a crossbar switch. A node refersto a network node, while a terminal is a dis-crete system resource (such as a core, cachetile, or memory controller) with a dedicatedport at a network node. The nodes are inter-connected via a richly connected MECStopology. We chose MECS due to its lowdiameter, scalable channel count, modestswitch complexity, and the unique capabil-ities multidrop offers. PVC enforces QoS

guarantees for the virtual machines (VMs)sharing a die.4

We arrange the 256 concentrated nodesin a 16 � 16 grid. Each MECS router inte-grates 30 network input ports (15 per di-mension). With one cycle of wire latencybetween adjacent nodes, maximum channeldelay from one edge of the chip to anotheris 15 cycles. The round-trip credit time is35 cycles, once router pipeline delays areincluded. This round-trip latency establishesa lower bound for per-port buffer requirementsin the absence of any location-dependentoptimizations. To guarantee freedom fromprotocol deadlock, each port needs a dedi-cated VC per packet class. With two prioritylevels (request at low priority and reply at highpriority), a pair of 35-deep VCs affords dead-lock-free operation while covering the maxi-mum round-trip credit delay. The total bufferrequirements are 70 flits at each input portand 2,100 flits for the entire 30-port router.With 16-byte flits, total required storage is32 Kbytes per router and 8.4 Mbytes net-work wide.

To guarantee QoS, packets from differentnodes require separate VCs to prevent prior-ity inversion within a VC buffer. To accom-modate a worst-case pattern consisting ofsingle-flit packets from different nodes, an

[3B2-9] mmi2012030017.3d 15/5/012 10:18 Page 19

Q

Q

(a) (b)

Q Q

Q Q QQ

Q Q

Q

Q

QQQ

QQQVM #2

Q

Q

VM #1(a)

VM #3

VM #1(b)

Figure 2. A 64-tile chip multiprocessor (CMP) with four-way concentration

and MECS topology. Light nodes represent core and cache tiles; shaded

nodes represent memory controllers; and Q indicates QoS hardware

support. Dotted lines indicate virtual machine (VM) assignments in a

topology-aware QoS architecture. Baseline QoS-enabled CMP (a); topology-

aware QoS approach (b). Topology-awareness enables a reduction in the

number of routers that require QoS hardware.

....................................................................

MAY/JUNE 2012 19

Page 4: AQOS-ENABLED ON-DIE INTERCONNECT FABRIC KILO ODE CHIPSusers.ece.cmu.edu/~omutlu/pub/kilonoc-QoS_ieee_micro12.pdf · 2012-07-06 · aqos-enabled on-die interconnect fabric for kilo-node

unoptimized router would require 35 VCs perport. Several optimizations, such as location-dependent buffer sizing, can be used to reducethe VC and buffer requirements at additionaldesign expense and arbitration complexity.Here, we assume a 25 percent reduction inper-port VC requirements. Assuming a max-imum packet size of four flits, a baselineQoS-enabled architecture requires 25 four-deep VCs per port, 750 VCs and 3,000 flitslots per router, and 12 Mbytes of storagenetwork wide.

Topology-aware QoS architectureOur first optimization target is the QoS

mechanism, which imposes significant VCand buffer overheads. In contrast to existingnetwork QoS architectures that demanddedicated QoS logic and storage at everyrouter, we seek to limit the number ofnodes requiring hardware QoS support.Our proposed scheme, called Topology-Aware QoS (TAQ), accomplishes this goalby isolating shared resources into dedicatedregions of the network, called shared regions(SRs), with hardware QoS enforcementwithin each SR. The rest of the network isfreed from the burden of hardware QoSsupport and enjoys reduced cost andcomplexity.

The TAQ architecture leverages the richintradimension connectivity provided byMECS (or other low-diameter topologies)to ensure single-hop access to any shared re-gion, which we achieve by organizing theSRs into columns spanning the entirewidth of the die. Single-hop connectivityguarantees interference-free transit into anSR. Once inside the SR, a packet is regulatedby PVC (or another QoS mechanism) as itproceeds to its destination. To preventunregulated contention for network band-width at concentrated nodes outside of theSR, we require the OS or hypervisor tocoschedule only threads from the same VMonto a node. Figure 2b shows our proposedorganization, including a sample assignmentof nodes to VMs. Note that though the fig-ure’s SR column is on the edge of the die,TAQ doesn’t require such placement.

Depending on how virtual machines on adie are placed, certain intra-VM and inter-VM transfers might have to flow through a

router at a node mapped to an unrelatedVM. Because such scenarios can resultin inter-VM contention at routers lackingQoS support, we use simple routing rulesthat exploit the combination of a richly con-nected topology and QoS-enabled regions toavoid inter-VM interference. The rules canbe summarized as follows.

� Communication within a dimensionis unrestricted, since a low-diametertopology provides interference-free,single-hop communication in a givenrow or column.

� Dimension changes are unrestricted ifthe turn node belongs to the sameVM as the packet’s source or destination.For example, all cache-to-cache trafficassociated with VM #2 in Figure 2bstays within a single convex regionand never needs to transit through arouter in another VM.

� Packets requiring a dimension changeat a router associated with a node ofan unrelated VM must flow throughone of the shared regions. Dependingon the locations of the communicatingnodes and the SRs, the resulting routesmay be nonminimal. For instance, inFigure 2b, traffic from partition (a) ofVM #1 that is transiting to partition(b) must take the longer path throughthe shared column to avoid turning ata router associated with VM #2.

Our proposal preserves service guaranteesfor all VMs, regardless of the locations ofcommunicating nodes. However, placing allof a VM’s resources in a contiguous regioncan improve both performance and energyefficiency by reducing communication dis-tance and minimizing accesses to the SRs.

Elastic-buffer flow controlFreed from the burden of enforcing QoS,

routers outside of shared regions enjoy a sig-nificant reduction in VCs to just one VC perpacket class. Yet, despite this reduction, aMECS kiloterminal network with twopacket priority classes still requires a prohib-itive 8 Mbytes of buffer capacity. In an effortto further reduce buffer overheads, we turnto elastic buffering.

[3B2-9] mmi2012030017.3d 15/5/012 10:18 Page 20

....................................................................

20 IEEE MICRO

...............................................................................................................................................................................................

TOP PICKS

Page 5: AQOS-ENABLED ON-DIE INTERCONNECT FABRIC KILO ODE CHIPSusers.ece.cmu.edu/~omutlu/pub/kilonoc-QoS_ieee_micro12.pdf · 2012-07-06 · aqos-enabled on-die interconnect fabric for kilo-node

Although conventional elastic-bufferednetworks are incompatible with QoS dueto the serializing nature of EB flow control(which can cause priority inversion within achannel), our proposed TAQ architectureenables elastic buffering outside of theshared regions by eliminating interferenceamong nodes from different VMs. A point-to-multipoint MECS topology also greatlyreduces overall storage requirements, becauseall downstream destination nodes effectivelyshare each buffer slot in a channel. In con-trast, the point-to-point topologies consid-ered in earlier EB studies offer limitedstorage savings, because buffer capacity issimply shifted from routers to links.

One drawback of existing EB architec-tures is the additional complexity associatedwith guaranteeing freedom from protocoldeadlock. An earlier proposal ensured dead-lock freedom through a conventional VC ar-chitecture, thus negating the benefits ofelastic buffering.6 A more recent study advo-cates pure, elastic-buffered NoCs with noVCs.7 While a pure EB architecture is leanerthan alternative designs, it requires a dedi-cated physical network for each packet classfor deadlock avoidance, increasing NoCarea and wire pressure.

Low-cost elastic bufferingWe propose an EB organization that

affords considerable area savings over earlierschemes. Our approach combines elastic-buffered links with minimal VC support,enabling a single-network architecture withhybrid EB/VC flow control. The key tominimizing VC costs comes through anovel flow-control mechanism called Just-in-Time VC binding (JIT-VC), which ena-bles a packet in the channel to allocate aVC from an EB adjacent to the router. Indoing so, our design essentially eliminatesthe buffer credit loop, whose length deter-mines VC and buffer requirements. Theresulting organization represents a scalable al-ternative to traditional VC architectures inwhich buffer requirements are proportionalto the link delay, necessitating large bufferpools to cover long link spans.

Because packets regulated by JIT-VCflow control don’t reserve downstream bufferspace before entering the channel, they leave

the network susceptible to protocol deadlock.We assure deadlock freedom by providing anescape path for blocked packets into interme-diate routers along their direction of travel byexploiting the multidrop feature of MECSchannels in concert with the JIT-VC mecha-nism. Under normal operation, a packet willallocate a VC once it reaches the EB at thetarget (turn or destination) node. However,should a high-priority (for example, reply)packet be blocked in the channel, it can es-cape into a JIT-allocated VC at anothernode. Once buffered at an escape router, apacket will switch to a new MECS channelby traversing the router pipeline like anyother packet. To prevent circular deadlock,we don’t let packets switch dimensions atan escape node. Figure 3 shows a high-leveldepiction of our approach.

Forward progress in the proposed EB-enabled network is guaranteed through simple

[3B2-9] mmi2012030017.3d 15/5/012 10:18 Page 21

VC

(a)

(b)

(c)

EB

Router

EB EB

VC EB

Router

VC EB

Router

EB EB

VC EB

Router

VC EB

Router

EB EB

VC EB

Router

High-priority packet Low-priority packet

X

Figure 3. An example of deadlock avoidance in a network with elastic

buffers (EB) via the Just-in-Time (JIT-VC) mechanism. A high-priority packet

in a MECS channel is obstructed by a low-priority packet (a). The high-

priority packet acquires a buffer at a router associated with the EB (b).

The high-priority packet switches to a new MECS channel and proceeds

toward its destination (c). Our architecture avoids deadlock by providing

escape paths for blocked packets.

....................................................................

MAY/JUNE 2012 21

Page 6: AQOS-ENABLED ON-DIE INTERCONNECT FABRIC KILO ODE CHIPSusers.ece.cmu.edu/~omutlu/pub/kilonoc-QoS_ieee_micro12.pdf · 2012-07-06 · aqos-enabled on-die interconnect fabric for kilo-node

microarchitectural mechanisms that segre-gate packets into dedicated VCs accordingto priority class at each router, allocateresources in strict priority order (as deter-mined by packet class), and ensure thathigh-priority packets eventually escape intoa router along their paths by bypassing ordraining lower-priority packets using JIT-VC. For details of the scheme, along witha proof showing its freedom from deadlock,see our paper for the 38th InternationalSymposium on Computer Architecture(ISCA 2011).9

A single-network EB scheme, as we de-scribe here, enables significant reduction instorage requirements for nodes outside ofshared regions. Given a maximum packetsize of four flits and two priority classes, apair of four-deep VCs suffices at each routerinput port. Compared to a baseline PVC-enabled MECS router with 25 VCs perport, our approach reduces both VC andbuffer requirements by a factor of 12.

Evaluation highlightsWe rigorously evaluated the set of pro-

posed optimizations in terms of their effecton network efficiency, performance, andQoS. We used detailed technology modelsfor area and energy and simulation-basedstudies for performance and QoS analysis.Here, we present a sampling of the results

and refer interested readers to our ISCA2011 paper for additional findings, insights,and details regarding our methodology.9

We model a 1,024-tile CMP in 15-nmtechnology with an on-chip voltage of 0.7 Vand a die area of 256 mm2, excluding periph-eral circuitry. At the network level, four-wayconcentration reduces the number of routersto 256, of which 64 correspond to variousshared resources—potentially includingmemory controllers, fixed-function accelera-tors, and I/O interfaces.

We evaluated the following NoC organi-zations:

� Cmesh+PVC: a concentrated mesh to-pology with PVC-based QoS support.

� MECS: baseline MECS network withno QoS support.

� MECS+PVC: QoS-enabled MECSnetwork with PVC-based QoS support.

� MECS+TAQ: MECS network with theproposed topology-aware QoS architec-ture. PVC enforces QoS inside fourshared regions and no QoS supportexists elsewhere.

� MECS+TAQ+EB: TAQ-enabled net-work augmented with a pure, elastic-buffer flow-control architecture. Dead-lock freedom is ensured through sepa-rate request and reply networks. Elasticbuffering is deployed only outside sharedregions, with conventional bufferingand PVC inside SRs.

� K-MECS: the proposed Kilo-NoCconfiguration, featuring TAQ and hy-brid EB/VC flow control with JIT-VC allocation (outside SRs).

Area analysisFigure 4 breaks down the total network

area into four resource types: links, link-integrated EBs, regular routers, and SRrouters (TAQ-enabled topologies only). Forlinks, we account for the area of driversand receivers and anticipate that wires arerouted over logic in a dedicated layer.

TAQ proves to be an effective optimiza-tion for reducing network area. Comparedto a baseline, QoS-enabled MECS network(MECS+PVC), TAQ enables a 16-percentarea reduction (MECS+TAQ bar) due todiminished buffer requirements. The pure,

[3B2-9] mmi2012030017.3d 15/5/012 10:18 Page 22

0

5

10

15

20

25

30

Cmesh+PVC

MECS MECS+PVC

MECS+TAQ

MECS+TAQ+EB

K-MECS

Are

a (m

m2 )

SR routersRoutersLink EBsLinks

Figure 4. Area breakdown of various NoC configurations. K-MECS has

the lowest area among organizations with similar or greater bisection

bandwidth.

....................................................................

22 IEEE MICRO

...............................................................................................................................................................................................

TOP PICKS

Page 7: AQOS-ENABLED ON-DIE INTERCONNECT FABRIC KILO ODE CHIPSusers.ece.cmu.edu/~omutlu/pub/kilonoc-QoS_ieee_micro12.pdf · 2012-07-06 · aqos-enabled on-die interconnect fabric for kilo-node

elastic-buffered NoC further reduces the areafootprint by 27 percent (MECS+TAQ+EB),but at the cost of a 56-percent increase inwire requirements precipitated by the needfor a second network. K-MECS offers an ad-ditional 10-percent area reduction and cur-tails wire pressure compared to a pure EBarchitecture by not requiring a second net-work to guarantee deadlock freedom. Toput the optimizations in perspective, theconventionally buffered, QoS-enabled SRrouters in K-MECS account for more thanone half of the total router area but makeup just a quarter of the network nodes.

The smallest network area is found in theCmesh topology, due to its modest bisectionbandwidth. The Cmesh NoC occupies 2.8times less area than the K-MECS network,but offers 8 times less network bandwidth.Link area represents just 7 percent of theCmesh network area, while accounting for21 percent of the richly connected K-MECS network. Among design with compa-rable bandwidth, K-MECS represents themost area-efficient configuration.

Energy analysisFigure 5 shows network-level energy effi-

ciency for three different access patterns:nearest neighbor (1 mesh hop), semilocal(5 mesh hops), and random (10 mesh hops).The nearest-neighbor pattern incurs onelink and two router traversals in all topolo-gies. In contrast, 5-hop and 10-hop patternsrequire three router accesses (which representsthe worst case) in low-diameter MECS net-works, while requiring six and 11 routercrossings, respectively, in Cmesh. We assumethat a quarter of all accesses in the multihoppatterns are to shared resources, necessitatingtransfers to and from the shared regions inTAQ-enabled networks.

In general, EB-enabled networks havebetter energy efficiency than other organiza-tions. K-MECS is the second most efficientdesign among the evaluated alternatives,reducing NoC energy by 16 to 63 percenton local traffic and by 20 to 40 percent onnonlocal patterns. A pure EB architecture(MECS+TAQ+EB) is 22 percent more effi-cient than K-MECS on local traffic and 6to 9 percent better on nonlocal routes, dueto a reduction in buffer and switch input

power; however, these reductions come atgreater area expense and lower throughputas compared to K-MECS.

Links are responsible for a significantfraction of overall energy expense, limitingthe benefits of router energy optimizations.For instance, links account for 69 percentof energy expended on random traffic inK-MECS. PVC-enabled routers in theshared regions also diminish the energy effi-ciency of K-MECS and other TAQ-enabledtopologies.

Results summaryTable 1 summarizes the area, power

requirements, zero-load latency, and through-put (maximum sustained injection rate beforethe network saturates) of different topolo-gies in a kilo-terminal network in a 15-nmtechnology. Power numbers are derived fora 2 GHz clock frequency and random(10-hop) traffic at average network loadsof 1 and 10 percent. Latency and through-put values are also for random traffic, with50 percent of the nodes communicating.

Our proposed topology-aware QoS opti-mization effectively reduces network areaand power consumption without compro-mising performance. Compared to a base-line MECS network with PVC support(MECS+PVC), TAQ reduces network area

[3B2-9] mmi2012030017.3d 15/5/012 10:18 Page 23

0102030405060708090

100

Net

wor

k en

ergy

/pac

ket (

pJ)

1 hop 5 hops 10 hops

Cm

esh+

PVC

MEC

S

MEC

S+PV

C

MEC

S+TA

Q

MEC

S+TA

Q+

EB

K-M

ECS

Cm

esh+

PVC

MEC

S

MEC

S+PV

C

MEC

S+TA

Q

MEC

S+EB

+TA

Q

K-M

ECS

Cm

esh+

PVC

MEC

S

MEC

S+PV

C

MEC

S+TA

Q

MEC

S+EB

+TA

Q

K-M

ECS

SR routersRoutersLink EBsLinks

Figure 5. NoC energy breakdown as a function of the communication dis-

tance. Topology awareness improves NoC energy efficiency by reducing

QoS overheads and by enabling QoS-friendly elastic buffering.

....................................................................

MAY/JUNE 2012 23

Page 8: AQOS-ENABLED ON-DIE INTERCONNECT FABRIC KILO ODE CHIPSusers.ece.cmu.edu/~omutlu/pub/kilonoc-QoS_ieee_micro12.pdf · 2012-07-06 · aqos-enabled on-die interconnect fabric for kilo-node

by 16 percent and power consumption by10 percent (MECS+TAQ). Furthermore,TAQ enables elastic-buffered flow controloutside of the shared regions, which furtherreduces area by 27 percent and power con-sumption by 25 percent, but degrades through-put by over 17 percent (MECS+TAQ+EB).The throughput reduction is caused by a se-vere shortage of network buffer capacity,aggravated by the shared nature of MECSlinks. Finally, K-MECS combines TAQwith the hybrid EB/VC flow-control archi-tecture, which we also propose in thiswork. The resulting organization restoresthroughput and improves area efficiency ata small power penalty when compared toa pure elastic-buffered NoC.

A n important contribution of our worklies in our topology-aware approach to

QoS, which represents a new direction forscalable network QoS architectures.Whereas all prior schemes have focused onminimizing per-router cost and complexity,our research suggests that router optimiza-tions might be secondary to architecturalmechanisms that reduce the need for QoSsupport in the first place.

Our work also points to a growingproblem of NoC energy consumption incommunication-intensive chips, as evidencedby the data in Table 1 (column ‘‘Power @10%’’). As process scaling pushes the limitsof on-die integration, future substrates willeither restrict the extent of internode com-munication to save interconnect power oremploy NoC architectures that push the en-velope of energy efficiency. While the formerapproach is clearly undesirable as it limits thebenefits of integration, the latter calls for

significant innovation in the interconnectspace. Specialization and tight integrationof NoC components is one promising direc-tion for improving the interconnect fabric’senergy efficiency, and this work representsa step in that direction. M I CR O

AcknowledgmentsThis research was supported by NSF

CISE Infrastructure grant EIA-0303609and NSF grant CCF-0811056.

....................................................................References

1. J. Kim, J. Balfour, and W. Dally, ‘‘Flattened

Butterfly Topology for On-Chip Networks,‘‘

Proc. 40th Int’l Symp. Microarchitecture

(Micro 40), ACM, 2007, pp. 172-182.

2. B. Grot, J. Hestness, S.W. Keckler, and

O. Mutlu, ‘‘Express Cube Topologies for

on-Chip Interconnects,‘‘ Proc. 15th Int’l

Symp. High-Performance Computer Archi-

tecture (HPCA 2009), IEEE CS, 2009,

pp. 163-174.

3. J.W. Lee, M.C. Ng, and K. Asanovic, ‘‘Globally-

Synchronized Frames for Guaranteed

Quality-of-Service in On-Chip Networks,‘‘

Proc. 35th Int’l Symp. Computer Archi-

tecture (ISCA 2008), IEEE CS, 2008,

pp. 89-100.

4. B. Grot, S.W. Keckler, and O. Mutlu, ‘‘Pre-

emptive Virtual Clock: A Flexible, Efficient,

and Cost-Effective QoS Scheme for

Networks-on-Chip,‘‘ Proc. 42nd Int’l Symp.

Microarchitecture (Micro 42), ACM, 2009,

pp. 268-279.

5. T. Moscibroda and O. Mutlu, ‘‘A Case for

Bufferless Routing in On-Chip Networks,‘‘

Proc. 36th Int’l Symp. Computer Architecture

(ISCA 2009), IEEE CS, pp. 196-207.

[3B2-9] mmi2012030017.3d 15/5/012 10:18 Page 24

Table 1. Area, power, and performance characteristics of various NoC

architectures.

Area

(mm2)

Power @

1% (W)

Power @

10% (W)

Zero-load

latency (cycles)

Throughput

(%)

Cmesh+PVC 6.0 3.8 38.3 36 9

MECS 23.5 2.9 29.2 20 29

MECS+PVC 29.9 3.3 32.9 20 29

MECS+TAQ 25.1 3.0 29.6 20 29

MECS+TAQ+EB 18.2 2.2 22.2 20 24

K-MECS 16.5 2.3 23.5 20 29

....................................................................

24 IEEE MICRO

...............................................................................................................................................................................................

TOP PICKS

Page 9: AQOS-ENABLED ON-DIE INTERCONNECT FABRIC KILO ODE CHIPSusers.ece.cmu.edu/~omutlu/pub/kilonoc-QoS_ieee_micro12.pdf · 2012-07-06 · aqos-enabled on-die interconnect fabric for kilo-node

6. A.K. Kodi, A. Sarathy, and A. Louri, ‘‘iDEAL:

Inter-Router Dual-Function Energy and

Area-Efficient Links for Network-on-Chip

(NoC) Architectures,‘‘ Proc. 35th Int’l

Symp. Computer Architecture (ISCA 2008),

ACM, 2008, pp. 241-250.

7. G. Michelogiannakis, J. Balfour, and W. Dally,

‘‘Elastic-Buffer Flow Control for On-Chip

Networks,‘‘ Proc. 15th Int’l Symp. High-

Performance Computer Architecture (HPCA

2009), IEEE CS, 2009, pp. 151-162.

8. J.D. Balfour and W.J. Dally, ‘‘Design Trade-

offs for Tiled CMP On-Chip Networks,‘‘

Proc. 23rd Int’l Conf. Supercomputing (ICS

2006), ACM, 2006, pp. 187-198.

9. B. Grot, J. Hestness, S.W. Keckler, and

O. Mutlu, ‘‘Kilo-NOC: A Heterogeneous

Network-on-Chip Architecture for Scalability

and Service Guarantees,‘‘ Proc. 38th Int’l

Symp. Computer Architecture (ISCA 2011),

IEEE CS, 2011, pp. 268-279.

Boris Grot is a postdoctoral researcher atEcole Polytechnique Federale de Lausanne.His research focuses on processor architec-tures, memory systems, and interconnectionnetworks for high-throughput, energy-awarecomputing. Grot has a PhD in computerscience from the University of Texas at Austin.

Joel Hestness is a PhD student in computerscience at the University of Texas at Austin.His research interests include future highlyintegrated chips, on-chip networks, andcommunication. Hestness has a BS in com-puter science and in mathematics from theUniversity of Wisconsin-�Madison.

Stephen W. Keckler is the senior director ofarchitecture research at Nvidia and a pro-fessor in the Department of ComputerScience at the University of Texas at Austin.His research interests include parallel com-puter architectures, memory systems, andinterconnection networks. Keckler has a PhDin computer science from the MassachusettsInstitute of Technology.

Onur Mutlu is an assistant professor in theElectrical and Computer Engineering De-partment at Carnegie Mellon University. Hisresearch interests include computer architecture,hardware/software cooperation, and memory

and communication systems. Mutlu has aPhD in electrical and computer engineeringfrom the University of Texas at Austin.

Direct questions and comments aboutthis article to Boris Grot, EPFL IC ISIMPARSA, INJ 238 (Batiment INJ), Station14, CH�1015, Lausanne, Switzerland; [email protected].

[3B2-9] mmi2012030017.3d 16/5/012 15:9 Page 25

....................................................................

MAY/JUNE 2012 25