342 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION ...cwcserv.ucsd.edu/~billlin/recent/tvlsi09_multicast.pdf · 344 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS,

342 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 3, MARCH 2009

Custom Networks-on-Chip Architectures WithMulticast Routing

Shan Yan, Student Member, IEEE, and Bill Lin, Senior Member, IEEE

Abstract—In this paper, we consider the problem of synthesizingcustom networks-on-chip (NoC) architectures that are optimizedfor a given application. We consider both unicast and multicasttraffic flows in the input specification. Multicast traffic flows areused in a variety of applications, and their direct support withonly replication of packets at optimal bifurcation points ratherthan full end-to-end replication can significantly reduce networkcontention and resource requirements. Our problem formulationis based on the decomposition of the problem into the inter-relatedsteps of finding good flow partitions, deriving a good physicalnetwork topology for each group in the partition, and providingan optimized network implementation for the derived topologies.Our solutions may be comprised of multiple custom networks,each interconnecting a subset of communicating modules. Wepropose several algorithms that can systematically examine dif-ferent flow partitions, and we propose Rectilinear–Steiner-Tree(RST)-based algorithms for generating efficient network topolo-gies. Our design flow integrates floorplanning, and our solutionsconsider deadlock-free routing. Experimental results on a varietyof NoC benchmarks showed that our synthesis results can onaverage achieve a 4.82 times reduction in power consumptionover different mesh implementations on unicast benchmarksand a 1.92 times reduction in power consumption on multicastbenchmarks. Significant improvements in performance were alsoachieved, with an average of 2.92 times reduction in hop counton unicast benchmarks and 1.82 times reduction in hop count onmulticast benchmarks. To further gauge the effectiveness of ourheuristic algorithms, we also implemented an exact algorithm thatenumerates all distinct set partitions. For the benchmarks whereexact results could be obtained, our algorithms on average canachieve results within 3% of exact results, but with much shorterexecution times.

Index Terms—Multicast routing, network-on-chip (NoC), syn-thesis, system-on-chip (SoC), topology.

I. INTRODUCTION

N ETWORKS-ON-CHIP (NoC) architectures have beenproposed as a scalable solution to the global communi-

cation challenges in nanoscale Systems-on-Chip (SoC) designs[1], [2]. The use of NoCs with standardized interfaces facili-tates the reuse of previously-designed and third-party-providedmodules in new designs (e.g. processor cores). Besides designand verification benefits, NoCs have also been advocated toaddress increasingly daunting clocking, signal integrity, andwire delay challenges.

Manuscript received March 30, 2007; revised December 11, 2008. First pub-lished February 06, 2009; current version published February 19, 2009.

The authors are with the Department of Electrical and Computer Engineering,University of California, San Diego, La Jolla, CA 92093-0407 USA (e-mail:[email protected]; [email protected]).

Digital Object Identifier 10.1109/TVLSI.2008.2011240

NoC architectures can be designed as regular or custom net-work topologies. Regular topologies, such as mesh or folded-torus networks, have been successfully employed in a number oftile-based chip-multiprocessor projects, e.g., [19], [20], whichare appropriate because of processor homogeneity and applica-tion traffic variability. On the other hand, for custom SoC ap-plications, the design challenges are different in terms of variedmodule sizes, irregularly spread module locations, and differentcommunication data rate requirements. Therefore, a custom net-work architecture optimized to the needs of the application ismore appropriate. This synthesis problem is the focus of thispaper.

The NoC synthesis problem is challenging for a number ofreasons. First, for a large complex SoC design, an optimal so-lution will likely involve multiple networks since each modulewill likely communicate only with a small subset of modules.Therefore, a single network that spans all nodes is often un-necessary. Part of the problem is to partition the set of speci-fied communication flows into subsets and derive a separate op-timal physical topology for each subset. In general, flows maybe grouped together even though they do not share commonsources or destinations because they may be able to benefi-cially share common intermediate network resources. Second,besides deciding on the set partition, our synthesis problem mustalso decide on the physical network topology of each group inthe set partition. Finally, depending on the optimization goalsand the implementation backend, the appropriate cost functionmay be quite complex. In particular, we consider a power min-imization problem that considers both leakage power and dy-namic switching power. It is well-known that leakage power isbecoming increasingly dominating [12], [14]. Therefore, it isimportant to properly account for leakage power when addingrouters and network links to the synthesized architecture. Otheroptimization goals may include minimizing hop counts alongwith power minimization.

In this paper, we consider the problem of synthesizing customNoC architectures that support both unicast and multicast trafficflows. Multicast traffic flows are used in a variety of applica-tions, and their direct support with only replication of packetsat optimal bifurcation points rather than full end-to-end replica-tion can significantly reduce network contention and resourcerequirements. Our problem formulation is based on the decom-position of the problem into the inter-related steps of findinggood flow partitions, deriving a good physical network topologyfor each group in the partition, and providing an optimized net-work implementation for the derived topologies. Our solutionsmay be comprised of multiple custom networks, each intercon-necting a subset of communicating modules. In particular, wepropose four algorithms for systematically examining differentset partitions of communication flows. The first two are heuristic

1063-8210/$25.00 © 2009 IEEE

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on August 10, 2009 at 16:17 from IEEE Xplore. Restrictions apply.

YAN AND LIN: CUSTOM NoC ARCHITECTURES WITH MULTICAST ROUTING 343

algorithms called CLUSTER and DECOMPOSE, and the lasttwo are perturbation-based probabilistic algorithms called per-turbation-based flow partitioning (PERTURB) and reduced per-turbation-based flow partitioning (R-PERTURB).

For each set partition considered, we use well-developed Rec-tilinear–Steiner-Tree (RST) algorithms [22]–[24] to generate aphysical network topology for each group in the set partition.Though the RST problem is in itself NP-hard, well-developedfast RST algorithms are available that can be effectively used,as indicated by the run-times presented in Section IX. For eachRST derived, the routes for the corresponding flows and thebandwidth requirements for the corresponding network linksare determined. Our design flow integrates floorplanning andsupports a decoupling of the evaluation cost function from theexploration process. The latter enables the flexible incorpora-tion of different design objectives and constraints. Although weuse Steiner trees to generate a physical network topology foreach group in the set partition, the final NoC architecture syn-thesized is not necessarily limited to just trees as RST imple-mentations of different groups may be connected to each otherto form non-tree structures. We also describe several ways toensure deadlock-free routing.

For the rest of this paper, Section II outlines related work.Section III presents our design flow, which incorporates floor-planning. Section IV presents the problem description and ourformulation. Sections V and VI describe the CLUSTER and DE-COMPOSE, respectively. Section VII describes PERTURB andR-PERTURB. Section VIII addresses deadlock considerations.Finally, experimental results and the conclusion are presentedin Section IX and X, respectively.

II. RELATED WORK

The NoC design problem has received considerable attentionin the literature. Towles and Dally [1] and Benini and De Micheli[2] motivated the NoC paradigm. Several existing NoC solutionshave addressed the mapping problem to a regular mesh-basedNoC architecture [3], [4]. Hu and Marculescu [3] proposed abranch-and-bound algorithm for the mapping of computationcores on to mesh-based NoC architectures. Murali et al. [4]described a fast algorithm for mesh-based NoC architecturesthat considers different routing functions, delay constraints, andbandwidth requirements.

On the problem of designing custom NoC architectureswithout assuming an existing network architecture, a numberof techniques have been proposed [5]–[10]. Pinto et al. [7]presented techniques for the constraint-driven communica-tion architecture synthesis of point-to-point links by usingheuristic-based -way merging. Their technique is limited totopologies with specific structures that have only two routersbetween each source and sink pair. Ogras et al. [5], [6] pro-posed graph decomposition and long link insertion techniquesfor application-specific NoC architectures. Srinivasan et al.[8], [9] presented NoC synthesis algorithms that considersystem-level floorplanning, but their solutions only consideredsolutions based on a slicing floorplan where router locations arerestricted to corners of cores and links run around cores. Muraliet al. [10] presented an innovative deadlock-free NoC synthesisflow with detailed backend integration that also considers thefloorplanning process. The proposed approach is based on themin-cut partitioning of cores to routers.

Fig. 1. Design flow.

Fig. 2. Modern SoC designs combine hard and soft modules, packet-basedcommunications and conventional wiring-based interconnections [32].

This paper presents a synthesis approach based on a set par-titioning formulation that considers multicast traffic, which tothe best of our knowledge has not been considered in previouscustom NoC synthesis formulations. Our approach considersdeadlock-free routing with multicast traffic as well as considersfloorplanning in the design flow. Our approach representsa different way of formulating the custom NoC synthesisproblem. Given that custom NoC synthesis is still a relativelynew problem, we believe our work provides an interestingdirection in this research area.

III. DESIGN FLOW

Our NoC synthesis design flow is depicted in Fig. 1. Themajor elements in the design flow are elaborated as follows.

1) Input Specification: The input specification to our designflow consists of a list of modules and their communications. Asobserved in recent trends, many modern SoC designs combineboth hard and soft modules as well as both packet-based net-work communications and conventional wiring [32], as shownin Fig. 2. Modules can correspond to a variety of different typesof intellectual property (IP) cores such as embedded micropro-cessors, large embedded memories, digital signal processors,graphics and multimedia processors, and security encryption en-gines, as well as custom hardware modules. These modules cancome in a variety of sizes and can be either hard or soft macros,possibly as just black boxes with area and power estimates andconstraints on aspect ratios.



To facilitate modularity and interoperability of IP cores,packet-based communication with standard network interfacesis rapidly gaining adoption. Custom NoC architectures, asaddressed in this paper, are being advocated as a scalablesolution to packet-based communication. For network-basedcommunications, traffic flows with required data rates betweenmodules are specified as part of the input specification. Forour synthesis problem, we consider both unicast and multicasttraffic flows. As discussed in [34] and [36], multicast trafficflows are used in a variety of applications, and their directsupport with only replication of flits at optimal bifurcationpoints rather than full end-to-end replication can significantlyreduce network contention and resource requirements.

In general, a mixture of network-based communications andconventional wiring may be utilized as appropriate, and not allinter-module communications are necessarily over the on-chipnetwork. For example, an embedded microprocessor may havededicated connections to its instruction and data cache modules.Our design flow and input specification allow for both intercon-nection models.

2) Floorplanning: The floorplanning problem has been ex-tensively studied with many mature solutions (e.g., [25]–[27]).These floorplanners can readily handle both hard and soft mod-ules as well as a variety of floorplanning constraints. In general,floorplanning solutions are not restricted to a slicing structure,and existing methods often allow for non-slicing floorplans toachieve more efficient solutions.1 The only requirement is thatthe modules are non-overlapping.

Like the NoC design flow proposed by Murali et al. [10], wehave also adopted the open source floorplanner Parquet [27].However, in the design flow proposed in [10], floorplanning isperformed after each NoC design has been produced to eval-uate detailed interconnect delays. NoC designs with differentnumber of routers are considered. On the other hand, in ourdesign flow, an initial floorplanning step is performed beforeNoC synthesis to obtain a placement of modules. This is impor-tant because the floorplanning of modules is often influencedby non-network-based interconnections, and the floorplan loca-tions of modules can have a significant influence on the NoCarchitecture.

In particular, the input to a floorplanner like Parquet is amodule netlist with weighted hyperedges. However, prior toNoC synthesis, the insertion of routers and network links hasnot yet been decided. To enable to the use of existing floor-planners in our design flow prior to NoC synthesis, we modeltraffic flows by using edge weights that are in proportion tothe rates of traffic flows among the modules. These weightededges are added along the side of weighted edges that representwire connections among non-network-based modules. Sinceconventional floorplanners [25]–[27] typically support a hyper-edge-based input graph model, multicast-like connectivity canbe readily modeled.

With the module locations available from the initial floorplan-ning step, NoC synthesis can better account for wiring delaysand power consumptions during the exploration of NoC archi-tectures, including accounting for repeaters [15] on links where

1A slicing floorplan is a floorplan that can be obtained by recursively cuttingan enclosing rectangle by either a vertical line or a horizontal line, which re-stricts the possible floorplans. On the other hand, a floorplan that is not slicingis called a non-slicing floorplan. [25], [26].

Fig. 3. Illustration of the NoC synthesis problem. (a) Example. (b) Floorplan.(c) CDG. (d) One architecture. (e) Alternative architecture.

needed. After NoC synthesis, actual routers and links in the syn-thesized NoC architecture can be fed back to the floorplanner toupdate the floorplan, and the refined floorplanning informationcan be used to obtain more accurate power and area estimates.Also, NoC synthesis can be re-invoked with the refined xfloor-plan as well. As shown experimentally in Section IX, our NoCsynthesis algorithms are fast, making it feasible to iterate NoCsynthesis with floorplanning.

We would like to stress that the focus of this work is on theNoC synthesis problem. We readily admit that extensions tofloorplanning to better consider network-based communicationsis a complex problem and is a subject of separate research. Re-cent work has explored the development of system floorplannersthat are geared towards the context of NoC synthesis [8]. SuchNoC-centric floorplanners may be used in our design flow aswell.

3) NoC Synthesis: Given floorplanning information, the NoCsynthesis step then proceeds to synthesize an NoC architecturethat is optimized for the given specification and floorplan. Con-sider Fig. 3(a) that depicts a small illustrative example. Fig. 3(a)only shows the portion of the input specification that corre-sponds to the network-attached modules and their traffic flows.The nodes represent modules, edges represent traffic flows, andedge labels represent the data rate requirements for the corre-sponding flows. Multicast traffic flows are represented with di-rected hyperedges, which are shown graphically in Fig. 3(a) asa bundle of directed edges in a shaded region. For example, thetraffic flow from to and is a multicast flow. This graph



representation is called a communication demand graph and isdiscussed in more details in Section IV.

An example floorplan is shown in Fig. 3(b). As noted earlier,modules in a design do not necessarily have to be attached tothe on-chip network. Modules can also be connected by conven-tional wiring, as shown in the unlabeled rectangles in Fig. 3(b).The communication demand graph with the floorplan positionsannotated is illustrated in Fig. 3(c), and Fig. 3(d) and (e) showtwo example network topologies.

4) NoC Power and Area Estimation: To evaluate the powerand area of the synthesized NoC architecture, we use a state-of-the-art NoC power-performance simulator called Orion [13],[14] that can provide detailed power characteristics for differentpower components of a router for different input/output portconfigurations. It accurately considers leakage power as well asdynamic switching power, which is important since it is well-known that leakage power is becoming an increasingly domi-nating [12], [14]. Orion also provides area estimates based on astate-of-the-art router microarchitecture [16], [17].

To accurately evaluate wire configurations for the networklinks in the synthesized architecture, we use a state-of-the-artrepeated on-chip interconnect model [15] to accurately deter-mine link power consumptions, including any repeaters needed.Since floorplanning is performed in advance of NoC synthesis,wirelengths can be considered in the link power estimation.

5) NoC Objective and Constraints: An important feature ofour NoC synthesis algorithms, which will be detailed later in thispaper, is that they allow the decoupling of the evaluation costfunction from the exploration process. For example, we can usethe power costs determined by Orion as the evaluation cost func-tion where the leakage power grows nonlinearly with respect tothe different input/output port configurations. Another designobjective is the minimization of hop counts for data routing.Since the evaluation cost function is decoupled from the ex-ploration process, our NoC synthesis algorithms can be usedto optimize for different user-defined objectives, including con-straints on power consumption or hop counts or some combina-tions of power and hop count objectives.

6) NoC Design Parameters: In addition to user-defined ob-jectives and constraints, NoC design parameters such as the op-erating voltage, target clock frequency, and link widths are pro-vided to the NoC synthesis step as well. Operating voltage andclock frequency parameters are usually dictated by the design,and link widths are often dictated by IP interface standards.However, if the design allows for different voltages or clock fre-quencies, or if the IP modules allow for different link widths,then NoC synthesis can be invoked to synthesize solutions fora range of design parameters specified by the user. As notedearlier, our NoC synthesis algorithms are fast, allowing for iter-ations with different design parameters.

7) Detailed Design: Finally, the synthesized NoC architec-ture with the rest of the design specification can be fed to a de-tailed RTL design flow where design tools like RTL optimiza-tion and detailed place and route are well established.

IV. PROBLEM DESCRIPTION AND FORMULATION

A. Problem Description

The input to our NoC synthesis problem is a communicationdemand graph (CDG), which is an annotated directed hyper-

graph , where each node corresponds to amodule, and each directed hyperedge repre-sents a traffic flow from source to one or more destina-tions . That is, is ahyperedge that connects a node to a group of nodes in ,

. A conventional edge with a single source and a singledestination is a specialize case of a hyperedge. After floorplan-ning, the coordinates of the modules are known, and theposition of each node is given by . The datarate requirement for each communication flow is given by

. In general, traffic flows can either be unicast or multicastflows. Multicast flows are flows with . For example, inFig. 3(c), corresponds to a multicast flow from source todestinations and .

Based on the optimization goals and cost functions speci-fied by the user, the output of our NoC architecture synthesisproblem is an optimized custom network topology with prede-termined routes for the specified traffic flows on the networksuch that the data rate requirements are satisfied. For example,Fig. 3(d) and (e) show two different topologies for the CDGshown in Fig. 3(c).

Fig. 3(d) shows a network topology where all flows share acommon network. In this topology, the predetermined route forthe multicast flow travels from through nodes and tofirst reach , and then it bifurcates at to reach . Fig. 3(e)shows an alternative topology comprising of two separate net-works. In this topology, the multicast flow is simply trans-ferred over the network link between to to first reach ,and then it bifurcates at to reach . Observe that in bothcases, the amount of network resources consumed by routing ofmulticast traffic is less than what would be required if the trafficis sent to each destination as a separate unicast flow.

B. Problem Formulation

In general, the solution space of possible application-specificnetwork architectures is quite large. Depending on the commu-nication demand requirements of the specific application underconsideration, the best network architecture may indeed becomprised of multiple networks. However, the decision on thenumber of networks and the partitioning of traffic flows amongthem is not a simple question of partitioning traffic flows intogroups where the corresponding sets of network interfacesare disjoint. Consider again the example depicted in Fig. 3(c).The traffic flows and have disjoint network interfaces.However, it may still be cost efficient to group them togetheron the same network, as depicted in Fig. 3(e), because bothflows travel a common distance in the horizontal direction, andthey are able to share the cost of the network link that spans thecommon horizontal distance, including the cost of repeaters, ifany is needed.

To address the NoC synthesis problem, we have formulatedthe problem as a combination of flow partitioning, Steiner-tree-based topology construction, and implementation optimization.This is depicted in Fig. 4. Each element of the synthesis formu-lation is elaborated in the following.

1) Flow Partitioning: Flow partitioning is performed in theouter loop of our synthesis formulation to explore differentpartitioning of flows to separate subnetworks. In general, thesolution space of distinct set partitions of flows, commonlyknown as the th Bell number, is known to grow



Fig. 4. Formulation of synthesis problem.

[21]. grows rapidly, with, e.g., , , and equal to115 975, 678 570, and 4 213 597, respectively, and so on. Thegoal of the heuristic algorithms CLUSTER and DECOMPOSEpresented in Sections V and VI is to significantly reduce thenumber of flow partitions that we consider in a systematicmanner. In Section VII, we also present two simulated an-nealing-based flow partitioning algorithms called PERTURBand R-PERTURB.

2) Steiner Tree-Based Topology Construction: For eachflow partition considered, physical network topologies mustbe decided for carrying the traffic flows. In current processtechnologies, layout rules for implementing wires dictate phys-ical topologies where the network links run horizontally orvertically. Thus, the problem is similar to Rectilinear SteinerTree (RST) problem that has been extensively studied for theconventional VLSI routing problem. Given a set of nodes,the RST problem is to find a network with the shortest edgelengths using horizontal and vertical edges such that all nodesare interconnected. The RST problem is well-studied with veryfast implementations available [22], [23]. Fig. 3(d) and (e)show possible RSTs for the set partitions and

, respectively. We use an RST solver in theinner loop of flow partitioning to generate topologies for the setpartitions considered. RSTs are only reevaluated for the groupsin the set partition that changed at each step, and previouslycomputed RSTs can be cached.

Since most existing Steiner tree algorithms are based on ahypergraph model, they match closely to our topology construc-tion problem for multicast traffic flows. In addition, emergingCMOS technologies are providing an increasing number ofmetal layers for implementing network links, facilitating“over-the-module” routing of network links. At 65 nm, as muchas 11 copper metal layers has been demonstrated [33]. How-ever, the use of some hard IP macros may prohibit routing overthem. Fortunately, our Steiner-tree-based topology constructionformulation allows for the use of available obstacle-avoidingRST algorithms [24] to accommodate this type of constraints.

After a physical topology is generated for each group in a setpartition, the routes for the corresponding flows and the band-width requirements for the corresponding network links can bereadily derived. The routes for the flows follow directly from thetree structure of the RST solution. For example, in Fig. 3(d), theroute for the unicast flow travels from through andto reach . For the multicast flow , a flit is first sent fromthrough and to . The flit is then copied and forwardedto over the network link from to . Correspondingly, thecumulative bandwidth requirements for each network link can

also be readily derived from the RST solution, as for exampleshown in Fig. 3(d) and (e).

3) Implementation Optimization: Given the RST-basedtopologies generated, network implementations are next de-rived. In particular, a separate network implementation isinitially generated for each RST. Routers are allocated atjunctions where either flows from multiple links must bemultiplexed to the same outgoing link or flows from the samelink must be demultiplexed to multiple outgoing links, andnetwork links are allocated to connect routers and networkinterfaces. To segment long links, repeaters are inserted asrequired [15]. Although the RST problem generates a graphwith undirected edges, network links and router ports are onlyallocated in the direction of traffic flows. In general, it maybe unnecessary to implement links in both directions or withsymmetric bandwidths. Fig. 3(d) and (e) illustrate exampleswhere the implementations of RSTs only have links in onedirection in most cases. After an initial network implementationis generated for each RST, the separate networks are combinedtogether to form a complete NoC architecture.

To improve upon the initial NoC architecture generated, agreedy router merging procedure is performed to further opti-mize the implementation and reduce cost. Since for each set par-tition, routers are allocated at Steiner points or terminal pointsof the RST generated, routers that connect with each other canbe merged to eliminate router ports and thus possibly elimi-nating the corresponding costs. Routers that connect to the sameports can also be merged to reduce ports and costs. We proposea greedy router merging algorithm, which works iteratively byconsidering all possible mergings of two routers connected witheach other in each iteration. For each candidate merging, thecost difference of the resulting topology after merging and theone before merging is calculated. Then they are sorted in theincreasing order of the cost difference. In the merging step, foreach candidate merging from the sorted list, routers are mergedif they have not been merged yet and the cost is improving. Afterall routers have been considered in the current iteration, they areupdated by replacing the routers merged with the new one gen-erated. The newly merged routers are again reconsidered in thenext iteration to form potentially larger routers. The algorithmkeeps merging routers until no improvement can be made fur-ther.

V. CLUSTER

In this section, we present an algorithm called CLUSTERthat reduces the number of set partitions considered from

to , which is a significantly smaller subsetof set partitions. The details of the algorithm is shown inAlgorithm 1. The CLUSTER algorithm takes a communica-tion demand graph and an evaluation cost function as inputsand generates an optimized network architecture implemen-tation details as output. As discussed in Section III, oursynthesis formulation provides a decoupling of the evaluationcost function from the exploration process. In particular, theCLUSTER algorithm starts by implementing each edge in thecommunication demand graph separately. The solution foreach single edge is a simple RST connecting two terminals. Itsets these single edges as the initial set partition, denoted as

, as shown in lines 2–5.



Then, in lines 7–18, at each iteration, the algorithm system-atically generates new candidate set partitions starting from theset partition chosen from the previous iteration. In particular, inthe first iteration, the algorithm starts with the initial set parti-tion with groups, each groupcontaining exactly one edge. Then, in lines 8–13, the algorithmgenerates new candidate set partitions from by consideringall pairwise mergings of groups in . The groups are denotedas and in the pseudo-code. For each pairwise merging con-sidered, an RST solver is called to generate a physical networktopology for the merged set of flows, and the cost of this networkis calculated using the specified evaluation function . We donot need to solve an RST problem for the entire set of flows, justthe subset of flows in the merged groups is considered. We then,in line 12, compute the total cost of the resulting set partition bysummarizing the cost of implementing all the other sets usingtheir own networks. In lines 15–16, we select the merging thatachieves the best cost in this iteration and choose it as . Ingeneral, we start from the chosen set partition , from the iter-ation to generate pairwise mergings of groups from , and thebest merging is selected as the new chosen set partition . Ateach iteration, the number of groups that need to be consideredis reduced by 1, but the size of groups will become increasinglylarger. Finally, in the last iteration, we only need to consider themergings of two groups.

At each iteration in lines 7–18, we maintain the chosen setpartition and the associated cost calculations for that iteration.Then, in the end of the algorithm, lines 23–24, we choose theset partition with the minimum cost. Since at each iteration ,there can be at most possible pairwisegroup mergings, and there are iterations, the number ofset partitions considered in the CLUSTER algorithm is .

Algorithm 1 CLUSTER

Input: : communication demand graph: specified evaluation function for implementation cost

Output: : synthesized network architecture1: initialize2: for all3:4:5: end for6:7: while do8: for all do9:

10:11:12:13: end for14:15:16:17:18: end while19: for all do20:21:22: end for

23:24:25: return

VI. DECOMPOSE

The DECOMPOSE algorithm described in this section re-duces the number of set partitions considered fromto . The details of the algorithm are shown in Algorithm2. This algorithm works in the opposite direction as CLUSTERwhen generating candidate set partitions and the correspondingRST topologies. It starts by considering all communication de-mands as a single cluster. In each iteration, the algorithm con-siders different ways of breaking up an existing group in theset partition chosen from the previous iteration into two smallerones. Then, the differential cost of splitting a group is evalu-ated by generating an RST for each subgroup and evaluatingtheir costs using the specified evaluation function. To facilitatethis decomposition process, two important graphs are used inDECOMPOSE: Affinity Graph (AG) and its Minimum Span-ning Tree (MST). The affinity graph is built by associatingeach flow in the communication demand graph to a vertex in theaffinity graph. An edge is added between each pair of the ver-tices in the affinity graph to form a complete graph. A weight isattached to each edge and is calculated as

, where is theflow in the communication demand graph associated within the affinity graph. is calculated by calling on theevaluation function to evaluate the cost of implementingseparately and is calculated by evaluating thecost of implementing a generated RST topology for to-gether. The weights of the edges in the affinity graph reflectthe benefits of implementing flows represented by vertices inthe affinity graph together using shared resources. The affinitygraph used here is based on the similarity graph model proposedin [7], except that RSTs are used to determine the affinities. Inparticular, in the affinity graph, the smaller the weight, the lessthe resulting total cost of clustering the two flows connected bythat edge. The motivation is to only cluster flows that are con-nected by small weighted edges so that the total implementationcost is minimized. Then the minimum spanning tree of thatcontains the minimum number of minimal weighted edges con-necting all the vertices in is derived. The affinity graph andits MST of the example in Fig. 3 are shown in Fig. 5. The costconsidered is the total power consumption based on the 70-nmtechnology power estimations shown in Table I.

Recall that the vertices in the spanning tree correspond toflows in the communication demand graph . Since is ini-tially a spanning tree, it interconnects all vertices, which is inter-preted as having all flows in a single cluster. During the course ofthe DECOMPOSE algorithm, we will selectively remove edgesfrom to create disjoint set of vertices, which will correspondto disjoint sets of flows into groups, thus forming a particular setpartition.

In each iteration shown in lines 5–9, the algorithm system-atically generates new candidate set partitions starting fromthe set partition chosen from the previous iteration. Inside thewhile loop, new set partitions are generated by temporarilyremoving one edge from . This is achieved by calling theroutine . With an edge removed, the



Fig. 5. Affinity graph and MST for example shown in Fig. 3. (a) Affinity graph.(b) MST of affinity graph.

TABLE IPOWER CONSUMPTION OF NOC COMPONENTS [14], [15]. (A) POWER

CONSUMPTION OF ROUTERS; (B) POWER CONSUMPTION OF LINKS

corresponding group is split into two subgroups. We evaluatethe cost of this splitting by solving an RST problem for eachsubgroup and calling on the evaluation function to compute thenew costs. In the first iteration of the algorithm, the spanningtree has edges. Thus, new candidate setpartitions will be generated. The set partition with the best costwill be chosen as the set partition for the current iteration. Thisset partition, and the corresponding modified , will be usedas the starting point for the next iteration. At iteration ,will have remaining edges. Therefore,candidate set partitions will be generated and considered. Thealgorithm ends when all flows in the problem have been splitinto their own individual groups. Then, at the end of the algo-rithm, at line 10, we choose the set partition with the minimumcost among the set partitions chosen from all iterations. Sinceat each iteration , there can be at most candidateset partitions, the number of set partitions considered in theDECOMPOSE algorithm is , which is again considerablysmaller than .

Algorithm 2 DECOMPOSE

Input: : communication demand graph: specified evaluation function for implementation cost

Output: : synthesized network architecture1:2:3:4:5: while do6:7: remove from8:9: end while

10:

11: return

1: for all2: temporarily remove from3:

4: for all do5:6:7:8: end for9: add back to

10: end for11:12: return

VII. PERTURBATION-BASED FLOW PARTITIONING

In this section, we present two alternative flow partitioningalgorithms called PERTURB and R-PERTURB. These algo-rithms are based on the use of simulated annealing [28], whichhas been successfully employed in numerous global optimiza-tion problems. Sections VII-B–VII-D describe PERTURB, andSection VII-E describes a more efficient version called R-PER-TURB that considers a reduced state space.

A. Overview of Simulated annealing (SA)

SA is a generic perturbation-based probabilistic global opti-mization algorithm [28]. Each step of the SA algorithm replacesthe current solution by a random perturbation, chosen with aprobability that depends on the change in cost and a global tem-perature parameter . We use the transition probability from[28], which is defined as 1 if is negative, whereand are the costs of the new state and the previous state, re-spectively (i.e., reduced cost moves are always accepted). Oth-erwise, the acceptance probability is . The allowance for“increased cost” moves prevents SA from becoming stuck atlocal minima.

In order to apply the SA method to a specific problem, threequestions must be resolved: How should the solution space berepresented? How should new candidates be generated? andWhat cost function should be used to evaluate each candidate?Each of these questions is addressed in the following.

B. Representing the State Space

SA-based optimization requires a representation of the statespace that can be searched to find optimal solutions. PERTURBtakes a communication demand graph and an evaluation func-tion as inputs and generates an optimized network architectureas output. Given a communication demand graph that specifiesall the communication demand flows in the application, the goalis to explore the state space of different possible set partitionsof the flows, implement each set partition as groups of RSTs,and evaluate the cost of each implementation. During the an-nealing process, best solution seen is maintained. The solutionspace considered by PERTURB is the set of setpartitions, although only a subset of candidates are evaluatedduring the simulated annealing process.



C. Generating Candidate Solutions

PERTURB works by randomly moving to a neighbor statefrom the current state in each step and evaluating the benefit ofthe move. The candidate set partition is generated from the cur-rent set partition by the function GenerateNewSetPartition(), asshown in Algorithm 3. It takes the current set partition config-uration and communication demand graph as input. Two flowsare randomly selected from the communication demand graphand the sets that they belong to in the current partition are foundas and . If the two flows are in different sets of the cur-rent set partition, the corresponding two sets are merged intoone set in the new partition; otherwise, the set is randomly splitinto and by randomly assigning one flow inand the other in and randomly assigning remaining flowsin the original set to and . The new set(s) generatedtogether with the remaining sets in the original partition formthe new set partition and are returned. This method of gener-ating new set partitions allow use to traverse thestate space of possible set partitions.

D. Incremental Cost Evaluation

For each set partition of the flows generated in each step ofPERTURB, an RST solver is called to generate a physical net-work topology for the configuration. The flows in each set in theset partition are considered as a cluster and an RST instance issolved as the physical network topology of this cluster. Note thatfor each new set partition generated from the old one, at mosttwo clusters can change, by either merging two clusters into oneor splitting one cluster into two. All others remain unchanged.So we only incrementally generate and solve the RST instancesfor the clusters merged or the cluster split. The derived RSTs arethen implemented and evaluated as discussed in Section IV-B-3.

Algorithm 3 )

Input: : input set partition: communication demand graph

Output: : new generated set partition1: randomly select 2 flows ,2: ,3: if then4:5:6: else7:8:9: for all , , do

10: if then11:12: else13:14: end if15: end for16:17:18: end if19: return

E. Reducing the State Space

The PERTURB algorithm presented above explores a statespace that comprises of all possible set partitions. However, thelarge state space that it considers can lead to long run times. Inthis section, we present another SA-based algorithm called Re-duced-PERTURB (R-PERTURB) that reduces the size of thestate space from set partitions to . In prac-tice, R-PERTURB can significantly reduce run times while stillachieving good results as compared to PERTURB.

R-PERTURB works in the similar way as PERTURB exceptit uses a different neighbor selection method to generate candi-date solutions. In order to efficiently reduce the number of setpartitions considered without excluding potentially good candi-dates, R-PERTURB again makes use of affinity graphs and min-imum spanning trees, as used in the DECOMPOSE algorithm.Before the SA process starts, an affinity graph and its min-imum spanning tree are first derived from the communica-tion demand graph, as described in Section VI. Referring againto Fig. 5, it shows the affinity graph and its minimum spanningtree for the example shown in Fig. 3.

For the initial minimum spanning tree generated, the edgesin are numbered and saved in . At each iteration, a newneighboring set partition is generated from the current set par-tition by using the function GenerateNewSetPartition() shownin Algorithm 4. It works by randomly selecting an edge in theinitial minimum spanning tree saved in . If the selected edgeis in the current , it is removed from ; otherwise, it is addedto . By adding or removing edges to/from , the set partitionrepresented by changes. The new set partition is saved toin the function GetDisjointSets() by finding all the disjoint setsof vertices in , which correspond to disjoint sets of flows.

Algorithm 4

Input: : modified MST of current set partition.: original MST edges array

Output: : new generated set partition1: randomly select 1 edge from2: if then3: remove from4: else5: add to6: end if7:8: return

For a communication demand graph containing flows, thenumber of vertices in the affinity graph is , and the numberof edges in the minimum spanning tree is . By consid-ering adding or removing edges in the spanning tree to generatenew set partitions, R-PERTURB considers a reduced state spaceof possible set partitions, which is much smaller than

. Similar to PERTURB, only a subset of candidatesin the reduced state space are evaluated during the simulated an-nealing process.

VIII. DEADLOCK CONSIDERATIONS

Deadlock-free routing is an important consideration for thecorrect operation of custom NoC architectures. The problem



is very well studied and analyzed in the literature. In [37],Dally and Seitz proposed a necessary and sufficient conditionfor deterministic deadlock-free routing using the concept ofa channel dependency graph. For general multiprocessor sys-tems that can be programmed to run different applications,or in the case when adaptive routing is used, the problem iscomplicated by the challenge that the flows and routing pathsare not necessarily known in advance [35], [36], [38]. On theother hand, for our custom NoC synthesis problem, the trafficflows and their required data rates are specified in advance, andpredetermined routes, for both unicast and multicast flows, aredecided and fixed as part of the NoC synthesis process. For ourdeterministic routing problem, deadlock-free operations can beensured in the following ways.

1) Statically Scheduled Routing: For our NoC solutions, therequired data rates are specified and the routes are fixed. Inthis setting, data transfers can be statically scheduled along thepredetermined paths with resource reservations to ensure dead-lock-free routing. As advocated in [40], statically scheduledtraffic is an effective option for providing guaranteed traffic. Ithas been shown in [40], [41] that router microarchitectures canbe effectively extended so that statically scheduled traffic cancomingle well with best effort traffic.

2) Virtual Channels: As shown in [37], a necessary and suf-ficient condition for deadlock-free routing is the absence of cy-cles in a channel dependency graph. In our problem setting,the traffic flows are known in advance, and the synthesis pro-cedure is responsible for deciding on a good network topologyand finding predetermined routes for the specified traffic flows.Given that the traffic flows, routing paths, and network topologyare all fixed by the synthesis procedure, we can construct a cor-responding channel dependency graph where each node in thegraph corresponds to a channel in the network, and a directededge is added from to if there is a channel dependencefrom to (i.e., a flow is holding and waiting for ).

For the unicast case, the construction is straightforward: iffollows immediately after in a routing path for a flow, then weadd an edge from to . The multicast case is more compli-cated as deadlocks may be caused by the resource dependencebetween two multicast trees, even though the trees may not forma cycle topologically. To consider the multicast case as well, weuse an extended channel dependency graph construction as fol-lows. If a multicast flow enters a router through channeland bifurcates from to a group of channels , then weadd an edge from to each . We refer toas the fan-out set of . The intuition is that if a multicast flowhas acquired channel , then it can only proceed if it can ac-quire all the channels at the fan-out set of . In addition, let

and be two channels in the fan-outset of . Then we also need to add an edge from to eachchannel in the fan-out set of as well. That is, we need to addan edge from to each . The intuition is that ifa multicast flow has acquired channels and , it cannot pro-ceed past unless it can proceed past as well. Similarly, weneed to add an edge from to each channel in the fan-out set of

(i.e., an edge from to each ). This extendedconstruction treats unicast flows as a special case.

Using the above extended channel dependency graph con-struction, resource dependencies between multicast trees showup as cycles in the extended channel dependency graph even

if they don’t form cycles topologically. The cycles in theextended channel dependency graph can be broken by splittinga channel in the cycle into two virtual channels (or by addinganother virtual channel if the physical channel has alreadybeen split). The added virtual channels are implemented inthe corresponding routers. To decide on where to introducevirtual channels, we simply use a greedy heuristic of splittingthe first channel encountered in each cycle. We readily admitthat a more sophisticated optimization procedure could beenvisioned for this virtual channel insertion problem. However,our simple greedy heuristic appears to suffice since we havefound that virtual channels are rarely needed to resolve dead-locks in practice for custom networks. In all the benchmarksthat we tested in Section IX, no deadlocks were found in thesynthesized solutions. Therefore, we did not need to add anyvirtual channel.

IX. RESULTS

A. Experimental Setup

We have implemented our four proposed algorithmsCLUSTER, DECOMPOSE, PERTURB, and R-PERTURBin C++. In our implementation, we incorporated a fast publicdomain Rectilinear Steiner Tree solver called GeoSteiner4.0[22], [23] to generate the physical network topologies in theinner loop of the four algorithms. The proposed router mergingalgorithm has been integrated into the four algorithms as wellto improve the solutions generated. As discussed in the designflow outlined in Section III, we use Parquet [27] for the floor-planning step.

To evaluate our proposed synthesis algorithms, two groups ofexperiments were used. The first group of experiments was usedto evaluate the performance of our algorithms on applicationswith only unicast flows. The second group of experiments wasused to evaluate the performance of our algorithms on bench-marks with multicast traffic flows.

For the first group of experiments, three sets of benchmarkswere used to evaluate the proposed algorithms. The first set ofbenchmarks consists of four different video processing applica-tions obtained from [11], including a Video Object Plane De-coder (VOPD), an MPEG4 decoder, a Picture-In-Picture (PIP)application, and a Multi-Window Display (MWD) application.The next set of benchmarks were obtained from [3] and [8].They correspond to different encoder/decoder combinations ofa H.263 video codec, a MP3 audio codec, and a generic Mul-tiMedia System (MMS). Finally, to generate larger benchmarkinstances, we generated synthetic benchmarks from the abovevideo applications.

For the second group of experiments, in the absence of pub-lished benchmarks with multicast traffic, we generated a setof synthetic benchmarks. In particular, we used the NoC-cen-tric bandwidth-version of Rent’s rule proposed in [29] to gen-erate these benchmarks. The details of this benchmark genera-tion process is described in Section IX-C1.

All experimental results were obtained on a 1.5-GHz Intel P4processor machine with 512 MB of memory running Linux.

B. Results for Unicast Applications

1) Method of Evaluation: In all our experiments, we aimto evaluate the performance of the four proposed algorithms



TABLE IINoCPOWER COMPARISONS ON UNICAST APPLICATIONS

Fig. 6. Comparisons of all algorithms on unicast applications. (a) Power. (b) Execution time. (c) Hop count. (d) Area.

CLUSTER, DECOMPOSE, PERTURB, and R-PERTURB onall benchmarks with the objective of minimizing the total powerconsumption of the synthesized NoC architectures. The totalpower consumption includes both the leakage power and thedynamic switching power of all network components. As dis-cussed in Section III, we use a power-performance simulatorcalled Orion [13], [14] to estimate the power consumptions ofrouter configurations generated. We applied the design param-eters of 1 GHz clock frequency, 4-flit buffers, and 128-bit flits.The leakage power and switching bit energy of some examplerouter configurations with different number of ports in 70-nmtechnology are shown in Table I. For the link power parameters,we use a state-of-the-art on-chip interconnect model proposedin [15] with repeated buffers. The static power and switching bitenergy parameters in 70-nm technology are obtained from [15]and listed in Table I.

For evaluation, fair direct comparison with previously pub-lished NoC synthesis results is difficult in part because of vastdifferences in the power parameters assumed.2 To evaluate theeffectiveness of our algorithms, we have designed two sets ofexperiments. In the first set of experiments, we generated afull mesh implementation for each benchmark for comparison.For the positioning of modules on the mesh implementation,we again used Parquet [27]. In a full mesh implementation,each module is connected to a router with five input/outputports. Packets are routed using routing over the mesh

2We use the Orion simulator to estimate power consumption [13], [14]. Thepower estimates are consistent with another published power-optimized NoCimplementation described in [18]. The power estimates are on the same orderof magnitude for the same router configuration in the same technology.

from source to destination. We also generated a variant of thebasic mesh topology called optimized mesh (opt-mesh) byeliminating router ports and links that are not used by the trafficflows. These experiments are designed to show the benefitsof application-specific NoC architectures. In the second setof experiments, we implemented an exact algorithm, referredto as EXACT, that exhaustively enumerates all distinct setpartitions. These experiments are designed to show how closeour synthesis algorithms are to exact enumeration results.

2) Comparison of Results: The synthesis results of our fouralgorithms on all unicast benchmarks at 70 nm with comparisonto results using mesh and opt-mesh topologies are shown inTable II. For each algorithm, the power results, the executiontimes, and power improvements of that algorithm over meshand opt-mesh topologies are reported. The power results of allalgorithms relative to mesh implementations are graphicallycompared in Fig. 6(a). The results show that all algorithms canefficiently synthesize NoC architectures that minimize powerconsumption. All algorithms can achieve substantial reductionin power consumption over the standard mesh and opt-meshtopologies in all cases. The two heuristic algorithms can achievecomparable results as the two perturbation-based algorithms.

In particular, CLUSTER can achieve on average a 6.92 re-duction in power over the standard mesh topologies and a 2.68reduction over the optimized mesh topologies, and DECOM-POSE can achieve on average a 6.83 and a 2.60 reductionin power over the standard mesh and optimized mesh topolo-gies. Similarly, PERTURB can achieve on average a 7.16 anda 2.73 reduction in power over the standard mesh and opti-mized mesh topologies, and R-PERTURB can achieve on av-



TABLE IIINoC HOP COUNT COMPARISONS ON UNICAST APPLICATIONS

TABLE IVNoC ROUTER AREA COMPARISONS ON UNICAST APPLICATIONS

erage a 6.99 and a 2.66 reduction in power over the standardmesh and optimized mesh topologies, respectively.

The execution times of all algorithms are graphically com-pared in Fig. 6(b). The results show that all algorithms workquite fast. The two heuristic algorithms, CLUSTER and DE-COMPOSE, work much faster than PERTURB. As can be seen,CLUSTER can achieve better results than DECOMPOSE be-cause it examines more set partition candidates in its solutionspace, but it requires longer run times. R-PERTURB also worksfaster than PERTURB because it searches a smaller state spacethan PERTURB. However, it can achieve similar results as PER-TURB, with execution times comparable to the two heuristic al-gorithms.

To evaluate the performance of the synthesized topologies,average hop count results for the benchmarks from the synthe-sized topology are reported in Table III and graphically com-pared in Fig. 6(c). Hop counts correspond to the number of in-termediate routers that a packet needs to pass through from thesource to the destination. The results show that the solutions ob-tained using our algorithms can achieve much lower hop counts,and thus lower latencies, than the corresponding mesh topolo-gies. In particular, CLUSTER, DECOMPOSE, R-PERTURB,and PERTURB can achieve on average a 2.89 , 2.90 , 2.93 ,and 2.95 reduction in hop count. In a number of benchmarks,some modules only have single incoming flow or single out-going flow. For example, for the VOPD benchmark, 6 out of the

TABLE VNoC POWER COMPARISONS WITH EXACT SOLUTIONS

ON UNICAST APPLICATIONS

12 modules have at most one incoming flow as well as one out-going flow, and 10 out of the 12 modules have either at mostone outgoing flow or one incoming flow. For these benchmarks,the most efficient architectures are actually ones that providedirect network links between network interfaces for some of itstraffic flows without going through intermediate routers.3 Forthese benchmarks, the average hop count may be less than onesince not all flows necessarily pass through intermediate routers.Our flow partitioning problem formulation is able to arrive atthese implementations by exploring set partitions in which someflows are grouped in their own partition.

To evaluate the area costs of the synthesized solutions, wealso used Orion [13] to estimate the areas of the routers in thesynthesized architectures, using the same 70-nm technologyused for power estimation. The area cost of a solution corre-sponds to the sum of the router areas in the solution. The resultsare presented in Table IV and compared in Fig. 6(d). Total areacosts of all routers in a custom NoC solution produced by ouralgorithms are only in the range of 0.10 to 0.86 mm , evenfor the largest benchmark 4-in-1 with 44 modules (about 0.02mm amortized area costs per module in the 4in1 example,which is small in comparison to the expected size of modules).In comparisons to the area costs of the opt-mesh solutions, ouralgorithms are on average 3.12 lower.

In the next set of experiments, we compare our algo-rithms with an exact algorithm that enumerates all distinctset partitions. As the number of distinct set partitions grows

, the CPU times for generating the exact solutionsincrease very quickly. We set a CPU timeout limit of 8 hours.The results are compared in Table V. Out of the 16 benchmarkstested, we were able to obtain results for exact enumeration foronly 6 of the benchmarks. The largest unicast benchmark thatexact enumeration could complete was a benchmark with 12flows (namely G5). This is because the number of set partitionsof flows grows as the Bell number, and

3State-of-the-art router microarchitectures, such as those proposed in[16]–[18], employ finite buffers and virtual channels. Flow control is used toprevent upstream routers or network interfaces from sending more data wheneither buffer space or virtual channel is unavailable. Network interfaces thatinteroperate with these router microarchitectures must also correspondinglysupport the same flow control mechanism. This flow control mechanism canbe used to control data transfers between network interfaces that are directlyconnected by a network link. Our synthesis algorithms can also be constrainedto produce architectures where flows are required to pass through at least onerouter.



TABLE VINoC POWER COMPARISONS ON MULTICAST APPLICATIONS

Fig. 7. Comparisons of all algorithms on multicast applications. (a) Power. (b) Execution time. (c) Hop count. (d) Area.

and . The exact enumeration of G5 with 12flows took about 4 hours. However, we estimate that it will takemore than 27 hours to generate exact solution for a benchmarkwith 13 flows (e.g., MPEG4), which is well beyond our 8-hourstimeout limit.

On the other hand, as shown in Table V, of the 6 bench-marks where exact enumeration was possible, our PERTURBalgorithm could achieve the exact solution in all cases and thelongest execution time was only about 2 mins. CLUSTER wasable to achieve the same results as exact enumeration in 5 out ofthe 6 cases, and on average, the results are within just 1% of theexact results. Moreover, the CPU times for the 6 benchmarkswere all under 1 s whereas the EXACT algorithm took as muchas 4.5 hours to achieve similar results. Likewise, DECOMPOSEand R-PERTURB were able to achieve the exact solution in 4out of the 6 cases, and on average, the results are within 2% ofthe exact results, and these results were achieved in the range of0.23 to 7 s.

C. Results for Multicast Applications

In this section, we present experimental results on bench-marks with multicast traffic to evaluate the performance of ouralgorithms on the synthesis of NoC communication architec-tures with multicast routing.

1) Benchmark Generation Using Rent’s Rule: To gen-erate synthetic benchmarks with multicast traffic, we used theNoC-centric bandwidth-version of Rent’s rule proposed byGreenfield et al. [29]. In particular, they showed that the trafficdistribution models of NoC applications should follow a similarRent’s rule distribution as in conventional VLSI netlists. Thebandwidth-version of Rent’s rule was derived showing thatthe relationship between the external bandwidth across aboundary and the number of blocks within a boundary obeys

, where is the average bandwidth for each blockand is the Rent’s exponent. We used this NoC-centric Rent’s

rule [29], [30] to generate large NoC benchmarks with mixedunicast/multicast flows and varying hop count and data ratedistributions. We formed multicast traffic with varying groupsizes for about 10% of the flows.

2) Comparison of Results: Using the above benchmark gen-eration process, we generated 8 multicast benchmarks with thenumber of modules ranging from 4 to 36 and the number offlows ranging from 6 to 84 (with some as multicast flows). Weapplied our four algorithms to derive optimized NoC architec-tures for them with the goal of minimizing power consumption.We again compared our results with both mesh and opt-meshimplementations. For multicast routing over a mesh implemen-tation, we applied the efficient multicast routing algorithm de-scribed in [39] to determine the routing of multicast traffic. Wethen again generated opt-mesh implementations by eliminatingrouter ports and links that are not used by the traffic flows.

The power consumption results and the execution times ofall algorithms on the multicast benchmarks are reported inTable VI. The power results of all algorithms relative to meshimplementations are compared in Fig. 7(a). The executiontimes of our four algorithms are compared in Fig. 7(b). Forthe multicast benchmarks, all of our algorithms can achievesubstantial reduction in power over the standard mesh andopt-mesh implementations. Among all algorithms, PERTURBachieves the best performance over our other algorithms, withon average a 2.30 reduction in power over the standard meshand a 1.64 reduction over the optimized mesh topologies.R-PERTURB is slightly worse than PERTURB by achievingon average a 2.27 and a 1.61 reduction in power over thestandard mesh and optimized mesh topologies. However, itsexecution times are 8 to 45 times faster than PERTURB. Thetwo heuristic algorithms can achieve comparable results asthe two probabilistic algorithms. In particular, CLUSTER canachieve on average a 2.25 and a 1.60 reduction in powerover the standard mesh and optimized mesh topologies, and



TABLE VIINoC HOP COUNT COMPARISONS ON MULTICAST APPLICATIONS

TABLE VIIINoC ROUTER AREA COMPARISONS ON MULTICAST APPLICATIONS

TABLE IXNoCPOWER COMPARISONS WITH EXACT SOLUTIONS

ON MULTICAST APPLICATIONS

DECOMPOSE can achieve on average a 2.15 and a 1.53reduction in power over the standard mesh and optimizedmesh topologies, respectively. Both heuristic algorithms workmuch faster than PERTURB. For example, DECOMPOSE canobtain all results in less than 1 min. For applications with largeproblem sizes, the heuristic algorithms and R-PERTURB canbe used without sacrificing much performance.

Average hop count results are also reported in Table VII andFig. 7(c). The results show that, for multicast applications, thesynthesized topologies using our algorithms can also achievelower hop counts than both mesh and opt-mesh topologies. Inparticular, on average, CLUSTER, DECOMPOSE, R-PER-TURB, and PERTURB can achieve 1.84 , 1.78 , 1.88 , and1.78 reduction in hop count.

The area costs in terms of router areas for the multicast bench-marks are reported in Table VIII and Fig. 7(d). The area costs ofour algorithms are in the range from 0.18 mm to 3.06 mm forthe largest benchmark M8 with 84 modules (about 0.04 mmamortized area costs per module). In comparisons to the areacosts of the opt-mesh solutions, our algorithms are on average1.77 lower.

Finally, we again compare our algorithms with the exact enu-meration algorithm for multicast applications. The results arecompared in Table IX. We again set a CPU timeout limit of 8 h.Out of the 8 multicast benchmarks tested, we were only able to

obtain the results for exact enumeration for 3 of them. Of these3 benchmarks, PERTURB could achieve the exact solution inall these cases, and the longest execution time was only about3.5 minutes. CLUSTER and R-PERTURB were able to achievethe exact solution in 2 out of the 3 cases, and on average, the re-sults are within just 3% and 2% of the exact results respectively.Likewise, DECOMPOSE was able to achieve the exact solutionin 1 out of 3 cases, and on average, the results are within 5% ofthe exact results.

X. CONCLUSION

In this paper, we proposed a formulation of the custom NoCsynthesis problem based on the decomposition of the probleminto the inter-related steps of finding a good set partition oftraffic flows, deriving a good physical network topology foreach group in the partition, and providing an optimized net-work implementation for the derived topologies. Our problemformulation takes into consideration both unicast and multi-cast traffic. We proposed four algorithms called CLUSTER,DECOMPOSE, PERTURB, and R-PERTURB for systemat-ically examining different possible set partitioning of flows,and we proposed the use of RST algorithms for constructinggood physical network topologies. Our solution frameworkenables the decoupling of the evaluation cost function from theexploration process, thereby enabling different user objectivesand constraints to be considered. Although we use Steinertrees to generate a physical network topology for each group inthe set partition, the final NoC architecture synthesized is notnecessarily limited to just trees as Steiner tree implementationsof different groups may be connected to each other to formnon-tree structures. We have described several ways to ensuredeadlock-free routing of both unicast and multicast flows.Experimental results on a variety of benchmarks using a powerconsumption cost model show that our algorithms can produceeffective solutions with fast execution times comparing to bothmesh implementations and EXACT solutions.

REFERENCES

[1] W. J. Dally and B. Towles, “Route packet, not wires: On-chip intercon-nection networks,” in Proc. DAC, 2001, pp. 684–689.

[2] L. Benini and G. De Micheli, “Networks on chips: A new SoC para-digm,” IEEE Computer, vol. 35, no. 1, pp. 70–78, Jan. 2002.

[3] J. Hu and R. Marculescu, “Energy-aware mapping for tile-based NoCarchitectures under performance constraints,” in Proc. ASP-DAC, 2003,pp. 233–239.

[4] S. Murali and G. De Micheli, “Bandwidth constrained mapping ofcores onto NoC architectures,” in Proc. DATE, 2004, p. 20896.

[5] U. Ogras and R. Marculescu, “Energy and performance driven NoCcommunication architecture synthesis using a decomposition ap-proach,” in Proc. DATE, 2005, pp. 352–357.

[6] U. Ogras and R. Marculescu, “Application specific network-on-chiparchitecture customization via long range link insertion,” in Proc.ICCAD, 2005, pp. 246–253.

[7] A. Pinto, L. P. Carloni, and A. L. Sangiovanni-Vincentelli, “Efficientsynthesis of networks on chip,” in Proc. ICCD, 2003, p. 146.

[8] K. Srinivasan, K. S. Chatha, and G. Konjevod, “Linear-programming-based techniques for synthesis of network-on-chip architectures,” IEEETrans. Very Large Scale Integr. (VLSI) Syst., vol. 14, no. 4, pp. 407–420,Apr. 2006.

[9] K. Srinivasan, K. S. Chatha, and G. Konjevod, “Application specificnetwork-on-chip design with guaranteed quality approximation algo-rithms,” in Proc. ASPDAC, 2007, pp. 184–190.

[10] S. Murali, P. Meloni, F. Angiolini, D. Atienza, S. Carta, L. Benini, G.De Micheli, and L. Raffo, “Designing application-specific networks onchips with floorplan information,” in Proc. ICCAD, 2006, pp. 355–362.



[11] D. Bertozzi, A. Jalabert, A. Jalabert, G. De Micheli, L. Benini, S.Murali, R. Tamhankar, and S. Stergiou, “NoC synthesis flow forcustomized domain specific multiprocessor systems-on-chip,” IEEETrans. Parallel and Distributed Systems, vol. 16, no. 2, pp. 113–129,Feb. 2005.

[12] A. S. Grove, “Changing vectors of Moore’s law,” presented at theKeynote presentation, International Electron Device Meeting, SanFrancisco, CA, Dec. 2002.

[13] H. Wang, X. Zhu, L.-S. Peh, and S. Malik, “Orion: A power-perfor-mance simulator for interconnection networks,” in Proc. MICRO 35,Nov. 2002, pp. 294–305.

[14] X. Chen and L.-S. Peh, “Leakage power modeling and optimization ininterconnection networks,” in Proc. ISPLED, 2003, pp. 90–95.

[15] L. Zhang, H. Chen, H. Chen, B. Yao, K. Hamilton, and C.-K. Cheng,“Repeated on-chip interconnect analysis and evaluation of delay,power, and bandwidth metrics under different design goals,” in Proc.ISQED, 2007, pp. 251–256.

[16] L.-S. P. and W. J. Dally, “A delay model and speculative architecturefor pipelined routers,” in Proc. 7th Int. Symp. High-Perform. Comput.Arch. (HPCA), 2001, p. 255.

[17] H. Wang, L.-S. Peh, and S. Malik, “Power-driven design of router mi-croarchitectures in on-chip networks,” in Proc. MICRO 36, 2003, p.105.

[18] R. Mullins, “Minimising dynamic power consumption in on-chip net-works,” in Proc. Int. Symp. Syst.-on-Chip, 2006, pp. 1–4.

[19] M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Green-wald, H. Hoffman, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf, M.Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, andA. Agarwal, “The RAW microprocessor: A computational fabric forsoftware circuits and general-purpose programs,” IEEE Micro, vol. 22,no. 2, pp. 25–35, Mar. 2002.

[20] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, S.W. Keckler, and C. R. Moore, “Exploiting ILP, TLP, and DLP with thepolymorphous TRIPS architecture,” in Proc. ISCA, 2003, pp. 422–433.

[21] D. E. Knuth, The Art of Computer Programming. Pre-Fascicle 3B. ADraft of Sections 7.2.1.4-5: Generating All Partitions. Boston, MA:Addison-Wesley.

[22] D. M. Warme, P. Winter, and M. Zachariasen, “Exact algorithms forplane Steiner Tree problems: A computational study,” in Advances inSteiner Trees. Norwell, MA: Kluwer, 2000, pp. 81–116.

[23] Univ. Copenhagen, Copenhagen, Denmark, “University of Copen-hagen homepage,” 2001. [Online]. Available: http://www.diku.dk/geosteiner/

[24] C.-W. Lin, S.-Y. Chen, C.-F. Li, Y.-W. Chang, and C.-L. Yang, “Effi-cient obstacle-avoiding rectilinear Steiner tree construction,” in Proc.Int. Symp. Phys. Des., 2007, pp. 127–134.

[25] N. A. Sherwani, Algorithms for VLSI Physical Design Automation, 3rded. Norwell, MA: Kluwer, 1998.

[26] X. Hong, G. Huang, Y. Cai, J. Gu, S. Dong, C. K. Cheng, and J. Gu,“Corner block list: An effective and efficient topological representationof non-slicing floorplan,” in Proc. ICCAD, 2000, pp. 8–12.

[27] S. N. Adya and I. L. Markov, “Fixed-outline floorplanning: Enablinghierarchical design,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,vol. 11, no. 6, pp. 1120–1135, Dec. 2003.

[28] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by sim-ulated annealing,” Science, vol. 220, no. 4598, pp. 671–680, 1983.

[29] D. Greenfield, A. Banerjee, J.-G. Lee, and S. Moore, “Implications ofrent’s rule for NoC design and its fault-tolerance,” in Proc. NOCS, May2007, pp. 283–294.

[30] D. Stroobandt, P. Verplaetse, and J. van Campenhout, “Generatingsynthetic benchmark circuits for evaluating CAD tools,” IEEETrans. Comput.-Aided Des. Integr. Circuits Syst., vol. 19, no. 9, pp.1011–1022, Sep. 2000.

[31] S. Yan and B. Lin, “Application-specific network-on-chip architecturesynthesis based on set partitions and steiner trees,” in Proc. ASPDAC,2008, pp. 277–282.

[32] E. Wein and J. Benkoski, “Hard macros will revolutionize SoC de-sign,” EE Times, Aug. 20, 2004 [Online]. Available: http://www.ee-times.com/showArticle.jhtml?articleID=26807055

[33] Xilinx, San Jose, CA, “UMC delivers leading-edge 65 nm FPGAs toXilinx,” Des. Reuse, Nov. 8, 2006 [Online]. Available: http://www.de-sign-reuse.com/news/14644/umc-edge-65nm-fpgas-xilinx.html

[34] P. Gratz, K. Sankaralingam, H. Hanson, P. Shivakumar, R. McDonald,S. W. Keckler, and D. Burger, “Implementation and evaluation of adynamically routed processor operand network,” in Proc. NOCS, May2007, pp. 7–17.

[35] J. Duato, S. Yalamanchili, and L. Ni, Interconnection Networks. NewYork: IEEE Computer Society, 1997.

[36] B. Towles and W. J. Dally, Principles and Practices of InterconnectionNetworks. San Francisco, CA: Morgan Kaufmann, 2003.

[37] W. J. Dally and C. L. Seitz, “Deadlock-free message routing in mul-tiprocessor interconnection networks,” IEEE Trans. Computers, vol.C-36, no. 5, pp. 547–553, May 1987.

[38] J. Duato, “A necessary and sufficient condition for deadlock-freeadaptive routing in wormhole networks,” IEEE Trans. Parallel Distrib.Syst., vol. 6, no. 10, pp. 1055–1067, Oct. 1995.

[39] M. P. Malumbres, J. Duato, and J. Torrellas, “An efficient implementa-tion of tree-based multicast routing fordistributed shared-memory,” inProc. IEEE Symp. Parallel Distrib. Process., 1996, p. 186.

[40] E. Rijpkema et al., “Trade-offs in the design of a router with both guar-anteed and best-effort services for networks on chip,” in Proc. DATE,2003, p. 1035.

[41] N. Enright-Jerger, M. Lipasti, and L.-S. Peh, “Circuit-switched coher-ence,” IEEE Comput. Arch. Lett., vol. 6, no. 1, pp. 193–202, Mar. 2007.

Shan Yan (S’03) received the B.S. and M.S. degreesin electronic engineering from Tsinghua University,Beijing, China, in 2000 and 2003, respectively. She iscurrently pursuing the Ph.D. degree in electrical andcomputer engineering from the University of Cali-fornia, San Diego.

Her current research interests focus on syn-thesis and approximation algorithms for low-powersystem-on-chip (SoC) and network-on-chip (NoC)design and also include the field of modelling, power,and performance optimization of NoC architectures,

and other design issues in SoC and NoC designs.

Bill Lin (M’87–SM’09) received the B.S., M.S., andPh.D. degrees in electrical engineering and computersciences from the University of California, Berkeley.

He is currently with the faculty of Electrical andComputer Engineering, University of California, SanDiego (UCSD), where he is actively involved withthe Center for Wireless Communications (CWC), theCenter for Networked Systems (CNS), and the Cal-ifornia Institute for Telecommunications and Infor-mation Technology �� in industry-spon-sored research efforts. Prior to joining the faculty at

UCSD, he was the head of the System Control and Communications Group atIMEC, Belgium. IMEC is the largest independent microelectronics and infor-mation technology research center in Europe. It is funded by European fundingagencies in joint projects with major European telecom and semiconductor com-panies. His research has led to over 100 journal and conference publications. Heholds two U.S. patents.

Dr. Lin was a recipient of a number of publication awards, including the 1995IEEE Transactions on VLSI Systems Best Paper Award, a Best Paper Awardat the 1987 ACM/IEEE Design Automation Conference, Distinguished Papercitations at the 1989 IFIP VLSI Conference and the 1990 IEEE InternationalConference on Computer-Aided Design, a Best Paper Nomination at the 1994ACM/IEEE Design Automation Conference, and a Best Paper Nomination atthe 1998 Conference on Design Automation and Test in Europe.


Documents

342 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION ...cwcserv.ucsd.edu/~billlin/recent/tvlsi09_multicast.pdf · 344 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS,