Compiler-Directed Channel Allocation for Saving Power in ... · both hardware based and software based channel turn-off schemes. Categories and Subject Descriptors D.3.m [Software]:

Compiler-Directed Channel Allocation for Saving Power inOn-Chip Networks ∗

Guangyu Chen, Feihui Li, Mahmut KandemirDepartment of Computer Science and Engineering

Pennsylvania State UniversityUniversity Park, PA 16802, USA

gchen,feli,[email protected]

AbstractIncreasing complexity in the communication patterns of embed-ded applications parallelized over multiple processing units makesit difficult to continue using the traditional bus-based on-chipcommunication techniques. The main contribution of this paperis to demonstrate the importance of compiler technology in re-ducing power consumption of applications designed for emergingmulti-processor, NoC (Network-on-Chip) based embedded sys-tems. Specifically, we propose and evaluate a compiler-directed ap-proach to NoC power management in the context of array-intensiveapplications, used frequently in embedded image/video processing.The unique characteristic of the compiler-based approach proposedin this paper is that it increases the idle periods of communicationchannels by reusing the same set of channels for as many commu-nication messages as possible. The unused channels in this casetake better advantage of the underlying power saving mechanismemployed by the network architecture. However, this channel reuseoptimization should be applied with care as it can hurt performanceif two or more simultaneous communications are mapped onto thesame set of channels. Therefore, the problem addressed in this pa-per is one of reducing the number of channels used to implementa set of communications without increasing the communication la-tency significantly. To test the effectiveness of our approach, weimplemented it within an optimizing compiler and performed ex-periments using twelve application codes and a network simulationenvironment. Our experiments show that the proposed compiler-based approach is very successful in practice and works well underboth hardware based and software based channel turn-off schemes.

Categories and Subject Descriptors D.3.m [Software]: Program-ming Languages—Miscellaneous

General Terms Experimentation, Languages

Keywords NoC, compiler, energy consumption

∗ This work is supported in part by NSF Career Award #0093082 and agrant from GSRC.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee.

POPL’06 January 11–13, 2006, Charleston, South Carolina, USA.Copyright c© 2006 ACM 1-59593-027-2/06/0001. . . $5.00.

1. IntroductionProliferation of embedded and portable devices are changing theway people are performing their daily tasks, they communicate,and maintain information critical to them. These devices face verydifferent challenges as compared to their high-end counterparts(e.g., workstations). For example, many embedded devices oper-ate in harsh environments and use low supply voltage levels, whichmakes them more vulnerable to transient errors. Recent research[18, 17, 33] investigates this growing problem of transient errorsand potential solutions. In addition, the memory capacity in embed-ded devices is much smaller than that in high-end devices, whichseverely limits the number and types of applications that one canexecute in these devices. As a third example, many embedded de-vices are battery operated, and consequently, reducing their energyconsumption is a critical issue.

In the meantime, the architectures employed by embedded de-vices are also transforming dramatically. Many embedded systemstoday are built as a System-on-Chip (SoC) type of design; thatis, a single chip contains computing resources, memory compo-nents, interconnection fabric, and some I/O circuitry. Continuingscaling of manufacturing technology and ever-increasing transis-tor counts enables designers to place multiple processing units intothe same chip. These architectures, commonly known as Chip Mul-tiprocessor (CMP) and Multiprocessor-System-on-Chip (MPSoC)currently have a small number of processors, but, the trends indi-cate that they will have larger number of processors in the future[44]. The problem of how these processors should be connected toeach other has received considerable attention in the last couple ofyears, and Network-on-Chip (NoC) appears to be a promising solu-tion [9, 14]. In an NoC, multiple processors are connected to eachother using a network topology (e.g., star, mesh, tree). NoC archi-tectures are fast replacing conventional bus-based, point-to-pointinterconnects and have scalability, reliability, and implementationadvantages over the conventional bus-based interconnects.

Considering the two trends discussed above, namely, differentoptimization challenges and increasing use of NoC based architec-tures, one can see that we need to evaluate the role of compiler inthis new context and investigate the possibility of new compileroptimizations for such systems. While in principle optimizationand interprocessor communication strategies similar to those cur-rently used for large off-chip networks [11, 6, 12, 7, 19] can beapplied at the chip level as well, increasing on-chip power con-sumption demands a power-aware design and optimization process.Motivated by this observation, this paper proposes and evaluates acompiler-directed approach to NoC power management in the con-text of array-intensive applications, which are used frequently inembedded image/video processing [41]. The recent research shows

194

(a) Original paths for mes-sages m1 and m2.

(b) Paths for messages m1

and m2 after re-routing.

Figure 1. A motivating example for channel reuse optimization.

that NoC power consumption can be significantly reduced by us-ing voltage/frequency scaling on communication channels [36] orshutting down the idle channels [40]. Such techniques, while veryeffective in reducing power consumption in certain cases, work bestwhen communication channels have long idle periods, so that theperformance/power overheads of switching between voltage lev-els and between channel turn-offs/turn-ons can be compensated.Specifically, long idle periods are preferable from the viewpointof maximizing power savings through channel turn-off. We pro-pose exposing the on-chip communication network to the compilerfor energy optimization. The unique characteristic of the compiler-based approach proposed in this paper is that it increases the idleperiods of channels by reusing the same set of channels for as manycommunication messages as possible. The unused channels in thiscase take better advantage of the underlying power-saving mecha-nism employed by the network architecture. However, this channelreuse optimization should be applied with care as it can hurt per-formance if two or more simultaneous communications are mappedonto the same set of channels. Therefore, the problem addressed inthis paper is one of reducing the number of channels used to imple-ment a set of communications without increasing the communica-tion latency significantly.

To illustrate our approach, we consider the scenario shown inFigure 1. In this scenario, two messages, denoted as m1 and m2,are sent one after the other (i.e., m2 is sent after m1 completesits transmission) in a mesh network. Their original (default) pathsbased on well-known X-Y routing1 are shown in Figure 1(a). Thetotal number of communication channels used by the two messagesis 6. On the other hand, if m2 can be re-routed as in Figure 1(b),we see that only 4 channels are in use. Since the case shown inFigure 1(b) has more idle channels than the one in Figure 1(a), onecan expect higher energy savings with it when using a channel turn-off mechanism such as [40]. It is important to note that the channelreuse in this example does not cause any extra channel contention(and thus does not degrade the network performance), since the twouses of the same channel occur at different times.

We formulate our channel reuse/allocation problem using agraph structure called the connection interference graph and solveit using a heuristic approach. The solution determines a route foreach message to improve channel sharing. To test the effectivenessof our approach, we implemented it within an optimizing compilerand performed experiments using twelve application codes and anetwork simulation environment. Our experiments show that theproposed compiler-based approach is very successful in practiceand works well under both hardware based and software basedchannel turn-off schemes. Also, we found that the compiler ap-proach does not bring significant performance overheads over those

1 X-Y routing is a deadlock free routing for mesh network. When using X-Yrouting, a packet is first routed along X-axis until it arrives at the node inthe same column of the destination node. After that, it is routed along theY-axis until it arrives the destination node [15].

already incurred by the underlying mechanism that implementschannel turn-off. One of the main contributions of this paper is todemonstrate the importance of compiler technology in optimizingapplications targeting at emerging multi-processor, NoC-based em-bedded systems.

The rest of this paper is structured as follows. Section 2 givesbackground knowledge on NoC architecture, wormhole switchingtechnology, and power saving through channel turn-off. Section 3discusses routing flexibility of messages. Section 4 gives our poweraware message routing algorithm. We evaluate the effectiveness ofour approach in Section 5. Section 6 discusses the related work, andfinally, Section 7 concludes the paper and briefly mentions futureresearch directions.

2. PreliminariesBefore presenting the details of our approach that increases channelidleness, we revise in this section four important concepts. Theseconcepts are important since they are related to the compiler opti-mization discussed in this paper.

2.1 NoC Architecture

We focus on an M × N on-chip mesh network2 as shown on theleft part of Figure 2. Each node in the mesh consists of a router, aprocessor, and a memory component (see the right side of Figure 2).We use ηi to denote the ith node of the mesh. Functions col(i) androw(j) give the column and row of node ηi, respectively. Nodes ηi

and ηj are adjacent to each other if and only if

|col(i) − col(j)| + |row(i) − row(j)| = 1.

Each pair of adjacent nodes, ηi and ηj , are connected by a pair ofdirected channels, λi,j (from ηi to ηj ) and λj,i (from ηj to ηi). Weclassify the channels in this mesh into four categories, N (north),E (east), S (south), and W (west), as shown in Figure 3. Formally,these four categories can be expressed as follows:

N = λi,j | col(j) − col(i) = 0 ∧ row(j) − row(i) = −1W = λi,j | col(j) − col(i) = 1 ∧ row(j) − row(i) = 0S = λi,j | col(j) − col(i) = 0 ∧ row(j) − row(i) = 1E = λi,j | col(j) − col(i) = −1 ∧ row(j) − row(i) = 0

A parallel program in this work consists of a set of paral-lel processes running on different processors of the mesh. Pro-cesses running on different nodes communicate with each otherthrough exchanging messages over logical connections (or con-nections, for short). Specifically, a process running on node ηi cansend messages to another process running on node ηj through log-ical connection [[i, j]]. The set of channels used to transfer a mes-sage from its source node to its destination is determined by therouting scheme used by the implementation of the mesh. Routingschemes [30] can be classified into two categories: dynamic andstatic. Dynamic routing determines the communication channelsused to transfer each message dynamically based on the networktraffic state during the message transfer time. Static routing, on theother hand, determines the channels to be used to transfer eachmessage statically based on the source and destination nodes ofthe message, irrespective of the dynamic network traffic state. Huand Marculescu compared the advantages and drawbacks of the dy-namic and static routing schemes, and concluded that static routingis more suitable for the NoC based systems due to its low cost, easeof implementation, and power efficiency [21]. Consequently, in thispaper, we employ static routing for all messages. A static routing

2 As will be discussed later, our approach can work with other types of on-chip networks as well.

195

Figure 2. Two-dimensional mesh and details of a node.

N (north) E (east) S (south) W (west)

Figure 3. Four categories of channels.

can be expressed using a routing functionR that maps each connec-tion to a path from the source node to the destination node. In thefollowing discussion, we use R(i, j) to denote the set of channelsin the path to which connection [[i, j]] is mapped. Note that, thereexist many static routing schemes. The compiler’s main task in ourwork is to select the ideal static scheme for each message to reduceenergy consumption and prevent performance degradation.

2.2 Wormhole Switching

In NoC based systems, messages with various lengths are dividedinto equal-sized packets. The packets are routed towards their des-tination nodes through a series of intermediate nodes. At an inter-mediate node, a router (or switch), switches a packet coming fromone of its input ports to an appropriate output port. Four types ofswitching techniques are usually used for this purpose [15]: cir-cuit switching, packet switching, virtual cut-through switching, andwormhole switching. Wormhole switching has been widely usedin new-generation multiple processor systems [30, 28]. Wormholeswitching divides each packet into a number of flits for transmis-sion. The size of a flit is determined by the system architecture;normally, the bits of the same flit are transmitted in parallel fromthe source router to the destination router. The header flit (or flits)contains the routing information for the packet. As the header ad-vances along the specified route, the remaining flits follow in a par-allel style, as shown in Figure 4. Once a channel has been acquiredby a packet, it is reserved for the packet. An occupied channel isreleased when the last (tail) flit has been transmitted on the chan-nel. If the header flit encounters a channel that is already in use,it is blocked until the channel becomes available, and, when theheader flit is blocked, all the flits following it are blocked as welland remain in the flit buffers along the established route.

Wormhole switching allows a packet to hold some channelswhile requiring others. Consequently, it can lead to deadlock incertain situations. Figure 5 shows an example of deadlock involvingfour routers and four packets. Each packet is holding a flit bufferof a channel while requesting the flit buffer being held by anotherpacket. Several deadlock free routing algorithms are discussed inthe literature [30]. In this paper, we avoid deadlock by adopting thefollowing deadlock avoiding rules when establishing a connection(i.e., when determining the route for the packets):

Figure 4. Wormhole switching. A packet reserves the flit buffersof all the channels along the path from the source node to thedestination node.

Figure 5. An example of deadlock involving four packets. Packet1 is waiting for a flit buffer held by packet 2; packet 2 is waitingfor a flit buffer held by packet 3; packet 3 is waiting for a flit bufferheld by packet 4; and packet 4 is waiting for a flit buffer held bypacket 1.

• (1) A connection must be mapped to a shortest path from itssource node to its destination node.

• (2) Any connection [[i, j]] where col(j) > col(i) cannot use thevertical channels in the even columns, and any connection [[i, j]]where col(j) < col(i) cannot use the vertical channels in theodd columns.

These two rules avoid deadlock by preventing the packets goingwestward from contending in the channels (or flit buffers) withthe packets going eastward so that a cyclic waiting (as shown inFigure 5), which is a necessary condition for a deadlock, cannotoccur.

2.3 Architectural Support for Compiler Directed Routing

In an NoC architecture, a message generated by an application pro-cess is first divided into equal-sized packets. A packet is the mini-mum unit for routing. The header of a packet contains the identityof the destination node. To support compiler-directed routing, eachinput channel of a router in our target NoC architecture is associ-ated with an N -entry routing table (where N is the number of thenodes in the mesh). The table entries are indexed by the destinationnode id in the header of the packet. The ith entry of the routing ta-ble associated with the kth input channel of a router contains 2 bits,indicating to which of the four output channels the router forwardsa packet coming from the kth input channel with the destinationnode id ηi. In this work, the values of the routing table entries aredetermined by the compiler based on the communication behaviorof the application, and are loaded from the off-chip memory in theinitialization stage of the application. During the execution of theapplication, the contents of the routing tables are not modified.

196

2.4 Network Power Saving Through Channel Turnoff

Several new channel architectures have recently been proposed forglobal wires, replacing full-swing repeatered wires [20, 23]. Dueto features such as differential and pulsed current-mode signaling,these channels have power profiles that are invariant to utilization.That is, these communication channels dissipate substantial powereven no data is being transmitted. In on-chip networks using suchcommunication channels, turning channels on/off in response tovarying utilization can harvest considerable energy saving. Thestrategies on when to turn off/on a channel have been investigatedin several recent papers [39, 40].

A channel turn off/on strategy can be either reactive or proac-tive. A reactive strategy determines the power state of channelsbased on the history information and it is normally implemented inhardware. Specifically, a channel power control component mon-itors the usage of each communication channel. It turns off somechannels to conserve energy when it observes that the network traf-fic is light, and turns on some channels when it detects that thecurrent active channels are being overloaded. On the other hand,a proactive strategy requires high level information about the be-havior of the application and it is usually implemented in software.Specifically, a compiler can insert instructions at certain points ofthe program to turn off some channels when it detects that, beyondthat point, these channels will not be used. Similarly, when the com-piler detects that some channels that have been turned off will beused at a certain point in the program, it can insert instructions topre-activate these channels before they are actually used to hide thelatency due to turning on channels.

The success of these channel turn-off strategies, either reactiveor proactive, however, depends on the following factors: (i) thenumber of channels that can be turned off, (ii) the length of the timeperiod during which each channel remains in the power-off mode,and (iii) the performance and energy penalties associated withpower mode changes. Our compiler-directed channel allocationcan enhance the effectiveness of a channel turn-off scheme bychannel reuse. Specifically, the compiler determines a route foreach message (all the packets of that message follow the sameroute) in such way that some channels used to transmit certainpackets can be reused to transmit other packets as well. As aresult, we can keep fewer channels in the power-on mode, whichalso means that more channels can remain in the power-off modewithout being re-activated. In addition, reusing the active channelsreduces the frequency of power mode changes, which is beneficialfrom the viewpoint of both performance and energy consumption.

3. Routing FlexibilityAs mentioned earlier, our goal is to assign a route to each packetsuch that the total number of channels used in communication isminimized without significantly hurting performance. We definethe distance from ηi to ηj as the number of channels in the shortestpath from ηi to ηj . Note that, given a source node ηi and a destina-tion node ηj in a mesh, there may exist multiple shortest paths. Wedefine the routing flexibility for connection [[i, j]] as the number ofdifferent shortest paths from ηi to ηj , under the deadlock avoidingrules listed earlier. In a mesh network, the distance from node ηi tonode ηj can be computed as:

∆x+ ∆y,

where ∆x = |col(j) − col(i)| and ∆y = |row(j) − row(i)|.Any shortest path from ηi to ηj contains ∆x horizontal (eastwardor westward) channels, and ∆y vertical (northward or southward)channels.

We use F (i, j) to denote the routing flexibility for connection[[i, j]], which can be computed as follows. First, we can calculate

col(i′) − col(i)< 0 = 0 > 0

< 0 Neven ∪W N Nodd ∪ Erow(i′) − row(i) = 0 W φ E

> 0 Seven ∪W S Sodd ∪ ENodd = N ∩ λi,j | col(i) is odd

Neven = N ∩ λi,j | col(i) is evenSodd = S ∩ λi,j | col(i) is odd

Seven = S ∩ λi,j | col(i) is even

Figure 6. Definition of Ω(i, i′), which is used in computing func-tion Λ(i, i′), the set of channels that can potentially be used byconnection [[i, i′]].

the number of vertical columns whose channels can be used byconnection [[i, j]] as follows:

T =

8<:

x2/2 − x1/2, if x2 > x1;(x1 + 1)/2 − (x2 + 1)/2, if x2 < x1;1, if x2 = x1;

where x1 = col(i) and x2 = col(j). A shortest ηi to ηj pathcontains ∆y = |row(j) − row(i)| vertical channels, which aredistributed over the columns that can be used by this connection;each distribution corresponds to a shortest path. Therefore, wehave:

F (i, j) = CT−1T+∆y−1 =

(T + ∆y − 1)!

(T − 1)!∆y!,

For ease of discussion, let us define a function Λ(i, i′), whichgives the set of communication channels in the mesh that can beused by a shortest path from node ηi to node ηi′ , as follows:

Λ(i, i′) = λx,y | col(x), col(y) ∈ [c1, c2]

∧ row(x), row(y) ∈ [r1, r2] ∧ λx,y ∈ Ω(i, i′),where c1 = min(col(i), col(i′)), c2 = max(col(i), col(i′)), r1 =min(row(i), row(i′)), r2 = max(row(i), row(i′)), and Ω(i, i′) isas defined in Figure 6.

Figure 7 shows the links in a 5 × 7 mesh that can be used byconnection [[i, j]], i.e., by the messages routed from ηi to ηj . Thethick arrows in Figure 7 are the channels in set Λ(i, j). In the topportion of this figure, we have col(j) < col(i); consequently, thevertical channels in the odd columns cannot be used by connection[[i, j]]. Since node ηi is in an even column, the first hop of connec-tion [[i, j]] can go either horizontally or vertically. Similarly, the lasthop of the connection can go either horizontally or vertically sinceηj is also in an even column. On the bottom of this figure, however,both nodes ηi and ηj are in the odd columns; therefore, neither thefirst nor the last hop of the connection can go vertically. In thisfigure, we can observe that, due to the shortest path constraint, allthe channels that might be used by connection [[i, j]] are within theshaded area. In the rest of this paper, we refer to this area as therouting area.

4. Compiler-Directed Power-Aware RoutingThe previous section shows that, in transmitting packets, one hascertain flexibility. The question is whether we can take advantageof this flexibility in reducing energy consumption of NoC. Basedon our discussion in Section 2.4, this can be achieved if we canassign routes to packets (by exploiting routing flexibility) such thatidle periods of channels are lengthened. To increase the energysavings obtained through channel turn-off, one may want to reducethe number of channels used by an application at any given timein execution so that more channels can be turned off. The number

197

Figure 7. Routing area and the channels that might be used byconnection [[i, j]], under the deadlock avoiding rules we have. Thechannels in Λ(i, j) are marked with thick arrows, and the routingarea of connection [[i, j]] is shaded. Top: both col(i) and col(j) areeven numbered columns. Bottom: both col(i) and col(j) are oddnumbered columns.

of channels used by an application can be reduced by employing arouting scheme that lets multiple connections share some channels.This is called the channel reuse in this paper.

A potential drawback of this approach is that channel sharingcan increase packet conflicts. Two packets conflict with each otherif their transmission times are overlapped and there exists a channelthat transfers both the packets. Let us assume that packet p1 is sentby node ηi at time t1, and received by node ηi′ at time t′1, and thatpacket p2 is sent by node ηj at time t2, and received by node ηj′at time t′2. According to our definition, packets p1 and p2 conflictwith each other if and only if

R(i, i′) ∩R(j, j′) = φ ∧ [t1, t′1] ∩ [t2, t

′2] = φ,

whereR(i, i′) andR(j, j′) are the sets of channels used by connec-tions [[i, i′]] and [[j, j′]], respectively. Further, if there exist packetsp1 and p2 that are transfered over connections [[i, i′]] and [[j, j′]],respectively, and their transition times overlap with each other, wesay that connections [[i, i′]] and [[j, j′]] interfere with each other.

In general, a conflict between two packets incurs both perfor-mance and energy penalties since these packets have to contendfor the shared communication channels. Particularly, in a worm-hole switching network (the one used in this work), a packet holdsall the channels in the path from the source node to the destinationnode until all its fleets arrive at the destination node. In this sce-nario, if two packets conflict with each other, one packet is blockeduntil the other completes its transmission. Conflicts can be avoidedif we re-route the connections such that the connections that in-terfere with each other do not share any channel. That is, it is notsufficient just to reduce the number of channels used by a given setof packets. We also need to ensure that the packet conflicts are keptat minimum.

The rest of this section gives the details of our compiler ap-proach. Figure 8 shows the three components of our approach. Inthe first step (Section 4.1), we build a connection interference graph

Figure 8. The components of our approach.

that captures the interference behavior among messages. In the sec-ond step (Section 4.2), we determine a static route for each mes-sage. Since our routing scheme is static, all the packets of the mes-sages with source node ηi and destination node ηj follow the sameroute determined by our compiler for connection [[i, j]]; therefore,we use the terms “message routing” and “connection routing” in-terchangeably. In the third step (Section 4.3), we explain how tocompute the entries of the routing tables (defined in Section 2.3) ofeach router.

4.1 Connection Interference Graph

Our approach to packet routing employs a structure called theconnection interference graph. A connection interference graph ofa program P can be represented asG(P) = (C(P), E(P)), whereC(P) is the set of connections used by P , and E(P) is the set ofedges in the graph. Each edge ([[i, i′]], [[j, j′]]) ∈ E(P) indicatesthat connections [[i, i′]] and [[j, j′]] interfere with each other.

Given a connection interference graph G(P) = (C(P), E(P))for program P , we define the conflict factor for routing R as:

θ(R,G) =X

([[i,i′]],[[j,j′]])∈E(P)

q(R(i, i′) ∩R(j, j′)),

where function q is defined as follows:

q(x) =

1, if x = φ,0, if x = φ

The number of channels used by program P can be calculated as:

σ(R,G) = |[

[[i,i′]]∈C(P)

R(i, i′)|.

Generally speaking, we can reduce the conflict factor by usingmore channels. However, using more channels usually means moreenergy consumption due to keeping active the channels that wouldbe idle otherwise. Ideally, we want to minimize the conflict factorby using minimum number of channels, that is, we want to find arouting function R∗ such that:

∀R = R∗ : θ(R,G) > θ(R∗, G)

∨ (θ(R,G) = θ(R∗, G) ∧ σ(R,G) ≥ σ(R∗, G)).

The connection interference graph captures the communicationbehavior of a given parallel program. It can be built either throughstatic (compiler based) code analysis or profiling. In the profilingapproach, we first build a weighted connection interference graph,which is different from the connection interference graph definedabove in that each edge in the weighted connection interferencegraph is attached a counter. We instrument the application code andcommunication library to keep track of the time when each packetis sent. Let us assume two packets, p1 and p2, are sent at time t1and t2, respectively. Given the data rate of a channel r and the sizeof a packet s, we can estimate the transmission time of a packetas s/r. Note that a wormhole switching network transmits packetsin a pipelined fashion, and thus, the transmission time of a packetdoes not vary based on the distance from the source node to thedestination node as long as the communication channels used bythis packet is not blocked by other packets. Let us further assumethat the source and destination nodes of p1 are ηi and ηi′ , respec-

198

tively, and that the source and destination nodes of p2 are ηj andηj′ , respectively. We can add an edge ([[i, i′]], [[j, j′]]) and initial-ize a counter associated with this edge to 1 if |t2 − t1| < s/r . Ifthis edge already exists in the weighted interference graph we arebuilding, we increase the value of the counter associated with thisedge. We derive a connection interference graph from this weightedgraph by eliminating the edges with counter values lower than agiven threshold. An advantage of this profiling based approach isthat it can capture the dynamic behavior of the application that can-not be fully determined through static analysis. However, profilingan application is usually time consuming and the behavior of theoptimized application may be tied to the input used in profiling.These two drawbacks make the profiling based approach less at-tractive.

For an array/loop-based embedded parallel program, the con-nection interference graph can also be built through static codeanalysis. Such an embedded program typically involves a set ofprocesses running in parallel. These parallel processes communi-cate with each other through message passing. All the processessynchronize at certain synchronization points (such as waiting at abarrier or invoking a broadcast operation). Let us assume that a pro-gram creates n parallel processes running on nodes η1, η2, ..., ηn.Each process consists of a set of loop nests. We use Li,j to de-note the jth loop nest in the process code running on node ηi. Wesay that loop nests Li,j and Li′,j′ are parallel with respect to eachother if i = i′ and at least one of the following conditions is satis-fied:

• There exist a message-sending operation α in the body of Li,j

and a message-receiving operation β in the body of Li′,j′ suchthat β receives the message sent by α.

• There exist a message-sending operation α in the body of Li′,j′and a message-receiving operation β in the body of Li,j suchthat β receives the message sent by α.

• There exists a loop nest Li′′,j′′ such that Li,j and Li′′,j′′ areparallel with respect to each other, and that Li′,j′ and Li′′,j′′are parallel with respect to each other.

• Both Li,j and Li′,j′ are executed immediately after the samesynchronization point. Note that, since all the processes starttheir execution simultaneously, the start point of the programcan be regarded as an implicit synchronization point.

The nodes of the connection interference graph are determinedby searching all the message-sending operations in the applicationcode. Specifically, for each message-sending operation in the codethat is executed on node ηi and sends packets to node ηj , weadd a node representing the connection [[i, j]] to the connectioninterference graph. We add an edge between nodes [[i, j]] and [[i′, j′]]if all of the following conditions are satisfied:

• There exist loop nests Li,k and Li′,k′ such that Li,k and Li′,k′are parallel with respect to each other.

• There exists a message-sending operation in the body of Li,k

that sends a message to the process running on node ηj .

• There exists a message-sending operation in the body of Li′,k′that sends a message to the process running on node ηj′ .

Figure 9 gives an example for building the connection interfer-ence graph of a given parallel code. Figure 9(a) shows a parallelcode that consists of four processes running on nodes η0, η1, η2,and η3. Figure 9(b) summarizes the control flow and the messageflow of this code. In this code, we observe that loops L0,1, L1,1,L2,1, and L3,1 are parallel with respect to each other; and loopsL0,2, L1,2, L2,2, and L3,2 are parallel with respect to each other.By applying the rules described above,we can build the connection

interference graph shown in Figure 9(c). Recall that each edge inthe connection interference graph indicates a pair of connectionsthat may interfere with each other. For example, one can observein Figure 9(b) that a packet from node η0 to node η1 and a packetfrom node η3 to node η2 may be under transition at the same time.If these two packets require the same communication channel, oneof them may be blocked by the other. Therefore, we have an edge([[0, 1]], [[3, 2]]) in the graph shown in Figure 9(c). On the otherhand, we do not have a edge such as ([[0, 1]], [[1, 2]]) since all thepackets from node η0 to node η1 are transferred before the barrierwhile all the packets from node η1 to node η2 are transferred afterthe barrier, and thus connections [[0, 1]] and [[1, 2]] cannot interferewith each other.

Note that this explained approach is just one possible way ofbuilding a connection interference graph. For a given program,other possible algorithms may generate different connection in-terference graphs, which capture the communication behavior ofthe parallel program with different accuracies. The routings com-puted based on the connection interference graphs that capture thecommunication behavior of the program with different accuracieswould typically result in different energy efficiencies and incur thedifferent performance penalties. It should be noted that the accu-racy of a connection interference graph in capturing communica-tions may affect the energy consumption and performance of theprogram, however, it does not affect the correctness of the program.This is because a turned off channel is automatically activated upona request that wants to use it. In our experimental evaluation, weuse the connection interference graphs built through the static codeanalysis approach explained above.

4.2 Routing Algorithm

Finding the optimal routing R∗ for a given connection interferencegraph G is NP-hard. Consequently, instead of searching for anoptimal algorithm, we study a greedy heuristic approach.

Figure 10 gives our routing algorithm. Our algorithm takes G =(C,E), the connection interference graph of program P , as input; itmaps each connection to a set of channels that forms a shortest pathfrom the source node to the destination node. The main body of ouralgorithm is a while-loop, which iterates over all the connectionsin C. In each iteration, it selects a connection with the lowestrouting flexibility and maps this connection to a set of channelsin the mesh. When mapping a connection, our algorithm selectsthe shortest path from the source node to the destination nodesuch that this path incurs the minimum number of interferenceswith the connections that have already been mapped. The reasonfor our “lowest-routing-flexibility-first” policy can be explained asfollows. As more connections are routed, the number of channelsthat can be used by the current connection under considerationwithout increasing the number of interferences is reduced. Whenrouting a connection with high routing flexibility, we still havechances to avoid the potential interferences with the connectionsthat have already been routed, even when the number of the alreadyrouted channels is large. Therefore, we route the connections withlower routing flexibilities first because they have fewer routingoptions to avoid the interferences with the connections that havealready been routed.

Without increasing the number of interferences, our algorithmtries to maximize channel reuse so that we can reduce the numberof channels needed by the program. As explained earlier, this inturn increases the idle periods for some channels in the mesh. Foreach connection [[i, i′]] used by program P , our algorithm main-tains a sharing counter S[i, i′], which keeps track of the numberof connections that might use this channel. When routing a con-nection [[j, j′]], without increasing the number of inter-connectioninterferences, our algorithm seeks a path P from ηj to ηj′ with the

199

L0,1: for(...) ...send(η1 , ...);receive(η1 , ...);...

barrier;L0,2: for(...)

...send(η3 , ...);receive(η3 , ...);...

L1,1: for(...) ...send(η0, ...);receive(η0, ...);...


...send(η2, ...);receive(η2, ...);...

L2,1: for(...) ...send(η3, ...);receive(η3, ...);...


...send(η1, ...);receive(η1, ...);...

L3,1: for(...) ...send(η2 , ...);receive(η2 , ...);...


...send(η0 , ...);receive(η0 , ...);...

η0 η1 η2 η3

(a) Parallel code running on nodes η0 , η1 , η2 , and η3 . “send” and “receive”are the message-sending and message-receiving operations.

(b) Control flow and message flow.

(c) Resulting connection interference graph.

Figure 9. Building a connection interference graph.

maximum sharing number ψ(P ), which can be computed as fol-lows:

ψ(P ) =X

λi,i′∈P

S[i, i′].

After mapping a connection [[j, j′]] to a path P , we reduce the valueof S[i, i′] by one for each channel λi,i′ ∈ (Λ(j, j′) − P ) sincewe now know that the channels in Λ(j, j′), except for those in P ,cannot be used by connection [[i, i′]].

Since our algorithm iterates over all the shortest paths for eachconnection, its computational complexity is O(MN), where Mis the number of the connections, and N is the maximum routingflexibility across all the connections. Therefore, we can expect ouralgorithm to be fast in practice.

Figure 11 shows an example application of our algorithm torouting five connections, namely, [[a, a′]], [[b, b′]], [[c, c′]], [[d, d′]], and[[e, e′]] in a 6 × 6 mesh. Figure 11(a) gives the connection inter-ference graph (G) for these connections. Figures 11(b) thorough(f) show the procedure of initializing the sharing counter for eachchannel. The number beside each channel is the value of the shar-ing counter for this channel. We omit the sharing counters with avalue of zero. At each step, we increase the values of the sharingcounters of the channels used by one. The channels that can poten-tially be used by the connection that we are processing are markedusing small circles.

The routing flexibilities of connections [[a, a′]], [[b, b′]], [[c, c′]],[[d, d′]], and [[e, e′]] are 4, 1, 10, 6, and 1, respectively. Therefore,the order in which the connections are to be routed is [[b, b′]], [[e, e′]],[[a, a′]], [[d, d′]], and [[c, c′]].

Input:G = (C, E) – connection interference graph of program P ;L = N ∪ E ∪ S ∪W – the set of channels in the mesh;

Output:R[i, i′] – channel set for each connection [[i, i′]] ∈ C;

Internal Variables:S[i, i′] – sharing counter for each channel λi,i′ ;W – the set of connections that have not been routed;

procedure routing() for each channel λi,i′ ∈ L S[i, i′] = 0; for each [[i, i′]] ∈ C

for each λj,j′ ∈ Λ(i, i′) S[j, j′] = S[j, j′] + 1; for each [[i, i′]] ∈ C R[i, i′] = φ; W = C;while(W = φ)

select [[i, i′]] ∈ W with minimum routing flexibility;R[i, i′] = route connection([[i, i′]]);for each λj,j′ ∈ (Λ(i, i′) − R[i, i′]) S[j, j′] = S[j, j′] − 1; W = W − [[i, i′]];

function route connection([[i, i′]]) t′ = +∞; s′ = 0; P ′ = φ;for each shortest ηi to ηi′ path P that satisfies deadlock avoiding rules

t = 0;for each [[j, j′]] ∈ (C − W ) such that ([[i, i′]], [[j, j′]]) ∈ E

if(R[j, j′] ∩ P = φ) t = t + 1;s =P

λj,j′ ∈P S[j, j′];

if(t < t′ ∨ (t = t′ ∧ s > s′)) t′ = t; s′ = s; P ′ = P ; return P ′;

Figure 10. Routing algorithm.

Our algorithm routes [[b, b′]] first. Routing this connection istrivial since there is only one shortest path from node b to nodeb′. Figure 11(g) shows the situation after routing connection [[b, b′]].Since all the channels in Λ(b, b′) are used by the routing, no sharingcounter is updated. Similarly, we route [[e, e′]] without the need toupdate any sharing counter (Figure 11(h)). Connection [[a, a′]] is thethird connection to be routed. Our routing algorithm checks all the4 alternative shortest a to a′ paths, and selects the one that does notinterfere with the already routed connections [[b, b′]] and [[e, e′]] andyields maximum channel sharing. Figure 11(i) shows the situationafter routing [[a, a′]]. Note that not all the channels in Λ(a, a′) areused by connection [[a, a′]]. Therefore, the sharing counters of thechannels in Λ(a, a′) — but those that are not actually used byconnection [[a, a′]] — are reduced by one after this connection isrouted. The updated counters are marked using a lighter color in thefigure. Connections [[d, d′]] (Figure 11(j)) and [[c, c′]] (Figure 11(k))are routed similarly. Figure 11(l) shows all the channels used by ourrouting algorithm in Figure 10. For this routing (denoted as R), wecan compute the conflict factor as θ(R,G) = 0, and the number ofchannels used by the program as σ(R,G) = 19. As a comparison,Figure 12 shows the result of X-Y routing (RXY ). This routing uses22 channels (i.e., σ(RXY , G) = 22), and results in a conflict factorof 2 (i.e., θ(RXY , G) = 2). This simple example demonstrates thatour routing algorithm can improve upon the X-Y routing in termsof the number of channels used and the number of conflicts.

4.3 Determining the Contents of Routing Tables

Having determined the routes for the messages, the next task isto fill the entries of the routing tables. Figure 13 gives our algo-rithm for determining the contents of the routing tables. Specif-ically, for each routed connection, our algorithm determines thevalues of the relevant routing table entries in the routers thatare used by this connection. Note that, in each router used by aconnection, there is only one routing table entry which is rele-

200

(a) (b) (c) (d)

(e) (f) (g) (h)

(i) (j) (k) (l)

Figure 11. Routing connections [[a, a′]], [[b, b′]], [[c, c′]], [[d, d′]], and [[e, e′]]. The number next to each channel is the value of sharing counterfor this channel, and the sharing counters containing zeros are omitted. The numbers in the light color are the values of the sharing countersthat are updated after routing each connection.

vant to this connection. Let us assume that the route for [[i, j]] isλi0,i1 , λi1,i2 , λi2,i3 , ..., λin−1,in, where i0 = i and in = j.This connection uses the routers in mesh nodes ηi0 , ηi1 , ..., ηin .Let us further assume that channel λik−1,ik is connected to inputport cin of the router in node ηik and channel λik,ik+1 is connectedto output port cout of the router in node ηik , where 1 ≤ k < n.Consequently, we can set the jth entry of the routing table asso-ciated with input port cin of the router in node ηik to value cout.Therefore, when a packet whose destination node is ηj comes intothe router in node ηik from input port cin, the crossbar of thisrouter will forward this packet to output port cout, from which thispacket is transferred to the next router on its path.

4.4 Extension to Other Types of On-Chip Networks

So far, we have focused on a mesh network. It should be noted thatour compiler directed communication channel allocation can beextended to on-chip networks with other topologies as well. For anon-mesh network, however, the deadlock avoiding rules, functionF (i, j) (the routing flexibility of connection [[i, j]]), and functionΛ(i, j) (the set of channels that might be used by a shortest pathfrom node ηi to node ηj ) need to be redefined, accordingly.

5. Experimental EvaluationIn this section, we present an experimental evaluation of our ap-proach to NoC power reduction. Section 5.1 introduces our simu-lation platform, methology, and benchmarks, and Section 5.2 dis-cusses the experimental results.

5.1 Platform, Methodology, and Benchmarks

We implemented our approach within the Paradigm compiler [7],using a customized front-end. This compiler takes a sequentialcode, and produces an optimized message-passing parallel pro-gram with calls to the selected communication library and theParadigm run-time system. It applies several optimizations for re-ducing the number and volume of interprocessor communication.Consequently, the resulting code is optimized as far as interpro-cessor communication is concerned. We applied our approach (de-tailed in Section 4), after well-known communication optimizationssuch as message vectorization, message coalescing, and messageaggregation [19]. The additional increase in compilation time as aresult of our optimizations was about 55% when averaged over allthe applications tested.

To compute the network energy consumption, we use a modifiedversion of the energy model proposed in [16]. The default values ofthe important experimental parameters used are listed in Table 1.

201

Figure 12. Result of X-Y routing.

Input: the route for each connection [[i, i′]] used by program P .Output: The routing table for each input channel of each router:

T [k, c, i] — the ith entry of the routing table associated with the cth input channel of the router in node ηk;c ∈ 0, 1, 2, 3, 4; particularly, c = 0 corresponds to the input port from the local processor.

for each connection [[i, j]] used by program P assume that the route for [[i, j]] is λi0,i1 , λi1,i2 , λi2,i3 , ..., λin−1,in , where i0 = i and in = j;T [i, 0, j] = the output port of the router in node ηi to which λi0,i1 is connected;for k = 1 to n − 1

cin = the input port of the router in node ηikto which λik−1,ik

is connected;cout = the output port of the router in node ηik

to which λik,ik+1 is connected;T [ik, cin, j] = cout;

Figure 13. Sketch of the algorithm for determining the contents of routing tables.

Parameter Value

Mesh Size 5 × 5Active Channel Energy Consumption 10.2 pJ/bit

Idle Channel Energy Consumption 8.5 pJ/cycleChannel Power-Up Delay 100 µs

Hardware Channel Shut-Down Threshold 150 µsProcessor Frequency 1 GHz

Packet Header Size 3 FlitsFlit Size 39 Bits

Table 1. Default values of our simulation parameters.

Most of the values in this table are based on the references [16,35, 24]. We embedded this model within a simulation environment.This environment takes as input a network description and theapplication executable and generates as output the network energyconsumption and performance statistics.

Table 2 gives the benchmark codes used in this study and theirimportant characteristics. A common characteristic of these bench-marks is that they are array based embedded applications. The sec-ond and third columns give, respectively, a brief description andsource of each benchmark. The first five codes are from [25]. 3Step-log, Full-search, Hier, and Phods are four well-known motion es-timation codes [46]. We also selected Epic from MediaBench [1]and Lame and FFT from MiBench [2] suites for our experiments.These three applications are the only array based embedded codesin MediaBench and MiBench. The fourth column of the table givesthe number of C lines of each benchmark and the fifth column givesthe total size of data used in executing each benchmark. The val-ues within parentheses show the percentage of data communicatedamong processor during execution (as a fraction of the total datasize). We see that, as far as communication volume is concerned,these applications exhibit a good mix. The next two columns givethe execution time increase and energy savings when a hardware-based scheme is used for turning off unused communication chan-nels. These results are with respect to an execution that does notemploy any power management capabilities. This hardware schemeis a history based approach which waits for a certain period of timeafter the idleness is detected to turn off the channel. A turned offcommunication channel is reactivated by the next message that usesit. More details of this scheme as well as a thorough evaluation ofits behavior can be found elsewhere [39]. The last two columns onthe other hand give the performance degradation and energy sav-ing results (again with respect to a scheme that does not employany power optimization method) for a software based scheme inwhich the compiler inserts explicit channel turn-off and turn-on in-structions in the application code. This scheme has two primaryadvantages over the hardware-based one. First, since the compilercan analyze the application code and determine the future channelusage pattern with certain accuracy, it can turn off a communicationchannel without waiting once it estimates that the idleness is large

enough. Second, the compiler can pre-activate a turned off commu-nication channel (before the channel is actually needed), and thiscan eliminate performance impact of channel power management.In the next subsection, we evaluate our approach under both thesechannel turn-off schemes. Note that the default scheme (withoutany channel turn-off), the hardware based turn-off scheme, and thesoftware based turn-off scheme all use the default X-Y routing forall packets.

5.2 Results

We start our discussion of the experimental results by presentingthe percentage energy savings brought by our approach. The resultsare shown in Figure 14. Each bar in this figure is given withrespect to the energy consumption of the corresponding scheme(hardware or software), as explained above. For example, the firstbar indicates that our approach saves about 18.2% communicationenergy (over the hardware based scheme) in benchmark Morph2when it is used in conjunction with the hardware-based channelturn-off scheme. Similarly, the second bar states that our approachsaves 16.3% energy over the software-based scheme for the samebenchmark. We see that our channel reuse based scheme saves, onaverage, about 17.1% energy over the hardware-based scheme andabout 15.6% energy over the software-based scheme. That is, it isvery effective under both the schemes. It needs to be emphasizedthat Table 2 already shows that the hardware and software basedchannel turn-off schemes save significant energy over the casewhen no power optimization is used. Figure 14 illustrates that ourapproach can bring further savings over these two schemes.

To explain the energy savings brought by our approach better,we show in Figure 15 the breakdown of idle channel periods intodifferent lengths with and without the channel reuse. Each point onthe x-axis captures an idle period range. Each bar represents the av-erage result when all twelve benchmark codes in our experimentalsuite are considered. Our observation is that, when channel reuseis used, we convert some idle periods in the region 11-100 to theperiods in the region 101-2000. In other words, some short idle pe-riods are combined into longer ones. Note that turning off a chan-nel is beneficial only when the idle period is long enough so thatthe energy saving achieved by turning off the channel during thisidle period can amortize the channel re-activation overheads. Com-bining short idle periods into long ones allows both the hardware-based and software-based channel turn-off schemes to turn off morechannels and keep some channels in the turned off mode for longerperiods of time. As a result, more energy is saved as observed inFigure 14.

Our next set of results evaluates the performance impact of ourapproach. Recall that the columns six and eight of Table 2 givethe performance degradation caused by the hardware and softwareschemes over an execution without any power management. Theresults in Figure 16 on the other hand give the percentage perfor-mance degradation caused by our channel reuse based strategy over

202

Benchmark Brief Source Number of Dataset Hardware Based Software BasedName Description C Lines Size (KB) Perf Ovhd Energy Saved Perf Ovhd Energy Saved

Morph2 Morphological operations and edge enhancement [25] 878 866.41 (27.3%) 4.41% 23.98% 0.51% 31.29%Disc Speech/music discriminator [25] 2,022 410.63 (36.4%) 5.21% 16.60% 0.29% 24.13%Jpeg Lossy compression for still images [25] 771 434.97 (22.8%) 4.73% 18.31% 1.01% 26.77%

Viterbi A graphical Viterbi decoder [25] 1,033 782.55 (46.7%) 2.08% 11.71% 0.63% 15.72%Rasta Speech recognition [25] 540 502.79 (39.3%) 7.01% 26.20% 0.27% 30.06%

3Step-log Logarithmic search motion estimation [46] 76 203.54 (31.2%) 3.96% 14.26% 0.41% 19.83%Full-search Full search motion estimation [46] 63 203.54 (33.0%) 3.15% 12.75% 0.55% 18.04%

Hier Hierarchical motion estimation [46] 84 678.26 (28.6%) 2.92% 10.54% 0.21% 17.30%Phods Parallel hierarchical motion estimation [46] 114 203.54 (29.2%) 4.48% 14.46% 0.30% 22.82%

Epic Image data compression [1] 3,530 290.17 (44.7%) 5.98% 18.33% 0.21% 24.26%Lame MP3 encoder [2] 18,612 60.86 (36.9%) 6.42% 9.66% 0.34% 13.03%FFT Fast Fourier transform [2] 469 524.35 (24.1%) 6.18% 17.71% 0.93% 27.86%

Table 2. Benchmarks used in our experiments and their important characteristics.

Figure 14. Energy savings by applying ourscheme to an NoC-based system that alreadyemploys hardware-based or software-basedenergy saving schemes.

Figure 15. Breakdown of idle channel peri-ods. Each point on the x-axis captures an idleperiod range. Each bar represents the averageresult across all the benchmark codes.

Figure 16. Performance degradations due toapplying our scheme to an NoC-based sys-tem that already employs hardware-based orsoftware-based energy saving schemes.

the hardware and software schemes (i.e., the additional overheads).One can see from these results that the additional performancedegradation brought by our approach over these two schemes isvery small (0.9% over the hardware-based turn-off scheme and1.0% over the software-based turn- off scheme).

Having presented the energy and performance behavior of ourapproach with the default values of the simulation parameters givenin Table 1, we now discuss the results when the number of nodes inthe mesh is changed. The remaining system parameters are kept attheir default values to enable us interpret the results observed. Sincethe performance degradation caused by our approach was about 1%for all the experiments we made, we do not present them in detail;instead, our focus is on energy savings. Recall that the defaultmesh size used in our experiments so far was 5 × 5. The graphs inFigure 17 plot the energy savings achieved by our approach underthe different mesh sizes. Our observation is that the benefits broughtby the channel reuse strategy increases as we increase the meshsize. This is because a larger mesh size provides a larger numberof channels which in turn allows the default routing employed bythe hardware and software based turn-off schemes to use a largernumber of channels. Our approach thus has better opportunities toreduce the number of channels used through channel reuse.

6. Related WorkThe concept of exposing the communication architecture to com-piler has been recently exploited for performance optimizations.For example, RAW architecture proposed by Waingold et al [42]consists of an array of simple tiles, each of which contains in-struction and data memories, an arithmetic logic unit, registers,configurable logic, and programmable switch. This architecture re-lies on the compiler [26] to implement high level abstractions like

caching, global shared memory, and memory protection. Even dy-namic events like cache misses are handled by the software. An-other example is polymorphous TRIPS architecture proposed bySankaralingam et al [34]. This architecture contains four out-of-order, 16-wide-issue Grid Processor cores connected by a network.A TRIPS machine can be configured for different granularities andtypes of parallelism by the software. The compiler work on RAW[8] and TRIPS [29] focuses mainly on exploiting parallelism in agiven application. Our compiler analysis differs from these priorefforts in that we focus on reducing the energy consumption of theinterconnection network.

Interconnection networks and routing algorithms have beenpopular research topics for decades. Agarwal [4] models the la-tency of direct networks. His model reveals that the performanceof an interconnection network is highly sensitive to packet size,communication locality, and workloads. Adve and Vernon [3] de-velop detailed analytical performance models for k-ary n-cube net-works with single-hit or infinite buffers, wormhole routing, and thenon-adaptive deadlock-free routing scheme proposed by Dally andSeitz [13]. Boppana and Chalasan [10] present a framework to de-sign deadlock-free wormhole algorithms for a variety of networktopologies, including torus, mesh, de Brujin, and a class of Cayleynetworks. Since these effort target large scale networks, their mainconcern has been performance rather than energy consumption.

There have been several efforts in recent years on minimizingthe energy consumption of the NoC based systems and chip-to-chipnetworks. For example, Simunic and Boyd [37] propose a network-centric power management scheme for NoCs. Their experimentalresults show that their technique can predict the future workloadsmore accurately than the node-centric power management schemes.Soteriou and Peh [39, 40] explore the design space for communi-

203

cation channel turn-on/off based on a dynamic power managementtechnique. Their schemes turn off idle communication channels toreduce the leakage power consumption. Worm et al [45] proposean adaptive low-power transitioning scheme for on-chip networks.Their goal is to minimize the energy required for reliable commu-nications, while satisfying a QoS constraint by dynamically vary-ing the voltages of the channels. Kim et al [24] design a channelshut-down scheme that minimizes the number of active channelswhile maintaining the connectivity of the network. They make useof an adaptive routing algorithm, and present a detailed compar-ison of the proposed scheme against voltage scaling. Shang et al[35] propose applying dynamic voltage scaling to communicationchannels. They use a history-based policy to lower the voltage ofthe channels with low utilization. In a sense, our work is comple-mentary to these efforts. Li et al [27] and Soterio et al [38] pro-posed compiler-directed techniques to turn off the communicationchannels to reduce NoC energy consumption. Since, by determin-ing suitable routes for the messages at compile time, our approachincreases channel idle times, we can expect any channel shutdownor voltage scaling based hardware mechanism to be more effectivewhen used in conjunction with our approach.

Shin and Kim [36] and Asica et al [5] use genetic algorithmsto explore the design space for NoC systems. Hu and Marculescu[22] propose an algorithm that maps a given set of IP blocksonto a generic regular NoC and constructs a routing function thatminimizes communication. The focus of these studies is to reduceenergy consumption via task mapping. Our approach is differentfrom theirs in that it focuses on reducing energy consumptionthrough compiler-directed communication channel reuse.

Another group of related work is on power modeling for in-terconnection networks. Wang et al [43] present an architectural-level power-performance simulator for interconnection networks.Using this simulator, their paper evaluates different network archi-tectures and the impact of different communication patterns on en-ergy consumption. Eisley and Peh [16] propose LUNA, a high-levelpower analysis framework for on-chip networks. Patel [31] focuseson the power-constrained design of interconnection networks andproposes power models for routers and channels. Raghunathan etal [32] present a survey of energy efficient on-chip communica-tion techniques that function across different levels: circuit-level,architecture-level, system-level, and network-level. In contrast tothese studies, our goal is to explore the role of compiler in reducingNoC energy consumption.

7. Concluding Remarks and Future ResearchDirections

Compilers in embedded computing face different optimizationproblems than their counterparts in high-performance computing.These problems include minimizing memory space requirements,reducing power/energy consumption, and maximizing reliability.Recent use of on-chip networks in embedded designs require com-piler writers to look at the problem of developing compiler supportfor such networks. Motivated by this observation, this paper pro-poses and evaluates a compiler-directed scheme to NoC (network-on-chip) power management in the context of embedded multi-processor systems. The idea behind the proposed approach is tominimize the number of communication channels to be used by agiven set of messages without increasing communication latencysignificantly. Reducing the number of channels in turn allows theremaining channels to be turned off for saving power. Our approachcasts this problem as a channel allocation problem and proposes aheuristic solution to it based on a graph-based representation calledthe connection interference graph. This solution tries to maximizethe channel reuse between different messages. We implemented

Figure 17. Impact of mesh size on the energy savings broughtby our scheme. Top: over the hardware based turn-off scheme.Bottom: over the software based turn-off scheme.

our approach and tested it in a simulation environment assumingtwo types of supports for channel turn-off (a hardware based oneand a software based one). Our experiments with twelve embed-ded application codes clearly show that the proposed approach isvery successful in practice. On an average, our scheme saves about17.1% energy over the hardware-based scheme and about 15.6%energy over the software-based scheme.

We plan to extend this work in the following directions. First,we would like to study NoC-aware application code parallelizationand study the interactions between such a parallelization schemeand the channel reuse based power optimization scheme discussedin this paper. Second, we would like to evaluate the effectivenessof our approach under multi-programmed workloads. Third, wewould like to investigate how our approach can be extended tocover the scenarios where communication channels are equippedwith voltage/frequency scaling capabilities (another mechanism tosave energy). We believe that exposing network architecture tothe compiler allows more optimizations for energy, and thus, is apromising research direction.

References[1] Mediabench. http://cares.icsl.ucla.edu/MediaBench/.

[2] Mibench. http://www.eecs.umich.edu/mibench.

[3] V. S. Adve and M. K. Vernon. Performance analysis of meshinterconnection networks with deterministic routing. IEEE Trans.Parallel Distrib. Syst., 5(3):225–246, 1994.

[4] A. Agarwal. Limits on interconnection network performance. IEEETrans. Parallel Distrib. Syst., 2(4):398–412, 1991.

[5] G. Ascia, V. Catania, and M. Palesi. Multi-objective mapping formesh-based NoC architectures. In Proc. the International Conferenceon Hardware/Software Codesign and System Synthesis, Sept. 2004.

204

[6] E. Ayguad and J. Torres. Partitioning the statement per iterationspace using non-singular matrices. In Proc. 7th ACM InternationalConference on Supercomputing ICS, pages 407–415, Tokyo, Japan,July 1993.

[7] P. Banerjee, J. A. Chandy, M. Gupta, E. W. H. IV, J. G. Holm, A. Lain,D. J. Palermo, S. Ramaswamy, and E. Su. The paradigm compiler fordistributed-memory multicomputers. Computer, 28(10):37–47, 1995.

[8] R. K. Barua. Maps: A Compiler-Managed Memory System for RawMachines. PhD thesis, MIT, Cambridge, MA, USA, 1999.

[9] L. Benini and G. D. Micheli. Networks on chips: a new SoCparadigm. IEEE Computer, 35(1):70–78, 2002.

[10] R. V. Boppana and S. Chalasani. A framework for designingdeadlock-free wormhole routing algorithms. IEEE Trans. ParallelDistrib. Syst., 7(2):169–183, 1996.

[11] Z. Bozkus, A. Choudhary, G. Fox, T. Haupt, and S. Ranka. Fortran90d/hpf compiler for distributed memory mimd computers: design,implementation, and performance results. In Proc. ACM/IEEEConference on Supercomputing, pages 351–360, New York, NY,USA, 1993. ACM Press.

[12] S. Chakrabarti, M. Gupta, and J.-D. Choi. Global communicationanalysis and optimization. In Proc. Conference on Programminglanguage design and implementation, pages 68–78, New York, NY,USA, 1996. ACM Press.

[13] W. J. Dally and C. L. Seitz. Deadlock-free message routing inmultiprocessor interconnection networks. IEEE Trans. Comput.,36(5):547–553, 1987.

[14] W. J. Dally and B. Towles. Route packets, not wires: On-chipinteconnection networks. In Proc. the 38th Conference on DesignAutomation, 2001.

[15] J. B. Duato, S. Yalamanchili, and L. Ni. Interconnection Networks.Morgan Kaufmann Publishers, 2002.

[16] N. Eisley and L.-S. Peh. High-level power analysis of on-chipnetworks. In Proc. the 7th International Conference on Compilers,Architectures and Synthesis for Embedded Systems, Sept. 2004.

[17] M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz.Transient-fault recovery for chip multiprocessors. SIGARCH Comput.Archit. News, 31(2), 2003.

[18] P. Hazucha and C. Svensson. Impact of cmos technology scalingon the atmospheric neutron soft error rate. IEEE Transactions onNuclear Science, 47(6), 2000.

[19] S. Hiranandani, K. Kennedy, and C.-W. Tseng. Compiling FortranD for MIMD distributed-memory machines. Communications of theACM, 35(8):66–80, Aug. 1992.

[20] R. Ho, K. Mai, and M. Horowitz. Efficient on-chip globalinterconnects. In Proc. Symposium on VLSI Circuits, June 2003.

[21] J. Hu and R. Marculescu. Exploiting the routing flexibility forenergy/performance aware mapping of regular NoC architectures. InProc. the Design Automation and Test in Europe, Mar. 2003.

[22] J. Hu and R. Marculescu. Energy- and performance-aware mappingfor regular Noc architectures. IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems, 24(4):551–562, Apr. 2005.

[23] A. Jose, G. Patounakis, and K. Shepard. A 8gbps on-chip serial link.Technical Report TR10-03-01, Columbia University, Nov. 2003.

[24] E. J. Kim et al. Energy optimization techniques in cluster inter-connects. In Proc. the International Symposium on Low PowerElectronics and Design, Aug. 2003.

[25] I. Kolcu. Personal communication.

[26] W. Lee, D. Puppin, S. Swenson, and S. Amarasinghe. Convergentscheduling. In Proc. the 35th International Symposium on Microar-chitecture, Nov. 2002.

[27] F. Li, G. Chen, M. Kandemir, and M. J. Irwin. Compiler-directedproactive power management for networks. In Proc. Conference onCompilers, Architectures and Synthesis of Embedded Systems, 2005.

[28] P. Mohapatra. Wormhole routing techniques for directed connectedmulticomputer systems. ACM Computing Surveys, 30(3):374–410,Sept. 1998.

[29] R. Nagarajan, D. Burger, K. S. McKinley, C. Lin, S. W. Keckler, andS. K. Kushwaha. Static placement, dynamic issue (SPDI) schedulingfor EDGE architectures. In Proc. International Conference onParallel Architectures and Compilation Techniques, Oct. 2004.

[30] L. M. Ni and P. K. McKinley. A survey of wormhole routingtechniques in direct networks. Computer, 26(2):62–76, 1993.

[31] C. S. Patel. Power constrained design of multiprocessor interconnec-tion networks. In Proc. the International Conference on ComputerDesign, Washington, DC, USA, 1997.

[32] V. Raghunathan, M. B. Srivastava, and R. K. Gupta. A survey oftechniques for energy efficient on-chip communication. In Proc. the40th Design Automation Conference, 2003.

[33] S. K. Reinhardt and S. S. Mukherjee. Transient fault detection viasimultaneous multithreading. In ISCA, 2000.

[34] K. Sankaralingam et al. Exploiting ILP, TLP, and DLP withthe polymorphous TRIPS architecture. In Proc. the 30th AnnualInternational Symposium on Computer Architecture, 2003.

[35] L. Shang, L.-S. Peh, and N. K. Jha. Dynamic voltage scaling withlinks for power optimization of interconnection networks. In Proc.Symposium on High-Performance Computer Architecture, Feb. 2003.

[36] D. Shin and J. Kim. Power-aware communication optimizationfor networks-on-chips with voltage scalable links. In Proc. theInternational Conference on Hardware/Software Codesign andSystem Synthesis, Sept. 2004.

[37] T. Simunic and S. Boyd. Managing power consumption in networkson chip. In Proc. Conference on Design, Automation and Test inEurope, 2002.

[38] V. Soteriou, N. Eisley, and L.-S. Peh. Software-directed power-aware interconnection networks. In Proc. Conference on Compilers,Architecture and Synthesis for Embedded Systems, Sept. 2005.

[39] V. Soteriou and L.-S. Peh. Dynamic power management for poweroptimization of interconnection networks using on/off links. In Proc.Symposium on High Performance Interconnects, 2003.

[40] V. Soteriou and L.-S. Peh. Design space exploration of power-awareon/off interconnection networks. In Proc. International Conferenceon Computer Design, Oct. 2004.

[41] F. Vermeulen, F. Catthor, D. Verkest, and H. DeMan. Formalizedthree-layer system-level reuse model and methodology for embeddeddata-dominated applications. In Proc. the conference on Design,automation and test in Europe, pages 92–98, New York, NY, USA,2000. ACM Press.

[42] E. Waingold et al. Baring it all to software: RAW machines.Computer, 30(9):86–93, 1997.

[43] H.-S. Wang, X. Zhu, L.-S. Peh, and S. Malik. Orion: a power-performance simulator for interconnection networks. In Proc. the35th International Symposium on Microarchitecture, Nov. 2002.

[44] W. Wolf. The future of multiprocessor systems-on-chips. In Proc. the41st Conference on Design Automation, pages 681–685, New York,NY, USA, 2004. ACM Press.

[45] F. Worm, P. Ienne, P. Thiran, and G. D. Micheli. An adaptive lowpower transmission scheme for on-chip networks. In Proc. theInternational System Synthesis Symposium, Koyoto, Japan, 2002.

[46] N. D. Zervas, K. Masselos, and C. Goutis. Code transformationsfor embedded multimedia applications: impact on power andperformance. In Proc. Power-Driven Microarchitecture Workshop,1998.

205

Documents

Compiler-Directed Channel Allocation for Saving Power in ... · both hardware based and software based channel turn-off schemes. Categories and Subject Descriptors D.3.m [Software]: