Power efficient solution for network on chip

Term Paper Submission ECE 562 – Fall 2013

1

ISBs: Bidirectional Buffer-less Router with

Intelligent Space Buffers Dhiraj Chaudhary and Ahmed Louri

Dept. of Electrical and Computer Engineering, University of Arizona, Tucson, AZ 85721

{dhirajchaudhary,louri}@ece.arizona.edu

ABSTRACT

Buffers in routers consume significant power and area.

A novel intelligent space buffers (ISBs) NOC

architecture capable of mitigating both power and

performance issues is proposed. Buffer-less router

designs illustrates a significant degradation of

performance at high injection rates. We make a case for

new approach for power efficient design of Network-

on-Chip utilizing buffer-less routers with improved

performance.

General Terms: Architecture, Algorithm, Design.

Keywords: routing, network on chip, control, buffers,

Channels.

1. INTRODUCTION

Today high performance and power are very tight

constraints for Network on Chip (NOC). According to

some papers, NOC consumes up to 30% of power in the

Intel 80-core Terascale chip [1] and about 40% in the

MIT RAW chip [2].A lot work has been done and still

in progress to balance power and performance. As we

increase the number of cores the latency dominates and

power control mechanisms further worsen this

situations. It is essential to design a low power design

for NOC by keeping performance with in certain limits.

This paper will discuss about a new low power design

which can be thought of as a balanced implementation

for future NOC designs.

Buffers are power hungry. A paper by Onur Mutlu

et. al [3] suggests removing buffers can save upto 60%

of total power in NOC. But removing buffers has

potential negative impact on performance and

bandwidth efficiency. This design works well for low

injection rates but for high injection rate BLESS

consumes a substantial percentage of chip power with

degradation in performance. Latif Khalid et. al [4]

discusses a very straight forward approach, utilize ideal

buffers. Storing packets require more power as

compared to transmission them so it is better to transmit

packets [9]. Sharing of buffers amongst various ports or

virtual channels can decrease a significant buffer count.

This design comes with an additional computational

complexity impacting area consumption and may be

power in certain cases. Avinash Kodi et. al [5] has

introduced adaptive dual-function links. Links can

dynamically configured as repeaters as well as storage

units in case of congestion. It can save ~40% of buffer

power, and area efficient as well.

In this paper, we propose intelligent space buffers

(ISBs) which can achieve high performance with

buffer-less routers by keeping power consumption with

in certain limits. We deploy buffers in the space around

the router. Congestion control mechanism is inherent

quality of control unit. Control unit dynamically

manages the number of buffers allocated to each

channel according to traffic. Bi-directional [6] links has

been utilized to utilize buffers in a more effective

manner.

2. RELATED WORKS

2.1 BLESS: buffer-less routers

Buffers are responsible for 60% of total power

consumption in network on chip (NOC) and consumes

about 64% of static power [7] [8]. Many researchers

hate buffers and try to completely keep them away from

router. Buffer-less router design BLESS by Onur Mutlu

et. al [3] demonstrates 60% reduction in area, deadlock

avoidance, simplified router design and no live locks

etc. But the research statistics shows that by eliminating

buffers, there is a major degradation in performance.

Concept goes well for low injection rate but with high

injection rate, significant degradation in both power and

performance has been observed [3].

In conventional design one can see the buffers

associated with each virtual channel. Along with that

there is huge area hungry control circuitry including VC


2

allocator, switch allocator and route computation unit

are present.

Figure 1. Traditional switch architecture with buffers

Figure 2. Buffer-less switch architecture

If we go for buffer-less router then significant area

can be saved. BLESS uses hot potato routing protocol.

It is a deflection based mechanism in which after

receiving a packet or flit, router will deflect it in any

direction based on port availability. Flit ranking

mechanism illustrated in figure -- takes care of live-lock

problem caused by deflection. Oldest packet will get

more priority which can avoid the live-lock situation in

buffer-less. As the flits are always in motion so

deadlock situation cannot arise, which is one of the

major problems in the routers with buffers. Another

advantage of BLESS is very less router latency because

of less routing computations. But major drawback is

buffer-less does not perform well in high injection rates.

With the increase in injection rate at router, its

performance degrades drastically. As illustrated in [3]

injection rate of 0.08, buffer-less router outperforms the

router with buffers. At injection rate 0.28 there is drastic

increase in link and router energy. This is due to the fact

that packet takes longer time when deflected in wrong

directions to reach destination. Pipeline latency is less

in BLESS as compared to conventional router with

buffers. Decrease in latency is because of elimination of

virtual channel allocation and switch allocation stages.

Experimental results [3] clearly indicates the

breakdown for buffer-less at 0.29 injection rate

compared to 0.35 for 4 VC- 4 flits buffer. All the

experiments are carried out by considering 8*8 routers

using synthetic traces utilizing 4 different traffic

patterns: Uniform routing (UR), transpose (TR), mesh

tornado (TOR) and bit complement (BC).

BLESS design works well for less traffic network.

In NOCs it is applicable to the memory-core interface.

As memory and core communicate at less injection

rates. But still there are a lot of issues associated with

buffer-less routers. First one is flit overhead, every flit

should have header associated with it. Second one is

high latency with respect to each flit reaching

destination. Because flits will arrive at different time

intervals therefore to accumulate flits to packet we may

require a large buffer size at receiver. Because of all

above stated drawbacks BLESS did not get much

success in term of practical implementation.

2.2 Shared buffers

In this design Latif Khalid et. al [4] has proposed to

share the buffers associated with each virtual channel.

Figure 3 describes the conventional router architecture

in which each virtual channel has its own buffer space

associated with it.


3

Figure 3. Architecture of input part of router for shared buffers NOC design (Courtesy of Latif, Khalid, Tiberiu

Seceleanu, and Hannu Tenhunen. "Power and area efficient design of network-on-chip router through utilization of idle

buffers." Engineering of Computer Based Systems (ECBS), 2010 17th IEEE International Conference and Workshops

on. IEEE, 2010.)

Figure 1 describes the conventional router

architecture in which each virtual channel has its

own buffer space associated with it. Traffic of

virtual channel 1 cannot utilize the buffers of other

virtual channel even though they are free. In

practical scenario 100% buffers are never utilized.

The idea is to utilize this unutilized channel buffer

space. In figure 3 we showcase the shared buffer

architecture.

The main contribution of this paper lies in the

input part where the channels share the common

buffer space. Each packet is divided in flits in which

first flit is head flit. We call it as beginning of

packet (BOP). When BOP arrives at buffer

allocator unit. It will look for the free buffer space

and allocate it. Then allocated signal is sent to

buffer write controller in response to which buffer

write controller will send busy signal. After

receiving busy signal buffer allocator will send

allocated to signal which will set the multiplexer

pins of input buffer. After allocation, grant signal

will be sent to port sending flits. This signal acts as

the virtual channel identifier. For every new flit the

port will send the NewFlit_Dx_x signal to buffer

write controller. In case of two requests for one

buffer slot we need to arbitrate which is done by

priority signal shown in figure. Status_flag is the

logical AND operation of all the busy signals which

indicate all buffer slots are full. After receiving this

signal, requesting neighboring port takes decision

to redirect flits to some other direction or store until

congestion is resolved.

2.3 iDEAL- Inter-router Dual-function

Energy and Area-efficient Links for NoC

architectures

With continued improvement in the router

design, a paper [5] addresses a completely new era

of architecture in NOCs which saves up to 40% of

buffer power and 41% of router area. Basic idea is

to utilize the repeaters in the links to dynamically

act as buffers. iDEAL replaces the conventional

buffers by three state repeaters. When the control

signal is low, three state repeater acts in the similar

way as conventional repeater. But with high control

signal it can act as a buffer which can hold the bit.

Figure 1 illustrates the conventional router

architecture, in which each virtual channel has 4

buffer slots of 128 bits each. We can remove some

of these buffers and can place them on the link. This

can save router area and power consumption as

well. Figure 4 shows the reduced buffer size of

router v4-r16-c0 to v4-r8-c8. Congestion control

signal dynamically configure these adaptive link

buffers (ALBs) to act as repeaters or buffers

according to traffic load. iDEAL improves power


4

Figure 4. Dual function links used in iDEAL NOC architecture (Courtesy of Kodi, Avinash Karanth, Ashwini

Sarathy, and Ahmed Louri. "iDEAL: Inter-router dual-function energy and area-efficient links for network-on-chip

(NoC) architectures." ACM SIGARCH Computer Architecture News. Vol. 36. No. 3. IEEE Computer Society, 2008)

and area more than 40% with 1-2 % degradation in

performance [5].

2.4 BiNoC: A Bidirectional NoC Architecture

with Dynamic Self-Reconfigurable Channel

Bidirectional NoCs allow each communication

channel to be dynamically configured in either

directions to enhance the performance. This design

illustrates a significant increase in performance

with some area penalty [6]. Aim is to utilize the

channel’s bandwidth more effectively. In BiNOC

design, if outgoing channel has more traffic as

compared to incoming channel, BiNoC design can

switch the direction of incoming channel. In this

way load is shared between two channels. BiNoC

can be utilized in the networks where traffic density

varies much in opposite directions.

3. DESIGN OF INTELLIGENT SPACE

BUFFERS

3.1 NOC router Architecture

We use an n * n mesh architecture in a 2-D mesh.

Routers are considered as buffer-less and connected

to processing element (PE). Each router is

connected to four adjacent neighbors north, east,

south & west respectively. Packets are divided in to

head, body and tail flits similar to conventional

architectures. Deflection routing algorithm is

considered in this design.

3.2 Problem description:

Buffer-less routers illustrates a significant

degradation in performance and power

consumption at high injection rates, which defeats

aim to go for buffer-less [6].

(a)

(b)


5

Figure 5. (a) Drop packet in case of congestion

for BLESS router architecture

(b)Redirected packet in case of congestion for

BLESS architecture.

In figure 5, suppose that B and C both send their

respective packets to same output port of router A.

Then router A will have to drop one of packets

because there is no buffers to store packets and at a

time only one can take that output port. Or if

deflection based routing algorithm is employed

then packets are redirected to any output port which

is free. Deflected packet takes long time to reach

destination which degrades the overall performance

of BLESS router design.

3.1 Intelligent space buffers (ISBs)

implementation

In this section we detail the implementation of

intelligent space buffers and associated control unit.

Figure 6. Proposed intelligent space buffers.

Figure 6 illustrates the conventional buffers

replaced by stack of buffers placed outside router.

When the decision and control unit’s signal is low

then buffers will be in power down mode. Whereas

in case of congestion, buffers will be activated and

hold the data bits. Buffers will be in activation

mode until congestion is alleviated. This

implementation enables the buffer-less routers to

perform well at high injection rates. Control unit is

the heart of ISBs which is discussed in next section.

3.2 Control Unit Implementation

Control unit enables the buffers to be in power

down or active mode during congestion. A single

control unit is responsible for the activation of all

space buffers shown in figure 6. Control unit as

illustrated in figure 7, consists of a counter which

counts the number of flits/ packets flowing in

particular link. Although for simplicity only one

link is shown but in practical implementation 2

links will be controlled by control unit. Comparator

unit compares the count obtained from counter unit

to the predetermined stored value “P”. If value

exceed this threshold value (P) then decision &

control unit sends the activate signal to respective

buffers. Apart from that control unit will also send

Figure 7. Proposed control unit implementation

for ISBs


6

the switching signals to sw1 and sw2. Now all the

traffic from port A to B will traverse via buffer unit.

The overhead of control unit is negligible if we

compare it with power saving.

Figure 8. Proposed algorithm implemented at

control unit of ISB architecture

Figure 8 illustrates the detailed algorithm to be

implemented at control unit. The main issue is, how

to determine threshold value. Another issue is how

much buffer space to be allocated to each channel

in case of congestion. We have considered 80% for

the prototype but still it needs an improvement.

3.3 Dynamic space buffers in Bi-Directional

links

Proposed intelligent space buffers architecture

can be further optimized by utilizing bi-directional

links [6]. Figure 9 illustrates the behavior of links

when traffic in one dimension dominates the other.

In figure 9(b), R1 (Router 1) configures both the

channels and links as the output when traffic from

R1 to R2 is more than traffic from R2 to R1. Figure

9(c) illustrates the opposite scenario that is traffic

from R2 to R1 is more.

In figure 10 block diagram illustrates the bi-

directional channel or link between router A and B.

Introducing bidirectional links can improve

performance [6] at high injection rates.

Figure 9.

(a) Conventional unidirectional link between

routers R1 and R2.

(b) Reconfigured links for congestion from R1 to

R2 router.

But there is scope of power reduction in our

design by using bi-directional channels instead of

unidirectional. Algorithm at router interface works

in a similar fashion as described in [6]

Figure 10. Bidirectional links implemented in

ISBs

Suppose that routers cannot process a packet

before 2 ns and a packet is sent from router A to

router B at 1 ns followed by one more packet on the

same port interface at 2 ns. But router B cannot

process new request before 3 ns so it will drop the

packet. We can utilize the incoming channel from

router B to Router A at same port if it is free. A

control circuitry is needed to switch the direction of

port. If 2 or more packets request the same port at 2

ns then algorithm illustrated in figure 8 running at

control circuitry of space buffers will start

executing.


7

3.4 Power gated frame implementation

Figure 11. Proposed pipelined power gating scheme

Power gating suffers from wake up latency

which impacts performance [10] [11]. We are using

sleep mode transistors in ISBs for performance

optimization. 10% of total transistors are in sleep

mode and 90 % remain in complete shut off. When

injection rate at any port is high, control block will

redirect the traffic via buffers. When 8 % of buffers

are occupied then 30 % of remaining buffers are

triggered to wake up mode. This will avoid the

wake up latency. As shown in figure 11, when

traffic is below threshold then we can start sending

buffers back to power down mode. We have

assumed 10% drop in buffer space when load

decreases below some threshold value. State 5

indicates 90% buffers are utilized at most. After this

all the packets specific to that port will be

discarded. This will avoid the impact of congestion

to another port. Proposed gating scheme can

perform well at high injection rates also. As we

overcoming wakeup latency, this scheme offers

high performance as compared to conventional

power gating. We are keeping buffers in power

down mode which is complete shut-down hence

static power dissipation will be less in pipelined

power gating scheme.

Pipelined power gating scheme is easy to

implement and promising in terms of power and

high performance. Exact performance gain can be

calculated after simulations. Our estimation shows

saving of more than 5 clock cycles. As 5 clock


8

cycles saving is illustrated in [11] and pipelined

power gating can further improvise this

performance.

4. DESIGN COMPLEXITY

Proposed ISBs architecture is not area efficient

design. Because we are dynamically controlling

links as well as buffers. Control circuitry may take

a large percentage of area. Another issue is with

predetermined threshold value used in control unit.

We need to recheck the proposed design in real time

traffic. We may implement a learning mechanism

to set predetermined threshold but area constraint is

the major issue which we need to look for success

of ISBs.

5. FUTURE WORK

While ISBs is appealing design for its power and

performance balance but there exists a large design

space that spans the gap between traditional and

ISBs architecture. First, area efficient design for

ISBs NOC architecture, which is not discussed in

this paper. Another one is, permutation and priority

schemes to be implemented at the control block in

case of congestion. Deadlock may also be the

problem of ISBs because of implementation of new

buffers. Flow control mechanisms are implemented

by counter, which can be improved to make ISBs

more performance and power.

6. CONCLUSION

In this paper we propose a novel architecture to

counter performance and power issues in NOC.

ISBs utilizes buffer-less router and bidirectional

links to achieve significant saving in power. To

counter performance issue, we provide self-

configured intelligent space buffers. Novel

architecture lacks in simulations because of time

constraints. It is our hope that this proposed

architecture will inspire more new ideas for works

on NOC.

7. REFRENCES

[1] Y. Hoskote, S. Vangal, A. Singh, N. Borkar,

and S. Borkar. “A 5-ghz mesh interconnect for

a teraflops processor”. IEEE Micro, 27(5),

2007.

[2] Taylor, Michael Bedford, et al. "Evaluation of

the Raw microprocessor: An exposed-wire-

ndelay architecture for ILP and streams." ACM

SIGARCH Computer Architecture News. Vol.

32. No. 2. IEEE Computer Society, 2004.

[3] Moscibroda, Thomas, and Onur Mutlu. "A case

for bufferless routing in on-chip

networks." ACM SIGARCH Computer

Architecture News. Vol. 37. No. 3. ACM, 2009]

[4] Latif, Khalid, Tiberiu Seceleanu, and Hannu

Tenhunen. "Power and area efficient design of

network-on-chip router through utilization of

idle buffers." Engineering of Computer Based

Systems (ECBS), 2010 17th IEEE International

Conference and Workshops on. IEEE, 2010.

[5] Kodi, Avinash Karanth, Ashwini Sarathy, and

Ahmed Louri. "iDEAL: Inter-router dual-

function energy and area-efficient links for

network-on-chip (NoC) architectures." ACM

SIGARCH Computer Architecture News. Vol.

36. No. 3. IEEE Computer Society, 2008.

[6] Y.C. Lan, S.H. Lo, Y.C. Lin, Y.H. Hu, and S.J.

Chen, "BiNoC: A Bidirectional NoC

Architecture with Dynamic Self-

Reconfigurable Channel," in Proc. of the 3rd

ACM/IEEE International Symposium on

Networks-on-Chip, pp. 266-275, 2009.

[7] W. Hangsheng, L. S. Peh, and S. Malik. “Power

driven design of router microarchitectures in

on-chip networks,” Proceedings of the 36th

Annual IEEE/ACM International Symposium

on Microarchitecture (MICRO), pp. 105-116,

2003.

[8] Xuning Chen and Li-Shiuan Peh. “Leakage

power modeling and optimization of

interconnection networks”. Proceedings of

International Symposium on Low Power

Electronics and Design, pp. 9095, 2003.

[9] T. T. Ye, L. Benini, G. De Micheli. “Analysis

of power consumption on switch fabrics in

network routers,” Proceedings of the 39th

Design Automation Conference (DAC), pp.

524-529, 2002.

[10] Z. Hu, A. Buyuktosunoglu, V. Srinivasan, V.

Zyuban, H. Jacobson, and P. Bose,

"Microarchitectural techniques for power

gating of execution units," in International

Symposium on Lower Power Electronics and

Design (ISLPED), CA, USA, pp. 32-37, 2004.

[11] H. Matsutani, M. Koibuchi, W. Daihan, and H.

Amano, "Run-time power gating of on-chip

routers using look-ahead routing," in 13th Asia

and South Pacific Design Automation

Conference (ASP-DAC), Piscataway, NJ, USA,

pp. 55-60, 2008.

Documents

Power efficient solution for network on chip