Upload
dhiraj-chaudhary
View
209
Download
2
Embed Size (px)
Citation preview
Term Paper Submission ECE 562 – Fall 2013
1
ISBs: Bidirectional Buffer-less Router with
Intelligent Space Buffers Dhiraj Chaudhary and Ahmed Louri
Dept. of Electrical and Computer Engineering, University of Arizona, Tucson, AZ 85721
{dhirajchaudhary,louri}@ece.arizona.edu
ABSTRACT
Buffers in routers consume significant power and area.
A novel intelligent space buffers (ISBs) NOC
architecture capable of mitigating both power and
performance issues is proposed. Buffer-less router
designs illustrates a significant degradation of
performance at high injection rates. We make a case for
new approach for power efficient design of Network-
on-Chip utilizing buffer-less routers with improved
performance.
General Terms: Architecture, Algorithm, Design.
Keywords: routing, network on chip, control, buffers,
Channels.
1. INTRODUCTION
Today high performance and power are very tight
constraints for Network on Chip (NOC). According to
some papers, NOC consumes up to 30% of power in the
Intel 80-core Terascale chip [1] and about 40% in the
MIT RAW chip [2].A lot work has been done and still
in progress to balance power and performance. As we
increase the number of cores the latency dominates and
power control mechanisms further worsen this
situations. It is essential to design a low power design
for NOC by keeping performance with in certain limits.
This paper will discuss about a new low power design
which can be thought of as a balanced implementation
for future NOC designs.
Buffers are power hungry. A paper by Onur Mutlu
et. al [3] suggests removing buffers can save upto 60%
of total power in NOC. But removing buffers has
potential negative impact on performance and
bandwidth efficiency. This design works well for low
injection rates but for high injection rate BLESS
consumes a substantial percentage of chip power with
degradation in performance. Latif Khalid et. al [4]
discusses a very straight forward approach, utilize ideal
buffers. Storing packets require more power as
compared to transmission them so it is better to transmit
packets [9]. Sharing of buffers amongst various ports or
virtual channels can decrease a significant buffer count.
This design comes with an additional computational
complexity impacting area consumption and may be
power in certain cases. Avinash Kodi et. al [5] has
introduced adaptive dual-function links. Links can
dynamically configured as repeaters as well as storage
units in case of congestion. It can save ~40% of buffer
power, and area efficient as well.
In this paper, we propose intelligent space buffers
(ISBs) which can achieve high performance with
buffer-less routers by keeping power consumption with
in certain limits. We deploy buffers in the space around
the router. Congestion control mechanism is inherent
quality of control unit. Control unit dynamically
manages the number of buffers allocated to each
channel according to traffic. Bi-directional [6] links has
been utilized to utilize buffers in a more effective
manner.
2. RELATED WORKS
2.1 BLESS: buffer-less routers
Buffers are responsible for 60% of total power
consumption in network on chip (NOC) and consumes
about 64% of static power [7] [8]. Many researchers
hate buffers and try to completely keep them away from
router. Buffer-less router design BLESS by Onur Mutlu
et. al [3] demonstrates 60% reduction in area, deadlock
avoidance, simplified router design and no live locks
etc. But the research statistics shows that by eliminating
buffers, there is a major degradation in performance.
Concept goes well for low injection rate but with high
injection rate, significant degradation in both power and
performance has been observed [3].
In conventional design one can see the buffers
associated with each virtual channel. Along with that
there is huge area hungry control circuitry including VC
Term Paper Submission ECE 562 – Fall 2013
2
allocator, switch allocator and route computation unit
are present.
Figure 1. Traditional switch architecture with buffers
Figure 2. Buffer-less switch architecture
If we go for buffer-less router then significant area
can be saved. BLESS uses hot potato routing protocol.
It is a deflection based mechanism in which after
receiving a packet or flit, router will deflect it in any
direction based on port availability. Flit ranking
mechanism illustrated in figure -- takes care of live-lock
problem caused by deflection. Oldest packet will get
more priority which can avoid the live-lock situation in
buffer-less. As the flits are always in motion so
deadlock situation cannot arise, which is one of the
major problems in the routers with buffers. Another
advantage of BLESS is very less router latency because
of less routing computations. But major drawback is
buffer-less does not perform well in high injection rates.
With the increase in injection rate at router, its
performance degrades drastically. As illustrated in [3]
injection rate of 0.08, buffer-less router outperforms the
router with buffers. At injection rate 0.28 there is drastic
increase in link and router energy. This is due to the fact
that packet takes longer time when deflected in wrong
directions to reach destination. Pipeline latency is less
in BLESS as compared to conventional router with
buffers. Decrease in latency is because of elimination of
virtual channel allocation and switch allocation stages.
Experimental results [3] clearly indicates the
breakdown for buffer-less at 0.29 injection rate
compared to 0.35 for 4 VC- 4 flits buffer. All the
experiments are carried out by considering 8*8 routers
using synthetic traces utilizing 4 different traffic
patterns: Uniform routing (UR), transpose (TR), mesh
tornado (TOR) and bit complement (BC).
BLESS design works well for less traffic network.
In NOCs it is applicable to the memory-core interface.
As memory and core communicate at less injection
rates. But still there are a lot of issues associated with
buffer-less routers. First one is flit overhead, every flit
should have header associated with it. Second one is
high latency with respect to each flit reaching
destination. Because flits will arrive at different time
intervals therefore to accumulate flits to packet we may
require a large buffer size at receiver. Because of all
above stated drawbacks BLESS did not get much
success in term of practical implementation.
2.2 Shared buffers
In this design Latif Khalid et. al [4] has proposed to
share the buffers associated with each virtual channel.
Figure 3 describes the conventional router architecture
in which each virtual channel has its own buffer space
associated with it.
Term Paper Submission ECE 562 – Fall 2013
3
Figure 3. Architecture of input part of router for shared buffers NOC design (Courtesy of Latif, Khalid, Tiberiu
Seceleanu, and Hannu Tenhunen. "Power and area efficient design of network-on-chip router through utilization of idle
buffers." Engineering of Computer Based Systems (ECBS), 2010 17th IEEE International Conference and Workshops
on. IEEE, 2010.)
Figure 1 describes the conventional router
architecture in which each virtual channel has its
own buffer space associated with it. Traffic of
virtual channel 1 cannot utilize the buffers of other
virtual channel even though they are free. In
practical scenario 100% buffers are never utilized.
The idea is to utilize this unutilized channel buffer
space. In figure 3 we showcase the shared buffer
architecture.
The main contribution of this paper lies in the
input part where the channels share the common
buffer space. Each packet is divided in flits in which
first flit is head flit. We call it as beginning of
packet (BOP). When BOP arrives at buffer
allocator unit. It will look for the free buffer space
and allocate it. Then allocated signal is sent to
buffer write controller in response to which buffer
write controller will send busy signal. After
receiving busy signal buffer allocator will send
allocated to signal which will set the multiplexer
pins of input buffer. After allocation, grant signal
will be sent to port sending flits. This signal acts as
the virtual channel identifier. For every new flit the
port will send the NewFlit_Dx_x signal to buffer
write controller. In case of two requests for one
buffer slot we need to arbitrate which is done by
priority signal shown in figure. Status_flag is the
logical AND operation of all the busy signals which
indicate all buffer slots are full. After receiving this
signal, requesting neighboring port takes decision
to redirect flits to some other direction or store until
congestion is resolved.
2.3 iDEAL- Inter-router Dual-function
Energy and Area-efficient Links for NoC
architectures
With continued improvement in the router
design, a paper [5] addresses a completely new era
of architecture in NOCs which saves up to 40% of
buffer power and 41% of router area. Basic idea is
to utilize the repeaters in the links to dynamically
act as buffers. iDEAL replaces the conventional
buffers by three state repeaters. When the control
signal is low, three state repeater acts in the similar
way as conventional repeater. But with high control
signal it can act as a buffer which can hold the bit.
Figure 1 illustrates the conventional router
architecture, in which each virtual channel has 4
buffer slots of 128 bits each. We can remove some
of these buffers and can place them on the link. This
can save router area and power consumption as
well. Figure 4 shows the reduced buffer size of
router v4-r16-c0 to v4-r8-c8. Congestion control
signal dynamically configure these adaptive link
buffers (ALBs) to act as repeaters or buffers
according to traffic load. iDEAL improves power
Term Paper Submission ECE 562 – Fall 2013
4
Figure 4. Dual function links used in iDEAL NOC architecture (Courtesy of Kodi, Avinash Karanth, Ashwini
Sarathy, and Ahmed Louri. "iDEAL: Inter-router dual-function energy and area-efficient links for network-on-chip
(NoC) architectures." ACM SIGARCH Computer Architecture News. Vol. 36. No. 3. IEEE Computer Society, 2008)
and area more than 40% with 1-2 % degradation in
performance [5].
2.4 BiNoC: A Bidirectional NoC Architecture
with Dynamic Self-Reconfigurable Channel
Bidirectional NoCs allow each communication
channel to be dynamically configured in either
directions to enhance the performance. This design
illustrates a significant increase in performance
with some area penalty [6]. Aim is to utilize the
channel’s bandwidth more effectively. In BiNOC
design, if outgoing channel has more traffic as
compared to incoming channel, BiNoC design can
switch the direction of incoming channel. In this
way load is shared between two channels. BiNoC
can be utilized in the networks where traffic density
varies much in opposite directions.
3. DESIGN OF INTELLIGENT SPACE
BUFFERS
3.1 NOC router Architecture
We use an n * n mesh architecture in a 2-D mesh.
Routers are considered as buffer-less and connected
to processing element (PE). Each router is
connected to four adjacent neighbors north, east,
south & west respectively. Packets are divided in to
head, body and tail flits similar to conventional
architectures. Deflection routing algorithm is
considered in this design.
3.2 Problem description:
Buffer-less routers illustrates a significant
degradation in performance and power
consumption at high injection rates, which defeats
aim to go for buffer-less [6].
(a)
(b)
Term Paper Submission ECE 562 – Fall 2013
5
Figure 5. (a) Drop packet in case of congestion
for BLESS router architecture
(b)Redirected packet in case of congestion for
BLESS architecture.
In figure 5, suppose that B and C both send their
respective packets to same output port of router A.
Then router A will have to drop one of packets
because there is no buffers to store packets and at a
time only one can take that output port. Or if
deflection based routing algorithm is employed
then packets are redirected to any output port which
is free. Deflected packet takes long time to reach
destination which degrades the overall performance
of BLESS router design.
3.1 Intelligent space buffers (ISBs)
implementation
In this section we detail the implementation of
intelligent space buffers and associated control unit.
Figure 6. Proposed intelligent space buffers.
Figure 6 illustrates the conventional buffers
replaced by stack of buffers placed outside router.
When the decision and control unit’s signal is low
then buffers will be in power down mode. Whereas
in case of congestion, buffers will be activated and
hold the data bits. Buffers will be in activation
mode until congestion is alleviated. This
implementation enables the buffer-less routers to
perform well at high injection rates. Control unit is
the heart of ISBs which is discussed in next section.
3.2 Control Unit Implementation
Control unit enables the buffers to be in power
down or active mode during congestion. A single
control unit is responsible for the activation of all
space buffers shown in figure 6. Control unit as
illustrated in figure 7, consists of a counter which
counts the number of flits/ packets flowing in
particular link. Although for simplicity only one
link is shown but in practical implementation 2
links will be controlled by control unit. Comparator
unit compares the count obtained from counter unit
to the predetermined stored value “P”. If value
exceed this threshold value (P) then decision &
control unit sends the activate signal to respective
buffers. Apart from that control unit will also send
Figure 7. Proposed control unit implementation
for ISBs
Term Paper Submission ECE 562 – Fall 2013
6
the switching signals to sw1 and sw2. Now all the
traffic from port A to B will traverse via buffer unit.
The overhead of control unit is negligible if we
compare it with power saving.
Figure 8. Proposed algorithm implemented at
control unit of ISB architecture
Figure 8 illustrates the detailed algorithm to be
implemented at control unit. The main issue is, how
to determine threshold value. Another issue is how
much buffer space to be allocated to each channel
in case of congestion. We have considered 80% for
the prototype but still it needs an improvement.
3.3 Dynamic space buffers in Bi-Directional
links
Proposed intelligent space buffers architecture
can be further optimized by utilizing bi-directional
links [6]. Figure 9 illustrates the behavior of links
when traffic in one dimension dominates the other.
In figure 9(b), R1 (Router 1) configures both the
channels and links as the output when traffic from
R1 to R2 is more than traffic from R2 to R1. Figure
9(c) illustrates the opposite scenario that is traffic
from R2 to R1 is more.
In figure 10 block diagram illustrates the bi-
directional channel or link between router A and B.
Introducing bidirectional links can improve
performance [6] at high injection rates.
Figure 9.
(a) Conventional unidirectional link between
routers R1 and R2.
(b) Reconfigured links for congestion from R1 to
R2 router.
But there is scope of power reduction in our
design by using bi-directional channels instead of
unidirectional. Algorithm at router interface works
in a similar fashion as described in [6]
Figure 10. Bidirectional links implemented in
ISBs
Suppose that routers cannot process a packet
before 2 ns and a packet is sent from router A to
router B at 1 ns followed by one more packet on the
same port interface at 2 ns. But router B cannot
process new request before 3 ns so it will drop the
packet. We can utilize the incoming channel from
router B to Router A at same port if it is free. A
control circuitry is needed to switch the direction of
port. If 2 or more packets request the same port at 2
ns then algorithm illustrated in figure 8 running at
control circuitry of space buffers will start
executing.
Term Paper Submission ECE 562 – Fall 2013
7
3.4 Power gated frame implementation
Figure 11. Proposed pipelined power gating scheme
Power gating suffers from wake up latency
which impacts performance [10] [11]. We are using
sleep mode transistors in ISBs for performance
optimization. 10% of total transistors are in sleep
mode and 90 % remain in complete shut off. When
injection rate at any port is high, control block will
redirect the traffic via buffers. When 8 % of buffers
are occupied then 30 % of remaining buffers are
triggered to wake up mode. This will avoid the
wake up latency. As shown in figure 11, when
traffic is below threshold then we can start sending
buffers back to power down mode. We have
assumed 10% drop in buffer space when load
decreases below some threshold value. State 5
indicates 90% buffers are utilized at most. After this
all the packets specific to that port will be
discarded. This will avoid the impact of congestion
to another port. Proposed gating scheme can
perform well at high injection rates also. As we
overcoming wakeup latency, this scheme offers
high performance as compared to conventional
power gating. We are keeping buffers in power
down mode which is complete shut-down hence
static power dissipation will be less in pipelined
power gating scheme.
Pipelined power gating scheme is easy to
implement and promising in terms of power and
high performance. Exact performance gain can be
calculated after simulations. Our estimation shows
saving of more than 5 clock cycles. As 5 clock
Term Paper Submission ECE 562 – Fall 2013
8
cycles saving is illustrated in [11] and pipelined
power gating can further improvise this
performance.
4. DESIGN COMPLEXITY
Proposed ISBs architecture is not area efficient
design. Because we are dynamically controlling
links as well as buffers. Control circuitry may take
a large percentage of area. Another issue is with
predetermined threshold value used in control unit.
We need to recheck the proposed design in real time
traffic. We may implement a learning mechanism
to set predetermined threshold but area constraint is
the major issue which we need to look for success
of ISBs.
5. FUTURE WORK
While ISBs is appealing design for its power and
performance balance but there exists a large design
space that spans the gap between traditional and
ISBs architecture. First, area efficient design for
ISBs NOC architecture, which is not discussed in
this paper. Another one is, permutation and priority
schemes to be implemented at the control block in
case of congestion. Deadlock may also be the
problem of ISBs because of implementation of new
buffers. Flow control mechanisms are implemented
by counter, which can be improved to make ISBs
more performance and power.
6. CONCLUSION
In this paper we propose a novel architecture to
counter performance and power issues in NOC.
ISBs utilizes buffer-less router and bidirectional
links to achieve significant saving in power. To
counter performance issue, we provide self-
configured intelligent space buffers. Novel
architecture lacks in simulations because of time
constraints. It is our hope that this proposed
architecture will inspire more new ideas for works
on NOC.
7. REFRENCES
[1] Y. Hoskote, S. Vangal, A. Singh, N. Borkar,
and S. Borkar. “A 5-ghz mesh interconnect for
a teraflops processor”. IEEE Micro, 27(5),
2007.
[2] Taylor, Michael Bedford, et al. "Evaluation of
the Raw microprocessor: An exposed-wire-
ndelay architecture for ILP and streams." ACM
SIGARCH Computer Architecture News. Vol.
32. No. 2. IEEE Computer Society, 2004.
[3] Moscibroda, Thomas, and Onur Mutlu. "A case
for bufferless routing in on-chip
networks." ACM SIGARCH Computer
Architecture News. Vol. 37. No. 3. ACM, 2009]
[4] Latif, Khalid, Tiberiu Seceleanu, and Hannu
Tenhunen. "Power and area efficient design of
network-on-chip router through utilization of
idle buffers." Engineering of Computer Based
Systems (ECBS), 2010 17th IEEE International
Conference and Workshops on. IEEE, 2010.
[5] Kodi, Avinash Karanth, Ashwini Sarathy, and
Ahmed Louri. "iDEAL: Inter-router dual-
function energy and area-efficient links for
network-on-chip (NoC) architectures." ACM
SIGARCH Computer Architecture News. Vol.
36. No. 3. IEEE Computer Society, 2008.
[6] Y.C. Lan, S.H. Lo, Y.C. Lin, Y.H. Hu, and S.J.
Chen, "BiNoC: A Bidirectional NoC
Architecture with Dynamic Self-
Reconfigurable Channel," in Proc. of the 3rd
ACM/IEEE International Symposium on
Networks-on-Chip, pp. 266-275, 2009.
[7] W. Hangsheng, L. S. Peh, and S. Malik. “Power
driven design of router microarchitectures in
on-chip networks,” Proceedings of the 36th
Annual IEEE/ACM International Symposium
on Microarchitecture (MICRO), pp. 105-116,
2003.
[8] Xuning Chen and Li-Shiuan Peh. “Leakage
power modeling and optimization of
interconnection networks”. Proceedings of
International Symposium on Low Power
Electronics and Design, pp. 9095, 2003.
[9] T. T. Ye, L. Benini, G. De Micheli. “Analysis
of power consumption on switch fabrics in
network routers,” Proceedings of the 39th
Design Automation Conference (DAC), pp.
524-529, 2002.
[10] Z. Hu, A. Buyuktosunoglu, V. Srinivasan, V.
Zyuban, H. Jacobson, and P. Bose,
"Microarchitectural techniques for power
gating of execution units," in International
Symposium on Lower Power Electronics and
Design (ISLPED), CA, USA, pp. 32-37, 2004.
[11] H. Matsutani, M. Koibuchi, W. Daihan, and H.
Amano, "Run-time power gating of on-chip
routers using look-ahead routing," in 13th Asia
and South Pacific Design Automation
Conference (ASP-DAC), Piscataway, NJ, USA,
pp. 55-60, 2008.