Upload
muhammad-refaat
View
64
Download
3
Tags:
Embed Size (px)
DESCRIPTION
NoC the Next Generation of Multi-Processor SoC
Citation preview
N t k ChiEAIT, 2011
Network-on-ChipThe Next Generation ofThe Next Generation of
Multi-Processor System-on-Chip Presenters
Dr. Santanu ChattopadhyayAssociate Professor
Dept. of Electronics and Electrical Communication Engineering
Santanu KunduResearch Scholar
p g gIndian Institute of Technology, Kharagpur.
email: santanu, [email protected] Feb, 2011
2
Lecture – 1Lecture 1
IntroductionIntroduction
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
After mass market production ofI t d ti
3
After mass market production ofdual-core and quad-coreprocessor chips, the trendtowards Multi Core processing is
Introduction
towards Multi-Core processing isnow a well established one.
In multi-core processing,
End NodeEnd NodeEnd NodeEnd Node
…SW Interface SW Interface SW Interface SW Interface
Device Device Device Device
In multi core processing,multiple processor (i.e. CPU,DSP) along with multiplecomputer components (i.e.
Lin
k
Lin
k
Lin
k
Lin
k…HW Interface HW Interface HW Interface HW Interface
computer components (i.e.microcontroller, memory blocks,timers, etc.) are integrated ontoa single silicon chip This
Communication Medium
a single silicon chip. Thisarchitecture is often called asMulti-Processor System-on-Chip(MPSoC)
…
Architecture overview of (MPSoC).
Multi-Processor System-on-Chip
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Introduction
4
Each on chip component referredt I t ll t l P t (IP)
System-on-Chip (SoC)Introduction
to as Intellectual Property (IP)block.
The communication medium usedThe communication medium usedin modern multi-processor chips isbus based.
Upto tens of cores in a single chip,the performance of these bus basedchips are satisfactory. But beyondthat its performance degrade withnumber of cores attached.
The communication backbone used in modern SoC is shared bus.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Limitation of Shared Global Bus
5
• Communication Bottleneck: A shared bus allows only onecommunication at a time and even in a hierarchical bus a
Limitation of Shared Global Bus
communication at a time, and even in a hierarchical bus, asingle communication can block all buses of the hierarchy.
• Scalability: Bus based SoC does not scale with the system sizeScalability: Bus based SoC does not scale with the system sizeand its bandwidth is shared by all the systems attached to it.
Node Node
XNode
X
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Limitation of Shared Global Bus
6
• The intrinsic parasitic resistanced it b it hi h
Limitation of Shared Global Bus
and capacitance can be quite highfor a long bus line.
• The global bus delay increasesexponentially with decrease inprocess technology.
• E er additional IP block adds to• Every additional IP block adds toparasitic capacitance and causesincreased propagation delay.
• In deep sub-micron era, 80% ormore of the delay of critical pathswill be due to globalinterconnects.Relative Evolution of wire and gate delays
Reference: International Technology Roadmap for Semiconductor (ITRS) Documents (2003), Available at: http://public.itrs.net/Files/2003ITRS/Home2003.htm.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Shared Global Bus to Segmented Bus
7
Shared Global Bus to Segmented Bus
R
R
R
R
• Shared global bus is segmented by inserting repeaters (R).
Segmented Bus Multi-Level Segmented Bus
• In segmented bus, delay increases linearly with decrease in processtechnology .
• No improvement in bandwidth as it is still shared by all the coresp yattached to it.
• At the system level, it has a profound effect in changing the focusfrom computation to communicationfrom computation to communication.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Point to Point Dedicated Links
8
Advantage:
Point-to-Point Dedicated Links
• Bandwidth is higher than the sharedbus.
Drawback:7
01
Drawback:
• Switch size increases with increasein number of cores.
6 2
• Number of links needed increasesexponentially as the number ofcores increases.
45 3
• More number of metal layers arerequired in placement and routing.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Centralized Crossbar Switch
9
Centralized Crossbar Switch
Node Node
Components:
• Crossbar switch and
Advantage:
Node Node
Crossbar Switch
• Crossbar switch and
• Point-to-point links.Advantage:
• A crossbar switch enhance thescalability to some extent.
Node Node
Drawback:
• However, connecting largenumber of cores with a singlegswitch is not very effective asit is not ultimately scalableand, thus, it is an, ,intermediate solution.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Network-on-Chip: A Paradigm Shift
10
Network-on-Chip: A Paradigm ShiftOff-Chip vs. On-Chip Networks
Th b d id h f ff hi k io The bandwidth of off-chip networks is typically much lower than on-chip networks.
o Off-chip network is often affected by clock skew whereas clock skew problem is less significant for on-chip networks.
Only 3 components…
g p
o Off-chip networks has higher latency than their on-chip counter part.
1. Network Interface (NI)
2. Switch (Router)
3 Point-to-Point Links
o Area is not a strong constraint for off-chip networks, but for on-chip network it is one of the major constraint3. Point-to-Point Links
Reference: Benini, L. and Micheli, G.D. (2002) ‘Network on chips: a new SOC paradigm’, IEEE Computer, Vol. 35, No. 1, pp.70–78.
of the major constraint.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Layers of Abstraction in Network-on-Chip
11
Session Layer- NoC Abstraction
Layers of Abstraction in Network-on-Chip
(Open Core Protocol Standardization)
Transport LayerTransport Layer- Network Interface
Network Layer- Router / Switch
Data Link Layer- Flow Control ProtocolFlow Control Protocol - Error Handling
Physical Layer- Physical Wire Connection
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
SoC to NoC: An Evolution
12
SoC to NoC: An Evolution
SoC NoC
SoC • Bandwidth is
limited, shared• Aggregate bandwidth
grows
• Speed goes down as N grows
• Central arbitration
• Speed unaffected by N
• Distributed arbitration
oC
Central arbitration
• No layers of abstraction
Distributed arbitration
• Separate abstraction layers
N However:
• Fairly simple.
However:
• Complex architecture.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Design Goal of Network-on-Chip
13
Design Goal of Network-on-Chip
High throughput
Low latency
S l bl hiScalable architecture
Less energy consumption
Smaller area requirements
R li bili i C i iReliability in Communication.
Quality-of-Service Support
Lecture – 2
Architecture Design and Performance Evaluation of
Network-on-Chip
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Design Issues in Network-on-Chip
15
Design Issues in Network-on-Chip
• Topology Selection
• Switching Techniques
• Routing
• Flow Control Protocol &• Flow Control Protocol & GALS Implementation
• Buffering• Buffering
• Arbitration
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
S it hi T h i
16
Switching Techniques
Ci it S it hiBuffers
for “request”tokens
• Circuit Switching
Source Destination
Request for circuit establishment(routing and arbitration is performed during this step)
end nodeDestination
end node
(routing and arbitration is performed during this step)
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Ci it S it hi
17
Circuit Switching
Buffers for “ack” tokens
Source end node
Destination
Request for circuit establishmentend node
end node
Acknowledgment and circuit establishment(as token travels back to the source connections are established)(as token travels back to the source, connections are established)
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur18
Ci it S it hiCircuit Switching
Request for circuit establishment
Source end node
Destination end node
Acknowledgment and circuit establishment
Message transport(neither routing nor arbitration is required)( g q )
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur19
Ci it S it hiCircuit Switching
X
Source end node
Destination end node
Acknowledgment and circuit establishment
Packet transport
High contention, low utilization low throughput
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Switching Techniques
20
• Store-and-forward Packet switching
Switching Techniques
Buffers for data
Store-and-forward Packet switching
Packets are completely stored before any portion is forwarded packets
Store Drawback:
1. Larger Buffer
2 M L
Source end node
Destination end node
2. More Latency
end node end node
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Switching Techniques
21
Switching Techniques
• Store-and-forward Packet switching
Requirement:buffers must be
Store-and-forward Packet switching
Packets are completely stored before any portion is forwarded
sized to holdentire packet
Latency per router depends on the size of the packet
StoreForward Drawback:
1. Larger Buffer,
2 M r L t n
Source end node
Destination end node
2. More Latency
end node
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Switching Techniques
22
Switching Techniques
• Virtual Cut-Through Packet SwitchingRequirement:
buffers must be sized to hold entire packet
Packets completely stored at the switch
Drawback:
L B ffBusy Larger Buffer
Advantage:
Lesser LatencySource
BusyLink
Destinationy
Source end node
Destination end node
Latency/ router reduced by forwarding header flit of a packet as soon as space for y/ y g p pthe entire packet in the next router.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Switching Techniques
23
Switching Techniques
• Wormhole Packet Switching
R i
Advantage: Lower Buffer Space, Lesser Latency.
Dra back: Thro ghp t lesser than Virt al C t Thro gh Requirement:packets can be
largerthan buffers
Drawback: Throughput lesser than Virtual Cut Through
BusyLink
Source Destination
Link
end node end nodePackets stored along the switch
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Network Interface (NI) Module
24
Network Interface (NI) Module
Protocol Conversion
Clock Domain Shifting
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Network Interface (NI) Module
25
Network Interface (NI) Modulepacket (64x32)bits
Fli i iFlitization
eop bop Src_add Dest _addHeader(32-bit) GT/BE
Payload 1(32-bit)
DATA 1GT/BEeop bop
Payload 2 DATA2GT/BEeop bop
(32-bit)
y(32-bit)
DATA2GT/BEeop bop
Tailer DATA nGT/BEeop bop
…
Deflitization
packet (64x32)bits
(32 bit)
• 1 Packet = 64 Flits
• 1 Flit = 32 bits p ( )1 Flit 32 bits
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Design Issues in Network-on-Chip
26
Design Issues in Network-on-Chip
• Switching TechniquesSwitching Techniques
• Topology Selection • Topology Selection
• Routing
• Flow Control Protocol & GALS Implementation
• Buffering
• Arbitration
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Topology Selection
27
N b f Li kTopology Selection• Diameter
Maximum shortest path distance between two nodes in
• Number of LinksA topology with large number
of links can support highbandwidthp
the network. Networks with small diameters arepreferable.
• Average Distance
bandwidth.
Average Distance is the average among the distancesbetween all pairs of nodes of a graph. A topologyhaving lesser average distance is preferable.
• Bisection Width• Bisection WidthMinimum number of wires removed in order to bisect
a network. A larger bisection width enables fasterinformation exchange, and preferable.
• Topology selection is2D Mesh with 16 cores
• Node DegreeNumbers of channels connecting the node to its
neighbors. The lower this number, the easier to build p gyapplication dependent.
g ,the network.
Reference: Interconnection Network Architectures (2001) pp.26–49, Available at: www.wellesley.edu/cs/ courses/cs331/notes/notesnetworks.pdf
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Existing Topologies in NoC
28
Existing Topologies in NoC
All switches are connected to the fourclosest other switches and target
2D Mesh
core
s
closest other switches and targetresource block via two opposite uni-directional links, except thoseswitches on the edge of the
mes
hof
16
c switches on the edge of thelayout.
For M×N Mesh,Di t (M + N 2)
2D m Diameter: (M + N - 2)
Bisection Width: min (M, N)No. of routers required: (M * N)Node Degree: 3 (corner)Node Degree: 3 (corner),
4 (edge), 5 (central).CLICHÉ: Chip-Level Integration of Communicating Heterogeneous Elements g g
Reference: Kumar, S., Jantsch, A., Soininen, J. P., Forsell, M., Millberg, M., Oberg, J., Tiensyrja, K. andHemani, A. (2002) ‘A network on chip architecture and design methodology’, Proc. of. ISVLSI, pp.117–124.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Existing Topologies in NoC
29
Existing Topologies in NoC
Wires are wrapped around from2D Torus
core
s
ppthe top component to thebottom and rightmost toleftmost
Tor
usof
16
c leftmost.
For M×N Torus,
2D T Diameter: M/2 + N/2
Bisection Width: 2 * min (M, N)
No of routers required: (M * N)No. of routers required: (M * N)
Node Degree: 5
Disadvantage: The long end-around connections can yield excessive delays
Reference: Dally, W. J. and Towles, B. (2001) ‘Route packets, not wires: on-chip interconnectionnetworks’, Proceedings of the 38th Design Automation Conference (DAC 2001), pp.684–689.
Disadvantage: The long end-around connections can yield excessive delays.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Existing Topologies in NoC
30
Existing Topologies in NoC Solving Delay Problem of Torus
Reducing theimaximum
physicallink length
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Existing Topologies in NoC
31
Existing Topologies in NoC Folded Torus
ores
16 c
ores
orus
of 1
6 co
ed T
orus
of 1
2DTo
2D F
olde
d
Reference: Dally, W.J. and Seitz, C.L. (1986) ‘The torus routing chip’, Journal of DistributedComputing, Vol. 1, No. 4, pp.187–196.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Existing Topologies in NoC
32
Existing Topologies in NoC Octagon For a network having N number of IP
bl k
s
Diameter: 2 * N/8 .blocks,
D b k
on o
f 8
core
s
For a system consisting of more thaneight nodes, the network is
Drawback:
2DO
ctag
o eight nodes, the network isextended to multidimensionalspace.
Wiring complexity increases linearlywith number of nodes.
Reference: Karim, F., Nguyen, A. and Dey, S. (2002) ‘An interconnect architecture for networkingsystems on chips’, IEEE Micro, Vol. 22, No. 5, pp.36–45.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Existing Topologies in NoC
33
Existing Topologies in NoC Binary Tree
A binary tree-based network with N
of 1
6 co
res (power of 2) number of IP core has,
Diameter: log2 N
inar
y T
ree
o
Bisection Width: 1
No of Routers required: (N/2 1)
2DB No. of Routers required: (N/2 − 1)
Node Degree: 5 (leaf), 3 (stem), 2 (root)
Dr b k Bi ti n Width i r lDrawback: Bisection Width is very less.
Advantage: Lesser Diameter.
Reference: Jeang, Y. L., Huang, W. H. and Fang, W. F. (2004) ‘A binary tree architecture for application specificnetwork on chip (ASNOC) design’, IEEE Asia-Pacific Conference on Circuits and Systems, pp.877–880.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Existing Topologies in NoC
34
Existing Topologies in NoC Fat Tree Every level has same number switches. The
functional IP blocks reside at the leaves and the
of 1
6 co
res functional IP blocks reside at the leaves and the
switches reside at the vertices.
For N number of IP blocks, the network has,
2DFa
t Tre
e o For N number of IP blocks, the network has,Diameter: log2 N/4
Bisection Width: N/2
2
SPIN: Scalable, Programmable, Integrated Network
No. of Routers required: (N. log2 N)/8
Node Degree: 8 (non-root node), 4 (root node).
Advantage: Large Bisection Width, Smaller Diameter
Drawback : High Node Degree
Reference: Guerrier, P. and Greiner, A. (2000) ‘A generic architecture for on-chip packet-switchedinterconnections’, Proceedings of Design, Automation and Test in Europe (DATE 2000), pp.250–256.
Drawback : High Node Degree
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Existing Topologies in NoC
35
Existing Topologies in NoC Butterfly Fat Tree (BFT) In the network, the IPs are placed at the
l d i h l d h
f 16
cor
es
leaves and switches placed at thevertices. For N number of IPs, thenetwork has,
2DBF
To
Diameter: log2 N/4
Bisection Width: √NAdvantage- Requires lesser number of switches
Low diameter and Large bisection
Bisection Width: √N
No. of Routers needed: (≈ N/2)
- Low diameter and Large bisection width
Drawback- High node-degree.
Node Degree: 6 (non-root), 4 (root)
Reference: Pande, P. P., Grecu, C., Ivanov, A. and Saleh, R. (2003), ‘High-throughput switch-based interconnectfor future SoCs’, Proc. Int’l Workshop on System-on-Chip for Real Time Applications, pp.304–310.
High node degree.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Mesh-of-Tree Topology
36
Mesh-of-Tree Topology- In M × N MoT where M
denotes the number ofR T d NRow Trees and Ndenotes the number ofColumn Trees. Both Mand N are power of 2and N are power of 2.
- Number of nodes
= 3*M*N – (M + N).
- Small Diameter
(2 log2 M + 2 log2 N).
- Large Bisection Width
4 × 4 M h f T ti 32
g
[min (M,N)].
Drawback
Non planer topology
Reference: Kundu, S. and Chattopadhyay, S. (2008), “Mesh-of-Tree Deterministic Routing for Network-on-Chip Architecture”, ACM Great Lake Symposium on VLSI, pp. 343–346.
4 × 4 Mesh-of-Tree connecting 32 cores - Non-planer topology.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Design Issues in Network-on-Chip
37
Design Issues in Network-on-Chip
• Switching TechniquesSwitching Techniques
• Topology Selection
• Routing
• Flow Control Protocol &
• Routing
GALS Implementation
• Buffering
• Arbitration
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Routing
38
RoutingSource Routing vs. Distributed RoutingSource routing
Routing control unit in switches is simplified; computed at source.
Headers containing the route tend to be larger increase overhead.Distributed routing
Next route computed by finite-state machine or by look-up table.
Deterministic Routing vs. Adaptive RoutingDeterministic routingDeterministic routing
Always follow a specified path.
Easy to implement and supports in-order delivery.Ad i iAdaptive routing
Different paths based on congestion and faults; destroys in-order delivery.
Historical channel load information, length of queues, status of nodesand links.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Routing Challenges
39
Routing Challenges• Livelock
• Arises from an unboundednumber of allowed non-
Live-lock in Adaptive Routing
minimal hops.
• Solution: restrict thenumber of non-minimalhops allowed
D
hops allowed.
• Deadlock• Arises from a set of
packets being blockedpackets being blockedwaiting only for networkresources (i.e., links,buffers) held by otherpackets in the set.
• Probability increases withincreased traffic &d r d il bilitdecreased availability.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Routing Dependent Deadlock
40
Routing Dependent Deadlockp p1
0
1ci = channel i si = source node id d i i d i k i
s1 s2c3
c1 c2
4
p1p2
c0
c3
p2
di = destination node i pi = packet i
d3c1 c2
c4
c5c11
d4
c0
s5
c12
c4 c5
c7 8
p2p3
c3
c6
p3
c12
p5
d1d2c7c8
c10
c5c11
c6
d5
c12 c7 c8
c10 c11
p3p4
p4
c6
c9
p4
c12
s3s4c9
c10 c11 p4c9
Routing of packets in a 2D mesh Channel dependency graph
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Routing Dependent Deadlock Avoidance
41
Routing Dependent Deadlock AvoidanceDeterministic Routing in 2D mesh using Dimension Ordered Routing
E t bli h d i all b d t k di i E lEstablish ordering on all resources based on network dimension. Example:X-Y Routing: First, route horizontally and match the Y co-ordinate; and then routevertically and match X co-ordinate.
X Y R tin N l in th Ch nn l D p nd n Gr phX-Y Routing No cycle in the Channel Dependency Graph
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Routing Dependent Deadlock Avoidance
42
Routing Dependent Deadlock AvoidanceDeadlock Free Adaptive Routing in 2D Mesh: Turn Model
West First
North LastAdaptive Routingp gDeterministic
Routing
Negative First
Reference: Glass, C. J. and Ni, L. M. (1992), ‘Turn Model for Adaptive Routing’, Proceedings ofInternational Symposium on Computer Architecture, pp. 278 – 287.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Routing Dependent Deadlock Avoidance
43
Routing Dependent Deadlock AvoidanceDeadlock Free Adaptive Routing in 2D Mesh: Odd-Even Turn Model
Rule 1. Any packet is not allowed tok EN d EStake an EN turn and ES turn at any
nodes located in an even column.
Rule 2. Any packet is not allowed totake an NW turn and SW turn at anytake an NW turn and SW turn at anynodes located in an odd column.
Reference: Chiu, G. M. (2000), ‘The Odd-Even Turn Model for Adaptive Routing’, IEEE Transactions on Parallel and Distributed Systems, pp. 729 – 738.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Routing Dependent Deadlock Avoidance
44
Routing Dependent Deadlock AvoidanceDeterministic Routing in 2D Torus and Folded Torus by using Virtual Channels
Messages at a node numbered less than their destinationMessages at a node numbered less than their destinationnode are routed on the high channels, and messages at anode numbered greater than their destination node arerouted on the low channels.
n0 n1 n2 n3n0 n1 n2 n3n0 → n2 n1 → n3 n2 → n0 n3 → n1
Reference: Dally, W. J. and Seitz, C. L., (1987) ‘Deadlock Free Message Routing in Multiprocessor Interconnection Networks’, IEEE Transactions on Computers, vol. C-36, no. 5, pp. 547 – 553.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Deadlock Recovery
45
Allow deadlock to occur, but once a potential deadlocksituation is detected, break at least one of the cyclic
Deadlock Recovery
situation is detected, break at least one of the cyclicdependencies to gracefully recover. The common techniquesare,
Regressive recovery (abort-and-retry): Remove packet(s)from a dependency cycle by killing (aborting) and later re-injecting (retry) the packet(s) into the network after somej g ( y) p ( )delay.
Progressive recovery (preemptive): Remove packet(s) fromd d l b i h k ( )a dependency cycle by rerouting the packet(s) onto a
deadlock-free lane.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Design Issues in Network-on-Chip
46
Design Issues in Network-on-Chip
• Switching TechniquesSwitching Techniques
• Topology Selection
• Routing
• Flow Control Protocol & • Flow Control Protocol & GALS Implementation
• Buffering
GALS Implementation
• Arbitration
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Flow Control Protocol
47
Flow Control Protocol
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Flow Control Protocol
48
Flow Control Protocol
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Fl C l P l
49
Flow Control Protocol
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Globally Asynchronous Locally Synchronous
50
Globally Asynchronous Locally Synchronous (GALS) style of Communication
Reference: Kundu, S. and Chattopadhyay, S. (2007) ‘Interfacing Cores and Routers in Network-on-Chip Using GALS’, IEEE International Symposium on Integrated Circuits (ISIC 2007), pp.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Design Issues in Network-on-Chip
51
Design Issues in Network-on-Chip
• Switching TechniquesSwitching Techniques
• Topology Selection
• Routing
• Flow Control Protocol & GALS Implementation
• Buffering• Buffering
• Arbitration
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Counter based FIFO
52
Counter based FIFO• Binary Counter Based- Drawback
1. There can be considerable ambiguity when a count is read during count transition.
• Gray Code Counter Based- Drawback
1. Power of 2 FIFO depth. Area wastage for non- binary FIFO depth.
Reference: Yi, Cheng, “Gray code sequences”, U. S. Patent 6703950, March 9, 2004.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Gray Counter Based Dual Clock FIFO
53
Gray Counter Based Dual Clock FIFO
Reference: Cummings, C. E. and Alfke, P. (2002) ‘Simulation and Synthesis Techniques for Asynchronous FIFODesign with Asynchronous Pointer Comparisons’, Synopsys Users Group Conference, vol. User Papers.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Functionality of Asynchronous Comparator
54
Functionality of Asynchronous Comparator
Full = ( (waddr == raddr) && (wr_dir != rd_dir) )
Empty = ( (waddr == raddr) && (wr_dir == rd_dir) )
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Metastability
55
Metastability
• Full and Empty Signals are controlled by both thecontrolled by both the clocks. Thus probability of arising Metastable States.
• 2-State Synchronizer are used to reduce the probability of Metastability.
• Full Signal is synchronized with the ‘wr-clk’ and Empty Si l i h i d i hSignal is synchronized with the ‘rd-clk’.
Full = ( (waddr == raddr) && (wr_dir != rd_dir) )
E ( ( dd dd ) && ( di d di ) )Empty = ( (waddr == raddr) && (wr_dir == rd_dir) )
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Design Issues in Network-on-Chip
56
Design Issues in Network-on-Chip
• Switching TechniquesSwitching Techniques
• Topology Selection
• Routing
• Flow Control Protocol & GALS Implementation
• Buffering
• Arbitration• Arbitration
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Arbitration
57
Arbitration
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Router Architecture
58
Router Architecture
Input Channel• Input Buffer• Routing Computation Unit
Output Channel• Output Buffer• Arbiter
• Control Unit • Control Unit
Reference: Kundu, S. and Chattopadhyay, S. (2008) ‘Network-on-chip architecture design based on Mesh-of-Treedeterministic routing topology’, Int’l Journal of High Performance Systems Architecture, Vol. 1, No. 3, pp. 163-182.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Wormhole Router Architecture Data Path
59
Wormhole Router Architecture Data Path
b ffk rolPhy
sica
lch
ann
el
Output buffer nk trol P
hysi
cal
chan
nel
sBar (ST)
Input buffer(IB)Li
nkCo
ntr p
(OB) Lin
Con
Routing Control Unit(RC)
HeaderFlit
Cross
Input buffer(IB)
Phy
sica
lch
ann
el
Link
Control
Output buffer(OB) Li
nkCo
ntrol Phy
sica
lch
ann
el
Routing Algorithm
Routing Control Unit(RC)
HeaderFli
Crossbar Control
ArbitrationUnit (SA)
(IB)C (OB) C
CRITICALFlit
Routing Algorithm
ControlOutput Port #
IB (Input Buffering) RC (Route Computation) SA (Switch Alloc) ST (Switch Trav) OB (Output Buffering)
PATH
( p g) ( p ) ( ) ( ) ( p g)
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Flit Traversal Through Wormhole Router
60
Flit Traversal Through Wormhole Router
T)
Input buffer(IB)Li
nkon
trolPhysical
chan
nel
Output buffer(OB) Li
nkon
trol Ph
ysical
chan
nel
CrossBar (ST(IB)
Input buffer(IB)
Physical
chan
nel
LCo
Link
ontrol
( ) LCo
Output buffer(OB) nk nt
rol Physical
chan
nel
Routing Control Unit(RC)
Routing Algorithm
HeaderFlit
Routing Control Unit(RC)
HeaderFlit
Routing Algorithm
Crossbar Control
ArbitrationUnit (SA)
Output Port #
(IB)LCo (OB) Lin
Con
g
IB (Input Buffering) RC (Route Computation) SA (Switch Alloc) ST (Switch Trav) OB (Output Buffering)
IB RC SA ST OBPacket Header
IB
IB
IB
IB
IB ST
IB IB ST
IB IB ST
OB
OB
OB
Packet Payload 1
Packet Payload 2
Packet Payload 3 S OPacket Payload 3
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Performance Evaluation
61
Performance Metrics
Throughput: Unit: flits/ cycle/ IPlength)(Packet x Packets) Accepted (Maximum
=TP
Performance Evaluation
g p
Latency: The time (in clock cycles) that elapses from between the occurrence of amessage header injection into the network at the source node and the occurrence ofa tail flit reception at the destination node
/ y /time)(Totalx blocks)IPof(Number
TP
a tail flit reception at the destination node.P = total number of messages,
Li= latency of each message i.
Bandwidth: Bandwidth refers to the maximum number of bits can send successfully to
Lavg = P
LiP
∑1
Bandwidth: Bandwidth refers to the maximum number of bits can send successfully tothe destination through the network per second. It is represented as bps (bits/sec).
d
Cost Metrics Energy dissipation: Energy consumed by routers and links at different workload. Average energy/packet and average energy/clock cycle are being measured.
Area requirements: Percentage chip area occupied by the switch and links havetaken into consideration.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Simulator Design for Performance Evaluation
62
Simulator Design for Performance EvaluationTypes of Simulator
1. Cycle Accurate:Sample the state of the signals at every clock edge (positive or negative).
Much faster than event driven simulation.
2. Event Driven:2. Event iven:Most accurate as every active signal is calculated for everydevice during the clock cycle as it propagates.
Each signal is simulated for its value and its time of occurrence.g
Excellent for timing analysis and verify race conditions.
Computation intensive (depends on the number of activities) andhence very slowhence very slow.
To calculate the performance metrics like throughput, latency etc., the delayafter each and every gate is not required. In that case Cycle Accurate Simulatoris the best choice.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Existing NoC Simulators
63
Existing NoC Simulators
Some Existing NoC Simulators
Drawbacks
NIRGAM li i d M h lNIRGAMUniversity of Southampton,
UK
limited to Mesh topology;No power evaluation
MPARM - XpipesUniversity of Bologna, Italy
Not freely available
NS2 Packet level transactionNS2Open Source
Packet level transaction
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Cycle Accurate Simulator for NoC Modeling
64
Cycle Accurate Simulator for NoC ModelingThe simulator should operate at the granularity of individual architectural
components of the router.co po e s o e ou e .
SystemC is normally preferred.
Traffic Generators are used for evaluating the performance of NoC.
Input Channel• Input Buffer• Routing Computation Unit
Output Channel• Output Buffer• Arbiter• Routing Computation Unit
• Control Unit• Arbiter• Control Unit
Router
1. Throughput2. Latency3. Bandwidth
Network
Traffic Generation
• Poisson Distribution• Self-Similar Traffic•Appli ti n Sp ifi Tr ffi•Application Specific Traffic
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Traffic Generator
65
Traffic GeneratorApplication Driven Traffic is the best suited for performance evaluation.
D t il bilit f th th ti t ffi d l l dDue to unavailability of the same, synthetic traffic source models are also used.
Nature of traffic is generally bursty in NoC.• A Poisson process
When observed on a fine time scale will appear burstyBurst length of a Poisson arrivalBurst length of a Poisson arrivalprocess tends to be smoothedby averaging over long enoughtime scale.P i f ilPoisson process fail to capturethe actual burstiness of NoCtraffic .
Short range DependenceShort range Dependence
Reference: Varatkar, G.V. and Marculescu, R. (2004) ‘On-chip traffic modeling and synthesis for MPEG-2 videoapplications’, IEEE Trans. on Very Large Scale Integration (VLSI) Systems, Vol. 12, No. 1, pp. 108-119.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Traffic Generator
66
Traffic Generator• A Self-Similar (fractal) process
When aggregated over wide range ofWhen aggregated over wide range oftime scales, will maintain its burstycharacteristic. Self-similarity manifestsitself in several equivalent fashions:
Slowly decaying variance
Long range dependence
Non-degenerate autocorrelations
Heavy Tailed
A Self-Similar process can be generated by super-positioning ON-OFF Pareto Sources
Reference: Park, K. and Willinger, W. (2000) ‘Self-Similar network traffic and performance evaluation’, A Wiley-Interscience Publication, John Wiley & Sons, Inc.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Traffic Parameter
67
d = 6
d = 5
Offered Load: Number of packets injected for particular time interval.
Traffic Parameter
d = 3d = 2d = 1d = 0
d = 4
S
Locality Factor: Ratio of traffic destined to the local clusterfrom a core to the total traffic injected by each core.
Locality Factor = 0 signifies Uniform Distributed Traffic.
For example in 4x4 Mesh, the distances (d) of the destinations from one corner source are at d = 1, 2, 3, 4, 5, and 6. If locality factor = 0.5, then , , , , , y ,
50 percent of the traffic will go to the cluster having d = 1. Rest 50 percent traffic will be distributed as
o 15% will go to the cluster having d = 2o 12.5% will go to the cluster having d = 3o 10% will go to the cluster having d = 4o 7.5% will go to the cluster having d = 5
do 5% will go to the cluster having d = 6
If there is more than one core in a cluster, the traffic will be randomlydistributed among them.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Performance Evaluation
68
TheoreticallyPerformance Evaluation
T l
Performance of any network depends on the following network parameters. Distance Average
Links ofNumber Throughputα
Theoretically,
Topology
Locality factor of the traffic
Buffer Position and Buffer Depth
S i hi T h i
Latency α Average Distance
Switching Techniques
Number of cores attached
Here, Wormhole router architecture is used to Mes
h
, Wform the network with following parameters,
Number of cores attached = 32
Message Length = Packet Length = 64 flits
MMessage Length = Packet Length = 64 flits
Each flit consists of 32 bits
Total Simulation cycle = 2 lacs with
BF
T
10,000 cycle settling time
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Performance Evaluation
69
Throughput varies with topology and locality factor
Performance Evaluation
Throughput = Maximum Accepted Traffic in flits/cycle/IP
We kept buffer depth = 6 in both input and output channels of the router in all the cases
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Performance Evaluation
70
Performance EvaluationLatency decreases with increase in Locality Factor in different topologies
We kept buffer depth = 6 in both input and output channels of the router in all the cases
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Power Evaluation Flow
71
Power Evaluation FlowRouter Power Evaluation
Reference: Synopsys prime power , Design vision manual.(Version Y-2006.06)
Operating Condition: Process = 1, Voltage = 1 volt, Temp = 75 C0
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Power Evaluation Flow
72
Power Evaluation FlowLink Length Estimation
MeshMesh
Estimated Length of Wires:Length of Wires:
1.25 mm,
2.5 mm
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Power Evaluation Flow
73
Power Evaluation Flow
fl T ( T)
Link Length Estimation
Butterfly Fat Tree (BFT)
EstimatedEstimated Length of Wires:
1.25 mm,,
5.0 mm
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Power Evaluation Flow
74
Interconnect ModelingCopper wire (resistivity = 17 nΩ-m) of Metal Layer 4 (Semi-global) has been taken.
Power Evaluation Flow
To reduce the wiring area we have chosen the minimum dimension of Metal Layer 4. The dimensions are,
Width (W) = 0.2 µmLayer 5
Spacing (S) = 0.2 µmPitch = W + S = 0.4 µmThickness (T) = 0.5 µm
Layer 4( )
H = 0.75 µmDielectric Constant = 2.9 Layer 3
C i f iCross-section of interconnectsLink Energy Evaluation
Parasitic Components (R, C, L) of Three Wire Model has been extracted from FieldSolver tool of HSPICE. The energy consumption of middle wire for different transitionsgy pis also obtained from HSPICE.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Power Evaluation Flow
75
Three wire modeling
Power Evaluation Flow
Data rate : 32 × 200 M bits/sec
Driver sizes are designed based
on length of the wire.
Load Capacitance on the other
end of the wire is 5fF
Look Up Table (LUT) is made
for middle line energy
consumption
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur76
Energy Consumption in Mesh TopologyEnergy Consumption in Mesh TopologyNetwork Energy = Router Energy + Link Energy
Si l i f 2 l l k l i h l k i d f 5Simulation runs for 2 lacs clock cycle with clock period of 5 ns
Internal Power D i tDominates
We kept buffer depth = 6 in both input and output channels of the router in all the cases
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Comparison of Energy Consumption
77
Comparison of Energy Consumption
We kept buffer depth = 6 in both input and output channels of the router in all the cases
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur78
Energy – Performance Trade-OffThroughput Variation with FIFO Depth & Position in Mesh
Energy Performance Trade Off
FIFO_Depth_4-4 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 4FIFO_Depth_4-6 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 6FIFO_Depth_6-6 => Input Channel FIFO Depth =6, Output Channel FIFO Depth = 6FIFO_Depth_4-0 => Input Channel FIFO Depth =4, No FIFO at Output ChannelFIFO_Depth_6-0 => Input Channel FIFO Depth =6, No FIFO at Output Channel
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Energy – Performance Trade-OffEnergy Performance Trade OffLatency Variation with FIFO Depth & Position in Mesh
FIFO_Depth_4-4 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 4FIFO_Depth_4-6 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 6FIFO_Depth_6-6 => Input Channel FIFO Depth =6, Output Channel FIFO Depth = 6FIFO Depth 4-0 => Input Channel FIFO Depth =4, No FIFO at Output ChannelFIFO_Depth_4 0 > Input Channel FIFO Depth 4, No FIFO at Output ChannelFIFO_Depth_6-0 => Input Channel FIFO Depth =6, No FIFO at Output Channel
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur80
Energy – Performance Trade-Off
Simulation runs for 2 lacs clock cycle with clock period of 5 ns
Energy Performance Trade OffEnergy Variation with FIFO Depth & Position in Mesh
FIFO_Depth_4-4 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 4FIFO D h 4 6 > I Ch l FIFO D h 4 O Ch l FIFO D h 6FIFO_Depth_4-6 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 6FIFO_Depth_6-6 => Input Channel FIFO Depth =6, Output Channel FIFO Depth = 6FIFO_Depth_4-0 => Input Channel FIFO Depth =4, No FIFO at Output ChannelFIFO_Depth_6-0 => Input Channel FIFO Depth =6, No FIFO at Output Channel
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur81
Energy – Performance Trade-OffEnergy Performance Trade OffTrade-Off in Mesh at saturation (load = 160)
FIFO D h 6 0 h b E P f T d OffFIFO_Depth_6-0 shows best Energy-Performance Trade-Off
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur82
Network Energy Consumption in Mesh afterNetwork Energy Consumption in Mesh after FIFO Optimization
Si l i f 2 l l k l i h l k i d f 5Simulation runs for 2 lacs clock cycle with clock period of 5 ns
Internal Power Still DominatesDominates
We kept FIFO depth = 6 in input channel and no FIFO at output channel
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Comparison of Energy Consumption after
83
Comparison of Energy Consumption after FIFO Optimization
We kept FIFO depth = 6 in input channel and no FIFO at output channeland no FIFO at output channel
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Internal Power
84
Internal Power
Netlist View of a D-type flip-flop with synchronous clear input in S D i Vi iSynopsys Design Vision
• Internal power = short circuit power + Internal node switching powerInternal power short circuit power Internal node switching power
• Output node of the clock-buffer switches continuously with free running clock
To minimize Internal Power: Stop the clock when the network is idle
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Internal Power Minimization
85
Internal Power MinimizationNetlist View of FIFO Memory
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur86
Network Energy Consumption in Mesh afterNetwork Energy Consumption in Mesh after Clock Gating in FIFO
Simulation runs for 2 lacs clock cycle with clock period of 5 ns
We kept FIFO depth = 6 in input channel and no FIFO at output channel
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Comparison of Energy Consumption after
87
Comparison of Energy Consumption after Clock Gating in FIFO
We kept FIFO depth = 6 in input channel and no FIFO at output channeland no FIFO at output channel
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Network Area Comparison
88
Network Area Comparison
% SoC Area Overhead
BFT Mesh
2 424 3 701
Total Core Area = (32 * 2.5 * 2.5) sq. mm. = 200 sq. mm.
2.424 3.701
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Scalability Measurement
89
Scalability MeasurementScalability is a property which exhibits performance proportional to the
number of cores employed.
As the size of a scalable system is increased, a corresponding increase inperformance is obtained.
BW = [(Throughput * Number of cores attached * Number of bits in a flit) / clock period]
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Head-of-Line Blocking in Wormhole Router
90
Head of Line Blocking in Wormhole Router
VC0
XX
X
2D mesh, no VCs, XY routing
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Introduction of Virtual Channels
91
Introduction of Virtual Channels• Multiple Virtual Channels multiplexed on a single physical link to improve
performance.
• Payload flits use the VC acquired by the header flit while tailer flit releases VC.
VC 0 VC 0
Physical
Switch A Switch B
VC 1
MUX VC 1D
EMUX
ydata link
VC control
VC Scheduler
VC control
Reference: Dally, W. J. (1992) ‘Virtual Channel Flow Control’, IEEE Trans. on Parallel and Distributed Systems, Vol. 3, No. 2, pp. 194–205.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Virtual Channels
92
Virtual ChannelsVC0
VC1
X
2D mesh, 2 VCs, XY routing
VC avoids HOL blocking.
routingg
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Virtual Channels
93
VC0
VC1
Virtual Channels
XXX
No VCs
X
No VCs available
VC mitigates HOL blocking but can
li i i2D mesh, 2 VCs, XY ro ting not eliminate itrouting
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Virtual Channel Based Router Architecture
94
Virtual Channel Based Router Architecturehysical
hann
el Input buffers
ol ol hysical
hann
el
Ph ch
Link
Contr o
... MUXD
EMUX Li
nkCo
ntr o Ph ch
sBar
Physical
channe
l
Input buffers
nk ntrol
. MUX
DEM
nk ontrol Physical
channe
l
Cross
Lin
Co .. M
MUX Li
nCo
Routing Control and
Arbitration Unit
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Virtual Channel Based Router Architecture
95
Input buffers
sical
nnel
ysical
anne
l
Virtual Channel Based Router ArchitectureLink
ControlPhy
cha
... Link
Control Phy
cha
MUX
DEM
UX M
UX
Input buffers
cal
nel
cal
nelCrossBar
Link
ControlPhysi c
chann
...
DEM
UX M
UX
Link
Control Physi
chann
MUX
Routing Control and
Arbitration UnitArbitration Unit
Reference: N. Kavaldjiev, G. J. M. Smit, and P. G. Jansen, “A Virtual Channel Router for On-Chip Networks”, in Proc. of IEEE Int’l SOC Conference. IEEE Computer Society Press, pp. 289–293, 2004.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur96
Determination of Number of Virtual ChannelsDetermination of Number of Virtual Channels
- Upto 4 virtual channels throughput increases, but beyond that it saturates.- Energy dissipation increases with increase in the number of virtual channels.- For Energy-Performance Trade-off, 4 virtual channels with each physical
Reference: Pande, P. P., Grecu, C., Jones, M., Ivanov, A. and Saleh, R. (2005) “Performance evaluation and design trade-offs for MP-SOC interconnect architectures”, IEEE Trans. on Computers, Vol. 54, No. 8, pp.1025–1040.
gy , p ychannel is preferred.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur97
Throughput Improvement in Mesh usingThroughput Improvement in Mesh using Virtual Channel Architecture
N f Vi l Ch l 4No. of Virtual Channel = 4
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Latency Improvement in Mesh using
98
Latency Improvement in Mesh using Virtual Channel Architecture
No. of Virtual Channel = 4
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Energy Overhead in Mesh using Virtual
99
Energy Overhead in Mesh using Virtual Channel Architecture
No. of Virtual Channel = 4
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Performance of Some
100
Performance of Some Other Topologies
No. of Virtual Channel = 4
Reference: Pande, P. P., Grecu, C., Jones, M., Ivanov, A. and Saleh, R. (2005) “Performance evaluation and design trade-offs for MP-SOC interconnect architectures”, IEEE Trans. on Computers, Vol. 54, No. 8, pp.1025–1040.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Network Area Comparison with Virtual
101
Network Area Comparison with Virtual Channel Architecture
% SoC Area Overhead% SoC Area Overhead
Mesh BFT
Without VC With VC Without VC With VC
Total Core Area = (32 * 2.5 * 2.5) sq. mm. = 200 sq. mm.
3.701 6.145 2.424 3.507
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Quality of Service (QoS) Support
102
Quality of Service (QoS) Support
• Conceptually, two disjointnetworks– a network with throughput and
latency guarantees (guaranteedy g (gthroughput, GT)
– a network without those guarantees(best-effort, BE)( , )
• Several types of commitment inthe network
bi d– combine guaranteed worst-casebehavior
– with good average resource usage
Architectural Modification for Supporting QoS
Reference: Rijpkema, E., Goossens, K., Radulescu, A., Dielssen, J., Meerbergen, J. V., Wielage, P., and Waterlander, E. (2003) “Trade-offs in the Design of a Router with Both Guaranteed and Best-Effort Services for Networks on Chip”,IEE Proc. Computers and Digital Techniques, Vol. 150, No. 5, pp. 294-302.
Lecture – 3
Application Mappingpp pp g
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Task of Application Mapping
104
Task of Application Mapping
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
M i P bl F l i C G h
105
Mapping Problem Formulation – Core Graph
• Directed graph G = (V, E)
• Each vertex vi represents acorecore
• Each edge ei,j ε E representscommunication between viand vand vj
• Weight of edge ei,j is commi,j,is the bandwidth requirement
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Mapping Problem Formulation –
106
Mapping Problem Formulation –NoC Topology Graph
• A directed graph P = (U,F)
• Each vertex u ε U is a router• Each vertex ui ε U is a router
• Each edge fi,j ε F represents a direct communication betweendirect communication between the vertices
• Weight of edge fi j denoted by g g i,j ybwi,j represents the available bandwidth across the edge
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
M F i
107
Map Function
• map: V Umap: V U
• Each edge k of the core graph represents a commodity dk
• Each commodity has a value vl(dk) representing thed d f fbandwidth requirement of the communication from vi to vj
• Bandwidth constraint:
An edge in the topology graph must have enough bandwidthAn edge in the topology graph must have enough bandwidthto accommodate all commodities passing through it
• Minimize communication cost:
Σ l(dk) di ( (dk) d (dk))Σk vl(dk) dist(source(dk), dest(dk))
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
M i S l i
108
Mapping Solution
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
M i Al i h
109
Mapping Algorithms
• Mapping problem is intractableMapping problem is intractable
• Several approaches are possible: ILP, Heuristics (PMAP, GMAP, PBB, NMAP, BMAP etc.), Meta-search heuristics (GA, PSO, Simulated Annealing)
• Other variants of the problem combining,
T k h d lin– Task scheduling
– Power consumption
– Alternative routing paths etc.Alternative routing paths etc.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
M i i h Mi i P h R i
110
Mapping with Minimum-Path Routing (NMAP)
• Three phases – Initialize, Minimum path computation, Iterative improvement
• Initialize:1. Core with maximum communication demand placed onto the
node with maximum number of neighborsg
2. Select the core that communicates most with the mapped cores
3. Place selected core onto the node that minimizes communication cost with mapped onescommunication cost with mapped ones
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
M i i h Mi i P h R i
111
Mapping with Minimum-Path Routing
• Shortest Path:• Shortest Path:
– Minimum path routing
– Commodities are sorted on descending order of flows
– For each commodity, shortest path is identified
As soon as a commodity path is finalized cost of each– As soon as a commodity path is finalized, cost of each edge on the path increased by the value of the commodity
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
M i i h Mi i P h R i
112
Mapping with Minimum-Path Routing
• Iterative Improvement:Iterative Improvement:
– Iteratively swap vertices pair-wise to obtain a better mapping
– Traffic splitting:
• Multiple shortest paths may exist
• Formulate a multi-commodity flow problem to satisfy bandwidth requirements for solutions that have lesser communication costs but do not satisfy yall the bandwidth requirements
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Bi i l M i Al i h (BMAP)
113
Binomial Mapping Algorithm (BMAP)
• NMAP algorithm is O(N4logN)NMAP algorithm is O(N logN)
• BMAP is a three stage algorithm with complexity O(N2logN)
– Binomial Merging Iteration
– Topology Mapping
– Hardware cost Optimization
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
BMAP: Binomial Merging Iteration
114
BMAP: Binomial Merging Iteration
1. Calculate IP Ranking: Rank of IP core i,1. Calculate IP Ranking: Rank of IP core i,
ranking(i) = Σ (requirement(i, j) + requirement(j, i), j = 1 to N
requirement(i, j) is the bandwidth requirement from i to j
2. Merge IP Set: Based on ranking merge two IP-sets at a time: logN time
3. Refreshing IP Set: Ranking is recalculated. Ranking of IP3. Refreshing IP Set: Ranking is recalculated. Ranking of IP Set k generated by merging IP Set i and IP Set j is,
ranking(k) = ranking(i) + ranking(j) – requirement(i,j) –requirement(j i)requirement(j,i)
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
M i A E l
115
Merging: An Example
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
BMAP T l M i d T ffi
116
BMAP: Topology Mapping and Traffic Surface Creation
• After mapping, a traffic surface is generated
• It shows the traffic load of each router
Mi i l h i i d• Minimal path routing is used
• Based on this surface, hardware can be optimized by selecting p y gproper routers from the library
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
BMAP H d C O i i i
117
BMAP: Hardware Cost Optimization
1 Dummy Router Elimination:1. Dummy Router Elimination:– Dummy routers added at start point to have 4n routers
– BMAP puts these routers at boundaries, hence can be eliminated
2. Router Selection:– Sharing single buffer among low bandwidth input channels
– Choice of router is made from library
3 Unfolding:3. Unfolding: – Add additional routers and links for larger bandwidth
requirements
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
BMAP H d O i i i A E l
118
BMAP: Hardware Optimization - An Example
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Network on Chip Synthesis:
119
Network on Chip Synthesis: SUNMAP + xpipes
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
SUNMAP T l M i
120
SUNMAP: Topology Mapping
• Optimizes for area power or delay within designOptimizes for area, power or delay within design constraints
• Uses heuristics to perform mapping
• onto topologies: mesh, torus, hypercube, clos, and butterfly
• B ilt in fl pl nn f p n l i• Built in floor-planner for area, power analysis
• Choice of different routing functions
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
SUNMAP T l M i
121
SUNMAP: Topology Mapping
Heuristic approach with several phases:Heuristic approach with several phases:
• Initial mapping using a greedy algorithm (from communication graph)
– Compute optimal routing (using flow formulation)
1. Floorplan solution
2. Check area and bandwidth constraints
3. Compute mapping cost
• Iterative improvement loop (Tabu search)• Iterative improvement loop (Tabu search)
• Allows manual and interactive topology creation
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
System configuration
122
System configuration// In this topology: 8 cores, 8 memories, 4x4 torus// ----------------------------- IP cores// name, switch number, clock divider, buffers, type
( 0 i h 0 1 6 i i i )core(core_0, switch_0, 1, 6, initiator);core(mem_8, switch_11, 1, 6, target:0x00);[…]// ----------------------------- switches// name, input ports, output ports, buffers
• Specifies
– NIs (I/Os, clocks, // , p p , p p ,switch(switch_0, 5, 5, 6);switch(switch_1, 5, 5, 6);[…]// ----------------------------- links// name so rce destination
buffers)
– switches (I/Os, buffers)// name, source, destination
link(link0, switch_0, switch_1);link(link1, switch_1, switch_0);[…]// ----------------------------- routes
buffers)
– links
– routes// source, destination, hopsroute(core_0, pm_8, switches:0,1,5,6,7,11);route(core_1, pm_9, switches:1,5,9,8);route(core_2, pm_10, switches:2,6,5,9);route(core 3 pm 11 switches:3 2 6 10);route(core_3, pm_11, switches:3,2,6,10);[…]
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
i C il Pl f G i
123
xpipes Compiler: Platform Generation
• Input:Input:
– System configuration: Topology, Routing tables, Parameters(flit width, buffering, …)
– Component Library
• Creates a class template for each type of network p n nt b d p n p n nt nfi ti n (I/Ocomponent based upon component configuration (I/O
ports, buffer sizing)
• Hierarchical instantiation of the platform in SystemCp y
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Network-on-Chip Synthesis Tool: xpipes
124
Network-on-Chip Synthesis Tool: xpipes
MPARM Architecture
Reference: Bertozzi, D. and Benini, L. (2004) “xpipes: A Network –on-Chip Architecture for Giga Scale Systems-on-Chips”, IEEE Circuits and Systems Magazine, pp. 18-31.
Lecture – 5
Conclusion and Future of Network-on-Chip
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Network on Chip: At a Glance
126
Network-on-Chip: At a GlanceTopics Covered
I f Hi h C i i L
Some More TopicsNeed of Network-on-Chip
NoC Architecture Design
Impact of Higher Communication Layers in NoC Performance
Test and Verification of NoC
Performance Evaluation
Design Trade-Off
Thermal Modeling of NoC
Metrics and Benchmarks for NoC.
Application Mapping on NoC
Signal Integrity and Reliability Issues
Floorplan-aware NoC architecture optimization
Fault Tolerant Architecture in NoCg g y y Fault Tolerant Architecture in NoC
CAD Tools for NoC
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Limitation of 2D Network on Chip
127
Limitation of 2D Network-on-ChipThe conventional 2D integrated circuit (IC) has limitedfloor-planning choices, and consequently, it limits thefloor planning choices, and consequently, it limits theperformance enhancements arising out of NoC architectures.
Need for more and more bandwidth but not at the cost ofNeed for more and more bandwidth but not at the cost of increased power consumption.
Reference: Carloni, L. P., Pande P. P., Yuan X. (2009) “Networks-on-Chip in emerging interconnect paradigms: Advantages and Challenges” ACM/IEEE Int’l Symp. On Network s-on-Chip, pp. 93-102.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
NoC Research Groups in Foreign Universities
128
NoC Research Groups in Foreign Universities1. Prof. Luca Benini, University of Bologna, Italy.2. Prof. Giovanni De Micheli, EPFL, Switzerland.3 Prof William J Dally Stanford University USA3. Prof. William J. Dally, Stanford University, USA. 4. Prof. Partha Pratim Pande, Washington State University, USA.5. Prof. Radu Marculescu, Carnegie Mellon University, USA.6. Prof. Bashir M Al-Hashimi, University of Southampton, UK.7. Prof. Chita R. Das, Pennsylvania State University, USA.8. Prof. Niraj K. Jha, Princeton University, USA.9. Prof. Sashi Kumar, Jonkoping University, Sweden.10. Prof. Axel Janstach, Royal Institute of Technology (KTH), Sweden.10. Prof. Axel Janstach, Royal Institute of Technology (KTH), Sweden.11. Prof. Jari Nurmi, Tampere University of Technology, Finland.12. Prof. Andre Ivanov, University of British Columbia, Canada.13. Prof. Resve Saleh, University of British Columbia, Canada.14 P f I l Cid T h i I l I i f T h l I l14. Prof. Israel Cidon, Technion-Israel Institute of Technology, Israel.15. Dr. Davide Bertozzi, University of Bologna, Italy.16. Dr. Srinivasan Murali, EPFL, Switzarland.
dand many more …
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
NoC Research in Indian Universities
129
NoC Research in Indian Universities1. Prof. Santanu Chattopadhyay, Indian Institute of Technology, Kharagpur.2. Prof. S. K. Nandy, Indian Institute of Science, Bangalore.y, , g3. Prof. Bharadwaj Amruthur, Indian Institute of Science, Bangalore.4. Prof. M. R. Bhujade, Indian Institute of Technology, Bombay.
J l C f d W k h N CJournals, Conference, and Workshop on NoC
Microprocessor and Microsystems Journal Elsevier (MICPRO)
IEEE/ACM International Symposium on Networks-on-Chip
Microprocessor and Microsystems Journal, Elsevier (MICPRO)
IEEE Int’l Workshop on Network on Chip Architectures (NoCArc)
IEEE/ACM International Symposium on Networks on Chip
IEEE Int l Workshop on Network on Chip Architectures (NoCArc)
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
NoC Research in Industries
130
NoC Research in IndustriesTilera Corporation Arteris Inc. Silistix Inc. NXP Semiconductor
IBM Corporation(Cyclops-64/Blue Gene)
130
AetherealAethereal
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Network on Chip Books
131
Network-on-Chip Books
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Network on Chip Books
132
Network-on-Chip Books
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Network on Chip Books
133
Network-on-Chip Books
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Bibliography
134
BibliographyFor detailed updated reference, the audience are directed to the following link:http://www.cl.cam.ac.uk/~rdm34/onChipNetBib/onChipNetwork.pdf
Below we are giving some of our contributions in NoC research:[1] S. Kundu and S. Chattopadhyay, “Interfacing Cores and Routers in Network-on-Chip Using GALS”, IEEE
International Symposium on Integrated Circuits (ISIC), 2007.[2] S. Kundu and S. Chattopadhyay, “Mesh-of-Tree Deterministic Routing for Network-on-Chip Architecture”,
ACM Great Lake Symposium on VLSI (GLSVLSI) 2008ACM Great Lake Symposium on VLSI (GLSVLSI), 2008.[3] S. Kundu, R. P. Dasari, K. Manna, and S. Chattopadhyay, “Mesh-of-Tree based scalable Network-on-Chip
Architecture”, IEEE Region 10 Colloquium and International Conference on Industrial and InformationSystems (ICIIS), 2008.
[4] S. Kundu and S. Chattopadhyay, “Mesh-of-Tree based Network-on-Chip Architecture Using Virtual Channelbased Router” IEEE VLSI Design and Test Conference (VDAT), 2008.
[5] S. Kundu and S. Chattopadhyay, “Network-on-chip architecture design based on mesh-of-tree deterministicrouting topology”. International Journal for High Performance Systems Architecture, Vol. 1, No. 3, pp.163–182,Inderscience Publisher, 2008.
[6] S. Kundu, R. P. Dasari, K. Manna, and S. Chattopadhyay, “Performance Evaluation of Mesh-of-Tree Based[6] d , , , d p d y y, dNetwork-on-Chip Using Wormhole Router with Poisson Distributed Traffic”, IEEE VLSI Design and TestConference (VDAT), 2009.
[7] S. Kundu, K. Manna, S. Gupta, K. Kumar, R. Parikh, and S. Chattopadhyay, “A Comparative PerformanceEvaluation Of Network-on-Chip Architectures Under Self-Similar Traffic”, IEEE International Conference onAd i R t T h l i i C i ti d C ti (ARTC ) 2009Advances in Recent Technologies in Communication and Computing (ARTCom), 2009.
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Microprocessor
135
Microprocessor Research Laboratory
Th YThank You