Upload
pham-trung-duc
View
224
Download
0
Embed Size (px)
Citation preview
8/12/2019 Thiet Ke He Thong Nhung
1/75
1
Thesis proposal: Throughput and
Latency aware mapping of NoC
Aug 17, 2006
Vu-Duc Ngo
System VLSI Lab
System Integration Technology Institute
8/12/2019 Thiet Ke He Thong Nhung
2/75
2
Contents
Related works
Part I: Latency aware mapping of NoC architectures
Part II: Throughput aware mapping of NoC architectures
Part III: Energy consumption of NoC architectures
Part IV: Experiment results
Study case of H.264 video decoder
Used architectures: 2-D Mesh, Fat-Tree
Study case of Video Object Plane Decoder
Used architectures: 2-D Mesh, Fat-Tree, Custom topologies
Future works
Publication list
References
Appendix for detailed future works
Appendix I: G/G/1 Queuing: Theoretical Approach Appendix II: Double Plane VC Wormhole Router
Appendix III: NoC Emulation
8/12/2019 Thiet Ke He Thong Nhung
3/75
3
Related works
Energy aware mapping schemes:
Proposed by: De Micheli (Standford Univ) and R. Marculescu (CMU) research groups
Addressed the issue of minimizing the power consumption of NoC
architectures
Did not address the currently hot issue of QoS such as:
Throughput guarantee
Latency guarantee
Did not consider:
Drop of packets inside the network which is the nature of packet based switchingnetwork
The power consumption was simulated with the homogeneous bit energy
model
8/12/2019 Thiet Ke He Thong Nhung
4/75
4
Proposed mapping scheme :Latency and Throughput
aware mappingIssue raising
- QoS issue, currently hot topic in NoC design, was addressed by J. Nurmi
in SoC05 conference.
- It was also strongly mentioned by A. Ivanov and De Micheli (IEEE Design and
Test, Aug 2005) as a important design criterion of NoC in future.
- It will be the main theme of 1stIEEE NoC symposium 2007.
Our works:
- Find out a mapping schemes that:
1. Minimize architectures latency
2. Maximize architectures throughput
3. Calculate the corresponding size and power consumption
8/12/2019 Thiet Ke He Thong Nhung
5/75
5
Part I: Latency aware mapping of NoC
8/12/2019 Thiet Ke He Thong Nhung
6/75
6
Latency Optimal Mapping: Introduction
Latency:
IPs and NoC architecture are Heterogeneous
Question is: Which switching core should each IP core be mountedonto in order to minimize the network latency?
Issues of mapping IPs onto NoC architectures:
For each mapping scheme:
The routing table of applied routing algorithm would be changed due tothe change of mapping IPs onto pre-selected NoC architecture.
The queuing latency would be changed according to the content of therouting table.
8/12/2019 Thiet Ke He Thong Nhung
7/75
7
Latency Optimal Mapping: Introduction (Contd)
Solution:
Assume data transactions have Poisson distribution (general
distribution will be studied in future work)
Using M/M/1 queuing model to analyze the latency
We utilize the spanning tree search algorithm to:
Automatically map the desired IPs onto the NoC architecture
Guarantee that the network latency is minimized
Reduce the searching complexity of optimum allocation scheme
8/12/2019 Thiet Ke He Thong Nhung
8/75
8
Latency Optimal Mapping: M/M/1 queuing model
Let the arrival rate of packet to the node is:
Let the number of packets arrive at one node is:
The node latency:
rateArrival
numPacket N
latNode T N T
Little theorem
Relation between
latency and number
of packet
IP CoreSwitching
Core Server
Single Network
Node
Buffer
Packets arrive
8/12/2019 Thiet Ke He Thong Nhung
9/75
9
Latency Optimal Mapping: M/M/1 queuing model (Contd)
M/M/1 (Contd):
Since:
N
Little theorem
1NT
Spending time of 1
packet in one node
1 1 1W T
Spending time of 1
packet in buffer
Mean of
processing time
QN W
N
Number of packets in
buffer
8/12/2019 Thiet Ke He Thong Nhung
10/75
10
Latency Optimal Mapping: Queuing latency in complex network
Network topology:
For each ith stream:
Since streams are iid and Markov has distribution property, then
the queuing latency of the jthnode is:
i i iN T
:
: the set of incoming streams toward node
j
j
j j
ii
j ii
th
j
Since
N T
where
N N
j
1
j
j
j
j
i ii
j
ii
iij i
ii
TT
Little theorem
8/12/2019 Thiet Ke He Thong Nhung
11/75
11
Latency Optimal Mapping: Queuing latency in complex network (Contd)
Thus, the latency of route is
Where:
Network latency in terms of Queuing latency is given by:
1
j
k
j
iij iR
Queue kj
j ii
T
kRthi
1, if node
0, if node
th
k
kj thk
j R
j R
1
1
j
j
iimj i
kjk j ii
1, if node
0, if node
: the number of routes in the routing table
th
k
kj thk
Where
j R
j R
m
8/12/2019 Thiet Ke He Thong Nhung
12/75
12
Wire latency
If we take into account the difference of the wires, then:
For RLC modeled by:
Wire delay:
Furthermore, we can also calculate the wire inductance and
capacitance in terms of wires width:
loadC
x
xW 0W
0
00 0
l x
w f
L
T C W y C dydxW x
0
0
line
line f
LL W
W x
C W C W x C
0
0
: wire inductance per square
: wire capacitance per unit area
: fringing capacitance per unit lengthf
L
C
C
Where:
M. A. El-Moursy and E. G. Friedman,"Desing Methodologies For On-Chip Inductive Interconnection", Chapter. 4,
Interconnect-Centric Design For Advanced SoC and NoC, Kluwer Academic Publishers, 2004.
8/12/2019 Thiet Ke He Thong Nhung
13/75
13
Wire latency (cont)
Route latency in terms of wire latency is calculated by:
Where is the route latency, and
Network latency in terms of wire latency is presented by:
0 0
0 0
j
i
l x
R
Wire f ij
j
LT C W y C dydx
W x
iR
WireT th
i
1, if node0, if node
th
iij th
i
j Rj R
0 01 0 0
jl xm
f ij
i j
LC W y C dydx
W x
:
1, if node
0, if node: the number of routes in the routing table
th
i
ij th
i
Where
j R
j Rm
8/12/2019 Thiet Ke He Thong Nhung
14/75
14
Network latency
Considering the shortest path routing algorithm is applied:
With given application, there is a certain routing table The average latency is:
Where:
0 0
0 0
,
1 ;
;
1
j
i
i
l x
f
jm
Aver Lat kj
ik i j ij i
j ii j i
LC W y C dydx
W x
N
1, if node route k
0, if node route k
: the number of routes in the routing table
kj
j
j
m
8/12/2019 Thiet Ke He Thong Nhung
15/75
15
Latency Optimal Mapping: Problem statement
Since can be changed accordingly to the status of network
Routing (predetermined connections of IPs)
Congestion
Arrival rate be accumulated
However are unchanged due to predetermined design of
switching nodes
Therefore, optimum mapping should be figured out to minimize
system latency for:
Certain case of the practical application
Ex. H.264 video decoder, VOPD
Find out an optimum mapping
i
i
8/12/2019 Thiet Ke He Thong Nhung
16/75
16
Latency Optimal Mapping: Graph definitions
Graph characterizations:
IP cores: IIG
Graph characterizations:
NoC architecture: SAG
V6 V2
V8V4
V1V10
V3
V7
V9
V5
,G V A
, : Directed graph
Vertex : IP cores
: arrival rate
i
i i
G V A
v V
A v
, : Directed graph
Vertex : node of NoC topology
1/ : Mean of processing time of
i
i i
G U P
u U
P u
U1 U2 U3 U4
U5 U6 U7 U8
U9 U10 U11 U12
U13 U14 U15 U16
,G U P
IIG: IPs Implementation Graph SAG: Switching Architecture Graph
8/12/2019 Thiet Ke He Thong Nhung
17/75
17
Latency Optimal Mapping: Mathematical formula
Mapping with Min-latency
criteria:
Definition of mapping:
Min-latency criteria
The cost function as theaverage latency:
: , ,
. . ,
,
i j
i j
i j i j
map G V A G U P
map v u
s t v V u U
v v V map v map v
0 0
0 0
1 ;
;
Find a : , , to:
min 1
j
i
i
l x
f
jm
kj
ik i j ij i
j ii j i
map G V A G U P
LC W y C dydx
W x
latency
Means that each IP needs to be
mapped to exactly one node of NoC
topology and no node can host more
than 1 core IP
0 0
0 0
,
1 ;
;
1
j
i
i
l x
f
jm
Aver Lat kj
ik i j ij i
j ii j i
LC W y C dydx
W x
N
8/12/2019 Thiet Ke He Thong Nhung
18/75
18
Latency Optimal Mapping: Mapping example
Mapping example:
Sol: Using spanning tree
search.
NoC architecture graph
(SAG)
8/12/2019 Thiet Ke He Thong Nhung
19/75
19
Example of On-Chip Multiprocessors Network (OCMN)
Mesh architectures: Fat-Tree architectures:
8/12/2019 Thiet Ke He Thong Nhung
20/75
20
Simulation results: H.264 video decoder on 2-D Mesh
Latency optimal mapping:
min
Minimum Latency:
325L s
DB LENT MC
VOM MVMVD Processor IPRED
IS REC FR_MEM ITIQ
Architecture Throughput and energy
DBMC
FR_MEMITIQ
LENT VOM Processor REC
MVMVD IPRED IS
Random mapping:
Random Latency
416Random
L s
0 2 4 6 8 10 12 14 16 18 200
5
10
15
20
25
30
35
40
Time ( x 0.0005 second)
Aggregative
Throughput
Throughput Comparison (Optimal vs. Random)
Optimal Lantency mapping
Random mapping
Opt_Latency_map Random_map0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1x 10
-4
EnergyConsu
mption(J)
Energy Consumption (Optimal Latency vs. Random)
8/12/2019 Thiet Ke He Thong Nhung
21/75
21
Part II: Throughput aware mapping of NoC
architectures
8/12/2019 Thiet Ke He Thong Nhung
22/75
22
Throughput aware mapping
Since the wormhole router treats data flow on flit level:
Probability that exits m flits (using M/M/1 queuing model)
Where:
11 1
0 0
0
i im
j jmil il l l
m
n j j
p p p
1
01
, 1.. : are arrival rates of data flows to port of j router
: is the number of individual data flows enter port
i j
ill
j
j th th
il j
th
i
p
i p i
i
Block probability
, 20
11 , : Buffer size
kj
block i mm
j
P p kp
1
1
, 2
1i
kj
ilj l
block i
jj
Pp
8/12/2019 Thiet Ke He Thong Nhung
23/75
23
Throughput aware mapping (Contd)
The throughput contributed by ithport:
Therefore, the throughput contributed by thejthrouter :
Where Nis the number of routers
Network Throughput:
1
1
21 1
11 1
i
i i
kj
ilj j jl
i block il il l lj
j
Pp
1
12 1
1 1
11
ij j
i
kjp pil
jlj i ill
i i jj
Tp
1
1
2 11 1 1
11
ij
i
kjpN Nil jl
Net j illj j i j
j
T Tp
8/12/2019 Thiet Ke He Thong Nhung
24/75
24
Throughput aware mapping (Contd)
Since:
To maximize theNetwork Throughput
1
1
2 11 1 1
11
ij
i
kjpN Nil jl
Net j illj j i j
j
T Tp
Function of allocation scheme
of IP onto the Architecture
Optimal mapping of a
given application
onto the architecturemust be work out
8/12/2019 Thiet Ke He Thong Nhung
25/75
25
Throughput aware Mapping: Mathematical formula
Mapping with Max-Throughput
criteria:
Definition of mapping:
Max-Throughput criteria
The cost function as the
Network Throughput:
: , ,
. . ,
,
i j
i j
i j i j
map G V A G U P
map v u
s t v V u U
v v V map v map v
1
1
2 11 1
Find a : , , to:
1max 1
i
ji
kjp
N il jl
Net illj i j
j
map G V A G U P
Tp
Means that each IP needs to be
mapped to exactly one node of NoC
topology and no node can host more
than 1 core IP
1
1
2 11 1 1
11
ij
i
kjpN Nil jl
Net j illj j i j
j
T Tp
8/12/2019 Thiet Ke He Thong Nhung
26/75
26
Part III: Energy consumption of NoC architectures
8/12/2019 Thiet Ke He Thong Nhung
27/75
27
Energy and area: calculation and parameters
0.1um CMOS technology
Vdd = 1.2V
Router configurations:
3x3
4x4
5x5 7x7
Energy of CMOS circuit:
Since:
Then:
Router energy model:
Router power model:
21
2 clk dd P f V C
212
ddE CV
clkP f E
Arbiter
Buf 1
Buf 2
Buf p
.
.
.
.
.
.
.
.
Output port 1
Output port 2
Output port p
Input port 1
Input port 2
Input port pCrossbar
Switch
xbar arb bufrd bufwrtE E E E E
212
clk xbar arb bufrd bufwrt
clk dd xbar arb bufrd bufwrt
P f E E E E
f V C C C C
8/12/2019 Thiet Ke He Thong Nhung
28/75
28
Energy and area: calculation and parameters (Contd)
For NxN 2-D Mesh homogenous architecture:
Since:
Then:
2
1
2
Net clk avg xbar arb bufrd bufwrt
clk dd avg xbar arb bufrd bufwrt
P f H E E E E
f V H C C C C
23
avgNH
21 22 3
Net clk avg xbar arb bufrd bufwrt
clk dd xbar arb bufrd bufwrt
P f H E E E E
Nf V C C C C
8/12/2019 Thiet Ke He Thong Nhung
29/75
29
Power and area: calculation and parameters (Contd)
For heterogeneous architecture:
Where:
Hence:
i
i j j j j j
flit clk xbar arb bufrd bufwrt ij
R
P f E E E E
1, if router
0, otherwise
th
i
ij
j R
i
i
Net flit
i
j j j j jclk xbar arb bufrd bufwrt ij
i R
P P
f E E E E
8/12/2019 Thiet Ke He Thong Nhung
30/75
30
Energy estimation
Bit energy estimation Energy estimation based on
throughput:
0 2 4 6 8 10 12 14 16 18 200
500
1000
1500
2000
2500
Time ( x 0.0005 second)
TotalThroughput(Mbps)
SFQ
DRR
RED
DropTail (FIFO)
Bit energy
estimation
System energy
estimation
Random mapping
example
Information bit
Orion power
model
Scaling factor for
a given CMOS
technology
Bit energy
consumption
Router power
model
Interconnection
power model
H. S. Hwang et al.,"Orion: A Power Performance Simulator for
Interconnection Networks," IEEE Micro, Nov. 2002.
8/12/2019 Thiet Ke He Thong Nhung
31/75
8/12/2019 Thiet Ke He Thong Nhung
32/75
32
Part IV: Experiment results
1. Throughput aware mapping: 2D Mesh,
Fat-Tree architectures for H.264 design
2. Throughput aware mapping: irregular
architectures for VOPD design
8/12/2019 Thiet Ke He Thong Nhung
33/75
8/12/2019 Thiet Ke He Thong Nhung
34/75
34
H.264 video decoders data transaction table
Assume data transactions are Poisson distribution
Throughput aware mapping for:
2-D Mesh, Fat-Tree
8/12/2019 Thiet Ke He Thong Nhung
35/75
35
Wormhole router architecture
pxp wormhole router
P inout ports
Single switching plane
One input buffer for each input
Ex. 2-D Mesh uses 5x5 router
Arbiter
Buf 1
Buf 2
Buf p
.
.
.
.
.
.
.
.
Output port 1
Output port 2
Output port p
Input port 1
Input port 2
Input port pCrossbar
Switch
8/12/2019 Thiet Ke He Thong Nhung
36/75
36
Simulation parameters
For all topologies:
Routing scheme Shortest Path
Queuing scheme DropTail
Buffer size 4 packets
Packet size 64 bytes
Flit size 128 bits
8/12/2019 Thiet Ke He Thong Nhung
37/75
37
Throughput comparison
Throughput of five topologies
8/12/2019 Thiet Ke He Thong Nhung
38/75
38
Topology comparison
Topology size Topology energy
8/12/2019 Thiet Ke He Thong Nhung
39/75
39
VOPD on 5 topologies
VOPD d t t ti t bl
8/12/2019 Thiet Ke He Thong Nhung
40/75
40
VOPDs data transaction table
Assume data transactions are Poisson distribution
Throughput aware mapping for: 2-D Mesh, Fat-Tree and 3 custom topologies
Fi t t t l (3 Xb )
8/12/2019 Thiet Ke He Thong Nhung
41/75
41
First custom topology (3 Xbar )
First custom topology:
(a). Nam output of First topology (b). VOPD on First topology
5x5Wormhole
Router(1st)
6x6Wormhole
Router
5x5Wormhole
Router(2rd)
VMEM
UPSAM
VRecst PAD IQua
AC/DC
IDCT
Invers
ARM
Str_MEM
Run_Lenght
VarLen
S d t t l (4 Xb )
8/12/2019 Thiet Ke He Thong Nhung
42/75
42
Second custom topology (4 Xbar )
(a). Nam output of Second
topology
(b). H.264 decoder on Second
topology
5x5Wormhole
Router(1st)
5x5Wormhole
Router(2rd)
5x5
WormholeRouter(3rd)
3x3WormholeRouter
MC
DB
DMA FR_MEM ITIQ
LENT
VOM
REC
MVMVD
PROCIPRED IS
Third c stom topolog (5 Xbar )
8/12/2019 Thiet Ke He Thong Nhung
43/75
43
Third custom topology (5 Xbar )
(a). Nam output of Third
topology(b). VOPD on Third topology
6x6Wormhole
Router
5x5Wormhole
Router
3x3Wormhole
Router(3rd)
VMEM
UPSAM
VRecst
PAD IQua
AC/DC
IDCT
Invers
ARM Str_MEM
Run_Lenght
VarLen
3x3Wormhole
Router(1st)
3x3Wormhole
Router(2rd)
8/12/2019 Thiet Ke He Thong Nhung
44/75
Simulation parameters
8/12/2019 Thiet Ke He Thong Nhung
45/75
45
Simulation parameters
For all topologies:
Routing scheme Shortest Path
Queuing scheme DropTail
Buffer size 4 packets
Packet size 64 bytes
Flit size 128 bits
Throughput comparison
8/12/2019 Thiet Ke He Thong Nhung
46/75
46
Throughput comparison
Throughput of five topologies
Result Discussion: Term of Throughput
8/12/2019 Thiet Ke He Thong Nhung
47/75
47
Result Discussion: Term of Throughput
Best topology: Fat-Tree
Worst topology: 2D Mesh
Lowest aggregative throughput while hight hardware overhead of
unused switches
Fat-Tree offers almost similar throughput in compared to
the first custom topology but big hardware overhead
Power and area of Router
8/12/2019 Thiet Ke He Thong Nhung
48/75
48
Power and area of Router
Wire and energy dissipation
8/12/2019 Thiet Ke He Thong Nhung
49/75
49
Wire and energy dissipation
Wire dimension vs. Cap:
0.10um technology:
Energy:
Wire dimension vs. chip edge:
0.10um technology:
Ldrawn/Tech 0.10um
Capacitance
(fF/um)
335
21
2wire wire dd E C V
R. Ho, et al, "The future of wires," Proceedings of the IEEE,
pp. 490 - 504, April 2001
8/12/2019 Thiet Ke He Thong Nhung
50/75
Topology comparison (Contd)
8/12/2019 Thiet Ke He Thong Nhung
51/75
51
Topology comparison (Cont d)
Conclusion:
1st
topology consumes smallest power Wire energy is not significant
Because of its simplicity in interconnections.
Fat-Tree consumes biggest power and has biggest size
Wire energy dissipation is significant
Due to its complex interconnections
Custom Topologies comparison (Random map vs. optimal map)
8/12/2019 Thiet Ke He Thong Nhung
52/75
52
p g p ( p p p)
Terms of throughput Terms of energy consumption
Custom Topologies comparison (Random map vs. optimal map) (Contd)
8/12/2019 Thiet Ke He Thong Nhung
53/75
53
p g p ( p p p) ( )
Discussion:
Optimal maps offer not only better throughput but also less energyconsumption.
No ARQ scheme was implemented.
If ARQ is used:
Same throughput even higher but more energy consumed for
retransmitting dropped packets. (Further works)
Conclusions
8/12/2019 Thiet Ke He Thong Nhung
54/75
54
The heterogeneous NoC architectures are considered for
designing based on the latency criteria
The latency of heterogeneous NoC architecture in terms of router
and wire latency is well formulized
Branch and Bound algorithm is adopted to automatically map the
IPs onto the NoC architectures with optimal latency metric
The experiments on various size of Mesh and Fat-Treearchitectures for OCMN application and H.264 are carried out
The latency of optimal mappings are significantly reduced
Conclusions (Contd)
8/12/2019 Thiet Ke He Thong Nhung
55/75
55
( )
The heterogeneous NoC architectures are considered for
designing based on the maximum throughput criteria
The throughput of heterogeneous NoC architecture is formulized
Branch and Bound algorithm is adopted to automatically map the
IPs onto the NoC architectures to obtain maximal throughput
The experiments on various size of Mesh and Fat-Tree and Tree-
Based architectures for VOPD and H.264 are carried out
The heterogeneous bit power model was applied to exactly obtain
energy consumption and area of architectures
Future works
8/12/2019 Thiet Ke He Thong Nhung
56/75
56
Modeling architecture with general distribution queuing model G/G/1
(Appendix I)
Realization of Multi-layer router for NoC design (Appendix II)
Performance comparison with current router model
Power consumption comparison with current router model
Variation of:
Number of switching planes Number of Virtual channel
Will be considered.
NoC emulation (Appendix III)
Global optimization for 2 criteria of Latency and Throughput
ARQ implementation with power measurement for Throughput awaremapping scheme
Publication list
8/12/2019 Thiet Ke He Thong Nhung
57/75
57
International Journals
1. Vu-Duc Ngo, Huy-Nam Nguyen, Hae-Wook Choi , "Analyzing the Performance of Me
sh and Fat-Tree topologies for Network on Chip design ", LNCS (Springer-Verlag), Vol. 3824 / 2005, pp 300-310, Dec 2005.
2. Vu-Duc Ngo, Huy-Nam Nguyen, Hae-Wook Choi, Designing On-Chip Network basedon optimal latency criteria", LNCS (Springer-Verlag), Vol. 3820 / 2005, pp. 287-298 ,Dec 2005.
3. Huy-Nam Nguyen, Vu-Duc Ngo, Hae-Wook Choi, Realization of Video Object PlaneDecoder on On-Chip-Network Architecture, LNCS (Springer-Verlag), Vol. 3820 /
2005, pp. 256-264 , Dec 2005.
4. Vu-Duc Ngo, Hae-Wook Choi and Sin-Chong Park, "An Expurgated Union Bound forSpace-Time Code Systems", LNCS (Springer-Verlag), Vol. 3124 / 2004, pp 156-162,July 2004
5. Vu-Duc Ngo, Huy-Nam Nguyen, Hae-Wook Choi, The Optimum Network on ChipArchitectures for Video Decoder Applications Design, (to be submitted to Etri journal)
6. Vu-Duc Ngo, Huy-Nam Nguyen, Hae-Wook Choi, Throughput Aware Mapping forNoC design , (to be submitted to IEE electronic letter)
8/12/2019 Thiet Ke He Thong Nhung
58/75
References
8/12/2019 Thiet Ke He Thong Nhung
59/75
59
1. L. Benini and G. DeMicheli,"Networks On Chips: A new SoC paradigm", IEEE computer, Jan, 2002.
2. A. Agarwal,"Limit on interconnection network performance", Parallel and Distrib-uted Systems,IEEE Transactions on Volume 2, Issue 4, Oct. 1991 pp. 398 - 412.
3. T. Ye, L. Benini and G. De Micheli," Packetization and Routing Analysis of On-Chip MultiProcessorNetworks", JSA Journal of System Architecture, Vol. 50, February 2004, pp. 81-104.
4. M. A. El-Moursy and E. G. Friedman,"Desing Methodologies For On-Chip Inductive Interconnection",Chapter. 4, Interconnect-Centric Design For Advanced SoC and NoC, Kluwer Academic Publishers,2004.
5. R. Ho, et al, "The future of wires," Proceedings of the IEEE, pp. 490 - 504, April 2001.
6. J. Hu, R. Marculescu, "Exploiting the Routing Flexibility for Energy Performance Aware Mapping ofRegular NoC Architectures", in Proc. Design, Automation and Test in Europe Conf, March 2003.
7. J. Hu, R. Marculescu, "Energy-Aware Communication and Task Scheduling for Network-on-ChipArchitectures under Real-Time Constraints, in Proc. Design, Au- tomation and Test in Europe Conf,Feb. 2004.
8. S.Murali and G.De Micheli, Bandwidth-Constrained Mapping of Cores pnto NoC Architectures, DATE,International Conference on Design and Test Europe, 2004, pp. 896-901.
9. T. Tao Ye, L. Benini, G. De Micheli, "Packetization and Routing for On-Chip Communication Networks," Journal of System Architecture, special issue on Networks- on-Chip.
10. T. H. Cormen, et al, "Introduction to algorithms," Second Edition, The MIT press, 2001.
11. D. Bertozzi, L. Benini and G.De Micheli, "Network on Chip Design for Gigascale Systems on Chips,in R. Zurawski, Editor, Industrial Technology Handbook, CRC Press, 2004, pp. 95.1-95.18 on Chips,Morgan Kaufmannn, 2004, pp. 49-80.
References (Contd)
8/12/2019 Thiet Ke He Thong Nhung
60/75
60
12. L. Benini and G. De Micheli, "Networks on Chip: A new Paradigm for component based MPSoCDesign," in A. Jerraja and W.Wolf Editors, "Multiprocessor Systems on Chips", Morgan Kaufmannn,2004, pp. 49-80.
13. D. Bertsekas and R. Gallager, "Data Networks," Chapter 5., Second Edition, Prentice-Hall, Inc.,1992.
14. A. Jalabert, S. Murali, L. Benini, G. De Micheli, "xpipesCompiler: A Tool for instantiating applicationspecific Networks on Chip", Proc. DATE 2004.
15. M. Dall'Osso, G. Biccari, L. Giovannini, D.Bertozzi, L. Benini, "Xpipes: a latency insensitive parameterized network-on-chip architecture for multiprocessor SoCs", 21st International Conference onComputer Design, Oct. 2003, pp. 536 - 539.
16. C. E. Leiserson,"Fat Trees: Universal networks for hardware efficient supercomputing," IEEETransactions on Computer, C-34, pp. 892-90, 1 Oct 1985.
17. H. S. Hwang et al.,"Orion: A Power Performance Simulator for Interconnection Networks," IEEEMicro, Nov. 2002.
18. W. J. Dally and B. Towles,"Route Packets, Not Wires: On Chip Interconnection Networks," DAC, pp.684-689, 2001.
19. N. Eisley and L.-S. Peh," High-level power analysis for on-chip networks," In Proceedings of the 7thInternational Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES),September 2004.
20.J. Nurmi, "Network-on-Chip: A New Paradigm for System-on-Chip Design," Proceedings ofInternational Symposium on System-on-Chip, Nov. 2005.
21. L. Kleinrock, Queuing systems, volume 1: Theory, Willey. New York, 1975.
Appendix I: Latency of G/G/1 queuing model
8/12/2019 Thiet Ke He Thong Nhung
61/75
61
Latency of a general queuing model is:
Where:
For G/G/1 queuing (applied for generally distributed data
transaction)
From [21], the waiting time:
1T W
1: Mean of processing time
: Waiting time in bufferW
[21] L. Kleinrock, Queuing systems, volume 1: Theory,Willey.New York, 1975.
1 2
2 22/
1 / 2
X XW
Appendix I : Latency of G/G/1 queuing model (Contd)
8/12/2019 Thiet Ke He Thong Nhung
62/75
62
Where:
Therefore, for a single node, the latency is given by:
For the case of complex node, in which several data
streams are independently associated, the nodes latency
is formulated by:
1
2
2 22
1 1 1 1
2 22
2 2 2 2
: Mean value of arrival rate
X
X
Var X E X E X E X
Var X E X E X E X
1 2
2 22, ,/1 1
1 / 2
i X i X i i
i i
i i i i
T W
1 2
2 2 2
, ,1 1
1
/1 1
21 /
i i
i
C C
ji i ji X i X j j
i i Ci i
ji ij
T W
Appendix I : Latency of G/G/1 queuing model (Contd)
8/12/2019 Thiet Ke He Thong Nhung
63/75
63
Because:
For a curtain kthroute, the latency is:
Where:
1 2
1 2
1 2
1
1 1 1
1 1 1 1
2
1 1 1 ,1
For independent data streams , ,...,
...
...
C ii i i
iC ii i i
iC ii i i
i
Cjij
C
ji Xj
C X X X
E X X X
Var X X X
1 2
2 2 2
, ,1 1
1
/ 1
21 /
i i
i
C C
ji i ji X i X j jk
R i ik ikCi i i
ji ij
T T
1, if node route
0, if node route
th th
ik th th
i k
i k
Appendix I : Latency of G/G/1 queuing model (Contd)
8/12/2019 Thiet Ke He Thong Nhung
64/75
64
If there are totally mroutes, the network latency is:
The optimization issue turns out to be: finding the optimal
mapping of IPs onto routers such that
1 2
2 2 2
, ,1 1
1 11
/
121 /
,
i i
i
C Cm m ji i ji X i X
j jkNet R ikC
k k i iji ij
T T
f
Function of allocation scheme
of IP onto the Architecture
1 2
2 2 2
, ,1 1
1 1 1
/ 1min min
21 /
i i
i
C Cm m ji i ji X i X j jk
Opt R ik C
k k i iji ij
T T
Appendix I : Latency of G/G/1 queuing model (Contd)
8/12/2019 Thiet Ke He Thong Nhung
65/75
65
The conditions that we assume to calculate the cost function as well as
to apply Branch and Bound are:
The mean of processing time are constant
The mean and variance value of IPs data rate are known
The cost function simplified to be:
1
2 2
,1 1
1 1
1
/ 1min min
21 /
i i
i
C Cm m
ji i ji Xj jk
Opt R ik Ck k i i
ji ij
T T
Appendix II: Multi-layer Router
8/12/2019 Thiet Ke He Thong Nhung
66/75
66
Conventional Virtual Circuit
Router:
Multiple Switching Layer Router:
Switch allocator
VC allocator
Routing
Computation Unit
Input unit Output unit
Crossbar SwitchCrossbarSwitch 2
Switch allocator
VC allocatorRouting
Computation Unit
Input unit Output unit
CrossbarSwitch 1
Appendix II: Multi-layer Router (Contd)
8/12/2019 Thiet Ke He Thong Nhung
67/75
67
Performance analysis:
Waiting time:
Where:
4 3 2
1
2 3
2
11 (32 6 ) (48 30 )
6( )(2 )
(24 48 ) 24
( 2( ) 2 )
m n m n mW
m m
m n m m n
m n mn
2
1(1 )
2(1 )mW
max
2 2
max
,
,n m n m
2 2
max n m n m
1 2W W W
: Number of virtual circuits
: Number of switching plane
n
m
8/12/2019 Thiet Ke He Thong Nhung
68/75
68
Appendix III: NoC Emulation
NoC Emulation: Behavior simulation framework
8/12/2019 Thiet Ke He Thong Nhung
69/75
69
Topology Selector(Mesh, FT, Torus,
Octagal)
Optimizer(BnB)
Latency Metric
Throughput Metric
Data TransactionTable
(Poisson, GeneralDistribution)
Routing Table(Shortest Path)
BehavioralSimulation
Traced Data Bit Energy Model
Energy and AreaAnalysis
PerformanceAnalyzer
Latency Metric
Throughput Metric
8/12/2019 Thiet Ke He Thong Nhung
70/75
NoC Emulation: Board implementation framework
8/12/2019 Thiet Ke He Thong Nhung
71/75
71
Emulation Architecture with Stochastic Data Generator
Host PC
Switch &RoutingTable
NetworkInterface
(NI)
Data Generator
Data Receiver
OCPInterface
NetworkInterface
(NI)
Data Generator
Data Receiver
OCPInterface
SchedulerController
To SchedulerTo Controller
.
.
.
.MEM
OPB toIB
Bridge
IBOPB
PowerPC
Emulation Platform
C code
Verilog Synthesis
1. To switch datas
distribution
2. To switch data generators
mode
1. To schedule
all Data
Generators
1. To store the data of Data
Receiver for post-
processing of Host
8/12/2019 Thiet Ke He Thong Nhung
72/75
NoC Emulation: Board implementation framework (Contd)
8/12/2019 Thiet Ke He Thong Nhung
73/75
73
Emulation Architecture with Real Data Generator
Switch &RoutingTable
NetworkInterface
(NI)
OCPInterface
SchedulerController
To SchedulerTo Controller
.
.
.
.
Tx.MEM
OPB toIB
Bridge
IBOPB
PowerPC
Emulation Platform
Rx.MEM
NetworkInterface
(NI)
OCPInterface
Tx.MEM
Rx.MEM
C code
Verilog Synthesis
Host PC
NoC Emulation: Board implementation framework (Contd)
8/12/2019 Thiet Ke He Thong Nhung
74/75
74
A given combination of Tx.MEM and Rx.MEM plays a role
as a soft IP with its own data transaction.
The given combination of Tx.MEM and Rx.MEM is
associated with a certain NI by Optimizer and controlled by
Controller.
NI plays as the packager (supports BE and GT)
The transmitted data is read out of Tx.MEM with the given
data transaction timing diagram
Scheduled by the Scheduler.
NoC Emulation: Emulation Board
Vi t TM II P B d P B d
8/12/2019 Thiet Ke He Thong Nhung
75/75
75
VirtexTM II Pro Based Processor Board
VirtexTM II Pro XC2VP100
Total slices: 44.096
Primitive design element
Double Plane VC WormholeRouter: 4000 slices (9%)
Benchmark: i.e H.264 decoder
12 IPs 16 Routers
Design partition
6IPs and 8 routers per FPGA