Thiet Ke He Thong Nhung

8/12/2019 Thiet Ke He Thong Nhung

1/75

1

Thesis proposal: Throughput and

Latency aware mapping of NoC

Aug 17, 2006

Vu-Duc Ngo

System VLSI Lab

System Integration Technology Institute


2/75

2

Contents

Related works

Part I: Latency aware mapping of NoC architectures

Part II: Throughput aware mapping of NoC architectures

Part III: Energy consumption of NoC architectures

Part IV: Experiment results

Study case of H.264 video decoder

Used architectures: 2-D Mesh, Fat-Tree

Study case of Video Object Plane Decoder

Used architectures: 2-D Mesh, Fat-Tree, Custom topologies

Future works

Publication list

References

Appendix for detailed future works

Appendix I: G/G/1 Queuing: Theoretical Approach Appendix II: Double Plane VC Wormhole Router

Appendix III: NoC Emulation


3/75

3

Related works

Energy aware mapping schemes:

Proposed by: De Micheli (Standford Univ) and R. Marculescu (CMU) research groups

Addressed the issue of minimizing the power consumption of NoC

architectures

Did not address the currently hot issue of QoS such as:

Throughput guarantee

Latency guarantee

Did not consider:

Drop of packets inside the network which is the nature of packet based switchingnetwork

The power consumption was simulated with the homogeneous bit energy

model


4/75

4

Proposed mapping scheme :Latency and Throughput

aware mappingIssue raising

- QoS issue, currently hot topic in NoC design, was addressed by J. Nurmi

in SoC05 conference.

- It was also strongly mentioned by A. Ivanov and De Micheli (IEEE Design and

Test, Aug 2005) as a important design criterion of NoC in future.

- It will be the main theme of 1stIEEE NoC symposium 2007.

Our works:

- Find out a mapping schemes that:

1. Minimize architectures latency

2. Maximize architectures throughput

3. Calculate the corresponding size and power consumption


5/75

5

Part I: Latency aware mapping of NoC


6/75

6

Latency Optimal Mapping: Introduction

Latency:

IPs and NoC architecture are Heterogeneous

Question is: Which switching core should each IP core be mountedonto in order to minimize the network latency?

Issues of mapping IPs onto NoC architectures:

For each mapping scheme:

The routing table of applied routing algorithm would be changed due tothe change of mapping IPs onto pre-selected NoC architecture.

The queuing latency would be changed according to the content of therouting table.


7/75

7

Latency Optimal Mapping: Introduction (Contd)

Solution:

Assume data transactions have Poisson distribution (general

distribution will be studied in future work)

Using M/M/1 queuing model to analyze the latency

We utilize the spanning tree search algorithm to:

Automatically map the desired IPs onto the NoC architecture

Guarantee that the network latency is minimized

Reduce the searching complexity of optimum allocation scheme


8/75

8

Latency Optimal Mapping: M/M/1 queuing model

Let the arrival rate of packet to the node is:

Let the number of packets arrive at one node is:

The node latency:

rateArrival

numPacket N

latNode T N T

Little theorem

Relation between

latency and number

of packet

IP CoreSwitching

Core Server

Single Network

Node

Buffer

Packets arrive


9/75

9

Latency Optimal Mapping: M/M/1 queuing model (Contd)

M/M/1 (Contd):

Since:

N

Little theorem

1NT

Spending time of 1

packet in one node

1 1 1W T

Spending time of 1

packet in buffer

Mean of

processing time

QN W

N

Number of packets in

buffer


10/75

10

Latency Optimal Mapping: Queuing latency in complex network

Network topology:

For each ith stream:

Since streams are iid and Markov has distribution property, then

the queuing latency of the jthnode is:

i i iN T

:

: the set of incoming streams toward node

j

j

j j

ii

j ii

th

j

Since

N T

where

N N

j

1

j

j

j

j

i ii

j

ii

iij i

ii

TT

Little theorem


11/75

11

Latency Optimal Mapping: Queuing latency in complex network (Contd)

Thus, the latency of route is

Where:

Network latency in terms of Queuing latency is given by:

1

j

k

j

iij iR

Queue kj

j ii

T

kRthi

1, if node

0, if node

th

k

kj thk

j R

j R

1

1

j

j

iimj i

kjk j ii

1, if node

0, if node

: the number of routes in the routing table

th

k

kj thk

Where

j R

j R

m


12/75

12

Wire latency

If we take into account the difference of the wires, then:

For RLC modeled by:

Wire delay:

Furthermore, we can also calculate the wire inductance and

capacitance in terms of wires width:

loadC

x

xW 0W

0

00 0

l x

w f

L

T C W y C dydxW x

0

0

line

line f

LL W

W x

C W C W x C

0

0

: wire inductance per square

: wire capacitance per unit area

: fringing capacitance per unit lengthf

L

C

C

Where:

M. A. El-Moursy and E. G. Friedman,"Desing Methodologies For On-Chip Inductive Interconnection", Chapter. 4,

Interconnect-Centric Design For Advanced SoC and NoC, Kluwer Academic Publishers, 2004.


13/75

13

Wire latency (cont)

Route latency in terms of wire latency is calculated by:

Where is the route latency, and

Network latency in terms of wire latency is presented by:

0 0

0 0

j

i

l x

R

Wire f ij

j

LT C W y C dydx

W x

iR

WireT th

i

1, if node0, if node

th

iij th

i

j Rj R

0 01 0 0

jl xm

f ij

i j

LC W y C dydx

W x

:

1, if node

0, if node: the number of routes in the routing table

th

i

ij th

i

Where

j R

j Rm


14/75

14

Network latency

Considering the shortest path routing algorithm is applied:

With given application, there is a certain routing table The average latency is:

Where:

0 0

0 0

,

1 ;

;

1

j

i

i

l x

f

jm

Aver Lat kj

ik i j ij i

j ii j i

LC W y C dydx

W x

N

1, if node route k

0, if node route k

: the number of routes in the routing table

kj

j

j

m


15/75

15

Latency Optimal Mapping: Problem statement

Since can be changed accordingly to the status of network

Routing (predetermined connections of IPs)

Congestion

Arrival rate be accumulated

However are unchanged due to predetermined design of

switching nodes

Therefore, optimum mapping should be figured out to minimize

system latency for:

Certain case of the practical application

Ex. H.264 video decoder, VOPD

Find out an optimum mapping

i

i


16/75

16

Latency Optimal Mapping: Graph definitions

Graph characterizations:

IP cores: IIG

Graph characterizations:

NoC architecture: SAG

V6 V2

V8V4

V1V10

V3

V7

V9

V5

,G V A

, : Directed graph

Vertex : IP cores

: arrival rate

i

i i

G V A

v V

A v

, : Directed graph

Vertex : node of NoC topology

1/ : Mean of processing time of

i

i i

G U P

u U

P u

U1 U2 U3 U4

U5 U6 U7 U8

U9 U10 U11 U12

U13 U14 U15 U16

,G U P

IIG: IPs Implementation Graph SAG: Switching Architecture Graph


17/75

17

Latency Optimal Mapping: Mathematical formula

Mapping with Min-latency

criteria:

Definition of mapping:

Min-latency criteria

The cost function as theaverage latency:

: , ,

. . ,

,

i j

i j

i j i j

map G V A G U P

map v u

s t v V u U

v v V map v map v

0 0

0 0

1 ;

;

Find a : , , to:

min 1

j

i

i

l x

f

jm

kj

ik i j ij i

j ii j i

map G V A G U P

LC W y C dydx

W x

latency

Means that each IP needs to be

mapped to exactly one node of NoC

topology and no node can host more

than 1 core IP

0 0

0 0

,

1 ;

;

1

j

i

i

l x

f

jm

Aver Lat kj

ik i j ij i

j ii j i

LC W y C dydx

W x

N


18/75

18

Latency Optimal Mapping: Mapping example

Mapping example:

Sol: Using spanning tree

search.

NoC architecture graph

(SAG)


19/75

19

Example of On-Chip Multiprocessors Network (OCMN)

Mesh architectures: Fat-Tree architectures:


20/75

20

Simulation results: H.264 video decoder on 2-D Mesh

Latency optimal mapping:

min

Minimum Latency:

325L s

DB LENT MC

VOM MVMVD Processor IPRED

IS REC FR_MEM ITIQ

Architecture Throughput and energy

DBMC

FR_MEMITIQ

LENT VOM Processor REC

MVMVD IPRED IS

Random mapping:

Random Latency

416Random

L s

0 2 4 6 8 10 12 14 16 18 200

5

10

15

20

25

30

35

40

Time ( x 0.0005 second)

Aggregative

Throughput

Throughput Comparison (Optimal vs. Random)

Optimal Lantency mapping

Random mapping

Opt_Latency_map Random_map0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1x 10

-4

EnergyConsu

mption(J)

Energy Consumption (Optimal Latency vs. Random)


21/75

21

Part II: Throughput aware mapping of NoC

architectures


22/75

22

Throughput aware mapping

Since the wormhole router treats data flow on flit level:

Probability that exits m flits (using M/M/1 queuing model)

Where:

11 1

0 0

0

i im

j jmil il l l

m

n j j

p p p

1

01

, 1.. : are arrival rates of data flows to port of j router

: is the number of individual data flows enter port

i j

ill

j

j th th

il j

th

i

p

i p i

i

Block probability

, 20

11 , : Buffer size

kj

block i mm

j

P p kp

1

1

, 2

1i

kj

ilj l

block i

jj

Pp


23/75

23

Throughput aware mapping (Contd)

The throughput contributed by ithport:

Therefore, the throughput contributed by thejthrouter :

Where Nis the number of routers

Network Throughput:

1

1

21 1

11 1

i

i i

kj

ilj j jl

i block il il l lj

j

Pp

1

12 1

1 1

11

ij j

i

kjp pil

jlj i ill

i i jj

Tp

1

1

2 11 1 1

11

ij

i

kjpN Nil jl

Net j illj j i j

j

T Tp


24/75

24

Throughput aware mapping (Contd)

Since:

To maximize theNetwork Throughput

1

1

2 11 1 1

11

ij

i

kjpN Nil jl

Net j illj j i j

j

T Tp

Function of allocation scheme

of IP onto the Architecture

Optimal mapping of a

given application

onto the architecturemust be work out


25/75

25

Throughput aware Mapping: Mathematical formula

Mapping with Max-Throughput

criteria:

Definition of mapping:

Max-Throughput criteria

The cost function as the

Network Throughput:

: , ,

. . ,

,

i j

i j

i j i j

map G V A G U P

map v u

s t v V u U

v v V map v map v

1

1

2 11 1

Find a : , , to:

1max 1

i

ji

kjp

N il jl

Net illj i j

j

map G V A G U P

Tp

Means that each IP needs to be

mapped to exactly one node of NoC

topology and no node can host more

than 1 core IP

1

1

2 11 1 1

11

ij

i

kjpN Nil jl

Net j illj j i j

j

T Tp


26/75

26

Part III: Energy consumption of NoC architectures


27/75

27

Energy and area: calculation and parameters

0.1um CMOS technology

Vdd = 1.2V

Router configurations:

3x3

4x4

5x5 7x7

Energy of CMOS circuit:

Since:

Then:

Router energy model:

Router power model:

21

2 clk dd P f V C

212

ddE CV

clkP f E

Arbiter

Buf 1

Buf 2

Buf p

.

.

.

.

.

.

.

.

Output port 1

Output port 2

Output port p

Input port 1

Input port 2

Input port pCrossbar

Switch

xbar arb bufrd bufwrtE E E E E

212

clk xbar arb bufrd bufwrt

clk dd xbar arb bufrd bufwrt

P f E E E E

f V C C C C


28/75

28

Energy and area: calculation and parameters (Contd)

For NxN 2-D Mesh homogenous architecture:

Since:

Then:

2

1

2

Net clk avg xbar arb bufrd bufwrt

clk dd avg xbar arb bufrd bufwrt

P f H E E E E

f V H C C C C

23

avgNH

21 22 3

Net clk avg xbar arb bufrd bufwrt

clk dd xbar arb bufrd bufwrt

P f H E E E E

Nf V C C C C


29/75

29

Power and area: calculation and parameters (Contd)

For heterogeneous architecture:

Where:

Hence:

i

i j j j j j

flit clk xbar arb bufrd bufwrt ij

R

P f E E E E

1, if router

0, otherwise

th

i

ij

j R

i

i

Net flit

i

j j j j jclk xbar arb bufrd bufwrt ij

i R

P P

f E E E E


30/75

30

Energy estimation

Bit energy estimation Energy estimation based on

throughput:

0 2 4 6 8 10 12 14 16 18 200

500

1000

1500

2000

2500

Time ( x 0.0005 second)

TotalThroughput(Mbps)

SFQ

DRR

RED

DropTail (FIFO)

Bit energy

estimation

System energy

estimation

Random mapping

example

Information bit

Orion power

model

Scaling factor for

a given CMOS

technology

Bit energy

consumption

Router power

model

Interconnection

power model

H. S. Hwang et al.,"Orion: A Power Performance Simulator for

Interconnection Networks," IEEE Micro, Nov. 2002.


31/75


32/75

32

Part IV: Experiment results

1. Throughput aware mapping: 2D Mesh,

Fat-Tree architectures for H.264 design

2. Throughput aware mapping: irregular

architectures for VOPD design


33/75


34/75

34

H.264 video decoders data transaction table

Assume data transactions are Poisson distribution

Throughput aware mapping for:

2-D Mesh, Fat-Tree


35/75

35

Wormhole router architecture

pxp wormhole router

P inout ports

Single switching plane

One input buffer for each input

Ex. 2-D Mesh uses 5x5 router

Arbiter

Buf 1

Buf 2

Buf p

.

.

.

.

.

.

.

.

Output port 1

Output port 2

Output port p

Input port 1

Input port 2

Input port pCrossbar

Switch


36/75

36

Simulation parameters

For all topologies:

Routing scheme Shortest Path

Queuing scheme DropTail

Buffer size 4 packets

Packet size 64 bytes

Flit size 128 bits


37/75

37

Throughput comparison

Throughput of five topologies


38/75

38

Topology comparison

Topology size Topology energy


39/75

39

VOPD on 5 topologies

VOPD d t t ti t bl


40/75

40

VOPDs data transaction table

Assume data transactions are Poisson distribution

Throughput aware mapping for: 2-D Mesh, Fat-Tree and 3 custom topologies

Fi t t t l (3 Xb )


41/75

41

First custom topology (3 Xbar )

First custom topology:

(a). Nam output of First topology (b). VOPD on First topology

5x5Wormhole

Router(1st)

6x6Wormhole

Router

5x5Wormhole

Router(2rd)

VMEM

UPSAM

VRecst PAD IQua

AC/DC

IDCT

Invers

ARM

Str_MEM

Run_Lenght

VarLen

S d t t l (4 Xb )


42/75

42

Second custom topology (4 Xbar )

(a). Nam output of Second

topology

(b). H.264 decoder on Second

topology

5x5Wormhole

Router(1st)

5x5Wormhole

Router(2rd)

5x5

WormholeRouter(3rd)

3x3WormholeRouter

MC

DB

DMA FR_MEM ITIQ

LENT

VOM

REC

MVMVD

PROCIPRED IS

Third c stom topolog (5 Xbar )


43/75

43

Third custom topology (5 Xbar )

(a). Nam output of Third

topology(b). VOPD on Third topology

6x6Wormhole

Router

5x5Wormhole

Router

3x3Wormhole

Router(3rd)

VMEM

UPSAM

VRecst

PAD IQua

AC/DC

IDCT

Invers

ARM Str_MEM

Run_Lenght

VarLen

3x3Wormhole

Router(1st)

3x3Wormhole

Router(2rd)


44/75



45/75

45


For all topologies:

Routing scheme Shortest Path

Queuing scheme DropTail

Buffer size 4 packets

Packet size 64 bytes

Flit size 128 bits



46/75

46


Throughput of five topologies

Result Discussion: Term of Throughput


47/75

47

Result Discussion: Term of Throughput

Best topology: Fat-Tree

Worst topology: 2D Mesh

Lowest aggregative throughput while hight hardware overhead of

unused switches

Fat-Tree offers almost similar throughput in compared to

the first custom topology but big hardware overhead

Power and area of Router


48/75

48

Power and area of Router

Wire and energy dissipation


49/75

49

Wire and energy dissipation

Wire dimension vs. Cap:

0.10um technology:

Energy:

Wire dimension vs. chip edge:

0.10um technology:

Ldrawn/Tech 0.10um

Capacitance

(fF/um)

335

21

2wire wire dd E C V

R. Ho, et al, "The future of wires," Proceedings of the IEEE,

pp. 490 - 504, April 2001


50/75

Topology comparison (Contd)


51/75

51

Topology comparison (Cont d)

Conclusion:

1st

topology consumes smallest power Wire energy is not significant

Because of its simplicity in interconnections.

Fat-Tree consumes biggest power and has biggest size

Wire energy dissipation is significant

Due to its complex interconnections

Custom Topologies comparison (Random map vs. optimal map)


52/75

52

p g p ( p p p)

Terms of throughput Terms of energy consumption

Custom Topologies comparison (Random map vs. optimal map) (Contd)


53/75

53

p g p ( p p p) ( )

Discussion:

Optimal maps offer not only better throughput but also less energyconsumption.

No ARQ scheme was implemented.

If ARQ is used:

Same throughput even higher but more energy consumed for

retransmitting dropped packets. (Further works)

Conclusions


54/75

54

The heterogeneous NoC architectures are considered for

designing based on the latency criteria

The latency of heterogeneous NoC architecture in terms of router

and wire latency is well formulized

Branch and Bound algorithm is adopted to automatically map the

IPs onto the NoC architectures with optimal latency metric

The experiments on various size of Mesh and Fat-Treearchitectures for OCMN application and H.264 are carried out

The latency of optimal mappings are significantly reduced

Conclusions (Contd)


55/75

55

( )

The heterogeneous NoC architectures are considered for

designing based on the maximum throughput criteria

The throughput of heterogeneous NoC architecture is formulized

Branch and Bound algorithm is adopted to automatically map the

IPs onto the NoC architectures to obtain maximal throughput

The experiments on various size of Mesh and Fat-Tree and Tree-

Based architectures for VOPD and H.264 are carried out

The heterogeneous bit power model was applied to exactly obtain

energy consumption and area of architectures

Future works


56/75

56

Modeling architecture with general distribution queuing model G/G/1

(Appendix I)

Realization of Multi-layer router for NoC design (Appendix II)

Performance comparison with current router model

Power consumption comparison with current router model

Variation of:

Number of switching planes Number of Virtual channel

Will be considered.

NoC emulation (Appendix III)

Global optimization for 2 criteria of Latency and Throughput

ARQ implementation with power measurement for Throughput awaremapping scheme

Publication list


57/75

57

International Journals

1. Vu-Duc Ngo, Huy-Nam Nguyen, Hae-Wook Choi , "Analyzing the Performance of Me

sh and Fat-Tree topologies for Network on Chip design ", LNCS (Springer-Verlag), Vol. 3824 / 2005, pp 300-310, Dec 2005.

2. Vu-Duc Ngo, Huy-Nam Nguyen, Hae-Wook Choi, Designing On-Chip Network basedon optimal latency criteria", LNCS (Springer-Verlag), Vol. 3820 / 2005, pp. 287-298 ,Dec 2005.

3. Huy-Nam Nguyen, Vu-Duc Ngo, Hae-Wook Choi, Realization of Video Object PlaneDecoder on On-Chip-Network Architecture, LNCS (Springer-Verlag), Vol. 3820 /

2005, pp. 256-264 , Dec 2005.

4. Vu-Duc Ngo, Hae-Wook Choi and Sin-Chong Park, "An Expurgated Union Bound forSpace-Time Code Systems", LNCS (Springer-Verlag), Vol. 3124 / 2004, pp 156-162,July 2004

5. Vu-Duc Ngo, Huy-Nam Nguyen, Hae-Wook Choi, The Optimum Network on ChipArchitectures for Video Decoder Applications Design, (to be submitted to Etri journal)

6. Vu-Duc Ngo, Huy-Nam Nguyen, Hae-Wook Choi, Throughput Aware Mapping forNoC design , (to be submitted to IEE electronic letter)


58/75

References


59/75

59

1. L. Benini and G. DeMicheli,"Networks On Chips: A new SoC paradigm", IEEE computer, Jan, 2002.

2. A. Agarwal,"Limit on interconnection network performance", Parallel and Distrib-uted Systems,IEEE Transactions on Volume 2, Issue 4, Oct. 1991 pp. 398 - 412.

3. T. Ye, L. Benini and G. De Micheli," Packetization and Routing Analysis of On-Chip MultiProcessorNetworks", JSA Journal of System Architecture, Vol. 50, February 2004, pp. 81-104.

4. M. A. El-Moursy and E. G. Friedman,"Desing Methodologies For On-Chip Inductive Interconnection",Chapter. 4, Interconnect-Centric Design For Advanced SoC and NoC, Kluwer Academic Publishers,2004.

5. R. Ho, et al, "The future of wires," Proceedings of the IEEE, pp. 490 - 504, April 2001.

6. J. Hu, R. Marculescu, "Exploiting the Routing Flexibility for Energy Performance Aware Mapping ofRegular NoC Architectures", in Proc. Design, Automation and Test in Europe Conf, March 2003.

7. J. Hu, R. Marculescu, "Energy-Aware Communication and Task Scheduling for Network-on-ChipArchitectures under Real-Time Constraints, in Proc. Design, Au- tomation and Test in Europe Conf,Feb. 2004.

8. S.Murali and G.De Micheli, Bandwidth-Constrained Mapping of Cores pnto NoC Architectures, DATE,International Conference on Design and Test Europe, 2004, pp. 896-901.

9. T. Tao Ye, L. Benini, G. De Micheli, "Packetization and Routing for On-Chip Communication Networks," Journal of System Architecture, special issue on Networks- on-Chip.

10. T. H. Cormen, et al, "Introduction to algorithms," Second Edition, The MIT press, 2001.

11. D. Bertozzi, L. Benini and G.De Micheli, "Network on Chip Design for Gigascale Systems on Chips,in R. Zurawski, Editor, Industrial Technology Handbook, CRC Press, 2004, pp. 95.1-95.18 on Chips,Morgan Kaufmannn, 2004, pp. 49-80.

References (Contd)


60/75

60

12. L. Benini and G. De Micheli, "Networks on Chip: A new Paradigm for component based MPSoCDesign," in A. Jerraja and W.Wolf Editors, "Multiprocessor Systems on Chips", Morgan Kaufmannn,2004, pp. 49-80.

13. D. Bertsekas and R. Gallager, "Data Networks," Chapter 5., Second Edition, Prentice-Hall, Inc.,1992.

14. A. Jalabert, S. Murali, L. Benini, G. De Micheli, "xpipesCompiler: A Tool for instantiating applicationspecific Networks on Chip", Proc. DATE 2004.

15. M. Dall'Osso, G. Biccari, L. Giovannini, D.Bertozzi, L. Benini, "Xpipes: a latency insensitive parameterized network-on-chip architecture for multiprocessor SoCs", 21st International Conference onComputer Design, Oct. 2003, pp. 536 - 539.

16. C. E. Leiserson,"Fat Trees: Universal networks for hardware efficient supercomputing," IEEETransactions on Computer, C-34, pp. 892-90, 1 Oct 1985.

17. H. S. Hwang et al.,"Orion: A Power Performance Simulator for Interconnection Networks," IEEEMicro, Nov. 2002.

18. W. J. Dally and B. Towles,"Route Packets, Not Wires: On Chip Interconnection Networks," DAC, pp.684-689, 2001.

19. N. Eisley and L.-S. Peh," High-level power analysis for on-chip networks," In Proceedings of the 7thInternational Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES),September 2004.

20.J. Nurmi, "Network-on-Chip: A New Paradigm for System-on-Chip Design," Proceedings ofInternational Symposium on System-on-Chip, Nov. 2005.

21. L. Kleinrock, Queuing systems, volume 1: Theory, Willey. New York, 1975.

Appendix I: Latency of G/G/1 queuing model


61/75

61

Latency of a general queuing model is:

Where:

For G/G/1 queuing (applied for generally distributed data

transaction)

From [21], the waiting time:

1T W

1: Mean of processing time

: Waiting time in bufferW

[21] L. Kleinrock, Queuing systems, volume 1: Theory,Willey.New York, 1975.

1 2

2 22/

1 / 2

X XW

Appendix I : Latency of G/G/1 queuing model (Contd)


62/75

62

Where:

Therefore, for a single node, the latency is given by:

For the case of complex node, in which several data

streams are independently associated, the nodes latency

is formulated by:

1

2

2 22

1 1 1 1

2 22

2 2 2 2

: Mean value of arrival rate

X

X

Var X E X E X E X

Var X E X E X E X

1 2

2 22, ,/1 1

1 / 2

i X i X i i

i i

i i i i

T W

1 2

2 2 2

, ,1 1

1

/1 1

21 /

i i

i

C C

ji i ji X i X j j

i i Ci i

ji ij

T W



63/75

63

Because:

For a curtain kthroute, the latency is:

Where:

1 2

1 2

1 2

1

1 1 1

1 1 1 1

2

1 1 1 ,1

For independent data streams , ,...,

...

...

C ii i i

iC ii i i

iC ii i i

i

Cjij

C

ji Xj

C X X X

E X X X

Var X X X

1 2

2 2 2

, ,1 1

1

/ 1

21 /

i i

i

C C

ji i ji X i X j jk

R i ik ikCi i i

ji ij

T T

1, if node route

0, if node route

th th

ik th th

i k

i k



64/75

64

If there are totally mroutes, the network latency is:

The optimization issue turns out to be: finding the optimal

mapping of IPs onto routers such that

1 2

2 2 2

, ,1 1

1 11

/

121 /

,

i i

i

C Cm m ji i ji X i X

j jkNet R ikC

k k i iji ij

T T

f

Function of allocation scheme

of IP onto the Architecture

1 2

2 2 2

, ,1 1

1 1 1

/ 1min min

21 /

i i

i

C Cm m ji i ji X i X j jk

Opt R ik C

k k i iji ij

T T



65/75

65

The conditions that we assume to calculate the cost function as well as

to apply Branch and Bound are:

The mean of processing time are constant

The mean and variance value of IPs data rate are known

The cost function simplified to be:

1

2 2

,1 1

1 1

1

/ 1min min

21 /

i i

i

C Cm m

ji i ji Xj jk

Opt R ik Ck k i i

ji ij

T T

Appendix II: Multi-layer Router


66/75

66

Conventional Virtual Circuit

Router:

Multiple Switching Layer Router:

Switch allocator

VC allocator

Routing

Computation Unit

Input unit Output unit

Crossbar SwitchCrossbarSwitch 2

Switch allocator

VC allocatorRouting

Computation Unit

Input unit Output unit

CrossbarSwitch 1

Appendix II: Multi-layer Router (Contd)


67/75

67

Performance analysis:

Waiting time:

Where:

4 3 2

1

2 3

2

11 (32 6 ) (48 30 )

6( )(2 )

(24 48 ) 24

( 2( ) 2 )

m n m n mW

m m

m n m m n

m n mn

2

1(1 )

2(1 )mW

max

2 2

max

,

,n m n m

2 2

max n m n m

1 2W W W

: Number of virtual circuits

: Number of switching plane

n

m


68/75

68

Appendix III: NoC Emulation

NoC Emulation: Behavior simulation framework


69/75

69

Topology Selector(Mesh, FT, Torus,

Octagal)

Optimizer(BnB)

Latency Metric

Throughput Metric

Data TransactionTable

(Poisson, GeneralDistribution)

Routing Table(Shortest Path)

BehavioralSimulation

Traced Data Bit Energy Model

Energy and AreaAnalysis

PerformanceAnalyzer

Latency Metric

Throughput Metric


70/75

NoC Emulation: Board implementation framework


71/75

71

Emulation Architecture with Stochastic Data Generator

Host PC

Switch &RoutingTable

NetworkInterface

(NI)

Data Generator

Data Receiver

OCPInterface

NetworkInterface

(NI)

Data Generator

Data Receiver

OCPInterface

SchedulerController

To SchedulerTo Controller

.

.

.

.MEM

OPB toIB

Bridge

IBOPB

PowerPC

Emulation Platform

C code

Verilog Synthesis

1. To switch datas

distribution

2. To switch data generators

mode

1. To schedule

all Data

Generators

1. To store the data of Data

Receiver for post-

processing of Host


72/75

NoC Emulation: Board implementation framework (Contd)


73/75

73

Emulation Architecture with Real Data Generator

Switch &RoutingTable

NetworkInterface

(NI)

OCPInterface

SchedulerController

To SchedulerTo Controller

.

.

.

.

Tx.MEM

OPB toIB

Bridge

IBOPB

PowerPC

Emulation Platform

Rx.MEM

NetworkInterface

(NI)

OCPInterface

Tx.MEM

Rx.MEM

C code

Verilog Synthesis

Host PC

NoC Emulation: Board implementation framework (Contd)


74/75

74

A given combination of Tx.MEM and Rx.MEM plays a role

as a soft IP with its own data transaction.

The given combination of Tx.MEM and Rx.MEM is

associated with a certain NI by Optimizer and controlled by

Controller.

NI plays as the packager (supports BE and GT)

The transmitted data is read out of Tx.MEM with the given

data transaction timing diagram

Scheduled by the Scheduler.

NoC Emulation: Emulation Board

Vi t TM II P B d P B d


75/75

75

VirtexTM II Pro Based Processor Board

VirtexTM II Pro XC2VP100

Total slices: 44.096

Primitive design element

Double Plane VC WormholeRouter: 4000 slices (9%)

Benchmark: i.e H.264 decoder

12 IPs 16 Routers

Design partition

6IPs and 8 routers per FPGA

Documents

Thiet Ke He Thong Nhung