Thiet Ke He Thong Nhung

Embed Size (px)

Citation preview

  • 8/12/2019 Thiet Ke He Thong Nhung

    1/75

    1

    Thesis proposal: Throughput and

    Latency aware mapping of NoC

    Aug 17, 2006

    Vu-Duc Ngo

    System VLSI Lab

    System Integration Technology Institute

  • 8/12/2019 Thiet Ke He Thong Nhung

    2/75

    2

    Contents

    Related works

    Part I: Latency aware mapping of NoC architectures

    Part II: Throughput aware mapping of NoC architectures

    Part III: Energy consumption of NoC architectures

    Part IV: Experiment results

    Study case of H.264 video decoder

    Used architectures: 2-D Mesh, Fat-Tree

    Study case of Video Object Plane Decoder

    Used architectures: 2-D Mesh, Fat-Tree, Custom topologies

    Future works

    Publication list

    References

    Appendix for detailed future works

    Appendix I: G/G/1 Queuing: Theoretical Approach Appendix II: Double Plane VC Wormhole Router

    Appendix III: NoC Emulation

  • 8/12/2019 Thiet Ke He Thong Nhung

    3/75

    3

    Related works

    Energy aware mapping schemes:

    Proposed by: De Micheli (Standford Univ) and R. Marculescu (CMU) research groups

    Addressed the issue of minimizing the power consumption of NoC

    architectures

    Did not address the currently hot issue of QoS such as:

    Throughput guarantee

    Latency guarantee

    Did not consider:

    Drop of packets inside the network which is the nature of packet based switchingnetwork

    The power consumption was simulated with the homogeneous bit energy

    model

  • 8/12/2019 Thiet Ke He Thong Nhung

    4/75

    4

    Proposed mapping scheme :Latency and Throughput

    aware mappingIssue raising

    - QoS issue, currently hot topic in NoC design, was addressed by J. Nurmi

    in SoC05 conference.

    - It was also strongly mentioned by A. Ivanov and De Micheli (IEEE Design and

    Test, Aug 2005) as a important design criterion of NoC in future.

    - It will be the main theme of 1stIEEE NoC symposium 2007.

    Our works:

    - Find out a mapping schemes that:

    1. Minimize architectures latency

    2. Maximize architectures throughput

    3. Calculate the corresponding size and power consumption

  • 8/12/2019 Thiet Ke He Thong Nhung

    5/75

    5

    Part I: Latency aware mapping of NoC

  • 8/12/2019 Thiet Ke He Thong Nhung

    6/75

    6

    Latency Optimal Mapping: Introduction

    Latency:

    IPs and NoC architecture are Heterogeneous

    Question is: Which switching core should each IP core be mountedonto in order to minimize the network latency?

    Issues of mapping IPs onto NoC architectures:

    For each mapping scheme:

    The routing table of applied routing algorithm would be changed due tothe change of mapping IPs onto pre-selected NoC architecture.

    The queuing latency would be changed according to the content of therouting table.

  • 8/12/2019 Thiet Ke He Thong Nhung

    7/75

    7

    Latency Optimal Mapping: Introduction (Contd)

    Solution:

    Assume data transactions have Poisson distribution (general

    distribution will be studied in future work)

    Using M/M/1 queuing model to analyze the latency

    We utilize the spanning tree search algorithm to:

    Automatically map the desired IPs onto the NoC architecture

    Guarantee that the network latency is minimized

    Reduce the searching complexity of optimum allocation scheme

  • 8/12/2019 Thiet Ke He Thong Nhung

    8/75

    8

    Latency Optimal Mapping: M/M/1 queuing model

    Let the arrival rate of packet to the node is:

    Let the number of packets arrive at one node is:

    The node latency:

    rateArrival

    numPacket N

    latNode T N T

    Little theorem

    Relation between

    latency and number

    of packet

    IP CoreSwitching

    Core Server

    Single Network

    Node

    Buffer

    Packets arrive

  • 8/12/2019 Thiet Ke He Thong Nhung

    9/75

    9

    Latency Optimal Mapping: M/M/1 queuing model (Contd)

    M/M/1 (Contd):

    Since:

    N

    Little theorem

    1NT

    Spending time of 1

    packet in one node

    1 1 1W T

    Spending time of 1

    packet in buffer

    Mean of

    processing time

    QN W

    N

    Number of packets in

    buffer

  • 8/12/2019 Thiet Ke He Thong Nhung

    10/75

    10

    Latency Optimal Mapping: Queuing latency in complex network

    Network topology:

    For each ith stream:

    Since streams are iid and Markov has distribution property, then

    the queuing latency of the jthnode is:

    i i iN T

    :

    : the set of incoming streams toward node

    j

    j

    j j

    ii

    j ii

    th

    j

    Since

    N T

    where

    N N

    j

    1

    j

    j

    j

    j

    i ii

    j

    ii

    iij i

    ii

    TT

    Little theorem

  • 8/12/2019 Thiet Ke He Thong Nhung

    11/75

    11

    Latency Optimal Mapping: Queuing latency in complex network (Contd)

    Thus, the latency of route is

    Where:

    Network latency in terms of Queuing latency is given by:

    1

    j

    k

    j

    iij iR

    Queue kj

    j ii

    T

    kRthi

    1, if node

    0, if node

    th

    k

    kj thk

    j R

    j R

    1

    1

    j

    j

    iimj i

    kjk j ii

    1, if node

    0, if node

    : the number of routes in the routing table

    th

    k

    kj thk

    Where

    j R

    j R

    m

  • 8/12/2019 Thiet Ke He Thong Nhung

    12/75

    12

    Wire latency

    If we take into account the difference of the wires, then:

    For RLC modeled by:

    Wire delay:

    Furthermore, we can also calculate the wire inductance and

    capacitance in terms of wires width:

    loadC

    x

    xW 0W

    0

    00 0

    l x

    w f

    L

    T C W y C dydxW x

    0

    0

    line

    line f

    LL W

    W x

    C W C W x C

    0

    0

    : wire inductance per square

    : wire capacitance per unit area

    : fringing capacitance per unit lengthf

    L

    C

    C

    Where:

    M. A. El-Moursy and E. G. Friedman,"Desing Methodologies For On-Chip Inductive Interconnection", Chapter. 4,

    Interconnect-Centric Design For Advanced SoC and NoC, Kluwer Academic Publishers, 2004.

  • 8/12/2019 Thiet Ke He Thong Nhung

    13/75

    13

    Wire latency (cont)

    Route latency in terms of wire latency is calculated by:

    Where is the route latency, and

    Network latency in terms of wire latency is presented by:

    0 0

    0 0

    j

    i

    l x

    R

    Wire f ij

    j

    LT C W y C dydx

    W x

    iR

    WireT th

    i

    1, if node0, if node

    th

    iij th

    i

    j Rj R

    0 01 0 0

    jl xm

    f ij

    i j

    LC W y C dydx

    W x

    :

    1, if node

    0, if node: the number of routes in the routing table

    th

    i

    ij th

    i

    Where

    j R

    j Rm

  • 8/12/2019 Thiet Ke He Thong Nhung

    14/75

    14

    Network latency

    Considering the shortest path routing algorithm is applied:

    With given application, there is a certain routing table The average latency is:

    Where:

    0 0

    0 0

    ,

    1 ;

    ;

    1

    j

    i

    i

    l x

    f

    jm

    Aver Lat kj

    ik i j ij i

    j ii j i

    LC W y C dydx

    W x

    N

    1, if node route k

    0, if node route k

    : the number of routes in the routing table

    kj

    j

    j

    m

  • 8/12/2019 Thiet Ke He Thong Nhung

    15/75

    15

    Latency Optimal Mapping: Problem statement

    Since can be changed accordingly to the status of network

    Routing (predetermined connections of IPs)

    Congestion

    Arrival rate be accumulated

    However are unchanged due to predetermined design of

    switching nodes

    Therefore, optimum mapping should be figured out to minimize

    system latency for:

    Certain case of the practical application

    Ex. H.264 video decoder, VOPD

    Find out an optimum mapping

    i

    i

  • 8/12/2019 Thiet Ke He Thong Nhung

    16/75

    16

    Latency Optimal Mapping: Graph definitions

    Graph characterizations:

    IP cores: IIG

    Graph characterizations:

    NoC architecture: SAG

    V6 V2

    V8V4

    V1V10

    V3

    V7

    V9

    V5

    ,G V A

    , : Directed graph

    Vertex : IP cores

    : arrival rate

    i

    i i

    G V A

    v V

    A v

    , : Directed graph

    Vertex : node of NoC topology

    1/ : Mean of processing time of

    i

    i i

    G U P

    u U

    P u

    U1 U2 U3 U4

    U5 U6 U7 U8

    U9 U10 U11 U12

    U13 U14 U15 U16

    ,G U P

    IIG: IPs Implementation Graph SAG: Switching Architecture Graph

  • 8/12/2019 Thiet Ke He Thong Nhung

    17/75

    17

    Latency Optimal Mapping: Mathematical formula

    Mapping with Min-latency

    criteria:

    Definition of mapping:

    Min-latency criteria

    The cost function as theaverage latency:

    : , ,

    . . ,

    ,

    i j

    i j

    i j i j

    map G V A G U P

    map v u

    s t v V u U

    v v V map v map v

    0 0

    0 0

    1 ;

    ;

    Find a : , , to:

    min 1

    j

    i

    i

    l x

    f

    jm

    kj

    ik i j ij i

    j ii j i

    map G V A G U P

    LC W y C dydx

    W x

    latency

    Means that each IP needs to be

    mapped to exactly one node of NoC

    topology and no node can host more

    than 1 core IP

    0 0

    0 0

    ,

    1 ;

    ;

    1

    j

    i

    i

    l x

    f

    jm

    Aver Lat kj

    ik i j ij i

    j ii j i

    LC W y C dydx

    W x

    N

  • 8/12/2019 Thiet Ke He Thong Nhung

    18/75

    18

    Latency Optimal Mapping: Mapping example

    Mapping example:

    Sol: Using spanning tree

    search.

    NoC architecture graph

    (SAG)

  • 8/12/2019 Thiet Ke He Thong Nhung

    19/75

    19

    Example of On-Chip Multiprocessors Network (OCMN)

    Mesh architectures: Fat-Tree architectures:

  • 8/12/2019 Thiet Ke He Thong Nhung

    20/75

    20

    Simulation results: H.264 video decoder on 2-D Mesh

    Latency optimal mapping:

    min

    Minimum Latency:

    325L s

    DB LENT MC

    VOM MVMVD Processor IPRED

    IS REC FR_MEM ITIQ

    Architecture Throughput and energy

    DBMC

    FR_MEMITIQ

    LENT VOM Processor REC

    MVMVD IPRED IS

    Random mapping:

    Random Latency

    416Random

    L s

    0 2 4 6 8 10 12 14 16 18 200

    5

    10

    15

    20

    25

    30

    35

    40

    Time ( x 0.0005 second)

    Aggregative

    Throughput

    Throughput Comparison (Optimal vs. Random)

    Optimal Lantency mapping

    Random mapping

    Opt_Latency_map Random_map0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1x 10

    -4

    EnergyConsu

    mption(J)

    Energy Consumption (Optimal Latency vs. Random)

  • 8/12/2019 Thiet Ke He Thong Nhung

    21/75

    21

    Part II: Throughput aware mapping of NoC

    architectures

  • 8/12/2019 Thiet Ke He Thong Nhung

    22/75

    22

    Throughput aware mapping

    Since the wormhole router treats data flow on flit level:

    Probability that exits m flits (using M/M/1 queuing model)

    Where:

    11 1

    0 0

    0

    i im

    j jmil il l l

    m

    n j j

    p p p

    1

    01

    , 1.. : are arrival rates of data flows to port of j router

    : is the number of individual data flows enter port

    i j

    ill

    j

    j th th

    il j

    th

    i

    p

    i p i

    i

    Block probability

    , 20

    11 , : Buffer size

    kj

    block i mm

    j

    P p kp

    1

    1

    , 2

    1i

    kj

    ilj l

    block i

    jj

    Pp

  • 8/12/2019 Thiet Ke He Thong Nhung

    23/75

    23

    Throughput aware mapping (Contd)

    The throughput contributed by ithport:

    Therefore, the throughput contributed by thejthrouter :

    Where Nis the number of routers

    Network Throughput:

    1

    1

    21 1

    11 1

    i

    i i

    kj

    ilj j jl

    i block il il l lj

    j

    Pp

    1

    12 1

    1 1

    11

    ij j

    i

    kjp pil

    jlj i ill

    i i jj

    Tp

    1

    1

    2 11 1 1

    11

    ij

    i

    kjpN Nil jl

    Net j illj j i j

    j

    T Tp

  • 8/12/2019 Thiet Ke He Thong Nhung

    24/75

    24

    Throughput aware mapping (Contd)

    Since:

    To maximize theNetwork Throughput

    1

    1

    2 11 1 1

    11

    ij

    i

    kjpN Nil jl

    Net j illj j i j

    j

    T Tp

    Function of allocation scheme

    of IP onto the Architecture

    Optimal mapping of a

    given application

    onto the architecturemust be work out

  • 8/12/2019 Thiet Ke He Thong Nhung

    25/75

    25

    Throughput aware Mapping: Mathematical formula

    Mapping with Max-Throughput

    criteria:

    Definition of mapping:

    Max-Throughput criteria

    The cost function as the

    Network Throughput:

    : , ,

    . . ,

    ,

    i j

    i j

    i j i j

    map G V A G U P

    map v u

    s t v V u U

    v v V map v map v

    1

    1

    2 11 1

    Find a : , , to:

    1max 1

    i

    ji

    kjp

    N il jl

    Net illj i j

    j

    map G V A G U P

    Tp

    Means that each IP needs to be

    mapped to exactly one node of NoC

    topology and no node can host more

    than 1 core IP

    1

    1

    2 11 1 1

    11

    ij

    i

    kjpN Nil jl

    Net j illj j i j

    j

    T Tp

  • 8/12/2019 Thiet Ke He Thong Nhung

    26/75

    26

    Part III: Energy consumption of NoC architectures

  • 8/12/2019 Thiet Ke He Thong Nhung

    27/75

    27

    Energy and area: calculation and parameters

    0.1um CMOS technology

    Vdd = 1.2V

    Router configurations:

    3x3

    4x4

    5x5 7x7

    Energy of CMOS circuit:

    Since:

    Then:

    Router energy model:

    Router power model:

    21

    2 clk dd P f V C

    212

    ddE CV

    clkP f E

    Arbiter

    Buf 1

    Buf 2

    Buf p

    .

    .

    .

    .

    .

    .

    .

    .

    Output port 1

    Output port 2

    Output port p

    Input port 1

    Input port 2

    Input port pCrossbar

    Switch

    xbar arb bufrd bufwrtE E E E E

    212

    clk xbar arb bufrd bufwrt

    clk dd xbar arb bufrd bufwrt

    P f E E E E

    f V C C C C

  • 8/12/2019 Thiet Ke He Thong Nhung

    28/75

    28

    Energy and area: calculation and parameters (Contd)

    For NxN 2-D Mesh homogenous architecture:

    Since:

    Then:

    2

    1

    2

    Net clk avg xbar arb bufrd bufwrt

    clk dd avg xbar arb bufrd bufwrt

    P f H E E E E

    f V H C C C C

    23

    avgNH

    21 22 3

    Net clk avg xbar arb bufrd bufwrt

    clk dd xbar arb bufrd bufwrt

    P f H E E E E

    Nf V C C C C

  • 8/12/2019 Thiet Ke He Thong Nhung

    29/75

    29

    Power and area: calculation and parameters (Contd)

    For heterogeneous architecture:

    Where:

    Hence:

    i

    i j j j j j

    flit clk xbar arb bufrd bufwrt ij

    R

    P f E E E E

    1, if router

    0, otherwise

    th

    i

    ij

    j R

    i

    i

    Net flit

    i

    j j j j jclk xbar arb bufrd bufwrt ij

    i R

    P P

    f E E E E

  • 8/12/2019 Thiet Ke He Thong Nhung

    30/75

    30

    Energy estimation

    Bit energy estimation Energy estimation based on

    throughput:

    0 2 4 6 8 10 12 14 16 18 200

    500

    1000

    1500

    2000

    2500

    Time ( x 0.0005 second)

    TotalThroughput(Mbps)

    SFQ

    DRR

    RED

    DropTail (FIFO)

    Bit energy

    estimation

    System energy

    estimation

    Random mapping

    example

    Information bit

    Orion power

    model

    Scaling factor for

    a given CMOS

    technology

    Bit energy

    consumption

    Router power

    model

    Interconnection

    power model

    H. S. Hwang et al.,"Orion: A Power Performance Simulator for

    Interconnection Networks," IEEE Micro, Nov. 2002.

  • 8/12/2019 Thiet Ke He Thong Nhung

    31/75

  • 8/12/2019 Thiet Ke He Thong Nhung

    32/75

    32

    Part IV: Experiment results

    1. Throughput aware mapping: 2D Mesh,

    Fat-Tree architectures for H.264 design

    2. Throughput aware mapping: irregular

    architectures for VOPD design

  • 8/12/2019 Thiet Ke He Thong Nhung

    33/75

  • 8/12/2019 Thiet Ke He Thong Nhung

    34/75

    34

    H.264 video decoders data transaction table

    Assume data transactions are Poisson distribution

    Throughput aware mapping for:

    2-D Mesh, Fat-Tree

  • 8/12/2019 Thiet Ke He Thong Nhung

    35/75

    35

    Wormhole router architecture

    pxp wormhole router

    P inout ports

    Single switching plane

    One input buffer for each input

    Ex. 2-D Mesh uses 5x5 router

    Arbiter

    Buf 1

    Buf 2

    Buf p

    .

    .

    .

    .

    .

    .

    .

    .

    Output port 1

    Output port 2

    Output port p

    Input port 1

    Input port 2

    Input port pCrossbar

    Switch

  • 8/12/2019 Thiet Ke He Thong Nhung

    36/75

    36

    Simulation parameters

    For all topologies:

    Routing scheme Shortest Path

    Queuing scheme DropTail

    Buffer size 4 packets

    Packet size 64 bytes

    Flit size 128 bits

  • 8/12/2019 Thiet Ke He Thong Nhung

    37/75

    37

    Throughput comparison

    Throughput of five topologies

  • 8/12/2019 Thiet Ke He Thong Nhung

    38/75

    38

    Topology comparison

    Topology size Topology energy

  • 8/12/2019 Thiet Ke He Thong Nhung

    39/75

    39

    VOPD on 5 topologies

    VOPD d t t ti t bl

  • 8/12/2019 Thiet Ke He Thong Nhung

    40/75

    40

    VOPDs data transaction table

    Assume data transactions are Poisson distribution

    Throughput aware mapping for: 2-D Mesh, Fat-Tree and 3 custom topologies

    Fi t t t l (3 Xb )

  • 8/12/2019 Thiet Ke He Thong Nhung

    41/75

    41

    First custom topology (3 Xbar )

    First custom topology:

    (a). Nam output of First topology (b). VOPD on First topology

    5x5Wormhole

    Router(1st)

    6x6Wormhole

    Router

    5x5Wormhole

    Router(2rd)

    VMEM

    UPSAM

    VRecst PAD IQua

    AC/DC

    IDCT

    Invers

    ARM

    Str_MEM

    Run_Lenght

    VarLen

    S d t t l (4 Xb )

  • 8/12/2019 Thiet Ke He Thong Nhung

    42/75

    42

    Second custom topology (4 Xbar )

    (a). Nam output of Second

    topology

    (b). H.264 decoder on Second

    topology

    5x5Wormhole

    Router(1st)

    5x5Wormhole

    Router(2rd)

    5x5

    WormholeRouter(3rd)

    3x3WormholeRouter

    MC

    DB

    DMA FR_MEM ITIQ

    LENT

    VOM

    REC

    MVMVD

    PROCIPRED IS

    Third c stom topolog (5 Xbar )

  • 8/12/2019 Thiet Ke He Thong Nhung

    43/75

    43

    Third custom topology (5 Xbar )

    (a). Nam output of Third

    topology(b). VOPD on Third topology

    6x6Wormhole

    Router

    5x5Wormhole

    Router

    3x3Wormhole

    Router(3rd)

    VMEM

    UPSAM

    VRecst

    PAD IQua

    AC/DC

    IDCT

    Invers

    ARM Str_MEM

    Run_Lenght

    VarLen

    3x3Wormhole

    Router(1st)

    3x3Wormhole

    Router(2rd)

  • 8/12/2019 Thiet Ke He Thong Nhung

    44/75

    Simulation parameters

  • 8/12/2019 Thiet Ke He Thong Nhung

    45/75

    45

    Simulation parameters

    For all topologies:

    Routing scheme Shortest Path

    Queuing scheme DropTail

    Buffer size 4 packets

    Packet size 64 bytes

    Flit size 128 bits

    Throughput comparison

  • 8/12/2019 Thiet Ke He Thong Nhung

    46/75

    46

    Throughput comparison

    Throughput of five topologies

    Result Discussion: Term of Throughput

  • 8/12/2019 Thiet Ke He Thong Nhung

    47/75

    47

    Result Discussion: Term of Throughput

    Best topology: Fat-Tree

    Worst topology: 2D Mesh

    Lowest aggregative throughput while hight hardware overhead of

    unused switches

    Fat-Tree offers almost similar throughput in compared to

    the first custom topology but big hardware overhead

    Power and area of Router

  • 8/12/2019 Thiet Ke He Thong Nhung

    48/75

    48

    Power and area of Router

    Wire and energy dissipation

  • 8/12/2019 Thiet Ke He Thong Nhung

    49/75

    49

    Wire and energy dissipation

    Wire dimension vs. Cap:

    0.10um technology:

    Energy:

    Wire dimension vs. chip edge:

    0.10um technology:

    Ldrawn/Tech 0.10um

    Capacitance

    (fF/um)

    335

    21

    2wire wire dd E C V

    R. Ho, et al, "The future of wires," Proceedings of the IEEE,

    pp. 490 - 504, April 2001

  • 8/12/2019 Thiet Ke He Thong Nhung

    50/75

    Topology comparison (Contd)

  • 8/12/2019 Thiet Ke He Thong Nhung

    51/75

    51

    Topology comparison (Cont d)

    Conclusion:

    1st

    topology consumes smallest power Wire energy is not significant

    Because of its simplicity in interconnections.

    Fat-Tree consumes biggest power and has biggest size

    Wire energy dissipation is significant

    Due to its complex interconnections

    Custom Topologies comparison (Random map vs. optimal map)

  • 8/12/2019 Thiet Ke He Thong Nhung

    52/75

    52

    p g p ( p p p)

    Terms of throughput Terms of energy consumption

    Custom Topologies comparison (Random map vs. optimal map) (Contd)

  • 8/12/2019 Thiet Ke He Thong Nhung

    53/75

    53

    p g p ( p p p) ( )

    Discussion:

    Optimal maps offer not only better throughput but also less energyconsumption.

    No ARQ scheme was implemented.

    If ARQ is used:

    Same throughput even higher but more energy consumed for

    retransmitting dropped packets. (Further works)

    Conclusions

  • 8/12/2019 Thiet Ke He Thong Nhung

    54/75

    54

    The heterogeneous NoC architectures are considered for

    designing based on the latency criteria

    The latency of heterogeneous NoC architecture in terms of router

    and wire latency is well formulized

    Branch and Bound algorithm is adopted to automatically map the

    IPs onto the NoC architectures with optimal latency metric

    The experiments on various size of Mesh and Fat-Treearchitectures for OCMN application and H.264 are carried out

    The latency of optimal mappings are significantly reduced

    Conclusions (Contd)

  • 8/12/2019 Thiet Ke He Thong Nhung

    55/75

    55

    ( )

    The heterogeneous NoC architectures are considered for

    designing based on the maximum throughput criteria

    The throughput of heterogeneous NoC architecture is formulized

    Branch and Bound algorithm is adopted to automatically map the

    IPs onto the NoC architectures to obtain maximal throughput

    The experiments on various size of Mesh and Fat-Tree and Tree-

    Based architectures for VOPD and H.264 are carried out

    The heterogeneous bit power model was applied to exactly obtain

    energy consumption and area of architectures

    Future works

  • 8/12/2019 Thiet Ke He Thong Nhung

    56/75

    56

    Modeling architecture with general distribution queuing model G/G/1

    (Appendix I)

    Realization of Multi-layer router for NoC design (Appendix II)

    Performance comparison with current router model

    Power consumption comparison with current router model

    Variation of:

    Number of switching planes Number of Virtual channel

    Will be considered.

    NoC emulation (Appendix III)

    Global optimization for 2 criteria of Latency and Throughput

    ARQ implementation with power measurement for Throughput awaremapping scheme

    Publication list

  • 8/12/2019 Thiet Ke He Thong Nhung

    57/75

    57

    International Journals

    1. Vu-Duc Ngo, Huy-Nam Nguyen, Hae-Wook Choi , "Analyzing the Performance of Me

    sh and Fat-Tree topologies for Network on Chip design ", LNCS (Springer-Verlag), Vol. 3824 / 2005, pp 300-310, Dec 2005.

    2. Vu-Duc Ngo, Huy-Nam Nguyen, Hae-Wook Choi, Designing On-Chip Network basedon optimal latency criteria", LNCS (Springer-Verlag), Vol. 3820 / 2005, pp. 287-298 ,Dec 2005.

    3. Huy-Nam Nguyen, Vu-Duc Ngo, Hae-Wook Choi, Realization of Video Object PlaneDecoder on On-Chip-Network Architecture, LNCS (Springer-Verlag), Vol. 3820 /

    2005, pp. 256-264 , Dec 2005.

    4. Vu-Duc Ngo, Hae-Wook Choi and Sin-Chong Park, "An Expurgated Union Bound forSpace-Time Code Systems", LNCS (Springer-Verlag), Vol. 3124 / 2004, pp 156-162,July 2004

    5. Vu-Duc Ngo, Huy-Nam Nguyen, Hae-Wook Choi, The Optimum Network on ChipArchitectures for Video Decoder Applications Design, (to be submitted to Etri journal)

    6. Vu-Duc Ngo, Huy-Nam Nguyen, Hae-Wook Choi, Throughput Aware Mapping forNoC design , (to be submitted to IEE electronic letter)

  • 8/12/2019 Thiet Ke He Thong Nhung

    58/75

    References

  • 8/12/2019 Thiet Ke He Thong Nhung

    59/75

    59

    1. L. Benini and G. DeMicheli,"Networks On Chips: A new SoC paradigm", IEEE computer, Jan, 2002.

    2. A. Agarwal,"Limit on interconnection network performance", Parallel and Distrib-uted Systems,IEEE Transactions on Volume 2, Issue 4, Oct. 1991 pp. 398 - 412.

    3. T. Ye, L. Benini and G. De Micheli," Packetization and Routing Analysis of On-Chip MultiProcessorNetworks", JSA Journal of System Architecture, Vol. 50, February 2004, pp. 81-104.

    4. M. A. El-Moursy and E. G. Friedman,"Desing Methodologies For On-Chip Inductive Interconnection",Chapter. 4, Interconnect-Centric Design For Advanced SoC and NoC, Kluwer Academic Publishers,2004.

    5. R. Ho, et al, "The future of wires," Proceedings of the IEEE, pp. 490 - 504, April 2001.

    6. J. Hu, R. Marculescu, "Exploiting the Routing Flexibility for Energy Performance Aware Mapping ofRegular NoC Architectures", in Proc. Design, Automation and Test in Europe Conf, March 2003.

    7. J. Hu, R. Marculescu, "Energy-Aware Communication and Task Scheduling for Network-on-ChipArchitectures under Real-Time Constraints, in Proc. Design, Au- tomation and Test in Europe Conf,Feb. 2004.

    8. S.Murali and G.De Micheli, Bandwidth-Constrained Mapping of Cores pnto NoC Architectures, DATE,International Conference on Design and Test Europe, 2004, pp. 896-901.

    9. T. Tao Ye, L. Benini, G. De Micheli, "Packetization and Routing for On-Chip Communication Networks," Journal of System Architecture, special issue on Networks- on-Chip.

    10. T. H. Cormen, et al, "Introduction to algorithms," Second Edition, The MIT press, 2001.

    11. D. Bertozzi, L. Benini and G.De Micheli, "Network on Chip Design for Gigascale Systems on Chips,in R. Zurawski, Editor, Industrial Technology Handbook, CRC Press, 2004, pp. 95.1-95.18 on Chips,Morgan Kaufmannn, 2004, pp. 49-80.

    References (Contd)

  • 8/12/2019 Thiet Ke He Thong Nhung

    60/75

    60

    12. L. Benini and G. De Micheli, "Networks on Chip: A new Paradigm for component based MPSoCDesign," in A. Jerraja and W.Wolf Editors, "Multiprocessor Systems on Chips", Morgan Kaufmannn,2004, pp. 49-80.

    13. D. Bertsekas and R. Gallager, "Data Networks," Chapter 5., Second Edition, Prentice-Hall, Inc.,1992.

    14. A. Jalabert, S. Murali, L. Benini, G. De Micheli, "xpipesCompiler: A Tool for instantiating applicationspecific Networks on Chip", Proc. DATE 2004.

    15. M. Dall'Osso, G. Biccari, L. Giovannini, D.Bertozzi, L. Benini, "Xpipes: a latency insensitive parameterized network-on-chip architecture for multiprocessor SoCs", 21st International Conference onComputer Design, Oct. 2003, pp. 536 - 539.

    16. C. E. Leiserson,"Fat Trees: Universal networks for hardware efficient supercomputing," IEEETransactions on Computer, C-34, pp. 892-90, 1 Oct 1985.

    17. H. S. Hwang et al.,"Orion: A Power Performance Simulator for Interconnection Networks," IEEEMicro, Nov. 2002.

    18. W. J. Dally and B. Towles,"Route Packets, Not Wires: On Chip Interconnection Networks," DAC, pp.684-689, 2001.

    19. N. Eisley and L.-S. Peh," High-level power analysis for on-chip networks," In Proceedings of the 7thInternational Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES),September 2004.

    20.J. Nurmi, "Network-on-Chip: A New Paradigm for System-on-Chip Design," Proceedings ofInternational Symposium on System-on-Chip, Nov. 2005.

    21. L. Kleinrock, Queuing systems, volume 1: Theory, Willey. New York, 1975.

    Appendix I: Latency of G/G/1 queuing model

  • 8/12/2019 Thiet Ke He Thong Nhung

    61/75

    61

    Latency of a general queuing model is:

    Where:

    For G/G/1 queuing (applied for generally distributed data

    transaction)

    From [21], the waiting time:

    1T W

    1: Mean of processing time

    : Waiting time in bufferW

    [21] L. Kleinrock, Queuing systems, volume 1: Theory,Willey.New York, 1975.

    1 2

    2 22/

    1 / 2

    X XW

    Appendix I : Latency of G/G/1 queuing model (Contd)

  • 8/12/2019 Thiet Ke He Thong Nhung

    62/75

    62

    Where:

    Therefore, for a single node, the latency is given by:

    For the case of complex node, in which several data

    streams are independently associated, the nodes latency

    is formulated by:

    1

    2

    2 22

    1 1 1 1

    2 22

    2 2 2 2

    : Mean value of arrival rate

    X

    X

    Var X E X E X E X

    Var X E X E X E X

    1 2

    2 22, ,/1 1

    1 / 2

    i X i X i i

    i i

    i i i i

    T W

    1 2

    2 2 2

    , ,1 1

    1

    /1 1

    21 /

    i i

    i

    C C

    ji i ji X i X j j

    i i Ci i

    ji ij

    T W

    Appendix I : Latency of G/G/1 queuing model (Contd)

  • 8/12/2019 Thiet Ke He Thong Nhung

    63/75

    63

    Because:

    For a curtain kthroute, the latency is:

    Where:

    1 2

    1 2

    1 2

    1

    1 1 1

    1 1 1 1

    2

    1 1 1 ,1

    For independent data streams , ,...,

    ...

    ...

    C ii i i

    iC ii i i

    iC ii i i

    i

    Cjij

    C

    ji Xj

    C X X X

    E X X X

    Var X X X

    1 2

    2 2 2

    , ,1 1

    1

    / 1

    21 /

    i i

    i

    C C

    ji i ji X i X j jk

    R i ik ikCi i i

    ji ij

    T T

    1, if node route

    0, if node route

    th th

    ik th th

    i k

    i k

    Appendix I : Latency of G/G/1 queuing model (Contd)

  • 8/12/2019 Thiet Ke He Thong Nhung

    64/75

    64

    If there are totally mroutes, the network latency is:

    The optimization issue turns out to be: finding the optimal

    mapping of IPs onto routers such that

    1 2

    2 2 2

    , ,1 1

    1 11

    /

    121 /

    ,

    i i

    i

    C Cm m ji i ji X i X

    j jkNet R ikC

    k k i iji ij

    T T

    f

    Function of allocation scheme

    of IP onto the Architecture

    1 2

    2 2 2

    , ,1 1

    1 1 1

    / 1min min

    21 /

    i i

    i

    C Cm m ji i ji X i X j jk

    Opt R ik C

    k k i iji ij

    T T

    Appendix I : Latency of G/G/1 queuing model (Contd)

  • 8/12/2019 Thiet Ke He Thong Nhung

    65/75

    65

    The conditions that we assume to calculate the cost function as well as

    to apply Branch and Bound are:

    The mean of processing time are constant

    The mean and variance value of IPs data rate are known

    The cost function simplified to be:

    1

    2 2

    ,1 1

    1 1

    1

    / 1min min

    21 /

    i i

    i

    C Cm m

    ji i ji Xj jk

    Opt R ik Ck k i i

    ji ij

    T T

    Appendix II: Multi-layer Router

  • 8/12/2019 Thiet Ke He Thong Nhung

    66/75

    66

    Conventional Virtual Circuit

    Router:

    Multiple Switching Layer Router:

    Switch allocator

    VC allocator

    Routing

    Computation Unit

    Input unit Output unit

    Crossbar SwitchCrossbarSwitch 2

    Switch allocator

    VC allocatorRouting

    Computation Unit

    Input unit Output unit

    CrossbarSwitch 1

    Appendix II: Multi-layer Router (Contd)

  • 8/12/2019 Thiet Ke He Thong Nhung

    67/75

    67

    Performance analysis:

    Waiting time:

    Where:

    4 3 2

    1

    2 3

    2

    11 (32 6 ) (48 30 )

    6( )(2 )

    (24 48 ) 24

    ( 2( ) 2 )

    m n m n mW

    m m

    m n m m n

    m n mn

    2

    1(1 )

    2(1 )mW

    max

    2 2

    max

    ,

    ,n m n m

    2 2

    max n m n m

    1 2W W W

    : Number of virtual circuits

    : Number of switching plane

    n

    m

  • 8/12/2019 Thiet Ke He Thong Nhung

    68/75

    68

    Appendix III: NoC Emulation

    NoC Emulation: Behavior simulation framework

  • 8/12/2019 Thiet Ke He Thong Nhung

    69/75

    69

    Topology Selector(Mesh, FT, Torus,

    Octagal)

    Optimizer(BnB)

    Latency Metric

    Throughput Metric

    Data TransactionTable

    (Poisson, GeneralDistribution)

    Routing Table(Shortest Path)

    BehavioralSimulation

    Traced Data Bit Energy Model

    Energy and AreaAnalysis

    PerformanceAnalyzer

    Latency Metric

    Throughput Metric

  • 8/12/2019 Thiet Ke He Thong Nhung

    70/75

    NoC Emulation: Board implementation framework

  • 8/12/2019 Thiet Ke He Thong Nhung

    71/75

    71

    Emulation Architecture with Stochastic Data Generator

    Host PC

    Switch &RoutingTable

    NetworkInterface

    (NI)

    Data Generator

    Data Receiver

    OCPInterface

    NetworkInterface

    (NI)

    Data Generator

    Data Receiver

    OCPInterface

    SchedulerController

    To SchedulerTo Controller

    .

    .

    .

    .MEM

    OPB toIB

    Bridge

    IBOPB

    PowerPC

    Emulation Platform

    C code

    Verilog Synthesis

    1. To switch datas

    distribution

    2. To switch data generators

    mode

    1. To schedule

    all Data

    Generators

    1. To store the data of Data

    Receiver for post-

    processing of Host

  • 8/12/2019 Thiet Ke He Thong Nhung

    72/75

    NoC Emulation: Board implementation framework (Contd)

  • 8/12/2019 Thiet Ke He Thong Nhung

    73/75

    73

    Emulation Architecture with Real Data Generator

    Switch &RoutingTable

    NetworkInterface

    (NI)

    OCPInterface

    SchedulerController

    To SchedulerTo Controller

    .

    .

    .

    .

    Tx.MEM

    OPB toIB

    Bridge

    IBOPB

    PowerPC

    Emulation Platform

    Rx.MEM

    NetworkInterface

    (NI)

    OCPInterface

    Tx.MEM

    Rx.MEM

    C code

    Verilog Synthesis

    Host PC

    NoC Emulation: Board implementation framework (Contd)

  • 8/12/2019 Thiet Ke He Thong Nhung

    74/75

    74

    A given combination of Tx.MEM and Rx.MEM plays a role

    as a soft IP with its own data transaction.

    The given combination of Tx.MEM and Rx.MEM is

    associated with a certain NI by Optimizer and controlled by

    Controller.

    NI plays as the packager (supports BE and GT)

    The transmitted data is read out of Tx.MEM with the given

    data transaction timing diagram

    Scheduled by the Scheduler.

    NoC Emulation: Emulation Board

    Vi t TM II P B d P B d

  • 8/12/2019 Thiet Ke He Thong Nhung

    75/75

    75

    VirtexTM II Pro Based Processor Board

    VirtexTM II Pro XC2VP100

    Total slices: 44.096

    Primitive design element

    Double Plane VC WormholeRouter: 4000 slices (9%)

    Benchmark: i.e H.264 decoder

    12 IPs 16 Routers

    Design partition

    6IPs and 8 routers per FPGA