NoC: MPSoC Communication Fabric Interconnection Networks (ELE 580) Shougata Ghosh 20 th Apr, 2006

NoC: MPSoC Communication

FabricInterconnection Networks (ELE 580)

Shougata Ghosh20th Apr, 2006

Outline

MPSoC Network-On-Chip Synthesis of Irregular NoC OCP SystemC Cases:

IBM CoreConnect Sonic Silicon Backplane CrossBow IPs

What are MPSoCs?

MPSoC – Multiprocessor System-On-Chip Most SoCs today use multiple processing

cores MPSoCs are characterised by heterogeneous

multiprocessors CPUs, IPs (Intellectual Properties), DSP

cores, Memory, Communication Handler (USB, UART, etc)

Where are MPSoCs used?

Cell phones Network Processors

(Used by Telecomm. and networking to handle high data rates)

Digital Television and set-top boxes High Definition Television Video games (PS emotion engine)

Challenges

All MPSoC designs have the following requirements: Speed Power Area Application Performance Time to market

Why Reinvent the wheel?

Why not use uniprocessor (3.4 GHz!!)? PDAs are usually uniprocessor

Cannot keep up with real-time processing requirements Slow for real-time data

Real-time processing requires “real” concurrency Uniprocessors provide “apparent” concurrency

through multitasking (OS) Multiprocessors can provide concurrency required to

handle real-time events

Need multiple Processors

Why not CMPs? +CMPs are cheaper (reuse) +Easier to program -Unpredictable delays (ex: Snoopy cache) -Need buffering to handle unpredictability

Area concerns

Configured CMPs would have unused resources

Special purpose PEs: Don’t need to support unwanted processes

Faster Area efficient Power efficient

Can exploit known memory access patterns Smaller Caches (Area savings)

MPSoC Architecture

Components

Hardware Multiple processors Non-programmable IPs Memory Communication Interface

Interface heterogeneous components to Comm. Network Communication Network

Hierarchical (Busses) NoC

Design Flow System-level-synthesis

Top-down approach Synthesis algo. ->SoC Arch. + SW Model from

system-level specs. Platform-based Design

Starts with Functional System Spec. + Predesigned Platform

Mapping & Scheduling of functions to HW/SW Component-based Design

Bottom-up approach

Platform Based Design Start with functional Spec :

Task Graphs

Task graph Nodes: Tasks to complete Edges: Communication

and Dependence between tasks

Execution time on the nodes Data communicated on the edges

Map tasks on pre designed HW

Use Extended Task Graph for SW and Communication

Mapping on to HW Gantt chart: Scheduling

task execution & Timing analysis

Extended Task Graph Comm. Nodes

(Reads and Writes)

ILP and Heuristic Algo. to schedule Task and Comm. to HW and SW

Component Based Design Conceptual MPSoC Platform SW, Processor, IP, Comm.

Fabric

Parallel Development Use APIs

Quicker time to market

Design Flow Schematic

Communication Fabric

Has been mostly Bus based IBM CoreConnect, Sonic Silicon Backplane, etc.

Busses not scalable!! Usually 5 Processors – rarely more than 10!

Number of cores has been increasing Push towards NoC

NoC NoC NoC-ing on Heaven’s Door!! Typical Network-On-Chip (Regular)

Regular NoC Bunch of tiles Each tile has input (inject into network) and

output (recv. From network) ports Input port => 256-bit Data 38-bit Control Network handles both static and dynamic traffic

Static: Flow of data from camera to MPEG encoder Dynamic: Memory request from PE (or CPU)

Uses dedicated VC for static traffic Dynamic traffic goes through arbitration

Control Bits Control bit fields

Type (2 bits): Head, Body, Tail, Idle Size (4 bits): Data size 0 (1-bit) to 8 (256-bit) VC Mask (8 bits): Mask to determine VC (out of 8)

Can be used to prioritise Route (16 bits): Source routing Ready (8 bits): Signal from network

indicating it’s ready to accept the next flit (??why 8?)

Flow Control

Virtual Channel flow control Router with input and output controller Input controller has buffer and state for each VC Inp. controller strips routing info from head flit Flit arbitrates for output VC Output VC has buffer for single flit

Used to store flit trying to get inp. buffer in next hop

Input and Output Controllers

NoC Issues Basic difference between NoC and Inter-chip or

Inter-board networks: Wires and pins are ABUNDANT in NoC Buffer space is limited in NoC

On-Chip pins for each tile could be 24,000 compared to 1000 for inter-chip designs

Designers can trade wiring resources for network performance!

Channels: On-Chip => 300 bits Inter-Chip => 8-16 bits

Topology

The previous design used folded torus Folded torus has twice the wire demand and

twice the bisection BW compared to mesh Converts plentiful wires to bandwidth

(performance) Not hard to implement On-Chip However, could be more power hungry

Flow Control Decision

Area scarce in On-Chip designs Buffers use up a LOT of area Flow control with less buffers are favourable However, need to balance with performance

Dropping pkt. FC requires least buffer but at the expense of performance

Misrouting when enough path diversity

High Performance Circuits

Wiring regular and known at design time Can be accurately modeled (R, L, C) This enables:

Low swing circuit – 100mV compared to 1V HUGE power saving

Overdrive produces 3 times signal velocity compared to full-swing drivers

Overdrive increases repeater spacing Again significant power savings

Heterogeneous NoC Regular topologies facilitate modular design

and easily scaled up by replication However, for heterogeneous systems, regular

topologies lead to overdesigns!! Heterogeneous NoCs can optimise local

bottlenecks Solution?

Complete Application Specific NoC synthesis flow Customised topology and NoC building blocks

xPipe Lite

Application Specific NoC library Creates application specific NoC

Uses library of NI, switch and link Parameterised library modules optimised for

frequency and low latency Packet switched communication Source routing Wormhole flow control Topology: Torus, Mesh, B-Tree, Butterfly

NoC Architecture Block Diagram

xPipes Lite

Uses OCP to communicate with cores OCP advantages:

Industry wide standard for comm. protocol between cores and NoC

Allows parallel development of cores and NoC Smoother development of modules Faster time to market

xPipes Lite – Network Interface Bridges OCP interface and NoC switching

fabric Functions:

Synch. Between OCP and xPipes timing Packeting OCP transaction to flits Route calculation Flit buffering to improve performance

NI

Uses 2 registers to interface with OCP Header reg. to store address (sent once) Payload reg. to store data (sent multiple times for

burst transfers) Flits generated from the registers

Header flit from Header reg. Body/payload flits from Payload reg.

Routing info. in header flit Route determined from LUT using the dest.

address

Network Interface

Bidirectional NI Output stage identical to

xPipes switches Input stage uses dual-

flit buffers Uses the same flow

control as the switches

Switch Architecture xPipes switch is the basic building block of the

switching fabric 2-cycle latency Output queued router Fixed and round robin priority arbitration on input

lines Flow control

ACK/nACK Go-Back-N semantics

CRC

Switch

Allocator module does the arbitration for head flit

Holds path until tail flit Routing info requests the

output port

The switch is parameterisable in: Number of input/output, arbitration policy, output

buffer sizes

Switch flow control

Input flit dropped if: Requested output port held by previous packet Output buffer full Lost the arbitration

NACK sent back All subsequent flits of that packet dropped

until header flit reappears(Go-Back-N flow control)

Updates routing info for next switch

xPipes Lite - Links

The links are pipelined to overcome interconnect delay problem

xPipes Lite uses shallow pipelines for all modules (NI, Switch) Low latency Less buffer requirement Area savings Higher frequency

xPipes Lite Design Flow

Heterogeneous Network

The network was heterogeneous in Switch buffering Input and Output ports Arbitration policy Links

Regular topology, however

Go-Back-N??

Flow and Error Control “Borrowed” from sliding window flow control Reject all subsequent flits/packets after

dropping In sliding window flow control, NACKs are

sent with frame number (N) Sender has to go back to frame N and resend

all the frames

Go-Back-N Example

NoC Synthesis - Netchip

Netchip – Tool to synthesise Application Specific NoCs

Uses two tools SUNMAP – to generate/select topology xPipes Lite – to generate NI, Switch, Links

Netchip

Three phases to generate the NoC Topology Mapping – SUNMAP

Core Graph, Area/Power libs, Floorplan, Topology lib

Topology Selection – SUNMAP NoC Generation – xPipes Lite

Possible to skip Phases 1 and 2 and provide custom topology!

Netchip Design Flow

Core Graph, Topology Graph

Core Graph: Directed Graph G(V, E) Each vertex vi represents a SoC core Each directed edge ei,j represents communication from vertex v i

to vj

Weight of edge ei,j represents the bandwidth of communication from vi to vj

NoC Topology Graph: Directed Graph P(U, F) Each vertex ui represents a node in the topology Each directed edge fi,j represents communication from node u i

to uj

Weight of edge fi,j (denoted by bwi,j) represents the bandwidth available across the edge fi,j

Mapping Uses Minimum-path Mapping Algorithm to map the cores to the nodes Do this for all topologies from the topology library

Selection Torus, Mesh

4 x 3 nodes 5x5 switches

Butterfly 4-ary 2-fly 4x4 switches

What about irregular topologies? Can be generated using Mixed Integer Linear

Programming formulation“Linear Programming based Techniques for Synthesis of Network-on-Chip

Architectures”, K. Srinivasan, K. Chatha and G. Konjevod, ASU

SystemC

System description language Both C++ class library and Design

Methodology Provides hierarchical design that can address:

High-level abstraction Low-level logic design Simulate software algorithm

SystemC: Language Features Modules

Basic building blocks of a design hierarchy. A SystemC model usually consists of several modules which communicate via ports

Ports Communication from inside a module to the outside (usually to

other modules) Processes

The main computation elements. They are concurrent Channels

The communication element of SystemC. Can be simple wires or complex communication mechanisms like fifo’s or busses

Elementary channels: signal, buffer, fifo, mutex, semaphore, etc.

SystemC: Language Features Interfaces

Ports use interfaces to communicate with channels Events

Allow the synchronisation between processes Data types

C++ data types: bool, int, etc. Extended standard types: sc_int<>, sc_uint<>, etc Logic types:

sc_bit : 2-valued single bit sc_logic : 4-valued single bit sc_bv<> : vector of sc_bit sc_lv<> : vector of sc_logic

1-bit Full Adder

SystemC example: 1-bit Adder

SystemC example cont’d.

Open Core Protocol

What is the OCP channel? OCP Hardware view OCP SystemC Channel Models

Generic OCP TL1 OCP TL2 Layer Adapter

Wrapped Bus and OCP Instances

OCP - Overview

An open standard for connecting different blocks on a SoC

Point to point connection NOT a bus spec. (many blocks to many blocks)

Flexible and configurable to work with a wide range of IPs

Hierarchy of Elements

OCP Layering and Terminology Signals

Wires and fields Phases

Requests, Data handshake, Response Transfers

A Read or Write Transaction

A complete burst of one or more transactions

OCP at the Hardware Level

OCP at the Hardware Level

Collection of signals between two cores Path for

Request signals Data signals Response signals

Slave responding with data (servicing Reads) Handshake paths

Sideband signals Interrupt, flags, error signalling

Hardware Timing

Allows responses/phases to be pipelined Meaning number of requests can be sent before receiving responses

Figure: RdRq1, RdRq2, Resp1, Resp2 Only restriction: Responses must follow request order

So, incoming response might not match last request!!

OCP Signals

Signalling order: Busy signals, Request, Data, Response

SystemC Model for OCP

Different OCP channels layered upon a generic transactional level channel

Generic Model For flexible interfaces

OCP TL1 Low-level, cycle accurate model (timing)

OCP TL2 Transaction level (data and throughput)

Layer Adaptors The different channel models allow the designer to find the

model that best matches the core’s model

Basic SystemC OCP Model Master calls a channel function (sendRequest) Channel takes the request and triggers the RequestStartEvent. The slave is

sensitive to this event Wakes up a SystemC process in the slave which calls the channel getRequest

function Slave may call channel’s acceptRequest function at a later time to accept the

request Channel triggers the RequestEndEvent that the Master is sensitive to The Master can now begin a new request Responses are separate – hence pipelined!

Transaction Level 1 Cycle Accurate Follows phase ordering of OCP transfer cycle Clock driven Uses all OCP parameters Request/Update All OCP signals supported

Transaction Level 2 Models OCP specific data flow through the channel Faster/Greater Throughput Commands to send an entire burst of data at a time Request & Response only – no Data Handshake Timing Approximate

Layer Adapters

OCP always connects two cores What if the two cores are modeled at different

transaction levels? Use the Layer Adapter!!! Layer Adapter could also convert to RTL!

IBM CoreConnect

CoreConnect Bus Architecture An open 32-, 64-, 128-bit core on-chip bus

standard Communication fabric for IBM Blue Logic and

other non-IBM devices Provides high bandwidth with hierarchical

bus structure Processor Local Bus (PLB) On-Chip Peripheral Bus (OPB) Device Control Register bus (DCR)

Performance Features

CoreConnect Components

PLB OPB DCR PLB Arbiter OPB Arbiter PLB to OPB Bridge OPB to PLB Bridge

PLB

Processor Local Bus Fully synchronous, supports up to 8 masters

32-, 64-, and 128-bit architecture versions; extendable to 256-bit

Separate read/write data buses Overlapped transfers Higher data rates

High Bandwidth Capabilities Burst transfers, variable and fixed-length supported Pipelining Split transactions DMA transfers No on-chip tri-states required Overlapped arbitration, programmable priority fairness

Processor Local Bus (cont’d.) Masters

Processor cores, Cores requiring high BW, low Latency DMA Controller PLB Master OPB to PLB Bridge

Slaves External Bus Interface Unit (EBIU) PLB to OPB Bridge

Address and Data cycles

OPB

On-Chip Peripheral Bus Fully synchronous 32-bit address bus, 32-bit data bus Supports single-cycle data transfers between master and slaves Supports multiple masters, determined by arbitration

implementation Supports

32-, 64-bit Masters 8-, 16-, 32-, 64-bit Slaves

No tri-state drivers required Provides a link between Processor core and other slower

peripherals PLB/OPB kind of analogous to Northbridge/ Southbridge chips

on a PC

DCR – Device Control Register Bus Transfers data between CPU’s general purpose registers and

DCR slaves’ registers Takes load off PLB and OPB 10-bit address bus and 32-bit data bus 2-cycle minimum read or write transfers extendable Handshake supports clocked asynchronous transfers Slaves may be clocked either faster or slower than master Single device control register bus master Distributed multiplexer architecture A simple but flexible interface Includes

Address Bus, Input and Output Data Bus DCR Read and Write signals Acknowledge signals

SONICS – Silicon BackPlane III Yet another acronym!!!

SONICS – Systems-ON-ICS BackPlane III uses a MicroNetwork More similar to data network than busses! OCP Interface Parameterised datapath widths and pipeline depths Internal protocols based upon hardware threads that can

be time interleaved at a fine granularity Time Division Multiplexing (TDMA) along a slotted time wheel Pipelined Memory Mapped Address Space

IP cores accessed only via Read/Write commands

BackPlane III MicroNetwork

Spidergon!!

Documents

NoC: MPSoC Communication Fabric Interconnection Networks (ELE 580) Shougata Ghosh 20 th Apr, 2006