Upload
emery-fleming
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
NoC: MPSoC Communication
FabricInterconnection Networks (ELE 580)
Shougata Ghosh20th Apr, 2006
Outline
MPSoC Network-On-Chip Synthesis of Irregular NoC OCP SystemC Cases:
IBM CoreConnect Sonic Silicon Backplane CrossBow IPs
What are MPSoCs?
MPSoC – Multiprocessor System-On-Chip Most SoCs today use multiple processing
cores MPSoCs are characterised by heterogeneous
multiprocessors CPUs, IPs (Intellectual Properties), DSP
cores, Memory, Communication Handler (USB, UART, etc)
Where are MPSoCs used?
Cell phones Network Processors
(Used by Telecomm. and networking to handle high data rates)
Digital Television and set-top boxes High Definition Television Video games (PS emotion engine)
Challenges
All MPSoC designs have the following requirements: Speed Power Area Application Performance Time to market
Why Reinvent the wheel?
Why not use uniprocessor (3.4 GHz!!)? PDAs are usually uniprocessor
Cannot keep up with real-time processing requirements Slow for real-time data
Real-time processing requires “real” concurrency Uniprocessors provide “apparent” concurrency
through multitasking (OS) Multiprocessors can provide concurrency required to
handle real-time events
Need multiple Processors
Why not CMPs? +CMPs are cheaper (reuse) +Easier to program -Unpredictable delays (ex: Snoopy cache) -Need buffering to handle unpredictability
Area concerns
Configured CMPs would have unused resources
Special purpose PEs: Don’t need to support unwanted processes
Faster Area efficient Power efficient
Can exploit known memory access patterns Smaller Caches (Area savings)
MPSoC Architecture
Components
Hardware Multiple processors Non-programmable IPs Memory Communication Interface
Interface heterogeneous components to Comm. Network Communication Network
Hierarchical (Busses) NoC
Design Flow System-level-synthesis
Top-down approach Synthesis algo. ->SoC Arch. + SW Model from
system-level specs. Platform-based Design
Starts with Functional System Spec. + Predesigned Platform
Mapping & Scheduling of functions to HW/SW Component-based Design
Bottom-up approach
Platform Based Design Start with functional Spec :
Task Graphs
Task graph Nodes: Tasks to complete Edges: Communication
and Dependence between tasks
Execution time on the nodes Data communicated on the edges
Map tasks on pre designed HW
Use Extended Task Graph for SW and Communication
Mapping on to HW Gantt chart: Scheduling
task execution & Timing analysis
Extended Task Graph Comm. Nodes
(Reads and Writes)
ILP and Heuristic Algo. to schedule Task and Comm. to HW and SW
Component Based Design Conceptual MPSoC Platform SW, Processor, IP, Comm.
Fabric
Parallel Development Use APIs
Quicker time to market
Design Flow Schematic
Communication Fabric
Has been mostly Bus based IBM CoreConnect, Sonic Silicon Backplane, etc.
Busses not scalable!! Usually 5 Processors – rarely more than 10!
Number of cores has been increasing Push towards NoC
NoC NoC NoC-ing on Heaven’s Door!! Typical Network-On-Chip (Regular)
Regular NoC Bunch of tiles Each tile has input (inject into network) and
output (recv. From network) ports Input port => 256-bit Data 38-bit Control Network handles both static and dynamic traffic
Static: Flow of data from camera to MPEG encoder Dynamic: Memory request from PE (or CPU)
Uses dedicated VC for static traffic Dynamic traffic goes through arbitration
Control Bits Control bit fields
Type (2 bits): Head, Body, Tail, Idle Size (4 bits): Data size 0 (1-bit) to 8 (256-bit) VC Mask (8 bits): Mask to determine VC (out of 8)
Can be used to prioritise Route (16 bits): Source routing Ready (8 bits): Signal from network
indicating it’s ready to accept the next flit (??why 8?)
Flow Control
Virtual Channel flow control Router with input and output controller Input controller has buffer and state for each VC Inp. controller strips routing info from head flit Flit arbitrates for output VC Output VC has buffer for single flit
Used to store flit trying to get inp. buffer in next hop
Input and Output Controllers
NoC Issues Basic difference between NoC and Inter-chip or
Inter-board networks: Wires and pins are ABUNDANT in NoC Buffer space is limited in NoC
On-Chip pins for each tile could be 24,000 compared to 1000 for inter-chip designs
Designers can trade wiring resources for network performance!
Channels: On-Chip => 300 bits Inter-Chip => 8-16 bits
Topology
The previous design used folded torus Folded torus has twice the wire demand and
twice the bisection BW compared to mesh Converts plentiful wires to bandwidth
(performance) Not hard to implement On-Chip However, could be more power hungry
Flow Control Decision
Area scarce in On-Chip designs Buffers use up a LOT of area Flow control with less buffers are favourable However, need to balance with performance
Dropping pkt. FC requires least buffer but at the expense of performance
Misrouting when enough path diversity
High Performance Circuits
Wiring regular and known at design time Can be accurately modeled (R, L, C) This enables:
Low swing circuit – 100mV compared to 1V HUGE power saving
Overdrive produces 3 times signal velocity compared to full-swing drivers
Overdrive increases repeater spacing Again significant power savings
Heterogeneous NoC Regular topologies facilitate modular design
and easily scaled up by replication However, for heterogeneous systems, regular
topologies lead to overdesigns!! Heterogeneous NoCs can optimise local
bottlenecks Solution?
Complete Application Specific NoC synthesis flow Customised topology and NoC building blocks
xPipe Lite
Application Specific NoC library Creates application specific NoC
Uses library of NI, switch and link Parameterised library modules optimised for
frequency and low latency Packet switched communication Source routing Wormhole flow control Topology: Torus, Mesh, B-Tree, Butterfly
NoC Architecture Block Diagram
xPipes Lite
Uses OCP to communicate with cores OCP advantages:
Industry wide standard for comm. protocol between cores and NoC
Allows parallel development of cores and NoC Smoother development of modules Faster time to market
xPipes Lite – Network Interface Bridges OCP interface and NoC switching
fabric Functions:
Synch. Between OCP and xPipes timing Packeting OCP transaction to flits Route calculation Flit buffering to improve performance
NI
Uses 2 registers to interface with OCP Header reg. to store address (sent once) Payload reg. to store data (sent multiple times for
burst transfers) Flits generated from the registers
Header flit from Header reg. Body/payload flits from Payload reg.
Routing info. in header flit Route determined from LUT using the dest.
address
Network Interface
Bidirectional NI Output stage identical to
xPipes switches Input stage uses dual-
flit buffers Uses the same flow
control as the switches
Switch Architecture xPipes switch is the basic building block of the
switching fabric 2-cycle latency Output queued router Fixed and round robin priority arbitration on input
lines Flow control
ACK/nACK Go-Back-N semantics
CRC
Switch
Allocator module does the arbitration for head flit
Holds path until tail flit Routing info requests the
output port
The switch is parameterisable in: Number of input/output, arbitration policy, output
buffer sizes
Switch flow control
Input flit dropped if: Requested output port held by previous packet Output buffer full Lost the arbitration
NACK sent back All subsequent flits of that packet dropped
until header flit reappears(Go-Back-N flow control)
Updates routing info for next switch
xPipes Lite - Links
The links are pipelined to overcome interconnect delay problem
xPipes Lite uses shallow pipelines for all modules (NI, Switch) Low latency Less buffer requirement Area savings Higher frequency
xPipes Lite Design Flow
Heterogeneous Network
The network was heterogeneous in Switch buffering Input and Output ports Arbitration policy Links
Regular topology, however
Go-Back-N??
Flow and Error Control “Borrowed” from sliding window flow control Reject all subsequent flits/packets after
dropping In sliding window flow control, NACKs are
sent with frame number (N) Sender has to go back to frame N and resend
all the frames
Go-Back-N Example
NoC Synthesis - Netchip
Netchip – Tool to synthesise Application Specific NoCs
Uses two tools SUNMAP – to generate/select topology xPipes Lite – to generate NI, Switch, Links
Netchip
Three phases to generate the NoC Topology Mapping – SUNMAP
Core Graph, Area/Power libs, Floorplan, Topology lib
Topology Selection – SUNMAP NoC Generation – xPipes Lite
Possible to skip Phases 1 and 2 and provide custom topology!
Netchip Design Flow
Core Graph, Topology Graph
Core Graph: Directed Graph G(V, E) Each vertex vi represents a SoC core Each directed edge ei,j represents communication from vertex v i
to vj
Weight of edge ei,j represents the bandwidth of communication from vi to vj
NoC Topology Graph: Directed Graph P(U, F) Each vertex ui represents a node in the topology Each directed edge fi,j represents communication from node u i
to uj
Weight of edge fi,j (denoted by bwi,j) represents the bandwidth available across the edge fi,j
Mapping Uses Minimum-path Mapping Algorithm to map the cores to the nodes Do this for all topologies from the topology library
Selection Torus, Mesh
4 x 3 nodes 5x5 switches
Butterfly 4-ary 2-fly 4x4 switches
What about irregular topologies? Can be generated using Mixed Integer Linear
Programming formulation“Linear Programming based Techniques for Synthesis of Network-on-Chip
Architectures”, K. Srinivasan, K. Chatha and G. Konjevod, ASU
SystemC
System description language Both C++ class library and Design
Methodology Provides hierarchical design that can address:
High-level abstraction Low-level logic design Simulate software algorithm
SystemC: Language Features Modules
Basic building blocks of a design hierarchy. A SystemC model usually consists of several modules which communicate via ports
Ports Communication from inside a module to the outside (usually to
other modules) Processes
The main computation elements. They are concurrent Channels
The communication element of SystemC. Can be simple wires or complex communication mechanisms like fifo’s or busses
Elementary channels: signal, buffer, fifo, mutex, semaphore, etc.
SystemC: Language Features Interfaces
Ports use interfaces to communicate with channels Events
Allow the synchronisation between processes Data types
C++ data types: bool, int, etc. Extended standard types: sc_int<>, sc_uint<>, etc Logic types:
sc_bit : 2-valued single bit sc_logic : 4-valued single bit sc_bv<> : vector of sc_bit sc_lv<> : vector of sc_logic
1-bit Full Adder
SystemC example: 1-bit Adder
SystemC example cont’d.
Open Core Protocol
What is the OCP channel? OCP Hardware view OCP SystemC Channel Models
Generic OCP TL1 OCP TL2 Layer Adapter
Wrapped Bus and OCP Instances
OCP - Overview
An open standard for connecting different blocks on a SoC
Point to point connection NOT a bus spec. (many blocks to many blocks)
Flexible and configurable to work with a wide range of IPs
Hierarchy of Elements
OCP Layering and Terminology Signals
Wires and fields Phases
Requests, Data handshake, Response Transfers
A Read or Write Transaction
A complete burst of one or more transactions
OCP at the Hardware Level
OCP at the Hardware Level
Collection of signals between two cores Path for
Request signals Data signals Response signals
Slave responding with data (servicing Reads) Handshake paths
Sideband signals Interrupt, flags, error signalling
Hardware Timing
Allows responses/phases to be pipelined Meaning number of requests can be sent before receiving responses
Figure: RdRq1, RdRq2, Resp1, Resp2 Only restriction: Responses must follow request order
So, incoming response might not match last request!!
OCP Signals
Signalling order: Busy signals, Request, Data, Response
SystemC Model for OCP
Different OCP channels layered upon a generic transactional level channel
Generic Model For flexible interfaces
OCP TL1 Low-level, cycle accurate model (timing)
OCP TL2 Transaction level (data and throughput)
Layer Adaptors The different channel models allow the designer to find the
model that best matches the core’s model
Basic SystemC OCP Model Master calls a channel function (sendRequest) Channel takes the request and triggers the RequestStartEvent. The slave is
sensitive to this event Wakes up a SystemC process in the slave which calls the channel getRequest
function Slave may call channel’s acceptRequest function at a later time to accept the
request Channel triggers the RequestEndEvent that the Master is sensitive to The Master can now begin a new request Responses are separate – hence pipelined!
Transaction Level 1 Cycle Accurate Follows phase ordering of OCP transfer cycle Clock driven Uses all OCP parameters Request/Update All OCP signals supported
Transaction Level 2 Models OCP specific data flow through the channel Faster/Greater Throughput Commands to send an entire burst of data at a time Request & Response only – no Data Handshake Timing Approximate
Layer Adapters
OCP always connects two cores What if the two cores are modeled at different
transaction levels? Use the Layer Adapter!!! Layer Adapter could also convert to RTL!
IBM CoreConnect
CoreConnect Bus Architecture An open 32-, 64-, 128-bit core on-chip bus
standard Communication fabric for IBM Blue Logic and
other non-IBM devices Provides high bandwidth with hierarchical
bus structure Processor Local Bus (PLB) On-Chip Peripheral Bus (OPB) Device Control Register bus (DCR)
Performance Features
CoreConnect Components
PLB OPB DCR PLB Arbiter OPB Arbiter PLB to OPB Bridge OPB to PLB Bridge
PLB
Processor Local Bus Fully synchronous, supports up to 8 masters
32-, 64-, and 128-bit architecture versions; extendable to 256-bit
Separate read/write data buses Overlapped transfers Higher data rates
High Bandwidth Capabilities Burst transfers, variable and fixed-length supported Pipelining Split transactions DMA transfers No on-chip tri-states required Overlapped arbitration, programmable priority fairness
Processor Local Bus (cont’d.) Masters
Processor cores, Cores requiring high BW, low Latency DMA Controller PLB Master OPB to PLB Bridge
Slaves External Bus Interface Unit (EBIU) PLB to OPB Bridge
Address and Data cycles
OPB
On-Chip Peripheral Bus Fully synchronous 32-bit address bus, 32-bit data bus Supports single-cycle data transfers between master and slaves Supports multiple masters, determined by arbitration
implementation Supports
32-, 64-bit Masters 8-, 16-, 32-, 64-bit Slaves
No tri-state drivers required Provides a link between Processor core and other slower
peripherals PLB/OPB kind of analogous to Northbridge/ Southbridge chips
on a PC
DCR – Device Control Register Bus Transfers data between CPU’s general purpose registers and
DCR slaves’ registers Takes load off PLB and OPB 10-bit address bus and 32-bit data bus 2-cycle minimum read or write transfers extendable Handshake supports clocked asynchronous transfers Slaves may be clocked either faster or slower than master Single device control register bus master Distributed multiplexer architecture A simple but flexible interface Includes
Address Bus, Input and Output Data Bus DCR Read and Write signals Acknowledge signals
SONICS – Silicon BackPlane III Yet another acronym!!!
SONICS – Systems-ON-ICS BackPlane III uses a MicroNetwork More similar to data network than busses! OCP Interface Parameterised datapath widths and pipeline depths Internal protocols based upon hardware threads that can
be time interleaved at a fine granularity Time Division Multiplexing (TDMA) along a slotted time wheel Pipelined Memory Mapped Address Space
IP cores accessed only via Read/Write commands
BackPlane III MicroNetwork
Spidergon!!