View
218
Download
0
Tags:
Embed Size (px)
Citation preview
The System Interconnect
Chips tend to have more than one “core” Control processors Accelerators Memories I/O
How do we get them to talk to each other? This is called “System Interconnect”
Shared bus topology Aimed at simple, cost-effective integration
of components
Shared BusShared Bus
Slave 2Slave 2Slave 1Slave 1Slave 0Slave 0
Master 2Master 2Master 1Master 1Master 0Master 0
Typical example: ARM Ltd. AMBA AHB Arbitration among multiple masters Single outstanding transaction allowed
If wait states are needed, everybody waits
Traditional Answer: with Buses
So, Are We All Set?
Well... not really. Let’s consider two trends
System/architectural: systems are becoming highly parallel
Physical: wires are becoming slower (especially in relative terms)
System/Architectural Level Parallel systems... OK, but how much?
CPUs: currently four cores (not so many...) Playstation/Cell: currently nine engines
(still OK) GPUs: currently 100+ shaders (hey!) Your next cellphone: 100+ cores (!!!)
And the trend is: double every 18 months
Multicore Testimonials 1
Intel IXP2800Intel IXP2800with 16 micro-engineswith 16 micro-engines
and one Intel XScale coreand one Intel XScale core
““We believe that Intel’s We believe that Intel’s Chip Level Chip Level
Multiprocessing (CMP) Multiprocessing (CMP) architectures represent architectures represent
the future of the future of microprocessors because microprocessors because
they deliver massive they deliver massive performance scaling performance scaling
while effectively while effectively managing power and managing power and
heat”.heat”.
White paper “Platform 2015: White paper “Platform 2015: Intel Processor and Platform Intel Processor and Platform
evolution for the next decade”evolution for the next decade”
““We believe that Intel’s We believe that Intel’s Chip Level Chip Level
Multiprocessing (CMP) Multiprocessing (CMP) architectures represent architectures represent
the future of the future of microprocessors because microprocessors because
they deliver massive they deliver massive performance scaling performance scaling
while effectively while effectively managing power and managing power and
heat”.heat”.
White paper “Platform 2015: White paper “Platform 2015: Intel Processor and Platform Intel Processor and Platform
evolution for the next decade”evolution for the next decade”
Multicore Testimonials 2
"The next 25 years of "The next 25 years of digital signal processing digital signal processing technology will literally technology will literally integrate hundreds of integrate hundreds of processors on a single processors on a single
chip to conceive chip to conceive applications beyond our applications beyond our
imagination.”imagination.”
Mike Hames, senior VP, Mike Hames, senior VP, Texas Instruments Texas Instruments
"The next 25 years of "The next 25 years of digital signal processing digital signal processing technology will literally technology will literally integrate hundreds of integrate hundreds of processors on a single processors on a single
chip to conceive chip to conceive applications beyond our applications beyond our
imagination.”imagination.”
Mike Hames, senior VP, Mike Hames, senior VP, Texas Instruments Texas Instruments
““Focus here is on Focus here is on parallelism and what's parallelism and what's
referred to as multi-core referred to as multi-core technology.”technology.”
Phil Hester, CTO, AMD Phil Hester, CTO, AMD
““Focus here is on Focus here is on parallelism and what's parallelism and what's
referred to as multi-core referred to as multi-core technology.”technology.”
Phil Hester, CTO, AMD Phil Hester, CTO, AMD
Intel: 80-core chip Intel: 80-core chip shown at ISSCC 2007shown at ISSCC 2007Intel: 80-core chip Intel: 80-core chip
shown at ISSCC 2007shown at ISSCC 2007
Rapport: Kilocore (1024 Rapport: Kilocore (1024 cores), for gaming & cores), for gaming &
mediamedia
Expected mid 2007Expected mid 2007
Rapport: Kilocore (1024 Rapport: Kilocore (1024 cores), for gaming & cores), for gaming &
mediamedia
Expected mid 2007Expected mid 2007
What Does This Mean for the Interconnect?
A new set of requirements! High performance
Many cores will want to communicate, fast High parallelism (bandwidth)
Many cores will want to communicate, simultaneously
High heterogeneity/flexibility Cores will operate at different frequencies, data
widths, maybe with different protocols
If you assume a shared bus, the wires have to go all around the chip (i.e. are very long) Propagation delay Spaghetti wiring
1 c
m1
cm
2 cm2 cm
And If We Consider a Floorplan...
What Does This Mean for the Interconnect?
A new set of requirements! Short wiring
Point-to-point and local is best Simple, structured wiring
Bundles of many wires are impractical to route
System Interconnect Evolution
TraditionalTraditional
Shared BusShared Bus
HierarchicalHierarchical
BusesBuses
AdvancedAdvanced
Bus ProtocolsBus Protocols
TopologyEvolution
ProtocolEvolution
Help with the issues, but do not fully solve them More scalable solutions needed
Master 0
AHB Layer 0
Slave 0 Slave 1
Master 1
Master 2
AHB Layer 1
Slave 2 Slave 3
Master 3
Slave 4
Slave 5
Slave 6
AHB crossbarcomponent
An Answer: Networks-on-Chip (NoCs)
DSPDSPNINI
NINIDRAMDRAM
switcswitchh
DMADMANINI
CPUCPU NINI
NINIAccelAccelNINI MPEGMPEG
switcswitchh
switcswitchh
switcswitchh
NoCNoC
switcswitchh
switcswitchh
Packet-based communication NIs convert transactions by cores into packets Switches route transactions across the system
First Assessment of NoCs
High performance High parallelism (bandwidth)
Yes: just add links and switches as you add cores High heterogeneity/flexibility
Yes: just design appropriate NIs, then plug in Short wiring
Yes: point-to-point, then just place switches as close as needed
Simple, structured wiring Yes: links are point-to-point, width can be tuned
Problem Solved?
Maybe, but... buses excel in simplicity, low power and low area
When designing a NoC, remember that tradeoffs will be required to keep those under control
Not all designs will require a NoC, only the “complex” ones
How to Make NoCs Tick A NoC is a small network
Many of the same architectural degrees of freedom
Some problems are less stringent Static number of nodes (Roughly) known traffic patterns and
requirements Some problems are much tougher
MANY less resources to solve problems Latencies of nanoseconds, not
milliseconds But... what characterizes a network?
Key NoC Properties
Topology Routing policy (where) Switching policy (how) Flow control policy (when) Syn-, asyn- or meso-chronicity ...and many others!
Huge design space
NoC Topologies
Must comply with demands of… performance (bandwidth & latency) area power routability
Can be split in… direct: node connected to every switch indirect: nodes connected to specific subset
of switches
NoC Topology Examples: Hypercubes
Compositional design Example: hypercube topologies
Arrange N=2n nodes in n-dimensional cube
At most n hops from source to destination
High bisection bandwidth good for traffic (but can you use it?) bad for cost [O(n2)]
Exploit locality Node size grows
as n [input/output cardinality] as n2 [internal crossbar]
Adaptive routing may be possible
0-D
1-D
2-D
3-D
4-D
NoC Topology Examples: Multistage Topologies
Need to fix hypercube resource requirements Idea: unroll hypercube vertices
switch sizes are now bounded, but loss of locality more hops can be blocking; non-blocking with even more stages
Alternate reduction from hypercube: restrict to < log2(N) dimensional structure
e.g. mesh (2-cube), 3D-mesh (3-cube) Matches with physical world structure and allows for
locality Bounds degree at node Even more bottleneck potentials
2D Mesh
NoC Topology Examples:k-ary n-cubes (Mesh Topologies)
Need to improve mesh performance
Idea: wrap around n-cube ends 2-cube cylinder 3-cube donut
Halves worst-case hop count Can be laid-out reasonably
efficiently maybe 2x cost in channel width?
NoC Topology Examples:Torus
NoC Topology Examples:Fat-Tree Topologies
Fatter links (actually: more of them) as you go to the root, so bisection bandwidth scales
Fat Tree
NoC Routing Policies
Static e.g. source routing or coordinate-based simpler to implement and validate
Adaptive e.g. congestion-based potentially faster much more expensive allows for out-of-order packet delivery possibly a bad idea for NoCs
Huge issue: deadlocks
Deadlocks
Showstopper problem avoid by mapping: no route loops avoid by architecture: e.g. virtual channels provide deadlock recovery
Critical for adaptive routing livelocks also possible
A
C
B A would like to talk
to C B to A C to B Everybody is stuck!!
NoC Switching Policies
Packet switching maximizes global network usage dynamically store-and-forward
minimum logic, but higher latency, needs more buffers wormhole
minimum buffering, but deadlock-prone, induces congestion
Circuit switching optimizes specific transactions
no contention, no jitter requires handshaking
may fail completely setup overhead
A
B
Virtual Channels
Performance improvement using virtual channels
Node 1 Node 2 Node 3 Node 4 Node 5Destination of B
Node 1 Node 2 Node 3 Node 4 Node 5
Block
Destination of B
NoC Flow Control Policies We need it because...
Sender may inject bursty traffic Receiver buffers may fill up Sender and receiver may operate at
different frequencies Arbitrations may be lost
How? TDMA: pre-defined time slots Speculative: send first, then wait for
confirmation (acknowledge - ACK) Conservative: wait for token, then send
(credit-based) Remember... links may be pipelined
ACK/NACK propagationMemory deallocation
NACK
RetransmissionGo-back-N
Transmission
Example: ACK/NACK Flow Control
ACK and buffering
NoC Timing: Synchronous
Flip-flops everywhere, clock tree Much more streamlined design Clock tree burns 40% power
budget, plus flip flops themselves Not easy to integrate cores at
different speeds Increasingly difficult to constrain
skew and process variance Worst-case design
NoC Timing: Asynchronous
Potentially allows for data to arrive at any time, solves process variance etc.
Average-case behaviour Lower power consumption Maximum flexibility in IP integration More secure for encryption logic Less EMI Much larger area Can be much slower (if really robust)
Two-way handshake removes the “bet” of synchronous logic
Intermediate implementations exist Much tougher to design
NoC Timing: Mesochronous
Attempts to optimize latency of long paths Everybody uses the same clock Senders embed their clock within packets Data is sent over long links and arrives out of sync with
receiver clock Embedded clock is used to sample incoming packets Dual-clocked FIFO restores synchronization Tough to design Somewhat defeats the NoC principles
ReceiverDual-clocked FIFO
Sender
Data
Strobe
CK CKLink
xpipes is a library of NoC components Network Interface (NI), Switch, Link Configurability of parameters such as flit width,
amount of buffering, flow control and arbitration policies…
xpipes is designed to be scalable to future technology nodes, architecturally and physically
Leverages a cell synthesis flow, no hard macros Pipelined links to tackle wire propagation delay
A complete CAD flow is provided to move from the application task graph level to the chip floorplan
The xpipes NoC
NoC
topology
OCP OCP
OCP clk xpipes clk OCP clk
packets packets
packeting unpacketing
packetingunpacketing
initiator NI target NI
Performs packeting/unpacketing OCP 2.0 protocol to connect to IP
cores Source routing via routing Look-Up
Tables Dual Clock operation
LUT
LUT
request channel
response channel
The xpipes NoC:the Network Interface
Basic OCP Concepts Point-to-point, unidirectional, synchronous
Easy physical implementation Master/slave, request/response
Well-defined, simple roles Extensions
Added functionality to support cores with more complex interface requirements
Configurability Match a core’s requirements exactly Tailor design to required features only
Reference: [SonicsInc]Reference: [SonicsInc]
Basic OCP Protocol
OC
P M
aste
r
OC
P S
lave
MCmd [3]
MAddr [32]
MData [32]
SCmdAccept
SResp [2]
SData [32]
MCmd, MAddr
SCmdAccept
SResp, SData
MCmd, MAddr, MData
SCmdAccept
MRespAccept
Read Transaction Write Transaction
MRespAccept
Requestphase
Responsephase
OCP Extensions
Simple Extensions Byte Enables Bursts Flow Control/Data Handshake
Complex Extensions Threads and Connections
Sideband Signals Interrupts, etc.
Testing Signals
Crossbar
AllocatorArbiter
Routing & Flow
Control
Input and/or output buffering Wormhole switching Supports multiple flow control policies
The xpipes NoC:the Switch
Repeaters are pure registers Buffering and retransmission logic in the
sender
SS RR
FLITFLIT
REQREQ
ACKACK
FLITFLIT
REQREQ
ACKACK
FLITFLIT
REQREQ
ACKACK
The xpipes NoC:the ACK/NACK Link
Repeaters are two-entry FIFOs No retransmission allowed
SS RR
FLITFLIT
REQREQ
STALLSTALL
FLITFLIT
REQREQ
STALLSTALL
FLITFLIT
REQREQ
STALLSTALL
The xpipes NoC:the STALL/GO Link
Very challenging!
Signal processing
hard real timevery regular load
high quality
worst casetypically on DSPs
Media processing
hard real timeirregular load
high quality
average caseSoC/media processors
Multi-media
soft real timeirregular load
limited quality
average casePC/desktop
Speaking of Quality of Service...
Multimedia Application Demands
Increasing functionality and heterogeneity Higher semantic content/entropy More dynamism
15000
17000
19000
21000
23000
25000
27000
29000
worst-case load
structural load
running average
instantaneous load
VBR MPEGDVD stream
[Gossens03]
Negotiating NoC Resources
15000
17000
19000
21000
23000
25000
27000
29000
worst-case load
structural load
running average
instantaneous load
VBR MPEGDVD stream
[Gossens03]
(re)negotiatesteady states
A QoS Approach Essential to recover global predictability and
improve performance Applications require it! It fits well with protocol stack concept
What is QoS? Requester poses the service request (negotiation) Provider either commits to or rejects the request Renegotiate when requirements change
After negotiation, steady states that are predictable Guaranteed versus best-effort service Types of commitment
correctness e.g. uncorrupted data completion e.g. no packet loss bounds e.g. maximum latency
Best-effort services have better average resource utilisation at the cost of unpredictable/unbounded worst-case behaviour The combination of best-effort & guaranteed services is useful!
QoS + Best Effort
QoS in the Æthereal NoC Conceptually, two disjoint networks
a network with throughput+latency guarantees (GT) a network without those guarantees (best-effort, BE)
Several types of commitment in the network combine guaranteed worst-case behaviour
with good average resource usage
priority/arbitration
best-effortrouter
guaranteedrouter
programming
Æthereal Router Architecture
Best-effort router Wormhole routing Input queueing Source routing
Guaranteed throughput router Contention-free routing
synchronous, using slot tables time-division multiplexed circuits
Store-and-forward routing Headerless packets
information is present in slot table A lot of hardware overhead!!!
Æthereal: Contention-Free Routing
Latency guarantees are easy in circuit switching
With packet switching, need to “emulate” Schedule packet injection in network such
that they never contend for same link at same time in space: disjoint paths in time: time-division multiplexing
Use best-effort packets to set up connections Distributed, concurrent, pipelined, consistent Compute slot assignment at build time, run
time, or combination Connection opening may be rejected