Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Design and Analysis of Networks-on-Chip
in Heterogeneous Multicore Systems
Young Jin Yoon
Contents
• Motivation and Applications
• System Drivers
• On-Chip Communication and Networks-on-Chip
• Modeling and Tools
Motivation:
Moore’s Law and Performance of CPU
• Moore’s law
– Draw Figure from ITRS 2009
1. Double the transistor in every 18 month!
2. Do we double the Performance?
1. Limited by ILP diminishing return
2. Power problem with Out-of-Order(OoO)!
3. ILP TLP Multi-Core Architecture
• Increasing the number of cores!
ITRS 2009
25 % / year
52 % / year
?? % / year
Bit-Level Parallelism
Instruction-Level Parallelism
TLP
Multicore
Computer Architecture: A Quantitative Approach
Motivation:
System-on-Chip with Mobile Phones
• Performance vs. flexibility: 3.5G Mobile Phones
• 100 Giga-Operation-Per-Second (GOPS) within 1W– 1 core running at 100GHz?
– 1000 cores running at 100MHz?
1.[2]. Multi-Core for Mobile Phones
Motivation:
System-on-Chip with Consumer Devices
1.[3]. Heterogeneous Multi-Core Platform for Consumer Multimedia Applications
Analog
Audio
Decoder
Digital
Audio
Decoder
Audio
Post-
Processing
Analog
Video
Decoder
Digital
RAW Video
Decoder
Digital
Compressed
Video Decoder
Picture
Quality
Enhancement
Content
Browsing
and Control
Host CPU
VLIW Processor
Cores
Embedded
Control CPU
Fixed-point
DSP
Function-Specific
HW cores
DCD with
New Format
DCD with
Established
Format
DSP VLIW Cores DSP
HW cores HW cores HW cores
Embedded
Control CPU
VLIW Cores
HW VLIW Host CPU
Motivation:
System-on-Chip with Consumer Devices
• Legacy
• Re-usability
• Performance
• Flexibility
• Support of industry standards
1.[3]. Heterogeneous Multi-Core Platform for Consumer Multimedia Applications
Analog
Audio
Decoder
Digital
Audio
Decoder
Audio
Post-
Processing
Analog
Video
Decoder
Digital
RAW Video
Decoder
DCD with
Established
Format
Picture
Quality
Enhancement
Content
Browsing
and Control
Host CPU
VLIW Processor
Cores
Embedded
Control CPU
Fixed-point
DSP
Function-Specific
HW cores
DCD with
New Format
DSP VLIW Cores DSP
HW cores HW cores HW cores
Embedded
Control CPU
VLIW Cores
HW VLIW Host CPU
Motivation:
System-on-Chip with Consumer Devices
1.[3]. Heterogeneous Multi-Core Platform for Consumer Multimedia Applications
Analog
Audio
Decoder
Digital
Audio
Decoder
Audio
Post-
Processing
Analog
Video
Decoder
Digital
RAW Video
Decoder
DCD with
Established
Format
Picture
Quality
Enhancement
Content
Browsing
and Control
Host CPU
VLIW Processor
Cores
Embedded
Control CPU
Fixed-point
DSP
Function-Specific
HW cores
DCD with
New Format
Motivation:
Networks-on-Chip (NoC)
• How do we connect all cores?
– Bus vs. Point-to-Point vs. Crossbar and Mesh
• Difference between NoC and other Networks
– Less non-determinism
– Local, High-performance networks
– Energy-constraints
– Design-time Specialization
0
1
2
73
4
5
60
1
7
23
4
56
0
1
2
3
0 1 2 3 0 1 2
3 4 5
6 7 8
1.[6]. Networks on Chips: A New SoC Paradigm
NoC Design Validation and Synthesis
NoC Architecture Analysis and Optimization
Application Modeling
and Optimization
Motivation:
Design and Analysis of NoC
Ph
ys
ica
lA
rch
. &
Co
ntr
ol
So
ftw
are
Wiring
Data Link
Network
Transport
System
Application
1.[6]. Networks on Chips: A New SoC Paradigm
1.[7]. Outstanding Research Problems in NoC Design: System, Microarchitecture, and Circuit Perspective
Application
…
Design Goals
& Constraints
Co
de P
artitio
nin
g
Communication
Infrastructure
Communication
Paradigm
Application Communication
Analysis
Analysis
& Optimization
Mapping
& Scheduling
Sim
ula
tion
Pro
toty
pin
g
NoC Testing
NoC Verification
Component
Instantiation
Communication
Component Library
Physical Synthesis & Tapeout
Applications:
PARSEC vs. SPLASH-2
• PARSEC benchmarks
– Multithreaded
– Emerging Workload
– Diverse
– State-of-art Techniques
– Support Research
• Similarity research
– Principal Component Analysis(PCA) based on 3 groups
• Inst. Mix: 4 characteristics
• Working Sets: 8 characteristics
• Sharing: 32 characteristics
1.[4]. PARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip-Multiprocessors
A Communication Characterization of SPLASH-2 and PARSEC
Applications:
Mobile Architecture
• Benchmarks for Embedded computing
– EEMBC, MiBench…
• Mobile Architecture
– Restricted Power constraints
• Dynamic power management
– Users determine the power consumption
1.[5]. Into the Wild: Studying Real User Activity Patterns to Guide Power Optimizations for Mobile Architectures
Contents
• Motivation and Applications
• System Drivers
• On-Chip Communication and Networks-on-Chip
• Modeling and Tools
Operating System
• How to Manage Heterogeneous Multicores?
– Cores & Systems are diverse.
– The interconnect matters.
– Messages cost less than shared Memory.
2.[1]. The Multikernel: A New OS Architecture for Scalable Multicore Systems
Core Parallelism, Power, and Temperature
• Performance and Power
– Same total parallelism (4P-8W vs. 8P-4W)
– Same power but better throughput on 8P than 4P
– Energy-Delay Product (EDP) and Energy-Delay^2 Product (ED2P)
2.[2]. Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal View
Core Parallelism, Power, and Temperature
2.[2]. Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal View
1~2C lower than the others
Due to the large L2 cache
• Performance and Power
– Same total parallelism (4P-8W vs. 8P-4W)
– Same power but better throughput on 8P than 4P
– Energy-Delay Product (EDP) and Energy-Delay^2 Product (ED2P)
• Temperature Spatial Distribution
– Paired vs. Lined up vs. Centered
Memory Hierarchy:
On-Chip Memory
• Cache vs. Scratch-pad
– Both scales equally well up to 16 cores.
– Streaming applications
• Scratch-pad memory > Transparent Cache
– Cache will suffer in a large-scale CMPs.
• Scratch-pad may be able to address the problem.
3.[3]. Memory Systems: Cache, DRAM, Disk
3.[6]. Comparing Memory Systems for Chip Multiprocessors
Mgmt.
AddressingImplicit Explicit
Transparent Transparent cache Software-managed cache
Non-Transparent Self-managed scratch-pad Scratch-pad memory
Memory Hierarchy:
Cache Coherence Protocol
3.[3]. Memory Systems: Cache, DRAM, Disk
Token Coherence: Decoupling Performance and Correctness
Snoop-based Directory-based Token-based
Ordering Point NoC Directory Caches w/ retransmission
Indirect? N Y N
Broadcast? Y N Y
Performance? Fast Slow Moderate
Unordered NoC? N Y Y
Cache
0 1 n…
…
NoC
CacheDir
0 1 n…
…
NoC
Cache
0 1 n…
…
NoC
2
NoCNoC
Intelligent NoCs for Cache Coherence:
INSO and INCF
• Snoop-based Coherence in unordered NoCs:
In-Network Snoop Ordering
2.[4]. In-Network Coherence Filtering: Snoopy Coherence without Broadcasts
1. Incorrect
In-Network Snoop Ordering (INSO)
Route messages as ordered
2. Broadcast messages.In-Network Coherence Filtering (INCF)
Filter Unnecessary Broadcasts
0 1{0,2,4} {1,3,5}0 1
0 1
2
--
Addr Dest
-A
AA A
AA
Memory Controller
• On-Chip Memory Controller
– Where to place them?
• Performance
– Row ≈ Column < Diagonal X ≈ Diamond
– The gap can be alleviated by choosing wise routing algorithms
• Class-based Deterministic Routing (CDR)
2.[5]. Achieving Predictable Performance through Better Memory Controller Placement in Many-Core CMPs
Row Column Diagonal X Diamond
Off-Chip Network & Memory
• Bandwidth wall
– Due to pin-limitations, power constraints and package costs
– Memory scales only 10% per year
• Bandwidth Conservation Techniques
2.[7]. Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scaling
NoC
Network-on-Chip:
Terminology
• Topology
– Indirect vs. Direct
• Routing
– Deterministic vs. Adaptive
• Flow Control
– Arbitration
– Circuit-Switched
– Packet-Switched
• Worm-Hole and Virtual-Channel
– Hop-to-hop Flow-Control
3.[1]. Principles and Practices of Interconnection Networks
0 1 2 3
4 5 6 7
0
7
1
6
… …
Network-on-Chip:
Router Microarchitecture
Routing
Logic /Table
Switch
Allocators
Crossbar
VC
Allocators
BW
RCVA SA LTST
• Topology
• Routing
• Flow Control
– Arbitration
– Worm-Hole
– Hop-to-Hop
– Virtual Channel
• Router Pipelines
3.[1]. Principles and Practices of Interconnection Networks
• Spend 4 c.c. for 1 link traversal
Router Microarchitecture:
Reducing Pipelines
• Speculative Routing
BW
RCVA SA LTST
SA LTSTBW -
BW
RC
VA
SALTST
SA LTSTBW
LTST
SA LTST- -
VA SA
Head Flit
Body
& Tail Flit
Baseline Router Pipeline
Head Flit
Body
& Tail Flit
Speculative Router Pipeline
• Speculation + Lookahead Routing• Lookahead Routing
Lookahead Router Pipeline
BW
NRC
VA
SA LTST
SA LTSTBW
BWNRC
VASA
LTST
LTSTSA
BW
LTST
SA LTSTBW -
VA SA
Speculation + Lookahead Router Pipeline
3.[1]. Principles and Practices of Interconnection Networks
Performance and Cost Metrics
• Performance Metrics
• Cost Metrics
– Average or peak energy/power consumption
– Network area overhead and total area
– Average or peak temperature
3.[1]. Principles and Practices of Interconnection Networks
1.[7]. Outstanding Research Problems in NoC Design
Delivery Speed Channel Usage
Ideal Zero-load Latency Bi-section Bandwidth
Average Average Latency Average Throughput
Worst Maximum Latency Peak Throughput
Topology:
Flattened Butterfly
• Flattened Butterfly vs. Mesh
3.[2]. Flattened Butterfly Topology for On-Chip Network
8
0
1
2
3
4
5
6
7
0 1 2
3 4 5
6 7 8
3-ary 3-fly network (3-stage Bfly) Flatten Butterfly
0
1
2
3
4
5
6
7
8
0
1
2
3
4
5
6
7
8
FBfly layout
Mesh layout
T0 = Th + Ts + Tw
1 2
3 4 5
6 7 8
0
Microarchitecture:
Enhance Arbitration
• SPAROFLO
– Speculative Priority Assignment (SPA)
– Recreate Old (RO)
– Flow (FLO)
3.[3]. A 4.6Tbits/s 3.6GHz Single-Cycle NoC Router with a Novel Switch Allocator in 65nm CMOS
0
1
2
3
0
1
2
3
Clock n
0
1
2
3
0
1
2
3
Clock (n+1)
2
3
0
2
V:1
Local
Arbiter
V:1
Local
Arbiter
V:1
Local
Arbiter
SPA
Priority
Encoder
Conflict
Detect
P:1
Global
Arbiter
V:1
Local
Arbiter
size(Q) != 0?
Sequential
Retry Queue
Conflict
on current c.c. Top Loser
Conflict on prev c.c.
0
1
Grants from Other
Global Arbiters
Final GrantPort
PriorityR
eq
uest V
ecto
r
Bufferless Network
10
2 3
• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?
– Deflective routing vs. Packet/Flit dropping
3.[5]. A Case for Bufferless Routing in On-Chip Networks
10
2 3
Bufferless Network
• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?
– Deflective routing vs. Packet/Flit dropping
3.[5]. A Case for Bufferless Routing in On-Chip Networks
10
2 3
Bufferless Network
• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?
– Deflective routing vs. Packet/Flit dropping
• BLESS
– Deflective bufferless Network
– FLIT-BLESS vs. WORM-BLESS
3.[5]. A Case for Bufferless Routing in On-Chip Networks
10
2 3
Bufferless Network
• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?
– Deflective routing vs. Packet/Flit dropping
• BLESS
– Deflective bufferless Network
– FLIT-BLESS vs. WORM-BLESS
• Problems
– Injection problem
3.[5]. A Case for Bufferless Routing in On-Chip Networks
10
2 3
Bufferless Network
• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?
– Deflective routing vs. Packet/Flit dropping
• BLESS
– Deflective bufferless Network
– FLIT-BLESS vs. WORM-BLESS
• Problems
– Injection problem
3.[5]. A Case for Bufferless Routing in On-Chip Networks
10
2 3
Bufferless Network
• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?
– Deflective routing vs. Packet/Flit dropping
• BLESS
– Deflective bufferless Network
– FLIT-BLESS vs. WORM-BLESS
• Problems
– Injection problem
– Livelock
3.[5]. A Case for Bufferless Routing in On-Chip Networks
10
2 3
Bufferless Network
• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?
– Deflective routing vs. Packet/Flit dropping
• BLESS
– Deflective bufferless Network
– FLIT-BLESS vs. WORM-BLESS
• Problems
– Injection problem
– Livelock
3.[5]. A Case for Bufferless Routing in On-Chip Networks
10
2 3
Bufferless Network
• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?
– Deflective routing vs. Packet/Flit dropping
• BLESS
– Deflective bufferless Network
– FLIT-BLESS vs. WORM-BLESS
• Problems
– Injection problem
– Livelock
– Throughput and Latency
3.[5]. A Case for Bufferless Routing in On-Chip Networks
Quality of Service (QoS)
• Quality of Service
– Local Fairness ≠ Global Fairness
– Some packets are more important than others.
• Round-Robin vs. Age-based vs. deadline-based
3.[6]. Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks
QoS: Globally Synchronous Frame (GSF)
• Deadline-Based Arbitration is impractical
– Infinite-sized sorting queues
– Large overhead for sending and storing the deadline
• Source-managed QoS (e.g. GSF)
– Frame-based approach
• Sorting across frames not within a frame
… Earliest
deadline
Selector
…
…
……
…
…
…
……
Head
Deadline-based
with infinite searchable buffer
Frame-based
with per-frame buffers
and infinite frame window
Frame-based
with circular frame buffers
and finite frame window
3.[6]. Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks
QoS: Other approaches
• Router-Based QoS
– Preemptive Virtual Channel (PVC)
: Router-based dynamic bandwidth allocation
• Application-Aware QoS
– What performance do we really care?
• Network vs. application
– Stall-Time-Criticality (STC)
Preemptive Virtual Clock: A Flexible, Efficient and Cost-effective QoS scheme for Networks-on-Chip
Application-Aware Prioritization Mechanisms for On-Chip Networks
A B C
Compute A
A
Compute
C B
B Com...
A B C
Compute A
A
C
C B
B Com...Compute Stall Compute Stall C Stall C Stall C Stall C Stall
Polymorphic On-Chip Networks
• There is no network to fit all workloads.
3.[4]. Polymorphic On-Chip Networks
0
100
200
300
400
500
600
700
800
0 100 200
Ave
rag
e T
hro
ug
hp
ut
(bit
s/
cyc
le)
Average Packet Latency (cycles)
Meshes
Butterflies
Fat Trees
Flatten Butterflies
Rings
Random Permutation Traffic
Pareto Optimal
Polymorphic On-Chip Networks
• Let’s provide Network resources
– Users can statically configure NoC before running applications
R
A
3.[4]. Polymorphic On-Chip Networks
Polymorphic On-Chip Networks
• Let’s provide Network resources
– Users can statically configure NoC before running applications
0 1 2 30
1
2
3
… …
……
e.g. Unidirectional Ring
3.[4]. Polymorphic On-Chip Networks
Hop-to-hop: wires and interconnects
• Network-on-Chip: Floorplans
4.[3]. COSI: A Framework for the Design of Interconnection Networks
1.[1]. International Technology Roadmap for Semiconductors (ITRS): 2009 edition
Cu Interconnect (ITRS) 2011 2012 2013 2014 2015 2016
Gate Length (nm) 16 14 13 11 10 9
IntermediateRC Delay (ps) 1291 1455 1842 2406 2670 3341
Line length (um) 16 15 12 9 8 7
GlobalRC Delay (ps) 487 557 705 921 1004 1297
Line length (um) 26 23 19 15 13 11
FITs /m /cm^2 2 1.6 1.6 1.4 1.3 1.1
• Interconnect Requirement from ITRS
• Time-to-market constraints
• Intellectual-Property design modules (IP Cores)
• Interconnect latency
– Hard to estimate in early design stage
– Conservative estimation: suboptimal design
• Latency Insensitive Design(LID)
Latency Insensitive Design
4.[2]. Coping with Latency in SOC Design
: Pearl (IP Core)
: Shell
: Relay Station
: Data w/ void
: Backpressure
Shell 4
Pearl 4
Shell 1
Pearl 1
Shell 2
Pearl 2
Shell 3
Pearl 3
Shell 5
Pearl 5
R
S
R
S
R
S
R
S
R
S
R
S
Back Pressure
Data
Hop-to-Hop Flow Control
• Channels between two routers
– Longer is the wire, slower are delivered the messages.
• Put some intelligence on the channel!
– Link pipelining with distributed buffers
3.[1]. Principles and Practices of Interconnection Networks
3.[7]. Distributed Flit-Buffer Flow Control for Networks-on-Chip
ON/OFF Credit Ack/Nack
- - -
2+5K 2+3K 1+3K
2+2K 2+2K 1+2K
Control
Logic
Control
Logic
Control
Logic
Control
Logic
Control
Logic
Control
Logic
Data Data Data
BPBPBP
Data Data Data
BPBPBP
Flip-Flops
Relay-Stations
Inverters or Latches
Globally Asynchronous,
Locally Synchronous (GALS) Circuit
• The problems of Clock Distribution
– Design Complexity, Noise, and Power
• Local clock w/ asynchronous communication
Property Pausible Clocking FIFO-based Boundary Synchronization
Area Overhead Low Med to High Low
Latency Low High Med
Throughput Depend on clock pause rate High Med
Power Consumption Low High Med
3.[9]. Globally Asynchronous, Locally Synchronous Circuits: Overview and Outlook
Local
Sync.
1
Pausible
Clock Gen
Ou
tpu
t Po
rt
Local
Sync.
2
Pausible
Clock Gen
Inp
ut P
ort
Local
Sync.
1
Async
FIFO
Local
Sync.
2
Local
Sync.
1
RE
G Local
Sync.
2
RE
G
CL
DL
Robust Interfaces for Mixed-Timing Systems
• Partition FIFOs into reusable components
– Reusable Put and Get Cell sub-modules required
3.[10]. Robust Interfaces for Mixed-Timing Systems
Cell Cell Cell Cell
Put Ctrl
Full Detector
Empty Detector
req_put
full
data_put
CLK_put
req_get
empty
data_get
CLK_get
valid_get
ack_putG
et
Ctr
l
ack_get
Sync-Sync FIFO
Async-Sync FIFO
Async-Async FIFO
Sync-Async FIFO
Robust Interfaces for Mixed-Timing Systems
Sync-Sync FIFO
• Partition FIFOs into reusable components
– Reusable Put and Get Cell sub-modules required
SR
S
R
valid_get
data_get
en_get
CLK_get
tok_out_get
empty_i
tok_in_get
REG
req_put
data_put
en_put
CLK_put
full_i
tok_out_put tok_in_put
3.[10]. Robust Interfaces for Mixed-Timing Systems
Robust Interfaces for Mixed-Timing Systems
Async-Async FIFO
• Partition FIFOs into reusable components
– Reusable Put and Get Cell sub-modules required
– Only Data Validity Controller sub-module needs to be modified
• Implement Relay Stations with Mixed Timing FIFO
REG
C+
C+
C+
+
wr
ra rr
wa
req_put
data_put
ack_put
data_get
req_get
ack_get
tok_out_get
tok_out_put
tok_in_get
tok_in_put
3.[10]. Robust Interfaces for Mixed-Timing Systems
• Power Consumption of NoC
– up to 28% total power on NoC
– Router frequency: critical design parameter
• Network power vs. network latency
• Dynamic power management for routers
– Clock Scaling and Time Stealing
Dynamic Voltage-Frequency Scaling (DVFS)
A Case for Dynamic Frequency Tuning in On-Chip Networks
Asynchronous NoC
0
1
2
3
0 1 2 3
• Mesh-of-Trees(MoT) variants
3.[11]. A Low-Overhead Asynchronous Interconnection Network for GALS Chip Multiprocessors
Asynchronous NoC
• Mesh-of-Trees(MoT) variants
– No Switch(i.e. crossbar) is required
– Can be implemented with
• Simple routers (for fan-out)
• Simple arbiters (for fan-in)
0
1
2
3
0
1
2
3
Row Forest Column ForestRow-Column
Shifter
Latch
Control 0
Toggle 0
LA
TC
H
Req0
AckReq Ack0
Latch
Control 1
Toggle 1
LA
TC
H
Req1
AckReq Ack1
B
B
Data1
Data0
Data_InMutex
Ack1
Ack0
L4
L3
L1
L2
0
1
L5
L6
L7
Req0
Req1
Req_Out
Ack_In
Data0
Data1
Data_Out
Mux_Select
LA
TC
H
Flow Control Unit
Datapath
Latch Controller
3.[11]. A Low-Overhead Asynchronous Interconnection Network for GALS Chip Multiprocessors
Reliable Hop-to-Hop transmission
• On-chip interconnect errors
• Using High Voltage
– Reduce error rate
– Limited in delay, area, and produce more energy
• Use low voltage with error correction code
– Type-II HARQ with low-swing channel
3.[8]. On Hamming Product Codes With Type-II Hybrid ARQ for On-Chip Interconnects
Adaptive Error Control For Nanometer Scale Network-on-Chip Links
x
y
z
011
100
HD(011,100) = 3 …
Sender Receivern x k
Photonic NoCs
• The benefits of Photonic communication
– Bandwidth
– Power Dissipation
• Hybrid Photonic vs. electronic NoCs
– Same execution time: 7.6W vs 244W
– Same power dissipation: 960Gbps vs 100Gbps
3.[12]. Photonic NoCs: System-Level Design Exploration
Contents
• Motivation and Applications
• System Drivers
• On-Chip Communication and Networks-on-Chip
• Modeling and Tools
6. Dynamic, reconfigurable
network tools
Modeling and Tools
4.[1]. Research Challenges for On-Chip Interconnection Networks
5. End-user
feedback
2. Custom
IP blocks3. Validation
7. Application
Instrumentation
1. Synthesis
Many-core system
constraints
4. Models of
CMOS devices
and interconnects
Hardware
COSI: NoC Design Automation
• Can we automate to design NoC?
• Communication Synthesis Infrastructure (COSI)
– Network specification
– Library of building blocks
– Quantified performance and cost models
– Optimization Algorithms
4.[3]. COSI: A Framework for the Design of Interconnection Networks
Models, Rules & Platforms
Orion, Ho’s Models
Algorithms
K-merging
Shortest path…
01
3
2
4
5
(10,100)0
1
3
2
4
5
Library
Topology Links Routers
ORION: NoC Power and Area Model
4.[4]. ORION 2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration
• Power: the most critical design constraint.
– Power of NoC will also be substantial
– How to estimate NoC power in the early-design stage?
FAST: Architectural Simulation
• Good simulators
– speed, accuracy, completeness, transparency
– inexpensiveness, up-to-date, and easy-to-use, …
• The functional model of FAST
– Keep generating instruction stream
– Roll back when mis-speculations occur
4.[5]. FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators
Functional
Model
Timing
Model
Inst.
Next Inst.
Functional
Model
FPGA
Timing
ModelRoll Back
/ Commit
Trace
Buffer
(a) Event-Driven Arch. Simulator (b) FAST
Inst. Trace
BP
NoC Design Validation and Synthesis
NoC Architecture Analysis and Optimization
Application Modeling
and Optimization
Conclusion
Ph
ys
ica
lA
rch
. &
Co
ntr
ol
So
ftw
are
Wiring
Data Link
Network
Transport
System
Application
1.[6]. Networks on Chips: A New SoC Paradigm
1.[7]. Outstanding Research Problems in NoC Design: System, Microarchitecture, and Circuit Perspective
Application
…
Design Goals
& Constraints
Co
de P
artitio
nin
g
Communication
Infrastructure
Communication
Paradigm
Application Communication
Analysis
Analysis
& Optimization
Mapping
& Scheduling
Sim
ula
tion
Pro
toty
pin
g
NoC Testing
NoC Verification
Component
Instantiation
Communication
Component Library
Physical Synthesis & Tapeout
Questions?
Backup slides
An example of MUTEX Circuit
ReCycle: Pipeline Adaptation
to Tolerate Process Variation
Simulation:Open-loop vs. Closed-loop simulation• Open-loop
– NI with infinite queue
• Isolate the effect of the network design from the injection
– e.g. synthetic traffic patterns
• Closed-loop
– More close to the actual system
– Ni with finite queue
– e.g. full-system simulations
Principles and Practices of Interconnection Networks
Simulation:Synthetic Traffic model• Synthetic Traffic model
– Based on Staticstical analysis of the traffic
– Traffic Patterns
• Random
• Bit permutations– Bit complement, Bit reverse, Bit rotation, Shuffle, Transpose
• Digit permutations– Tornado, Neighbor
– Constant injection rate over time
• Actual traffic : bursty!
Principles and Practices of Interconnection Networks
Simulation:Summary• Trade-off between accuracy and simulation
time
– Synthetic traffic model
• Fast simulation time, less accurate
– Event-driven simulation
• Slow simulation time, more accurate
– RTL-level simulation
• The slowest, but even more accurate
Applications• PARSEC vs. SPLASH-2
– Diversity– State-of-art Algorithms– Input dataset
• Comparison– Instruction Mix, Working Sets, and Sharing– Communication
A Communication Characterization of SPLASH-2 and PARSECPARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip-Multiprocessors
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
Applications:PARSEC vs. SPLASH-2• PARSEC vs. SPLASH-2
– Diversity
– State-of-art Algorithms
– Input dataset
• Similarity research
– Principal Component Analysis(PCA)
– 44 parameters.
• Including Inst. Mix, Working Sets, and Sharing
A Communication Characterization of SPLASH-2 and PARSECPARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip-Multiprocessors
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
Applications:
PARSEC vs. SPLASH-2 (cont’)
• Communication Comparison– Spatial Behavior: Less Distinct
– Temporal Behavior: More Bursty
– Producer–Consumer: Multi-to-Multi
PARSEC vs. SPLASH-2
Operating System• Real-Time Operating System
– How to deliver the real-time requirement
• Operating System coexistence
– Multiprocessor with Heterogeneous cores
– Some simpler cores may require to have RTOS
– Some Complex cores can have General OS
– How to manage those issues?
• Using a hypervisor?
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
Application
System
The Multikernel: A New OS Architecture for Scalable Multicore SystemsA Unified Operating System for Clouds and Manycore: fosProcess Scheduling Challenges in the Era of Multi-Core Processors
Reliable Hop-to-Hop transmission• On-chip interconnect errors• Using High Voltage
– Reduce error rate– Limited in delay, area, and produce more energy
• Use low voltage with error correction code– Increase error rate but correct errors when they happened – Type-II HARQ with low-swing channel
On Hamming Product Codes With Type-II Hybrid ARQ for On-Chip InterconnectsAdaptive error control for nanometer scale network-on-chip links
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
x
y
z
End-to-End flow control• Message-dependent Deadlock
– Deadlock avoidance• Virtual Network
• Credit-Based(CB)
– Deadlock recovery• Regressive
• Deflective
• Progressive
• CTC: Connect-Then-Credit– 3-way handshake to exchange credits
• P_REQ, P_ACK, and data
Principles and Practices of Interconnection Networks CTC: An End-To-End Flow Control Protocol for SoC Architectures
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Network
Transport
System
Application
Data Link
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Transport
System
Application
Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
Application
System
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
Application
System
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
ApplicationP
hys
ica
lA
rch
. & C
ntl
Soft
war
e
Wiring
Data Link
Transport
Application
Network
System
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
Ph
ysic
alA
rch
. & C
ntl
Soft
war
e
Wiring
Data Link
Network
Transport
System
Application
Intelligent NoCs for Cache Coherence:
INSO and INCF
• Two main problems with Snoop-based Coherence in
unordered NoCs:
In-Network Snoop Ordering
In-Network Coherence Filtering: Snoopy Coherence without Broadcasts
1. Incorrect
In-Network Snoop Ordering (INSO)
2. Broadcast messages.
In-Network Coherence Filtering (INCF)
Intelligent NoCs for Cache Coherence:
INSO and INCF
• Two main problems with Snoop-based Coherence in
unordered NoCs:
In-Network Snoop Ordering
In-Network Coherence Filtering: Snoopy Coherence without Broadcasts
1. Incorrect
In-Network Snoop Ordering (INSO)
2. Broadcast messages.
In-Network Coherence Filtering (INCF)
0 1
{0,2,4} {1,3,5}
0 1
0
8
4
Traditional Network-on-Chip•
•
•
•
•
Principles and Practices of Interconnection Networks
0 1 2 3
4 5 6 7
8 9 10 11
14 1512 13
RoutingLogic /Table
SwitchAllocator
Crossbar
VCAllocator
BW
RCVA SA LTST
Bufferless Network• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet or Flit dropping
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3
0 1 2 30 1 2 3
0 1 2 3
Bufferless Network• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet or Flit dropping
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3
0
1 2 3
0
1 2 3
0
1 2 3
Bufferless Network• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet or Flit dropping
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3
0
1
2 3
01
2 3
01
2 3
Bufferless Network• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet or Flit dropping
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3 0
1
2
3
012
3
012
3
Bufferless Network• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet or Flit dropping
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3 0 1
2
3
0
12
3
0
12
3
0
0
Bufferless Network• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet or Flit dropping
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3 0 1 2
301
20
1
2
3
3
0
0
Bufferless Network• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet or Flit dropping
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3 0 1 2 3
01
2 0
1
2
3
3
0
0
Bufferless Network• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet or Flit dropping
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3 0 1 2 3
0
12
0
1
23
3
0
0
Bufferless Network• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet or Flit dropping
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3 0 1 2 3
01
2
0 1
23
3
0
0
Bufferless Network• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet or Flit dropping
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3 0 1 2 3
0
1
2
0 1 2
3
3
0
0
Bufferless Network• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet or Flit dropping
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3 0 1 2 3
0
1
2
0 1 2
3
3 0
0
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
Circuit-switched NoC• Benefit of circuit-switched NoC
A 2.9Tb/s 8W 64-Core Circuit-Switched Network-on-Chip in 45nm CMOSWinning the Pinning in NoC
Bufferless Network• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet or Flit dropping
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3 0 1 2 3
0 1
2
0 1 2
0
3
3
0
Bufferless Network• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet or Flit dropping
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3 0 1 2 3
0 1 20 1 2
3
3
0 0
Bufferless Network• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet or Flit dropping
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3 0 1 2 3
0 1 20 1 2
3 3
0
0
Bufferless Network• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet or Flit dropping
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3 0 1 2 3
0 1 20 1 2
3
0
0
3
Bufferless Network• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet or Flit dropping
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3 0 1 2 3
0 1 20 1 2
3
3
0
3
0
Bufferless Network• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet or Flit dropping
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3 0 1 2 3
0 1 20 1 2
300
3
Bufferless Network• Buffers in NoC
– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet or Flit dropping
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3
0 1 2 3 0 1 20 1 2 3 00 3
Bufferless Network
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3
• Buffers in NoC– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet/Flit dropping
• BLESS– Deflective bufferless Network
– FLIT-BLESS vs. WORM-BLESS
• Problems– Injection problem
– Livelock
– Throughput and Latency
Bufferless Network
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3
• Buffers in NoC– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet/Flit dropping
• BLESS– Deflective bufferless Network
– FLIT-BLESS vs. WORM-BLESS
• Problems– Injection problem
– Livelock
– Throughput and Latency
Bufferless Network
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3
• Buffers in NoC– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet/Flit dropping
• BLESS– Deflective bufferless Network
– FLIT-BLESS vs. WORM-BLESS
• Problems– Injection problem
– Livelock
– Throughput and Latency
Bufferless Network
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3
• Buffers in NoC– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet/Flit dropping
• BLESS– Deflective bufferless Network
– FLIT-BLESS vs. WORM-BLESS
• Problems– Injection problem
– Livelock
– Throughput and Latency
Bufferless Network
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3
• Buffers in NoC– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet/Flit dropping
• BLESS– Deflective bufferless Network
– FLIT-BLESS vs. WORM-BLESS
• Problems– Injection problem
– Livelock
– Throughput and Latency
Bufferless Network
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3
• Buffers in NoC– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet/Flit dropping
• BLESS– Deflective bufferless Network
– FLIT-BLESS vs. WORM-BLESS
• Problems– Injection problem
– Livelock
– Throughput and Latency
Bufferless Network
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3
• Buffers in NoC– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet/Flit dropping
• BLESS– Deflective bufferless Network
– FLIT-BLESS vs. WORM-BLESS
• Problems– Injection problem
– Livelock
– Throughput and Latency
Bufferless Network
A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network
Ph
ysi
cal
Arc
h. &
Cn
tlSo
ftw
are
Wiring
Data Link
Network
Transport
System
Application
10
2 3
• Buffers in NoC– Energy, area, complexity
• Can we design network without buffers?– Deflective routing vs. Packet/Flit dropping
• BLESS– Deflective bufferless Network
– FLIT-BLESS vs. WORM-BLESS
• Problems– Injection problem
– Livelock
– Throughput and Latency
• Spend 4 c.c. for 1 link traversal
Router Microarchitecture:
Reducing Pipelines
Principles and Practices of Interconnection Networks
BW
RCVA SA LTST
SA LTSTBW -
Head Flit
Body
& Tail Flit
Baseline Router Pipeline
Router Microarchitecture:
Reducing Pipelines
• Speculative Routing
Principles and Practices of Interconnection Networks
BW
RCVA SA LTST
SA LTSTBW -
BW
RC
VA
SALTST
SA LTSTBW
VA
Head Flit
Body
& Tail Flit
Baseline Router Pipeline
Head Flit
Body
& Tail Flit
Speculative Router Pipeline
ST
Router Microarchitecture:
Reducing Pipelines
Principles and Practices of Interconnection Networks
BW
RCVA SA LTST
SA LTSTBW -
BW
RC
VA
SALTST
SA LTSTBW
LTST
SA LTST- -
VA SA
Head Flit
Body
& Tail Flit
Baseline Router Pipeline
Head Flit
Body
& Tail FlitBW
• Speculative Routing
Speculative Router Pipeline
Router Microarchitecture:
Reducing Pipelines
Principles and Practices of Interconnection Networks
BW
RCVA SA LTST
SA LTSTBW -
LT
Head Flit
Body
& Tail Flit
Baseline Router Pipeline
Head Flit
Body
& Tail Flit
• Lookahead Routing
Lookahead Router Pipeline
BW
NRC
VA
SA LTST
SA STBW
Router Microarchitecture:
Reducing Pipelines
Principles and Practices of Interconnection Networks
BW
RCVA SA LTST
SA LTSTBW -
Head Flit
Body
& Tail Flit
Baseline Router Pipeline
Head Flit
Body
& Tail Flit
• Speculation + Lookahead Routing
BWNRC
VASA
ST
SA
BW
LT
LTST
Speculation + Lookahead Router Pipeline
Router Microarchitecture:
Reducing Pipelines
Principles and Practices of Interconnection Networks
BW
RCVA SA LTST
SA LTSTBW -
Head Flit
Body
& Tail Flit
Baseline Router Pipeline
Head Flit
Body
& Tail Flit
• Speculation + Lookahead Routing
BWNRC
VASA
LTST
SA LTSTBW -
VA SA
Speculation + Lookahead Router Pipeline
Microarchitecture:
Enhance Arbitration
• Traditional Allocator Implementation
– Input-first, Output-first, Wavefront
• SPAROFLO
– Speculative Priority Assignment (SPA)
– Recreate Old (RO)
– Flow (FLO)
Allocator Implementations for Network-on-Chips
A 4.6Tbits/s 3.6GHz Single-Cycle NoC Router with a Novel Switch Allocator in 65nm CMOS
IxV:1
( 1 )
IxV:1
( O )
V:1
( 1 )
V:1
( I )
OxV:1
( 1 )
OxV:1
( I )
V:1
( 1 )
V:1
( O )
req11
req1v
reqIv
reqI1
gnt11
gnt1v
gntIv
gntI1
req11
req1v
reqIv
reqI1
gnt11
gnt1v
gntIv
gntI1
…… … …
… …
… … ……
…… … …
… … ……
Microarchitecture:
Enhance Arbitration
• Traditional Allocator Implementation
– Input-first, Output-first, Wavefront, LOA, PIM, …
• SPAROFLO
– Speculative Priority Assignment (SPA)
– Recreate Old (RO)
– Flow (FLO)
Allocator Implementations for Network-on-Chips
A 4.6Tbits/s 3.6GHz Single-Cycle NoC Router with a Novel Switch Allocator in 65nm CMOS
0
1
2
3
0
1
2
req11
reqi1
reqio
req1o
gnt11
gnt1o
gntio
gnti1
…… … …
… … ……
o:1
( 1 )
o:1
( i )
i:1
( 1 )
i:1
( o )
Microarchitecture:
Enhance Arbitration
• Traditional Allocator Implementation
– Input-first, Output-first, Wavefront, LOA, PIM, …
• SPAROFLO
– Speculative Priority Assignment (SPA)
– Recreate Old (RO)
– Flow (FLO)
Allocator Implementations for Network-on-Chips
A 4.6Tbits/s 3.6GHz Single-Cycle NoC Router with a Novel Switch Allocator in 65nm CMOS
i:1
( 1 )
i:1
( o )
o:1
( 1 )
o:1
( i )
req11
req1o
reqio
reqi1
gnt11
gnti1
gntio
gnt1o
…… … …
… …
… … ……
0
1
2
3
0
1
2
Microarchitecture:
Enhance Arbitration
• Traditional Allocator Implementation
– Input-first, Output-first, Wavefront, LOA, PIM, …
Allocator Implementations for Network-on-Chips
A 4.6Tbits/s 3.6GHz Single-Cycle NoC Router with a Novel Switch Allocator in 65nm CMOS
i:1
( 1 )
i:1
( o )
o:1
( 1 )
o:1
( i )
req11
req1o
reqio
reqi1
gnt11
gnti1
gntio
gnt1o
…… … …
… …
… … ……
0
1
2
3
0
1
2
req11
reqi1
reqio
req1o
gnt11
gnt1o
gntio
gnti1
…… … …
… … ……
o:1
( 1 )
o:1
( i )
i:1
( 1 )
i:1
( o )
Microarchitecture:
Enhance Arbitration
• Switch and Virtual Channel Allocators
Allocator Implementations for Network-on-Chips
A 4.6Tbits/s 3.6GHz Single-Cycle NoC Router with a Novel Switch Allocator in 65nm CMOS
0
1
0
1
0
1
0
11
0
1
0
• SPAROFLO
– Speculative Priority Assignment (SPA)
– Recreate Old (RO)
– Flow (FLO)
Reliable Hop-to-Hop transmission
• On-chip interconnect errors
• Using High Voltage
– Reduce error rate
– Limited in delay, area, and produce more energy
• Use low voltage with error correction code
– Type-II HARQ with low-swing channel
On Hamming Product Codes With Type-II Hybrid ARQ for On-Chip Interconnects
Adaptive error control for nanometer scale network-on-chip links
x
y
z
011
100
HD(011,100) = 3 …
Sender Receivern x k
Globally Asynchronous,
Locally Synchronous (GALS) Circuit
• The problems of Clock Distribution
– Design Complexity, Noise, and Power
• Local clock w/ asynchronous communication
Property Pausible Clocking FIFO-based Boundary Synchronization
Area Overhead Low Med to High Low
Latency Low High Med
Throughput Depend on clock pause rate High Med
Power Consumption Low High Med
Globally Asynchronous, Locally Synchronous Circuits: Overview and Outlook
Local
Sync.
1
Pausible
Clock
Ou
tpu
t Po
rt
Local
Sync.
2
Pausible
Clock
Inp
ut P
ort
Local
Sync.
1
Async
FIFO
Local
Sync.
2
Local
Sync.
1
RE
G Local
Sync.
2
RE
G
CL
DL
Robust Interfaces for Mixed-Timing Systems
• Two distinct problems– Different local timing : GALS
– long delays in interconnections : LID
• Can we use mixed FIFOs as relay stations?– LID + GALS
• Reusable mixed-timing FIFOs– And Relay stations based on the FIFOs
clk1
clk2
TAIL
HEAD
Robust Interfaces for Mixed-Timing Systems with Application to Latency-Insensitive Protocols
Robust Interfaces for Mixed-Timing Systems
Cell Cell Cell Cell
Put Ctrl
Full Detector
Empty Detector
req_put
full
data_put
CLK_put
req_get
empty
data_get
CLK_get
valid_get
ack_putG
et
Ctr
l
ack_get
Sync-Sync FIFO
Async-Sync FIFO
Async-Async FIFO
Sync-Async FIFO
Application & System Drivers Summary
• Multicores & Heterogeneous Systems
– Increasing numbers of IP cores
• Emerging applications
– PARSEC / User-Interactive Apps
• Role of operating system
• Power vs. Performance
• Cache vs. Scratch-pad
– Shared-memory vs. Message-passing
– Cache Coherence Protocols
• On-Chip Memory Controller
• Off-chip Network & Memory