Design and Analysis of Networks-on-Chip in Heterogeneous Multicore …youngjin/download/Candidacy_note.pdf · Fixed-point DSP Function-Specific HW cores DCD with New Format DCD with

Design and Analysis of Networks-on-Chip

in Heterogeneous Multicore Systems

Young Jin Yoon

<[email protected]>

Contents

• Motivation and Applications

• System Drivers

• On-Chip Communication and Networks-on-Chip

• Modeling and Tools

Motivation:

Moore’s Law and Performance of CPU

• Moore’s law

– Draw Figure from ITRS 2009

1. Double the transistor in every 18 month!

2. Do we double the Performance?

1. Limited by ILP diminishing return

2. Power problem with Out-of-Order(OoO)!

3. ILP TLP Multi-Core Architecture

• Increasing the number of cores!

ITRS 2009

25 % / year

52 % / year

?? % / year

Bit-Level Parallelism

Instruction-Level Parallelism

TLP

Multicore

Computer Architecture: A Quantitative Approach

Motivation:

System-on-Chip with Mobile Phones

• Performance vs. flexibility: 3.5G Mobile Phones

• 100 Giga-Operation-Per-Second (GOPS) within 1W– 1 core running at 100GHz?

– 1000 cores running at 100MHz?

1.[2]. Multi-Core for Mobile Phones

Motivation:

System-on-Chip with Consumer Devices

1.[3]. Heterogeneous Multi-Core Platform for Consumer Multimedia Applications

Analog

Audio

Decoder

Digital

Audio

Decoder

Audio

Post-

Processing

Analog

Video

Decoder

Digital

RAW Video

Decoder

Digital

Compressed

Video Decoder

Picture

Quality

Enhancement

Content

Browsing

and Control

Host CPU

VLIW Processor

Cores

Embedded

Control CPU

Fixed-point

DSP

Function-Specific

HW cores

DCD with

New Format

DCD with

Established

Format

DSP VLIW Cores DSP

HW cores HW cores HW cores

Embedded

Control CPU

VLIW Cores

HW VLIW Host CPU

Motivation:


• Legacy

• Re-usability

• Performance

• Flexibility

• Support of industry standards


Analog

Audio

Decoder

Digital

Audio

Decoder

Audio

Post-

Processing

Analog

Video

Decoder

Digital

RAW Video

Decoder

DCD with

Established

Format

Picture

Quality

Enhancement

Content

Browsing

and Control

Host CPU

VLIW Processor

Cores

Embedded

Control CPU

Fixed-point

DSP

Function-Specific

HW cores

DCD with

New Format

DSP VLIW Cores DSP

HW cores HW cores HW cores

Embedded

Control CPU

VLIW Cores

HW VLIW Host CPU

Motivation:



Analog

Audio

Decoder

Digital

Audio

Decoder

Audio

Post-

Processing

Analog

Video

Decoder

Digital

RAW Video

Decoder

DCD with

Established

Format

Picture

Quality

Enhancement

Content

Browsing

and Control

Host CPU

VLIW Processor

Cores

Embedded

Control CPU

Fixed-point

DSP

Function-Specific

HW cores

DCD with

New Format

Motivation:

Networks-on-Chip (NoC)

• How do we connect all cores?

– Bus vs. Point-to-Point vs. Crossbar and Mesh

• Difference between NoC and other Networks

– Less non-determinism

– Local, High-performance networks

– Energy-constraints

– Design-time Specialization

0

1

2

73

4

5

60

1

7

23

4

56

0

1

2

3

0 1 2 3 0 1 2

3 4 5

6 7 8

1.[6]. Networks on Chips: A New SoC Paradigm

NoC Design Validation and Synthesis

NoC Architecture Analysis and Optimization

Application Modeling

and Optimization

Motivation:

Design and Analysis of NoC

Ph

ys

ica

lA

rch

. &

Co

ntr

ol

So

ftw

are

Wiring

Data Link

Network

Transport

System

Application


1.[7]. Outstanding Research Problems in NoC Design: System, Microarchitecture, and Circuit Perspective

Application

…

Design Goals

& Constraints

Co

de P

artitio

nin

g

Communication

Infrastructure

Communication

Paradigm

Application Communication

Analysis

Analysis

& Optimization

Mapping

& Scheduling

Sim

ula

tion

Pro

toty

pin

g

NoC Testing

NoC Verification

Component

Instantiation

Communication

Component Library

Physical Synthesis & Tapeout

Applications:

PARSEC vs. SPLASH-2

• PARSEC benchmarks

– Multithreaded

– Emerging Workload

– Diverse

– State-of-art Techniques

– Support Research

• Similarity research

– Principal Component Analysis(PCA) based on 3 groups

• Inst. Mix: 4 characteristics

• Working Sets: 8 characteristics

• Sharing: 32 characteristics

1.[4]. PARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip-Multiprocessors

A Communication Characterization of SPLASH-2 and PARSEC

Applications:

Mobile Architecture

• Benchmarks for Embedded computing

– EEMBC, MiBench…

• Mobile Architecture

– Restricted Power constraints

• Dynamic power management

– Users determine the power consumption

1.[5]. Into the Wild: Studying Real User Activity Patterns to Guide Power Optimizations for Mobile Architectures

Contents


• System Drivers



Operating System

• How to Manage Heterogeneous Multicores?

– Cores & Systems are diverse.

– The interconnect matters.

– Messages cost less than shared Memory.

2.[1]. The Multikernel: A New OS Architecture for Scalable Multicore Systems

Core Parallelism, Power, and Temperature

• Performance and Power

– Same total parallelism (4P-8W vs. 8P-4W)

– Same power but better throughput on 8P than 4P

– Energy-Delay Product (EDP) and Energy-Delay^2 Product (ED2P)

2.[2]. Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal View

Core Parallelism, Power, and Temperature

2.[2]. Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal View

1~2C lower than the others

Due to the large L2 cache

• Performance and Power

– Same total parallelism (4P-8W vs. 8P-4W)

– Same power but better throughput on 8P than 4P

– Energy-Delay Product (EDP) and Energy-Delay^2 Product (ED2P)

• Temperature Spatial Distribution

– Paired vs. Lined up vs. Centered

Memory Hierarchy:

On-Chip Memory

• Cache vs. Scratch-pad

– Both scales equally well up to 16 cores.

– Streaming applications

• Scratch-pad memory > Transparent Cache

– Cache will suffer in a large-scale CMPs.

• Scratch-pad may be able to address the problem.

3.[3]. Memory Systems: Cache, DRAM, Disk

3.[6]. Comparing Memory Systems for Chip Multiprocessors

Mgmt.

AddressingImplicit Explicit

Transparent Transparent cache Software-managed cache

Non-Transparent Self-managed scratch-pad Scratch-pad memory

Memory Hierarchy:

Cache Coherence Protocol

3.[3]. Memory Systems: Cache, DRAM, Disk

Token Coherence: Decoupling Performance and Correctness

Snoop-based Directory-based Token-based

Ordering Point NoC Directory Caches w/ retransmission

Indirect? N Y N

Broadcast? Y N Y

Performance? Fast Slow Moderate

Unordered NoC? N Y Y

Cache

0 1 n…

…

NoC

CacheDir

0 1 n…

…

NoC

Cache

0 1 n…

…

NoC

2

NoCNoC

Intelligent NoCs for Cache Coherence:

INSO and INCF

• Snoop-based Coherence in unordered NoCs:

In-Network Snoop Ordering

2.[4]. In-Network Coherence Filtering: Snoopy Coherence without Broadcasts

1. Incorrect

In-Network Snoop Ordering (INSO)

Route messages as ordered

2. Broadcast messages.In-Network Coherence Filtering (INCF)

Filter Unnecessary Broadcasts

0 1{0,2,4} {1,3,5}0 1

0 1

2

--

Addr Dest

-A

AA A

AA

Memory Controller

• On-Chip Memory Controller

– Where to place them?

• Performance

– Row ≈ Column < Diagonal X ≈ Diamond

– The gap can be alleviated by choosing wise routing algorithms

• Class-based Deterministic Routing (CDR)

2.[5]. Achieving Predictable Performance through Better Memory Controller Placement in Many-Core CMPs

Row Column Diagonal X Diamond

Off-Chip Network & Memory

• Bandwidth wall

– Due to pin-limitations, power constraints and package costs

– Memory scales only 10% per year

• Bandwidth Conservation Techniques

2.[7]. Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scaling

NoC

Network-on-Chip:

Terminology

• Topology

– Indirect vs. Direct

• Routing

– Deterministic vs. Adaptive

• Flow Control

– Arbitration

– Circuit-Switched

– Packet-Switched

• Worm-Hole and Virtual-Channel

– Hop-to-hop Flow-Control

3.[1]. Principles and Practices of Interconnection Networks

0 1 2 3

4 5 6 7

0

7

1

6

… …

Network-on-Chip:

Router Microarchitecture

Routing

Logic /Table

Switch

Allocators

Crossbar

VC

Allocators

BW

RCVA SA LTST

• Topology

• Routing

• Flow Control

– Arbitration

– Worm-Hole

– Hop-to-Hop

– Virtual Channel

• Router Pipelines


• Spend 4 c.c. for 1 link traversal

Router Microarchitecture:

Reducing Pipelines

• Speculative Routing

BW

RCVA SA LTST

SA LTSTBW -

BW

RC

VA

SALTST

SA LTSTBW

LTST

SA LTST- -

VA SA

Head Flit

Body

& Tail Flit

Baseline Router Pipeline

Head Flit

Body

& Tail Flit

Speculative Router Pipeline

• Speculation + Lookahead Routing• Lookahead Routing

Lookahead Router Pipeline

BW

NRC

VA

SA LTST

SA LTSTBW

BWNRC

VASA

LTST

LTSTSA

BW

LTST

SA LTSTBW -

VA SA

Speculation + Lookahead Router Pipeline


Performance and Cost Metrics

• Performance Metrics

• Cost Metrics

– Average or peak energy/power consumption

– Network area overhead and total area

– Average or peak temperature


1.[7]. Outstanding Research Problems in NoC Design

Delivery Speed Channel Usage

Ideal Zero-load Latency Bi-section Bandwidth

Average Average Latency Average Throughput

Worst Maximum Latency Peak Throughput

Topology:

Flattened Butterfly

• Flattened Butterfly vs. Mesh

3.[2]. Flattened Butterfly Topology for On-Chip Network

8

0

1

2

3

4

5

6

7

0 1 2

3 4 5

6 7 8

3-ary 3-fly network (3-stage Bfly) Flatten Butterfly

0

1

2

3

4

5

6

7

8

0

1

2

3

4

5

6

7

8

FBfly layout

Mesh layout

T0 = Th + Ts + Tw

1 2

3 4 5

6 7 8

0

Microarchitecture:

Enhance Arbitration

• SPAROFLO

– Speculative Priority Assignment (SPA)

– Recreate Old (RO)

– Flow (FLO)

3.[3]. A 4.6Tbits/s 3.6GHz Single-Cycle NoC Router with a Novel Switch Allocator in 65nm CMOS

0

1

2

3

0

1

2

3

Clock n

0

1

2

3

0

1

2

3

Clock (n+1)

2

3

0

2

V:1

Local

Arbiter

V:1

Local

Arbiter

V:1

Local

Arbiter

SPA

Priority

Encoder

Conflict

Detect

P:1

Global

Arbiter

V:1

Local

Arbiter

size(Q) != 0?

Sequential

Retry Queue

Conflict

on current c.c. Top Loser

Conflict on prev c.c.

0

1

Grants from Other

Global Arbiters

Final GrantPort

PriorityR

eq

uest V

ecto

r

Bufferless Network

10

2 3

• Buffers in NoC

– Energy, area, complexity

• Can we design network without buffers?

– Deflective routing vs. Packet/Flit dropping

3.[5]. A Case for Bufferless Routing in On-Chip Networks

10

2 3

Bufferless Network

• Buffers in NoC





10

2 3

Bufferless Network

• Buffers in NoC




• BLESS

– Deflective bufferless Network

– FLIT-BLESS vs. WORM-BLESS


10

2 3

Bufferless Network

• Buffers in NoC




• BLESS



• Problems

– Injection problem


10

2 3

Bufferless Network

• Buffers in NoC




• BLESS



• Problems



10

2 3

Bufferless Network

• Buffers in NoC




• BLESS



• Problems


– Livelock


10

2 3

Bufferless Network

• Buffers in NoC




• BLESS



• Problems


– Livelock


10

2 3

Bufferless Network

• Buffers in NoC




• BLESS



• Problems


– Livelock

– Throughput and Latency


Quality of Service (QoS)

• Quality of Service

– Local Fairness ≠ Global Fairness

– Some packets are more important than others.

• Round-Robin vs. Age-based vs. deadline-based

3.[6]. Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks

QoS: Globally Synchronous Frame (GSF)

• Deadline-Based Arbitration is impractical

– Infinite-sized sorting queues

– Large overhead for sending and storing the deadline

• Source-managed QoS (e.g. GSF)

– Frame-based approach

• Sorting across frames not within a frame

… Earliest

deadline

Selector

…

…

……

…

…

…

……

Head

Deadline-based

with infinite searchable buffer

Frame-based

with per-frame buffers

and infinite frame window

Frame-based

with circular frame buffers

and finite frame window

3.[6]. Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks

QoS: Other approaches

• Router-Based QoS

– Preemptive Virtual Channel (PVC)

: Router-based dynamic bandwidth allocation

• Application-Aware QoS

– What performance do we really care?

• Network vs. application

– Stall-Time-Criticality (STC)

Preemptive Virtual Clock: A Flexible, Efficient and Cost-effective QoS scheme for Networks-on-Chip

Application-Aware Prioritization Mechanisms for On-Chip Networks

A B C

Compute A

A

Compute

C B

B Com...

A B C

Compute A

A

C

C B

B Com...Compute Stall Compute Stall C Stall C Stall C Stall C Stall

Polymorphic On-Chip Networks

• There is no network to fit all workloads.

3.[4]. Polymorphic On-Chip Networks

0

100

200

300

400

500

600

700

800

0 100 200

Ave

rag

e T

hro

ug

hp

ut

(bit

s/

cyc

le)

Average Packet Latency (cycles)

Meshes

Butterflies

Fat Trees

Flatten Butterflies

Rings

Random Permutation Traffic

Pareto Optimal


• Let’s provide Network resources

– Users can statically configure NoC before running applications

R

A



• Let’s provide Network resources

– Users can statically configure NoC before running applications

0 1 2 30

1

2

3

… …

……

e.g. Unidirectional Ring


Hop-to-hop: wires and interconnects

• Network-on-Chip: Floorplans

4.[3]. COSI: A Framework for the Design of Interconnection Networks

1.[1]. International Technology Roadmap for Semiconductors (ITRS): 2009 edition

Cu Interconnect (ITRS) 2011 2012 2013 2014 2015 2016

Gate Length (nm) 16 14 13 11 10 9

IntermediateRC Delay (ps) 1291 1455 1842 2406 2670 3341

Line length (um) 16 15 12 9 8 7

GlobalRC Delay (ps) 487 557 705 921 1004 1297

Line length (um) 26 23 19 15 13 11

FITs /m /cm^2 2 1.6 1.6 1.4 1.3 1.1

• Interconnect Requirement from ITRS

• Time-to-market constraints

• Intellectual-Property design modules (IP Cores)

• Interconnect latency

– Hard to estimate in early design stage

– Conservative estimation: suboptimal design

• Latency Insensitive Design(LID)

Latency Insensitive Design

4.[2]. Coping with Latency in SOC Design

: Pearl (IP Core)

: Shell

: Relay Station

: Data w/ void

: Backpressure

Shell 4

Pearl 4

Shell 1

Pearl 1

Shell 2

Pearl 2

Shell 3

Pearl 3

Shell 5

Pearl 5

R

S

R

S

R

S

R

S

R

S

R

S

Back Pressure

Data

Hop-to-Hop Flow Control

• Channels between two routers

– Longer is the wire, slower are delivered the messages.

• Put some intelligence on the channel!

– Link pipelining with distributed buffers


3.[7]. Distributed Flit-Buffer Flow Control for Networks-on-Chip

ON/OFF Credit Ack/Nack

- - -

2+5K 2+3K 1+3K

2+2K 2+2K 1+2K

Control

Logic

Control

Logic

Control

Logic

Control

Logic

Control

Logic

Control

Logic

Data Data Data

BPBPBP

Data Data Data

BPBPBP

Flip-Flops

Relay-Stations

Inverters or Latches

Globally Asynchronous,

Locally Synchronous (GALS) Circuit

• The problems of Clock Distribution

– Design Complexity, Noise, and Power

• Local clock w/ asynchronous communication

Property Pausible Clocking FIFO-based Boundary Synchronization

Area Overhead Low Med to High Low

Latency Low High Med

Throughput Depend on clock pause rate High Med

Power Consumption Low High Med

3.[9]. Globally Asynchronous, Locally Synchronous Circuits: Overview and Outlook

Local

Sync.

1

Pausible

Clock Gen

Ou

tpu

t Po

rt

Local

Sync.

2

Pausible

Clock Gen

Inp

ut P

ort

Local

Sync.

1

Async

FIFO

Local

Sync.

2

Local

Sync.

1

RE

G Local

Sync.

2

RE

G

CL

DL

Robust Interfaces for Mixed-Timing Systems

• Partition FIFOs into reusable components

– Reusable Put and Get Cell sub-modules required

3.[10]. Robust Interfaces for Mixed-Timing Systems

Cell Cell Cell Cell

Put Ctrl

Full Detector

Empty Detector

req_put

full

data_put

CLK_put

req_get

empty

data_get

CLK_get

valid_get

ack_putG

et

Ctr

l

ack_get

Sync-Sync FIFO

Async-Sync FIFO

Async-Async FIFO

Sync-Async FIFO


Sync-Sync FIFO



SR

S

R

valid_get

data_get

en_get

CLK_get

tok_out_get

empty_i

tok_in_get

REG

req_put

data_put

en_put

CLK_put

full_i

tok_out_put tok_in_put



Async-Async FIFO



– Only Data Validity Controller sub-module needs to be modified

• Implement Relay Stations with Mixed Timing FIFO

REG

C+

C+

C+

+

wr

ra rr

wa

req_put

data_put

ack_put

data_get

req_get

ack_get

tok_out_get

tok_out_put

tok_in_get

tok_in_put


• Power Consumption of NoC

– up to 28% total power on NoC

– Router frequency: critical design parameter

• Network power vs. network latency

• Dynamic power management for routers

– Clock Scaling and Time Stealing

Dynamic Voltage-Frequency Scaling (DVFS)

A Case for Dynamic Frequency Tuning in On-Chip Networks

Asynchronous NoC

0

1

2

3

0 1 2 3

• Mesh-of-Trees(MoT) variants

3.[11]. A Low-Overhead Asynchronous Interconnection Network for GALS Chip Multiprocessors

Asynchronous NoC

• Mesh-of-Trees(MoT) variants

– No Switch(i.e. crossbar) is required

– Can be implemented with

• Simple routers (for fan-out)

• Simple arbiters (for fan-in)

0

1

2

3

0

1

2

3

Row Forest Column ForestRow-Column

Shifter

Latch

Control 0

Toggle 0

LA

TC

H

Req0

AckReq Ack0

Latch

Control 1

Toggle 1

LA

TC

H

Req1

AckReq Ack1

B

B

Data1

Data0

Data_InMutex

Ack1

Ack0

L4

L3

L1

L2

0

1

L5

L6

L7

Req0

Req1

Req_Out

Ack_In

Data0

Data1

Data_Out

Mux_Select

LA

TC

H

Flow Control Unit

Datapath

Latch Controller

3.[11]. A Low-Overhead Asynchronous Interconnection Network for GALS Chip Multiprocessors

Reliable Hop-to-Hop transmission

• On-chip interconnect errors

• Using High Voltage

– Reduce error rate

– Limited in delay, area, and produce more energy

• Use low voltage with error correction code

– Type-II HARQ with low-swing channel

3.[8]. On Hamming Product Codes With Type-II Hybrid ARQ for On-Chip Interconnects

Adaptive Error Control For Nanometer Scale Network-on-Chip Links

x

y

z

011

100

HD(011,100) = 3 …

Sender Receivern x k

Photonic NoCs

• The benefits of Photonic communication

– Bandwidth

– Power Dissipation

• Hybrid Photonic vs. electronic NoCs

– Same execution time: 7.6W vs 244W

– Same power dissipation: 960Gbps vs 100Gbps

3.[12]. Photonic NoCs: System-Level Design Exploration

Contents


• System Drivers



6. Dynamic, reconfigurable

network tools

Modeling and Tools

4.[1]. Research Challenges for On-Chip Interconnection Networks

5. End-user

feedback

2. Custom

IP blocks3. Validation

7. Application

Instrumentation

1. Synthesis

Many-core system

constraints

4. Models of

CMOS devices

and interconnects

Hardware

COSI: NoC Design Automation

• Can we automate to design NoC?

• Communication Synthesis Infrastructure (COSI)

– Network specification

– Library of building blocks

– Quantified performance and cost models

– Optimization Algorithms

4.[3]. COSI: A Framework for the Design of Interconnection Networks

Models, Rules & Platforms

Orion, Ho’s Models

Algorithms

K-merging

Shortest path…

01

3

2

4

5

(10,100)0

1

3

2

4

5

Library

Topology Links Routers

ORION: NoC Power and Area Model

4.[4]. ORION 2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration

• Power: the most critical design constraint.

– Power of NoC will also be substantial

– How to estimate NoC power in the early-design stage?

FAST: Architectural Simulation

• Good simulators

– speed, accuracy, completeness, transparency

– inexpensiveness, up-to-date, and easy-to-use, …

• The functional model of FAST

– Keep generating instruction stream

– Roll back when mis-speculations occur

4.[5]. FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators

Functional

Model

Timing

Model

Inst.

Next Inst.

Functional

Model

FPGA

Timing

ModelRoll Back

/ Commit

Trace

Buffer

(a) Event-Driven Arch. Simulator (b) FAST

Inst. Trace

BP

NoC Design Validation and Synthesis

NoC Architecture Analysis and Optimization

Application Modeling

and Optimization

Conclusion

Ph

ys

ica

lA

rch

. &

Co

ntr

ol

So

ftw

are

Wiring

Data Link

Network

Transport

System

Application


1.[7]. Outstanding Research Problems in NoC Design: System, Microarchitecture, and Circuit Perspective

Application

…

Design Goals

& Constraints

Co

de P

artitio

nin

g

Communication

Infrastructure

Communication

Paradigm

Application Communication

Analysis

Analysis

& Optimization

Mapping

& Scheduling

Sim

ula

tion

Pro

toty

pin

g

NoC Testing

NoC Verification

Component

Instantiation

Communication

Component Library

Physical Synthesis & Tapeout

Questions?

Backup slides

An example of MUTEX Circuit

ReCycle: Pipeline Adaptation

to Tolerate Process Variation

Simulation:Open-loop vs. Closed-loop simulation• Open-loop

– NI with infinite queue

• Isolate the effect of the network design from the injection

– e.g. synthetic traffic patterns

• Closed-loop

– More close to the actual system

– Ni with finite queue

– e.g. full-system simulations

Principles and Practices of Interconnection Networks

Simulation:Synthetic Traffic model• Synthetic Traffic model

– Based on Staticstical analysis of the traffic

– Traffic Patterns

• Random

• Bit permutations– Bit complement, Bit reverse, Bit rotation, Shuffle, Transpose

• Digit permutations– Tornado, Neighbor

– Constant injection rate over time

• Actual traffic : bursty!


Simulation:Summary• Trade-off between accuracy and simulation

time

– Synthetic traffic model

• Fast simulation time, less accurate

– Event-driven simulation

• Slow simulation time, more accurate

– RTL-level simulation

• The slowest, but even more accurate

Applications• PARSEC vs. SPLASH-2

– Diversity– State-of-art Algorithms– Input dataset

• Comparison– Instruction Mix, Working Sets, and Sharing– Communication

A Communication Characterization of SPLASH-2 and PARSECPARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip-Multiprocessors

Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

Applications:PARSEC vs. SPLASH-2• PARSEC vs. SPLASH-2

– Diversity

– State-of-art Algorithms

– Input dataset

• Similarity research

– Principal Component Analysis(PCA)

– 44 parameters.

• Including Inst. Mix, Working Sets, and Sharing

A Communication Characterization of SPLASH-2 and PARSECPARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip-Multiprocessors

Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

Applications:

PARSEC vs. SPLASH-2 (cont’)

• Communication Comparison– Spatial Behavior: Less Distinct

– Temporal Behavior: More Bursty

– Producer–Consumer: Multi-to-Multi

PARSEC vs. SPLASH-2

Operating System• Real-Time Operating System

– How to deliver the real-time requirement

• Operating System coexistence

– Multiprocessor with Heterogeneous cores

– Some simpler cores may require to have RTOS

– Some Complex cores can have General OS

– How to manage those issues?

• Using a hypervisor?

Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

Application

System

The Multikernel: A New OS Architecture for Scalable Multicore SystemsA Unified Operating System for Clouds and Manycore: fosProcess Scheduling Challenges in the Era of Multi-Core Processors

Reliable Hop-to-Hop transmission• On-chip interconnect errors• Using High Voltage

– Reduce error rate– Limited in delay, area, and produce more energy

• Use low voltage with error correction code– Increase error rate but correct errors when they happened – Type-II HARQ with low-swing channel

On Hamming Product Codes With Type-II Hybrid ARQ for On-Chip InterconnectsAdaptive error control for nanometer scale network-on-chip links

Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

x

y

z

End-to-End flow control• Message-dependent Deadlock

– Deadlock avoidance• Virtual Network

• Credit-Based(CB)

– Deadlock recovery• Regressive

• Deflective

• Progressive

• CTC: Connect-Then-Credit– 3-way handshake to exchange credits

• P_REQ, P_ACK, and data

Principles and Practices of Interconnection Networks CTC: An End-To-End Flow Control Protocol for SoC Architectures

Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Network

Transport

System

Application

Data Link

Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Transport

System

Application

Network

Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

Application

System

Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

Application

System

Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

ApplicationP

hys

ica

lA

rch

. & C

ntl

Soft

war

e

Wiring

Data Link

Transport

Application

Network

System

Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

Ph

ysic

alA

rch

. & C

ntl

Soft

war

e

Wiring

Data Link

Network

Transport

System

Application


INSO and INCF

• Two main problems with Snoop-based Coherence in

unordered NoCs:


In-Network Coherence Filtering: Snoopy Coherence without Broadcasts

1. Incorrect


2. Broadcast messages.

In-Network Coherence Filtering (INCF)


INSO and INCF

• Two main problems with Snoop-based Coherence in

unordered NoCs:


In-Network Coherence Filtering: Snoopy Coherence without Broadcasts

1. Incorrect


2. Broadcast messages.

In-Network Coherence Filtering (INCF)

0 1

{0,2,4} {1,3,5}

0 1

0

8

4

Traditional Network-on-Chip•

•

•

•

•


0 1 2 3

4 5 6 7

8 9 10 11

14 1512 13

RoutingLogic /Table

SwitchAllocator

Crossbar

VCAllocator

BW

RCVA SA LTST

Bufferless Network• Buffers in NoC


• Can we design network without buffers?– Deflective routing vs. Packet or Flit dropping

A Case for Bufferless Routing in On-Chip NetworksSCARAB: A Single Cycle Adaptive Routing and Bufferless Network

Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3

0 1 2 30 1 2 3

0 1 2 3





Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3

0

1 2 3

0

1 2 3

0

1 2 3





Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3

0

1

2 3

01

2 3

01

2 3





Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3 0

1

2

3

012

3

012

3





Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3 0 1

2

3

0

12

3

0

12

3

0

0





Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3 0 1 2

301

20

1

2

3

3

0

0





Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3 0 1 2 3

01

2 0

1

2

3

3

0

0





Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3 0 1 2 3

0

12

0

1

23

3

0

0





Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3 0 1 2 3

01

2

0 1

23

3

0

0





Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3 0 1 2 3

0

1

2

0 1 2

3

3

0

0





Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3 0 1 2 3

0

1

2

0 1 2

3

3 0

0

Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

Circuit-switched NoC• Benefit of circuit-switched NoC

A 2.9Tb/s 8W 64-Core Circuit-Switched Network-on-Chip in 45nm CMOSWinning the Pinning in NoC





Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3 0 1 2 3

0 1

2

0 1 2

0

3

3

0





Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3 0 1 2 3

0 1 20 1 2

3

3

0 0





Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3 0 1 2 3

0 1 20 1 2

3 3

0

0





Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3 0 1 2 3

0 1 20 1 2

3

0

0

3





Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3 0 1 2 3

0 1 20 1 2

3

3

0

3

0





Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3 0 1 2 3

0 1 20 1 2

300

3





Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3

0 1 2 3 0 1 20 1 2 3 00 3

Bufferless Network


Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3

• Buffers in NoC– Energy, area, complexity

• Can we design network without buffers?– Deflective routing vs. Packet/Flit dropping

• BLESS– Deflective bufferless Network


• Problems– Injection problem

– Livelock


Bufferless Network


Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3






– Livelock


Bufferless Network


Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3






– Livelock


Bufferless Network


Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3






– Livelock


Bufferless Network


Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3






– Livelock


Bufferless Network


Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3






– Livelock


Bufferless Network


Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3






– Livelock


Bufferless Network


Ph

ysi

cal

Arc

h. &

Cn

tlSo

ftw

are

Wiring

Data Link

Network

Transport

System

Application

10

2 3






– Livelock


• Spend 4 c.c. for 1 link traversal


Reducing Pipelines


BW

RCVA SA LTST

SA LTSTBW -

Head Flit

Body

& Tail Flit



Reducing Pipelines



BW

RCVA SA LTST

SA LTSTBW -

BW

RC

VA

SALTST

SA LTSTBW

VA

Head Flit

Body

& Tail Flit


Head Flit

Body

& Tail Flit


ST


Reducing Pipelines


BW

RCVA SA LTST

SA LTSTBW -

BW

RC

VA

SALTST

SA LTSTBW

LTST

SA LTST- -

VA SA

Head Flit

Body

& Tail Flit


Head Flit

Body

& Tail FlitBW




Reducing Pipelines


BW

RCVA SA LTST

SA LTSTBW -

LT

Head Flit

Body

& Tail Flit


Head Flit

Body

& Tail Flit

• Lookahead Routing

Lookahead Router Pipeline

BW

NRC

VA

SA LTST

SA STBW


Reducing Pipelines


BW

RCVA SA LTST

SA LTSTBW -

Head Flit

Body

& Tail Flit


Head Flit

Body

& Tail Flit

• Speculation + Lookahead Routing

BWNRC

VASA

ST

SA

BW

LT

LTST



Reducing Pipelines


BW

RCVA SA LTST

SA LTSTBW -

Head Flit

Body

& Tail Flit


Head Flit

Body

& Tail Flit

• Speculation + Lookahead Routing

BWNRC

VASA

LTST

SA LTSTBW -

VA SA


Microarchitecture:

Enhance Arbitration

• Traditional Allocator Implementation

– Input-first, Output-first, Wavefront

• SPAROFLO



– Flow (FLO)

Allocator Implementations for Network-on-Chips

A 4.6Tbits/s 3.6GHz Single-Cycle NoC Router with a Novel Switch Allocator in 65nm CMOS

IxV:1

( 1 )

IxV:1

( O )

V:1

( 1 )

V:1

( I )

OxV:1

( 1 )

OxV:1

( I )

V:1

( 1 )

V:1

( O )

req11

req1v

reqIv

reqI1

gnt11

gnt1v

gntIv

gntI1

req11

req1v

reqIv

reqI1

gnt11

gnt1v

gntIv

gntI1

…… … …

… …

… … ……

…… … …

… … ……

Microarchitecture:

Enhance Arbitration


– Input-first, Output-first, Wavefront, LOA, PIM, …

• SPAROFLO



– Flow (FLO)



0

1

2

3

0

1

2

req11

reqi1

reqio

req1o

gnt11

gnt1o

gntio

gnti1

…… … …

… … ……

o:1

( 1 )

o:1

( i )

i:1

( 1 )

i:1

( o )

Microarchitecture:

Enhance Arbitration



• SPAROFLO



– Flow (FLO)



i:1

( 1 )

i:1

( o )

o:1

( 1 )

o:1

( i )

req11

req1o

reqio

reqi1

gnt11

gnti1

gntio

gnt1o

…… … …

… …

… … ……

0

1

2

3

0

1

2

Microarchitecture:

Enhance Arbitration





i:1

( 1 )

i:1

( o )

o:1

( 1 )

o:1

( i )

req11

req1o

reqio

reqi1

gnt11

gnti1

gntio

gnt1o

…… … …

… …

… … ……

0

1

2

3

0

1

2

req11

reqi1

reqio

req1o

gnt11

gnt1o

gntio

gnti1

…… … …

… … ……

o:1

( 1 )

o:1

( i )

i:1

( 1 )

i:1

( o )

Microarchitecture:

Enhance Arbitration

• Switch and Virtual Channel Allocators



0

1

0

1

0

1

0

11

0

1

0

• SPAROFLO



– Flow (FLO)

Reliable Hop-to-Hop transmission

• On-chip interconnect errors

• Using High Voltage

– Reduce error rate

– Limited in delay, area, and produce more energy

• Use low voltage with error correction code

– Type-II HARQ with low-swing channel

On Hamming Product Codes With Type-II Hybrid ARQ for On-Chip Interconnects

Adaptive error control for nanometer scale network-on-chip links

x

y

z

011

100

HD(011,100) = 3 …

Sender Receivern x k

Globally Asynchronous,

Locally Synchronous (GALS) Circuit

• The problems of Clock Distribution

– Design Complexity, Noise, and Power

• Local clock w/ asynchronous communication

Property Pausible Clocking FIFO-based Boundary Synchronization

Area Overhead Low Med to High Low

Latency Low High Med

Throughput Depend on clock pause rate High Med

Power Consumption Low High Med

Globally Asynchronous, Locally Synchronous Circuits: Overview and Outlook

Local

Sync.

1

Pausible

Clock

Ou

tpu

t Po

rt

Local

Sync.

2

Pausible

Clock

Inp

ut P

ort

Local

Sync.

1

Async

FIFO

Local

Sync.

2

Local

Sync.

1

RE

G Local

Sync.

2

RE

G

CL

DL


• Two distinct problems– Different local timing : GALS

– long delays in interconnections : LID

• Can we use mixed FIFOs as relay stations?– LID + GALS

• Reusable mixed-timing FIFOs– And Relay stations based on the FIFOs

clk1

clk2

TAIL

HEAD

Robust Interfaces for Mixed-Timing Systems with Application to Latency-Insensitive Protocols


Cell Cell Cell Cell

Put Ctrl

Full Detector

Empty Detector

req_put

full

data_put

CLK_put

req_get

empty

data_get

CLK_get

valid_get

ack_putG

et

Ctr

l

ack_get

Sync-Sync FIFO

Async-Sync FIFO

Async-Async FIFO

Sync-Async FIFO

Application & System Drivers Summary

• Multicores & Heterogeneous Systems

– Increasing numbers of IP cores

• Emerging applications

– PARSEC / User-Interactive Apps

• Role of operating system

• Power vs. Performance

• Cache vs. Scratch-pad

– Shared-memory vs. Message-passing

– Cache Coherence Protocols

• On-Chip Memory Controller

• Off-chip Network & Memory

Documents

Design and Analysis of Networks-on-Chip in Heterogeneous Multicore …youngjin/download/Candidacy_note.pdf · Fixed-point DSP Function-Specific HW cores DCD with New Format DCD with