Interconnections in Multi-core Architectures: Understanding Mechanisms, Overheads and Scaling Rakesh Kumar (UCSD) Victor Zyuban (IBM) Dean Tullsen (UCSD)

Interconnections in Multi-core Architectures: Understanding Mechanisms, Overheads and Scaling

Rakesh Kumar (UCSD)

Victor Zyuban (IBM)Dean Tullsen (UCSD)

L2_0

L2_1 L2_2 L2_3

L2_4 L2_5

L2_6 L2_7

A Naive methodology for Multi-core Design

P0

P1

P2

P3

P4

P5

P6

P7

Multi-core oblivious multi-core design!

Clean, easy way to design

Holistic design of multi-core architectures Naïve Methodology is inefficient

Demonstrated inefficiency for cores and proposed alternatives Single-ISA Heterogeneous Multi-core Architectures for

Power[MICRO03] Single-ISA Heterogeneous Multi-core Architectures for

Performance[ISCA04] Conjoined-core Chip Multiprocessing [MICRO04]

What about interconnects? How much can interconnects impact processor

architecture? Need to be co-designed with caches and cores?

Goal of this Research

Contributions We model the implementation of several interconnection

mechanisms and topologies Quantify various overheads Highlight various tradeoffs Study the scaling of overheads

We show that several common architectural beliefs do not hold when interconnection overheads are properly accounted for

We show that one cannot design a good interconnect in isolation from the CPU cores and memory design

We propose a novel interconnection architecture which exploits behaviors identified by this research

Talk Outline Interconnection Models

Shared Bus Fabric (SBF) Point-to-point links Crossbar

Modeling area, power and latency

Evaluation Methodology

SBF and Crossbar results

Novel architecture

Shared Bus Fabric (SBF) On-chip equivalent of the system bus for snoop-based

shared memory multiprocessors

We assume a MESI-like snoopy write-invalidate protocol with write-back L2s

SBF needs to support several coherence transactions (request, snoop, response, data transfer, invalidates etc.)

Also needs to arbitrate access to the corresponding busses

Shared Bus Fabric (SBF)

Book-keeping

ABSB

RB

DB

D-arb A-arb

L2

Core(incl. I$/D$) Arbiters

Queues

Buses

(pipelined, unidirectional)

Control Wires

(Mux controls, flow-control, request/grant signals)

Details about latencies, overheads etc. in the paper

Point-to-point Link (P2PL) If there are multiple SBFs in the system, a

Point-to-point link connects two SBFs.

Needs queues and arbiters similar to an SBF

Multiple SBFs might be required in the system

To increase bandwidth To decrease signal latencies To ease floorplanning

Crossbar Interconnection System

If two or more cores share L2, the way a lot of present CMPs do, a crossbar provides a high bandwidth connection

Crossbar Interconnection System

Core

L2 bank

AB (one per core)

DoutB(one per core)

DinB(one per bank)

Loads, stores,prefetches,

TLB misses

Data Writebacks

Data reloads, invalidate addresses






Novel architecture

Wiring Area Overhead

Repeater Latch Memory Array

Wiring Area Overhead (65nm)

Metal Plane

Effective Pitch(um)

Repeater Spacing (mm)

Repeater Width (um)

Latch Spacing (mm)

Latch Height (um)

1X 0.5 0.4 0.4 1.5 120

2X 1.0 0.8 0.8 3.0 60

4X 2.0 1.6 1.6 5.0 30

8X 4.0 3.2 3.2 8.0 15

Overheads change based on the metal plane(s) the interconnects are mapped to

Wiring Power Overhead Dynamic dissipation in wires, repeaters and

latches Wire capacitance 0.02pF/mm, freq 2.5GHz, 1.1V Repeater capacitance 30% of wire capacitance Dynamic power per latch 0.05mW per latch

Leakage in repeaters and latches Channel and gate leakage values in paper

Wiring Latency Overhead Latency of signals traveling through latches

Latency also from travel of control between a central arbiter and interfaces corresponding to request/data queues

Hence, latencies depends on the location of the particular core or cache or the arbiter

Interconnect-Related Logic Overhead Arbiters, muxes, queues constitute interconnect-

related logic

Area and power overhead primarily due to queues Assumed to be implemented using latches

Performance overhead due to wait time in the queues and arbitration latencies Arbitration overhead increases with the number of

connected units Latching required between different stages of

arbitration






Novel architecture

Modeling Multi-core Architectures Stripped version of Power-4 like cores, 10mm^2, 10W

Evaluated 4,8 and 16 core multiprocessors occupying roughly 400mm^2

A CMP consists of cores, L2 banks, memory controllers, DMA controllers and Non-cacheable units

Weak consistency model, MESI-like coherence

All studies done for 65nm

Floorplans for 4,8 and 16 core processors [assuming private caches]

SBF IOX MCMC

NCUNCU

IOX MCMC

NCUNCU

NCU NCU

NCUNCU

SBF IOX MCMC IOX MCMC

NCU NCU

NCUNCU

SBFNCUNCU

NCUNCU

SBFNCUNCU

NCUNCU

NCUNCU

NCUNCU

NCUNCU

NCUNCU

MC IOX MC

MC IOX MC

CoreL2 Data

L2 Tag

P2P Link

Note that there are two SBFs for 16 core processor

Performance Modeling Used a combination of detailed functional simulation

and queuing simulation

Functional simulator Input: SMP traces (TPC-C, TPC-W, TPC-H, Notesbench

etc) Output: Coherence statistics for modeled

memory/interconnection system

Queuing simulator Input: Coherence statistics, interconnection latencies,

CPI of the modeled core assuming infinite L2 Output: System CPI

Results…finally!

SBF: Wiring Area Overhead

0

10

20

30

40

50

60

4 8 16Number of cores

Are

a o

verh

ead

(mm

^2)

Architected busses Control wires Total overhead

Area overhead can be significant – 7-13% of die areaSufficient to place 3-5 extra cores, 4-6 MB of extra cache!Co-design needed: More cores, more cache or more interconnect bandwidth?Observed a scenario where decreasing bandwidth improved performance


Control overhead 37%-63% of total overhead

Constrains how much area can be reduced with narrower busses

0

10

20

30

40

50

60


Are

a o

verh

ead

(mm

^2)



Argues against lightweight cores – do not amortize the incremental cost to the interconnect

0

10

20

30

40

50

60


Are

a o

verh

ead

(mm

^2)


SBF: Power Overhead

Power overhead can be significant for large number of cores

0

5

10

15

20

25


Po

we

r(W

)

leakage due to logiclatches

dynamic power due tologic latches(w/ogating)

leakage due to wiringlatches

leakage due torepeaters

dynamic power due towiring latches(w/ogating)

dynamic power due torepeater cap(AF=0.2)

dynamic power due towire cap(AF=0.2)

SBF: Power Overhead

0

5

10

15

20

25


Po

we

r(W

)

leakage due to logiclatches

dynamic power due tologic latches(w/ogating)

leakage due to wiringlatches

leakage due torepeaters

dynamic power due towiring latches(w/ogating)

dynamic power due torepeater cap(AF=0.2)

dynamic power due towire cap(AF=0.2)

Power due to queues more than that due to wires!Good interconnect architecture should have efficient queuing and flow-control

SBF: Performance

0

1

2

3

4

5

6

7

8

9


CP

I

performance assuming no interconnection overhead

performance with interconnection overhead

Interconnect Overhead can be significant – 10-26%!Interconnect accounts for over half the latency to the L2 cache

Shared Caches and Crossbar Results for the 8-core processor

2-way, 4-way and full-sharing of the L2 cache

Results are shown for two cases – when crossbar sits between cores and L2

Easy interfacing, but all wiring tracks results in area overhead

when the crossbar is routed over L2 Interfacing difficult, area overhead only due to reduced

cache density

Crossbar: Area Overhead

0

50

100

150

200

250

300

350

400

1X 2X 4XMetal plane

Are

a o

verh

ead

in

mm

^2

two-way sharing

four-way sharingall-sharing

assuming crossbar can routed over L2 (4-way sharing)

11-46% of the die for 2X implementation! Sufficient to put 4 more cores even for 4-way sharing!

Crossbar: Area Overhead (contd)

What is the point of cache sharing? Cores get the effect of having more cache space

But, we have to reduce the size of the shared cache to accommodate a crossbar Is larger caches through sharing an illusion OR Can we really have larger caches by making them

private and reclaim the area used by the crossbar

In other words, does sharing have any benefit in such scenarios?

Crossbar: Performance Overhead

3

5

7

9

11

13

15

all_private two_shared four_shared all_shared

CP

I2X crossbar between cores and L2

2X crossbar routed over L2

4X crossbar routed over L2

Accompanying grain of salt Simplified interconnection model assumed

Systems with memory scouts etc. may have different memory system requirements

Non Uniform Caches (NUCA) might improve performance

Etc. etc.

However, results do show that a shared cache is significantly less desirable for future technologies

What have we learned so far?(in terms of bottlenecks)

Interconnection bottlenecks (and possible solutions)

Long wires result in long latencies See if wires can be shortened

Centralized arbitration See if arbitration can be distributed

Gets worse with the number of modules connected to a bus See if the number of modules connected to a bus

can be decreased

A Hierarchical Interconnect

CORE

NCU

CORE

NCUL2Data L2Data

CORE

NCU

CORE

NCUL2Data L2Data


CORE

NCU

CORE

NCUL2Data L2Data

CORE

NCU

CORE

NCUL2Data L2Data

SBF

A Hierarchical Interconnect

A local and a remote SBF

(smaller average case latency, longer worse case latency)

CORE

NCU

CORE

NCUL2Data L2Data

CORE

NCU

CORE

NCUL2Data L2Data


CORE

NCU

CORE

NCUL2Data L2Data

CORE

NCU

CORE

NCUL2Data L2Data

P2PL

A Hierarchical Interconnect (contd)

The threads need to be mapped intelligently To increase hit rate in caches connected to

the local SBF

For some cases, even random mapping results in better performance E.g. for the 8-core processor shown

More research needs to be done for hierarchical interconnects

More description in the paper

Conclusions Design choices for interconnects have significant effect on

the rest of the chip Should be co-designed with cores and caches

Interconnection power and performance overheads can be almost as much logic-dominated as wire-dominated. Don’t think about wires only – arbitration, queuing and flow-

control important

Some common architectural beliefs (e.g. shared L2 caches) may not hold when interconnection overheads are accounted for. We should do careful interconnect modeling for our CMP

research proposals

A hierarchical bus structure can negate some of the interconnection performance cost

Documents

Interconnections in Multi-core Architectures: Understanding Mechanisms, Overheads and Scaling Rakesh Kumar (UCSD) Victor Zyuban (IBM) Dean Tullsen (UCSD)