33
Intel’s Tara-scale computing Intel’s Tara-scale computing project project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2 Sun’s Niagara2 8 cores, 64 Threads Key design issues Key design issues Architecture Challenges and Tradeoffs Architecture Challenges and Tradeoffs Packaging and off-chip memory bandwidth Software and runtime environment Tara-Scale CMP CDA6159fa07 peir

Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

  • Upload
    selma

  • View
    52

  • Download
    0

Embed Size (px)

DESCRIPTION

Tara-Scale CMP. Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2 8 cores, 64 Threads Key design issues Architecture Challenges and Tradeoffs Packaging and off-chip memory bandwidth Software and runtime environment. CDA6159fa07 peir. - PowerPoint PPT Presentation

Citation preview

Page 1: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

Intel’s Tara-scale computing projectIntel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip

Sun’s Niagara2Sun’s Niagara2 8 cores, 64 Threads

Key design issuesKey design issues Architecture Challenges and TradeoffsArchitecture Challenges and Tradeoffs Packaging and off-chip memory bandwidth Software and runtime environment

Tara-Scale CMP

CDA6159fa07 peir

Page 2: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

Many-Core CMPs – High-level ViewCores

L2L1I/D What are

the key architecture

issues in many-cores

CMP

CDA6159fa07 Peir 2

On-die interconnectOn-die interconnect Cache organization Cache organization

& Cache coherence& Cache coherence I/O and Memory I/O and Memory

architecturearchitecture

Page 3: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

The General Block DiagramFFU: Fixed Function Unit, Mem C: Memory Controller, PCI-E C: PCI-based Controller, R: Router, ShdU: Shader Unit, Sys I/F: System Interface, TexU: Texture Unit

CDA6159fa07 Peir 3

Page 4: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

On-Die Interconnect

2D Embedding of a 64-core 3D-mesh networkThe longest hop of the topological distance is extended from 9 to 18!

Page 5: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

On-Die Interconnect

Must satisfy bandwidth and latency within power/area Ring or 2D mesh/torusRing or 2D mesh/torus are good candidate topology

Wiring density, router complexity, design complexity Multiple source/dest. pairs can be switched together; avoid

packets stop and buffered, save power, help throughput Xbar, general router are power hungry Fault-tolerant interconnect

Provide spare modules, allow fault-tolerant routing Partition for performance isolation

Page 6: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

Performance Isolation in 2D mesh

Performance isolation in 2D mesh with partition 3 rectangular partitions Intra-communication confined within partition Traffic generated in a partition will not affect others

Virtualization of network interfaces Interconnect as an abstraction of applications Allow programmers fine-tune application’s inter-processor

communication

Page 7: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

Basic VC Router for On-Die Interconnect

Page 8: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

2-D Mesh Router Layout

Page 9: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

Router Pipeline

Micro-architecture optimization on Router pipelineMicro-architecture optimization on Router pipeline Use Express Virtual ChannelExpress Virtual Channel for intermediate hops Eliminate VA and SA stages, Bypass buffers Static EVC vs. dynamic EVC

Intelligent inconnect topologiesIntelligent inconnect topologies

Page 10: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

Many-Core CMPsCores

L2L1I/D How about

on-die cache organization with so many

cores?

Shared vs. PrivateShared vs. Private Cache capacity vs. Cache capacity vs.

accessibility accessibility Data replication vs. Data replication vs.

block migrationblock migration Cache partition Cache partition

Page 11: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

CMP Cache Organization

Page 12: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

Capacity vs. Accessibility, A Tradeoff Capacity – favor Shared cacheCapacity – favor Shared cache

No data replication, no cache coherenceNo data replication, no cache coherence Longer access time, contention issueLonger access time, contention issue Flexible cache capacity sharingFlexible cache capacity sharing Fair sharing among cores – Cache partitionFair sharing among cores – Cache partition

Accessibility – favor Private cacheAccessibility – favor Private cache Fast local access with data replication, capacity Fast local access with data replication, capacity

may suffermay suffer Need maintain coherence among private cachesNeed maintain coherence among private caches Equal partition, inflexibleEqual partition, inflexible

Many works to take advantage of bothMany works to take advantage of both Capacity sharing on private– cooperative cachingCapacity sharing on private– cooperative caching Utility-based cache partition on sharedUtility-based cache partition on shared

Page 13: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

Analytical Data Replication Model

Reuse distance histogram f(x):# of accesses with distance x Cache size S:

Total # hits => Area beneath the curve =>

S

dxxf0

)(Cache misses increase

S

RSdxxf )(

Capacity decreases Cache hits now

RS

dxxf0

)(

Local hits increaseR/S of hits to replica

RS

dxxfSR

0)(

Local hits increaseR/S of hits to replica

L of replica hits: local

RS

dxxfSR

0)(

RS

dxxfSRL

0)(

P: Miss Penalty Cycles; G: Local Gain CyclesNet memory access cycle increase:

IncMissCachePIncHitLocalG

S

RS

RSdxxfPdxxfL

SRG )()(

0

Page 14: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

Get Histogram f(x) for OLTP

-1012345678

0 1 2 3 4 5 6 7 8

Reuse distance (MB)

Reu

se h

isto

gram

f(x)

3

6

10*658.2

,10*084.6

),exp()(

B

Awhere

BxAxf

Step 2:Matlab Curve Fitting

Find math expr.

Step 1: Stack simulationCollect discrete reuse distance

X106

Page 15: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

Data Replication Effects

-10

0

10

20

30

40

0 1/8 1/4 3/8 1/2 5/8 3/4 7/8

Fraction of replication

Ave

. acc

ess

time

incr

ease

s(C

ycle

s)

f(x)G =15P = 400L = 0.5

S = 2M

-10

0

10

20

30

40

0 1/8 1/4 3/8 1/2 5/8 3/4 7/8

Fraction of replication

Ave

. acc

ess

time

incr

ease

s(C

ycle

s)

-10

0

10

20

30

40

0 1/8 1/4 3/8 1/2 5/8 3/4 7/8

Fraction of replication

Ave

. acc

ess

time

incr

ease

s(C

ycle

s)

S = 4MS = 8M:

S

RS

RSdxxfPdxxfL

SRGModel )()(:

0

(R/S)

S = 2M0% best S = 4M

40% best S = 8M65% best

Data Replication Impacts vary with

different cache sizes

Page 16: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

Many-Core CMPsCores

L2L1I/D How about

Cache Coherence

with so many cores&caches

?

Snooping bus:Snooping bus: Broadcast requestsBroadcast requests

Directory-based:Directory-based: maintaining memory maintaining memory block information block information

Review Culler’s bookReview Culler’s book

Page 17: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

Simplicity: Shared L2, Write-through L1 Existing designs

IBM Power4 & 5 Sun Niagara & Niagara 2

Small number of cores,Multiple L2 banks, Xbar

Still need L1 coherence!! Inclusive L2, use L2 directory record L1 sharers in Power4&5 Non-inclusive L2, Shadow L1 directory in Niagara

L2 (shared) coherence among multiple CMPs Private L2 is assumed

Page 18: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

Other Considerations Broadcast

Snooping Bus: loading, speed, space, power, scalability, etc. Ring: slow traversal, ordering, scalability

Memory-based directory Huge directory space Directory cache, extra penalty

Shadow L2 Directory: copy all local L2s Aggregated associativity = Cores * Ways/Core;

64*16 = 1024 way High power

Page 19: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

Directory-Based Approach

Directory needs to maintain the statestate and locationlocation of all cached blocks

Directory is checked when the data cannot be accessed locally, e.g. cache miss, write-to-shared

Directory may route the request to remote cache to fetch the requested block

Page 20: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

Sparse Directory Approach Holds states for

all cached blocks Low-cost set-

associative design No backup Key issues:Key issues:

Centralized vs. Centralized vs. DistributedDistributed

Extra invalidation Extra invalidation due to conflictsdue to conflicts

Presence bit vs. Presence bit vs. duplicated blocksduplicated blocks

Page 21: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

Conflict Issues in Coherence Directory

Coherence directory must be a superset of all cached blocks

Uneven distribution of cached blocks in each directory set cause invalidations

Potential solutions: High set associativity – costly Directory + victim directory Randomization and Skew associativity Bigger directory - Costly Others?

Page 22: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

50

60

70

80

90

100

OLTP Apache SPECjbb SPEC-2000 SPEC-2006

Val

id B

lock

(%)

Set-8wSet-16wSet-32wSet-64wSet-2x-8w

Impact of Invalidation due to Directory Conflict

• 8-core CMP, 1MB 8-way private L2 (total 8MB)• Set-associative dir; # of dir entry = total # of cache blocks• Each cached block occupies a directory entry

75%

96%

72%

93%

Page 23: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

Presence bits Issue in Directory Presence bits (or not?)Presence bits (or not?)

Extra space, useless for multi-programs Coherence directory must cover all cached

blocks (consider no sharing) Potential solutionsPotential solutions

Coarse-granularity present bits Sparse presence vectors – record core-ids Allow duplicated block addresses with few

core-ids for each shared block, enable multiple hits on directory search

Page 24: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

50

60

70

80

90

100

OLTP Apache SPECjbb SPEC-2000 SPEC-2006

Val

id b

lock

s (%

)

Set-8w Set-8w-64v Skew-8wSet-10w-1/4 Set-8w-p Set-full

Valid Blocks

Presence Bit:Multiprogrammed -> No

Multithreaded -> YesSkew, and 10w-1/4

helps;No difference 64v

Page 25: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

Challenge in Memory Bandwidth

Increase in off-chip memory bandwidth to sustain chip-level IPC Need power-efficient high-speed off-die I/O Need power-efficient high-bandwidth DRAM access

Potential Solutions: Embedded DRAM Integrated DRAM, GDDR inside processor package 3D stacking of multiple DRAM/processor dies Many technology issues to overcome

Page 26: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

Memory Bandwidth Fundamental BW = # of bits x bit rate

A typical DDR2 bus is 16 bytes (128 bits) wide and operating at 800Mb/s. The memory bandwidth of that bus is 16 bytes x 800Mb/s, which is 12.8GB/s

Latency and Capacity Fast, but small capacity on-chip SRAM (caches) Slow large capacity off-chip DRAM

Page 27: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

Memory Bus vs. System Bus Bandwidth

Scaling of bus capability has usually involved a combination of increasing the bus width while simultaneously increasing the bus speed

Page 28: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

Integrated CPU with Memory Controller

Eliminate off-chip controller delay Fast, but difficult to adapt new DRAM technology

The entire burden of pin count and interconnect speed to sustain increases in memory bandwidth requirements now falls on the CPU package alone

Page 29: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

Challenge in Memory Bandwidth and Pin Count

Page 30: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

Challenge in Memory Bandwidth

Historical trend for memory bandwidth demand Current generation: 10-20 GB/s Next generation: >100GB/s and could go 1TB/s

Page 31: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

New Packaging

Page 32: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

New Packaging

Page 33: Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

New Packaging