Recent multicore design trends - people.cs.pitt.edupeople.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_4up.pdf · University of Pittsburgh Improving program-data distance Key idea:

CloudCache:Expanding and Shrinking Private Caches

CloudCache:Expanding and Shrinking Private Caches

Sangyeun Cho

Computer Science DepartmentUniversity of Pittsburgh

Sangyeun Cho

Computer Science DepartmentUniversity of Pittsburgh

University of Pittsburgh

Credits Parts of the work presented in this talk are from the results

obtained in collaboration with students and faculty at the University of Pittsburgh:• Mohammad Hammoud• Lei Jin• Hyunjin Lee• Kiyeon Lee• Prof. Bruce Childers• Prof. Rami Melhem

Partial support has come as:• NSF grant CCF-0702236• NSF grant CCF-0952273• A. Richard Newton Graduate Scholarship, ACM DAC 2008


Recent multicore design trends Modular designs based on relatively simple cores

• Easier to validate (a single core)• Easier to scale (the same validated design replicated multiple times)• Due to these reasons, future “many-core” processors may look like this• Examples: Tilera TILE*, Intel 48-core chip [Howard et al., ISSCC ’10]


Recent multicore design trends Modular designs based on relatively simple cores

• Easier to validate (a single core)• Easier to scale (the same validated design replicated multiple times)• Due to these reasons, future “many-core” processors will look like this• Examples: Tilera TILE*, Intel 48-core chip [Howard et al., ISSCC ’10]

Design focus shifts from cores to “uncores”• Network on-chip (NoC): bus, crossbar, ring, 2D/3D mesh, …• Memory hierarchy: caches, coherence mechanisms, memory

controllers, …• System-on-chip components like network MAC, cypher, …

Note also:• In this design, aggregate cache capacity increases as we add more tiles• Some tiles may be active while others are inactive


L2 cache design issues L2 caches occupy a major portion (prob. the largest) on chip

Capacity vs. latency• Off-chip access (still) expensive• On-chip latency can grow; with switched on-chip networks, we have

NUCA (Non-Uniform Cache Architecture)• Private cache vs. shared cache or a hybrid scheme?

Quality of service• Interference (capacity/bandwidth)• Explicit/implicit capacity partitioning

Manageability, fault and yield• Not all caches have to be active• Hard (irreversible) faults + process variation + soft faults


Private cache vs. shared cache

Short hit latency (always local)

High on-chip miss rate

Long miss resolution time (i.e., are there replicas?

Low on-chip miss rate

Straightforward data location

Long average hit latency

Uncontrolled interference


Historical perspective Private designs were there

• Multiprocessors based on microprocessors Shared cache clustering [Nayfeh and Olukotun, ISCA ‘94]

Multicores with private design more straightforward• Earlier AMD chips and Sun Microsystems’ UltraSPARC processors

Intel, IBM, and Sun pushed for shared caches (e.g., Xeon products, POWER-4/5/6, Niagara*)• A variety of workloads perform relatively well

More recently, private caches re-gain popularity• We have more transistors• Interference is easily controlled• Match well with on-/off-chip (shared) L3 $$ [Oh et al., ISVLSI ‘09]


Shared $$ issue 1: NUCA A compromise design

• Large monolithic shared cache: conceptually simple but very slow• Smaller cache slices w/ disparate latencies [Kim et al., ASPLOS ‘02]

Interleave data to all available slices (e.g., IBM POWER-*)

Problems• Blind data distribution with no provisions for latency hiding• As a result, simple NUCA performance is not scalable

yourprogram

yourdata


Improving program-data distance Key idea: replicate or migrate Victim replication [Zhang and Asanović, ISCA ’05]

• Victims from private L1 cache are “replicated” to local L2 bank• Essentially, local L2 banks incorporate large victim caching space• A natural hybrid scheme that takes advantages of shared & private $$

Adaptive Selective Replication [Beckmann et al., MICRO ‘06]

• Uncontrolled replication can be detrimental (shared cache degenerates to private cache)

• Based on benefit and cost, control replication level• Hardware support quite complex

Adaptive Controlled Migration [Hammoud, Cho, and Melhem, HiPEAC ‘09]

• Migrate (rather than replicate) a block to minimize average access latency to this block based on Manhattan distance

• Caveat: book-keeping overheads


3-way communication Once you migrate a cache block away from its home, how do

you locate it? [Hammoud, Cho, and Melhem, HiPEAC ‘09]

Bhome tile of B

new host of B

another requester

1

3

2


3-way communication Once you migrate a cache block away from its home, how do

you locate it? [Hammoud, Cho, and Melhem, HiPEAC ‘09]

B

home tile of B

new host of B

requester

1

3

2


Cache block location strategies Always-go-home

• “Go to the home tile (directory) to find the tracking information”• Lots of 3-way cache-to-cache transfers

Broadcasting• “Don’t remember anything”• Expensive

Caching of tracking information at potential requesters• “Remember (locally) where previously accessed blocks are”• Quickly locate the target block locally (after first miss)• Need maintenance (coherence) of distributed information• We proposed and studied one such technique [Hammoud, Cho, and Melhem,

HiPEAC ‘09]


Caching of tracking information

…B

home tile of BL2$$ Dir. TrT

Rcore

B primarytracking info. b

requester tile for B

L2$$ Dir.

TrT

Rcore

tag ptr

replicatedtracking info.

Primary tracking information at home tile• Shows where the block has been migrated to• Keeps track of who has copied the tracking information (bit vector)

Tracking information table (TrT)• A copy of the tracking information (only one pointer)• Updated as necessary (initiated by the directory of the home tile)

B


A flexible data mapping approach Key idea: What if we distribute memory pages to cache banks

instead of memory blocks? [Cho and Jin, MICRO ‘06]

Memory blocks Memory pages


Observation

Memory pages Program 1

Program 2

OS PAGE ALLOCATIONOS PAGE ALLOCATION

Software maps data to different $$

Key: OS page allocation policies

Flexible


Shared $$ issue 2: interference Co-scheduled programs compete for cache capacity freely

(“capitalism”) Well-behaving programs get damages if another program vie

for more cache capacity w/ little reuse (e.g., streaming) Performance loss due to interference hard to predict

on 8-core Niagara

2×


Explicit/implicit partitioning Strict partitioning based on system directives

• E.g., program A gets 768KB and program B 256KB• System must allow a user to specify partition sizes [Iyer, ICS ‘04][Guo et al.,

MICRO ‘07]

Architectural/system mechanisms• Way partitioning (e.g., 3 vs. 2 ways) [Suh et al., ICS ‘01][Kim et al., PACT ‘04]

• Bin/bank partitioning: maps pages to cache bins (no hardware support needed) [Liu et al., HPCA ‘04][Cho, Jin and Lee, RTCSA ‘07][Lin et al., HPCA ‘08]

…

3 ways to program A 2 ways to program B


Explicit/implicit partitioning Strict partitioning based on system directives

• E.g., program A gets 768KB and program B 256KB• System must allow a user to specify partition sizes [Iyer, ICS ‘04][Guo et al.,

MICRO ‘07]

Architectural/system mechanisms• Way partitioning (e.g., 5 vs. 3 ways) [Suh et al., ICS ‘01][Kim et al., PACT ‘04]

• Bin/bank partitioning: maps pages to cache bins (no hardware support needed) [Liu et al., HPCA ‘04][Cho, Jin and Lee, RTCSA ‘07][Lin et al., HPCA ‘08]

Implicit partitioning based on utility (“utilitarian”)• Way partitioning based on marginal gains [Suh et al., ICS ‘01][Qureshi et al.,

MICRO ‘06]

• Bank level [Lee, Cho and Childers, HPCA ’10]

• Bank + way level [Lee, Cho and Childers, HPCA ’11]


Explicit partitioning


Implicit partitioning

(cache capacity) (cache capacity)


Private $$ issue: capacity borrowing The main drawback of private caches is fixed (small) capacity Assume that we have an underlying mechanism for enforcing

correctness of cache access semantics; Key idea: borrow (or steal) capacity from others

• How much capacity do I need and opt to borrow?• Who may be a potential capacity donor?• Do we allow performance degradation of a capacity donor?

Cooperative Caching (CC) [Chang and Sohi, ISCA ‘06]• Stealing coordinated by a centralized directory (is not scalable)

Dynamic Spill Receive (DSR) [Qureshi, HPCA ‘09]• Each node is a spiller or a receiver• On eviction from a spiller, randomly choose a receiver to copy the data into;

then, for data access later, use broadcasting StimulusCache [Lee, Cho, and Childers, HPCA ‘10]

• It’s likely we’ll have “excess cache” in the future• Intelligently share the capacity provided by excess caches based on utility


Core disabling uneconomical

Many unused (excess) L2 caches exist Problem exacerbated with many cores

0%

5%

10%

15%

20%

25%

30%

35%

40%

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

prob

ability

# of sound cores/ L2 caches

L2 cache

processing logic

core (L2 + proc. Logic)

32 core

0%

10%

20%

30%

40%

50%

60%

70%

80%

1 2 3 4 5 6 7 8

Prob

ability

# of sound cores/ L2 caches

8 core L2 cache

processing logic

core (L2 + proc. logic)


Flow-in#N: # data blocks flowing to EC#N

Hit#N: # data blocks hit at EC#N

Dynamic sharing policy

L2Core 0

L2Core 1

L2Core 2

L2Core 3

EC #1

MainmemoryFlow-in#1

EC #2

Hits#1

Hits#2

Flow-in#2Flow-in#2


Hit/Flow-in ↑ More ECsHit/Flow-in ↓ Less ECs


L2Core 0

L2Core 1

L2Core 2

L2Core 3

EC #1

MainmemoryFlow-in#1

EC #2

Hits#1

Hits#2

Flow-in#2Flow-in#2


L2Core 0

L2Core 1

L2Core 2

L2Core 3

Core 0: At least 1 EC No harmful effect on EC#2

Allocate 2 ECs


EC #1

Mainmemory

EC #2

Hits#1


L2Core 0

L2Core 1

L2Core 2

L2Core 3

Core 1: At least 2 ECs

Allocate 2 ECs


EC #1

Mainmemory

EC #2

Hits#2

Flow-in#2


L2Core 0

L2Core 1

L2Core 2

L2Core 3

Core 2: At least 1 ECHarmful effect on EC#2

Allocate 1 EC


EC #1

Mainmemory

EC #2

Hits#1

Flow-in#2


L2Core 0

L2Core 1

L2Core 2

L2Core 3

Core 2: No benefit with ECs

Allocate 0 EC


EC #1

Mainmemory

EC #2Flow-in#2


L2Core 0

L2Core 1

L2Core 2

L2Core 3

Maximized capacity utilization

Minimized capacity interference


Mainmemory

EC #1 EC #2

2

2

1

0

EC#


Summary: priv. vs. shared caching For a small- to medium-scale multicore processor, shared

caching appears good (for its simplicity)• A variety of workloads run well• Care must be taken about interferences• Because of NUCA effects, this approach is not scalable!

For larger-scale multicore processors, private cache schemes appear more promising• Free performance isolation• Less NUCA effect• Easier fault isolation• Capacity stealing a necessity


Summary: hypothetical caching

(256kB L2 cache slice)

Rel

ativ

e pe

rform

ance

to s

hare

d ca

chin

g(w

/o p

rofil

ing)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

gzip vpr gcc mcf

crafty

parse r eo

n

gapvo

rtex

bzip2

twolf

wupwise swim

mgrid mesa arteq

uakeam

mpg-m

ean

private w/ profilingshared w/ profilingstatic 2D coloring

Tackle both miss rate and locality through judicious data mapping! [Jin and Cho, ICPP ‘08]


0

50

100

150

200

250

300

350

400

450

Summary: hypothetical caching

# pa

ges

map

ped

(256kB L2 cache bank)

Program location

GCC

XY


Summary: hypothetical caching alpha 0 0.125 0.25 0.375 0.5 0.625 0.75 0.875 1

ammp

art

bzip2

crafty

eon

equake

gap

gcc

gzip

mcf

mesa

mgrid

parser

swim

twolf

vortex

vpr

wupwise

private caching shared caching(256kB L2 cache bank)


In the remainder of this talk CloudCache

• Private cache based scheme for large-scale multicore processors• Capacity borrowing at way/bank level & distance awareness


A 30,000-feet snapshot

1

2

34

5

6

89

7

Cache cloud Working core

High hit rate& fast access

Low hit rate& slow access

C

C Working core

Each cloud is an exclusive private cache for a working core Clouds are built with nearby cache ways organized in a chain Clouds are built in a two-step process: Dynamic global

partitioning and cache cloud formation


Step 1: Dynamic global partitioning Global capacity allocator (GCA) runs a partitioning

algorithm periodically using “hit counts” information• Each working core sends GCA hit counts from “allocated capacity” and

additional “monitoring capacity” (of 32 ways)

Our partitioning considers both utility [Qureshi & Patt, MICRO ‘06]and QoS

This step computes how much capacity to give to each core

networknetwork

GCA

Hit countsCache

alloc. info

…

from allocated capacityfrom monitoring capacity

Allocationengine

Allocationengine

…

Counter buffer

GCA


0.75

1. Local L2 cache first2. Threads w/ larger capacity demand first3. Closer L2 banks first4. Allocate capacity as much as possible

Allocation is done only for sound L2 banks

2.751.75

Step 2: Cache cloud formation

0

3

1

20

1

2.75

1.25

Goal Capacity to allocate

1.250.25

0

0

Repeat!


Cache cloud chaining example

1 2

3 4 5

6 7 8

4 1 5 7 3 6 8 24 1 5 7 3 6 8 2

8 6 6 4 4 3 2 2

MRU LRU

Cloud capacity

Core ID

Token count

35

Working core 4

2

6

Working core 2

0-hop distance1-hop distance 2-hop distance


Architectural support

L2router

DirL1 Proc

Global capacity allocator

L2router

DirL1 Proc

Global capacity allocatorGlobal capacity allocator

L2

Router

Dir

L1 Proc

Cloud table

Monitor tags

Hit counters

Home Core ID Token #Next Core ID

Home Core ID Token #Next Core IDHome Core ID Token #Next Core ID

Capacity allocation

Cache cloud chaining

Cloud table resembles StimulusCache’s NECP Additional hardware is per-way hit counters and monitor tags We do not need core ID tracking information per block


Capacity allocation example


Seven working cores10.5

812.6250.759.625

14.5

8




10.58

12.6250.759.625

14.5

813.125

8.57.8757

1111.5

6

Repartitionevery ‘T’ cycles




13.1258.57.8757

1111.5

6


CloudCache is fast? Remote L2 access has 3-way communication


(1) Directory lookup(2) Request forwarding(3) Data forwarding

Distance-aware cachecloud formation tacklesonly (3)!


Limited target broadcast (LTB) Make common case “super fast” and rare case “not so fast”


Private data: Limited target broadcast No wait for directory lookup

Shared data:Directory-based coherence

Private data ≫ Shared data


LTB example

w/o broadcasting1 1

2 3 4 5

6

2 3 4 5

11

1 6w/ broadcasting

Dir. request BroadcastLocal L2 hit

Dir. response Broadcast hit

time

timeUniversity of Pittsburgh

When is LTB used? Shared data: Access ordering is managed by directory

• LTBP is NOT used

Private data: Access ordering is not needed• Fast access with broadcast first, then notify directory

Race condition?• When there is an access request for private data from a non-owner core

before directory is updated


LTB protocol (LTBP)

RR

WW BB

start

1

2

3

4Input Action

1 Request from other cores Broadcast lock request

2 Ack from the owner Process non-owner request

3 Nack from the owner

4 Request from the ownerR: ReadyW: WaitB: Busy

BLBL BUBU

BU: Broadcast unlockBL: Broadcast lock

Directory

L2 cache 1

2

Input Action

1 Broadcast lock request Ack

2 Invalidation


Quality of Service (QoS) support QoS Maximumperformancedegradation samplingfactor, 32inthiswork privatecachecapacity, 8inthiswork currentlyallocatedcapacity Base cycle: Cycles (time) with private cache (estimated)

.Basecycle currentcycle ∑ currentcyclecurrentcycle ∑ Estimated cycle: Cycles with cache capacity ‘j’ Estimatedcycle j Basecycle ∑ .

Min j s. t. Estimatedcycle j / 1 QoS Basecycle


DSR vs. ECC vs. CloudCache Dynamic Spill Receive [Quresh, HPCA ’09]

• Node is either spiller or receiver depending on memory usage• Spiller nodes randomly “spill” evicted data to a receiver node

Elastic Cooperative Caching [Herrero et al., ISCA ‘10]• Each node has private area and shared area• Nodes with high memory demand can spill evicted data to shared area

in a randomly selected node


Experimental setup TPTS [Lee et al., SPE ‘10, Cho et al., ICPP ‘08]

• 64-core CMP with 8×8 2D mesh, 4-cycles/hop• Core: Intel’s ATOM-like two-issue in-order pipeline• Directory-based MESI protocol• Four independent DRAM controllers, four ports/controller• DRAM with Samsung DDR3-1600 timing

Workloads• SPEC2006 (10B cycles)

High/medium/low based on MPKI for varying cache capacity• PARSEC (simlarge input set)

16 threads/application


Impact of global partitioning

High

Medium

Low

High

MP

KI



High

Medium

Low

HighM

PK

I



High

Medium

Low

High

MP

KI



High

Medium

Low

High

MP

KI



High

Medium

Low

High

MP

KI


L2 cache access latencyAccess # 401.bzip2

Private

DSR

Shared

0%

50%

100%

CloudCache w/o LTB0.E+00

1.E+06

2.E+06

0%

50%

100%

0.E+00

1.E+06

2.E+06

0%

50%

100%

ECC0.E+00

1.E+06

2.E+06

0%

50%

100%

0.E+00

1.E+06

2.E+060%

50%

100%

0.E+00

1.E+06

2.E+06

0%

50%

100%

0 100 200 300

CloudCache w/ LTB

access latency

0.E+00

1.E+06

2.E+06

0 100 200 300

Shared: widely spread latency

Private: fast local access + off-chip access

DSR & ECC: fast local access + widely spread latency

CloudCache w/o LTBfast local access + narrowly spread latency

CloudCache w/ LTBfast local access + fast remote access


16 threads

1


32 and 64 threads


PARSEC

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Comb1 Comb2 Comb3 Comb4 Comb5

spee

dup

to p

rivat

e

Shared DSR ECC CLOUD


PARSEC

0

0.5

1

1.5

2

swaption blacksch. canneal bodytrack swaption facesim canneal ferret

Comb2 Comb5

spee

dup

to p

rivat

e

Shared DSR ECC CLOUD


Effect of QoS enforcement

0.9

1

1.1

1.2

1.3

1.4

1.5

1 21 41 61 81 101 121 141 161 181

speedu

p to priv

ate

No QoS 98% QoS 95% QoS

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1

1 11 21 31 41 51 61 71 81 91

different programs


CloudCache summary Future processors will carry many more cores and cache

resources and future workloads will be heterogeneous Low-level caches must become more scalable and flexible

We proposed CloudCache w/ three techniques:• Global “private” capacity allocation to eliminate interference and to

minimize on-chip misses• Distance-aware data placement to overcome NUCA latency• Limited target broadcast to overcome directory lookup latency

Proposed techniques synergistically improve performance Still, we find that global capacity allocation the most effective QoS is naturally supported in CloudCache


Our multicore cache publications Cho and Jin, “Managing Distributed, Shared L2 Caches through OS-Level

Page Allocation,” MICRO 2006. Best paper nominee. Cho, Jin and Lee, “Achieving Predictable Performance with On-Chip Shared

L2 Caches for Manycore-Based Real-Time Systems,” RTCSA 2007. Invited paper.

Hammoud, Cho and Melhem, “ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors,” HiPEAC 2008.

Hammoud, Cho and Melhem, “Dynamic Cache Clustering for Chip Multiprocessors,” ICS 2008.

Jin and Cho, “Taming Single-Thread Program Performance on Many Distributed On-Chip L2 Caches,” ICPP 2008.

Oh, Lee, Lee, and Cho, “An Analytical Model to Study Optimal Area Breakdown between Cores and Caches in a Chip Multiprocessor,” ISVLSI 2009.

Jin and Cho, “SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors,” PACT 2009.

Lee, Cho and Childers, “StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache,” HPCA 2010.

Hammoud, Cho and Melhem, “Cache Equalizer: A Placement Mechanism for Chip Multiprocessor Distributed Shared Caches,” HiPEAC 2011.

Lee, Cho and Childers, “CloudCache: Expanding and Shrinking Private Caches,” HPCA 2011.

Documents

Recent multicore design trends - people.cs.pitt.edupeople.cs.pitt.edu/~cho/cs2410/current/lect-cloudcache_4up.pdf · University of Pittsburgh Improving program-data distance Key idea: