ECE8833 Polymorphous and Many-Core Computer Architecture

ECE8833 Polymorphous and Many-Core Computer Architecture

Prof. Hsien-Hsin S. LeeSchool of Electrical and Computer Engineering

Lecture 5 Non-Uniform Cache Architecture for CMP

2ECE8833 H.-H. S. Lee 2009

CMP Memory Hierarchy• Continuing device scaling leads to

– Deeper memory hierarchy (L2, L3, etc.)– Growing cache capacity

• 6MB in AMD’s Phenom quad-core• 8MB in Intel Core i7• 24MB L3 in Itanium 2

• Global wire delay– Routing dominates access time

• Design for worst case– Compromise for the slowest access– Penalize overall memory accesses– Undesirable

3ECE8833 H.-H. S. Lee 2009

Evolution of Cache Access Time• Facts

– Large shared on-die L2– Wire-delay dominating on-die cache

3 cycles1MB

180nm, 1999

11 cycles4MB

90nm, 2004

24 cycles16MB

50nm, 2010

4ECE8833 H.-H. S. Lee 2009

Multi-banked L2 cache

Bank=128KB11 cycles

2MB @ 130nm

Bank Access time = 3 cyclesInterconnect delay = 8 cycles

5ECE8833 H.-H. S. Lee 2009

Multi-banked L2 cache

Bank=64KB47 cycles

16MB @ 50nm

Bank Access time = 3 cyclesInterconnect delay = 44 cycles

6ECE8833 H.-H. S. Lee 2009

NUCA: Non-Uniform Cache Architecture[Kim et al. ASPLOS-X, 2002]

• Partition a large cache into banks

• Non-uniform latencies for different banks

• Design space exploration– Mapping

• How many banks? (i.e., what’s the granularity)• How to map lines to each bank?

– Search• Strategy for searching the set of possible locations for a line

– Movement• Should a line always be placed in the same bank?• How a line migrates to different banks over its lifetime?

7ECE8833 H.-H. S. Lee 2009

Cache Hierarchy Taxonomy (16MB @50nm)

41

UCA1 bank

255 cyclesAvg access time

41

ML-UCA1 bank

11/41 cycles

L3

10L2

17 41

17 41

S-NUCA-132 banks34 cycles

S-NUCA-232 banks24 cycles

9 32

D-NUCA

256 banks18 cycles

4 47

Contentionless latency from

CACTI

From simulation modeling bank & channel conflict

[Kim et al., ASPLOS-X 2002]

8ECE8833 H.-H. S. Lee 2009

Static NUCA-1 Using Private Channels

• Upside– Increase the number of banks to avoid bulky access– Parallelize accesses to different banks

• Overhead– Decoders– Wire-dominated due to same set of private wires is required for every

bank• Each bank has its distinct access latency• Statically pre-determine data location for its given address • Average access latency =34.2 cycles• Wire overhead = 20.9% an issue

Tag Array

Data Bus

Address Bus

Bank

Sub-bank

Predecoder

Senseamplifier

Wordline driverand decoder

Use low-order bits for bank

index

9ECE8833 H.-H. S. Lee 2009

Static NUCA-2 Using Switched Channels

• Improved wire congestion from Static NUCA-1 using 2D switched network• Wormhole-routed flow control• Each switch buffers 128-bit packets • Average access latency =24.2 cycles

– On avg, 0.8 cycle of “bank” contention + 0.7 cycle of “link” contention in the network

• Wire overhead = 5.9%

Bank

Data bus

SwitchTag Array

Wordline driverand decoder

Predecoder

9 32

10ECE8833 H.-H. S. Lee 2009

Dynamic NUCA

• Data can dynamically migrate• Promote frequently used cache lines closer

to CPU

• Data management – Mapping

• How many banks? (i.e., what’s the granularity)

• How to map lines to each bank?

– Search• Strategy for searching the set of possible

locations for a line

– Movement• Should a line always be placed in the same

bank?• How a line migrates to different banks over

its lifetime?

D-NUCA256 banks18 cycles

4 47

11ECE8833 H.-H. S. Lee 2009

Dynamic NUCA

• Simple Mapping• All 4 ways of each bank set needs to be searched• Non-uniform access times for different bank sets• Farther bank sets longer access

Memory Controller

8 bank setsway 3

way 2

way 1

way 0

one set

bank

12ECE8833 H.-H. S. Lee 2009

Dynamic NUCA

• Fair Mapping (proposed, not studied in the paper)• Average access time across all bank sets are equal• Complex routing, likely more contention

8 bank sets one set

bank

Memory Controller

13ECE8833 H.-H. S. Lee 2009

Dynamic NUCA

• Shared Mapping• Sharing the closet banks among multiple banks• Some banks have slightly higher associativity which offset

the increased avg. access latency due to the distance

8 bank setsbank

Memory Controller

14ECE8833 H.-H. S. Lee 2009

Locate A NUCA Line• Incremental search

– From the closest to the farthest

• (Limited, partitioned) Multicast search– Search all (or a partition of) the banks in

parallel– Return time depending on the routing distance

• Smart search– Use partial tag comparison [Kessler’89] (used in

P6)– Keep the partial tag array in cache controller– Similar modern techniques: Bloom Filters

15ECE8833 H.-H. S. Lee 2009

D-NUCA: Dynamic Movement of Cache LinesCache Line Placement Upon Hit• LRU ordering

– Conventional implementation only adjust LRU bits

– Require physical movement in order to get latency benefits for NUCA (n copy operations)

• Generational Promotion– Only swap with the line in the neighbor

bank closest to the controller– Receive more “latency reward” when hit

contiguously

Hit

Old state

New state

Old state

New state

Hit

Controller

Controller

16ECE8833 H.-H. S. Lee 2009

D-NUCA: Dynamic Movement of Cache LinesUpon Miss• Incoming Line Insertion

– To a distant bank– To MRU position

• Victim eviction– Zero copy– One copy

Controller

new

Controller

victim

Most distant bank(assist cache concept)

Controller

victim

Some distant bank(Zero copy)

victim

Controller

Some distant bank(One copy)

Controller

victim

MRU bank(One copy)

17ECE8833 H.-H. S. Lee 2009

Sharing NUCA Cache in CMP• Sharing Degree (SD) of N: Number of

processor cores share a cache

• Low SD– Smaller private partitions– Good hit latency, poor hit rate– More discrete L2 caches

• Expensive L2 coherence• E.g., Need a centralized L2 tag directory for L2

coherence

• High SD– Good hit rate, bad for hit latency– More efficient inter-core communication– More expensive L1 coherence

18ECE8833 H.-H. S. Lee 2009

16-Core CMP Substrate and SD

• Low SD (e.g., 1), need either snooping or a central L2 tag directory for coherence

• High SD (e.g., 16) also needs some directory to indicate whose L1 has a copy (used in Piranha CMP)

[Huh et al. ICS’05]

19ECE8833 H.-H. S. Lee 2009

Trade-off for Cache Sharing Among Cores• Upside

– Keep single copy data– Use area more efficiently– Faster inter-core communication

• No coherence fabric

• Downside– Larger structure, slower access– Longer wire delay– More congestion on the shared

interconnect

20ECE8833 H.-H. S. Lee 2009

Flexible Cache Mapping• Static mapping

– Fixed L2 access latency upon line placement time

• Dynamic mapping– D-NUCA idea: line can migrate across multiple banks– Line will move closer to the core that frequently accesses it

[Huh et al. ICS’05]

Lookup could be expensive

Search all partial tags first

21ECE8833 H.-H. S. Lee 2009

Flexible Cache Sharing• Multiple sharing degrees for different

classes of blocks

• Classify lines to be (Per-line sharing degree)– Private (assign smaller SD)– Shared (assign larger SD)

• Study found 6 to 7% improvement vs. the best uniform SD– SD=1 or 2 for private data– SD=16 for shared data

22ECE8833 H.-H. S. Lee 2009

Enhance Cache/Memory Performance• Cache Partitioning

– Explicitly manage cache allocation among processes• Each process gets different benefit for more cache space• Similar to main memory partition [Stone’92] in the good old

days

• Memory-aware Scheduling– Choose a set of simultaneous processes to minimize

cache contention– Symbiotic scheduling for SMT by OS

• Sample and collect info (perf. counters) about possible schedules

• Predict the best schedule (e.g., based on resource contention)• Complexity is high for many processes

– Admission control for gang scheduling• Based on footprint of a job (total memory usage)

Slide adapted from Ed Suh’s HPCA’02 presentation

Victim Replication

24ECE8833 H.-H. S. Lee 2009

Today’s Chip Multiprocessors (Shared L2)

core

L1$

• Layout: “Dance-Hall” – Per processing node: Core + L1

cache– Shared L2 cache

• Small L1 cache– Fast access

• Large L2 cache– Good hit rate– Slower access latency

Intra-Chip Switch

core

L1$

core

L1$

core

L1$

L2 Cache

Slide adapted from presentation by Zhang and Asanovic, ISCA’05

Intra-Chip Switch

core

L1$

core

L1$

core

L1$

core

L1$

25ECE8833 H.-H. S. Lee 2009

Today’s Chip Multiprocessors (Shared L2)

• Layout: “Dance-Hall” – Per processing node: Core + L1 cache– Shared L2 cache

• Alternate large L2 cache– Divided into slices to minimize latency

and power– i.e., NUCA

• Challenge– Minimize average access latency– Avg memory latency == Best latency


L2 Slice

core

L1$

L2 Slice

L2 Slice

L2 Slice L2 Slice L2 Slice

L2 Slice L2 Slice

L2 Slice L2 Slice L2 SliceL2 Slice






Intra-Chip Switch

core

L1$

core

L1$

core

L1$

Intra-Chip Switch

core

L1$

core

L1$

core

L1$

core

L1$

26ECE8833 H.-H. S. Lee 2009

Dynamic NUCA Issues

• Does not work well with CMPs

• The “unique” copy of data cannot be close to all of its sharers

• Behavior– Over time, shared data migrates to

a location “equidistant” to all sharers

[Beckmann & Wood, MICRO-36]

core

L1$

Intra-Chip Switch

core

L1$

core

L1$

core

L1$

Intra-Chip Switch

core

L1$

core

L1$

core

L1$

core

L1$


27ECE8833 H.-H. S. Lee 2009

Tiled CMP with Directory-Based Protocol

• Tiled CMPs for Scalability– Minimal redesign effort – Use directory-based protocol for

scalability

• Managing the L2s to minimize the effective access latency– Keep data close to the requestors– Keep data on-chip

• Two baseline L2 cache designs– Each tile has own private L2– All tiles share a single distributed

L2

SWc L1

L2$Data

L2$Tag

SWc L1

L2$Data

L2$Tag

SWc L1

L2$Data

L2$Tag

SWc L1

L2$Data

L2$Tag

SWc L1

L2$Data

L2$Tag

SWc L1

L2$Data

L2$Tag

SWc L1

L2$Data

L2$Tag

SWc L1

L2$Data

L2$Tag

SWc L1

L2$Data

L2$Tag

SWc L1

L2$Data

L2$Tag

SWc L1

L2$Data

L2$Tag

SWc L1

L2$Data

L2$Tag

SWc L1

L2$Data

L2$Tag

SWc L1

L2$Data

L2$Tag

SWc L1

L2$Data

L2$Tag

SWc L1

L2$Data

L2$Tag

core L1$

L2$SliceData

Switch

L2$SliceTag


28ECE8833 H.-H. S. Lee 2009

“Private L2” Design Keeps Low Hit Latency

core L1$

Private L2$Data

Switch

DIRL2$Tag

core L1$

Private L2$Data

Switch

DIRL2$Tag

Sharer jSharer i

• The local L2 slice is used as a private L2 cache for the tile– Shared data is

“duplicated” in the L2 of each sharer

– Coherence must be kept among all sharers at the L2 level

– Similar to DSM

• On an L2 miss:– Data not on-chip– Data available in the

private L2 cache of another chipSlide adapted from presentation by Zhang and Asanovic, ISCA’05

29ECE8833 H.-H. S. Lee 2009

“Private L2” Design Keeps Low Hit Latency

core L1$

Private L2$Data

Switch

DIRL2$Tag

core L1$

Private L2$Data

Switch

DIRL2$Tag

core L1$

Private L2$Data

Switch

DIRL2$Tag

Home Nodestatically determined by address

Owner/SharerRequestor

• The local L2 slice is used as a private L2 cache for the tile– Shared data is “duplicated”

in the L2 of each sharer– Coherence must be kept

among all sharers at the L2 level

– Similar to DSM

• On an L2 miss:– Data not on-chip– Data available in the

private L2 cache of another tile (cache-to-cache reply-forwarding)

Off-chip Access

30ECE8833 H.-H. S. Lee 2009

“Shared L2” Design Gives Maximum Capacity

core L1$

Shared L2$Data

Switch

DIRL2$Tag

core L1$

Shared L2$Data

Switch

DIRL2$Tag

Requestor

core L1$

Shared L2$Data

Switch

DIRL2$Tag

Owner/Sharer

Off-chip Access

• All L2 slices on-chip form a distributed shared L2, backing up all L1s– “No duplication,” data kept

in a unique L2 location– Coherence must be kept

among all sharers at the L1 level

• On an L2 miss:– Data not in L2– Coherence miss (cache-to-

cache reply-forwarding)

Home Nodestatically determined by address

31ECE8833 H.-H. S. Lee 2009

Private vs. Shared L2 CMP

• Shared L2– Long/non-uniform L2

hit latency – No duplication

maximizes L2 capacity

• Private L2– Uniform lower

latency if found in local L2

– Duplication reduces L2 capacity

32ECE8833 H.-H. S. Lee 2009

Private vs. Shared L2 CMP

• Shared L2 – Long/non-uniform L2

hit latency No duplication

maximizes L2 capacity

• Private L2Uniform lower

latency if found in local L2

– Duplication reduces L2 capacity

Victim Replication: Provides low hit latency while keeping the working set on-chip

33ECE8833 H.-H. S. Lee 2009

Normal L1 Eviction for a Shared L2 CMP

core L1$

SharedL2$Data

Switch

DIRL2$Tag

core L1$

SharedL2$Data

Switch

DIRL2$Tag

Sharer i Sharer j

core L1$

SharedL2$Data

Switch

DIRL2$Tag

Home Node

• When an L1 cache line is being evicted – Write back to home L2 if

dirty– Update home directory

34ECE8833 H.-H. S. Lee 2009

Victim Replication

core L1$

SharedL2$Data

Switch

DIRL2$Tag

core L1$

SharedL2$Data

Switch

DIRL2$Tag

Sharer i Sharer j

core L1$

SharedL2$Data

Switch

DIRL2$Tag

Home Node

• Replicas– L1 victims stored in

the Local L2 slice

• Reused later for faster access latency

35ECE8833 H.-H. S. Lee 2009

Hitting the Victim Replica

core L1$

SharedL2$Data

Switch

DIRL2$Tag

core L1$

SharedL2$Data

Switch

DIRL2$Tag

Sharer i Sharer j

core L1$

SharedL2$Data

Switch

DIRL2$Tag

Home Node

• Look up local L2 slice

• A miss will follow the normal transaction to get the line in home node

• A replica hit will invalidate the replica

Replica Hit

36ECE8833 H.-H. S. Lee 2009

Replication Policy

• Replica is only inserted when one of the following is found (in the priority)

core L1$

SharedL2$Data

Switch

DIRL2$Tag

core L1$

Switch

DIR

Sharer j

core L1$

SharedL2$Data

Switch

DIRL2$Tag

Sharer i

Home Node

37ECE8833 H.-H. S. Lee 2009

Replication Policy, Where to Insert?

• Replica is only inserted when one of the following is found (in the priority)– Invalid line

core L1$

SharedL2$Data

Switch

DIRL2$Tag

core L1$

Switch

DIR

Sharer j

core L1$

SharedL2$Data

Switch

DIRL2$Tag

Sharer i

Home Node

38ECE8833 H.-H. S. Lee 2009


• Replica is only inserted when one of the following is found (in the priority)– Invalid line– A global line with no

sharer

core L1$

SharedL2$Data

Switch

DIRL2$Tag

core L1$

Switch

DIR

Sharer j

core L1$

SharedL2$Data

Switch

DIRL2$Tag

Sharer i

The line in its home with no sharer

Home Node

39ECE8833 H.-H. S. Lee 2009



sharer– An existing replica

core L1$

SharedL2$Data

Switch

DIRL2$Tag

core L1$

Switch

DIR

Sharer j

core L1$

SharedL2$Data

Switch

DIRL2$Tag

Sharer i

Home Node

40ECE8833 H.-H. S. Lee 2009




• Line is never replicated when– A global line has

remote sharers

core L1$

SharedL2$Data

Switch

DIRL2$Tag

core L1$

Switch

DIR

Sharer j

core L1$

SharedL2$Data

Switch

DIRL2$Tag

Sharer i

Home Node

41ECE8833 H.-H. S. Lee 2009




• Line is never replicated when– A global line has

remote sharers– the victim’s home tile

is local

core L1$

SharedL2$Data

Switch

DIRL2$Tag

core L1$

Switch

DIR

core L1$

SharedL2$Data

Switch

DIRL2$Tag

Home Node

42ECE8833 H.-H. S. Lee 2009

VR Combines Global Lines and Replica

core L1$

Switch

DIRL2$Tag

core L1$

Switch

DIRL2$Tag

core L1$

Switch

DIRL2$Tag

Shared L2$

Private L2$(filled w/ L1 victims)

SharedL2$

PrivateL2$

Private L2 Design Shared L2 Design

Victim Replication

Victim Replication dynamically creates a large local private, victim cache for the local L1 cache


43ECE8833 H.-H. S. Lee 2009

When Working Set Does not Fit in Local L2

0

0.5

1

1.5

2

2.5

97%

98%

99%

100%

Off-chip misses

Hits in Non-Local L2

Hits in Local L2

Hits in L1

• The capacity advantage of the shared design yields many fewer off-chip misses

• The latency advantage of the private design is offset by costly off-chip accesses

• Victim replication is even better than shared design by creating replicas to reduce access latency

L2P L2S L2VR L2P L2S L2VR

Average Data

Access Latency

Access

Breakdown

Best

Very Good

O.K.

Not Good …

44ECE8833 H.-H. S. Lee 2009

Average Latencies of Different CMPs

0

1

2

3

MP0 MP1 MP2 MP3 MP4 MP5

La

ten

cy

(c

yc

les

)

L2P

L2S

L2VR

0

2

4

6

8

10

bzip

crafty eo

ngap gcc

gzip mcf

parse

r

perlb

mk

twolf

vorte

xvp

r

La

ten

cy

(c

yc

les

)

L2P

L2S

L2VR

Single thread applications

L2VR excels 11 out of 12 cases

Multi-programmed workload

L2P is always the best

Documents

ECE8833 Polymorphous and Many-Core Computer Architecture