Upload
toyah
View
28
Download
4
Tags:
Embed Size (px)
DESCRIPTION
ECE8833 Polymorphous and Many-Core Computer Architecture. Lecture 5 Non-Uniform Cache Architecture for CMP. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering. CMP Memory Hierarchy. Continuing device scaling leads to Deeper memory hierarchy (L2, L3, etc.) - PowerPoint PPT Presentation
Citation preview
ECE8833 Polymorphous and Many-Core Computer Architecture
Prof. Hsien-Hsin S. LeeSchool of Electrical and Computer Engineering
Lecture 5 Non-Uniform Cache Architecture for CMP
2ECE8833 H.-H. S. Lee 2009
CMP Memory Hierarchy• Continuing device scaling leads to
– Deeper memory hierarchy (L2, L3, etc.)– Growing cache capacity
• 6MB in AMD’s Phenom quad-core• 8MB in Intel Core i7• 24MB L3 in Itanium 2
• Global wire delay– Routing dominates access time
• Design for worst case– Compromise for the slowest access– Penalize overall memory accesses– Undesirable
3ECE8833 H.-H. S. Lee 2009
Evolution of Cache Access Time• Facts
– Large shared on-die L2– Wire-delay dominating on-die cache
3 cycles1MB
180nm, 1999
11 cycles4MB
90nm, 2004
24 cycles16MB
50nm, 2010
4ECE8833 H.-H. S. Lee 2009
Multi-banked L2 cache
Bank=128KB11 cycles
2MB @ 130nm
Bank Access time = 3 cyclesInterconnect delay = 8 cycles
5ECE8833 H.-H. S. Lee 2009
Multi-banked L2 cache
Bank=64KB47 cycles
16MB @ 50nm
Bank Access time = 3 cyclesInterconnect delay = 44 cycles
6ECE8833 H.-H. S. Lee 2009
NUCA: Non-Uniform Cache Architecture[Kim et al. ASPLOS-X, 2002]
• Partition a large cache into banks
• Non-uniform latencies for different banks
• Design space exploration– Mapping
• How many banks? (i.e., what’s the granularity)• How to map lines to each bank?
– Search• Strategy for searching the set of possible locations for a line
– Movement• Should a line always be placed in the same bank?• How a line migrates to different banks over its lifetime?
7ECE8833 H.-H. S. Lee 2009
Cache Hierarchy Taxonomy (16MB @50nm)
41
UCA1 bank
255 cyclesAvg access time
41
ML-UCA1 bank
11/41 cycles
L3
10L2
17 41
17 41
S-NUCA-132 banks34 cycles
S-NUCA-232 banks24 cycles
9 32
D-NUCA
256 banks18 cycles
4 47
Contentionless latency from
CACTI
From simulation modeling bank & channel conflict
[Kim et al., ASPLOS-X 2002]
8ECE8833 H.-H. S. Lee 2009
Static NUCA-1 Using Private Channels
• Upside– Increase the number of banks to avoid bulky access– Parallelize accesses to different banks
• Overhead– Decoders– Wire-dominated due to same set of private wires is required for every
bank• Each bank has its distinct access latency• Statically pre-determine data location for its given address • Average access latency =34.2 cycles• Wire overhead = 20.9% an issue
Tag Array
Data Bus
Address Bus
Bank
Sub-bank
Predecoder
Senseamplifier
Wordline driverand decoder
Use low-order bits for bank
index
9ECE8833 H.-H. S. Lee 2009
Static NUCA-2 Using Switched Channels
• Improved wire congestion from Static NUCA-1 using 2D switched network• Wormhole-routed flow control• Each switch buffers 128-bit packets • Average access latency =24.2 cycles
– On avg, 0.8 cycle of “bank” contention + 0.7 cycle of “link” contention in the network
• Wire overhead = 5.9%
Bank
Data bus
SwitchTag Array
Wordline driverand decoder
Predecoder
9 32
10ECE8833 H.-H. S. Lee 2009
Dynamic NUCA
• Data can dynamically migrate• Promote frequently used cache lines closer
to CPU
• Data management – Mapping
• How many banks? (i.e., what’s the granularity)
• How to map lines to each bank?
– Search• Strategy for searching the set of possible
locations for a line
– Movement• Should a line always be placed in the same
bank?• How a line migrates to different banks over
its lifetime?
D-NUCA256 banks18 cycles
4 47
11ECE8833 H.-H. S. Lee 2009
Dynamic NUCA
• Simple Mapping• All 4 ways of each bank set needs to be searched• Non-uniform access times for different bank sets• Farther bank sets longer access
Memory Controller
8 bank setsway 3
way 2
way 1
way 0
one set
bank
12ECE8833 H.-H. S. Lee 2009
Dynamic NUCA
• Fair Mapping (proposed, not studied in the paper)• Average access time across all bank sets are equal• Complex routing, likely more contention
8 bank sets one set
bank
Memory Controller
13ECE8833 H.-H. S. Lee 2009
Dynamic NUCA
• Shared Mapping• Sharing the closet banks among multiple banks• Some banks have slightly higher associativity which offset
the increased avg. access latency due to the distance
8 bank setsbank
Memory Controller
14ECE8833 H.-H. S. Lee 2009
Locate A NUCA Line• Incremental search
– From the closest to the farthest
• (Limited, partitioned) Multicast search– Search all (or a partition of) the banks in
parallel– Return time depending on the routing distance
• Smart search– Use partial tag comparison [Kessler’89] (used in
P6)– Keep the partial tag array in cache controller– Similar modern techniques: Bloom Filters
15ECE8833 H.-H. S. Lee 2009
D-NUCA: Dynamic Movement of Cache LinesCache Line Placement Upon Hit• LRU ordering
– Conventional implementation only adjust LRU bits
– Require physical movement in order to get latency benefits for NUCA (n copy operations)
• Generational Promotion– Only swap with the line in the neighbor
bank closest to the controller– Receive more “latency reward” when hit
contiguously
Hit
Old state
New state
Old state
New state
Hit
Controller
Controller
16ECE8833 H.-H. S. Lee 2009
D-NUCA: Dynamic Movement of Cache LinesUpon Miss• Incoming Line Insertion
– To a distant bank– To MRU position
• Victim eviction– Zero copy– One copy
Controller
new
Controller
victim
Most distant bank(assist cache concept)
Controller
victim
Some distant bank(Zero copy)
victim
Controller
Some distant bank(One copy)
Controller
victim
MRU bank(One copy)
17ECE8833 H.-H. S. Lee 2009
Sharing NUCA Cache in CMP• Sharing Degree (SD) of N: Number of
processor cores share a cache
• Low SD– Smaller private partitions– Good hit latency, poor hit rate– More discrete L2 caches
• Expensive L2 coherence• E.g., Need a centralized L2 tag directory for L2
coherence
• High SD– Good hit rate, bad for hit latency– More efficient inter-core communication– More expensive L1 coherence
18ECE8833 H.-H. S. Lee 2009
16-Core CMP Substrate and SD
• Low SD (e.g., 1), need either snooping or a central L2 tag directory for coherence
• High SD (e.g., 16) also needs some directory to indicate whose L1 has a copy (used in Piranha CMP)
[Huh et al. ICS’05]
19ECE8833 H.-H. S. Lee 2009
Trade-off for Cache Sharing Among Cores• Upside
– Keep single copy data– Use area more efficiently– Faster inter-core communication
• No coherence fabric
• Downside– Larger structure, slower access– Longer wire delay– More congestion on the shared
interconnect
20ECE8833 H.-H. S. Lee 2009
Flexible Cache Mapping• Static mapping
– Fixed L2 access latency upon line placement time
• Dynamic mapping– D-NUCA idea: line can migrate across multiple banks– Line will move closer to the core that frequently accesses it
[Huh et al. ICS’05]
Lookup could be expensive
Search all partial tags first
21ECE8833 H.-H. S. Lee 2009
Flexible Cache Sharing• Multiple sharing degrees for different
classes of blocks
• Classify lines to be (Per-line sharing degree)– Private (assign smaller SD)– Shared (assign larger SD)
• Study found 6 to 7% improvement vs. the best uniform SD– SD=1 or 2 for private data– SD=16 for shared data
22ECE8833 H.-H. S. Lee 2009
Enhance Cache/Memory Performance• Cache Partitioning
– Explicitly manage cache allocation among processes• Each process gets different benefit for more cache space• Similar to main memory partition [Stone’92] in the good old
days
• Memory-aware Scheduling– Choose a set of simultaneous processes to minimize
cache contention– Symbiotic scheduling for SMT by OS
• Sample and collect info (perf. counters) about possible schedules
• Predict the best schedule (e.g., based on resource contention)• Complexity is high for many processes
– Admission control for gang scheduling• Based on footprint of a job (total memory usage)
Slide adapted from Ed Suh’s HPCA’02 presentation
Victim Replication
24ECE8833 H.-H. S. Lee 2009
Today’s Chip Multiprocessors (Shared L2)
core
L1$
• Layout: “Dance-Hall” – Per processing node: Core + L1
cache– Shared L2 cache
• Small L1 cache– Fast access
• Large L2 cache– Good hit rate– Slower access latency
Intra-Chip Switch
core
L1$
core
L1$
core
L1$
L2 Cache
Slide adapted from presentation by Zhang and Asanovic, ISCA’05
Intra-Chip Switch
core
L1$
core
L1$
core
L1$
core
L1$
25ECE8833 H.-H. S. Lee 2009
Today’s Chip Multiprocessors (Shared L2)
• Layout: “Dance-Hall” – Per processing node: Core + L1 cache– Shared L2 cache
• Alternate large L2 cache– Divided into slices to minimize latency
and power– i.e., NUCA
• Challenge– Minimize average access latency– Avg memory latency == Best latency
Slide adapted from presentation by Zhang and Asanovic, ISCA’05
L2 Slice
core
L1$
L2 Slice
L2 Slice
L2 Slice L2 Slice L2 Slice
L2 Slice L2 Slice
L2 Slice L2 Slice L2 SliceL2 Slice
L2 Slice L2 Slice L2 SliceL2 Slice
L2 Slice L2 Slice L2 SliceL2 Slice
L2 Slice L2 Slice L2 SliceL2 Slice
L2 Slice L2 Slice L2 SliceL2 Slice
L2 Slice L2 Slice L2 SliceL2 Slice
Intra-Chip Switch
core
L1$
core
L1$
core
L1$
Intra-Chip Switch
core
L1$
core
L1$
core
L1$
core
L1$
26ECE8833 H.-H. S. Lee 2009
Dynamic NUCA Issues
• Does not work well with CMPs
• The “unique” copy of data cannot be close to all of its sharers
• Behavior– Over time, shared data migrates to
a location “equidistant” to all sharers
[Beckmann & Wood, MICRO-36]
core
L1$
Intra-Chip Switch
core
L1$
core
L1$
core
L1$
Intra-Chip Switch
core
L1$
core
L1$
core
L1$
core
L1$
Slide adapted from presentation by Zhang and Asanovic, ISCA’05
27ECE8833 H.-H. S. Lee 2009
Tiled CMP with Directory-Based Protocol
• Tiled CMPs for Scalability– Minimal redesign effort – Use directory-based protocol for
scalability
• Managing the L2s to minimize the effective access latency– Keep data close to the requestors– Keep data on-chip
• Two baseline L2 cache designs– Each tile has own private L2– All tiles share a single distributed
L2
SWc L1
L2$Data
L2$Tag
SWc L1
L2$Data
L2$Tag
SWc L1
L2$Data
L2$Tag
SWc L1
L2$Data
L2$Tag
SWc L1
L2$Data
L2$Tag
SWc L1
L2$Data
L2$Tag
SWc L1
L2$Data
L2$Tag
SWc L1
L2$Data
L2$Tag
SWc L1
L2$Data
L2$Tag
SWc L1
L2$Data
L2$Tag
SWc L1
L2$Data
L2$Tag
SWc L1
L2$Data
L2$Tag
SWc L1
L2$Data
L2$Tag
SWc L1
L2$Data
L2$Tag
SWc L1
L2$Data
L2$Tag
SWc L1
L2$Data
L2$Tag
core L1$
L2$SliceData
Switch
L2$SliceTag
Slide adapted from presentation by Zhang and Asanovic, ISCA’05
28ECE8833 H.-H. S. Lee 2009
“Private L2” Design Keeps Low Hit Latency
core L1$
Private L2$Data
Switch
DIRL2$Tag
core L1$
Private L2$Data
Switch
DIRL2$Tag
Sharer jSharer i
• The local L2 slice is used as a private L2 cache for the tile– Shared data is
“duplicated” in the L2 of each sharer
– Coherence must be kept among all sharers at the L2 level
– Similar to DSM
• On an L2 miss:– Data not on-chip– Data available in the
private L2 cache of another chipSlide adapted from presentation by Zhang and Asanovic, ISCA’05
29ECE8833 H.-H. S. Lee 2009
“Private L2” Design Keeps Low Hit Latency
core L1$
Private L2$Data
Switch
DIRL2$Tag
core L1$
Private L2$Data
Switch
DIRL2$Tag
core L1$
Private L2$Data
Switch
DIRL2$Tag
Home Nodestatically determined by address
Owner/SharerRequestor
• The local L2 slice is used as a private L2 cache for the tile– Shared data is “duplicated”
in the L2 of each sharer– Coherence must be kept
among all sharers at the L2 level
– Similar to DSM
• On an L2 miss:– Data not on-chip– Data available in the
private L2 cache of another tile (cache-to-cache reply-forwarding)
Off-chip Access
30ECE8833 H.-H. S. Lee 2009
“Shared L2” Design Gives Maximum Capacity
core L1$
Shared L2$Data
Switch
DIRL2$Tag
core L1$
Shared L2$Data
Switch
DIRL2$Tag
Requestor
core L1$
Shared L2$Data
Switch
DIRL2$Tag
Owner/Sharer
Off-chip Access
• All L2 slices on-chip form a distributed shared L2, backing up all L1s– “No duplication,” data kept
in a unique L2 location– Coherence must be kept
among all sharers at the L1 level
• On an L2 miss:– Data not in L2– Coherence miss (cache-to-
cache reply-forwarding)
Home Nodestatically determined by address
31ECE8833 H.-H. S. Lee 2009
Private vs. Shared L2 CMP
• Shared L2– Long/non-uniform L2
hit latency – No duplication
maximizes L2 capacity
• Private L2– Uniform lower
latency if found in local L2
– Duplication reduces L2 capacity
32ECE8833 H.-H. S. Lee 2009
Private vs. Shared L2 CMP
• Shared L2 – Long/non-uniform L2
hit latency No duplication
maximizes L2 capacity
• Private L2Uniform lower
latency if found in local L2
– Duplication reduces L2 capacity
Victim Replication: Provides low hit latency while keeping the working set on-chip
33ECE8833 H.-H. S. Lee 2009
Normal L1 Eviction for a Shared L2 CMP
core L1$
SharedL2$Data
Switch
DIRL2$Tag
core L1$
SharedL2$Data
Switch
DIRL2$Tag
Sharer i Sharer j
core L1$
SharedL2$Data
Switch
DIRL2$Tag
Home Node
• When an L1 cache line is being evicted – Write back to home L2 if
dirty– Update home directory
34ECE8833 H.-H. S. Lee 2009
Victim Replication
core L1$
SharedL2$Data
Switch
DIRL2$Tag
core L1$
SharedL2$Data
Switch
DIRL2$Tag
Sharer i Sharer j
core L1$
SharedL2$Data
Switch
DIRL2$Tag
Home Node
• Replicas– L1 victims stored in
the Local L2 slice
• Reused later for faster access latency
35ECE8833 H.-H. S. Lee 2009
Hitting the Victim Replica
core L1$
SharedL2$Data
Switch
DIRL2$Tag
core L1$
SharedL2$Data
Switch
DIRL2$Tag
Sharer i Sharer j
core L1$
SharedL2$Data
Switch
DIRL2$Tag
Home Node
• Look up local L2 slice
• A miss will follow the normal transaction to get the line in home node
• A replica hit will invalidate the replica
Replica Hit
36ECE8833 H.-H. S. Lee 2009
Replication Policy
• Replica is only inserted when one of the following is found (in the priority)
core L1$
SharedL2$Data
Switch
DIRL2$Tag
core L1$
Switch
DIR
Sharer j
core L1$
SharedL2$Data
Switch
DIRL2$Tag
Sharer i
Home Node
37ECE8833 H.-H. S. Lee 2009
Replication Policy, Where to Insert?
• Replica is only inserted when one of the following is found (in the priority)– Invalid line
core L1$
SharedL2$Data
Switch
DIRL2$Tag
core L1$
Switch
DIR
Sharer j
core L1$
SharedL2$Data
Switch
DIRL2$Tag
Sharer i
Home Node
38ECE8833 H.-H. S. Lee 2009
Replication Policy, Where to Insert?
• Replica is only inserted when one of the following is found (in the priority)– Invalid line– A global line with no
sharer
core L1$
SharedL2$Data
Switch
DIRL2$Tag
core L1$
Switch
DIR
Sharer j
core L1$
SharedL2$Data
Switch
DIRL2$Tag
Sharer i
The line in its home with no sharer
Home Node
39ECE8833 H.-H. S. Lee 2009
Replication Policy, Where to Insert?
• Replica is only inserted when one of the following is found (in the priority)– Invalid line– A global line with no
sharer– An existing replica
core L1$
SharedL2$Data
Switch
DIRL2$Tag
core L1$
Switch
DIR
Sharer j
core L1$
SharedL2$Data
Switch
DIRL2$Tag
Sharer i
Home Node
40ECE8833 H.-H. S. Lee 2009
Replication Policy, Where to Insert?
• Replica is only inserted when one of the following is found (in the priority)– Invalid line– A global line with no
sharer– An existing replica
• Line is never replicated when– A global line has
remote sharers
core L1$
SharedL2$Data
Switch
DIRL2$Tag
core L1$
Switch
DIR
Sharer j
core L1$
SharedL2$Data
Switch
DIRL2$Tag
Sharer i
Home Node
41ECE8833 H.-H. S. Lee 2009
Replication Policy, Where to Insert?
• Replica is only inserted when one of the following is found (in the priority)– Invalid line– A global line with no
sharer– An existing replica
• Line is never replicated when– A global line has
remote sharers– the victim’s home tile
is local
core L1$
SharedL2$Data
Switch
DIRL2$Tag
core L1$
Switch
DIR
core L1$
SharedL2$Data
Switch
DIRL2$Tag
Home Node
42ECE8833 H.-H. S. Lee 2009
VR Combines Global Lines and Replica
core L1$
Switch
DIRL2$Tag
core L1$
Switch
DIRL2$Tag
core L1$
Switch
DIRL2$Tag
Shared L2$
Private L2$(filled w/ L1 victims)
SharedL2$
PrivateL2$
Private L2 Design Shared L2 Design
Victim Replication
Victim Replication dynamically creates a large local private, victim cache for the local L1 cache
Slide adapted from presentation by Zhang and Asanovic, ISCA’05
43ECE8833 H.-H. S. Lee 2009
When Working Set Does not Fit in Local L2
0
0.5
1
1.5
2
2.5
97%
98%
99%
100%
Off-chip misses
Hits in Non-Local L2
Hits in Local L2
Hits in L1
• The capacity advantage of the shared design yields many fewer off-chip misses
• The latency advantage of the private design is offset by costly off-chip accesses
• Victim replication is even better than shared design by creating replicas to reduce access latency
L2P L2S L2VR L2P L2S L2VR
Average Data
Access Latency
Access
Breakdown
Best
Very Good
O.K.
Not Good …
44ECE8833 H.-H. S. Lee 2009
Average Latencies of Different CMPs
0
1
2
3
MP0 MP1 MP2 MP3 MP4 MP5
La
ten
cy
(c
yc
les
)
L2P
L2S
L2VR
0
2
4
6
8
10
bzip
crafty eo
ngap gcc
gzip mcf
parse
r
perlb
mk
twolf
vorte
xvp
r
La
ten
cy
(c
yc
les
)
L2P
L2S
L2VR
Single thread applications
L2VR excels 11 out of 12 cases
Multi-programmed workload
L2P is always the best