View
7
Download
0
Category
Preview:
Citation preview
A Coherent Hybrid SRAM and STT-RAM L1 Cache Architecture for Shared Memory Multicores
Jianxing Wang, Yenni Tim Zhong-Liang Ong and Weng-Fai Wong
School of Computing National University of Singapore
Zhenyu Sun and Hai (Helen) LiSwanson School of Engineering
University of Pittsburgh
1
Outline
• STT-RAM Basics• Cell structure and advantages• Challenges and motivations
• Hybrid L1 Cache Architecture• Naïve solution• The MESI protocol • Block transfer mechanisms
• Evaluation• Performance and energy• STT-RAM endurance
• Conclusion
2
STT-RAM Basics
• Magnetic Tunnel Junction (MTJ)• Two ferromagnetic layers separated by a barrier
3
STT-RAM Basics (cont.)
• Advantages• Non-volatile, near zero leakage energy• As fast as SRAM (read)• As dense as DRAM• Multi-level cell capability (stacking MTJs)• CMOS-compatible• Universal memory
4
Motivations of Hybrid Cache
• Expensive write operation of STT-RAM• High latency (10ns+)• High energy• Compensated by relaxed non-volatility [Smullen et al. 11]
• Refresh
• Endurance• Intense writes in L1
• bodytrack: L1(s) / L2 = ~29!• Additional synchronous operations under multi-core
environment 5
Proposed Hybrid Cache Hierarchy
6
SRAM
STT-R
AM
Core #0
L1 D-Cache Controller #0
L2 Cache Controller
L2 Cache (LLC)
MESI MESI
L1 D-Cache Controller #1
L1 D-Cache Controller #2
L1 D-Cache Controller #3
MESI MESI
SR
AM
STT-R
AM
Core #1
SRAM
STT-R
AM
Core #2
SRAM
STT-R
AM
Core #3
Cache Block Management
• Naïve solution• Based on temporal locality
• Simple but not good enough• > 3% IPC degradation
7
Cache Miss
STT-RAM
Write-miss
Read-miss
SRAM
The MESI Coherent Protocol
• Developed by University of Illinois• Illinois MESI
• For each cache block• M (modified) state – data dirty, exclusive copy• E (exclusive) state – data clean, exclusive copy• S (shared) state – data clean, multiple copies• I (invalid) state
• Common event bus• Local (processor) read/write• Remote (snoop / bus) read/write 8
Cache Block Management (cont.)
9
• Immediate transfer policy (IT)• Place dirty data (M state) block in SRAM• Place clean data (E/S state) block in STT-RAM• Transfer cache block when coherent state changes• DO NOT need extra information (built-in by MESI)
Immediate Transfer Policy (IT)
10
I
S I
Remote Read
Local Write
E
M
SRAM
STT-RAM
Write Miss
Read Miss
Cache Block Management (cont.)
11
• Delayed transfer policy (DT)• IT could be too aggressive
• Coherent state “ping-pong” between M and S
• Relax state restriction• Consider request history in prediction• Extra information required
Delayed Transfer Policy (DT)
12
M
E I
E
M
I S
S2nd Local Write (consecutive)
2nd Remote Read
1st Remote Read
1st Local Write
SRAM
STT-RAM
Evaluation
• PARSEC on MARSSx86 [Patel et al.11]• IPC (Instruction Per Cycle)
• NVSim [Dong et al. 12]• Latency, area and energy numbers (32nm)
• Configuration• Quadcore machine with two-level cache hierarchy • Relaxed STT-RAM’s non-volatility with a 26.5µs
retention period [Sun et al. 11]• Various cache size combinations within the
baseline area budget (64KB SRAM) 13
Normalized IPC (IT policy)
14
95
96
97
98
99
100
101
102(%)
4KB SRAM +64KB STT-RAM 8KB SRAM +64KB STT-RAM16KB SRAM +64KB STT-RAM 4KB SRAM +128KB STT-RAM
106.6
93 92
102.9102.1
Normalized Energy (IT policy)
15
50
55
60
65
70
75
80
85
90(%)
4KB SRAM + 64KB STT-RAM 8KB SRAM + 64KB STT-RAM16KB SRAM + 64KB STT-RAM 4KB SRAM + 128KB STT-RAM
Comparison of Transfer Policies
16
56
59
62
65
68
71
74
94
95
96
97
98
99
100
Pure 64KBSTT-RAM
4KB SRAM +64KB STT-RAM
8KB SRAM +64KB STT-RAM
16KB SRAM +64KB STT-RAM
4KB SRAM +128KB STT-RAM
Energy (%)IPC (%) Naïve IT DT Energy
×
× × ×
×
×
× ×
×
× ××× ×
Impact of Retention Time
• ∆• ∆: Thermal barrier of MTJ, affected by planar
area, thickness and temperature• Range from few microseconds to 10+ years
• Lower bound (DRAM-style refresh)• #
• Example: ~4µs(64-byte block, 64K size, 3-/9-
cycle read/write latency under 3GHz clock)
17
Impact of Retention Time (cont.)
18
100
105
110
115
120
125
90
92
94
96
98
100
102
4µs 8µs 16µs 32µs 64µs inf
Energy (%)IPC (%)
Retention Time
IPC Energy
STT-RAM Endurance
• Lifespan programming cycles• SRAM and DRAM: 10^16• STT-RAM prediction [Tabrizi 07] : 10^15• STT-RAM reported [Diao et al. 07] : 10^13• SLC NAND flash: 10^5
• Writes in L1 cache• High intensity• Non-even distributed
• bodytrack: ~35% writes on one cache partition• facesim: ~50% writes on the same cache partition, ~15%
on the same block!19
STT-RAM Endurance (cont.)
20
• facesimPerfect
distributedWorst
PartitionWorst Block
Baseline SRAM 1,300+ years 300+ years < 360 hrs
BaselineSTT-RAM 1.3 years 0.3 years < 22 mins
Hybrid Naïve 3.5 years 1.0 year 0.9 hrHybrid IT 41.2 years 6.9 years 51.6 hrsHybrid DT 32.9 years 7.0 years 54.3 hrs
150x lifespan increases for the worst block!
Conclusion
• Deploy STT-RAM as L1 cache• Expensive write (latency, energy and endurance)
• Architecture solution: hybrid cache• “big.LITTLE” model
• MESI-based Hybrid L1 Cache Architecture• Small SRAM partition + large STT-RAM partition • Using built-in information from coherent protocol• Performance maintained with less energy, and
extended lifespan21
Q&A 22
Recommended