A Coherent Hybrid SRAM and STT-RAM L1 Cache Architecture for Shared Memory Multicoreswongwf/papers/ASPDAC14-ppt.pdf · 2014. 6. 2. · STT-RAM L1 Cache Architecture for Shared Memory

A Coherent Hybrid SRAM and STT-RAM L1 Cache Architecture for Shared Memory Multicores

Jianxing Wang, Yenni Tim Zhong-Liang Ong and Weng-Fai Wong

School of Computing National University of Singapore

Zhenyu Sun and Hai (Helen) LiSwanson School of Engineering

University of Pittsburgh

1

Outline

• STT-RAM Basics• Cell structure and advantages• Challenges and motivations

• Hybrid L1 Cache Architecture• Naïve solution• The MESI protocol • Block transfer mechanisms

• Evaluation• Performance and energy• STT-RAM endurance

• Conclusion

2

STT-RAM Basics

• Magnetic Tunnel Junction (MTJ)• Two ferromagnetic layers separated by a barrier

3

STT-RAM Basics (cont.)

• Advantages• Non-volatile, near zero leakage energy• As fast as SRAM (read)• As dense as DRAM• Multi-level cell capability (stacking MTJs)• CMOS-compatible• Universal memory

4

Motivations of Hybrid Cache

• Expensive write operation of STT-RAM• High latency (10ns+)• High energy• Compensated by relaxed non-volatility [Smullen et al. 11]

• Refresh

• Endurance• Intense writes in L1

• bodytrack: L1(s) / L2 = ~29!• Additional synchronous operations under multi-core

environment 5

Proposed Hybrid Cache Hierarchy

6

SRAM

STT-R

AM

Core #0

L1 D-Cache Controller #0

L2 Cache Controller

L2 Cache (LLC)

MESI MESI




MESI MESI

SR

AM

STT-R

AM

Core #1

SRAM

STT-R

AM

Core #2

SRAM

STT-R

AM

Core #3

Cache Block Management

• Naïve solution• Based on temporal locality

• Simple but not good enough• > 3% IPC degradation

7

Cache Miss

STT-RAM

Write-miss

Read-miss

SRAM

The MESI Coherent Protocol

• Developed by University of Illinois• Illinois MESI

• For each cache block• M (modified) state – data dirty, exclusive copy• E (exclusive) state – data clean, exclusive copy• S (shared) state – data clean, multiple copies• I (invalid) state

• Common event bus• Local (processor) read/write• Remote (snoop / bus) read/write 8

Cache Block Management (cont.)

9

• Immediate transfer policy (IT)• Place dirty data (M state) block in SRAM• Place clean data (E/S state) block in STT-RAM• Transfer cache block when coherent state changes• DO NOT need extra information (built-in by MESI)

Immediate Transfer Policy (IT)

10

I

S I

Remote Read

Local Write

E

M

SRAM

STT-RAM

Write Miss

Read Miss

Cache Block Management (cont.)

11

• Delayed transfer policy (DT)• IT could be too aggressive

• Coherent state “ping-pong” between M and S

• Relax state restriction• Consider request history in prediction• Extra information required

Delayed Transfer Policy (DT)

12

M

E I

E

M

I S

S2nd Local Write (consecutive)

2nd Remote Read

1st Remote Read

1st Local Write

SRAM

STT-RAM

Evaluation

• PARSEC on MARSSx86 [Patel et al.11]• IPC (Instruction Per Cycle)

• NVSim [Dong et al. 12]• Latency, area and energy numbers (32nm)

• Configuration• Quadcore machine with two-level cache hierarchy • Relaxed STT-RAM’s non-volatility with a 26.5µs

retention period [Sun et al. 11]• Various cache size combinations within the

baseline area budget (64KB SRAM) 13

Normalized IPC (IT policy)

14

95

96

97

98

99

100

101

102(%)

4KB SRAM +64KB STT-RAM 8KB SRAM +64KB STT-RAM16KB SRAM +64KB STT-RAM 4KB SRAM +128KB STT-RAM

106.6

93 92

102.9102.1

Normalized Energy (IT policy)

15

50

55

60

65

70

75

80

85

90(%)

4KB SRAM + 64KB STT-RAM 8KB SRAM + 64KB STT-RAM16KB SRAM + 64KB STT-RAM 4KB SRAM + 128KB STT-RAM

Comparison of Transfer Policies

16

56

59

62

65

68

71

74

94

95

96

97

98

99

100

Pure 64KBSTT-RAM

4KB SRAM +64KB STT-RAM




Energy (%)IPC (%) Naïve IT DT Energy

×

× × ×

×

×

× ×

×

× ××× ×

Impact of Retention Time

• ∆• ∆: Thermal barrier of MTJ, affected by planar

area, thickness and temperature• Range from few microseconds to 10+ years

• Lower bound (DRAM-style refresh)• #

• Example: ~4µs(64-byte block, 64K size, 3-/9-

cycle read/write latency under 3GHz clock)

17

Impact of Retention Time (cont.)

18

100

105

110

115

120

125

90

92

94

96

98

100

102

4µs 8µs 16µs 32µs 64µs inf

Energy (%)IPC (%)

Retention Time

IPC Energy

STT-RAM Endurance

• Lifespan programming cycles• SRAM and DRAM: 10^16• STT-RAM prediction [Tabrizi 07] : 10^15• STT-RAM reported [Diao et al. 07] : 10^13• SLC NAND flash: 10^5

• Writes in L1 cache• High intensity• Non-even distributed

• bodytrack: ~35% writes on one cache partition• facesim: ~50% writes on the same cache partition, ~15%

on the same block!19

STT-RAM Endurance (cont.)

20

• facesimPerfect

distributedWorst

PartitionWorst Block

Baseline SRAM 1,300+ years 300+ years < 360 hrs

BaselineSTT-RAM 1.3 years 0.3 years < 22 mins

Hybrid Naïve 3.5 years 1.0 year 0.9 hrHybrid IT 41.2 years 6.9 years 51.6 hrsHybrid DT 32.9 years 7.0 years 54.3 hrs

150x lifespan increases for the worst block!

Conclusion

• Deploy STT-RAM as L1 cache• Expensive write (latency, energy and endurance)

• Architecture solution: hybrid cache• “big.LITTLE” model

• MESI-based Hybrid L1 Cache Architecture• Small SRAM partition + large STT-RAM partition • Using built-in information from coherent protocol• Performance maintained with less energy, and

extended lifespan21

Q&A 22

Documents

A Coherent Hybrid SRAM and STT-RAM L1 Cache Architecture for Shared Memory Multicoreswongwf/papers/ASPDAC14-ppt.pdf · 2014. 6. 2. · STT-RAM L1 Cache Architecture for Shared Memory