22
A Coherent Hybrid SRAM and STT-RAM L1 Cache Architecture for Shared Memory Multicores Jianxing Wang, Yenni Tim Zhong-Liang Ong and Weng-Fai Wong School of Computing National University of Singapore Zhenyu Sun and Hai (Helen) Li Swanson School of Engineering University of Pittsburgh 1

A Coherent Hybrid SRAM and STT-RAM L1 Cache Architecture for Shared Memory Multicoreswongwf/papers/ASPDAC14-ppt.pdf · 2014. 6. 2. · STT-RAM L1 Cache Architecture for Shared Memory

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

  • A Coherent Hybrid SRAM and STT-RAM L1 Cache Architecture for Shared Memory Multicores

    Jianxing Wang, Yenni Tim Zhong-Liang Ong and Weng-Fai Wong

    School of Computing National University of Singapore

    Zhenyu Sun and Hai (Helen) LiSwanson School of Engineering

    University of Pittsburgh

    1

  • Outline

    • STT-RAM Basics• Cell structure and advantages• Challenges and motivations

    • Hybrid L1 Cache Architecture• Naïve solution• The MESI protocol • Block transfer mechanisms

    • Evaluation• Performance and energy• STT-RAM endurance

    • Conclusion

    2

  • STT-RAM Basics

    • Magnetic Tunnel Junction (MTJ)• Two ferromagnetic layers separated by a barrier

    3

  • STT-RAM Basics (cont.)

    • Advantages• Non-volatile, near zero leakage energy• As fast as SRAM (read)• As dense as DRAM• Multi-level cell capability (stacking MTJs)• CMOS-compatible• Universal memory

    4

  • Motivations of Hybrid Cache

    • Expensive write operation of STT-RAM• High latency (10ns+)• High energy• Compensated by relaxed non-volatility [Smullen et al. 11]

    • Refresh

    • Endurance• Intense writes in L1

    • bodytrack: L1(s) / L2 = ~29!• Additional synchronous operations under multi-core

    environment 5

  • Proposed Hybrid Cache Hierarchy

    6

    SRAM

    STT-R

    AM

    Core #0

    L1 D-Cache Controller #0

    L2 Cache Controller

    L2 Cache (LLC)

    MESI MESI

    L1 D-Cache Controller #1

    L1 D-Cache Controller #2

    L1 D-Cache Controller #3

    MESI MESI

    SR

    AM

    STT-R

    AM

    Core #1

    SRAM

    STT-R

    AM

    Core #2

    SRAM

    STT-R

    AM

    Core #3

  • Cache Block Management

    • Naïve solution• Based on temporal locality

    • Simple but not good enough• > 3% IPC degradation

    7

    Cache Miss

    STT-RAM

    Write-miss

    Read-miss

    SRAM

  • The MESI Coherent Protocol

    • Developed by University of Illinois• Illinois MESI

    • For each cache block• M (modified) state – data dirty, exclusive copy• E (exclusive) state – data clean, exclusive copy• S (shared) state – data clean, multiple copies• I (invalid) state

    • Common event bus• Local (processor) read/write• Remote (snoop / bus) read/write 8

  • Cache Block Management (cont.)

    9

    • Immediate transfer policy (IT)• Place dirty data (M state) block in SRAM• Place clean data (E/S state) block in STT-RAM• Transfer cache block when coherent state changes• DO NOT need extra information (built-in by MESI)

  • Immediate Transfer Policy (IT)

    10

    I

    S I

    Remote Read

    Local Write

    E

    M

    SRAM

    STT-RAM

    Write Miss

    Read Miss

  • Cache Block Management (cont.)

    11

    • Delayed transfer policy (DT)• IT could be too aggressive

    • Coherent state “ping-pong” between M and S

    • Relax state restriction• Consider request history in prediction• Extra information required

  • Delayed Transfer Policy (DT)

    12

    M

    E I

    E

    M

    I S

    S2nd Local Write (consecutive)

    2nd Remote Read

    1st Remote Read

    1st Local Write

    SRAM

    STT-RAM

  • Evaluation

    • PARSEC on MARSSx86 [Patel et al.11]• IPC (Instruction Per Cycle)

    • NVSim [Dong et al. 12]• Latency, area and energy numbers (32nm)

    • Configuration• Quadcore machine with two-level cache hierarchy • Relaxed STT-RAM’s non-volatility with a 26.5µs

    retention period [Sun et al. 11]• Various cache size combinations within the

    baseline area budget (64KB SRAM) 13

  • Normalized IPC (IT policy)

    14

    95

    96

    97

    98

    99

    100

    101

    102(%)

    4KB SRAM +64KB STT-RAM 8KB SRAM +64KB STT-RAM16KB SRAM +64KB STT-RAM 4KB SRAM +128KB STT-RAM

    106.6

    93 92

    102.9102.1

  • Normalized Energy (IT policy)

    15

    50

    55

    60

    65

    70

    75

    80

    85

    90(%)

    4KB SRAM + 64KB STT-RAM 8KB SRAM + 64KB STT-RAM16KB SRAM + 64KB STT-RAM 4KB SRAM + 128KB STT-RAM

  • Comparison of Transfer Policies

    16

    56

    59

    62

    65

    68

    71

    74

    94

    95

    96

    97

    98

    99

    100

    Pure 64KBSTT-RAM

    4KB SRAM +64KB STT-RAM

    8KB SRAM +64KB STT-RAM

    16KB SRAM +64KB STT-RAM

    4KB SRAM +128KB STT-RAM

    Energy (%)IPC (%) Naïve IT DT Energy

    ×

    × × ×

    ×

    ×

    × ×

    ×

    × ××× ×

  • Impact of Retention Time

    • ∆• ∆: Thermal barrier of MTJ, affected by planar

    area, thickness and temperature• Range from few microseconds to 10+ years

    • Lower bound (DRAM-style refresh)• #

    • Example: ~4µs(64-byte block, 64K size, 3-/9-

    cycle read/write latency under 3GHz clock)

    17

  • Impact of Retention Time (cont.)

    18

    100

    105

    110

    115

    120

    125

    90

    92

    94

    96

    98

    100

    102

    4µs 8µs 16µs 32µs 64µs inf

    Energy (%)IPC (%)

    Retention Time

    IPC Energy

  • STT-RAM Endurance

    • Lifespan programming cycles• SRAM and DRAM: 10^16• STT-RAM prediction [Tabrizi 07] : 10^15• STT-RAM reported [Diao et al. 07] : 10^13• SLC NAND flash: 10^5

    • Writes in L1 cache• High intensity• Non-even distributed

    • bodytrack: ~35% writes on one cache partition• facesim: ~50% writes on the same cache partition, ~15%

    on the same block!19

  • STT-RAM Endurance (cont.)

    20

    • facesimPerfect

    distributedWorst

    PartitionWorst Block

    Baseline SRAM 1,300+ years 300+ years < 360 hrs

    BaselineSTT-RAM 1.3 years 0.3 years < 22 mins

    Hybrid Naïve 3.5 years 1.0 year 0.9 hrHybrid IT 41.2 years 6.9 years 51.6 hrsHybrid DT 32.9 years 7.0 years 54.3 hrs

    150x lifespan increases for the worst block!

  • Conclusion

    • Deploy STT-RAM as L1 cache• Expensive write (latency, energy and endurance)

    • Architecture solution: hybrid cache• “big.LITTLE” model

    • MESI-based Hybrid L1 Cache Architecture• Small SRAM partition + large STT-RAM partition • Using built-in information from coherent protocol• Performance maintained with less energy, and

    extended lifespan21

  • Q&A 22