59
Presented by: Swapnil Bhosale Advisor: Dr. Sudeep Pasricha Committee members: Dr. Sourajeet Roy Dr. Wim Bohm SLAM: High performance and energy efficient shared hybrid last level cache architecture in multicore systems 1

SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Presented by: Swapnil Bhosale

Advisor: Dr. Sudeep Pasricha

Committee members:Dr. Sourajeet RoyDr. Wim Bohm

SLAM: High performance and energy efficient shared hybrid last level cache architecture in multicore systems

1

Page 2: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

• Introduction

• Related work

• Analysis of prior works

• Motivation

• Proposed SLAM framework

• Experimental setup

• Results

• Conclusion and future work

Overview

2

Page 3: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

• Most of the modern computing systems are multicore with multi-level cache memory

• Last level cache (LLC) is generally shared between private caches

Introduction

Source: http://www.eedailynews.com/2012/02/freescale-claims-highest-performance.html

Freescale semiconductor’s multi-core processor (B4860)

Source: http://www.cse.wustl.edu/~jain/cse567-11/ftp/multcore/index.html

3

Page 4: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Introduction

Source: http://csillustrated.berkeley.edu/PDFs/handouts/cache-3-associativity-handout.pdf

4

Page 5: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Introduction

Source: http://csillustrated.berkeley.edu/PDFs/handouts/cache-3-associativity-handout.pdf

5

Page 6: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

• Processor-memory

performance gap continues

to increase

• Traditional SRAM based

caches cannot cope up

with increasing gap

• Need for alternate memory

that can provide

o High capacity

o Less energy consumption

o Be closer to processor

Introduction

Source: https://mzh.io/%E5%A6%82%E4%BD%95%E8%AE%A9Go%E7%A8%8B%E5%BA%8F%E6%9B%B4%E5%BF%AB

6

Page 7: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

• Researchers proposed Spin Transfer Torque Random Access Memory (STTRAM)

• Attributes

o High density

o Low static power consumption

o Non-volatility

o Future scalability

o High endurance

o Small read latency

Introduction

Potential replacement for SRAM in cache memory hierarchy

Source: https://www.mram-info.com/stt-mram

7

Page 8: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

• Basic storage element is MTJ (Magnetic Tunnel Junction)

• Data is stored as relative magnetic orientation of two ferromagnetic layers

Source: https://www.mram-info.com/stt-mramSource: https://www.embedded.com/design/real-time-and-performance/4026000/The-future-of-scalable-STT-RAM-

as-a-universal-embedded-memory

Introduction

8

Page 9: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

SRAM STTRAM

Cell structure

Leakage power (for 1MB, 45nm tech node)

14.63mW2.32mW

(5-6x lesser than SRAM)

Area (for 1MB, 45nm tech node) 3.77sqmm0.95sqmm

(4-5x denser than SRAM)

Write latency 3.18ns12.01ns

(approx. 4x of SRAM)

Write energy 0.08nJ0.64nJ

(approx. 8x of SRAM)

Source: J. Ahn, S. Yoo and K. Choi, "DASCA: Dead Write

Prediction Assisted STT-RAM Cache Architecture," 2014

IEEE 20th International Symposium on High Performance

Computer Architecture (HPCA), Orlando, FL, 2014, pp. 25-

36.

Introduction

Need for techniques to overcome the drawbacks of STTRAM

9

Page 10: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Introduction

• Related work

• Analysis of prior works

• Motivation

• Proposed SLAM framework

• Experimental setup

• Results

• Conclusion and future work

Overview

10

Page 11: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

• Some works focus on reducing

write energy by tuning MTJ

device propertieso “Relaxing non-volatility for fast and energy-

efficient STT-RAM caches”, [C. Smullen, et al,

IEEE HPCA, 2011]

o “Delivering on the promise of universal

memory for spin-transfer torque RAM (STT-

RAM)”, [C. Smullen, et al, IEEE ISLPED, 2011]

Related work

MTJ thickness MTJ write time

MTJ thickness MTJ Retention time

Tough compromise between MTJ write time and MTJ retention time

11

Page 12: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

• Other works focus on reducing write energy

at cell level

• Basic idea is to update bits with different

values

o "Energy reduction for STT-RAM using early write termination“,

[P. Zhou, et al, IEEE ICCAD, 2009]

o “Coding last level STT-RAM cache for high endurance and low

power”, [S. Yazdanshenas, et al, IEEE Computer Architecture

Letters, 2013]

These works do not consider non-uniformity of writes across the cache

Related work

1 0 1 0 1 1 1 1 1 0 1 0 1 1 0 0

Cache DATA field

Incoming bit pattern

1 0 1 0 1 1 1 1

Cache DATA field

12

comparator

Page 13: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

• There are some works that focus on reducing write

energy using hybrid last level cache (LLC) architecture

• Basic idea is to migrate write intensive cache lines to

SRAM region

o “Exploiting non-uniformity of write accesses for designing a high-

endurance hybrid last level cache”, [P. Safayenikoo, et al, IEEE CCECE,

2017]

o "High-endurance and performance-efficient design of hybrid cache

architectures through adaptive line replacement“, [A. Jadidi, et al, IEEE

ISLPED, 2011]

Provides better energy savings by taking advantage of both SRAM and STTRAM

way-0 SRAM

way-1 SRAM

way-2 STTRAM

way-3 STTRAM

way-4 STTRAM

way-5 STTRAM

way-6 STTRAM

way-7 STTRAM

8-way set in

hybrid cache

Related work

13

Page 14: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Introduction

Related work

• Analysis of prior works

• Motivation

• Proposed SLAM framework

• Experimental setup

• Results

• Conclusion and future work

Overview

14

Page 15: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Analysis of prior work PTHCM

TAG DATA

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

Main memory

• PTHCM (Prediction table hybrid cache management) use hybrid last level cache (LLC) comprised of SRAM and STTRAM

Prediction table

• Use prediction table to predict write intensive cache lines

• Migrate write intensive cache lines in STTRAM region to SRAM region

Citation: Baixing Quan, Tiefei Zhang, Tianzhou Chen and Jianzhong Wu, "Prediction table based management policy for STT-RAM and SRAM hybrid cache," 2012 7th International Conference on Computing and Convergence Technology (ICCCT), Seoul, 2012, pp. 1092-1097

What does

PTHCM do?

15

Page 16: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Analysis of prior work PTHCM

Main memory

TAG WC AC WLC ALC

Prediction table

• Counters to keep access history of each cache lineo AC – actual access count of a cache line

(read/write)o WC – actual write count of a cache lineo ALC – prediction of access count of a cache lineo WLC – prediction of write count of a cache line

• Migration will happen ono Misso Write hit

• Prediction table is populated ono Eviction

TAG WC AC

Citation: Baixing Quan, Tiefei Zhang, Tianzhou Chen and Jianzhong Wu, "Prediction table based

management policy for STT-RAM and SRAM hybrid cache," 2012 7th International Conference on

Computing and Convergence Technology (ICCCT), Seoul, 2012, pp. 1092-1097

DATA

4-way

SRAM

12-way

STTRAM

A set in hybrid LLCHow does it

work?

16

Page 17: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Analysis of prior work PTHCM

M

A

I

N

M

E

M

O

R

Y

x.TAG x.WC x.AC

TAG WC=0 AC=0 WLC ALC DATA

Line with minimum

WLC is replaced

Miss at line ‘x’

LRU line evicted

Citation: Baixing Quan, Tiefei Zhang, Tianzhou Chen and Jianzhong Wu, "Prediction table based

management policy for STT-RAM and SRAM hybrid cache," 2012 7th International Conference on

Computing and Convergence Technology (ICCCT), Seoul, 2012, pp. 1092-1097

Prediction table

TAG WC AC

x.TAG x.WC x.AC

Insert in SRAM to avoid write operation to

STTRAM

If entry not found, WLC and ALC are initialized to user set thresholds

DATA

Line ‘y’

Line ‘z’

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

• Counters to keep access history of each cache lineo AC – actual access count of a

cache line (read/write)o WC – actual write count of a

cache lineo ALC – prediction of access

count of a cache lineo WLC – prediction of write

count of a cache line

• Migration will happen on-o Misso Write hit

• Prediction table is populated on-o Eviction

17

Page 18: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Analysis of prior work PTHCM

Main memory

TAG WC AC WLC ALC

Write hit

TAG WC++ AC++ WLC - - ALC - -

WC >

threshold

Line with minimum

WLC is replaced

Citation: Baixing Quan, Tiefei Zhang, Tianzhou Chen and Jianzhong Wu, "Prediction table based

management policy for STT-RAM and SRAM hybrid cache," 2012 7th International Conference on

Computing and Convergence Technology (ICCCT), Seoul, 2012, pp. 1092-1097

swap

Prediction table

TAG WC AC

• Counters to keep access history of each cache lineo AC – actual access count of

a cache line (read/write)o WC – actual write count of

a cache lineo ALC – prediction of access

count of a cache lineo WLC – prediction of write

count of a cache line

• Migration will happen on-o Misso Write hit

• Prediction table is populated on-o Eviction

DATA

Line ‘x’

Line ‘y’

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

18

Page 19: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Analysis of prior work PTHCM

y.TAG y.WC y.AC

TAG WC AC WLC ALC

Main memory

Eviction

Citation: Baixing Quan, Tiefei Zhang, Tianzhou Chen and Jianzhong Wu, "Prediction table based

management policy for STT-RAM and SRAM hybrid cache," 2012 7th International Conference on

Computing and Convergence Technology (ICCCT), Seoul, 2012, pp. 1092-1097

Prediction table

TAG WC AC

y.TAG y.WC y.AC

If entry not found, make new entry in empty slot

If no empty slot, delete entry with minimum AC

• Counters to keep access history of each cache lineo AC – actual access count of a cache

line (read/write)o WC – actual write count of a cache

lineo ALC – prediction of access count of a

cache lineo WLC – prediction of write count of a

cache line

• Migration will happen on-o Misso Write hit

• Prediction table is populated on-o Eviction

DATA

Line ‘x’

Line ‘y’

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

19

Page 20: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

• Hardware overhead

o 3 bits to represent each of WC, AC, WLC and ALC

o 12 bits extra added to every cache line in LLC

o 65536 cache lines in 4MB hybrid LLC and 64B blocksize

o 12*65536 ~ 98kB additional space in LLC

o Considering 14 bits to represent TAG

o Each entry in prediction table is 20 bits in size

o 65536 entries in prediction table

o Size of prediction is 20*65536 ~ 163kB

o Size of swap/migration buffer ~ 68B

o Total hardware overhead = 262kB ~ 6.39% of LLC

WC AC WLC ALC

TAG DATA

Prediction table

TAG WC AC

Analysis of prior work PTHCM

163kB prediction table

Notable hardware overhead

20

TAG DATA

68B swap/migration buffer

4MB hybrid LLC

12 bits of extra fields

per cache line

Page 21: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

P1 P3

L1:

x=5L1:

Shared mem:

x=5

P2

L1:

x=5

Background: Cache coherence

P1 P3

L1:

x=6L1:

Shared mem:

x=5

P2

L1:

x=INV

eviction

Write-back

Rd->xWr->x

• Uniformity of shared resource data

• Achieved by writing back modified data to shared memory when o Evicted by ownero Requested by peer

processor

Coherent view of memory Non-coherent view of memory

21

Page 22: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

P1 P3

L1:

x=5L1:

Shared mem:

x=5

P2

L1:

x=5

Background: Cache coherence

P1 P3

L1:

x=6L1:

Shared mem:

x=6

P2

L1:

x=INV

• Uniformity of shared resource data

• Achieved by writing back modified data to shared memory when o Evicted by ownero Requested by peer

processor

Coherent view of memory Coherent view of memory

22

Page 23: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Main memory

Analysis of prior work RWEEHCCitation: S. Agarwal and H. K. Kapoor, "Restricting writes for energy-efficient hybrid cache in multi-core architectures," 2016 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC), Tallinn, 2016, pp. 1-6

TAG DATA

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

• RWEEHC (Restricting writes to energy efficient hybrid cache) use hybrid last level cache (LLC) comprised of SRAM and STTRAM

• Exploit cache coherency to predict write intensive cache lines

• Migrate write intensive cache lines in STTRAM region to SRAM region

What does

RWEEHC do?

23

Page 24: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Analysis of prior work RWEEHC

• Adds extra states (STT_STATE) to predict write intensive cache block

• STT_STATES• P: Dataless entry into STTRAM region • ST-D: Possible candidate for migration

to SRAM• SR-C: Block migrated to SRAM region

• Migration is done on• Writeback to a block in ST-D state in

STTRAM region

Citation: S. Agarwal and H. K. Kapoor, "Restricting writes for energy-efficient hybrid cache in multi-

core architectures," 2016 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-

SoC), Tallinn, 2016, pp. 1-6

Main memory

TAG STT_STATE DATA

4-way

SRAM

12-way

STTRAM

A set in hybrid LLCHow does it

work?

24

Page 25: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Block ‘x’

TAG DATA

L1

Miss at line ‘x’

Analysis of prior work RWEEHC

• STT_STATES• P: Dataless entry into STTRAM region • ST-D: Possible candidate for migration

to SRAM• SR-C: Block migrated to SRAM region

• Migration is done on• Writeback to a block in

ST-D state in STTRAM region

Citation: S. Agarwal and H. K. Kapoor, "Restricting writes for energy-efficient hybrid cache in multi-

core architectures," 2016 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-

SoC), Tallinn, 2016, pp. 1-6

• Adds extra states (STT_STATE) to predict write intensive cache block

TAG STT_STATE DATA

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

Main memory

25

Page 26: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Block ‘x’

TAG DATA

10100110 101111001010

Core0 - L1

Miss at line ‘x’

Dataless

entry

x

Analysis of prior work RWEEHC

• STT_STATES• P: Dataless entry into STTRAM region • ST-D: Possible candidate for migration

to SRAM• SR-C: Block migrated to SRAM region

• Migration is done on• Writeback to a block

in ST-D state in STTRAM region

Citation: S. Agarwal and H. K. Kapoor, "Restricting writes for energy-efficient hybrid cache in multi-

core architectures," 2016 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-

SoC), Tallinn, 2016, pp. 1-6

• Adds extra states (STT_STATE) to predict write intensive cache block

TAG STT_STATE DATA

10100110 P

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

x

Main memory

26

Page 27: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Block ‘x’

TAG DATA

10100110 101111011111

Core 0 - L1

x

TAG DATA

Core 1 - L1

Analysis of prior work RWEEHC

Rd ‘x’

Writeback

• STT_STATES• P: Dataless entry into

STTRAM region • ST-D: Possible candidate

for migration to SRAM• SR-C: Block migrated to

SRAM region

• Migration is done on• Writeback to a block in

ST-D state in STTRAM region

Transition to ST-D state on writeback in P state

Citation: S. Agarwal and H. K. Kapoor, "Restricting writes for energy-efficient hybrid cache in multi-

core architectures," 2016 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-

SoC), Tallinn, 2016, pp. 1-6

• Adds extra states (STT_STATE) to to predict write intensive cache block

TAG STT_STATE DATA

10100110 P

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

x

Main memory

27

Page 28: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Block ‘x’

TAG DATA

10100110 101111011111

Core 0 - L1

x

TAG DATA

Core 1 - L1

Dirty

eviction

Analysis of prior work RWEEHCTransition to ST-D state on writeback in P state

Writeback

• STT_STATES• P: Dataless entry into

STTRAM region • ST-D: Possible candidate

for migration to SRAM• SR-C: Block migrated to

SRAM region

• Migration is done on• Writeback to a block in

ST-D state in STTRAM region

Citation: S. Agarwal and H. K. Kapoor, "Restricting writes for energy-efficient hybrid cache in multi-

core architectures," 2016 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-

SoC), Tallinn, 2016, pp. 1-6

• Adds extra states (STT_STATE) to to predict write intensive cache block

TAG STT_STATE DATA

10100110 P

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

x

Main memory

28

Page 29: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Block ‘x’

TAG DATA

10100110 101111011111

Core 0 - L1

x

TAG DATA

Core 1 - L1

Analysis of prior work RWEEHC

• STT_STATES• P: Dataless entry into

STTRAM region • ST-D: Possible candidate

for migration to SRAM• SR-C: Block migrated to

SRAM region

• Migration is done on• Writeback to a block in

ST-D state in STTRAM region

Possible candidate for migrationCitation: S. Agarwal and H. K. Kapoor, "Restricting writes for energy-efficient hybrid cache in multi-

core architectures," 2016 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-

SoC), Tallinn, 2016, pp. 1-6

• Adds extra states (STT_STATE) to to predict write intensive cache block

TAG STT_STATE DATA

10100110 ST-D 101111011111

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

x

Main memory

29

Page 30: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Block ‘x’

TAG DATA

10100110 101111011100

Core 0 - L1

x

TAG DATA

10100110 101111011100

Core 1 - L1

PAUSE

Migrate to

SRAM

region

Analysis of prior work RWEEHC

x

• STT_STATES• P: Dataless entry into

STTRAM region • ST-D: Possible

candidate for migration to SRAM

• SR-C: Block migrated to SRAM region

• Migration is done on• Writeback to a block in

ST-D state in STTRAM region

Writeback

to ‘x’

Citation: S. Agarwal and H. K. Kapoor, "Restricting writes for energy-efficient hybrid cache in multi-

core architectures," 2016 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-

SoC), Tallinn, 2016, pp. 1-6

• Adds extra states (STT_STATE) to to predict write intensive cache block TAG STT_STATE DATA

10100110 ST-D 101111011111

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

x

Main memory

30

Page 31: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Block ‘x’

Analysis of prior work RWEEHC

TAG DATA

10100110 101111011111

Core 0 - L1

x

TAG DATA

10100110 101111011111

Core 1 - L1

x

• STT_STATES• P: Dataless entry into

STTRAM region • ST-D: Possible candidate

for migration to SRAM• SR-C: Block migrated to

SRAM region

• Migration is done on• Writeback to a block in

ST-D state in STTRAM region

Citation: S. Agarwal and H. K. Kapoor, "Restricting writes for energy-efficient hybrid cache in multi-

core architectures," 2016 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-

SoC), Tallinn, 2016, pp. 1-6

• Adds extra states (STT_STATE) to to predict write intensive cache block

TAG STT_STATE DATA

10100110 SR-C 101111011111

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

x

Main memory

31

Page 32: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Analysis of prior work RWEEHC

TAG DATA

10100110 101111011111

Core 0 - L1

x

TAG DATA

10100110 101111011111

Core 1 - L1

x

Resume

the

Writeback

• STT_STATES• P: Dataless entry into

STTRAM region • ST-D: Possible

candidate for migration to SRAM

• SR-C: Block migrated to SRAM region

• Migration is done on• Writeback to a block in

ST-D state in STTRAM region

SR-C is stable state

Citation: S. Agarwal and H. K. Kapoor, "Restricting writes for energy-efficient hybrid cache in multi-

core architectures," 2016 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-

SoC), Tallinn, 2016, pp. 1-6

• Adds extra states (STT_STATE) to to predict write intensive cache block

TAG STT_STATE DATA

10100110 SR-C 101111011100

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

x

Block ‘x’

Main memory

32

Page 33: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Analysis of prior work RWEEHCTAG STT_STATE DATA

• Hardware overhead

o 2 bits to represent STT_STATE

o 65536 cache lines in 4MB hybrid LLC

and 64B blocksize

o 66B for swap/migration buffer

o 2*65536 + 528 ~ 16kB additional

space in LLC

o Total hardware overhead = 16kB ~

0.39% of LLC

Negligible hardware overhead

33

TAG STT_STATE DATA

66B swap/migration buffer

4MB hybrid LLC

16kB space for STT_STATE

Page 34: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Analysis of prior work RWEEHC

• Performance overhead

o Dataless entry cause high writebacks

to LLC

o Writeback buffer gets full more often

o Hence system stalls more often

P1 P3

L1:

x=6 L1:

Shared mem:

P2

L1:

x=INV

Eviction

Write-back ‘x’

(clean/dirty)

Rd->xWr->x

Main memory

On miss at line ‘x’

34

Page 35: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Analysis of prior work RWEEHC

• Performance overhead

o Dataless entry cause high writebacks

to LLC

o Writeback buffer gets full more often

o Hence system stalls more often

Performance affected due to stalling

P1 P3

L1:

x=6 L1:

Shared mem:

x=6

P2

L1:

x=INV

Eviction

Write-back ‘x’

(clean/dirty)

Rd->xWr->x

Main memory

On miss at line ‘x’

35

Page 36: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Introduction

Related work

Analysis of prior works

• Motivation

• Proposed SLAM framework

• Experimental setup

• Results

• Conclusion and future work

Overview

36

Page 37: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

• Use hybrid last level cache (LLC) comprised of SRAM and STTRAM

• Use existing cache block state to track eviction of dirty block from L1

• Avoid writebacks to STTRAM region of LLC due to eviction of dirty block from L1

Motivation

What does

SLAM do?

37

Page 38: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

System configuration

CPU x86, 2.66GHz, 4-cores, out of order execution

L1 cache 32kB SRAM split I/D caches8-way, 64B blocksize4-cycle read and write latencyLRU replacement policywrite-invalidate, write-backdirectory-based MESI

L2 cache/ LLC 4MB 16-way inclusive hybrid (1MB SRAM + 3MB STTRAM)4-way SRAM and 12-way STTRAM, 64B blocksize8-cycle SRAM read and write latency8-cycle STTRAM read latency32 cycle STTRAM write latencyLRU replacement policywrite-back cache

Simulator used SNIPER v6.1 (multi-core, parallel, trace-driven, high-speed and accurate x86 simulator)

Benchmarks used PARSEC-2.1 and SPLASH-2

Motivation

38

Page 39: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Sources of writes to LLC

Coherency writes constitute 60% of all the writes

Motivation

39

coherency core prefetch

Page 40: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Motivation

Writebacks due to eviction of dirty blockconstitute 88% of all coherency writes

40

Writebacks due to dirty eviction

Writebacks due to request from another

core

Coherency writes

Page 41: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Can we avoid coherency

writes to LLC?

Copy is requested by

another core

(priority writeback)

Copy is NOT requested

by another core

(NOT a priority writeback)

Writeback due to dirty eviction

Motivation

Writeback due to request from peer processor

41

Page 42: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Introduction

Related work

Analysis of prior works

Motivation

• Proposed SLAM framework

• Experimental setup

• Results

• Conclusion and future work

Overview

42

Page 43: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

C0

L1 L1

43

C3

state

LRU

bits

count

Tag Data

M 5

M 3

M 4

M 2

M 7

E 6

M 1

S 0

Line

index

0

1

2

3

4

5

6

7

A set in L1 cache (8-way)

4-way SRAM region

12-way STTRAM region

TAG DATA

SLAM frameworkLi

ne

4 is

LR

U

Check if writeback in STTRAM region

Search for clean block

Drop clean block silently

16-way set in hybrid L2/LLC

SLAM

Page 44: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

SLAM• Hardware overhead

o 32-bit buffer for each L1 to hold address

of actual LRU dirty block selected for

eviction from L1

o Two 2-bit registers for each L1 to

represent one cache block state from

{M,E,S,I}

o Total hardware overhead = 4*32 + 4*2*2

= 144 bits = 18B

o Negligible compared to 4MB LLC

Negligible hardware overhead

44

32-bit address buffer

TAG DATA

4MB hybrid LLC

2-bit register2-bit register

x4

Page 45: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

SLAM

45

• Performance overhead

o Extra access to 4MB LLC cost 8 cycles

o 1 cycle to load cache block states from L1 into

buffer for comparison

o 1 cycle for performing comparison

o Clean block is searched iteratively in entire 8-

way set of L1

Extra LLC access cycles Extra execution cycles Extra total cycles

Best case 8 2 10

Worst case 8 14 22

o Best case- Clean block is found in first iteration

Number of cycles = 2 + 8 = 10 cycles

o Worst case- Clean block is found in last iteration

Number of cycles = 2*7 + 8 = 22 cycles

o Each writeback to STTRAM region needs 32 cycles

(write latency of STTRAM)

o Hence performance of overall system is maintained

Page 46: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

WC AC WLC ALC

TAG STT_STATE DATA

PTHCM RWEEHC SLAM

TAG DATA

Hardware overhead comparison

46

TAG DATA

68B swap/migration buffer 66B swap/migration buffer

TAG WC AC

TAG STT_STATE DATA

163kB prediction table

12 bits for extra fields

per cache line

4MB hybrid LLC 4MB hybrid LLC

16kB space for STT_STATE

32-bit address buffer

2-bit register

x4

x4

TAG DATA

4MB hybrid LLC

2-bit register

Page 47: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

SRAM LLC

2MB

Hybrid LLC

4MB(1MB SRAM + 3MB

STTRAM)

STTRAM LLC

8MB

44.7305 𝑚𝑚2 44.7305 𝑚𝑚244.7305 𝑚𝑚2

SLAM

47

Page 48: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

SRAM-STTRAM partition in hybrid LLC

• Total energy least for 4-12 combination (4-

way SRAM and 12-way STTRAM) partition

is least

• 4-12 combination is the best fit for the

selected LLC on-chip area

• Results shown for only 4 workloads for

brevity; conclusions are the same across

other workloads

SLAM

48

Page 49: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Introduction

Related work

Analysis of prior works

Motivation

Proposed SLAM framework

• Experimental setup

• Results

• Conclusion and future work

Overview

49

Page 50: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Benchmark selection

• PARSEC-2.1 and SPLASH-2

• Parallel and multi-threaded

• Diverse application domain

• Large usage and exchange of shared data

Workload Application Domain%

coherency writes

swaptions Financial analysis 68%

freqmine Data mining 68%

fluidanimate Animation 30%

raytrace Graphics 32%

cholesky Sparse matrix factorization kernel 66%

barnes N-body problem (3D) 65%

fmm N-body problem (2D) 39%

lu.cont Dense matrix factorization kernel 89%

fft Blocked matrix transpose kernel 36%

ocean.cont Large-scale ocean movements 94%

radix Integer radix sort kernel 75%

Experimental setup

50

Page 51: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Power and energy parameters

• Extracted from CACTI and NVSim for

STTRAM

• Scaled for 45nm technology from various

previous works

• Used to evaluate total LLC energy

consumption

2MB SRAM LLC

8MB STTRAM LLC

4MB Hybrid LLC (SRAM/STTRAM)

Readenergy

(nJ/access)0.3072 0.1484 0.3072/0.1484

Write energy

(nJ/access)0.3072 2.78 0.3072/2.78

Static power (mW)

3825 1040 2302.5

Experimental setup

51

Page 52: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Experimental setup

52

• Applications were run to completion on all four cores while exploiting cache coherence with detailed

models of cores, caches and interconnection networks

• Number of LLC accesses were collected for entire application runtime to evaluate-

o Minimized writes to STTRAM

o Decreased total energy consumption of LLC

• Performance is measured in terms of IPC (Instructions Per Cycle)

• Comparison of SLAM’s energy and performance with

o PTHCM [B. Quan, et al, IEEE ICCCT, 2012]

o RWEEHC [S. Agarwal, et al, IEEE VLSI-SoC, 2016]

Simulator setup

• Simulator used – SNIPER v6.1 (multi-core, parallel, trace-driven, high-speed and accurate x86 simulator)

• Used two metrics for evaluation- Total LLC energy and overall system performance

Page 53: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Introduction

Related work

Analysis of prior works

Motivation

Proposed SLAM framework

Experimental setup

• Results

• Conclusion and future work

Overview

53

Page 54: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Energy evaluation for SLAM

• The use of hybrid LLC architecture saved energy compared to SRAM-only and STTRAM-only LLC architectures

• Negligible use of external hardware led to significant energy savings compared to PTHCM and RWEEHC

Comparisonarchitecture

Average LLC energy savings

SRAM 18.94%

STTRAM 32.31%

PTHCM 38.79%

RWEEHC 8.97%

Results

54

Page 55: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Performance evaluation for SLAM

• SLAM outperforms SRAM-only and STTRARM-only LLC architectures by avoiding writeback operations, thus

avoiding saturating the writeback buffer

• SLAM outperforms PTHCM and RWEEHC by eliminating migration/swapping between SRAM and STTRAM regions

Comparisonarchitecture

Average IPC improvement

SRAM 4.631%

STTRAM 0.607%

PTHCM 6.863%

RWEEHC 0.407%

Results

55

Page 56: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Introduction

Related work

Analysis of prior works

Motivation

Proposed SLAM framework

Experimental setup

Results

• Conclusion and future work

Overview

56

Page 57: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Conclusion• Designed a framework that

o Tracks writeback operations to LLC

o Avoid writeback operations to STTRAM region of LLC due to dirty eviction from L1

• Did comprehensive energy and performance comparison based on same area

constraint with

o Baseline SRAM based LLC architecture

o Baseline STTRAM based LLC architecture

o PTHCM based hybrid LLC architecture

o RWEEHC based hybrid LLC architecture

• Compared to SRAM, STTRAM, PTHCM and RWEEHC

o Achieved 18.94%, 32.31%, 38.79% and 8.97% total LLC energy savings respectively

o Achieved 4.631%, 0.607%, 6.863% and 0.407% improvement in performance respectively

57

Page 58: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Future work

• Three and higher level cache hierarchy where writeback operations to LLC may vary with levels of cache

• Exclusive LLC as it is populated only through writebacks due to eviction from L1

• Write-through LLC wherein writebacks due to conflict miss at L1 are part of non-idle CPU

• Lower nanometer technologies wherein writes to STTRAM are unstable because of smaller

MTJ thickness

There are several potential extensions to our work, for example, consideration of

58

Page 59: SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Thank you

59