SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer

Presented by: Swapnil Bhosale

Advisor: Dr. Sudeep Pasricha

Committee members:Dr. Sourajeet RoyDr. Wim Bohm

SLAM: High performance and energy efficient shared hybrid last level cache architecture in multicore systems

1

• Introduction

• Related work

• Analysis of prior works

• Motivation

• Proposed SLAM framework

• Experimental setup

• Results

• Conclusion and future work

Overview

2

• Most of the modern computing systems are multicore with multi-level cache memory

• Last level cache (LLC) is generally shared between private caches

Introduction

Source: http://www.eedailynews.com/2012/02/freescale-claims-highest-performance.html

Freescale semiconductor’s multi-core processor (B4860)

Source: http://www.cse.wustl.edu/~jain/cse567-11/ftp/multcore/index.html

3

Introduction

Source: http://csillustrated.berkeley.edu/PDFs/handouts/cache-3-associativity-handout.pdf

4

Introduction

Source: http://csillustrated.berkeley.edu/PDFs/handouts/cache-3-associativity-handout.pdf

5

• Processor-memory

performance gap continues

to increase

• Traditional SRAM based

caches cannot cope up

with increasing gap

• Need for alternate memory

that can provide

o High capacity

o Less energy consumption

o Be closer to processor

Introduction

Source: https://mzh.io/%E5%A6%82%E4%BD%95%E8%AE%A9Go%E7%A8%8B%E5%BA%8F%E6%9B%B4%E5%BF%AB

6

• Researchers proposed Spin Transfer Torque Random Access Memory (STTRAM)

• Attributes

o High density

o Low static power consumption

o Non-volatility

o Future scalability

o High endurance

o Small read latency

Introduction

Potential replacement for SRAM in cache memory hierarchy

Source: https://www.mram-info.com/stt-mram

7

• Basic storage element is MTJ (Magnetic Tunnel Junction)

• Data is stored as relative magnetic orientation of two ferromagnetic layers

Source: https://www.mram-info.com/stt-mramSource: https://www.embedded.com/design/real-time-and-performance/4026000/The-future-of-scalable-STT-RAM-

as-a-universal-embedded-memory

Introduction

8

SRAM STTRAM

Cell structure

Leakage power (for 1MB, 45nm tech node)

14.63mW2.32mW

(5-6x lesser than SRAM)

Area (for 1MB, 45nm tech node) 3.77sqmm0.95sqmm

(4-5x denser than SRAM)

Write latency 3.18ns12.01ns

(approx. 4x of SRAM)

Write energy 0.08nJ0.64nJ

(approx. 8x of SRAM)

Source: J. Ahn, S. Yoo and K. Choi, "DASCA: Dead Write

Prediction Assisted STT-RAM Cache Architecture," 2014

IEEE 20th International Symposium on High Performance

Computer Architecture (HPCA), Orlando, FL, 2014, pp. 25-

36.

Introduction

Need for techniques to overcome the drawbacks of STTRAM

9

Introduction

• Related work


• Motivation



• Results


Overview

10

• Some works focus on reducing

write energy by tuning MTJ

device propertieso “Relaxing non-volatility for fast and energy-

efficient STT-RAM caches”, [C. Smullen, et al,

IEEE HPCA, 2011]

o “Delivering on the promise of universal

memory for spin-transfer torque RAM (STT-

RAM)”, [C. Smullen, et al, IEEE ISLPED, 2011]

Related work

MTJ thickness MTJ write time

MTJ thickness MTJ Retention time

Tough compromise between MTJ write time and MTJ retention time

∝

∝

11

• Other works focus on reducing write energy

at cell level

• Basic idea is to update bits with different

values

o "Energy reduction for STT-RAM using early write termination“,

[P. Zhou, et al, IEEE ICCAD, 2009]

o “Coding last level STT-RAM cache for high endurance and low

power”, [S. Yazdanshenas, et al, IEEE Computer Architecture

Letters, 2013]

These works do not consider non-uniformity of writes across the cache

Related work

1 0 1 0 1 1 1 1 1 0 1 0 1 1 0 0

Cache DATA field

Incoming bit pattern

1 0 1 0 1 1 1 1

Cache DATA field

12

comparator

• There are some works that focus on reducing write

energy using hybrid last level cache (LLC) architecture

• Basic idea is to migrate write intensive cache lines to

SRAM region

o “Exploiting non-uniformity of write accesses for designing a high-

endurance hybrid last level cache”, [P. Safayenikoo, et al, IEEE CCECE,

2017]

o "High-endurance and performance-efficient design of hybrid cache

architectures through adaptive line replacement“, [A. Jadidi, et al, IEEE

ISLPED, 2011]

Provides better energy savings by taking advantage of both SRAM and STTRAM

way-0 SRAM

way-1 SRAM

way-2 STTRAM

way-3 STTRAM

way-4 STTRAM

way-5 STTRAM

way-6 STTRAM

way-7 STTRAM

8-way set in

hybrid cache

Related work

13

Introduction

Related work


• Motivation



• Results


Overview

14

Analysis of prior work PTHCM

TAG DATA

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

Main memory

• PTHCM (Prediction table hybrid cache management) use hybrid last level cache (LLC) comprised of SRAM and STTRAM

Prediction table

• Use prediction table to predict write intensive cache lines

• Migrate write intensive cache lines in STTRAM region to SRAM region

Citation: Baixing Quan, Tiefei Zhang, Tianzhou Chen and Jianzhong Wu, "Prediction table based management policy for STT-RAM and SRAM hybrid cache," 2012 7th International Conference on Computing and Convergence Technology (ICCCT), Seoul, 2012, pp. 1092-1097

What does

PTHCM do?

15


Main memory

TAG WC AC WLC ALC

Prediction table

• Counters to keep access history of each cache lineo AC – actual access count of a cache line

(read/write)o WC – actual write count of a cache lineo ALC – prediction of access count of a cache lineo WLC – prediction of write count of a cache line

• Migration will happen ono Misso Write hit

• Prediction table is populated ono Eviction

TAG WC AC

Citation: Baixing Quan, Tiefei Zhang, Tianzhou Chen and Jianzhong Wu, "Prediction table based

management policy for STT-RAM and SRAM hybrid cache," 2012 7th International Conference on

Computing and Convergence Technology (ICCCT), Seoul, 2012, pp. 1092-1097

DATA

4-way

SRAM

12-way

STTRAM

A set in hybrid LLCHow does it

work?

16


M

A

I

N

M

E

M

O

R

Y

x.TAG x.WC x.AC

TAG WC=0 AC=0 WLC ALC DATA

Line with minimum

WLC is replaced

Miss at line ‘x’

LRU line evicted




Prediction table

TAG WC AC

x.TAG x.WC x.AC

Insert in SRAM to avoid write operation to

STTRAM

If entry not found, WLC and ALC are initialized to user set thresholds

DATA

Line ‘y’

Line ‘z’

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

• Counters to keep access history of each cache lineo AC – actual access count of a

cache line (read/write)o WC – actual write count of a

cache lineo ALC – prediction of access

count of a cache lineo WLC – prediction of write

count of a cache line

• Migration will happen on-o Misso Write hit

• Prediction table is populated on-o Eviction

17


Main memory

TAG WC AC WLC ALC

Write hit

TAG WC++ AC++ WLC - - ALC - -

WC >

threshold

Line with minimum

WLC is replaced




swap

Prediction table

TAG WC AC

• Counters to keep access history of each cache lineo AC – actual access count of

a cache line (read/write)o WC – actual write count of

a cache lineo ALC – prediction of access

count of a cache lineo WLC – prediction of write

count of a cache line



DATA

Line ‘x’

Line ‘y’

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

18


y.TAG y.WC y.AC

TAG WC AC WLC ALC

Main memory

Eviction




Prediction table

TAG WC AC

y.TAG y.WC y.AC

If entry not found, make new entry in empty slot

If no empty slot, delete entry with minimum AC

• Counters to keep access history of each cache lineo AC – actual access count of a cache

line (read/write)o WC – actual write count of a cache

lineo ALC – prediction of access count of a

cache lineo WLC – prediction of write count of a

cache line



DATA

Line ‘x’

Line ‘y’

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

19

• Hardware overhead

o 3 bits to represent each of WC, AC, WLC and ALC

o 12 bits extra added to every cache line in LLC

o 65536 cache lines in 4MB hybrid LLC and 64B blocksize

o 12*65536 ~ 98kB additional space in LLC

o Considering 14 bits to represent TAG

o Each entry in prediction table is 20 bits in size

o 65536 entries in prediction table

o Size of prediction is 20*65536 ~ 163kB

o Size of swap/migration buffer ~ 68B

o Total hardware overhead = 262kB ~ 6.39% of LLC

WC AC WLC ALC

TAG DATA

Prediction table

TAG WC AC


163kB prediction table

Notable hardware overhead

20

TAG DATA

68B swap/migration buffer

4MB hybrid LLC

12 bits of extra fields

per cache line

P1 P3

L1:

x=5L1:

Shared mem:

x=5

P2

L1:

x=5

Background: Cache coherence

P1 P3

L1:

x=6L1:

Shared mem:

x=5

P2

L1:

x=INV

eviction

Write-back

Rd->xWr->x

• Uniformity of shared resource data

• Achieved by writing back modified data to shared memory when o Evicted by ownero Requested by peer

processor

Coherent view of memory Non-coherent view of memory

21

P1 P3

L1:

x=5L1:

Shared mem:

x=5

P2

L1:

x=5

Background: Cache coherence

P1 P3

L1:

x=6L1:

Shared mem:

x=6

P2

L1:

x=INV

• Uniformity of shared resource data

• Achieved by writing back modified data to shared memory when o Evicted by ownero Requested by peer

processor

Coherent view of memory Coherent view of memory

22

Main memory

Analysis of prior work RWEEHCCitation: S. Agarwal and H. K. Kapoor, "Restricting writes for energy-efficient hybrid cache in multi-core architectures," 2016 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC), Tallinn, 2016, pp. 1-6

TAG DATA

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

• RWEEHC (Restricting writes to energy efficient hybrid cache) use hybrid last level cache (LLC) comprised of SRAM and STTRAM

• Exploit cache coherency to predict write intensive cache lines

• Migrate write intensive cache lines in STTRAM region to SRAM region

What does

RWEEHC do?

23

Analysis of prior work RWEEHC

• Adds extra states (STT_STATE) to predict write intensive cache block

• STT_STATES• P: Dataless entry into STTRAM region • ST-D: Possible candidate for migration

to SRAM• SR-C: Block migrated to SRAM region

• Migration is done on• Writeback to a block in ST-D state in

STTRAM region

Citation: S. Agarwal and H. K. Kapoor, "Restricting writes for energy-efficient hybrid cache in multi-

core architectures," 2016 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-

SoC), Tallinn, 2016, pp. 1-6

Main memory

TAG STT_STATE DATA

4-way

SRAM

12-way

STTRAM

A set in hybrid LLCHow does it

work?

24

Block ‘x’

TAG DATA

L1





• Migration is done on• Writeback to a block in

ST-D state in STTRAM region





TAG STT_STATE DATA

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

Main memory

25

Block ‘x’

TAG DATA

10100110 101111001010

Core0 - L1


Dataless

entry

x




• Migration is done on• Writeback to a block

in ST-D state in STTRAM region





TAG STT_STATE DATA

10100110 P

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

x

Main memory

26

Block ‘x’

TAG DATA

10100110 101111011111

Core 0 - L1

x

TAG DATA

Core 1 - L1


Rd ‘x’

Writeback

• STT_STATES• P: Dataless entry into

STTRAM region • ST-D: Possible candidate

for migration to SRAM• SR-C: Block migrated to

SRAM region



Transition to ST-D state on writeback in P state




• Adds extra states (STT_STATE) to to predict write intensive cache block

TAG STT_STATE DATA

10100110 P

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

x

Main memory

27

Block ‘x’

TAG DATA

10100110 101111011111

Core 0 - L1

x

TAG DATA

Core 1 - L1

Dirty

eviction

Analysis of prior work RWEEHCTransition to ST-D state on writeback in P state

Writeback




SRAM region







TAG STT_STATE DATA

10100110 P

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

x

Main memory

28

Block ‘x’

TAG DATA

10100110 101111011111

Core 0 - L1

x

TAG DATA

Core 1 - L1





SRAM region



Possible candidate for migrationCitation: S. Agarwal and H. K. Kapoor, "Restricting writes for energy-efficient hybrid cache in multi-




TAG STT_STATE DATA

10100110 ST-D 101111011111

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

x

Main memory

29

Block ‘x’

TAG DATA

10100110 101111011100

Core 0 - L1

x

TAG DATA

10100110 101111011100

Core 1 - L1

PAUSE

Migrate to

SRAM

region


x


STTRAM region • ST-D: Possible

candidate for migration to SRAM

• SR-C: Block migrated to SRAM region



Writeback

to ‘x’




• Adds extra states (STT_STATE) to to predict write intensive cache block TAG STT_STATE DATA

10100110 ST-D 101111011111

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

x

Main memory

30

Block ‘x’


TAG DATA

10100110 101111011111

Core 0 - L1

x

TAG DATA

10100110 101111011111

Core 1 - L1

x




SRAM region







TAG STT_STATE DATA

10100110 SR-C 101111011111

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

x

Main memory

31


TAG DATA

10100110 101111011111

Core 0 - L1

x

TAG DATA

10100110 101111011111

Core 1 - L1

x

Resume

the

Writeback


STTRAM region • ST-D: Possible

candidate for migration to SRAM

• SR-C: Block migrated to SRAM region



SR-C is stable state





TAG STT_STATE DATA

10100110 SR-C 101111011100

4-way

SRAM

12-way

STTRAM

A set in hybrid LLC

x

Block ‘x’

Main memory

32

Analysis of prior work RWEEHCTAG STT_STATE DATA

• Hardware overhead

o 2 bits to represent STT_STATE

o 65536 cache lines in 4MB hybrid LLC

and 64B blocksize

o 66B for swap/migration buffer

o 2*65536 + 528 ~ 16kB additional

space in LLC

o Total hardware overhead = 16kB ~

0.39% of LLC

Negligible hardware overhead

33

TAG STT_STATE DATA

66B swap/migration buffer

4MB hybrid LLC

16kB space for STT_STATE


• Performance overhead

o Dataless entry cause high writebacks

to LLC

o Writeback buffer gets full more often

o Hence system stalls more often

P1 P3

L1:

x=6 L1:

Shared mem:

P2

L1:

x=INV

Eviction

Write-back ‘x’

(clean/dirty)

Rd->xWr->x

Main memory

On miss at line ‘x’

34



o Dataless entry cause high writebacks

to LLC

o Writeback buffer gets full more often

o Hence system stalls more often

Performance affected due to stalling

P1 P3

L1:

x=6 L1:

Shared mem:

x=6

P2

L1:

x=INV

Eviction

Write-back ‘x’

(clean/dirty)

Rd->xWr->x

Main memory

On miss at line ‘x’

35

Introduction

Related work

Analysis of prior works

• Motivation



• Results


Overview

36

• Use hybrid last level cache (LLC) comprised of SRAM and STTRAM

• Use existing cache block state to track eviction of dirty block from L1

• Avoid writebacks to STTRAM region of LLC due to eviction of dirty block from L1

Motivation

What does

SLAM do?

37

System configuration

CPU x86, 2.66GHz, 4-cores, out of order execution

L1 cache 32kB SRAM split I/D caches8-way, 64B blocksize4-cycle read and write latencyLRU replacement policywrite-invalidate, write-backdirectory-based MESI

L2 cache/ LLC 4MB 16-way inclusive hybrid (1MB SRAM + 3MB STTRAM)4-way SRAM and 12-way STTRAM, 64B blocksize8-cycle SRAM read and write latency8-cycle STTRAM read latency32 cycle STTRAM write latencyLRU replacement policywrite-back cache

Simulator used SNIPER v6.1 (multi-core, parallel, trace-driven, high-speed and accurate x86 simulator)

Benchmarks used PARSEC-2.1 and SPLASH-2

Motivation

38

Sources of writes to LLC

Coherency writes constitute 60% of all the writes

Motivation

39

coherency core prefetch

Motivation

Writebacks due to eviction of dirty blockconstitute 88% of all coherency writes

40

Writebacks due to dirty eviction

Writebacks due to request from another

core

Coherency writes

Can we avoid coherency

writes to LLC?

Copy is requested by

another core

(priority writeback)

Copy is NOT requested

by another core

(NOT a priority writeback)

Writeback due to dirty eviction

Motivation

Writeback due to request from peer processor

41

Introduction

Related work


Motivation



• Results


Overview

42

C0

L1 L1

43

C3

state

LRU

bits

count

Tag Data

M 5

M 3

M 4

M 2

M 7

E 6

M 1

S 0

Line

index

0

1

2

3

4

5

6

7

A set in L1 cache (8-way)

4-way SRAM region

12-way STTRAM region

TAG DATA

SLAM frameworkLi

ne

4 is

LR

U

Check if writeback in STTRAM region

Search for clean block

Drop clean block silently

16-way set in hybrid L2/LLC

SLAM

SLAM• Hardware overhead

o 32-bit buffer for each L1 to hold address

of actual LRU dirty block selected for

eviction from L1

o Two 2-bit registers for each L1 to

represent one cache block state from

{M,E,S,I}

o Total hardware overhead = 4*32 + 4*2*2

= 144 bits = 18B

o Negligible compared to 4MB LLC

Negligible hardware overhead

44

32-bit address buffer

TAG DATA

4MB hybrid LLC

2-bit register2-bit register

x4

SLAM

45


o Extra access to 4MB LLC cost 8 cycles

o 1 cycle to load cache block states from L1 into

buffer for comparison

o 1 cycle for performing comparison

o Clean block is searched iteratively in entire 8-

way set of L1

Extra LLC access cycles Extra execution cycles Extra total cycles

Best case 8 2 10

Worst case 8 14 22

o Best case- Clean block is found in first iteration

Number of cycles = 2 + 8 = 10 cycles

o Worst case- Clean block is found in last iteration

Number of cycles = 2*7 + 8 = 22 cycles

o Each writeback to STTRAM region needs 32 cycles

(write latency of STTRAM)

o Hence performance of overall system is maintained

WC AC WLC ALC

TAG STT_STATE DATA

PTHCM RWEEHC SLAM

TAG DATA

Hardware overhead comparison

46

TAG DATA

68B swap/migration buffer 66B swap/migration buffer

TAG WC AC

TAG STT_STATE DATA

163kB prediction table

12 bits for extra fields

per cache line

4MB hybrid LLC 4MB hybrid LLC

16kB space for STT_STATE

32-bit address buffer

2-bit register

x4

x4

TAG DATA

4MB hybrid LLC

2-bit register

SRAM LLC

2MB

Hybrid LLC

4MB(1MB SRAM + 3MB

STTRAM)

STTRAM LLC

8MB

44.7305 𝑚𝑚2 44.7305 𝑚𝑚244.7305 𝑚𝑚2

SLAM

47

SRAM-STTRAM partition in hybrid LLC

• Total energy least for 4-12 combination (4-

way SRAM and 12-way STTRAM) partition

is least

• 4-12 combination is the best fit for the

selected LLC on-chip area

• Results shown for only 4 workloads for

brevity; conclusions are the same across

other workloads

SLAM

48

Introduction

Related work


Motivation

Proposed SLAM framework


• Results


Overview

49

Benchmark selection

• PARSEC-2.1 and SPLASH-2

• Parallel and multi-threaded

• Diverse application domain

• Large usage and exchange of shared data

Workload Application Domain%

coherency writes

swaptions Financial analysis 68%

freqmine Data mining 68%

fluidanimate Animation 30%

raytrace Graphics 32%

cholesky Sparse matrix factorization kernel 66%

barnes N-body problem (3D) 65%

fmm N-body problem (2D) 39%

lu.cont Dense matrix factorization kernel 89%

fft Blocked matrix transpose kernel 36%

ocean.cont Large-scale ocean movements 94%

radix Integer radix sort kernel 75%

Experimental setup

50

Power and energy parameters

• Extracted from CACTI and NVSim for

STTRAM

• Scaled for 45nm technology from various

previous works

• Used to evaluate total LLC energy

consumption

2MB SRAM LLC

8MB STTRAM LLC

4MB Hybrid LLC (SRAM/STTRAM)

Readenergy

(nJ/access)0.3072 0.1484 0.3072/0.1484

Write energy

(nJ/access)0.3072 2.78 0.3072/2.78

Static power (mW)

3825 1040 2302.5

Experimental setup

51

Experimental setup

52

• Applications were run to completion on all four cores while exploiting cache coherence with detailed

models of cores, caches and interconnection networks

• Number of LLC accesses were collected for entire application runtime to evaluate-

o Minimized writes to STTRAM

o Decreased total energy consumption of LLC

• Performance is measured in terms of IPC (Instructions Per Cycle)

• Comparison of SLAM’s energy and performance with

o PTHCM [B. Quan, et al, IEEE ICCCT, 2012]

o RWEEHC [S. Agarwal, et al, IEEE VLSI-SoC, 2016]

Simulator setup

• Simulator used – SNIPER v6.1 (multi-core, parallel, trace-driven, high-speed and accurate x86 simulator)

• Used two metrics for evaluation- Total LLC energy and overall system performance

Introduction

Related work


Motivation


Experimental setup

• Results


Overview

53

Energy evaluation for SLAM

• The use of hybrid LLC architecture saved energy compared to SRAM-only and STTRAM-only LLC architectures

• Negligible use of external hardware led to significant energy savings compared to PTHCM and RWEEHC

Comparisonarchitecture

Average LLC energy savings

SRAM 18.94%

STTRAM 32.31%

PTHCM 38.79%

RWEEHC 8.97%

Results

54

Performance evaluation for SLAM

• SLAM outperforms SRAM-only and STTRARM-only LLC architectures by avoiding writeback operations, thus

avoiding saturating the writeback buffer

• SLAM outperforms PTHCM and RWEEHC by eliminating migration/swapping between SRAM and STTRAM regions

Comparisonarchitecture

Average IPC improvement

SRAM 4.631%

STTRAM 0.607%

PTHCM 6.863%

RWEEHC 0.407%

Results

55

Introduction

Related work


Motivation


Experimental setup

Results


Overview

56

Conclusion• Designed a framework that

o Tracks writeback operations to LLC

o Avoid writeback operations to STTRAM region of LLC due to dirty eviction from L1

• Did comprehensive energy and performance comparison based on same area

constraint with

o Baseline SRAM based LLC architecture

o Baseline STTRAM based LLC architecture

o PTHCM based hybrid LLC architecture

o RWEEHC based hybrid LLC architecture

• Compared to SRAM, STTRAM, PTHCM and RWEEHC

o Achieved 18.94%, 32.31%, 38.79% and 8.97% total LLC energy savings respectively

o Achieved 4.631%, 0.607%, 6.863% and 0.407% improvement in performance respectively

57

Future work

• Three and higher level cache hierarchy where writeback operations to LLC may vary with levels of cache

• Exclusive LLC as it is populated only through writebacks due to eviction from L1

• Write-through LLC wherein writebacks due to conflict miss at L1 are part of non-idle CPU

• Lower nanometer technologies wherein writes to STTRAM are unstable because of smaller

MTJ thickness

There are several potential extensions to our work, for example, consideration of

58

Thank you

59

Documents

SLAM: High performance and energy efficient shared hybrid ...€¦ · Prediction Assisted STT-RAM Cache Architecture," 2014 IEEE 20th International Symposium on High Performance Computer