Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy

Preview:

DESCRIPTION

Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy. Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon. Hongzhou Zhao Sandhya Dwarkadas. Fixed granularity cache organisation. Tag Array. Data Array. Cache data utilization. Tag Array. - PowerPoint PPT Presentation

Citation preview

Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy

Snehasish KumarArrvindh ShriramanEric MatthewsLesley Shannon

Hongzhou ZhaoSandhya Dwarkadas

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

2

Fixed granularity cache organisation

Tag Array Data Array

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

3

Cache data utilization

Tags Data UntouchedData

Tag Array Data Array

Utilization = Fraction of words touched in cache block at the time of eviction

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

4Ser0%

25%

50%

75%

100%64K L1 – 4 ways – 64B/block

apac

he

cann

.

eclip

se

firef

ox

h2 jbb

lbm

mcf

tpcc

x264

Cache utilization

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

5

55%

13%6%

26%

18%5%4%

73%

Block Distribution

1-2

3-4

5-6

7-8

40%

26%

9%

25%

75%

14%

6%5%

Apac

heEc

lipse

Fire

fox

Cann

eal

# Words Touched

64K – 64B/block

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

6

58%20%

12%

10%

Block Distribution

1-2

3-4

5-6

7-8

75%

14%

6%5%

Cann

eal

Cann

eal

# Words Touched

64K – 64B/block 1M – 64B/block

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

7

Application specific behaviour ― Inefficient data structure access

patterns

Interaction with cache geometry— Way conflicts reduce block lifetime

and cause poor utilization

Factors affecting cache utilization

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

8

Application Specific Behaviour

struct TIE {long long X, Y, Z;long long V, H;long long data[3];

} Imperial[1024];

Data[3]X Y HZ V

Access in a loop

Data Arrayfor (int i=0; i<1024; i++){

Imperial[i].X = …;Imperial[i].Y = …;Imperial[i].Z = …;Imperial[i].V = …;

}

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

9

Cache Geometry

Data Array – 4 ways

Problem : Lots of data map to same set

1 2 3

4 5

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

10

1. Shrinks effective cache space

2. Increases miss rate

3. Wastes on-chip bandwidth

4. Increases on-chip cache energy consumption

Implications

=

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

11

Miss Rate

Space Utilisation

Bandwidth

AmoebaCache

Target Metrics

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

12

Variable Granularity Blocks

Tag Array Data Array

How to support variable # of blocks / set ?

How to support variable granularity for each block?

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

13

Our Approach : Amoeba Cache

Unified SRAM Array

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

14

Amoeba Cache

• Insert• Lookup• Partial Miss• Overheads

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

15

SRAM Array

Region Tag Start End

1 word 1+ words

SRAM Array

Tag Data Block

Bitmaps

0000Valid? Tag?

0000

0000 0000

0000 0000

0000 0000

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

16

Tag - Regions

Memory

Region

RMAXbytes

Region Tag ByteStart / EndSet Index

3

64 bit address

Top 3

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

17

Example

struct TIE {long long X, Y, Z;long long V, H;long long data[3];

} Imperial;

Imperial.X = … ;

Miss

Invoke Spatial Granularity Predictor(PC/Region based)

Fetch

Tag X Y Z V

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

18

00000000

Valid? Tag?

Amoeba Cache – Insert (8words/set)

00000000SRAM Array / Set

Miss

Insert 4+1 words

00000 substring()

1Pos: 0

Tag X Y Z V

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

19

00000000Valid? Tag?

Amoeba Cache – Insert (8words/set)

00000000

SRAM Array / Set

11111000

Tag X Y Z V

Refill

210000000

3

Tag X Y Z V

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

20

Example

struct TIE {long long X, Y, Z;long long V, H;long long data[3];

} Imperial;

Imperial.Y = … ;Lookup Data from the cacheData[3]X Y HZ VX Y Z V

Tag X Y Z V

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

21

Amoeba Cache – Lookup (8words/set)

RegionTag

Set Index

Word (W)

Tag X Y Z V

SRAM Array / Set

10000000

2x1 2x12x1 2x1

Tag?1

2 𝐴𝑑𝑑𝑟 ∈𝑇𝑎𝑔Region

==Start ≤ W

End > W Word SelectorHit?

3

Tag X Y Z VOutput Buffer

Criti

cal P

ath

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

22

Partial MissIdentify Sub-Blocks Step 1 of 2

New ∩ Tags

1

MSHR 2 Evict Overlap

Fetch NewTag X Y Z V

Tag X Y Tag V H

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

23

Partial MissInsert New Block Step 2 of 2

MSHR3

Allocate 6 words

Miss 4

5Patch Missing ?’s

Tag

Occurs ≈ 5 in 1000 accesses

Tag X Y Z V H

X Y ? V HZ

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

Hardware Overheads

SRAM Array

24

Metadata

0000Valid? Tag?

0000

0000 0000

0000 0000 Criti

cal

Path

Extr

a

Amoe

ba C

ritica

l Pat

h

1 KB

Latency +4%

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

25

Evaluation

• Parameters for latency and energy• Workloads

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

26

Latency Parameters (cycles)

300

64K L1

1M LLC

CPU1

3

20

Fixe

d Gr

anul

arity

Amoe

ba C

ache

1.04 Latency +4%

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

27

On-Chip Energy Parameters (pJ)

64K L1

1M LLC

101

230

Fixe

d Gr

anul

arity

Amoe

ba C

ache

≈ 7 / word

105

238

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

28

• 22 diverse workloads from• PARSEC• SPEC-CPU 2000 & 2006• DaCapo ( Java Benchmarks )• Apache, Firefox and PostgreSQL

Workloads

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

29

Results

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

30

% Improvement in L1 Miss-Rate

mcf

canneal

lbm h2 jbb

apache

x264

firefoxtpcc

eclipse

0%

10%

20%

30%

40%

Reduces L1 and L2 miss rate by 18%

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

31

% Improvement in L1 Miss-Bandwidth

-25%

0%

25%

50%

75%

Reduces on-chip bandwidth by 46% Reduces off-chip bandwidth by 38%

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

32

% Improvement in memory energy

mcf

canneal

lbm h2 jbb

apache

x264

firefoxtpcc

eclipse

0%

10%

20%

30%

40%

Reduces energy by 11%

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

33

% Improvement in execution time

mcf

canneal

lbm h2 jbb

apache

x264

firefoxtpcc

eclipse

0%

5%

10%

15%

20%

21%

Improves performance by 10%

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

34

Results SummaryAmoeba-Cache

• Reduce cache pollution for applications with low cache utilization

• Improve performance for moderate cache utilization

• Maintain performance for high cache utilization workloads

• Save energy for streaming applications by keeping out unused words

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

35

Additional Results

Lookup as an extra cache pipeline stage vs. throttling the CPU

Spatial Granularity Predictor— Indexing— Training — Table Size

For extra pipeline stage, 8 of 22 applications show improvement

18 of 22 – Address region betterEvictions and First Touch

256 – PC and 1024 – Region

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

36

Additional Results

Multicore Shared Cache

Comparison against other designs— Fixed Granularity 2X— Sector Cache variants— Multi-$

Reduces miss rate (avg 18%) and LLC miss bandwidth (16%-39%)

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

37

Amoeba Cache

What? —Enable variable granularity data caching

Why?—Eliminate waste

How?—Unify tag and data into a single SRAM array

—Afforded by recent technology trendsWhere?

—Definitely at the L2, possibly at the L1

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

38

Frequently Asked Questions

1. Multiple threads?

2. Compare against other designs

3. Spatial Pattern Predictor

4. Replacement Policy

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

39

Multicore Shared Cache

Miss Miss Miss Miss BW

Mix T1 T2 T3 T4 (All)

jbb x2, tpc-c x2 12.38% 12.38% 22.29% 22.37% 39.07%

Firefox x2, x264 x2 3.82% 3.61% –2.44% 0.43% 15.71%

cactus, fluid., omnet., sopl. 1.01% 1.86% 22.38% 0.59% 18.62%

canneal, astar, ferret, milc 4.85% 2.75% 19.39% –4.07% 17.77%

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

40

Comparison

Impact on Miss-RateImpact on BandwidthLow tag overheadTradeoff data and tag spaceDynamically resize blocks

Amoeba Cache

Multi -$Sector Variants

YesYes~

~NoYesNoNo

NoNo

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

41

Comparison – Moderate Group – 64K

1.0 1.1 1.2 1.3 1.4 1.5 1.60.4

0.5

0.6

0.7

0.8

0.9

1.0

Miss Rate Ratio

Band

wid

th R

atio Sector

(x:2.9)

Sector-Pre

Fixed-2X

AmoebaMulti$-25

Multi$-50

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

42

Spatial Pattern Predictor

Index Pattern

PC / Region 01011111

PC / Region 00011101

Predictor History Table

1

PC : Read Addr 0 0 0 1 1 1 0 1

2

Critical Word

Policy Miss vs Policy-Bandwidth

What to do when there is no entry?

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

43

Predictor Training

Data Array

Index Pattern

PC / Region 01011111

PC / Region 00011101

Add / update entry on evict

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

44

Predictor – L1 Miss Rate (1 of 2)

cann

e.

eclip

.

firef

.

h2

tpc-

c

x264

0

2

4

6

8

10Aligned Finite Infinite Finite+FT History

MPK

I

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

45

Predictor – L1 Miss Rate (2 of 2)

apac

.

lbm

mcf jbb020406080

100120140

Aligned Finite Infinite Finite+FT History

MPK

I

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

46

Predictor – L1 Miss Bandwidth (1 of 2)

cann

e.

eclip

.

firef

.

h2

tpc-

c

x264

0

300

600

900

1200

1500

1800Aligned Finite Infinite Finite+FT History

Band

wid

th R

ate

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

47

Predictor – L1 Miss Bandwidth (2 of 2)

apac

.

lbm

mcf jbb0

2000

4000

6000

8000

10000Aligned Finite Infinite Finite+FT History

Band

wid

th R

ate

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

48

Predictor – Summary

For majority applications Region Predictor with

— 1024 entry table— Table with 8 ways x 128 sets

PC Predictor is good for 5 applications— apache, art, mcf, lbm and omnetpp

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

49

Pseudo LRU Replacement

• Logically partition the set into a Nways

• Pick a block at random from way• Unset the T? (Tag) and V? (Valid) bits

Way 0 Way 1

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

50apac

he art

asta

rca

ctus can

eclip

se fac

ferr

etfir

efox

fluid

.fre

q. h2 jbb

lbm

mcf

milc

omne

t.so

plex

tpc-

c.tr

ade.

twol

fx2

64m

ean0

20

40

60

80

100

1-2 Words 3-4 Words 5-6 Words 7-8 WordsW

ords

Acc

esse

d (%

)

45 20 39 79 30 80 77 82 49 62 55 38 40 32 29 81 33 21 53 73 29 46 50

Access Distribution for L1W

ord

dist

ributi

on fo

r 64K

L1

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

51

Amoeba block size distribution for L1Bl

ock

dist

ributi

on fo

r 64K

L1

apac

he art

asta

rca

ctus can

eclip

se fac

ferr

etfir

efox

fluid

.fre

q. h2 jbb

lbm

mcf

milc

omne

t.so

plex

tpc-

c.tr

ade.

twol

fx2

64m

ean0

20

40

60

80

100

1-2 Words 3-4 Words 5-6 Words 7-8 Words%

of A

moe

ba B

lock

s

92 80 98 100

67 98 88 99 78 100

94 82 89 89 93 100

83 91 91 97 70 91 90

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

52

L1 FSM

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

53

Miss-Rate ( 64K L1 )

mcf

canneal

lbm h2 jbb

apache

x264

firefoxtpcc

eclipse

0

20

40

60

80

Fixed

Amoeba

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

54

Miss Bandwidth Rate ( 64K L1 )

mcf

canneal

lbm h2 jbb

apache

x264

firefoxtpcc

eclipse

0

2000

4000

6000

8000

10000Fixed

Amoeba

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

55

Energy Rate ( L1 + LLC ) – (nJ/KI)

mcf

canneal

lbm h2 jbb

apache

x264

firefoxtpcc

eclipse

0

25

50

75

100Fixed

Amoeba

Amoeba Cache : Adaptive blocks for Eliminating Waste in the Memory Hierarchy

56

Reduction in execution time

0

4000

8000

12000

16000

Fixed

Amoeba