Memento : Coordinated In-Memory Caching for Data-Intensive Clusters

1

Memento: Coordinated In-Memory Caching

for Data-Intensive Clusters

Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth

Kandula, Scott Shenker, Ion Stoica

2

Data Intensive ComputationData analytic clusters are pervasive

◦Jobs run multiple tasks in parallel◦Jobs operate on petabytes of input

Distributed file systems (DFS) store data distributed and replicated◦Data reads are either disk-local or

remote across the network

3

Access to disk slowMemory orders of magnitude faster

How do we leverage memory storage for datacenter jobs?

4

Can we store all data in memory? Machines have tens of gigabytes of

memory

But, huge discrepancy between storage and memory capacities

◦ Facebook cluster has ~200x more data on disk than memory

Use Memory as Cache

5

Will the data fit in cache?

10% total input is >80% of all jobs

Heavy-tailed 96% of smallest jobs can fit in the

memory cache

6

Elephants and miceMix of a few “large” jobs and

very many “small” jobs

Large jobs:◦Batch operations◦Production jobs

Small jobs:◦Interactive queries (e.g., Hive,

SCOPE)◦Experimental analytics

7

Challenge: Small Parallel JobsJob finishes when its last task finishes

◦Need to cache all-or-nothing

8

In summary…

Only option for memory-locality is caching

96% of jobs can have their data in memory, if we cache it right

9

OutlineFATE: Cache Replacement

Memento: System Architecture

Evaluation

10

We care about jobs finishing faster…

Job j that completed in tn time normally, takes tm time with memory caching

◦%Reductionj =

Average % Reduction in Completion Time

11

Traditional Cache Replacement

Traditional cache replacement policies (e.g., LRU, LFU) optimize for hit-ratio

◦ Belady’s MIN: Evict blocks that are to be accessed “farthest in future”

12

Belady’s MIN Example

50% cache hit

E, F, B, D, C, A (time

)

F, B, D, C, A (time

)

B, D, C, A(time)…

Data Block

A B C D

E B C D

E B F D

MIN: How much do jobs benefit?

Memory-local tasks are 10x (or 90%) fasterB DA CJ1 J2

J2DJ2

C

13

J1A J1

B

Reduction:

0%

0%

Average(0 + 0)/2 = 0%

4 computation slots

B, D, C, A(time)

Data Block E B F D

14

“Whole-job” inputs

50% cache hit

E, F, B, D, C, A (time

)

F, B, D, C, A (time

)

B, D, C, A(time)…

Data Block

A B C D

A B E D

A B E F

B

D

A

C

J1

J2

MIN: How much do jobs benefit?

Memory-local tasks are 10x (or 90%) fasterB DA CJ1 J2

J2DJ2

C

15

J1A J1

B

Reduction:

90%

0%

Average(90 + 0)/2 = 45%

4 computation slots

B, D, C, A(time)

Data Block A B E F

(MIN): Average(0 + 0)/2 = 0%Cache hit-ratio not the most

suited

16

FATE Cache ReplacementMaximize “whole-job” inputs in cache

Need global coordination◦Parallel tasks distributed over different

machines

Property:◦Small jobs get preference◦Large jobs benefit with remaining cache

space

17

Waves in the jobSingle

Wave (small jobs) All-or-nothing

Multiple Waves (large jobs)Linear benefits

18

Waves in the job

Multiple

Waves

Sing

le

Wav

e

19



Evaluation

20

Global coordination of local caches

Global cache viewBlock

IdClient Id

File Name

… … …

21

Memento: Salient Features

External Service

Local cache reads

Metadata communication

22



Evaluation

23

EvaluationHDFS in conjunction with Memento

Microsoft and Facebook traces replayed◦Replay jobs with same inter-arrival time

Deployment on EC2 cluster of 100 machines◦20GB memory for Memento

Jobs binned by their size

24

Job Distribution, by bins

25

Jobs are 77% faster at average

Small jobs see 85% reduction in completion

time

26

Cache hit-ratio matters less

Average job faster by 77% with FATE (vs.) 49% with

MIN

27

Memento scales sufficientlyCoordinator handles 10,000

simultaneous client communications

Client can handle eight simultaneous local map tasks

Sufficient for current datacenter loads

28

Ongoing / Future work >>

29

Simpler Implementation [1]Ride the OS cache

◦Estimate where block is cached

Change job manager to track block accesses

– No FATE, use default (LRU?)

Initial results show 2.3x improvement in cache hit-rate

30

Alternate Metrics [2]We optimize for “average %

reduction in completion time” of jobs

Average◦Weighted to include job priorities?

Other metrics◦Reduction of load on disk subsystem?◦Utilization?

31

Solid State Devices [3]SSDs, a new layer in the storage

hierarchy

Hierarchical Caching◦Include SSDs between disk and memory

What’s the best cache replacement policy?

32

SummaryMemory-caching can be surprisingly

effective◦…despite disk and memory capacity

discrepancy

Memento: Coordinated cache management◦FATE Replacement Policy (“whole-jobs”)

Encouraging results for datacenter workload

Documents

Memento : Coordinated In-Memory Caching for Data-Intensive Clusters