Cake: Enabling High-level SLOs on Shared Storage Systemsacs.ict.ac.cn/storage/slides/Cake.pdfCake: Enabling High-level SLOs on Shared Storage Systems Andrew Wang, Shivaram Venkataraman,

Cake: Enabling High-level SLOs on Shared Storage Systems

Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Randy Katz, Ion Stoica

University of California, Berkeley

SOCC 2012

2

Introduction

Problem And Challenge

Solutions

System Design

Implementation

Evaluation

Conclusion

Future work

Content

Introduction

Rich web applications

A single slow storage request can dominate the

overall response time

High percentile latency SLOs

Deal with the latency present at the 95th or

99th percentile

4

Introduction

Datacenter applications

Latency-sensitive

Throughput-oriented

Accessing distributed storage systems

Applications don’t share storage systems

Service-level objectives on throughput or latency

5

Introduction

SLOs

Reflect the performance expectations

Amazon, Google, and Microsoft have identified

SLO as a major cause of user dissatisfaction

For example

A web client might require a 99th percentile

latency SLO of 100ms

A batch job might require a throughput SLO of

100 scan requests per second

6


Physically separating storage systems

Need Individual peak load

Segregation of data leads to degraded user

experience

Operational complexity

Require additional maintenance staff

More software bugs and configuration errors

7


Focusing solely on controlling disk-level resources

High-level storage SLOs require consideration of

resources beyond the disk

Disconnect between the high-level SLOs and

performance parameters like MB/s

Require tedious, manual translation

More programmer or system operator

8

Solutions

Cake

A coordinated, multi-resource

schedule for shared distributed storage

environments with the goal of achieving

both high throughput and bounded

latency.

9

Architecture

System Design

10

System Design

First-level schedulers as a client

Provide mechanisms for differentiated

scheduling

Split large requests into smaller chunks

Limit the number of outstanding device requests

11

System Design

Cake’s second-level scheduler as a

feedback loop

While attempting to increase utilization

Continually adjusts resource allocation at each

of the first-level schedulers

Maximize SLO compliance of the system

12

First-level Resource Scheduling

Differentiated scheduling

a b

13

First-level Resource SchedulingSplit large requests

Control number of outstanding requests

c d

14

Second-level Scheduling

Multi-resource Request Lifecycle

Request processing in a storage system

involves far more than just accessing disk

Necessitating a coordinated, multi-resource

approach to scheduling

15


Multi-resource Request Lifecycle

16


High-level SLO Enforcement Cake’s second-level scheduler

Satisfy the latency requirements of latency-sensitive front-end clients

Maximize the throughput of throughput-oriented batch clients

Two phases of second level scheduling decisions For disk in the SLO compliance-based phase

For non-disk resources in the queue occupancy-based phase

17


The initial SLO compliance-based phase

Decide on disk allocations based on client performance

The queue occupancy-based phase

Balance allocation in the rest of the system to keep the

disk utilized and improve overall performance

18

Implementation

Chunking Large Requests

19

Implementation

Number of Outstanding Requests

20

Implementation

Cake Second-level Scheduler — SLO

Compliance-based Scheduling

21

Implementation

Cake Second-level Scheduler — Queue

Occupancy-based Scheduling

22

Evaluation

Proportional Shares and Reservations

When the front-end client is sending low throughput, reservations are an

effective way of reducing queue time at HDFS

23

Evaluation

Proportional Shares and Reservations

When the front-end is sending high throughput,proportional share

is an effective mechanism at reducing latency

24

Evaluation

Single vs Multi-resource Scheduling

CPU contention within HBase when running many concurrent threads

and without separate queues and differentiated scheduling

25

Evaluation

Single vs. Multi-resource Scheduling

Thread-per-request displays greatly increased latency with chunked

request sizes

26

Evaluation

Convergence Time

Diurnal Workload

Spike Workload

Latency Throughput Trade-off

Quantifying Benefits of Consolidation

27

Conclusion

Coordinating resource allocation across

multiple software layers

Allowing application programmers to specify

high-level SLOs directly to the storage

Allowing consolidation of latency-sensitive

and throughput-oriented workloads

28

Conclusion

Allowing users to flexibly move within the

storage latency vs. throughput trade-off by

choosing different high-level SLOs

Using Cake has concrete economic and

business advantages

29

Future work

SLO admission control

Influence of DRAM and SSDs

Composable application-level SLOs

Automatic parameter tuning

Generalization to multiple SLOs

30

Thank You

Documents

Cake: Enabling High-level SLOs on Shared Storage Systemsacs.ict.ac.cn/storage/slides/Cake.pdfCake: Enabling High-level SLOs on Shared Storage Systems Andrew Wang, Shivaram Venkataraman,