Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Cake: Enabling High-level SLOs on Shared Storage Systems
Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Randy Katz, Ion Stoica
University of California, Berkeley
SOCC 2012
2
Introduction
Problem And Challenge
Solutions
System Design
Implementation
Evaluation
Conclusion
Future work
Content
Introduction
Rich web applications
A single slow storage request can dominate the
overall response time
High percentile latency SLOs
Deal with the latency present at the 95th or
99th percentile
4
Introduction
Datacenter applications
Latency-sensitive
Throughput-oriented
Accessing distributed storage systems
Applications don’t share storage systems
Service-level objectives on throughput or latency
5
Introduction
SLOs
Reflect the performance expectations
Amazon, Google, and Microsoft have identified
SLO as a major cause of user dissatisfaction
For example
A web client might require a 99th percentile
latency SLO of 100ms
A batch job might require a throughput SLO of
100 scan requests per second
6
Problem And Challenge
Physically separating storage systems
Need Individual peak load
Segregation of data leads to degraded user
experience
Operational complexity
Require additional maintenance staff
More software bugs and configuration errors
7
Problem And Challenge
Focusing solely on controlling disk-level resources
High-level storage SLOs require consideration of
resources beyond the disk
Disconnect between the high-level SLOs and
performance parameters like MB/s
Require tedious, manual translation
More programmer or system operator
8
Solutions
Cake
A coordinated, multi-resource
schedule for shared distributed storage
environments with the goal of achieving
both high throughput and bounded
latency.
9
Architecture
System Design
10
System Design
First-level schedulers as a client
Provide mechanisms for differentiated
scheduling
Split large requests into smaller chunks
Limit the number of outstanding device requests
11
System Design
Cake’s second-level scheduler as a
feedback loop
While attempting to increase utilization
Continually adjusts resource allocation at each
of the first-level schedulers
Maximize SLO compliance of the system
12
First-level Resource Scheduling
Differentiated scheduling
a b
13
First-level Resource SchedulingSplit large requests
Control number of outstanding requests
c d
14
Second-level Scheduling
Multi-resource Request Lifecycle
Request processing in a storage system
involves far more than just accessing disk
Necessitating a coordinated, multi-resource
approach to scheduling
15
Second-level Scheduling
Multi-resource Request Lifecycle
16
Second-level Scheduling
High-level SLO Enforcement Cake’s second-level scheduler
Satisfy the latency requirements of latency-sensitive front-end clients
Maximize the throughput of throughput-oriented batch clients
Two phases of second level scheduling decisions For disk in the SLO compliance-based phase
For non-disk resources in the queue occupancy-based phase
17
Second-level Scheduling
The initial SLO compliance-based phase
Decide on disk allocations based on client performance
The queue occupancy-based phase
Balance allocation in the rest of the system to keep the
disk utilized and improve overall performance
18
Implementation
Chunking Large Requests
19
Implementation
Number of Outstanding Requests
20
Implementation
Cake Second-level Scheduler — SLO
Compliance-based Scheduling
21
Implementation
Cake Second-level Scheduler — Queue
Occupancy-based Scheduling
22
Evaluation
Proportional Shares and Reservations
When the front-end client is sending low throughput, reservations are an
effective way of reducing queue time at HDFS
23
Evaluation
Proportional Shares and Reservations
When the front-end is sending high throughput,proportional share
is an effective mechanism at reducing latency
24
Evaluation
Single vs Multi-resource Scheduling
CPU contention within HBase when running many concurrent threads
and without separate queues and differentiated scheduling
25
Evaluation
Single vs. Multi-resource Scheduling
Thread-per-request displays greatly increased latency with chunked
request sizes
26
Evaluation
Convergence Time
Diurnal Workload
Spike Workload
Latency Throughput Trade-off
Quantifying Benefits of Consolidation
27
Conclusion
Coordinating resource allocation across
multiple software layers
Allowing application programmers to specify
high-level SLOs directly to the storage
Allowing consolidation of latency-sensitive
and throughput-oriented workloads
28
Conclusion
Allowing users to flexibly move within the
storage latency vs. throughput trade-off by
choosing different high-level SLOs
Using Cake has concrete economic and
business advantages
29
Future work
SLO admission control
Influence of DRAM and SSDs
Composable application-level SLOs
Automatic parameter tuning
Generalization to multiple SLOs
30
Thank You