Upload
amazon-web-services
View
303
Download
1
Embed Size (px)
Citation preview
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Chunky Gupta, Software Engineer @Yelp
David Morrison, Software Engineer @YelpDecember 1, 2016
Lessons Learned from
a Year of Using Spot Fleet
CMP205
What to Expect from the Session
How Yelp is saving money by using Amazon EC2 Spot Fleet!
Outline
Seagull: Yelp’s Distributed System for Concurrent Task Execution
FleetMiser: Scaling Yelp’s Spot Fleet for Fun and Profit
Looking to the Future for Seagull and FleetMiser
Yelp’s Mission
Connecting people with great local businesses
Terminology
On Demand
Reserved
Spot Instances
us-west-2a
(c3.8xlarge)
Spot Market
Resource Unit ≈ 1 vCPU
Spot Instance• c3.8xlarge
• m4.10xlarge
• …
Clusterus-west-2b
(c3.8xlarge)
us-west-2c
(c3.8xlarge)
Bundle/Executor
Seagull:
Yelp’s Distributed System For
Concurrent Task Execution
What kinds of tasks are we talking about?
Unit, integration and acceptance tests (Runs ~25
million tests/day)
Photo classification (Runs classifier on tens of millions
of photos in less than a day)
Other applications to come
Seagull is built on top of Apache Mesos
Scheduler 1 Scheduler 2 Scheduler n
Slave 1 Slave 2 Slave 3 Slave m
Seagull is built on top of Apache Mesos
Scheduler 1 Scheduler 2 Scheduler n
Slave 1 Slave 2 Slave 3 Slave m
Where has Yelp’s Seagull Cluster lived?
May 2015 ($$$$)
July 2015 ($$$)
Dec 2015 ($$)
Feb 2016 ($)
OD OD OD OD
SI SI SI RI
SI SI SI RI
SI SI SI SI
+
+
Seagull’s infrastructure costs reduced by 85%
in the last year
Seagull
Infr
astr
uctu
re C
ost
Timeline (May 2015-April 2016)
55% reduction in costs after initial transition to
Spot Instances
Additional 60% savings after
transition to Spot + Auto
Scaling
Why Spot Instances?
• On-Demand Instances
• Reserved Instances
Are Spot Instances actually cheaper?
• If used intelligently, they
can save you a lot of
money
• Be careful! Naive usage
may end up costing more
than on-demand!
How does Spot pricing actually work?
Available Spot InstancesUser A
Bid: $10
User B
Bid: $5
User C
Bid: $1
Spot Bid Price $1
How does Spot pricing actually work?
Available Spot InstancesUser A
Bid: $10
User B
Bid: $5
User C
Bid: $1
Spot Bid Price $1Spot Bid Price $5
Maintaining cluster stability in bidding wars
On-Demand Price
Step 1: Application level (Seagull) Fault Tolerance
Scheduled Tasks
Executio
n T
ime
Instances lost due to outbid events
Step 1: Application level (Seagull) Fault Tolerance
Scheduled Tasks
Executio
n T
ime
Lost tasks rescheduled
Step 1: Application level (Seagull) Fault Tolerance
Scheduled Tasks
Executio
n T
ime
Lost tasks rescheduled
Step 2: Cluster-level Fault Tolerance
Amazon EC2 Spot Fleet
Spot Fleet: 9 Instances, 3 Markets
us-west-2cus-west-2b
$
Step 2: Cluster-level Fault Tolerance
us-west-2a
$$$$$$ $Amazon EC2 Spot Fleet
Spot Fleet: 9 Instances, 3 Markets
us-west-2cus-west-2b
$
What if the bid price fluctuates?
us-west-2a
$$$$$$$$$$ $$
Spot Fleet: 9 Instances, 3 Markets
us-west-2cus-west-2b
$$$$$
What if the bid price fluctuates?
us-west-2a
$$$$$ $$
Spot Fleet: 9 Instances, 3 Markets
us-west-2cus-west-2b
$$$$$
What if the bid price fluctuates?
us-west-2a
$$$$$ $$ $ $$$$$
What if the bid price fluctuates?
On-Demand Price Challenges:
• Availability
• Reliability
How do you deal with churn?
Option 1: Move back to On-Demand and wait for fluctuation to stop
Seagull
Infr
astr
uctu
re C
ost
Timeline (June 2016-Sept 2016)
Seagull costs spiked by 250% when
transitioning back to On-Demand
Instances for a few days
How do you deal with churn?
Getting outbid in three markets doesn’t impact the cluster
Number of units in cluster, grouped by Spot market
Option 2: Diversify! Add more Spot markets to reduce impact
Diversification isn’t always easy
Is your application compatible with other instance sizes and types
(e.g., EBS instances, GPU instances)?
Diversification isn’t always easy
How does your application perform on different instance types?
Executio
n T
ime
Scheduled Tasks
(color-coded by instance id)
How to use Spot Fleet most intelligently
Be simple and don’t bid too high
Diversify your Spot markets
FleetMiser:
Scaling Yelp’s Spot Fleet for Fun and Profit
Why do we need scaling at all?
Number of Seagull runs
Peak demand is between ~9am and ~7pm
FleetMiser: Yelp’s in-house scaling engine
What does scaling look like?
Number of units in cluster
Developers in Europe
Peak capacity is between ~12pm and ~7pm
FleetMiser: Yelp’s in-house scaling engine
FleetMiser uses a plugin-based architecture for
scaling signalsautoscale_signals:
ClusterOverutilizedSignal:
priority: 2
query_period: 10
scale_up_threshold: 0.65
units_to_add: 100
...
Using metrics to control scaling
Cluster underutilized: scale down
Developers submitted batch jobs: maintain capacity/scale up
Cluster overutilized: scale up
(not shown) Historical usage indicates demand: scale up
Number of units in cluster
FleetMiser: Yelp’s in-house scaling engine
Scaling up uses the AWS diversification strategy
FleetMiser uses sophisticated scale-down logic to
ensure cluster diversity is maintained
Scaling Down: How to terminate instances
Scale-down evenly distributed across all Spot markets
Number of units in cluster, grouped by Spot market
Comparison to AWS Auto Scaling for Spot Fleetshttps://aws.amazon.com/blogs/aws/new-auto-scaling-for-ec2-spot-fleets/
• Driven by CloudWatch metrics
• Policies can scale by constant,
percentage, step function
• No custom scale-down logic
• An easy way to get your cluster
autoscaling
• Custom signal plugins
• Scaling by arbitrary amounts
(based on signal input)
• Specify instances to terminate
• Allows for more complicated
functionality
Spot Fleet scaling FleetMiser scaling
Looking to the Future
for Seagull and FleetMiser
Goal: Diversify our Spot Markets even further
Goal: Diversify our Spot Markets even further
53 bundles!
Goal: Diversify our Spot Markets even further
53 bundles!
Goal: Diversify our Spot Markets even further
Goal: More advanced scaling logic for FleetMiserCombine and control multiple Spot Fleets and Auto Scaling Groups at once
Goal: More advanced scaling logic for FleetMiser
$$$$
$$$
Goal: Better bundling of tasks for Seagull
task_requirements:
TaskA:
RAM: 100MB
CPU: 3
dependencies:
- ServiceA
- ServiceB
TaskB:
RAM: 10MB
CPU: 1
dependencies:
- ServiceC
...
Use EC2 Spot Fleet with a fault-tolerant application
Yelp’s simple mantra for saving money on your
compute costs
Use scaling to reduce off-hours capacity
Yelp’s simple mantra for saving money on your
compute costs
@YelpEngineering
fb.com/YelpEngineers
engineeringblog.yelp.com
github.com/yelp
Thank you!Thank you!
Remember to complete
your evaluations!