AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Chunky Gupta, Software Engineer @Yelp

David Morrison, Software Engineer @YelpDecember 1, 2016

Lessons Learned from

a Year of Using Spot Fleet

CMP205

What to Expect from the Session

How Yelp is saving money by using Amazon EC2 Spot Fleet!

Outline

Seagull: Yelp’s Distributed System for Concurrent Task Execution

FleetMiser: Scaling Yelp’s Spot Fleet for Fun and Profit

Looking to the Future for Seagull and FleetMiser

Yelp’s Mission

Connecting people with great local businesses

Terminology

On Demand

Reserved

Spot Instances

us-west-2a

(c3.8xlarge)

Spot Market

Resource Unit ≈ 1 vCPU

Spot Instance• c3.8xlarge

• m4.10xlarge

• …

Clusterus-west-2b

(c3.8xlarge)

us-west-2c

(c3.8xlarge)

Bundle/Executor

Seagull:

Yelp’s Distributed System For

Concurrent Task Execution

What kinds of tasks are we talking about?

Unit, integration and acceptance tests (Runs ~25

million tests/day)

Photo classification (Runs classifier on tens of millions

of photos in less than a day)

Other applications to come

Seagull is built on top of Apache Mesos

Scheduler 1 Scheduler 2 Scheduler n

Slave 1 Slave 2 Slave 3 Slave m

Seagull is built on top of Apache Mesos

Scheduler 1 Scheduler 2 Scheduler n

Slave 1 Slave 2 Slave 3 Slave m

Where has Yelp’s Seagull Cluster lived?

May 2015 ($$$$)

July 2015 ($$$)

Dec 2015 ($$)

Feb 2016 ($)

OD OD OD OD

SI SI SI RI

SI SI SI RI

SI SI SI SI

+

+

Seagull’s infrastructure costs reduced by 85%

in the last year

Seagull

Infr

astr

uctu

re C

ost

Timeline (May 2015-April 2016)

55% reduction in costs after initial transition to

Spot Instances

Additional 60% savings after

transition to Spot + Auto

Scaling

Why Spot Instances?

• On-Demand Instances

• Reserved Instances

Are Spot Instances actually cheaper?

• If used intelligently, they

can save you a lot of

money

• Be careful! Naive usage

may end up costing more

than on-demand!

How does Spot pricing actually work?

Available Spot InstancesUser A

Bid: $10

User B

Bid: $5

User C

Bid: $1

Spot Bid Price $1

How does Spot pricing actually work?

Available Spot InstancesUser A

Bid: $10

User B

Bid: $5

User C

Bid: $1

Spot Bid Price $1Spot Bid Price $5

Maintaining cluster stability in bidding wars

On-Demand Price

Step 1: Application level (Seagull) Fault Tolerance

Scheduled Tasks

Executio

n T

ime

Instances lost due to outbid events


Scheduled Tasks

Executio

n T

ime

Lost tasks rescheduled


Scheduled Tasks

Executio

n T

ime

Lost tasks rescheduled

Step 2: Cluster-level Fault Tolerance

Amazon EC2 Spot Fleet

Spot Fleet: 9 Instances, 3 Markets

us-west-2cus-west-2b

$

Step 2: Cluster-level Fault Tolerance

us-west-2a

$$$$$$ $Amazon EC2 Spot Fleet



$

What if the bid price fluctuates?

us-west-2a

$$$$$$$$$$ $$



$$$$$


us-west-2a

$$$$$ $$



$$$$$


us-west-2a

$$$$$ $$ $ $$$$$


On-Demand Price Challenges:

• Availability

• Reliability

How do you deal with churn?

Option 1: Move back to On-Demand and wait for fluctuation to stop

Seagull

Infr

astr

uctu

re C

ost

Timeline (June 2016-Sept 2016)

Seagull costs spiked by 250% when

transitioning back to On-Demand

Instances for a few days

How do you deal with churn?

Getting outbid in three markets doesn’t impact the cluster

Number of units in cluster, grouped by Spot market

Option 2: Diversify! Add more Spot markets to reduce impact

Diversification isn’t always easy

Is your application compatible with other instance sizes and types

(e.g., EBS instances, GPU instances)?

Diversification isn’t always easy

How does your application perform on different instance types?

Executio

n T

ime

Scheduled Tasks

(color-coded by instance id)

How to use Spot Fleet most intelligently

Be simple and don’t bid too high

Diversify your Spot markets

FleetMiser:

Scaling Yelp’s Spot Fleet for Fun and Profit

Why do we need scaling at all?

Number of Seagull runs

Peak demand is between ~9am and ~7pm

FleetMiser: Yelp’s in-house scaling engine

What does scaling look like?

Number of units in cluster

Developers in Europe

Peak capacity is between ~12pm and ~7pm


FleetMiser uses a plugin-based architecture for

scaling signalsautoscale_signals:

ClusterOverutilizedSignal:

priority: 2

query_period: 10

scale_up_threshold: 0.65

units_to_add: 100

...

Using metrics to control scaling

Cluster underutilized: scale down

Developers submitted batch jobs: maintain capacity/scale up

Cluster overutilized: scale up

(not shown) Historical usage indicates demand: scale up

Number of units in cluster


Scaling up uses the AWS diversification strategy

FleetMiser uses sophisticated scale-down logic to

ensure cluster diversity is maintained

Scaling Down: How to terminate instances

Scale-down evenly distributed across all Spot markets

Number of units in cluster, grouped by Spot market

Comparison to AWS Auto Scaling for Spot Fleetshttps://aws.amazon.com/blogs/aws/new-auto-scaling-for-ec2-spot-fleets/

• Driven by CloudWatch metrics

• Policies can scale by constant,

percentage, step function

• No custom scale-down logic

• An easy way to get your cluster

autoscaling

• Custom signal plugins

• Scaling by arbitrary amounts

(based on signal input)

• Specify instances to terminate

• Allows for more complicated

functionality

Spot Fleet scaling FleetMiser scaling

Looking to the Future

for Seagull and FleetMiser

Goal: Diversify our Spot Markets even further


53 bundles!


53 bundles!


Goal: More advanced scaling logic for FleetMiserCombine and control multiple Spot Fleets and Auto Scaling Groups at once

Goal: More advanced scaling logic for FleetMiser

$$$$

$$$

Goal: Better bundling of tasks for Seagull

task_requirements:

TaskA:

RAM: 100MB

CPU: 3

dependencies:

- ServiceA

- ServiceB

TaskB:

RAM: 10MB

CPU: 1

dependencies:

- ServiceC

...

Use EC2 Spot Fleet with a fault-tolerant application

Yelp’s simple mantra for saving money on your

compute costs

Use scaling to reduce off-hours capacity

Yelp’s simple mantra for saving money on your

compute costs

@YelpEngineering

fb.com/YelpEngineers

engineeringblog.yelp.com

github.com/yelp

Thank you!Thank you!

Remember to complete

your evaluations!

Technology

AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)