55
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Chunky Gupta, Software Engineer @Yelp David Morrison, Software Engineer @Yelp December 1, 2016 Lessons Learned from a Year of Using Spot Fleet CMP205

AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Embed Size (px)

Citation preview

Page 1: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Chunky Gupta, Software Engineer @Yelp

David Morrison, Software Engineer @YelpDecember 1, 2016

Lessons Learned from

a Year of Using Spot Fleet

CMP205

Page 2: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

What to Expect from the Session

How Yelp is saving money by using Amazon EC2 Spot Fleet!

Page 3: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Outline

Seagull: Yelp’s Distributed System for Concurrent Task Execution

FleetMiser: Scaling Yelp’s Spot Fleet for Fun and Profit

Looking to the Future for Seagull and FleetMiser

Page 4: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Yelp’s Mission

Connecting people with great local businesses

Page 5: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Terminology

On Demand

Reserved

Spot Instances

us-west-2a

(c3.8xlarge)

Spot Market

Resource Unit ≈ 1 vCPU

Spot Instance• c3.8xlarge

• m4.10xlarge

• …

Clusterus-west-2b

(c3.8xlarge)

us-west-2c

(c3.8xlarge)

Bundle/Executor

Page 6: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Seagull:

Yelp’s Distributed System For

Concurrent Task Execution

Page 7: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

What kinds of tasks are we talking about?

Unit, integration and acceptance tests (Runs ~25

million tests/day)

Photo classification (Runs classifier on tens of millions

of photos in less than a day)

Other applications to come

Page 8: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Seagull is built on top of Apache Mesos

Scheduler 1 Scheduler 2 Scheduler n

Slave 1 Slave 2 Slave 3 Slave m

Page 9: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Seagull is built on top of Apache Mesos

Scheduler 1 Scheduler 2 Scheduler n

Slave 1 Slave 2 Slave 3 Slave m

Page 10: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Where has Yelp’s Seagull Cluster lived?

May 2015 ($$$$)

July 2015 ($$$)

Dec 2015 ($$)

Feb 2016 ($)

OD OD OD OD

SI SI SI RI

SI SI SI RI

SI SI SI SI

+

+

Page 11: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Seagull’s infrastructure costs reduced by 85%

in the last year

Seagull

Infr

astr

uctu

re C

ost

Timeline (May 2015-April 2016)

55% reduction in costs after initial transition to

Spot Instances

Additional 60% savings after

transition to Spot + Auto

Scaling

Page 12: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Why Spot Instances?

• On-Demand Instances

• Reserved Instances

Page 13: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Are Spot Instances actually cheaper?

• If used intelligently, they

can save you a lot of

money

• Be careful! Naive usage

may end up costing more

than on-demand!

Page 14: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

How does Spot pricing actually work?

Available Spot InstancesUser A

Bid: $10

User B

Bid: $5

User C

Bid: $1

Spot Bid Price $1

Page 15: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

How does Spot pricing actually work?

Available Spot InstancesUser A

Bid: $10

User B

Bid: $5

User C

Bid: $1

Spot Bid Price $1Spot Bid Price $5

Page 16: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Maintaining cluster stability in bidding wars

On-Demand Price

Page 17: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Step 1: Application level (Seagull) Fault Tolerance

Scheduled Tasks

Executio

n T

ime

Instances lost due to outbid events

Page 18: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Step 1: Application level (Seagull) Fault Tolerance

Scheduled Tasks

Executio

n T

ime

Lost tasks rescheduled

Page 19: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Step 1: Application level (Seagull) Fault Tolerance

Scheduled Tasks

Executio

n T

ime

Lost tasks rescheduled

Page 20: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Step 2: Cluster-level Fault Tolerance

Amazon EC2 Spot Fleet

Page 21: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Spot Fleet: 9 Instances, 3 Markets

us-west-2cus-west-2b

$

Step 2: Cluster-level Fault Tolerance

us-west-2a

$$$$$$ $Amazon EC2 Spot Fleet

Page 22: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Spot Fleet: 9 Instances, 3 Markets

us-west-2cus-west-2b

$

What if the bid price fluctuates?

us-west-2a

$$$$$$$$$$ $$

Page 23: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Spot Fleet: 9 Instances, 3 Markets

us-west-2cus-west-2b

$$$$$

What if the bid price fluctuates?

us-west-2a

$$$$$ $$

Page 24: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Spot Fleet: 9 Instances, 3 Markets

us-west-2cus-west-2b

$$$$$

What if the bid price fluctuates?

us-west-2a

$$$$$ $$ $ $$$$$

Page 25: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

What if the bid price fluctuates?

On-Demand Price Challenges:

• Availability

• Reliability

Page 26: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

How do you deal with churn?

Option 1: Move back to On-Demand and wait for fluctuation to stop

Seagull

Infr

astr

uctu

re C

ost

Timeline (June 2016-Sept 2016)

Seagull costs spiked by 250% when

transitioning back to On-Demand

Instances for a few days

Page 27: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

How do you deal with churn?

Getting outbid in three markets doesn’t impact the cluster

Number of units in cluster, grouped by Spot market

Option 2: Diversify! Add more Spot markets to reduce impact

Page 28: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Diversification isn’t always easy

Is your application compatible with other instance sizes and types

(e.g., EBS instances, GPU instances)?

Page 29: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Diversification isn’t always easy

How does your application perform on different instance types?

Executio

n T

ime

Scheduled Tasks

(color-coded by instance id)

Page 30: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

How to use Spot Fleet most intelligently

Be simple and don’t bid too high

Diversify your Spot markets

Page 31: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

FleetMiser:

Scaling Yelp’s Spot Fleet for Fun and Profit

Page 32: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Why do we need scaling at all?

Number of Seagull runs

Peak demand is between ~9am and ~7pm

Page 33: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

FleetMiser: Yelp’s in-house scaling engine

Page 34: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

What does scaling look like?

Number of units in cluster

Developers in Europe

Peak capacity is between ~12pm and ~7pm

Page 35: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

FleetMiser: Yelp’s in-house scaling engine

Page 36: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

FleetMiser uses a plugin-based architecture for

scaling signalsautoscale_signals:

ClusterOverutilizedSignal:

priority: 2

query_period: 10

scale_up_threshold: 0.65

units_to_add: 100

...

Page 37: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Using metrics to control scaling

Cluster underutilized: scale down

Developers submitted batch jobs: maintain capacity/scale up

Cluster overutilized: scale up

(not shown) Historical usage indicates demand: scale up

Number of units in cluster

Page 38: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

FleetMiser: Yelp’s in-house scaling engine

Page 39: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Scaling up uses the AWS diversification strategy

Page 40: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

FleetMiser uses sophisticated scale-down logic to

ensure cluster diversity is maintained

Page 41: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Scaling Down: How to terminate instances

Scale-down evenly distributed across all Spot markets

Number of units in cluster, grouped by Spot market

Page 42: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Comparison to AWS Auto Scaling for Spot Fleetshttps://aws.amazon.com/blogs/aws/new-auto-scaling-for-ec2-spot-fleets/

• Driven by CloudWatch metrics

• Policies can scale by constant,

percentage, step function

• No custom scale-down logic

• An easy way to get your cluster

autoscaling

• Custom signal plugins

• Scaling by arbitrary amounts

(based on signal input)

• Specify instances to terminate

• Allows for more complicated

functionality

Spot Fleet scaling FleetMiser scaling

Page 43: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Looking to the Future

for Seagull and FleetMiser

Page 44: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Goal: Diversify our Spot Markets even further

Page 45: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Goal: Diversify our Spot Markets even further

53 bundles!

Page 46: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Goal: Diversify our Spot Markets even further

53 bundles!

Page 47: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Goal: Diversify our Spot Markets even further

Page 48: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Goal: More advanced scaling logic for FleetMiserCombine and control multiple Spot Fleets and Auto Scaling Groups at once

Page 49: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Goal: More advanced scaling logic for FleetMiser

$$$$

$$$

Page 50: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Goal: Better bundling of tasks for Seagull

task_requirements:

TaskA:

RAM: 100MB

CPU: 3

dependencies:

- ServiceA

- ServiceB

TaskB:

RAM: 10MB

CPU: 1

dependencies:

- ServiceC

...

Page 51: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Use EC2 Spot Fleet with a fault-tolerant application

Yelp’s simple mantra for saving money on your

compute costs

Page 52: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Use scaling to reduce off-hours capacity

Yelp’s simple mantra for saving money on your

compute costs

Page 53: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

@YelpEngineering

fb.com/YelpEngineers

engineeringblog.yelp.com

github.com/yelp

Page 54: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Thank you!Thank you!

Page 55: AWS re:Invent 2016: Lessons Learned from a Year of Using Spot Fleet (CMP205)

Remember to complete

your evaluations!