Download pdf - Reducing Cost & Maximizing Efficiency: Tightening the Belt on AWS (CPN211) | AWS re:Invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

CPN211 - Reducing Cost and Maximizing

Efficiency: Tightening the Belt on AWS Tom Johnston - Business Development Manager, Amazon Web Services

Sean Simpson - Director of Operations, Stitcher, Inc.

Kingsley Wood - Business Development Manager, Amazon Web Services

Ashay Padwal - CTO, Vserv.mobi

November 15, 2013

https://www.portal.reinvent.awsevents.com/connect/speakerDetail.ww?PERSON_ID=0B6165C9331BC35DDDCA29E86BEE8FF3&tclass=popup




https://www.portal.reinvent.awsevents.com/connect/speakerDetail.ww?PERSON_ID=B16BF55127B91928B9446FA1FB5F175A&tclass=popup






https://www.portal.reinvent.awsevents.com/connect/speakerDetail.ww?PERSON_ID=95BD9E1906A819DFC09F2DDD4B5CC0E5&tclass=popup






https://www.portal.reinvent.awsevents.com/connect/speakerDetail.ww?PERSON_ID=1665A9979F8C5E2DBF0B46A45D650195&tclass=popup





Introductions and Outline

• Tom Johnston (AWS)

Reducing Cost and Spending Smart

• Sean Simpson (Stitcher)

Moving to AWS – A Story

• Kingsley Wood (AWS)

Maximizing Efficiency and Cost Optimization

• Ashay Padwal (vServ.mobi)

a Spot Case Study

Reducing Cost

and

Spending Smart

Tom Johnston – Business Development Manager, AWS

Fundamentals

• Explicit Objectives

• Match Instances with Workloads

• Match Scale & Use with Demand

• Match Purchasing with Utilization

• Governance Matters

Objectives

Objectives

AWS provides you the ability to

match your architecture to your

objectives

Start

Choose an instance that best meets your basic requirements

Match memory & virtual cores

Instance types

Start



Tune

Change instance size up or down based upon

monitoring

Use CloudWatch & Trusted Advisor to assess

Instance types

Instance Amazon

CloudWatch Alarm

Free Memory

Free CPU

Free HDD

…

Custom Metrics

…

At 1-min

intervals

PUT 2 weeks

Know your usage

Me

mo

ry (

GB

)

Processing Ability

More

Memory

More

Processing

M1

M3

High-CPU

High

Mem

Cluster

Compute

C3

High

Storage

High

I/O

High-Mem

Cluster

Compute

Start



Tune

Change instance size up or down based upon

monitoring

Use CloudWatch & Trusted Advisor to assess

Roll-Out

Run multiple instances in multiple Availability

Zones

Instance types

Choose your metric optimize for the metric


Cost per unit of work per instance(size)

Workload A

Optimal on 4x

m1.xlarge

Workload B

Optimal on 10x

m1.medium

Workload C

Optimal on 2x

m3.xxlarge


Cost per unit of work per instance (size)

100 concurrent jobs on 10 x m1.large @ $0.26 / hr = $ 0.026 / job

300 concurrent jobs on 10 x m3.xlarge @ $0.58 / hr = $ 0.019 / job

vs


Think workload density

Don’t just focus on instance hourly rate

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Serv

er

Lo

ad

Hour of day

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Serv

er

Lo

ad

Hour of day

Capacity of 1 Server

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Serv

er

Lo

ad

Hour of day


Traditional capacity required

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Serv

er

Lo

ad

Hour of day



1 Server for 8 hours

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Serv

er

Lo

ad

Hour of day



1 Server for 8 hours 1 Server for 8 hours

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Serv

er

Lo

ad

Hour of day





0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Serv

er

Lo

ad

Hour of day






0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Serv

er

Lo

ad

Hour of day



1/3rd

Saving

0

1

2

3

4

5

6

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Insta

nce C

ou

nt

Day of Month

0

1

2

3

4

5

6

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Insta

nce C

ou

nt

Day of Month

Monthly

predictable

peak

processing

0

1

2

3

4

5

6

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Insta

nce C

ou

nt

Day of Month


0

1

2

3

4

5

6

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Insta

nce C

ou

nt

Day of Month

Elastic Capacity


0

1

2

3

4

5

6

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Insta

nce C

ou

nt

Day of Month

75% Savings


Elastic Capacity

Unix/Linux instances start at $0.02/hour

Pay as you go for compute power

Low cost and flexibility

Pay only for what you use, no up-front commitments or long-term contracts

Use Cases:

Applications with short term, spiky, or

unpredictable workloads;

Application development or testing

On-demand instances

Reserved instances





Use Cases:




On-demand instances

1- or 3-year terms

Pay low up-front fee, receive significant hourly discount

Low Cost / Predictability

Helps ensure compute capacity is available

when needed

Use Cases:

Applications with steady state or predictable usage

Applications that require reserved capacity,

including disaster recovery

Reserved instances

Reserved instances





Use Cases:




On-demand instances

1- or 3-year terms




when needed

Use Cases:




Reserved instances

Reserved instances

Up to 58% Savings

Heavy utilization RI





Use Cases:




On-demand instances

1- or 3-year terms




when needed

Use Cases:




Reserved instances

Reserved instances

> 80% utilization Lower costs up to 58%

Use Cases: Databases, Large Scale HPC, Always-on infrastructure, Baseline






Use Cases:




On-demand instances

1- or 3-year terms




when needed

Use Cases:




Reserved instances

Reserved instances




Up to 49%

Savings

Medium utilization RI





Use Cases:




On-demand instances

1- or 3-year terms




when needed

Use Cases:




Reserved instances

Reserved instances




41-79% utilization Lower costs up to 49%

Use Cases: Web applications, many heavy

processing tasks, running much of the time






Use Cases:




On-demand instances

1- or 3-year terms




when needed

Use Cases:




Reserved instances

Reserved instances








Up to 34% Savings

Light utilization RI





Use Cases:




On-demand instances

1- or 3-year terms




when needed

Use Cases:




Reserved instances

Reserved instances









Use Cases: Disaster Recovery, Weekly / Monthly reporting, Elastic Map Reduce

Light utilization RI

Best RI for Utilization

$-

$2,000

$4,000

$6,000

$8,000

$10,000

$12,000

$14,000

$16,000

$18,000

Heavy

Medium

Light

O-Demand

Best RI for Utilisation

$-

$2,000

$4,000

$6,000

$8,000

$10,000

$12,000

$14,000

$16,000

$18,000

Heavy

Medium

Light

O-Demand

0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

On Demand

Light Utilization RI

Medium Utilization RI


Optimizing costs with RIs





Use Cases:




On-demand instances

1- or 3-year terms




when needed

Use Cases:




Reserved instances

Bid on unused EC2 capacity

Spot Price based on supply/demand, determined automatically

Cost / Large Scale, dynamic workload handling

Use Cases:

Applications with flexible start and end times

Applications only feasible at very low compute prices

Spot instances

Spot instances

Governance Matters

• Who can create and launch instances?

• Who checks that only needed instances are

running?

• Have specific policies

• Use AWS tools such as IAM to help enforce

them

Checklist

• Identify your goals

• Understand your workload & match to instances

• Scale up and down with demand

• Align purchasing methods & utilization

• Have governance appropriate to your goals

• Change in goals & workload will drive change in

use of AWS

Moving to AWS – A Story

Sean Simpson

Director of Operations - Stitcher, Inc.

What is Stitcher?

• Stitcher is to news and talk radio what Pandora

is to music

• Stitcher is a content aggregator

• Stitcher is an on-demand service

• Stitcher is deployed on mobile, CE, and

automotive platforms

Stitcher by the Numbers

• 12 million downloads

• 20,000+ shows

• Over 1 million hours of listening weekly

• Over 100 TB outbound data monthly

With Growth Comes Pain

• DRBD database locked us into hardware

• Sublease of colocation facility restricted our

access to our servers

• Server leases and purchases constrained our

architecture

• Growth inhibited by human, server, and vendor

resources

What options did we consider?

• Move to another colocation facility

• Move to a cloud provider

• Move to a hybrid colocation/cloud provider

Why we chose Amazon Web Services

• Familiarity – Already using Amazon Simple Storage Service for our RSS

feeds

– Already experimenting with Amazon Elastic Compute Cloud

– Recently implemented Amazon Simple Queue Service


• Flexibility / Scalability – Ability to adjust resources quickly in our production environment

– Ability to create any number of environments

– Ability to design servers as we wanted with respect to operating

systems, systems software, etc.


• Cost – Cost matches usage

– Bandwidth savings when using Amazon CloudFront as our CDN

– Many resources to assist in optimization

– Put simply, we got our solution for the lowest quote


• Documentation & Customer Service – Knowledgeable solutions architects

– “Right-level” documentation

– Quick response to our needs

Architecting Change

• Ask yourself: What are we trying to achieve?

• Know yourself, know your systems

• Consider industry best practices (but don’t

blindly follow them)

• Read the documentation

Use Puppet or Chef

• Configuration management tools are both

enabling and liberating

• Build, destroy, and build again

• Write once, build many

• Nuances between node types are managed with

clearly written rules

• Naming conventions are your friend

Our Architecture

Looks nice, but what does it do?

• High Availability

• Scalability

• Security

• Performance

• Cost effectiveness

The Results – Database connections/sec

450

225

0 100 200 300 400 500

After

Before

The Results – GetStationPlaylist()

0.1

0.75

0 0.2 0.4 0.6 0.8

After

Before

The Results – Maximum throughput

20000

5000

0 5000 10000 15000 20000 25000

After

Before

The Results – Downtime

15

1200

0 200 400 600 800 1000 1200 1400

After

Before

Cost Optimization Results

• Twice the results for the same money

How we save money

• Reserved instances

• Appropriate instance types

• CloudFront CDN

• Rapid reorganization using the API

• Monitor utilization

• Load test

• Housecleaning

On Deck Cost Savings

• Spot instances for processing tasks

• Auto Scaling

• In-app optimizations

• Instance type tuning

Parting Advice

• Architect for 10X

• Take the time to get it right the first time (or at

least, close enough)

• Plan on continuous evolution of systems

Maximizing Efficiency

and

Cost Optimization

Kingsley Wood – Business Development Manager, AWS

Considerations

• Offloading – reduce footprint

• Utilization – your biggest lever

• Managed Services – leverage RDS, SQS, SES

• Consolidated Billing – pooling resources

• Flexible Evolution – continually revisit

• Spot Instances – think big, new possibilities

OFFLOAD all static content • reduce your compute demand and costs

• improve end-user experience

• increase reliability and durability

+

ENTIRE SITE via CloudFront • minimize client-server chatter (keep it at the edge)

• reduce server-database traffic (cache the common calls)

• speed up mobile app response (persistent connections)

+

Real World Example

Standard Setup

• 4 x Medium Instances

$485

• AWS Data Transfer 1 TB

$194

• Total = $679

Optimized

• 1 x Medium Instance

$121

• CloudFront Data 1 TB

$168

• CloudFront Requests

$1.89

• Total = $291

57% Lower Cost + 6X Faster

Offloading Tips

• Leverage S3, CloudFront, Route 53

• Eliminate repeated calls (edge and data cache)

• Static website hosting on S3

No web server at all!

• Minimize your EC2 and database footprint

stand up Read Replicas for variable loads

Utilization and Auto-Scaling: Granularity

more small instances vs. less large instances

29 Large @

$0.32/hr

= $9.28

59 Small @

$0.08/hr

= $4.72

Utilization – Trigger Actions by Event

Leverage CloudWatch to collect and measure metrics

Buuuk for Singapore Press Holdings (SPH)

The Straits Times Mobile App

REAL-TIME reaction response • notification of pending News Flash (with audible alarm)

• on-demand ramp up of capacity (6 mins)

• subscriber alert push delivered

• mass response traffic handled (followed by ramp down)

Architecture

Amazon Web Services provides services and infrastructure to build reliable, fault-tolerant, and

highly available systems in the cloud.

These qualities have been designed into our services both by handling such aspects without any special

action by you and by providing features that must be used explicitly and correctly.

Managed Services Reduce:

Managed Services

Elastic Load Balancing

Amazon Relational Database Service

(RDS)

Amazon Simple Queue Service

(SQS)

Amazon Simple Email Service

(SES)

Amazon Elastic

MapReduce

Amazon

ElastiCache

Amazon Simple

Notification Service

(SNS)

$0.028 per hour

Web Servers

Availability Zone

Elastic Load

Balancing DNS

Web Servers

$0.08 per hour

(small instance)

Availability Zone

$0.028 per hour

Web Servers

Availability Zone

EC2 instance

+ software LB

Elastic Load

Balancer DNS

DNS

VS

SQS queue

Consumers Producer

$0.50 per

1,000,000 Requests ($0.0000005 per Request)

Producer

SQS queue

Consumers

Consumers Producer

EC2 instance

+ software queue

$0.50 per

1,000,000 Requests ($0.0000005 per Request)

$0.08 per hour

(small instance)

VS

Consolidated Billing

RI Purchases to grow a Resource Pool

0

5

10

15

20

25

30

35

1 2 3 4 5 6 7 8 9 10 11 12

E

D

C

B

AReserved Instance

Pool

Tiered Pricing

Flexibility: Take advantage!

Architecture

vs.

Gardening

STOP/START

size changes

new instance types

vary capacity

rearrange, etc.

What are Spot Instances?

• Value Pricing

• Up to 92% discount

Elastic • Capacity not otherwise

available

Minimum Commitment • Commit to 1 hour

• Tradeoff Potential for interruption

Key Points about Spot

• Spare capacity – supply and demand

• Be prepared for no availability at times

• Be willing to accept and deal with interruption

• Far greater potential scale

starting at 5X default instance limits

• Massive possible capacity = new ideas…

Consider 2 Time-to-Value Scenarios 1) Value of results quickly diminishes

e.g., Engineering simulations e.g., Analytics before an M&A deal

2) Value of result stable until deadline

Spot Applications

Ideal Applications

Batch Processing

Time-Delayable

Fault-Tolerant or Restartable

Compute-Intensive

Horizontally Scalable

Stateless Worker Nodes

Region and AZ Independent

Uses Deployment Automation

Less Ideal Applications

Interactive

Strict/Tight SLA for Completion

Expensive to Handle Terminations

Data-Intensive

In-Memory Scaling

Long-Running Worker Nodes

Requires a Single AZ

Manually Launched and Managed

Spot Advice and Tips

• Don’t build your reliability ENTIRELY on spot

vServ.mobi – exceptional and smart architecture

• With time flexibility, different approaches:

delayed results, lower cost

spend less, quicker answers

• Ask different questions:

with enormous capacity, what is now possible?

Look at the World Differently

• Order of magnitude more capacity

• New experiments enabled = innovation!

• Lucky Oyster – recommendation exchange

• Prototyping a new search technology idea (using Common Crawl)

• 3.4 billion web pages > 1 TB of data > Index of 400 million entities

• “The cost? About $100... in about 14 hours”

A Spot Case Study

Ashay Padwal

CoFounder & CTO – vServ.mobi

GLOBAL INNOVATION FOCUSED

Award Winning Mobile Ad Exchange

across Emerging Markets

31 Bn Ad Requests / Month

Over 200 Mn Unique Users / Month

10% SOUTH AMERICA

7% NORTH AMERICA

11% EUROPE

33% INDIA

14% MIDDLE EAST & AFRICA

11% REST OF ASIA

14% SE ASIA

Infrastructure: Requirements & Challenges

Requirement: Self Serve for Publisher On-boarding & Exit

Challenge: No Capacity Planning; Extreme Scalability

Requirement: Start Up

Challenge: No Capex, no Lock-in

Requirement: Least Latency & High Availability

Challenge: Suite of services – Compute, Load Balancing,

DNS, CDN, Storage, Multiple DCs per location

Requirement: Global Setup management with small team

Challenge: Availability across Regions with extensive APIs

1

2

3

4

Infrastructure: Solution

AWS

AWS

EC2 & ELB – Multi-AZ

Route53, CloudFront, S3

US East, US West, Europe, South America, Asia

For Middle East, we host in Turkey

For Africa, we host in South Africa

1

2

3

4

Deployment Overview

Ad Delivery Setup

Now What? Reduce Cost without impacting Performance

• AWS is pretty cost-effective. But we were greedy!

• Saving more meant more money for other areas in our

business.

• We walked in the opposite direction... and it worked!

• We use spot instances in production extensively.

• Sounds risky? - Yes, but if you architect your system

correctly, you should be safe.

What we did

Selected the right Instance Type - use CloudWatch for CPU & memory usage

- Load Test

Designed our servers to be self-sufficient and perishable - Business logic & DB on same server

- Transaction Logs written to EBS

- Auto Setup on Server

- Data Collection module

We built a custom Scaling solution - Add/Remove instances by checking present traffic & predicting traffic

in the immediate future

- Based on trending of spot prices either try launching spot or fall back

to on-demand instances

- Remove servers if in use between 45-55min

- Track spot prices to shift to on-demand

1

2

3

What AWS did

Reduced pricing for EC2 (On Demand & Reserved) and S3

Cheap Archival System - Glacier

Pre warming of Load Balancer (ELB)

AMI movement across regions

ELB with equal distribution of traffic across instances

spread in any Availability Zone

1

2

3

4

5

THANK YOU!

Ashay Padwal CTO & Co-Founder [email protected]

Closing – Key Takeaways

• Re-evaluate, revist and re:Invent

Evolve along with AWS

• Leverage

Managed Services, CloudWatch

• Stay up to date

RI modifications, Trusted Advisor

• AWS Blog: aws.typepad.com

aws.typepad.com

Please give us your feedback on this

presentation

As a thank you, we will select prize

winners daily for completed surveys!

CPN211