48
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kevin Stinson, Senior Software Engineer, Big Data Services, Quantcast Corporation D.J. Hanson, Director of Infrastructure, Smartsheet November 30, 2016 Case Study: How Startups like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 STG309

AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

Embed Size (px)

Citation preview

Page 1: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Kevin Stinson, Senior Software Engineer, Big Data Services,

Quantcast Corporation

D.J. Hanson, Director of Infrastructure, Smartsheet

November 30, 2016

Case Study: How Startups like

Smartsheet and Quantcast Accelerate

Innovation and Growth with Amazon S3

STG309

Page 2: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

What to Expect from the Session

• Quick overview of Quantcast’s MapReduce System

• Changes made to move to AWS and Amazon S3

• Problems we encountered on the way and their

resolutions

Page 3: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

A little bit about Quantcast

• Uses real-time data about consumer behavior to

significantly improve the relevancy of digital advertising

• Over 100 billion bids and 40 PB of data processed per

day

• 180 engineers globally across San Francisco, Seattle,

Singapore, and London

• We’re hiring – [email protected]

Page 4: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

MapReduce at Quantcast

Page 5: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

QFS – Quantcast’s distributed file system

• Open sourced - https://github.com/quantcast/qfs

• Written in C++

• Compatible with Hadoop 0.23 and higher, Hive, Spark,

Storm, etc.

• Supports replication, erasure coding, tiered storage, and

rack awareness

Page 6: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

QFS - continued

• Many of our internal tools assume data is on QFS

• Quantcast has more than 17 PB of data stored in QFS

Page 7: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

Basic QFS setup

Metaserver

QFS Client

Chunkserver

RAM SSD Disk

Chunkserver

RAM SSD Disk

Page 8: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

Quantflow – Quantcast’s MapReduce system

• Over 40 PB processed daily

• Heavily relies on QFS

• Uses QFS instance tiered with RAM disks and SSDs for

intermediate data

• Bundled with control/monitoring systems like Zookeeper

and Ganglia

Page 9: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

Moving Quantflow to AWS

Page 10: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

Adding Amazon S3 support to QFS

• Uses S3 bucket as a block device

• Replication and erasure coding is not supported

because S3 is reliable

• Makes S3 appear as just another tier in QFS

• I/O performance comparable to other S3-based file

systems such as EMRFS

• Supports fast renames and deletes

• Usable with standard Hadoop and Hadoop-friendly tools

Page 11: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

QFS setup with S3 bucket

Metaserver

QFS Client

Chunkserver

RAM SSD Disk

Chunkserver

RAM SSD Disk

S3 Bucket

Page 12: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

Changes to Quantflow

• Repackage Quantflow for easier installation on fresh

Amazon EC2 cluster

• Some important services run on dedicated instances but

all MapReduce workers can run on spot instances.

Page 13: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

Data flow

• Ends of data pipeline are generally QFS on S3

• Intermediate data is on QFS using tiered RAM disks,

SSDs, or Amazon EBS volumes using replication or

erasure coding

• Direct access to QFS data in data center possible but

limited by bandwidth and cost control concerns

Page 14: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

Copying data to S3 QFS

• Copied 8 PB of data center data to S3 as backup and as

input for AWS Quantflow jobs

• Done as copy from one QFS instance to another

• Process took weeks to complete

• Major bottleneck was 20 Gb/sec link between data

center and Amazon

• Still copy 120-150 TB/day

Page 15: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

Issues and Resolutions

Page 16: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

Low S3 performance

• Initial tests of Quantflow in AWS ran slower than

expected

• S3 performance hit apparent cap at 20-30 GB/sec

• Adding more EC2 instances to Quantflow cluster did not

improve performance

• Tests accessing S3 directly had same problem

Page 17: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

Finding the cause

• It took us 2 months to find out the root cause, even with

help from AWS engineers

• A tcpdump showed 8% of traffic was from DNS queries

• Parallel DNS query benchmark shows using our internal

DNS server only achieves 200 QPS vs. 10,000 QPS

using Amazon DNS

• All DNS queries went to a DNS server on a t2.micro

instance – this was a legacy from our data center setup

• S3 uses short DNS TTLs for load balancing

Page 18: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

Fixing the problem

• We configured dnscache on worker nodes to forward

DNS queries to S3 endpoints to Amazon DNS

• We achieved 75 GB/sec with 3,200 concurrent

processes on 200 c3.8xlarge instances with dnscache;

100 GB/sec is easily achievable by using c4.8xlarge and

adding a few more instances

• Using Amazon VPC DNS should work also

Page 19: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

Improvement from DNS caching change

22

74

0 10 20 30 40 50 60 70 80

Throughput (GB/sec)

32

00

Con

curr

en

t P

rocesses

Comparison of S3 Read Performance on 200xc3.8xlarge, 16 workers/instance, 64MBx16 Objects, Boto2 APIs

w/ DNSCache

Single DNS forwarder

Page 20: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

Checklist to enable 100 GB/sec with S3

• Use multipart upload with large-enough object size

• Use well-distributed object keys

• Have enough DNS capacity to achieve 10,000 QPS

• Enable partitioning of bucket, which needs time and data

• Pay attention to instance types and their bandwidth

Page 21: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

Tools that helped

• dig, tcpdump, boto with logging

• AWS CLI, S3 bucket logging

• Parallel execution tools like GXP cluster shell

• Try micro-benchmarking before checking the whole

stack

Page 22: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

Quantflow spot fleet issues

• Getting a large spot fleet of more capable instances can

be difficult, take a long time, or cost more then expected

• With availability and pricing changes, we may want a

mixture of several different types of spot instances and

be able to drop or lose instances

• Because intermediate data is stored locally, losing

instances can cause job failures

Page 23: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

A workaround

• Request multiple smaller spot fleets

• Tell QFS that each fleet is its own virtual rack

• QFS will try to spread out the data across racks

• Using N-way replication, up to N-1 fleets can be lost

• Using QFS’s standard 6+3 erasure coding, up to 3 fleets

can be lost and less space than 4-way replication is

used

Page 24: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

Parting thoughts

• An easily overlooked item of your setup can have a large

impact on performance

• As we started using AWS services on a larger scale, we

hit a number of account limitations such as instance

limits, total provisioned SSD limits, etc.

• If your performance levels off, ask your friendly AWS

liaison if an account limitation is the issue

Page 25: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)
Page 26: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)
Page 27: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)
Page 28: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

The Smartsheet use case

Page 29: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)
Page 30: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)
Page 31: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)
Page 32: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)
Page 33: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

In the before times…

During the Waywhen.

Page 34: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)
Page 35: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)
Page 36: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)
Page 37: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

Game changers disrupt prior assumptions

Page 38: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)
Page 39: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

$ aws s3 ls icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e

PRE 1feed5-1337-d00d-2ba5e/

2016-12-25 13:29:10 1048576 1feed5-1337-d00d-2ba5e

$ aws s3 ls icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e/

PRE mobile/

PRE thumbs/

$ aws s3 ls icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e/mobile/

2016-11-22 10:17:21 0

2016-11-22 10:17:34 165342 400.jpg

$ aws s3 ls icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e/thumbs/

2016-11-22 10:15:23 0

2016-11-22 10:17:13 455 20.png

2016-11-22 10:17:13 169722 400.png

2016-11-22 10:17:12 494804 700.png

A dirty trick – Objects aren’t paths

Page 40: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

$ ls -la

drwxrwxr-x. 2 djhanson djhanson 4096 Dec 25 10:38 bar

-rw-rw-r--. 1 djhanson djhanson 0 Dec 25 10:37 foo.bar

-rw-rw-r--. 1 djhanson djhanson 0 Dec 25 10:37 foo.baz

-rw-rw-r--. 1 djhanson djhanson 0 Dec 25 10:37 foo.qux

$ mv foo.bar bar # Works directory exists

$ mv foo.baz baz # Ooops not what we wanted!

$ mv foo.qux qux/ # Fails appropriately.

mv: cannot move `foo.qux' to `qux/': Not a directory

$ find .

.

./bar

./bar/foo.bar Desired state

./baz This is not what we wanted

./foo.qux Proper error condition

The power of the trailing slash

Page 41: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

$ aws s3 cp s3://icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e/ ./picture

$ md5sum ./picture

d41d8cd98f00b204e9800998ecf8427e

$ aws s3 cp s3://icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e ./picture

$ md5sum ./picture

1cdb80e2693da95e7fa647895d6277c8

$ aws s3 ls icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e

PRE 1feed5-1337-d00d-2ba5e/

2016-12-25 13:29:10 1048576 1feed5-1337-d00d-2ba5e

$ aws s3 ls icanhazbukkit/1f/ee/1feed5-1337-d00d-2ba5e/

PRE mobile/

PRE thumbs/

2016-12-25 13:29:44 18 meta.json

Take care when operating against paths

Page 42: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

X-AMZ-META-FTW

Page 43: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

X-AMZ-META-BILLTO: ALICE

X-AMZ-META-CREATOR: BOB

X-AMZ-META-STYLE: CLASSIFIED

X-AMZ-META-RELATIONSHIP: COMPLICATED

The power of the trailing slash

Page 44: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

A caveat about consistency

Page 45: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)
Page 46: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

Related Sessions

• For more about Quantcast’s experiences with other AWS

services, check out DAT310 - Building Real-Time

Campaign Analytics Using AWS Services

• For more info on S3, check out STG303 - Deep Dive on

Amazon S3

Page 47: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

Thank you!

Page 48: AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Accelerate Innovation and Growth with Amazon S3 (STG309)

Remember to complete

your evaluations!