AWS re:Invent 2016: Beeswax: Building a Real-Time Streaming Data Platform on AWS (BDM403)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Ryan Nienhuis, Sr. Technical Product Manager, Amazon Kinesis

Ram Kumar Rengaswamy, co-founder and CTO, Beeswax

November 29, 2016

BDM403

BeeswaxBuilding a Real Time Streaming Data Platform on AWS

What to Expect from the Session

• Introduction to Amazon Kinesis as a platform for real

time streaming data on AWS

• Key considerations for building an end to end streaming

platform using Amazon Kinesis Streams

• Introduction to Beeswax real time bidding platform built

on AWS using Amazon Kinesis, Amazon Redshift,

Amazon S3, and AWS Data Pipeline

• Deep dive into best practices for streaming data using

these services

An unbounded sequence of events that is

continuously captured and processed with

low latency.

What is streaming data?

Amazon Kinesis: Streaming Data Made EasyServices make it easy to capture, deliver, process streams on AWS

Amazon Kinesis

Streams

Amazon Kinesis

Analytics

Amazon Kinesis

Firehose

Amazon Kinesis Streams

• Easy administration

• Build real time applications with framework of choice

• Low cost

Amazon Kinesis Firehose

• Zero administration

• Direct-to-data store integration

• Seamless elasticity

Amazon Kinesis Analytics

• Apply SQL on streams

• Build real-time, stream processing applications

• Easy scalability

Key Concepts

for Amazon Kinesis Streams

Amazon Kinesis Streams Key Concepts

Data Sources

App.4

[Machine Learning]

AW

S En

dp

oin

t

App.1

[Aggregate & De-Duplicate]

Data Sources

Data Sources

Data Sources

App.2

[Metric Extraction]

App.3[Sliding Window Analysis]

Availability

Zone

Shard 1

Shard 2

Shard N

Availability

ZoneAvailability

Zone

Data

ProducersAmazon Kinesis

stream

Data

Consumers

Downstream

systems

Amazon

S3

Amazon

Redshift

AWS

Lambda

Amazon Kinesis

Analytics

An Amazon Kinesis stream

• Streams are made of

shards

• Each shard is a unit of

parallelism and throughput

• Serves as a durable

temporal buffer with data

stored 1 - 7 days

• Scale by splitting and

merging shards

Putting Data into an Amazon Kinesis stream

• Data producers call PutRecord(s) to send

data to an Amazon Kinesis stream

• Partition key determines which shard the

data is stored

• Each shard supports 1 MB in / 2 MB out

• Each records gets a unique sequence

number

• Options for writing: AWS SDKs, Amazon

Kinesis Producer Library (KPL), Amazon

Kinesis agent, FluentD, Flume, and

more…

Producer

Producer

Producer

Producer

Producer

Producer

Producer

Kinesis stream

Shard 1

Shard 2

Shard 3

Shard 4

Shard n

Key considerations for data producers

• Connectivity - Lost connectivity and latency fluctuations

• Durability – Capture most or all records in event of

failure

• Efficiency – Producer’s primary job is often not

collection

• Distributed – Record ordering and retry strategies

Most customers choose to do some buffering and use a

random partition key; many strategies for failover

Getting Data from an Amazon Kinesis stream

• Consumer applications read each shard

continuously using GetRecords, determine

where to start using GetShardIterator

• Read model is per shard

• Increasing number of shards increases

scalability but reduces processing locality

• Options: Amazon Kinesis Client Library

(KCL) on Amazon EC2, Amazon Kinesis

Analytics, AWS Lambda, Spark Streaming

(Amazon EMR), Storm on EC2, and

more…

Kinesis stream

Shard 1

Shard 2

Shard 3

Shard 4

Shard n

Consumer

Consumer

Consumer

Amazon Kinesis Client Library –KCL

• Open source and available for Java, Ruby, Python, Node.js dev

• Deploy on your EC2 instances, scales easily with Elastic Beanstalk

• Two important components:

1. Record Processor – Processor unit that processes data from a

shard in Amazon Kinesis Streams

2. Worker – Processing unit that maps to each application instance

• Key features include load balancing, shard mapping, check pointing,

and CloudWatch monitoring

Key considerations for data consumer apps

• Scale - Have ready mechanisms for increasing

parallelism and add compute

• Availability - Always be reading latest data and monitor

stream position

• Accuracy - Implement at least once processing logic,

exactly once at destination (if you need it)

• Speed - Scale test your logic to ensure linear scalability

• Replay - Have retry strategy

Key considerations for the end-to-end solution

• Use cases - Start with a simple one, progress to more

advanced

• Data variety – Must support different data formats and

schema; centrally or decentralized management

• Integrations – Determine guarantees and where to

apply back pressure

• Fanning out or in – Determine whether to use multiple

consumers, multiple streams, or both

BeeswaxPowering the next generation

of real-time bidding

Who we are?

Startup based out of NYC, founded by ex-Googlers

We are hiring ! https://www.beeswax.com/careers

We do RTB (Real-time bidding)

Publisher

Ad Exchange

Beeswax BidderScale: O(M) QPS

Latency_99 : 20 ms

- Target campaigns

- Target user profiles

- Optimize for ROI

- Customize

< 200 ms

Step 1:

Send ad request & userid

Step 2:

Broadcast bid request

Step 3:

Submit bid & ad markup

Step 4:

Show ad to user

Auction

Building a bidder is very hard

Need scale to deliver campaigns

• To reach the desired audience, bidder needs to process at least 1M QPS

• Deployment has to be in multiple regions to guarantee reach

Performance

• The timeout from ad exchanges is 100ms including the RTT over internet

• 99%ile tail latency for processing a bid request is 20ms

Complex ecosystem

• Manage integrations with ad exchanges, third-party data providers and vendors

• Requires a lot of domain expertise to optimize the bidder for maximizing

performance

A difficult trade-off

Build your own BidderUse a DSP

Risky investment of time and $

with no success guaranteeLimited to no customization;

Platform lock in

Our First Product: The Bidder-as-a-Service™

A full-stack solution deployed for each customer in a sandbox

Servicesyou control

Pre-built ecosystem and supply relationships

Cookies,

Mobile ID’s, 3rd

Party

Data

Bidding

and Targeting

Engine

Campaign

Management UI/API

Reporting

UI/API

Custom

bidding

algos

Log-level

streaming

RESTful APIs

Direct connections to customer-hosted services

Fully managed ad tech platform on

Outline of the talk

• System architecture

• Why we chose Amazon Kinesis

• Challenge 1: Collecting very high volume streams

• Challenge 2: Stream data transformation and fan out

• Challenge 3: Joining streams and aggregation

Beeswax System Architecture

Event Stream

Impression &

Click Data Producer

Bid Data

Producer

Streaming

Message Hub

Customer

Stream

HTTP

POST

S3 BucketAmazon

Redshift

Customer

API

Why we chose Amazon Kinesis?

Infrastructure requirements motivated by RTB use cases

Reason to choose Amazon Kinesis

• Fully managed by AWS; Really important factor for small engineering teams

• Support the scale necessary for RTB

• Pricing model provided opportunities to optimize cost

• Ingestion at very large scale (> 1M

QPS)

• Low latency delivery

• Reliable store of data

• Sequenced retrieval of events

Options available for consideration

1. Amazon Kinesis 2. Apache Kafka on EC2

Problem 1: Collecting high volume streams

Listening Bidders

• Filter very high QPS bid stream using Boolean targeting expressions

• Sample filtered stream and deliver

Challenges

• Collection at very high scale (QPS > 1M)

• Minimize infrastructure cost

• Minimize delivery latency for stream output ( < 10s)

Filtering

and

SamplingBids: O(M) QPS Filtered bid stream

Solution 1: Optimized Data Producers

Cost vs Reliability Tradeoff

• Uploads are priced by PUT payload size of 25K

• Buffer incoming records and pack them into single PUT payload

• Possible data loss if application crashes before buffer is flushed

• Be creative! We use ELB logs to replay requests to our collector

Consider overall system cost

• Compression can reduce data payload size but increase data producer

CPU usage

• Evaluate compression vs cost tradeoff. For example, we choose snappy

over gzip

Solution 1: Optimized Data Producers

Throughput vs Latency

• Buffering increases throughput as more data is uploaded per API call

• Increases average latency; Not a concern for very high QPS collectors

• Flush buffers periodically even if not full, to cap latency

Choose uniformly distributed partition keys

Problem 2: Data transformation and fan out

API driven, transparent and flexible platform

• Provide very detailed log level data to all our customers

• Support multiple delivery destinations and data formats

Challenges

• Config driven system to determine format, schema and destination of each record

• Maximize resource utilization by scaling elastically to stream volume

• Monitoring and operating the service

Transform

and

Fan OutEvent Stream

Solution 2: API-driven Streaming Message Hub

• KCL application deployed to Auto Scaling group

• CloudWatch alarms on CPU utilization elastically resize fleet

• Adapters perform schema and data format transformations

• Emitters buffer data in-memory and flush periodically to destination

• Stream is checkpointed after records are flushed by emitters

Kinesis Record

BidAdapters

WinAdapters

S3Emitter

...HTTPEmitterClickAdapters

KinesisEmitter

...

Streaming message hub design tradeoffs

Single reader vs multiple readers

• Separate reader for every format & destination instead of a single reader

• Having separate readers improves fault tolerance

• However, CPU cost of parsing records is minimized with single reader

EC2 vs Lambda

• Use AWS Lambda instead of self-managed Auto Scaling

• Spot Instances deeply cut down the costs of self-managed solution

• Rich set of Amazon Kinesis stream metrics simplified monitoring and

management of service

Streaming message hub design tradeoffs

Amazon Kinesis Streams versus Amazon Kinesis Firehose

• Firehose does not support record level fan out or arbitrary data

transformations

• With above enhancements, it would be preferred over self-managed Auto

Scaling in EC2

Operating streaming message hub

Scale: ~300 shards, 250 MB/sec

Use CloudWatch metrics published by Amazon Kinesis Streams

Amazon Kinesis capacity alert

• Alert upon approaching 80% capacity

• Manually reshard Amazon Kinesis using KinesisScalingUtils (or new scaling

API)

Reader falling behind alert

• Alert if the average iterator age is greater than 20 sec.

• Ensure reader application is up, examine its custom metrics and triage

Management overhead - We have roughly 2 “incidents” per month

Problem 3: Joining and aggregation

High level value added services

• Joined data directly feeds into model building pipelines for clicks, etc.

• Reporting API, powered by ETL pipeline, provides aggregated metrics.

Challenges

• Supporting exactly once semantics, i.e., eliminate all duplicates

• Minimize end-to-end latency from capture to joining & aggregation

• Be robust to delays between arrival times of correlated events

Bids

Impressions

Clicks, Conversions

Joining

and

Aggregation

Solution 3: Stream joins using Amazon Redshift

• Message hub emits separate log files into S3 for each event type

• Data pipeline schedules periodically loads log files into Amazon Redshift

• Amazon Redshift tables of different event types are joined via primary

key

• FastPath: Joined events in 15min but can miss delayed events

• SlowPath: Fully joined events after 24 hours

Streaming

Message Hub

...

S3 Buckets

Amazon

Redshift

Data

Pipeline

Stream join design trade offs

Joins are not truly streaming in current design

• Batch size of 15 min dictated by lowest interval for scheduling data pipeline

• Lambda can be used instead of AWS Data Pipeline to lower schedule

intervals

• Data loaded into Amazon Redshift cannot be easily fed into Amazon Kinesis

streams

• However, it scales well, is fully AWS managed, and supports many of our

use cases

What are the alternatives?

• Spark streaming via EMR

• Amazon Kinesis Analytics

Early thoughts on comparing the alternatives

• Amazon Kinesis Analytics is fully managed; Spark Streaming is not

• Amazon Kinesis Analytics has usage-based pricing; Spark requires careful

capacity planning

• Need to evaluate Amazon Kinesis Analytics on scale and support for

arbitrary data formats

Summary

Building real time bidding (RTB) applications is very challenging

Beeswax provides a managed platform to build RTB apps on AWS

Beeswax uses Amazon Kinesis as infrastructure for streaming data

Beeswax platform solves key streaming data challenges

• Supports event collection at very large scale

• API driven platform for data transformation and fan out

• Supports joining of streams and aggregation of metrics

Tradeoffs are unique to application; Beeswax is optimized for RTB

Thank you!

Remember to complete

your evaluations!

Reference

We have many AWS Big Data Blog posts which cover more examples. Full list here. Some

good ones:

1. Amazon Kinesis Streams

1. Implement Efficient and Reliable Producers with the Amazon Kinesis Producer Library

2. Presto and Amazon Kinesis

3. Querying Amazon Kinesis Streams Directly with SQL and Sparking Streaming

4. Optimize Spark-Streaming to Efficiently Process Amazon Kinesis Streams

2. Amazon Kinesis Firehose

1. Persist Streaming Data to Amazon S3 using Amazon Kinesis Firehose and AWS Lambda

2. Building a Near Real-Time Discovery Platform with AWS

3. Amazon Kinesis Analytics

1. Writing SQL on Streaming Data With Amazon Kinesis Analytics Part 1 | Part 2

2. Real-time Clickstream Anomaly Detection with Amazon Kinesis Analytics

http://blogs.aws.amazon.com/bigdata/blog/search?q=Kinesis&category=

http://blogs.aws.amazon.com/bigdata/post/Tx3ET30EGDKUUI2/Implementing-Efficient-and-Reliable-Producers-with-the-Amazon-Kinesis-Producer-L

http://blogs.aws.amazon.com/bigdata/post/Tx2DDFNHXSAAH2G/Presto-Amazon-Kinesis-Connector-for-Interactively-Querying-Streaming-Data

http://blogs.aws.amazon.com/bigdata/post/Tx3916WCIUPVA3T/Querying-Amazon-Kinesis-Streams-Directly-with-SQL-and-Spark-Streaming

http://blogs.aws.amazon.com/bigdata/post/Tx2MQREB43K3BFK/Optimize-Spark-Streaming-to-Efficiently-Process-Amazon-Kinesis-Streams

http://blogs.aws.amazon.com/bigdata/post/Tx2MUQB5PRWU36K/Persist-Streaming-Data-to-Amazon-S3-using-Amazon-Kinesis-Firehose-and-AWS-Lambda

http://blogs.aws.amazon.com/bigdata/post/Tx1Z6IF7NA8ELQ9/Building-a-Near-Real-Time-Discovery-Platform-with-AWS

http://blogs.aws.amazon.com/bigdata/post/Tx3QZUFCI71QTOZ/Writing-SQL-on-Streaming-Data-with-Amazon-Kinesis-Analytics-Part-1

http://blogs.aws.amazon.com/bigdata/post/Tx3QZUFCI71QTOZ/Writing-SQL-on-Streaming-Data-with-Amazon-Kinesis-Analytics-Part-2

http://blogs.aws.amazon.com/bigdata/post/Tx1XNQPQ2ARGT81/Real-time-Clickstream-Anomaly-Detection-with-Amazon-Kinesis-Analytics

• Technical documentation

• Amazon Kinesis Agent

• Amazon Kinesis Streams and Spark Streaming

• Amazon Kinesis Producer Library Best Practice

• Amazon Kinesis Firehose and AWS Lambda

• Building Near Real-Time Discovery Platform with Amazon Kinesis

• Public case studies

• Glu mobile – Real-Time Analytics

• Hearst Publishing – Clickstream Analytics

• How Sonos Leverages Amazon Kinesis

• Nordstorm Online Stylist

Reference

http://docs.aws.amazon.com/firehose/latest/dev/writing-with-agents.html

https://blogs.aws.amazon.com/bigdata/post/Tx2MQREB43K3BFK/Optimize-Spark-Streaming-to-Efficiently-Process-Amazon-Kinesis-Streams

https://blogs.aws.amazon.com/bigdata/post/Tx3ET30EGDKUUI2/Implementing-Efficient-and-Reliable-Producers-with-the-Amazon-Kinesis-Producer-L

https://blogs.aws.amazon.com/bigdata/post/Tx2MUQB5PRWU36K/Persist-Streaming-Data-to-Amazon-S3-using-Amazon-Kinesis-Firehose-and-AWS-Lambda

https://blogs.aws.amazon.com/bigdata/post/Tx1Z6IF7NA8ELQ9/Building-a-Near-Real-Time-Discovery-Platform-with-AWS

https://www.youtube.com/watch?v=ThLWrldseG4

https://youtu.be/6cwbbqi36k8?t=18m47s

https://www.youtube.com/watch?v=-70wNNrxf6Q

https://www.youtube.com/watch?v=TXmkj2a0fRE

Detailed system architecture

Event StreamPartitionKey = F(EventId)

Config Store

Event Producer- Reliable

- Record level retries

Bid Producer- High throughput

- Stream compression

- Batch records w/ flush timeout

Stream Msg Hub- KCL Application

- Autoscales

- At-least once processing

- Record format transforms

- Route to custom sinks

- Stream window analytics

Customer Log StreamPartition key = EventId

Customer Http PostProtobuf/Json payload

S3 Storage- CSV data

- Customer bucket

Amazon

Redshift- Join by EventId

- Exactly once

- Fast path 30m

Data Pipeline

Streaming data in real-time bidding application

Filtering

and

Sampling

Joining

and

Aggregation

Analytics

and

Reporting

Data Sources

Bids: O(1M) TPS

Wins: O(10K) TPS

Clicks: O(1K) TPS

Consumers Formats

Technology

AWS re:Invent 2016: Beeswax: Building a Real-Time Streaming Data Platform on AWS (BDM403)