Upload
amazon-web-services
View
119
Download
2
Embed Size (px)
Citation preview
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ryan Nienhuis, Sr. Technical Product Manager, Amazon Kinesis
Ram Kumar Rengaswamy, co-founder and CTO, Beeswax
November 29, 2016
BDM403
BeeswaxBuilding a Real Time Streaming Data Platform on AWS
What to Expect from the Session
• Introduction to Amazon Kinesis as a platform for real
time streaming data on AWS
• Key considerations for building an end to end streaming
platform using Amazon Kinesis Streams
• Introduction to Beeswax real time bidding platform built
on AWS using Amazon Kinesis, Amazon Redshift,
Amazon S3, and AWS Data Pipeline
• Deep dive into best practices for streaming data using
these services
An unbounded sequence of events that is
continuously captured and processed with
low latency.
What is streaming data?
Amazon Kinesis: Streaming Data Made EasyServices make it easy to capture, deliver, process streams on AWS
Amazon Kinesis
Streams
Amazon Kinesis
Analytics
Amazon Kinesis
Firehose
Amazon Kinesis Streams
• Easy administration
• Build real time applications with framework of choice
• Low cost
Amazon Kinesis Firehose
• Zero administration
• Direct-to-data store integration
• Seamless elasticity
Amazon Kinesis Analytics
• Apply SQL on streams
• Build real-time, stream processing applications
• Easy scalability
Key Concepts
for Amazon Kinesis Streams
Amazon Kinesis Streams Key Concepts
Data Sources
App.4
[Machine Learning]
AW
S En
dp
oin
t
App.1
[Aggregate & De-Duplicate]
Data Sources
Data Sources
Data Sources
App.2
[Metric Extraction]
App.3[Sliding Window Analysis]
Availability
Zone
Shard 1
Shard 2
Shard N
Availability
ZoneAvailability
Zone
Data
ProducersAmazon Kinesis
stream
Data
Consumers
Downstream
systems
Amazon
S3
Amazon
Redshift
AWS
Lambda
Amazon Kinesis
Analytics
An Amazon Kinesis stream
• Streams are made of
shards
• Each shard is a unit of
parallelism and throughput
• Serves as a durable
temporal buffer with data
stored 1 - 7 days
• Scale by splitting and
merging shards
Putting Data into an Amazon Kinesis stream
• Data producers call PutRecord(s) to send
data to an Amazon Kinesis stream
• Partition key determines which shard the
data is stored
• Each shard supports 1 MB in / 2 MB out
• Each records gets a unique sequence
number
• Options for writing: AWS SDKs, Amazon
Kinesis Producer Library (KPL), Amazon
Kinesis agent, FluentD, Flume, and
more…
Producer
Producer
Producer
Producer
Producer
Producer
Producer
Kinesis stream
Shard 1
Shard 2
Shard 3
Shard 4
Shard n
Key considerations for data producers
• Connectivity - Lost connectivity and latency fluctuations
• Durability – Capture most or all records in event of
failure
• Efficiency – Producer’s primary job is often not
collection
• Distributed – Record ordering and retry strategies
Most customers choose to do some buffering and use a
random partition key; many strategies for failover
Getting Data from an Amazon Kinesis stream
• Consumer applications read each shard
continuously using GetRecords, determine
where to start using GetShardIterator
• Read model is per shard
• Increasing number of shards increases
scalability but reduces processing locality
• Options: Amazon Kinesis Client Library
(KCL) on Amazon EC2, Amazon Kinesis
Analytics, AWS Lambda, Spark Streaming
(Amazon EMR), Storm on EC2, and
more…
Kinesis stream
Shard 1
Shard 2
Shard 3
Shard 4
Shard n
Consumer
Consumer
Consumer
Amazon Kinesis Client Library –KCL
• Open source and available for Java, Ruby, Python, Node.js dev
• Deploy on your EC2 instances, scales easily with Elastic Beanstalk
• Two important components:
1. Record Processor – Processor unit that processes data from a
shard in Amazon Kinesis Streams
2. Worker – Processing unit that maps to each application instance
• Key features include load balancing, shard mapping, check pointing,
and CloudWatch monitoring
Key considerations for data consumer apps
• Scale - Have ready mechanisms for increasing
parallelism and add compute
• Availability - Always be reading latest data and monitor
stream position
• Accuracy - Implement at least once processing logic,
exactly once at destination (if you need it)
• Speed - Scale test your logic to ensure linear scalability
• Replay - Have retry strategy
Key considerations for the end-to-end solution
• Use cases - Start with a simple one, progress to more
advanced
• Data variety – Must support different data formats and
schema; centrally or decentralized management
• Integrations – Determine guarantees and where to
apply back pressure
• Fanning out or in – Determine whether to use multiple
consumers, multiple streams, or both
BeeswaxPowering the next generation
of real-time bidding
Who we are?
Startup based out of NYC, founded by ex-Googlers
We are hiring ! https://www.beeswax.com/careers
We do RTB (Real-time bidding)
Publisher
Ad Exchange
Beeswax BidderScale: O(M) QPS
Latency_99 : 20 ms
- Target campaigns
- Target user profiles
- Optimize for ROI
- Customize
< 200 ms
Step 1:
Send ad request & userid
Step 2:
Broadcast bid request
Step 3:
Submit bid & ad markup
Step 4:
Show ad to user
Auction
Building a bidder is very hard
Need scale to deliver campaigns
• To reach the desired audience, bidder needs to process at least 1M QPS
• Deployment has to be in multiple regions to guarantee reach
Performance
• The timeout from ad exchanges is 100ms including the RTT over internet
• 99%ile tail latency for processing a bid request is 20ms
Complex ecosystem
• Manage integrations with ad exchanges, third-party data providers and vendors
• Requires a lot of domain expertise to optimize the bidder for maximizing
performance
A difficult trade-off
Build your own BidderUse a DSP
Risky investment of time and $
with no success guaranteeLimited to no customization;
Platform lock in
Our First Product: The Bidder-as-a-Service™
A full-stack solution deployed for each customer in a sandbox
Servicesyou control
Pre-built ecosystem and supply relationships
Cookies,
Mobile ID’s, 3rd
Party
Data
Bidding
and Targeting
Engine
Campaign
Management UI/API
Reporting
UI/API
Custom
bidding
algos
Log-level
streaming
RESTful APIs
Direct connections to customer-hosted services
Fully managed ad tech platform on
Outline of the talk
• System architecture
• Why we chose Amazon Kinesis
• Challenge 1: Collecting very high volume streams
• Challenge 2: Stream data transformation and fan out
• Challenge 3: Joining streams and aggregation
Beeswax System Architecture
Event Stream
Impression &
Click Data Producer
Bid Data
Producer
Streaming
Message Hub
Customer
Stream
HTTP
POST
S3 BucketAmazon
Redshift
Customer
API
Why we chose Amazon Kinesis?
Infrastructure requirements motivated by RTB use cases
Reason to choose Amazon Kinesis
• Fully managed by AWS; Really important factor for small engineering teams
• Support the scale necessary for RTB
• Pricing model provided opportunities to optimize cost
• Ingestion at very large scale (> 1M
QPS)
• Low latency delivery
• Reliable store of data
• Sequenced retrieval of events
Options available for consideration
1. Amazon Kinesis 2. Apache Kafka on EC2
Problem 1: Collecting high volume streams
Listening Bidders
• Filter very high QPS bid stream using Boolean targeting expressions
• Sample filtered stream and deliver
Challenges
• Collection at very high scale (QPS > 1M)
• Minimize infrastructure cost
• Minimize delivery latency for stream output ( < 10s)
Filtering
and
SamplingBids: O(M) QPS Filtered bid stream
Solution 1: Optimized Data Producers
Cost vs Reliability Tradeoff
• Uploads are priced by PUT payload size of 25K
• Buffer incoming records and pack them into single PUT payload
• Possible data loss if application crashes before buffer is flushed
• Be creative! We use ELB logs to replay requests to our collector
Consider overall system cost
• Compression can reduce data payload size but increase data producer
CPU usage
• Evaluate compression vs cost tradeoff. For example, we choose snappy
over gzip
Solution 1: Optimized Data Producers
Throughput vs Latency
• Buffering increases throughput as more data is uploaded per API call
• Increases average latency; Not a concern for very high QPS collectors
• Flush buffers periodically even if not full, to cap latency
Choose uniformly distributed partition keys
Problem 2: Data transformation and fan out
API driven, transparent and flexible platform
• Provide very detailed log level data to all our customers
• Support multiple delivery destinations and data formats
Challenges
• Config driven system to determine format, schema and destination of each record
• Maximize resource utilization by scaling elastically to stream volume
• Monitoring and operating the service
Transform
and
Fan OutEvent Stream
Solution 2: API-driven Streaming Message Hub
• KCL application deployed to Auto Scaling group
• CloudWatch alarms on CPU utilization elastically resize fleet
• Adapters perform schema and data format transformations
• Emitters buffer data in-memory and flush periodically to destination
• Stream is checkpointed after records are flushed by emitters
Kinesis Record
BidAdapters
WinAdapters
S3Emitter
...HTTPEmitterClickAdapters
KinesisEmitter
...
Streaming message hub design tradeoffs
Single reader vs multiple readers
• Separate reader for every format & destination instead of a single reader
• Having separate readers improves fault tolerance
• However, CPU cost of parsing records is minimized with single reader
EC2 vs Lambda
• Use AWS Lambda instead of self-managed Auto Scaling
• Spot Instances deeply cut down the costs of self-managed solution
• Rich set of Amazon Kinesis stream metrics simplified monitoring and
management of service
Streaming message hub design tradeoffs
Amazon Kinesis Streams versus Amazon Kinesis Firehose
• Firehose does not support record level fan out or arbitrary data
transformations
• With above enhancements, it would be preferred over self-managed Auto
Scaling in EC2
Operating streaming message hub
Scale: ~300 shards, 250 MB/sec
Use CloudWatch metrics published by Amazon Kinesis Streams
Amazon Kinesis capacity alert
• Alert upon approaching 80% capacity
• Manually reshard Amazon Kinesis using KinesisScalingUtils (or new scaling
API)
Reader falling behind alert
• Alert if the average iterator age is greater than 20 sec.
• Ensure reader application is up, examine its custom metrics and triage
Management overhead - We have roughly 2 “incidents” per month
Problem 3: Joining and aggregation
High level value added services
• Joined data directly feeds into model building pipelines for clicks, etc.
• Reporting API, powered by ETL pipeline, provides aggregated metrics.
Challenges
• Supporting exactly once semantics, i.e., eliminate all duplicates
• Minimize end-to-end latency from capture to joining & aggregation
• Be robust to delays between arrival times of correlated events
Bids
Impressions
Clicks, Conversions
Joining
and
Aggregation
Solution 3: Stream joins using Amazon Redshift
• Message hub emits separate log files into S3 for each event type
• Data pipeline schedules periodically loads log files into Amazon Redshift
• Amazon Redshift tables of different event types are joined via primary
key
• FastPath: Joined events in 15min but can miss delayed events
• SlowPath: Fully joined events after 24 hours
Streaming
Message Hub
...
S3 Buckets
Amazon
Redshift
Data
Pipeline
Stream join design trade offs
Joins are not truly streaming in current design
• Batch size of 15 min dictated by lowest interval for scheduling data pipeline
• Lambda can be used instead of AWS Data Pipeline to lower schedule
intervals
• Data loaded into Amazon Redshift cannot be easily fed into Amazon Kinesis
streams
• However, it scales well, is fully AWS managed, and supports many of our
use cases
What are the alternatives?
• Spark streaming via EMR
• Amazon Kinesis Analytics
Early thoughts on comparing the alternatives
• Amazon Kinesis Analytics is fully managed; Spark Streaming is not
• Amazon Kinesis Analytics has usage-based pricing; Spark requires careful
capacity planning
• Need to evaluate Amazon Kinesis Analytics on scale and support for
arbitrary data formats
Summary
Building real time bidding (RTB) applications is very challenging
Beeswax provides a managed platform to build RTB apps on AWS
Beeswax uses Amazon Kinesis as infrastructure for streaming data
Beeswax platform solves key streaming data challenges
• Supports event collection at very large scale
• API driven platform for data transformation and fan out
• Supports joining of streams and aggregation of metrics
Tradeoffs are unique to application; Beeswax is optimized for RTB
Thank you!
Remember to complete
your evaluations!
Reference
We have many AWS Big Data Blog posts which cover more examples. Full list here. Some
good ones:
1. Amazon Kinesis Streams
1. Implement Efficient and Reliable Producers with the Amazon Kinesis Producer Library
2. Presto and Amazon Kinesis
3. Querying Amazon Kinesis Streams Directly with SQL and Sparking Streaming
4. Optimize Spark-Streaming to Efficiently Process Amazon Kinesis Streams
2. Amazon Kinesis Firehose
1. Persist Streaming Data to Amazon S3 using Amazon Kinesis Firehose and AWS Lambda
2. Building a Near Real-Time Discovery Platform with AWS
3. Amazon Kinesis Analytics
1. Writing SQL on Streaming Data With Amazon Kinesis Analytics Part 1 | Part 2
2. Real-time Clickstream Anomaly Detection with Amazon Kinesis Analytics
• Technical documentation
• Amazon Kinesis Agent
• Amazon Kinesis Streams and Spark Streaming
• Amazon Kinesis Producer Library Best Practice
• Amazon Kinesis Firehose and AWS Lambda
• Building Near Real-Time Discovery Platform with Amazon Kinesis
• Public case studies
• Glu mobile – Real-Time Analytics
• Hearst Publishing – Clickstream Analytics
• How Sonos Leverages Amazon Kinesis
• Nordstorm Online Stylist
Reference
Detailed system architecture
Event StreamPartitionKey = F(EventId)
Config Store
Event Producer- Reliable
- Record level retries
Bid Producer- High throughput
- Stream compression
- Batch records w/ flush timeout
Stream Msg Hub- KCL Application
- Autoscales
- At-least once processing
- Record format transforms
- Route to custom sinks
- Stream window analytics
Customer Log StreamPartition key = EventId
Customer Http PostProtobuf/Json payload
S3 Storage- CSV data
- Customer bucket
Amazon
Redshift- Join by EventId
- Exactly once
- Fast path 30m
Data Pipeline
Streaming data in real-time bidding application
Filtering
and
Sampling
Joining
and
Aggregation
Analytics
and
Reporting
Data Sources
Bids: O(1M) TPS
Wins: O(10K) TPS
Clicks: O(1K) TPS
Consumers Formats