Peter-Mark Verwoerd
Amazon Kinesis�Managed Service for Streaming Data Ingestion, & Processing
Solutions Architect
o Why continuous, real-time processing? § Early stages of Data lifecycle demanding more § Developing the ‘Right tool for the right job’
o What can you do with streaming data today? § Customer Scenarios § Current approaches
o What is Amazon Kinesis? § Putting data into Kinesis § Getting data from Kinesis Streams: Building applications with KCL
o Connecting Amazon Kinesis to other systems § Moving data into S3, DynamoDB, Redshift § Leveraging existing EMR, Storm infrastructure
Amazon Kinesis: Managed Service for Streaming Data Ingestion & Processing
The Mo'va'on for Con'nuous, Real Time Processing
Unconstrained Data Growth�Data lifecycle demands frequent, continuous understanding
• IT/ Applica'on server logs IT Infrastructure logs, Metering, Audit / Change logs
• Web sites / Mobile Apps/ Ads Clickstream, User Engagement
• Social Media, User Content 450MM+ Tweets/day • Sensor data Weather, Smart Grids, Wearables
{ "payerId": "Joe", "productCode": "AmazonS3", "clientProductCode": "AmazonS3", "usageType": "Bandwidth", "operation": "PUT", "value": "22490", "timestamp": "1216674828" }
Metering Record
127.0.0.1 user-‐identifier frank [10/Oct/2000:13:55:36 -‐0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
Common Log Entry
<165>1 2003-‐10-‐11T22:14:15.003Z mymachine.example.com evntslog -‐ ID47 [exampleSDID@32473 iut="3" eventSource="Application" eventID="1011"][examplePriority@32473 class="high"]
Syslog Entry
“SeattlePublicWater/Kinesis/123/Realtime” – 412309129140
MQTT Record
<R,AMZN ,T,G,R1>
NASDAQ OMX Record
Big data comes from small
Typical Data Flow�Many different technologies, at stages of evolution
Client/Sensor Aggregator Con'nuous Processing
Storage Analy'cs + Repor'ng
?
Big Data
• Hourly server logs: how your systems were misbehaving an hour ago
• Weekly / Monthly Bill: What you spent this past billing cycle?
• Daily customer-preferences report from your web-site’s click stream: tells you what deal or ad to try next time
• Daily fraud reports: tells you if there was fraud yesterday
Real-‐time Big Data
• CloudWatch metrics: what just went wrong now
• Real-time spending alerts/caps: guaranteeing you can’t overspend
• Real-time analysis: tells you what to offer the current customer now
• Real-time detection: blocks fraudulent use now
Big Data : Best Served Fresh�
Customer View
Scenarios
Accelerated Ingest-‐Transform-‐Load Con'nual Metrics/ KPI Extrac'on Responsive Data Analysis
Data Types
IT infrastructure, ApplicaGons logs, Social media, Fin. Market data, Web Clickstreams, Sensors, Geo/LocaGon data
SoKware/ Technology
IT server , App logs ingesGon IT operaGonal metrics dashboards Devices / Sensor OperaGonal Intelligence
Digital Ad Tech./ Marke'ng
AdverGsing Data aggregaGon AdverGsing metrics like coverage, yield, conversion
AnalyGcs on User engagement with Ads, OpGmized bid/ buy engines
Financial Services Market/ Financial TransacGon order data collecGon
Financial market data metrics Fraud monitoring, and Value-‐at-‐Risk assessment, AudiGng of market order data
Consumer Online/ E-‐Commerce
Online customer engagement data aggregaGon
Consumer engagement metrics like page views, CTR
Customer clickstream analyGcs, RecommendaGon engines
Customer Scenarios across Industry Segments�
1 2 3
Customers using Amazon Kinesis�Streaming big data processing in action
Mobile/ Social Gaming Digital Adver'sing Tech.
Deliver conGnuous/ real-‐Gme delivery of game insight data by 100’s of game servers
Generate real-‐Gme metrics, KPIs for online ad performance for adverGsers/ publishers
Custom-‐built soluGons operaGonally complex to manage, & not scalable
Store + Forward fleet of log servers, and Hadoop based processing pipeline
• Delay with criGcal business data delivery • Developer burden in building reliable, scalable
plaZorm for real-‐Gme data ingesGon/ processing • Slow-‐down of real-‐Gme customer insights
• Lost data with Store/ Forward layer • OperaGonal burden in managing reliable, scalable plaZorm
for real-‐Gme data ingesGon/ processing • Batch-‐driven real-‐Gme customer insights
Accelerate Gme to market of elasGc, real-‐Gme applicaGons – while minimizing operaGonal overhead
Generate freshest analyGcs on adverGser performance to opGmize markeGng spend, and increase responsiveness to clients
Solu'on Architecture Set o Streaming Data Inges'on
• Ka]a • Flume • Kestrel / Scribe • RabbitMQ / AMQP
o Streaming Data Processing
• Storm
o Do-‐It-‐yourself (AWS) based solu'on
• EC2: Logging/ pass through servers • EBS: holds log/ other data snapshots • SQS: Queue data store • S3: Persistence store • EMR: workflow to ingest data from S3 and
process
o Exploring Con'nual data Inges'on & Processing
‘Typical’ Technology Solution Set
Solu'on Architecture Considera'ons Flexibility: Select the most appropriate sobware, and configure underlying infrastructure
Control: Sobware and hardware can be tuned to meet specific business and scenario needs.
Ongoing Opera'onal Complexity: Deploy, and manage an end-‐to-‐end system
Infrastructure planning and maintenance: Managing a reliable, scalable infrastructure
Developer/ IT staff expense: Developers, Devops and IT staff Gme and energy expended
SoKware Maintenance : Tech. and professional services support
Data Streams Ingestion, Continuous Processing�Right Toolset for the Right Job
Real-time Ingest!• Highly Scalable"• Durable"• Elastic "• Replay-able Reads""
Continuous Processing FX !• Load-balancing incoming streams"• Fault-tolerance, Checkpoint / Replay"• Elastic"
Enable data movement into Stores/ Processing Engines!
Managed Service !
Low end-to-end latency!
Continuous, real-time workloads!
Amazon Kinesis – An Overview
Data Sources
App.4
[Machine Learning]
AW
S En
dpoint
App.1
[Aggregate & De-‐Duplicate]
Data Sources
Data Sources
Data Sources
App.2
[Metric Extrac'on]
S3
DynamoDB
Redshift
App.3 [Sliding Window Analysis]
Data Sources
Availability Zone
Shard 1
Shard 2 Shard N
Availability Zone
Availability Zone
Introducing Amazon Kinesis �Service for Real-Time Big Data Ingestion that Enables Continuous Processing
Kinesis Stream: Managed ability to capture and store data
• Streams are made of Shards
• Each Shard ingests data up to 1MB/
sec, and up to 1000 TPS
• Each Shard emits up to 2 MB/sec
• All data is stored for 24 hours
• Scale Kinesis streams by adding or
removing Shards
• Replay data inside of 24Hr. Window
Putting Data into Kinesis Simple Put interface to store data in Kinesis • Producers use a PUT call to store data in a Stream
• PutRecord {Data, PartitionKey, StreamName}
• A Par''on Key is supplied by producer and used to
distribute the PUTs across Shards
• Kinesis MD5 hashes supplied parGGon key over the hash
key range of a Shard
• A unique Sequence # is returned to the Producer upon a successful PUT call
Producer
Shard 1
Shard 2
Shard 3
Shard n
Shard 4
Producer
Producer
Producer
Producer
Producer
Producer
Producer
Producer
Kinesis
Creating and Sizing a Kinesis Stream
Getting Started with Kinesis – Writing to a Stream POST / HTTP/1.1 Host: kinesis.<region>.<domain> x-‐amz-‐Date: <Date> Authorization: AWS4-‐HMAC-‐SHA256 Credential=<Credential>, SignedHeaders=content-‐type;date;host;user-‐agent;x-‐amz-‐date;x-‐amz-‐target;x-‐amzn-‐requestid, Signature=<Signature> User-‐Agent: <UserAgentString> Content-‐Type: application/x-‐amz-‐json-‐1.1 Content-‐Length: <PayloadSizeBytes> Connection: Keep-‐Alive X-‐Amz-‐Target: Kinesis_20131202.PutRecord { "StreamName": "exampleStreamName", "Data": "XzxkYXRhPl8x", "PartitionKey": "partitionKey" }
Abributes of Con'nuous, Stream Processing Apps
Building Stream Processing Apps with Kinesis
ConGnuous processing, implies “long running” applicaGons
Read from Kinesis Stream directly using Get* APIs Build applicaGons with Kinesis Client Library Leverage Kinesis Spout for Storm
Be fault tolerant, to handle failures in hardware or sobware
Be distributed, to handle mulGple shards
Scale up and down as the number of shards increase or decrease
Reading from Kinesis Streams
Shard 1
Shard 2
Shard 3
Shard n
Shard 4
KCL Worker 1
KCL Worker 2
EC2 Instance
KCL Worker 3
KCL Worker 4
EC2 Instance
KCL Worker n
EC2 Instance
Kinesis
Building Kinesis Processing Apps : Kinesis Client Library�Client library for fault-tolerant, at least-once, Continuous Processing
o Java client library, source available on Github
o Build & Deploy app with KCL on your EC2 instance(s)
o KCL is intermediary b/w your application & stream § Automatically starts a Kinesis Worker for each shard
§ Simplifies reading by abstracting individual shards
§ Increase / Decrease Workers as # of shards changes
§ Checkpoints to keep track of a Worker’s location in the
stream, Restarts Workers if they fail
o Integrates with AutosScaling groups to redistribute
workers to new instances
More options to read from Kinesis Streams�Leveraging Get APIs, existing Storm topologies
o Use the Get APIs for raw reads of Kinesis data streams • GetRecords {Limit, ShardIterator} • GetShardIterator {ShardId, ShardIteratorType, StartingSequenceNumber, StreamName}
o Integrate Kinesis Streams with Storm Topologies • Bootstraps, via Zookeeper to map Shards to Spout tasks
• Fetches data from Kinesis stream
• Emits tuples and Checkpoints (in Zookeeper)
Sending & Reading data from Kinesis Streams
HTTP Post
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Kinesis Client Library + Connector Library
Apache Storm
Amazon Elas'c MapReduce
Sending Reading
Kinesis Connectors Enabling Data Movement
Amazon Kinesis Connector Library�Customizable, Open Source Apps to Connect Kinesis with S3, Redshift, DynamoDB
ITransformer
• Defines the transformaGon of records from the Amazon Kinesis stream in order to suit the user-‐defined data model
IFilter
• Excludes irrelevant records from the processing.
IBuffer
• Buffers the set of records to be processed by specifying size limit (# of records)& total byte count
IEmiber
• Makes client calls to other AWS services and persists the records stored in the buffer.
S3
DynamoDB
Redshift
Kinesis
Batch processing
Client/ Sensor Aggregator/ Sequencer
ConGnuous processor for dashboard
Storage AnalyGcs and ReporGng
Amazon Kinesis Amazon EMR S3 DynamoDB Redshib
Streaming Data Repository
Log Processing with Hadoop
Processing
Input My Website
Kinesis Log4J
Appender
push to Kinesis
EMR
Hive Pig
Cascading MapReduce
pull from
Hadoop ecosystem implementation
• Hadoop Input format
• Hive Storage Handler
• Pig Load Function
• Cascading Scheme and Tap
private(static(final(Logger(LOGGER(=(Logger.getLogger(KinesisAppender.class);
LOGGER.error("Cannot find resource XYX… go do something about it!");
select instancetime, InstanceID, Message from MyKinesisLogsStream where LogLevel = 'error'
12/13/13:07:51, InstanceID123, Cannot find resource XYX… go do something about it!
Processing Kinesis data from Elastic MapReduce using Hive
Writing to Kinesis using Log4J
select instancetime, InstanceID, Message from MyKinesisLogsStream where LogLevel = 'error'
12/13/13:07:51, InstanceID123, Cannot find resource XYX… go do something about it!
Amazon Kinesis Storm spout�Integrate Kinesis Streams with Storm topologies • Bootstrap
– Queries the list of shards from Kinesis – Initializes state in Zookeeper – Shards are mapped to spout tasks
• Fetching data from Kinesis: – Fetches a batch of records via GetRecords API calls – Buffers the records in memory – A spout task will fetch data from all shards it is responsible for
• Emitting tuples and checkpointing – Converts each Kinesis data record into a tuple – Emits the tuple for processing – Checkpoints periodically (checkpoints stored in Zookeeper)
Shard 1
Shard 2
Shard n
Shard 4
Shard 3
Spout Task A
Spout Task B
Spout Task C
Customers using Amazon Kinesis�Amazon Kinesis in action
Mobile/ Social Gaming Digital Adver'sing Tech.
Deliver conGnuous/ real-‐Gme delivery of game insight data by 100’s of game servers
Generate real-‐Gme metrics, KPIs for online ad performance for adverGsers/ publishers
Custom-‐built soluGons operaGonally complex to manage, & not scalable
Store + Forward fleet of log servers, and Hadoop based processing pipeline
• Delay with criGcal business data delivery • Developer burden in building reliable, scalable
plaZorm for real-‐Gme data ingesGon/ processing • Slow-‐down of real-‐Gme customer insights
• Lost data with Store/ Forward layer • OperaGonal burden in managing reliable, scalable plaZorm
for real-‐Gme data ingesGon/ processing • Batch-‐driven real-‐Gme customer insights
Amazon Kinesis accelerates Gme to market of elasGc, real-‐Gme applicaGons – while minimizing operaGonal overhead
Amazon Kinesis ingests data and enables conGnuous fresh analyGcs on adverGser performance to opGmize markeGng spend, and increase responsiveness to clients
Gaming Analytics with Amazon Kinesis
Real-‐'me Clickstream Processing App
Aggregate Sta's'cs
Clickstream Archive
In-‐Game Engagement Trends Dashboard
Digital Ad. Tech Metering with Amazon Kinesis
Con'nuous Ad Metrics Extrac'on
Incremental Ad. Sta's'cs Computa'on
Metering Record Archive
Ad Analy'cs Dashboard
Simple Metering & Billing with Amazon Kinesis
Billing Auditors
Incremental Bill Computa'on
Metering Record Archive
Billing Management Service
Kinesis Pricing�Simple, Pay-as-you-go, & no up-front costs
Pricing Dimension Value
Hourly Shard Rate $0.015
Per 1,000,000 PUT transac'ons:
$0.028
• Customers specify throughput requirements in shards, that they control • Each Shard delivers 1 MB/s on ingest, and 2MB/s on egress
• Inbound data transfer is free
• EC2 instance charges apply for Kinesis processing applications
Easy Administra'on Managed service for real-‐Gme streaming data collecGon, processing and analysis. Simply create a new stream, set the desired level of capacity, and let the service handle the rest.
Real-‐'me Performance Perform conGnual processing on streaming big data. Processing latencies fall to a few seconds, compared with the minutes or hours associated with batch processing.
High Throughput. Elas'c Seamlessly scale to match your data throughput rate and volume. You can easily scale up to gigabytes per second. The service will scale up or down based on your operaGonal or business needs.
S3, RedshiK, & DynamoDB Integra'on Reliably collect, process, and transform all of your data in real-‐Gme & deliver to AWS data stores of choice, with Connectors for S3, Redshib, and DynamoDB.
Build Real-‐'me Applica'ons Client libraries that enable developers to design and operate real-‐Gme streaming data processing applicaGons.
Low Cost Cost-‐efficient for workloads of any scale. You can get started by provisioning a small stream, and pay low hourly rates only for what you use.
Amazon Kinesis: Key Benefits
AWS Big Data Services
AWS Big Data Services
Amazon EMR
AWS Big Data Services
Amazon EMR Amazon S3
AWS Big Data Services
Amazon EMR Amazon S3 Amazon DynamoDB
AWS Big Data Services
Amazon EMR Amazon S3 Amazon DynamoDB Amazon Glacier
AWS Big Data Services
Amazon EMR Amazon S3 Amazon DynamoDB Amazon Glacier Amazon Redshift
AWS Big Data Services
Amazon EMR Amazon S3 Amazon DynamoDB Amazon Glacier Amazon Redshift
AWS Data Pipeline
AWS Big Data Services
Amazon EMR Amazon S3 Amazon DynamoDB Amazon Glacier Amazon Redshift
AWS Data Pipeline
Amazon Kinesis
AWS Big Data Services
Amazon EMR Amazon S3 Amazon DynamoDB Amazon Glacier Amazon Redshift
AWS Data Pipeline
Amazon Kinesis
Try out Amazon Kinesis
• Try out Amazon Kinesis – http://aws.amazon.com/kinesis/
• Thumb through the Developer Guide – http://aws.amazon.com/documentation/kinesis/
• Visit, and Post on Kinesis Forum – https://forums.aws.amazon.com/forum.jspa?forumID=169#
• Continue the conversation on Twitter – @petermark
• Contact me – [email protected]
• Contact Doug Lott – [email protected]
Questions?
Thank You