Upload
hanhu
View
216
Download
0
Embed Size (px)
Citation preview
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Roger Barga, GM Amazon Kinesis
April 2016
Streaming Data: The Opportunity &
How to Work With It
Interest in and demand for
stream data processing is rapidly
increasing*…
* Understatement of the year…
Most data is produced continuously
Why?
{
"payerId": "Joe",
"productCode": "AmazonS3",
"clientProductCode": "AmazonS3",
"usageType": "Bandwidth",
"operation": "PUT",
"value": "22490",
"timestamp": "1216674828"
}
Metering Record
127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif
HTTP/1.0" 200 2326
Common Log Entry
<165>1 2003-10-11T22:14:15.003Z
mymachine.example.com evntslog - ID47
[exampleSDID@32473 iut="3" eventSource="Application"
eventID="1011"][examplePriority@32473 class="high"]
Syslog Entry
“SeattlePublicWater/Kinesis/123/Realtime” –
412309129140
MQTT Record
<R,AMZN,T,G,R1>
NASDAQ OMX Record
Time is money• Perishable Insights (Forrester)
Why?
• Hourly server logs: how your systems were misbehaving an hour ago
• Weekly / Monthly Bill: What you spent this
past billing cycle?
• Daily fraud reports: tells you if there was
fraud yesterday
• CloudWatch metrics: what just went wrong now
• Real-time spending alerts/caps: guaranteeing you
can’t overspend
• Real-time detection: blocks fraudulent use now
Time is money• Perishable Insights (Forrester)
• A more efficient implementation
• Most ‘Big Data’ deployments process
continuously generated data (batched)
Why?
Amazon Kinesis: Streaming Data Done the AWS WayMakes it easy to capture, deliver, and process real-time data streams
Pay as you go, no up-front costs
Elastically scalable
Right services for your specific use cases
Real-time latencies
Easy to provision, deploy, and manage
Amazon Kinesis: Streaming data made easyServices make it easy to capture, deliver and process streams on AWS
Kinesis Streams
For Technical Developers
Kinesis Firehose
For all developers,
data scientists
Kinesis Analytics
For all developers, data enginers,
data scientists
Build your own custom
applications that process
or analyze streaming data
Easily load massive
volumes of streaming
data into S3 and Redshift
Easily analyze data streams
using standard SQL queries
Amazon Kinesis StreamsBuild your own data streaming applications
Easy administration: Simply create a new stream, and set the desired level of
capacity with shards. Scale to match your data throughput rate and volume.
Build real-time applications: Perform continual processing on streaming big data
using Kinesis Client Library (KCL), Apache Spark/Storm, AWS Lambda, and more.
Low cost: Cost-efficient for workloads of any scale.
Kinesis Streaming Data Ingestion•Streams are made of Shards
•Each Shard ingests data up to
1MB/sec, and up to 1000 TPS
•Producers use PUT call to store
data in a Stream: PutRecord
{Data, PartitionKey,
StreamName}
•Each shard emits 2MB/sec
•All data is stored for 24 hours
•Scale Kinesis streams by
adding or removing Shards
•Replay data from stream (log)
Kinesis Producer LibraryHighly configurable library to write to Kinesis
• Collects records and uses PutRecords for high throughput writes
• Integrates seamlessly with the Amazon Kinesis Client Library (KCL) to
de-aggregate batched records
• Submits Amazon CloudWatch metrics on your behalf to provide
visibility into producer performance
Extended Retention in KinesisDefault 24 Hours but configurable to 7 days
• 2 New APIs
• IncreaseStreamRetentionPeriod(String StreamName, intRetentionPeriodHours)
• DecreaseStreamRetentionPeriod(String StreamName, intRetentionPeriodHours)
• Use it in one of two-modes • As always-on extended retention
• Raise retention period in response to operational event or planned
maintenance of record processing application to extend retention.
Amazon Web Services
AZ AZ AZ
Durable, highly consistent storage replicates dataacross three data centers (availability zones)
Aggregate andarchive to S3
Millions ofsources producing100s of terabytes
per hour
FrontEnd
AuthenticationAuthorization
Ordered streamof events supportsmultiple readers
Real-timedashboardsand alarms
Machine learningalgorithms or
sliding windowanalytics
Aggregate analysisin Hadoop or adata warehouse
Inexpensive: $0.028 per million puts
Real-Time Streaming Data Ingestion
Custom-built
Streaming
Applications
(KCL)
Inexpensive: $0.014 per 1,000,000 PUT Payload Units
25 – 40ms 100 – 150ms
• Open Source Java client library, also available for Python,
Ruby, Node.JS, .NET. Available on Github
• Connects to the stream and enumerates the shards
• Instantiates a record processor for every shard it manages
• Pulls data records from the stream
• Pushes the records to the corresponding record processor
Building Applications: Kinesis Client LibraryFor fault-tolerant, scalable stream processing applications
• Open Source Java client library, also available for Python, Ruby, Node.JS,
.NET. Available on Github
• Connects to the stream and enumerates the shards
• Instantiates a record processor for every shard it manages
• Pulls data records from the stream
• Pushes the records to the corresponding record processor
• Checkpoints processed records
• Balances shard-worker associations when the worker instance
count changes
• Balances shard-worker associations when shards are split or
merged
Building Applications: Kinesis Client LibraryFor fault-tolerant, scalable stream processing applications
Real-Time Streaming Data with Kinesis Streams
5 billion events/wk from
connected devices | IoT
17 PB of game data per
season | Entertainment
100 billion ad
impressions/day, 30 ms
response time | Ad Tech
100 GB/day click streams
250+ sites | Enterprise
50 billion ad
impressions/day sub-50
ms responses | Ad Tech
17 million events/day
| Technology
1 billion transactions per
day | Bitcoin
1 TB+/day game data
analyzed in real-time
| Gaming
Amazon Kinesis: Streaming data made easyServices make it easy to capture, deliver and process streams on AWS
Kinesis Streams
For Technical Developers
Kinesis Firehose
For all developers,
data scientists
Kinesis Analytics
For all developers, data enginers,
data scientists
Build your own custom
applications that process
or analyze streaming data
Easily load massive
volumes of streaming
data into S3 and Redshift
Easily analyze data streams
using standard SQL queries
Amazon Kinesis Firehose
Zero Admin: Capture and deliver streaming data into S3, Redshift, and
more without writing an application or managing infrastructure
Direct-to-data store integration: Batch, compress, and encrypt
streaming data for delivery into S3, and other destinations in as little as
60 secs, set up in minutes
Seamless elasticity: Seamlessly scales to match data throughput
Capture and submit
streaming data to Firehose
Firehose loads streaming data
continuously into S3 and Redshift
Analyze streaming data using your favorite
BI tools
1. Delivery Stream: The underlying entity of Firehose. Use Firehose by
creating a delivery stream to a specified destination and send data to it.
• You do not have to create a stream or provision shards.
• You do not have to specify partition keys.
2. Records: The data producer sends data blobs as large as 1,000 KB to a
delivery stream. That data blob is called a record.
3. Data Producers: Producers send records to a Delivery Stream. For
example, a web server sending log data to a delivery stream is a data
producer.
Amazon Kinesis Firehose 3 Simple Concepts
Amazon Kinesis Firehose Console Experience Unified Console Experience for Firehose and Streams
Amazon Kinesis Firehose Console Experience (S3) Create fully managed resources for delivery without building an app
Amazon Kinesis Firehose Console Experience Configure data delivery options simply using the console
Amazon Kinesis Agent
Software agent makes submitting data to Firehose easy
• Monitors files and sends new data records to a delivery stream
• Handles file rotation, checkpointing, and retry upon failures
• Delivers all data in a reliable, timely, and simple manner
• Emits AWS CloudWatch metrics to help you better monitor and
troubleshoot the streaming process
• Supported on Amazon Linux AMI with version 2015.09 or later, or Red
Hat Enterprise Linux version 7 or later. Install on Linux-based server
environments such as web servers, front ends, log servers, and more
• Also enabled for Kinesis Streams
Amazon Kinesis Firehose PricingSimple, pay-as-you-go and no up-front costs
Dimension Value
Per 1 GB of data ingested $0.035
Digital Media
Publication brands
Mobile Apps
Re-targeting platforms
Hearst’s wonderful world of streaming data
Hearst Streaming Data Architecture
Buzzing API
API
Ready
Data
Kinesis
Streams
S3 Storage
Node.JS
App- ProxyUsers to
Hearst
Properties
Clickstream
Data Science
Application
Amazon Redshift
ETL on EMR
100 seconds
1G/day
30 seconds
5GB/day 5 seconds
1G/day
Milliseconds
100GB/day
LATENCY
THROUGHPUT Models
Agg Data
Kinesis
Firehose
AW
S En
dp
oin
t
[Batch, Compress, Encrypt]
Data Sources
S3No Partition Keys
No Provisioning
End to End Elastic
Amazon Kinesis Firehose Fully Managed Service for Delivering Data Streams into AWS Destinations
Redshift
Amazon Kinesis Streams is a service for workloads that requires
custom processing, per incoming record, with sub-1 second
processing latency, and a choice of stream processing frameworks.
Amazon Kinesis Firehose is a service for workloads that require
zero administration, ability to use existing analytics tools based
on S3 or Redshift, and a data latency of 60 seconds or higher.
Real-Time Analytics
Our customers asked for a way
to produce results fast
Static Queries, but need very fast
output (alerts, real-time control)
Ability to transform streaming data
and deliver enhanced data to other
destinations (services & storage)
Dynamic and interactive queries
(data exploration)
Value of insights on some data degrades quickly with time, like
stock markets, traffic, device telemetry, patient monitoring, etc.
Amazon Kinesis: Streaming data made easyServices make it easy to capture, deliver and process streams on AWS
Kinesis Streams
For Technical Developers
Kinesis Firehose
For all developers,
data scientists
Kinesis Analytics
For all developers, data enginers,
data scientists
Build your own custom
applications that process
or analyze streaming data
Easily load massive
volumes of streaming
data into S3 and Redshift
Easily analyze data streams
using standard SQL queries
Amazon Kinesis AnalyticsAnalyze data streams continuously with standard SQL
Apply SQL on streams: Easily connect to data streams and apply existing
SQL skills.
Build real-time applications: Perform continual processing on streaming
big data with sub-second processing latencies
Scale elastically: Elastically scales to match data throughput without any
operator intervention.
Announcement
Only!
Connect to Kinesis streams,
Firehose delivery streamsRun standard SQL queries
against data streams
Kinesis Analytics can send processed
data to analytics tools so you can create
alerts and respond in real-time
Scenarios for Streaming Data Analytics
Windowed analytics such as running totals, weighted moving
averages using sliding and tumbling windows
Streaming ETL to parse, validate, and enrich streaming data, and
then aggregate and filter before sending to persistent stores like
S3, Redshift, ElasticSearch, and more…
Real-time alerts on data streams using simple thresholds or
advanced machine learning algorithms like anomaly detect
Real-time visualizations on calculated key performance
indicators in near real-time
…and more sophisticated use cases such as temporal analysis, detecting event
sequences, complex event processing, advanced analytics.
Amazon Kinesis Analytics
Service Overview
Use ANSI SQL to build real-time streaming applications
Easily write SQL code to process streaming data
Author applications that perform simple aggregations to sophisticated
multi-step pipelines with advanced analytical functions
Connect to Streaming Input Data
Select Kinesis Streams or Firehoses as input data for your SQL code
Continuously deliver SQL results
Configure one to many outputs for your processed data for real-time
alerts, visualizations, and batch analytics
Connect to Streaming Input Data Source
Single streaming input, Kinesis Stream or Firehose
Single reference input up to 1 GB from S3
Input formats include JSON, CSV, variable column (TSV,
pipe-delimited); complex, nested JSON structures
Each input has a schema
• Schema is inferred but you can edit
• Deep nesting (2+ levels) supported by specifying row paths of
desired fields
• Multiple record types are supported through canonical schema
and/or by manipulating structures in SQL using string functions
Easily write SQL to process streaming data
Build streaming application with SQL statements
Extensions to SQL standard to work seamlessly with
streaming data (STREAM, Windows, ROWTIME)
Robust SQL support including large number of functions
including:
• Simple mathematical operators (AVG, STDEV, etc.)
• String manipulations (SUBSTRING, POSITION)
• Advanced analytics (random sampling, anomaly detection)
Support for at-least-once processing semantics
Real-Time Analytics Patterns
Simple counting (e.g. failure count)
Counting with Windows ( e.g. failure count every hour)
Preprocessing: filter, transform (e.g. data cleanup)
Alerts , thresholds (e.g. Alarm on high temperature)
Data correlation, detect missing events, detect erroneous data (e.g. detecting failed sensors)
Joining event streams (e.g. detect a hit on soccer ball)
Merge with data in a databaseupdate data conditionally
Realtime Analytics Patterns (contd.)
Detect Event Sequence Patterns (e.g. small transaction followed by large transaction)
Tracking - follow some related entity’s state in space, time etc. (e.g. location of airline baggage, vehicle tracking, …)
Detect trends – Rise, turn, fall, Outliers, Complex trends like triple bottom etc., (e.g. algorithmic trading, SLA, load balancing)
And more….
Continuously deliver SQL results
Up to five outputs, including S3, Redshift through
Firehose, Kinesis Streams, and more...
• Firehose allows Kinesis Analytics to separate processing and
delivery of enriched streaming data
• Delivery speed will be heavily dependent upon your SQL queries
(i.e. simple ETL versus 10 minute aggregations)
Other destination(s) will be included at GA.
Output formats include JSON, CSV, variable column
(TSV, pipe-delimited)
Questions