Streaming Data: The Opportunity & How to Work With It

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Roger Barga, GM Amazon Kinesis

April 2016

Streaming Data: The Opportunity &

How to Work With It

Interest in and demand for

stream data processing is rapidly

increasing*…

* Understatement of the year…

Most data is produced continuously

Why?

{

"payerId": "Joe",

"productCode": "AmazonS3",

"clientProductCode": "AmazonS3",

"usageType": "Bandwidth",

"operation": "PUT",

"value": "22490",

"timestamp": "1216674828"

}

Metering Record

127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif

HTTP/1.0" 200 2326

Common Log Entry

<165>1 2003-10-11T22:14:15.003Z

mymachine.example.com evntslog - ID47

[exampleSDID@32473 iut="3" eventSource="Application"

eventID="1011"][examplePriority@32473 class="high"]

Syslog Entry

“SeattlePublicWater/Kinesis/123/Realtime” –

412309129140

MQTT Record

<R,AMZN,T,G,R1>

NASDAQ OMX Record

Time is money• Perishable Insights (Forrester)

Why?

• Hourly server logs: how your systems were misbehaving an hour ago

• Weekly / Monthly Bill: What you spent this

past billing cycle?

• Daily fraud reports: tells you if there was

fraud yesterday

• CloudWatch metrics: what just went wrong now

• Real-time spending alerts/caps: guaranteeing you

can’t overspend

• Real-time detection: blocks fraudulent use now

Time is money• Perishable Insights (Forrester)

• A more efficient implementation

• Most ‘Big Data’ deployments process

continuously generated data (batched)

Why?

Amazon Kinesis: Streaming Data Done the AWS WayMakes it easy to capture, deliver, and process real-time data streams

Pay as you go, no up-front costs

Elastically scalable

Right services for your specific use cases

Real-time latencies

Easy to provision, deploy, and manage

Amazon Kinesis: Streaming data made easyServices make it easy to capture, deliver and process streams on AWS

Kinesis Streams

For Technical Developers

Kinesis Firehose

For all developers,

data scientists

Kinesis Analytics

For all developers, data enginers,

data scientists

Build your own custom

applications that process

or analyze streaming data

Easily load massive

volumes of streaming

data into S3 and Redshift

Easily analyze data streams

using standard SQL queries

Amazon Kinesis StreamsBuild your own data streaming applications

Easy administration: Simply create a new stream, and set the desired level of

capacity with shards. Scale to match your data throughput rate and volume.

Build real-time applications: Perform continual processing on streaming big data

using Kinesis Client Library (KCL), Apache Spark/Storm, AWS Lambda, and more.

Low cost: Cost-efficient for workloads of any scale.

Kinesis Streaming Data Ingestion•Streams are made of Shards

•Each Shard ingests data up to

1MB/sec, and up to 1000 TPS

•Producers use PUT call to store

data in a Stream: PutRecord

{Data, PartitionKey,

StreamName}

•Each shard emits 2MB/sec

•All data is stored for 24 hours

•Scale Kinesis streams by

adding or removing Shards

•Replay data from stream (log)

Kinesis Producer LibraryHighly configurable library to write to Kinesis

• Collects records and uses PutRecords for high throughput writes

• Integrates seamlessly with the Amazon Kinesis Client Library (KCL) to

de-aggregate batched records

• Submits Amazon CloudWatch metrics on your behalf to provide

visibility into producer performance

Extended Retention in KinesisDefault 24 Hours but configurable to 7 days

• 2 New APIs

• IncreaseStreamRetentionPeriod(String StreamName, intRetentionPeriodHours)

• DecreaseStreamRetentionPeriod(String StreamName, intRetentionPeriodHours)

• Use it in one of two-modes • As always-on extended retention

• Raise retention period in response to operational event or planned

maintenance of record processing application to extend retention.

Amazon Web Services

AZ AZ AZ

Durable, highly consistent storage replicates dataacross three data centers (availability zones)

Aggregate andarchive to S3

Millions ofsources producing100s of terabytes

per hour

FrontEnd

AuthenticationAuthorization

Ordered streamof events supportsmultiple readers

Real-timedashboardsand alarms

Machine learningalgorithms or

sliding windowanalytics

Aggregate analysisin Hadoop or adata warehouse

Inexpensive: $0.028 per million puts

Real-Time Streaming Data Ingestion

Custom-built

Streaming

Applications

(KCL)

Inexpensive: $0.014 per 1,000,000 PUT Payload Units

25 – 40ms 100 – 150ms

• Open Source Java client library, also available for Python,

Ruby, Node.JS, .NET. Available on Github

• Connects to the stream and enumerates the shards

• Instantiates a record processor for every shard it manages

• Pulls data records from the stream

• Pushes the records to the corresponding record processor

Building Applications: Kinesis Client LibraryFor fault-tolerant, scalable stream processing applications

• Open Source Java client library, also available for Python, Ruby, Node.JS,

.NET. Available on Github

• Connects to the stream and enumerates the shards

• Instantiates a record processor for every shard it manages

• Pulls data records from the stream

• Pushes the records to the corresponding record processor

• Checkpoints processed records

• Balances shard-worker associations when the worker instance

count changes

• Balances shard-worker associations when shards are split or

merged

Building Applications: Kinesis Client LibraryFor fault-tolerant, scalable stream processing applications

Real-Time Streaming Data with Kinesis Streams

5 billion events/wk from

connected devices | IoT

17 PB of game data per

season | Entertainment

100 billion ad

impressions/day, 30 ms

response time | Ad Tech

100 GB/day click streams

250+ sites | Enterprise

50 billion ad

impressions/day sub-50

ms responses | Ad Tech

17 million events/day

| Technology

1 billion transactions per

day | Bitcoin

1 TB+/day game data

analyzed in real-time

| Gaming


Kinesis Streams


Kinesis Firehose

For all developers,

data scientists

Kinesis Analytics


data scientists




Easily load massive





Amazon Kinesis Firehose

Zero Admin: Capture and deliver streaming data into S3, Redshift, and

more without writing an application or managing infrastructure

Direct-to-data store integration: Batch, compress, and encrypt

streaming data for delivery into S3, and other destinations in as little as

60 secs, set up in minutes

Seamless elasticity: Seamlessly scales to match data throughput

Capture and submit

streaming data to Firehose

Firehose loads streaming data

continuously into S3 and Redshift

Analyze streaming data using your favorite

BI tools

1. Delivery Stream: The underlying entity of Firehose. Use Firehose by

creating a delivery stream to a specified destination and send data to it.

• You do not have to create a stream or provision shards.

• You do not have to specify partition keys.

2. Records: The data producer sends data blobs as large as 1,000 KB to a

delivery stream. That data blob is called a record.

3. Data Producers: Producers send records to a Delivery Stream. For

example, a web server sending log data to a delivery stream is a data

producer.

Amazon Kinesis Firehose 3 Simple Concepts

Amazon Kinesis Firehose Console Experience Unified Console Experience for Firehose and Streams

Amazon Kinesis Firehose Console Experience (S3) Create fully managed resources for delivery without building an app

Amazon Kinesis Firehose Console Experience Configure data delivery options simply using the console

Amazon Kinesis Agent

Software agent makes submitting data to Firehose easy

• Monitors files and sends new data records to a delivery stream

• Handles file rotation, checkpointing, and retry upon failures

• Delivers all data in a reliable, timely, and simple manner

• Emits AWS CloudWatch metrics to help you better monitor and

troubleshoot the streaming process

• Supported on Amazon Linux AMI with version 2015.09 or later, or Red

Hat Enterprise Linux version 7 or later. Install on Linux-based server

environments such as web servers, front ends, log servers, and more

• Also enabled for Kinesis Streams

Amazon Kinesis Firehose PricingSimple, pay-as-you-go and no up-front costs

Dimension Value

Per 1 GB of data ingested $0.035

Digital Media

Publication brands

Mobile Apps

Re-targeting platforms

Hearst’s wonderful world of streaming data

Hearst Streaming Data Architecture

Buzzing API

API

Ready

Data

Kinesis

Streams

S3 Storage

Node.JS

App- ProxyUsers to

Hearst

Properties

Clickstream

Data Science

Application

Amazon Redshift

ETL on EMR

100 seconds

1G/day

30 seconds

5GB/day 5 seconds

1G/day

Milliseconds

100GB/day

LATENCY

THROUGHPUT Models

Agg Data

Kinesis

Firehose

AW

S En

dp

oin

t

[Batch, Compress, Encrypt]

Data Sources

S3No Partition Keys

No Provisioning

End to End Elastic

Amazon Kinesis Firehose Fully Managed Service for Delivering Data Streams into AWS Destinations

Redshift

Amazon Kinesis Streams is a service for workloads that requires

custom processing, per incoming record, with sub-1 second

processing latency, and a choice of stream processing frameworks.

Amazon Kinesis Firehose is a service for workloads that require

zero administration, ability to use existing analytics tools based

on S3 or Redshift, and a data latency of 60 seconds or higher.

Real-Time Analytics

Our customers asked for a way

to produce results fast

Static Queries, but need very fast

output (alerts, real-time control)

Ability to transform streaming data

and deliver enhanced data to other

destinations (services & storage)

Dynamic and interactive queries

(data exploration)

Value of insights on some data degrades quickly with time, like

stock markets, traffic, device telemetry, patient monitoring, etc.


Kinesis Streams


Kinesis Firehose

For all developers,

data scientists

Kinesis Analytics


data scientists




Easily load massive





Amazon Kinesis AnalyticsAnalyze data streams continuously with standard SQL

Apply SQL on streams: Easily connect to data streams and apply existing

SQL skills.

Build real-time applications: Perform continual processing on streaming

big data with sub-second processing latencies

Scale elastically: Elastically scales to match data throughput without any

operator intervention.

Announcement

Only!

Connect to Kinesis streams,

Firehose delivery streamsRun standard SQL queries

against data streams

Kinesis Analytics can send processed

data to analytics tools so you can create

alerts and respond in real-time

Scenarios for Streaming Data Analytics

Windowed analytics such as running totals, weighted moving

averages using sliding and tumbling windows

Streaming ETL to parse, validate, and enrich streaming data, and

then aggregate and filter before sending to persistent stores like

S3, Redshift, ElasticSearch, and more…

Real-time alerts on data streams using simple thresholds or

advanced machine learning algorithms like anomaly detect

Real-time visualizations on calculated key performance

indicators in near real-time

…and more sophisticated use cases such as temporal analysis, detecting event

sequences, complex event processing, advanced analytics.

Amazon Kinesis Analytics

Service Overview

Use ANSI SQL to build real-time streaming applications

Easily write SQL code to process streaming data

Author applications that perform simple aggregations to sophisticated

multi-step pipelines with advanced analytical functions

Connect to Streaming Input Data

Select Kinesis Streams or Firehoses as input data for your SQL code

Continuously deliver SQL results

Configure one to many outputs for your processed data for real-time

alerts, visualizations, and batch analytics

Connect to Streaming Input Data Source

Single streaming input, Kinesis Stream or Firehose

Single reference input up to 1 GB from S3

Input formats include JSON, CSV, variable column (TSV,

pipe-delimited); complex, nested JSON structures

Each input has a schema

• Schema is inferred but you can edit

• Deep nesting (2+ levels) supported by specifying row paths of

desired fields

• Multiple record types are supported through canonical schema

and/or by manipulating structures in SQL using string functions

Easily write SQL to process streaming data

Build streaming application with SQL statements

Extensions to SQL standard to work seamlessly with

streaming data (STREAM, Windows, ROWTIME)

Robust SQL support including large number of functions

including:

• Simple mathematical operators (AVG, STDEV, etc.)

• String manipulations (SUBSTRING, POSITION)

• Advanced analytics (random sampling, anomaly detection)

Support for at-least-once processing semantics

Real-Time Analytics Patterns

Simple counting (e.g. failure count)

Counting with Windows ( e.g. failure count every hour)

Preprocessing: filter, transform (e.g. data cleanup)

Alerts , thresholds (e.g. Alarm on high temperature)

Data correlation, detect missing events, detect erroneous data (e.g. detecting failed sensors)

Joining event streams (e.g. detect a hit on soccer ball)

Merge with data in a databaseupdate data conditionally

Realtime Analytics Patterns (contd.)

Detect Event Sequence Patterns (e.g. small transaction followed by large transaction)

Tracking - follow some related entity’s state in space, time etc. (e.g. location of airline baggage, vehicle tracking, …)

Detect trends – Rise, turn, fall, Outliers, Complex trends like triple bottom etc., (e.g. algorithmic trading, SLA, load balancing)

And more….

Continuously deliver SQL results

Up to five outputs, including S3, Redshift through

Firehose, Kinesis Streams, and more...

• Firehose allows Kinesis Analytics to separate processing and

delivery of enriched streaming data

• Delivery speed will be heavily dependent upon your SQL queries

(i.e. simple ETL versus 10 minute aggregations)

Other destination(s) will be included at GA.

Output formats include JSON, CSV, variable column

(TSV, pipe-delimited)

Questions

Documents

Streaming Data: The Opportunity & How to Work With It