Peter-Mark Verwoerd Solutions Architectfiles.meetup.com/5297812/dc_meetup_kinesis.pdf · Getting...

Peter-Mark Verwoerd

Amazon Kinesis�Managed Service for Streaming Data Ingestion, & Processing

Solutions Architect

o  Why continuous, real-time processing? §  Early stages of Data lifecycle demanding more §  Developing the ‘Right tool for the right job’

o  What can you do with streaming data today? §  Customer Scenarios §  Current approaches

o  What is Amazon Kinesis? §  Putting data into Kinesis §  Getting data from Kinesis Streams: Building applications with KCL

o  Connecting Amazon Kinesis to other systems §  Moving data into S3, DynamoDB, Redshift §  Leveraging existing EMR, Storm infrastructure

Amazon Kinesis: Managed Service for Streaming Data Ingestion & Processing

The Mo'va'on for Con'nuous, Real Time Processing

Unconstrained Data Growth�Data lifecycle demands frequent, continuous understanding

•  IT/ Applica'on server logs IT Infrastructure logs, Metering, Audit / Change logs

•  Web sites / Mobile Apps/ Ads Clickstream, User Engagement

•  Social Media, User Content 450MM+ Tweets/day •  Sensor data Weather, Smart Grids, Wearables

{ "payerId": "Joe", "productCode": "AmazonS3", "clientProductCode": "AmazonS3", "usageType": "Bandwidth", "operation": "PUT", "value": "22490", "timestamp": "1216674828" }

Metering Record

127.0.0.1 user-‐identifier frank [10/Oct/2000:13:55:36 -‐0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

Common Log Entry

<165>1 2003-‐10-‐11T22:14:15.003Z mymachine.example.com evntslog -‐ ID47 [exampleSDID@32473 iut="3" eventSource="Application" eventID="1011"][examplePriority@32473 class="high"]

Syslog Entry

“SeattlePublicWater/Kinesis/123/Realtime” – 412309129140

MQTT Record

<R,AMZN ,T,G,R1>

NASDAQ OMX Record

Big data comes from small

Typical Data Flow�Many different technologies, at stages of evolution

Client/Sensor Aggregator Con'nuous Processing

Storage Analy'cs + Repor'ng

Big Data

•  Hourly server logs: how your systems were misbehaving an hour ago

•  Weekly / Monthly Bill: What you spent this past billing cycle?

•  Daily customer-preferences report from your web-site’s click stream: tells you what deal or ad to try next time

•  Daily fraud reports: tells you if there was fraud yesterday

Real-‐time Big Data

•  CloudWatch metrics: what just went wrong now

•  Real-time spending alerts/caps: guaranteeing you can’t overspend

•  Real-time analysis: tells you what to offer the current customer now

•  Real-time detection: blocks fraudulent use now

Big Data : Best Served Fresh�

Customer View

Scenarios

Accelerated Ingest-‐Transform-‐Load Con'nual Metrics/ KPI Extrac'on Responsive Data Analysis

Data Types

IT infrastructure, ApplicaGons logs, Social media, Fin. Market data, Web Clickstreams, Sensors, Geo/LocaGon data

SoKware/ Technology

IT server , App logs ingesGon IT operaGonal metrics dashboards Devices / Sensor OperaGonal Intelligence

Digital Ad Tech./ Marke'ng

AdverGsing Data aggregaGon AdverGsing metrics like coverage, yield, conversion

AnalyGcs on User engagement with Ads, OpGmized bid/ buy engines

Financial Services Market/ Financial TransacGon order data collecGon

Financial market data metrics Fraud monitoring, and Value-‐at-‐Risk assessment, AudiGng of market order data

Consumer Online/ E-‐Commerce

Online customer engagement data aggregaGon

Consumer engagement metrics like page views, CTR

Customer clickstream analyGcs, RecommendaGon engines

Customer Scenarios across Industry Segments�

Customers using Amazon Kinesis�Streaming big data processing in action

Mobile/ Social Gaming Digital Adver'sing Tech.

Deliver conGnuous/ real-‐Gme delivery of game insight data by 100’s of game servers

Generate real-‐Gme metrics, KPIs for online ad performance for adverGsers/ publishers

Custom-‐built soluGons operaGonally complex to manage, & not scalable

Store + Forward fleet of log servers, and Hadoop based processing pipeline

•  Delay with criGcal business data delivery •  Developer burden in building reliable, scalable

plaZorm for real-‐Gme data ingesGon/ processing •  Slow-‐down of real-‐Gme customer insights

•  Lost data with Store/ Forward layer •  OperaGonal burden in managing reliable, scalable plaZorm

for real-‐Gme data ingesGon/ processing •  Batch-‐driven real-‐Gme customer insights

Accelerate Gme to market of elasGc, real-‐Gme applicaGons – while minimizing operaGonal overhead

Generate freshest analyGcs on adverGser performance to opGmize markeGng spend, and increase responsiveness to clients

Solu'on Architecture Set o  Streaming Data Inges'on

•  Ka]a •  Flume •  Kestrel / Scribe •  RabbitMQ / AMQP

o  Streaming Data Processing

•  Storm

o  Do-‐It-‐yourself (AWS) based solu'on

•  EC2: Logging/ pass through servers •  EBS: holds log/ other data snapshots •  SQS: Queue data store •  S3: Persistence store •  EMR: workflow to ingest data from S3 and

process

o  Exploring Con'nual data Inges'on & Processing

‘Typical’ Technology Solution Set

Solu'on Architecture Considera'ons Flexibility: Select the most appropriate sobware, and configure underlying infrastructure

Control: Sobware and hardware can be tuned to meet specific business and scenario needs.

Ongoing Opera'onal Complexity: Deploy, and manage an end-‐to-‐end system

Infrastructure planning and maintenance: Managing a reliable, scalable infrastructure

Developer/ IT staff expense: Developers, Devops and IT staff Gme and energy expended

SoKware Maintenance : Tech. and professional services support

Data Streams Ingestion, Continuous Processing�Right Toolset for the Right Job

Real-time Ingest!•  Highly Scalable"•  Durable"•  Elastic "•  Replay-able Reads""

Continuous Processing FX !•  Load-balancing incoming streams"•  Fault-tolerance, Checkpoint / Replay"•  Elastic"

Enable data movement into Stores/ Processing Engines!

Managed Service !

Low end-to-end latency!

Continuous, real-time workloads!

Amazon Kinesis – An Overview

Data Sources

[Machine Learning]

dpoint

[Aggregate & De-‐Duplicate]

Data Sources

[Metric Extrac'on]

DynamoDB

Redshift

App.3 [Sliding Window Analysis]

Data Sources

Availability Zone

Shard 1

Shard 2 Shard N

Availability Zone

Introducing Amazon Kinesis �Service for Real-Time Big Data Ingestion that Enables Continuous Processing

Kinesis Stream: Managed ability to capture and store data

• Streams are made of Shards

• Each Shard ingests data up to 1MB/

sec, and up to 1000 TPS

• Each Shard emits up to 2 MB/sec

• All data is stored for 24 hours

• Scale Kinesis streams by adding or

removing Shards

• Replay data inside of 24Hr. Window

Putting Data into Kinesis Simple Put interface to store data in Kinesis •  Producers use a PUT call to store data in a Stream

•  PutRecord {Data, PartitionKey, StreamName}

•  A Par''on Key is supplied by producer and used to

distribute the PUTs across Shards

•  Kinesis MD5 hashes supplied parGGon key over the hash

key range of a Shard

•  A unique Sequence # is returned to the Producer upon a successful PUT call

Producer

Shard 1

Shard 2

Shard 3

Shard n

Shard 4

Producer

Kinesis

Creating and Sizing a Kinesis Stream

Getting Started with Kinesis – Writing to a Stream POST / HTTP/1.1 Host: kinesis.<region>.<domain> x-‐amz-‐Date: <Date> Authorization: AWS4-‐HMAC-‐SHA256 Credential=<Credential>, SignedHeaders=content-‐type;date;host;user-‐agent;x-‐amz-‐date;x-‐amz-‐target;x-‐amzn-‐requestid, Signature=<Signature> User-‐Agent: <UserAgentString> Content-‐Type: application/x-‐amz-‐json-‐1.1 Content-‐Length: <PayloadSizeBytes> Connection: Keep-‐Alive X-‐Amz-‐Target: Kinesis_20131202.PutRecord { "StreamName": "exampleStreamName", "Data": "XzxkYXRhPl8x", "PartitionKey": "partitionKey" }

Abributes of Con'nuous, Stream Processing Apps

Building Stream Processing Apps with Kinesis

ConGnuous processing, implies “long running” applicaGons

Read from Kinesis Stream directly using Get* APIs Build applicaGons with Kinesis Client Library Leverage Kinesis Spout for Storm

Be fault tolerant, to handle failures in hardware or sobware

Be distributed, to handle mulGple shards

Scale up and down as the number of shards increase or decrease

Reading from Kinesis Streams

Shard 1

Shard 2

Shard 3

Shard n

Shard 4

KCL Worker 1

KCL Worker 2

EC2 Instance

KCL Worker 3

KCL Worker 4

EC2 Instance

KCL Worker n

EC2 Instance

Kinesis

Building Kinesis Processing Apps : Kinesis Client Library�Client library for fault-tolerant, at least-once, Continuous Processing

o  Java client library, source available on Github

o  Build & Deploy app with KCL on your EC2 instance(s)

o  KCL is intermediary b/w your application & stream §  Automatically starts a Kinesis Worker for each shard

§  Simplifies reading by abstracting individual shards

§  Increase / Decrease Workers as # of shards changes

§  Checkpoints to keep track of a Worker’s location in the

stream, Restarts Workers if they fail

o  Integrates with AutosScaling groups to redistribute

workers to new instances

More options to read from Kinesis Streams�Leveraging Get APIs, existing Storm topologies

o  Use the Get APIs for raw reads of Kinesis data streams •  GetRecords {Limit, ShardIterator} •  GetShardIterator {ShardId, ShardIteratorType, StartingSequenceNumber, StreamName}

o  Integrate Kinesis Streams with Storm Topologies •  Bootstraps, via Zookeeper to map Shards to Spout tasks

•  Fetches data from Kinesis stream

•  Emits tuples and Checkpoints (in Zookeeper)

Sending & Reading data from Kinesis Streams

HTTP Post

AWS SDK

Fluentd

Get* APIs

Kinesis Client Library + Connector Library

Apache Storm

Amazon Elas'c MapReduce

Sending Reading

Kinesis Connectors Enabling Data Movement

Amazon Kinesis Connector Library�Customizable, Open Source Apps to Connect Kinesis with S3, Redshift, DynamoDB

ITransformer

• Defines the transformaGon of records from the Amazon Kinesis stream in order to suit the user-‐defined data model

IFilter

• Excludes irrelevant records from the processing.

IBuffer

• Buffers the set of records to be processed by specifying size limit (# of records)& total byte count

IEmiber

• Makes client calls to other AWS services and persists the records stored in the buffer.

DynamoDB

Redshift

Kinesis

Batch processing

Client/ Sensor Aggregator/ Sequencer

ConGnuous processor for dashboard

Storage AnalyGcs and ReporGng

Amazon Kinesis Amazon EMR S3 DynamoDB Redshib

Streaming Data Repository

Log Processing with Hadoop

Processing

Input My Website

Kinesis Log4J

Appender

push to Kinesis

Hive Pig

Cascading MapReduce

pull from

Hadoop ecosystem implementation

•  Hadoop Input format

•  Hive Storage Handler

•  Pig Load Function

•  Cascading Scheme and Tap

private(static(final(Logger(LOGGER(=(Logger.getLogger(KinesisAppender.class);

LOGGER.error("Cannot find resource XYX… go do something about it!");

select instancetime, InstanceID, Message from MyKinesisLogsStream where LogLevel = 'error'

12/13/13:07:51, InstanceID123, Cannot find resource XYX… go do something about it!

Processing Kinesis data from Elastic MapReduce using Hive

Writing to Kinesis using Log4J

select instancetime, InstanceID, Message from MyKinesisLogsStream where LogLevel = 'error'

12/13/13:07:51, InstanceID123, Cannot find resource XYX… go do something about it!

Amazon Kinesis Storm spout�Integrate Kinesis Streams with Storm topologies •  Bootstrap

–  Queries the list of shards from Kinesis –  Initializes state in Zookeeper –  Shards are mapped to spout tasks

•  Fetching data from Kinesis: –  Fetches a batch of records via GetRecords API calls –  Buffers the records in memory –  A spout task will fetch data from all shards it is responsible for

•  Emitting tuples and checkpointing –  Converts each Kinesis data record into a tuple –  Emits the tuple for processing –  Checkpoints periodically (checkpoints stored in Zookeeper)

Shard 1

Shard 2

Shard n

Shard 4

Shard 3

Spout Task A

Spout Task B

Spout Task C

Customers using Amazon Kinesis�Amazon Kinesis in action