Peter-Mark Verwoerd Solutions Architectfiles.meetup.com/5297812/dc_meetup_kinesis.pdf · Getting...

Preview:

Citation preview

Peter-Mark Verwoerd

Amazon Kinesis�Managed Service for Streaming Data Ingestion, & Processing

Solutions Architect

o  Why continuous, real-time processing? §  Early stages of Data lifecycle demanding more §  Developing the ‘Right tool for the right job’

o  What can you do with streaming data today? §  Customer Scenarios §  Current approaches

o  What is Amazon Kinesis? §  Putting data into Kinesis §  Getting data from Kinesis Streams: Building applications with KCL

o  Connecting Amazon Kinesis to other systems §  Moving data into S3, DynamoDB, Redshift §  Leveraging existing EMR, Storm infrastructure

Amazon Kinesis: Managed Service for Streaming Data Ingestion & Processing

The  Mo'va'on  for  Con'nuous,  Real  Time  Processing  

Unconstrained Data Growth�Data lifecycle demands frequent, continuous understanding

•  IT/  Applica'on  server  logs            IT  Infrastructure  logs,  Metering,            Audit  /  Change  logs  

•  Web  sites  /  Mobile  Apps/  Ads          Clickstream,  User  Engagement  

•  Social  Media,  User  Content            450MM+  Tweets/day    •  Sensor  data            Weather,  Smart  Grids,  Wearables  

{      "payerId":  "Joe",      "productCode":  "AmazonS3",      "clientProductCode":  "AmazonS3",      "usageType":  "Bandwidth",      "operation":  "PUT",      "value":  "22490",      "timestamp":  "1216674828"  }  

Metering  Record  

127.0.0.1  user-­‐identifier  frank  [10/Oct/2000:13:55:36  -­‐0700]  "GET  /apache_pb.gif  HTTP/1.0"  200  2326  

Common  Log  Entry  

<165>1  2003-­‐10-­‐11T22:14:15.003Z  mymachine.example.com  evntslog  -­‐  ID47  [exampleSDID@32473  iut="3"  eventSource="Application"  eventID="1011"][examplePriority@32473  class="high"]  

Syslog  Entry  

“SeattlePublicWater/Kinesis/123/Realtime”  –  412309129140  

MQTT  Record  

   <R,AMZN        ,T,G,R1>  

NASDAQ  OMX  Record  

Big data comes from small

Typical Data Flow�Many different technologies, at stages of evolution

Client/Sensor   Aggregator   Con'nuous  Processing  

Storage   Analy'cs  +  Repor'ng        

?  

Big  Data  

•  Hourly server logs: how your systems were misbehaving an hour ago

•  Weekly / Monthly Bill: What you spent this past billing cycle?

•  Daily customer-preferences report from your web-site’s click stream: tells you what deal or ad to try next time

•  Daily fraud reports: tells you if there was fraud yesterday

Real-­‐time  Big  Data  

•  CloudWatch metrics: what just went wrong now

•  Real-time spending alerts/caps: guaranteeing you can’t overspend

•  Real-time analysis: tells you what to offer the current customer now

•  Real-time detection: blocks fraudulent use now

Big Data : Best Served Fresh�

Customer View

Scenarios    

Accelerated  Ingest-­‐Transform-­‐Load   Con'nual    Metrics/  KPI  Extrac'on   Responsive  Data  Analysis  

Data  Types    

IT  infrastructure,  ApplicaGons  logs,  Social  media,  Fin.  Market  data,  Web  Clickstreams,  Sensors,  Geo/LocaGon  data  

SoKware/  Technology    

IT  server  ,  App  logs  ingesGon     IT  operaGonal  metrics  dashboards     Devices  /  Sensor  OperaGonal  Intelligence    

Digital  Ad  Tech./  Marke'ng  

AdverGsing  Data  aggregaGon     AdverGsing  metrics  like  coverage,  yield,  conversion    

AnalyGcs  on  User  engagement  with  Ads,  OpGmized  bid/  buy  engines      

Financial  Services   Market/  Financial  TransacGon  order  data  collecGon    

Financial  market  data  metrics     Fraud  monitoring,  and  Value-­‐at-­‐Risk  assessment,  AudiGng  of  market  order  data    

Consumer  Online/  E-­‐Commerce  

Online  customer  engagement  data  aggregaGon        

Consumer  engagement  metrics  like  page  views,  CTR    

Customer  clickstream  analyGcs,  RecommendaGon  engines        

Customer Scenarios across Industry Segments�

1   2   3  

Customers using Amazon Kinesis�Streaming big data processing in action

Mobile/  Social  Gaming   Digital  Adver'sing  Tech.    

Deliver  conGnuous/  real-­‐Gme  delivery  of  game  insight  data  by  100’s  of  game  servers  

Generate  real-­‐Gme  metrics,  KPIs  for  online  ad  performance  for  adverGsers/  publishers    

Custom-­‐built  soluGons  operaGonally  complex  to  manage,  &  not  scalable  

Store  +  Forward  fleet  of  log  servers,  and  Hadoop  based  processing  pipeline  

•  Delay  with  criGcal  business  data  delivery  •  Developer  burden  in  building  reliable,  scalable  

plaZorm  for  real-­‐Gme  data  ingesGon/  processing  •  Slow-­‐down  of  real-­‐Gme  customer  insights  

•  Lost  data  with  Store/  Forward  layer  •  OperaGonal  burden  in  managing  reliable,  scalable  plaZorm  

for  real-­‐Gme  data  ingesGon/  processing  •  Batch-­‐driven  real-­‐Gme  customer  insights  

Accelerate  Gme  to  market  of  elasGc,  real-­‐Gme  applicaGons  –  while  minimizing  operaGonal  overhead    

Generate  freshest  analyGcs  on  adverGser  performance  to  opGmize  markeGng  spend,  and  increase  responsiveness  to  clients  

Solu'on  Architecture  Set    o  Streaming  Data  Inges'on  

•  Ka]a    •  Flume    •  Kestrel  /  Scribe      •  RabbitMQ    /  AMQP  

o  Streaming  Data  Processing  

•  Storm    

o  Do-­‐It-­‐yourself  (AWS)  based  solu'on    

•  EC2:  Logging/  pass  through  servers  •  EBS:  holds  log/  other  data  snapshots  •  SQS:  Queue  data  store  •  S3:  Persistence  store  •  EMR:   workflow   to   ingest   data   from   S3   and  

process  

o  Exploring  Con'nual  data  Inges'on  &  Processing  

‘Typical’ Technology Solution Set

Solu'on  Architecture  Considera'ons    Flexibility:  Select  the  most  appropriate  sobware,  and  configure    underlying  infrastructure  

Control:  Sobware  and  hardware  can  be  tuned  to  meet  specific  business  and  scenario  needs.    

Ongoing  Opera'onal  Complexity:  Deploy,  and  manage  an  end-­‐to-­‐end  system  

Infrastructure  planning  and  maintenance:  Managing  a  reliable,  scalable  infrastructure  

Developer/  IT  staff  expense:  Developers,  Devops  and  IT  staff  Gme  and  energy  expended  

SoKware  Maintenance  :  Tech.  and  professional  services  support  

Data Streams Ingestion, Continuous Processing�Right Toolset for the Right Job

Real-time Ingest!•  Highly Scalable"•  Durable"•  Elastic "•  Replay-able Reads""

Continuous Processing FX !•  Load-balancing incoming streams"•  Fault-tolerance, Checkpoint / Replay"•  Elastic"

Enable data movement into Stores/ Processing Engines!

Managed Service !

Low end-to-end latency!

Continuous, real-time workloads!

Amazon Kinesis – An Overview

 Data  Sources  

App.4    

[Machine  Learning]  

                             

     AW

S  En

dpoint  

App.1    

[Aggregate  &  De-­‐Duplicate]  

 Data  Sources  

Data  Sources  

 Data  Sources  

App.2    

[Metric  Extrac'on]  

S3

DynamoDB  

Redshift

App.3  [Sliding  Window  Analysis]  

 Data  Sources  

Availability Zone

Shard  1  

Shard  2  Shard  N  

Availability Zone

Availability Zone

Introducing Amazon Kinesis �Service for Real-Time Big Data Ingestion that Enables Continuous Processing

Kinesis Stream: Managed ability to capture and store data

• Streams  are  made  of  Shards  

• Each  Shard  ingests  data  up  to  1MB/

sec,  and  up  to  1000  TPS  

• Each  Shard  emits  up  to  2  MB/sec    

• All  data  is  stored  for  24  hours  

• Scale  Kinesis  streams  by  adding  or  

removing  Shards  

• Replay  data  inside  of  24Hr.  Window    

Putting Data into Kinesis Simple Put interface to store data in Kinesis •  Producers  use  a  PUT  call  to  store  data  in  a  Stream  

•  PutRecord  {Data,  PartitionKey,  StreamName}  

•  A  Par''on  Key  is  supplied  by  producer  and  used  to  

distribute  the  PUTs  across  Shards  

•  Kinesis  MD5  hashes  supplied  parGGon  key  over  the  hash  

key  range  of  a  Shard    

•  A  unique  Sequence  #  is  returned  to  the  Producer  upon  a  successful  PUT  call  

Producer

Shard 1

Shard 2

Shard 3

Shard n

Shard 4

Producer

Producer

Producer

Producer

Producer

Producer

Producer

Producer

Kinesis

Creating and Sizing a Kinesis Stream

Getting Started with Kinesis – Writing to a Stream POST  /  HTTP/1.1  Host:  kinesis.<region>.<domain>  x-­‐amz-­‐Date:  <Date>  Authorization:  AWS4-­‐HMAC-­‐SHA256  Credential=<Credential>,  SignedHeaders=content-­‐type;date;host;user-­‐agent;x-­‐amz-­‐date;x-­‐amz-­‐target;x-­‐amzn-­‐requestid,  Signature=<Signature>  User-­‐Agent:  <UserAgentString>  Content-­‐Type:  application/x-­‐amz-­‐json-­‐1.1  Content-­‐Length:  <PayloadSizeBytes>  Connection:  Keep-­‐Alive  X-­‐Amz-­‐Target:  Kinesis_20131202.PutRecord    {      "StreamName":  "exampleStreamName",      "Data":  "XzxkYXRhPl8x",      "PartitionKey":  "partitionKey"  }  

Abributes  of  Con'nuous,  Stream  Processing  Apps    

Building  Stream  Processing  Apps  with  Kinesis  

ConGnuous  processing,  implies  “long  running”  applicaGons    

   Read  from  Kinesis  Stream  directly  using  Get*  APIs  Build  applicaGons  with  Kinesis  Client  Library    Leverage  Kinesis  Spout  for  Storm  

Be  fault  tolerant,  to  handle  failures  in  hardware  or  sobware  

Be  distributed,  to  handle  mulGple  shards  

Scale  up  and  down  as  the  number  of  shards  increase  or  decrease  

Reading from Kinesis Streams

Shard 1

Shard 2

Shard 3

Shard n

Shard 4

KCL Worker 1

KCL Worker 2

EC2 Instance

KCL Worker 3

KCL Worker 4

EC2 Instance

KCL Worker n

EC2 Instance

Kinesis

Building Kinesis Processing Apps : Kinesis Client Library�Client library for fault-tolerant, at least-once, Continuous Processing

o  Java client library, source available on Github

o  Build & Deploy app with KCL on your EC2 instance(s)

o  KCL is intermediary b/w your application & stream §  Automatically starts a Kinesis Worker for each shard

§  Simplifies reading by abstracting individual shards

§  Increase / Decrease Workers as # of shards changes

§  Checkpoints to keep track of a Worker’s location in the

stream, Restarts Workers if they fail

o  Integrates with AutosScaling groups to redistribute

workers to new instances

More options to read from Kinesis Streams�Leveraging Get APIs, existing Storm topologies

o  Use  the  Get  APIs  for  raw  reads  of  Kinesis  data  streams    •  GetRecords  {Limit,  ShardIterator}  •  GetShardIterator  {ShardId,  ShardIteratorType,  StartingSequenceNumber,  StreamName}  

o  Integrate  Kinesis  Streams  with  Storm  Topologies  •  Bootstraps,  via  Zookeeper  to  map  Shards  to  Spout  tasks  

•  Fetches  data  from  Kinesis  stream  

•  Emits  tuples  and  Checkpoints  (in  Zookeeper)  

Sending & Reading data from Kinesis Streams

HTTP  Post  

AWS  SDK  

LOG4J  

Flume  

Fluentd  

Get*  APIs  

Kinesis  Client  Library                                  +  Connector  Library      

Apache  Storm  

Amazon  Elas'c  MapReduce  

Sending   Reading  

Kinesis Connectors Enabling Data Movement

Amazon Kinesis Connector Library�Customizable, Open Source Apps to Connect Kinesis with S3, Redshift, DynamoDB

ITransformer  

• Defines  the  transformaGon  of  records  from  the  Amazon  Kinesis  stream  in  order  to  suit  the  user-­‐defined  data  model  

IFilter  

• Excludes  irrelevant  records  from  the  processing.    

IBuffer  

• Buffers  the  set  of  records  to  be  processed  by  specifying  size  limit  (#  of  records)&  total  byte  count  

IEmiber  

• Makes  client  calls  to  other  AWS  services  and  persists  the  records  stored  in  the  buffer.    

S3

DynamoDB  

Redshift

Kinesis  

Batch processing

Client/  Sensor   Aggregator/  Sequencer  

ConGnuous  processor  for  dashboard  

Storage  AnalyGcs  and  ReporGng        

Amazon  Kinesis   Amazon  EMR   S3   DynamoDB   Redshib  

Streaming  Data  Repository  

Log Processing with Hadoop

Processing  

Input  My  Website  

Kinesis  Log4J  

Appender  

push  to  Kinesis  

EMR  

Hive  Pig  

Cascading  MapReduce  

pull  from  

Hadoop ecosystem implementation

•  Hadoop Input format

•  Hive Storage Handler

•  Pig Load Function

•  Cascading Scheme and Tap

private(static(final(Logger(LOGGER(=(Logger.getLogger(KinesisAppender.class);

LOGGER.error("Cannot find resource XYX… go do something about it!");

select instancetime, InstanceID, Message from MyKinesisLogsStream where LogLevel = 'error'

12/13/13:07:51, InstanceID123, Cannot find resource XYX… go do something about it!

Processing Kinesis data from Elastic MapReduce using Hive

Writing to Kinesis using Log4J

select instancetime, InstanceID, Message from MyKinesisLogsStream where LogLevel = 'error'

12/13/13:07:51, InstanceID123, Cannot find resource XYX… go do something about it!

Amazon Kinesis Storm spout�Integrate Kinesis Streams with Storm topologies •  Bootstrap

–  Queries the list of shards from Kinesis –  Initializes state in Zookeeper –  Shards are mapped to spout tasks

•  Fetching data from Kinesis: –  Fetches a batch of records via GetRecords API calls –  Buffers the records in memory –  A spout task will fetch data from all shards it is responsible for

•  Emitting tuples and checkpointing –  Converts each Kinesis data record into a tuple –  Emits the tuple for processing –  Checkpoints periodically (checkpoints stored in Zookeeper)

Shard  1  

Shard  2  

Shard  n  

Shard  4  

Shard  3  

Spout  Task  A  

Spout  Task  B  

Spout  Task  C  

Customers using Amazon Kinesis�Amazon Kinesis in action

Mobile/  Social  Gaming   Digital  Adver'sing  Tech.    

Deliver  conGnuous/  real-­‐Gme  delivery  of  game  insight  data  by  100’s  of  game  servers  

Generate  real-­‐Gme  metrics,  KPIs  for  online  ad  performance  for  adverGsers/  publishers    

Custom-­‐built  soluGons  operaGonally  complex  to  manage,  &  not  scalable  

Store  +  Forward  fleet  of  log  servers,  and  Hadoop  based  processing  pipeline  

•  Delay  with  criGcal  business  data  delivery  •  Developer  burden  in  building  reliable,  scalable  

plaZorm  for  real-­‐Gme  data  ingesGon/  processing  •  Slow-­‐down  of  real-­‐Gme  customer  insights  

•  Lost  data  with  Store/  Forward  layer  •  OperaGonal  burden  in  managing  reliable,  scalable  plaZorm  

for  real-­‐Gme  data  ingesGon/  processing  •  Batch-­‐driven  real-­‐Gme  customer  insights  

Amazon  Kinesis  accelerates  Gme  to  market  of  elasGc,  real-­‐Gme  applicaGons  –  while  minimizing  operaGonal  overhead    

Amazon  Kinesis  ingests  data  and  enables  conGnuous  fresh  analyGcs  on  adverGser  performance  to  opGmize  markeGng  spend,  and  increase  responsiveness  to  clients  

Gaming Analytics with Amazon Kinesis

Real-­‐'me  Clickstream  Processing  App  

Aggregate  Sta's'cs  

Clickstream  Archive  

In-­‐Game  Engagement  Trends  Dashboard  

Digital Ad. Tech Metering with Amazon Kinesis

 Con'nuous  Ad    Metrics  Extrac'on  

Incremental  Ad.  Sta's'cs  Computa'on  

Metering  Record  Archive  

Ad  Analy'cs  Dashboard  

Simple Metering & Billing with Amazon Kinesis

Billing  Auditors  

Incremental  Bill  Computa'on  

Metering  Record  Archive  

Billing  Management  Service  

Kinesis Pricing�Simple, Pay-as-you-go, & no up-front costs

Pricing  Dimension   Value    

Hourly  Shard  Rate   $0.015  

Per  1,000,000  PUT  transac'ons:  

$0.028  

•  Customers specify throughput requirements in shards, that they control •  Each Shard delivers 1 MB/s on ingest, and 2MB/s on egress

•  Inbound data transfer is free

•  EC2 instance charges apply for Kinesis processing applications

Easy  Administra'on      Managed  service  for  real-­‐Gme  streaming  data  collecGon,  processing  and  analysis.  Simply  create  a  new  stream,  set  the  desired  level  of  capacity,  and  let  the  service  handle  the  rest.        

Real-­‐'me  Performance        Perform  conGnual  processing  on  streaming  big  data.  Processing  latencies  fall  to  a  few  seconds,  compared  with  the  minutes  or  hours  associated  with  batch  processing.            

High  Throughput.  Elas'c        Seamlessly  scale  to  match  your  data  throughput  rate  and  volume.  You  can  easily  scale  up  to  gigabytes  per  second.  The  service  will  scale  up  or  down  based  on  your  operaGonal  or  business  needs.      

S3,  RedshiK,  &  DynamoDB  Integra'on      Reliably  collect,  process,  and  transform  all  of  your  data  in  real-­‐Gme  &  deliver  to  AWS  data  stores  of  choice,  with  Connectors  for  S3,  Redshib,  and  DynamoDB.          

Build  Real-­‐'me  Applica'ons      Client  libraries  that  enable  developers  to  design  and  operate  real-­‐Gme  streaming  data  processing  applicaGons.                  

Low  Cost      Cost-­‐efficient  for  workloads  of  any  scale.  You  can  get  started  by  provisioning  a  small  stream,  and  pay  low  hourly  rates  only  for  what  you  use.              

Amazon Kinesis: Key Benefits

AWS Big Data Services

AWS Big Data Services

Amazon EMR

AWS Big Data Services

Amazon EMR Amazon S3

AWS Big Data Services

Amazon EMR Amazon S3 Amazon DynamoDB

AWS Big Data Services

Amazon EMR Amazon S3 Amazon DynamoDB Amazon Glacier

AWS Big Data Services

Amazon EMR Amazon S3 Amazon DynamoDB Amazon Glacier Amazon Redshift

AWS Big Data Services

Amazon EMR Amazon S3 Amazon DynamoDB Amazon Glacier Amazon Redshift

AWS Data Pipeline

AWS Big Data Services

Amazon EMR Amazon S3 Amazon DynamoDB Amazon Glacier Amazon Redshift

AWS Data Pipeline

Amazon Kinesis

AWS Big Data Services

Amazon EMR Amazon S3 Amazon DynamoDB Amazon Glacier Amazon Redshift

AWS Data Pipeline

Amazon Kinesis

Try out Amazon Kinesis

•  Try out Amazon Kinesis –  http://aws.amazon.com/kinesis/

•  Thumb through the Developer Guide –  http://aws.amazon.com/documentation/kinesis/

•  Visit, and Post on Kinesis Forum –  https://forums.aws.amazon.com/forum.jspa?forumID=169#

•  Continue the conversation on Twitter –  @petermark

•  Contact me –  verwoerd@amazon.com

•  Contact Doug Lott –  douglott@amazon.com

Questions?

Thank You

Recommended