23
Towards a Realtime Processing Pipeline: Running Apache Flink on AWS Dr. Steffen Hausmann, Solutions Architect Michael Hanisch, Manager Solutions Architecture November 18 th , 2016

Towards(a(Real ,time(Processing(Pipeline: …aws-de-media.s3.amazonaws.com/images/_Munich_Loft_Slides...2016/11/18  · Towards(a(Real ,time(Processing(Pipeline: Running(Apache(Flink(onAWS

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Towards(a(Real ,time(Processing(Pipeline: …aws-de-media.s3.amazonaws.com/images/_Munich_Loft_Slides...2016/11/18  · Towards(a(Real ,time(Processing(Pipeline: Running(Apache(Flink(onAWS

Towards  a  Real-­‐time  Processing  Pipeline:  Running  Apache  Flink  on  AWS

Dr.  Steffen  Hausmann,  Solutions  ArchitectMichael  Hanisch,  Manager  Solutions  ArchitectureNovember  18th,  2016

Page 2: Towards(a(Real ,time(Processing(Pipeline: …aws-de-media.s3.amazonaws.com/images/_Munich_Loft_Slides...2016/11/18  · Towards(a(Real ,time(Processing(Pipeline: Running(Apache(Flink(onAWS

Stream  Processing  Challenges

• Event  time  and  out  of  order  events• Consistency,  fault  tolerance,  and  high  availability• Rich  forms  of  window  queries• Low  latency  and  high  throughput

Page 3: Towards(a(Real ,time(Processing(Pipeline: …aws-de-media.s3.amazonaws.com/images/_Munich_Loft_Slides...2016/11/18  · Towards(a(Real ,time(Processing(Pipeline: Running(Apache(Flink(onAWS

Analyzing  NYC  Taxi  Rides  in  Real  Time

Page 4: Towards(a(Real ,time(Processing(Pipeline: …aws-de-media.s3.amazonaws.com/images/_Munich_Loft_Slides...2016/11/18  · Towards(a(Real ,time(Processing(Pipeline: Running(Apache(Flink(onAWS
Page 5: Towards(a(Real ,time(Processing(Pipeline: …aws-de-media.s3.amazonaws.com/images/_Munich_Loft_Slides...2016/11/18  · Towards(a(Real ,time(Processing(Pipeline: Running(Apache(Flink(onAWS

Event  Processing  Architecture

“ReplayableLog” Processing Visualization

Apache  FlinkAmazon  Kinesis Amazon  Elasticsearch

Page 6: Towards(a(Real ,time(Processing(Pipeline: …aws-de-media.s3.amazonaws.com/images/_Munich_Loft_Slides...2016/11/18  · Towards(a(Real ,time(Processing(Pipeline: Running(Apache(Flink(onAWS

Apache  Flink

“Apache  Flink® is  an  open  source  platform  for  distributed  stream  and  batch  data  processing.”

https://flink.apache.org/http://data-­‐artisans.com/why-­‐apache-­‐flink/

Page 7: Towards(a(Real ,time(Processing(Pipeline: …aws-de-media.s3.amazonaws.com/images/_Munich_Loft_Slides...2016/11/18  · Towards(a(Real ,time(Processing(Pipeline: Running(Apache(Flink(onAWS

Apache  Flink

Page 8: Towards(a(Real ,time(Processing(Pipeline: …aws-de-media.s3.amazonaws.com/images/_Munich_Loft_Slides...2016/11/18  · Towards(a(Real ,time(Processing(Pipeline: Running(Apache(Flink(onAWS

Amazon  Elastic  MapReduce  (EMR)

• Easily  provision  &  manage  clusters  for  your  big  data  needs

• Hadoop,  Spark,  Presto,  HBase,  Tez,  Hive,  Pig,…• Apache  Flink support  added  in  EMR  5.1• Dynamically  scalable,  persistent  or  transient  

clusters  • Provides  access  control,  firewalls,  encryption

Page 9: Towards(a(Real ,time(Processing(Pipeline: …aws-de-media.s3.amazonaws.com/images/_Munich_Loft_Slides...2016/11/18  · Towards(a(Real ,time(Processing(Pipeline: Running(Apache(Flink(onAWS

Amazon  Kinesis  

• Managed  Service  for  Real  Time  Big  Data  Processing

• Create  Streams  to  Produce  &  Consume  Data

• Elastically  Add  and  Remove  Shards  for  Throughput

• Secured  via  AWS  IAM

• Durable  storage  of  data  streams  

Page 10: Towards(a(Real ,time(Processing(Pipeline: …aws-de-media.s3.amazonaws.com/images/_Munich_Loft_Slides...2016/11/18  · Towards(a(Real ,time(Processing(Pipeline: Running(Apache(Flink(onAWS

Data  Sources

App.4

[Machine  Learning]

AWS  En

dpoint

App.1

[Aggregate  &  De-­‐Duplicate]

Data  Sources

Data  Sources

Data  Sources

App.2

[Metric  Extraction]

S3

DynamoDB

Redshift

App.3[Sliding  Window  Analysis]

Data  Sources

Shard 1

Shard 2

Shard N

Availability  Zone

Availability  Zone

Availability  Zone

Amazon  Kinesis  

Page 11: Towards(a(Real ,time(Processing(Pipeline: …aws-de-media.s3.amazonaws.com/images/_Munich_Loft_Slides...2016/11/18  · Towards(a(Real ,time(Processing(Pipeline: Running(Apache(Flink(onAWS

Amazon  Kinesis  

• Central  bus  for  all  event  data• Decoupling  of  multiple

producers  and  consumers

• Keeps  a  ‘replayable log’  of  your  events• Many  options  to  consume  events  with  Apache  

Flink (new),  Spark  Streaming,  Presto,  Hive,  Pig,  Storm  (or  custom  KCL  apps)…

Page 12: Towards(a(Real ,time(Processing(Pipeline: …aws-de-media.s3.amazonaws.com/images/_Munich_Loft_Slides...2016/11/18  · Towards(a(Real ,time(Processing(Pipeline: Running(Apache(Flink(onAWS

Amazon  Elasticsearch Service  

• Provisions  and  maintains  an  Elasticsearch cluster• Complete  ELK  stack,  including  Kibana• Scalable  • Secured  via  AWS  IAM

Page 13: Towards(a(Real ,time(Processing(Pipeline: …aws-de-media.s3.amazonaws.com/images/_Munich_Loft_Slides...2016/11/18  · Towards(a(Real ,time(Processing(Pipeline: Running(Apache(Flink(onAWS

Architecture

Amazon  Kinesis

Amazon  EMR Amazon  ElasticsearchService

EC2  instance(bastion  host)

Page 14: Towards(a(Real ,time(Processing(Pipeline: …aws-de-media.s3.amazonaws.com/images/_Munich_Loft_Slides...2016/11/18  · Towards(a(Real ,time(Processing(Pipeline: Running(Apache(Flink(onAWS

Demo

Page 15: Towards(a(Real ,time(Processing(Pipeline: …aws-de-media.s3.amazonaws.com/images/_Munich_Loft_Slides...2016/11/18  · Towards(a(Real ,time(Processing(Pipeline: Running(Apache(Flink(onAWS

Lessons  Learned

Page 16: Towards(a(Real ,time(Processing(Pipeline: …aws-de-media.s3.amazonaws.com/images/_Munich_Loft_Slides...2016/11/18  · Towards(a(Real ,time(Processing(Pipeline: Running(Apache(Flink(onAWS

Building  the  Flink Kinesis  Connector

• The  Flink Kinesis  connector  artifact  is  not  available  from  Maven  Central

• Build  the  Connector  with  Maven  3.0.5• mvn clean  install  -­‐Pinclude-­‐kinesis  

–DskipTests -­‐Dhadoop-­‐two.version=2.7.2

• For  future  projects,  add  the  dependency  to  your  local  Maven  repository• mvn install:install-­‐file -­‐Dfile=flink-­‐

connector-­‐kinesis_2.10-­‐1.1.3.jar

Page 17: Towards(a(Real ,time(Processing(Pipeline: …aws-de-media.s3.amazonaws.com/images/_Munich_Loft_Slides...2016/11/18  · Towards(a(Real ,time(Processing(Pipeline: Running(Apache(Flink(onAWS

Approximate  Event  Time

• Each  Amazon  Kinesis  record  includes  an  ApproximateArrivalTimestamp

• The  timestamp  is  set  when  an  Amazon  Kinesis  stream  successfully  receives  and  stores  a  record

• By  default  the  event  time  of  Flink uses  this  timestamp  when  reading  from  a  Kinesis  stream

StreamExecutionEnvironment env =StreamExecutionEnvironment.getExecutionEnvironment();  

env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

Page 18: Towards(a(Real ,time(Processing(Pipeline: …aws-de-media.s3.amazonaws.com/images/_Munich_Loft_Slides...2016/11/18  · Towards(a(Real ,time(Processing(Pipeline: Running(Apache(Flink(onAWS

Event  Time  and  Watermarks

• With  event  time  the  time  of  an  event  is  determined  by  the  producer

• Flink measures  progress  in  event  time  by  means  of  Watermarks

• Watermarks  must  be  ingested  to  each  individual  Kinesis  shard

DataStream<Event>  kinesis  =  env.addSource(new  FlinkKinesisConsumer<>(...)).assignTimestampsAndWatermarks(new  PunctuatedAssigner())

Page 19: Towards(a(Real ,time(Processing(Pipeline: …aws-de-media.s3.amazonaws.com/images/_Munich_Loft_Slides...2016/11/18  · Towards(a(Real ,time(Processing(Pipeline: Running(Apache(Flink(onAWS

Data  Encryption  with  Amazon  EMR  and  FlinkSecurity  configuration  supports  encryption• for  data  stored  within  the  file  system• Hadoop  Distributed  File  System  (HDFS)  block-­‐transfer  

and  RPC• S3  data  (SSE-­‐S3,  SSE-­‐KMS,  CSE-­‐KMS,  CSE-­‐Custom)• Local  disk  (except  boot  volumes)• In-­‐transit  data  (no  Flink support  yet)

env.readTextFile("s3://...")env.setStateBackend(new FsStateBackend("hdfs://..."))

Page 20: Towards(a(Real ,time(Processing(Pipeline: …aws-de-media.s3.amazonaws.com/images/_Munich_Loft_Slides...2016/11/18  · Towards(a(Real ,time(Processing(Pipeline: Running(Apache(Flink(onAWS

Connecting  to  the  Flink Dashboard

• Use  dynamic  port  forwarding  to  the  Master  node• ssh -­‐D  8157  hadoop@...

• Use  FoxyProxy to  redirect  URLs  to  localhost• *ec2*.amazonaws.com*• *.compute.internal*

• Navigate  to  the  YARN  Resource  Manager  and  select  the  Tracking  UI

Page 21: Towards(a(Real ,time(Processing(Pipeline: …aws-de-media.s3.amazonaws.com/images/_Munich_Loft_Slides...2016/11/18  · Towards(a(Real ,time(Processing(Pipeline: Running(Apache(Flink(onAWS

Starting  Flink  and  Submitting  Jobs

Use  steps  to  interact  with  Flink through  the  AWS  API

Page 22: Towards(a(Real ,time(Processing(Pipeline: …aws-de-media.s3.amazonaws.com/images/_Munich_Loft_Slides...2016/11/18  · Towards(a(Real ,time(Processing(Pipeline: Running(Apache(Flink(onAWS

Extending  Flink Functionality

• Flink Elasticsearch sink  merely  supports  TCP  transport

• A  custom  Elasticsearch sink  with  HTTP  support  requires  only  a  few  dozens  lines  of  code  using• Jest  (io.searchbox)• aws-­‐signing-­‐request-­‐interceptor  (vc.inreach.aws)

Page 23: Towards(a(Real ,time(Processing(Pipeline: …aws-de-media.s3.amazonaws.com/images/_Munich_Loft_Slides...2016/11/18  · Towards(a(Real ,time(Processing(Pipeline: Running(Apache(Flink(onAWS

Questions?

[email protected]@amazon.de