Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Towards a Real-‐time Processing Pipeline: Running Apache Flink on AWS
Dr. Steffen Hausmann, Solutions ArchitectMichael Hanisch, Manager Solutions ArchitectureNovember 18th, 2016
Stream Processing Challenges
• Event time and out of order events• Consistency, fault tolerance, and high availability• Rich forms of window queries• Low latency and high throughput
Analyzing NYC Taxi Rides in Real Time
Event Processing Architecture
“ReplayableLog” Processing Visualization
Apache FlinkAmazon Kinesis Amazon Elasticsearch
Apache Flink
“Apache Flink® is an open source platform for distributed stream and batch data processing.”
https://flink.apache.org/http://data-‐artisans.com/why-‐apache-‐flink/
Apache Flink
Amazon Elastic MapReduce (EMR)
• Easily provision & manage clusters for your big data needs
• Hadoop, Spark, Presto, HBase, Tez, Hive, Pig,…• Apache Flink support added in EMR 5.1• Dynamically scalable, persistent or transient
clusters • Provides access control, firewalls, encryption
Amazon Kinesis
• Managed Service for Real Time Big Data Processing
• Create Streams to Produce & Consume Data
• Elastically Add and Remove Shards for Throughput
• Secured via AWS IAM
• Durable storage of data streams
Data Sources
App.4
[Machine Learning]
AWS En
dpoint
App.1
[Aggregate & De-‐Duplicate]
Data Sources
Data Sources
Data Sources
App.2
[Metric Extraction]
S3
DynamoDB
Redshift
App.3[Sliding Window Analysis]
Data Sources
Shard 1
Shard 2
Shard N
Availability Zone
Availability Zone
Availability Zone
Amazon Kinesis
Amazon Kinesis
• Central bus for all event data• Decoupling of multiple
producers and consumers
• Keeps a ‘replayable log’ of your events• Many options to consume events with Apache
Flink (new), Spark Streaming, Presto, Hive, Pig, Storm (or custom KCL apps)…
Amazon Elasticsearch Service
• Provisions and maintains an Elasticsearch cluster• Complete ELK stack, including Kibana• Scalable • Secured via AWS IAM
Architecture
Amazon Kinesis
Amazon EMR Amazon ElasticsearchService
EC2 instance(bastion host)
Demo
Lessons Learned
Building the Flink Kinesis Connector
• The Flink Kinesis connector artifact is not available from Maven Central
• Build the Connector with Maven 3.0.5• mvn clean install -‐Pinclude-‐kinesis
–DskipTests -‐Dhadoop-‐two.version=2.7.2
• For future projects, add the dependency to your local Maven repository• mvn install:install-‐file -‐Dfile=flink-‐
connector-‐kinesis_2.10-‐1.1.3.jar
Approximate Event Time
• Each Amazon Kinesis record includes an ApproximateArrivalTimestamp
• The timestamp is set when an Amazon Kinesis stream successfully receives and stores a record
• By default the event time of Flink uses this timestamp when reading from a Kinesis stream
StreamExecutionEnvironment env =StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
Event Time and Watermarks
• With event time the time of an event is determined by the producer
• Flink measures progress in event time by means of Watermarks
• Watermarks must be ingested to each individual Kinesis shard
DataStream<Event> kinesis = env.addSource(new FlinkKinesisConsumer<>(...)).assignTimestampsAndWatermarks(new PunctuatedAssigner())
Data Encryption with Amazon EMR and FlinkSecurity configuration supports encryption• for data stored within the file system• Hadoop Distributed File System (HDFS) block-‐transfer
and RPC• S3 data (SSE-‐S3, SSE-‐KMS, CSE-‐KMS, CSE-‐Custom)• Local disk (except boot volumes)• In-‐transit data (no Flink support yet)
env.readTextFile("s3://...")env.setStateBackend(new FsStateBackend("hdfs://..."))
Connecting to the Flink Dashboard
• Use dynamic port forwarding to the Master node• ssh -‐D 8157 hadoop@...
• Use FoxyProxy to redirect URLs to localhost• *ec2*.amazonaws.com*• *.compute.internal*
• Navigate to the YARN Resource Manager and select the Tracking UI
Starting Flink and Submitting Jobs
Use steps to interact with Flink through the AWS API
Extending Flink Functionality
• Flink Elasticsearch sink merely supports TCP transport
• A custom Elasticsearch sink with HTTP support requires only a few dozens lines of code using• Jest (io.searchbox)• aws-‐signing-‐request-‐interceptor (vc.inreach.aws)
Questions?
[email protected]@amazon.de