Upload
jamie-grier
View
548
Download
0
Embed Size (px)
Citation preview
Jamie Grier@[email protected]
Apache FlinkTMCounting elements in streams
Introduction
Data streaming is becoming increasingly popular*
*Biggest understatement of 2016
Streaming technology is enabling the obvious: continuous processing on data
that is continuously produced
Streaming is the next programming paradigm for data applications, and you
need to start thinking in terms of streams
Counting
Continuous countingA seemingly simple
application, but generally an unsolved problem
E.g., count visitors, impressions, interactions, clicks, etc
Aggregations and OLAP cube operations are generalizations of counting
Counting in batch architectureContinuous
ingestion
Periodic (e.g., hourly) files
Periodic batch jobs
Problems with batch architectureHigh latency
Too many moving parts
Implicit treatment of time
Out of order event handling
Implicit batch boundaries
Counting in λ architecture"Batch layer":
what we had before
"Stream layer": approximate early results
Problems with batch and λWay too many moving parts (and code
dup)
Implicit treatment of time
Out of order event handling
Implicit batch boundaries
Counting in streaming architectureMessage queue ensures stream durability and
replay
Stream processor ensures consistent counting
Counting in Flink DataStream API
Number of visitors in last hour by country DataStream<LogEvent> stream = env .addSource(new FlinkKafkaConsumer(...)); // create stream from Kafka .keyBy("country"); // group by country .timeWindow(Time.minutes(60)) // window of size 1 hour .apply(new CountPerWindowFunction()); // do operations per window
Counting hierarchy of needs
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
... queryable
Based on Maslow's hierarchy of needs
Counting hierarchy of needs
Continuous counting
Counting hierarchy of needs
Continuous counting
... with low latency,
Counting hierarchy of needs
Continuous counting
... with low latency,
... efficiently on high volume streams,
Counting hierarchy of needs
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
Counting hierarchy of needs
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
Counting hierarchy of needs
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
... queryable1.1+
Rest of this talk
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
... queryable
Latency
Yahoo! Streaming BenchmarkStorm, Spark Streaming, and Flink benchmark by
the Storm team at Yahoo!
Focus on measuring end-to-end latency at low throughputs
First benchmark that was modeled after a real application
Read more:• https
://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
Benchmark task: counting! Count ad
impressions grouped by campaign
Compute aggregates over last 10 seconds
Make aggregates available for queries (Redis)
Results (lower is better)
Flink and Storm at sub-second latencies
Spark has a latency-throughput tradeoff
170k events/sec
Efficiency, and scalability
Handling high-volume streamsScalability: how many events/sec can a
system scale to, with infinite resources?
Scalability comes at a cost (systems add overhead to be scalable)
Efficiency: how many events/sec can a system scale to, with limited resources?
Extending the Yahoo! benchmarkYahoo! benchmark is a great starting point to
understand engine behavior.
However, benchmark stops at low write throughput and programs are not fault tolerant.
We extended the benchmark to (1) high volumes, and (2) to use Flink's built-in state.• http://data-artisans.com/extending-the-yahoo-streaming-benchmark
/
Results (higher is better)
Also: Flink jobs are correct under failures (exactly once), Storm jobs are not
500k events/sec
3mi events/sec
15mi events/sec
Fault tolerance and repeatability
Stateful streaming applicationsMost interesting stream applications are stateful
How to ensure that state is correct after failures?
Fault tolerance definitionsAt least once
• May over-count • In some cases even under-count with different
definition
Exactly once• Counts are the same after failure
End-to-end exactly once• Counts appear the same in an external sink (database,
file system) after failure
Fault tolerance in FlinkFlink guarantees exactly once
End-to-end exactly once supported with specific sources and sinks • E.g., Kafka ➔ Flink ➔ HDFS
Internally, Flink periodically takes consistent snapshots of the state without ever stopping the computation
Savepoints Maintaining stateful
applications in production is challenging
Flink savepoints: externally triggered, durable checkpoints
Easy code upgrades (Flink or app), maintenance, migration, and debugging, what-if simulations, A/B tests
Explicit handling of time
Notions of time
1977 1980 1983 1999 2002 2005 2015
Episode IV:A New Hope
Episode V:The EmpireStrikes Back
Episode VI:Return of the Jedi
Episode I:The Phantom
Menace
Episode II:Attach of the Clones
Episode III:Revenge of
the Sith
Episode VII:The Force Awakens
This is called event time
This is called processing time
Out of order streams
Event time windows
Arrival time windows
Instant event-at-a-time
First burst of eventsSecond burst of events
Why event timeMost stream processors are limited to
processing/arrival time, Flink can operate on event time as well
Benefits of event time • Accurate results for out of order data• Sessions and unaligned windows• Time travel (backstreaming)
What's coming up in Flink
Evolution of streaming in Flink Flink 0.9 (Jun 2015): DataStream API in beta, exactly-
once guarantees via checkpoiting
Flink 0.10 (Nov 2015): Event time support, windowing mechanism based on Dataflow/Beam model, graduated DataStream API, high availability, state interface, new/updated connectors (Kafka, Nifi, Elastic, ...), improved monitoring
Flink 1.0 (Mar 2015): DataStream API stability, out of core state, savepoints, CEP library, improved monitoring, Kafka 0.9 support
Upcoming featuresSQL: ongoing work in collaboration with Apache
Calcite
Dynamic scaling: adapt resources to stream volume, historical stream processing
Queryable state: ability to query the state inside the stream processor
Mesos support
More sources and sinks (e.g., Kinesis, Cassandra)
Queryable state
Using the stream processor as a database
Closing
SummaryStream processing gaining momentum, the right
paradigm for continuous data applications
Even seemingly simple applications can be complex at scale and in production – choice of framework crucial
Flink: unique combination of capabilities, performance, and robustness
Flink in the wild
30 billion events daily 2 billion events in 10 1Gb machines
Picked Flink for "Saiki"data integration &
distribution platform
See talks by at
Join the community!Follow: @ApacheFlink, @dataArtisansRead: flink.apache.org/blog, data-artisans.com/blogSubscribe: (news | user | dev) @ flink.apache.org
2 meetups next week in Bay Area!
April 6, San JoseApril 5, San Francisco
What's new in Flink 1.0 & recent performance benchmarks with Flinkhttp://www.meetup.com/Bay-Area-Apache-Flink-Meetup/