16
Re-introducing the Stream Processor A Universal Tool for Continuous Data Analytical Needs A Universal Tool for Continuous Data Analysis Paris Carbone Committer @ Apache Flink PhD Candidate @ KTH

Reintroducing the Stream Processor: A universal tool for continuous data analysis

Embed Size (px)

Citation preview

Page 1: Reintroducing the Stream Processor: A universal tool for continuous data analysis

Re-introducing the Stream Processor

A Universal Tool for Continuous Data Analytical Needs

A Universal Tool for Continuous Data Analysis

Paris CarboneCommitter @ Apache Flink

PhD Candidate @ KTH

Page 2: Reintroducing the Stream Processor: A universal tool for continuous data analysis

Data Stream Processors

Data Stream Processor

can set up any data pipeline for you

http://edge.alluremedia.com.au/m/l/2014/10/CoolingPipes.jpg

Page 3: Reintroducing the Stream Processor: A universal tool for continuous data analysis

Is this really a step forward in data processing?

A growing open-source ecosystem:

kafkaflink beam apex

e.g.

General Idea of the tech:• Processes pipeline computation in a cluster • Computation is continuous and parallel (like data) • Event-processing logic <-> Application state• It’s production-ready and aims to simplify analytics

Data Stream Processors

streams

Page 4: Reintroducing the Stream Processor: A universal tool for continuous data analysis

complex event procfast approximate streamingETL

event logs

production database

4 Aspects of Data Processing

rules

data warehouses + historical data

application state+ failover

“microservices"

complex analytics

large-scale processing systems

interactivequeries

data sciencereports

dev

user analyst

data engineer

Page 5: Reintroducing the Stream Processor: A universal tool for continuous data analysis

complex event procfast approximate streamingETL

event logs

production database

4 Aspects of Data Processing

rules

data warehouses + historical data

application state+ failover

“microservices"

complex analytics

large-scale processing systems

interactivequeries

data sciencereports

dev

user analyst

data engineer

Page 6: Reintroducing the Stream Processor: A universal tool for continuous data analysis

complex event procfast approximate streamingETL

event logs

production database

4 Aspects of Data Processing

rules

data warehouses + historical data

application state+ failover

“microservices"

complex analytics

large-scale processing systems

interactivequeries

data sciencereports

dev

user analyst

data engineer

1. Speed

stream processor

Page 7: Reintroducing the Stream Processor: A universal tool for continuous data analysis

1. SpeedLow-Latency Data Processing

Traditionally the sole reason stream processing was used

• No intermediate scheduling (you let it run) • No physical blocking (pre-compute on the go) • Copy-on-write for state and output

How do stream processors achieve low latency?

But Is this is only relevant for live data?

CEP semantics etc. are nowadays provided as additional libraries for stream processors

Page 8: Reintroducing the Stream Processor: A universal tool for continuous data analysis

complex event procfast approximate streamingETL

event logs

production database

4 Aspects of Data Processing

rules

data warehouses + historical data

application state+ failover

“microservices"

complex analytics

large-scale processing systems

interactivequeries

data sciencereports

dev

user analyst

data engineer

1. Speed 2. History

stream processor

Page 9: Reintroducing the Stream Processor: A universal tool for continuous data analysis

2. HistoryOffline Data Processing

It is possible and better over bulk historical data analysis

• Ability to define custom state to build up models • Large-scale support is a given (inherits cluster computing benefits) • Separation of notions of time and out-of-order processing

What can stream processors do for historical data?

But isn’t streaming hard to deal with failures?

session

windows

event-timewindowse.g.,

Page 10: Reintroducing the Stream Processor: A universal tool for continuous data analysis

complex event procfast approximate streamingETL

event logs

production database

4 Aspects of Data Processing

rules

data warehouses + historical data

application state+ failover

“microservices"

complex analytics

large-scale processing systems

interactivequeries

data sciencereports

dev

user analyst

data engineer

1. Speed 2. History

3. Durability

stream processor

Page 11: Reintroducing the Stream Processor: A universal tool for continuous data analysis

3. DurabilityExactly-Once Data Processing

Traditionally streaming ~ lossy, approximate processingThis is no longer true. Forget the ‘lambda architecture’.

• Input records are durably stored and indexed in logs (e.g., Kafka) • Systems handle state snapshotting & transactions with external

stores transparently. • Idempontent and transactional writes to external stores

part 1 part 2 part 3 part 4

on Flink each stream computation either completes or repeatse.g.

Page 12: Reintroducing the Stream Processor: A universal tool for continuous data analysis

3. DurabilityExactly-Once Data Processing

input streams

application states

stream processor

rollback

Page 13: Reintroducing the Stream Processor: A universal tool for continuous data analysis

complex event procfast approximate streamingETL

event logs

production database

4 Aspects of Data Processing

rules

data warehouses + historical data

application state+ failover

“microservices"

complex analytics

large-scale processing systems

interactivequeries

data sciencereports

dev

user analyst

data engineer

1. Speed 2. History

3. Durability

stream processor

4. Interactivity

Page 14: Reintroducing the Stream Processor: A universal tool for continuous data analysis

4. InteractivityQuerying Data Processing State

Stream Processor ~ Inverse DBMS

Application state holds fresh knowledge we want to query:

• In some systems (e.g. Kafka-Streams) we can use the changelog • In other systems (i.e., Flink) we can query the state externally…or

stream queries on custom query processor on-top of them*

Alice

Bob? Bob=…

*https://techblog.king.com/rbea-scalable-real-time-analytics-king/

Page 15: Reintroducing the Stream Processor: A universal tool for continuous data analysis

4 Aspects of Data Processing1. Speed 2. History

3. Durability 4. Interactivity

stream processor

• no physical blocking/staging • no rescheduling • efficient pipelining • copy-on-write data structures

• different notions of time • flexible stateful processing • high throughput

• durable input logging is a standard • automated state management • exactly-once processing • output commit & Idempotency

• external access to state/changelogs

• ability to ‘stream queries’ over state

Page 16: Reintroducing the Stream Processor: A universal tool for continuous data analysis

@SenorCarbone

Try out Stream Processing

https://flink.apache.org/

https://kafka.apache.org/https://beam.apache.org/