Big Data: Challenges and Some Solutionsbig-data-berlin.dima.tu-berlin.de/fileadmin/news/... · ML. algorithms. algorithms. data too uncertain. V. eracity. Data Mining MATLAB, R, Python

1 © Volker Markl© 2013 Berlin Big Data Center • All Rights Reserved1 © Volker Markl

Big Data: Challenges and Some Solutions Stratosphere, Apache Flink, and Beyond

Volker Marklhttp://www.user.tu-berlin.de/marklv/

http://www.dima.tu-berlin.dehttp://www.dfki.de/web/forschung/iam

http://bbdc.berlin

[email protected]

http://www.user.tu-berlin.de/marklv/

http://www.dima.tu-berlin.de/

http://www.dfki.de/web/forschung/iam

http://bbdc.berlin/

2 © Volker Markl2

2 © Volker Markl

Thanks to my team members and students

• Dr. Stephan Ewen• Sebastian Schelter• Dr. Kostas Tzoumas• Dr. Asterios Katsifodimos• Fabian Hüske• Alexander Alexandrov• Max Heimeland many more members of the Stratosphere Project, theBerlin Big Data Center, and the Apache Flink community

3 © Volker Markl3 © 2013 Berlin Big Data Center • All Rights Reserved

3 © Volker Markl

ML

ML

algorithms

algorithms

data too uncertain Veracity Data Mining MATLAB, R, PythonPredictive/Prescriptive MATLAB, R, Python

Data & Analysis: Increasingly Complex!

data volume too large Volumedata rate too fast Velocitydata too heterogeneous Variability

Data

Reporting aggregation, selectionAd-Hoc Queries SQL, XQueryETL/ELT MapReduce

Analysis

DMDM

scalability

scalability


4 © Volker Markl

Application

DataScience

Control Flow

Iterative Algorithms

Error Estimation

Active Sampling

Sketches

Curse of Dimensionality

Decoupling

Convergence

Monte Carlo

Mathematical ProgrammingLinear Algebra

Stochastic Gradient Descent

Regression

Statistics

Hashing

Parallelization

Query Optimization

Fault Tolerance

Relational Algebra / SQL

Scalability

Data Analysis LanguageCompiler

Memory Management

Memory Hierarchy

Data Flow

Hardware Adaptation

Indexing

Resource Management

NF2 /XQuery

Data Warehouse/OLAP

“Data Scientist” – “Jack of All Trades!”Domain Expertise (e.g., Industry 4.0, Medicine, Physics, Engineering, Energy, Logistics)

Real-TimeNew Technology to the Rescue!

5 © Volker Markl5

5 © Volker Markl

http://mattturck.com/wp-content/uploads/2016/03/Big-Data-Landscape-2016-v18-FINAL.png

A Zoo of Technologies!

6 © Volker Markl6

6 © Volker Markl

Big Data Analytics Requires Systems Programming

R/Matlab:4 million users

Hadoop:200,000

users

Data Analysis Statistics

AlgebraOptimization

Machine LearningNLP

Signal ProcessingImage Analysis

Audio-,Video AnalysisInformation IntegrationInformation Extraction

Data Value ChainData Analysis Process

Predictive Analytics

IndexingParallelizationCommunicationMemory ManagementQuery OptimizationEfficient AlgorithmsResource ManagementFault ToleranceNumerical Stability

Big Data is now where database systems were in the70s (prior to relational algebra, query optimizationand a SQL-standard)!

People with Big DataAnalytics Skills

Declarative languages to the rescue!


7 © Volker Markl

Deep Analysis of „Big Data“ is Key!

Small Data Big Data (3V)

Dee

p An

alyt

ics

Sim

ple

Anal

ysis

Many new companies and products are emerging to enable deep big data analysis;strong European contenders include Apache Flink, Parstream, and Exasol.„New companies“ are the (b)leading users of these technologies, e.g., in theinformation economy (e.g., Zalando, Amazon, Researchgate, Soundcloud, Spotify).„Traditional Big companies“ are following and still determining strategies (Industrie4.0, Logistics, Telco, etc.). Most SMEs are not ready yet to capitalize on Big Data.


8 © Volker Markl

ML DMFeature EngineeringRepresentationAlgorithms (SVM, GPs, etc.)

Think ML-algorithms in a scalable way

Process iterative algorithms

in a scalable way

Technology X

Machine Learning + Data Management = X

Declarative LanguagesAutomatic AdaptionScalable processing

Control flowIterative Algorithms

Error EstimationActive Sampling

Sketches

Curse of DimensionalityIsolation Convergence

Monte Carlo

Mathematical ProgrammingLinear Algebra

Regression

StatisticHashing

Parallelization

Query Optimization

Fault Tolerance

Relational Algebra/SQL

Scalability

Data Analysis Language

CompilerMemory Management

Memory Hierarchy

Dataflow

Hardware adaption

Indexing

Resource Management

NF2 /XQuery Data Warehouse/OLAP

declarative

Goal: Data Analysis without System Programming!

9 © Volker Markl9

9 © Volker Markl

Agenda• On Stratosphere and Flink

– Conception, Initial Contributions– Streaming Data Analysis– The Flink Community

• On Roman Generals and Big Data Analytics• On Emma and Mosaics

10 © Volker Markl10 © Volker Markl

Apache Flink

June 2, 2008

Aug 26, 2014

11 © Volker Markl

• Relational Algebra• Declarativity• Query Optimization• Robust Out-of-core

• Scalability• User-defined

Functions • Complex Data Types• Schema on Read

• Iterations• Advanced Dataflows• General APIs• Native Streaming

11

Draws onDatabase Technology

Draws onMapReduce Technology

Adds

Stratosphere: General Purpose Programming + Database Execution

A. Alexandrov, D. Battré, S. Ewen, M. Heimel, F. Hueske, O. Kao, V. Markl, E. Nijkamp, D. Warneke: Massively Parallel Data Analysis with PACTs on Nephele. PVLDB 3(2): 1625-1628 (2010)D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, D. Warneke: Nephele/PACTs: a programming model and execution framework for web-scaleanalytical processing. SoCC 2010: 119-130A. Alexandrov, R.Bergmann, S. Ewen, et al: The Stratosphere platform for big data analytics. VLDB J. 23(6): 939-964 (2014)

12 © Volker Markl12

12 © Volker Markl

Apache Flink® is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.

What is Apache Flink?

http://flink.apache.org

• Key Features:• Bounded and unbounded data• Event time semantics• Stateful and fault-tolerant• Running on thousands of nodes with

very good throughput and latency• Exactly-once semantics for stateful

computations.• Flexible windowing based on time,

count, or sessions in addition to data-driven windows

• DataSet and DataStream programming abstractions are the foundation for user programs and higher layers


13 © Volker Markl

Technology inside Flinkcase class Path (from: Long, to:Long)val tc = edges.iterate(10) {

paths: DataSet[Path] =>val next = paths

.join(edges)

.where("to")

.equalTo("from") {(path, edge) =>

Path(path.from, edge.to)}.union(paths).distinct()

next}

Cost-based optimizer

Type extraction stack

Task scheduling

Recoverymetadata

Pre-flight (Client)

MasterWorkers

DataSource

orders.tbl

Filter

Map DataSource

lineitem.tbl

JoinHybrid Hash

buildHT

probe

hash-part [0] hash-part [0]

GroupRed

sort

forward

Program

DataflowGraph

Memory manager

Out-of-core algos

Batch & Streaming

State & Checkpoints

deployoperators

trackintermediate

results

P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, K. Tzoumas:Apache Flink™: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull. 38(4): 28-38 (2015)


14 © Volker Markl

Effect of optimization

14

Run on a sampleon the laptop

Run a month laterafter the data evolved

Hash vs. SortPartition vs. BroadcastCachingReusing partition/sortExecution

Plan A

ExecutionPlan B

Run on large fileson the cluster

ExecutionPlan C


15 © Volker Markl

Why optimization ?

Do you want to hand-tune that?

F. Hueske, M. Peters, A. Krettek, M. Ringwald, K. Tzoumas, V. Markl, J.C. Freytag:Peeking into the optimization of data flow programs with MapReduce-style UDFs. ICDE 2013: 1292-1295


16 © Volker Markl

ITERATIONS IN DATA FLOWS MACHINE LEARNING ALGORITHMS

S. Ewen, S. Schelter, K. Tzoumas, D. Warneke, V. Markl: Iterative Parallel Data Processing with Stratosphere: an Inside Look. SIGMOD 2013S. Ewen, K. Tzoumas, M. Kaufmann, V. Markl:Spinning Fast Iterative Data Flows. PVLDB 5(11): 1268-1279 (2012)


17 © Volker Markl

Iterate by looping

• for/while loop in client submits one job per iteration step• Data reuse by caching in memory and/or disk

Step Step Step Step Step

Client


18 © Volker Markl

Iterate in the Dataflow


19 © Volker Markl

Large-Scale Machine Learning

Factorizing a matrix with28 billion ratings forrecommendations

(Scale of Netflixor Spotify)

More at: http://data-artisans.com/computing-recommendations-with-flink.html


20 © Volker Markl

Optimizing iterative programs

Caching Loop-invariant DataPushing work„out of the loop“

Maintain state as index


21 © Volker Markl

STATE IN ITERATIONS GRAPHS AND MACHINE LEARNING


22 © Volker Markl

Iterate natively with deltas

partial solution

deltasetX

otherdatasets

Y initialsolution

iterationresult

workset A B workset

Merge deltas

Replace

initialworkset


23 © Volker Markl

Effect of delta iterations…#

of e

lem

ents

upd

ated

iteration


24 © Volker Markl

… very fast graph analysis

… and mix and matchETL-style and graph analysisin one program

Performance competitivewith dedicated graph

analysis systems

More at: http://data-artisans.com/data-analysis-with-flink.html


25 © Volker Markl

“STREAMING DATA” ANALYSIS


26 © Volker Markl

Bounded and unbounded data• Bounded data: a dataset with a natural beginning and

end– E.g., current position of all trucks in a fleet, content of a data

warehouse

• Unbounded data: a dataset without a natural beginning and end– E.g., customers of our products, tweets about our product

• Note: few data is bounded by nature; most bounded data is a view over unbounded data

26

By courtesy of Kostas Tzoumas


27 © Volker Markl

Stream and batch processing• Stream processing: continuous processing that

continuously produces results– E.g., a Java program that connects to a socket and parses the

socket contents– Apache Flink, Apache Storm

• Batch processing: processing that takes finite time to complete and produces results only in the end– E.g., sorting a file– Apache Hadoop MapReduce, Apache Spark

27



28 © Volker Markl

How do they fit together

28

• Batch processing over bounded data– Natural

• Stream processing over bounded data– Treats bounded data as subset of stream

• Batch processing over unbounded data– Needs a pre-processing stream processing phase that splits

stream into chunks

• Stream processing over unbounded data– Natural



29 © Volker Markl

A different view• What changes faster? Your code or your data?

• ddata/dt >> dcode/dt is a data streaming problem

• dcode/dt >> ddata/dt is a data exploration problem (and likely to become a data streaming problem later)

29

Credits Joe Hellerstein


30 © Volker Markl

Defining windows in Flink

• Trigger policy– When to trigger the computation on current window

• Eviction policy– When data points should leave the window– Defines window width/size

• E.g., count-based policy– evict when #elements > n– start a new window every n-th element

• Built-in: Count, Time, Delta policies


31 © Volker Markl

Checkpointing / Recovery• Flink acknowledges batches of records

– Less overhead in failure-free case– Currently tied to fault tolerant data sources (e.g., Kafka)

• Flink operators can keep state– State is checkpointed– Checkpointing and record acks go together

• Exactly one semantics for state


32 © Volker Markl

Checkpointing / Recovery

32

Chandy-Lamport Algorithm for consistent asynchronous distributed snapshots

Pushes checkpoint barriersthrough the data flow

Operator checkpointstarting

Checkpoint done

Data Stream

barrier

Before barrier =part of the snapshot

After barrier =Not in snapshot

Checkpoint done

checkpoint in progress

(backup till next snapshot)


Some Benchmark ResultsInitially performed by Yahoo! Engineering, Dec 16, 2015,

[..]Storm 0.10.0, 0.11.0-SNAPSHOT and Flink 0.10.1 show sub- second latencies at relatively high throughputs[..]. Spark

streaming 1.5.1 supports high throughputs, but at a relatively higher

latency.

http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-athttps://data-artisans.com/extending-the-yahoo-streaming-benchmark/


34 © Volker Markl

THE FLINK COMMUNITY

J. Soto and V. Markl, "A Historical Account of Apache Flink," [Online]. Available: http://www.dima.tu-berlin.de/fileadmin/fg131/ Informationsmaterial/Apache_Flink_Origins_for_Public_Release.pdf


35 © Volker Markl

Flink Community as of Today

479

10,659

50

678

959314

243

1,393365376

2,179

541

452

196

18,884 Members313 Contributors41 Meetups30 Cities15 Countries 0

50

100

150

200

250

300

350

Feb 09 Jun 10 Nov 11 Mrz 13 Jul 14 Dez 15 Apr 17

Contributors


36

30 Flink applications in production for more than one year. 10 billion events (2TB) processed daily

Complex jobs of > 30 operators running 24/7, processing 30 billion events daily, maintaining state of 100s of GB with exactly-once guarantees

Largest job has > 20 operators, runs on > 5000 vCores in 1000-node cluster, processes millions of events per second


Some Highly Engaged Users


37 © Volker Markl

Other Companies in the Flink Community

https://flink.apache.org/poweredby.html


38 © Volker Markl

Some Software Projects Using Flink


39 © Volker Markl

Flink in the ecosystem

POSIX Java/ScalaCollections

POSIX



40 © Volker Markl

2008 2009

2012

2010

20112012

2010

2014

2015 2015 2015

201620162017

> 30 Companies using Flink

Timeline of Flink


41 © Volker Markl

Evolution of Big Data Platforms


42 © Volker Markl

Fault tolerancePessimistic Recovery:• Write intermediate state to stable storage• Restart from such a checkpoint in case of a failureProblematic:• High overhead, checkpoint must

be replicated to other machines• Overhead always incurred, even if no

failures happen!

How can we avoid this overhead in failure-free cases


43 © Volker Markl

Optimistic recovery• Many data mining algorithms are fixpoint algorithms• Optimistic Recovery: jump to a different state in case of a failure,

still converge to solution

• No checkpoints No overhead in absense of failures!• algorithm-specific compensation function must restore state

S. Schelter, S. Ewen, K. Tzoumas, V. Markl: "All roads lead to Rome": optimistic recovery for distributed iterative data processing. CIKM 2013S. Dudoladov, C. Xu, S. Schelter, A. Katsifodimos, S. Ewen, K. Tzoumas, V. Markl: Optimistic Recovery for Iterative Dataflows. SIGMOD 2015


Declarative Data Processing and Mosaics


45 © Volker Markl

A Billion $$$ Mantra...

Declarative Data Processing

SQL Relations RDBMS

A simple, high-level language for querying data (Chamberlin ’74).

An effective, formal foundation based on relational algebra and calculus (Codd ’71).

An efficient, low-level execution environment tailored towards the data (Selinger ’79).


46 © Volker Markl

With 40+ years of success...


SQL Relations RDBMS


47 © Volker Markl

Is Being Revised


SQL Relations RDBMS

DistributedCollections

Parallel DataflowEngines

Second-OrderFunctions


48 © Volker Markl

Fine grained parallelism through deep language Embedding (EMMA)

Emma Source (backed by Scala AST)

DataBag StreamMatrix

Emma Core (backed by Scala AST)

Target Language + Runtime

(iii) Lowering

Programming Abstractions

(i) Lifting

Common Concrete Syntax(realized as Scala eDSL)

Common Abstract Syntax

e.g. Spark, Flink, MM Engine

A. Alexandrov, A. Salzmann, G. Krastev, A. Katsifodimos, V. Markl: Emma in Action: Declarative Dataflows for Scalable Data Analysis. SIGMOD 2016A. Alexandrov, A. Katsifodimos, G. Krastev, V. Markl: Implicit Parallelism through Deep Language Embedding. SIGMOD Record 45(1): 51-58 (2016)


49 © Volker Markl

MosaicsFrontend:

Backends:


50 © Volker Markl

Research1. Unifying Modelling Across

Theories2. Cross Theory Optimization

3. Optimizing Across Engines4. Predicting and Learning

Program Runtimes

5. Optimizing Across Hardware6. Generating Hardware-Targeted

Code


51 © Volker Markl

Law

Society

EconomyTechnology

Application

Business ModelsBenchmarkingOpen Source & Open DataDeployment ModelsInformation PricingInformation Marketplaces

Data ManagementData ProcessingStatistics/MLLinguistics/Text&SpeechHCI/VisualizationNovel Computer ArchitecturesSecurity

The Five Dimensions of Big DataOwnershipCopyright/IPRLiabilityInsolvencyPrivacy

User BehaviourSocial ProcessesCollaborationPolitical ProcessesPublic Good

Economy & BusinessPublic SectorHealthcareTransportationHumanitiesSciences

SystemsFrameworks

SkillsBest-Practices

Tools

52 © Volker Markl

Acknowledgements

I would like to thank the members of the Stratosphere project, theBerlin Big Data Center, and my national and internationalcollaborators. In particular, I would like to thank my former andcurrent students and postdocs at DIMA/TU Berlin and DFKI.Without them it would not have been possible to realize the visionsinto concrete software system artifacts. Last, but not least, I wouldlike to thank all of the contributors and users in the Apache Flinkcommunity, without them the Stratosphere project and thecorrelated research of the Berlin Big Data Center would not haveachieved the worldwide impact that we have experienced over thelast few years

Documents

Big Data: Challenges and Some Solutionsbig-data-berlin.dima.tu-berlin.de/fileadmin/news/... · ML. algorithms. algorithms. data too uncertain. V. eracity. Data Mining MATLAB, R, Python