Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
1 © Volker Markl© 2013 Berlin Big Data Center • All Rights Reserved1 © Volker Markl
Big Data: Challenges and Some Solutions Stratosphere, Apache Flink, and Beyond
Volker Marklhttp://www.user.tu-berlin.de/marklv/
http://www.dima.tu-berlin.dehttp://www.dfki.de/web/forschung/iam
http://bbdc.berlin
2 © Volker Markl2
2 © Volker Markl
Thanks to my team members and students
• Dr. Stephan Ewen• Sebastian Schelter• Dr. Kostas Tzoumas• Dr. Asterios Katsifodimos• Fabian Hüske• Alexander Alexandrov• Max Heimeland many more members of the Stratosphere Project, theBerlin Big Data Center, and the Apache Flink community
3 © Volker Markl3 © 2013 Berlin Big Data Center • All Rights Reserved
3 © Volker Markl
ML
ML
algorithms
algorithms
data too uncertain Veracity Data Mining MATLAB, R, PythonPredictive/Prescriptive MATLAB, R, Python
Data & Analysis: Increasingly Complex!
data volume too large Volumedata rate too fast Velocitydata too heterogeneous Variability
Data
Reporting aggregation, selectionAd-Hoc Queries SQL, XQueryETL/ELT MapReduce
Analysis
DMDM
scalability
scalability
4 © Volker Markl4 © 2013 Berlin Big Data Center • All Rights Reserved
4 © Volker Markl
Application
DataScience
Control Flow
Iterative Algorithms
Error Estimation
Active Sampling
Sketches
Curse of Dimensionality
Decoupling
Convergence
Monte Carlo
Mathematical ProgrammingLinear Algebra
Stochastic Gradient Descent
Regression
Statistics
Hashing
Parallelization
Query Optimization
Fault Tolerance
Relational Algebra / SQL
Scalability
Data Analysis LanguageCompiler
Memory Management
Memory Hierarchy
Data Flow
Hardware Adaptation
Indexing
Resource Management
NF2 /XQuery
Data Warehouse/OLAP
“Data Scientist” – “Jack of All Trades!”Domain Expertise (e.g., Industry 4.0, Medicine, Physics, Engineering, Energy, Logistics)
Real-TimeNew Technology to the Rescue!
5 © Volker Markl5
5 © Volker Markl
http://mattturck.com/wp-content/uploads/2016/03/Big-Data-Landscape-2016-v18-FINAL.png
A Zoo of Technologies!
6 © Volker Markl6
6 © Volker Markl
Big Data Analytics Requires Systems Programming
R/Matlab:4 million users
Hadoop:200,000
users
Data Analysis Statistics
AlgebraOptimization
Machine LearningNLP
Signal ProcessingImage Analysis
Audio-,Video AnalysisInformation IntegrationInformation Extraction
Data Value ChainData Analysis Process
Predictive Analytics
IndexingParallelizationCommunicationMemory ManagementQuery OptimizationEfficient AlgorithmsResource ManagementFault ToleranceNumerical Stability
Big Data is now where database systems were in the70s (prior to relational algebra, query optimizationand a SQL-standard)!
People with Big DataAnalytics Skills
Declarative languages to the rescue!
7 © Volker Markl7 © 2013 Berlin Big Data Center • All Rights Reserved
7 © Volker Markl
Deep Analysis of „Big Data“ is Key!
Small Data Big Data (3V)
Dee
p An
alyt
ics
Sim
ple
Anal
ysis
Many new companies and products are emerging to enable deep big data analysis;strong European contenders include Apache Flink, Parstream, and Exasol.„New companies“ are the (b)leading users of these technologies, e.g., in theinformation economy (e.g., Zalando, Amazon, Researchgate, Soundcloud, Spotify).„Traditional Big companies“ are following and still determining strategies (Industrie4.0, Logistics, Telco, etc.). Most SMEs are not ready yet to capitalize on Big Data.
8 © Volker Markl8 © 2013 Berlin Big Data Center • All Rights Reserved
8 © Volker Markl
ML DMFeature EngineeringRepresentationAlgorithms (SVM, GPs, etc.)
Think ML-algorithms in a scalable way
Process iterative algorithms
in a scalable way
Technology X
Machine Learning + Data Management = X
Declarative LanguagesAutomatic AdaptionScalable processing
Control flowIterative Algorithms
Error EstimationActive Sampling
Sketches
Curse of DimensionalityIsolation Convergence
Monte Carlo
Mathematical ProgrammingLinear Algebra
Regression
StatisticHashing
Parallelization
Query Optimization
Fault Tolerance
Relational Algebra/SQL
Scalability
Data Analysis Language
CompilerMemory Management
Memory Hierarchy
Dataflow
Hardware adaption
Indexing
Resource Management
NF2 /XQuery Data Warehouse/OLAP
declarative
Goal: Data Analysis without System Programming!
9 © Volker Markl9
9 © Volker Markl
Agenda• On Stratosphere and Flink
– Conception, Initial Contributions– Streaming Data Analysis– The Flink Community
• On Roman Generals and Big Data Analytics• On Emma and Mosaics
11 © Volker Markl
• Relational Algebra• Declarativity• Query Optimization• Robust Out-of-core
• Scalability• User-defined
Functions • Complex Data Types• Schema on Read
• Iterations• Advanced Dataflows• General APIs• Native Streaming
11
Draws onDatabase Technology
Draws onMapReduce Technology
Adds
Stratosphere: General Purpose Programming + Database Execution
A. Alexandrov, D. Battré, S. Ewen, M. Heimel, F. Hueske, O. Kao, V. Markl, E. Nijkamp, D. Warneke: Massively Parallel Data Analysis with PACTs on Nephele. PVLDB 3(2): 1625-1628 (2010)D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, D. Warneke: Nephele/PACTs: a programming model and execution framework for web-scaleanalytical processing. SoCC 2010: 119-130A. Alexandrov, R.Bergmann, S. Ewen, et al: The Stratosphere platform for big data analytics. VLDB J. 23(6): 939-964 (2014)
12 © Volker Markl12
12 © Volker Markl
Apache Flink® is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.
What is Apache Flink?
http://flink.apache.org
• Key Features:• Bounded and unbounded data• Event time semantics• Stateful and fault-tolerant• Running on thousands of nodes with
very good throughput and latency• Exactly-once semantics for stateful
computations.• Flexible windowing based on time,
count, or sessions in addition to data-driven windows
• DataSet and DataStream programming abstractions are the foundation for user programs and higher layers
13 © Volker Markl13 © 2013 Berlin Big Data Center • All Rights Reserved
13 © Volker Markl
Technology inside Flinkcase class Path (from: Long, to:Long)val tc = edges.iterate(10) {
paths: DataSet[Path] =>val next = paths
.join(edges)
.where("to")
.equalTo("from") {(path, edge) =>
Path(path.from, edge.to)}.union(paths).distinct()
next}
Cost-based optimizer
Type extraction stack
Task scheduling
Recoverymetadata
Pre-flight (Client)
MasterWorkers
DataSource
orders.tbl
Filter
Map DataSource
lineitem.tbl
JoinHybrid Hash
buildHT
probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
DataflowGraph
Memory manager
Out-of-core algos
Batch & Streaming
State & Checkpoints
deployoperators
trackintermediate
results
P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, K. Tzoumas:Apache Flink™: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull. 38(4): 28-38 (2015)
14 © Volker Markl14 © 2013 Berlin Big Data Center • All Rights Reserved
14 © Volker Markl
Effect of optimization
14
Run on a sampleon the laptop
Run a month laterafter the data evolved
Hash vs. SortPartition vs. BroadcastCachingReusing partition/sortExecution
Plan A
ExecutionPlan B
Run on large fileson the cluster
ExecutionPlan C
15 © Volker Markl15 © 2013 Berlin Big Data Center • All Rights Reserved
15 © Volker Markl
Why optimization ?
Do you want to hand-tune that?
F. Hueske, M. Peters, A. Krettek, M. Ringwald, K. Tzoumas, V. Markl, J.C. Freytag:Peeking into the optimization of data flow programs with MapReduce-style UDFs. ICDE 2013: 1292-1295
16 © Volker Markl16 © 2013 Berlin Big Data Center • All Rights Reserved
16 © Volker Markl
ITERATIONS IN DATA FLOWS MACHINE LEARNING ALGORITHMS
S. Ewen, S. Schelter, K. Tzoumas, D. Warneke, V. Markl: Iterative Parallel Data Processing with Stratosphere: an Inside Look. SIGMOD 2013S. Ewen, K. Tzoumas, M. Kaufmann, V. Markl:Spinning Fast Iterative Data Flows. PVLDB 5(11): 1268-1279 (2012)
17 © Volker Markl17
17 © Volker Markl
Iterate by looping
• for/while loop in client submits one job per iteration step• Data reuse by caching in memory and/or disk
Step Step Step Step Step
Client
19 © Volker Markl19 © 2013 Berlin Big Data Center • All Rights Reserved
19 © Volker Markl
Large-Scale Machine Learning
Factorizing a matrix with28 billion ratings forrecommendations
(Scale of Netflixor Spotify)
More at: http://data-artisans.com/computing-recommendations-with-flink.html
20 © Volker Markl20
20 © Volker Markl
Optimizing iterative programs
Caching Loop-invariant DataPushing work„out of the loop“
Maintain state as index
21 © Volker Markl21 © 2013 Berlin Big Data Center • All Rights Reserved
21 © Volker Markl
STATE IN ITERATIONS GRAPHS AND MACHINE LEARNING
22 © Volker Markl22
22 © Volker Markl
Iterate natively with deltas
partial solution
deltasetX
otherdatasets
Y initialsolution
iterationresult
workset A B workset
Merge deltas
Replace
initialworkset
24 © Volker Markl24 © 2013 Berlin Big Data Center • All Rights Reserved
24 © Volker Markl
… very fast graph analysis
… and mix and matchETL-style and graph analysisin one program
Performance competitivewith dedicated graph
analysis systems
More at: http://data-artisans.com/data-analysis-with-flink.html
25 © Volker Markl25 © 2013 Berlin Big Data Center • All Rights Reserved
25 © Volker Markl
“STREAMING DATA” ANALYSIS
26 © Volker Markl26
26 © Volker Markl
Bounded and unbounded data• Bounded data: a dataset with a natural beginning and
end– E.g., current position of all trucks in a fleet, content of a data
warehouse
• Unbounded data: a dataset without a natural beginning and end– E.g., customers of our products, tweets about our product
• Note: few data is bounded by nature; most bounded data is a view over unbounded data
26
By courtesy of Kostas Tzoumas
27 © Volker Markl27
27 © Volker Markl
Stream and batch processing• Stream processing: continuous processing that
continuously produces results– E.g., a Java program that connects to a socket and parses the
socket contents– Apache Flink, Apache Storm
• Batch processing: processing that takes finite time to complete and produces results only in the end– E.g., sorting a file– Apache Hadoop MapReduce, Apache Spark
27
By courtesy of Kostas Tzoumas
28 © Volker Markl28
28 © Volker Markl
How do they fit together
28
• Batch processing over bounded data– Natural
• Stream processing over bounded data– Treats bounded data as subset of stream
• Batch processing over unbounded data– Needs a pre-processing stream processing phase that splits
stream into chunks
• Stream processing over unbounded data– Natural
By courtesy of Kostas Tzoumas
29 © Volker Markl29
29 © Volker Markl
A different view• What changes faster? Your code or your data?
• ddata/dt >> dcode/dt is a data streaming problem
• dcode/dt >> ddata/dt is a data exploration problem (and likely to become a data streaming problem later)
29
Credits Joe Hellerstein
30 © Volker Markl30
30 © Volker Markl
Defining windows in Flink
• Trigger policy– When to trigger the computation on current window
• Eviction policy– When data points should leave the window– Defines window width/size
• E.g., count-based policy– evict when #elements > n– start a new window every n-th element
• Built-in: Count, Time, Delta policies
31 © Volker Markl31
31 © Volker Markl
Checkpointing / Recovery• Flink acknowledges batches of records
– Less overhead in failure-free case– Currently tied to fault tolerant data sources (e.g., Kafka)
• Flink operators can keep state– State is checkpointed– Checkpointing and record acks go together
• Exactly one semantics for state
32 © Volker Markl32 © 2013 Berlin Big Data Center • All Rights Reserved
32 © Volker Markl
Checkpointing / Recovery
32
Chandy-Lamport Algorithm for consistent asynchronous distributed snapshots
Pushes checkpoint barriersthrough the data flow
Operator checkpointstarting
Checkpoint done
Data Stream
barrier
Before barrier =part of the snapshot
After barrier =Not in snapshot
Checkpoint done
checkpoint in progress
(backup till next snapshot)
33 © Volker Markl33 © Volker Markl
Some Benchmark ResultsInitially performed by Yahoo! Engineering, Dec 16, 2015,
[..]Storm 0.10.0, 0.11.0-SNAPSHOT and Flink 0.10.1 show sub- second latencies at relatively high throughputs[..]. Spark
streaming 1.5.1 supports high throughputs, but at a relatively higher
latency.
http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-athttps://data-artisans.com/extending-the-yahoo-streaming-benchmark/
34 © Volker Markl34 © 2013 Berlin Big Data Center • All Rights Reserved
34 © Volker Markl
THE FLINK COMMUNITY
J. Soto and V. Markl, "A Historical Account of Apache Flink," [Online]. Available: http://www.dima.tu-berlin.de/fileadmin/fg131/ Informationsmaterial/Apache_Flink_Origins_for_Public_Release.pdf
35 © Volker Markl35
35 © Volker Markl
Flink Community as of Today
479
10,659
50
678
959314
243
1,393365376
2,179
541
452
196
18,884 Members313 Contributors41 Meetups30 Cities15 Countries 0
50
100
150
200
250
300
350
Feb 09 Jun 10 Nov 11 Mrz 13 Jul 14 Dez 15 Apr 17
Contributors
36 © Volker Markl36 © Volker Markl
36
30 Flink applications in production for more than one year. 10 billion events (2TB) processed daily
Complex jobs of > 30 operators running 24/7, processing 30 billion events daily, maintaining state of 100s of GB with exactly-once guarantees
Largest job has > 20 operators, runs on > 5000 vCores in 1000-node cluster, processes millions of events per second
By courtesy of Kostas Tzoumas
Some Highly Engaged Users
37 © Volker Markl37
37 © Volker Markl
Other Companies in the Flink Community
https://flink.apache.org/poweredby.html
39 © Volker Markl39
39 © Volker Markl
Flink in the ecosystem
POSIX Java/ScalaCollections
POSIX
By courtesy of Kostas Tzoumas
40 © Volker Markl40
40 © Volker Markl
2008 2009
2012
2010
20112012
2010
2014
2015 2015 2015
201620162017
> 30 Companies using Flink
Timeline of Flink
41 © Volker Markl41 © 2013 Berlin Big Data Center • All Rights Reserved
41 © Volker Markl
Evolution of Big Data Platforms
42 © Volker Markl42
42 © Volker Markl
Fault tolerancePessimistic Recovery:• Write intermediate state to stable storage• Restart from such a checkpoint in case of a failureProblematic:• High overhead, checkpoint must
be replicated to other machines• Overhead always incurred, even if no
failures happen!
How can we avoid this overhead in failure-free cases
43 © Volker Markl43
43 © Volker Markl
Optimistic recovery• Many data mining algorithms are fixpoint algorithms• Optimistic Recovery: jump to a different state in case of a failure,
still converge to solution
• No checkpoints No overhead in absense of failures!• algorithm-specific compensation function must restore state
S. Schelter, S. Ewen, K. Tzoumas, V. Markl: "All roads lead to Rome": optimistic recovery for distributed iterative data processing. CIKM 2013S. Dudoladov, C. Xu, S. Schelter, A. Katsifodimos, S. Ewen, K. Tzoumas, V. Markl: Optimistic Recovery for Iterative Dataflows. SIGMOD 2015
45 © Volker Markl45
45 © Volker Markl
A Billion $$$ Mantra...
Declarative Data Processing
SQL Relations RDBMS
A simple, high-level language for querying data (Chamberlin ’74).
An effective, formal foundation based on relational algebra and calculus (Codd ’71).
An efficient, low-level execution environment tailored towards the data (Selinger ’79).
46 © Volker Markl46
46 © Volker Markl
With 40+ years of success...
Declarative Data Processing
SQL Relations RDBMS
47 © Volker Markl47
47 © Volker Markl
Is Being Revised
Declarative Data Processing
SQL Relations RDBMS
DistributedCollections
Parallel DataflowEngines
Second-OrderFunctions
48 © Volker Markl48
48 © Volker Markl
Fine grained parallelism through deep language Embedding (EMMA)
Emma Source (backed by Scala AST)
DataBag StreamMatrix
Emma Core (backed by Scala AST)
Target Language + Runtime
(iii) Lowering
Programming Abstractions
(i) Lifting
Common Concrete Syntax(realized as Scala eDSL)
Common Abstract Syntax
e.g. Spark, Flink, MM Engine
A. Alexandrov, A. Salzmann, G. Krastev, A. Katsifodimos, V. Markl: Emma in Action: Declarative Dataflows for Scalable Data Analysis. SIGMOD 2016A. Alexandrov, A. Katsifodimos, G. Krastev, V. Markl: Implicit Parallelism through Deep Language Embedding. SIGMOD Record 45(1): 51-58 (2016)
50 © Volker Markl50
50 © Volker Markl
Research1. Unifying Modelling Across
Theories2. Cross Theory Optimization
3. Optimizing Across Engines4. Predicting and Learning
Program Runtimes
5. Optimizing Across Hardware6. Generating Hardware-Targeted
Code
51 © Volker Markl51
51 © Volker Markl
Law
Society
EconomyTechnology
Application
Business ModelsBenchmarkingOpen Source & Open DataDeployment ModelsInformation PricingInformation Marketplaces
Data ManagementData ProcessingStatistics/MLLinguistics/Text&SpeechHCI/VisualizationNovel Computer ArchitecturesSecurity
The Five Dimensions of Big DataOwnershipCopyright/IPRLiabilityInsolvencyPrivacy
User BehaviourSocial ProcessesCollaborationPolitical ProcessesPublic Good
Economy & BusinessPublic SectorHealthcareTransportationHumanitiesSciences
SystemsFrameworks
SkillsBest-Practices
Tools
52 © Volker Markl
Acknowledgements
I would like to thank the members of the Stratosphere project, theBerlin Big Data Center, and my national and internationalcollaborators. In particular, I would like to thank my former andcurrent students and postdocs at DIMA/TU Berlin and DFKI.Without them it would not have been possible to realize the visionsinto concrete software system artifacts. Last, but not least, I wouldlike to thank all of the contributors and users in the Apache Flinkcommunity, without them the Stratosphere project and thecorrelated research of the Berlin Big Data Center would not haveachieved the worldwide impact that we have experienced over thelast few years