Upload
robert-metzger
View
3.620
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Stratosphere is the next generation big data processing engine. These slides introduce the most important features of Stratosphere by comparing it with Apache Hadoop. For more information, visit stratosphere.eu Based on university research, it is now a completely open-source, community driven development with focus on stability and usability.
Citation preview
Stratosphere: System Overview
Robert [email protected]
Twitter: @rmetzger_
Big Data Beers Meetup, Nov. 19th, 2013
Stratosphere
… is a distributed data processing engine
… automatically handles parallelization
… brings database technology to the world of big data
● Extends MapReduce with more operators
● Support for advanced data flow graphs
● Compiler/Optimizer, Java/Scala Interface, YARN
Overview
map crossjoin cogroup
Known from Hadoop New in Stratosphere
reduce
Known from Hadoop New in Stratosphere
M R
M R
J
M
R R
Local Files HDFS S3 ...Storage
YARN Direct EC2Cluster
Manager
Stratosphere Runtime
Stratosphere Optimizer
Java API
Scala API Meteor ...
Stratosphere System Stack
Hadoop MR
Hive
Stratosphere in a Cluster
● Operators are executed over the whole cluster
● Side by side with Hadoop● Scales by adding more
nodes● Support for YARN is in
development● We have a LocalExecutor
Cluster Node
JobSubmission
JobManagerResource Mgmt
CompilerWeb Interface
Master Node
TaskManager
DataNode
TaskManager
DataNode
TaskManager
4 Worker Nodes
DataNode
TaskManager
DataNodeStratosphere
Hadoop
Legend:
4. Scala Interface
1. Data Flows
3. Iterations
2. Optimizer
Data Flows: Execution Models
One of many possible data flows in Stratosphere
M R
J
M
R
M RApache Hadoop MR is limited to one data flow
Complex Data Flows in Hadoop
M R
M R
M
M R
R
J
M
RFiltering
Joining
Grouping
Grouping
Data Flows: Lessons Learned
1. Most tasks do not fit the MapReduce model2. Very expensive
○ Always go to disk and HDFS
3. Tedious to implement○ Custom data types and file formats between jobs
That’s why higher level abstractions for MR exist.
R
J
M
R
● Data flow graphs are supported natively● Stratosphere only writes to disk if necessary,
otherwise in-memory
Advanced Data Flows in Stratosphere
Skeleton of a Stratosphere Program
● Input: text file, JDBC source, CSV, etc. ● Transformations
○ map, reduce, join, iterate etc.
● Output: to file etc.● Data Types
○ PactRecord: Tuples with n fields.
○ custom data types for vectors, images, audio (we only expect serialization and compare)
2
Data Flows: Code Example
FileDataSource customers = new FileDataSource(TextInputFormat.class, customersPath);
FileDataSource orders = new FileDataSource(TextInputFormat.class, ordersPath);
MapContract ordersFiltered = MapContract.builder(FilterOrders.class)
.input(orders).build();
ReduceContract groupedCustomers = ReduceContract.builder(GroupCustomers.class)
.input(customers)
.keyField(PactInteger.class, 0).build();
MatchContract joined = MatchContract.builder(JoinOnCustomerid.class, PactInteger.class, 0,
0)
.input1(ordersFiltered)
.input2(groupedCustomers).build();
ReduceContract orderBy = ReduceContract.builder(MaxSum.class)
.input(joined)
.keyField(PactInteger.class, 0).build();
FileDataSink sink = new FileDataSink(RecordOutputFormat.class, outputPath, orderBy);
R
J
M
R
Filter Mapper
Define group key
Map Stub and PactRecord by Example
public class FilterOrders extends MapStub {
@Override
public void map(PactRecord order, Collector<PactRecord> out)
throws Exception {
PactString date = order.getField(Orders.DATE_IDX, PactString.class);
if (date.getValue().equals("11.20.2013")) {
out.collect(order);
}
}
}
MapContract ordersFiltered = MapContract.builder(FilterOrders.class)
.input(orders).build();
4. Scala Interface
1. Data Flows
3. Iterations
2. Optimizer
Joins in Hadoop
● Which strategy to choose?● How to configure it
Lessons Learned:
● Joins do not naturally fit MapReduce● Very time consuming to implement● Hand optimization necessary
Source: Sebastian Schelter, TU Berlin
Map (Broadcast) Join Reduce (Repartition) Join
Joins with Stratosphere
● Natively implemented into the system● Optimizer decides join strategy:
○ Sort-merge-join○ Hybrid Hash Join○ Data Shipping Strategy
● Hybrid Hash Join starts in-memory and gracefully degrades to disk
Optimizer Magic
Recap example job:
● We require a grouped input for the reducer (sorting or hashing)
● Optimizer chooses Sort-Merge-Join → no sorting for reduce
R
J
M
RFiltering
Joining
Grouping
Grouping
Stratosphere Optimizer
● Cost-based optimizer○ Enumerate different execution plans○ Choose the cheapest one
● Optimizer collects statistics○ Size of input and output
● Operators (Map, Reduce, Join) tell how they modify fields
● In-memory chaining of operators● Memory Distribution
⇒ Focus on your application logic rather than parallel execution.
4. Scala Interface
1. Data Flows
3. Iterations
2. Optimizer
Algorithms that need iterations
● K-Means● Gradient descent● Page-Rank● Logistic Regression● Path algorithms on graphs● Graph communities / dense sub-components● Inference (belief propagation)
● Many algorithms loop over the data○ Machine learning: iteratively refine the model○ Graph processing: propagate information hop by hop
Why Iterations?
Initial Input
1 2
3 4
5
6 7
1st Iteration
1 1
2 2
5
5 5
2nd Iteration
1 1
1 1
5
5 5
Example: Connected Components
● Loop is outside the system○ Hard to program○ Very poor performance
Iterations in Hadoop
M R M R M R
1st Iteration
Usually each iteration is more than a single map and reduce!
n-th Iteration
Driver
Spawn 1st Iteratio
n
2nd Iteration ...
Spaw
n 2n
d It
erat
ion Spawn n-th Iteration
● Loop is inside the system○ Easy to program○ Huge performance gains
Iterations in Stratosphere
M
C
M
R R M
Iterate
4. Scala Interface
1. Data Flows
3. Iterations
2. Optimizer
● Functional object oriented programming language● ScaLa = Scalable Language● Very productive (few LOC)● Feels like a scripting language● No more UDFs● Easy to integrate● Runs in JVM, is compatible to regular Java classes● Basis for developing embedded domain specific
languages (DSL)
Do more, write less!class Person(val firstName: String, val lastName: String)
public class Person {
private final String firstName;
private final String lastName;
public Person(String firstName, String lastName) {
this.firstName = firstName;
this.lastName = lastName;
}
public String getFirstName() {
return firstName;
}
public String getLastName() {
return lastName;
}
}
Let the code speak
val input = TextFile(textInput)
val words = input.flatMap { line => line.split(" ") }
val counts = words
.groupBy { word => word }
.count()
val output = counts.write(wordsOutput, CsvOutputFormat())
val plan = new ScalaPlan(Seq(output))
Example in Scala
FileDataSource customers = new FileDataSource(TextInputFormat.class, customersPath);
FileDataSource orders = new FileDataSource(TextInputFormat.class, ordersPath);
MapContract ordersFiltered = MapContract.builder(FilterOrders.class).input(orders).build();
ReduceContract groupedCustomers = ReduceContract.builder(GroupCustomers.class)
.input(customers)
.keyField(PactInteger.class, 0)
.build();
MatchContract joined = MatchContract.builder(JoinOnCustomerid.class,PactInteger.class, 0,0)
.input1(ordersFiltered).input2(groupedCustomers).build();
ReduceContract orderBy = ReduceContract.builder(MaxSum.class)
.input(joined)
.keyField(PactInteger.class, 0)
.build();
FileDataSink sink = new FileDataSink(RecordOutputFormat.class, outputPath, orderBy, "output: word counts");
val customers = DataSource(customersPath, CsvInputFormat[Customer])
val orders = DataSource(ordersPath, CsvInputFormat[Order])
val ordersFiltered = orders filter { order => order.date.equals("11.20.2013")}
val groupedCustomers = customers groupBy { cust => cust.zip} reduceGroup {grp => (grp.buffered.head.zip,
grp.maxBy{_.total})}
val joined = ordersFiltered .join(groupedCustomers) .where {ord => ord.c_id}
.isEqualTo {cust => cust._1} .map { (orders, cust) => cust}
val max = joined groupBy { cust => cust.category_id} reduceGroup {_.maxBy{_.sum}}
val output = counts.write(wordsOutput, DelimitedOutputFormat(formatOutput.tupled))
val plan = new ScalaPlan(Seq(output), "BDB Example")
R
J
M
R
Stratosphere: Database inspired Big Data Analytics
Summary: Feature Matrix
Map Reduce Stratosphere
Operators
● Map● Reduce
● Map● Reduce (multiple sort keys)● Cross● Join● CoGroup● Union● Iterate, Iterate Delta
Composition Only MapReduce Arbitrary Data flows
Data Exchange Batch through diskPipelined, in-memory
(automatic spilling to disk)
Stratosphere is the next-generation open source Big Data Analytics Platform.
Quickstart: http://stratosphere.eu/quickstart
Website: http://stratosphere.eu
GitHub: https://github.com/stratosphere
Mailing List:https://groups.google.com/d/forum/stratosphere-dev
Twitter: @stratosphere_eu
Get In Touch