Upload
databricks
View
5.718
Download
0
Embed Size (px)
Citation preview
New Directions for Spark in 2015 Matei Zaharia February 20, 2015
What is Apache Spark?
Fast and general engine for big data processing with libraries for SQL, streaming, advanced analytics Most active open source project in big data
2
Founded by the creators of Spark in 2013 Largest organization contributing to Spark
– 3/4 of the code in 2014
End-to-end hosted service, Databricks Cloud
About Databricks
3
2014: an Amazing Year for Spark
Total contributors: 150 => 500
Lines of code: 190K => 370K
500 active production deployments
4
Contributors per Month to Spark
0
20
40
60
80
100
2011 2012 2013 2014 2015 5
Contributors per Month to Spark
0
20
40
60
80
100
2011 2012 2013 2014 2015
Most active project at Apache
6
7
On-Disk Sort Record: Time to sort 100TB
2100 machines 2013 Record: Hadoop
2014 Record: Spark
Source: Daytona GraySort benchmark, sortbenchmark.org
72 minutes
207 machines
23 minutes
Distributors Applications
8
9
New Directions in 2015
Data Science High-level interfaces similar
to single-machine tools
Platform Interfaces Plug in data sources
and algorithms
10
DataFrames
Similar API to data frames in R and Pandas
Automatically optimized via Spark SQL
Coming in Spark 1.3
df = jsonFile(“tweets.json”)
df[df[“user”] == “matei”]
.groupBy(“date”)
.sum(“retweets”)
0
5
10
Python Scala DataFrame Ru
nnin
g Ti
me
11
R Interface (SparkR)
Arrives in Spark 1.4 (June)
Exposes DataFrames, RDDs, and ML library in R
df = jsonFile(“tweets.json”)
summarize(
group_by(
df[df$user == “matei”,],
“date”),
sum(“retweets”))
12
Machine Learning Pipelines
High-level API inspired by SciKit-Learn
Featurization, evaluation, model tuning tokenizer = Tokenizer()
tf = HashingTF(numFeatures=1000)
lr = LogisticRegression()
pipe = Pipeline([tokenizer, tf, lr])
model = pipe.fit(df)
tokenizer TF LR
model DataFrame
13
External Data Sources
Platform API to plug smart data sources into Spark
Returns DataFrames usable in Spark apps or SQL
Pushes logic into sources
Spark
{JSON}
14
External Data Sources
Platform API to plug smart data sources into Spark
Returns DataFrames usable in Spark apps or SQL
Pushes logic into sources
SELECT * FROM mysql_users u JOIN
hive_logs h
WHERE u.lang = “en”
Spark
{JSON}
SELECT * FROM users WHERE lang=“en”
15
Goal: one engine for all data sources, workloads and environments
To Learn More
Two free massive online courses on Spark:
databricks.com/moocs
16
Try Databricks Cloud:
databricks.com