16
New Directions for Spark in 2015 Matei Zaharia February 20, 2015

New directions for Apache Spark in 2015

Embed Size (px)

Citation preview

Page 1: New directions for Apache Spark in 2015

New Directions for Spark in 2015 Matei Zaharia February 20, 2015

Page 2: New directions for Apache Spark in 2015

What is Apache Spark?

Fast and general engine for big data processing with libraries for SQL, streaming, advanced analytics Most active open source project in big data

2

Page 3: New directions for Apache Spark in 2015

Founded by the creators of Spark in 2013 Largest organization contributing to Spark

–  3/4 of the code in 2014

End-to-end hosted service, Databricks Cloud

About Databricks

3

Page 4: New directions for Apache Spark in 2015

2014: an Amazing Year for Spark

Total contributors: 150 => 500

Lines of code: 190K => 370K

500 active production deployments

4

Page 5: New directions for Apache Spark in 2015

Contributors per Month to Spark

0

20

40

60

80

100

2011 2012 2013 2014 2015 5

Page 6: New directions for Apache Spark in 2015

Contributors per Month to Spark

0

20

40

60

80

100

2011 2012 2013 2014 2015

Most active project at Apache

6

Page 7: New directions for Apache Spark in 2015

7

On-Disk Sort Record: Time to sort 100TB

2100 machines 2013 Record: Hadoop

2014 Record: Spark

Source: Daytona GraySort benchmark, sortbenchmark.org

72 minutes

207 machines

23 minutes

Page 8: New directions for Apache Spark in 2015

Distributors Applications

8

Page 9: New directions for Apache Spark in 2015

9

New Directions in 2015

Data Science High-level interfaces similar

to single-machine tools

Platform Interfaces Plug in data sources

and algorithms

Page 10: New directions for Apache Spark in 2015

10

DataFrames

Similar API to data frames in R and Pandas

Automatically optimized via Spark SQL

Coming in Spark 1.3

df = jsonFile(“tweets.json”)

df[df[“user”] == “matei”]

.groupBy(“date”)

.sum(“retweets”)

0

5

10

Python Scala DataFrame Ru

nnin

g Ti

me

Page 11: New directions for Apache Spark in 2015

11

R Interface (SparkR)

Arrives in Spark 1.4 (June)

Exposes DataFrames, RDDs, and ML library in R

df = jsonFile(“tweets.json”) 

summarize(                         

  group_by(                        

    df[df$user == “matei”,],

    “date”),

  sum(“retweets”)) 

Page 12: New directions for Apache Spark in 2015

12

Machine Learning Pipelines

High-level API inspired by SciKit-Learn

Featurization, evaluation, model tuning tokenizer = Tokenizer()

tf = HashingTF(numFeatures=1000)

lr = LogisticRegression()

pipe = Pipeline([tokenizer, tf, lr])

model = pipe.fit(df)

tokenizer TF LR

model DataFrame

Page 13: New directions for Apache Spark in 2015

13

External Data Sources

Platform API to plug smart data sources into Spark

Returns DataFrames usable in Spark apps or SQL

Pushes logic into sources

Spark

{JSON}

Page 14: New directions for Apache Spark in 2015

14

External Data Sources

Platform API to plug smart data sources into Spark

Returns DataFrames usable in Spark apps or SQL

Pushes logic into sources

SELECT * FROM mysql_users u JOIN

hive_logs h

WHERE u.lang = “en”

Spark

{JSON}

SELECT * FROM users WHERE lang=“en”

Page 15: New directions for Apache Spark in 2015

15

Goal: one engine for all data sources, workloads and environments

Page 16: New directions for Apache Spark in 2015

To Learn More

Two free massive online courses on Spark:

databricks.com/moocs

16

Try Databricks Cloud:

databricks.com