19
New Directions for Spark in 2015 Matei Zaharia March 18, 2015

New Directions for Spark in 2015 - Spark Summit East

Embed Size (px)

Citation preview

Page 1: New Directions for Spark in 2015 - Spark Summit East

New Directions for Spark in 2015 Matei Zaharia March 18, 2015

Page 2: New Directions for Spark in 2015 - Spark Summit East

2014: an Amazing Year for Spark

Total contributors: 150 => 500

Lines of code: 190K => 370K

500+ active production deployments

2

Page 3: New Directions for Spark in 2015 - Spark Summit East

0

20

40

60

80

100

120

140

2011 2012 2013 2014 2015

Contributors per Month to Spark

Most active project in big data

3

Page 4: New Directions for Spark in 2015 - Spark Summit East

4

On-Disk Sort Record: Time to sort 100TB

Source: Daytona GraySort benchmark, sortbenchmark.org

2100 machines 2013 Record: Hadoop

72 minutes

2014 Record: Spark

207 machines

23 minutes

Page 5: New Directions for Spark in 2015 - Spark Summit East

Major Additions in 2014

5

Spark SQL Java 8 syntax Python streaming …

GraphX Random forests Streaming MLlib

Page 6: New Directions for Spark in 2015 - Spark Summit East

6

New Directions in 2015

Data Science High-level interfaces similar

to single-machine tools

Platform Interfaces Plug in data sources

and algorithms

Page 7: New Directions for Spark in 2015 - Spark Summit East

7

DataFrames

Similar API to data frames in R and Pandas

Automatically optimized via Spark SQL

Out in Spark 1.3

df = jsonFile(“tweets.json”)

df[df[“user”] == “matei”]

.groupBy(“date”)

.sum(“retweets”)

0

5

10

Python Scala DataFrame Ru

nnin

g Ti

me

Page 8: New Directions for Spark in 2015 - Spark Summit East

8

Machine Learning Pipelines

High-level API inspired by SciKit-Learn

Featurization, evaluation, parameter search tokenizer = Tokenizer()

tf = HashingTF(numFeatures=1000)

lr = LogisticRegression()

pipe = Pipeline([tokenizer, tf, lr])

model = pipe.fit(df)

tokenizer TF LR

model DataFrame

Page 9: New Directions for Spark in 2015 - Spark Summit East

9

R Interface (SparkR)

Targeting Spark 1.4 (June)

Exposes DataFrames, RDDs, and ML library in R

df = jsonFile(“tweets.json”) 

summarize(                         

  group_by(                        

    df[df$user == “matei”,],

    “date”),

  sum(“retweets”)) 

Page 10: New Directions for Spark in 2015 - Spark Summit East

10

New Directions in 2015

Data Science High-level interfaces similar

to single-machine tools

Platform Interfaces Plug in data sources

and algorithms

Page 11: New Directions for Spark in 2015 - Spark Summit East

11

External Data Sources

Platform API to plug smart data sources into Spark

Returns DataFrames usable in Spark apps or SQL

Pushes logic into sources

Spark

{JSON}

Page 12: New Directions for Spark in 2015 - Spark Summit East

12

External Data Sources

Platform API to plug smart data sources into Spark

Returns DataFrames usable in Spark apps or SQL

Pushes logic into sources

SELECT * FROM mysql_users u JOIN

hive_logs h

WHERE u.lang = “en”

Spark

{JSON}

SELECT * FROM users WHERE lang=“en”

Page 13: New Directions for Spark in 2015 - Spark Summit East

13

Spark Packages

Community index of third party packages bin/spark-shell --packages databricks/spark-csv:0.2 spark-packages.org

Page 14: New Directions for Spark in 2015 - Spark Summit East

14

Spark Core

Spark Streaming

Spark SQL MLlib GraphX

Page 15: New Directions for Spark in 2015 - Spark Summit East

15

Spark Core

DataFrames ML Pipelines

Spark Streaming

Spark SQL MLlib GraphX

Page 16: New Directions for Spark in 2015 - Spark Summit East

16

{JSON}

Data Sources

Spark Core

DataFrames ML Pipelines

Spark Streaming

Spark SQL MLlib GraphX

Page 17: New Directions for Spark in 2015 - Spark Summit East

17

{JSON}

Data Sources

Spark Core

DataFrames ML Pipelines

Spark Streaming

Spark SQL MLlib GraphX

Page 18: New Directions for Spark in 2015 - Spark Summit East

18

Goal: unified engine across data sources, workloads and environments

Page 19: New Directions for Spark in 2015 - Spark Summit East

19

Enjoy Spark Summit East!