New directions for Apache Spark in 2015

Preview:

Citation preview

New Directions for Spark in 2015 Matei Zaharia February 20, 2015

What is Apache Spark?

Fast and general engine for big data processing with libraries for SQL, streaming, advanced analytics Most active open source project in big data

2

Founded by the creators of Spark in 2013 Largest organization contributing to Spark

–  3/4 of the code in 2014

End-to-end hosted service, Databricks Cloud

About Databricks

3

2014: an Amazing Year for Spark

Total contributors: 150 => 500

Lines of code: 190K => 370K

500 active production deployments

4

Contributors per Month to Spark

0

20

40

60

80

100

2011 2012 2013 2014 2015 5

Contributors per Month to Spark

0

20

40

60

80

100

2011 2012 2013 2014 2015

Most active project at Apache

6

7

On-Disk Sort Record: Time to sort 100TB

2100 machines 2013 Record: Hadoop

2014 Record: Spark

Source: Daytona GraySort benchmark, sortbenchmark.org

72 minutes

207 machines

23 minutes

Distributors Applications

8

9

New Directions in 2015

Data Science High-level interfaces similar

to single-machine tools

Platform Interfaces Plug in data sources

and algorithms

10

DataFrames

Similar API to data frames in R and Pandas

Automatically optimized via Spark SQL

Coming in Spark 1.3

df = jsonFile(“tweets.json”)

df[df[“user”] == “matei”]

.groupBy(“date”)

.sum(“retweets”)

0

5

10

Python Scala DataFrame Ru

nnin

g Ti

me

11

R Interface (SparkR)

Arrives in Spark 1.4 (June)

Exposes DataFrames, RDDs, and ML library in R

df = jsonFile(“tweets.json”) 

summarize(                         

  group_by(                        

    df[df$user == “matei”,],

    “date”),

  sum(“retweets”)) 

12

Machine Learning Pipelines

High-level API inspired by SciKit-Learn

Featurization, evaluation, model tuning tokenizer = Tokenizer()

tf = HashingTF(numFeatures=1000)

lr = LogisticRegression()

pipe = Pipeline([tokenizer, tf, lr])

model = pipe.fit(df)

tokenizer TF LR

model DataFrame

13

External Data Sources

Platform API to plug smart data sources into Spark

Returns DataFrames usable in Spark apps or SQL

Pushes logic into sources

Spark

{JSON}

14

External Data Sources

Platform API to plug smart data sources into Spark

Returns DataFrames usable in Spark apps or SQL

Pushes logic into sources

SELECT * FROM mysql_users u JOIN

hive_logs h

WHERE u.lang = “en”

Spark

{JSON}

SELECT * FROM users WHERE lang=“en”

15

Goal: one engine for all data sources, workloads and environments

To Learn More

Two free massive online courses on Spark:

databricks.com/moocs

16

Try Databricks Cloud:

databricks.com

Recommended