New directions for Apache Spark in 2015

New Directions for Spark in 2015 Matei Zaharia February 20, 2015

What is Apache Spark?

Fast and general engine for big data processing with libraries for SQL, streaming, advanced analytics Most active open source project in big data

Founded by the creators of Spark in 2013 Largest organization contributing to Spark

–  3/4 of the code in 2014

End-to-end hosted service, Databricks Cloud

About Databricks

2014: an Amazing Year for Spark

Total contributors: 150 => 500

Lines of code: 190K => 370K

500 active production deployments

Contributors per Month to Spark

2011 2012 2013 2014 2015 5

Contributors per Month to Spark

2011 2012 2013 2014 2015

Most active project at Apache

On-Disk Sort Record: Time to sort 100TB

2100 machines 2013 Record: Hadoop

2014 Record: Spark

Source: Daytona GraySort benchmark, sortbenchmark.org

72 minutes

207 machines

23 minutes

Distributors Applications

New Directions in 2015

Data Science High-level interfaces similar

to single-machine tools

Platform Interfaces Plug in data sources

and algorithms

DataFrames

Similar API to data frames in R and Pandas

Automatically optimized via Spark SQL

Coming in Spark 1.3

df = jsonFile(“tweets.json”)

df[df[“user”] == “matei”]

.groupBy(“date”)

.sum(“retweets”)

Python Scala DataFrame Ru

R Interface (SparkR)

Arrives in Spark 1.4 (June)

Exposes DataFrames, RDDs, and ML library in R

df = jsonFile(“tweets.json”)

summarize(

group_by(

df[df$user == “matei”,],

“date”),

sum(“retweets”))

Machine Learning Pipelines

High-level API inspired by SciKit-Learn

Featurization, evaluation, model tuning tokenizer = Tokenizer()

tf = HashingTF(numFeatures=1000)

lr = LogisticRegression()

pipe = Pipeline([tokenizer, tf, lr])

model = pipe.fit(df)

tokenizer TF LR

model DataFrame

External Data Sources

Platform API to plug smart data sources into Spark

Returns DataFrames usable in Spark apps or SQL

Pushes logic into sources

{JSON}

External Data Sources

Platform API to plug smart data sources into Spark

Returns DataFrames usable in Spark apps or SQL

Pushes logic into sources

SELECT * FROM mysql_users u JOIN

hive_logs h

WHERE u.lang = “en”

{JSON}

SELECT * FROM users WHERE lang=“en”

Goal: one engine for all data sources, workloads and environments

To Learn More

Two free massive online courses on Spark:

databricks.com/moocs

Try Databricks Cloud:

databricks.com

New directions for Apache Spark in 2015

Technology

Using Apache Spark Pat McDonough - Databricks. Apache Spark spark.incubator.apache.org github.com/apache/incubator- spark user@spark.incubator.apache.or

Apache Spark and Distributed Programming - CS-E4110 ... · Apache Spark Apache Spark Distributed programming framework for Big Data processing Based on functional programming Implements

Developing Apache Spark Applications · Apache Spark Introduction Introduction Apache Spark enables you to quickly develop applications and process jobs. Apache Spark is designed

Apache Spark 101

Apache Spark Streaming

Getting Started with Apache Spark - Big Data Toronto … Started with Apache Spark Conclusion 71 CHAPTER 9: Apache Spark Developer Cheat Sheet 73 Transformations (return …

Apache Spark RDDs

Spark SQL | Apache Spark

Using Apache Spark

A Tutorial on Apache Spark - Michael Hahslermichael.hahsler.net/SMU/EMIS8331/tutorials/Tutorial_Apache_Spark.pdf · A Tutorial on Apache Spark ... •Apache Spark is considered to

TeachYourself Apache Spark...HOUR 1 Introducing Apache Spark..... 1 2 Understanding Hadoop ... Part II: Programming with Apache Spark HOUR 6: Learning the Basics of Spark Programming

Apache Spark PDF

Apache spark

KNIME Extension for Apache Spark Installation Guide · Apache Livy (recommended) Spark Job Server (deprecated) Supported Spark and Hadoop distributions KNIME Extension for Apache

Budapest Spark Meetup - Apache Spark @enbrite.ly

NEW ARCHITECTURES FOR APACHE SPARK TM AND BIG DATA · NEW ARCHITECTURES FOR APACHE SPARK AND BIG DATA The Apache Spark Platform for Big Data The Apache Spark platform is an open-source

Running Apache Spark & Apache Zeppelin in Production

Apache Spark Introduction

Apache Spark 2.0

Apache Spark - Courses€¦ · Apache Spark Introduction to Data Science DATA11001 Nitinder Mohan CollaborativeNetworking (CoNe) nitinder.mohan@helsinki.fi. What is Apache Spark?