Spark Community Update - Spark Summit San Francisco 2015

Spark Community Update

Matei Zaharia & Patrick Wendell June 15th, 2015

A Great Year for Spark

Most active open source project in data processing

New language: R

Many new features & community projects

Community Growth

June 2014 June 2015

total contributors 255 730

contributors/month 75 135

lines of code 175,000 400,000

Community Growth

June 2014 June 2015

total contributors 255 730

contributors/month 75 135

lines of code 175,000 400,000

Mostly in libraries

Users

1000+ companies

…

Distributors + Apps

50+ companies

…

Large-Scale Usage

Largest cluster: 8000 nodes

Largest single job: 1 petabyte

Top streaming intake: 1 TB/hour

2014 on-disk sort record

Open Source Ecosystem

Applications

Environments Data Sources

Current Spark Components

Spark Core

Spark Streaming

Spark SQL MLlib GraphX

Unified engine across diverse workloads & environments

Major Directions in 2015

Data Science Similar interfaces to

single-node tools

Platform APIs Growing the ecosystem

Data Science

DataFrames: popular API for data transformation

Machine Learning Pipelines: inspired by scikit-learn

R Language

1.3

1.4

1.4

Platform APIs

{JSON}

Data Sources •  Uniform interface to diverse

sources (DataFrames + SQL)

Spark Packages

•  Community site with 70+ libraries •  spark-packages.org

…

Spark

Your app

Ongoing Engine Improvements

Project Tungsten •  Code generation, binary processing, off-heap memory

DAG visualization & debugging tools

Spark’s 1.4 Release

R Language Support

14

Aaron Davidson – Bug fixes in Core, Shuffle, and YARN Aaron Josephs – New features in Core Adam Budde – Bug fixes in SQL Ai He – Improvements in MLlib Andrew Or – Bug fixes in Core Andrew Or – Improvements in Core and YARN; bug fixes in Core, Web UI, Streaming, tests, and SQL; improvement in Streaming, Web UI, Core, and SQL Andrey Zagrebin – Improvement in SQL Antonio Piccolboni – New features in SparkR Arsenii Krasikov – Bug fixes in Core Ashutosh Raina – New features in SparkR Ashwin Shankar – Bug fixes in YARN AugusTn Borsu – New features in MLlib Ben Fradet – DocumentaTon in Core and Streaming Benedikt Linse – DocumentaTon in Core Bill Chambers – DocumentaTon in Core Brennon York – Improvements in Project Infra, Core, GraphX, and tests; bug fixes in Core Bryan Cutler – Bug fixes in Core Burak Yavuz – Test in spark submit; improvements in Core and Streaming; new features in MLlib and PySpark; bug fixes in Core, tests, and spark submit; improvement in SQL, MLlib, and PySpark Calvin Jia – Improvements and documentaTon in Core Chen Song – Bug fixes and improvement in SQL Cheng Chang – New features in EC2 Cheng Hao – Improvements, new features, bug fixes, and improvement in SQL Cheng Lian – Bug fixes in SQL Cheng Lian – Improvements in Core and SQL; documentaTon in Core and SQL; bug fixes in Core and SQL; improvement in SQL Cheolsoo Park – Wish in YARN; improvements in Core and spark submit; bug fixes in Core Chris Freeman – New features in SparkR Chet Mancini – Improvements in Core and SQL Chris Heller – New features in Mesos Christophe Preaud – DocumentaTon in Core and YARN Cody Koeninger – Bug fixes in Streaming; improvement in Core DB Tsai – Improvements, new features, and bug fixes in MLlib DEBORAH SIEGEL – DocumentaTon in Core Dan McClary – New features in GraphX Dan Putler – New features in SparkR Daoyuan Wang – Improvements in tests and SQL; new features in SQL; bug fixes in SQL; improvement in MLlib and SQL David McGuire – Bug fixes in Streaming Davies Liu – Improvements in SQL and PySpark; new features in Core and SparkR; bug fixes in Streaming, tests, PySpark, SparkR, and SQL; improvement in Core and SQL Davies Liu – New features in SparkR Dean Chen – Improvements in Core; new features in YARN; bug fixes in Core and YARN Debasish Das – New features in MLlib Deborah Siegel – Improvements in Core Doing Done – Improvements in SQL; bug fixes in Core and SQL Dong Xu – Bug fixes in SQL Doug Balog – Bug fixes in spark submit, YARN, and SQL Edward T – New features in SparkR Elisey Zanko – Bug fixes in MLlib and PySpark Emre Sevinc – Improvements in Streaming Eric Chiang – DocumentaTon in Core Erik Van Oosten – Bug fixes in Core Evan Jones – Bug fixes in Core Evan Yu – Bug fixes in Core Evert Lammerts – New features in SparkR Favio Vazquez – Build fixes in Core; documentaTon in Core and MLlib Felix Cheung – SparkR DocumentaTon Florian Verhein – Improvements and new features in EC2 Gaurav Nanda – DocumentaTon in Core Glenn Weidner – DocumentaTon in MLlib and PySpark Guancheng (G.C.) Chen – Improvements in Core Guancheng Chen – Improvements in Core Guo Wei – Bug fixes and window funcTon feature in SQL GuoQiang Li – New features in Core; bug fixes in Core and YARN Haiyang Sea – Improvements in SQL Hangchen Yu – DocumentaTon in GraphX Hao Lin – Improvements and new features in SparkR Hari Shreedharan – Test in Streaming and tests; new features in YARN; bug fixes in Web UI Harihar Nahak – New features in SparkR Holden Karau – Improvements in Core, MLlib, and PySpark; bug fixes in PySpark Hossein Falaki – SparkR DocumentaTon Hong Shen – Bug fixes in Core and YARN Hrishikesh Subramonian – Improvements in MLlib and PySpark Hung Lin – Bug fixes in scheduler Ilya Ganelin – Improvements in Core; new features in Core; bug fixes in Core and Shuffle; improvement in Core Imran Rashid – Improvements in Web UI; bug fixes in Core and Web UI Isaias Barroso – Bug fixes in Core Iulian Dragos – Bug fixes in Core and SQL; improvement in Core, Shuffle, and Mesos Jacek Lewandowski – Bug fixes in Core Jacky Li – Improvements in SQL Jaonary Rabarisoa – Improvements in MLlib Jayson Sunshine – DocumentaTon in Core Jean Lyn – Bug fixes in SQL Jeff Harrison – Improvements in SparkR Jeremy A. Lucas – Improvements in Streaming Jeremy Freeman – Bug fixes in Streaming and MLlib Jim Carroll – Bug fixes in MLlib Jin Adachi – Bug fixes in SQL Jongyoul Lee – Improvements in Core and Mesos; bug fixes in Core Joseph K. Bradley – Improvements in MLlib; documentaTon in PySpark, Core, SQL, MLlib, and Streaming; new features in MLlib; bug fixes in Java API, Core, MLlib, and PySpark; improvement in MLlib and PySpark Josh Rosen – Improvements in Core and SQL; new features in Core, Shuffle, and SQL; bug fixes in Core, tests, Shuffle, Streaming, scheduler, SQL, and Java API; improvement in Core and Shuffle Judy Nash – Bug fixes in Windows and spark submit Judy Nash – Improvements in Core Juliet Hougland – Improvements in MLlib June He – Bug fixes in Core and tests Kai Sasaki – DocumentaTon in Core and MLlib; improvements in MLlib and PySpark; bug fixes in MLlib and PySpark; improvement in MLlib and PySpark Kalle Jepsen – Improvements in PySpark and SQL; bug fixes in PySpark; improvement in PySpark Kamil Smuga – Bug fixes in Core and PySpark Kay Ousterhout – Improvements in Core, Web UI, and Shuffle; bug fixes in Project Infra, Core, Web UI, and tests Kevin (Sangwoo) Kim – Bug fixes in Core Kirill A. Korinskiy – New features in MLlib Kousuke Saruta – Improvements in Streaming, Web UI, and tests; bug fixes in Web UI, scheduler, tests, and YARN; improvement in Web UI LCY Vincent – DocumentaTon in Core Leah McGuire – Improvements and new features in MLlib Lev Khomich – Improvements in Core Liang-‐Chi Hsieh – Improvements in MLlib and SQL; improvement in MLlib; new features in SQL; bug fixes in Core, Shuffle, PySpark, MLlib, SQL, and spark submit; documentaTon in Core and MLlib Liangliang Gu – Improvements in Core and Web UI; bug fixes in Web UI Lianhui Wang – Improvements in GraphX; bug fixes in PySpark Liu Chang – Improvements in EC2 Lomig Megard – DocumentaTon in Core Madhukara Phatak – DocumentaTon in SQL Manoj Kumar – Improvements in MLlib; new features in SQL, MLlib, and PySpark; bug fixes in Streaming, MLlib, and SQL; improvement in MLlib and PySpark Marcelo Vanzin – Improvements in Core; bug fixes in Core, tests, Shuffle, YARN, Streaming, and spark submit; improvement in Core Mark Bidmann – Bug fixes in MLlib Marko Bonaci – DocumentaTon in Core Masaru Dobashi – DocumentaTon in Core Masayoshi TSUZUKI – Bug fixes in Windows and Core Matei Zaharia – Improvement in Web UI Mad Aasted – Bug fixes in EC2 Mad Massie – New features in SparkR Mad Wise – DocumentaTon in Core Madhew Cheah – Improvements and new features in Core Madhew Goodman – Bug fixes in EC2 and PySpark Max Seiden – Bug fixes in SQL Meethu Mathew – Bug fixes in MLlib and PySpark Michael Armbrust – DocumentaTon in Core; new features in SQL; improvements in SQL; bug fixes in SQL; improvement in Core and SQL Michael Griffiths – Bug fixes in Windows and Core Michael Malak – Bug fixes in GraphX Michael Nazario – Bug fixes in tests and PySpark Michelangelo D’AgosTno – Bug fixes in EC2 Michelle Casbon – Improvements in Project Infra Miguel Peralvo – Improvements in EC2 Mike Dusenberry – Improvements in Core and MLlib; documentaTon in Core; bug fixes in Core and MLlib Milan Straka – Bug fixes in Core and PySpark Misha Chernetsov – Improvements in Core and SQL Mridul Muralidharan – Improvements in Core and Shuffle Nan Zhu – Improvements in Core and tests; bug fixes in Core and SQL Nathan Howell – Improvements and new features in SQL Nathan Kronenfeld – Bug fixes in Core Nathan McCarthy – Bug fixes in Core Nicholas Chammas – Improvements in Core and EC2; bug fixes in EC2 Nishkam Ravi – Improvements in Core; documentaTon in Core; bug fixes in Core and YARN Nobuyuki Kuromatsu – Bug fixes in MLlib Octavian Geagla – Improvements in MLlib; documentaTon in Java API, Core, and MLlib Oleg Sidorkin – Bug fixes in SQL Oleksii Kostyliev – Bug fixes in Core Olivier Girardot – Improvements in Java API and SQL; bug fixes in Core; improvement in PySpark and SQL Omede Firouz – Improvements in MLlib; new features in MLlib and PySpark Oscar Olmedo – New features in SparkR Pankaj Arora – Bug fixes in Core Patrick Wendell – Test in spark submit; improvements in Core and Shuffle; bug fixes in tests and SQL Pei-‐Lun Lee – Improvements and bug fixes in SQL Peter Parente – Improvements in Core Peter Rudenko – DocumentaTon in Core Pierre Borckmans – DocumentaTon in Core and EC2 Prabeesh K – Improvements in Streaming Pradeep Chanumolu – Improvements in Core Prashant Sharma – Improvements and bug fixes in Core Punya Biswal – Improvements in SQL; bug fixes in Core Punyashloka Biswal – Build fixes in Core Qian Huang – New features and improvement in SparkR Qiping Li – Bug fixes in Core Rajendra Gokhale (rvgcentos) – Improvements in Core Rakesh Chalasani – Improvement in SQL Ram Sriharsha – Improvements in Core, MLlib, and PySpark; new features in MLlib; documentaTon in Core and MLlib Rekha Joshi – Improvements in SparkR Rene Treffer – Improvements in SQL Rex Xiong – Improvements in Core Reynold Xin – Improvements in Project Infra, Core, tests, PySpark, and SQL; documentaTon in Core; bug fixes in Core and MLlib; improvement in Project Infra, Core, GraphX, and SQL Reza Zadeh – Improvements in MLlib Ryan Hafen – New features in SparkR Ryan Williams – Improvements in Core Saisai Shao – Test in Streaming and tests; improvements in Core, PySpark, YARN, and Streaming; new features in Web UI; bug fixes in Web UI and YARN; improvement in Streaming Saleem Ansari – DocumentaTon in Core and MLlib Sandy Ryza – Improvements in Core, Shuffle, and MLlib; documentaTon in Core and MLlib; bug fixes in Core and YARN; improvement in MLlib SanTago M. Mola – Improvements in SQL; bug fixes in SQL; documentaTon in Core Sasaki Toru – Improvements in Core and GraphX Sean Owen – DocumentaTon in Core; improvements in Core, tests, MLlib, Streaming, SQL, and Web UI; bug fixes in Project Infra, Core, tests, Windows, SQL, GraphX, and Web UI; improvement in Core Sephiroth Lin – Improvements in SparkR, Core, scheduler, YARN, and PySpark; bug fixes in SQL Shekhar Bansal – Improvements in YARN; bug fixes in Web UI Sheng Li – Bug fixes in SQL ShiT Saxena – Improvement in SQL Shivaram Venkataraman – Improvements in SparkR and EC2; new features in Core and SparkR; bug fixes in SparkR; improvement in SparkR Shixiong Zhu – Test in Streaming, tests, and Core; improvement in Streaming, Web UI, and Core; improvements in Streaming, Web UI, and Core; bug fixes in Core, tests, MLlib, YARN, Streaming, scheduler, and Web UI; documentaTon in Core and Streaming Shuai Zheng – Bug fixes in SQL Shuo Xiang – New features in Core; bug fixes in MLlib Stephen Boesch – Bug fixes in MLlib Stephen Haberman – Bug fixes in Core Steve Loughran – Improvements in Core, Web UI, and SQL; bug fixes in Core and YARN Steven She – Bug fixes in Core Su Yan – Bug fixes in Core Sun Rui – Improvements in SparkR; new features in SparkR and SQL; bug fixes in SparkR; improvement in SparkR Taka Shinagawa – DocumentaTon in Core Takeshi YAMAMURO – Improvements in GraphX and SQL Tathagata Das – Test in Streaming and tests; improvements in Streaming and Core; new features in Streaming and SQL; bug fixes in Project Infra, Streaming, and Core Ted Yu – Improvements in Core; bug fixes in Core and PySpark Theodore Vasiloudis – Improvements in Core; bug fixes in Core and EC2 Thomas Graves – Bug fixes in Core Tijo Thomas – Improvements in Core; bug fixes in Core and SQL Tim Ellison – Bug fixes in Core Timothy Chen – Improvements in spark submit and Mesos; bug fixes in spark submit and Mesos Tingjun Xu – Improvements in Streaming Todd Gao – SparkR Venkata Ramana Gollamudi – Improvements and bug fixes in SQL Vidmantas Zemleris – Improvements in SQL Vincenzo Selvaggio – DocumentaTon and new features in MLlib Vinod K C – Improvements in Shuffle and scheduler; bug fixes in Core and SQL Vinod KC – Bug fixes in Core and SQL Volodymyr Lyubinets – Improvements and bug fixes in SQL Vyacheslav Baranov – Bug fixes in SQL Wang Fei – Improvements, new features, and bug fixes in SQL Wang Tao – Improvements in Core, YARN, and SQL; new features in spark submit; bug fixes in Core, spark submit, and SQL Wenchen Fan – Improvements in Core; documentaTon in Core; bug fixes in SQL; improvement in SQL Wesley Miao – Bug fixes in Streaming Xiangrui Meng – New features in SQL, MLlib, and PySpark; umbrella in MLlib; documentaTon in PySpark, Core, SQL, MLlib, and Streaming; improvement in Core, SQL, MLlib, and PySpark; build fixes in GraphX and MLlib; improvements in Core, SQL, MLlib, and PySpark; bug fixes in Java API, Web UI, SQL, MLlib, and PySpark Xu Kun – New features in Core Xusen Yin – DocumentaTon in Core and MLlib; improvement in MLlib Yadong Qi – Improvements and bug fixes in SQL Yanbo Liang – Improvements in Core, MLlib, and PySpark; new features in MLlib and PySpark; bug fixes in MLlib and SQL; improvement in MLlib and PySpark Yash Dada – Improvements and bug fixes in SQL Ye Xianjin – Bug fixes in Core Yi Lu – New features in SparkR Yi Tian – New features in Web UI and SQL; bug fixes in SQL Yin Huai – Improvements in tests and SQL; new features in SQL; bug fixes in Core and SQL; improvement in Core and SQL Yong Tang – Bug fixes in Core Yu ISHIKAWA – Improvements in MLlib Yuhao Yang – Improvements in Core and MLlib; new features in MLlib; documentaTon in Core and MLlib Yuri Saito – Bug fixes in SQL Zhan Zhang – Improvements in Core; new features in Core and SQL Zhang, Liye – DocumentaTon in Core; bug fixes in Core and Web UI Zhichao Li – Bug fixes in Streaming, Web UI, and Core Zhichao Zhang – Improvements in SQL; bug fixes in Streaming; documentaTon in Core Zhongshuai Pei – Improvements and bug fixes in SQL Zoltan Zvara – Bug fixes in Core and YARN Zonghenga Yang – New features in SparkR

R API based on Spark’s DataFrames

An R Runtime for Big Data

15

Spark’s scale Thousands of machines and cores Spark’s performance Runtime optimizer, code generation, memory management

Access to Spark’s I/O Packages

# Dataframe from JSON > people <-‐ read.df("people.json", "json") # ... from MySQL > people <-‐ read.df("jdbc:mysql://sql01", "jdbc") # ... from Hive > people <-‐ read.table("orders")

16

CSV ElasticSearch Avro Cassandra

Parquet MongoDB SequoiaDB HBase

ML Pipelines

17

Data Frame

tokenizer hashingTF lr

Pipeline Model

// create pipeline

tok = Tokenizer(in="text", out="words”)

tf = HashingTF(in=“words”, out="features”)

lr = LogisticRegression(maxIter=10, regParam=0.01)

pipeline = Pipeline(stages=[tok, tf, lr])

Data Frame

// train pipeline df = sqlCtx.table(”training”) model = pipeline.fit(df) // make predictions df = sqlCtx.read.json("/path/to/test”) model.transform(df) .select(“id”, “text”, “prediction”)

18

ML Pipelines

Stable API with hooks for third party pipeline components

Feature transformers New algorithms VectorAssembler String/VectorIndexer OneHotEncoder PolynomialExpansion ….

GLM with elastic-net Tree classifiers Tree regressors OneVsRest …

19

Performance and Project Tungsten

Managed memory for aggregations Memory efficient shuffles

1.3 1.4

Customer Pipeline Latency

What’s Coming in Spark 1.5+?

Project Tungsten: Code generation, improved sort + aggregation

Spark Streaming: Flow control, optimized state management

ML: Single machine solvers, scalability to many features

SparkR: Integration with Spark’s machine learning APIs

20

Join Us Today at Office Hours!

21

Area 1:00-1:45 Spark Core, YARN

Spark Streaming

1:45-2:30 Spark SQL

3:00-3:40 Spark Ops

3:40-4:15 Spark SQL

4:30-5:15 Spark Core, PySpark

Spark MLlib

5:15-6:00 Spark MLlib

Databricks booth (A1) More tomorrow…

Thanks!

Documents

Spark Community Update - Spark Summit San Francisco 2015