A look ahead at spark 2.0

A look ahead at Spark 2.0

Reynold Xin @rxin2016-03-30, Strata Conference

About Databricks

Founded by creators of Spark in 2013

Cloud enterprise data platform- Managed Spark clusters- Interactive data science- Production pipelines- Data governance, security, …

Today’s Talk

Looking back last 12 months

Looking forward to Spark 2.0• Project Tungsten, Phase 2• Structured Streaming• Unifying DataFrame & Dataset

Best resource for learning Spark

A slide from 2013 …

Programmability

5WordCount in 50+ lines of Java MR

WordCount in 3 lines of Spark

What is Spark?

Unified engine across data workloads and platforms

…

SQLStreaming ML Graph Batch …

2015: A Great Year for Spark

Most active open source project in (big) data

• 1000+ code contributors

New language: R

Widespread industry support & adoption

“Spark is the Taylor Swiftof big data software.”

- Derrick Harris, Fortune

Top Applications

29%

36%

40%

44%

52%

68%

Faud Detection / Security

User-Facing Services

Log Processing

Recommendation

Data Warehousing

Business Intelligence

Diverse Runtime EnvironmentsHOW RESPONDENTS ARE

RUNNING SPARK

51%on a public cloud

MOST COMMON SPARK DEPLOYMENTENVIRONMENTS (CLUSTER MANAGERS)

48% 40% 11%Standalone mode YARN Mesos

Cluster Managers

Spark 2.0

Next major release, coming in May

Builds on all we learned in past 2 years

Versioning in Spark

In reality, we hate breaking APIs!Will not do so except for dependency conflicts (e.g. Guava) and experimental APIs

1 .6.0Patch version (only bug fixes)

Major version (may change APIs)

Minor version (adds APIs / features)

Major Features in 2.0

Tungsten Phase 2speedups of 5-10x

Structured Streamingreal-time engine

on SQL/DataFrames

Unifying Datasetsand DataFrames

Datasets & DataFrames

API foundation for the future

Datasets and DataFrames

In 2015, we added DataFrames & Datasets as structured data APIs• DataFrames are collections of rows with a schema• Datasets add static types, e.g. Dataset[Person]• Both run on Tungsten

Spark 2.0 will merge these APIs: DataFrame = Dataset[Row]

Examplecase class User(name: String, id: Int)case class Message(user: User, text: String)

dataframe = sqlContext.read.json(“log.json”) // DataFrame, i.e. Dataset[Row]messages = dataframe.as[Message] // Dataset[Message]

users = messages.filter(m => m.text.contains(“Spark”)).map(m => m.user) // Dataset[User]

pipeline.train(users) // MLlib takes either DataFrames or Datasets

Benefits

Simpler to understand• Only kept Dataset separate to keep binary compatibility in 1.x

Libraries can take data of both forms

With Streaming, same API will also work on streams

Long-Term

RDD will remain the low-level API in Spark

Datasets & DataFrames give richer semantics and optimizations• New libraries will increasingly use these as interchange format• Examples: Structured Streaming, MLlib, GraphFrames

Structured Streaming

How do we simplify streaming?

Integration Example

Streaming engine

Stream(home.html, 10:08)

(product.html, 10:09)

(home.html, 10:10)

. . .

What can go wrong?• Late events• Partial outputs to MySQL• State recovery on failure• Distributed reads/writes • ...

MySQL

Page Minute Visits

home 10:09 21

pricing 10:10 30

... ... ...

ProcessingBusiness logic change & new ops

(windows, sessions)

Complex Programming Models

OutputHow do we define

output over time & correctness?

DataLate arrival, varying distribution over time, …

The simplest way to perform streaming analyticsis not having to reason about streaming.

Spark 2.0Infinite DataFrames

Spark 1.3Static DataFrames

Single API !

Structured Streaming

High-level streaming API built on Spark SQL engine• Runs the same queries on DataFrames• Event time, windowing, sessions, sources & sinks

Unifies streaming, interactive and batch queries• Aggregate data in a stream, then serve using JDBC• Change queries at runtime• Build and apply ML models

See Michael/TD’s talks tomorrow for a deep dive!

Tungsten Phase 2

Can we speed up Spark by 10X?

Demo

Run a join on a large table with 1 billion records and a small table with 1000 records

In Spark 1.6, took 60+ seconds.

In Spark 2.0, took ~3 seconds.

Scan

Filter

Project

Aggregate

select count(*) from store_saleswhere ss_item_sk = 1000

Volcano Iterator Model

Standard for 30 years: almost all databases do it

Each operator is an “iterator” that consumes records from its input operator

class Filter {def next(): Boolean = {var found = falsewhile (!found && child.next()) {

found = predicate(child.fetch())}return found

}

def fetch(): InternalRow = {child.fetch()

}…

}

What if we hire a college freshman toimplement this query in Java in 10 mins?

select count(*) from store_saleswhere ss_item_sk = 1000

var count = 0for (ss_item_sk in store_sales) {

if (ss_item_sk == 1000) {count += 1

}}

Volcano model30+ years of database research

college freshmanhand-written code in 10 mins

vs

Volcano 13.95 millionrows/sec

collegefreshman

125 millionrows/sec

Note: End-to-end, single thread, single column, and data originated in Parquet on disk

High throughput

How does a student beat 30 years of research?

Volcano

1. Many virtual function calls

2. Data in memory (or cache)

3. No loop unrolling, SIMD, pipelining

hand-written code

1. No virtual function calls

2. Data in CPU registers

3. Compiler loop unrolling, SIMD, pipelining

Take advantage of all the information that is known after query compilation

Scan

Filter

Project

Aggregate

long count = 0;for (ss_item_sk in store_sales) {

if (ss_item_sk == 1000) {count += 1;

}}

Tungsten Phase 2: Spark as a “Compiler”

Functionality of a general purpose execution engine; performance as if hand built system just to run your query

DatabricksCommunity Edition

Best place to try & learn Spark.

Today’s talk

Spark has been growing explosively

Spark 2.0 doubles down on what made Spark attractive:• elegant APIs• cutting-edge performance

Learn Spark on Databricks Community Edition• join beta waitlist https://databricks.com/

Thank you.@rxin

Software

A look ahead at spark 2.0