39
A look ahead at Spark 2.0 Reynold Xin @rxin 2016-03-30, Strata Conference

A look ahead at spark 2.0

Embed Size (px)

Citation preview

Page 1: A look ahead at spark 2.0

A look ahead at Spark 2.0

Reynold Xin @rxin2016-03-30, Strata Conference

Page 2: A look ahead at spark 2.0

About Databricks

Founded by creators of Spark in 2013

Cloud enterprise data platform- Managed Spark clusters- Interactive data science- Production pipelines- Data governance, security, …

Page 3: A look ahead at spark 2.0

Today’s Talk

Looking back last 12 months

Looking forward to Spark 2.0• Project Tungsten, Phase 2• Structured Streaming• Unifying DataFrame & Dataset

Best resource for learning Spark

Page 4: A look ahead at spark 2.0

A slide from 2013 …

Page 5: A look ahead at spark 2.0

Programmability

5WordCount in 50+ lines of Java MR

WordCount in 3 lines of Spark

Page 6: A look ahead at spark 2.0

What is Spark?

Unified engine across data workloads and platforms

SQLStreaming ML Graph Batch …

Page 7: A look ahead at spark 2.0

2015: A Great Year for Spark

Most active open source project in (big) data

• 1000+ code contributors

New language: R

Widespread industry support & adoption

Page 8: A look ahead at spark 2.0

“Spark is the Taylor Swiftof big data software.”

- Derrick Harris, Fortune

Page 9: A look ahead at spark 2.0
Page 10: A look ahead at spark 2.0
Page 11: A look ahead at spark 2.0

Top Applications

29%

36%

40%

44%

52%

68%

Faud Detection / Security

User-Facing Services

Log Processing

Recommendation

Data Warehousing

Business Intelligence

Page 12: A look ahead at spark 2.0

Diverse Runtime EnvironmentsHOW RESPONDENTS ARE

RUNNING SPARK

51%on a public cloud

MOST COMMON SPARK DEPLOYMENTENVIRONMENTS (CLUSTER MANAGERS)

48% 40% 11%Standalone mode YARN Mesos

Cluster Managers

Page 13: A look ahead at spark 2.0

Spark 2.0

Next major release, coming in May

Builds on all we learned in past 2 years

Page 14: A look ahead at spark 2.0

Versioning in Spark

In reality, we hate breaking APIs!Will not do so except for dependency conflicts (e.g. Guava) and experimental APIs

1 .6.0Patch version (only bug fixes)

Major version (may change APIs)

Minor version (adds APIs / features)

Page 15: A look ahead at spark 2.0

Major Features in 2.0

Tungsten Phase 2speedups of 5-10x

Structured Streamingreal-time engine

on SQL/DataFrames

Unifying Datasetsand DataFrames

Page 16: A look ahead at spark 2.0

Datasets & DataFrames

API foundation for the future

Page 17: A look ahead at spark 2.0

Datasets and DataFrames

In 2015, we added DataFrames & Datasets as structured data APIs• DataFrames are collections of rows with a schema• Datasets add static types, e.g. Dataset[Person]• Both run on Tungsten

Spark 2.0 will merge these APIs: DataFrame = Dataset[Row]

Page 18: A look ahead at spark 2.0

Examplecase class User(name: String, id: Int)case class Message(user: User, text: String)

dataframe = sqlContext.read.json(“log.json”) // DataFrame, i.e. Dataset[Row]messages = dataframe.as[Message] // Dataset[Message]

users = messages.filter(m => m.text.contains(“Spark”)).map(m => m.user) // Dataset[User]

pipeline.train(users) // MLlib takes either DataFrames or Datasets

Page 19: A look ahead at spark 2.0

Benefits

Simpler to understand• Only kept Dataset separate to keep binary compatibility in 1.x

Libraries can take data of both forms

With Streaming, same API will also work on streams

Page 20: A look ahead at spark 2.0

Long-Term

RDD will remain the low-level API in Spark

Datasets & DataFrames give richer semantics and optimizations• New libraries will increasingly use these as interchange format• Examples: Structured Streaming, MLlib, GraphFrames

Page 21: A look ahead at spark 2.0

Structured Streaming

How do we simplify streaming?

Page 22: A look ahead at spark 2.0

Integration Example

Streaming engine

Stream(home.html, 10:08)

(product.html, 10:09)

(home.html, 10:10)

. . .

What can go wrong?• Late events• Partial outputs to MySQL• State recovery on failure• Distributed reads/writes • ...

MySQL

Page Minute Visits

home 10:09 21

pricing 10:10 30

... ... ...

Page 23: A look ahead at spark 2.0

ProcessingBusiness logic change & new ops

(windows, sessions)

Complex Programming Models

OutputHow do we define

output over time & correctness?

DataLate arrival, varying distribution over time, …

Page 24: A look ahead at spark 2.0

The simplest way to perform streaming analyticsis not having to reason about streaming.

Page 25: A look ahead at spark 2.0

Spark 2.0Infinite DataFrames

Spark 1.3Static DataFrames

Single API !

Page 26: A look ahead at spark 2.0

Structured Streaming

High-level streaming API built on Spark SQL engine• Runs the same queries on DataFrames• Event time, windowing, sessions, sources & sinks

Unifies streaming, interactive and batch queries• Aggregate data in a stream, then serve using JDBC• Change queries at runtime• Build and apply ML models

See Michael/TD’s talks tomorrow for a deep dive!

Page 27: A look ahead at spark 2.0

Tungsten Phase 2

Can we speed up Spark by 10X?

Page 28: A look ahead at spark 2.0

Demo

Run a join on a large table with 1 billion records and a small table with 1000 records

In Spark 1.6, took 60+ seconds.

In Spark 2.0, took ~3 seconds.

Page 29: A look ahead at spark 2.0

Scan

Filter

Project

Aggregate

select count(*) from store_saleswhere ss_item_sk = 1000

Page 30: A look ahead at spark 2.0

Volcano Iterator Model

Standard for 30 years: almost all databases do it

Each operator is an “iterator” that consumes records from its input operator

class Filter {def next(): Boolean = {var found = falsewhile (!found && child.next()) {

found = predicate(child.fetch())}return found

}

def fetch(): InternalRow = {child.fetch()

}…

}

Page 31: A look ahead at spark 2.0

What if we hire a college freshman toimplement this query in Java in 10 mins?

select count(*) from store_saleswhere ss_item_sk = 1000

var count = 0for (ss_item_sk in store_sales) {

if (ss_item_sk == 1000) {count += 1

}}

Page 32: A look ahead at spark 2.0

Volcano model30+ years of database research

college freshmanhand-written code in 10 mins

vs

Page 33: A look ahead at spark 2.0

Volcano 13.95 millionrows/sec

collegefreshman

125 millionrows/sec

Note: End-to-end, single thread, single column, and data originated in Parquet on disk

High throughput

Page 34: A look ahead at spark 2.0

How does a student beat 30 years of research?

Volcano

1. Many virtual function calls

2. Data in memory (or cache)

3. No loop unrolling, SIMD, pipelining

hand-written code

1. No virtual function calls

2. Data in CPU registers

3. Compiler loop unrolling, SIMD, pipelining

Take advantage of all the information that is known after query compilation

Page 35: A look ahead at spark 2.0

Scan

Filter

Project

Aggregate

long count = 0;for (ss_item_sk in store_sales) {

if (ss_item_sk == 1000) {count += 1;

}}

Tungsten Phase 2: Spark as a “Compiler”

Functionality of a general purpose execution engine; performance as if hand built system just to run your query

Page 36: A look ahead at spark 2.0

DatabricksCommunity Edition

Best place to try & learn Spark.

Page 37: A look ahead at spark 2.0
Page 38: A look ahead at spark 2.0

Today’s talk

Spark has been growing explosively

Spark 2.0 doubles down on what made Spark attractive:• elegant APIs• cutting-edge performance

Learn Spark on Databricks Community Edition• join beta waitlist https://databricks.com/

Page 39: A look ahead at spark 2.0

Thank you.@rxin