A look ahead at Spark’s development

Reynold Xin @rxinSpark Summit EU, AmsterdamOct 29th, 2015

SQL Streaming MLlib

Spark Core (RDD)

GraphX

Spark stack diagram

Frontend(user facing APIs)

Backend(execution)

Spark stack diagram(a different take)

Frontend(RDD, DataFrame, ML pipelines, …)

Backend(scheduler, shuffle, operators, …)

Spark stack diagram(a different take)

Last 12 months of Spark evolution

Frontend

DataFramesData sourcesRMachine learning pipelines…

Backend

Project TungstenSort-based shuffleNetty-based network…

Last 12 months of Spark evolution

Frontend

DataFramesData sourcesRMachine learning pipelines…

Backend

Project TungstenSort-based shuffleNetty-based network…

DataFrame:A Frontend Perspective

Spark DataFrame

> head(filter(df, df$waiting < 50)) # an example in R## eruptions waiting##1 1.750 47##2 1.750 47##3 1.867 48

Scalable data frame for Java, Python, R, Scala

Similar APIs as single-node tools (Pandas, dplyr), i.e. easy to learn

Spark RDD Execution

Java/Scalafrontend

JVMbackend

Pythonfrontend

Pythonbackend

opaque closures(user-defined functions)

Spark DataFrame Execution

DataFramefrontend

Logical Plan

Physical execution

Catalystoptimizer

Intermediate representation for computation

Spark DataFrame Execution

PythonDF

Logical Plan

Physicalexecution

Catalystoptimizer

Java/ScalaDF

Intermediate representation for computation

Simple wrappers to create logical plan

Benefit of Logical Plan: Simpler Frontend

Python : ~2000 line of code (built over a weekend)

R : ~1000 line of code

i.e. much easier to add new language bindings (Julia, Clojure, …)

Performance

0 2 4 6 8 10

Java/Scala

Python

Runtime for an example aggregation workload

Benefit of Logical Plan:Performance Parity Across Languages

0 2 4 6 8 10

Java/Scala

Python

Java/Scala

Python

Runtime for an example aggregation workload (secs)

DataFrame

Tungsten:A Backend Perspective

Hardware Trends

Storage

Network

Hardware Trends

Storage 50+MB/s(HDD)

Network 1Gbps

CPU ~3GHz

Hardware Trends

2010 2015

500+MB/s(SSD)

Network 1Gbps 10Gbps

CPU ~3GHz ~3GHz

Hardware Trends

2010 2015

500+MB/s(SSD) 10X

Network 1Gbps 10Gbps 10X

CPU ~3GHz ~3GHz L

Project Tungsten

Substantially speed up execution by optimizing CPU efficiency, via:

(1) Runtime code generation(2) Exploiting cache locality(3) Off-heap memory management

From DataFrame to Tungsten

PythonDF

Logical Plan

Java/ScalaDF

TungstenExecution

Initial phase in Spark 1.5

More work coming in 2016

3 Things to Look Forward To

Dataset API in Spark 1.6

Typed interface over DataFrames / Tungsten

case class Person(name: String, age: Int)

val dataframe = read.json(“people.json”)val ds: Dataset[Person] = dataframe.as[Person]

ds.filter(p => p.name.startsWith(“M”)).groupBy(“name”).avg(“age”)

Dataset

“Encoder” to specify type informationso Spark can translate it into DataFrameand generate optimized memory layouts

Checkout SPARK-9999

Dataset[T]

DataFrame

encoder

Streaming DataFrames

Easier-to-use APIs (batch, streaming, and interactive)

And optimizations:- Tungsten backends- native support for out-of-order data- data sources and sinks

val stream = read.kafka("...")stream.window(5 mins, 10 secs)

.agg(sum("sales"))

.write.jdbc("mysql://...")

3D XPoint

- DRAM latency- SSD capacity- Byte addressible

Python Java/Scala RSQL …

DataFrameLogical Plan

LLVMJVM SIMD 3D XPoint

Unified API, One Engine, Automatically Optimized

Tungstenbackend

languagefrontend

Tungsten Execution

PythonSQL R Streaming

DataFrame (& Dataset)

AdvancedAnalytics

Office Hours Today @ Databricks booth

Topic Area

10:30 – 11:30 Spark general (Reynold)

13:00 – 14:00 R and data science (Hossein)

13:30 – 14:30 machine learning (Joseph)

14:00 – 15:00 Spark, YARN, etc (Andrew)

A look ahead at Spark’s development - GitHub...

Documents

Discretized Streams: A Fault-Tolerant Model for Scalable ... · jobs, Spark Streaming interoperates seamlessly with Spark’s batch and interactive processing features. This is a

Ahead Connectivity Digital connectivity that works for you. · Ahead Services Ahead Connectivity Ahead Connectivity Benefits A benchmark of data security with strict policies for

Fantasy Spark’s Toys€¦ · READ Sorting Fruit Fantasy by Nora Carson illustrated by Kristin Varner b C Spark’s Toys Program: CR14 Component: LR PDF Vendor: SRM Grade: 1 FC_BC_CR14_LR_G1_U5W1_L20_BEY_119639.indd

Apache Spark and Scala - GitHub Pagesrxin.github.io/talks/2017-10-22_Scala.pdfApache Spark Started in UC Berkeley ~ 2010 Most popular and de facto standard framework in big data One

Concepts and Technologies for Distributed Systems and Big ...stg-tud.github.io/ctbd/2016/CTBD_16_spark_streaming.pdf · •Integrates with Spark’s batch and interactive processing

Move Ahead Go Back 1 · 2019. 9. 19. · Ahead 2 Move Ahead 1 Move Ahead 1 Go Back 2 Move Ahead 2 Move Ahead 1 . Title: ReconciliationGameBoard Created Date: 5/18/2008 10:29:48 AM

ANNUAL GENERAL MEETINGcontent.stockpr.com/isdsf/media/0258a83b0f7d369735f98f532fed6… · Aug 2012 – Installation of Smart Antennas into Spark’s Street Mall, Ottawa, Ontario •

APACHE SPARK - pages.databricks.compages.databricks.com/rs/094-YMS-629/images/2016_Spark_Survey.pdf · APACHE SPARK’S GROWTH CONTINUES 13 The Apache Spark Community is Growing 14

firm AHEAD

Dataflow Engines: MapReduce Sparkrossbach/cs378h/lectures/23-MapReduce-Spark.pdfList aspects of Spark’s design that help/hinder multi-core parallelism relative to MapReduce. If the

Ahead of the game in ’86 – still ahead now

The Future of Real-Time in Spark - GitHub Pagesrxin.github.io/talks/2016-02-18_spark_summit_streaming.pdf · 2017. 12. 21. · High-level streaming API built on Spark SQL engine •

Young Professionals Australia: Get ahead and stay ahead

RELIANCE - vercopak.com.tw Reliance Brochure.pdf · For fully automated online sample preparation and injection, the Reliance ™ is ready to be integrated in Spark’s newly developed

P&O: Staying ahead means thinking ahead

AHEAD COSMOS and COMPASS Studies. The AHEAD Study

METAMORPHOSIS FOR BETTER LIFE IN NICHOLAS SPARK’S A …eprints.ums.ac.id/55161/13/Naskah Publikasi OKUPLOAD BENER.pdf · METAMORPHOSIS FOR BETTER LIFE IN NICHOLAS SPARK’S A WALK

Distributed Computing with Spark - GitHub Pageslintool.github.io/SparkTutorial/slides/day1_intro.pdf · of common algorithms Spark’s generality + support" for multiple languages

Apache Spark in 2015 and Beyond - GitHub Pagesrxin.github.io/talks/2015-04-16_Spark_2015_ApacheCon.pdf · 4/16/2015 · Apache Spark in 2015 and Beyond Reynold Xin (@rxin) ApacheCon,

State of Spark, and where it is going - GitHub Pagesrxin.github.io/talks/2015-12-03_strata_singapore_keynote.pdf · No! Development is faster than ever. Expect Spark 2.0 in 2016