38
SparkR: Enabling Interactive Data Science at Scale Shivaram Venkataraman Zongheng Yang

SparkR: Enabling Interactive Data Science at Scale Shivaram Venkataraman Zongheng Yang

Embed Size (px)

Citation preview

SparkR: Enabling Interactive

Data Science at Scale

Shivaram VenkataramanZongheng Yang

Talk Outline

MotivationOverview of Spark & SparkR APILive Demo: Digit Classification

Design & ImplementationQuestions & Answers

Fast!

Scalable Flexible

Statistical!

Packages Interactive

Fast!

Scalable

Flexible

Statistical!

Interactive

Packages

RDD

Parallel Collection

Transformationsmap

filtergroupBy

Actionscountcollect

saveAsTextFile…

R + RDD =R2D2

R + RDD = RRDD

lapplylapplyPartition

groupByKeyreduceByKeysampleRDDcollectcache

broadcastincludePackage

textFileparallelize

Q: How can I use a loop to [...insert task here...] ?

A: Don’t. Use one of the apply functions.

From: http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/

Getting Closer to Idiomatic R

Example: Word Count

lines <- textFile(sc, "hdfs://my_text_file")

Example: Word Count

lines <- textFile(sc, "hdfs://my_text_file")

words <- flatMap(lines, function(line) { strsplit(line, " ")[[1]] })wordCount <- lapply(words,

function(word) { list(word, 1L)

})

Example: Word Count

lines <- textFile(sc, "hdfs://my_text_file")

words <- flatMap(lines, function(line) { strsplit(line, " ")[[1]] })wordCount <- lapply(words,

function(word) { list(word, 1L)

})counts <- reduceByKey(wordCount, "+", 2L)output <- collect(counts)

Live Demo

MNIST

A

b

Minimize

How does this work ?

Dataflow

Local Worker

Worker

Dataflow

Local

R

Worker

Worker

Dataflow

Local

RSpark Conte

xt

JavaSpark Conte

xt

JNI

Worker

Worker

Dataflow

Local Worker

WorkerRSpark Conte

xt

JavaSpark Conte

xt

JNI

SparkExecuto

r execR

SparkExecuto

r exec R

From http://obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/

Dataflow: Performance?

Local Worker

WorkerRSpark Conte

xt

JavaSpark Conte

xt

JNI

SparkExecuto

r execR

SparkExecuto

r execR

…Pipeline the transformations!

words <- flatMap(lines, …)wordCount <- lapply(words, …)

SparkExecuto

r execR

SparkExecuto

rR

exec

Alpha developer release

One line install!

install_github("amplab-extras/SparkR-pkg", subdir="pkg")

SparkR Implementation

Lightweight292 lines of Scala code1694 lines of R code549 lines of test code in R

…Spark is easy to extend!

In the Roadmap

Calling MLLib directly within SparkR

Data Frame support

Better integration with R packages

Performance: daemon R processes

EC2 setup scripts

All Spark examplesMNIST demo

Hadoop2, Maven build

On Github

SparkR

Combine scalability & utility

RDD :: distributed lists

Closures & Serializations

Re-use R packages

Thanks!

https://github.com/amplab-extras/SparkR-pkg

Shivaram Venkataraman

Spark User mailing list [email protected]

Zongheng Yang [email protected]

[email protected]

Pipelined RDD

SparkExecuto

r execR

SparkExecuto

rR

exec

SparkExecutor exec R R Spark

Executor

HDFS / HBase / Cassandra / …

Spark

SparkR

Mesos / YARN / …

Storage

Cluster Manager

ProcessingEngine

Example: Logistic Regression

pointsRDD <- textFile(sc, "hdfs://myfile")weights <- runif(n=D, min = -1, max = 1)

# Logistic gradientgradient <- function(partition) { X <- partition[,1]; Y <- partition[,-1] t(X) %*% (1/(1 + exp(-Y * (X %*% weights))) - 1) * Y}

Example: Logistic Regression

pointsRDD <- textFile(sc, "hdfs://myfile")weights <- runif(n=D, min = -1, max = 1)

# Logistic gradientgradient <- function(partition) { X <- partition[,1]; Y <- partition[,-1] t(X) %*% (1/(1 + exp(-Y * (X %*% weights))) - 1) * Y}

# Iterateweights <- weights - reduce(

lapplyPartition(pointsRDD, gradient), "+")

How does it work ?

R Shell

rJava

Spark Context

Spark Executor

Spark Executor

RScript RScript

Data: RDD[Array[Byte]]

Functions: Array[Byte]