Upload
linette-snow
View
224
Download
1
Embed Size (px)
Citation preview
Talk Outline
MotivationOverview of Spark & SparkR APILive Demo: Digit Classification
Design & ImplementationQuestions & Answers
R + RDD = RRDD
lapplylapplyPartition
groupByKeyreduceByKeysampleRDDcollectcache
…
broadcastincludePackage
textFileparallelize
Q: How can I use a loop to [...insert task here...] ?
A: Don’t. Use one of the apply functions.
From: http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/
Getting Closer to Idiomatic R
Example: Word Count
lines <- textFile(sc, "hdfs://my_text_file")
words <- flatMap(lines, function(line) { strsplit(line, " ")[[1]] })wordCount <- lapply(words,
function(word) { list(word, 1L)
})
Example: Word Count
lines <- textFile(sc, "hdfs://my_text_file")
words <- flatMap(lines, function(line) { strsplit(line, " ")[[1]] })wordCount <- lapply(words,
function(word) { list(word, 1L)
})counts <- reduceByKey(wordCount, "+", 2L)output <- collect(counts)
Dataflow
Local Worker
WorkerRSpark Conte
xt
JavaSpark Conte
xt
JNI
SparkExecuto
r execR
SparkExecuto
r exec R
Dataflow: Performance?
Local Worker
WorkerRSpark Conte
xt
JavaSpark Conte
xt
JNI
SparkExecuto
r execR
SparkExecuto
r execR
…Pipeline the transformations!
words <- flatMap(lines, …)wordCount <- lapply(words, …)
SparkExecuto
r execR
SparkExecuto
rR
exec
SparkR Implementation
Lightweight292 lines of Scala code1694 lines of R code549 lines of test code in R
…Spark is easy to extend!
In the Roadmap
Calling MLLib directly within SparkR
Data Frame support
Better integration with R packages
Performance: daemon R processes
SparkR
Combine scalability & utility
RDD :: distributed lists
Closures & Serializations
Re-use R packages
Thanks!
https://github.com/amplab-extras/SparkR-pkg
Shivaram Venkataraman
Spark User mailing list [email protected]
Zongheng Yang [email protected]
Example: Logistic Regression
pointsRDD <- textFile(sc, "hdfs://myfile")weights <- runif(n=D, min = -1, max = 1)
# Logistic gradientgradient <- function(partition) { X <- partition[,1]; Y <- partition[,-1] t(X) %*% (1/(1 + exp(-Y * (X %*% weights))) - 1) * Y}
Example: Logistic Regression
pointsRDD <- textFile(sc, "hdfs://myfile")weights <- runif(n=D, min = -1, max = 1)
# Logistic gradientgradient <- function(partition) { X <- partition[,1]; Y <- partition[,-1] t(X) %*% (1/(1 + exp(-Y * (X %*% weights))) - 1) * Y}
# Iterateweights <- weights - reduce(
lapplyPartition(pointsRDD, gradient), "+")