Integrate SparkR with existing R packages to accelerate data science workflows

1 ©HortonworksInc.2011– 2016.AllRightsReserved

Integrate SparkR with existing R packagesto accelerate data science workflows

Feb 2017

Yanbo LiangSoftware engineer @ HortonworksApache Spark committer


Outline

Ã Introduction toRand SparkR.

Ã Typical data science workflow.

Ã SparkR + R for typical data science problem.– Big data, small learning.– Partition aggregate.– Large scale machine learning.

Ã Future directions.


R for data scientist

Ã Pros– Open source.– Rich ecosystem of packages.– Powerful visualization infrastructure.– Data frames make data manipulation convenient.– Taught by many schools to statistics and computer science students.

Ã Cons– Single threaded– Everything has to fit in single machine memory


SparkR = Spark + R

Ã AnRfrontendforApacheSpark,awidely deployed cluster computing engine.

Ã Wrappers over DataFrames and DataFrame-based APIs (MLlib).– Complete DataFrame API to behave just like R data.frame.– ML APIs mimic to the methods implemented in R or R packages, rather than Scala/Python APIs.

Ã Data frame concept is the corner stone of both Spark and R.

Ã Convenient interoperability between R and Spark DataFrames.


APIs’ perspective


SparkR architecture


Data science workflow


Why SparkR + R

Ã There are thousands of community packages on CRAN.– It is impossible for SparkR to match all existing features.

Ã Not every dataset is large.– Many people work with small/medium datasets.


SparkR + R for typical data science application

Ã Big data, small learning

Ã Partition aggregate

Ã Large scale machine learning


Big data, small learning

Table1

Table2

Table3 Table4 Table5join

select/where/

aggregate/sample collect

model/analytics

SparkR R


Data wrangle with SparkR

Operation/Transformation functionJoin different data sources or tables joinPick observations by their value filter/where

Reorder the rows arrangePick variables by their names select

Create new variable with functions of existing variables mutate/withColumnCollapse many values down to a single summary summary/describe

Aggregation groupBy


Data wrangle

airlines <- read.df(path="/data/2008.csv", source="csv", header="true", inferSchema="true")

planes <- read.df(path="/data/plane-data.csv", source="csv", header="true", inferSchema="true")

joined <- join(airlines, planes, airlines$TailNum == planes$tailnum)

df1 <- select(joined, “aircraft_type”, “Distance”, “ArrDelay”, “DepDelay”)

df2 <- dropna(df1)


SparkR performance


Sampling Algorithms

Ã Bernoulli sampling (without replacement)– df3 <- sample(df2,FALSE,0.1)

Ã Poisson sampling (with replacement)– df3 <- sample(df2, TRUE, 0.1)

Ã stratified sampling– df3 <- sampleBy(df2,"aircraft_type",list("FixedWingMulti-Engine"=0.1,"FixedWingSingle-

Engine"=0.2,"Rotorcraft"=0.3),0)



Table1

Table2


select/where/


model/analytics

SparkR R



Table1

Table2


select/where/


model/analytics

SparkDataFrame data.frame


Distributed dataset to local


Partition aggregate

Ã User Defined Functions (UDFs).– dapply– gapply

Ã Parallel execution of function.– spark.lapply


User Defined Functions (UDFs)

Ã dapply

Ã gapply


dapply

> schema <- structType(structField(”aircraft_type”, “string”),

structField(”Distance“, ”integer“),structField(”ArrDelay“, ”integer“),structField(”DepDelay“, ”integer“), structField(”DepDelayS“, ”integer“))

> df4 <- dapply(df3, function(x) { x <- cbind(x, x$ DepDelay* 60L) }, schema)

> head(df4)


gapply

> schema <- structType(structField(”Distance“, ”integer“),structField(”MaxActualDelay“, ”integer“))

> df5 <- gapply(df3, “Distance”, function(key, x) { y <-data.frame(key, max(x$ArrDelay-x$DepDelay)) }, schema)

> head(df5)


spark.lapply

Ã Ideal way for distributing existing R functionality and packages


spark.lapply

for (lambda in c(0.5, 1.5)) {

for (alpha in c(0.1, 0.5, 1.0)) {

model <- glmnet(A, b, lambda=lambda, alpha=alpha)

c <- predit(model, A)

c(coef(model), auc(c, b))

}}


spark.lapply

values <- c(c(0.5, 0.1), c(0.5, 0.5), c(0.5, 1.0), c(1.5,0.1), c(1.5, 0.5), c(1.5, 1.0))

train <- function(value) {

lambda <- value[1]

alpha <- value[2]



}

models <- spark.lapply(values, train)


spark.lapply

executor

executor

executor

executor

executor

Driverlambda = c(0.5, 1.5)

alpha = c(0.1, 0.5, 1.0)

executor


spark.lapply

(0.5, 0.1)executor

(1.5, 0.1)executor

(0.5, 0.5)executor

(0.5, 1.0)executor

(1.5, 1.0)executor

Driver

(1.5, 0.5)executor


Virtual environment

(glmnet)executor

(glmnet)executor

(glmnet)executor

(glmnet)executor

(glmnet)executor

Driver

(glmnet)executor


Virtual environment

download.packages(”glmnet", packagesDir, repos = "https://cran.r-project.org")

filename <- list.files(packagesDir, "^glmnet")

packagesPath <- file.path(packagesDir, filename)

spark.addFile(packagesPath)


Virtual environmentpath <- spark.getSparkFiles(filename)

values <- c(c(0.5, 0.1), c(0.5, 0.5), c(0.5, 1.0), c(1.5, 0.1), c(1.5,0.5), c(1.5, 1.0))

train <- function(value) {

install.packages(path, repos = NULL, type = "source")

library(glmnet)

lambda <- value[1]

alpha <- value[2]



}

models <- spark.lapply(values, train)


Large scale machine learning





> model <- glm(ArrDelay ~ DepDelay + Distance + aircraft_type, family = "gaussian", data = df3)

> summary(model)


Future directions

Ã Improve collect/createDataFrame performance in SparkR (SPARK-18924).

Ã More scalable machine learning algorithms from MLlib.

Ã Better R formula support.

Ã Improve UDF performance.


Reference

Ã SparkR:ScalingRProgramswithSpark (SIGMOD 2016)

Ã http://www.slideshare.net/databricks/recent-developments-in-sparkr-for-advanced-analytics

Ã https://databricks.com/blog/2015/10/05/generalized-linear-models-in-sparkr-and-r-formula-support-in-mllib.html

Ã https://databricks.com/blog/2016/12/28/10-things-i-wish-i-knew-before-using-apache-sparkr.html

Ã http://www.kdnuggets.com/2015/06/top-20-r-machine-learning-packages.html

Ã R for Data Science (http://r4ds.had.co.nz/)


Integrate SparkR with existing R packagesto accelerate data science workflows

Feb 2017

Yanbo LiangSoftware engineer @ HortonworksApache Spark committer