34
SparkR The Past, Present and Future Shivaram Venkataraman

SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

SparkR The Past, Present and Future

Shivaram Venkataraman

Page 2: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

Big Data & R

DataFrames Visualization

Libraries Data +

Page 3: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

Big Data & R Big Data Small Learning

Partition Aggregate

Large Scale Machine Learning

Page 4: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

1. Big Data, Small Learning

Data Cleaning Filtering

Aggregation

Collect

Subset

DataFrames Visualization Libraries

Page 5: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

2(a). Partition Aggregate

Data Collect

Subset

Best Model Params

Parameter Tuning

Page 6: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

2(b). Partition Aggregate

Data Combine Models

Model Averaging

Page 7: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

3. Large Scale Machine Learning

Data Featurize Learning

Model

Page 8: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

Big Data & R Big Data Small Learning Partition Aggregate Large Scale Machine Learning

SparkR: Unified approach

Page 9: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

Outline Project History Current Release SparkR Future

Page 10: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

Speed

Scalable

Flexible

Statistics

Visualization

DataFrames

Page 11: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

RDD Parallel Collection

Transformations map filter

groupBy …

Actions count collect

saveAsTextFile …

Page 12: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

R + RDD = RRDD

lapply lapplyPartition

groupByKey collect cache …

broadcast includePackage

textFile

Page 13: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

Example: Word Count library(SparkR)  lines  <-­‐  textFile(sc,  “hdfs://my_text_file”)  words  <-­‐  flatMap(lines,                                    function(line)  {                                        strsplit(line,  "  ")[[1]]                                    })  wordCount  <-­‐  lapply(words,    

         function(word)  {            list(word,  1L)              })  

counts  <-­‐  reduceByKey(wordCount,  "+",  2L)  output  <-­‐  collect(counts)    

Page 14: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

Initial Prototype Standalone R package Install from github

Page 15: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

Open Source Development

1. Architecture

2. Usability

Page 16: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

Architecture Local Worker

Worker

Page 17: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

Architecture Local Worker

Worker R

Page 18: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

Architecture Local Worker

Worker R Spark

Context

Java Spark

Context

R-JVM bridge

Page 19: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

Architecture Local Worker

Worker R Spark Context

Java Spark

Context

R-JVM bridge

Spark Executor exec R

Spark Executor exec R

Page 20: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

Architecture Local Worker

Worker R Spark Context

Java Spark

Context

R-JVM bridge

Spark Executor exec R

Spark Executor exec R

Page 21: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

R-JVM Bridge Layer to call JVM methods directly from R Automatic argument serialization

result  <-­‐        callJStatic(          “sparkr.RRDD”,            “someMethod”,            arg1,            arg2)  

Page 22: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

R-JVM Bridge Use sockets for communication Supported across platforms, languages

R JVM

Netty Server

Page 23: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

Usability Need for Data Inputs

Read in CSV, JSON, JDBC etc. High-level API for data manipulation

Page 24: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

SparkR DataFrames people  <-­‐  read.df(          “people.json”,          “json”)    avgAge  <-­‐  select(          df,            avg(df$age))    head(avgAge)  

DataSources API Support for schema dplyr-like syntax

Page 25: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

SparkR DataFrames Scala Optimizations Released in Spark 1.4 ! 0 1 2 3

SparkR DataFrame

Scala DataFrame

Python DataFrame

Time (s)

Demo: github.com/cafreeman/SparkR_DataFrame_Demo

Page 26: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

SparkR Future

Page 27: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

Big Data & R Big Data Small Learning Partition Aggregate Large Scale Machine Learning

Page 28: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

Big Data, Small Learning SparkR DataFrames: Read input, aggregation Collect results, apply machine learning Upcoming features:

Support for R transformations More column functions (e.g. math, strings)

Page 29: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

Partition Aggregate Upcoming feature:

Simple, parallel API for SparkR Ex: Parameter tuning, Model Averaging Integrated with DataFrames Use existing R packages

Page 30: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

Large Scale Machine Learning Integration with MLLib Support for GLM, KMeans etc.

   model  <-­‐  glm(    a  ~  b  +  c,  

     data  =  df)    

Page 31: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

Large Scale Machine Learning Key Features

DataFrame inputs R-like formulas Model statistics

   model  <-­‐  glm(    a  ~  b  +  c,  

     data  =  df)    summary(model)  

Page 32: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

Extensibility Existing data sources R package support on spark-packages.org Example packages

     ./bin/sparkR        -­‐-­‐packages  spark-­‐csv  

Page 33: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

Developer Community >20 contributors including AMPLab, Databricks, Alteryx, Intel R and Scala contributions welcome !

Page 34: SparkR The Past, Present and Future - Shivaramshivaram.org/talks/sparkr-summit-2015.pdf · 2020-01-20 · 1. Big Data, Small Learning Data Cleaning Filtering Aggregation Collect Subset

SparkR

Big data processing from R DataFrames in Spark 1.4 Future: Large Scale ML & more