41
© 2016 IBM Corporation Accelerating and Scaling R Analytics Using Spark R, In-memory Columnar Databases, and Hadoop Dan Gouveia, Ironside Chi Shu, Ironside Rich Tarro, IBM

Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

© 2016 IBM Corporation

Accelerating and Scaling R Analytics

Using Spark R, In-memory Columnar Databases,

and Hadoop

Dan Gouveia, Ironside

Chi Shu, Ironside

Rich Tarro, IBM

Page 2: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

2 © 2016 IBM Corporation

Agenda

R Analytics– Overview

– Challenges

Scaling R

dashDB Overview

Demonstration: R with dashDB In-Database Analytics

Spark Overview– Spark DataFrames

– SparkR

Demonstration: SparkR, using Jupyter Notebook

Demonstration: SparkR with dashDB, using RStudio

Page 3: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

3 © 2016 IBM Corporation

Agenda

R Analytics– Overview

– Challenges

Scaling R

dashDB Overview

Demonstration: R with dashDB In-Database Analytics

Spark Overview– Spark DataFrames

– SparkR

Demonstration: SparkR, using Jupyter Notebook

Demonstration: SparkR with dashDB, using RStudio

Page 4: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

4 © 2016 IBM Corporation

What is R?

Interpreted programming language for statistical computing and

graphics

Freely available under the GNU General Public License

Widely used among statisticians and data miners for data and

statistical analysis

The capabilities of R are extended through user-created packages– R includes a core set of packages

– More than 7,801 additional packages (as of January 2016) available at the

Comprehensive R Archive Network (CRAN)

R's popularity has increased substantially in recent years

Page 5: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

5 © 2016 IBM Corporation

Data Science Tool Adoption

Page 6: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

6 © 2016 IBM Corporation

R Challenges

R is single threaded

R requires data to be loaded into memory– objects must all fit in memory

R can only process data on a single machine

Page 7: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

7 © 2016 IBM Corporation

Previous approaches for scaling R for Big Data

RHIPE implementation

include an R API for writing MapReduce from R

RHadoop implementation

Page 8: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

8 © 2016 IBM Corporation

Agenda

R Analytics– Overview

– Challenges

Scaling R

dashDB Overview

Demonstration: R with dashDB In-Database Analytics

Spark Overview– Spark DataFrames

– SparkR

Demonstration: SparkR, using Jupyter Notebook

Demonstration: SparkR with dashDB, using RStudio

Page 9: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

9 © 2016 IBM Corporation

dashDB

Spark

Introducing 2 Ways to scale R Analytics

A fully-managed cloud

data warehouse,

purpose-built for

analytics

An open source cluster-

computing framework

Page 10: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

10 © 2016 IBM Corporation

Agenda

R Analytics– Overview

– Challenges

Scaling R

dashDB Overview

Demonstration: R with dashDB In-Database Analytics

Spark Overview– Spark DataFrames

– SparkR

Demonstration: SparkR, using Jupyter Notebook

Demonstration: SparkR with dashDB, using RStudio

Page 11: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

11 © 2016 IBM Corporation

IBM dashDB – Analytics Warehouse as a Service

For apps that need:

• Elastic scalability

• High availability

• Data model flexibility

• Data mobility

• Text search

• Geospatial

Available as:• Fully managed DBaaS

• On-premises private cloud

• Hybrid architecture

BLU Acceleration

Netezza In-Database

Analytics

Cloudant NoSQL

Integration

In-database analytics capabilities for best performance atop a fully-managed warehouse

dashDB MPP

Fully-managed data warehouse on cloud

BLU Acceleration columnar technology +

Netezza in-database analytics

BLU in-memory processing, data skipping, actionable

compression, parallel vector processing, , “Load & Go”

administration

Netezza predictive analytic algorithms

Fully integrated RStudio & R language

Oracle compatibility

Massively Parallel Processing (MPP)

On disk data encryption and

secure connectivity

Page 12: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

12 © 2016 IBM Corporation

MPP for IBM dashDB

Massively Parallel Processing– Coordination of multiple CPU cores and servers, working together to solve complex tasks & queries

– Add more servers for additional processing power!

Query takes

1 hour

Query takes

15 min

Traditional Approach

Parallelization of Cores• For smaller data sets < 12TB

• Generally less expensive

• Slower performance

MPP Approach

Parallelization of Cores and Servers• For larger data sets > 4 TB

• Larger monthly budget

• Very high performance

Query is

segmented into

smaller tasks

Four (4) servers work

together on separate

tasks of the

original query

Page 13: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

13 © 2016 IBM Corporation

IBM Netezza Advanced Analytics Built In!

k-Means Clustering

Linear Regression

Decision Tree

Geospatial

Page 14: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

14 © 2016 IBM Corporation

Database (BLU)

Analytic

Applications

Anatomy of dashDB’s Analytic Warehouse

Execute entire custom(er) analytic programs inside of the database!

Analytic Code

& Algorithms:

Analytic Data:

Deploy custom(er) code and execute jobs

via special SQL function interfaces3

SQLsSQLs

BENEFIT: Bring custom-designed analytic functions and programs directly to the data!

Canned Algorithms

La

ngu

age

Fra

me

wo

rk

(UD

X &

AE

)

Data

Page 15: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

15 © 2016 IBM Corporation

dashDB – Integrated Analytics Environment with

Open-Source R

Page 16: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

16 © 2016 IBM Corporation

IBM DBR Package

Allows you to perform the following…

Create in-database Data Frames

Query a database, creating an R data frame

Sample, merge, create contingency table, etc…

Modeling– K-Means

– Association

– Linear Regression

– Decision Tree

Page 17: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

17 © 2016 IBM Corporation

IBM DBR Package - Example

Running the LM function against ~40K database

records, using 12GB memory…

Page 18: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

18 © 2016 IBM Corporation

Agenda

R Analytics– Overview

– Challenges

Scaling R

dashDB Overview

Demonstration: R with dashDB In-Database Analytics

Spark Overview– Spark DataFrames

– SparkR

Demonstration: SparkR, using Jupyter Notebook

Demonstration: SparkR with dashDB, using RStudio

Page 19: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

19 © 2016 IBM Corporation

Demonstration: R with dashDB In-Database Analytics

What this demonstration will cover:

Walk through IBMDBR package concepts– Brief discussion of the dataset

– Brief discussion of the tasks that will be performed• Making a long dataset wide

• Modeling

• Scoring

– Connect Rstudio to dashDB

– Create ida data frame

– Subset ida data frame (column-wise)

– Merge ida data frame

– Subset ida data frame (row-wise)

– Modeling

– Scoring

Page 20: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

20 © 2016 IBM Corporation

Agenda

R Analytics– Overview

– Challenges

Scaling R

dashDB Overview

Demonstration: R with dashDB In-Database Analytics

Spark Overview– Spark DataFrames

– SparkR

Demonstration: SparkR, using Jupyter Notebook

Demonstration: SparkR with dashDB, using RStudio

Page 21: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

21 © 2016 IBM Corporation

Spark – Let’s get on the same page

Apache Spark is an open source parallel processing

framework that enables users to run large-scale data

analytics applications across clustered computers.

It can process data from: Hadoop Distributed File System (HDFS)

NoSQL dbs

Relational data stores (e.g. Apache Hive)

It can process data in-memory or on-disk.

Page 22: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

22 © 2016 IBM Corporation

Spark Background

Started as a research project in 2009,

open source in 2010– General purpose cluster computing system

– Generalizes MapReduce

– Batch oriented processing

– Main concept: Resilient Distributed Datasets (RDDs)

Apache incubator project in June 2013– Apache top level project Feb 27, 2014

Current version 1.6.1– Requires Scala 2.10.x, Maven

– Languages supported: Java, Scala, Python, R

(Java 7+, Python 2.6+, R 3.1+)

– May need additional libraries for Python

ex: numpy

Page 23: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

23 © 2016 IBM Corporation

Key reasons for interest in Spark

Performance In-memory architecture greatly reduces disk I/O

Anywhere from 20-100x faster for common tasks

Productivity Concise and expressive syntax, especially compared to prior approaches

Single programming model across a range of use cases and steps in data lifecycle

Integrated with common programming languages – Java, Python, Scala, R

New tools continually reduce skill barrier for access (e.g. SQL for analysts)

Leverages existing

investments

Works well within existing Hadoop ecosystem

Improves with age Large and growing community of contributors continuously improve full analytics stack and extend capabilities

Page 24: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

24 © 2016 IBM Corporation

Spark Programming Languages

Scala **

Java

Python

R

Language 2014 2015

Scala 84% 71%

Java 38% 31%

Python 38% 58%

R unknown 18%

Survey done by Databricks,

summer 2015

** Spark written in Scala

Page 25: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

25 © 2016 IBM Corporation

Spark Application Architecture

A Spark application is initiated from a driver program

Spark execution modes:– Standalone with the built-in cluster manager

– Use Mesos as the cluster manager

– Use YARN as the cluster manager

– Standalone cluster on Amazon EC2

Page 26: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

26 © 2016 IBM Corporation

Spark DataFrames

Distributed collection of data organized in named columns– Conceptually equivalent to a relational table, R/Python data frames

Supported format and sources– Can be created from an SQLContext

– From sources such as: JSON, Hive, JDBC, parquet, etc.

Benefits:– Easier manipulation interface (similar to SQL)

– Higher abstraction for possible optimization

Page 27: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

27 © 2016 IBM Corporation

Spark SQL

Provide for relational queries expressed in SQL, HiveQL and Scala

Seamlessly mix SQL queries with Spark programs

DataFrames provide a single interface for efficiently working with

structured data including Apache Hive, Parquet and JSON files

Leverages Hive frontend and metastore– Compatibility with Hive data, queries, and UDFs

– HiveQL limitations may apply

– Not ANSI SQL compliant

Standard connectivity through JDBC/ODBC

Page 28: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

28 © 2016 IBM Corporation

SparkR

SparkR is an R package that provides a light-weight front-end to use

Apache Spark from R

Exposes the Spark API and allows users to interactively run jobs

Provides a distributed data frame implementation that supports

operations like selection, filtering, aggregation etc. (similar to R data

frames, dplyr) but on large datasets– Conceptually equivalent to a table in a relational database or a data frame in R,

but with richer optimizations under the hood

– DataFrames can be constructed from a wide array of sources such as:

structured data files, tables in Hive, external databases, or existing local R data

frames

Supports operations like selection, filtering, aggregation etc. (similar

to R data frames) but on large datasets

SparkR also supports distributed machine learning using MLlib

Page 29: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

29 © 2016 IBM Corporation

Running SQL Queries from SparkR

A SparkR DataFrame can also be registered as a temporary table in

Spark SQL

Registering a DataFrame as a table allows you to run SQL queries

over its data

The sql function enables applications to run SQL queries

programmatically and returns the result as a DataFrame.

Page 30: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

30 © 2016 IBM Corporation

RStudio

RStudio is a free and open-source integrated development environment (IDE) for R

Page 31: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

31 © 2016 IBM Corporation

Web-Based Notebooks

Notebooks:

“interactive computational environment, in which you can combine code

execution, rich text, mathematics, plots and rich media”

Zeppelin– Apache incubator project

– Suport multiple interpreters• Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown

and Shell

Jupyter– Based on IPython

– Supports multiple interpreters• Python, Scala, R

Page 32: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

32 © 2016 IBM Corporation

Agenda

R Analytics– Overview

– Challenges

Scaling R

dashDB Overview

Demonstration: R with dashDB In-Database Analytics

Spark Overview– Spark DataFrames

– SparkR

Demonstration: SparkR, using Jupyter Notebook

Demonstration: SparkR with dashDB, using RStudio

Page 33: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

33 © 2016 IBM Corporation

Demonstration: SparkR, using a Jupyter Notebook

What this demonstration will cover:

Explore the concept of a notebook

Walk-through high-level SparkR concepts– Data access (file)

– Aggregated Calculations

– Transformations

– Enriching data (adding weather)

– Visualizing data (ggplot.SparkR)

Page 34: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

34 © 2016 IBM Corporation

Agenda

R Analytics– Overview

– Challenges

Scaling R

dashDB Overview

Demonstration: R with dashDB In-Database Analytics

Spark Overview– Spark DataFrames

– SparkR

Demonstration: SparkR, using Jupyter Notebook

Demonstration: SparkR with dashDB, using RStudio

Page 35: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

35 © 2016 IBM Corporation

Demonstration: SparkR with dashDB, using RStudio

What this demonstration will cover:

Walk-through high-level concepts–Connecting R to Spark– Data access (dashDB)

– Aggregated Calculations

– Enriching data (adding weather via API)

– Visualizing data (ggplot)

Page 36: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

36 © 2016 IBM Corporation

Wrap-up

Page 37: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

37 © 2016 IBM Corporation

Summary

Key Concepts Covered:

1. R is a popular and powerful data science tool

2. R has limitations– Single threaded, in-memory solution

– Not scalable on its own

3. IBM’s dashDB provides a means of scaling R– Can use R to develop in-database analytics applications

– Leverage the powerful MPP capabilities

4. Spark (SparkR) provides another means of scaling R– Can use R to develop within the Spark framework

– Take advantage of clustered computing power

– Parallels between R data frame and Spark DataFrame

Page 38: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

38 © 2016 IBM Corporation

Next Big Data Developers Meetup

Building a Recommendation Engine

with Spark MLlib

Tuesday, June 6, 2016 @ 6 PM

IBM Client Center–1 Rogers St.

–Cambridge, MA

Page 39: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

39 © 2016 IBM Corporation

Page 40: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

40 © 2016 IBM Corporation

Backup

Page 41: Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

41 © 2016 IBM Corporation