Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use

© 2016 IBM Corporation

Accelerating and Scaling R Analytics

Using Spark R, In-memory Columnar Databases,

and Hadoop

Dan Gouveia, Ironside

Chi Shu, Ironside

Rich Tarro, IBM

2 © 2016 IBM Corporation

Agenda

R Analytics– Overview

– Challenges

Scaling R

dashDB Overview

Demonstration: R with dashDB In-Database Analytics

Spark Overview– Spark DataFrames

– SparkR

Demonstration: SparkR, using Jupyter Notebook

Demonstration: SparkR with dashDB, using RStudio


Agenda


– Challenges

Scaling R

dashDB Overview



– SparkR




What is R?

Interpreted programming language for statistical computing and

graphics

Freely available under the GNU General Public License

Widely used among statisticians and data miners for data and

statistical analysis

The capabilities of R are extended through user-created packages– R includes a core set of packages

– More than 7,801 additional packages (as of January 2016) available at the

Comprehensive R Archive Network (CRAN)

R's popularity has increased substantially in recent years


Data Science Tool Adoption


R Challenges

R is single threaded

R requires data to be loaded into memory– objects must all fit in memory

R can only process data on a single machine


Previous approaches for scaling R for Big Data

RHIPE implementation

include an R API for writing MapReduce from R

RHadoop implementation


Agenda


– Challenges

Scaling R

dashDB Overview



– SparkR




dashDB

Spark

Introducing 2 Ways to scale R Analytics

A fully-managed cloud

data warehouse,

purpose-built for

analytics

An open source cluster-

computing framework


Agenda


– Challenges

Scaling R

dashDB Overview



– SparkR




IBM dashDB – Analytics Warehouse as a Service

For apps that need:

• Elastic scalability

• High availability

• Data model flexibility

• Data mobility

• Text search

• Geospatial

Available as:• Fully managed DBaaS

• On-premises private cloud

• Hybrid architecture

BLU Acceleration

Netezza In-Database

Analytics

Cloudant NoSQL

Integration

In-database analytics capabilities for best performance atop a fully-managed warehouse

dashDB MPP

Fully-managed data warehouse on cloud

BLU Acceleration columnar technology +

Netezza in-database analytics

BLU in-memory processing, data skipping, actionable

compression, parallel vector processing, , “Load & Go”

administration

Netezza predictive analytic algorithms

Fully integrated RStudio & R language

Oracle compatibility

Massively Parallel Processing (MPP)

On disk data encryption and

secure connectivity


MPP for IBM dashDB

Massively Parallel Processing– Coordination of multiple CPU cores and servers, working together to solve complex tasks & queries

– Add more servers for additional processing power!

Query takes

1 hour

Query takes

15 min

Traditional Approach

Parallelization of Cores• For smaller data sets < 12TB

• Generally less expensive

• Slower performance

MPP Approach

Parallelization of Cores and Servers• For larger data sets > 4 TB

• Larger monthly budget

• Very high performance

Query is

segmented into

smaller tasks

Four (4) servers work

together on separate

tasks of the

original query


IBM Netezza Advanced Analytics Built In!

k-Means Clustering

Linear Regression

Decision Tree

Geospatial


Database (BLU)

Analytic

Applications

Anatomy of dashDB’s Analytic Warehouse

Execute entire custom(er) analytic programs inside of the database!

Analytic Code

& Algorithms:

Analytic Data:

Deploy custom(er) code and execute jobs

via special SQL function interfaces3

SQLsSQLs

BENEFIT: Bring custom-designed analytic functions and programs directly to the data!

Canned Algorithms

La

ngu

age

Fra

me

wo

rk

(UD

X &

AE

)

Data


dashDB – Integrated Analytics Environment with

Open-Source R


IBM DBR Package

Allows you to perform the following…

Create in-database Data Frames

Query a database, creating an R data frame

Sample, merge, create contingency table, etc…

Modeling– K-Means

– Association

– Linear Regression

– Decision Tree


IBM DBR Package - Example

Running the LM function against ~40K database

records, using 12GB memory…


Agenda


– Challenges

Scaling R

dashDB Overview



– SparkR





What this demonstration will cover:

Walk through IBMDBR package concepts– Brief discussion of the dataset

– Brief discussion of the tasks that will be performed• Making a long dataset wide

• Modeling

• Scoring

– Connect Rstudio to dashDB

– Create ida data frame

– Subset ida data frame (column-wise)

– Merge ida data frame

– Subset ida data frame (row-wise)

– Modeling

– Scoring


Agenda


– Challenges

Scaling R

dashDB Overview



– SparkR




Spark – Let’s get on the same page

Apache Spark is an open source parallel processing

framework that enables users to run large-scale data

analytics applications across clustered computers.

It can process data from: Hadoop Distributed File System (HDFS)

NoSQL dbs

Relational data stores (e.g. Apache Hive)

It can process data in-memory or on-disk.


Spark Background

Started as a research project in 2009,

open source in 2010– General purpose cluster computing system

– Generalizes MapReduce

– Batch oriented processing

– Main concept: Resilient Distributed Datasets (RDDs)

Apache incubator project in June 2013– Apache top level project Feb 27, 2014

Current version 1.6.1– Requires Scala 2.10.x, Maven

– Languages supported: Java, Scala, Python, R

(Java 7+, Python 2.6+, R 3.1+)

– May need additional libraries for Python

ex: numpy


Key reasons for interest in Spark

Performance In-memory architecture greatly reduces disk I/O

Anywhere from 20-100x faster for common tasks

Productivity Concise and expressive syntax, especially compared to prior approaches

Single programming model across a range of use cases and steps in data lifecycle

Integrated with common programming languages – Java, Python, Scala, R

New tools continually reduce skill barrier for access (e.g. SQL for analysts)

Leverages existing

investments

Works well within existing Hadoop ecosystem

Improves with age Large and growing community of contributors continuously improve full analytics stack and extend capabilities


Spark Programming Languages

Scala **

Java

Python

R

Language 2014 2015

Scala 84% 71%

Java 38% 31%

Python 38% 58%

R unknown 18%

Survey done by Databricks,

summer 2015

** Spark written in Scala


Spark Application Architecture

A Spark application is initiated from a driver program

Spark execution modes:– Standalone with the built-in cluster manager

– Use Mesos as the cluster manager

– Use YARN as the cluster manager

– Standalone cluster on Amazon EC2


Spark DataFrames

Distributed collection of data organized in named columns– Conceptually equivalent to a relational table, R/Python data frames

Supported format and sources– Can be created from an SQLContext

– From sources such as: JSON, Hive, JDBC, parquet, etc.

Benefits:– Easier manipulation interface (similar to SQL)

– Higher abstraction for possible optimization


Spark SQL

Provide for relational queries expressed in SQL, HiveQL and Scala

Seamlessly mix SQL queries with Spark programs

DataFrames provide a single interface for efficiently working with

structured data including Apache Hive, Parquet and JSON files

Leverages Hive frontend and metastore– Compatibility with Hive data, queries, and UDFs

– HiveQL limitations may apply

– Not ANSI SQL compliant

Standard connectivity through JDBC/ODBC


SparkR

SparkR is an R package that provides a light-weight front-end to use

Apache Spark from R

Exposes the Spark API and allows users to interactively run jobs

Provides a distributed data frame implementation that supports

operations like selection, filtering, aggregation etc. (similar to R data

frames, dplyr) but on large datasets– Conceptually equivalent to a table in a relational database or a data frame in R,

but with richer optimizations under the hood

– DataFrames can be constructed from a wide array of sources such as:

structured data files, tables in Hive, external databases, or existing local R data

frames

Supports operations like selection, filtering, aggregation etc. (similar

to R data frames) but on large datasets

SparkR also supports distributed machine learning using MLlib


Running SQL Queries from SparkR

A SparkR DataFrame can also be registered as a temporary table in

Spark SQL

Registering a DataFrame as a table allows you to run SQL queries

over its data

The sql function enables applications to run SQL queries

programmatically and returns the result as a DataFrame.


RStudio

RStudio is a free and open-source integrated development environment (IDE) for R


Web-Based Notebooks

Notebooks:

“interactive computational environment, in which you can combine code

execution, rich text, mathematics, plots and rich media”

Zeppelin– Apache incubator project

– Suport multiple interpreters• Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown

and Shell

Jupyter– Based on IPython

– Supports multiple interpreters• Python, Scala, R


Agenda


– Challenges

Scaling R

dashDB Overview



– SparkR




Demonstration: SparkR, using a Jupyter Notebook


Explore the concept of a notebook

Walk-through high-level SparkR concepts– Data access (file)

– Aggregated Calculations

– Transformations

– Enriching data (adding weather)

– Visualizing data (ggplot.SparkR)


Agenda


– Challenges

Scaling R

dashDB Overview



– SparkR






Walk-through high-level concepts–Connecting R to Spark– Data access (dashDB)

– Aggregated Calculations

– Enriching data (adding weather via API)

– Visualizing data (ggplot)


Wrap-up


Summary

Key Concepts Covered:

1. R is a popular and powerful data science tool

2. R has limitations– Single threaded, in-memory solution

– Not scalable on its own

3. IBM’s dashDB provides a means of scaling R– Can use R to develop in-database analytics applications

– Leverage the powerful MPP capabilities

4. Spark (SparkR) provides another means of scaling R– Can use R to develop within the Spark framework

– Take advantage of clustered computing power

– Parallels between R data frame and Spark DataFrame


Next Big Data Developers Meetup

Building a Recommendation Engine

with Spark MLlib

Tuesday, June 6, 2016 @ 6 PM

IBM Client Center–1 Rogers St.

–Cambridge, MA



Backup


Documents

Accelerating and Scaling R Analytics Using Spark R, In ...files.meetup.com/9505222/Scaling with R Meetup v0.3.pdf · 4. Spark (SparkR) provides another means of scaling R –Can use