40
BUILT FOR THE SPEED OF BUSINESS

Pivotal OSS meetup - MADlib and PivotalR

Embed Size (px)

DESCRIPTION

With the explosion of big data, the need for fast and inexpensive analytics solutions has become a key basis of competition in many industries. Extracting the value of big data with analytics can be complex, and requires advanced skills. At Pivotal, we are building open-source solutions (MADlib, PivotalR, PyMadlib) to simplify this process for the user, while maintaining the efficiency necessary for big data analysis. This talk will provide information about MADlib, an open source library of SQL-based algorithms for machine learning, data mining and statistics that run at large scale within a database engine, with no need for data import/export to other tools. It provides an overview of the library’s architecture and compares various statistical methods with those available in Apache Mahout. We also introduce, PivotalR, a R-based wrapper for MADlib that allows data scientists and programmers to access power of MADlib along with the ease of use of R.

Citation preview

Page 1: Pivotal OSS meetup - MADlib and PivotalR

1 Pivotal Confidential–Internal Use Only

BUILT FOR THE SPEED OF BUSINESS

Page 2: Pivotal OSS meetup - MADlib and PivotalR

2 Pivotal Confidential–Internal Use Only 2 Pivotal Confidential–Internal Use Only

Big Data Analytics MADlib and PivotalR: Scalable Machine Learning for Massively Parallel Databases

Rahul Iyer, Senior Software Developer, Predictive Analytics March, 4th 2014

Pivotal OSS Meetups

Page 3: Pivotal OSS meetup - MADlib and PivotalR

3 Pivotal Confidential–Internal Use Only

Agenda for the talk

•  Introduce MADlib, a distributed machine learning library for SQL users

•  How scalability is achieved by distributing the computation?

•  Performance metrics + comparisons with Mahout

•  A new R interface to access all of MADlib’s features

•  How does it get big-data results with small-data efforts?

•  Demo to showcase PivotalR

Page 4: Pivotal OSS meetup - MADlib and PivotalR

4 Pivotal Confidential–Internal Use Only

What is Big data?

•  Volumes of data … •  In various formats … •  From multiple sources …

and Analytics?

•  Generate insights … •  for informed decision-making

Page 5: Pivotal OSS meetup - MADlib and PivotalR

6 Pivotal Confidential–Internal Use Only

Data ---! Information ---! Insights Traditional analytics pipeline

sample.csv&

Time;to;Insights&

Data&Prep& DB&Extract& DB&Import&spec.docx& scores.csv&

3&

Page 6: Pivotal OSS meetup - MADlib and PivotalR

7 Pivotal Confidential–Internal Use Only

Data ---! Information ---! Insights The MAD approach

Enterprise)Data)

RDBMS& RDBMS&RDBMS& RDBMS&

Time-to-Insights

Data&Prep& Model& Score&

Reduced&Data&Movement&

Billions&of&rows&in&minutes&

4&

Page 7: Pivotal OSS meetup - MADlib and PivotalR

8 Pivotal Confidential–Internal Use Only

What is MADlib?

MADlib project was initiated in 2011 by Greenplum architects and Joe Hellerstein from Univ. of California, Berkeley.

•  MAD stands for:

•  lib stands for SQL library of: •  advanced (mathematical, statistical, machine learning) •  parallel & scalable in-database functions

Page 8: Pivotal OSS meetup - MADlib and PivotalR

9 Pivotal Confidential–Internal Use Only

What is MADlib?

MADlib project was initiated in 2011 by Greenplum architects and Joe Hellerstein from Univ. of California, Berkeley.

•  MAD stands for:

•  lib stands for SQL library of: •  advanced (mathematical, statistical, machine learning) •  parallel & scalable in-database functions

UrbanDictionary.com: mad (adj.): an adjective used to enhance a noun.

1- dude, you got skills. 2- dude, you got mad skills.

Page 9: Pivotal OSS meetup - MADlib and PivotalR

10 Pivotal Confidential–Internal Use Only

Which platforms does it run on?

HDFS

HAWQ Impala

GPDB PostgreSQL

(Partly ported)

Page 10: Pivotal OSS meetup - MADlib and PivotalR

11 Pivotal Confidential–Internal Use Only

MPP (Massively Parallel Processing)

Network Interconnect

... ...

... ... Master Servers

Query planning & dispatch

Segment Servers

Query processing & data storage

SQL MapReduce

External Sources Loading,

streaming, etc.

Shared-Nothing Database Architecture

Page 11: Pivotal OSS meetup - MADlib and PivotalR

12 Pivotal Confidential–Internal Use Only

Supervised Learning •  Generalized Linear models

•  Linear Regression •  Logistic Regression •  Multinomial logit …

•  Decision Trees and Random Forest •  Naive Bayes Classification •  Support Vector Machines •  Cox-Prop Hazards

and more …

Analytics Pipeline

Predictive Modeling Data Exploration

Summary function Sketch estimators Percentiles Correlation matrix

Data Prep

Aggregation Normalizing Pivoting Filtering

Text analytics •  CRF •  LDA

Unsupervised Learning • Association Rules •  k-Means Clustering • Low-rank Matrix Factorization • PCA • SVD Matrix Factorization

Data mining

Sampling methods •  Cross Validation

Scoring

Scoring •  Linear Regression •  Logistic Regression •  Naïve Bayes …

Statistical metrics • Descriptive statistics • Goodness of fit •  Inferential statistics • ROC

Model fitness

Support modules •  Array operations •  Sparse Vectors •  Probability functions

Page 12: Pivotal OSS meetup - MADlib and PivotalR

13 Pivotal Confidential–Internal Use Only

Example usage

Train a model

Predict for new data

Page 13: Pivotal OSS meetup - MADlib and PivotalR

14 Pivotal Confidential–Internal Use Only

How do we implement scalability? Example: Linear Regression

•  Finding linear dependencies between variables

y ≈ c0 + c1 · x1 + c2 · x2 ? y | x1 | x2 -------+------+----- 10.14 | 0 | 0.3 11.93 | 0.69 | 0.6 13.57 | 1.1 | 0.9 14.17 | 1.39 | 1.2 15.25 | 1.61 | 1.5 16.15 | 1.79 | 1.8 Design

matrix X

Vector of dependent variables y

Predictor (x1)

Reg

ress

or (y

)

Page 14: Pivotal OSS meetup - MADlib and PivotalR

15 Pivotal Confidential–Internal Use Only

Challenges in computing OLS solution

Page 15: Pivotal OSS meetup - MADlib and PivotalR

16 Pivotal Confidential–Internal Use Only

Challenges in computing OLS solution

a b c d

a c b d

XT X

Segment 1

Segment 2

Segm

ent 1

Segm

ent 2

Page 16: Pivotal OSS meetup - MADlib and PivotalR

17 Pivotal Confidential–Internal Use Only

Challenges to compute OLS solution

a b c d

a c b d

XT X

a2 + c2 =

Data across nodes are multiplied!

Page 17: Pivotal OSS meetup - MADlib and PivotalR

18 Pivotal Confidential–Internal Use Only

Challenges to compute OLS solution

a b c d

a c b d

XT X

a2 + c2 ab + cd =

Data across nodes are multiplied!

Page 18: Pivotal OSS meetup - MADlib and PivotalR

19 Pivotal Confidential–Internal Use Only

Challenges to compute OLS solution

a b c d

a c b d

XT X

a2 + c2 ab + cd ba + dc b2 + d2

= Looks like the result can be decomposed

Page 19: Pivotal OSS meetup - MADlib and PivotalR

20 Pivotal Confidential–Internal Use Only

Challenges to compute OLS solution

a b c d

a c b d

XT X

a2 + c2 ab + cd ba + dc b2 + d2

= Let’s change perspective

= + a b a b c

d c d

Page 20: Pivotal OSS meetup - MADlib and PivotalR

22 Pivotal Confidential–Internal Use Only

Linear Regression: Streaming Algorithm

How to compute with a single table scan?

XT

X

XT

y

-1

XTX XTy

! "

! "# $ ! "# $

Page 21: Pivotal OSS meetup - MADlib and PivotalR

23 Pivotal Confidential–Internal Use Only

Linear Regression: Parallel Computation

XT

y

XT 1 y 1 XT

2 y 2 Segment 1 Segment 2 XTy Master

Page 22: Pivotal OSS meetup - MADlib and PivotalR

24 Pivotal Confidential–Internal Use Only

Basic Building Block: User-defined aggregate Basic&Building&Block:&

User;Defined&Aggregates&

AggregaOon&phase&1&on&each&node:&

1.  IniOalize:&

2.  TransiOon&for&all&rows:&

&

3.  Send&(A,b)&&

x# y)(1,0,3,…,5)& 3&

(;2,4,5,…,2)& 2&

…& …&

(A,b) = (0,0)

(A,b) = (A,b)+ (x ⋅ xT ,x ⋅ y) map&

reduce&

(A,b)&…&

AggregaOon&phase&2&on&master&node:&

1.  Merge:&&

2.  Finalize:& β̂ = solve(A,b) = A−1 ⋅b(A,b) = (A,b)+ (A,b)

13&

Page 23: Pivotal OSS meetup - MADlib and PivotalR

25 Pivotal Confidential–Internal Use Only

Problem solved? … Not Yet

"  Many ML solutions are iterative without analytical formulations

Initialize problem

Perform optimization step

Has converged?

Return results

false

true

Page 24: Pivotal OSS meetup - MADlib and PivotalR

26 Pivotal Confidential–Internal Use Only

Use a convex optimization framework

# segments # variables # rows v0.3 v0.2.1beta v0.1alpha(million) (s) (s) (s)

6 10 10 4.447 9.501 1.3376 20 10 4.688 11.60 1.8746 40 10 6.843 17.96 3.8286 80 10 13.28 52.94 12.986 160 10 35.66 181.4 51.206 320 10 186.2 683.8 333.4

12 10 10 2.115 4.756 0.960012 20 10 2.432 5.760 1.21212 40 10 3.420 9.010 2.04612 80 10 6.797 26.48 6.46912 160 10 17.71 90.95 25.6712 320 10 92.41 341.5 166.6

18 10 10 1.418 3.206 0.619718 20 10 1.648 3.805 1.00318 40 10 2.335 5.994 1.18318 80 10 4.461 17.73 4.31418 160 10 11.90 60.58 17.1418 320 10 61.66 227.7 111.4

24 10 10 1.197 2.383 0.390424 20 10 1.276 2.869 0.476924 40 10 1.698 4.475 1.15124 80 10 3.363 13.35 3.26324 160 10 8.840 45.48 13.1024 320 10 46.18 171.7 84.59

Figure 4: Linear-regression execution times

search. In our prototype implementation in MADlib, we picked upone such simple greedy algorithm, called stochastic (or sometimes,“incremental”) gradient descent (SGD) [33, 6], that goes back to the1960s. SGD is an approximation of gradient methods that is usefulwhen the convex function we are considering, f (x), has the form:

f (x) =NX

i=1

fi(x)

If each of the fi is convex, then so is f [8, pg. 38]. Notice thatall problems in Table 2 are of this form: intuitively each of thesemodels is finding some model (i.e., a vector w) that is scored onmany di↵erent training examples. SGD leverages the above form toconstruct a rough estimate of the gradient of f using the gradient of asingle term: for example, the estimate if we select i is the gradient offi (that we denote Gi(x)). The resulting algorithm is then describedas:

x x � ↵N ·Gi(x) (1)

This approximation is guaranteed to converge to an optimal solu-tion [26].

Using the MADlib framework. In our setting, each tuple inthe input table for an analysis task encodes a single fi. We use themicro-programming interfaces of Sections 3.2 and 3.3 to performthe mapping from the tuples to the vector representation that is usedin Eq. 1. Then, we observe Eq. 1 is simply an expression over eachtuple (to compute Gi(x)) which is then averaged together. Instead ofaveraging a single number, we average a vector of numbers. Here,we use the macro-programming provided by MADlib to handleall data access, spills to disk, parallelized scans, etc. Finally, the

Figure 6: The Archetypical Convex Function f (x) = x2.

Application ObjectiveLeast Squares

P(u,y)2⌦(xT u � y)2

Lasso [38]P

(u,y)2⌦(xT u � y)2 + µkxk1Logisitic Regression

P(u,y)2⌦ log(1 + exp(�yxtu))

Classification (SVM)P

(u,y)2⌦(1 � yxT u)+Recommendation

P(i, j)2⌦(LT

i R j � Mi j)2 + µkL,Rk2FLabeling (CRF) [40]

Pk

hPj x jF j(yk , zk) � log Z(zk)

i

Table 2: Models currently Implemented in MADlib using theSGD-based approach.

macro programming layer helps us test for convergence (which isimplemented with either a python combination or C driver.) Usingthis approach, we were able to add in implementations of all themodels in Table 2 in a matter of days.

In an upcoming paper we report initial experiments showing thatour SGD based approach achieves higher performance than priordata mining tools for some datasets [13].

-  Each step has an analytical formulation that can be performed in parallel

1.&Lack&of&portable&mulO;pass&iteraOons&

•  WITH RECURSIVE&not&reliable&basis&for&portability&

•  User;defined&driver&funcOons&in&Python&– Outer&loops&not&performance;criOcal&

•  Compromise:&Different&user&interface&

CREATE TEMP TABLE temp !

INSERT INTO temp SELECT step(...) FROM ... !

SELECT converged(...) FROM temp, ... !

SELECT result(...) !FROM temp!

false&

true&

16&

Page 25: Pivotal OSS meetup - MADlib and PivotalR

27 Pivotal Confidential–Internal Use Only

Architecture

RDBMS Query Processing (Greenplum, PostgreSQL, Hadoop with SQL)

Low-level Abstraction Layer (matrix operations,

C++ to DB type bridge, …)

RDBMS Built-in

Functions

User Interface

High-level Abstraction Layer (iteration controller, ...)

Functions for Inner Loops (implements convex optimization)

Python

SQL, generated per specification

C++

3.&Lack&of&language&support&for&linear&algebra&

•  C++&AbstracOon&Layer&uses&Eigen&•  (Dense)&Vectors&and&matrices:&DOUBLE PRECISION[]!

•  Example:&AnyType!solve::run(AnyType& args) { ! MappedMatrix A = args[0].getAs<MappedMatrix>(); ! MappedColumnVector b = args[1].getAs<MappedColumnVector>(); ! ! MutableMappedColumnVector x = allocateArray<double>(A.cols()); ! x = A.colPivHouseholderQr().solve(b); ! return x; !} ! Performance:&

•  No&unnecessary&copying&•  No&internal&type&conversion&

18&

The&MADlib&Vision&

•  Academic&and&industry&contribuOons&•  Think&of&“CRAN&for&databases”&– Repository&of&open;source&ML&algorithms&– This&Ome&with&data&parallelism&in&mind&

•  Open;Source&Framework&

Eigen&BSD&License&10&

Page 26: Pivotal OSS meetup - MADlib and PivotalR

28 Pivotal Confidential–Internal Use Only

Performance trends Performance&Trends&

•  Disk&I/O&is&not&always&the&boLleneck&•  Performance&tuning&is&

essenOal&

•  Overhead&for&single&query&very&low&(fracOon&of&a&second)&

•  Greenplum&achieves&nearly&perfect&speedup&

0&

5&

10&

15&

20&

25&

30&

35&

40&

6& 12& 18& 24&

20& 40& 80& 160&

OLS&on&10&million&rows&(in&seconds)&

#&segments&

#&variables:&

22&

•  Overhead for a single row is very low (fraction of a second)

•  Able to achieve close to linear speedup

Page 27: Pivotal OSS meetup - MADlib and PivotalR

29 Pivotal Confidential–Internal Use Only

Performance Comparison with Apache Mahout

"  Analytics WorkBench (http://www.gopivotal.com/big-data/analytics-workbench) –  1000-node cluster located in Las Vegas –  Over 24,000 processors, 48 TB of Memory, and 24 PB of

raw disk storage –  8000+ Map Task Capacity, 5000+ Reduce Task Capacity –  Infrastructure: Pivotal HD 1.1

"  Mahout v0.7 "  Test matrix*

–  Data size ▪  KDD Cup 2009 Orange marketing churn data (16.5 GB) ▪  Enron data (1.9 GB) ▪  Census data 2000 (1.7 GB)

–  Algorithms: Logistic Regression and K-means –  Algorithm parameters (e.g. convergence threshold, # iterations)

Courtesy Grace Gee (Engineer, SOAR Program, Pivotal)

* Reporting a subset of results from whitepaper.

Page 28: Pivotal OSS meetup - MADlib and PivotalR

30 Pivotal Confidential–Internal Use Only

Logistic Regression

0

100

200

300

400

500

600

700

1000000 10000000 100000000 1E+09

Tim

e in

Min

utes

log(Number of Rows)

MADlib & Mahout Logistic Regression Scalability Across Number of Attributes

Census data, 48 attributes [Mahout]

Census data, 48 attributes [MADlib]

Courtesy Grace Gee (Engineer, SOAR Program, Pivotal)

Page 29: Pivotal OSS meetup - MADlib and PivotalR

31 Pivotal Confidential–Internal Use Only

Logistic Regression

0

1

2

3

4

5

6

7

8

9

1000000 10000000 100000000 1E+09

Tim

e in

Min

utes

log(Number of Rows)

Courtesy Grace Gee (Engineer, SOAR Program, Pivotal)

Page 30: Pivotal OSS meetup - MADlib and PivotalR

32 Pivotal Confidential–Internal Use Only

K-Means

0

50

100

150

200

250

300

350

1000000 10000000 100000000 1E+09

Tim

e in

Min

log(Number of Rows)

MADlib & Mahout K-means Scalability Across Number of Rows

Census data, 48 attributes [Mahout]

Census data, 48 attributes [MADlib]

Courtesy Grace Gee (Engineer, SOAR Program, Pivotal)

Page 31: Pivotal OSS meetup - MADlib and PivotalR

33 Pivotal Confidential–Internal Use Only

Random Forest

0

200

400

600

800

1000

1200

1400

1600

1000000 10000000 100000000 1E+09

Tim

e in

Min

log(Number of Rows)

Census data, 46 attributes [Mahout]

Census data, 46 attributes [MADlib]

Courtesy Grace Gee (Engineer, SOAR Program, Pivotal)

Page 32: Pivotal OSS meetup - MADlib and PivotalR

35 Pivotal Confidential–Internal Use Only

Part 1 Summary

MADlib is a easy-to-use library that provides a SQL interface to fast, scalable machine learning algorithms …

Page 33: Pivotal OSS meetup - MADlib and PivotalR

36 Pivotal Confidential–Internal Use Only 36 Pivotal Confidential–Internal Use Only

But not all Data Scientists speak SQL … Accessing Scalability through R

Page 34: Pivotal OSS meetup - MADlib and PivotalR

37 Pivotal Confidential–Internal Use Only

Why R?

O’Reilly: 2013 Data Science Salary Survey

From the report: “The preponderance of R and Python usage is more surprising … two most commonly used individual tools, even above Excel. R and Python are likely popular because they are easily accessible and effective open source tools.”

Page 35: Pivotal OSS meetup - MADlib and PivotalR

38 Pivotal Confidential–Internal Use Only

Execution in Database

•  All data stays in DB: R objects merely point to DB objects

•  All model estimation and heavy lifting done in DB by MADlib

•  R → SQL translation done in the R client

• Only strings of SQL and model output transferred across DBI

SQL to execute MADlib

Model output

36 © Copyright 2014 Pivotal. All rights reserved.

PivotalR Design Overview

SQL to execute

Computation results

RPostgreSQL

Data lives here

R " SQL

PivotalR

No data here

Database w/ MADlib

•  Call MADlib’s in-DB machine learning functions directly from R

•  Syntax is analogous to native R function

•  Data doesn’t need to leave the database •  All heavy lifting, including model estimation

& computation, are done in the database

Woo Jung

http://gopivotal.github.io/PivotalR/ 36 © Copyright 2014 Pivotal. All rights reserved.

PivotalR Design Overview

SQL to execute

Computation results

RPostgreSQL

Data lives here

R " SQL

PivotalR

No data here

Database w/ MADlib

•  Call MADlib’s in-DB machine learning functions directly from R

•  Syntax is analogous to native R function

•  Data doesn’t need to leave the database •  All heavy lifting, including model estimation

& computation, are done in the database

Woo Jung

http://gopivotal.github.io/PivotalR/

Courtesy Woo Jung and Hai Qian

Page 36: Pivotal OSS meetup - MADlib and PivotalR

40 Pivotal Confidential–Internal Use Only

Some of current features

A wrapper of MADlib

•  Linear regression

•  Logistic regression

•  Elastic Net

•  ARIMA

•  Table summary

•  Categorial variable

as.factor()

•  $ [ [[ $<- [<- [[<-

•  is.na

+ - * / %% %/% ^

•  & | !

•  == != > < >= <=

•  merge

•  by

•  db.data.frame

•  as.db.data.frame

•  preview •  sort

•  c mean sum sd var min max length colMeans colSums

•  db.connect db.disconnect db.list db.objects

db.existsObject delete •  dim •  names

•  content

And more ... (SQL wrapper)

•  predict

Page 37: Pivotal OSS meetup - MADlib and PivotalR

43 Pivotal Confidential–Internal Use Only

library(PivotalR)

db.connect(port = 14526, dbname = "madlib")

db.objects()

x <- db.data.frame("madlibtestdata.dt_abalone")

dim(x)

names(x)

x$rings

lookat(x, 10) # look at a sample of table

mean(x$rings)

lookat(mean(x$rings))

fit <- madlib.lm(rings ~ . - id | sex, data = y)

predict(fit, x)

mean((x$rings - predict(fit, x))^2)

x$sex <- as.factor(v$sex)

m0 <- madlib.glm(resp ~ age,

family="binomial", data=dbbank)

mstep <- step(m0, scope=list(lower=~age, upper=~age + factor(marital) + factor(education) + factor(housing) + factor(loan) + factor(job)))

Load the Library

Connect to the database “madlib” on port 14526

List all the tables in the active connection

Create an R object that references a table in the database

Report #/rows and #/columns in the table

Column names within the table

Database query object representing “select rings from madlibtestdata.dt_abalone”

Pull 10 rows of data from the table back into the R environment

query object representing “select avg(rings) from madlibtestdata.dt_abalone”

execute the query and report back the result

Run a linear regression within the database and return a model object

Create a query object representing scoring the model in the database

Query object calculating the mean square error of the model

Add a calculated factor column to the database query object

Calculate a logistic regression model

Perform stepwise feature selection

Demonstration

Page 38: Pivotal OSS meetup - MADlib and PivotalR

44 Pivotal Confidential–Internal Use Only

We’re looking for contributors

•  Browse our help pages –  Start page: madlib.net –  Github pages

•  github.com/madlib/madlib (SQL) •  github.com/gopivotal/pivotalr (R) •  github.com/gopivotal/pymadlib (Python)

–  Use our product and report issues: •  jira.madlib.net (Issue tracker) •  [email protected] (User forum)

•  Can use PostgreSQL or Greenplum Database Community Edition for installations on multiple platforms

Page 39: Pivotal OSS meetup - MADlib and PivotalR

45 Pivotal Confidential–Internal Use Only

Credits

Leaders and contributors:

Gavin Sherry Caleb Welton Joseph Hellerstein Christopher Ré Zhe Wang Florian Schoppmann

Hai Qian Shengwen Yang Aaron Feng and many others …

The&MADlib&Vision&

•  Academic&and&industry&contribuOons&•  Think&of&“CRAN&for&databases”&– Repository&of&open;source&ML&algorithms&– This&Ome&with&data&parallelism&in&mind&

•  Open;Source&Framework&

Eigen&BSD&License&10&

The&MADlib&Vision&

•  Academic&and&industry&contribuOons&•  Think&of&“CRAN&for&databases”&– Repository&of&open;source&ML&algorithms&– This&Ome&with&data&parallelism&in&mind&

•  Open;Source&Framework&

Eigen&BSD&License&10&

Page 40: Pivotal OSS meetup - MADlib and PivotalR

46 Pivotal Confidential–Internal Use Only 46 Pivotal Confidential–Internal Use Only

Thank you for your attention

Important links:

Product email: [email protected]

Product site: madlib.net

Speaker email: [email protected]