61
BUILT FOR THE SPEED OF BUSINESS

Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

Embed Size (px)

DESCRIPTION

Slides from the Pivotal Open Source Hub Meetup "Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science!" As the need for data science as a key differentiator grows in all industries, from large corporations to startups, the need to get to results quickly is enabled by sharing ideas and methods in the community. The data science team at Pivotal leverages and contributes to this community of publicly available and open source technologies as part of their practice. We will share the resources we use by highlighting specific toolkits for building models (e.g. MADlib, R) and visualization (e.g. Gephi and Circos) along with their benefits and limitations by sharing examples from Pivotal's data science engagements. At the end of this session we hope to have answered the questions: Where can I get started with Data Science? Which toolkit is most appropriate for building a model with my dataset? How can I visualize my results to have the greatest impact? Bio: Sarah Aerni is a member of the Pivotal Data Science team with a focus on healthcare and life science. She has a background in the field of Bioinformatics, developing tools to help biomedical researchers understand their data. She holds a B.S. In Biology with a specialization in Bioinformatics and minor in French Literature from UCSD, and an M.S. and Ph.D in Biomedical Informatics from Stanford University. During her time as a researcher she focused on the interface between machine learning and biology, building computational models enabling research for a broad range of fields in biomedicine. She also co-founded a start-up providing informatics services to researchers and small companies. At Pivotal she works with customers in life science and healthcare building models to derive insight and business value from their data.

Citation preview

Page 1: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

BUILT FOR THE SPEED OF BUSINESS

Additional Line 18 Point Verdana

Page 2: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

2 © Copyright 2014 Pivotal. All rights reserved. 2 © Copyright 2014 Pivotal. All rights reserved.

Data Science as a Commodity: How to use MADlib, R, and other Publicly Available and Open Source Tools for Data Science Pivotal OSS Meetups Sarah Aerni Pivotal Senior Data Scientist @itweetsarah [email protected]

January 28, 2014

Page 3: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

3 © Copyright 2014 Pivotal. All rights reserved.

What we will cover in today’s Meetup �  What is data science, big data,

buzzword, buzzword? �  What are some examples of data

science in action? �  What do I do at Pivotal? �  Who are our data scientists? �  Why is open source software

important for data science? �  What tools does our team use? For

NLP? For optimization? For regression?

�  What do I do with loads of data? �  How can I create good models? �  What types of open source tools can

I use to build models? �  How can I build a quick app? �  What can I do to get started

analyzing text data? �  Which tools exist to create

visualizations of my data that I can understand?

Page 4: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

4 © Copyright 2014 Pivotal. All rights reserved.

What we will not cover #notdatascience

Page 5: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

5 © Copyright 2014 Pivotal. All rights reserved.

Instead: Practical Data Science Tools #useful

http://blog.gopivotal.com/p-o-v/the-eightfold-path-of-data-science – Kaushik Das

Page 6: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

6 © Copyright 2014 Pivotal. All rights reserved.

Instead: Practical Data Science Tools #useful “At companies where there is no framework for operationalization of the models, PowerPoint is where models go to die!” – Hulya Farinas http://venturebeat.com/2013/12/03/how-to- revolutionize-healthcare-get-data-scientists-and-app-developers-together/

Page 7: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

7 © Copyright 2014 Pivotal. All rights reserved.

Instead: Practical Data Science Tools #useful “At companies where there is no framework for operationalization of the models, PowerPoint is where models go to die!” – Hulya Farinas http://venturebeat.com/2013/12/03/how-to- revolutionize-healthcare-get-data-scientists-and-app-developers-together/ “The use of statistical and machine

learning techniques on big multi-structured data — in a distributed computing environment — to identify correlations and causal relationships, classify and predict events, identify patterns and anomalies, and infer probabilities, interest, and sentiment.” – Annika Jimenez http://blog.gopivotal.com/news-2/annika-jimenez-on-disruptive-data-science-at-the-strata-conference

Page 8: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

8 © Copyright 2014 Pivotal. All rights reserved.

DATA IS THE NEW CENTER OF GRAVITY

Data > Application!“BIG DATA IS THE NEW NORMAL”

“‘BIG DATA’ BECOMES ‘DATA’ ONCE AGAIN”

Page 9: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

9 © Copyright 2014 Pivotal. All rights reserved.

What Can “Small Data” Scientists Bring on Their “Big Data” Journey?

http://factspy.net/the-difference-between-geeks-vs-nerds/

Page 10: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

10 © Copyright 2014 Pivotal. All rights reserved.

What Can “Small Data” Scientists Bring on Their “Big Data” Journey?

Flat files

Distributed computing

HDFS

In-memory model building

Cloud computing

MapReduce

Command-line tools

Databases

Command-line tools

Small Data Big Data Many tools and approaches are being adapted to big data technologies

Page 11: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

11 © Copyright 2014 Pivotal. All rights reserved.

Basic DS Tools: From Command-line to GUI

Ian Huston, Alex Kagoshima, Ronert Obst

�  Quick-and-dirty tricks using command-line tools

–  Fast feedback - interactive –  Fast to process –  Easy to write, hard to read –  Background processing (screen)

Page 12: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

12 © Copyright 2014 Pivotal. All rights reserved.

Basic DS Tools: From Command-line to GUI

Ian Huston, Alex Kagoshima, Ronert Obst

�  Large-volumes of data à automatically parallel environments (e.g. GPDB) may be faster

�  Quick-and-dirty tricks using command-line tools

–  Fast feedback - interactive –  Fast to process –  Easy to write, hard to read –  Background processing (screen)

Page 13: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

13 © Copyright 2014 Pivotal. All rights reserved.

Basic DS Tools: From Command-line to GUI

�  Python and R –  Rstudio –  iPython (iPythonNotebook)

Ian Huston, Alex Kagoshima, Ronert Obst

�  Large-volumes of data à automatically parallel environments (e.g. GPDB) may be faster

�  Quick-and-dirty tricks using command-line tools

–  Fast feedback - interactive –  Fast to process –  Easy to write, hard to read –  Background processing (screen)

Page 14: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

14 © Copyright 2014 Pivotal. All rights reserved.

Favorite python and R packages and resources Python

–  NumPy –  SciPy –  scikit-learn – machine

learning package –  statsmodels –  pandas –  pyMC –  IPython

(IPythonNotebook) –  matplotlib

Ian Huston, Alex Kagoshima, Ronert Obst

Page 15: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

15 © Copyright 2014 Pivotal. All rights reserved.

Favorite python and R packages, resources, and more

� R –  ggplot –  reshape –  plyr –  Shiny –  Good support for time

series analyses –  Rstudio ( weave ) –  foreach, parallel –  taskviews –  parboost

Ian Huston, Alex Kagoshima, Ronert Obst

Page 16: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

16 © Copyright 2014 Pivotal. All rights reserved.

A New Platform for a New Era What do I do at Pivotal?

...ETC

Cloud Fabric “The new OS”

Data Fabric “The new Database”

App Fabric “The new Middleware”

“The new Hardware”

DATA-DRIVEN APPLICATION DEVELOPMENT

Page 17: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

17 © Copyright 2014 Pivotal. All rights reserved.

Pivotal Big Data Technology: HAWQ Think of it as multiple PostGreSQL servers

Segments/Workers

Master

Rows are distributed across segments by a particular field (or randomly)

Download database version at http://www.gopivotal.com/products/pivotal-greenplum-database

Page 18: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

18 © Copyright 2014 Pivotal. All rights reserved.

Performance Through Parallelism �  Automatic parallelization

–  Load and query like any database –  Automatically distributed tables

across nodes

�  Analytics-oriented query optimization

�  Scalable MPP architecture –  All nodes can scan and process in

parallel –  Linear scalability by adding nodes

Download database version at http://www.gopivotal.com/products/pivotal-greenplum-database

Page 19: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

19 © Copyright 2014 Pivotal. All rights reserved.

Data Science Tools for Big Data C O M M E R C I A L OP E N SO U R C E (O R FR E E)

PL/R,  PL/Python  PL/Java  

Page 20: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

20 © Copyright 2014 Pivotal. All rights reserved.

Making sense of your “big data” �  Large volumes of data may be difficult to understand

–  ~100 tables –  Tens of thousands of columns

Page 21: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

21 © Copyright 2014 Pivotal. All rights reserved.

Making sense of your “big data” �  Large volumes of data may be difficult to understand

–  ~100 tables –  Tens of thousands of columns

�  How do you build models that use all the data? Score all the data?

Page 22: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

22 © Copyright 2014 Pivotal. All rights reserved.

Making sense of your “big data” �  Large volumes of data may be difficult to understand

–  ~100 tables –  Tens of thousands of columns

�  How do you build models that use all the data? Score all the data?

�  Where do you focus your effort? –  Getting a rapid grasp of relevant fields is important –  Scanning lots of data is slow, creating models with huge numbers of features is

possible, but generally better to understand your data –  Columns with little or no variation or only null values

Page 23: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

23 © Copyright 2014 Pivotal. All rights reserved.

Making sense of your “big data” �  Large volumes of data may be difficult to understand

–  ~100 tables –  Tens of thousands of columns

�  How do you build models that use all the data? Score all the data?

�  Where do you focus your effort? –  Getting a rapid grasp of relevant fields is important –  Scanning lots of data is slow, creating models with huge numbers of features is

possible, but generally better to understand your data –  Columns with little or no variation or only null values

� These functions exist in MADlib

Page 24: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

24 © Copyright 2014 Pivotal. All rights reserved.

MADlib In-Database Functions Predictive Modeling Library

Linear Systems •  Sparse and Dense Solvers

Matrix Factorization •  Single Value Decomposition

(SVD) •  Low-Rank

Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards •  Regression •  Elastic Net Regularization •  Sandwich Estimators (Huber

white, clustered, marginal effects)

Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Affinity Analysis,

Market Basket) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Ensemble Learners (Random Forests) •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation

Descriptive Statistics

Sketch-based Estimators •  CountMin (Cormode-

Muthukrishnan) •  FM (Flajolet-Martin) •  MFV (Most Frequent

Values) Correlation Summary

Support Modules

Array Operations Sparse Vectors Random Sampling Probability Functions

Page 25: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

25 © Copyright 2014 Pivotal. All rights reserved.

MADlib in Action: Regression on Billions of Rows

Drilling into the San Andreas Fault at Parkfield California. Credit: Stephen H. Hickman, USGS Rashmi Raghu

�  Input Data –  10s of millions of rows from data collected at multiple drill

testing sites –  Sensor data for drills during operation, including rate of

penetration, depth of penetration, weight on drill bit and more

�  Data Massaging and Review –  Rapid summarization of many columns of data - to identify

outliers, missing data and remove them from analysis –  Used window functions to construct a moving average

(smoothing) of all the features and dependent variable

�  Model –  Linear regression on the complete dataset –  K-means clustering to determine similarities of sites

Page 26: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

26 © Copyright 2014 Pivotal. All rights reserved.

Linear Regression: Streaming Algorithm

� Finding linear dependencies between variables

� How to compute with a single scan?

Page 27: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

27 © Copyright 2014 Pivotal. All rights reserved.

Linear Regression: Parallel Computation

XT

y

XT y = xiT yi

i∑

Page 28: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

28 © Copyright 2013 Pivotal. All rights reserved.

Linear Regression: Parallel Computation

y

XT

Master

XT y

Segment 1 Segment 2

X1T y1 X2

T y2+ =

Page 29: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

29 © Copyright 2013 Pivotal. All rights reserved.

Linear Regression: Parallel Computation

y

XT

Master Segment 1 Segment 2

XT yX1T y1 X2

T y2+ =

Page 30: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

30 © Copyright 2013 Pivotal. All rights reserved.

Performing a linear regression on 10 million rows in seconds

Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711.

Page 31: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

31 © Copyright 2014 Pivotal. All rights reserved.

Calling MADlib Functions: Fast Training, Scoring �  MADlib allows users to easily and

create models without moving data out of the systems

–  Model generation –  Model validation –  Scoring (evaluation of) new data

�  All the data can be used in one model

�  Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)

�  Open-source lets you tweak and extend methods, or build your own

SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!

'price’,!'ARRAY[1, tax, bath, size]’);!

MADlib model function Table containing

training data

Table in which to save results

Column containing dependent variable Features included in the

model

Page 32: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

32 © Copyright 2014 Pivotal. All rights reserved.

Calling MADlib Functions: Fast Training, Scoring �  MADlib allows users to easily and

create models without moving data out of the systems

–  Model generation –  Model validation –  Scoring (evaluation of) new data

�  All the data can be used in one model

�  Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)

�  Open-source lets you tweak and extend methods, or build your own

SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!

'price’,!'ARRAY[1, tax, bath, size]’,!

‘bedroom’);!

MADlib model function Table containing

training data

Table in which to save results

Column containing dependent variable Features included in the

model Create multiple output models (one for each value of bedroom)

Page 33: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

33 © Copyright 2014 Pivotal. All rights reserved.

Calling MADlib Functions: Fast Training, Scoring �  MADlib allows users to easily and

create models without moving data out of the systems

–  Model generation –  Model validation –  Scoring (evaluation of) new data

�  All the data can be used in one model

�  Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)

�  Open-source lets you tweak and extend methods, or build your own

SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!

'price’,!'ARRAY[1, tax, bath, size]’);!

SELECT houses.*, madlib.linregr_predict(ARRAY[1,tax,bath,size],

m.coef!)as predict !

FROM houses, houses_linregr m;!

MADlib model scoring function

Table with data to be scored Table containing model

Page 34: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

34 © Copyright 2014 Pivotal. All rights reserved.

PivotalR: Bringing MADlib and HAWQ to a familiar R interface �  Challenge

Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics

�  Simple solution: Translate R code into SQL

Woo Jung

d <- db.data.frame(”houses")!houses_linregr <- madlib.lm(price ~ tax!

! ! !+ bath!! ! !+ size!! ! !, data=d)!

Pivotal R SELECT madlib.linregr_train( 'houses’,!

'houses_linregr’,!'price’,!

'ARRAY[1, tax, bath, size]’);!

SQL Code

http://gopivotal.github.io/PivotalR/

Page 35: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

35 © Copyright 2014 Pivotal. All rights reserved.

PivotalR: Bringing MADlib and HAWQ to a familiar R interface �  Challenge

Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics

�  Simple solution: Translate R code into SQL

Woo Jung

d <- db.data.frame(”houses")!houses_linregr <- madlib.lm(price ~ as.factor(state)!

! ! ! !+ tax!! ! ! !+ bath!! ! ! !+ size!! ! ! !, data=d)!

Pivotal R

# Build a regression model with a different!# intercept term for each state!# (state=1 as baseline).!# Note that PivotalR supports automated!# indicator coding a la as.factor()!!

http://gopivotal.github.io/PivotalR/

Page 36: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

36 © Copyright 2014 Pivotal. All rights reserved.

PivotalR Design Overview

SQL to execute

Computation results

RPostgreSQL

Data lives here

R à SQL

PivotalR

No data here

Database w/ MADlib

•  Call MADlib’s in-DB machine learning functions directly from R

•  Syntax is analogous to native R function

•  Data doesn’t need to leave the database •  All heavy lifting, including model estimation

& computation, are done in the database

Woo Jung

http://gopivotal.github.io/PivotalR/

Page 37: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

37 © Copyright 2014 Pivotal. All rights reserved.

PivotalR: Current Features

MADlib Functionality • Linear Regression • Logistic Regression • Elastic Net • ARIMA • Marginal Effects • Cross Validation • Bagging • summary on model objects

• Automated Indicator Variable Coding as.factor

• predict

•  $ [ [[ $<- [<- [[<-

•  is.na

•  + - * / %% %/% ^

•  & | !

•  == != > < >= <=

•  merge

•  by

•  db.data.frame

•  as.db.data.frame

•  preview •  sort

•  c mean sum sd var min max length colMeans colSums

•  db.connect db.disconnect db.list db.objects

db.existsObject delete •  dim names

•  content

And more ... (SQL wrapper)

http://gopivotal.github.io/PivotalR/

Page 38: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

38 © Copyright 2014 Pivotal. All rights reserved.

Woo Jung

http://gopivotal.github.io/PivotalR/

Page 39: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

39 © Copyright 2014 Pivotal. All rights reserved.

Woo Jung

http://www.rstudio.com/shiny/ http://gopivotal.github.io/PivotalR/

Page 40: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

40 © Copyright 2014 Pivotal. All rights reserved.

Shiny Showcase: Example Web Apps in R �  Users can choose

input parameters with sliders, drop-downs, and text fields.

�  HTML/JavaScript knowledge not required.

http://www.rstudio.com/shiny/

Page 41: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

41 © Copyright 2014 Pivotal. All rights reserved.

�  Users can choose input parameters with sliders, drop-downs, and text fields.

�  HTML/JavaScript knowledge not required.

Shiny Showcase: Example Web Apps in R

http://www.rstudio.com/shiny/

Page 42: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

42 © Copyright 2014 Pivotal. All rights reserved.

http://d3js.org/

Page 43: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

43 © Copyright 2014 Pivotal. All rights reserved.

http://d3js.org/

D3 Data-Driven Documents

Page 44: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

44 © Copyright 2014 Pivotal. All rights reserved.

http://d3js.org/

D3 Data-Driven Documents

Page 45: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

45 © Copyright 2014 Pivotal. All rights reserved.

PyMADlib

� Python wrapper for MADlib

http://nbviewer.ipython.org/gist/vatsan/5275846

Page 46: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

46 © Copyright 2014 Pivotal. All rights reserved.

PyMADlib

� Python wrapper for MADlib

http://nbviewer.ipython.org/gist/vatsan/5275846

Page 47: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

47 © Copyright 2014 Pivotal. All rights reserved.

Procedural Languages in Big Data Science �  HAWQ & PL/X can take advantage of “data

parallel” tasks by performing analyses in parallel – embarrassingly parallel tasks

�  Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks

�  Examples of ‘data parallel’ problems: –  Counting words in documents –  Genome-Wide Association Study –  Studying network anomalies

Network Interconnect

Master Severs

Segment Severs

Doc1 Doc2 DocM

Stem1 Stem2 StemM

SQL & R

Count1 Count2 CountM http://gopivotal.github.io/gp-r/

Page 48: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

48 © Copyright 2014 Pivotal. All rights reserved.

Structure of input table for PL/R function

�  Topology: Hubs connected to multiple terminal points

�  Using historical readings, solve a linear program to establish baseline behavior, for example number of shipments

�  Detecting anomalies within sub-networks on future observations

Terminal readings

Columns Network ID Topology Network Readings

Description ID of the network. 300K in total.

Array of integers defining the topology tree.

Array of readings from network terminal points over (say) a week.

Vivek Ramamurthy

0

C

D B

A

Page 49: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

49 © Copyright 2014 Pivotal. All rights reserved.

Performance Analysis

0 100 200 300 400 500 600 700

0 50 100 150 200 250 300

Tim

e (s

econ

ds)

Number of networks (in thousands)

Execution time v/s number of networks

Vivek Ramamurthy

Number of networks

Time/network (ms)

Total time (seconds)

500 6.604 3.30

1000 3.637 3.64

5000 2.822 14.11

10,000 2.356 23.56

50,000 2.160 108.02

100,000 2.142 214.20

150,000 2.162 324.29

200,000 2.142 428.48

250,000 2.138 534.69

300,000 2.132 639.85

Page 50: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

50 © Copyright 2014 Pivotal. All rights reserved.

Performance Analysis R package used optim quadprog Rsymphony Rglpk

Single network in R (time) ~60s 6.3 s 0.145 s 0.181 s

300K networks in PL/R (time) ~84 hrs 5.87 hrs 10.7 min 14.6 min

Time per network in PL/R 1005.2 ms 70.44 ms 2.13 ms 2.92 ms

Vivek Ramamurthy

Page 51: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

51 © Copyright 2014 Pivotal. All rights reserved.

Performance Analysis R package used optim quadprog Rsymphony Rglpk

Single network in R (time) ~60s 6.3 s 0.145 s 0.181 s

300K networks in PL/R (time) ~84 hrs 5.87 hrs 10.7 min 14.6 min

Time per network in PL/R 1005.2 ms 70.44 ms 2.13 ms 2.92 ms

COIN-OR : Computational Infrastructure for Operations Research

–  Libraries for linear and non-linear programming, integer programming

–  SYMPHONY : Callable library in COIN-OR for solving mixed integer linear programs

GLPK : GNU Linear Programming Kit Used for large-scale LPs, MIPs and related problems

Vivek Ramamurthy

http://www.coin-or.org/

Page 52: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

52 © Copyright 2014 Pivotal. All rights reserved.

Performance Analysis R package used optim quadprog Rsymphony Rglpk

Single network in R (time) ~60s 6.3 s 0.145 s 0.181 s

300K networks in PL/R (time) ~84 hrs 5.87 hrs 10.7 min 14.6 min

Time per network in PL/R 1005.2 ms 70.44 ms 2.13 ms 2.92 ms

COIN-OR : Computational Infrastructure for Operations Research

–  Libraries for linear and non-linear programming, integer programming

–  SYMPHONY : Callable library in COIN-OR for solving mixed integer linear programs

GLPK : GNU Linear Programming Kit –  Used for large-scale LPs, MIPs and related problems http://www.gnu.org/software/glpk/

Vivek Ramamurthy

http://www.coin-or.org/

Page 53: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

53 © Copyright 2014 Pivotal. All rights reserved.

Natural language processing

Data sources Text sources Documents, books, emails Speech Phone logs, conversations

NLP processing pipeline

Sentence detection Tokenization Morphological

stemming

Stop word removal

Word-sense disambiguation

Part-of-Speech tagging

Syntactic parsing

Semantic role labeling

Entity recognition

Reference resolution

Event processing …

Common tasks/tools in NLP

Applications Word clouds Topic modeling Sentiment analysis Machine translation Document classification Document summarization Language generation Search Question answering Information Extraction

Niels Kasch

Page 54: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

54 © Copyright 2014 Pivotal. All rights reserved.

Open source tools for common NLP tasks

W O R D C L O U D S

I N F O R M A T I O N E X T R A C T I O N

T O P I C M O D E L I N G / T E X T C L A S S I F I C A T I O N

OPEN SOURCE SOFTWARE RELEVANT NLP TOOLS

Niels Kasch

Page 55: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

55 © Copyright 2014 Pivotal. All rights reserved.

Open source tools for common NLP tasks

W O R D C L O U D S Stemming/

lemmatization

Stop word removal

Tokenization •  GPText •  Apache UIMA •  OpenNLP (Java)

•  NLTK (Python) •  WordNet •  Pytagcloud

I N F O R M A T I O N E X T R A C T I O N

T O P I C M O D E L I N G / T E X T C L A S S I F I C A T I O N

OPEN SOURCE SOFTWARE RELEVANT NLP TOOLS

Niels Kasch

Page 56: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

56 © Copyright 2014 Pivotal. All rights reserved.

Open source tools for common NLP tasks

W O R D C L O U D S Stemming/

lemmatization

Stop word removal

Tokenization •  GPText •  Apache UIMA •  OpenNLP (Java)

•  NLTK (Python) •  WordNet •  Pytagcloud

I N F O R M A T I O N E X T R A C T I O N

T O P I C M O D E L I N G / T E X T C L A S S I F I C A T I O N

Language detection

Stemming/lemmatization

Stop word removal

Tokenization •  Madlib (PLDA) •  gensim (LSA & LDA package for python) •  https://code.google.com/p/language-detection/

OPEN SOURCE SOFTWARE RELEVANT NLP TOOLS

Niels Kasch

Page 57: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

57 © Copyright 2014 Pivotal. All rights reserved.

Open source tools for common NLP tasks

W O R D C L O U D S Stemming/

lemmatization

Stop word removal

Tokenization •  GPText •  Apache UIMA •  OpenNLP (Java)

•  NLTK (Python) •  WordNet •  Pytagcloud

I N F O R M A T I O N E X T R A C T I O N

Syntactic parsing

Sentence detection

Language detection

Tokenization

Entity extraction

Relationship extraction

T O P I C M O D E L I N G / T E X T C L A S S I F I C A T I O N

Language detection

Stemming/lemmatization

Stop word removal

Tokenization •  Madlib (PLDA) •  gensim (LSA & LDA package for python) •  https://code.google.com/p/language-detection/

•  GPText and Madlib •  OpenNLP •  NLTK

•  Stanford CoreNLP (incl. POS tagger, NER, parser, etc.)

OPEN SOURCE SOFTWARE RELEVANT NLP TOOLS

Niels Kasch

Page 58: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

58 © Copyright 2014 Pivotal. All rights reserved.

Topic Analysis – MADlib pLDA

Prepare dataset for

Topic Modeling

Social Media

Tokenizer

Align Data

Stemming, frequency

filtering

Natural Language Processing - GPText

Filter relevant content

Srivatsan Ramanujam

Page 59: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

59 © Copyright 2014 Pivotal. All rights reserved.

Topic Analysis – MADlib pLDA

Prepare dataset for

Topic Modeling

MADlib Topic Model

Social Media

Tokenizer

Align Data

Stemming, frequency

filtering

Natural Language Processing - GPText

Filter relevant content

Topic composition

Topic Clouds

Topic Graph

Srivatsan Ramanujam

Page 60: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

60 © Copyright 2014 Pivotal. All rights reserved.

Is there more? What’s next?

blog.gopivotal.com/tag/data-science-tech

blog.gopivotal.com/tag/data-science

Page 61: Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science (from Pivotal Open Source Hub Meetup)

BUILT FOR THE SPEED OF BUSINESS