Upload
srivatsan-ramanujam
View
373
Download
2
Embed Size (px)
DESCRIPTION
These slides give an overview of the technology and the tools used by Data Scientists at Pivotal Data Labs. This includes Procedural Languages like PL/Python, PL/R, PL/Java, PL/Perl and the parallel, in-database machine learning library MADlib. The slides also highlight the power and flexibility of the Pivotal platform from embracing open source libraries in Python, R or Java to using new computing paradigms such as Spark on Pivotal HD.
Citation preview
1 © Copyright 2013 Pivotal. All rights reserved. 1 © Copyright 2013 Pivotal. All rights reserved.
Pivotal Data Labs – Technology and Tools in our Data Scientist’s Arsenal
Srivatsan Ramanujam Senior Data Scientist Pivotal Data Labs 15 Oct 2014
2 © Copyright 2013 Pivotal. All rights reserved.
Agenda � Pivotal: Technology and Tools Introduction
– Greenplum MPP Database and Pivotal Hadoop with HAWQ
� Data Parallelism – PL/Python, PL/R, PL/Java, PL/C
� Complete Parallelism – MADlib
� Python and R Wrappers – PyMADlib and PivotalR
� Open Source Integration – Spark and PySpark examples
� Live Demos – Pivotal Data Science Tools in Action – Topic and Sentiment Analysis – Content Based Image Search
3 © Copyright 2013 Pivotal. All rights reserved.
Technology and Tools
4 © Copyright 2013 Pivotal. All rights reserved.
MPP Architectural Overview Think of it as multiple PostGreSQL servers
Segments/Workers
Master
Rows are distributed across segments by a particular field (or randomly)
5 © Copyright 2013 Pivotal. All rights reserved.
Implicit Parallelism – Procedural Languages
6 © Copyright 2013 Pivotal. All rights reserved.
Data Parallelism – Embarrassingly Parallel Tasks
� Little or no effort is required to break up the problem into a number of parallel tasks, and there exists no dependency (or communication) between those parallel tasks.
� Examples: – map() function in Python:
>>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] >>> map(lambda e: e*e, x) >>> [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
www.slideshare.net/SrivatsanRamanujam/python-powered-data-science-at-pivotal-pydata-2013
7 © Copyright 2013 Pivotal. All rights reserved.
� The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster
• Data Parallelism: - PL/X piggybacks on
Greenplum/HAWQ’s MPP architecture
• Allows users to write Greenplum/PostgreSQL functions in the R/Python/Java, Perl, pgsql or C languages Standby
Master
…
Master Host
SQL
Interconnect
Segment Host Segment Segment
Segment Host Segment Segment
Segment Host Segment Segment
Segment Host Segment Segment
PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}
8 © Copyright 2013 Pivotal. All rights reserved.
User Defined Functions – PL/Python Example � Procedural languages need to be installed on each database used.
� Syntax is like normal Python function with function definition line replaced by SQL wrapper. Alternatively like a SQL User Defined Function with Python inside.
CREATE FUNCTION pymax (a integer, b integer) RETURNS integer AS $$ if a > b: return a return b $$ LANGUAGE plpythonu;
SQL wrapper
SQL wrapper
Normal Python
9 © Copyright 2013 Pivotal. All rights reserved.
Returning Results � Postgres primitive types (int, bigint, text, float8, double precision, date, NULL etc.) � Composite types can be returned by creating a composite type in the database:
CREATE TYPE named_value AS ( name text, value integer );
� Then you can return a list, tuple or dict (not sets) which reference the same structure as the table:
CREATE FUNCTION make_pair (name text, value integer) RETURNS named_value AS $$ return [ name, value ] # or alternatively, as tuple: return ( name, value ) # or as dict: return { "name": name, "value": value } # or as an object with attributes .name and .value $$ LANGUAGE plpythonu;
� For functions which return multiple rows, prefix “setof” before the return type
http://www.slideshare.net/PyData/massively-parallel-process-with-prodedural-python-ian-huston
10 © Copyright 2013 Pivotal. All rights reserved.
Returning more results You can return multiple results by wrapping them in a sequence (tuple, list or set), an iterator or a generator:
CREATE FUNCTION make_pair (name text) RETURNS SETOF named_value AS $$ return ([ name, 1 ], [ name, 2 ], [ name, 3]) $$ LANGUAGE plpythonu;
Sequence
Generator
CREATE FUNCTION make_pair (name text) RETURNS SETOF named_value AS $$ for i in range(3): yield (name, i) $$ LANGUAGE plpythonu;
11 © Copyright 2013 Pivotal. All rights reserved.
Accessing Packages � On Greenplum DB: To be available packages must be installed on the
individual segment nodes. – Can use “parallel ssh” tool gpssh to conda/pip install – Currently Greenplum DB ships with Python 2.6 (!)
� Then just import as usual inside function:
CREATE FUNCTION make_pair (name text) RETURNS named_value AS $$ import numpy as np return ((name,i) for i in np.arange(3)) $$ LANGUAGE plpythonu;
12 © Copyright 2013 Pivotal. All rights reserved.
UCI Auto MPG Dataset – A toy problem Sample Data
� Sample Task: Aero-dynamics aside (attributable to body style), what is the effect of engine parameters (bore, stroke, compression_ratio, horsepower, peak_rpm) on the highway mpg of cars?
� Solution: Build a Linear Regression model for each body style (hatchback, sedan) using the features bore, stroke, compression ration, horsepower and peak_rpm with highway_mpg as the target label.
� This is a data parallel task which can be executed in parallel by simply piggybacking on the MPP architecture. One segment can build a model for Hatchbacks another for Sedan
http://archive.ics.uci.edu/ml/datasets/Auto+MPG
13 © Copyright 2013 Pivotal. All rights reserved.
Ridge Regression with scikit-learn on PL/Python
Python
SQL wrapper
SQL wrapper
User Defined Function
User Defined Type User Defined Aggregate
14 © Copyright 2013 Pivotal. All rights reserved.
PL/Python + scikit-learn : Model Coefficients
Physical machine on the cluster in which the regression model was built
Invoke UDF
Build Feature Vector
Choose Features
One model per body style
15 © Copyright 2013 Pivotal. All rights reserved.
Parallelized R in Pivotal via PL/R: An Example � With placeholders in SQL, write functions in the native R language
� Accessible, powerful modeling framework
http://pivotalsoftware.github.io/gp-r/
16 © Copyright 2013 Pivotal. All rights reserved.
Parallelized R in Pivotal via PL/R: An Example � Execute PL/R function
� Plain and simple table is returned
http://pivotalsoftware.github.io/gp-r/
17 © Copyright 2013 Pivotal. All rights reserved.
Aggregate and obtain final prediction
Each tree makes a prediction
Parallelized R in Pivotal via PL/R: Parallel Bagged Decision Trees
http://pivotalsoftware.github.io/gp-r/
18 © Copyright 2013 Pivotal. All rights reserved.
Complete Parallelism
19 © Copyright 2013 Pivotal. All rights reserved.
Complete Parallelism – Beyond Data Parallel Tasks
� Data Parallel computation via PL/X libraries only allow us to run ‘n’ models in parallel.
� This works great when we are building one model for each value of the group by column, but we need parallelized algorithms to be able to build a single model on all the available data
� For this, we use MADlib – an open source library of parallel in-database machine learning algorithms.
20 © Copyright 2013 Pivotal. All rights reserved.
MADlib: Scalable, in-database Machine Learning
http://madlib.net
21 © Copyright 2013 Pivotal. All rights reserved.
MADlib In-Database Functions
Predictive Modeling Library
Linear Systems • Sparse and Dense Solvers
Matrix Factorization • Single Value Decomposition (SVD) • Low-Rank
Generalized Linear Models • Linear Regression • Logistic Regression • Multinomial Logistic Regression • Cox Proportional Hazards • Regression • Elastic Net Regularization • Sandwich Estimators (Huber white,
clustered, marginal effects)
Machine Learning Algorithms • Principal Component Analysis (PCA) • Association Rules (Affinity Analysis, Market
Basket) • Topic Modeling (Parallel LDA) • Decision Trees • Ensemble Learners (Random Forests) • Support Vector Machines • Conditional Random Field (CRF) • Clustering (K-means) • Cross Validation
Descriptive Statistics
Sketch-based Estimators • CountMin (Cormode-
Muthukrishnan) • FM (Flajolet-Martin) • MFV (Most Frequent
Values) Correlation Summary
Support Modules
Array Operations Sparse Vectors Random Sampling Probability Functions
22 © Copyright 2013 Pivotal. All rights reserved.
Linear Regression: Streaming Algorithm
� Finding linear dependencies between variables
� How to compute with a single scan?
23 © Copyright 2013 Pivotal. All rights reserved.
Linear Regression: Parallel Computation
XT
y
XT y = xiT yi
i∑
24 © Copyright 2013 Pivotal. All rights reserved.
Linear Regression: Parallel Computation
y
XT
Master
XT y
Segment 1 Segment 2
X1T y1 X2
T y2+ =
25 © Copyright 2013 Pivotal. All rights reserved.
Linear Regression: Parallel Computation
y
XT
Master Segment 1 Segment 2
XT yX1T y1 X2
T y2+ =
26 © Copyright 2013 Pivotal. All rights reserved.
Performing a linear regression on 10 million rows in seconds
Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711.
27 © Copyright 2013 Pivotal. All rights reserved.
Calling MADlib Functions: Fast Training, Scoring
� MADlib allows users to easily and create models without moving data out of the systems
– Model generation – Model validation – Scoring (evaluation of) new data
� All the data can be used in one model
� Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)
� Open-source lets you tweak and extend methods, or build your own
SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!
'price’,!'ARRAY[1, tax, bath, size]’);!
MADlib model function Table containing
training data
Table in which to save results
Column containing dependent variable Features included in the
model
https://www.youtube.com/watch?v=Gur4FS9gpAg
28 © Copyright 2013 Pivotal. All rights reserved.
Calling MADlib Functions: Fast Training, Scoring
SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!
'price’,!'ARRAY[1, tax, bath, size]’,!
‘bedroom’);!
MADlib model function Table containing
training data
Table in which to save results
Column containing dependent variable
Create multiple output models (one for each value of bedroom)
� MADlib allows users to easily and create models without moving data out of the systems
– Model generation – Model validation – Scoring (evaluation of) new data
� All the data can be used in one model
� Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)
� Open-source lets you tweak and extend methods, or build your own
Features included in the model
https://www.youtube.com/watch?v=Gur4FS9gpAg
29 © Copyright 2013 Pivotal. All rights reserved.
Calling MADlib Functions: Fast Training, Scoring
SELECT madlib.linregr_train( 'houses’,!'houses_linregr’,!
'price’,!'ARRAY[1, tax, bath, size]’);!
SELECT houses.*, madlib.linregr_predict(ARRAY[1,tax,bath,size],
m.coef!)as predict !
FROM houses, houses_linregr m;!
MADlib model scoring function
Table with data to be scored Table containing model
� MADlib allows users to easily and create models without moving data out of the systems
– Model generation – Model validation – Scoring (evaluation of) new data
� All the data can be used in one model
� Built-in functionality to create of multiple smaller models (e.g. classification grouped by feature)
� Open-source lets you tweak and extend methods, or build your own
30 © Copyright 2013 Pivotal. All rights reserved.
Python and R wrappers to MADlib
31 © Copyright 2013 Pivotal. All rights reserved.
PivotalR: Bringing MADlib and HAWQ to a familiar R interface � Challenge
Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics
� Simple solution: Translate R code into SQL
d <- db.data.frame(”houses")!houses_linregr <- madlib.lm(price ~ tax!
! ! !+ bath!! ! !+ size!! ! !, data=d)!
Pivotal R SELECT madlib.linregr_train( 'houses’,!
'houses_linregr’,!'price’,!
'ARRAY[1, tax, bath, size]’);!
SQL Code
https://github.com/pivotalsoftware/PivotalR
32 © Copyright 2013 Pivotal. All rights reserved.
PivotalR: Bringing MADlib and HAWQ to a familiar R interface � Challenge
Want to harness the familiarity of R’s interface and the performance & scalability benefits of in-DB analytics
� Simple solution: Translate R code into SQL
d <- db.data.frame(”houses")!houses_linregr <- madlib.lm(price ~ as.factor(state)!
! ! ! !+ tax!! ! ! !+ bath!! ! ! !+ size!! ! ! !, data=d)!
Pivotal R
# Build a regression model with a different!# intercept term for each state!# (state=1 as baseline).!# Note that PivotalR supports automated!# indicator coding a la as.factor()!!
https://github.com/pivotalsoftware/PivotalR
33 © Copyright 2013 Pivotal. All rights reserved.
PivotalR Design Overview
2. SQL to execute
3. Computation results 1. R à SQL
RPostgreSQL
PivotalR
Data lives here No data here
Database/Hadoop w/ MADlib
• Call MADlib’s in-DB machine learning functions directly from R
• Syntax is analogous to native R function
• Data doesn’t need to leave the database • All heavy lifting, including model estimation
& computation, are done in the database
https://github.com/pivotalsoftware/PivotalR
34 © Copyright 2013 Pivotal. All rights reserved.
http://pivotalsoftware.github.io/pymadlib/
PyMADlib : Power of MADlib + Flexibility of Python Linear Regression
Logistic Regression
Extras – Support for Categorical variables – Pivoting
Current PyMADlib Algorithms – Linear Regression – Logistic Regression – K-Means – LDA
35 © Copyright 2013 Pivotal. All rights reserved.
Visualization
36 © Copyright 2013 Pivotal. All rights reserved.
Visualization
Commercial Open Source
37 © Copyright 2013 Pivotal. All rights reserved.
Hack one when needed – Pandas_via_psql
http://vatsan.github.io/pandas_via_psql/
SQL Client DB
38 © Copyright 2013 Pivotal. All rights reserved.
Integration with Open Source – (Py)Spark Example
39 © Copyright 2013 Pivotal. All rights reserved.
Apache Spark Project – Quick Overview
http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf
• Apache Project, originated in AMPLab Berkeley • Supported on Pivotal Hadoop 2.0!
40 © Copyright 2013 Pivotal. All rights reserved.
MapReduce vs. Spark
http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf
41 © Copyright 2013 Pivotal. All rights reserved.
Data Parallelism in PySpark – A Simple Example
• Next we’ll take the UCI automobile dataset example from PL/Python and demonstrate how to run in PySpark
42 © Copyright 2013 Pivotal. All rights reserved.
Scikit-Learn on PySpark – UCI Auto Dataset Example
• This is in essence similar to the PL/Python example from the earlier slide, except we’re using data store on HDFS (Pivotal HD) with Spark as the platform in place of HAWQ/Greenplum
43 © Copyright 2013 Pivotal. All rights reserved.
Large Scale Topic and Sentiment Analysis of Tweets
Social Media Demo
44 © Copyright 2013 Pivotal. All rights reserved.
Pivotal GNIP Decahose Pipeline
Parallel Parsing of JSON
PXF
Twitter Decahose (~55 million tweets/day)
Source: http Sink: hdfs
HDFS
External Tables
PXF
Nightly Cron Jobs
Topic Analysis through MADlib pLDA
Unsupervised Sentiment Analysis
(PL/Python)
D3.js
http://www.slideshare.net/SrivatsanRamanujam/a-pipeline-for-distributed-topic-and-sentiment-analysis-of-tweets-on-pivotal-greenplum-database
45 © Copyright 2013 Pivotal. All rights reserved.
Data Science + Agile = Quick Wins
� The Team – 1 Data Scientist – 2 Agile Developers – 1 Designer (part-time) – 1 Project Manager (part-time)
� Duration – 3 weeks!
46 © Copyright 2013 Pivotal. All rights reserved.
Live Demo – Topic and Sentiment Analysis
47 Pivotal Confidential–Internal Use Only
Content Based Image Search
CBIR Live Demo
48 Pivotal Confidential–Internal Use Only
Content Based Information Retrieval - Task
http://blog.pivotal.io/pivotal/features/content-based-image-retrieval-using-pivotal-hd-or-pivotal-greenplum-database
49 Pivotal Confidential–Internal Use Only
CBIR - Components
http://blog.pivotal.io/pivotal/features/content-based-image-retrieval-using-pivotal-hd-or-pivotal-greenplum-database
50 Pivotal Confidential–Internal Use Only
Live Demo – Content Based Image Search
http://blog.pivotal.io/pivotal/features/content-based-image-retrieval-using-pivotal-hd-or-pivotal-greenplum-database
51 Pivotal Confidential–Internal Use Only
Appendix
52 Pivotal Confidential–Internal Use Only
Acknowledgements • Ian Huston, Woo Jung, Sarah Aerni, Gautam Muralidhar, Regunathan
Radhakrishnan, Ronert Obst, Hai Qian, MADlib Engineering Team, Sumedh Mungee, Girish Lingappa