28
© Copyright 2018 Pivotal Software, Inc. All rights Reserved. Version 1.0 Hubert Zhang [email protected] Oct 2018 Greenplum & Data Science

Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

  • Upload
    others

  • View
    34

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

© Copyright 2018 Pivotal Software, Inc. All rights Reserved. Version 1.0

Hubert Zhang [email protected] Oct 2018

Greenplum & Data Science

Page 2: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

Cover w/ Image

Agenda

■  What Is Data Science

■  Greenplum Architecture

■  Data Science In Greenplum

■  PL/Container

■  MADlib

■  Q+A

Page 3: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

What is Data Science

●  Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured ----Wikipedia

●  Data scientists are a new breed of analytical data expert who have the technical skills to solve complex problems – and the curiosity to explore what problems need to be solve ----SAS

Page 4: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

Greenplum Architecture

Standby Master

Master Host

SQL

Interconnect

Segment Host

Node1

Segment Host

Node2

Segment Host

Node3

Segment Host

NodeN

Local Storage

Other RDBMSes Spark GemFire

Cloud Object

Storage HDFS Kafka ETL

Spring Cloud

Data Flow

GPCC Visualize

Cook

Prepare

Shop

Greenplum Big Data Platform

Page 5: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

Greenplum Support on Data Science

How to prepare and cook data in greenplum

•  Procedure Language: PL/Python, PL/R, PL/Container

•  GPText

•  Apache MADlib

Page 6: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

Procedural Language: PL/Python

User defined function written in Python ●  CREATE TABLE sales (id int, year int, qtr int, day int, region text) DISTRIBUTED BY (id) ;

INSERT INTO sales VALUES (1, 2014, 1,1, 'usa'), (2, 2002, 2,2, 'europe'), (3, 2014, 3,3, 'asia'), (4, 2014, 4,4, 'usa'), (5, 2014, 1,5, 'europe'), (6, 2014, 2,6, 'asia'), (7, 2002, 3,7, 'usa') ;

●  CREATE OR REPLACE FUNCTION pydbaccess(a integer)

RETURNS text AS $$ rv = plpy.execute("SELECT * FROM sales ORDER BY id", 5) region = rv[a]["region"] return region $$ language plpythonu;

●  SELECT pydbaccess(2) ;

Page 7: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

Procedural Language: PL/R

User defined function written in R ●  CREATE OR REPLACE FUNCTION r_norm(n integer, mean float8,

std_dev float8) RETURNS float8[ ] AS $$ x<-rnorm(n,mean,std_dev) return(x) $$ LANGUAGE 'plr';

●  CREATE TABLE test_norm_var

AS SELECT id, r_norm(10,0,1) as x FROM (SELECT generate_series(1,30:: bigint) AS ID) foo DISTRIBUTED BY (id);

●  SELECT mypytest(2) ;

Page 8: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

Procedural Language: Pain Point How to install data science library of Python and R

●  Only DBA is able to install the third party library

●  Data science bundle pre-installed with 100+ libraries.

Untrusted language for PL/Python and PL/R ●  Only DBA could create function in untrusted language

●  Malicious UDF code may harm DB system.

Page 9: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

Data Science Bundle

Page 10: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

Only Superuser can Create UDF in Untrusted Languages

System(“rm -rf /data”)

Page 11: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

The Problem: Triangle Dependency Data Scientist

DBA

UDF

Review & Create

Run UDF

Greenplum

Package UDF Package

1. Greenplum

2. Operation System

3. Python / R

4. TensorFlow

Page 12: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

Resolve The Problem: untrusted -> untrusted Data Scientist

DBA UDF

Review & Create

Run UDF

Create UDF

Page 13: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

How to Make untrusted to untrusted?

PL/Container

PL/Container

Page 14: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

What is PL/Container?

PL/Container is a customizable, secure runtime for Greenplum Database Procedural Languages. ●  Greenplum Database Extension

●  Stateless

●  Based on Docker Container

●  Customizable

●  Secure

●  Isolated

PL/Container

PL/Container

Page 15: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

Build Custom Docker Image

Minimum Requirement:

●  Python or R environment

●  Add location of libpython.so and libR.so to LD_LIBRARY_PATH Customize Your image:

●  Install specific packages

FROM continuumio/anaconda ENV LD_LIBRARY_PATH "/opt/conda/lib:$LD_LIBRARY_PATH”

FROM continuumio/anaconda3 RUN conda install -c conda-forge -y tensorflow ENV LD_LIBRARY_PATH "/opt/conda/lib:/usr/local/lib:$LD_LIBRARY_PATH"

Page 16: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

Run PL/Container

Running a simple PL/Container UDF to calculate log10

postgres=# CREATE EXTENSION plcontainer;

postgres=# CREATE OR REPLACE FUNCTION pylog10(Input double precision) RETURNS double

precision AS $$

# container: plc_python_shared

import math

return math.log10(input)

$$ LANGUAGE plcontainer;

postgres=# SELECT pylog10(100); -------------- 2

Page 17: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

GPText

•  Enables organizations to process mass quantities of raw text data for large-scale

text analytics

•  Indexing, Search, and Analytics Functions are exposed as SQL UDFs

•  Integrates with Apache Solr enterprise search

•  Computing distributed in segments, can be run in parallel

Page 18: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

MADlib

Apache MADlib: Scalable In-Database Machine Learning in SQL

Open source, top level Apache project

For PostgreSQL and Greenplum Database

Powerful machine learning, graph, statistics and analytics for data scientists

•  Open source https://github.com/apache/madlib •  Downloads and docs http://madlib.apache.org/ •  Wiki https://cwiki.apache.org/confluence/display/MADLIB/

Page 19: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

MADlib

History ●  Established in 2011

●  EMC/Greenplum Pivotal

●  Joe Hellerstein from Univ. of California, Berkeley.

Page 20: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

MADlib

Features of MADlib

●  Rich algorithm libraries: more than 50+ algorithms.

●  Easy to use: SQL based interface.

●  Open source: Apache top project.

●  Distributed: integrated with Greenplum, support MPP architecture.

●  In-database analysis: No additional data movement.

Page 21: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

MADlib

Architecture of MADlib

Page 22: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

MADlib

MADlib’s Rich Algorithm Libraries ●  Supervised Learning: Linear Regression, Decision Tree, SVM, CRF etc.

●  Unsupervised Learning: K-Means, LDA, PCA, Association Rule etc.

●  Graph: PageRank, HIT, All Pairs Shortest Path, Weakly Connected Components etc.

●  Time Series: ARIMA

●  Model Selection: Cross Validation, Prediction Metrics

Page 23: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

MADlib

Linear Regression in MADlib ●  Suppose data follow the linear relationship

●  Cost function: sum of squared residuals(SSR) is used to measure the overall model fit

#Dataset

CREATETABLEhouses(idINT,taxINT,bedroomINT,bathFLOAT,priceINT,sizeINT,lotINT);

#Training

SELECTmadlib.linregr_train('houses','houses_linregr','price','ARRAY[tax,bath,size]');

#Pridicting

SELECThouses.*,madlib.linregr_predict(m.coef,ARRAY[tax,bath,size])aspredict,price-madlib.linregr_predict(m.coef,ARRAY[tax,bath,size])asresidual

FROMhouses,houses_linregrmORDERBYid;

Page 24: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

MADlib

PageRank ●  Google’s classical rank algorithm used to rank the importance of the web page.

●  PageRank is based on random walk model which can include the following steps: ○  Initialization step: set initial probability distribution

○  Iteration step: recalculate the score for each web page.

○  Converge step

Image from https://en.wikipedia.org/wiki/PageRank

Page 25: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

MADlib

PageRank in MADlib

#Data

CREATETABLEvertex(idINTEGER);CREATETABLEedge(srcINTEGER,destINTEGER,user_idINTEGER);

#Training

SELECTmadlib.pagerank('vertex',--Vertextable'id',--Vertixidcolumn'edge',--Edgetable'src=src,dest=dest',--Edgeargs'pagerank_out');--Outputtable

#ResultSELECT*FROMpagerank_outORDERBYpagerankDESC;

1

3

4 2

3 5

1

1 2

Vertex or node

Edge

Edge Weight

Page 26: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

MADlib

Github Analysis ●  Github is one of the biggest code repo for different kinds of cool projects, especially open

source.

●  GH Archive store all the Github history data: 113GB compressed data in 2017

Page 27: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

MADlib

Github Analysis ●  Intern project: find the important repo in Github

during 2015-current

●  How to measure importance?

○  Personal PageRank

●  How to represent the node in PageRank?

○  People & Repo

●  How to represent the edge in PageRank?

○  People star a repo (single direction)

○  People contribute to a repo (bidirectional)

Page 28: Greenplum & Data Science - 数字化三部曲€¦ · Greenplum Architecture Data Science In Greenplum PL/Container MADlib Q+A . What is Data Science Data science is an interdisciplinary

Transforming How The World Builds Software

© Copyright 2018 Pivotal Software, Inc. All rights Reserved.