65
Greenplum Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder Greenplum

Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

Embed Size (px)

Citation preview

Page 1: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

Greenplum Database 4.0, Greenplum Chorus, and Advanced Analytics

Presentation, June 2010

17/26/2010 Confidential

Luke LonerganCTO and Co-FounderGreenplum

Page 2: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

27/26/2010 Confidential

Greenplum: Demonstrating Market Momentum and Leadership

• 2009 was a breakout year for GP• Surpassed more than +100 global

enterprise customers• 100% year over year growth• Projecting the same for 2010• Open systems model is winning:

significant leverage with Dell, EMC, Cisco and others

• Acquiring net new customers at a pace faster than Netezza and Teradata

• Growth is enabling us to innovate beyond our competitors, not only within just the core database but in new product initiatives

Page 3: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

37/26/2010 Confidential

Announcing Greenplum Database 4.0: Critical Mass Innovation

3

• 4.0 represents industry leading innovations in:– Workload Management– Fault-Tolerance– Advance Analytics

• Culmination of more than +7 years of research and development

• First emerging SW vendor to achieve critical mass and maturity across all necessary aspects of enterprise class DBMS platforms:

– Complex query optimization– Data loading– Workload Management– Fault-Tolerance– Embedded languages/analytics– 3rd Party ISV certification– Administration and Monitoring

• Genuine floor-sweep replacement option for Teradata, Oracle, DB2, and SQL Server

Page 4: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

47/26/2010 Confidential

The Need for Change:State of Play – Data in a Typical Enterprise

• Data is everywhere –corporate EDW, 100s of data marts, ‘shadow’ databases and spreadsheets

• The goal of centralizing all data in a single EDW has proven untenable

EDW~10% of data

Data Marts and ‘Personal Databases’

~90% of data

Page 5: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

57/26/2010 Confidential

Introducing Greenplum Chorus: The World’s First Enterprise Data Cloud Platform

• New software product• World’s first Enterprise Data Cloud

Platform (EDC), enabling:– Self-service provisioning– Data services– Data collaboration

• Customers deploy Chorus along with GP Database to create a net new and self-service analytic infrastructure

• Chorus can significantly accelerate the time and ease with which companies extract value and insight from their data

Page 6: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

67/26/2010 Confidential

Greenplum Chorus: Core Design Philosophies

• Secure– Provide comprehensive and granular access

control over whom is authorized to view and subscribe to data within Chorus

• Collaborative– Facilitate the publishing, discovery, and

sharing of data and insight using a social computing model that appears familiar and easy-to-use

• Data-centric– Focus on the necessary tooling to manage the

flow and provenance of data sets as they are created/shared within a company

• MAD Skills in Action– Build a platform capable of supporting the

magnetic, agile, and deep principles of MAD Skills

Page 7: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

77/26/2010 Confidential

Greenplum Chorus: Core Technologies

• Greenplum Chorus has unique technical requirements that demand a new kind of core infrastructure

• Cloud platforms have focused on scalable processing – not scalable and dynamic flow of data

• Chorus needs both:• DATA REQUIREMENTS:

Coordinate complex dataflow and data lifecycle across 10s or 100s of distinct databases

• CLOUD REQUIREMENTS:Low TCO provisioning and control of distributed processing and storage

Page 8: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

87/26/2010 Confidential

Greenplum Data Hypervisor™

• Greenplum Data Hypervisor™ is the execution and operational framework for all ‘outside the database’activities

– Coordinate complex cross-database data movement

– Manage all Chorus state and in-flight activities– Orchestrate database instance provisioning,

expansion, and other operational activities – Respond to events and failures with compensating

actions and escalation– Execute arbitrary programs and process flow in a

strongly fault-tolerance manner

• Greenplum Data Hypervisor™ is built for the cloud– Underlying consensus/replication model is similar

to Google’s core – Handles and recovers from failures mid-operation

– even within complex multi-step flows– Scales to 10,000s of nodes across geographic

boundaries and WAN links– Runs unnoticed within every Chorus and GPDB

server, and elsewhere as neededConsensus Protocol

State Replication

DistributedState Management

App & ProcessFlow Runtime

Schedule + DependencyManagement

Developer API

Page 9: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

97/26/2010 Confidential

Greenplum Chorus: Customer Example, Telecom

100 TB EDW

1 Petabyte EDC

GO Database + EDC Chorus

Customer Challenge: – 100TB Teradata EDW focused on operational

reporting and financial consolidation– EDW is single source of truth, under heavy

governance and control– Unable to support all of the critical initiatives

around data surrounding the business– Customer loyalty and churn the #1 business

initiative from the CEO on down

Greenplum Database + Chorus:– Extracted data from EDW and others source

systems to quickly assemble new analytic mart– Generated a social graph from call detail

records and subscriber data– Within 2 weeks uncovered behavior where

“connected” subscribers where 7X more likely to churn than average user

– Now deploying 1PB production EDC with GP to power their analytic initiatives

1 Petabyte EDC

Page 10: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

107/26/2010 Confidential

Greenplum Database + Chorus: Platform for Data and Analytic Solutions

Offer/CampaignManagement

Offer/CampaignManagement

Value-Added

Services

Value-Added

Services

Greenplum Chorus: Self-Service Provisioning, Data Virtualization, Collaboration

Data Mart Consolidation

Data Mart Consolidation

OperationalDataStore

OperationalDataStore

AnalyticsLab

AnalyticsLab

Greenplum Database: Massively Scalable, Reliable, and Flexible Data Platform

Page 11: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

117/26/2010 Confidential 117/26/2010 Confidential

Advanced Analytics:An Overview of MAD Skills

Page 12: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

127/26/2010 Confidential

Magnetic

• Attract data…• Parallel data loads• External tables• Web tables

• Attract practitioners…• Parallel query execution• OLAP, window functions• Built-in analytics

• MapReduce• Data Cart• ETL/ELT

• Analytics libraries• Sandboxes• Collaboration

Page 13: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

137/26/2010 Confidential

Agile

analyze and model in the

cloudpush results back into the

cloud

get data into the cloud

Page 14: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

147/26/2010 Confidential

Sample tables Transformed Data Models and Results

CREATE TABLE temp1 ASSELECT customerID, max(

DELETE FROM temp1 WHERE num <

model <- function(x1, x2)

Export a sample, plus hold‐out, for model design 

Design models in R and test for 

fitness on out‐of‐sample data

Implement transformations as parallelized SQL statements

Iterate on feature selection and model form to improve fit

EDC PLATFORM (Staging)

Implement models as scalable Greenplum 

Analytics functions

EDC PLATFORM (Sandbox)

Staging tables

Staging tables

Generate useful features by a sequence of aggregations

Ingest raw data into staging 

tables

Go back and ingest more data 

as required

Implement robust mapping scripts, and use more comprehensive data for training and testing

CREATE VIEW ols ASSELECT pseudo_inverse(FROM (SELECT sum(trans

Agile

Page 15: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

157/26/2010 Confidential

Deep

What will happen?What will happen? How can we do better?

How can we do better?

What happened where and when?What happened

where and when?How and why did

it happen?How and why did

it happen?

Page 16: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

167/26/2010 Confidential

• Window functions• Cume_Dist• Lag, Lead• Rank• etc…

• Libraries• Mann-Whitney U Test• Chi-Square Test• PL/R• NLTK• etc…

• Built-in analytics• Matrix operations• Multiple linear regression• Naïve Bayes• etc…

• Methods• Log likelihood• Conjugate gradient• Re-sampling• etc…

Deep

Page 17: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

177/26/2010 Confidential 177/26/2010 Confidential

Vector and Matrix Types and Operators

Page 18: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

187/26/2010 Confidential

• Types• Sparse and Dense Vectors and Matrices

• All numerical formats, byte through double and complex double

• Run length code compression of duplicate values• Operations

• Native scalar operations (natural arithmetic)• Dot product, vector triple product, vector / matrix algebra• Set operations (contains, AND/OR, etc)

• Library interfaces• Solvers from LAPACK and Sparspak

Vectors and Matrices

Page 19: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

197/26/2010 Confidential

Enter a sparse vector:greenplumdb=# select '{1,4,1,3,1,7}:{1.1,0,2.2,0,3.3,0}'::svec;

Cast a sparse vector to a dense vector:greenplumdb=# select '{1,4,1,3,1,7}:{1.1,0,2.2,0,3.3,0}'::svec::float8[];

float8 -------------------------------------------{1.1,0,0,0,0,2.2,0,0,0,3.3,0,0,0,0,0,0,0}

Scalar multiply two vectors:greenplumdb=# select '{1,10,20}:{1,2,3}'*'{10,20,1}:{1,2,3}'::svec;

?column? --------------------------{1,9,1,19,1}:{1,2,4,6,9}

Example: Basic Vector Arithmetic

Page 20: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

207/26/2010 Confidential

Weight calculation:

CREATE TABLE latest_coefficient

( coefs svec NOT NULL );

INSERT INTO weight (rownumber, logistic, weight)

(SELECT rownumber,logistic,logistic*(1-logistic)

FROM(SELECT a.rownumber rownumber,

1 / (1 + exp(

-vec_sum(c.coefs * ARRAY[1,a.gre,a.topnotch,a.gpa]))) logistic

FROM admission a, latest_coefficient c) foo);

Example: Iterative Logistic Regression

Page 21: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

217/26/2010 Confidential

Residual calculation using l2norm():

SELECT iteration, abs(residual)FROM (

SELECT iteration, l2norm-last_value(l2norm)

OVER (w1) residualFROM (SELECT iteration, l2norm(coefs) FROM coefficient

ORDER BY 1) bar

WINDOW w1 as (ORDER BY iterationROWS BETWEEN 1 PRECEDING AND 1 PRECEDING

) offset 1

) foo;

Example: Plotting Residual of Iterative Method

Page 22: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

227/26/2010 Confidential

Residual plot using gnuplot:

\o | gnuplot -e "set logscale y; plot '-' with lines title'l2norm of residual’”

SELECT <data>

\o

Example: Plotting Residual of Iterative Method

Page 23: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

237/26/2010 Confidential 237/26/2010 Confidential

ExampleWhat’s the right price for my products?

Page 24: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

247/26/2010 Confidential

What’s the right price for my products?

Page 25: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

257/26/2010 Confidential

What’s the right price for my products?

Get the raw data…DROP TABLE IF EXISTS misc.price_promo;

CREATE TABLE misc.price_promo

(

dt date

,base_price numeric

,display_price numeric

,feature_price numeric

,feature_display_price numeric

,tpr numeric

,volume numeric

) DISTRIBUTED BY(dt);

\copy misc.price_promo from data.csv with delimiter ','

Date Base Price

DisplayPrice

Feature Price

Feature/Display Price

TPR Volume

2009-02-24 7.33 6.67 7.33 7.33 7.20 20484.522009-03-10 7.47 5.94 5.72 7.00 5.72 34313.942009-03-24 7.75 6.74 5.74 7.26 5.82 25477.332009-04-07 7.40 7.19 7.40 7.40 7.23 18772.572009-04-21 7.75 7.36 6.74 7.75 6.22 20743.682009-05-05 7.43 6.56 6.98 7.43 5.70 28244.822009-05-19 7.70 6.57 7.70 7.70 6.23 20234.742009-06-02 6.87 6.67 6.87 6.87 6.64 23262.602009-06-16 7.36 7.00 7.36 7.36 7.44 19290.872009-06-30 6.92 6.72 6.92 6.92 6.73 23617.612009-07-14 7.49 7.32 7.49 7.49 7.58 18017.582009-07-28 7.69 7.44 5.69 7.69 5.70 29193.442009-08-11 7.19 6.24 7.19 7.19 6.72 23863.132009-08-25 7.72 6.74 7.72 7.72 5.72 25138.34

Page 26: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

267/26/2010 Confidential

What’s the right price for my products?

Build the model…CREATE TABLE misc.price_promo_coefs ASSELECTcoefs[1] AS intercept_beta ,coefs[2] AS base_price_beta,coefs[3] AS display_price_beta ,coefs[4] AS feature_display_price_beta,coefs[5] AS tpr_beta ,r2

FROM (SELECT

mregr_coef(volume, array[1::int, base_price_per_unit, display_price, feature_display_price, temporary_price_reduction]) AS coefs

,mregr_r2(volume, array[1::int, base_price_per_unit, display_price, feature_display_price, temporary_price_reduction]) AS r2

FROMmisc.price_promo

) AS aDISTRIBUTED RANDOMLY;

intercept_beta base_price_beta display_price_beta feature_display_price_beta tpr_beta r2

72804.48332 5049.03841 -1388.842417 -6203.882026 -4801.114351 0.883172235

Page 27: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

277/26/2010 Confidential

What’s the right price for my products?

Execute the model…CREATE OR REPLACE VIEW misc.v_price_promo_fitted ASSELECTvolume,volume_fitted,100 * abs(volume - volume_fitted)::numeric / volume AS ape

FROM (SELECTp.volume,c.intercept_beta+ p.base_price * c.base_price_beta

+ p.display_price * c.display_price_beta+ p.feature_display_price * c.feature_display_price_beta

+ p.tpr * c.tpr_betaAS volume_fitted

FROMmisc.price_promo_coefs c,misc.price_promo p

) AS a

volume volume_fitted ape

20484.52 20507.88 0.1140

34313.94 31381.52 8.5458

25477.33 29591.06 16.1466

18772.57 19560.80 4.1988

20743.68 23769.63 14.5873

28244.82 27746.83 1.7630

20234.74 24876.55 22.9398

23262.60 23727.72 1.9994

19290.87 18862.64 2.2198

23617.61 23168.44 1.9018

18017.58 17595.93 2.3402

29193.44 26224.39 10.1702

23863.13 23571.29 1.2229

25138.34 27065.91 7.6678

24307.88 23945.45 1.4909

… … …

Page 28: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

287/26/2010 Confidential

What’s the right price for my products?

R2 mape

0.883 5.965

Page 29: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

297/26/2010 Confidential

Page 30: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

307/26/2010 Confidential 307/26/2010 Confidential

ExampleWhat are our customers saying about us?

Page 31: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

317/26/2010 Confidential

What are our customers saying about us?

• How do you discern trends and categories within thousands of on-line conversations?

– Search for relevant blogs– Construct a ‘fingerprint’ for each

document based on word frequencies

– Use this to define what it means for documents to be similar, or ‘close’

– Identify ‘clusters’ of documents

Page 32: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

327/26/2010 Confidential

What are our customers saying about us?

Method

• Construct document histograms

• Transform histograms into document “fingerprints”

• Use k-means clustering to classify documents

Page 33: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

337/26/2010 Confidential

What are our customers saying about us?

Constructing document histograms

• Parsing html files and extracting relevant data.

• Using natural language processing to perform text tokenization and stemming.

• Cleansing the data of inconsistencies.• Transforming unstructured data into

structured data.

Page 34: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

347/26/2010 Confidential

What are our customers saying about us?

Constructing a document “fingerprint”- Measure the term frequency of words

within a document vs. the frequency that those words occur in all documents.

- This is a TFxIDF weight: term frequency-inverse document frequency.

- Easily calculated based on formulas over the document histograms.

- The result is a vector in n-space.

Page 35: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

357/26/2010 Confidential

What are our customers saying about us?

K-means clustering- An iterative algorithm for finding items

that are similar within an n-dimensional space.

- Two fundamental steps:1. Measuring distance to a centroid (the

center of a cluster)2. Moving the center of a cluster to

minimize the sum distance to all the members of the cluster.

Page 36: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

367/26/2010 Confidential

What are our customers saying about us?

Page 37: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

377/26/2010 Confidential

What are our customers saying about us?

Page 38: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

387/26/2010 Confidential

What are our customers saying about us?

Page 39: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

397/26/2010 Confidential

What are our customers saying about us?

Page 40: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

407/26/2010 Confidential

What are our customers saying about us?

Page 41: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

417/26/2010 Confidential

What are our customers saying about us?

Page 42: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

427/26/2010 Confidential

What are our customers saying about us?

Page 43: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

437/26/2010 Confidential

What are our customers saying about us?

Page 44: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

447/26/2010 Confidential

What are our customers saying about us?

Page 45: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

457/26/2010 Confidential

What are our customers saying about us?

Page 46: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

467/26/2010 Confidential

What are our customers saying about us?

Page 47: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

477/26/2010 Confidential

What are our customers saying about us?

Page 48: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

487/26/2010 Confidential

What are our customers saying about us?

Page 49: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

497/26/2010 Confidential

What are our customers saying about us?

• innovation• leader• cost

• competitor• technology• speed

• bug• installation• download

Page 50: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

507/26/2010 Confidential

Accessing the data

• Build the directory list into a set of files that we will access:

id | path | body ------+---------------------------------------+------------------------------------2482 | /Users/demo/blogsplog/model/2482.html | <!DOCTYPE html PUBLIC "...

1 | /Users/demo/blogsplog/model/1.html | <!DOCTYPE html PUBLIC "...10 | /Users/demo/blogsplog/model/1000.html | <!DOCTYPE html PUBLIC "...

2484 | /Users/demo/blogsplog/model/2484.html | <!DOCTYPE html PUBLIC "......

id | path | body ------+---------------------------------------+------------------------------------2482 | /Users/demo/blogsplog/model/2482.html | <!DOCTYPE html PUBLIC "...

1 | /Users/demo/blogsplog/model/1.html | <!DOCTYPE html PUBLIC "...10 | /Users/demo/blogsplog/model/1000.html | <!DOCTYPE html PUBLIC "...

2484 | /Users/demo/blogsplog/model/2484.html | <!DOCTYPE html PUBLIC "......

• For each record in the list "open()" the file and read it in its entirety

-INPUT:NAME: filelistFILE:

- maple:/Users/demo/blogsplog/filelist1- maple:/Users/demo/blogsplog/filelist2

COLUMNS:- path text

-MAP:NAME: read_dataPARAMETERS: [path text]RETURNS: [id int, path text, body text]LANGUAGE: pythonFUNCTION: |

(_, fname) = path.rsplit('/', 1)(id, _) = fname.split('.')body = f.open(path).read()

Page 51: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

517/26/2010 Confidential

Parse the documents into word lists

Convert HTML documents into parsed, tokenized, stemmed, term lists with stop-word removal:

-MAP:NAME: extract_termsPARAMETERS: [id integer, body text]RETURNS: [id int, title text, doc _text]FUNCTION: |

if 'parser' not in SD:import ...class MyHTMLParser(HTMLParser):

...SD['parser'] = MyHTMLParser()

parser = SD['parser']parser.reset()parser.feed(body)

yield (id, parser.title, '{"' + '","'.join(parser.doc) + '"}')

Page 52: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

527/26/2010 Confidential

Parse the documents into word lists

Use the HTMLParser library to parse the html documents and extract titles and body contents:

if 'parser' not in SD:from HTMLParser import HTMLparser...class MyHTMLParser(HTMLParser):

def __init(self):HTMLParser.__init__(self)...

def handle_data(self, data):data = data.strip()if self.inhead:

if self.tag == 'title':self.title = data

if self.inbody:...

parser = SD['parser']parser.reset()...

Page 53: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

537/26/2010 Confidential

Parse the documents into word lists

Use nltk to tokenize, stem, and remove common terms:

if 'parser' not in SD:from nltk import WordTokenizer, PorterStemmer, corpus...class MyHTMLParser(HTMLParser):def __init(self):

... self.tokenizer = WordTokenizer()self.stemmer = PorterStemmer()self.stopwords = dict(map(lambda x: (x, True), corpus.stopwords.words()))

def handle_data(self, data):...

if self.inbody:tokens = self.tokenizer.tokenize(data)stems = map(self.stemmer.stem, tokens)for x in stems:if len(x) < 4: continuex = x.lower()if x in self.stopwords: continueself.doc.append(x)

...parser = SD['parser'] parser.reset()...

Page 54: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

547/26/2010 Confidential

Parse the documents into word lists

Use nltk to tokenize, stem, and remove common terms:

if 'parser' not in SD:from nltk import WordTokenizer, PorterStemmer, corpus...class MyHTMLParser(HTMLParser):def __init(self):

... self.tokenizer = WordTokenizer()self.stemmer = PorterStemmer()self.stopwords = dict(map(lambda x: (x, True), corpus.stopwords.words()))

def handle_data(self, data):...

if self.inbody:tokens = self.tokenizer.tokenize(data)stems = map(self.stemmer.stem, tokens)for x in stems:if len(x) < 4: continuex = x.lower()if x in self.stopwords: continueself.doc.append(x)

...parser = SD['parser'] parser.reset()...

shell$ gpmapreduce -f blog-terms.ymlmapreduce_75643_run_1DONE

sql# SELECT id, title, doc FROM blog_terms LIMIT 5;

id | title | doc ------+------------------+-----------------------------------------------------------------2482 | noodlepie | {noodlepi,from,gutter,grub,gourmet,tabl,noodlepi,blog,scoff,...

1 | Bhootakannadi | {bhootakannadi,2005,unifi,feed,gener,comment,final,integr,...10 | Tea Set | {novelti,dish,goldilock,bear,bowl,lide,contain,august,...

...

shell$ gpmapreduce -f blog-terms.ymlmapreduce_75643_run_1DONE

sql# SELECT id, title, doc FROM blog_terms LIMIT 5;

id | title | doc ------+------------------+-----------------------------------------------------------------2482 | noodlepie | {noodlepi,from,gutter,grub,gourmet,tabl,noodlepi,blog,scoff,...

1 | Bhootakannadi | {bhootakannadi,2005,unifi,feed,gener,comment,final,integr,...10 | Tea Set | {novelti,dish,goldilock,bear,bowl,lide,contain,august,...

...

Page 55: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

557/26/2010 Confidential

Create histograms of word frequencies

Extract a term-dictionary of terms that show up in at least ten blogs

sql# SELECT term, sum(c) AS freq, count(*) AS num_blogsFROM (

SELECT id, term, count(*) AS c FROM (

SELECT id, unnest(doc) AS term FROM blog_terms

) term_unnestGROUP BY id, term

) doc_termsWHERE term IS NOT NULLGROUP BY termHAVING count(*) > 10;

term | freq | num_blogs ----------+------+-----------sturdi | 19 | 13canon | 97 | 40group | 48 | 17skin | 510 | 152linger | 19 | 17blunt | 20 | 17

sql# SELECT term, sum(c) AS freq, count(*) AS num_blogsFROM (

SELECT id, term, count(*) AS c FROM (

SELECT id, unnest(doc) AS term FROM blog_terms

) term_unnestGROUP BY id, term

) doc_termsWHERE term IS NOT NULLGROUP BY termHAVING count(*) > 10;

term | freq | num_blogs ----------+------+-----------sturdi | 19 | 13canon | 97 | 40group | 48 | 17skin | 510 | 152linger | 19 | 17blunt | 20 | 17

Page 56: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

567/26/2010 Confidential

Create histograms of word frequencies

Use the term frequencies to construct the term dictionary…

sql# SELECT array(SELECT term FROM blog_term_freq) dictionary;

dictionary ---------------------------------------------------------------------{sturdi,canon,group,skin,linger,blunt,detect,giver,annoy,telephon,...

sql# SELECT array(SELECT term FROM blog_term_freq) dictionary;

dictionary ---------------------------------------------------------------------{sturdi,canon,group,skin,linger,blunt,detect,giver,annoy,telephon,...

…then use the term dictionary to construct feature vectors for every document, mapping document terms to the features in the dictionary:

sql# SELECT id, gp_extract_feature_histogram(dictionary, doc)FROM blog_terms, blog_features;

id | term_count-----+----------------------------------------------------------------2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,2,0,4,0,1,0,1,0,1,...}

1 | {41,1,34,1,22,1,125,1,387,...}:{0,9,0,1,0,1,0,1,0,3,0,2,...}10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2,0,6,0,12,0,3,0,1,0,1,...}

...

sql# SELECT id, gp_extract_feature_histogram(dictionary, doc)FROM blog_terms, blog_features;

id | term_count-----+----------------------------------------------------------------2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,2,0,4,0,1,0,1,0,1,...}

1 | {41,1,34,1,22,1,125,1,387,...}:{0,9,0,1,0,1,0,1,0,3,0,2,...}10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2,0,6,0,12,0,3,0,1,0,1,...}

...

Page 57: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

577/26/2010 Confidential

Create histograms of word frequencies

Note the use of sparse vectors to store the term counts as a histogram:

id | term_count-----+----------------------------------------------------------------------...10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2,0,6,0,12,0,3,0,1,0,1,...}

...

id | term_count-----+----------------------------------------------------------------------...10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2,0,6,0,12,0,3,0,1,0,1,...}

...

id | term_count-----+----------------------------------------------------------------------...10 | {0 ,0 ,0 ,2 ,0 ,0 ,0 ,0 ,6 ,0,0,0,...}

...dictionary

----------------------------------------------------------------------------{sturdi,canon,group,skin,linger,blunt,detect,giver,annoy,telephon,...

id | term_count-----+----------------------------------------------------------------------...10 | {0 ,0 ,0 ,2 ,0 ,0 ,0 ,0 ,6 ,0,0,0,...}

...dictionary

----------------------------------------------------------------------------{sturdi,canon,group,skin,linger,blunt,detect,giver,annoy,telephon,...

Page 58: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

587/26/2010 Confidential

Transform the blog terms into statistically useful measures

Use the feature vectors to construct TFxIDF vectors (term frequency inverse document frequency). These are a measure of the importance of terms.

sql# SELECT id, (term_count*logidf) tfxidfFROM blog_histogram, (

SELECT log(count(*)/count_vec(term_count)) logidf FROM blog_histogram

) blog_logidf;

id | tfxidf-----+-------------------------------------------------------------------2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.34311110...}

1 | {41,1,34,1,22,1,125,1,387,...}:{0,0.771999985977529,0,1.999427...}10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2.95439664949608,0,3.2006935...}

...

sql# SELECT id, (term_count*logidf) tfxidfFROM blog_histogram, (

SELECT log(count(*)/count_vec(term_count)) logidf FROM blog_histogram

) blog_logidf;

id | tfxidf-----+-------------------------------------------------------------------2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.34311110...}

1 | {41,1,34,1,22,1,125,1,387,...}:{0,0.771999985977529,0,1.999427...}10 | {3,1,4,1,30,1,18,1,13,1,4,...}:{0,2.95439664949608,0,3.2006935...}

...

Page 59: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

597/26/2010 Confidential

Create document clusters around iteratively defined centroids

sql# SELECT id, tfxidf, cid, ACOS((tfxidf %*% centroid) /

(svec_l2norm(tfxidf) * svec_l2norm(centroid))) AS distance

FROM blog_tfxidf, blog_centroids;

id | tfxidf | cid | distance-----+-------------------------------------------------------------------+-----+------------2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 1 | 1.536729772482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 2 | 1.557203542482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 3 | 1.55040145

sql# SELECT id, tfxidf, cid, ACOS((tfxidf %*% centroid) /

(svec_l2norm(tfxidf) * svec_l2norm(centroid))) AS distance

FROM blog_tfxidf, blog_centroids;

id | tfxidf | cid | distance-----+-------------------------------------------------------------------+-----+------------2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 1 | 1.536729772482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 2 | 1.557203542482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 3 | 1.55040145

Now that we have TFxIDFs we have something that is a statistically significant metric, which enables all sorts of real analytics.

The current example is k-means clustering which requires two operations.

First, we compute a distance metric between the documents and a random selection of centroids, for instance cosine similarity:

Page 60: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

607/26/2010 Confidential

Create document clusters around iteratively defined centroids

sql# SELECT cid, sum(tfxidf)/count(*) AS centroidFROM (

SELECT id, tfxidf, cid, row_number() OVER (PARTITION BY id ORDER BY distance, cid) rank

FROM blog_distance) blog_rankWHERE rank = 1GROUP BY cid;

cid | centroid -----+------------------------------------------------------------------------

3 | {1,1,1,1,1,1,1,1,1,...}:{0.157556041103536,0.0635233900749665,0.050...}2 | {1,1,1,1,1,1,3,1,1,...}:{0.0671131209568817,0.332220028552986,0,0.0...}1 | {1,1,1,1,1,1,1,1,1,...}:{0.103874521481016,0.158213686890834,0.0540...}

sql# SELECT cid, sum(tfxidf)/count(*) AS centroidFROM (

SELECT id, tfxidf, cid, row_number() OVER (PARTITION BY id ORDER BY distance, cid) rank

FROM blog_distance) blog_rankWHERE rank = 1GROUP BY cid;

cid | centroid -----+------------------------------------------------------------------------

3 | {1,1,1,1,1,1,1,1,1,...}:{0.157556041103536,0.0635233900749665,0.050...}2 | {1,1,1,1,1,1,3,1,1,...}:{0.0671131209568817,0.332220028552986,0,0.0...}1 | {1,1,1,1,1,1,1,1,1,...}:{0.103874521481016,0.158213686890834,0.0540...}

Next, use an averaging metric to re-center the mean of a cluster:

Repeat the previous two operations until the centroids converge, and you have k-means clustering.

Page 61: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

617/26/2010 Confidential

Summary

• Accessing the data (MapReduce)

• Parse the documents into word lists (MapReduce)

• Create histograms of word frequencies (SQL)

• Transform the blog terms into statistically useful measures (SQL)

• Create document clusters around iteratively defined centroids (SQL window functions)

id | path | body ------+---------------------------------------+------------------------------------2482 | /Users/demo/blogsplog/model/2482.html | <!DOCTYPE html PUBLIC ”...

id | path | body ------+---------------------------------------+------------------------------------2482 | /Users/demo/blogsplog/model/2482.html | <!DOCTYPE html PUBLIC ”...

id | title | doc ------+------------------+-----------------------------------------------------------------2482 | noodlepie | {noodlepi,from,gutter,grub,gourmet,tabl,noodlepi,blog,scoff,...

id | title | doc ------+------------------+-----------------------------------------------------------------2482 | noodlepie | {noodlepi,from,gutter,grub,gourmet,tabl,noodlepi,blog,scoff,...

id | term_count-----+----------------------------------------------------------------2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,2,0,4,0,1,0,1,0,1,...}

id | term_count-----+----------------------------------------------------------------2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,2,0,4,0,1,0,1,0,1,...}

id | tfxidf-----+-------------------------------------------------------------------2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.34311110...}

id | tfxidf-----+-------------------------------------------------------------------2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.34311110...}

id | tfxidf | cid | distance-----+-------------------------------------------------------------------+-----+------------2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 1 | 1.53672977

id | tfxidf | cid | distance-----+-------------------------------------------------------------------+-----+------------2482 | {3,1,37,1,18,1,29,1,45,1,...}:{0,8.25206814635817,0,0.3431111...} | 1 | 1.53672977

Page 62: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

627/26/2010 Confidential

Page 63: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

637/26/2010 Confidential 637/26/2010 Confidential

MAD Skills in practice

Page 64: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

647/26/2010 Confidential

–Extracted data from EDW and other source systems into new analytic sandbox

–Generated a social graph from call detail records and subscriber data

–Within 2 weeks uncovered behavior where “connected”subscribers were seven times more likely to churn than average user

MAD Skills in practice

Page 65: Greenplum Database 4.0, Greenplum Chorus, and … Database 4.0, Greenplum Chorus, and Advanced Analytics Presentation, June 2010 1 7/26/2010 Confidential Luke Lonergan CTO and Co-Founder

657/26/2010 Confidential

–Wanted to identify unusual on-line campaigns as part of traffic quality monitoring

–Collected data on clicks and impressions for campaigns

–Built cumulative distribution functions for clicks and volume

–Computed pairwise distances–Identified outliers – e.g. too

much revenue based on number of clicks – using simple SQL over very large datasets

MAD Skills in practice