How Data Scientists can Leverage MicroStrategy and HP Vertica … › Strategy › media › downloads › ... · 2015-07-29 · How Data Scientists can Leverage MicroStrategy and

How Data Scientists can Leverage MicroStrategy and HP

Vertica to Create an Analytics Ecosystem “Killer App”

Two great technologies that go great together!

Outline

Introduction: Conclusive Analytics

HP Vertica: Why We Chose Vertica

HP Vertica: Technology Overview

User-Defined Functions

“Killer App” Ecosystem Demo

2

Who is Conclusive Analytics?

We are experts in Managed Analytics as a

Service that enable our customers to grow

revenue and increase profitability

We derive insights that intelligently power our

customers’ sales activities and marketing

programs.

Our experts integrate data and deploy the latest in

decision science, predictive analytics, and

visualization techniques enabling better, faster

decisions to achieve improved and sustainable

business results.

3

Conclusive Analytics – Quick Stats

Managed Database and Analytics as a Service

Founded in 1996

Small company – 50+ people

Charlotte Business Journal’s 2014 Best Places to Work

We’re hiring!

4

HP Vertica Analytics Platform Why We Chose Vertica

5

Traditional Process Flow: (We needed Dramatic

Improvement--not just speed)

RA

W Staging E.T.L.

MDM D.W.

O.L.T.P. D.M. B.I.

Pre-aggregated

Star Schema

Highly Managed

…for speed

…for traditional B.I.

ACID

Add to history

Best-Record

Source

System-of-record

e.g. MicroStrategy

“dim_”

“fact_”

(Pull out for SAS,

SPSS or R)

Undo • Export

• Compress

• Encrypt

• FTP

• Decrypt

• Un-compress

• Import

SQL Server 2008 SQL Server 2008 R2

Standard

Quad Core AMD Opteron Processor 8378 2.40 GHz

12 GB RAM

1TB GB Disk

475 million rows

2-3X raw storage

Typical analytics queries: 30 minutes

Highly “managed” & indexed

The more (un-indexed) filters, the slower (table scans)

Vertica Different use-case

No “management”

No “indexes”

Load and query at same time

Typical: 8-20 second queries

Redundant, fault-tolerant, scalable

Approx. 50% of raw storage

10X, 60X, 100X, 1000X speed comparison

Write-once, Read-many

The more filters, the faster

Internal Test Results

HP Vertica Analytics Platform Technology Overview

9

HP Vertica Quick Facts

Relational database

o ODBC/JDBC support, standard SQL interface

o Supports existing workflows & investments – ETL tools, BI tools, etc.

Columnar store (vs. traditional row-based)

Distributed, shared-nothing architecture

Massively parallel in-database analytics

10

Column Orientation

Vertica organizes data for each column

o Each column is stored separately on disk

o Only reads the columns needed to answer the query

Significant reduction of disk I/O

11

Advanced Compression

Slower disk I/O is replaced with fast

CPU cycles and aggressive encoding

and compression

Sorting and cardinality help

determine encoding

Operates on encoded data

Data is decoded as late as possible

Implements late materialization

12

High Availability

RAID-like functionality within database

o If a node fails, a copy is available on one of the surviving nodes

o No need for manual log-based recovery

Always-on Queries and Loads

o System continues to load and query when nodes are down

o Automatically recovers missing data by querying other nodes

13

Cluster / K-Safety

14

Cluster / K-Safety

15

What are projections?

Superprojection == Table

Projections == View (materialized)

o Collection of table columns

o Stores data in a format to optimize query execution

Encoding/compression

o Examples: Run-length encoding (RLE), Delta values

Sort order

Segmentation

o How the data are distributed across the nodes

16

Projections

17

Segmentation

18

store sales

fact

store

dimension

product

dimension

date

dimension

Segmentation

store

dimension

date

dimension

product

dimension

store sales

fact

store sales

fact

store sales

fact

store

dimension

date

dimension

product

dimension

store

dimension

date

dimension

product

dimension

Node 1 Node 2 Node 3

Vertica and User-Defined Functions What makes Vertica not just a database, but a platform

20

User-Defined Functions (UDF)

Can be written in C++, Java, or R

Types:

o Scalar

o Transform

o Analytic

o Aggregate

o Load

21

User-Defined Functions (UDF)

Two types of UDx’es are supported in R:

User-Defined Scalar Functions (UDSFs)

o Take a single row of data and produce a single output value

o Built-in examples: upper(), greatest(), sqrt(), etc.

User-Defined Transform Functions (UDTFs)

o Take n rows of data and produce n rows and columns of output

22

Function Volatility

A promise to the optimizer about the behavior of the function

3 types:

VOLATILE

o May return different results on successive calls with the same arguments

o Function includes random(), currval(), timeofday()

STABLE

o Will return same results with same arguments for all rows within a single statement

IMMUTABLE

o Will return same results with same arguments forever

23

Null Input Behavior

Another optimization strategy – how your function will handle NULL input

2 values (with 2 synonyms):

CALLED_ON_NULL_INPUT (or DEFAULT STRICTNESS)

o The function must be called, even if one or more input values are NULL

RETURN_NULL_ON_NULL_INPUT (or STRICT)

o The function always returns a NULL value if any of its inputs are NULL

24

Creating R Functions

R Factory Function

o Defines how R and Vertica interact when your R function is called

Main R Function(s)

o Associated with each factory function that returns the data back to Vertica after

processing the data in R

parametertypecallback function

o Defines parameter names and types

o Only required if your function is used with Vertica parameters

outtypecallback function (optional)

o Defines type(s) and precision of the values returned to Vertica

metadata definition (optional)

25

Example R Scalar Function – Multiply 2 Numbers

R code

########## # Example: Multiplication # Filename: mul.R ########## ### # @brief multiplies col1 and col2 of the input data frame. ### mul <- function(x) { pr <- x[,1] * x[,2] pr } mulFactory <- function() { list(name=mul,udxtype=c("scalar"), intype=c("float","float"), outtype=c("float"), outtypecallback=mulReturnType, volatility=c("immutable")) } mulReturnType <- function(x) { ret = data.frame(datatype = rep(NA,1), length = rep(NA,1), scale = rep(NA,1), name = rep(NA,1)) ret[1,1] = "float" ret[1,4] = "Multiplied" ret }

SQL code

=> CREATE LIBRARY mulLib AS '/home/dbadmin/mul.r' -> LANGUAGE 'R'; => CREATE FUNCTION mul AS LANGUAGE 'R' -> NAME 'mulFactory' LIBRARY mulLib; => SELECT x, y FROM twocols; x | y ---+--- 2 | 2 4 | 4 6 | 6 8 | 8 => SELECT mul(x,y) FROM twocols; Multiplied ---------- 4 16 36 64

26

Analytic Functions – SQL Syntax

Example 1: SELECT state, name, sales, median(sales) OVER ( PARTITION BY state ) FROM table ORDER BY state, name;

AZ & NC each get a partition

o AZ = [15, 20, 40, 60] (median=30)

o NC = [10, 20, 50] (median=20)

TABLE DATA: state name sales ----- ---- ----- AZ Amy 60 AZ Bill 20 AZ Cara 15 NC Dan 20 NC Eve 50 AZ Fay 40 NC Gene 10 QUERY OUTPUT: state name sales msales ----- ---- ----- ------ AZ Amy 60 30 AZ Bill 20 30 AZ Cara 15 30 AZ Fay 40 30 NC Dan 20 20 NC Eve 50 20 NC Gene 10 20

27

Example R Transform Function – kmeans

R code – Main Function

myKmeans <- function(x, y) { # Error check and get the number of # clusters to be formed. # # The y argument contains the parameters from the # USING PARAMETERS clause in your SQL function call. if(!is.null(y[['k']])) k=as.numeric(y[['k']]) else stop("Expected parameter k") # Get the number of columns in the input data frame cols = ncol(x) # Run the kmeans algorithm cl <- kmeans(x[,1:cols-1], k) # Get the cluster information from the result of above Result <- cl$cluster # Return result to vertica Result <- data.frame(VCol=Result1) Result }

Parametertypes, Factory Functions

# Callback function to return parameter types myKmeansCallback <- function() { params <- data.frame(datatype=rep(NA,1), length=rep(NA,1), scale=rep(NA,1), name=rep(NA,1)) params[1,1] = "int" params[1,4] = "k" params } # Function that tells Vertica the name of the actual # R function, the parameter types and the return types myKmeansFactory <- function() { list(name=myKmeans, udxtype=c("transform"), intype=c("float","float"), outtype=c("int"), parametertypecallback=myKmeansCallback, volatility=c("stable"), strict=c("called_on_null_input")) }

28

Example R Transform Function – kmeans

SQL Code

=> CREATE LIBRARY myKmeansLib AS '/home/dbadmin/myKmeans.r' -> LANGUAGE 'R'; => CREATE TRANSFORM FUNCTION myKmeans AS LANGUAGE 'R' -> NAME 'myKmeansFactory' LIBRARY myKmeansLib;

29

“forecast” Package (from CRAN)

Forecasting functions for time series and linear models

Methods and tools for displaying and analyzing univariate time series forecasts

including exponential smoothing via state space models and automatic ARIMA

modelling.

http://cran.r-project.org/web/packages/forecast/index.html

30





Example R Scalar Function

Elapsed Forecasting Time (1,000 Customers)

31

R Processing Methodology Average Elapsed Time

(seconds)

“Traditional” RODBC / Remote Function 48

Vertica UDF – Unoptimized Projections 28

Vertica UDF – Optimized Projections 10

Demo Let’s see the “Killer App” in action!

32

Questions? How Data Scientists can Leverage MicroStrategy and HP Vertica to Create an Analytics Ecosystem “Killer App”

33

Come visit us at the HP booth for a trial!

Documents

How Data Scientists can Leverage MicroStrategy and HP Vertica … › Strategy › media › downloads › ... · 2015-07-29 · How Data Scientists can Leverage MicroStrategy and