Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
How Data Scientists can Leverage MicroStrategy and HP
Vertica to Create an Analytics Ecosystem “Killer App”
Two great technologies that go great together!
Outline
Introduction: Conclusive Analytics
HP Vertica: Why We Chose Vertica
HP Vertica: Technology Overview
User-Defined Functions
“Killer App” Ecosystem Demo
2
Who is Conclusive Analytics?
We are experts in Managed Analytics as a
Service that enable our customers to grow
revenue and increase profitability
We derive insights that intelligently power our
customers’ sales activities and marketing
programs.
Our experts integrate data and deploy the latest in
decision science, predictive analytics, and
visualization techniques enabling better, faster
decisions to achieve improved and sustainable
business results.
3
Conclusive Analytics – Quick Stats
Managed Database and Analytics as a Service
Founded in 1996
Small company – 50+ people
Charlotte Business Journal’s 2014 Best Places to Work
We’re hiring!
4
HP Vertica Analytics Platform Why We Chose Vertica
5
Traditional Process Flow: (We needed Dramatic
Improvement--not just speed)
RA
W Staging E.T.L.
MDM D.W.
O.L.T.P. D.M. B.I.
Pre-aggregated
Star Schema
Highly Managed
…for speed
…for traditional B.I.
ACID
Add to history
Best-Record
Source
System-of-record
e.g. MicroStrategy
“dim_”
“fact_”
(Pull out for SAS,
SPSS or R)
Undo • Export
• Compress
• Encrypt
• FTP
• Decrypt
• Un-compress
• Import
SQL Server 2008 SQL Server 2008 R2
Standard
Quad Core AMD Opteron Processor 8378 2.40 GHz
12 GB RAM
1TB GB Disk
475 million rows
2-3X raw storage
Typical analytics queries: 30 minutes
Highly “managed” & indexed
The more (un-indexed) filters, the slower (table scans)
Vertica Different use-case
No “management”
No “indexes”
Load and query at same time
Typical: 8-20 second queries
Redundant, fault-tolerant, scalable
Approx. 50% of raw storage
10X, 60X, 100X, 1000X speed comparison
Write-once, Read-many
The more filters, the faster
Internal Test Results
HP Vertica Analytics Platform Technology Overview
9
HP Vertica Quick Facts
Relational database
o ODBC/JDBC support, standard SQL interface
o Supports existing workflows & investments – ETL tools, BI tools, etc.
Columnar store (vs. traditional row-based)
Distributed, shared-nothing architecture
Massively parallel in-database analytics
10
Column Orientation
Vertica organizes data for each column
o Each column is stored separately on disk
o Only reads the columns needed to answer the query
Significant reduction of disk I/O
11
Advanced Compression
Slower disk I/O is replaced with fast
CPU cycles and aggressive encoding
and compression
Sorting and cardinality help
determine encoding
Operates on encoded data
Data is decoded as late as possible
Implements late materialization
12
High Availability
RAID-like functionality within database
o If a node fails, a copy is available on one of the surviving nodes
o No need for manual log-based recovery
Always-on Queries and Loads
o System continues to load and query when nodes are down
o Automatically recovers missing data by querying other nodes
13
Cluster / K-Safety
14
Cluster / K-Safety
15
What are projections?
Superprojection == Table
Projections == View (materialized)
o Collection of table columns
o Stores data in a format to optimize query execution
Encoding/compression
o Examples: Run-length encoding (RLE), Delta values
Sort order
Segmentation
o How the data are distributed across the nodes
16
Projections
17
Segmentation
18
store sales
fact
store
dimension
product
dimension
date
dimension
Segmentation
store
dimension
date
dimension
product
dimension
store sales
fact
store sales
fact
store sales
fact
store
dimension
date
dimension
product
dimension
store
dimension
date
dimension
product
dimension
Node 1 Node 2 Node 3
Vertica and User-Defined Functions What makes Vertica not just a database, but a platform
20
User-Defined Functions (UDF)
Can be written in C++, Java, or R
Types:
o Scalar
o Transform
o Analytic
o Aggregate
o Load
21
User-Defined Functions (UDF)
Two types of UDx’es are supported in R:
User-Defined Scalar Functions (UDSFs)
o Take a single row of data and produce a single output value
o Built-in examples: upper(), greatest(), sqrt(), etc.
User-Defined Transform Functions (UDTFs)
o Take n rows of data and produce n rows and columns of output
22
Function Volatility
A promise to the optimizer about the behavior of the function
3 types:
VOLATILE
o May return different results on successive calls with the same arguments
o Function includes random(), currval(), timeofday()
STABLE
o Will return same results with same arguments for all rows within a single statement
IMMUTABLE
o Will return same results with same arguments forever
23
Null Input Behavior
Another optimization strategy – how your function will handle NULL input
2 values (with 2 synonyms):
CALLED_ON_NULL_INPUT (or DEFAULT STRICTNESS)
o The function must be called, even if one or more input values are NULL
RETURN_NULL_ON_NULL_INPUT (or STRICT)
o The function always returns a NULL value if any of its inputs are NULL
24
Creating R Functions
R Factory Function
o Defines how R and Vertica interact when your R function is called
Main R Function(s)
o Associated with each factory function that returns the data back to Vertica after
processing the data in R
parametertypecallback function
o Defines parameter names and types
o Only required if your function is used with Vertica parameters
outtypecallback function (optional)
o Defines type(s) and precision of the values returned to Vertica
metadata definition (optional)
25
Example R Scalar Function – Multiply 2 Numbers
R code
########## # Example: Multiplication # Filename: mul.R ########## ### # @brief multiplies col1 and col2 of the input data frame. ### mul <- function(x) { pr <- x[,1] * x[,2] pr } mulFactory <- function() { list(name=mul,udxtype=c("scalar"), intype=c("float","float"), outtype=c("float"), outtypecallback=mulReturnType, volatility=c("immutable")) } mulReturnType <- function(x) { ret = data.frame(datatype = rep(NA,1), length = rep(NA,1), scale = rep(NA,1), name = rep(NA,1)) ret[1,1] = "float" ret[1,4] = "Multiplied" ret }
SQL code
=> CREATE LIBRARY mulLib AS '/home/dbadmin/mul.r' -> LANGUAGE 'R'; => CREATE FUNCTION mul AS LANGUAGE 'R' -> NAME 'mulFactory' LIBRARY mulLib; => SELECT x, y FROM twocols; x | y ---+--- 2 | 2 4 | 4 6 | 6 8 | 8 => SELECT mul(x,y) FROM twocols; Multiplied ---------- 4 16 36 64
26
Analytic Functions – SQL Syntax
Example 1: SELECT state, name, sales, median(sales) OVER ( PARTITION BY state ) FROM table ORDER BY state, name;
AZ & NC each get a partition
o AZ = [15, 20, 40, 60] (median=30)
o NC = [10, 20, 50] (median=20)
TABLE DATA: state name sales ----- ---- ----- AZ Amy 60 AZ Bill 20 AZ Cara 15 NC Dan 20 NC Eve 50 AZ Fay 40 NC Gene 10 QUERY OUTPUT: state name sales msales ----- ---- ----- ------ AZ Amy 60 30 AZ Bill 20 30 AZ Cara 15 30 AZ Fay 40 30 NC Dan 20 20 NC Eve 50 20 NC Gene 10 20
27
Example R Transform Function – kmeans
R code – Main Function
myKmeans <- function(x, y) { # Error check and get the number of # clusters to be formed. # # The y argument contains the parameters from the # USING PARAMETERS clause in your SQL function call. if(!is.null(y[['k']])) k=as.numeric(y[['k']]) else stop("Expected parameter k") # Get the number of columns in the input data frame cols = ncol(x) # Run the kmeans algorithm cl <- kmeans(x[,1:cols-1], k) # Get the cluster information from the result of above Result <- cl$cluster # Return result to vertica Result <- data.frame(VCol=Result1) Result }
Parametertypes, Factory Functions
# Callback function to return parameter types myKmeansCallback <- function() { params <- data.frame(datatype=rep(NA,1), length=rep(NA,1), scale=rep(NA,1), name=rep(NA,1)) params[1,1] = "int" params[1,4] = "k" params } # Function that tells Vertica the name of the actual # R function, the parameter types and the return types myKmeansFactory <- function() { list(name=myKmeans, udxtype=c("transform"), intype=c("float","float"), outtype=c("int"), parametertypecallback=myKmeansCallback, volatility=c("stable"), strict=c("called_on_null_input")) }
28
Example R Transform Function – kmeans
SQL Code
=> CREATE LIBRARY myKmeansLib AS '/home/dbadmin/myKmeans.r' -> LANGUAGE 'R'; => CREATE TRANSFORM FUNCTION myKmeans AS LANGUAGE 'R' -> NAME 'myKmeansFactory' LIBRARY myKmeansLib;
29
“forecast” Package (from CRAN)
Forecasting functions for time series and linear models
Methods and tools for displaying and analyzing univariate time series forecasts
including exponential smoothing via state space models and automatic ARIMA
modelling.
http://cran.r-project.org/web/packages/forecast/index.html
30
Example R Scalar Function
Elapsed Forecasting Time (1,000 Customers)
31
R Processing Methodology Average Elapsed Time
(seconds)
“Traditional” RODBC / Remote Function 48
Vertica UDF – Unoptimized Projections 28
Vertica UDF – Optimized Projections 10
Demo Let’s see the “Killer App” in action!
32
Questions? How Data Scientists can Leverage MicroStrategy and HP Vertica to Create an Analytics Ecosystem “Killer App”
33
Come visit us at the HP booth for a trial!