Pivotal Big Data Suite: A Technical Overview

TECHNICAL OVERVIEW:

Pivotal Big Data Suite

Les KleinField CTO Data

Pivotal

@LesKlein #PivotalForum #Istanbul #BigData #Analytics

Forward Looking Statements

This presentation contains “forward-looking statements” as defined under the Federal Securities Laws. Actual results could differ materially

from those projected in the forward-looking statements as a result of certain risk factors, including but not limited to: (i) adverse changes in

general economic or market conditions; (ii) delays or reductions in information technology spending; (iii) the relative and varying rates of

product price and component cost declines and the volume and mixture of product and services revenues; (iv) competitive factors,

including but not limited to pricing pressures and new product introductions; (v) component and product quality and availability; (vi)

fluctuations in VMware’s Inc.’s operating results and risks associated with trading of VMware stock; (vii) the transition to new products, the

uncertainty of customer acceptance of new product offerings and rapid technological and market change; (viii) risks associated with

managing the growth of our business, including risks associated with acquisitions and investments and the challenges and costs of

integration, restructuring and achieving anticipated synergies; (ix) the ability to attract and retain highly qualified employees; (x) insufficient,

excess or obsolete inventory; (xi) fluctuating currency exchange rates; (xii) threats and other disruptions to our secure data centers and

networks; (xiii) our ability to protect our proprietary technology; (xiv) war or acts of terrorism; and (xv) other one-time events and other

important factors disclosed previously and from time to time in the filings EMC Corporation, the parent company of Pivotal, with the U.S.

Securities and Exchange Commission. EMC and Pivotal disclaim any obligation to update any such forward-looking statements after the

date of this release.

4© 2016 Pivotal Software, Inc. All rights reserved.


Complete

platform

Hadoop Native SQL

Deployment

options

Based on open

source

Flexible

licensing

Advanced data

services

PIVOTAL GREENPLUM

DATABASE

Data warehouse database

based on open source

Greenplum Database

PIVOTAL HDB

Open source analytical

database for Apache

Hadoop based on Apache

HAWQ

PIVOTAL GEMFIRE

Open source application

and transaction data grid

based on Apache Geode


Open source data management portfolio

Great software companies leverage Big Data

to fundamentally change the consumer

experience and pioneer entirely new business

models


$4BN

Financial Services

$26BN

Hospitality

$50BN

Transportation

$54BN

Entertainment

$30BN

Automotive

$3.2BN

Industrial Products

CLOUD NATIVE SOFTWARE IS CHANGING INDUSTRIES

Data is Fueling Software

7© Copyright 2015 Pivotal. All rights reserved.

Hundreds of

thousands of “trip”

events each day

400+ billion of

viewing-related

events per day

Five billion

training data

points for Price

Tip feature

Disruptors Use a LOT of Data


“We’ve found that when a

host selects a price that’s

within 5% of their tip,

they’re nearly 4 times

more likely to get booked”

“The importance of accuracy and

efficiency […], will continue to

rise as we expand and improve

products like uberPOOL and

beyond.”

“Over 75% of what

people watch come from

our recommendations”

Data manifests as features in an app


(Data)

Microservices

Loosely coupled

services architecture,

bounded by context

Cloud-Native

Platforms

Enabling continuous

delivery & automated

operations

Open Source

Database

Innovation

Extreme scale &

performance advantages,

built for the cloud

Machine

Learning

Use of predictive

analytics to build

smart apps

How are they accomplishing this?


These companies…

Release new features in minutes, multiple times a day

Support a micro-services architecture

Consume a wide range of data sources and protocols

Store and Analyze all their data

Update algorithms and predictive models daily

Continuously ask lots of questions of their data

Modify data pipelines and add processing steps daily


…but most enterprises are not quite there yet

11

Applications

scalability

limited by databases

Real-time data insights limited

by disconnected OLTP and

OLAP systems

Data services are not

ready for

cloud platforms

App 2App 1 App 3

Bottleneck

Transactional

Database

AppAppApp

Transactional

Database

ETL / ELT

Batches

Δt

TRANSACTIONS ANALYTICS

Analytic

Database

Continuous

Delivery


Stream + Batch Processing

Programming + Operating Model

Cloud-Native Platform

Microservices FrameworkPlatform RuntimeHadoop

DW

Spark

Microservices and Polyglot Persistence

IMDG

K/V Store

Relational DB

Big Data &

Machine Learning

Modern Cloud-Native Data Architecture

Cloud Infrastructure


New pressures are breaking fragile systems

13

Applications

scalability




OLAP systems


ready for

cloud platforms

App 2App 1 App 3

Bottleneck

Transactional

Database

AppAppApp

Transactional

Database

ETL / ELT

Batches

Δt


Analytic

Database

Continuous

Delivery


Apps scalability limited by scalability of databases

14

DB scalability limitations are aggravated by additional devices, clients and apps

App 2

App 1

App 3

Existing

Applications

New devices

And clients

New cloud native

scalable data apps

App 2App 1 App 3

Bottleneck

Transactional

Database

Scale-out applications vs Scale-up databases


GemFire:

15

Cloud-scale high performance transactional data

• Horizontally scalable

• Ultra fast, low-latency in-memory

transactions

• Fully configurable data consistency

• Reliable eventing and notification model

• Highly Available, auto-healing

• Inter-cluster WAN replication

Custom Apps

App 1App 1App 1

App 2App 2App 2 Push Updates

Transactional

Native API

Rest / HTTP

Pivotal GemFire


Batch-mode latency prevents real-time analysis

16

Applications

scalability




OLAP systems


ready for

cloud platforms

App 2App 1 App 3

Bottleneck

Transactional

Database

AppAppApp

Transactional

Database

ETL / ELT

Batches

Δt


Analytic

Database

Continuous

Delivery


Data TemperatureHot

Hot

Real-time data analytics is limited by data integration batches

17

Overnight ETL / ELT jobs expose data that is already outdated

App 1 App 3

App 2

Transactional

Database

ETL / ELT

Batches

Δt

TRANSACTIONS ANALYTICS• Analytical processes don’t

have access to the latest data

• ETL/ELT processes

are expensive and hard to

maintain

• Batch process windows limits

data scalability

MPP

Cold


Operationalized data insights need an event-driven architecture

18

Combination of SQL Analytics and NoSQL event-driven transactions is needed

App 1 App 3

App 2

Transactional

Database

TRANSACTIONS ANALYTICS• Data Insights must be

immediately pushed to

applications

• Apps should be able to react in

real-time to analytical

findings

MPP

Machine Learning

Advanced Analytics

ANSI SQL

APIs /

NoSQL

Data Insights


Da

ta T

em

pe

ratu

reW

arm

Ho

t

GemFire and GPDB - Big Data meets Fast Data

19

Custom Apps

App 1App 1App 1

App 2App 2App 2

Pivotal GemFire

Data science,

analytics & ML

Transactional

Native API

Rest / HTTP

Analytical

ANSI SQL

Push

Updates

Pivotal Greenplum

Parallel Configurable

Data Load

Transactional

data

Write behind

Analytical

Data

to cache


…but most enterprises are not quite there yet

22

Applications

scalability




OLAP systems


ready for

cloud platforms

App 2App 1 App 3

Bottleneck

Transactional

Database

AppAppApp

Transactional

Database

ETL / ELT

Batches

Δt


Analytic

Database

Continuous

Delivery


Cloud Native apps are better suitable for NoSQL

24

Enabling fast and scalable event-driven data services

Unidirectional, request-response SQL Bidirectional, event-driven APIs

Monolithic apps needed complex schema-

based, SQL databasesMicro-services need much simpler schemas,

but much better scalability

SQL

API

API

API


Piv

ota

l C

lou

d F

ou

nd

ry

GemFire for Pivotal Cloud Foundry

26

Lightning fast in-memory persistence for cloud native apps

• One-click provisioning

• Pre-packaged configuration

• Embedded monitoring by Pulse

• Auto application binding

• Multi-cloud support

• Reliable data replication between PCF

sites

Pivotal GemFire

Click to

Deploy


Cloud-ready, infra-structure

agnostic

Next-generation databases must keep up to cloud native apps

27

Can your database do all of this? GemFire IMDG DOES.

Horizontal Scalability Automatic fail-over Reliable eventing model

Multi-site High Availability

Seamless integration to

analytical databases

App 1 App 3App 2


Pivotal GreenplumWorld’s First Open Source Massively Parallel Data Warehouse


• Relational database system for big data and data warehousing

•

• Mission critical & system of record product with supporting tools and ecosystem

•

• Fully open source with a global community of developers and users

•

• Large industrial focused system

•

• PostgreSQL based

•

• Multi-platform technology• On-premise, Cloud, Enterprise Appliance

•

• It’s a Software product

Greenplum Database Mission & Strategy


Government Tax & benefits fraud detection

Economic statistics research

Financial ServicesWealth management data science and product development

for Commercial Banking

Risk and trade repositories reporting

401K providers analytics on investment choices

PharmaceuticalVaccine potency prediction based on manufacturing sensors

IoTPredictive maintenance for auto manufacturer, industrial

equipment and government agencies

Semiconductor Fab sensor analytics and reporting

Highlighted Greenplum Successes

Cyber Security & Surveillance Internal email and communication surveillance and reporting

Corporate network anomalous behavior and intrusion

detections

Oil & GasDrilling equipment predictive maintenance

CommunicationsMobile telephone company enterprise data warehouse

Network performance and availability analytics

Retail Customer purchases analytics

TransportationAirlines loyalty program analytics


POLYMORPHIC

STORAGE

HEAP, Append Only,

Columnar, External,

Compression

MULTI-VERSION

CONCURRENCY

CONTROL (MVCC)

Greenplum Overview Greenplum DBS

YS

TE

M

AC

CE

SS

DA

TA

PR

OC

ES

SIN

G

DA

TA

ST

OR

AG

E

CLIENT ACCESS

PSQL, ODBC, JDBC

BULK LOAD/UNLOAD

GPLoad, GPFdist,

External Tables, GPHDFS

ADMIN TOOLS

GP Perfmon, GP Support

3rd PARTY TOOLS

Compatible with Industry

Standard BI & ETL Tools

SQL

STANDARD

COMPLIANCE

MASSIVELY

PARALLEL

PROCESSING (MPP)

IN-DATABASE

PROGRAMMING

LANGUAGES

PL/pgSQL, PL/Python,

PL/R, PL/Perl, PL/Java,

PL/C

IN-DATABASE

ANALYTICS &

EXTENSIONS

MADlib, PostGIS,

PGCrypto

FULLY ACID

COMPLIANT

TRANSACTIONAL

DATABASE

INDEXES

B-Tree, Bitmap,

GiST

BIG DATA

QUERY

OPTIMIZER


PostgreSQL HeritageGreenplum Open

Source Launch

• Widely used

• Open Source

• PostgreSQL License

• Enterprise class open source relational engine


MPP Shared Nothing ArchitectureFlexible framework for processing large datasets

…

Master

Host

SQLMaster Host and Standby Master Host

Master coordinates work with Segment

Hosts

Segment Host with one or more

Segment Instances

Segment Instances process queries

in parallel

Segment Hosts have their own CPU,

disk and memory (shared nothing)

High speed interconnect for continuous

pipelining of data processing

Interconnect

Segment HostSegment Instance

Segment Instance

Segment Instance

Segment Instance


Segment Instance

Segment Instance

Segment Instance

node1


Segment Instance

Segment Instance

Segment Instance

node2


Segment Instance

Segment Instance

Segment Instance

node3


Segment Instance

Segment Instance

Segment Instance

nodeN

Greenplum DB


Greenplum DB

External

Sources

Loading, streaming,

etc.

Network

Interconnect

... ...

......

Master

Servers

Query planning &

dispatch

Segment

Servers

Query processing &

data storage

ETLFile

Systems

Fast Parallel Load & Unload

No Master Node bottleneck

10+ TB/Hour per Rack

Linear scalability

Low Latency

Data immediately available

No intermediate stores

No data “reorganization”

Load/Unload To & From:

File Systems

Any ETL Product

Hadoop & Amazon S3

Loading: Massively-Parallel Ingest

Extreme speed and immediate usability from files, ETL, Hadoop & S3


Polymorphic Storage™User Definable Storage Layout

Columnar storage compresses better

Optimized for retrieving a subset of the

columns when querying

Compression can be set differently per

column: gzip (1-9), quicklz, delta, RLE

Row oriented faster when returning

all columns

HEAP for many updates and deletes

Use indexes for drill through queries

TABLE ‘SALES’

Jun

Column-orientedRow-oriented

Oct Year -

1

Year -

2

External HDFS or S3

Less accessed partitions

on external and

seamlessly query all data

All major Hadoop

distributions

Amazon S3 storage

Others in development

Nov DecJul Aug Sep


Parent table

Feb 2014

RETExternal

Dec 2014Jan2013 Jan 2014

Partitions and External Partitions

...

• Hash Distribution to evenly spread data across all segment instances

• Range Partition within a segment instance to minimize scan work

• Partitioned Tables Support for External Tables as a Partition

– Readable external table

– Host file system, NFS mount, HDFS or Amazon S3

Greenplum DB


Hybrid Queries: Pivotal External Tables

• Readable Ext-Table MVP

• Readable Gzip Files

• Writable Ext-Table

• Investigation: Enhanced Security/Roles

• Investigation: Additional File Formats

S3 External Tables

Gemfire External Tables

• Hi Speed Ingestion

• Hi Concurrency Query Cache

GPHDFS

Roadmap


Greenplum Database Features for Data Scientists

• Window functions: Perform

calculations across a set of table rows

that are somehow related to the

current row

• Analytics extensions: In-database

machine learning at scale using

MADlib

• Procedural language extensions:

Extended functionality using non-SQL

programming languages and packages

(e.g. Python and R)

• Client Access: ODBC and JDBC

access to support connections to 3rd

party tools* Only a subset of Greenplum Database features


Procedural Languages

• User Defined Types

• User Defined Functions

• User Defined Aggregates

• Import of libraries from open source


Scalable, In-Database Machine Learning

• Open source https://github.com/apache/incubator-madlib

• Downloads and docs http://madlib.incubator.apache.org/

• Wiki

https://cwiki.apache.org/confluence/display/MADLIB/

https://github.com/apache/incubator-madlib

http://madlib.incubator.apache.org/

https://cwiki.apache.org/confluence/display/MADLIB/


Functions

Linear Systems

• Sparse and Dense Solvers

• Linear Algebra

Matrix Factorization

• Singular Value Decomposition (SVD)

• Low Rank

Generalized Linear Models

• Linear Regression

• Logistic Regression

• Multinomial Logistic Regression

• Ordinal Regression

• Cox Proportional Hazards Regression

• Elastic Net Regularization

• Robust Variance (Huber-White),

Clustered Variance, Marginal Effects

Other Machine Learning Algorithms

• Principal Component Analysis (PCA)

• Association Rules (Apriori)

• Topic Modeling (Parallel LDA)

• Decision Trees

• Random Forest

• Support Vector Machines

• Conditional Random Field (CRF)

• Clustering (K-means)

• Cross Validation

• Naïve Bayes

• Support Vector Machines (SVM)

Descriptive Statistics

Sketch-Based Estimators

• CountMin (Cormode-Muth.)

• FM (Flajolet-Martin)

• MFV (Most Frequent Values)

Correlation and Covariance

Summary

Utility Modules

Array and Matrix Operations

Sparse Vectors

Random Sampling

Probability Functions

Data Preparation

PMML Export

Conjugate Gradient

Stemming

Inferential Statistics

Hypothesis Tests

Time Series

• ARIMA

April 2016

Path Functions

• Operations on Pattern Matches


GPDB Geospatial

Current Key Features:

• Points, Lines, Polygons,

Perimeter, Area, Intersection,

Contains, Distance, Long/Lat,

Spatial Indexes & Bounding Boxes

Round earth calculations

Ability to store

geospatial data and

query with with joins and

operators

Raster Image

Processing


Pivotal HDB

Hadoop Native SQL Database



Enabling data science and machine learning at scale

Making the Hadoop Data Lake More Consumable

2) Data scientists still have to resort

to sampling if they can't run

analytics in-database at scale

3) There are multiple data sets

and formats within Hadoop

SQL App

BUSINESS ANALYSTS DATA SCIENTISTS

DATA LAKE

DATA LAKE

Hive, HBase, etc.

DATA LAKE

1) Important people and tools

are cut-off because of SQL

completeness or performance.


As the lingua franca of analytics, SQL can't be ignored. Neither can performance.







SQL App


DATA LAKE

DATA LAKE

Hive, HBase, etc.

DATA LAKE

1) Important people and tools are

cut-off because of SQL



Lack of interactive, ANSI SQL capabilities inhibits adoption and value

Hadoop data lakes sit underutilized

Producing complex queries, large

joins, interactive queries

Existing investments in

visualization and BI tools

Large population of users

with SQL skills

DATA LAKE

DATA SCIENTISTS

BUSINESS ANALYSTS

SQL App


High performance, interactive SQL queries on Hadoop

HDB: The Hadoop Native SQL Database

● Highly efficient MPP

(massively parallel processing)

● Low-latency

● Petabyte scalability

● ACID transaction support

● SQL-92, 99, 2003 compatibility

● Advanced cost-based optimizer

DATA LAKESQL App

BUSINESS ANALYSTS

DATA SCIENTISTS


Integrate SQL and data science tools into an interactive, operationalized environment







SQL App


DATA LAKE

DATA LAKE

Hive, HBase, etc.

DATA LAKE





Using traditional, single-node Python or R for analytics means using subsets because of the

lack of parallelization

Predictive analytics not scaling with Python or R

<...>

Implications

• Time-consuming data movement

• Working with small sample sizes

requires extra testing cycles

against larger data sets

• Slow feature generation limits

algorithm development

DATA LAKE

DATA LAKE

DATA LAKE

SAMPLE 1

SAMPLE 2

SAMPLE n


ApacheTM MADlib® (incubating) is an open-source library for scalable in-database analytics

In-database analytics speeds predictive modeling

Scale-out mathematical, statistical and

machine learning methods for structured

and unstructured data

• SQL-based

• Analyze without sampling

• Open source

• Runs on HDB, Greenplum, and

Postgres

• Compliments support for procedural

languages: PL/R, PL/Python, PL/Java

Train a model...

Predict for new data...

DATA LAKE


Overcome complexity







SQL App


DATA LAKE

DATA LAKE

Hive, HBase, etc.

DATA LAKE





Schema Read

HDB’s Pivotal eXtension Framework (PXF) and HCatalog integration

Simplifying the data lake with data federation

• Enables connectivity between

Pivotal HDB and other stores

(Hive, HBase, HDFS files).

• Provides an extensible

framework to add support for

custom services

• Low latency on large data sets

• Considers cost model of

federated sources

HDFS DATA LAKE

HCatalog

CSV TXT AvroCustom

Extensions


CUSTOMER

APP

Providing information in context with the right architecture and the right algorithms

HDB as part of an architecture: Next Likely Purchase

INTERNAL

APPPURCHASE

NEXT OFFER

REAL-TIME VIEW OF

TRANSACTIONS AND OFFERS

REPORTS


CUSTOMER

APP

Providing information in context with the right architecture and the right algorithms

HDB as part of an architecture: Next Likely Purchase

INTERNAL

APPPURCHASE

NEXT OFFER

REAL-TIME VIEW OF

TRANSACTIONS AND OFFERS

TR

AN

SA

CT

ION

S

PMML

Model Creation &

Training

HDB Tables

HDFS Staging

1. Ingest, transform, and land data into HDFS

2. Score streaming data and serve to

application

DATA SCIENCE &

AD HOC QUERIES

REPORTS


Advanced Analytics Performance

Exceptional MPP performance, low latency, petabyte scalability, ACID reliability, fault tolerance

Most Complete

Language Compliance

Higher degree of SQL compatibility, SQL-92, 99,

2003, OLAP, leverage existing SQL skills

Advanced Query

Optimizer

Maximize performance anddo advanced queries with confidence

Elastic Architecture for

ScalabilityScale-up/down or scale-in/out, expand/shrink

clusters on the fly

Integrated w/MADlib Machine Learning

Advanced MPP analytics, data science at scale,

directly on Hadoop dataMAD

Pivotal HDB Advantages


“Companies need to learn how to catch

people or things in the act of doing

something and affect the outcome“

PAUL MARITZ

Executive Chairman, Pivotal

Real-time andPersonalised Informationin Context is what Wins!

Technology

Pivotal Big Data Suite: A Technical Overview