54

Pivotal Big Data Suite: A Technical Overview

  • Upload
    pivotal

  • View
    243

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Pivotal Big Data Suite: A Technical Overview
Page 2: Pivotal Big Data Suite: A Technical Overview

TECHNICAL OVERVIEW:

Pivotal Big Data Suite

Les KleinField CTO Data

Pivotal

@LesKlein #PivotalForum #Istanbul #BigData #Analytics

Page 3: Pivotal Big Data Suite: A Technical Overview

Forward Looking Statements

This presentation contains “forward-looking statements” as defined under the Federal Securities Laws. Actual results could differ materially

from those projected in the forward-looking statements as a result of certain risk factors, including but not limited to: (i) adverse changes in

general economic or market conditions; (ii) delays or reductions in information technology spending; (iii) the relative and varying rates of

product price and component cost declines and the volume and mixture of product and services revenues; (iv) competitive factors,

including but not limited to pricing pressures and new product introductions; (v) component and product quality and availability; (vi)

fluctuations in VMware’s Inc.’s operating results and risks associated with trading of VMware stock; (vii) the transition to new products, the

uncertainty of customer acceptance of new product offerings and rapid technological and market change; (viii) risks associated with

managing the growth of our business, including risks associated with acquisitions and investments and the challenges and costs of

integration, restructuring and achieving anticipated synergies; (ix) the ability to attract and retain highly qualified employees; (x) insufficient,

excess or obsolete inventory; (xi) fluctuating currency exchange rates; (xii) threats and other disruptions to our secure data centers and

networks; (xiii) our ability to protect our proprietary technology; (xiv) war or acts of terrorism; and (xv) other one-time events and other

important factors disclosed previously and from time to time in the filings EMC Corporation, the parent company of Pivotal, with the U.S.

Securities and Exchange Commission. EMC and Pivotal disclaim any obligation to update any such forward-looking statements after the

date of this release.

Page 4: Pivotal Big Data Suite: A Technical Overview

4© 2016 Pivotal Software, Inc. All rights reserved.

Pivotal Big Data Suite

Complete

platform

Hadoop Native SQL

Deployment

options

Based on open

source

Flexible

licensing

Advanced data

services

PIVOTAL GREENPLUM

DATABASE

Data warehouse database

based on open source

Greenplum Database

PIVOTAL HDB

Open source analytical

database for Apache

Hadoop based on Apache

HAWQ

PIVOTAL GEMFIRE

Open source application

and transaction data grid

based on Apache Geode

Pivotal Big Data Suite

Open source data management portfolio

Page 5: Pivotal Big Data Suite: A Technical Overview

Great software companies leverage Big Data

to fundamentally change the consumer

experience and pioneer entirely new business

models

Page 6: Pivotal Big Data Suite: A Technical Overview

6© 2016 Pivotal Software, Inc. All rights reserved.

$4BN

Financial Services

$26BN

Hospitality

$50BN

Transportation

$54BN

Entertainment

$30BN

Automotive

$3.2BN

Industrial Products

CLOUD NATIVE SOFTWARE IS CHANGING INDUSTRIES

Data is Fueling Software

Page 7: Pivotal Big Data Suite: A Technical Overview

7© Copyright 2015 Pivotal. All rights reserved.

Hundreds of

thousands of “trip”

events each day

400+ billion of

viewing-related

events per day

Five billion

training data

points for Price

Tip feature

Disruptors Use a LOT of Data

Page 8: Pivotal Big Data Suite: A Technical Overview

8© Copyright 2015 Pivotal. All rights reserved.

“We’ve found that when a

host selects a price that’s

within 5% of their tip,

they’re nearly 4 times

more likely to get booked”

“The importance of accuracy and

efficiency […], will continue to

rise as we expand and improve

products like uberPOOL and

beyond.”

“Over 75% of what

people watch come from

our recommendations”

Data manifests as features in an app

Page 9: Pivotal Big Data Suite: A Technical Overview

9© Copyright 2015 Pivotal. All rights reserved.

(Data)

Microservices

Loosely coupled

services architecture,

bounded by context

Cloud-Native

Platforms

Enabling continuous

delivery & automated

operations

Open Source

Database

Innovation

Extreme scale &

performance advantages,

built for the cloud

Machine

Learning

Use of predictive

analytics to build

smart apps

How are they accomplishing this?

Page 10: Pivotal Big Data Suite: A Technical Overview

10© Copyright 2015 Pivotal. All rights reserved.

These companies…

Release new features in minutes, multiple times a day

Support a micro-services architecture

Consume a wide range of data sources and protocols

Store and Analyze all their data

Update algorithms and predictive models daily

Continuously ask lots of questions of their data

Modify data pipelines and add processing steps daily

Page 11: Pivotal Big Data Suite: A Technical Overview

11© 2016 Pivotal Software, Inc. All rights reserved.

…but most enterprises are not quite there yet

11

Applications

scalability

limited by databases

Real-time data insights limited

by disconnected OLTP and

OLAP systems

Data services are not

ready for

cloud platforms

App 2App 1 App 3

Bottleneck

Transactional

Database

AppAppApp

Transactional

Database

ETL / ELT

Batches

Δt

TRANSACTIONS ANALYTICS

Analytic

Database

Continuous

Delivery

Page 12: Pivotal Big Data Suite: A Technical Overview

12© 2016 Pivotal Software, Inc. All rights reserved.

Stream + Batch Processing

Programming + Operating Model

Cloud-Native Platform

Microservices FrameworkPlatform RuntimeHadoop

DW

Spark

Microservices and Polyglot Persistence

IMDG

K/V Store

Relational DB

Big Data &

Machine Learning

Modern Cloud-Native Data Architecture

Cloud Infrastructure

Page 13: Pivotal Big Data Suite: A Technical Overview

13© 2016 Pivotal Software, Inc. All rights reserved.

New pressures are breaking fragile systems

13

Applications

scalability

limited by databases

Real-time data insights limited

by disconnected OLTP and

OLAP systems

Data services are not

ready for

cloud platforms

App 2App 1 App 3

Bottleneck

Transactional

Database

AppAppApp

Transactional

Database

ETL / ELT

Batches

Δt

TRANSACTIONS ANALYTICS

Analytic

Database

Continuous

Delivery

Page 14: Pivotal Big Data Suite: A Technical Overview

14© 2016 Pivotal Software, Inc. All rights reserved.

Apps scalability limited by scalability of databases

14

DB scalability limitations are aggravated by additional devices, clients and apps

App 2

App 1

App 3

Existing

Applications

New devices

And clients

New cloud native

scalable data apps

App 2App 1 App 3

Bottleneck

Transactional

Database

Scale-out applications vs Scale-up databases

Page 15: Pivotal Big Data Suite: A Technical Overview

15© 2016 Pivotal Software, Inc. All rights reserved.

GemFire:

15

Cloud-scale high performance transactional data

• Horizontally scalable

• Ultra fast, low-latency in-memory

transactions

• Fully configurable data consistency

• Reliable eventing and notification model

• Highly Available, auto-healing

• Inter-cluster WAN replication

Custom Apps

App 1App 1App 1

App 2App 2App 2 Push Updates

Transactional

Native API

Rest / HTTP

Pivotal GemFire

Page 16: Pivotal Big Data Suite: A Technical Overview

16© 2016 Pivotal Software, Inc. All rights reserved.

Batch-mode latency prevents real-time analysis

16

Applications

scalability

limited by databases

Real-time data insights limited

by disconnected OLTP and

OLAP systems

Data services are not

ready for

cloud platforms

App 2App 1 App 3

Bottleneck

Transactional

Database

AppAppApp

Transactional

Database

ETL / ELT

Batches

Δt

TRANSACTIONS ANALYTICS

Analytic

Database

Continuous

Delivery

Page 17: Pivotal Big Data Suite: A Technical Overview

17© 2016 Pivotal Software, Inc. All rights reserved.

Data TemperatureHot

Hot

Real-time data analytics is limited by data integration batches

17

Overnight ETL / ELT jobs expose data that is already outdated

App 1 App 3

App 2

Transactional

Database

ETL / ELT

Batches

Δt

TRANSACTIONS ANALYTICS• Analytical processes don’t

have access to the latest data

• ETL/ELT processes

are expensive and hard to

maintain

• Batch process windows limits

data scalability

MPP

Cold

Page 18: Pivotal Big Data Suite: A Technical Overview

18© 2016 Pivotal Software, Inc. All rights reserved.

Operationalized data insights need an event-driven architecture

18

Combination of SQL Analytics and NoSQL event-driven transactions is needed

App 1 App 3

App 2

Transactional

Database

TRANSACTIONS ANALYTICS• Data Insights must be

immediately pushed to

applications

• Apps should be able to react in

real-time to analytical

findings

MPP

Machine Learning

Advanced Analytics

ANSI SQL

APIs /

NoSQL

Data Insights

Page 19: Pivotal Big Data Suite: A Technical Overview

19© 2016 Pivotal Software, Inc. All rights reserved.

Da

ta T

em

pe

ratu

reW

arm

Ho

t

GemFire and GPDB - Big Data meets Fast Data

19

Custom Apps

App 1App 1App 1

App 2App 2App 2

Pivotal GemFire

Data science,

analytics & ML

Transactional

Native API

Rest / HTTP

Analytical

ANSI SQL

Push

Updates

Pivotal Greenplum

Parallel Configurable

Data Load

Transactional

data

Write behind

Analytical

Data

to cache

Page 20: Pivotal Big Data Suite: A Technical Overview

22© 2016 Pivotal Software, Inc. All rights reserved.

…but most enterprises are not quite there yet

22

Applications

scalability

limited by databases

Real-time data insights limited

by disconnected OLTP and

OLAP systems

Data services are not

ready for

cloud platforms

App 2App 1 App 3

Bottleneck

Transactional

Database

AppAppApp

Transactional

Database

ETL / ELT

Batches

Δt

TRANSACTIONS ANALYTICS

Analytic

Database

Continuous

Delivery

Page 21: Pivotal Big Data Suite: A Technical Overview

24© 2016 Pivotal Software, Inc. All rights reserved.

Cloud Native apps are better suitable for NoSQL

24

Enabling fast and scalable event-driven data services

Unidirectional, request-response SQL Bidirectional, event-driven APIs

Monolithic apps needed complex schema-

based, SQL databasesMicro-services need much simpler schemas,

but much better scalability

SQL

API

API

API

Page 22: Pivotal Big Data Suite: A Technical Overview

26© 2016 Pivotal Software, Inc. All rights reserved.

Piv

ota

l C

lou

d F

ou

nd

ry

GemFire for Pivotal Cloud Foundry

26

Lightning fast in-memory persistence for cloud native apps

• One-click provisioning

• Pre-packaged configuration

• Embedded monitoring by Pulse

• Auto application binding

• Multi-cloud support

• Reliable data replication between PCF

sites

Pivotal GemFire

Click to

Deploy

Page 23: Pivotal Big Data Suite: A Technical Overview

27© 2016 Pivotal Software, Inc. All rights reserved.

Cloud-ready, infra-structure

agnostic

Next-generation databases must keep up to cloud native apps

27

Can your database do all of this? GemFire IMDG DOES.

Horizontal Scalability Automatic fail-over Reliable eventing model

Multi-site High Availability

Seamless integration to

analytical databases

App 1 App 3App 2

Page 24: Pivotal Big Data Suite: A Technical Overview

29© 2016 Pivotal Software, Inc. All rights reserved.

Pivotal GreenplumWorld’s First Open Source Massively Parallel Data Warehouse

Page 25: Pivotal Big Data Suite: A Technical Overview

30© 2016 Pivotal Software, Inc. All rights reserved.

• Relational database system for big data and data warehousing

• Mission critical & system of record product with supporting tools and ecosystem

• Fully open source with a global community of developers and users

• Large industrial focused system

• PostgreSQL based

• Multi-platform technology• On-premise, Cloud, Enterprise Appliance

• It’s a Software product

Greenplum Database Mission & Strategy

Page 26: Pivotal Big Data Suite: A Technical Overview

31© 2016 Pivotal Software, Inc. All rights reserved.

Government Tax & benefits fraud detection

Economic statistics research

Financial ServicesWealth management data science and product development

for Commercial Banking

Risk and trade repositories reporting

401K providers analytics on investment choices

PharmaceuticalVaccine potency prediction based on manufacturing sensors

IoTPredictive maintenance for auto manufacturer, industrial

equipment and government agencies

Semiconductor Fab sensor analytics and reporting

Highlighted Greenplum Successes

Cyber Security & Surveillance Internal email and communication surveillance and reporting

Corporate network anomalous behavior and intrusion

detections

Oil & GasDrilling equipment predictive maintenance

CommunicationsMobile telephone company enterprise data warehouse

Network performance and availability analytics

Retail Customer purchases analytics

TransportationAirlines loyalty program analytics

Page 27: Pivotal Big Data Suite: A Technical Overview

32© 2016 Pivotal Software, Inc. All rights reserved.

POLYMORPHIC

STORAGE

HEAP, Append Only,

Columnar, External,

Compression

MULTI-VERSION

CONCURRENCY

CONTROL (MVCC)

Greenplum Overview Greenplum DBS

YS

TE

M

AC

CE

SS

DA

TA

PR

OC

ES

SIN

G

DA

TA

ST

OR

AG

E

CLIENT ACCESS

PSQL, ODBC, JDBC

BULK LOAD/UNLOAD

GPLoad, GPFdist,

External Tables, GPHDFS

ADMIN TOOLS

GP Perfmon, GP Support

3rd PARTY TOOLS

Compatible with Industry

Standard BI & ETL Tools

SQL

STANDARD

COMPLIANCE

MASSIVELY

PARALLEL

PROCESSING (MPP)

IN-DATABASE

PROGRAMMING

LANGUAGES

PL/pgSQL, PL/Python,

PL/R, PL/Perl, PL/Java,

PL/C

IN-DATABASE

ANALYTICS &

EXTENSIONS

MADlib, PostGIS,

PGCrypto

FULLY ACID

COMPLIANT

TRANSACTIONAL

DATABASE

INDEXES

B-Tree, Bitmap,

GiST

BIG DATA

QUERY

OPTIMIZER

Page 28: Pivotal Big Data Suite: A Technical Overview

34© 2016 Pivotal Software, Inc. All rights reserved.

PostgreSQL HeritageGreenplum Open

Source Launch

• Widely used

• Open Source

• PostgreSQL License

• Enterprise class open source relational engine

Page 29: Pivotal Big Data Suite: A Technical Overview

35© 2016 Pivotal Software, Inc. All rights reserved.

MPP Shared Nothing ArchitectureFlexible framework for processing large datasets

Master

Host

SQLMaster Host and Standby Master Host

Master coordinates work with Segment

Hosts

Segment Host with one or more

Segment Instances

Segment Instances process queries

in parallel

Segment Hosts have their own CPU,

disk and memory (shared nothing)

High speed interconnect for continuous

pipelining of data processing

Interconnect

Segment HostSegment Instance

Segment Instance

Segment Instance

Segment Instance

Segment HostSegment Instance

Segment Instance

Segment Instance

Segment Instance

node1

Segment HostSegment Instance

Segment Instance

Segment Instance

Segment Instance

node2

Segment HostSegment Instance

Segment Instance

Segment Instance

Segment Instance

node3

Segment HostSegment Instance

Segment Instance

Segment Instance

Segment Instance

nodeN

Greenplum DB

Page 30: Pivotal Big Data Suite: A Technical Overview

36© 2016 Pivotal Software, Inc. All rights reserved.

Greenplum DB

External

Sources

Loading, streaming,

etc.

Network

Interconnect

... ...

......

Master

Servers

Query planning &

dispatch

Segment

Servers

Query processing &

data storage

ETLFile

Systems

Fast Parallel Load & Unload

No Master Node bottleneck

10+ TB/Hour per Rack

Linear scalability

Low Latency

Data immediately available

No intermediate stores

No data “reorganization”

Load/Unload To & From:

File Systems

Any ETL Product

Hadoop & Amazon S3

Loading: Massively-Parallel Ingest

Extreme speed and immediate usability from files, ETL, Hadoop & S3

Page 31: Pivotal Big Data Suite: A Technical Overview

39© 2016 Pivotal Software, Inc. All rights reserved.

Polymorphic Storage™User Definable Storage Layout

Columnar storage compresses better

Optimized for retrieving a subset of the

columns when querying

Compression can be set differently per

column: gzip (1-9), quicklz, delta, RLE

Row oriented faster when returning

all columns

HEAP for many updates and deletes

Use indexes for drill through queries

TABLE ‘SALES’

Jun

Column-orientedRow-oriented

Oct Year -

1

Year -

2

External HDFS or S3

Less accessed partitions

on external and

seamlessly query all data

All major Hadoop

distributions

Amazon S3 storage

Others in development

Nov DecJul Aug Sep

Page 32: Pivotal Big Data Suite: A Technical Overview

40© 2016 Pivotal Software, Inc. All rights reserved.

Parent table

Feb 2014

RETExternal

Dec 2014Jan2013 Jan 2014

Partitions and External Partitions

...

• Hash Distribution to evenly spread data across all segment instances

• Range Partition within a segment instance to minimize scan work

• Partitioned Tables Support for External Tables as a Partition

– Readable external table

– Host file system, NFS mount, HDFS or Amazon S3

Greenplum DB

Page 33: Pivotal Big Data Suite: A Technical Overview

41© 2016 Pivotal Software, Inc. All rights reserved.

Hybrid Queries: Pivotal External Tables

• Readable Ext-Table MVP

• Readable Gzip Files

• Writable Ext-Table

• Investigation: Enhanced Security/Roles

• Investigation: Additional File Formats

S3 External Tables

Gemfire External Tables

• Hi Speed Ingestion

• Hi Concurrency Query Cache

GPHDFS

Roadmap

Page 34: Pivotal Big Data Suite: A Technical Overview

42© 2016 Pivotal Software, Inc. All rights reserved.

Greenplum Database Features for Data Scientists

• Window functions: Perform

calculations across a set of table rows

that are somehow related to the

current row

• Analytics extensions: In-database

machine learning at scale using

MADlib

• Procedural language extensions:

Extended functionality using non-SQL

programming languages and packages

(e.g. Python and R)

• Client Access: ODBC and JDBC

access to support connections to 3rd

party tools* Only a subset of Greenplum Database features

Page 35: Pivotal Big Data Suite: A Technical Overview

43© 2016 Pivotal Software, Inc. All rights reserved.

Procedural Languages

• User Defined Types

• User Defined Functions

• User Defined Aggregates

• Import of libraries from open source

Page 36: Pivotal Big Data Suite: A Technical Overview

44© 2016 Pivotal Software, Inc. All rights reserved.

Scalable, In-Database Machine Learning

• Open source https://github.com/apache/incubator-madlib

• Downloads and docs http://madlib.incubator.apache.org/

• Wiki

https://cwiki.apache.org/confluence/display/MADLIB/

Page 37: Pivotal Big Data Suite: A Technical Overview

45© 2016 Pivotal Software, Inc. All rights reserved.

Functions

Linear Systems

• Sparse and Dense Solvers

• Linear Algebra

Matrix Factorization

• Singular Value Decomposition (SVD)

• Low Rank

Generalized Linear Models

• Linear Regression

• Logistic Regression

• Multinomial Logistic Regression

• Ordinal Regression

• Cox Proportional Hazards Regression

• Elastic Net Regularization

• Robust Variance (Huber-White),

Clustered Variance, Marginal Effects

Other Machine Learning Algorithms

• Principal Component Analysis (PCA)

• Association Rules (Apriori)

• Topic Modeling (Parallel LDA)

• Decision Trees

• Random Forest

• Support Vector Machines

• Conditional Random Field (CRF)

• Clustering (K-means)

• Cross Validation

• Naïve Bayes

• Support Vector Machines (SVM)

Descriptive Statistics

Sketch-Based Estimators

• CountMin (Cormode-Muth.)

• FM (Flajolet-Martin)

• MFV (Most Frequent Values)

Correlation and Covariance

Summary

Utility Modules

Array and Matrix Operations

Sparse Vectors

Random Sampling

Probability Functions

Data Preparation

PMML Export

Conjugate Gradient

Stemming

Inferential Statistics

Hypothesis Tests

Time Series

• ARIMA

April 2016

Path Functions

• Operations on Pattern Matches

Page 38: Pivotal Big Data Suite: A Technical Overview

46© 2016 Pivotal Software, Inc. All rights reserved.

GPDB Geospatial

Current Key Features:

• Points, Lines, Polygons,

Perimeter, Area, Intersection,

Contains, Distance, Long/Lat,

Spatial Indexes & Bounding Boxes

Round earth calculations

Ability to store

geospatial data and

query with with joins and

operators

Raster Image

Processing

Page 39: Pivotal Big Data Suite: A Technical Overview

47© 2016 Pivotal Software, Inc. All rights reserved.

Pivotal HDB

Hadoop Native SQL Database

Page 40: Pivotal Big Data Suite: A Technical Overview

48© 2016 Pivotal Software, Inc. All rights reserved.

Page 41: Pivotal Big Data Suite: A Technical Overview

49© 2016 Pivotal Software, Inc. All rights reserved.

Enabling data science and machine learning at scale

Making the Hadoop Data Lake More Consumable

2) Data scientists still have to resort

to sampling if they can't run

analytics in-database at scale

3) There are multiple data sets

and formats within Hadoop

SQL App

BUSINESS ANALYSTS DATA SCIENTISTS

DATA LAKE

DATA LAKE

Hive, HBase, etc.

DATA LAKE

1) Important people and tools

are cut-off because of SQL

completeness or performance.

Page 42: Pivotal Big Data Suite: A Technical Overview

50© 2016 Pivotal Software, Inc. All rights reserved.

As the lingua franca of analytics, SQL can't be ignored. Neither can performance.

Making the Hadoop Data Lake More Consumable

2) Data scientists still have to resort

to sampling if they can't run

analytics in-database at scale

3) There are multiple data sets

and formats within Hadoop

SQL App

BUSINESS ANALYSTS DATA SCIENTISTS

DATA LAKE

DATA LAKE

Hive, HBase, etc.

DATA LAKE

1) Important people and tools are

cut-off because of SQL

completeness or performance.

Page 43: Pivotal Big Data Suite: A Technical Overview

51© 2016 Pivotal Software, Inc. All rights reserved.

Lack of interactive, ANSI SQL capabilities inhibits adoption and value

Hadoop data lakes sit underutilized

Producing complex queries, large

joins, interactive queries

Existing investments in

visualization and BI tools

Large population of users

with SQL skills

DATA LAKE

DATA SCIENTISTS

BUSINESS ANALYSTS

SQL App

Page 44: Pivotal Big Data Suite: A Technical Overview

52© 2016 Pivotal Software, Inc. All rights reserved.

High performance, interactive SQL queries on Hadoop

HDB: The Hadoop Native SQL Database

● Highly efficient MPP

(massively parallel processing)

● Low-latency

● Petabyte scalability

● ACID transaction support

● SQL-92, 99, 2003 compatibility

● Advanced cost-based optimizer

DATA LAKESQL App

BUSINESS ANALYSTS

DATA SCIENTISTS

Page 45: Pivotal Big Data Suite: A Technical Overview

53© 2016 Pivotal Software, Inc. All rights reserved.

Integrate SQL and data science tools into an interactive, operationalized environment

Making the Hadoop Data Lake More Consumable

2) Data scientists still have to resort

to sampling if they can't run

analytics in-database at scale

3) There are multiple data sets

and formats within Hadoop

SQL App

BUSINESS ANALYSTS DATA SCIENTISTS

DATA LAKE

DATA LAKE

Hive, HBase, etc.

DATA LAKE

1) Important people and tools are

cut-off because of SQL

completeness or performance.

Page 46: Pivotal Big Data Suite: A Technical Overview

54© 2016 Pivotal Software, Inc. All rights reserved.

Using traditional, single-node Python or R for analytics means using subsets because of the

lack of parallelization

Predictive analytics not scaling with Python or R

<...>

Implications

• Time-consuming data movement

• Working with small sample sizes

requires extra testing cycles

against larger data sets

• Slow feature generation limits

algorithm development

DATA LAKE

DATA LAKE

DATA LAKE

SAMPLE 1

SAMPLE 2

SAMPLE n

Page 47: Pivotal Big Data Suite: A Technical Overview

55© 2016 Pivotal Software, Inc. All rights reserved.

ApacheTM MADlib® (incubating) is an open-source library for scalable in-database analytics

In-database analytics speeds predictive modeling

Scale-out mathematical, statistical and

machine learning methods for structured

and unstructured data

• SQL-based

• Analyze without sampling

• Open source

• Runs on HDB, Greenplum, and

Postgres

• Compliments support for procedural

languages: PL/R, PL/Python, PL/Java

Train a model...

Predict for new data...

DATA LAKE

Page 48: Pivotal Big Data Suite: A Technical Overview

56© 2016 Pivotal Software, Inc. All rights reserved.

Overcome complexity

Making the Hadoop Data Lake More Consumable

2) Data scientists still have to resort

to sampling if they can't run

analytics in-database at scale

3) There are multiple data sets

and formats within Hadoop

SQL App

BUSINESS ANALYSTS DATA SCIENTISTS

DATA LAKE

DATA LAKE

Hive, HBase, etc.

DATA LAKE

1) Important people and tools are

cut-off because of SQL

completeness or performance.

Page 49: Pivotal Big Data Suite: A Technical Overview

57© 2016 Pivotal Software, Inc. All rights reserved.

Schema Read

HDB’s Pivotal eXtension Framework (PXF) and HCatalog integration

Simplifying the data lake with data federation

• Enables connectivity between

Pivotal HDB and other stores

(Hive, HBase, HDFS files).

• Provides an extensible

framework to add support for

custom services

• Low latency on large data sets

• Considers cost model of

federated sources

HDFS DATA LAKE

HCatalog

CSV TXT AvroCustom

Extensions

Page 50: Pivotal Big Data Suite: A Technical Overview

59© 2016 Pivotal Software, Inc. All rights reserved.

CUSTOMER

APP

Providing information in context with the right architecture and the right algorithms

HDB as part of an architecture: Next Likely Purchase

INTERNAL

APPPURCHASE

NEXT OFFER

REAL-TIME VIEW OF

TRANSACTIONS AND OFFERS

REPORTS

Page 51: Pivotal Big Data Suite: A Technical Overview

60© 2016 Pivotal Software, Inc. All rights reserved.

CUSTOMER

APP

Providing information in context with the right architecture and the right algorithms

HDB as part of an architecture: Next Likely Purchase

INTERNAL

APPPURCHASE

NEXT OFFER

REAL-TIME VIEW OF

TRANSACTIONS AND OFFERS

TR

AN

SA

CT

ION

S

PMML

Model Creation &

Training

HDB Tables

HDFS Staging

1. Ingest, transform, and land data into HDFS

2. Score streaming data and serve to

application

DATA SCIENCE &

AD HOC QUERIES

REPORTS

Page 52: Pivotal Big Data Suite: A Technical Overview

61© 2016 Pivotal Software, Inc. All rights reserved.

Advanced Analytics Performance

Exceptional MPP performance, low latency, petabyte scalability, ACID reliability, fault tolerance

Most Complete

Language Compliance

Higher degree of SQL compatibility, SQL-92, 99,

2003, OLAP, leverage existing SQL skills

Advanced Query

Optimizer

Maximize performance anddo advanced queries with confidence

Elastic Architecture for

ScalabilityScale-up/down or scale-in/out, expand/shrink

clusters on the fly

Integrated w/MADlib Machine Learning

Advanced MPP analytics, data science at scale,

directly on Hadoop dataMAD

Pivotal HDB Advantages

Page 53: Pivotal Big Data Suite: A Technical Overview

62© Copyright 2015 Pivotal. All rights reserved.

“Companies need to learn how to catch

people or things in the act of doing

something and affect the outcome“

PAUL MARITZ

Executive Chairman, Pivotal

Real-time andPersonalised Informationin Context is what Wins!

Page 54: Pivotal Big Data Suite: A Technical Overview