Upload
pivotal
View
243
Download
2
Embed Size (px)
Citation preview
TECHNICAL OVERVIEW:
Pivotal Big Data Suite
Les KleinField CTO Data
Pivotal
@LesKlein #PivotalForum #Istanbul #BigData #Analytics
Forward Looking Statements
This presentation contains “forward-looking statements” as defined under the Federal Securities Laws. Actual results could differ materially
from those projected in the forward-looking statements as a result of certain risk factors, including but not limited to: (i) adverse changes in
general economic or market conditions; (ii) delays or reductions in information technology spending; (iii) the relative and varying rates of
product price and component cost declines and the volume and mixture of product and services revenues; (iv) competitive factors,
including but not limited to pricing pressures and new product introductions; (v) component and product quality and availability; (vi)
fluctuations in VMware’s Inc.’s operating results and risks associated with trading of VMware stock; (vii) the transition to new products, the
uncertainty of customer acceptance of new product offerings and rapid technological and market change; (viii) risks associated with
managing the growth of our business, including risks associated with acquisitions and investments and the challenges and costs of
integration, restructuring and achieving anticipated synergies; (ix) the ability to attract and retain highly qualified employees; (x) insufficient,
excess or obsolete inventory; (xi) fluctuating currency exchange rates; (xii) threats and other disruptions to our secure data centers and
networks; (xiii) our ability to protect our proprietary technology; (xiv) war or acts of terrorism; and (xv) other one-time events and other
important factors disclosed previously and from time to time in the filings EMC Corporation, the parent company of Pivotal, with the U.S.
Securities and Exchange Commission. EMC and Pivotal disclaim any obligation to update any such forward-looking statements after the
date of this release.
4© 2016 Pivotal Software, Inc. All rights reserved.
Pivotal Big Data Suite
Complete
platform
Hadoop Native SQL
Deployment
options
Based on open
source
Flexible
licensing
Advanced data
services
PIVOTAL GREENPLUM
DATABASE
Data warehouse database
based on open source
Greenplum Database
PIVOTAL HDB
Open source analytical
database for Apache
Hadoop based on Apache
HAWQ
PIVOTAL GEMFIRE
Open source application
and transaction data grid
based on Apache Geode
Pivotal Big Data Suite
Open source data management portfolio
Great software companies leverage Big Data
to fundamentally change the consumer
experience and pioneer entirely new business
models
6© 2016 Pivotal Software, Inc. All rights reserved.
$4BN
Financial Services
$26BN
Hospitality
$50BN
Transportation
$54BN
Entertainment
$30BN
Automotive
$3.2BN
Industrial Products
CLOUD NATIVE SOFTWARE IS CHANGING INDUSTRIES
Data is Fueling Software
7© Copyright 2015 Pivotal. All rights reserved.
Hundreds of
thousands of “trip”
events each day
400+ billion of
viewing-related
events per day
Five billion
training data
points for Price
Tip feature
Disruptors Use a LOT of Data
8© Copyright 2015 Pivotal. All rights reserved.
“We’ve found that when a
host selects a price that’s
within 5% of their tip,
they’re nearly 4 times
more likely to get booked”
“The importance of accuracy and
efficiency […], will continue to
rise as we expand and improve
products like uberPOOL and
beyond.”
“Over 75% of what
people watch come from
our recommendations”
Data manifests as features in an app
9© Copyright 2015 Pivotal. All rights reserved.
(Data)
Microservices
Loosely coupled
services architecture,
bounded by context
Cloud-Native
Platforms
Enabling continuous
delivery & automated
operations
Open Source
Database
Innovation
Extreme scale &
performance advantages,
built for the cloud
Machine
Learning
Use of predictive
analytics to build
smart apps
How are they accomplishing this?
10© Copyright 2015 Pivotal. All rights reserved.
These companies…
Release new features in minutes, multiple times a day
Support a micro-services architecture
Consume a wide range of data sources and protocols
Store and Analyze all their data
Update algorithms and predictive models daily
Continuously ask lots of questions of their data
Modify data pipelines and add processing steps daily
11© 2016 Pivotal Software, Inc. All rights reserved.
…but most enterprises are not quite there yet
11
Applications
scalability
limited by databases
Real-time data insights limited
by disconnected OLTP and
OLAP systems
Data services are not
ready for
cloud platforms
App 2App 1 App 3
Bottleneck
Transactional
Database
AppAppApp
Transactional
Database
ETL / ELT
Batches
Δt
TRANSACTIONS ANALYTICS
Analytic
Database
Continuous
Delivery
12© 2016 Pivotal Software, Inc. All rights reserved.
Stream + Batch Processing
Programming + Operating Model
Cloud-Native Platform
Microservices FrameworkPlatform RuntimeHadoop
DW
Spark
Microservices and Polyglot Persistence
IMDG
K/V Store
Relational DB
Big Data &
Machine Learning
Modern Cloud-Native Data Architecture
Cloud Infrastructure
13© 2016 Pivotal Software, Inc. All rights reserved.
New pressures are breaking fragile systems
13
Applications
scalability
limited by databases
Real-time data insights limited
by disconnected OLTP and
OLAP systems
Data services are not
ready for
cloud platforms
App 2App 1 App 3
Bottleneck
Transactional
Database
AppAppApp
Transactional
Database
ETL / ELT
Batches
Δt
TRANSACTIONS ANALYTICS
Analytic
Database
Continuous
Delivery
14© 2016 Pivotal Software, Inc. All rights reserved.
Apps scalability limited by scalability of databases
14
DB scalability limitations are aggravated by additional devices, clients and apps
App 2
App 1
App 3
Existing
Applications
New devices
And clients
New cloud native
scalable data apps
App 2App 1 App 3
Bottleneck
Transactional
Database
Scale-out applications vs Scale-up databases
15© 2016 Pivotal Software, Inc. All rights reserved.
GemFire:
15
Cloud-scale high performance transactional data
• Horizontally scalable
• Ultra fast, low-latency in-memory
transactions
• Fully configurable data consistency
• Reliable eventing and notification model
• Highly Available, auto-healing
• Inter-cluster WAN replication
Custom Apps
App 1App 1App 1
App 2App 2App 2 Push Updates
Transactional
Native API
Rest / HTTP
Pivotal GemFire
16© 2016 Pivotal Software, Inc. All rights reserved.
Batch-mode latency prevents real-time analysis
16
Applications
scalability
limited by databases
Real-time data insights limited
by disconnected OLTP and
OLAP systems
Data services are not
ready for
cloud platforms
App 2App 1 App 3
Bottleneck
Transactional
Database
AppAppApp
Transactional
Database
ETL / ELT
Batches
Δt
TRANSACTIONS ANALYTICS
Analytic
Database
Continuous
Delivery
17© 2016 Pivotal Software, Inc. All rights reserved.
Data TemperatureHot
Hot
Real-time data analytics is limited by data integration batches
17
Overnight ETL / ELT jobs expose data that is already outdated
App 1 App 3
App 2
Transactional
Database
ETL / ELT
Batches
Δt
TRANSACTIONS ANALYTICS• Analytical processes don’t
have access to the latest data
• ETL/ELT processes
are expensive and hard to
maintain
• Batch process windows limits
data scalability
MPP
Cold
18© 2016 Pivotal Software, Inc. All rights reserved.
Operationalized data insights need an event-driven architecture
18
Combination of SQL Analytics and NoSQL event-driven transactions is needed
App 1 App 3
App 2
Transactional
Database
TRANSACTIONS ANALYTICS• Data Insights must be
immediately pushed to
applications
• Apps should be able to react in
real-time to analytical
findings
MPP
Machine Learning
Advanced Analytics
ANSI SQL
APIs /
NoSQL
Data Insights
19© 2016 Pivotal Software, Inc. All rights reserved.
Da
ta T
em
pe
ratu
reW
arm
Ho
t
GemFire and GPDB - Big Data meets Fast Data
19
Custom Apps
App 1App 1App 1
App 2App 2App 2
Pivotal GemFire
Data science,
analytics & ML
Transactional
Native API
Rest / HTTP
Analytical
ANSI SQL
Push
Updates
Pivotal Greenplum
Parallel Configurable
Data Load
Transactional
data
Write behind
Analytical
Data
to cache
22© 2016 Pivotal Software, Inc. All rights reserved.
…but most enterprises are not quite there yet
22
Applications
scalability
limited by databases
Real-time data insights limited
by disconnected OLTP and
OLAP systems
Data services are not
ready for
cloud platforms
App 2App 1 App 3
Bottleneck
Transactional
Database
AppAppApp
Transactional
Database
ETL / ELT
Batches
Δt
TRANSACTIONS ANALYTICS
Analytic
Database
Continuous
Delivery
24© 2016 Pivotal Software, Inc. All rights reserved.
Cloud Native apps are better suitable for NoSQL
24
Enabling fast and scalable event-driven data services
Unidirectional, request-response SQL Bidirectional, event-driven APIs
Monolithic apps needed complex schema-
based, SQL databasesMicro-services need much simpler schemas,
but much better scalability
SQL
API
API
API
26© 2016 Pivotal Software, Inc. All rights reserved.
Piv
ota
l C
lou
d F
ou
nd
ry
GemFire for Pivotal Cloud Foundry
26
Lightning fast in-memory persistence for cloud native apps
• One-click provisioning
• Pre-packaged configuration
• Embedded monitoring by Pulse
• Auto application binding
• Multi-cloud support
• Reliable data replication between PCF
sites
Pivotal GemFire
Click to
Deploy
27© 2016 Pivotal Software, Inc. All rights reserved.
Cloud-ready, infra-structure
agnostic
Next-generation databases must keep up to cloud native apps
27
Can your database do all of this? GemFire IMDG DOES.
Horizontal Scalability Automatic fail-over Reliable eventing model
Multi-site High Availability
Seamless integration to
analytical databases
App 1 App 3App 2
29© 2016 Pivotal Software, Inc. All rights reserved.
Pivotal GreenplumWorld’s First Open Source Massively Parallel Data Warehouse
30© 2016 Pivotal Software, Inc. All rights reserved.
• Relational database system for big data and data warehousing
•
• Mission critical & system of record product with supporting tools and ecosystem
•
• Fully open source with a global community of developers and users
•
• Large industrial focused system
•
• PostgreSQL based
•
• Multi-platform technology• On-premise, Cloud, Enterprise Appliance
•
• It’s a Software product
Greenplum Database Mission & Strategy
31© 2016 Pivotal Software, Inc. All rights reserved.
Government Tax & benefits fraud detection
Economic statistics research
Financial ServicesWealth management data science and product development
for Commercial Banking
Risk and trade repositories reporting
401K providers analytics on investment choices
PharmaceuticalVaccine potency prediction based on manufacturing sensors
IoTPredictive maintenance for auto manufacturer, industrial
equipment and government agencies
Semiconductor Fab sensor analytics and reporting
Highlighted Greenplum Successes
Cyber Security & Surveillance Internal email and communication surveillance and reporting
Corporate network anomalous behavior and intrusion
detections
Oil & GasDrilling equipment predictive maintenance
CommunicationsMobile telephone company enterprise data warehouse
Network performance and availability analytics
Retail Customer purchases analytics
TransportationAirlines loyalty program analytics
32© 2016 Pivotal Software, Inc. All rights reserved.
POLYMORPHIC
STORAGE
HEAP, Append Only,
Columnar, External,
Compression
MULTI-VERSION
CONCURRENCY
CONTROL (MVCC)
Greenplum Overview Greenplum DBS
YS
TE
M
AC
CE
SS
DA
TA
PR
OC
ES
SIN
G
DA
TA
ST
OR
AG
E
CLIENT ACCESS
PSQL, ODBC, JDBC
BULK LOAD/UNLOAD
GPLoad, GPFdist,
External Tables, GPHDFS
ADMIN TOOLS
GP Perfmon, GP Support
3rd PARTY TOOLS
Compatible with Industry
Standard BI & ETL Tools
SQL
STANDARD
COMPLIANCE
MASSIVELY
PARALLEL
PROCESSING (MPP)
IN-DATABASE
PROGRAMMING
LANGUAGES
PL/pgSQL, PL/Python,
PL/R, PL/Perl, PL/Java,
PL/C
IN-DATABASE
ANALYTICS &
EXTENSIONS
MADlib, PostGIS,
PGCrypto
FULLY ACID
COMPLIANT
TRANSACTIONAL
DATABASE
INDEXES
B-Tree, Bitmap,
GiST
BIG DATA
QUERY
OPTIMIZER
34© 2016 Pivotal Software, Inc. All rights reserved.
PostgreSQL HeritageGreenplum Open
Source Launch
• Widely used
• Open Source
• PostgreSQL License
• Enterprise class open source relational engine
35© 2016 Pivotal Software, Inc. All rights reserved.
MPP Shared Nothing ArchitectureFlexible framework for processing large datasets
…
Master
Host
SQLMaster Host and Standby Master Host
Master coordinates work with Segment
Hosts
Segment Host with one or more
Segment Instances
Segment Instances process queries
in parallel
Segment Hosts have their own CPU,
disk and memory (shared nothing)
High speed interconnect for continuous
pipelining of data processing
Interconnect
Segment HostSegment Instance
Segment Instance
Segment Instance
Segment Instance
Segment HostSegment Instance
Segment Instance
Segment Instance
Segment Instance
node1
Segment HostSegment Instance
Segment Instance
Segment Instance
Segment Instance
node2
Segment HostSegment Instance
Segment Instance
Segment Instance
Segment Instance
node3
Segment HostSegment Instance
Segment Instance
Segment Instance
Segment Instance
nodeN
Greenplum DB
36© 2016 Pivotal Software, Inc. All rights reserved.
Greenplum DB
External
Sources
Loading, streaming,
etc.
Network
Interconnect
... ...
......
Master
Servers
Query planning &
dispatch
Segment
Servers
Query processing &
data storage
ETLFile
Systems
Fast Parallel Load & Unload
No Master Node bottleneck
10+ TB/Hour per Rack
Linear scalability
Low Latency
Data immediately available
No intermediate stores
No data “reorganization”
Load/Unload To & From:
File Systems
Any ETL Product
Hadoop & Amazon S3
Loading: Massively-Parallel Ingest
Extreme speed and immediate usability from files, ETL, Hadoop & S3
39© 2016 Pivotal Software, Inc. All rights reserved.
Polymorphic Storage™User Definable Storage Layout
Columnar storage compresses better
Optimized for retrieving a subset of the
columns when querying
Compression can be set differently per
column: gzip (1-9), quicklz, delta, RLE
Row oriented faster when returning
all columns
HEAP for many updates and deletes
Use indexes for drill through queries
TABLE ‘SALES’
Jun
Column-orientedRow-oriented
Oct Year -
1
Year -
2
External HDFS or S3
Less accessed partitions
on external and
seamlessly query all data
All major Hadoop
distributions
Amazon S3 storage
Others in development
Nov DecJul Aug Sep
40© 2016 Pivotal Software, Inc. All rights reserved.
Parent table
Feb 2014
RETExternal
Dec 2014Jan2013 Jan 2014
Partitions and External Partitions
...
• Hash Distribution to evenly spread data across all segment instances
• Range Partition within a segment instance to minimize scan work
• Partitioned Tables Support for External Tables as a Partition
– Readable external table
– Host file system, NFS mount, HDFS or Amazon S3
Greenplum DB
41© 2016 Pivotal Software, Inc. All rights reserved.
Hybrid Queries: Pivotal External Tables
• Readable Ext-Table MVP
• Readable Gzip Files
• Writable Ext-Table
• Investigation: Enhanced Security/Roles
• Investigation: Additional File Formats
S3 External Tables
Gemfire External Tables
• Hi Speed Ingestion
• Hi Concurrency Query Cache
GPHDFS
Roadmap
42© 2016 Pivotal Software, Inc. All rights reserved.
Greenplum Database Features for Data Scientists
• Window functions: Perform
calculations across a set of table rows
that are somehow related to the
current row
• Analytics extensions: In-database
machine learning at scale using
MADlib
• Procedural language extensions:
Extended functionality using non-SQL
programming languages and packages
(e.g. Python and R)
• Client Access: ODBC and JDBC
access to support connections to 3rd
party tools* Only a subset of Greenplum Database features
43© 2016 Pivotal Software, Inc. All rights reserved.
Procedural Languages
• User Defined Types
• User Defined Functions
• User Defined Aggregates
• Import of libraries from open source
44© 2016 Pivotal Software, Inc. All rights reserved.
Scalable, In-Database Machine Learning
• Open source https://github.com/apache/incubator-madlib
• Downloads and docs http://madlib.incubator.apache.org/
• Wiki
https://cwiki.apache.org/confluence/display/MADLIB/
45© 2016 Pivotal Software, Inc. All rights reserved.
Functions
Linear Systems
• Sparse and Dense Solvers
• Linear Algebra
Matrix Factorization
• Singular Value Decomposition (SVD)
• Low Rank
Generalized Linear Models
• Linear Regression
• Logistic Regression
• Multinomial Logistic Regression
• Ordinal Regression
• Cox Proportional Hazards Regression
• Elastic Net Regularization
• Robust Variance (Huber-White),
Clustered Variance, Marginal Effects
Other Machine Learning Algorithms
• Principal Component Analysis (PCA)
• Association Rules (Apriori)
• Topic Modeling (Parallel LDA)
• Decision Trees
• Random Forest
• Support Vector Machines
• Conditional Random Field (CRF)
• Clustering (K-means)
• Cross Validation
• Naïve Bayes
• Support Vector Machines (SVM)
Descriptive Statistics
Sketch-Based Estimators
• CountMin (Cormode-Muth.)
• FM (Flajolet-Martin)
• MFV (Most Frequent Values)
Correlation and Covariance
Summary
Utility Modules
Array and Matrix Operations
Sparse Vectors
Random Sampling
Probability Functions
Data Preparation
PMML Export
Conjugate Gradient
Stemming
Inferential Statistics
Hypothesis Tests
Time Series
• ARIMA
April 2016
Path Functions
• Operations on Pattern Matches
46© 2016 Pivotal Software, Inc. All rights reserved.
GPDB Geospatial
Current Key Features:
• Points, Lines, Polygons,
Perimeter, Area, Intersection,
Contains, Distance, Long/Lat,
Spatial Indexes & Bounding Boxes
Round earth calculations
Ability to store
geospatial data and
query with with joins and
operators
Raster Image
Processing
47© 2016 Pivotal Software, Inc. All rights reserved.
Pivotal HDB
Hadoop Native SQL Database
48© 2016 Pivotal Software, Inc. All rights reserved.
49© 2016 Pivotal Software, Inc. All rights reserved.
Enabling data science and machine learning at scale
Making the Hadoop Data Lake More Consumable
2) Data scientists still have to resort
to sampling if they can't run
analytics in-database at scale
3) There are multiple data sets
and formats within Hadoop
SQL App
BUSINESS ANALYSTS DATA SCIENTISTS
DATA LAKE
DATA LAKE
Hive, HBase, etc.
DATA LAKE
1) Important people and tools
are cut-off because of SQL
completeness or performance.
50© 2016 Pivotal Software, Inc. All rights reserved.
As the lingua franca of analytics, SQL can't be ignored. Neither can performance.
Making the Hadoop Data Lake More Consumable
2) Data scientists still have to resort
to sampling if they can't run
analytics in-database at scale
3) There are multiple data sets
and formats within Hadoop
SQL App
BUSINESS ANALYSTS DATA SCIENTISTS
DATA LAKE
DATA LAKE
Hive, HBase, etc.
DATA LAKE
1) Important people and tools are
cut-off because of SQL
completeness or performance.
51© 2016 Pivotal Software, Inc. All rights reserved.
Lack of interactive, ANSI SQL capabilities inhibits adoption and value
Hadoop data lakes sit underutilized
Producing complex queries, large
joins, interactive queries
Existing investments in
visualization and BI tools
Large population of users
with SQL skills
DATA LAKE
DATA SCIENTISTS
BUSINESS ANALYSTS
SQL App
52© 2016 Pivotal Software, Inc. All rights reserved.
High performance, interactive SQL queries on Hadoop
HDB: The Hadoop Native SQL Database
● Highly efficient MPP
(massively parallel processing)
● Low-latency
● Petabyte scalability
● ACID transaction support
● SQL-92, 99, 2003 compatibility
● Advanced cost-based optimizer
DATA LAKESQL App
BUSINESS ANALYSTS
DATA SCIENTISTS
53© 2016 Pivotal Software, Inc. All rights reserved.
Integrate SQL and data science tools into an interactive, operationalized environment
Making the Hadoop Data Lake More Consumable
2) Data scientists still have to resort
to sampling if they can't run
analytics in-database at scale
3) There are multiple data sets
and formats within Hadoop
SQL App
BUSINESS ANALYSTS DATA SCIENTISTS
DATA LAKE
DATA LAKE
Hive, HBase, etc.
DATA LAKE
1) Important people and tools are
cut-off because of SQL
completeness or performance.
54© 2016 Pivotal Software, Inc. All rights reserved.
Using traditional, single-node Python or R for analytics means using subsets because of the
lack of parallelization
Predictive analytics not scaling with Python or R
<...>
Implications
• Time-consuming data movement
• Working with small sample sizes
requires extra testing cycles
against larger data sets
• Slow feature generation limits
algorithm development
DATA LAKE
DATA LAKE
DATA LAKE
SAMPLE 1
SAMPLE 2
SAMPLE n
55© 2016 Pivotal Software, Inc. All rights reserved.
ApacheTM MADlib® (incubating) is an open-source library for scalable in-database analytics
In-database analytics speeds predictive modeling
Scale-out mathematical, statistical and
machine learning methods for structured
and unstructured data
• SQL-based
• Analyze without sampling
• Open source
• Runs on HDB, Greenplum, and
Postgres
• Compliments support for procedural
languages: PL/R, PL/Python, PL/Java
Train a model...
Predict for new data...
DATA LAKE
56© 2016 Pivotal Software, Inc. All rights reserved.
Overcome complexity
Making the Hadoop Data Lake More Consumable
2) Data scientists still have to resort
to sampling if they can't run
analytics in-database at scale
3) There are multiple data sets
and formats within Hadoop
SQL App
BUSINESS ANALYSTS DATA SCIENTISTS
DATA LAKE
DATA LAKE
Hive, HBase, etc.
DATA LAKE
1) Important people and tools are
cut-off because of SQL
completeness or performance.
57© 2016 Pivotal Software, Inc. All rights reserved.
Schema Read
HDB’s Pivotal eXtension Framework (PXF) and HCatalog integration
Simplifying the data lake with data federation
• Enables connectivity between
Pivotal HDB and other stores
(Hive, HBase, HDFS files).
• Provides an extensible
framework to add support for
custom services
• Low latency on large data sets
• Considers cost model of
federated sources
HDFS DATA LAKE
HCatalog
CSV TXT AvroCustom
Extensions
59© 2016 Pivotal Software, Inc. All rights reserved.
CUSTOMER
APP
Providing information in context with the right architecture and the right algorithms
HDB as part of an architecture: Next Likely Purchase
INTERNAL
APPPURCHASE
NEXT OFFER
REAL-TIME VIEW OF
TRANSACTIONS AND OFFERS
REPORTS
60© 2016 Pivotal Software, Inc. All rights reserved.
CUSTOMER
APP
Providing information in context with the right architecture and the right algorithms
HDB as part of an architecture: Next Likely Purchase
INTERNAL
APPPURCHASE
NEXT OFFER
REAL-TIME VIEW OF
TRANSACTIONS AND OFFERS
TR
AN
SA
CT
ION
S
PMML
Model Creation &
Training
HDB Tables
HDFS Staging
1. Ingest, transform, and land data into HDFS
2. Score streaming data and serve to
application
DATA SCIENCE &
AD HOC QUERIES
REPORTS
61© 2016 Pivotal Software, Inc. All rights reserved.
Advanced Analytics Performance
Exceptional MPP performance, low latency, petabyte scalability, ACID reliability, fault tolerance
Most Complete
Language Compliance
Higher degree of SQL compatibility, SQL-92, 99,
2003, OLAP, leverage existing SQL skills
Advanced Query
Optimizer
Maximize performance anddo advanced queries with confidence
Elastic Architecture for
ScalabilityScale-up/down or scale-in/out, expand/shrink
clusters on the fly
Integrated w/MADlib Machine Learning
Advanced MPP analytics, data science at scale,
directly on Hadoop dataMAD
Pivotal HDB Advantages
62© Copyright 2015 Pivotal. All rights reserved.
“Companies need to learn how to catch
people or things in the act of doing
something and affect the outcome“
PAUL MARITZ
Executive Chairman, Pivotal
Real-time andPersonalised Informationin Context is what Wins!