A NEW PLATFORM FOR A NEW ERA
2 Pivotal Confidential–Internal Use Only 2 Pivotal Confidential–Internal Use Only
Pivotal and Big Data
Les Klein, Director Field Engineering EMEA [email protected]
@LesKlein Pivotal.io
3 Pivotal Confidential–Internal Use Only
Agenda
� Pivotal & Myself
� What Pivotal does
� Why is this important?
� A closer look at HAWQ
4 Pivotal Confidential–Internal Use Only
Agenda
� Pivotal & Myself
� What Pivotal does
� Why is this important?
� A closer look at HAWQ
5 Pivotal Confidential–Internal Use Only
Pivotal At-a-Glance
� New Independent Venture: Spun out & jointly owned by EMC & VMware
� Top Talent: 1900~ employees � Proven Leadership: Paul Maritz, CEO � Global Customer Validation:
+1000 Tier-1 Enterprise Customers � Strategic Backing: $105M investment by GE � Bold Vision: New platform for a new era,
focused on the intersection of Big Data, PaaS, and Agile Software Development
6 Pivotal Confidential–Internal Use Only
Agenda
� Pivotal & Myself
� What Pivotal does
� Why is this important?
� A closer look at HAWQ
7 Pivotal Confidential–Internal Use Only
The Shift to the 3rd Generation Platform 1st
MAINFRAME
Automation of financial accounts
Mainframes
ISAM
3rd
CLOUD
New Experiences New Biz Models New Needs (“IoT”) pioneered by new Consumer Internet giants – requires a new Application & Data Fabrics
Cloud-Enabled Resources
New Data-fabrics
2nd
CLIENT-SERVER & WEB
Automation of most paper processes: ERP, CRM, Email, …
Relational Databases
Mini’s & PC’s
$1+ Trillion of IT Spend over coming
10 years
8 Pivotal Confidential–Internal Use Only
What Have Pioneers Been Able To Do?
9 Pivotal Confidential–Internal Use Only
It’s More Than Just Hadoop Pivotal’s Full Approach to Big Data
10 Pivotal Confidential–Internal Use Only
How Pivotal Accelerates Value Creation
70% of data generated by
customers
80% of data being stored
3% being prepared for
analysis
0.5% being analyzed
<0.5% being operationalized
First Movers
Smart Enterprises
~20X $2.9B
~30X$40B
~7X $290B
~20X $120B
Average Enterprises
SOLVE THE BIG DATA UTILITY GAP
11 Pivotal Confidential–Internal Use Only
The Journey to Technology Leadership
COLLECT
Business Data Lake
Store everything
ANALYZE
Big Data Analytics
Generate Insights
DEVELOP
Data-Driven Applications
Operationalize
INNOVATE
Agile Enterprise
Iterate Rapidly
PDL Data Science
Pivotal Labs Agile
Pivotal CF Services
PDL Data Architecture
Agile Development
Big Data Predictive Analytics
Enterprise PaaS
12 Pivotal Confidential–Internal Use Only
Data Driven: Harder than it Sounds
Operationalize
Ingest
Distill
Interface
Process
Analytical Transactional
Operationalize
Ingest
Distill
Interface
Process
Analytical Transactional
Operationalize
Ingest
Distill
Interface
Process
Analytical Transactional
Real Time Near Real Time Batch
Predictive Call Routing, Fraud Prediction, Dynamic Pricing,
Re-Marketing, Stream Analytics
Analytic Model Designs, Transaction Analysis, Trend Analysis
ETL, Archive, Trending, Monthly and Weekly Jobs
13 Pivotal Confidential–Internal Use Only
Data Driven: Impossible in Silos
Finance Manufacturing Marketing IT
Data Growth Over 60% Floods These Silos
14 Pivotal Confidential–Internal Use Only
Business Data Lake Architecture System monitoring System management
Processing Tier
Workflow Management
HDFS storage Unstructured and structured data
In-memory
MPP database
Unified Sources Flexible Actions
Real-time ingestion
Micro batch ingestion
Batch ingestion
Real-time insights
Interactive insights
Batch insights
15 Pivotal Confidential–Internal Use Only
Business Data Lake Architecture System monitoring System management
Processing Tier
Workflow Management
PHD
HDFS storage Unstructured and structured data
GemfireXD/Gemfire
HAWQ/GPDB
Unified Sources Flexible Actions
Real-time ingestion
Micro batch ingestion
Batch ingestion
Real-time insights
Interactive insights
Batch insights
16 Pivotal Confidential–Internal Use Only
Pivotal BDS Subscription Model… Application Type
Greenplum DB Database
Pivotal HD Hadoop Distributed File System
HAWQ Parallel Query Engine
In-Memory Data Grid for Hadoop
SQLFire In-Memory Data Grid with SQL Layer
GemFire In-Memory Data Grid
Pricing Metric: Pivotal Component
SKU
1
3
4
2
5
6
Unit of Measure
Price GemFire XD
17 Pivotal Confidential–Internal Use Only
Pivotal HD Value
• Cost-based Query Optimizer • ANSI SQL Compliant • Linear, incremental scalability on COTS
hardware • Deep Analytic OLAP Queries • Petabyte Data Storage & Management • Low latency updates and transactions • Partitioned Events in situ w/ data • Active-active deployment across WAN
OLAP OLTP
SQL
HDFS
18 Pivotal Confidential–Internal Use Only
How Does this Work in Practice?
• Use insights to iteratively improve your product
Build what you need
• Cleanse, organize, and manage your data lake
• Make the right tools available
• Use the resources wisely to compute, analyze, and understand data
• Obsessively collect data
• Keep it forever
• Put the data in one place
Analyze Anything
Store Everything
19 Pivotal Confidential–Internal Use Only
Agenda
� Pivotal & Myself
� What Pivotal does
� Why is this important?
� A closer look at HAWQ
20 Pivotal Confidential–Internal Use Only
µs ms s hour day month year years+ Time
Value of Data ($)
Real-Time Predictive Data Warehouse / BI
MPP Database as a technology innovation to process data more effectively and cost optimized
21 Pivotal Confidential–Internal Use Only
Data Insights are Fueling the Future
The more data you have, the easier it is to
solve a problem, pursue an opportunity, and
build smarter software.
AC
CU
RA
CY
DATA
Complex Algorithms
22 Pivotal Confidential–Internal Use Only 22 Pivotal Confidential–Internal Use Only
Oil & Gas $90bn
In CapEx Savings
Power $66bn
In Fuel Savings
Healthcare $63bn
In Efficiency Gains
Aviation $30bn
In Fuel Savings
Rail $27bn
In Operations
Savings
23 Pivotal Confidential–Internal Use Only
Agenda
� Pivotal & Myself
� What Pivotal does
� Why is this important?
� A closer look at HAWQ
24 Pivotal Confidential–Internal Use Only
HAWQ – Hadoop With Queries � Fastest SQL query engine on Hadoop
� 100% SQL Compliant – E.g. TPC-DS benchmark (SIGMod report)
� Proven with 10 years of technology innovation
� Dynamic Pipelining technology delivers 100X performance improvement with mature SQL query optimization and powerful analytics
� Scatter/Gather data loading, polymorphic storage, third-party tools certification and language support
25 Pivotal Confidential–Internal Use Only
HAWQ Simply
ODBC/JDBC Driver L3,4
Robust Query Optimizer
Cost-‐Based Query Optimization
Row/Columnar Storage
Built-‐in Compression
Complex Data Management
Distributions
Partitioning
Sub-‐Partitioning
Polymorphic Storage
Parallel Loading/Unloading
HDFS Native Formats
Mem
Disk
Users
Concurrency
Resource Queues
Role-‐Based Security
Data Encryption
Multi-‐User Platform
Accessibility
SQL Engine ANSI SQL 2003/2011 Support
Storage Options
Extendable… HDFS Native Formats
CPU
Greenplum database re-‐platformed on Hadoop/HDFS
txt
Avro
Seq
HBase
Hive MapReduce Integration
26 Pivotal Confidential–Internal Use Only
HAWQ Benefits… � Out of the box SQL for Hadoop
– SQL adoption versus learning MapReduce programming
� PXF External Tables providing SQL access to Hadoop – HDFS, HBase, Hive or any data types
� Broad data access, integration and portability
� Performance and Scalability – Parallel Everything – Dynamic Pipelining – High Speed Interconnect – Optimized HDFS access with libhdfs3
– Co-Located Joins & Data Locality – Partition Elimination – Higher Cluster Utilization – Concurrency Control
27 Pivotal Confidential–Internal Use Only
HAWQ and Hadoop Native File Formats
Read/Write
PXF {Pivotal eXtention Framework}
HDFS Flat Files, CSV, Delimited, …
Hive
HBase {predicate push-‐down}
Avro, RCFile, SeqFile
Open extendable API
Available on Github: Accumulo, JSON,…
28 Pivotal Confidential–Internal Use Only
Pivotal eXtension Framework (PXF) � An advanced version of GPDB external tables
� Enables combining HAWQ data and Hadoop data in single query
� Supports connectors for HDFS, HBase and Hive (and GFXD)
� Provides extensible framework API to enable custom connector
� Available on Github: JSON, Accumulo,…
� HAWQ MapReduce RecordReader
PIVOTAL-‐HD EXTENSION FRAMEWORK
HDFS HBase Hive
29 Pivotal Confidential–Internal Use Only
Deep Scalable Analytics
Provides data-parallel implementations of mathematical, statistical and machine-learning methods
for structured and unstructured data.
30 Pivotal Confidential–Internal Use Only
• MADlib is the open-source analytical library developed by Pivotal in conjunction with researchers from, UC Berkeley, University of Wisconsin-Madison, University of Florida and Johns Hopkins University and integrated into HAWQ.
• Enables you to run deep analytics like linear regression, k-means clustering, naïve Bayes classification and many other well-known analytics directly inside the HAWQ/Hadoop cluster.
MADlib – In-DB deep analytics
31 Pivotal Confidential–Internal Use Only
Execution in Database
• All data stays in DB: R objects merely point to DB objects • All model estimation and heavy lifting done in DB by MADlib • R→ SQL translation done in the R client • Only strings of SQL and model output transferred across ODBC/DBI
SQL to execute MADlib
Model output R → SQL
ODBC/DBI
PivotalR
32 Pivotal Confidential–Internal Use Only 32 Pivotal Confidential–Internal Use Only
Thank You
33 Pivotal Confidential–Internal Use Only 33 Pivotal Confidential–Internal Use Only
Supporting Slides
34 Pivotal Confidential–Internal Use Only
Pivotal HD Architecture
HDFS
HBase Pig, Hive, Mahout
Map Reduce
Sqoop Flume
Resource
Management & Workflow
Yarn
Zookeeper
Apache Pivotal
Command Center Configure,
Deploy, Monitor, Manage
Spring XD
Pivotal HD Enterprise
Spring
Xtension Framework
Catalog Services
Query Optimizer
Dynamic Pipelining
ANSI SQL + Analytics
HAWQ – Advanced Database Services
Distributed In-memory
Store
Query Transactions
Ingestion Processing
Hadoop Driver – Parallel with Compaction
ANSI SQL + In-Memory
GemFire XD – Real-Time Database Services
MADlib Algorithms
Oozie
Virtual Extensions
Graphlab, Open MPI
35 Pivotal Confidential–Internal Use Only
Committed to Open Source
� Pivotal has signed Apache CCLA (July 17, 2013)
� Contributing to Apache Hadoop (Pig patch, Hadoop Virtualization Extensions)
� Integrating with other Open Source projects
Pivotal is a major contributor to multiple open source projects
36 Pivotal Confidential–Internal Use Only 36 Pivotal Confidential–Internal Use Only
Greenplum Database and HAWQ
37 Pivotal Confidential–Internal Use Only
HAWQ Evolved From… � Greenplum database re-platformed on Hadoop/HDFS
� HAWQ provides all major features found in Greenplum database – SQL Completeness: 2003 Extensions – JDBC Compliant – Robust Query Optimizer – Row or Column-Oriented Table Storage – Parallel Loading and Unloading – Distributions – Multi-level Partitioning – High speed data redistribution
– Views – External Tables – Compression – Resource Management – Security – Authentication – Management and Monitoring
38 Pivotal Confidential–Internal Use Only
HAWQ � GPDB on HDFS
� Not shared nothing has a distributed file system (HDFS) – Nodes can access shards of data on other nodes
� Built for large I/O, append-only, write-once, read-many
� Segments are stateless – HA is one of the main drivers towards HDFS
HDFS DataNode
HDFS NameNode
HDFS DataNode HDFS DataNode
39 Pivotal Confidential–Internal Use Only
HAWQ Features � HAWQ provides all major features found in Greenplum
database that can be supported in Hadoop/HDFS including – Row or Column-oriented table storage – Distributions – Partitioning – Views – External tables
� Using some features without understanding implications in HDFS may result in problems
40 Pivotal Confidential–Internal Use Only
Architectural Differences from GPDB � Stateless Segment Hosts
– Segments do not know what is visible or aborted in their physical data – Segments do not know what columns are in a table
� HA model deviates from shared nothing environment – If segment is down simply read from the replica in HDFS – No lengthy failover process
� HDFS design doesn’t lend itself to local transaction management
– Frequent, small bursts of I/O on HDFS perform poorly
41 Pivotal Confidential–Internal Use Only
Architectural Implications of Using HDFS � To re-platform GPDB on HDFS, segment workers had to be simplified
(or made dumber) – GPDB segment workers had their own copies of metadata,
transaction management and local storage
� Heap storage in GPDB requires the database to make modifications to tuples on disk
– HDFS is append only therefore heap storage cannot work on DataNodes
– Catalog tables require 100% heap storage so segment servers cannot have a local copy of the catalog
42 Pivotal Confidential–Internal Use Only
Considering the architectural differences and implications of HDFS… GPDB and HAWQ Differences at a Glance
� No Update and Delete – Truncate is supported
� No catalog on segment servers
� No local transaction management at the segment level
� No indexes
� No GPText
� Local storage exists on segments but is used for temporary purposes
43 Pivotal Confidential–Internal Use Only
HAWQ or Greenplum Database? GPDB HAWQ
Real time random read/writes ✗
Large I/O write once, read many ✗
Petabytes of data ✗
Hadoop/HDFS platform ✗
Updates ✗
Deletes ✗
Indexes ✗
Row or columnar oriented table storage ✗ ✗
User Defined Data Distributions ✗ ✗
User Defined Partitioning ✗ ✗
Resource Management ✗ ✗
User Defined Functions (UDFs) ✗ ✗
External Tables ✗ ✗
GPText ✗
MADLib Algorithms ✗ ✗
44 Pivotal Confidential–Internal Use Only 44 Pivotal Confidential–Internal Use Only
HAWQ Storage Options
45 Pivotal Confidential–Internal Use Only
Define the Storage Model HAWQ STORAGE OPTIONS
� Row oriented format
� Column oriented format
� Parquet format
� Append only
� Compression
46 Pivotal Confidential–Internal Use Only
Specify Using the WITH Clause � The WITH clause can be used to set storage options for HAWQ tables
� You can also set storage parameters on a particular partition or subpartition by declaring the WITH clause in the partition specification
CREATE TABLE table_name (… [ WITH ( storage_parameter=value [, ... ] )
� Where storage_parameter is:
APPENDONLY={TRUE} ORIENTATION={COLUMN|ROW|PARQUET} COMPRESSTYPE={ZLIB|QUICKLZ|RLE_TYPE|SNAPPY|GZIP|NONE} COMPRESSLEVEL={0-9}
47 Pivotal Confidential–Internal Use Only
Storage Considerations
� Use Row based tables when – Incremental loads/inserts are performed – Selects against table are wide
� Use Column based tables when – Write once, read many ▪ Not optimized for write operations
– Selects are narrow
48 Pivotal Confidential–Internal Use Only
Columnar Storage
A B C
A1 B1 C1
A2 B2 C2
A3 B3 C3
A1 B1 C1 A2 B2 C2 A3 B3 C3
A1 A2 A3 B1 B2 B3 C1 C2 C3
Row Oriented-Storage
Column Oriented-Storage
• Reduces I/O – Scans only columns needed
• Reduces space – Columnar compresses better – Efficient type specific encodings
by storing together values of the same primitive type
49 Pivotal Confidential–Internal Use Only
Column Orientated Considerations � Consider use cases for column oriented storage
– More efficient I/O and storage – Not optimized for write operations
� Increases performance and reduces storage cost due to decreased I/O and good compression ratios
� Do not use columnar orientation on very wide tables
� If partition granularity requirement is low, use row based table orientation
50 Pivotal Confidential–Internal Use Only
Introducing Parquet � Columnar storage open source file format for Hadoop
� Began as a joint effort between Twitter and Cloudera
� Stores nested data structures in a flat columnar format using a technique based Google’s Dremel ColumnIO format
– Dremel is a ad-hoc query system for analysis of read-only nested data
� Allows for better compression because data is more homogenous
� Reduces I/O because you can scan only a subset of the columns while reading the data
51 Pivotal Confidential–Internal Use Only
Parquet Model � Minimalistic model
� Represents nesting using groups of fields
� Represents repetition using repeated fields
� No complex data types – Mapped to a combination of repeated fields and groups
� Schema data structure – Root of the schema is a group of fields called a message – Each field has three attributes: repetition, type and name – Repetition can be required, optional or repeated – The type of a field is either a group or a primitive type
52 Pivotal Confidential–Internal Use Only
Why HAWQ Parquet? � Partially solves the HDFS file number limitation
� Performance is better for large I/O bound datasets compared with Append Only tables
– Due to column level compression
� All HAWQ complex and rich data types are supported
� External systems can directly read HAWQ data on HDFS due to open file format
53 Pivotal Confidential–Internal Use Only
HAWQ Parquet Feature Overview
� Parquet table read and write support
� Partitioned table support
� Compression support
� Complex and rich data type support
� MapReduce InputFormat for Parquet tables
54 Pivotal Confidential–Internal Use Only
HAWQ Parquet Design � Do not change anything in open source Parquet format
– Both the data file and metadata file are in the format of open source Parquet
� The data file is organized in dimensions of rowgroup->column chunk ->column page
� Append to a file and add a new footer at the end of the load/insert
� Accumulative inserts can be slow – Due to the overhead of metadata
� Design point for Parquet is for LARGE writes
55 Pivotal Confidential–Internal Use Only
DDL and DML Support � Most DDL and DML operations are supported for Parquet
tables – Usage is similar to Append Only tables
� Supports Table level compression
� Supports partitioning
� Supports all HAWQ data types except arrays and UDT
� Supports ALTER TABLE for partition operations
56 Pivotal Confidential–Internal Use Only
Business Data Lake Use Cases N
ativ
e H
adoo
p ETL/ELT Offload Batch/Archive External data merges with enterprise data Hot-Warm-Cold Data, EDW Cost Savings
SQ
L on
Had
oop
Unstructured Data Analysis Machine Generated Data, Sensors, Log, Social Media Grid IQ – GE Electric Grid Prediction
MP
P A
naly
tics
Ad Hoc Machine Learning Predictive Analytics Advanced Data Science Credit Card Fraud Detection, Recommendation Engines
In M
emor
y Real-Time, Low Latency, High Concurrency, OLTP Real-time processing of bookings at China Railways, Southwest Airlines
Business Data Lake
All Data, ETL/ELT, Local, Social, Mobile
Batch Interactive Real-Time
A NEW PLATFORM FOR A NEW ERA