Download pdf - A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

A NEW PLATFORM FOR A NEW ERA

2 Pivotal Confidential–Internal Use Only 2 Pivotal Confidential–Internal Use Only

Pivotal and Big Data

Les Klein, Director Field Engineering EMEA [email protected]

@LesKlein Pivotal.io

3 Pivotal Confidential–Internal Use Only

Agenda

� Pivotal & Myself

� What Pivotal does

� Why is this important?

� A closer look at HAWQ


Agenda






Pivotal At-a-Glance

� New Independent Venture: Spun out & jointly owned by EMC & VMware

�  Top Talent: 1900~ employees � Proven Leadership: Paul Maritz, CEO � Global Customer Validation:

+1000 Tier-1 Enterprise Customers � Strategic Backing: $105M investment by GE � Bold Vision: New platform for a new era,

focused on the intersection of Big Data, PaaS, and Agile Software Development


Agenda






The Shift to the 3rd Generation Platform 1st

MAINFRAME

Automation of financial accounts

Mainframes

ISAM

3rd

CLOUD

New Experiences New Biz Models New Needs (“IoT”) pioneered by new Consumer Internet giants – requires a new Application & Data Fabrics

Cloud-Enabled Resources

New Data-fabrics

2nd

CLIENT-SERVER & WEB

Automation of most paper processes: ERP, CRM, Email, …

Relational Databases

Mini’s & PC’s

$1+ Trillion of IT Spend over coming

10 years


What Have Pioneers Been Able To Do?


It’s More Than Just Hadoop Pivotal’s Full Approach to Big Data


How Pivotal Accelerates Value Creation

70% of data generated by

customers

80% of data being stored

3% being prepared for

analysis

0.5% being analyzed

<0.5% being operationalized

First Movers

Smart Enterprises

~20X $2.9B

~30X$40B

~7X $290B

~20X $120B

Average Enterprises

SOLVE THE BIG DATA UTILITY GAP


The Journey to Technology Leadership

COLLECT

Business Data Lake

Store everything

ANALYZE

Big Data Analytics

Generate Insights

DEVELOP

Data-Driven Applications

Operationalize

INNOVATE

Agile Enterprise

Iterate Rapidly

PDL Data Science

Pivotal Labs Agile

Pivotal CF Services

PDL Data Architecture

Agile Development

Big Data Predictive Analytics

Enterprise PaaS


Data Driven: Harder than it Sounds

Operationalize

Ingest

Distill

Interface

Process

Analytical Transactional

Operationalize

Ingest

Distill

Interface

Process


Operationalize

Ingest

Distill

Interface

Process


Real Time Near Real Time Batch

Predictive Call Routing, Fraud Prediction, Dynamic Pricing,

Re-Marketing, Stream Analytics

Analytic Model Designs, Transaction Analysis, Trend Analysis

ETL, Archive, Trending, Monthly and Weekly Jobs


Data Driven: Impossible in Silos

Finance Manufacturing Marketing IT

Data Growth Over 60% Floods These Silos


Business Data Lake Architecture System monitoring System management

Processing Tier

Workflow Management

HDFS storage Unstructured and structured data

In-memory

MPP database

Unified Sources Flexible Actions

Real-time ingestion

Micro batch ingestion

Batch ingestion

Real-time insights

Interactive insights

Batch insights


Business Data Lake Architecture System monitoring System management

Processing Tier

Workflow Management

PHD

HDFS storage Unstructured and structured data

GemfireXD/Gemfire

HAWQ/GPDB

Unified Sources Flexible Actions

Real-time ingestion

Micro batch ingestion

Batch ingestion

Real-time insights

Interactive insights

Batch insights


Pivotal BDS Subscription Model… Application Type

Greenplum DB Database

Pivotal HD Hadoop Distributed File System

HAWQ Parallel Query Engine

In-Memory Data Grid for Hadoop

SQLFire In-Memory Data Grid with SQL Layer

GemFire In-Memory Data Grid

Pricing Metric: Pivotal Component

SKU

1

3

4

2

5

6

Unit of Measure

Price GemFire XD


Pivotal HD Value

•  Cost-based Query Optimizer •  ANSI SQL Compliant •  Linear, incremental scalability on COTS

hardware •  Deep Analytic OLAP Queries •  Petabyte Data Storage & Management •  Low latency updates and transactions •  Partitioned Events in situ w/ data •  Active-active deployment across WAN

OLAP OLTP

SQL

HDFS


How Does this Work in Practice?

•  Use insights to iteratively improve your product

Build what you need

•  Cleanse, organize, and manage your data lake

•  Make the right tools available

•  Use the resources wisely to compute, analyze, and understand data

•  Obsessively collect data

•  Keep it forever

•  Put the data in one place

Analyze Anything

Store Everything


Agenda






µs ms s hour day month year years+ Time

Value of Data ($)

Real-Time Predictive Data Warehouse / BI

MPP Database as a technology innovation to process data more effectively and cost optimized


Data Insights are Fueling the Future

The more data you have, the easier it is to

solve a problem, pursue an opportunity, and

build smarter software.

AC

CU

RA

CY

DATA

Complex Algorithms


Oil & Gas $90bn

In CapEx Savings

Power $66bn

In Fuel Savings

Healthcare $63bn

In Efficiency Gains

Aviation $30bn

In Fuel Savings

Rail $27bn

In Operations

Savings


Agenda






HAWQ – Hadoop With Queries �  Fastest SQL query engine on Hadoop

�  100% SQL Compliant –  E.g. TPC-DS benchmark (SIGMod report)

�  Proven with 10 years of technology innovation

�  Dynamic Pipelining technology delivers 100X performance improvement with mature SQL query optimization and powerful analytics

�  Scatter/Gather data loading, polymorphic storage, third-party tools certification and language support


HAWQ Simply

ODBC/JDBC Driver L3,4

Robust Query Optimizer

Cost-‐Based Query Optimization

Row/Columnar Storage

Built-‐in Compression

Complex Data Management

Distributions

Partitioning

Sub-‐Partitioning

Polymorphic Storage

Parallel Loading/Unloading

HDFS Native Formats

Mem

Disk

Users

Concurrency

Resource Queues

Role-‐Based Security

Data Encryption

Multi-‐User Platform

Accessibility

SQL Engine ANSI SQL 2003/2011 Support

Storage Options

Extendable… HDFS Native Formats

CPU

Greenplum database re-‐platformed on Hadoop/HDFS

txt

Avro

Seq

HBase

Hive MapReduce Integration


HAWQ Benefits… �  Out of the box SQL for Hadoop

–  SQL adoption versus learning MapReduce programming

�  PXF External Tables providing SQL access to Hadoop –  HDFS, HBase, Hive or any data types

�  Broad data access, integration and portability

�  Performance and Scalability –  Parallel Everything –  Dynamic Pipelining –  High Speed Interconnect –  Optimized HDFS access with libhdfs3

–  Co-Located Joins & Data Locality –  Partition Elimination –  Higher Cluster Utilization –  Concurrency Control


HAWQ and Hadoop Native File Formats

Read/Write

PXF {Pivotal eXtention Framework}

HDFS Flat Files, CSV, Delimited, …

Hive

HBase {predicate push-‐down}

Avro, RCFile, SeqFile

Open extendable API

Available on Github: Accumulo, JSON,…


Pivotal eXtension Framework (PXF) �  An advanced version of GPDB external tables

�  Enables combining HAWQ data and Hadoop data in single query

�  Supports connectors for HDFS, HBase and Hive (and GFXD)

�  Provides extensible framework API to enable custom connector

�  Available on Github: JSON, Accumulo,…

�  HAWQ MapReduce RecordReader

PIVOTAL-‐HD EXTENSION FRAMEWORK

HDFS HBase Hive


Deep Scalable Analytics

Provides data-parallel implementations of mathematical, statistical and machine-learning methods

for structured and unstructured data.


•  MADlib is the open-source analytical library developed by Pivotal in conjunction with researchers from, UC Berkeley, University of Wisconsin-Madison, University of Florida and Johns Hopkins University and integrated into HAWQ.

•  Enables you to run deep analytics like linear regression, k-means clustering, naïve Bayes classification and many other well-known analytics directly inside the HAWQ/Hadoop cluster.

MADlib – In-DB deep analytics


Execution in Database

•  All data stays in DB: R objects merely point to DB objects •  All model estimation and heavy lifting done in DB by MADlib •  R→ SQL translation done in the R client •  Only strings of SQL and model output transferred across ODBC/DBI

SQL to execute MADlib

Model output R → SQL

ODBC/DBI

PivotalR


Thank You


Supporting Slides


Pivotal HD Architecture

HDFS

HBase Pig, Hive, Mahout

Map Reduce

Sqoop Flume

Resource

Management & Workflow

Yarn

Zookeeper

Apache Pivotal

Command Center Configure,

Deploy, Monitor, Manage

Spring XD

Pivotal HD Enterprise

Spring

Xtension Framework

Catalog Services

Query Optimizer

Dynamic Pipelining

ANSI SQL + Analytics

HAWQ – Advanced Database Services

Distributed In-memory

Store

Query Transactions

Ingestion Processing

Hadoop Driver – Parallel with Compaction

ANSI SQL + In-Memory

GemFire XD – Real-Time Database Services

MADlib Algorithms

Oozie

Virtual Extensions

Graphlab, Open MPI


Committed to Open Source

�  Pivotal has signed Apache CCLA (July 17, 2013)

�  Contributing to Apache Hadoop (Pig patch, Hadoop Virtualization Extensions)

�  Integrating with other Open Source projects

Pivotal is a major contributor to multiple open source projects


Greenplum Database and HAWQ


HAWQ Evolved From… �  Greenplum database re-platformed on Hadoop/HDFS

�  HAWQ provides all major features found in Greenplum database –  SQL Completeness: 2003 Extensions –  JDBC Compliant –  Robust Query Optimizer –  Row or Column-Oriented Table Storage –  Parallel Loading and Unloading –  Distributions –  Multi-level Partitioning –  High speed data redistribution

–  Views –  External Tables –  Compression –  Resource Management –  Security –  Authentication –  Management and Monitoring


HAWQ � GPDB on HDFS

� Not shared nothing has a distributed file system (HDFS) –  Nodes can access shards of data on other nodes

� Built for large I/O, append-only, write-once, read-many

� Segments are stateless –  HA is one of the main drivers towards HDFS

HDFS DataNode

HDFS NameNode

HDFS DataNode HDFS DataNode


HAWQ Features � HAWQ provides all major features found in Greenplum

database that can be supported in Hadoop/HDFS including –  Row or Column-oriented table storage –  Distributions –  Partitioning –  Views –  External tables

� Using some features without understanding implications in HDFS may result in problems


Architectural Differences from GPDB � Stateless Segment Hosts

–  Segments do not know what is visible or aborted in their physical data –  Segments do not know what columns are in a table

� HA model deviates from shared nothing environment –  If segment is down simply read from the replica in HDFS –  No lengthy failover process

� HDFS design doesn’t lend itself to local transaction management

–  Frequent, small bursts of I/O on HDFS perform poorly


Architectural Implications of Using HDFS �  To re-platform GPDB on HDFS, segment workers had to be simplified

(or made dumber) –  GPDB segment workers had their own copies of metadata,

transaction management and local storage

�  Heap storage in GPDB requires the database to make modifications to tuples on disk

–  HDFS is append only therefore heap storage cannot work on DataNodes

–  Catalog tables require 100% heap storage so segment servers cannot have a local copy of the catalog


Considering the architectural differences and implications of HDFS… GPDB and HAWQ Differences at a Glance

�  No Update and Delete –  Truncate is supported

�  No catalog on segment servers

�  No local transaction management at the segment level

�  No indexes

�  No GPText

�  Local storage exists on segments but is used for temporary purposes


HAWQ or Greenplum Database? GPDB HAWQ

Real time random read/writes ✗

Large I/O write once, read many ✗

Petabytes of data ✗

Hadoop/HDFS platform ✗

Updates ✗

Deletes ✗

Indexes ✗

Row or columnar oriented table storage ✗ ✗

User Defined Data Distributions ✗ ✗

User Defined Partitioning ✗ ✗

Resource Management ✗ ✗

User Defined Functions (UDFs) ✗ ✗

External Tables ✗ ✗

GPText ✗

MADLib Algorithms ✗ ✗


HAWQ Storage Options


Define the Storage Model HAWQ STORAGE OPTIONS

� Row oriented format

� Column oriented format

� Parquet format

� Append only

� Compression


Specify Using the WITH Clause �  The WITH clause can be used to set storage options for HAWQ tables

�  You can also set storage parameters on a particular partition or subpartition by declaring the WITH clause in the partition specification

CREATE TABLE table_name (… [ WITH ( storage_parameter=value [, ... ] )

�  Where storage_parameter is:

APPENDONLY={TRUE} ORIENTATION={COLUMN|ROW|PARQUET} COMPRESSTYPE={ZLIB|QUICKLZ|RLE_TYPE|SNAPPY|GZIP|NONE} COMPRESSLEVEL={0-9}


Storage Considerations

� Use Row based tables when –  Incremental loads/inserts are performed –  Selects against table are wide

� Use Column based tables when –  Write once, read many ▪  Not optimized for write operations

–  Selects are narrow


Columnar Storage

A B C

A1 B1 C1

A2 B2 C2

A3 B3 C3

A1 B1 C1 A2 B2 C2 A3 B3 C3

A1 A2 A3 B1 B2 B3 C1 C2 C3

Row Oriented-Storage

Column Oriented-Storage

•  Reduces I/O –  Scans only columns needed

•  Reduces space –  Columnar compresses better –  Efficient type specific encodings

by storing together values of the same primitive type


Column Orientated Considerations � Consider use cases for column oriented storage

– More efficient I/O and storage – Not optimized for write operations

�  Increases performance and reduces storage cost due to decreased I/O and good compression ratios

� Do not use columnar orientation on very wide tables

�  If partition granularity requirement is low, use row based table orientation


Introducing Parquet �  Columnar storage open source file format for Hadoop

�  Began as a joint effort between Twitter and Cloudera

�  Stores nested data structures in a flat columnar format using a technique based Google’s Dremel ColumnIO format

–  Dremel is a ad-hoc query system for analysis of read-only nested data

�  Allows for better compression because data is more homogenous

�  Reduces I/O because you can scan only a subset of the columns while reading the data


Parquet Model �  Minimalistic model

�  Represents nesting using groups of fields

�  Represents repetition using repeated fields

�  No complex data types –  Mapped to a combination of repeated fields and groups

�  Schema data structure –  Root of the schema is a group of fields called a message –  Each field has three attributes: repetition, type and name –  Repetition can be required, optional or repeated –  The type of a field is either a group or a primitive type


Why HAWQ Parquet? � Partially solves the HDFS file number limitation

� Performance is better for large I/O bound datasets compared with Append Only tables

– Due to column level compression

� All HAWQ complex and rich data types are supported

� External systems can directly read HAWQ data on HDFS due to open file format


HAWQ Parquet Feature Overview

� Parquet table read and write support

� Partitioned table support

� Compression support

� Complex and rich data type support

� MapReduce InputFormat for Parquet tables


HAWQ Parquet Design �  Do not change anything in open source Parquet format

–  Both the data file and metadata file are in the format of open source Parquet

�  The data file is organized in dimensions of rowgroup->column chunk ->column page

�  Append to a file and add a new footer at the end of the load/insert

�  Accumulative inserts can be slow –  Due to the overhead of metadata

�  Design point for Parquet is for LARGE writes


DDL and DML Support � Most DDL and DML operations are supported for Parquet

tables –  Usage is similar to Append Only tables

� Supports Table level compression

� Supports partitioning

� Supports all HAWQ data types except arrays and UDT

� Supports ALTER TABLE for partition operations


Business Data Lake Use Cases N

ativ

e H

adoo

p ETL/ELT Offload Batch/Archive External data merges with enterprise data Hot-Warm-Cold Data, EDW Cost Savings

SQ

L on

Had

oop

Unstructured Data Analysis Machine Generated Data, Sensors, Log, Social Media Grid IQ – GE Electric Grid Prediction

MP

P A

naly

tics

Ad Hoc Machine Learning Predictive Analytics Advanced Data Science Credit Card Fraud Detection, Recommendation Engines

In M

emor

y Real-Time, Low Latency, High Concurrency, OLTP Real-time processing of bookings at China Railways, Southwest Airlines

Business Data Lake

All Data, ETL/ELT, Local, Social, Mobile

Batch Interactive Real-Time

A NEW PLATFORM FOR A NEW ERA