57
A NEW PLATFORM FOR A NEW ERA

A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

A NEW PLATFORM FOR A NEW ERA

Page 2: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

2 Pivotal Confidential–Internal Use Only 2 Pivotal Confidential–Internal Use Only

Pivotal and Big Data

Les Klein, Director Field Engineering EMEA [email protected]

@LesKlein Pivotal.io

Page 3: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

3 Pivotal Confidential–Internal Use Only

Agenda

� Pivotal & Myself

� What Pivotal does

� Why is this important?

� A closer look at HAWQ

Page 4: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

4 Pivotal Confidential–Internal Use Only

Agenda

� Pivotal & Myself

� What Pivotal does

� Why is this important?

� A closer look at HAWQ

Page 5: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

5 Pivotal Confidential–Internal Use Only

Pivotal At-a-Glance

� New Independent Venture: Spun out & jointly owned by EMC & VMware

�  Top Talent: 1900~ employees � Proven Leadership: Paul Maritz, CEO � Global Customer Validation:

+1000 Tier-1 Enterprise Customers � Strategic Backing: $105M investment by GE � Bold Vision: New platform for a new era,

focused on the intersection of Big Data, PaaS, and Agile Software Development

Page 6: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

6 Pivotal Confidential–Internal Use Only

Agenda

� Pivotal & Myself

� What Pivotal does

� Why is this important?

� A closer look at HAWQ

Page 7: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

7 Pivotal Confidential–Internal Use Only

The Shift to the 3rd Generation Platform 1st

MAINFRAME

Automation of financial accounts

Mainframes

ISAM

3rd

CLOUD

New Experiences New Biz Models New Needs (“IoT”) pioneered by new Consumer Internet giants – requires a new Application & Data Fabrics

Cloud-Enabled Resources

New Data-fabrics

2nd

CLIENT-SERVER & WEB

Automation of most paper processes: ERP, CRM, Email, …

Relational Databases

Mini’s & PC’s

$1+ Trillion of IT Spend over coming

10 years

Page 8: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

8 Pivotal Confidential–Internal Use Only

What Have Pioneers Been Able To Do?

Page 9: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

9 Pivotal Confidential–Internal Use Only

It’s More Than Just Hadoop Pivotal’s Full Approach to Big Data

Page 10: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

10 Pivotal Confidential–Internal Use Only

How Pivotal Accelerates Value Creation

70% of data generated by

customers

80% of data being stored

3% being prepared for

analysis

0.5% being analyzed

<0.5% being operationalized

First Movers

Smart Enterprises

~20X $2.9B

~30X$40B

~7X $290B

~20X $120B

Average Enterprises

SOLVE THE BIG DATA UTILITY GAP

Page 11: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

11 Pivotal Confidential–Internal Use Only

The Journey to Technology Leadership

COLLECT

Business Data Lake

Store everything

ANALYZE

Big Data Analytics

Generate Insights

DEVELOP

Data-Driven Applications

Operationalize

INNOVATE

Agile Enterprise

Iterate Rapidly

PDL Data Science

Pivotal Labs Agile

Pivotal CF Services

PDL Data Architecture

Agile Development

Big Data Predictive Analytics

Enterprise PaaS

Page 12: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

12 Pivotal Confidential–Internal Use Only

Data Driven: Harder than it Sounds

Operationalize

Ingest

Distill

Interface

Process

Analytical Transactional

Operationalize

Ingest

Distill

Interface

Process

Analytical Transactional

Operationalize

Ingest

Distill

Interface

Process

Analytical Transactional

Real Time Near Real Time Batch

Predictive Call Routing, Fraud Prediction, Dynamic Pricing,

Re-Marketing, Stream Analytics

Analytic Model Designs, Transaction Analysis, Trend Analysis

ETL, Archive, Trending, Monthly and Weekly Jobs

Page 13: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

13 Pivotal Confidential–Internal Use Only

Data Driven: Impossible in Silos

Finance Manufacturing Marketing IT

Data Growth Over 60% Floods These Silos

Page 14: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

14 Pivotal Confidential–Internal Use Only

Business Data Lake Architecture System monitoring System management

Processing Tier

Workflow Management

HDFS storage Unstructured and structured data

In-memory

MPP database

Unified Sources Flexible Actions

Real-time ingestion

Micro batch ingestion

Batch ingestion

Real-time insights

Interactive insights

Batch insights

Page 15: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

15 Pivotal Confidential–Internal Use Only

Business Data Lake Architecture System monitoring System management

Processing Tier

Workflow Management

PHD

HDFS storage Unstructured and structured data

GemfireXD/Gemfire

HAWQ/GPDB

Unified Sources Flexible Actions

Real-time ingestion

Micro batch ingestion

Batch ingestion

Real-time insights

Interactive insights

Batch insights

Page 16: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

16 Pivotal Confidential–Internal Use Only

Pivotal BDS Subscription Model… Application Type

Greenplum DB Database

Pivotal HD Hadoop Distributed File System

HAWQ Parallel Query Engine

In-Memory Data Grid for Hadoop

SQLFire In-Memory Data Grid with SQL Layer

GemFire In-Memory Data Grid

Pricing Metric: Pivotal Component

SKU

1

3

4

2

5

6

Unit of Measure

Price GemFire XD

Page 17: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

17 Pivotal Confidential–Internal Use Only

Pivotal HD Value

•  Cost-based Query Optimizer •  ANSI SQL Compliant •  Linear, incremental scalability on COTS

hardware •  Deep Analytic OLAP Queries •  Petabyte Data Storage & Management •  Low latency updates and transactions •  Partitioned Events in situ w/ data •  Active-active deployment across WAN

OLAP OLTP

SQL

HDFS

Page 18: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

18 Pivotal Confidential–Internal Use Only

How Does this Work in Practice?

•  Use insights to iteratively improve your product

Build what you need

•  Cleanse, organize, and manage your data lake

•  Make the right tools available

•  Use the resources wisely to compute, analyze, and understand data

•  Obsessively collect data

•  Keep it forever

•  Put the data in one place

Analyze Anything

Store Everything

Page 19: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

19 Pivotal Confidential–Internal Use Only

Agenda

� Pivotal & Myself

� What Pivotal does

� Why is this important?

� A closer look at HAWQ

Page 20: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

20 Pivotal Confidential–Internal Use Only

µs ms s hour day month year years+ Time

Value of Data ($)

Real-Time Predictive Data Warehouse / BI

MPP Database as a technology innovation to process data more effectively and cost optimized

Page 21: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

21 Pivotal Confidential–Internal Use Only

Data Insights are Fueling the Future

The more data you have, the easier it is to

solve a problem, pursue an opportunity, and

build smarter software.

AC

CU

RA

CY

DATA

Complex Algorithms

Page 22: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

22 Pivotal Confidential–Internal Use Only 22 Pivotal Confidential–Internal Use Only

Oil & Gas $90bn

In CapEx Savings

Power $66bn

In Fuel Savings

Healthcare $63bn

In Efficiency Gains

Aviation $30bn

In Fuel Savings

Rail $27bn

In Operations

Savings

Page 23: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

23 Pivotal Confidential–Internal Use Only

Agenda

� Pivotal & Myself

� What Pivotal does

� Why is this important?

� A closer look at HAWQ

Page 24: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

24 Pivotal Confidential–Internal Use Only

HAWQ – Hadoop With Queries �  Fastest SQL query engine on Hadoop

�  100% SQL Compliant –  E.g. TPC-DS benchmark (SIGMod report)

�  Proven with 10 years of technology innovation

�  Dynamic Pipelining technology delivers 100X performance improvement with mature SQL query optimization and powerful analytics

�  Scatter/Gather data loading, polymorphic storage, third-party tools certification and language support

Page 25: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

25 Pivotal Confidential–Internal Use Only

HAWQ  Simply    

ODBC/JDBC  Driver  L3,4  

Robust  Query  Optimizer  

Cost-­‐Based  Query  Optimization  

Row/Columnar  Storage  

Built-­‐in  Compression  

Complex  Data  Management  

Distributions  

Partitioning  

Sub-­‐Partitioning  

Polymorphic  Storage  

Parallel  Loading/Unloading  

HDFS  Native  Formats  

Mem  

Disk  

Users  

Concurrency  

Resource  Queues  

Role-­‐Based  Security  

Data  Encryption  

Multi-­‐User  Platform  

Accessibility  

SQL  Engine  ANSI  SQL  2003/2011  Support  

Storage  Options  

Extendable…   HDFS  Native  Formats  

CPU  

Greenplum  database  re-­‐platformed  on  Hadoop/HDFS  

txt  

Avro  

Seq  

HBase  

Hive  MapReduce  Integration  

Page 26: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

26 Pivotal Confidential–Internal Use Only

HAWQ Benefits… �  Out of the box SQL for Hadoop

–  SQL adoption versus learning MapReduce programming

�  PXF External Tables providing SQL access to Hadoop –  HDFS, HBase, Hive or any data types

�  Broad data access, integration and portability

�  Performance and Scalability –  Parallel Everything –  Dynamic Pipelining –  High Speed Interconnect –  Optimized HDFS access with libhdfs3

–  Co-Located Joins & Data Locality –  Partition Elimination –  Higher Cluster Utilization –  Concurrency Control

Page 27: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

27 Pivotal Confidential–Internal Use Only

HAWQ  and  Hadoop  Native  File  Formats  

Read/Write  

PXF    {Pivotal  eXtention  Framework}  

HDFS  Flat  Files,  CSV,  Delimited,  …  

Hive  

HBase    {predicate  push-­‐down}    

Avro,  RCFile,  SeqFile  

Open  extendable  API  

Available  on  Github:  Accumulo,  JSON,…  

Page 28: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

28 Pivotal Confidential–Internal Use Only

Pivotal eXtension Framework (PXF) �  An advanced version of GPDB external tables

�  Enables combining HAWQ data and Hadoop data in single query

�  Supports connectors for HDFS, HBase and Hive (and GFXD)

�  Provides extensible framework API to enable custom connector

�  Available on Github: JSON, Accumulo,…

�  HAWQ MapReduce RecordReader

PIVOTAL-­‐HD  EXTENSION  FRAMEWORK  

HDFS   HBase   Hive  

Page 29: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

29 Pivotal Confidential–Internal Use Only

Deep Scalable Analytics

Provides data-parallel implementations of mathematical, statistical and machine-learning methods

for structured and unstructured data.

Page 30: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

30 Pivotal Confidential–Internal Use Only

•  MADlib is the open-source analytical library developed by Pivotal in conjunction with researchers from, UC Berkeley, University of Wisconsin-Madison, University of Florida and Johns Hopkins University and integrated into HAWQ.

•  Enables you to run deep analytics like linear regression, k-means clustering, naïve Bayes classification and many other well-known analytics directly inside the HAWQ/Hadoop cluster.

MADlib – In-DB deep analytics

Page 31: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

31 Pivotal Confidential–Internal Use Only

Execution in Database

•  All data stays in DB: R objects merely point to DB objects •  All model estimation and heavy lifting done in DB by MADlib •  R→ SQL translation done in the R client •  Only strings of SQL and model output transferred across ODBC/DBI

SQL to execute MADlib

Model output R → SQL

ODBC/DBI

PivotalR

Page 32: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

32 Pivotal Confidential–Internal Use Only 32 Pivotal Confidential–Internal Use Only

Thank You

Page 33: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

33 Pivotal Confidential–Internal Use Only 33 Pivotal Confidential–Internal Use Only

Supporting Slides

Page 34: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

34 Pivotal Confidential–Internal Use Only

Pivotal HD Architecture

HDFS

HBase Pig, Hive, Mahout

Map Reduce

Sqoop Flume

Resource

Management & Workflow

Yarn

Zookeeper

Apache Pivotal

Command Center Configure,

Deploy, Monitor, Manage

Spring XD

Pivotal HD Enterprise

Spring

Xtension Framework

Catalog Services

Query Optimizer

Dynamic Pipelining

ANSI SQL + Analytics

HAWQ – Advanced Database Services

Distributed In-memory

Store

Query Transactions

Ingestion Processing

Hadoop Driver – Parallel with Compaction

ANSI SQL + In-Memory

GemFire XD – Real-Time Database Services

MADlib Algorithms

Oozie

Virtual Extensions

Graphlab, Open MPI

Page 35: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

35 Pivotal Confidential–Internal Use Only

Committed to Open Source

�  Pivotal has signed Apache CCLA (July 17, 2013)

�  Contributing to Apache Hadoop (Pig patch, Hadoop Virtualization Extensions)

�  Integrating with other Open Source projects

Pivotal is a major contributor to multiple open source projects

Page 36: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

36 Pivotal Confidential–Internal Use Only 36 Pivotal Confidential–Internal Use Only

Greenplum Database and HAWQ

Page 37: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

37 Pivotal Confidential–Internal Use Only

HAWQ Evolved From… �  Greenplum database re-platformed on Hadoop/HDFS

�  HAWQ provides all major features found in Greenplum database –  SQL Completeness: 2003 Extensions –  JDBC Compliant –  Robust Query Optimizer –  Row or Column-Oriented Table Storage –  Parallel Loading and Unloading –  Distributions –  Multi-level Partitioning –  High speed data redistribution

–  Views –  External Tables –  Compression –  Resource Management –  Security –  Authentication –  Management and Monitoring

Page 38: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

38 Pivotal Confidential–Internal Use Only

HAWQ � GPDB on HDFS

� Not shared nothing has a distributed file system (HDFS) –  Nodes can access shards of data on other nodes

� Built for large I/O, append-only, write-once, read-many

� Segments are stateless –  HA is one of the main drivers towards HDFS

HDFS DataNode

HDFS NameNode

HDFS DataNode HDFS DataNode

Page 39: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

39 Pivotal Confidential–Internal Use Only

HAWQ Features � HAWQ provides all major features found in Greenplum

database that can be supported in Hadoop/HDFS including –  Row or Column-oriented table storage –  Distributions –  Partitioning –  Views –  External tables

� Using some features without understanding implications in HDFS may result in problems

Page 40: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

40 Pivotal Confidential–Internal Use Only

Architectural Differences from GPDB � Stateless Segment Hosts

–  Segments do not know what is visible or aborted in their physical data –  Segments do not know what columns are in a table

� HA model deviates from shared nothing environment –  If segment is down simply read from the replica in HDFS –  No lengthy failover process

� HDFS design doesn’t lend itself to local transaction management

–  Frequent, small bursts of I/O on HDFS perform poorly

Page 41: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

41 Pivotal Confidential–Internal Use Only

Architectural Implications of Using HDFS �  To re-platform GPDB on HDFS, segment workers had to be simplified

(or made dumber) –  GPDB segment workers had their own copies of metadata,

transaction management and local storage

�  Heap storage in GPDB requires the database to make modifications to tuples on disk

–  HDFS is append only therefore heap storage cannot work on DataNodes

–  Catalog tables require 100% heap storage so segment servers cannot have a local copy of the catalog

Page 42: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

42 Pivotal Confidential–Internal Use Only

Considering the architectural differences and implications of HDFS… GPDB and HAWQ Differences at a Glance

�  No Update and Delete –  Truncate is supported

�  No catalog on segment servers

�  No local transaction management at the segment level

�  No indexes

�  No GPText

�  Local storage exists on segments but is used for temporary purposes

Page 43: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

43 Pivotal Confidential–Internal Use Only

HAWQ or Greenplum Database? GPDB HAWQ

Real time random read/writes ✗

Large I/O write once, read many ✗

Petabytes of data ✗

Hadoop/HDFS platform ✗

Updates ✗

Deletes ✗

Indexes ✗

Row or columnar oriented table storage ✗ ✗

User Defined Data Distributions ✗ ✗

User Defined Partitioning ✗ ✗

Resource Management ✗ ✗

User Defined Functions (UDFs) ✗ ✗

External Tables ✗ ✗

GPText ✗

MADLib Algorithms ✗ ✗

Page 44: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

44 Pivotal Confidential–Internal Use Only 44 Pivotal Confidential–Internal Use Only

HAWQ Storage Options

Page 45: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

45 Pivotal Confidential–Internal Use Only

Define the Storage Model HAWQ STORAGE OPTIONS

� Row oriented format

� Column oriented format

� Parquet format

� Append only

� Compression

Page 46: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

46 Pivotal Confidential–Internal Use Only

Specify Using the WITH Clause �  The WITH clause can be used to set storage options for HAWQ tables

�  You can also set storage parameters on a particular partition or subpartition by declaring the WITH clause in the partition specification

CREATE TABLE table_name (… [ WITH ( storage_parameter=value [, ... ] )

�  Where storage_parameter is:

APPENDONLY={TRUE} ORIENTATION={COLUMN|ROW|PARQUET} COMPRESSTYPE={ZLIB|QUICKLZ|RLE_TYPE|SNAPPY|GZIP|NONE} COMPRESSLEVEL={0-9}

Page 47: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

47 Pivotal Confidential–Internal Use Only

Storage Considerations

� Use Row based tables when –  Incremental loads/inserts are performed –  Selects against table are wide

� Use Column based tables when –  Write once, read many ▪  Not optimized for write operations

–  Selects are narrow

Page 48: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

48 Pivotal Confidential–Internal Use Only

Columnar Storage

A B C

A1 B1 C1

A2 B2 C2

A3 B3 C3

A1 B1 C1 A2 B2 C2 A3 B3 C3

A1 A2 A3 B1 B2 B3 C1 C2 C3

Row Oriented-Storage

Column Oriented-Storage

•  Reduces I/O –  Scans only columns needed

•  Reduces space –  Columnar compresses better –  Efficient type specific encodings

by storing together values of the same primitive type

Page 49: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

49 Pivotal Confidential–Internal Use Only

Column Orientated Considerations � Consider use cases for column oriented storage

– More efficient I/O and storage – Not optimized for write operations

�  Increases performance and reduces storage cost due to decreased I/O and good compression ratios

� Do not use columnar orientation on very wide tables

�  If partition granularity requirement is low, use row based table orientation

Page 50: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

50 Pivotal Confidential–Internal Use Only

Introducing Parquet �  Columnar storage open source file format for Hadoop

�  Began as a joint effort between Twitter and Cloudera

�  Stores nested data structures in a flat columnar format using a technique based Google’s Dremel ColumnIO format

–  Dremel is a ad-hoc query system for analysis of read-only nested data

�  Allows for better compression because data is more homogenous

�  Reduces I/O because you can scan only a subset of the columns while reading the data

Page 51: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

51 Pivotal Confidential–Internal Use Only

Parquet Model �  Minimalistic model

�  Represents nesting using groups of fields

�  Represents repetition using repeated fields

�  No complex data types –  Mapped to a combination of repeated fields and groups

�  Schema data structure –  Root of the schema is a group of fields called a message –  Each field has three attributes: repetition, type and name –  Repetition can be required, optional or repeated –  The type of a field is either a group or a primitive type

Page 52: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

52 Pivotal Confidential–Internal Use Only

Why HAWQ Parquet? � Partially solves the HDFS file number limitation

� Performance is better for large I/O bound datasets compared with Append Only tables

– Due to column level compression

� All HAWQ complex and rich data types are supported

� External systems can directly read HAWQ data on HDFS due to open file format

Page 53: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

53 Pivotal Confidential–Internal Use Only

HAWQ Parquet Feature Overview

� Parquet table read and write support

� Partitioned table support

� Compression support

� Complex and rich data type support

� MapReduce InputFormat for Parquet tables

Page 54: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

54 Pivotal Confidential–Internal Use Only

HAWQ Parquet Design �  Do not change anything in open source Parquet format

–  Both the data file and metadata file are in the format of open source Parquet

�  The data file is organized in dimensions of rowgroup->column chunk ->column page

�  Append to a file and add a new footer at the end of the load/insert

�  Accumulative inserts can be slow –  Due to the overhead of metadata

�  Design point for Parquet is for LARGE writes

Page 55: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

55 Pivotal Confidential–Internal Use Only

DDL and DML Support � Most DDL and DML operations are supported for Parquet

tables –  Usage is similar to Append Only tables

� Supports Table level compression

� Supports partitioning

� Supports all HAWQ data types except arrays and UDT

� Supports ALTER TABLE for partition operations

Page 56: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

56 Pivotal Confidential–Internal Use Only

Business Data Lake Use Cases N

ativ

e H

adoo

p ETL/ELT Offload Batch/Archive External data merges with enterprise data Hot-Warm-Cold Data, EDW Cost Savings

SQ

L on

Had

oop

Unstructured Data Analysis Machine Generated Data, Sensors, Log, Social Media Grid IQ – GE Electric Grid Prediction

MP

P A

naly

tics

Ad Hoc Machine Learning Predictive Analytics Advanced Data Science Credit Card Fraud Detection, Recommendation Engines

In M

emor

y Real-Time, Low Latency, High Concurrency, OLTP Real-time processing of bookings at China Railways, Southwest Airlines

Business Data Lake

All Data, ETL/ELT, Local, Social, Mobile

Batch Interactive Real-Time

Page 57: A NEW PLATFORM FOR A NEW ERA and Big Data - MeetUp... · A closer look at HAWQ . Pivotal Confidential–Internal Use Only 5 Pivotal At-a-Glance ! New Independent Venture: Spun out

A NEW PLATFORM FOR A NEW ERA