Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012

Coordinating the Many Tools of Big Data

Page 1

Alan F. Gates

@alanfgates

Big Data Spain 2012http://www.bigdataspain.org/

http://www.bigdataspain.org/

© Hortonworks 2012

Big Data = Terabytes, Petabytes, …

Page 2

Image Credit: Gizmodo

© Hortonworks 2012

But It Is Also Complex Algorithms

Page 3

• An example from a talk byJimmy Lin at Hadoop Summit 2012 on calculations Twitter is doing via UDFs (user defined functions) in Pig. This equation uses stochastic gradient descent to do machine learning across with their data:

w(t+1) =w(t) −γ(t)∇(f(x;w(t)),y)

© Hortonworks 2012

Pre-Cloud: One Tool per Machine

Page 4

• Databases presented SQL or SQL-like paradigms for operating on data• Other tools came in separate packages (e.g. R) or on separate platforms

(SAS).

Data Warehouse

Statistical Analysis

Cube/MOLAP

OLTP

Data Mart

© Hortonworks 2012

Cloud: Many Tools One Platform

Page 5

• Users no longer want to be concerned with what platform their data is in – just apply the tool to it

• SQL no longer the only or primary data access tool

Data Warehouse

Statistical AnalysisData

Mart

Cube/MOLAP

OLTP

© Hortonworks 2012

Upside - Pick the Right Tool for the Job

Page 6

© Hortonworks 2012

Downside – Tools Don’t Play Well Together

Page 7

• Hard for users to share data between tools– Different storage formats– Different data models– Different user defined function interfaces

© Hortonworks 2012

Downside – Wasted Developer Time

Page 8

• Wastes developer time since each tool supplies the redundant functionality

Executor

Physical Planner

Optimizer

Parser

Executor

Physical Planner

Optimizer

Parser

Metadata

Pig

Hive

© Hortonworks 2012

Downside – Wasted Developer Time

Page 9

• Wastes developer time since each tool supplies the redundant functionality

Executor

Physical Planner

Optimizer

Parser

Executor

Physical Planner

Optimizer

Parser

Metadata

Pig

Hive

Overlap

© Hortonworks 2012

Conclusion: We Need Services

Page 10

• We need to find a way to share services where we can. • Gives users the same experience across tools• Allows developers to share effort when it makes sense

© Hortonworks 2012

Hadoop = Distributed Data Operating System

Page 11

Service Hadoop Component Single Node Analogue

Table Management HCatalog RDBMS

User access control Hadoop /etc/passwd, file system permissions, etc.

Resource management YARN Process management

Notification HCatalog Signals, semaphores, mutexes

REST/Connectors HCatalog, Hive, HBase, Oozie

Network layer

Batch data processing Data Virtual Machine JVM

Exists Pieces exist in this component To be built

© Hortonworks 2012

Hadoop = Distributed Data Operating System

Page 12

Service Hadoop Component Single Node Analogue

Table Management HCatalog RDBMS

User access control Hadoop /etc/passwd, file system permissions, etc.

Resource management YARN Process management

Notification HCatalog Signals, semaphores, mutexes

REST/Connectors HCatalog, Hive, HBase, Oozie

Network layer

Batch data processing Data Virtual Machine JVM

Exists Pieces exist in this component To be built

© Hortonworks 2012

HCatalog – Table Management

Page 13

• Opens up Hive’s tables to other tools inside and outside Hadoop• Presents tools with a table paradigm that abstracts away storage details• Provides a shared data model• Provides a shared code path for data and metadata access

Data Access Without HCatalog

Page 14© Hortonworks 2012

MetastoreHDFS

Hive

Metastore ClientInputFormat/ OuputFormat

SerDe

InputFormat/ OuputFormat

MapReduce Pig

Load/Store

Data & Metadata Access With HCatalog


MetastoreHDFS

Hive

Metastore ClientInputFormat/ OuputFormat

SerDe

HCatInputFormat/ HCatOuputFormat

MapReduce Pig

HCatLoader/ HCatStorer

REST

External System

Without HCatalog


Feature MapReduce Pig Hive

Record format Key value pairs Tuple Record

Data model User defined int, float, string, bytes, maps, tuples, bags

int, float, string, maps, structs, lists

Schema Encoded in app Declared in script or read by loader

Read from metadata

Data location Encoded in app Declared in script Read from metadata

Data format Encoded in app Declared in script Read from metadata

With HCatalog


Feature MapReduce + HCatalog

Pig + HCatalog Hive

Record format Record Tuple Record

Data model int, float, string, maps, structs, lists

int, float, string, bytes, maps, tuples, bags

int, float, string, maps, structs, lists

Schema Read from metadata

Read from metadata

Read from metadata

Data location Read from metadata

Read from metadata

Read from metadata

Data format Read from metadata

Read from metadata

Read from metadata

© Hortonworks 2012

YARN – Resource Manager

Page 18

• Hadoop 1.0: HDFS plus MapReduce• Hadoop 2.0: HDFS plus YARN Resource Manager, an interface for

developers to write parallel applications on top of the Hadoop cluster• The Resource Manager provides:

– applications a way to request resources in the cluster– allocation and scheduling of machine resource to the applications

• MapReduce is now an application provided inside YARN• Other systems have been ported to YARN such as Spark (cluster computing system

that focuses on in memory operations) and Storm (streaming computations)

© Hortonworks 2012

Architectural Comparison

Page 19

Hadoop 1.0 Hadoop 2.0

© Hortonworks 2012

Data Virtual Machine – Shared Batch Processing

Page 20

• Recall our previous diagram of Pig and Hive

Executor

Physical Planner

Optimizer

Parser

Executor

Physical Planner

Optimizer

Parser

Metadata

Pig

Hive

Overlap

© Hortonworks 2012

A VM That Provides

Page 21

• Standard operators (equivalent of Java byte codes):– Project– Select– Join– Aggregate– Sort– …

• An optimizer that could – Choose appropriate implementation of an operator based on physical data

characteristics– Dynamically re-optimize the plan based on information gathered executing the plan

• Shared execution layer– Can provide its own YARN application master and improve on MapReduce

paradigm for batch processing

• Shared User Defined Function (UDF) framework– user code works across systems

© Hortonworks 2012

Taking Advantage of YARN – MR*

Page 22

Map Map

Reduce Reduce

Map Map

Reduce Reduce

HDFS

© Hortonworks 2012


Page 23

Map Map

Reduce Reduce

Map Map

Reduce Reduce

HDFSWhy do I

need these

maps?

© Hortonworks 2012


Page 24

Map Map

Reduce Reduce

Map Map

Reduce Reduce

HDFS

Map Map

Reduce Reduce

Reduce Reduce

• Removed an entire write/read cycle of HDFS• Still want to checkpoint sometimes

© Hortonworks 2012

Taking Advantage of YARN – In Memory Data Transfer

Page 25

Map Map

Reduce Reduce

© Hortonworks 2012

Taking Advantage of YARN – In Memory Data Transfer

Page 26

Map Map

Reduce Reduce

These are writes to

disk

Switching shuffle to in memory instead of on disk• Better performance• Data must also be spilled to disk for retry-ability and to handle memory overflow• Will benefit from stronger guarantees of simultaneous execution

© Hortonworks 2012

On the Fly Optimization

Page 27

• Traditionally databases do all optimization up front based on statistics– But often there are not statistics in Hadoop– Languages like Pig Latin allow very long series of operations that make up front

estimates unrealistic

• Observation: as the system operates on the data it can gather basic statistics and change the subsequent operators based on this information

MR Job

MR Job

Hash Join

© Hortonworks 2012


Page 28




MR Job

MR Job

Hash Join

Output fits in memory

© Hortonworks 2012


Page 29




MR Job

MR Job

Hash Join

MR Job

MR Job

Map-side Join

Load into distributed

cache

© Hortonworks 2012

Thank You Big Data Spain

Page 30

Technology

Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apache Hive. ALAN GATES at Big Data Spain 2012