Upload
big-data-spain
View
3.740
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Session presented at Big Data Spain 2012 Conference 16th Nov 2012 ETSI Telecomunicacion UPM Madrid www.bigdataspain.org More info: http://www.bigdataspain.org/es-2012/conference/coordinating-many-tools-of-big-data/alan-gates
Citation preview
Coordinating the Many Tools of Big Data
Page 1
Alan F. Gates
@alanfgates
Big Data Spain 2012http://www.bigdataspain.org/
© Hortonworks 2012
Big Data = Terabytes, Petabytes, …
Page 2
Image Credit: Gizmodo
© Hortonworks 2012
But It Is Also Complex Algorithms
Page 3
• An example from a talk byJimmy Lin at Hadoop Summit 2012 on calculations Twitter is doing via UDFs (user defined functions) in Pig. This equation uses stochastic gradient descent to do machine learning across with their data:
w(t+1) =w(t) −γ(t)∇(f(x;w(t)),y)
© Hortonworks 2012
Pre-Cloud: One Tool per Machine
Page 4
• Databases presented SQL or SQL-like paradigms for operating on data• Other tools came in separate packages (e.g. R) or on separate platforms
(SAS).
Data Warehouse
Statistical Analysis
Cube/MOLAP
OLTP
Data Mart
© Hortonworks 2012
Cloud: Many Tools One Platform
Page 5
• Users no longer want to be concerned with what platform their data is in – just apply the tool to it
• SQL no longer the only or primary data access tool
Data Warehouse
Statistical AnalysisData
Mart
Cube/MOLAP
OLTP
© Hortonworks 2012
Upside - Pick the Right Tool for the Job
Page 6
© Hortonworks 2012
Downside – Tools Don’t Play Well Together
Page 7
• Hard for users to share data between tools– Different storage formats– Different data models– Different user defined function interfaces
© Hortonworks 2012
Downside – Wasted Developer Time
Page 8
• Wastes developer time since each tool supplies the redundant functionality
Executor
Physical Planner
Optimizer
Parser
Executor
Physical Planner
Optimizer
Parser
Metadata
Pig
Hive
© Hortonworks 2012
Downside – Wasted Developer Time
Page 9
• Wastes developer time since each tool supplies the redundant functionality
Executor
Physical Planner
Optimizer
Parser
Executor
Physical Planner
Optimizer
Parser
Metadata
Pig
Hive
Overlap
© Hortonworks 2012
Conclusion: We Need Services
Page 10
• We need to find a way to share services where we can. • Gives users the same experience across tools• Allows developers to share effort when it makes sense
© Hortonworks 2012
Hadoop = Distributed Data Operating System
Page 11
Service Hadoop Component Single Node Analogue
Table Management HCatalog RDBMS
User access control Hadoop /etc/passwd, file system permissions, etc.
Resource management YARN Process management
Notification HCatalog Signals, semaphores, mutexes
REST/Connectors HCatalog, Hive, HBase, Oozie
Network layer
Batch data processing Data Virtual Machine JVM
Exists Pieces exist in this component To be built
© Hortonworks 2012
Hadoop = Distributed Data Operating System
Page 12
Service Hadoop Component Single Node Analogue
Table Management HCatalog RDBMS
User access control Hadoop /etc/passwd, file system permissions, etc.
Resource management YARN Process management
Notification HCatalog Signals, semaphores, mutexes
REST/Connectors HCatalog, Hive, HBase, Oozie
Network layer
Batch data processing Data Virtual Machine JVM
Exists Pieces exist in this component To be built
© Hortonworks 2012
HCatalog – Table Management
Page 13
• Opens up Hive’s tables to other tools inside and outside Hadoop• Presents tools with a table paradigm that abstracts away storage details• Provides a shared data model• Provides a shared code path for data and metadata access
Data Access Without HCatalog
Page 14© Hortonworks 2012
MetastoreHDFS
Hive
Metastore ClientInputFormat/ OuputFormat
SerDe
InputFormat/ OuputFormat
MapReduce Pig
Load/Store
Data & Metadata Access With HCatalog
Page 15© Hortonworks 2012
MetastoreHDFS
Hive
Metastore ClientInputFormat/ OuputFormat
SerDe
HCatInputFormat/ HCatOuputFormat
MapReduce Pig
HCatLoader/ HCatStorer
REST
External System
Without HCatalog
Page 16© Hortonworks 2012
Feature MapReduce Pig Hive
Record format Key value pairs Tuple Record
Data model User defined int, float, string, bytes, maps, tuples, bags
int, float, string, maps, structs, lists
Schema Encoded in app Declared in script or read by loader
Read from metadata
Data location Encoded in app Declared in script Read from metadata
Data format Encoded in app Declared in script Read from metadata
With HCatalog
Page 17© Hortonworks 2012
Feature MapReduce + HCatalog
Pig + HCatalog Hive
Record format Record Tuple Record
Data model int, float, string, maps, structs, lists
int, float, string, bytes, maps, tuples, bags
int, float, string, maps, structs, lists
Schema Read from metadata
Read from metadata
Read from metadata
Data location Read from metadata
Read from metadata
Read from metadata
Data format Read from metadata
Read from metadata
Read from metadata
© Hortonworks 2012
YARN – Resource Manager
Page 18
• Hadoop 1.0: HDFS plus MapReduce• Hadoop 2.0: HDFS plus YARN Resource Manager, an interface for
developers to write parallel applications on top of the Hadoop cluster• The Resource Manager provides:
– applications a way to request resources in the cluster– allocation and scheduling of machine resource to the applications
• MapReduce is now an application provided inside YARN• Other systems have been ported to YARN such as Spark (cluster computing system
that focuses on in memory operations) and Storm (streaming computations)
© Hortonworks 2012
Architectural Comparison
Page 19
Hadoop 1.0 Hadoop 2.0
© Hortonworks 2012
Data Virtual Machine – Shared Batch Processing
Page 20
• Recall our previous diagram of Pig and Hive
Executor
Physical Planner
Optimizer
Parser
Executor
Physical Planner
Optimizer
Parser
Metadata
Pig
Hive
Overlap
© Hortonworks 2012
A VM That Provides
Page 21
• Standard operators (equivalent of Java byte codes):– Project– Select– Join– Aggregate– Sort– …
• An optimizer that could – Choose appropriate implementation of an operator based on physical data
characteristics– Dynamically re-optimize the plan based on information gathered executing the plan
• Shared execution layer– Can provide its own YARN application master and improve on MapReduce
paradigm for batch processing
• Shared User Defined Function (UDF) framework– user code works across systems
© Hortonworks 2012
Taking Advantage of YARN – MR*
Page 22
Map Map
Reduce Reduce
Map Map
Reduce Reduce
HDFS
© Hortonworks 2012
Taking Advantage of YARN – MR*
Page 23
Map Map
Reduce Reduce
Map Map
Reduce Reduce
HDFSWhy do I
need these
maps?
© Hortonworks 2012
Taking Advantage of YARN – MR*
Page 24
Map Map
Reduce Reduce
Map Map
Reduce Reduce
HDFS
Map Map
Reduce Reduce
Reduce Reduce
• Removed an entire write/read cycle of HDFS• Still want to checkpoint sometimes
© Hortonworks 2012
Taking Advantage of YARN – In Memory Data Transfer
Page 25
Map Map
Reduce Reduce
© Hortonworks 2012
Taking Advantage of YARN – In Memory Data Transfer
Page 26
Map Map
Reduce Reduce
These are writes to
disk
Switching shuffle to in memory instead of on disk• Better performance• Data must also be spilled to disk for retry-ability and to handle memory overflow• Will benefit from stronger guarantees of simultaneous execution
© Hortonworks 2012
On the Fly Optimization
Page 27
• Traditionally databases do all optimization up front based on statistics– But often there are not statistics in Hadoop– Languages like Pig Latin allow very long series of operations that make up front
estimates unrealistic
• Observation: as the system operates on the data it can gather basic statistics and change the subsequent operators based on this information
MR Job
MR Job
Hash Join
© Hortonworks 2012
On the Fly Optimization
Page 28
• Traditionally databases do all optimization up front based on statistics– But often there are not statistics in Hadoop– Languages like Pig Latin allow very long series of operations that make up front
estimates unrealistic
• Observation: as the system operates on the data it can gather basic statistics and change the subsequent operators based on this information
MR Job
MR Job
Hash Join
Output fits in memory
© Hortonworks 2012
On the Fly Optimization
Page 29
• Traditionally databases do all optimization up front based on statistics– But often there are not statistics in Hadoop– Languages like Pig Latin allow very long series of operations that make up front
estimates unrealistic
• Observation: as the system operates on the data it can gather basic statistics and change the subsequent operators based on this information
MR Job
MR Job
Hash Join
MR Job
MR Job
Map-side Join
Load into distributed
cache
© Hortonworks 2012
Thank You Big Data Spain
Page 30