26
Coordinating the Many Tools of Big Data Page 1 Alan F. Gates @alanfgates Strata 2013

Strata feb2013

Embed Size (px)

DESCRIPTION

Slides from Strata talk "Coordinating the Many Tools of Big Data"

Citation preview

Page 1: Strata feb2013

Coordinating the Many Tools of Big Data

Page 1

Alan F. Gates

@alanfgates

Strata 2013

Page 2: Strata feb2013

Big Data = Terabytes, Petabytes, …

Page 2© Hortonworks 2013

Image Credit: Gizmodo

Page 3: Strata feb2013

But It Is Also Complex Algorithms

Page 3© Hortonworks 2013

• An example from a talk by Jimmy Lin at Hadoop Summit 2012 on calculations Twitter is doing via UDFs in Pig. This equation uses stochastic gradient descent to do machine learning with their data:

w(t+1) =w(t) −γ(t)∇(f(x;w(t)),y)

Page 4: Strata feb2013

And New Tools

Page 4© Hortonworks 2013

• Apache Hadoop brings with it a large selection of tools and paradigms–Apache HBase, Apache Cassandra – Distributed, high volume

reads and rights of individual data records–Apache Hive - SQL–Apache Pig, Cascading – Data flow programming for ETL, data

modeling, and exploration–Apache Giraph – Graph processing–MapReduce – Batch processing–Storm, S4 – Stream processing–Plus lots of commercial offerings

Page 5: Strata feb2013

Pre-Cloud: One Tool per Machine

Page 5© Hortonworks 2013

• Databases presented SQL or SQL-like paradigms for operating on data• Other tools came in separate packages (e.g. R) or on separate platforms (e.g.

SAS).

Data Warehouse

Statistical Analysis

Cube/MOLAP

OLTP

Data Mart

Page 6: Strata feb2013

Cloud: Many Tools One Platform

Page 6© Hortonworks 2013

• Users no longer want to be concerned with what platform their data is in – just apply the tool to it

• SQL no longer the only or primary data access tool

Data Warehouse

Statistical AnalysisData

Mart

Cube/MOLAP

OLTP

Page 7: Strata feb2013

Upside - Pick the Right Tool for the Job

Page 7© Hortonworks 2013

Page 8: Strata feb2013

Downside – Tools Don’t Play Well Together

Page 8© Hortonworks 2013

• Hard for users to share data between tools–Different storage formats–Different data models–Different user defined function interfaces

Page 9: Strata feb2013

Downside – Wasted Developer Time

Page 9© Hortonworks 2013

• Wastes developer time since each tool supplies the redundant functionality

Executor

Physical Planner

Optimizer

Parser

Executor

Physical Planner

Optimizer

Parser

Metadata

Pig

Hive

Page 10: Strata feb2013

Downside – Wasted Developer Time

Page 10© Hortonworks 2013

• Wastes developer time since each tool supplies the redundant functionality

Executor

Physical Planner

Optimizer

Parser

Executor

Physical Planner

Optimizer

Parser

Metadata

Pig

Hive

Overlap

Page 11: Strata feb2013

Conclusion: We Need Services

Page 11© Hortonworks 2013

• We need to find a way to share services where we can • Gives users the same experience across tools• Allows developers to share effort when it makes sense

Page 12: Strata feb2013

Hadoop = Distributed Data Operating System

Page 12© Hortonworks 2013

Service Hadoop Component

Table Management Hive

Access To Metadata HCatalog

User authentication Knox

Resource management YARN

Notification HCatalog

REST/Connectors webhcat, webhdfs, Hive, HBase, Oozie

Relational data processing Tez

Exists Pieces exist in this component New Project

Page 13: Strata feb2013

Hadoop = Distributed Data Operating System

Page 13© Hortonworks 2013

Service Hadoop Component

Table Management Hive

Access To Metadata HCatalog

User authentication Knox

Resource management YARN

Notification HCatalog

REST/Connectors webhcat, webhdfs, Hive, HBase, Oozie

Relational data processing Tez

Exists Pieces exist in this component New Project

Page 14: Strata feb2013

HCatalog – Table Management

Page 14© Hortonworks 2013

• Opens up Hive’s tables to other tools inside and outside Hadoop

• Presents tools with a table paradigm that abstracts away storage details

• Provides a shared data model• Provides a shared code path for data and metadata access

Page 15: Strata feb2013

HCatalog – Table Management

Page 15© Hortonworks 2013

• Opens up Hive’s tables to other tools inside and outside Hadoop

• Presents tools with a table paradigm that abstracts away storage details

• Provides a shared data model• Provides a shared code path for data and metadata access

Metastore

Hive

Page 16: Strata feb2013

HCatalog – Table Management

Page 16© Hortonworks 2013

• Opens up Hive’s tables to other tools inside and outside Hadoop

• Presents tools with a table paradigm that abstracts away storage details

• Provides a shared data model• Provides a shared code path for data and metadata access

Metastore

Hive Pig

HCatLoader

HCatInputFormat

MapReduce

Page 17: Strata feb2013

HCatalog – Table Management

Page 17© Hortonworks 2013

• Opens up Hive’s tables to other tools inside and outside Hadoop

• Presents tools with a table paradigm that abstracts away storage details

• Provides a shared data model• Provides a shared code path for data and metadata access

Metastore

Hive Pig

HCatLoader

HCatInputFormat

MapReduceWebHCat

ExternalSystems

REST

Page 18: Strata feb2013

Tez – Moving Beyond MapReduce

Page 18© Hortonworks 2013

• Low level data-processing execution engine• Use it for the base of MapReduce, Hive, Pig, Cascading etc.

• Enables pipelining of jobs• Removes task and job launch times• Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline

• Does not write intermediate output to HDFS–Much lighter disk and network usage

• Built on YARN

Page 19: Strata feb2013

Pig/Hive-MR versus Pig/Hive-Tez

Page 19© Hortonworks 2013

SELECT a.state, COUNT(*), AVERAGE(c.price)

FROM a

JOIN b ON (a.id = b.id)

JOIN c ON (a.itemId = c.itemId)

GROUP BY a.state

Pig/Hive - MR

I/O Synchronization

Barrier

I/O Synchronization

Barrier

Job 1

Job 2

Job 3

Page 20: Strata feb2013

Pig/Hive-MR versus Pig/Hive-Tez

Page 20© Hortonworks 2013

SELECT a.state, COUNT(*), AVERAGE(c.price)

FROM a

JOIN b ON (a.id = b.id)

JOIN c ON (a.itemId = c.itemId)

GROUP BY a.state

Pig/Hive - MR Pig/Hive - Tez

I/O Synchronization

Barrier

I/O Synchronization

Barrier

Job 1

Job 2

Job 3

Single Job

Page 21: Strata feb2013

FastQuery: Beyond Batch with YARN

Page 21© Hortonworks 2013

Tez Generalizes Map-Reduce

Simplified execution plans processdata more efficiently

Always-On Tez Service

Low latency processing forall Hadoop data processing

Page 22: Strata feb2013

Knox – Single Sign On

Page 22© Hortonworks 2013

Page 23: Strata feb2013

Today’s Access Options

Page 23© Hortonworks 2013

• Direct Access– Access Services via REST (WebHDFS, WebHCat)– Need knowledge of and access to whole cluster– Security handled by each component in the cluster– Kerberos details exposed to users

• Gateway / Portal Nodes– Dedicated nodes behind firewall– User SSH to node to access Hadoop services

Hadoop ClusterUser

Hadoop ClusterUserGW

Node

SSH

{REST}

Page 24: Strata feb2013

Knox Design Goals

Page 24© Hortonworks 2013

• Operators can firewall cluster without end user access to “gateway node”

• Users see one cluster end-point that aggregates capabilities for data access, metadata and job control

• Provide perimeter security to make Hadoop security setup easier

• Enable integration enterprise and cloud identity management environments

Page 25: Strata feb2013

Perimeter Verification & Authentication

Page 25© Hortonworks 2013

WebHCat

JT

NN

DN

DN DN

Hadoop Cluster

DN

Web HDFS

Hive

HCat

Authentication

Verification

Client

User StoreKDC, AD,

LDAP

ID ProviderKDC, AD,

LDAP

Verification- Verify identity token- SAML, propagation of identityAuthentication- Establish identity at Gateway to

Authenticate with LDAP + AD

{REST} KnoxGateway

Page 26: Strata feb2013

© Hortonworks 2012

Thank You

Page 26