Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and

Spark


YARN


Hadoop Machine Types

core

switch

top-of-rack

switches

master nodes run

Hadoop master

processes to

manage and

coordinate cluster

services and

tasks

slave nodes run

Hadoop slave

processes and

provide cluster

resources to

perform data

processing

client machines have

client-side software

used to access a cluster

to process data


How Hadoop Processes Data

• Hadoop has historically processed data using

MapReduce.

• MapReduce has been the basis for Hadoop’s data

processing scalability.

– MapReduce processes the data on each slave node in parallel

and then aggregates the results.

• The secret to performance and scalability is to move the processing to

the data rather than move the data to the processing.

• Doing so signficantly reduces network I/O traffic.


Hadoop Version 2.x

• Hadoop 2.x has two core

components.

– HDFS provides distributed,

scalable, and highly available data

storage.

– YARN provides distributed,

scalable, and highly available

processing.

YARN : Data Operating System

DATA MANAGEMENT

DATA ACCESS

Script

Pig

Search

Solr

SQL

Hive HCatalog

NoSQL

HBase

Stream

Storm

Others

In-Memory Analytics,

ISV engines

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

°

N

HDFS (Hadoop Distributed File System)

Batch

Map Reduce

Tez Tez

Hadoop 2.x


HDFS is a Distributed File System

dataA

dataB

dataC

C

B

A

master node

(NameNode)

slave nodes

(DataNodes)

split

block

block

block

block locations

MR

MR

MR large data file

HDFS automatically: -splits large files into

blocks

-spreads blocks across

cluster

-tracks block locations

-replicates blocks (not

shown) distributed

applications

like

MapReduce

get block

information

to access

and analyze

data


Hadoop Data Operating System

• Apache Hadoop YARN is the data operating system for

Hadoop 2.

• YARN is:

– Responsible for scheduling

tasks and managing CPU

and memory resources

– Designed to enable multiple

distributed applications to utilize

cluster resources in a shared,

secure, and multi-tenant manner


DATA MANAGEMENT

DATA ACCESS

Script

Pig

Search

Solr

SQL

Hive HCatalog

NoSQL

HBase

Stream

Storm

Others


ISV engines

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

°

N


Batch

Map Reduce

Tez Tez

Hadoop 2.x


A Little History

• In Hadoop version 1.x, MapReduce was more than just a

data processing application.

– MapReduce was also

the Hadoop cluster’s

scheduler and resource

manager.

• In Hadoop 2.x, YARN

replaced MapReduce

for scheduling and

resource management.

MapReduce: Scheduling and Resource Management

DATA MANAGEMENT

DATA ACCESS

Script

Pig

SQL

Hive HCatalog

NoSQL

Hbase

1 ° ° ° ° °

° ° ° ° ° °

° ° ° ° ° °


Batch

Map Reduce


DATA MANAGEMENT

DATA ACCESS

Script

Pig

Search

Solr

SQL

Hive HCatalog

NoSQL

HBase

Stream

Storm

Others


ISV engines

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

°

N


Batch

Map Reduce

Tez Tez

Hadoop 1.x Hadoop 2.x


Why the Move to YARN?

• YARN is a generic scheduler and resource manager to

support applications other than just MapReduce.

• MapReduce is not suitable for every type of data

processing workload.

– The problem is that MapReduce is by nature batch processing.

Batch is not suitable for:

• Processing streaming data

• Performing real-time analytics

• Record fetching

• High-speed iterative processing


Hadoop Before YARN

• Many times separate

clusters were deployed that:

– Ensured different workloads

received sufficient resources

– Wasted time and money on

additional deployment and

management tasks

– Created data silos that forced

additional data transfers

interactive

processing

batch

processing

ingest

data

results

clusterA

clusterB

transfer


Hadoop After YARN

• YARN transformed Hadoop

into a generic, distributed

operating system.

– HDFS is a distributed file

system.

– YARN is a distributed

scheduler.

– The combination gives a single

Hadoop cluster multi-tenant

capability to run distributed

applications of many types.

YARN distributed

processing

HDFS distributed

storage

batch real-time streaming iterative

applications


Tez


Tez, an Alternative to MapReduce

• Tez is an alternative

to the traditional

MapReduce

framework.

– It meets the demands

for fast response

times and extreme

throughput at

petabyte scale.

MapReduce: Scheduling and Resource Management

DATA MANAGEMENT

DATA ACCESS

Script

Pig

SQL

Hive HCatalog

NoSQL

Hbase

1 ° ° ° ° °

° ° ° ° ° °

° ° ° ° ° °


Batch

Map Reduce


DATA MANAGEMENT

DATA ACCESS

Script

Pig

Search

Solr

SQL

Hive HCatalog

NoSQL

HBase

Stream

Storm

Others


ISV engines

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

°

N


Batch

Map Reduce

Tez Tez

Hadoop 1.x Hadoop 2.x


Inefficiencies in MapReduce

• To understand how Tez accelerates query processing it is

helpful to understand some inefficiencies in MapReduce.

– These inefficiencies make MapReduce suitable only for batch

processing.

• Causes of MapReduce inefficiencies are:

– HDFS and local storage use

– Requirement of map phase before reduce phase

– Hadoop containers (A container is an abstraction used to represent a discreet amount of slave node CPU and

memory resources. Resources in one container are logically isolated from other container

resources. Applications run inside containers.)


MapReduce and HDFS

• MapReduce uses HDFS

storage to store temporary

data between MapReduce

jobs.

• Local storage is used to

store temporary data

between map and reduce

phases.

– Storage I/O adds significant

overhead to the overall job.

M

HDFS

M M

R R

M M M

R R

HDFS

M

HDFS

M M M

HDFS

HDFS

M M M

R R

temporary

data


Tez and HDFS

M

HDFS

M M

R R

M M M

R R

HDFS

M

HDFS

M M M

HDFS

HDFS

M M M

R R

Map and Reduce

over MapReduce

M M M

R R

M M M

R R

HDFS

M

M M M

HDFS

M M M

R R

Map and Reduce

over MapReduce

Map and Reduce

over Tez

Map and Reduce

over Tez


Tez is Simple

• Tez is a completely client-side implementation.

– Tez is a set of client-side libraries.

– There is no server to deploy or manage.

• Tez is not meant for end-users.

– Developers use the Tez API to create better end-user

applications.

– Tez applications:

• Support batch and interactive data processing applications

• Integrate with YARN

• Perform well in a mixed application workload cluster


Spark


Apache Spark

• Apache Spark is an open source, general purpose

processing engine used to build and run fast and

sophisticated applications.

– It features a simple set of APIs to write applications in Scala,

Java, or Python.

• The processing engine and applications run on Hadoop 2.

– It leverages Hadoop’s horizontal scale out capabilities.

• It is YARN-ready.

– You can process a single copy of data in multiple ways using

the same cluster.


Spark RDD – Scalability and Performance

• To leverage Hadoop’s horizontal

scalability:

– Spark processes data in a Resilient

Distributed Dataset (RDD).

• It is a fault-tolerant collection of data elements.

– An RDD is stored in memory or on disk.

– Each RDD is distributed across Hadoop

slave nodes.

• Enables parallel processing across the cluster

10x MapReduce

performance

RAM

RAM

RAM

RAM

on-disk

RDD

in-memory

RDD

100x MapReduce

performance


Spark High-Level Tools

• The Spark Engine supports

four high-level tools to build

applications.

– Spark SQL

– Spark Streaming

– MLlib

– GraphX

Spark

Streaming

streaming

GraphX

MLlib

Spark

SQL

SQL

Apache Spark Engine

graph

computation

machine

learning


Spark SQL

– Use Spark SQL for interactive or batch queries on streaming or

historical data.

• Perform queries in Scala, Java, and Python programs using integrated APIs.

• It queries structured data as an SchemaRDD.

– A SchemaRDD is an RDD of row objects that has an associated schema.

– SchemaRDDs are registered as tables and used in FROM clauses in SQL statements.

– SchemaRDDs can be used in relational queries, as well as in standard RDD functions.

– Spark SQL reuses an existing Apache Hive frontend and metastore.

• This makes it compatible with existing Hive data, queries, and UDFs.

– Spark SQL includes a server mode with standard ODBC and JDBC

connectors.


Decisions, decisions, decisions…


Data Processing Options – Spark vs. Tez

• Three Common Options

– Hive on Tez

– Hive on Spark

– Spark SQL


Hive on Tez vs. Hive on Spark

• Hive on Tez outperforms Hive on Spark

– Hive tends to be bound by CPU rather than I/O, especially with

introduction of columnar file formats

– Spark spends time translating from RDDs to Hive’s native “Row

Containers”

• Ends up consuming more CPU, Disk & Network I/O

– Tez is a framework for building special-purpose engines,

whereas Spark is a general-purpose engine

• Hive on Tez is optimized for typical Hive operations


Hive on Tez vs. Spark SQL

• Depends on size of dataset

– Less than 200 GB, Spark SQL wins

– 200 GB and greater, Hive on Tez wins

• The larger the dataset, the greater the discrepancy in performance

• http://www.slideshare.net/hortonworks/hive-on-spark-is-

blazing-fast-or-is-it-final


Tez vs. Spark


BUT…

• Spark, like all other Hadoop projects, is evolving.

Performance metrics are likely to change

– …as will those for Tez applications, etc.

• Your mileage will vary, and performance variance today

may not be the same as performance variance tomorrow

– Beware of the word “always”


Thank you!

[email protected]

Documents

Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support