29
© Hortonworks Inc. 2011 2014. All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark

Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and

Spark

Page 2: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

YARN

Page 3: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop Machine Types

core

switch

top-of-rack

switches

master nodes run

Hadoop master

processes to

manage and

coordinate cluster

services and

tasks

slave nodes run

Hadoop slave

processes and

provide cluster

resources to

perform data

processing

client machines have

client-side software

used to access a cluster

to process data

Page 4: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

How Hadoop Processes Data

• Hadoop has historically processed data using

MapReduce.

• MapReduce has been the basis for Hadoop’s data

processing scalability.

– MapReduce processes the data on each slave node in parallel

and then aggregates the results.

• The secret to performance and scalability is to move the processing to

the data rather than move the data to the processing.

• Doing so signficantly reduces network I/O traffic.

Page 5: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop Version 2.x

• Hadoop 2.x has two core

components.

– HDFS provides distributed,

scalable, and highly available data

storage.

– YARN provides distributed,

scalable, and highly available

processing.

YARN : Data Operating System

DATA MANAGEMENT

DATA ACCESS

Script

Pig

Search

Solr

SQL

Hive HCatalog

NoSQL

HBase

Stream

Storm

Others

In-Memory Analytics,

ISV engines

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

°

N

HDFS (Hadoop Distributed File System)

Batch

Map Reduce

Tez Tez

Hadoop 2.x

Page 6: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

HDFS is a Distributed File System

dataA

dataB

dataC

C

B

A

master node

(NameNode)

slave nodes

(DataNodes)

split

block

block

block

block locations

MR

MR

MR large data file

HDFS automatically: -splits large files into

blocks

-spreads blocks across

cluster

-tracks block locations

-replicates blocks (not

shown) distributed

applications

like

MapReduce

get block

information

to access

and analyze

data

Page 7: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop Data Operating System

• Apache Hadoop YARN is the data operating system for

Hadoop 2.

• YARN is:

– Responsible for scheduling

tasks and managing CPU

and memory resources

– Designed to enable multiple

distributed applications to utilize

cluster resources in a shared,

secure, and multi-tenant manner

YARN : Data Operating System

DATA MANAGEMENT

DATA ACCESS

Script

Pig

Search

Solr

SQL

Hive HCatalog

NoSQL

HBase

Stream

Storm

Others

In-Memory Analytics,

ISV engines

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

°

N

HDFS (Hadoop Distributed File System)

Batch

Map Reduce

Tez Tez

Hadoop 2.x

Page 8: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

A Little History

• In Hadoop version 1.x, MapReduce was more than just a

data processing application.

– MapReduce was also

the Hadoop cluster’s

scheduler and resource

manager.

• In Hadoop 2.x, YARN

replaced MapReduce

for scheduling and

resource management.

MapReduce: Scheduling and Resource Management

DATA MANAGEMENT

DATA ACCESS

Script

Pig

SQL

Hive HCatalog

NoSQL

Hbase

1 ° ° ° ° °

° ° ° ° ° °

° ° ° ° ° °

HDFS (Hadoop Distributed File System)

Batch

Map Reduce

YARN : Data Operating System

DATA MANAGEMENT

DATA ACCESS

Script

Pig

Search

Solr

SQL

Hive HCatalog

NoSQL

HBase

Stream

Storm

Others

In-Memory Analytics,

ISV engines

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

°

N

HDFS (Hadoop Distributed File System)

Batch

Map Reduce

Tez Tez

Hadoop 1.x Hadoop 2.x

Page 9: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Why the Move to YARN?

• YARN is a generic scheduler and resource manager to

support applications other than just MapReduce.

• MapReduce is not suitable for every type of data

processing workload.

– The problem is that MapReduce is by nature batch processing.

Batch is not suitable for:

• Processing streaming data

• Performing real-time analytics

• Record fetching

• High-speed iterative processing

Page 10: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop Before YARN

• Many times separate

clusters were deployed that:

– Ensured different workloads

received sufficient resources

– Wasted time and money on

additional deployment and

management tasks

– Created data silos that forced

additional data transfers

interactive

processing

batch

processing

ingest

data

results

clusterA

clusterB

transfer

Page 11: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hadoop After YARN

• YARN transformed Hadoop

into a generic, distributed

operating system.

– HDFS is a distributed file

system.

– YARN is a distributed

scheduler.

– The combination gives a single

Hadoop cluster multi-tenant

capability to run distributed

applications of many types.

YARN distributed

processing

HDFS distributed

storage

batch real-time streaming iterative

applications

Page 12: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Tez

Page 13: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Tez, an Alternative to MapReduce

• Tez is an alternative

to the traditional

MapReduce

framework.

– It meets the demands

for fast response

times and extreme

throughput at

petabyte scale.

MapReduce: Scheduling and Resource Management

DATA MANAGEMENT

DATA ACCESS

Script

Pig

SQL

Hive HCatalog

NoSQL

Hbase

1 ° ° ° ° °

° ° ° ° ° °

° ° ° ° ° °

HDFS (Hadoop Distributed File System)

Batch

Map Reduce

YARN : Data Operating System

DATA MANAGEMENT

DATA ACCESS

Script

Pig

Search

Solr

SQL

Hive HCatalog

NoSQL

HBase

Stream

Storm

Others

In-Memory Analytics,

ISV engines

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

°

N

HDFS (Hadoop Distributed File System)

Batch

Map Reduce

Tez Tez

Hadoop 1.x Hadoop 2.x

Page 14: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Inefficiencies in MapReduce

• To understand how Tez accelerates query processing it is

helpful to understand some inefficiencies in MapReduce.

– These inefficiencies make MapReduce suitable only for batch

processing.

• Causes of MapReduce inefficiencies are:

– HDFS and local storage use

– Requirement of map phase before reduce phase

– Hadoop containers (A container is an abstraction used to represent a discreet amount of slave node CPU and

memory resources. Resources in one container are logically isolated from other container

resources. Applications run inside containers.)

Page 15: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

MapReduce and HDFS

• MapReduce uses HDFS

storage to store temporary

data between MapReduce

jobs.

• Local storage is used to

store temporary data

between map and reduce

phases.

– Storage I/O adds significant

overhead to the overall job.

M

HDFS

M M

R R

M M M

R R

HDFS

M

HDFS

M M M

HDFS

HDFS

M M M

R R

temporary

data

Page 16: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Tez and HDFS

M

HDFS

M M

R R

M M M

R R

HDFS

M

HDFS

M M M

HDFS

HDFS

M M M

R R

Map and Reduce

over MapReduce

M M M

R R

M M M

R R

HDFS

M

M M M

HDFS

M M M

R R

Map and Reduce

over MapReduce

Map and Reduce

over Tez

Map and Reduce

over Tez

Page 17: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Tez is Simple

• Tez is a completely client-side implementation.

– Tez is a set of client-side libraries.

– There is no server to deploy or manage.

• Tez is not meant for end-users.

– Developers use the Tez API to create better end-user

applications.

– Tez applications:

• Support batch and interactive data processing applications

• Integrate with YARN

• Perform well in a mixed application workload cluster

Page 18: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Spark

Page 19: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Spark

• Apache Spark is an open source, general purpose

processing engine used to build and run fast and

sophisticated applications.

– It features a simple set of APIs to write applications in Scala,

Java, or Python.

• The processing engine and applications run on Hadoop 2.

– It leverages Hadoop’s horizontal scale out capabilities.

• It is YARN-ready.

– You can process a single copy of data in multiple ways using

the same cluster.

Page 20: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Spark RDD – Scalability and Performance

• To leverage Hadoop’s horizontal

scalability:

– Spark processes data in a Resilient

Distributed Dataset (RDD).

• It is a fault-tolerant collection of data elements.

– An RDD is stored in memory or on disk.

– Each RDD is distributed across Hadoop

slave nodes.

• Enables parallel processing across the cluster

10x MapReduce

performance

RAM

RAM

RAM

RAM

on-disk

RDD

in-memory

RDD

100x MapReduce

performance

Page 21: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Spark High-Level Tools

• The Spark Engine supports

four high-level tools to build

applications.

– Spark SQL

– Spark Streaming

– MLlib

– GraphX

Spark

Streaming

streaming

GraphX

MLlib

Spark

SQL

SQL

Apache Spark Engine

graph

computation

machine

learning

Page 22: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Spark SQL

– Use Spark SQL for interactive or batch queries on streaming or

historical data.

• Perform queries in Scala, Java, and Python programs using integrated APIs.

• It queries structured data as an SchemaRDD.

– A SchemaRDD is an RDD of row objects that has an associated schema.

– SchemaRDDs are registered as tables and used in FROM clauses in SQL statements.

– SchemaRDDs can be used in relational queries, as well as in standard RDD functions.

– Spark SQL reuses an existing Apache Hive frontend and metastore.

• This makes it compatible with existing Hive data, queries, and UDFs.

– Spark SQL includes a server mode with standard ODBC and JDBC

connectors.

Page 23: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Decisions, decisions, decisions…

Page 24: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Data Processing Options – Spark vs. Tez

• Three Common Options

– Hive on Tez

– Hive on Spark

– Spark SQL

Page 25: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hive on Tez vs. Hive on Spark

• Hive on Tez outperforms Hive on Spark

– Hive tends to be bound by CPU rather than I/O, especially with

introduction of columnar file formats

– Spark spends time translating from RDDs to Hive’s native “Row

Containers”

• Ends up consuming more CPU, Disk & Network I/O

– Tez is a framework for building special-purpose engines,

whereas Spark is a general-purpose engine

• Hive on Tez is optimized for typical Hive operations

Page 26: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hive on Tez vs. Spark SQL

• Depends on size of dataset

– Less than 200 GB, Spark SQL wins

– 200 GB and greater, Hive on Tez wins

• The larger the dataset, the greater the discrepancy in performance

• http://www.slideshare.net/hortonworks/hive-on-spark-is-

blazing-fast-or-is-it-final

Page 27: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Tez vs. Spark

Page 28: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

BUT…

• Spark, like all other Hadoop projects, is evolving.

Performance metrics are likely to change

– …as will those for Tez applications, etc.

• Your mileage will vary, and performance variance today

may not be the same as performance variance tomorrow

– Beware of the word “always”

Page 29: Hadoop 2.x Core: YARN, Tez, and Spark - Meetupfiles.meetup.com/1706946/Hadoop-YARN-Tez-Spark.pdf · Why the Move to YARN? •YARN is a generic scheduler and resource manager to support

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Thank you!

[email protected]