22
© Hortonworks Inc. 2013 Apache Hadoop Next Generation Compute Platform - YARN Page 1 Bikas Saha @bikassaha

YARN - Next Generation Compute Platform fo Hadoop

Embed Size (px)

DESCRIPTION

Latest information on Apache Hadoop YARN from Big Data Camp LA by Bikas Saha

Citation preview

Page 1: YARN - Next Generation Compute Platform fo Hadoop

© Hortonworks Inc. 2013

Apache HadoopNext Generation Compute Platform - YARN

Page 1

Bikas Saha@bikassaha

Page 2: YARN - Next Generation Compute Platform fo Hadoop

© Hortonworks Inc. 2013 - Confidential

1st Generation Hadoop: Batch Focus

HADOOP 1.0Built for Web-Scale Batch Apps

Single App

BATCH

HDFS

Single App

INTERACTIVE

Single App

BATCH

HDFS

All other usage patterns MUST leverage same infrastructure

Forces Creation of Silos to Manage Mixed Workloads

Single App

BATCH

HDFS

Single App

ONLINE

Page 2

Page 3: YARN - Next Generation Compute Platform fo Hadoop

© Hortonworks Inc. 2013 - Confidential

Hadoop 1 Architecture

JobTracker

Manage Cluster Resources & Job Scheduling

TaskTracker

Per-node agent

Manage Tasks

Page 3

Page 4: YARN - Next Generation Compute Platform fo Hadoop

© Hortonworks Inc. 2013 - Confidential

Hadoop 1 Limitations

Scalability

Max Cluster size ~5,000 nodes

Max concurrent tasks ~40,000

Coarse Synchronization in JobTracker

Availability

Failure Kills Queued & Running Jobs

Hard partition of resources into map and reduce slots

Non-optimal Resource Utilization

Lacks Support for Alternate Paradigms and Services

Iterative applications in MapReduce are 10x slower

Page 4

Page 5: YARN - Next Generation Compute Platform fo Hadoop

© Hortonworks Inc. 2013 - ConfidentialPage 5

Hadoop 2 - YARN Architecture

ResourceManager (RM)

Manages and allocates cluster resources

Central agent

NodeManager (NM)

Manage Tasks, Enforce Allocations

Per-Node Agent

Page 6: YARN - Next Generation Compute Platform fo Hadoop

© Hortonworks Inc. 2013 - Confidential

Data Processing Engines Run Natively IN HadoopBATCH

MapReduceINTERACTIVE

TezSTREAMINGStorm, S4, …

GRAPHGiraph

MICROSOFTREEF

SASLASR, HPA

ONLINEHBase OTHERS

Apache YARN

HDFS2: Redundant, Reliable Storage

YARN: Cluster Resource Management

Page 6

FlexibleEnables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming

EfficientDouble processing IN Hadoop on the same hardware while providing predictable performance & quality of service

SharedProvides a stable, reliable, secure foundation and shared operational services across multiple workloads

The Data Operating System for Hadoop 2.0

Page 7: YARN - Next Generation Compute Platform fo Hadoop

© Hortonworks Inc. 2013 - Confidential

55 Key Benefits of YARN

1. New Applications & Services

2. Improved cluster utilization

3. Scale

4. Experimental Agility

5. Shared Services

Page 7

Page 8: YARN - Next Generation Compute Platform fo Hadoop

© Hortonworks Inc. 2013 - Confidential

YARN: Efficiency with Shared Services

Page 8

Yahoo! leverages YARN

40,000+ nodes running YARN across over 365PB of data

~400,000 jobs per day for about 10 million hours of compute

time

Estimated a 60% – 150% improvement on node usage per

day using YARN

Eliminated Colo (~10K nodes) due to increased utilization

Page 9: YARN - Next Generation Compute Platform fo Hadoop

© Hortonworks Inc. 2013 - Confidential

Key Improvements in YARN

Framework supporting multiple applications– Separate generic resource brokering from application logic– Define protocols/libraries and provide a framework for custom

application development– Share same Hadoop Cluster across applications

Application Agility and Innovation– Use Protocol Buffers for RPC gives wire compatibility– Map Reduce becomes an application in user space unlocking

safe innovation– Multiple versions of an app can co-exist leading to

experimentation– Easier upgrade of framework and applications

Page 9

Page 10: YARN - Next Generation Compute Platform fo Hadoop

© Hortonworks Inc. 2013 - Confidential

Key Improvements in YARN

Scalability– Removed complex app logic from RM, scale further– State machine, message passing based loosely coupled design

Cluster Utilization– Generic resource container model replaces fixed Map/Reduce

slots. Container allocations based on locality, memory (CPU coming soon)

– Sharing cluster among multiple applications

Reliability and Availability– Simpler RM state makes it easier to save and restart (work in

progress)– Application checkpoint can allow an app to be restarted.

MapReduce application master saves state in HDFS.

Page 10

Page 11: YARN - Next Generation Compute Platform fo Hadoop

© Hortonworks Inc. 2013 - Confidential

YARN as Cluster Operating System

Page 11

NodeManager NodeManager NodeManager NodeManager

map 1.1

vertex1.2.2

NodeManager NodeManager NodeManager NodeManager

NodeManager NodeManager NodeManager NodeManager

map1.2

reduce1.1

Batch

vertex1.1.1

vertex1.1.2

vertex1.2.1

Interactive SQL

ResourceManager

Scheduler

Real-Time

nimbus0

nimbus1

nimbus2

Page 12: YARN - Next Generation Compute Platform fo Hadoop

© Hortonworks Inc. 2013 - Confidential

Multi-Tenancy with Capacity Scheduler

• Queues• Economics as queue-capacity

–Hierarchical Queues

• SLAs–Preemption

• Resource Isolation–Linux: cgroups–MS Windows: Job Control–Roadmap: Virtualization (Xen, KVM)

• Administration–Queue ACLs–Run-time re-configuration for queues–Charge-backs

Page 12

ResourceManager

Scheduler

root

Adhoc10%

DW70%

Mrkting20%

Dev10%

Reserved20%

Prod70%

Prod80%

Dev20%

P070%

P130%

Capacity Scheduler

Hierarchical Queues

Page 13: YARN - Next Generation Compute Platform fo Hadoop

© Hortonworks Inc. 2013 - Confidential

YARN Eco-system

Page 13

Applications Powered by YARN

Apache Giraph – Graph Processing

Apache Hama - BSP

Apache Hadoop MapReduce – Batch

Apache Tez – Batch/Interactive

Apache S4 – Stream Processing

Apache Samza – Stream Processing

Apache Storm – Stream Processing

Apache Spark – Iterative/Interactive applications

Elastic Search – Scalable Search

Cloudera Llama – Impala on YARN

DataTorrent – Data Analysis

HOYA – HBase on YARN

RedPoint - Data Management

Frameworks Powered By YARN

Weave by Continuity

REEF by Microsoft

Spring support for Hadoop 2

There's an app for that...

YARN App Marketplace!

Page 14: YARN - Next Generation Compute Platform fo Hadoop

© Hortonworks Inc. 2013 - Confidential

YARN APIs & Client Libraries

Application Client Protocol: Client to RM interaction– Library: YarnClient– Application Lifecycle control– Access Cluster Information

Application Master Protocol: AM – RM interaction– Library: AMRMClient / AMRMClientAsync– Resource negotiation– Heartbeat to the RM

Container Management Protocol: AM to NM interaction– Library: NMClient/NMClientAsync– Launching allocated containers– Stop Running containers

Use external frameworks like Weave/REEF/Spring

Page 14

Page 15: YARN - Next Generation Compute Platform fo Hadoop

© Hortonworks Inc. 2013 - Confidential

YARN Application Flow

Page 15

Application Client

ResourceManager

Application Master

NodeManager

YarnClient

AppSpecific API

Application ClientProtocol

AMRMClient

NMClient

Application MasterProtocol

ContainerManagement

Protocol

AppContainer

Page 16: YARN - Next Generation Compute Platform fo Hadoop

© Hortonworks Inc. 2013 - Confidential

YARN Best Practices

Page 16

Use provided Client libraries

Resource Negotiation–You may ask but you may not get what you want - immediately.– Locality requests may not always be met. –Resources like memory/CPU are guaranteed.

Failure handling–Remember, anything can fail ( or YARN can pre-empt your

containers)–AM failures handled by YARN but container failures handled by the

application.

Checkpointing–Check-point AM state for AM recovery.– If tasks are long running, check-point task state.

Page 17: YARN - Next Generation Compute Platform fo Hadoop

© Hortonworks Inc. 2013 - Confidential

YARN Best Practices

Page 17

Cluster Dependencies–Try to make zero assumptions on the cluster.–Your application bundle should deploy everything required using

YARN’s local resources.

Client-only installs if possible–Simplifies cluster deployment, and multi-version support

Securing your Application–YARN does not secure communications between the AM and its

containers.

Page 18: YARN - Next Generation Compute Platform fo Hadoop

© Hortonworks Inc. 2013 - Confidential

Testing/Debugging your Application

Page 18

MiniYARNClusterRegression tests

Unmanaged AMSupport to run the AM outside of a YARN cluster for manual testing

LogsLog aggregation support to push all logs into HDFS

Accessible via CLI, UI

Page 19: YARN - Next Generation Compute Platform fo Hadoop

© Hortonworks Inc. 2013 - Confidential

YARN Future Work

Page 19

ResourceManager High Availability and Work-preserving restart– Work-in-Progress

Scheduler Enhancements– SLA Driven Scheduling, Low latency allocations– Multiple resource types – disk/network/GPUs/affinity

Rolling upgrades

Long running services– Better support to running services like HBase– Discovery of services, upgrades without downtime

More utilities/libraries for Application Developers– Failover/Checkpointing

• http://hadoop.apache.org

Page 20: YARN - Next Generation Compute Platform fo Hadoop

© Hortonworks Inc. 2013 - Confidential

Key Take-Aways

YARN - Distributed Application Framework to build/run Multiple

Applications (original being MapReduce)

YARN is completely Backwards Compatible for existing

MapReduce apps

YARN Allows Different Applications to Share the Same Cluster

YARN Enables Fine Grained Resource Management via Generic

Resource Containers

YARN Provides Better Control over Application Upgrades via Wire

Compatibility

Page 20

Page 21: YARN - Next Generation Compute Platform fo Hadoop

© Hortonworks Inc. 2013 - Confidential

Data Processing Engines Run Natively IN HadoopBATCH

MapReduceINTERACTIVE

TezSTREAMINGStorm, S4, …

GRAPHGiraph

MICROSOFTREEF

SASLASR, HPA

ONLINEHBase OTHERS

Apache YARN

HDFS2: Redundant, Reliable Storage

YARN: Cluster Resource Management

Page 21

FlexibleEnables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming

EfficientDouble processing IN Hadoop on the same hardware while providing predictable performance & quality of service

SharedProvides a stable, reliable, secure foundation and shared operational services across multiple workloads

The Data Operating System for Hadoop 2.0

Page 22: YARN - Next Generation Compute Platform fo Hadoop

© Hortonworks Inc. 2013 - Confidential

Thank you!

Page 22

http://hortonworks.com/products/hortonworks-sandbox/

Download Sandbox: Experience Apache Hadoop

Both 2.0 and 1.x Versions Available!

http://hortonworks.com/products/hortonworks-sandbox/

Additional Questions?