18
Hadoop YARN in the Cloud Junping Du Staff Engineer, VMware China Hadoop Summit, 2013

Hadoop YARN in the Cloud

  • Upload
    jodie

  • View
    91

  • Download
    0

Embed Size (px)

DESCRIPTION

Hadoop YARN in the Cloud. Junping Du Staff Engineer, VMware China Hadoop Summit, 2013. Agenda. Hadoop YARN – Hub for Big Data Applications YARN and Cloud Computing HVE ( Hadoop Virtualization Extension) work on YARN. Hadoop MapReduce v1 (Classic). JobTracker - PowerPoint PPT Presentation

Citation preview

Page 1: Hadoop  YARN in the Cloud

Hadoop YARN in the Cloud

Junping DuStaff Engineer, VMware

China Hadoop Summit, 2013

Page 2: Hadoop  YARN in the Cloud

Agenda

• Hadoop YARN – Hub for Big Data Applications

• YARN and Cloud Computing

• HVE (Hadoop Virtualization Extension) work on YARN

Page 3: Hadoop  YARN in the Cloud

Hadoop MapReduce v1 (Classic)

• JobTracker– Manage cluster

resources and job scheduling

• TaskTracker– Per node agent– Manage tasks

Page 4: Hadoop  YARN in the Cloud

MapReduce v1 Limitations

• Scalability– Manage cluster resources and job scheduling

• SPOF (Single Point Of Failure)• JobTracker failure cause all queued and running job

failure– Restart is very tricky due to complex state

• Hard partition of resources into map and reduce slots– Low resource utilization

• Lacks support for alternate paradigms• Lack of wire-compatible protocols

Page 5: Hadoop  YARN in the Cloud

YARN Architecture• Splits up the two major functions of

JobTracker– Resource Manager (RM) - Cluster resource

management– Application Master (AM) - Task scheduling and

monitoring• NodeManager (NM) - A new per-node slave

– launching the applications’ containers– monitoring their resource usage (cpu, memory)

and reporting to the Resource Manager.• YARN maintains compatibility with existing

MapReduce application and support other applications

Page 6: Hadoop  YARN in the Cloud

YARN – Hub for Big Data Applications

YARN

MapReduce Tez

HDFS

Storm

Spark

HBaseImpala

OpenMPI Distributed Shell

• App-specific AM• HOYA (Hbase On YArn)

– Long running services (YARN-896)• LLAMA (Low Latency Application MAster)

– Gang Scheduler (YARN-624)

Page 7: Hadoop  YARN in the Cloud

• Two different prospective:– YARN-centric prospective• YARN is the key platform to apps• YARN is independent of infrastructure, running on top of

Cloud shows YARN’s generality – Cloud-centric prospective• YARN is an umbrella kind of applications• Supporting YARN shows Cloud’s generality

YARN and Cloud

Page 8: Hadoop  YARN in the Cloud

YARN and Cloud: YARN-centric Prospective

YARN

Bare-metal machines

MapReduce Tez Storm

SparkHBase

Impala

Open MPI Distributed Shell

VMware Open Stack

Infrastructure

Big Data Apps

Cloud Infrastructure

Page 9: Hadoop  YARN in the Cloud

YARN and Cloud: Cloud-centric Prospective

YARN

MapReduce Tez Storm

SparkHBase

Impala

Open MPI D.S

Cloud Infrastructure (VMware, Open Stack, etc.)

YARN AppsLegacy Apps Non-YARNBig Data Apps

……

Page 10: Hadoop  YARN in the Cloud

• Similarity – Target to share resources across applications– Provide Global Resource Management

• YARN vs. Cloud– YARN managing resource in OS layer vs. Cloud

managing resources in Hypervisor (Not comparable, but Hypervisor is more powerful than OS )

– Apps managed by YARN need specific AppMaster, Apps managed by Cloud is exactly the same as running on physical machines (Cloud )

– YARN tracking application-specific metrics/progress, Cloud only track underlayer resources (YARN )

YARN vs. Cloud

Page 11: Hadoop  YARN in the Cloud

• Why YARN + Cloud?– Leverage virtualization in strong isolation, fine-grained resource

sharing and other benefits– Uniform infrastructure to simplify IT in enterprise

• What it looks like?– Running YARN NM inside of VMs managed by Cloud Infrastructure– Build communication channel between YARN RM and Cloud

Resource Manager for coordination• How we do?– First thing above is very easy and smoothly– Second things to achieve in two ways

• YARN can aware/manipulate Cloud resource change• YARN provide a generic resource notification mechanism so Cloud

Manager can use when resource changing

YARN + Cloud

Page 12: Hadoop  YARN in the Cloud

• VM’s resource boundary can be elastic– CPU is easy – time slicing (with constraints)– Memory is harder – page sharing and memory ballooning– In case of contention, enforce limits and proportional sharing– “Stealing” resources behind apps could cause bad performance (paging)– App aware resource management could address these issues

• Hadoop YARN Resource Model– Dynamic with adding/removing nodes– But static for per node

• In this case, shall we enable resource elasticity on VM?– If yes, low performance when resource contention happens.– If no, low utilization as physical boxes because free resources cannot be

leveraged by other busy VMs• We need better answer .

Elastic YARN Node in the Cloud

Page 13: Hadoop  YARN in the Cloud

HVE provide the answer!• Hadoop Virtualization Extensions– A project to enhance Hadoop running on

virtualization• Goal: Make Hadoop Cloud-Ready– Provide Virtualization-awareness to Hadoop, i.e.

virtual topology, virtual resources, etc.– Deliver generic utility that can be leveraged by

virtualized platform • Independent of virtualization platform and cloud

infrastructure• 100% contribution to Apache Hadoop

Community

Page 14: Hadoop  YARN in the Cloud

HVE• Philosophy– make infrastructure related components abstract– deliver different implementations that can be

configured properly• E.g.

BlockPlacementPolicy

BlockPlacementPolicy(Abstract)

BlockPlacementPolicyDefault

BlockPlacementPolicyFor Virtualization

Page 15: Hadoop  YARN in the Cloud

Virtualization Host

Elastic YARN Node in the Cloud

VirtualYARNNode

OtherWorkload

VMDK

Datanode

NodeManager

ContainerContainer

Add/RemoveResources?

Grow/Shrinkby tens of GB in memory?

Grow/Shrink resource of a VM

Page 16: Hadoop  YARN in the Cloud

Implementation – YARN-291 (umbrella)

• YARN-311– Core scheduler changes

• YARN-313• CLI

• YARN-312– AdminProtocol changes

• REST API, JMX, etc.

Node Manager

SchedulerNode

Cloud Resource Manager

Resource Manager

Resource Tracker Service

Scheduler

RMContext

RMNode

Heartbeat

Admin CLIAdminServiceCluster Resource

UpdateNodeResource()

yarn rmadmin -updateNodeResource <NodeId> <Resource>

Page 17: Hadoop  YARN in the Cloud

Reference• YARN MapReduce 2.0– https://issues.apache.org/jira/browse/MAPREDUCE-279

• HVE topology extension– https://issues.apache.org/jira/browse/HADOOP-8468

• HVE topology extension for YARN– https://issues.apache.org/jira/browse/YARN-18

• HVE elastic resource configuration– https://issues.apache.org/jira/browse/YARN-291

• Gang Scheduling– https://issues.apache.org/jira/browse/YARN-624

• Long-lived services in YARN– https://issues.apache.org/jira/browse/YARN-896

Page 18: Hadoop  YARN in the Cloud

Thanks!

Junping Du [email protected]