22
Operationalizing YARN Based Hadoop Clusters in the Cloud Abhishek Modi Lead Developer, Yarn and Hadoop Team, Qubole

Operationalizing YARN based Hadoop Clusters in the Cloud

Embed Size (px)

Citation preview

Page 1: Operationalizing YARN based Hadoop Clusters in the Cloud

Operationalizing YARN Based Hadoop Clusters in the Cloud

Abhishek ModiLead Developer,Yarn and Hadoop Team,Qubole

Page 2: Operationalizing YARN based Hadoop Clusters in the Cloud

Hadoop at Qubole

● Over 300 Petabytes data processed per month.

● More than 100 customers with more than 1000 active users.

● Over 1 million Hadoop jobs completed per month.

● More than 8,000 Hadoop clusters brought up per month.

Page 3: Operationalizing YARN based Hadoop Clusters in the Cloud

Qubole Architecture

Qubole UI

Qubole SaaS

Hadoop Cluster

Hadoop Cluster

Hadoop Cluster

Hadoop Cluster

CloudStorage

Prod

New

Qubole REST API

Page 4: Operationalizing YARN based Hadoop Clusters in the Cloud

Ephemeral Hadoop Clusters

Bring up Cluster Perform Jobs Terminate Cluster

Scale Up

Scale Down

Page 5: Operationalizing YARN based Hadoop Clusters in the Cloud

• Use cloud storage for job output and input.

• Needs to auto-scale as per work-load.

• Store job history and logs at persistent location.

• Adapting YARN/HDFS to take into account ephemeral cloud nodes.

Challenges: Ephemeral Hadoop Clusters

Page 6: Operationalizing YARN based Hadoop Clusters in the Cloud

YARN Auto-scaling

Page 7: Operationalizing YARN based Hadoop Clusters in the Cloud

Up-scaling for MR jobs

Resource Manager

Node 1

Node 2

User

Submit Job

Launches MR AM

NodeManager

MR AppMaster

ContainerRequest

Allocate Resources

NodeManager

C1 C2

Task Progress

Up Scale Request

Cluster Manager

Add Node

NodeManager

C3 C4Node 3

Page 8: Operationalizing YARN based Hadoop Clusters in the Cloud

Generic Up-scaling

Resource Manager

ClusterManager

MR AppMaster

Spark AppMaster

Tez AppMaster

Up Scale Request

Add Node

Page 9: Operationalizing YARN based Hadoop Clusters in the Cloud

Node 2

Down-scaling

Resource Manager

NodeManager

C1 C2

C3 C4

NodeManager

C1 C2

C3 C4

NodeManager

C1 C2

C4C3

Status Update

Evaluates cluster is being underutilized and

can be down scaled

Selects node whose estimated task

completion time is lowest

Graceful Shutdown

User

Submits Job

Allocates container

Job1 Completes

Cluster Manager

Remove Node

Job 1Job 2Job 3

DecommissionNode

Node 1

Node 3

Page 10: Operationalizing YARN based Hadoop Clusters in the Cloud

Re-commissioning

NodeManager

C2C1

NodeManager

C1 C2

C4C3

C4C3

NodeManager

C4

C2C1

Resource Manager

Graceful Shutdown

User

Submit Job

Allocates Containers

C3

Upscale Request

Re-commission

Page 11: Operationalizing YARN based Hadoop Clusters in the Cloud

• Containers contains output of Map tasks.

• Can not be terminated until Map output is consumed.

• Upload Map output to cloud.

• Reducers access Map output directly from cloud.

Further Optimizations in Down-scaling

Page 12: Operationalizing YARN based Hadoop Clusters in the Cloud

• DFS used and incoming data rate is monitored periodically.

• Upscale if free DFS goes below an absolute threshold.

• Upscale if free DFS is projected to go below absolute threshold in next few

minutes.

HDFS Based Up-Scaling

Page 13: Operationalizing YARN based Hadoop Clusters in the Cloud

Cost Benefits of Auto-scaling

Page 14: Operationalizing YARN based Hadoop Clusters in the Cloud

• AWS and Google Cloud provide volatile nodes termed as “Spot Nodes” or “Pre-

emptible Nodes”

• Available at very low price as compared to stable nodes.

• Can be lost at any point of time without any prior notification.

• Hadoop’s failure resilience makes these nodes good candidates for Hadoop.

• Approx. 77% of all Qubole clusters make use of volatile nodes.

Volatile Nodes

Page 15: Operationalizing YARN based Hadoop Clusters in the Cloud

• While starting cluster, percentage of volatile nodes can be specified.

• A maximum ‘bid’ price for volatile nodes is also specified.

• Qubole Placement Policy:

– Ensures at least one replica of each HDFS block is present on Stable Node.

– No Application Master is scheduled on volatile nodes.

Volatile Nodes at Qubole

Page 16: Operationalizing YARN based Hadoop Clusters in the Cloud

• While up-scaling, RM tries to maintain volatile node percentage.

• If volatile node are not available, fall back to stable nodes.

• Periodically tries to re-balance the volatile node percentage.

Rebalancing – Volatile Nodes

Page 17: Operationalizing YARN based Hadoop Clusters in the Cloud

• Show job history for terminated clusters.

• Multi-tenant job history server.

• Clusters are generally running in isolated networks – need a proxy.

• Job History files needs to be stored at cloud storage.

Job History

Page 18: Operationalizing YARN based Hadoop Clusters in the Cloud

Job History – Running cluster

QuboleUI Cluster Proxy

Hadoop Cluster

Hadoop Cluster

Hadoop Cluster

Hadoop Cluster

User

Clicks UI link

Authenticates the request

Find cluster corresponding to

the request

Proxifies link in html and js

Sends Request

Page 19: Operationalizing YARN based Hadoop Clusters in the Cloud

Job History – Terminated Cluster

QuboleUIUser Cluster

Proxy

Job History Server

Clicks UI link

Authenticates the request

Finds cluster is down

Fetches jhist file from cloud

Jhist file

Rendered JobHist

Proxifies Link

Page 20: Operationalizing YARN based Hadoop Clusters in the Cloud

• Writing output directly to cloud without storing at temporary location.

• Optimizations in getting file status for large number of files with common prefix.

• Added streaming upload support in NativeS3FileSystem.

• Added bulk delete and move support in NativeS3FileSystem.

Cloud Read/Write Optimizations

Page 21: Operationalizing YARN based Hadoop Clusters in the Cloud

• Issues with newer version of JetS3t (0.9.4)

– Seek performance degraded around 10X.

– Empty files.

• Deadlock when number of threads reading from S3 exceeds JetS3t’s max number

of connections (HADOOP-12739).

• Too many queues causes a deadlock in cluster.(YARN-3633)

• Support for Socks Proxy was missing from HA.

Open Source Issues

Page 22: Operationalizing YARN based Hadoop Clusters in the Cloud

Thank You