Hadoop 2.0, MRv2 and YARN - Module 9

Hadoop 2.0, MRv2 and YARN

YARN(Yet Another Resource Negotiator)• YARN is Hadoop’s cluster resource management system.• Introduced in Hadoop 2 to improve the MapReduce implementation, but it is

general enough to support other distributed computing paradigms as well.• YARN provides APIs for requesting and working with cluster resources.

Anatomy of a YARN Application Run

Anatomy of a YARN Application Run

• YARN provides its core services via two types of long-running daemon: • Resource Manager (one per cluster) to manage the use of resources across the cluster, and • Node Managers running on all the nodes in the cluster to launch and monitor containers.

1. A client contacts the resource manager and asks it to run an application master process.

2. The resource manager then finds a node manager that can launch the application master in a container.

3. Application Master could simply run a computation in the container it is running in and return the result to the client.

4. Or it could request more containers from the resource managers

Comparison of MapReduce 1 and YARN components

• MapReduce 1 hits scalability bottlenecks in the region of 4,000 nodes and 40,000 tasks

• YARN is designed to scale up to 10,000 nodes and 100,000 tasks

HDFS Federation• Hadoop 2.0 introduces a failover and scaling mechanism for the NameNode

referred to as HDFS Federation. As opposed to a single NameNode (which was used in Hadoop 1.x), the new Hadoop infrastructure provides for multiple NameNodes that run independently of each other, providing

• Scalability: NameNodes can now scale horizontally, allowing you to improve the performance of NameNode tasks by distributing reads and writes across a cluster of NameNodes.

• Namespaces: the ability to define multiple Namespaces allows for the organizing and separating of your big data.

HDFS High Availability• Prior to Hadoop 2.0, the NameNode was a single point of failure in an HDFS

cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine.

• The HDFS High Availability (HA) feature addresses this issue by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby. This allows a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.

Differences between the Components of Hadoop 1.0 and Hadoop 2.0

Migration from Hadoop 1.0 to Hadoop 2.0• An edge that YARN provides to Hadoop Users is that it is backward compatible

(i.e. one can easily run an existing Map Reduce job on Hadoop 2.0 without making any modifications) thus compelling the companies to migrate from Hadoop 1.0 to Hadoop 2.0

Scheduling in YARN• Three schedulers are available in YARN• FIFO (First In First Out): Places applications in a queue and runs them in the

order of submission (first in, first out). Requests for the first application in the queue are allocated first; once its requests have been satisfied, the next application in the queue is served, and so on.

• Capacity: A separate dedicated queue allows the small job to start as soon as it is submitted, although this is at the cost of overall cluster utilization since the queue capacity is reserved for jobs in that queue. This means that the large job finishes later than when using the FIFO Scheduler.

• Fair Scheduler: There is no need to reserve a set amount of capacity, since it will dynamically balance resources between all running jobs. Just after the first (large) job starts, it is the only job running, so it gets all the resources in the cluster. When the second (small) job starts, it is allocated half of the cluster resources so that each job is using its fair share of resources.

Technology

Hadoop 2.0, MRv2 and YARN - Module 9