14
Hadoop 2.0, MRv2 and YARN

Hadoop 2.0, MRv2 and YARN - Module 9

Embed Size (px)

Citation preview

Page 1: Hadoop 2.0, MRv2 and YARN - Module 9

Hadoop 2.0, MRv2 and YARN

Page 2: Hadoop 2.0, MRv2 and YARN - Module 9

YARN(Yet Another Resource Negotiator)• YARN is Hadoop’s cluster resource management system.• Introduced in Hadoop 2 to improve the MapReduce implementation, but it is

general enough to support other distributed computing paradigms as well.• YARN provides APIs for requesting and working with cluster resources.

Page 3: Hadoop 2.0, MRv2 and YARN - Module 9

Anatomy of a YARN Application Run

Page 4: Hadoop 2.0, MRv2 and YARN - Module 9

Anatomy of a YARN Application Run

• YARN provides its core services via two types of long-running daemon: • Resource Manager (one per cluster) to manage the use of resources across the cluster, and • Node Managers running on all the nodes in the cluster to launch and monitor containers.

1. A client contacts the resource manager and asks it to run an application master process.

2. The resource manager then finds a node manager that can launch the application master in a container.

3. Application Master could simply run a computation in the container it is running in and return the result to the client.

4. Or it could request more containers from the resource managers

Page 5: Hadoop 2.0, MRv2 and YARN - Module 9

Comparison of MapReduce 1 and YARN components

• MapReduce 1 hits scalability bottlenecks in the region of 4,000 nodes and 40,000 tasks

• YARN is designed to scale up to 10,000 nodes and 100,000 tasks

Page 6: Hadoop 2.0, MRv2 and YARN - Module 9

HDFS Federation• Hadoop 2.0 introduces a failover and scaling mechanism for the NameNode

referred to as HDFS Federation. As opposed to a single NameNode (which was used in Hadoop 1.x), the new Hadoop infrastructure provides for multiple NameNodes that run independently of each other, providing

• Scalability: NameNodes can now scale horizontally, allowing you to improve the performance of NameNode tasks by distributing reads and writes across a cluster of NameNodes.

• Namespaces: the ability to define multiple Namespaces allows for the organizing and separating of your big data.

Page 7: Hadoop 2.0, MRv2 and YARN - Module 9
Page 8: Hadoop 2.0, MRv2 and YARN - Module 9

HDFS High Availability• Prior to Hadoop 2.0, the NameNode was a single point of failure in an HDFS

cluster. Each cluster had a single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable until the NameNode was either restarted or brought up on a separate machine.

• The HDFS High Availability (HA) feature addresses this issue by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby. This allows a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance.

Page 9: Hadoop 2.0, MRv2 and YARN - Module 9
Page 10: Hadoop 2.0, MRv2 and YARN - Module 9

Differences between the Components of Hadoop 1.0 and Hadoop 2.0

Page 11: Hadoop 2.0, MRv2 and YARN - Module 9

Migration from Hadoop 1.0 to Hadoop 2.0• An edge that YARN provides to Hadoop Users is that it is backward compatible

(i.e. one can easily run an existing Map Reduce job on Hadoop 2.0 without making any modifications) thus compelling the companies to migrate from Hadoop 1.0 to Hadoop 2.0

Page 12: Hadoop 2.0, MRv2 and YARN - Module 9

Scheduling in YARN• Three schedulers are available in YARN• FIFO (First In First Out): Places applications in a queue and runs them in the

order of submission (first in, first out). Requests for the first application in the queue are allocated first; once its requests have been satisfied, the next application in the queue is served, and so on.

Page 13: Hadoop 2.0, MRv2 and YARN - Module 9

• Capacity: A separate dedicated queue allows the small job to start as soon as it is submitted, although this is at the cost of overall cluster utilization since the queue capacity is reserved for jobs in that queue. This means that the large job finishes later than when using the FIFO Scheduler.

Page 14: Hadoop 2.0, MRv2 and YARN - Module 9

• Fair Scheduler: There is no need to reserve a set amount of capacity, since it will dynamically balance resources between all running jobs. Just after the first (large) job starts, it is the only job running, so it gets all the resources in the cluster. When the second (small) job starts, it is allocated half of the cluster resources so that each job is using its fair share of resources.