48
www.edureka.in/hadoop Slide 1

Hadoop 1.0 vs Hadoop 2.0

Embed Size (px)

DESCRIPTION

Hadoop

Citation preview

  • www.edureka.in/hadoopSlide 1

  • www.edureka.in/hadoopSlide 2

    Topics for Today

    Hadoop Architecture

    Problems with Hadoop 1.0

    Solution: Hadoop 2.0 and YARN

    Hadoop 2.0 New Features

    HDFS High Availability

    HDFS Federation

    YARN and MRv2

    YARN and Hadoop ecosystem

    YARN Components

    JobTracker and Job Submission

    MR Application Execution in YARN

  • www.edureka.in/hadoopSlide 3

    Hadoop is a system for large scale data processing.

    It has two main components:

    HDFS Hadoop Distributed File System (Storage)

    Distributed across nodes

    Natively redundant

    NameNode tracks locations.

    MapReduce (Processing)

    Splits a task across processors

    near the data & assembles results

    Self-Healing, High Bandwidth

    Clustered storage

    JobTracker manages the TaskTrackers

    Hadoop Core Components

  • www.edureka.in/hadoopSlide 4

    Data Node

    TaskTracker

    Data Node

    TaskTracker

    Data Node

    TaskTracker

    Data Node

    TaskTracker

    MapReduceEngine

    HDFSCluster

    Job Tracker

    Admin Node

    Name node

    Hadoop Core Components (Contd.)

  • www.edureka.in/hadoopSlide 5

    Client

    HDFS Map Reduce

    Secondary Name Node

    Data BlocksData Node

    Name Node Job Tracker

    Task Tracker

    Map Reduce

    Data Node Task Tracker

    Map Reduce.

    Hadoop 1.0 In Summary

    Data Node Data NodeTask Tracker

    Map Reduce

    Task Tracker

    Map Reduce

  • www.edureka.in/hadoopSlide 6

    Hadoop 1.0 - Challenges

    Problem Description

    NameNode No Horizontal Scalability

    Single NameNode and Single Namespaces, limited by NameNode RAM

    NameNode No High Availability (HA)

    NameNode is Single Point of Failure, Need manual recovery using Secondary NameNode in case of failure

    Job Tracker Overburdened Spends significant portion of time and effort managing the life cycle of applications

    MRv1 Only Map and Reduce tasks Humongous Data stored in HDFS remains unutilized and cannot be used for other workloads such as Graph processing etc.

  • www.edureka.in/hadoopSlide 7

    NameNode HA and Scale

    NameNode - No Horizontal Scale

    NameNode - No High Availability

    DataNode

    DataNode

    DataNode

    .

    Clientget Block Locations

    Read Data

    NameNodeNS

    Block Management

  • www.edureka.in/hadoopSlide 8

    Secondary NameNode:

    Not a hot standby for the NameNode

    Connects to NameNode every hour*

    Housekeeping, backup of NemeNode metadata

    Saved metadata can build a failed NameNode

    You give me metadata every hour, I will make

    it secure

    Single Point of Failure

    Secondary NameNode

    NameNode

    metadata

    metadata

    Name Node Single Point of Failure

  • www.edureka.in/hadoopSlide 9

    Job Tracker Overburdened

    CPU Spends a very significant portion of time and

    effort managing the life cycle of applications

    Network Single Listener Thread to communicate with

    thousands of Map and Reduce Jobs

    Task Tracker Task Tracker Task Tracker.

    Job Tracker

  • www.edureka.in/hadoopSlide 10

    MRv1 Unpredictability in Large Clusters

    Cascading Failures The DataNode failures results in a serious

    deterioration of the overall cluster performance because of attempts to replicate data and overload live nodes, through network flooding.

    As the cluster size grow and reaches to 4000 Nodes

    Multi-tenancy As clusters increase in size, you may want

    to employ these clusters for a variety of models. MRv1 dedicates its nodes to Hadoop and cannot be re-purposed for other applications and workloads in an Organization. With the growing popularity and adoption of cloud computing among enterprises, this becomes more important.

  • www.edureka.in/hadoop

    Terabytes and Petabytes of data in HDFS can be used only for MapReduce processing

    Unutilized Data in HDFS

  • www.edureka.in/hadoopSlide 12

    Hadoop 2.0 New Features

    Property Hadoop 1.0 Hadoop 2.0

    Federation One Namenode and Namespaces

    Multiple Namenode and Namespaces

    High Availability Not present Highly Available

    YARN - Processing Control and Multi-tenancy

    JobTracker, Task Tracker Resource Manager, Node Manager, App Master, Capacity Scheduler

    Other important Hadoop 2.0 features HDFS Snapshots NFSv3 access to data in HDFS Support for running Hadoop on MS Windows Binary Compatibility for MapReduce applications built on Hadoop 1.0 Substantial amount of Integration testing with rest of the projects (such as PIG, HIVE) in

    Hadoop ecosystem

  • www.edureka.in/hadoopSlide 13

    YARN and Hadoop Ecosystem

    Apache Oozie (Workflow)

    HDFS (Hadoop Distributed File System)

    Pig LatinData Analysis

    HiveDW System

    MapReduce Framework

    HBase

    Apache Oozie (Workflow)

    HDFS (Hadoop Distributed File System)

    Pig LatinData Analysis

    HiveDW System

    MapReduce Framework HBase

    Other YARN

    Frameworks(MPI, GIRAPH)

    YARNCluster Resource Management

    YARN adds a more general interface to run non-MapReduce jobs (such as Graph Processing) within the Hadoop framework

  • www.edureka.in/hadoopSlide 14

    Hadoop 2.0 Cluster Architecture - Federation

    Namenode

    Block Management

    NS

    Storage

    Datanode Datanode

    Nam

    esp

    ace

    Blo

    ck S

    tora

    ge

    Nam

    esp

    ace

    NS1 NSk NSn

    NN-1 NN-k NN-n

    Common Storage

    Datanode 1

    Datanode 2

    Datanode m

    Blo

    ck S

    tora

    ge

    Pool 1 Pool k Pool n

    Block Pools

    Hadoop 1.0 Hadoop 2.0

    http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html

  • www.edureka.in/hadoopSlide 15

    Hello There!!My name is Annie. I love quizzes and

    puzzles and I am here to make you guys think and

    answer my questions.

    How does HDFS Federation help HDFS Scale horizontally?- reduces the load on any single NameNode by using the multiple,

    independent NameNode to manage individual parts of the filesystem namespace

    - provides cross-data center (non-local) support for HDFS, allowing a cluster administrator to split the Block Storage outside the local cluster.

    Annies Question

  • www.edureka.in/hadoopSlide 16

    A. In order to scale the name service horizontally, HDFS federation uses multiple independent Namenodes. The Namenodes are federated, that is, the Namenodes are independent and dont require coordination with each other.

    Annies Answer

  • www.edureka.in/hadoopSlide 17

    Hello There!!My name is Annie. I love quizzes and

    puzzles and I am here to make you guys think and

    answer my questions.

    You have configured two name nodes to manage /marketing and /finance respectively. What will happen if you try to put a file in /accounting directory?

    Annies Question

  • www.edureka.in/hadoopSlide 18

    The put will fail. None of the namespace will manage the file and you will get an IOException with a No such file or directory error.

    Annies Answer

  • www.edureka.in/hadoopSlide 19

    Node Manager

    HDFS

    YARN

    Resource Manager

    Shared edit logs

    All name space edits logged to shared NFS storage; single writer

    (fencing)

    Read edit logs and applies to its own namespace

    Secondary Name Node

    Data Node

    Standby NameNode

    Active NameNode

    ContainerApp

    Master

    Node Manager

    Data Node

    ContainerApp

    Master

    Data Node

    Client

    Data Node

    ContainerApp

    Master

    Node Manager

    Data Node

    ContainerApp

    Master

    Node Manager

    Hadoop 2.0 Cluster Architecture - HA

    NameNode High Availability

    Next Generation MapReduce

    HDFS HIGH AVAILABILITY

    http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html

  • www.edureka.in/hadoop

    NameNode Recovery Vs. Failover

    High Availability in Hadoop 2.0

    NameNode recovery in Hadoop 1.0

    Secondary Name Node

    Standby NameNode

    Active NameNode

    Secondary Name Node

    NameNode

    Edit logs

    Meta-Data

    Automatic failover to Standby NameNode

    Manually Recover using Secondary

    NameNodeFSImage

  • www.edureka.in/hadoopSlide 21

    Hello There!!My name is Annie. I love quizzes and

    puzzles and I am here to make you guys think and

    answer my questions.

    HDFS HA was developed to overcome the following disadvantage in Hadoop 1.0?- Single Point Of Failure Of NameNode- Only one version can be run in classic MapReduce- Too much burden on Job Tracker

    Annies Question

  • www.edureka.in/hadoopSlide 22

    Single Point of Failure of NameNode.

    Annies Answer

  • www.edureka.in/hadoopSlide 23

    BATCH(MapReduce)

    INTERACTIVE(Text)

    ONLINE(HBase)

    STREAMING(Storm, S4, )

    GRAPH(Giraph)

    IN-MEMORY(Spark)

    HPC MPI(OpenMPI)

    OTHER(Search)

    (Weave..)

    YARN Moving beyond MapReduce

    http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html

  • www.edureka.in/hadoopSlide 24

    Multi-tenancy - Capacity Scheduler

    Organizes jobs into queues

    Queue shares as %s of cluster

    FIFO scheduling within each queue

    Data locality-aware Scheduling

    Hierarchical QueuesTo manage the resource within an organization.

    Capacity GuaranteesA fraction to the total available capacity allocated to each Queue.

    SecurityTo safeguard applications from other users.

    ElasticityResources are available in a predictable and elastic manner to queues.

    Multi-tenancySet of limit to prevent over-utilization of resources by a single application.

    OperabilityRuntime configuration of Queues.

    Resource-based schedulingIf needed, Applications can request more resources than the default.

  • www.edureka.in/hadoop

    Executing MapReduce Application on YARN

    MapReduce Application Execution

  • www.edureka.in/hadoopSlide 26

    DFS

    Job Tracker

    1. Copy Input Files

    User

    2. Submit Job

    3. Get Input Files Info

    6. Submit Job

    4. Create Splits

    5. Upload Job Information

    Input Files

    Client

    Job.xml.Job.jar.

    JobTracker

    Job Execution

  • www.edureka.in/hadoopSlide 27

    DFS

    Client Job.xml.Job.jar.

    6. Submit Job

    8. Read Job Files

    7. Initialize JobJob Queue

    As many maps as splitsInput Spilts

    Maps Reduces

    9. Create maps and reduces

    Job Tracker

    JobTracker (Contd.)

  • www.edureka.in/hadoopSlide 28

    Job Tracker

    Job QueueH3

    H4

    H5

    H1

    Task Tracker -H2

    Task Tracker -H4

    10. Heartbeat

    12. Assign Tasks

    10. Heartbeat

    10. Heartbeat

    10. Heartbeat

    11. Picks Tasks(Data Local if possible)

    Task Tracker -H3

    Task Tracker -H1

    JobTracker (Contd.)

  • www.edureka.in/hadoopSlide 29

    YARN = Yet Another Resource Negotiator

    Node Manager

    Container Container

    Node Manager

    AppMaster

    Container

    Node Manager

    ContainerApp

    Master

    ResourceManager

    Client

    Client

    MapReduce StatusJob SubmissionNode StatusResource Request

    Resource Manager Cluster Level resource

    manager Long Life, High Quality

    Hardware

    Node Manager One per Data Node Monitors resources on Data

    Node

    Application Master One per application Short life Manages task /scheduling

    YARN Beyond MapReduce

    JobHistoryServer

  • www.edureka.in/hadoopSlide 30

    YARN MR Application Execution Flow

    Client

    ResourceManager

    HDFS

    Application

    Node Manager

    MR AppMaster

    YARNChild

    Map or Reduce

    Task

    Job Object

    1. Run App 2. Get New Application

    Client JVM

    3. Copy Job Resources

    4. Submit Application

    DataNode

    Management Node

    Task JVM

  • www.edureka.in/hadoopSlide 31

    YARN MR Application Execution Flow

    Client

    ResourceManager

    HDFS

    Application

    Node Manager

    MR AppMaster

    YarnChild

    Map or Reduce

    Task

    Job Object

    1. Run App 2. Get New Application

    Client JVM

    3. Copy Job Resources

    4. Submit Application

    5. Start MR AppMaster container

    7. Get Input Splits

    8. Request Resources

    9. Start container

    6. Create container

    DataNode

    Management Node

    Task JVM

  • www.edureka.in/hadoopSlide 32

    YARN MR Application Execution Flow

    Client

    ResourceManager

    HDFS

    Application

    Node Manager

    MR AppMaster

    YarnChild

    Map or Reduce

    Task

    Job Object

    1. Run App 2. Get New Application

    Client JVM

    3. Copy Job Resources

    4. Submit Application

    5. Start MR AppMaster container

    7. Get Input Splits

    8. Request Resources

    9. Start container

    6. Create container

    DataNode

    12. Execute

    10. Create Container

    11. Acquire Job Resources

    Management Node

    Task JVM

  • www.edureka.in/hadoopSlide 33

    YARN MR Application Execution Flow

    Client

    ResourceManager

    HDFS

    Application

    Node Manager

    MR AppMaster

    YarnChild

    Map or Reduce

    Task

    Job Object

    Client JVM

    DataNode

    Management Node

    Task JVMPoll for Status

    Update Status

  • www.edureka.in/hadoop

    Hadoop 2.0 : YARN Workflow

    ResourceManager

    Node Manager

    Node Manager

    MR AppMaster 2

    Node Manager

    Node Manager

    Node Manager

    Container 2.2

    Node Manager

    Container 2.3

    Node Manager

    Node Manager

    Container 2.1

    Container 1.1

    Node Manager

    Container 1.2

    Scheduler

    Applications Manager (AsM)

    Node Manager

    Node Manager

    Node Manager

    MR AppMaster 1

  • www.edureka.in/hadoopSlide 35

    YARN MR Application Execution Flow

    MapReduce Application Execution Submission Initialization Tasks Assignment Tasks Memory Status Updates Failure Recovery

    Client Submit a MapReduce Application

    Resource Manager Manage the resource utilization across

    Hadoop Cluster Node Manager

    Runs on each Data Node Creates execution container Monitors Containers usage

    MapReduce Application Master Coordinates and Manages MapReduce

    tasks Negotiates with Resource Manager to

    schedule asks The tasks are started by NodeManager(s)

    HDFS shares resources and tasks artefacts

    among YARN components

    Application Vs. Job Execution

  • www.edureka.in/hadoop

    Hadoop 2.0 In Summary

    Client

    HDFS YARN

    Resource ManagerSecondary Name Node

    Standby NameNode

    Active NameNode

    Distributed Data Storage Distributed Data Processing

    Data Node

    Node Manager

    ContainerApp

    Master .

    Mast

    ers

    Sla

    ves

    Node Manager

    Data Node

    ContainerApp

    Master

    Data Node

    Node Manager

    ContainerApp

    Master

    Shared edit logs

    Scheduler

    Applications Manager

    (AsM)

    NameNode High Availability

    Next Generation MapReduce

    Manages the App

    App Master can request the

    container and run the code in them.

  • www.edureka.in/hadoopSlide 37

    Hello There!!My name is Annie. I love quizzes and

    puzzles and I am here to make you guys think and

    answer my questions.

    YARN/MRv2 was developed to overcome the following disadvantage in MRv1?- Single Point Of Failure Of NameNode- Only one version can be run in classic MapReduce- Too much burden on Job Tracker

    Annies Question

  • www.edureka.in/hadoopSlide 38

    Too much burden on Job Tracker.

    Annies Answer

  • www.edureka.in/hadoopSlide 39

    Hello There!!My name is Annie. I love quizzes and

    puzzles and I am here to make you guys think and

    answer my questions.

    In YARN (MRv2), the functionality of JobTracker has been replaced by which of the following YARN features:- Job Scheduling - Task Monitoring - Resource Management - Node management

    Annies Question

  • www.edureka.in/hadoopSlide 40

    Task Monitoring and Resource Management. The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, i.e. resource management and job scheduling/monitoring, into separate daemons. A global ResourceManager (RM) for resources and per-application ApplicationMaster (AM) for task monitoring.

    Annies Answer

  • www.edureka.in/hadoopSlide 41

    Hello There!!My name is Annie. I love quizzes and

    puzzles and I am here to make you guys think and

    answer my questions.

    In YARN, which of the following daemons takes care of the container and the resource utilization by the applications?:- Node Manager- Job Tracker- Task tracker- Application Master- Resource manager

    Annies Question

  • www.edureka.in/hadoopSlide 42

    ApplicationMaster.

    Annies Answer

  • www.edureka.in/hadoopSlide 43

    Hello There!!My name is Annie. I love quizzes and

    puzzles and I am here to make you guys think and

    answer my questions.

    Can we run MRv1 Jobs in a YARN/MRv2 enabled Hadoop Cluster?- Yes- No

    Annies Question

  • www.edureka.in/hadoopSlide 44

    Yes. You need recompile the Jobs in MRv2 after enabling YARN/Mrv2 to run the Job successfully on Mrv2 Hadoop Cluster.

    Annies Answer

  • www.edureka.in/hadoopSlide 45

    Hello There!!My name is Annie. I love quizzes and

    puzzles and I am here to make you guys think and

    answer my questions.

    Which of the following YARN/MRv2 daemon is responsible for launching the tasks?- Task Tracker - Resource Manager- ApplicationMaster- ApplicationMasterServer

    Annies Question

  • www.edureka.in/hadoopSlide 46

    ApplicationMaster.

    Annies Answer

  • Thank You See You in Class Next Week

  • www.edureka.in/hadoop

    NN High Availability Quorum Journal Manager

    Standby Name Node

    Active Name Node

    Data Node Data Node Data Node Data Node

    .

    Journal Node

    Write namespace modification to edit logs; only Active Name Node is the writer

    Journal Node

    Journal Node

    Read edit logs and applies to its own namespace

    Data Blocks

    Data Nodes are configured with the location of both Name Nodes, and send block location information and heartbeats to both.

    http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithQJM.html