35
View Hadoop Administration course details at www.edureka.co/hadoop-admin Top 5 Hadoop admin tasks

Introduction to hadoop administration jk

  • Upload
    edureka

  • View
    219

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Introduction to hadoop administration   jk

View Hadoop Administration course details at www.edureka.co/hadoop-admin

Top 5 Hadoop admin tasks

Page 2: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 2

Objectives of this Session

At the end of this module, you will be able to

Understand Cluster Planning

Understand Hadoop fully distributed cluster set up

Add further nodes to the running cluster

Upgrade existing Hadoop cluster

Understand name node High availability

Page 3: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 3

Why Hadoop Administration

Page 4: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 4

With the Rise of Hadoop Adoption and usage across various industries, the role of Hadoop Administrator has

become very important and is in demand.

Hadoop Administrator

Page 5: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 5

Hadoop Administration Responsibilities

Page 6: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 6

HDFS Support & Maintenance Monitor Hadoop ClusterProviding Security

Integrating Different Frameworks Hadoop Infrastructure Maintenance

Hadoop Admin Responsibilities

Page 7: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 7

Top 5 Hadoop Admin Tasks

Page 8: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 8

Top 5 Hadoop Admin Tasks

Task-1

Cluster Planning

Task-2

Hadoop Cluster set up Hadoop Version upgrade

Task-3

Adding or Removing Nodes to Cluster Providing High Availability to Cluster

Task-4 Task-5

Page 9: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 9

Cluster Planning

Task-1

Page 10: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 10

RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 coresEthernet: 3 x 10 GB/sOS: 64-bit CentOS

Hadoop Cluster: A Typical Use Case

RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 cores.Ethernet: 3 x 10 GB/sOS: 64-bit CentOS

RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply

RAM: 32 GB,Hard disk: 1 TBProcessor: Xenon with 4 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply

Active NameNodeSecondary NameNode

DataNode DataNode

RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply

StandBy NameNode

Optional

RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 coresEthernet: 3 x 10 GB/sOS: 64-bit CentOS

DataNode

DataNode DataNode DataNode

Page 11: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 11

Seeking cluster growth on storage capacity is often a good method to use!

Cluster Growth Based On Storage Capacity

Data grows by approximately5TB per week

HDFS set up to replicate eachblock three times

Thus, 15TB of extra storagespace required per week

Assuming machines with 5x3TBhard drives, equating to a newmachine required each week

Assume Overheads to be 30%

Page 12: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 12

Slave Nodes: Recommended Configuration

Higher-performance vs lower performance components

Save the Money, Buy more Nodes!

General ( Depends on requirement ‘base’ configuration for a slave Node

» 4 x 1 TB or 2 TB hard drives, in a JBOD* configuration

» Do not use RAID!» 2 x Quad-core CPUs» 24 -32GB RAM» Gigabit Ethernet

General Configuration

Multiples of ( 1 hard drive + 2 cores+ 6-8GB RAM) generally work wellfor many types of applications

Special Configuration

Slave Nodes

“A cluster with more nodes performs better than one with fewer, slightly faster nodes”

Page 13: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 13

Slave Nodes: More Details (RAM)

Slave Nodes (RAM)

Generally each Map or Reduce taskwill take 1GB to 2GB of RAM

Slave nodes should not be usingvirtual memory

RULE OF THUMB!Total number of tasks = 1.5 x numberof processor core

Ensure enough RAM is present torun all tasks, plus the DataNode,TaskTracker daemons, plus theoperating system

Page 14: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 14

Master Node Hardware Recommendations

Carrier-class hardware (Not commodity hardware)

Dual power supplies

Dual Ethernet cards(Bonded to provide failover)

Raided hard drives

At least 32GB of RAM

Master Node

Requires

Page 15: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 15

Hadoop Cluster Set up

Task-2

Page 16: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 16

Hadoop Cluster Modes

Hadoop can run in any of the following three modes:

Fully-Distributed Mode

Pseudo-Distributed Mode

No daemons, everything runs in a single JVM Suitable for running MapReduce programs during development Has no DFS

Hadoop daemons run on the local machine

Hadoop daemons run on a cluster of machines

Standalone (or Local) Mode

Page 17: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 17

Core

HDFS

core-site.xml

hdfs-site.xml

yarn-site.xmlYARN

mapred-site.xmlMap

Reduce

Hadoop 2.x Configuration Files – Apache Hadoop

www.edureka.co/hadoop-admin

Page 18: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 18

Configuration Files

ConfigurationFilenames

Description of Log Files

hadoop-env.shyarn-env.sh

Settings for Hadoop Daemon’s process environment.

core-site.xmlConfiguration settings for Hadoop Core such as I/O settings that common to both HDFS and YARN.

hdfs-site.xml Configuration settings for HDFS Daemons, the Name Node and the Data Nodes.

yarn-site.xml Configuration setting for Resource Manager and Node Manager.

mapred-site.xml Configuration settings for MapReduce Applications.

slaves A list of machines (one per line) that each run DataNode and Node Manager.

Page 19: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 19

Hadoop Daemons

NameNode daemon» Runs on master node of the Hadoop Distributed File System (HDFS)» Directs Data Nodes to perform their low-level I/O tasks

DataNode daemon» Runs on each slave machine in the HDFS» Does the low-level I/O work

Resource Manager» Runs on master node of the Data processing System(MapReduce)» Global resource Scheduler

Node Manager» Runs on each slave node of Data processing System» Platform for the Data processing tasks

Job HistoryServer» JobHistoryServer is responsible for servicing all job history related requests from client

Page 20: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 20

Hadoop 1.x and Hadoop 2.x Ecosystem

Pig LatinData Analysis

HiveDW System

OtherYARN

Frameworks(MPI, GIRAPH)

HBaseMapReduce Framework

YARNCluster Resource Management

Apache Oozie(Workflow)

HDFS(Hadoop Distributed File System)

Pig LatinData Analysis

HiveDW System

MapReduce Framework

Apache Oozie(Workflow)

HDFS(Hadoop Distributed File System)

Pig LatinData Analysis

HBase

Structured DataUnstructured/Semi-structured Data

Hadoop 1.x Hadoop 2.x

Page 21: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 21

Demo On Hadoop Cluster Set Up

Page 22: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 22

Hadoop Version upgrade

Task-3

Page 23: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 23

Stop map-reduce cluster and all client applications running on the DFS cluster

Take the back up of File System Name Space

Install new version of Hadoop software

Update the all configuration files in new Hadoop

start name node with Upgrade command

Compare the new HDFS file system with previous version file system name space

finalize upgrade.

Hadoop Version Upgrade

Page 24: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 24

1) Run Report• FSCK• LSR• DFSADMIN

2) Take Back up• Configuration• Applications• Data and Meta Data

3) Install new Version of Hadoop4) Upgradehadoop-daemon.sh start namenode -upgrade

Hadoop Version Upgrade

5) Run New Reports• FSCK• LSR• DFSADMIN

Compare old and new ReportsTest new Cluster

6) Finalize upgrade• hadoop dfsadmin -finalizeUpgrade

Page 25: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 25

Adding or Removing Nodes from Cluster

Task-4

Page 26: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 26

Commissioning and Decommissioning of DataNode

DataNode

Master Node

DataNode

DataNode DataNode DataNode

DataNode DataNode

DataNode

Decommissioning Commissioning

Page 27: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 27

Add (Commission) DataNodes

Update the network addresses in the ‘include’ files dfs.include mapred.include

Update the NameNode: hadoop dfsadmin-refreshNodes

Update the Job Tracker:hadoop mradmin-refreshNodes Update the

‘slaves’ file

Start the DataNode and TaskTracker hadoop-daemon.sh start tasktracker hadoop-daemon.sh start datanode

Cross Check the Web

6 UI to ensure the

successful addition

Run Balancer to

7 move the HDFS

blocks toDataNodes

1 2 3

45

Page 28: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 28

Demo On Commissioning Data Node

Page 29: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 29

Providing High Availability to Cluster

Task-5

Page 30: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 30

Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. Each cluster had a

single NameNode, and if that machine or process became unavailable, the cluster as a whole would be unavailable

until the NameNode was either restarted or brought up on a separate machine.

Achieve the High Availability in two different ways

HA using the Quorum Journal Manager (QJM) to share edit logs between the Active and Standby NameNodes.

HA using NFS for shared storage instead of the QJM

High Availability (HA)

Page 31: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 31

Slave NodeSlave NodeSlave Node

Standby NodeActive Node

Journal Nodes(Shared Edits)

Failover Controller Standby

Failover Controller Active

Zookeeper Service

Block Report & Heart beat

Monitor status and health. Manage HA state

HA Architecture

Monitor status and health. Manage HA state

Write Read

Page 32: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 32

Demo On NameNode High Availability

Page 33: Introduction to hadoop administration   jk

www.edureka.co/hadoop-adminSlide 33

Hadoop admin Job Trends

Page 34: Introduction to hadoop administration   jk

Questions

www.edureka.co/hadoop-adminSlide 34 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions

Page 35: Introduction to hadoop administration   jk