Top 5 Hadoop Admin Tasks

Top 5 Hadoop Admin Tasks

For more details please contact us: US : 1800 275 9730 (toll free)INDIA : +91 88808 62004Email Us : [email protected]

For Queries: Post on Twitter @edurekaIN: #askEdurekaPost on Facebook /edurekaIN

View Hadoop Administration Course at www.edureka.co/hadoop-admin

mailto:[email protected]

www.edureka.co/hadoop-adminSlide 2 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions

Objectives

At the end of this module, you will be able to

Understand Cluster Planning

Understand Hadoop fully distributed cluster setup with two nodes

Add further nodes to the running cluster

Upgrade existing Hadoop Cluster from Hadoop 1 to Hadoop 2

Understand Active NameNode Failure and how passive takes over

Slide 3 www.edureka.in/hadoop-admin

» Great for testing, developing

» Not a practical implementation for large amounts of data

» Initially four or six nodes

» As the volume of data grows, more nodes can easily be added

Ways of deciding when the cluster needs to grow

» Increasing amount of computation power needed

» Increasing amount of data which needs to be stored

» Increasing amount of memory needed to process tasks

Hadoop Cluster

Large Cluster

Hadoop Cluster: Thinking About The Problem

Small ClusterSingle Machine

www.edureka.co/hadoop-adminSlide 4

Hadoop Cluster: A Typical Use Case

NameNode Secondary NameNode

DataNode

RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 X 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply

RAM: 32 GB,Hard disk: 1 TBProcessor: Xenon with 4 CoresEthernet: 3 X 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply

RAM: 16GBHard disk: 6 X 2TBProcessor: Xenon with 2 cores.Ethernet: 3 X 10 GB/sOS: 64-bit CentOS

DataNode

RAM: 16GBHard disk: 6 X 2TBProcessor: Xenon with 2 cores.Ethernet: 3 X 10 GB/sOS: 64-bit CentOS


Seeking cluster growth on storage capacity is often a good method to use!

Cluster Growth Based On Storage Capacity

Data grows by approximately5TB per week

HDFS set up to replicate eachblock three times

Thus, 15TB of extra storagespace required per week

Assuming machines with 5x3TBhard drives, equating to a newmachine required each week

Assume Overheads to be 30%


Slave Nodes: Recommended Configuration

Higher-performance vs lower performance components

Save the Money, Buy more Nodes!

General ( Depends on requirement ‘base’ configuration for a slave Node

» 4 x 1 TB or 2 TB hard drives, in a JBOD* configuration

» Do not use RAID!» 2 x Quad-core CPUs» 24 -32GB RAM» Gigabit Ethernet

General Configuration

Multiples of ( 1 hard drive + 2 cores+ 6-8GB RAM) generally work wellfor many types of applications

Special Configuration

Slave Nodes

“A cluster with more nodes performs better than one with fewer, slightly faster nodes”


Slave Nodes: More Details (RAM)

Slave Nodes (RAM)

Generally each Map or Reduce taskwill take 1GB to 2GB of RAM

Slave nodes should not be usingvirtual memory

RULE OF THUMB!Total number of tasks = 1.5 x numberof processor core

Ensure enough RAM is present torun all tasks, plus the DataNode,TaskTracker daemons, plus theoperating system


Master Node Hardware Recommendations

Carrier-class hardware (Not commodity hardware)

Dual power supplies

Dual Ethernet cards(Bonded to provide failover)

Raided hard drives

At least 32GB of RAM

Master Node

Requires


Fully Distributed Mode Cluster

Hadoop requires certain ports on each nodes accessible via the network

However, the default firewall iptables prohibit these ports being accessed

To run hadoop applications, you must make sure that these ports are open

To check the status of iptables, you can use these commands under root privilege:

/etc/init.d/iptables status

You can simply turn iptables off, or at least open these ports:

9000, 9001, 50010, 50020, 50030, 50060, 50070, 50075, 50090


Hadoop Cluster Modes

Hadoop can run in any of the following three modes:

Fully-Distributed Mode

Pseudo-Distributed Mode

No daemons, everything runs in a single JVM Suitable for running MapReduce programs during development Has no DFS

Hadoop daemons run on the local machine

Hadoop daemons run on a cluster of machines

Standalone (or Local) Mode


Hadoop Cluster

Create Dedicated User and Group

» Hadoop requires all the nodes in the cluster have exactly the same structure of directory in which hadoop was installed

» It will be beneficial if we create a dedicated user (e.g.“hadoop”) and install hadoop in its home folder» You must have root privilege on each nodes to carry on the following steps» To change to “root”, type in “su -” in the terminal and input the password for “root”

Create group “hadoop user”:groupadd hadoop use

Create user “hadoop”:useradd -g hadoop user -s /bin/bash -d /home/hadoop hadoopin which -g specifies user “hadoop” belongs to group “hadoop user”, -s specifies the shell to use, -d specifies the home folder for user “hadoop”.

Set password for user “hadoop”:passwd hadoop

Then type in the password for user “hadoop” twice. Then type in “su - hadoop” to change to user “hadoop”.


Passwordless ssh


Configuration Files

ConfigurationFilenames

Description of Log Files

hadoop-env.shyarn-env.sh

Settings for Hadoop Daemon’s process environment.

core-site.xmlConfiguration settings for Hadoop Core such as I/O settings that common to both HDFS and YARN.

hdfs-site.xml Configuration settings for HDFS Daemons, the Name Node and the Data Nodes.

yarn-site.xml Configuration setting for Resource Manager and Node Manager.

mapred-site.xml Configuration settings for MapReduce Applications.

slaves A list of machines (one per line) that each run DataNode and Node Manager.


Configuration Files (Contd.)

Deprecated Property Name New Property Name

dfs.data.dir dfs.datanode.data.dir

dfs.http.address dfs.namenode.http-address

fs.default.name fs.defaultFS

The core functionality and usage of these core configuration files are same in Hadoop 2.0 and 1.0 but many new properties have been added and many have been deprecated

For example: ’fs.default.name’ has been deprecated and replaced with ‘fs.defaultFS’ for YARN in core-site.xml ‘dfs.nameservices’ has been added to enable NameNode High Availability in hdfs-site.xml

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedProperties.html

In Hadoop 2.2.0 release, you can use either the old or the new properties

The old property names are now deprecated, but still work!

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedProperties.html


Commission Data Nodes

Add New Data Nodes


Update the network addresses in the ‘include’ files dfs.include mapred.include

Update the NameNode: hadoop dfsadmin-refreshNodes

Update the Job Tracker:hadoop mradmin-refreshNodes Update the

‘slaves’ file

Start the DataNode and TaskTracker hadoop-daemon.sh start tasktrackerhadoop-daemon.sh start datanode

Cross Check the Web

6 UI to ensure the

successful addition

Run Balancer to

7 move the HDFS

blocks toDataNodes

1 2 3

45

Add (Commission) DataNodes


Hadoop Upgrade from 1 to 2

Run Reports» FSCK» LSR» DFSADMIN

Take backup» Configurations» Applications» Data and Meta-data

Install new version of HadoopUpgradeRun New Reports

» FSCK» LSR» DFSADMIN

Compare old and new reportsTest new clusterFinalize upgrade


NameNode HA


DEMO

LIVE Online Class

Class Recording in LMS

24/7 Post Class Support

Module Wise Quiz

Project Work

Verifiable Certificate


How it Works?

Questions



Course Topics

Module 1 » Hadoop Cluster Administration

Module 2» Hadoop Architecture and Cluster setup

Module 3 » Hadoop Cluster: Planning and Managing

Module 4 » Backup, Recovery and Maintenance

Module 5 » Hadoop 2.0 and High Availability

Module 6» Advanced Topics: QJM, HDFS Federation and

Security

Module 7» Oozie, Hcatalog/Hive and HBase Administration

Module 8» Project: Hadoop Implementation

Technology

Top 5 Hadoop Admin Tasks