Upload
edureka
View
367
Download
4
Embed Size (px)
Citation preview
Top 5 Hadoop Admin Tasks
For more details please contact us: US : 1800 275 9730 (toll free)INDIA : +91 88808 62004Email Us : [email protected]
For Queries: Post on Twitter @edurekaIN: #askEdurekaPost on Facebook /edurekaIN
View Hadoop Administration Course at www.edureka.co/hadoop-admin
www.edureka.co/hadoop-adminSlide 2 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Objectives
At the end of this module, you will be able to
Understand Cluster Planning
Understand Hadoop fully distributed cluster setup with two nodes
Add further nodes to the running cluster
Upgrade existing Hadoop Cluster from Hadoop 1 to Hadoop 2
Understand Active NameNode Failure and how passive takes over
Slide 3 www.edureka.in/hadoop-admin
» Great for testing, developing
» Not a practical implementation for large amounts of data
» Initially four or six nodes
» As the volume of data grows, more nodes can easily be added
Ways of deciding when the cluster needs to grow
» Increasing amount of computation power needed
» Increasing amount of data which needs to be stored
» Increasing amount of memory needed to process tasks
Hadoop Cluster
Large Cluster
Hadoop Cluster: Thinking About The Problem
Small ClusterSingle Machine
www.edureka.co/hadoop-adminSlide 4
Hadoop Cluster: A Typical Use Case
NameNode Secondary NameNode
DataNode
RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 X 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply
RAM: 32 GB,Hard disk: 1 TBProcessor: Xenon with 4 CoresEthernet: 3 X 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply
RAM: 16GBHard disk: 6 X 2TBProcessor: Xenon with 2 cores.Ethernet: 3 X 10 GB/sOS: 64-bit CentOS
DataNode
RAM: 16GBHard disk: 6 X 2TBProcessor: Xenon with 2 cores.Ethernet: 3 X 10 GB/sOS: 64-bit CentOS
www.edureka.co/hadoop-adminSlide 5
Seeking cluster growth on storage capacity is often a good method to use!
Cluster Growth Based On Storage Capacity
Data grows by approximately5TB per week
HDFS set up to replicate eachblock three times
Thus, 15TB of extra storagespace required per week
Assuming machines with 5x3TBhard drives, equating to a newmachine required each week
Assume Overheads to be 30%
www.edureka.co/hadoop-adminSlide 6
Slave Nodes: Recommended Configuration
Higher-performance vs lower performance components
Save the Money, Buy more Nodes!
General ( Depends on requirement ‘base’ configuration for a slave Node
» 4 x 1 TB or 2 TB hard drives, in a JBOD* configuration
» Do not use RAID!» 2 x Quad-core CPUs» 24 -32GB RAM» Gigabit Ethernet
General Configuration
Multiples of ( 1 hard drive + 2 cores+ 6-8GB RAM) generally work wellfor many types of applications
Special Configuration
Slave Nodes
“A cluster with more nodes performs better than one with fewer, slightly faster nodes”
www.edureka.co/hadoop-adminSlide 7
Slave Nodes: More Details (RAM)
Slave Nodes (RAM)
Generally each Map or Reduce taskwill take 1GB to 2GB of RAM
Slave nodes should not be usingvirtual memory
RULE OF THUMB!Total number of tasks = 1.5 x numberof processor core
Ensure enough RAM is present torun all tasks, plus the DataNode,TaskTracker daemons, plus theoperating system
www.edureka.co/hadoop-adminSlide 8
Master Node Hardware Recommendations
Carrier-class hardware (Not commodity hardware)
Dual power supplies
Dual Ethernet cards(Bonded to provide failover)
Raided hard drives
At least 32GB of RAM
Master Node
Requires
www.edureka.co/hadoop-adminSlide 9
Fully Distributed Mode Cluster
Hadoop requires certain ports on each nodes accessible via the network
However, the default firewall iptables prohibit these ports being accessed
To run hadoop applications, you must make sure that these ports are open
To check the status of iptables, you can use these commands under root privilege:
/etc/init.d/iptables status
You can simply turn iptables off, or at least open these ports:
9000, 9001, 50010, 50020, 50030, 50060, 50070, 50075, 50090
www.edureka.co/hadoop-adminSlide 10
Hadoop Cluster Modes
Hadoop can run in any of the following three modes:
Fully-Distributed Mode
Pseudo-Distributed Mode
No daemons, everything runs in a single JVM Suitable for running MapReduce programs during development Has no DFS
Hadoop daemons run on the local machine
Hadoop daemons run on a cluster of machines
Standalone (or Local) Mode
www.edureka.co/hadoop-adminSlide 11
Hadoop Cluster
Create Dedicated User and Group
» Hadoop requires all the nodes in the cluster have exactly the same structure of directory in which hadoop was installed
» It will be beneficial if we create a dedicated user (e.g.“hadoop”) and install hadoop in its home folder» You must have root privilege on each nodes to carry on the following steps» To change to “root”, type in “su -” in the terminal and input the password for “root”
Create group “hadoop user”:groupadd hadoop use
Create user “hadoop”:useradd -g hadoop user -s /bin/bash -d /home/hadoop hadoopin which -g specifies user “hadoop” belongs to group “hadoop user”, -s specifies the shell to use, -d specifies the home folder for user “hadoop”.
Set password for user “hadoop”:passwd hadoop
Then type in the password for user “hadoop” twice. Then type in “su - hadoop” to change to user “hadoop”.
www.edureka.co/hadoop-adminSlide 13
Configuration Files
ConfigurationFilenames
Description of Log Files
hadoop-env.shyarn-env.sh
Settings for Hadoop Daemon’s process environment.
core-site.xmlConfiguration settings for Hadoop Core such as I/O settings that common to both HDFS and YARN.
hdfs-site.xml Configuration settings for HDFS Daemons, the Name Node and the Data Nodes.
yarn-site.xml Configuration setting for Resource Manager and Node Manager.
mapred-site.xml Configuration settings for MapReduce Applications.
slaves A list of machines (one per line) that each run DataNode and Node Manager.
www.edureka.co/hadoop-adminSlide 14
Configuration Files (Contd.)
Deprecated Property Name New Property Name
dfs.data.dir dfs.datanode.data.dir
dfs.http.address dfs.namenode.http-address
fs.default.name fs.defaultFS
The core functionality and usage of these core configuration files are same in Hadoop 2.0 and 1.0 but many new properties have been added and many have been deprecated
For example: ’fs.default.name’ has been deprecated and replaced with ‘fs.defaultFS’ for YARN in core-site.xml ‘dfs.nameservices’ has been added to enable NameNode High Availability in hdfs-site.xml
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedProperties.html
In Hadoop 2.2.0 release, you can use either the old or the new properties
The old property names are now deprecated, but still work!
www.edureka.co/hadoop-adminSlide 16
Update the network addresses in the ‘include’ files dfs.include mapred.include
Update the NameNode: hadoop dfsadmin-refreshNodes
Update the Job Tracker:hadoop mradmin-refreshNodes Update the
‘slaves’ file
Start the DataNode and TaskTracker hadoop-daemon.sh start tasktrackerhadoop-daemon.sh start datanode
Cross Check the Web
6 UI to ensure the
successful addition
Run Balancer to
7 move the HDFS
blocks toDataNodes
1 2 3
45
Add (Commission) DataNodes
www.edureka.co/hadoop-adminSlide 17
Hadoop Upgrade from 1 to 2
Run Reports» FSCK» LSR» DFSADMIN
Take backup» Configurations» Applications» Data and Meta-data
Install new version of HadoopUpgradeRun New Reports
» FSCK» LSR» DFSADMIN
Compare old and new reportsTest new clusterFinalize upgrade
LIVE Online Class
Class Recording in LMS
24/7 Post Class Support
Module Wise Quiz
Project Work
Verifiable Certificate
www.edureka.co/hadoop-adminSlide 20 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
How it Works?
Questions
www.edureka.co/hadoop-adminSlide 21 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
www.edureka.co/hadoop-adminSlide 22
Course Topics
Module 1 » Hadoop Cluster Administration
Module 2» Hadoop Architecture and Cluster setup
Module 3 » Hadoop Cluster: Planning and Managing
Module 4 » Backup, Recovery and Maintenance
Module 5 » Hadoop 2.0 and High Availability
Module 6» Advanced Topics: QJM, HDFS Federation and
Security
Module 7» Oozie, Hcatalog/Hive and HBase Administration
Module 8» Project: Hadoop Implementation