Hadoop Installation

Installing & Running Hadoop on Ubuntu Linux (Single-Node Cluster)PrerequisitesConsidering Successful installation of

1. Ubuntu Linux 2. Java1.6+

Download Hadoop Download Hadoop and extract the contents of the Hadoop package to a location.

Adding Hadoop User:

$ sudo addgroup hadoop$ sudo adduser --ingroup hadoop hduser

This will add the user hduser and the group hadoop to the local machine.

Configuring SSHHadoop requires SSH access to manage its nodes, therefore need to configure SSH access to localhost for the hduser user we created

First Generate the Keys

$ su hduser$ ssh-keygen -t rsa -P""

The second line will create an RSA key pair with an empty password. So that you dont have to enter the passphrase every time Hadoop interacts with its nodes.

Output might be like:

hduser@ubuntu:/home$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/hduser/.ssh/id_rsa): Created directory '/home/hduser/.ssh'. Your identification has been saved in /home/hduser/.ssh/id_rsa. Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.

GD 20/06/2013

The key fingerprint is: 5d:1f:ac:e2:33:de:ae:ab:86:2d:e0:f6:d7:2b:63:04 hduser@ubuntu The key's randomart image is: +--[ RSA 2048]----+ | | | . | | . o | | E . . o . | | .S o . . | | . .. . | | . . + .+ | | o o B..+ | | . ..=.+==o | +-----------------+

Second - Enabling SSH to access local machine with the newly created key

hduser@ubuntu:/home$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Finally - Test the SSH setup by connecting to your local machine with the hduser user.

hduser@ubuntu:/home$ ssh localhost


The authenticity of host 'localhost (127.0.0.1)' can't be established. ECDSA key fingerprint is 52:5f:fd:6f:95:06:48:f7:fe:2e:c8:69:07:7e:93:f2. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts. Welcome to Ubuntu 12.04.2 LTS (GNU/Linux 3.5.0-34-generic i686)

* Documentation: https://help.ubuntu.com/

0 packages can be updated. 0 updates are security updates.

The programs included with the Ubuntu system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright.

GD 20/06/2013

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law.

Disabling IPv6

Option -1:Open /etc/sysctl.conf in the editor and add the following lines to the end of the file:

# disable ipv6net.ipv6.conf.all.disable_ipv6 = 1net.ipv6.conf.default.disable_ipv6 = 1net.ipv6.conf.lo.disable_ipv6 = 1

Reboot machine in order to make the changes take effect.Command to check whether IPv6 is enabled on the machine:$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6A return value of 0 means IPv6 is enabled, a value of 1 means disabled.

Option- 2:You can also disable IPv6 only for Hadoop .Add the following line to conf/hadoop-env.sh:export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

Change the OwnershipChange the owner of the hadoop files to the hduser user and hadoop groupChange the following commands depending on the location and version of hadoop.$ cd $ sudo tar xzf hadoop-x.x.x.tar.gz$ sudo mv hadoop-x.x.x hadoop$ sudo chown -R hduser:hadoop hadoop

Update $HOME/.bashrc

Add the following lines to the end of the $HOME/.bashrc file of user hduser. If you use a shell other than bash, you should of course update its appropriate configuration files

GD 20/06/2013

export HADOOP_HOME=/home/gd/hadoopexport JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386unalias fs &> /dev/nullalias fs="hadoop fs"unalias hls &> /dev/nullalias hls="fs -ls"lzohead() { hadoop fs -cat $1 | lzop -dc | head -1000 | less}export PATH=$PATH:$HADOOP_HOME/bin

Configuration

More information is available on the Hadoop Wiki.

hadoop-env.sh

The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME. Open hadoop/conf/hadoop-env.sh in the editor of your choice and set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory.

Create the directory and set the required ownerships and permissions:$ sudo mkdir -p /app/hadoop/tmp$ sudo chown hduser:hadoop /app/hadoop/tmp# ...and if you want to tighten up security, chmod from 755 to 750...$ sudo chmod 750 /app/hadoop/tmpForgetting to set the required ownerships and permissions, will throw java.io.IOException when you try to format the name node

In file conf/core-site.xml:

Add the following snippets between the ... tags in the respective configuration XML file.

hadoop.tmp.dir /app/hadoop/tmp A base for other temporary directories.

fs.default.name hdfs://localhost:54310

GD 20/06/2013

The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.

In file conf/mapred-site.xml:

mapred.job.tracker localhost:54311 The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.

In file conf/hdfs-site.xml:

dfs.replication 1 Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.

Formatting the HDFS filesystem via the NameNodeNeed to do this the first time you set up a Hadoop cluster. Should not format a running Hadoop filesystem as it will lose all the data currently in the cluster (in HDFS)!

Run the command

hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format

The output will look like this:

hduser@ubuntu:/home/gd$ hadoop/bin/hadoop namenode -format Warning: $HADOOP_HOME is deprecated.

13/06/20 18:37:20 INFO namenode.NameNode: STARTUP_MSG: /************************************************************

GD 20/06/2013

STARTUP_MSG: Starting NameNode STARTUP_MSG: host = ubuntu/127.0.1.1 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 1.0.4 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1393290; compiled by 'hortonfo' on Wed Oct 3 05:13:58 UTC 2012 ************************************************************/ 13/06/20 18:37:21 INFO util.GSet: VM type = 32-bit 13/06/20 18:37:21 INFO util.GSet: 2% max memory = 19.33375 MB 13/06/20 18:37:21 INFO util.GSet: capacity = 2^22 = 4194304 entries 13/06/20 18:37:21 INFO util.GSet: recommended=4194304, actual=4194304 13/06/20 18:37:24 INFO namenode.FSNamesystem: fsOwner=hduser 13/06/20 18:37:24 INFO namenode.FSNamesystem: supergroup=supergroup 13/06/20 18:37:24 INFO namenode.FSNamesystem: isPermissionEnabled=true 13/06/20 18:37:24 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100 13/06/20 18:37:24 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 13/06/20 18:37:24 INFO namenode.NameNode: Caching file names occuring more than 10 times 13/06/20 18:37:24 INFO common.Storage: Image file of size 112 saved in 0 seconds. 13/06/20 18:37:24 INFO common.Storage: Storage directory /tmp/hadoop-hduser/dfs/name has been successfully formatted. 13/06/20 18:37:24 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1 ************************************************************/

Starting the single-node clusterhduser@ubuntu:/home/gd$ hadoop/bin/start-all.sh

This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

The output will look like this:hduser@ubuntu:/home/gd$ hadoop/bin/start-all.sh Warning: $HADOOP_HOME is deprecated.

starting namenode, logging to /home/gd/hadoop/libexec/../logs/hadoop-hduser-namenode-ubuntu.out localhost: starting datanode, logging to /home/gd/hadoop/libexec/../logs/hadoop-hduser-datanode-ubuntu.out localhost: starting secondarynamenode, logging to /home/gd/hadoop/libexec/../logs/hadoop-hduser-secondarynamenode-ubuntu.out starting jobtracker, logging to /home/gd/hadoop/libexec/../logs/hadoop-hduser-jobtracker-ubuntu.out localhost: starting tasktracker, logging to /home/gd/hadoop/libexec/../logs/hadoop-hduser-tasktracker-ubuntu.out hduser@ubuntu:/home/gd$ cd hadoop

GD 20/06/2013

Testing ToolA tool for checking whether the expected Hadoop processes are running isOption-1: jps hduser@ubuntu:/usr/local/hadoop$ jps

The output might be like:

2287 TaskTracker2149 JobTracker1938 DataNode2085 SecondaryNameNode2349 Jps1788 NameNode

Option-2: netstathduser@ubuntu:~$ sudo netstat -plten | grep java


tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 1001 9236 2471/javatcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 1001 9998 2628/javatcp 0 0 0.0.0.0:48159 0.0.0.0:* LISTEN 1001 8496 2628/javatcp 0 0 0.0.0.0:53121 0.0.0.0:* LISTEN 1001 9228 2857/javatcp 0 0 127.0.0.1:54310 0.0.0.0:* LISTEN 1001 8143 2471/javatcp 0 0 127.0.0.1:54311 0.0.0.0:* LISTEN 1001 9230 2857/javatcp 0 0 0.0.0.0:59305 0.0.0.0:* LISTEN 1001 8141 2471/javatcp 0 0 0.0.0.0:50060 0.0.0.0:* LISTEN 1001 9857 3005/javatcp 0 0 0.0.0.0:49900 0.0.0.0:* LISTEN 1001 9037 2785/javatcp 0 0 0.0.0.0:50030 0.0.0.0:* LISTEN 1001 9773 2857/javahduser@ubuntu:~$

Stopping the single-node clusterRun the commandhduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh

Running a MapReduce jobUse the WordCount example job which reads text files and counts how often words occur

GD 20/06/2013

Installing & Running Hadoop on Ubuntu Linux (Single-Node Cluster)

Documents

Hadoop Installation