Run MapReduce sample program “wordcount” on Hadoop File …cis.csuohio.edu/~sschung/cis612/Lab4_1_CIS612_HDFS... · Run MapReduce sample program “wordcount” on Hadoop File

Run MapReduce sample program “wordcount” on Hadoop File system

1) Set up a RSA public/private key pair to be able to ssh into the node. On each node (both master and slave), type the following commands to generate RSA key pair of the node.

2) Then, the public key is authorized by copying the public key into $HOME/.ssh/authorized_keys using the following command.

3) Hadoop Installation 1. Downloaded Hadoop from the website:

http://www.gtlib.gatech.edu/pub/apache/hadoop/common/stable1/ 2. Downloaded stable version hadoop-1.2.1.tar.gz I opened expanded the package and

moved it to my home directory. : /Users/SonalDeshmukh 3. From the command line ,added Hadoop to the local directory: /usr/local

4) Hadoop version 1.2.1 installed and saved in /usr/local seen below :

5) Hadoop Configuration There are 6 configuration files that need to be customized and they are located in /usr/local/hadoop-1.2.1/conf

1. hadoop-env.sh

This file sets environment variables that are used in the scripts to run Hadoop.

2. core-site.xml This file set the configuration setting for Hadoop Core such as I/O settings of the nodes. This file is needed to be configured on every node in the cluster.

3. hdfs-site.xml This file controls the configuration for Hadoop Distributed File System process, the name-node, the secondary name-node, and the data-nodes. The dfs.replication control the number of replication when a file is created in HDFS. If we want to utilize all the computing power from every node, this value should equals to the number of nodes available in the cluster.

4. mapred-site.xml

This file controls the configuration of MapReduce process, the job tracker and the tasktrackers. The mapred.job.tracker must point to the master node only with the correct port since only the master node runs the job tracker in Hadoop cluster.

6) Test Running Hadoop Before starting Hadoop cluster, Initialize the HDFS first by using the following command.

7) Then to start Hadoop cluster, execute the following command.

8) Verified Hadoop demons running using JPS command:

9) DFS options using $ bin/hadoop fs

10) Copy input files into distributed file System using (Shakespeare is directory)

$bin/hadoop fs –put 5000-8.txt Shakespeare

11) Run word count with Hadoop using $bin/hadoop jar hadoop-examples-*.jar wordcount Shakespeare output4 :

12) View the output using $ bin/hadoop fs –lsr output4:

13) viewed output files on distributed file system using $ bin/hadoop fs –cat output4/part-r-00000 | sort –k 2 –n –r | less :

14) To stop our Hadoop cluster, the following command is used :

Documents

Run MapReduce sample program “wordcount” on Hadoop File …cis.csuohio.edu/~sschung/cis612/Lab4_1_CIS612_HDFS... · Run MapReduce sample program “wordcount” on Hadoop File