Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
HADOOP MULTIPLE NODES CLUSTER SETUP AND
EXECUTION OF MAP REDUCE PROGRAMS
STUDENT NAME: Harinath Selvaraj
STUDENT NUMBER: C00235324
DEPARTMENT: Department of Computing and Networking
COURSE NAME: Master's in Data Science
COURSE CODE: CW_KCDAT_M
SUPERVISOR: Micheal
DATE OF SUBMISSION: 28 APRIL 2019
WORD COUNT: 1157
1. INTRODUCTION
Apache Hadoop is a framework written in Java programming which has features similar to
Google File System(GFS) and the MapReduce computing paradigm. It is primarily used to run
programs on large clusters to enable parallel processing. It is generally deployed on low-cost
hardware due to its fault-tolerant nature and due to the fact that it is used to handle large data sets
(Apache, n.d.-b). This report demonstrates how to setup a multiple node Hadoop cluster with 3
Nodes (1 Master and 2 Slaves) and run MapReduce jobs in JAVA and PYTHON languages and
validate their outputs.
2. SETUP INSTRUCTIONS
The Hadoop cluster was initially setup in a single node and then was made to support multiple
nodes i.e) 1 Master and 2 Slave Nodes.
Instructions to setup both single node and multiple nodes were obtained from the below links,
[1] Single Node Hadoop Cluster -
https://xuri.me/2015/03/09/setup-hadoop-on-ubuntu-single-node-cluster.html
[2] Multiple Node Hadoop Cluster -
https://xuri.me/2016/03/22/setup-hadoop-on-ubuntu-multi-node-cluster.html
The existing Ubuntu virtual machines setup for previous lab exercises were deleted and a fresh
copy of ubuntu machine was created from the snapshot.
In order to check whether the nodes functioning properly after completion of the multi-node
cluster, Ubuntu Desktop was installed using the below commands,
sudo apt-get update
sudo apt-get install ubuntu-desktop
The setup instructions for installing Hadoop in a single node were done as given in the link [1].
The Hadoop setup file was downloaded from Apache Hadoop website(Apache, n.d.-a) from the
below link,
http://ftp.heanet.ie/mirrors/www.apache.org/dist/hadoop/common/hadoop-3.1.2/hadoop-
3.1.2.tar.gz
After successful setup for the single node, the machine is cloned and the remaining instructions
to setup a multiple node cluster was done in link [2]. The below image shows the presence of 3
Virtual machines – 1 Master node and 2 Slave nodes.
Figure 1. 3 Node setup for Hadoop Cluster
Figure 2, 3 and 4 shows that the Master, Slave 1 and Slave 2 Nodes are up and running.
Figure 2. Hadoop Running on Master Node
Figure 3. Hadoop Running on Slave 1 Node
Figure 4. Hadoop
Running on Slave 2 Node
The link http://master:8088 was accessed from the Master VM Desktop in order to check if the
nodes are running.
Figure 5. Master, Slave1 and Slave2 nodes are visible in the master VM
3. RUNNING HADOOP MAP REDUCE JOBMapReduce is the heart of Apache Hadoop. It is a programming paradigm that enables massive
scalability across hundreds or thousands of servers in a Hadoop cluster. The MapReduce concept
is fairly simple to understand for those who are familiar with clustered scale-out data processing
solutions (IBM, n.d.).
3.1 PERFORMING WORD COUNT USING HADOOP
An input directory is created which will hold the files required for processing.
hadoop fs -mkdir /input
The below command will perform a word count on all the files present in the JAR File - $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar
wordcount /input /output
The Job output is shown in the below screenshot,
NOTE: The output directory has to be deleted using the below command before running the next Hadoop job.
hadoop fs -rm -R /output
ISSUE – The below JAVA Exception was thrown when a new Hadoop job was started.
RESOLUTION – In order to fix the issue, the input folder was deleted with the below command,
hadoop fs -rm -R /input
and then created again with the below command,
hadoop fs -mkdir /input
3.2 RUNNING A PYTHON MAP REDUCE JOB USING HADOOP
Since Hadoop is written in Java language, other programming languages such as python can’t be
directly executed as a MapReduce job. The Hadoop Streaming API is used in order to run
Python code as a MapReduce job. This will help to transfer the data from Map and Reduce code
via STDIN (standard input) and STDOUT (standard output) respectively. The program uses
sys.stdin to read the data and sys.stdout to pass the data to output.
3.2.1 PYTHON CODE
The below python code is similar to the JAVA code which counts the number of words.
Mapper.py
#!/usr/bin/env python
"""mapper.py"""
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)
Reducer.py
#!/usr/bin/env python"""reducer.py"""
from operator import itemgetterimport syscurrent_word = Nonecurrent_count = 0word = None
# input comes from STDINfor line in sys.stdin: # remove leading and trailing whitespace line = line.strip()
# parse the input we got from mapper.py word, count = line.split('\t', 1)
# convert count (currently a string) to int try:
count = int(count) except ValueError: # count was not a number, so silently # ignore/discard this line continue
# this IF-switch only works because Hadoop sorts map output # by key (here: word) before it is passed to the reducer if current_word == word: current_count += count else: if current_word: # write result to STDOUT print '%s\t%s' % (current_word, current_count) current_count = count current_word = word
# do not forget to output the last word if needed!if current_word == word: print '%s\t%s' % (current_word, current_count)
Read permissions were granted to the mapper.py and reducer.py files so that they can be
accessed by the Hadoop job.
chmod +x /home/hduser/reducer.py
chmod +x /home/hduser/reducer.py
3.2.2 INPUT FILES REQUIRED FOR PROCESSING
The flat files are obtained from the below links -
http://www.gutenberg.org/cache/epub/20417/pg20417.txt - 750KB
http://www.gutenberg.org/files/5000/5000-8.txt – 1.4MB
http://www.gutenberg.org/files/4300/4300-0.txt – 1.5MB
The files are copied to HDFS using the below commands,
hdfs dfs -copyFromLocal -p 5000-8.txt /input/
hdfs dfs -copyFromLocal -p 4300-0.txt /input/
hdfs dfs -copyFromLocal -p pg20417.txt /input/
3.2.3 RUNNING THE MAP REDUCE JOB
ISSUE:
The Hadoop version which I had installed didn’t have the Hadoop streaming library that is
required to run the Python job. This is confirmed by running a search on the hadoop file path.
The search results which are seen below are the documents but not the actual library which I was
looking for.
Therefore, the Hadoop Streaming file was downloaded from the link and copied to
$HADOOP_HOME/share/hadoop/mapreduce/ path,
http://www.java2s.com/Code/Jar/h/Downloadhadoopstreamingjar.htm
The below command is executed to run the Hadoop job,
hadoop \
jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-streaming.jar \
-mapper "python /home/hduser1/mapper.py" \
-reducer "python /home/hduser1/reducer.py" \
-input "/input/*" \
-output "/output"
Run Screenshot
Output Screenshot
4. CONCLUSION
The below activities were successful performed,
1) Install a single node Hadoop cluster
2) Install a multi-node Hadoop cluster
3) Run a MapReduce program in JAVA to find the word count within all the files inside a
directory
4) Run a MapReduce program in JAVA to find the word count within all the files inside a
directory
Running MapReduce programs helped me to understand how the tasks are executed in parallel in
the slave virtual machines. This exercise gave me a confidence to setup Hadoop Cluster and run
MapReduce programs on Python seamlessly.
REFERENCES
Apache. (n.d.-a). Apache Hadoop. Retrieved March 27, 2019, from http://hadoop.apache.org/releases.html
Apache. (n.d.-b). Welcome to Apache Hadoop! Hadoop.Apache.Org. Retrieved from http://hadoop.apache.org
IBM. (n.d.). What is MapReduce? | IBM Analytics. Retrieved March 27, 2019, from https://www.ibm.com/analytics/hadoop/mapreduce