Alex Chengelis 2632220 CIS-612 LAB 4 1

Alex Chengelis

2632220

CIS-612

LAB 4_1

1. Create a new virtual Machine with Ubuntu. I am using VMware player to do this.

Just keep filling in the info you want and let it install.

2. Download the Appropriate Java and Hadoop files.

I am using 2.7.3 since it is the latest stable release of Hadoop. You can either use the website to

download it or use curl.

Curl -O http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.3/hadoop-

2.7.3.tar.gz

For Java Go to this page http://www.oracle.com/technetwork/java/javase/downloads/jdk8-

downloads-2133151.html

And download the linux tar.gz.

http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz

http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz

http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

Place both the Hadoop and Java binaries in downloads.

3. Configure the SSH server.

sudo apt-get update

sudo apt-get install openssh-server

4. Configure the password-less ssh login.

cd

ssh-keygen -t rsa -P ""

cat ./.ssh/id_rsa.pub >> ./.ssh/authorized_keys

chmod 600 ~/.ssh/authorized_keys

##THEN

sudo service ssh restart

5. Standalone Mode Setup (you start with this and add more and more functionality). Start by

extracting the downloaded files.

cd Downloads

tar xzvf hadoop-2.7.2.tar.gz

After running the tar command the terminal will quickly fill up.

Verify that Hadoop has been extracted

6. Create soft links

cd

ln -s ./Downloads/hadoop-2.7.2/ ./Hadoop

7. Configure .bashrc

cd

vi ./.bashrc

export HADOOP_HOME=/home/alex/hadoop

export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

8. Configure Hadoop’s Hadoop-env.sh file

cd

vi ./hadoop/etc/hadoop/hadoop-env.sh

export JAVA_HOME=/home/alex/jdk

9. Run a Hadoop job on a Standalone cluster. First exit and restart the terminal. Then type the

Hadoop command.

A sign that our installation is good so far.

Run a Hadoop job

Create a testhadoop directory

Create input directory inside testhadoop

Create some input files (the .xml files)

Run MapReduce example job

View the output directory using cat command

cd

mkdir testhadoop

cd testhadoop

mkdir input

cp ~/hadoop/etc/hadoop/*.xml input

hadoop jar ~/hadoop/share/hadoop/mapreduce/hadoop-

mapreduce-examples-2.7.2.jar grep input output 'dfs[a-z.]+'

cat output/*

You’ll see some output in the terminal

Finally check the output

This is working.

10. Now to transform this into a pseud-Distributed Mode without YARN setup (to start).

a. Configure core-site.xml and hdfs-site.xml

cd

vi ./hadoop/etc/hadoop/core-site.xml

## adding these lines to the file ##

<configuration>

<property>

<name>fs.defaultFS</name>

<value>hdfs://10.1.37.12:9000</value>

</property>

</configuration>

vi ./hadoop/etc/hadoop/hdfs-site.xml

## adding these lines to the file ##

<configuration>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

</configuration>

Replace the ip with the one from the following command

11. Format the namenode

hdfs namenode -format

12. Start/Stop Hadoop cluster

$ start-fs.sh

13. Create a user on the HDFS system

$ hdfs dfs -mkdir /user

$hdfs dfs -mkdir /user/alex

Put some info into that input

$hdfs dfs -put ~/hadoop/etc/hadoop input

14. Run a Hadoop job now

$hadoop jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-

examples-2.7.2.jar grep input output ‘d[a-z.]+’

Check the output

$hdfs dfs -cat output/*

15. Since everything is working so far we are going to extend our Pseudo-Distributed Mode with

YARN Setup.

a. Configure mapred-site.xml and yarn-site.xml

$cd

$nano ./hadoop/etc/hadoop/mapred-site.xml

Add the following lines

<configuration>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

</configuration>

$nano ./hadoop/etc/hadoop/yarn-site.xml

Add the following lines

<configuration>

<property>

<name>yarn.nodemanager.aux-

services</name><value>mapreduce_shuffle</value>

</property>

</configuration>

16. Start YARN cluster

$start-yarn.sh

Go to http://localhost:8088 to make sure it is working

17. Let’s test.

$cd

$cd testhadoop

$rm -rf output/

$hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep input

output ‘dfs[a-z.]+’

http://localhost:8088/

$hdfs dfs -cat output/*

Will look the same as the previous one

18. Time to run the word count.

a. Let’s get a file from the Gutenberg project: http://www.gutenberg.org/files/76/76-0.txt

it’s a copy of Huckleberry Fin

b. Use wget to get it.

c. Create a directory for our wordcount, and the input directory

$mkdir wordcount && cd wordcount

$mkdir input

d. Move our test file into the input file

e. Navigate back to the wordcount directory

$cd wordcount

http://www.gutenberg.org/files/76/76-0.txt

f. Remove the output file currently in the system

$ hdfs dfs -rmr /user/alex/output

g. Now remove and copy over our current input directory.

$ hdfs dfs -rm -r /user/alex/input

$ hdfs dfs -put input /user/alex/input

$ hdfs dfs -ls /user/alex/input (just to check to make sure it is there)

h. Finally it is time to run the wordcount program.

$ Hadoop jar ~/Hadoop/share/Hadoop/mapreduce/Hadoop-mapreduce-examples-

2.7.3.jar wordcount input output

i. Check the output

$ hdfs dfs -cat output/*

j. Copy over the output to the “local” machine.

$ hdfs dfs -get /user/alex/output/ .

$ ls (to verify)

$ ls output (to verify)

k. Open it up in your favorite editor. Have fun looking through the results.

Guide was taken from:

https://medium.com/@luck/installing-hadoop-2-7-2-on-ubuntu-16-04-3a34837ad2db

Documents

Alex Chengelis 2632220 CIS-612 LAB 4 1