13
Implementing Hadoop on a Single Cluster - SALIL NAVGIRE

Implementing Hadoop on a single cluster

Embed Size (px)

Citation preview

Page 1: Implementing Hadoop on a single cluster

Implementing Hadoop on a Single Cluster

- SALIL NAVGIRE

Page 2: Implementing Hadoop on a single cluster

Basic Setup1. Install Ubuntu

2. Install Java, Python and update

3. Add group ‘hadoop’ and ‘hduser’ as user (for security and backup)

4. Configure SSHa) Install OpenSSH Server

b) Configure it by editing file ssh_config and save a backup

c) Generate ssh key for hduser

d) Enable ssh access to your local machine with the newly created RSA key

e) hduser@Ubuntu:~$ ssh localhost

5. Disable IPv6 in sysctl.conf file in editor

Page 3: Implementing Hadoop on a single cluster

Installing Hadoop1. Download hadoop from the collection of

Apache Download Mirrors• salil@ubuntu:/usr/local$ sudo tar xzf hadoop-

2.0.6-alpha-src.tar.gz

2. Make sure to change the owner to hduser in hadoop group

• $ sudo chown -R hduser:hadoop hadoop (change the permissions)

3. Update $HOME/.bashrc – hadoop related environment variables

Page 4: Implementing Hadoop on a single cluster

Configuration1. Edit environment variables in conf/hadoop-env.sh

2. Change settings in conf/*site.xml

3. Make directory and set the required ownerships and permissions

• Now we create the directory and set the required ownerships and permissions:

• $ sudo mkdir -p /app/hadoop/tmp

• $ sudo chown hduser:hadoop /app/hadoop/tmp

• $ sudo chmod 750 /app/hadoop/tmp

4. Add configurations snippets between <configuration> ... </configuration> tags in core-site.xml, mapred-site.xml and hdfs-site.xml

Page 5: Implementing Hadoop on a single cluster

Starting your single node cluster• First format the namenode • hduser@ubuntu:~$

/usr/local/hadoop/bin/hadoop namenode -format

• Start your single node cluster

Page 6: Implementing Hadoop on a single cluster

• Running a MapReduce job

• Download data and copy from local file to hdfs• hduser@ubuntu:~$ hadoop dfs -

copyFromLocal /home/hduser/project.txt /user/new

• hduser@ubuntu:~$ hadoop dfs -copyFromLocal /home/hduser/hadoop/project.txt /user/lol

Page 7: Implementing Hadoop on a single cluster

• hduser@ubuntu:~$ hadoop dfs -ls /user/lolFound 2 itemsdrwxr-xr-x   - hduser supergroup          0 2013-10-10 06:30 /user/lol/output-rw-r--r--   1 hduser supergroup     969039 2013-10-05 20:20 /user/lol/project.txt

• hduser@ubuntu:~$ hadoop jar /home/hduser/hadoop/hadoop-examples-1.0.3.jar wordcount /user/lol/project.txt /user/lol/output/

• Hadoop Web interfaces• http://localhost:50070/ – web UI of the NameNode

daemon

• http://localhost:50030/ – web UI of the JobTracker daemon

• http://localhost:50060/ – web UI of the TaskTracker daemon

Page 8: Implementing Hadoop on a single cluster

• The NameNode Web interface gives us a cluster summary about total /remaining, capacity, live and dead nodes.

• Aditionally we can browse the HDFS to view contents of files and log

Page 9: Implementing Hadoop on a single cluster

• The Jobtracker Web interface provides general job statistics about Hadoop cluster, running/completed/failed jobs and a job history log file

• Tasktracker provides info about running and non-running tasks

Page 10: Implementing Hadoop on a single cluster

Writing MapReduce programs• Hadoop framework is written in java, which is

complicated to code for Non-CS guys

• Can be written in Python and converted to .jar file using Jython to run on a Hadoop cluster

• But Jython has incomplete standard library because some Python features not provided in Jython

• Alternative is to use Hadoop Streaming

• Hadoop streaming is the utility that comes with Hadoop distribution; able to run any executable script as a mapper and reducer

Page 11: Implementing Hadoop on a single cluster

• Write mapper.py and reducer.py in python

• Download and copy data to HDFS

• Run same as previous java implementation

• There are other third party solutions of Python Mapreduce which are similar to Streaming/Jython but can be easily used as a library in Python

Page 12: Implementing Hadoop on a single cluster

Python implementation stratagies

• Streaming• mrjob

• dumbo

• Hadoopy

• Non-Hadoop• disco

• Prefer Hadoop streaming if possible because it is easy and has the lowest overhead

• Prefer mrjob where you need higher abstraction and integration with AWS

Page 13: Implementing Hadoop on a single cluster

Future Work….• Python implementation in Hadoop

• Running Hadoop in Multi node cluster

• Pig and its implementation on linux

• Apache Mahout, Hive, Solr