hbfrds_hadtyuut_ijhytredew

The Enterprise Open Source Billing System

Hadoop/HBase Installation Note: This installation is ONLY for developer machines. It assumes that the Hadoop/HBase installation is on the same machine where JB will be running. With a multi-node cluster this installation guide will not work. (These instructions are for linux-Ubuntu OS)

INSTALL JAVA

This step can be skipped if this is done on a reference machine and a JB installation already exists.

You should already have the java 1.6 jdk file in the /root folder of your VM. Follow these installation instructions:

chmod 755 jdk-6u25-linux-i586.bin./jdk-6u25-linux-i586.bin

sudo mv jdk1.6.0_25/ /optcd /optsudo ln -s jdk1.6.0_25 jdk1.6

Create a new profile script in /etc/profile.d/ to set JAVA_HOME to the location of the unpacked JDK, then change the symlink in /etc/alternatives to point at the new java binary so that it’s available to all applications.

sudo vim /etc/profile.d/java.sh

java.sh:

#!/bin/bashJAVA_HOME=/opt/jdk1.6PATH=$PATH:$JAVA_HOME/binexport JAVA_HOME PATH

Run the profile script and update the software alternatives symlinks.

sudo chmod 755 /etc/profile.d/java.shsource /etc/profile.d/java.sh

sudo update-alternatives --install "/usr/bin/java" "java" "/opt/jdk1.6/bin/java" 1sudo update-alternatives --set java /opt/jdk1.6/bin/java

INSTALL HADOOP AND HBASE

CREATE HADOOP GROUP/USER

sudo groupadd hadoopsudo useradd hadoop -m -s /bin/bash -g hadoopsudo passwd hadoop

change the password to 'hadoop'

INSTALL HADOOP/HBASE BINARIES

Be very careful about hadoop and hbase version. Specific versions of hadoop work only with specific versions of hbase.

Download the binaries for hadoop and hbase into /root folder.

as root user:

cd ~wget http://archive.apache.org/dist/hadoop/core/hadoop-1.1.2/hadoop-1.1.2-bin.tar.gztar zxvf hadoop-1.1.2-bin.tar.gzwget http://archive.apache.org/dist/hbase/hbase-0.94.8/hbase-0.94.8.tar.gztar zxvf hbase-0.94.8.tar.gz

move the hadoop and hbase binary into the /opt folder and create symbolic links for those:

mv hbase-0.94.8/ /optmv hadoop-1.1.2/ /optcd /optln -s hbase-0.94.8 hbaseln -s hadoop-1.1.2 hadoop

make hadoop (group and user) owner of the newly create folders in /opt directory

chown hadoop:hadoop -R hbase-0.94.8/chown hadoop:hadoop -R hadoop-1.1.2/

HADOOP PROFILE SCRIPT

as root user:

vim /etc/profile.d/hadoop.sh

hadoop.sh:

#!/bin/bashHADOOP_HOME=/opt/hadoopPATH=$PATH:$HADOOP_HOME/binexport HADOOP_HOME PATH

make hadoop profile script readable and executable for all users,

chmod 755 /etc/profile.d/hadoop.sh

HBASE PROFILE SCRIPT

vim /etc/profile.d/hbase.sh

hbase.sh:

#!/bin/bashHBASE_HOME=/opt/hbasePATH=$PATH:$HBASE_HOME/binexport HBASE_HOME PATH

make hbase profile script readable and executable for all users,

chmod 755 /etc/profile.d/hbase.sh

HADOOP USER ENV CONFIG

as hadoop user:

cd ~vim .bashrc

at the bottom of the file add the following lines:

source /etc/profile.d/java.shsource /etc/profile.d/hadoop.shsource /etc/profile.d/hbase.sh

SSH KEYS (HADOOP USER)

Generate ssh keys (for hadoop user) to be able ssh into the machine without password.

as hadoop user:

cd ~ssh-keygen -t rsacd .sshcat id_rsa.pub > authorized_keyschmod 600 authorized_keys

use empty pass phrases.

(The local machine also requires an sshd server running, which may need to be installed if not already installed. (openssh-server)

HADOOP CONFIGURATION

Create a folder for the hadoop data. In this example we choose that place to be under the /opt folder but this is not mandatory. The hadoop data folder can be any place in the system that has sufficient space. Note, avoid the /tmp folder since Linux OSs do automatic cleaning in these folders.

as root user:

mkdir /opt/hadoop-datachown hadoop:hadoop /opt/hadoop-datachmod 755 /opt/hadoop-data

Note that hadoop will not like (and then not start) if its data dir is not own by itself and with the exact 755 permissions.

As 'hadoop' user, in $HADOOP_HOME/conf/hadoop-env.sh, uncomment and define the JAVA_HOME variable:

..export JAVA_HOME=/opt/jdk1.6..

As 'hadoop' user, in $HADOOP_HOME/conf/core-site.xml:

<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property></configuration>

As 'hadoop' user, in $HADOOP_HOME/conf/hdfs-site.xml:

<configuration> <property> <name>dfs.replication</name> <value>1</value> </property>

<property> <name>dfs.permissions</name> <value>false</value> </property>



<property> <name>hadoop.tmp.dir</name> <value>/opt/hadoop-data</value> </property>

</configuration>

As 'hadoop' user, in $HADOOP_HOME/conf/mapred-site.xml:

<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property></configuration>

In this file you should also configure how many mappers and reducers this node will be processing as a maximum. The default is only two, which is almost single threaded. If this node will be used for any real processing, change the '2' for about '200'.

When you decide the real number of mappers/reducers that will run in this node, adjust the maximum connections for postgres accordingly. Calculate 2 connections per reducer.

To modify the maximum number of connections, edit postgres.conf and edit the property 'max_connections'. Then restart postgres.

As 'hadoop' user, in $HADOOP_HOME/conf/log4j.properties, change the root logging level threshold to DEBUG,

.hadoop.root.logger=DEBUG,console.

HBASE CONFIGURATION

As 'hadoop' user, in $HBASE_HOME/conf/hbase-env.sh, uncomment and define the JAVA_HOME variable and uncomment the HBASE_MANAGES_ZK variable definition:

..export JAVA_HOME=/opt/jdk1.6...export HBASE_MANAGES_ZK=true..

in $HBASE_HOME/conf/hbase-site.xml,

<configuration> <property> <name>hbase.rootdir</name> <value>hdfs://localhost:9000/hbase</value> </property> <property> <name>hbase.master</name> <value>localhost:60000</value> <description>The host and port that the HBase master runs at.</description> </property>

<property> <name>hbase.cluster.distributed</name> <value>true</value> </property>

 <property> <name>zookeeper.znode.parent</name> <value>/hbase</value> <description>Root ZNode for HBase in ZooKeeper. All of HBase's ZooKeeper files that are configured with a relative path will go under this node. By default, all of HBase's ZooKeeper file path are configured with a relative path, so they will all go under this directory unless changed.</description>

</property>

<property> <name>zookeeper.znode.rootserver</name> <value>root-region-server</value> <description>Path to ZNode holding root region location. This is written by the master and read by clients and region servers. If a relative path is given, the parent folder will be ${zookeeper.znode.parent}. By default,this means the root location is stored at /hbase/root-region-server.</description> </property>

 <property> <name>hbase.zookeeper.quorum</name> <value>localhost</value> </property> <property> <name>hbase.zookeeper.property.clientPort</name> <value>2181</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>/opt/hadoop-data/zookeeper</value> </property> <property> <name>hbase.zookeeper.property.tickTime</name> <value>2000</value> </property> <property> <name>hbase.zookeeper.property.initLimit</name> <value>10</value> </property> <property> <name>hbase.zookeeper.property.syncLimit</name> <value>5</value> </property>

</configuration>

INITIALIZE HDFS

Initialize HDFS by running the commmand:

$HADOOP_HOME/bin/hadoop namenode -format

output should look similar to this:

/************************************************************STARTUP_MSG: Starting NameNodeSTARTUP_MSG: host = HP610/127.0.1.1STARTUP_MSG: args = [-format]STARTUP_MSG: version = 1.1.1STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1411108; compiled by 'hortonfo' on Mon Nov 19 10:48:11 UTC 2012************************************************************/13/01/02 19:07:47 INFO util.GSet: VM type = 64-bit13/01/02 19:07:47 INFO util.GSet: 2% max memory = 17.77875 MB13/01/02 19:07:47 INFO util.GSet: capacity = 2^21 = 2097152 entries13/01/02 19:07:47 INFO util.GSet: recommended=2097152, actual=209715213/01/02 19:07:48 INFO namenode.FSNamesystem: fsOwner=hadoop13/01/02 19:07:48 INFO namenode.FSNamesystem: supergroup=supergroup13/01/02 19:07:48 INFO namenode.FSNamesystem: isPermissionEnabled=true13/01/02 19:07:48 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=10013/01/02 19:07:48 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)13/01/02 19:07:48 INFO namenode.NameNode: Caching file names occuring more than10 times 13/01/02 19:07:48 INFO common.Storage: Image file of size 112 saved in 0 seconds.13/01/02 19:07:49 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/tmp/hadoop-hadoop/dfs/name/current/edits13/01/02 19:07:49 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/tmp/hadoop-hadoop/dfs/name/current/edits13/01/02 19:07:49 INFO common.Storage: Storage directory /tmp/hadoop-hadoop/dfs/name has been successfully formatted.13/01/02 19:07:49 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************SHUTDOWN_MSG: Shutting down NameNode at HP610/127.0.1.1************************************************************/

STARTING/STOPING HADOOP and HBASE

Always start hadoop before hbase. Hbase is configured to work on top of HDFS which is started and running along with Hadoop. Both hadoop and hbase come with scripts that will star/stop hadoop and hbase.

STARTING HADOOP

$HADOOP_HOME/bin/start-all.sh

or alternativly start each process manually

hadoop-daemon.sh start jobtrackerhadoop-daemon.sh start tasktrackerhadoop-daemon.sh start namenodehadoop-daemon.sh start datanode

to check if all the processes are started execute the 'jps' command.

hadoop@debian:~$ jps3318 SecondaryNameNode3506 TaskTracker3193 DataNode3090 NameNode3397 JobTracker3619 Jps

In the output hadoop processes are JobTracker, NameNode, DataNode, TaskTracker and SecondaryNameNode.

Two web interfaces will be started that give monitoring options for Hadoop and HDFS.

http://X.Y.Z.Q:50030/jobtracker.jsp - customer status and job monitoring

http://X.Y.Z.Q:50070/dfshealth.jsp - hdfs monitoring

STOPING HADOOP

$HADOOP_HOME/bin/stop-all.sh

or alternativly stop each process manually

hadoop-daemon.sh stop jobtrackerhadoop-daemon.sh stop tasktrackerhadoop-daemon.sh stop namenodehadoop-daemon.sh stop datanode

STARTING HBASE

$HBASE_HOME/bin/start-hbase.sh

check if all HBase process are started by using the 'jps' command,

hadoop@U10:~$ jps17890 HMaster17112 JobTracker17811 HQuorumPeer16811 DataNode17312 TaskTracker16608 NameNode17018 SecondaryNameNode18139 HRegionServer18256 Jps

http://X.Y.Z.Q:50030/jobtracker.jsp

http://X.Y.Z.Q:50070/dfshealth.jsp

In the output, HBase processes are HMaster, HQuorumPeer and HRegionServer.

STOPING HBASE

$HBASE_HOME/bin/stop-hbase.sh

After starting both Hadoop and HBase make sure they are working.

LOGS

Check the logs of both HBase and Hadoop and make sure there are no critical exceptions.

HBase logs path: /opt/hbase/logsHadoop logs path: /opt/hadoop/logs

WARNING:Known problem with some Linux distribution is a predefined /etc/hosts entry that starts with 127.0.1.1. When HBase starts the first thing it will do is to insert some nodes into ZooKeeper (ZK) which contain information about the location of the region servers. When client want to talk to HBase they talk to the ZK to find out the location of the region servers. The problem here is that when HBase starts it will do DNS resolution against /etc/hosts for the location of the region server and it will update the ZK node with that information instead of the configured data. Now, if we leave the 127.0.1.1 entry in /etc/hosts that HBase may use this entry and include it in ZK. When clients query ZK about information they will receive this information and they will not be able to connect to the region server. This problem is much more visible when we access the Hadoop/HBasefrom a different machine that the one with the Hadoop/HBase installation. The usual solution is to remove this line from /etc/hosts and restart everything. For more details read Why does HBase care about /etc/hosts?

/etc/hosts

The first entry in /etc/hosts should be similar to below:

127.0.0.1 localhost domain_name_or_machine_name

HDFS Sanity Test:

hadoop dfs -ls /

it should list the root folder content of the HDFS

HBASE Sanity Test:

Start HBase shell script, with:

hbase shell

http://blog.devving.com/why-does-hbase-care-about-etchosts/

http://blog.devving.com/why-does-hbase-care-about-etchosts/

Try to list all the tables, with:

list

Note, if the process hangs and is not listing the tables (or saying that there are not tables) than most likely HDFS is not available and something is wrong.

As part of sanity testing a good practice is to check the logs to see if there is some start up errors. The logs ca be found at:

$HADOOP_HOME/logs

TROUBLE SHOOTING

PID FILES

Files containing process pids are stored in /tmp. If you kill a hbase process you will have to delete the files in order for it to restart from the command line.

Documents

hbfrds_hadtyuut_ijhytredew