Gluster Hadoop Compatible Storage

8/2/2019 Gluster Hadoop Compatible Storage

1/15

Gluster Filesystem 3.3 Beta

Hadoop Compatible Storag

Release: August 2011


2/15

ds

Hadoop Compatible Storage Pg No.

Copyright

Copyright 2011 Gluster, Inc.

This is a preliminary document and may be changed substantially prior to final commercial release ofthe software described herein.


3/15

ds


Table of Contents

1. About this Guide ............................................................................................... 41.1. Disclaimer ................................................................................................ 41.2. Audience .................................................................................................. 41.3. Prerequisite .............................................................................................. 41.4. Terms ...................................................................................................... 41.5. Typographical Conventions ............................................................................ 51.6. Feedback ................................................................................................. 5

2. Introducing Hadoop Compatible Storage of GlusterFS .................................................. 62.1. Architecture Overview ................................................................................. 62.2. Advantages ............................................................................................... 6

3. Preparing to Install Hadoop Compatible Storage ........................................................ 73.1. Pre-requisites ............................................................................................ 73.2. Dependencies ............................................................................................ 7

4. Installing and Configuring Hadoop Compatible Storage ................................................ 85. Starting and Stopping the Hadoop MapReduce Daemon on GlusterFS .............................. 11

5.1. Starting and Stopping MapReduce Daemon ........................................................ 116. Troubleshooting Hadoop Compatible Storage ........................................................... 12

6.1. Time Sync ................................................................................................ 126.2. Socket Creation Errors ................................................................................ 12

7. Creating GlusterFS Volumes ................................................................................ 137.1. Creating Distributed Striped Replicated Volumes ................................................ 137.2. Creating Striped Replicated Volumes ............................................................... 14

8. Managing Your Gluster Filesystem......................................................................... 15


4/15

ds


1.About this GuideThis guide describes Gluster Hadoop Compatible Storage feature and its installation andmanagement.

1.1. Disclaimer

Gluster, Inc. has designated English as the official language for all of its product documentation andother documentation, as well as all our customer communications. All documentation prepared ordelivered by Gluster will be written, interpreted and applied in English, and English is the official andcontrolling language for all our documents, agreements, instruments, notices, disclosures andcommunications, in any form, electronic or otherwise (collectively, the Gluster Documents).

Any customer, vendor, partner or other party who requires a translation of any of the GlusterDocuments is responsible for preparing or obtaining such translation, including associated costs.However, regardless of any such translation, the English language version of any of the GlusterDocuments prepared or delivered by Gluster shall control for any interpretation, enforcement,application or resolution.

1.2. Audience

This guide is intended for Apache Hadoop users interested in using GlusterFS as filesystem forHadoop.

1.3. Prerequisite

This document assumes that you are familiar with the Linux operating system, concepts of FileSystem, GlusterFS concepts, Apache Hadoop, and MapReduce framework.

1.4. Terms

Term Description

masterMaster manages scheduling of jobs, assigns tasks to slaves, monitors tasks and re-executesthe failed tasks.

slave Program which submits a job to the master.

jobA set of map and/or reduce tasks, coordinated by the master. When the master receives ajob, it assigns a unique name for the job, and assigns the tasks to workers until they are all

completed.

mapThe first phase of a job, in which tasks are usually scheduled on the same node where theirinput data is hosted, so that local computation can be performed. Generally there is onemap task per input.

reduceIndividual task in this phase, which usually has access to all values for a given key producedby the map phase.


5/15

ds


Term Description

mapreduceA paradigm and associated framework for distributed computing, which decouplesapplication code from the core challenges of fault tolerance and data locality.

task A task is essentially a unit of work, provided to a worker.

workerA worker is responsible for carrying out a task. A job specifies the executable that is theworker. Workers are scheduled to run on the nodes, close to the data they are supposed tobe processing.

1.5. Typographical Conventions

The following table lists the formatting conventions that are used in this guide to make it easier foryou to recognize and use specific types of information.

Convention Description Example

Courier Text Commands formatted as courier indicateshell commands.

gluster volume start volname

ItalicizedText Within a command, italicized textrepresents variables, which must besubstituted with specific values.

gluster volume start volname

Square Brackets Within a command, optional parametersare shown in square brackets.

gluster volume start volname[force]

Curly Brackets Within a command, alternativeparameters are grouped within curlybrackets and separated by the verticalOR bar.

gluster volume { start | stop | delete } volname

1.6. Feedback

Gluster welcomes your comments and suggestions on the quality and usefulness of its documentation.If you find any errors or have any other suggestions, write to us [email protected] and provide the chapter, section, and page number, if available.

Gluster offers a range of resources related to Gluster software:

Discuss technical problems and solutions on the Discussion Forum(http://community.gluster.org)

Get hands-on step-by-step tutorials(http://www.gluster.com/community/documentation/index.php/Main_Page)
mailto:[email protected]:[email protected]:[email protected]://community.gluster.org/http://community.gluster.org/http://community.gluster.org/http://www.gluster.com/community/documentation/index.php/Main_Pagehttp://www.gluster.com/community/documentation/index.php/Main_Pagehttp://www.gluster.com/community/documentation/index.php/Main_Pagehttp://www.gluster.com/community/documentation/index.php/Main_Pagehttp://community.gluster.org/http://community.gluster.org/mailto:[email protected]


6/15

ds


2.Introducing Hadoop Compatible Storage ofGlusterFS

GlusterFS 3.3 beta 2 includes compatibility for Apache Hadoop and it uses the standard file systemAPIs available in Hadoop to provide a new storage option for Hadoop deployments. ExistingMapReduce based applications can use GlusterFS seamlessly. This new functionality opens up datawithin Hadoop deployments to any file-based or object-based application.

A MapReduce framework typically divides the input data-set into independent tasks which areprocessed by the map tasks in a completely parallel manner. The framework sorts the outputs of themaps, which are then input to the reduce tasks. Typically both the input and the output of the jobsare stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executing the failed tasks.

2.1. Architecture Overview

The following diagram illustrates Hadoop integration with Gluster:

2.2. Advantages

The following are the advantages of Hadoop Compatible Storage with GlusterFS:

Provides simultaneous file-based and object-based access within Hadoop.

Eliminates the centralized metadata server.

Provides compatibility with MapReduce applications and rewrite is not required.

Provides a fault tolerant filesystem.


7/15

ds


3.Preparing to Install Hadoop Compatible StorageThis section provides information on pre-requisites and list of dependencies that will be installedduring installation of Hadoop compatible storage.

3.1. Pre-requisitesThe following are the pre-requisites to install and configure GlusterFS with Hadoop CompatibleStorage:

Hadoop 0.20.2 is installed, configured, and is running on all the machines in the cluster.

Java Runtime Environment

Maven (mandatory only if you are building the plugin from the source)

JDK (mandatory only if you are building the plugin from the source)

Source code is available athttps://github.com/gluster/hadoop-glusterfs.

3.2. Dependencies

The following package will be installed when you install Hadoop Compatible Storage on Gluster:

getfattr
https://github.com/gluster/hadoop-glusterfshttps://github.com/gluster/hadoop-glusterfshttps://github.com/gluster/hadoop-glusterfshttps://github.com/gluster/hadoop-glusterfs


8/15

ds


4.Installing and Configuring Hadoop CompatibleStorage

This section describes how to install and configure Hadoop Compatible Storage in your storage

environment and verify that it is functioning correctly.

1. Download the GlusterFS RPM files on all servers in your cluster. You can download the softwareathttp://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/.

2. For each RPM file, get the md5sum (using the following command) and compare it against themd5sum file available athttp://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/CentOS/.

$ md5sum RPM_file.rpm

3. Install GlusterFS on all servers using the following commands:

# rpm -Uvh core_RPM_file# rpm -Uvh fuse_RPM_file# rpm -Uvh geo-replication_RPM_file

For example:

# rpm -Uvh glusterfs-core-3.3beta2-1.x86_64.rpm# rpm -Uvh glusterfs-fuse-3.3beta2-1.x86_64.rpm# rpm -Uvh glusterfs-geo-replication-3.3beta2-1.x86_64.rpm

4. Verify that 3.3beta2 version of GlusterFS is installed, using the following command:

# glusterfs version

For more information on installing GlusterFS, refer to GlusterFS Installation athttp://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Installation_Guide

5. Download the glusterfs-hadoop-0.20.2-0.1.x86_64.rpm on all servers in your cluster. You candownload the software athttp://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/.

6. To install Hadoop Compatible Storage on all servers in your cluster, run the following command:

# rpm ivh --nodpes glusterfs-hadoop-0.20.2-0.1.x86_64.rpm

The following files will be extracted:

/usr/local/lib/glusterfs--.jar -

/usr/local/lib/conf/core-site.xml

7. (Optional) To install Hadoop Compatible Storage in a different location, run the followingcommand:

# rpm ivh --nodpes prefix /usr/local/glusterfs/hadoop glusterfs-hadoop-0.20.2-0.1.x86_64.rpm
http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/CentOS/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/CentOS/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/CentOS/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/CentOS/http://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Installation_Guidehttp://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Installation_Guidehttp://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Installation_Guidehttp://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/http://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Installation_Guidehttp://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Installation_Guidehttp://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/CentOS/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/CentOS/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/


9/15

ds


8. Edit the conf/core-site.xml file. The following is the sample conf/core-site.xml file:

fs.glusterfs.implorg.apache.hadoop.fs.glusterfs.GlusterFileSystem

fs.default.name

glusterfs://fedora1:9000

fs.glusterfs.volnamehadoopvol

fs.glusterfs.mount/mnt/glusterfs

fs.glusterfs.server fedora2

quick.slave.ioOff

The following are the configurable fields:

Property Name Default Value Description

fs.default.name glusterfs://fedora1:9000 Any hostname in the cluster as the server and anyport number.

fs.glusterfs.volname hadoopvol GlusterFS volume to mount.

fs.glusterfs.mount /mnt/glusterfs The directory used to fuse mount the volume.

fs.glusterfs.server fedora2 Any hostname or IP address on the cluster except thclient/master.


10/15

ds


Property Name Default Value Description

quick.slave.io Off Performance tunable option. If this option is set toOn, the plugin will try to perform I/O directly fromthe disk filesystem (like ext3 or ext4) the file resideon. Hence read performance will improve and jobwould run faster.

Note: This option is not tested widely.

9. Create a soft link in Hadoops library and configuration directory for the downloaded files (inStep 7) using the following commands:

# ln -s

For example,

# ln s /usr/local/lib/glusterfs-0.20.2-0.1.jar$HADOOP_HOME/lib/glusterfs-0.20.2-0.1.jar

# ln s /usr/local/lib/conf/core-site.xml $HADOOP_HOME/conf/core-site.xml

10. (Optional) You can run the following command on Hadoop master to build the plugin and deployit along with core-site.xml file, instead of repeating the above steps:

# build-deploy-jar.py -d $HADOOP_HOME -c


11/15

ds


5.Starting and Stopping the Hadoop MapReduceDaemon on GlusterFS

The MapReduce daemon serves to run MapReduce jobs on Gluster.

Note: You must start Hadoop MapReduce daemon on all servers.

5.1. Starting and Stopping MapReduce Daemon

To start MapReduce daemon manually, enter the following command:

# $HADOOP_HOME/bin/start-mapred.sh

To stop MapReduce daemon manually, enter the following command:

# $HADOOP_HOME/bin/stop-mapred.sh


12/15

ds


6.Troubleshooting Hadoop Compatible StorageThis section describes the most common troubleshooting issues related to Hadoop CompatibleStorage.

6.1. Time Sync

Running MapReduce job may throw exceptions if the time is out-of-sync on the hosts in the cluster.

Solution: Sync the time on all hosts using ntpd program.

6.2. Socket Creation Errors

The CLI commands may not work with Centos 5.x, gluster ipv6 module and the socket creation mayfail (error message will be logged in the log file).

Solution: Edit the following line in /etc/modprobe.conf file from options ipv6 disable=1 tooptions ipv6 disable=0 and reboot the machines.


13/15

ds


7.Creating GlusterFS VolumesFrom GlusterFS 3.3 beta 2 onwards, you can create volumes of the following types in your storageenvironment:

Distributed Striped Replicated Distributes and stripes data across replicated bricks in thevolume. For more information, see Creating Distributed Striped Replicated Volumes.

Striped Replicated Stripes and replicates data across bricks in the volume. For moreinformation, see Creating Striped Replicated Volumes

7.1. Creating Distributed Striped Replicated Volumes

Distributed striped replicated volumes distributes and stripes data across replicated bricks in thecluster. For best results, you should use distributed striped replicated volumes where therequirement is to scale storage, high concurrency environments accessing very large files, andperformance is critical.

To configure a distributed striped replicated volume

1. Create a trusted storage pool consisting of the storage servers that will comprise the volume. Forinformation on creating trusted storage pool, seehttp://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Adding_Servers_to_Trusted_Storage_Pool.

2. Create the volume using the following command:

Note: The number of bricks should be a multiples of number of stripe count and replica count fora distributed striped replicated volume.

# gluster volume create NEW-VOLNAME [stripe COUNT] [replica COUNT][transport tcp | rdma | tcp,rdma] NEW-BRICK...

To create a distributed replicated striped volume across eight storage servers:

# gluster volume create test-volume stripe 2 replica 2 transport tcpserver1:/exp1 server2:/exp2 server3:/exp3 server4:/exp4 server5:/exp5server6:/exp6 server7:/exp7 server8:/exp8Creation of test-volume has been successfulPlease start the volume to access data.

(Optional) Set additional options if required, such as auth.allow or auth.reject.

For example:

# gluster volume set test-volume auth.allow 10.*

Note: Make sure you start your volumes before you try to mount them or else client operations afterthe mount will hang. For information on starting volumes, seehttp://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Starting_Volumes.
http://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Starting_Volumeshttp://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Starting_Volumeshttp://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Starting_Volumes


14/15

ds


7.2. Creating Striped Replicated Volumes

Stripes data across replicated bricks in the cluster. For best results, you should use stripedreplicated volumes where the requirement is high concurrency environments accessing very largefiles and performance is critical.

To configure a striped replicated volume

1. Create a trusted storage pool consisting of the storage servers that will comprise the volume.For information on creating trusted storage pool, seehttp://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Adding_Servers_to_Trusted_Storage_Pool.

2. Create the volume using the following command:

Note: The number of bricks should be a multiple of the replicate count and stripe count for astriped replicated volume.

# gluster volume create NEW-VOLNAME [stripe COUNT] [replica COUNT][transport tcp | rdma | tcp,rdma] NEW-BRICK...

To create a striped replicated volume across four storage servers:

# gluster volume create test-volume stripe 2 replica 2 transport tcpserver1:/exp1 server2:/exp2 server3:/exp3 server4:/exp4Creation of test-volume has been successfulPlease start the volume to access data.

To create a striped replicated volume across six storage servers:

# gluster volume create test-volume stripe 3 replica 2 transport tcpserver1:/exp1 server2:/exp2 server3:/exp3 server4:/exp4 server5:/exp5server6:/exp6

Creation of test-volume has been successfulPlease start the volume to access data.

3. (Optional) Set additional options if required, such as auth.allow or auth.reject.

For example:

# gluster volume set test-volume auth.allow 10.*

Note: Make sure you start your volumes before you try to mount them or else client operations afterthe mount will hang. For information on starting volumes, seehttp://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Starting_Volumes.
http://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Starting_Volumeshttp://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Starting_Volumeshttp://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Starting_Volumes


15/15

ds

8.Managing Your Gluster FilesystemThe GlusterFS Administration Guide is available at:http://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Administration_Guide.
http://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Administration_Guidehttp://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Administration_Guidehttp://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Administration_Guidehttp://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Administration_Guidehttp://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Administration_Guide