Gluster Hadoop Compatible Storage

Embed Size (px)

Citation preview

  • 8/2/2019 Gluster Hadoop Compatible Storage

    1/15

    Gluster Filesystem 3.3 Beta

    Hadoop Compatible Storag

    Release: August 2011

  • 8/2/2019 Gluster Hadoop Compatible Storage

    2/15

    ds

    Hadoop Compatible Storage Pg No.

    Copyright

    Copyright 2011 Gluster, Inc.

    This is a preliminary document and may be changed substantially prior to final commercial release ofthe software described herein.

  • 8/2/2019 Gluster Hadoop Compatible Storage

    3/15

    ds

    Hadoop Compatible Storage Pg No.

    Table of Contents

    1. About this Guide ............................................................................................... 41.1. Disclaimer ................................................................................................ 41.2. Audience .................................................................................................. 41.3. Prerequisite .............................................................................................. 41.4. Terms ...................................................................................................... 41.5. Typographical Conventions ............................................................................ 51.6. Feedback ................................................................................................. 5

    2. Introducing Hadoop Compatible Storage of GlusterFS .................................................. 62.1. Architecture Overview ................................................................................. 62.2. Advantages ............................................................................................... 6

    3. Preparing to Install Hadoop Compatible Storage ........................................................ 73.1. Pre-requisites ............................................................................................ 73.2. Dependencies ............................................................................................ 7

    4. Installing and Configuring Hadoop Compatible Storage ................................................ 85. Starting and Stopping the Hadoop MapReduce Daemon on GlusterFS .............................. 11

    5.1. Starting and Stopping MapReduce Daemon ........................................................ 116. Troubleshooting Hadoop Compatible Storage ........................................................... 12

    6.1. Time Sync ................................................................................................ 126.2. Socket Creation Errors ................................................................................ 12

    7. Creating GlusterFS Volumes ................................................................................ 137.1. Creating Distributed Striped Replicated Volumes ................................................ 137.2. Creating Striped Replicated Volumes ............................................................... 14

    8. Managing Your Gluster Filesystem......................................................................... 15

  • 8/2/2019 Gluster Hadoop Compatible Storage

    4/15

    ds

    Hadoop Compatible Storage Pg No.

    1.About this GuideThis guide describes Gluster Hadoop Compatible Storage feature and its installation andmanagement.

    1.1. Disclaimer

    Gluster, Inc. has designated English as the official language for all of its product documentation andother documentation, as well as all our customer communications. All documentation prepared ordelivered by Gluster will be written, interpreted and applied in English, and English is the official andcontrolling language for all our documents, agreements, instruments, notices, disclosures andcommunications, in any form, electronic or otherwise (collectively, the Gluster Documents).

    Any customer, vendor, partner or other party who requires a translation of any of the GlusterDocuments is responsible for preparing or obtaining such translation, including associated costs.However, regardless of any such translation, the English language version of any of the GlusterDocuments prepared or delivered by Gluster shall control for any interpretation, enforcement,application or resolution.

    1.2. Audience

    This guide is intended for Apache Hadoop users interested in using GlusterFS as filesystem forHadoop.

    1.3. Prerequisite

    This document assumes that you are familiar with the Linux operating system, concepts of FileSystem, GlusterFS concepts, Apache Hadoop, and MapReduce framework.

    1.4. Terms

    Term Description

    masterMaster manages scheduling of jobs, assigns tasks to slaves, monitors tasks and re-executesthe failed tasks.

    slave Program which submits a job to the master.

    jobA set of map and/or reduce tasks, coordinated by the master. When the master receives ajob, it assigns a unique name for the job, and assigns the tasks to workers until they are all

    completed.

    mapThe first phase of a job, in which tasks are usually scheduled on the same node where theirinput data is hosted, so that local computation can be performed. Generally there is onemap task per input.

    reduceIndividual task in this phase, which usually has access to all values for a given key producedby the map phase.

  • 8/2/2019 Gluster Hadoop Compatible Storage

    5/15

    ds

    Hadoop Compatible Storage Pg No.

    Term Description

    mapreduceA paradigm and associated framework for distributed computing, which decouplesapplication code from the core challenges of fault tolerance and data locality.

    task A task is essentially a unit of work, provided to a worker.

    workerA worker is responsible for carrying out a task. A job specifies the executable that is theworker. Workers are scheduled to run on the nodes, close to the data they are supposed tobe processing.

    1.5. Typographical Conventions

    The following table lists the formatting conventions that are used in this guide to make it easier foryou to recognize and use specific types of information.

    Convention Description Example

    Courier Text Commands formatted as courier indicateshell commands.

    gluster volume start volname

    ItalicizedText Within a command, italicized textrepresents variables, which must besubstituted with specific values.

    gluster volume start volname

    Square Brackets Within a command, optional parametersare shown in square brackets.

    gluster volume start volname[force]

    Curly Brackets Within a command, alternativeparameters are grouped within curlybrackets and separated by the verticalOR bar.

    gluster volume { start | stop | delete } volname

    1.6. Feedback

    Gluster welcomes your comments and suggestions on the quality and usefulness of its documentation.If you find any errors or have any other suggestions, write to us [email protected] and provide the chapter, section, and page number, if available.

    Gluster offers a range of resources related to Gluster software:

    Discuss technical problems and solutions on the Discussion Forum(http://community.gluster.org)

    Get hands-on step-by-step tutorials(http://www.gluster.com/community/documentation/index.php/Main_Page)

    mailto:[email protected]:[email protected]:[email protected]://community.gluster.org/http://community.gluster.org/http://community.gluster.org/http://www.gluster.com/community/documentation/index.php/Main_Pagehttp://www.gluster.com/community/documentation/index.php/Main_Pagehttp://www.gluster.com/community/documentation/index.php/Main_Pagehttp://www.gluster.com/community/documentation/index.php/Main_Pagehttp://community.gluster.org/http://community.gluster.org/mailto:[email protected]
  • 8/2/2019 Gluster Hadoop Compatible Storage

    6/15

    ds

    Hadoop Compatible Storage Pg No.

    2.Introducing Hadoop Compatible Storage ofGlusterFS

    GlusterFS 3.3 beta 2 includes compatibility for Apache Hadoop and it uses the standard file systemAPIs available in Hadoop to provide a new storage option for Hadoop deployments. ExistingMapReduce based applications can use GlusterFS seamlessly. This new functionality opens up datawithin Hadoop deployments to any file-based or object-based application.

    A MapReduce framework typically divides the input data-set into independent tasks which areprocessed by the map tasks in a completely parallel manner. The framework sorts the outputs of themaps, which are then input to the reduce tasks. Typically both the input and the output of the jobsare stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executing the failed tasks.

    2.1. Architecture Overview

    The following diagram illustrates Hadoop integration with Gluster:

    2.2. Advantages

    The following are the advantages of Hadoop Compatible Storage with GlusterFS:

    Provides simultaneous file-based and object-based access within Hadoop.

    Eliminates the centralized metadata server.

    Provides compatibility with MapReduce applications and rewrite is not required.

    Provides a fault tolerant filesystem.

  • 8/2/2019 Gluster Hadoop Compatible Storage

    7/15

    ds

    Hadoop Compatible Storage Pg No.

    3.Preparing to Install Hadoop Compatible StorageThis section provides information on pre-requisites and list of dependencies that will be installedduring installation of Hadoop compatible storage.

    3.1. Pre-requisitesThe following are the pre-requisites to install and configure GlusterFS with Hadoop CompatibleStorage:

    Hadoop 0.20.2 is installed, configured, and is running on all the machines in the cluster.

    Java Runtime Environment

    Maven (mandatory only if you are building the plugin from the source)

    JDK (mandatory only if you are building the plugin from the source)

    Source code is available athttps://github.com/gluster/hadoop-glusterfs.

    3.2. Dependencies

    The following package will be installed when you install Hadoop Compatible Storage on Gluster:

    getfattr

    https://github.com/gluster/hadoop-glusterfshttps://github.com/gluster/hadoop-glusterfshttps://github.com/gluster/hadoop-glusterfshttps://github.com/gluster/hadoop-glusterfs
  • 8/2/2019 Gluster Hadoop Compatible Storage

    8/15

    ds

    Hadoop Compatible Storage Pg No.

    4.Installing and Configuring Hadoop CompatibleStorage

    This section describes how to install and configure Hadoop Compatible Storage in your storage

    environment and verify that it is functioning correctly.

    1. Download the GlusterFS RPM files on all servers in your cluster. You can download the softwareathttp://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/.

    2. For each RPM file, get the md5sum (using the following command) and compare it against themd5sum file available athttp://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/CentOS/.

    $ md5sum RPM_file.rpm

    3. Install GlusterFS on all servers using the following commands:

    # rpm -Uvh core_RPM_file# rpm -Uvh fuse_RPM_file# rpm -Uvh geo-replication_RPM_file

    For example:

    # rpm -Uvh glusterfs-core-3.3beta2-1.x86_64.rpm# rpm -Uvh glusterfs-fuse-3.3beta2-1.x86_64.rpm# rpm -Uvh glusterfs-geo-replication-3.3beta2-1.x86_64.rpm

    4. Verify that 3.3beta2 version of GlusterFS is installed, using the following command:

    # glusterfs version

    For more information on installing GlusterFS, refer to GlusterFS Installation athttp://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Installation_Guide

    5. Download the glusterfs-hadoop-0.20.2-0.1.x86_64.rpm on all servers in your cluster. You candownload the software athttp://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/.

    6. To install Hadoop Compatible Storage on all servers in your cluster, run the following command:

    # rpm ivh --nodpes glusterfs-hadoop-0.20.2-0.1.x86_64.rpm

    The following files will be extracted:

    /usr/local/lib/glusterfs--.jar -

    /usr/local/lib/conf/core-site.xml

    7. (Optional) To install Hadoop Compatible Storage in a different location, run the followingcommand:

    # rpm ivh --nodpes prefix /usr/local/glusterfs/hadoop glusterfs-hadoop-0.20.2-0.1.x86_64.rpm

    http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/CentOS/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/CentOS/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/CentOS/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/CentOS/http://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Installation_Guidehttp://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Installation_Guidehttp://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Installation_Guidehttp://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/http://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Installation_Guidehttp://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Installation_Guidehttp://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/CentOS/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/CentOS/http://download.gluster.com/pub/gluster/glusterfs/qa-releases/3.3-beta-2/
  • 8/2/2019 Gluster Hadoop Compatible Storage

    9/15

    ds

    Hadoop Compatible Storage Pg No.

    8. Edit the conf/core-site.xml file. The following is the sample conf/core-site.xml file:

    fs.glusterfs.implorg.apache.hadoop.fs.glusterfs.GlusterFileSystem

    fs.default.name

    glusterfs://fedora1:9000

    fs.glusterfs.volnamehadoopvol

    fs.glusterfs.mount/mnt/glusterfs

    fs.glusterfs.server fedora2

    quick.slave.ioOff

    The following are the configurable fields:

    Property Name Default Value Description

    fs.default.name glusterfs://fedora1:9000 Any hostname in the cluster as the server and anyport number.

    fs.glusterfs.volname hadoopvol GlusterFS volume to mount.

    fs.glusterfs.mount /mnt/glusterfs The directory used to fuse mount the volume.

    fs.glusterfs.server fedora2 Any hostname or IP address on the cluster except thclient/master.

  • 8/2/2019 Gluster Hadoop Compatible Storage

    10/15

    ds

    Hadoop Compatible Storage Pg No.

    Property Name Default Value Description

    quick.slave.io Off Performance tunable option. If this option is set toOn, the plugin will try to perform I/O directly fromthe disk filesystem (like ext3 or ext4) the file resideon. Hence read performance will improve and jobwould run faster.

    Note: This option is not tested widely.

    9. Create a soft link in Hadoops library and configuration directory for the downloaded files (inStep 7) using the following commands:

    # ln -s

    For example,

    # ln s /usr/local/lib/glusterfs-0.20.2-0.1.jar$HADOOP_HOME/lib/glusterfs-0.20.2-0.1.jar

    # ln s /usr/local/lib/conf/core-site.xml $HADOOP_HOME/conf/core-site.xml

    10. (Optional) You can run the following command on Hadoop master to build the plugin and deployit along with core-site.xml file, instead of repeating the above steps:

    # build-deploy-jar.py -d $HADOOP_HOME -c

  • 8/2/2019 Gluster Hadoop Compatible Storage

    11/15

    ds

    Hadoop Compatible Storage Pg No.

    5.Starting and Stopping the Hadoop MapReduceDaemon on GlusterFS

    The MapReduce daemon serves to run MapReduce jobs on Gluster.

    Note: You must start Hadoop MapReduce daemon on all servers.

    5.1. Starting and Stopping MapReduce Daemon

    To start MapReduce daemon manually, enter the following command:

    # $HADOOP_HOME/bin/start-mapred.sh

    To stop MapReduce daemon manually, enter the following command:

    # $HADOOP_HOME/bin/stop-mapred.sh

  • 8/2/2019 Gluster Hadoop Compatible Storage

    12/15

    ds

    Hadoop Compatible Storage Pg No.

    6.Troubleshooting Hadoop Compatible StorageThis section describes the most common troubleshooting issues related to Hadoop CompatibleStorage.

    6.1. Time Sync

    Running MapReduce job may throw exceptions if the time is out-of-sync on the hosts in the cluster.

    Solution: Sync the time on all hosts using ntpd program.

    6.2. Socket Creation Errors

    The CLI commands may not work with Centos 5.x, gluster ipv6 module and the socket creation mayfail (error message will be logged in the log file).

    Solution: Edit the following line in /etc/modprobe.conf file from options ipv6 disable=1 tooptions ipv6 disable=0 and reboot the machines.

  • 8/2/2019 Gluster Hadoop Compatible Storage

    13/15

    ds

    Hadoop Compatible Storage Pg No.

    7.Creating GlusterFS VolumesFrom GlusterFS 3.3 beta 2 onwards, you can create volumes of the following types in your storageenvironment:

    Distributed Striped Replicated Distributes and stripes data across replicated bricks in thevolume. For more information, see Creating Distributed Striped Replicated Volumes.

    Striped Replicated Stripes and replicates data across bricks in the volume. For moreinformation, see Creating Striped Replicated Volumes

    7.1. Creating Distributed Striped Replicated Volumes

    Distributed striped replicated volumes distributes and stripes data across replicated bricks in thecluster. For best results, you should use distributed striped replicated volumes where therequirement is to scale storage, high concurrency environments accessing very large files, andperformance is critical.

    To configure a distributed striped replicated volume

    1. Create a trusted storage pool consisting of the storage servers that will comprise the volume. Forinformation on creating trusted storage pool, seehttp://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Adding_Servers_to_Trusted_Storage_Pool.

    2. Create the volume using the following command:

    Note: The number of bricks should be a multiples of number of stripe count and replica count fora distributed striped replicated volume.

    # gluster volume create NEW-VOLNAME [stripe COUNT] [replica COUNT][transport tcp | rdma | tcp,rdma] NEW-BRICK...

    To create a distributed replicated striped volume across eight storage servers:

    # gluster volume create test-volume stripe 2 replica 2 transport tcpserver1:/exp1 server2:/exp2 server3:/exp3 server4:/exp4 server5:/exp5server6:/exp6 server7:/exp7 server8:/exp8Creation of test-volume has been successfulPlease start the volume to access data.

    (Optional) Set additional options if required, such as auth.allow or auth.reject.

    For example:

    # gluster volume set test-volume auth.allow 10.*

    Note: Make sure you start your volumes before you try to mount them or else client operations afterthe mount will hang. For information on starting volumes, seehttp://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Starting_Volumes.

    http://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Starting_Volumeshttp://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Starting_Volumeshttp://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Starting_Volumes
  • 8/2/2019 Gluster Hadoop Compatible Storage

    14/15

    ds

    Hadoop Compatible Storage Pg No.

    7.2. Creating Striped Replicated Volumes

    Stripes data across replicated bricks in the cluster. For best results, you should use stripedreplicated volumes where the requirement is high concurrency environments accessing very largefiles and performance is critical.

    To configure a striped replicated volume

    1. Create a trusted storage pool consisting of the storage servers that will comprise the volume.For information on creating trusted storage pool, seehttp://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Adding_Servers_to_Trusted_Storage_Pool.

    2. Create the volume using the following command:

    Note: The number of bricks should be a multiple of the replicate count and stripe count for astriped replicated volume.

    # gluster volume create NEW-VOLNAME [stripe COUNT] [replica COUNT][transport tcp | rdma | tcp,rdma] NEW-BRICK...

    To create a striped replicated volume across four storage servers:

    # gluster volume create test-volume stripe 2 replica 2 transport tcpserver1:/exp1 server2:/exp2 server3:/exp3 server4:/exp4Creation of test-volume has been successfulPlease start the volume to access data.

    To create a striped replicated volume across six storage servers:

    # gluster volume create test-volume stripe 3 replica 2 transport tcpserver1:/exp1 server2:/exp2 server3:/exp3 server4:/exp4 server5:/exp5server6:/exp6

    Creation of test-volume has been successfulPlease start the volume to access data.

    3. (Optional) Set additional options if required, such as auth.allow or auth.reject.

    For example:

    # gluster volume set test-volume auth.allow 10.*

    Note: Make sure you start your volumes before you try to mount them or else client operations afterthe mount will hang. For information on starting volumes, seehttp://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Starting_Volumes.

    http://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Starting_Volumeshttp://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Starting_Volumeshttp://www.gluster.com/community/documentation/index.php/Gluster_3.2:_Starting_Volumes
  • 8/2/2019 Gluster Hadoop Compatible Storage

    15/15

    ds

    8.Managing Your Gluster FilesystemThe GlusterFS Administration Guide is available at:http://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Administration_Guide.

    http://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Administration_Guidehttp://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Administration_Guidehttp://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Administration_Guidehttp://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Administration_Guidehttp://www.gluster.com/community/documentation/index.php/Gluster_3.2_Filesystem_Administration_Guide