Big Data Storage Options for Hadoop

Embed Size (px)

Citation preview

  • 8/10/2019 Big Data Storage Options for Hadoop

    1/44

    PRESENTATION TITLE GOES HEREBig Data Storage Options for Hadoop

    Sam Fineberg/HP Storage Division

  • 8/10/2019 Big Data Storage Options for Hadoop

    2/44

    Big Data Storage Options for Hadoop

    2013 Storage Networking Industry Association. All Rights Reserved.

    SNIA Legal Notice

    The material contained in this tutorial is copyrighted by the SNIA unlessotherwise noted.

    Member companies and individual members may use this material inpresentations and literature under the following conditions:Any slide or slides used must be reproduced in their entirety without modification

    The SNIA must be acknowledged as the source of any material used in the body ofany document containing material from these presentations.

    This presentation is a project of the SNIA Education Committee.

    Neither the author nor the presenter is an attorney and nothing in thispresentation is intended to be, or should be construed as legal advice or anopinion of counsel. If you need legal advice or a legal opinion pleasecontact your attorney.

    The information presented herein represents the author's personal opinionand current understanding of the relevant issues involved. The author, thepresenter, and the SNIA do not assume any responsibility or liability fordamages arising out of any reliance on or use of this information.

    NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK.

    2

  • 8/10/2019 Big Data Storage Options for Hadoop

    3/44

    Big Data Storage Options for Hadoop

    2013 Storage Networking Industry Association. All Rights Reserved.

    Abstract

    Big Data Storage Options for Hadoop

    The Hadoop system was developed to enable the transformationand analysis of vast amounts of structured and unstructuredinformation. It does this by implementing an algorithm calledMapReduce across compute clusters that may consist ofhundreds or even thousands of nodes. In this presentation

    Hadoop will be looked at from a storage perspective. Thetutorial will describe the key aspects of Hadoop storage, thebuilt-in Hadoop file system (HDFS), and some other options forHadoop storage that exist in the commercial and open sourcecommunities.

    3

  • 8/10/2019 Big Data Storage Options for Hadoop

    4/44

    Big Data Storage Options for Hadoop

    2013 Storage Networking Industry Association. All Rights Reserved.

    Overview

    Introduction

    What is HadoopWhat is MapReduce

    How does Hadoop use storage

    Distributed filesystem concepts

    Storage optionsNative Hadoop HDFS

    On direct attached storage

    On networked (SAN) storage

    Alternative distributed filesystems

    Cloud object storage

    Emerging options

    4

  • 8/10/2019 Big Data Storage Options for Hadoop

    5/44

    Big Data Storage Options for Hadoop

    2013 Storage Networking Industry Association. All Rights Reserved.

    Overview

    Introduction

    What is HadoopWhat is MapReduce

    How does Hadoop use storage

    Distributed filesystem concepts

    Storage optionsNative Hadoop HDFS

    On direct attached storage

    On networked (SAN) storage

    Alternative distributed filesystems

    Cloud object storage

    Emerging options

    5

  • 8/10/2019 Big Data Storage Options for Hadoop

    6/44

    Big Data Storage Options for Hadoop

    2013 Storage Networking Industry Association. All Rights Reserved.

    What is Hadoop?

    A scalable fault-tolerant distributed system for data storageand processing

    Core Hadoop has two main componentsMapReduce: fault-tolerant distributed processing

    Programming model for processing sets of data

    Mapping inputs to outputs and reducing the output of multiple Mappers to one (ora few) answer(s)

    Hadoop Distributed File System (HDFS): high-bandwidth clusteredstorageDistributed file system optimized for large files

    Operates on unstructured and structured data

    A large and active ecosystem

    Written in JavaOpen source under the friendly Apache License

    http://hadoop.apache.org

    6

  • 8/10/2019 Big Data Storage Options for Hadoop

    7/44

    Big Data Storage Options for Hadoop

    2013 Storage Networking Industry Association. All Rights Reserved.

    What is MapReduce?

    A method for distributing a task across multiple nodes

    Each node processes data stored on that nodeConsists of two developer-created phases

    1. Map

    2. ReduceIn between Map and Reduce is the Shuffle and Sort

    7

  • 8/10/2019 Big Data Storage Options for Hadoop

    8/44

    Big Data Storage Options for Hadoop

    2013 Storage Networking Industry Association. All Rights Reserved.

    MapReduce

    8

    Google, from Google Code University, http://code.google.com/edu/parallel/mapreduce-tutorial.html

  • 8/10/2019 Big Data Storage Options for Hadoop

    9/44

    Big Data Storage Options for Hadoop

    2013 Storage Networking Industry Association. All Rights Reserved.

    What was the max temperature for the last century?

    MapReduce Operation

    9

  • 8/10/2019 Big Data Storage Options for Hadoop

    10/44

    Big Data Storage Options for Hadoop

    2013 Storage Networking Industry Association. All Rights Reserved.

    Key MapReduce Terminology Concepts

    A user runs a client program (typically a Java application) on

    a client computerThe client program submits a job to Hadoop

    The job is sent to the JobTracker process on the Master Node

    Each Slave Node runs a process called the TaskTracker

    The JobTracker instructs TaskTrackers to run and monitortasks

    A task attempt is an instance of a task running on a slave

    node

    There will be at least as many task attempts as there are

    tasks which need to be performed

    10

  • 8/10/2019 Big Data Storage Options for Hadoop

    11/44

    Big Data Storage Options for Hadoop

    2013 Storage Networking Industry Association. All Rights Reserved.

    MapReduce in Hadoop

    11

    Google, from Google Code University, http://code.google.com/edu/parallel/mapreduce-tutorial.html

    Task Tracker

    Input (HDFS)

    Output (HDFS)

    Mapper

    Reducer

    Worker=Tasks

  • 8/10/2019 Big Data Storage Options for Hadoop

    12/44

    Big Data Storage Options for Hadoop

    2013 Storage Networking Industry Association. All Rights Reserved.

    MapReduce: Basic Concepts

    Each Mapper processes single input split from HDFS

    Hadoop passes one record at a time to the developersMap code

    Each record has a key and a value

    Intermediate data written by the Mapper to local disk (notHDFS) on each of the individual cluster nodes

    intermediate data is reliable or globally accessible

    During shuffle and sort phase, all values associated with

    same intermediate key are transferred to same ReducerReducer is passed each key and a list of all its values

    Output from Reducers is written to HDFS

    12

  • 8/10/2019 Big Data Storage Options for Hadoop

    13/44

    Big Data Storage Options for Hadoop

    2013 Storage Networking Industry Association. All Rights Reserved.

    What is a Distributed File System?

    A distributed file system is a file system that allows

    access to files from multiple hosts across a networkA network filesystem (NFS/CIFS) is a type of distributed file

    system more tuned for file sharing than distributed computation

    Distributed computing applications, like Hadoop, utilize a tightly

    coupled distributed file systemTightly coupled distributed filesystems

    Provide a single global namespace across all nodes

    Support multiple initiators, multiple disk nodes, multiple access

    to files file parallelismExamples include HDFS, GlusterFS, pNFS, as well as many

    commercial and research systems

    13

  • 8/10/2019 Big Data Storage Options for Hadoop

    14/44

    Big Data Storage Options for Hadoop

    2013 Storage Networking Industry Association. All Rights Reserved.

    Overview

    IntroductionWhat is Hadoop

    What is MapReduce

    How does Hadoop use storage

    Distributed filesystem concepts

    Storage optionsNative Hadoop HDFSOn direct attached storage

    On networked (SAN) storage

    Alternative distributed filesystems

    Cloud object storage

    Emerging options

    14

  • 8/10/2019 Big Data Storage Options for Hadoop

    15/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    Hadoop Distributed File System - HDFS

    ArchitectureJava application, not deeply integrated with the server OS

    Layered on top of a standard FS (e.g., ext, xfs, etc.)

    Must use Hadoop or a special library to access HDFS filesShared-nothing, all nodes have direct attached disks

    Write once filesystem must copy a file to modify it

    HDFS basicsData is organized into files & directories

    Files are divided into 64-128MB blocks, distributed across nodesBlock placement is handled by the NameNode

    Placement coordinated with job tracker = writes always co-located, reads co-located with computation whenever possible

    Blocks replicated to handle failure, replica blocks can be used by compute tasks

    Checksums used to ensure data integrity

    Replication: one and only strategy for error handling, recovery and faulttolerance

    Self Healing

    Makes multiple copies (typically 3)

    15

  • 8/10/2019 Big Data Storage Options for Hadoop

    16/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    HDFS on local DAS

    A Hadoop cluster consisting of many nodes, each of

    which has local direct attached storage (DAS)Disks are running a standard file system (e.g., ext. xfs, etc.)

    HDFS blocks are stored as files in a special directory

    Disks attached directly, for example, with SAS or SATA

    No storage is shared, disks only attach to a single node

    The most common use case for Hadoop

    Original design point for Hadoop/HDFS

    Can work with cheap unreliable hardware

    Some very large systems utilize this model

    16

  • 8/10/2019 Big Data Storage Options for Hadoop

    17/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    HDFS on local DAS

    17

    Compute nodes are part of

    HDFS, data spread across nodes

    HDFS

    Protocol

  • 8/10/2019 Big Data Storage Options for Hadoop

    18/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    HDFS File Write Operation

    18

    Image source: Hadoop, The Definitive Guide Tom White, OReilly

    3-way

    replication

  • 8/10/2019 Big Data Storage Options for Hadoop

    19/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    HDFS File Read Operation

    19

    Image source: Hadoop, The Definitive Guide Tom White, OReilly

  • 8/10/2019 Big Data Storage Options for Hadoop

    20/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    HDFS on local DAS - Pros and Cons

    ProsWrites are highly parallel

    Large files are broken into many parts, distributed across the cluster

    Three copies of any file block, one written local, two remoteNot a simple round-robin scheme, tuned for Hadoop jobs

    Job tracker attempts to make reads localIf possible, tasks scheduled in same node as the needed file segment

    Duplicate file segments are also readable, can be used for tasks too

    ConsNot a replacement for general purpose storage

    Not a kernel-based POSIX filesystem

    Incompatible with standard applications and utilities (but future versions ofHadoop are adding more other application models)

    High replication cost compared with RAID/shared diskThe NameNode keeps track of data location

    SPOF - location data is critical and must be protected

    Scalability bottleneck (everything has to be in memory)

    Improvements to NameNode are in the works

    20

  • 8/10/2019 Big Data Storage Options for Hadoop

    21/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    Other HDFS storage options

    HDFS on Storage Area Network (SAN) attached storage

    A lot like DAS,Disks are logical volumes in storage array(s), accessed across a SAN

    HDFS doesnt know the difference

    Still appears like a locally attached disk

    SAN attached arrays arent the same as DAS

    Array has its own cache, redundancy, replication, etc.

    Any node on the SAN can access any array volume

    So a new node can be assigned to a failed nodes data

    21

  • 8/10/2019 Big Data Storage Options for Hadoop

    22/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    HDFS with SAN Storage

    22

    Storage Arrays Compute nodes

    Hadoop

    Cluster

    iSCSI or FC SAN

  • 8/10/2019 Big Data Storage Options for Hadoop

    23/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    HDFS File Write Operation

    23

    Array can provide

    redundancy, no need to

    replicate data acrossdata nodes

    Array

    Replication

  • 8/10/2019 Big Data Storage Options for Hadoop

    24/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    HDFS File Read Operation

    24

    Array redundancy,

    means only a single

    source for data

  • 8/10/2019 Big Data Storage Options for Hadoop

    25/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    SAN for Hadoop Storage

    Instead of storing data on direct attached local disks, data is in oneor more arrays attached to data nodes through a SAN

    Looks like local storage to data nodesHadoop still utilizes HDFS

    ProsAll the normal advantages of arrays

    RAID, centralized caching, thin provisioning, other advanced array features

    Centralized management, easy redistribution of storage

    Retains advantages of HDFS (as long as array is not over-utilized)Easy failover when compute node dies, can eliminate or reduce 3-wayreplication

    ConsCost? It depends

    Unless if multiple arrays are used, scale is limitedAnd with multiple arrays, management and cost advantages are reduced

    Still have HDFS complexity and manageability issues

    25

  • 8/10/2019 Big Data Storage Options for Hadoop

    26/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    Overview

    IntroductionWhat is Hadoop

    What is MapReduce

    How does Hadoop use storage

    Distributed filesystem concepts

    Storage optionsNative Hadoop HDFS

    On direct attached storage

    On networked (SAN) storage

    Alternative distributed filesystems

    Cloud object storageEmerging options

    26

  • 8/10/2019 Big Data Storage Options for Hadoop

    27/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    Other distributed filesystems

    Kernel-based tightly coupled distributed file systemKernel-based, i.e., no special access libraries, looks like a normallocal file system

    These filesystems have existed for years in high performancecomputing, scale-out NAS servers, and other scale-out computingenvironments

    Many commercial and research examples

    Not originally designed for Hadoop like HDFS

    Location awareness is part of the file system no NameNodeWorks better if functionality is exposed to Hadoop

    Compute nodes may or may not have local storageCompute nodes are part of the storage cluster, but may bediskless i.e., equal access to files and global namespace

    Can tie the filesystems location awareness into task tracker to reduce remotestorage access

    Remote storage is accessed using a filesystem specific inter-nodeprotocol

    Single network hop due to filesystems location awareness

    27

  • 8/10/2019 Big Data Storage Options for Hadoop

    28/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    Tightly coupled DFS for Hadoop

    General purpose shared file systemImplemented in the kernel, single namespace, compatible withmost applications (no special library or language)

    Data is distributed across local storage node disksArchitecturally like HDFS

    Can utilize same disk options as HDFS

    Including shared nothing DAS SAN storage

    Some can also support shared SAN storage where raw volumes can beaccessed by multiple nodes

    Failover model where only one node actively uses a volume, other can takeover after failure

    Multiple initiator model where multiple nodes actively use a volume

    Shared nothing option has similar cost/performance to HDFS onDAS

    28

  • 8/10/2019 Big Data Storage Options for Hadoop

    29/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    Distributed FS local disks

    29

    Compute

    nodes are

    part of the

    DFS, data

    spread

    acrossnodes

    Distributed

    FS inter-node

    Protocol

  • 8/10/2019 Big Data Storage Options for Hadoop

    30/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    Distributed FS remote disks

    30

    Compute nodes are

    distributed FS clients

    Scale out nodes are

    distributed FS servers

    Distributed FSinter-node

    Protocol

  • 8/10/2019 Big Data Storage Options for Hadoop

    31/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    Remote DFS Write Operation

    31

    Note that these diagrams are intended

    to be generic, and leave out much of the

    detail of any specific DFS

  • 8/10/2019 Big Data Storage Options for Hadoop

    32/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    Local DFS Write Operation

    32

    Note that these diagrams are

    intended to be generic, and leave out

    much of the detail of any specific DFS

  • 8/10/2019 Big Data Storage Options for Hadoop

    33/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    Remote DFS Read Operation

    33

    Note that these diagrams are intended

    to be generic, and leave out much of the

    detail of any specific DFS

  • 8/10/2019 Big Data Storage Options for Hadoop

    34/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    Local DFS Read Operation

    34

    Note that these diagrams are

    intended to be generic, and leave out

    much of the detail of any specific DFS

  • 8/10/2019 Big Data Storage Options for Hadoop

    35/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    Tightly coupled DFS for Hadoop

    ProsShared data access, any node can access any data like it is local

    POSIX compatible, works for non-Hadoop apps just like a local file system

    Centralized management and administration

    No NameNode, may have a better block mapping mechanism

    Compute in-place, same copy can be served via NFS/CIFS

    Many of the performance benefits

    Cons

    HDFS is highly optimized for Hadoop, unlikely to get same optimization for ageneral purpose DFS

    Large file striping is not regular, based on compute distribution

    Copies are simultaneously readable

    Strict POSIX compliance leads to unnecessary serializationHadoop assumes multiple-access to files, however, accesses are on block boundaries and dontoverlap

    Need to relax POSIX compliance for large files, or just stick with many smaller filesSome DFSs have scaling limitations that are worse than HDFS, not designed forthousands of nodes

    35

  • 8/10/2019 Big Data Storage Options for Hadoop

    36/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    Overview

    IntroductionWhat is Hadoop

    What is MapReduce

    How does Hadoop use storage

    Distributed filesystem concepts

    Storage optionsNative Hadoop HDFS

    On direct attached storage

    On networked (SAN) storage

    Alternative distributed filesystems

    Cloud object storageEmerging options

    36

  • 8/10/2019 Big Data Storage Options for Hadoop

    37/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    Cloud Object Storage for Hadoop

    Uses a REST API like CDMI, S3, or Swift

    HTTP based protocol, data is remote

    Objects are write once, read many, streaming access

    Objects have some stored metadata

    Data is stored in cloud object storage

    Could be local or across internetCheap, high volume

    Systems utilize triple redundancy or erasure coding, for reliability

    Often uses Hadoop S3 connector

    37

  • 8/10/2019 Big Data Storage Options for Hadoop

    38/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    Hadoop on Object Storage

    38

    Cloud Object Storage

    REST

    /HTTP

  • 8/10/2019 Big Data Storage Options for Hadoop

    39/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    Hadoop Write on Object Storage

    39

  • 8/10/2019 Big Data Storage Options for Hadoop

    40/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    Hadoop Read on Object Storage

    40

  • 8/10/2019 Big Data Storage Options for Hadoop

    41/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    Object Storage for Hadoop

    ProsLow cost, high volume, reliable storage

    Good location for infrequently used WORM dataPublic cloud options

    Scalable storage

    Data can easily be shared between Hadoop and other applications

    ConsAll data is remote performance

    No data/compute colocation

    Limited capabilities, though a good match for Hadoop

    High disk cost if triple redundancy is used

    Good choice for large infrequently accessed WORM itemsthat may need to be accessed by non-Hadoop jobs as well

    41

  • 8/10/2019 Big Data Storage Options for Hadoop

    42/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    Emerging Options

    New options are emerging in the storage research

    community

    Caching from enterprise storage

    Mirror to enterprise storage, NAS/NFS

    SSD

    Improvements to HDFS

    HA options

    Access to non-Hadoop jobs

    Bottom line

    The limitations of HDFS are knownWork is ongoing to improve Hadoop storage options

    42

  • 8/10/2019 Big Data Storage Options for Hadoop

    43/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    Summary

    Hadoop provides a scalable fault-tolerant environment foranalyzing unstructured and structured information

    The default way to store data for Hadoop is HDFS on localdirect attached disks

    Alternatives to this architecture include SAN array storage,tightly-coupled general purpose DFS, and cloud object

    storageThey can provide some significant advantages

    However, they arent without their downsides, its hard to beat afilesystem designed specifically for Hadoop

    Which one is best for you?Depends on what is most important cost, manageability,compatibility with existing infrastructure, performance, scale,

    43

  • 8/10/2019 Big Data Storage Options for Hadoop

    44/44

    Big Data Storage Options for Hadoop 2013 Storage Networking Industry Association. All Rights Reserved.

    Attribution & Feedback

    44

    Please send any questions or comments regarding this SNIA

    Tutorial to [email protected]

    The SNIA Education Committee thanks the following

    individuals for their contributions to this Tutorial.

    Authorship History

    Sam Fineberg, August 2012

    Updates:

    Sam Fineberg, February 2013Sam Fineberg, March 2013

    Additional Contributors

    Rob Peglar

    Joseph White

    Chris Santilli