37
1 HADOOP DISTRIBUTED FILE SYSTEM Lei Xu

Hadoop Distributed File System

  • Upload
    onslow

  • View
    80

  • Download
    1

Embed Size (px)

DESCRIPTION

Lei Xu. Hadoop Distributed File System. Brief Introduction. Hadoop An apache project for data-intensive applications Typical application: Map-Reduce (OSDI’04), a distributed algorithm for massive-data computation Crawl and index web pages (Y!) Analyze popular topics and trends (Twitter) - PowerPoint PPT Presentation

Citation preview

Page 1: Hadoop  Distributed File System

1

HADOOP DISTRIBUTED FILE SYSTEM

Lei Xu

Page 2: Hadoop  Distributed File System

2

Brief Introduction

Hadoop An apache project for data-intensive

applications Typical application: Map-Reduce (OSDI’04),

a distributed algorithm for massive-data computation Crawl and index web pages (Y!) Analyze popular topics and trends (Twitter)

Led by Yahoo!/Facebook/Cloudera

Page 3: Hadoop  Distributed File System

3

Brief Introduction (cont’d) Hadoop Distributed File System

(HDFS) A scalable distributed file system to

serve Hadoop MapReduce applications Borrow the essential ideas from the

Google File System Sanjay Ghenawat, Howard Gobioff and Shun-

Tak Leung. The Google File System. 19TH ACM Symposium on Operating System Principles (SOSP’03)

Share same design assumptions

Page 4: Hadoop  Distributed File System

4

Google File System

A scalable distributed file system designed for: Data-intensive applications (mainly

MapReduce) Web page indexing Then it has spread to other applications

E.g. Gmail, Big Table, App Engine Fault-tolerant

Low-cost hardware High throughputs

Page 5: Hadoop  Distributed File System

5

Google File System (cont’d) Departure from other file system

assumptions Run on top of the commodity hardware

Component failures are common Files are huge

Basic block size 64~128 MB 1~64KB in traditional file systems (Ext3/NTFS and

etc.) Massive-data/data-intensive processing

Large streaming read and small random read Large, sequential writes No (or bare) random writes

Page 6: Hadoop  Distributed File System

6

Hadoop DFS Assumptions

Other than the assumptions in Google File System, HDFS assumes that: Simple Coherency Model

Write-once-read-many Once a file was created, written and closed, it can not be

changed anymore. Moving Computation Is Cheaper than Moving Data

“Semi-Location-Aware” computation Try its best to assign computations closer to the related data

Portability Across Heterogeneous Hardware and Software Platforms Is written in Java, multi-platform support

Google File System was written in C++ and run on Linux Store data on top of existing file systems (NTFS/Ext4/Btrfs…)

Page 7: Hadoop  Distributed File System

7

HDFS Architecture

Master/Slave Architecture NameNode

Metadata Server File location ( file name -> the DataNode ) File attributions (atime/ctime/mtime, size, the

number of replicas and etc.) DataNode

Manages the storage attached to the nodes that they run on

Client Producer and Consumers of data

Page 8: Hadoop  Distributed File System

8

HDFS Architecture (cont’d)

Page 9: Hadoop  Distributed File System

9

NameNode

Metadata Server Only one NameNode in one cluster

Single Point Failure Potential performance bottleneck

Manage the file system namespace Traditional hierarchical namespace Keep all file metadata in memory for fast access

The memory size of NameNode determines how many files can be supported

Execute file system namespace operation: Open/close/rename/create/unlink…

Return the location of data blocks

Page 10: Hadoop  Distributed File System

10

NameNode (cont’d)

Maintains system-wide activities E.g. creating new replications of file

data, garbage collection, load balancing and etc.

Periodically communicates with DataNode to collect their statuses Is DataNode alive? Is DataNode overload?

Page 11: Hadoop  Distributed File System

11

DataNode

Storage server Store fixed-size data blocks on local file

systems ( ext4/zfs/btrfs ) Serve read/write operations from the

clients Create, delete, replicate data blocks

upon instruction from the NameNode Block size = 64MB

Page 12: Hadoop  Distributed File System

12

Client

Application-level implementations Does not provide POSIX API

Hadoop has a FUSE interface FUSE: Filesystem in Userspace Has limited functions (e.g, no random write

supports) Query the NameNode for file locations

and metadata Contact corresponding DataNodes for

file I/Os

Page 13: Hadoop  Distributed File System

13

Data Replication

Files are stored as a sequence of blocks The blocks (typically 64MB) are replicated for fault

tolerance Replication factor is configurable per file

Can be specified at creation time, and can be changed later The NameNode decides how to replicate blocks. It

periodically receives: Heartbeat, which implies the DataNode is alive Blockreport, which contains a list of all blocks on a

DataNode When a DataNode is down, the NameNode replicas all

blocks on this DataNode to other active DataNode to achieve enough replications

Page 14: Hadoop  Distributed File System

14

Data Replication (cont’d)

Page 15: Hadoop  Distributed File System

15

Data Replication (cont’d)

Rack Awareness Hadoop instance runs on a cluster of

computers that spread across many racks: Nodes in same rack are connected by one

switches Communications between two nodes in

different racks go through switches Slower than nodes in same rack

One rack may fail due to network/power issues. Improve data reliability, availability and

network bandwidth utilization

Page 16: Hadoop  Distributed File System

16

Data Replications (cont’d) Rack Awareness (cont’d)

For common case, the replication factor is three Two replicas are placed on two different

nodes in same rack The third replica is placed on a node in a

remote rack Improves write performance

2/3 writes are in same rack, faster Without compromising data reliability

Page 17: Hadoop  Distributed File System

17

Replica Selection

For READ operation: Minimize the bandwidth consumption

and latency Prefer nearer node:

If there is a replica on the same node, it is preferred

The cluster may span multiple data centers, replicas in same data centers are preferred

Page 18: Hadoop  Distributed File System

18

Filesystem Metadata

The HDFS stores all file metadata on NameNode An EditLog

Record every change that occurs to filesystem metadata For failure recovery Same as journaling file systems (Ext3/NTFS)

An FSImage Stores mapping of blocks to files and file

attributes EditLog and FSImage are stored on NameNode

locally

Page 19: Hadoop  Distributed File System

19

Filesystem Metedata(cont’d) DataNode has no knowledge about

HDFS files It only stores data blocks as regular files

on local file systems With a checksum for data integrity

It periodically reports a Blockreport that includes all blocks stored on this DataNode to NameNode Only the DataNode has knowledge about

the availability of one block replica.

Page 20: Hadoop  Distributed File System

20

Filesystem Metadata(cont’d) When NameNode starts up

Load FSImage and EditLog from the local file system

Update FSImage with latest EditLogs Create a new FSImage for latest

checkpoint and store on local file system permanently

Page 21: Hadoop  Distributed File System

21

Communication Protocol

A Hadoop specific RPC on top of TCP/IP

NameNode is simply a server that only responses to the requests issued by DataNodes or clients ClientProtocol.java – client protocol DatanodeProtoco.java – datanode

protocol

Page 22: Hadoop  Distributed File System

22

Robustness

Primary object of HDFS: Reliable with component failures In a typical large cluster (>1K nodes),

component failures are common Three common types of failures:

NameNode failures DataNode failures Network failures

Page 23: Hadoop  Distributed File System

23

Robustness (cont’d)

Heartbeats Each DataNode sends heartbeats to NameNode

periodically System status and block reports

The NameNode marks DataNodes w/o recent heartbeats as dead Does not forward I/O to it Mark all data blocks on these DataNodes as

unavailable Re-replicate these blocks if necessary (according to

the replication factor). Can detect network failures and DataNode dies

Page 24: Hadoop  Distributed File System

24

Robustness (cont’d)

Re-Balancing Automatically move the data on one

DataNode to another one If the free space falls below a threshold

Data-Integrity A block of data may be corrupted

Disk faults, network faults, buggy software Client computes checksums for each block

and stores them in a separate hidden file in HDFS namespace

Verify data before read it

Page 25: Hadoop  Distributed File System

25

Robustness (cont’d)

Metadata failures FSImage and EditLog are the central data

structures Once corrupted, HDFS can not build namespace and

access data NameNode can be configured to support multiple-

copies of FSImage and EditLog E.g: one FSImage/EditLog on local machine, another

one is stored on mounted remote NFS server. Reduce the update performances

Once NameNode is down, it must to restart the cluster manually

Page 26: Hadoop  Distributed File System

26

Data Organization

Data Blocks HDFS is designed to support very large

files and streaming I/Os A File is chopped up into 64MB blocks Reduce the number of connection

establishments and accelerate TCP transmissions

If possible, each block of a file will reside on a different DataNode For future parallel I/O and computations

(MapReduce)

Page 27: Hadoop  Distributed File System

27

Data Organization (cont’d) Staging

When write a new file A client firstly caches the file data into

temporary local file until this file worth over the HDFS block size

Then the client contacts NameNode to assign a DataNode

The client flushes the cached data to the chosen DataNode

Fully utilized the bandwidth

Page 28: Hadoop  Distributed File System

28

Data Organization (cont’d) Replication Pipeline

A client obtains a DataNode list to flush one block The client firstly flushes the data to the first

DataNode The first DataNode starts to receive the data in

small portions (4kB), writes that portions to local storage, and transfer it to the next DataNode in the list immediately

The second DataNode acts as the first one The total transfer time for one block(64MB) is:

T(64MB) + T(4kb) * 2 , for pipeline 3 * T(64MB), for non-pipeline

Page 29: Hadoop  Distributed File System

29

Replication Pipeline

The client asks the NameNode where to put data

The client push data to DataNode linearly to fully utilize network bandwidth

The secondary replicas reply to the primary. Then the primary replies to the client for success.* This figure was in “The Google File System” paper

Page 30: Hadoop  Distributed File System

30

See also

HBase – a BigTable implementation on Hadoop Key-value storage

Pig – high-level language to run data analyze on Hadoop

ZooKeeper “ZooKeeper: Wait-free Coordination for Internet-

scale Systems”, ATC’10, Best Paper CloudStore (KFS, previously Kosmosfs)

A C++ implementation of Google File System Parallels the Hadoop project

Page 31: Hadoop  Distributed File System

31

Google v.s Y!/Facebook/Amazon..

Page 32: Hadoop  Distributed File System

32

Known Issues and Research Interests NameNode is the single point failure

Limits the total files supported in the HDFS as well RAM limitation

Google has changed the one-master architecture to multiple-header cluster However, the details are unrevealed

Page 33: Hadoop  Distributed File System

33

Known Issues and Research Interests (cont’d) Use replications to provide data

reliability Same problems to RAID-1 ? Apply RAID technologies to HDFS?

“DiskReduce: RAID for Data-Intensive Scalable Computing”, PDSW’09

Page 34: Hadoop  Distributed File System

34

Known Issues and Research Interests (cont’d) Energy Efficiency

DataNodes are alive for data availability However, there may be no MapReduce

computations running on them. Waste of energy

Page 35: Hadoop  Distributed File System

35

Conclusion

Hadoop Distributed File System is designed to serve MapReduce computations Provide high reliable storage Support mass of data Optimized data placement policies based on

the topology of data centers Large companies build their core businesses on

top of these infrastructures Google: GFS/MapReduce/BigTable Yahoo!/Facebook/Amazon/Twitter/NY Times:

Hadoop/HBase/Pig

Page 36: Hadoop  Distributed File System

36

Reference

HDFS Architecture Guide: http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html

Page 37: Hadoop  Distributed File System

Thank you !

37

Questions?