Upload
onslow
View
80
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Lei Xu. Hadoop Distributed File System. Brief Introduction. Hadoop An apache project for data-intensive applications Typical application: Map-Reduce (OSDI’04), a distributed algorithm for massive-data computation Crawl and index web pages (Y!) Analyze popular topics and trends (Twitter) - PowerPoint PPT Presentation
Citation preview
1
HADOOP DISTRIBUTED FILE SYSTEM
Lei Xu
2
Brief Introduction
Hadoop An apache project for data-intensive
applications Typical application: Map-Reduce (OSDI’04),
a distributed algorithm for massive-data computation Crawl and index web pages (Y!) Analyze popular topics and trends (Twitter)
Led by Yahoo!/Facebook/Cloudera
3
Brief Introduction (cont’d) Hadoop Distributed File System
(HDFS) A scalable distributed file system to
serve Hadoop MapReduce applications Borrow the essential ideas from the
Google File System Sanjay Ghenawat, Howard Gobioff and Shun-
Tak Leung. The Google File System. 19TH ACM Symposium on Operating System Principles (SOSP’03)
Share same design assumptions
4
Google File System
A scalable distributed file system designed for: Data-intensive applications (mainly
MapReduce) Web page indexing Then it has spread to other applications
E.g. Gmail, Big Table, App Engine Fault-tolerant
Low-cost hardware High throughputs
5
Google File System (cont’d) Departure from other file system
assumptions Run on top of the commodity hardware
Component failures are common Files are huge
Basic block size 64~128 MB 1~64KB in traditional file systems (Ext3/NTFS and
etc.) Massive-data/data-intensive processing
Large streaming read and small random read Large, sequential writes No (or bare) random writes
6
Hadoop DFS Assumptions
Other than the assumptions in Google File System, HDFS assumes that: Simple Coherency Model
Write-once-read-many Once a file was created, written and closed, it can not be
changed anymore. Moving Computation Is Cheaper than Moving Data
“Semi-Location-Aware” computation Try its best to assign computations closer to the related data
Portability Across Heterogeneous Hardware and Software Platforms Is written in Java, multi-platform support
Google File System was written in C++ and run on Linux Store data on top of existing file systems (NTFS/Ext4/Btrfs…)
7
HDFS Architecture
Master/Slave Architecture NameNode
Metadata Server File location ( file name -> the DataNode ) File attributions (atime/ctime/mtime, size, the
number of replicas and etc.) DataNode
Manages the storage attached to the nodes that they run on
Client Producer and Consumers of data
8
HDFS Architecture (cont’d)
9
NameNode
Metadata Server Only one NameNode in one cluster
Single Point Failure Potential performance bottleneck
Manage the file system namespace Traditional hierarchical namespace Keep all file metadata in memory for fast access
The memory size of NameNode determines how many files can be supported
Execute file system namespace operation: Open/close/rename/create/unlink…
Return the location of data blocks
10
NameNode (cont’d)
Maintains system-wide activities E.g. creating new replications of file
data, garbage collection, load balancing and etc.
Periodically communicates with DataNode to collect their statuses Is DataNode alive? Is DataNode overload?
11
DataNode
Storage server Store fixed-size data blocks on local file
systems ( ext4/zfs/btrfs ) Serve read/write operations from the
clients Create, delete, replicate data blocks
upon instruction from the NameNode Block size = 64MB
12
Client
Application-level implementations Does not provide POSIX API
Hadoop has a FUSE interface FUSE: Filesystem in Userspace Has limited functions (e.g, no random write
supports) Query the NameNode for file locations
and metadata Contact corresponding DataNodes for
file I/Os
13
Data Replication
Files are stored as a sequence of blocks The blocks (typically 64MB) are replicated for fault
tolerance Replication factor is configurable per file
Can be specified at creation time, and can be changed later The NameNode decides how to replicate blocks. It
periodically receives: Heartbeat, which implies the DataNode is alive Blockreport, which contains a list of all blocks on a
DataNode When a DataNode is down, the NameNode replicas all
blocks on this DataNode to other active DataNode to achieve enough replications
14
Data Replication (cont’d)
15
Data Replication (cont’d)
Rack Awareness Hadoop instance runs on a cluster of
computers that spread across many racks: Nodes in same rack are connected by one
switches Communications between two nodes in
different racks go through switches Slower than nodes in same rack
One rack may fail due to network/power issues. Improve data reliability, availability and
network bandwidth utilization
16
Data Replications (cont’d) Rack Awareness (cont’d)
For common case, the replication factor is three Two replicas are placed on two different
nodes in same rack The third replica is placed on a node in a
remote rack Improves write performance
2/3 writes are in same rack, faster Without compromising data reliability
17
Replica Selection
For READ operation: Minimize the bandwidth consumption
and latency Prefer nearer node:
If there is a replica on the same node, it is preferred
The cluster may span multiple data centers, replicas in same data centers are preferred
18
Filesystem Metadata
The HDFS stores all file metadata on NameNode An EditLog
Record every change that occurs to filesystem metadata For failure recovery Same as journaling file systems (Ext3/NTFS)
An FSImage Stores mapping of blocks to files and file
attributes EditLog and FSImage are stored on NameNode
locally
19
Filesystem Metedata(cont’d) DataNode has no knowledge about
HDFS files It only stores data blocks as regular files
on local file systems With a checksum for data integrity
It periodically reports a Blockreport that includes all blocks stored on this DataNode to NameNode Only the DataNode has knowledge about
the availability of one block replica.
20
Filesystem Metadata(cont’d) When NameNode starts up
Load FSImage and EditLog from the local file system
Update FSImage with latest EditLogs Create a new FSImage for latest
checkpoint and store on local file system permanently
21
Communication Protocol
A Hadoop specific RPC on top of TCP/IP
NameNode is simply a server that only responses to the requests issued by DataNodes or clients ClientProtocol.java – client protocol DatanodeProtoco.java – datanode
protocol
22
Robustness
Primary object of HDFS: Reliable with component failures In a typical large cluster (>1K nodes),
component failures are common Three common types of failures:
NameNode failures DataNode failures Network failures
23
Robustness (cont’d)
Heartbeats Each DataNode sends heartbeats to NameNode
periodically System status and block reports
The NameNode marks DataNodes w/o recent heartbeats as dead Does not forward I/O to it Mark all data blocks on these DataNodes as
unavailable Re-replicate these blocks if necessary (according to
the replication factor). Can detect network failures and DataNode dies
24
Robustness (cont’d)
Re-Balancing Automatically move the data on one
DataNode to another one If the free space falls below a threshold
Data-Integrity A block of data may be corrupted
Disk faults, network faults, buggy software Client computes checksums for each block
and stores them in a separate hidden file in HDFS namespace
Verify data before read it
25
Robustness (cont’d)
Metadata failures FSImage and EditLog are the central data
structures Once corrupted, HDFS can not build namespace and
access data NameNode can be configured to support multiple-
copies of FSImage and EditLog E.g: one FSImage/EditLog on local machine, another
one is stored on mounted remote NFS server. Reduce the update performances
Once NameNode is down, it must to restart the cluster manually
26
Data Organization
Data Blocks HDFS is designed to support very large
files and streaming I/Os A File is chopped up into 64MB blocks Reduce the number of connection
establishments and accelerate TCP transmissions
If possible, each block of a file will reside on a different DataNode For future parallel I/O and computations
(MapReduce)
27
Data Organization (cont’d) Staging
When write a new file A client firstly caches the file data into
temporary local file until this file worth over the HDFS block size
Then the client contacts NameNode to assign a DataNode
The client flushes the cached data to the chosen DataNode
Fully utilized the bandwidth
28
Data Organization (cont’d) Replication Pipeline
A client obtains a DataNode list to flush one block The client firstly flushes the data to the first
DataNode The first DataNode starts to receive the data in
small portions (4kB), writes that portions to local storage, and transfer it to the next DataNode in the list immediately
The second DataNode acts as the first one The total transfer time for one block(64MB) is:
T(64MB) + T(4kb) * 2 , for pipeline 3 * T(64MB), for non-pipeline
29
Replication Pipeline
The client asks the NameNode where to put data
The client push data to DataNode linearly to fully utilize network bandwidth
The secondary replicas reply to the primary. Then the primary replies to the client for success.* This figure was in “The Google File System” paper
30
See also
HBase – a BigTable implementation on Hadoop Key-value storage
Pig – high-level language to run data analyze on Hadoop
ZooKeeper “ZooKeeper: Wait-free Coordination for Internet-
scale Systems”, ATC’10, Best Paper CloudStore (KFS, previously Kosmosfs)
A C++ implementation of Google File System Parallels the Hadoop project
31
Google v.s Y!/Facebook/Amazon..
32
Known Issues and Research Interests NameNode is the single point failure
Limits the total files supported in the HDFS as well RAM limitation
Google has changed the one-master architecture to multiple-header cluster However, the details are unrevealed
33
Known Issues and Research Interests (cont’d) Use replications to provide data
reliability Same problems to RAID-1 ? Apply RAID technologies to HDFS?
“DiskReduce: RAID for Data-Intensive Scalable Computing”, PDSW’09
34
Known Issues and Research Interests (cont’d) Energy Efficiency
DataNodes are alive for data availability However, there may be no MapReduce
computations running on them. Waste of energy
35
Conclusion
Hadoop Distributed File System is designed to serve MapReduce computations Provide high reliable storage Support mass of data Optimized data placement policies based on
the topology of data centers Large companies build their core businesses on
top of these infrastructures Google: GFS/MapReduce/BigTable Yahoo!/Facebook/Amazon/Twitter/NY Times:
Hadoop/HBase/Pig
36
Reference
HDFS Architecture Guide: http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html
Thank you !
37
Questions?