5
Hadoop File System www.KellyTechno.com Page 1 Basic Features: HDFS ó Highly fault-tolerant ó High throughput ó Suitable for applications with large data sets ó Streaming access to file system data ó Can be built out of commodity hardware Fault tolerance: ó Failure is the norm rather than exception ó A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data. ó Since we have huge number of components and that each component has non-trivial probability of failure means that there is always some component that is non-functional. ó Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. Data Characteristics: ó Streaming data access ó Applications need streaming access to data ó Batch processing rather than interactive user access. ó Large data sets and files: gigabytes to terabytes size ó High aggregate data bandwidth ó Scale to hundreds of nodes in a cluster

Hadoop training in hyderabad | hadoop file system

Embed Size (px)

Citation preview

Page 1: Hadoop training in hyderabad | hadoop file system

Hadoop File System

www.KellyTechno.com Page 1

Basic Features: HDFS

ó Highly fault-tolerant

ó High throughput

ó Suitable for applications with large data sets

ó Streaming access to file system data

ó Can be built out of commodity hardware

Fault tolerance:

ó Failure is the norm rather than exception

ó A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data.

ó Since we have huge number of components and that each component has non-trivial probability of failure means that there is always some component that is non-functional.

ó Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.

Data Characteristics:

ó Streaming data access

ó Applications need streaming access to data

ó Batch processing rather than interactive user access.

ó Large data sets and files: gigabytes to terabytes size

ó High aggregate data bandwidth

ó Scale to hundreds of nodes in a cluster

Page 2: Hadoop training in hyderabad | hadoop file system

Hadoop File System

www.KellyTechno.com Page 2

ó Tens of millions of files in a single instance

ó Write-once-read-many: a file once created, written and closed need not be changed – this assumption simplifies coherency

ó A map-reduce application or web-crawler application fits perfectly with this model.

MapReduce:

Namenode and Datanodes:

ó Master/slave architecture

ó HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients.

ó There are a number of DataNodes usually one per node in a cluster.

ó The DataNodes manage storage attached to the nodes that they run on.

ó HDFS exposes a file system namespace and allows user data to be stored in files.

Page 3: Hadoop training in hyderabad | hadoop file system

Hadoop File System

www.KellyTechno.com Page 3

ó A file is split into one or more blocks and set of blocks are stored in DataNodes.

DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.

HDFS Architecture:

File system Namespace:

ó Hierarchical file system with directories and files

ó Create, remove, move, rename etc.

ó Namenode maintains the file system

ó Any meta information changes to the file system recorded by the Namenode.

ó An application can specify the number of replicas of the file needed: replication factor of the file. This information is stored in the Namenode.

Page 4: Hadoop training in hyderabad | hadoop file system

Hadoop File System

www.KellyTechno.com Page 4

Data Replication:

ó HDFS is designed to store very large files across machines in a large cluster.

ó Each file is a sequence of blocks.

ó All blocks in the file except the last are of the same size.

ó Blocks are replicated for fault tolerance.

ó Block size and replicas are configurable per file.

ó The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster.

ó BlockReport contains all the blocks on a Datanode.

Replica Placement:

ó The placement of the replicas is critical to HDFS reliability and performance.

ó Optimizing replica placement distinguishes HDFS from other distributed file systems.

ó Rack-aware replica placement: ° Goal: improve reliability, availability and network bandwidth

utilization

° Research topic

ó Many racks, communication between racks are through switches.

ó Network bandwidth between machines on the same rack is greater than those in different racks.

ó Namenode determines the rack id for each DataNode.

Page 5: Hadoop training in hyderabad | hadoop file system

Hadoop File System

www.KellyTechno.com Page 5

ó Replicas are typically placed on unique racks

° Simple but non-optimal

° Writes are expensive

° Replication factor is 3

° Another research topic?

ó Replicas are placed: one on a node in a local rack, one on a different node in the local rack and one on a node in a different rack.

ó 1/3 of the replica on a node, 2/3 on a rack and 1/3 distributed evenly across remaining racks.

Replica Selection:

ó Replica selection for READ operation: HDFS tries to minimize the bandwidth consumption and latency.

ó If there is a replica on the Reader node then that is preferred.

ó HDFS cluster may span multiple data centers: replica in the local data center is preferred over the remote one.