hadoop_hdfs

What is HDFS?

Hadoop Distributed File System.

One of the two main components of Hadoop.MapReduce.

HDFS.

What is Hadoop?

Distibuted Computing System.

Hadoop was developed to solve problems where you have huge amout of data.

Hadoop is for situations where you want to run analytics that are deep and computationally extensive.

The underlying technology was invented by Google. Googles innovations were incorporated into Nutch, an open source project.

From there Hadoop came into exsistence, and Yahoo palyed a big role in developing Hadoop for enterprise applications.

Why Hadoop needs HDFS?

Hadoop was developed to beScalable : Adding and Removign nodes without changing data formats.

Cost Effective : Use of comodity servers.

Flexible : Schema less.

Fault Tolerant : Built-In Redundancy and Failover

HDFS matches all the requirements listed above.

Difference b/w HDFS and other file systems.

The biggest difference is HDFS is a virtual file system.

Hadoop file system runs on top of the existing file system.

Why HDFS?HDFS is a distributed file system. While NTFS and FAT are not.

HDFS stores data reliably, It has built-in redundancy and failover. There is no built-in redundancy and failover.

NTFS and FAT file system supports 4-8 block size. HDFS supports much larger block sizes, by default 64 MB.

NTFS and FAT are optimized for random access read while HDFS is optimized for sequential reads.

No local caching for HDFS as files size are huge. A typical file in HDFS can size upto 1TB or even more

Blocks?

Advantages of blocks:Fixed in size : easy to calculate how many fits in a disk.

A file can be bigger than any single disk in the network.

Only needed space is used. Smaller files dont use the default block size.

Fits well with replication to provide fault tolerance and availability.

Behind the scenes, 1 HDFS block is supported by multiple operating system blocks.

128 MB

Operating System Blocks

HDFS Block

Components of HDFS

NameNode

Secondary NameNode

DataNode

NameNode

Hadoop Works on Master-Slave architechture.

NameNode is Master Node. There can only be one NameNode in a Hadoop cluster.

What NameNode Does?Store all the file system metadata for the cluster.

Oversees the health of Data Nodes and coordinates access to data.

Name Node only knows what blocks make up a file and where those blocks are located in the cluster.

Keeps track of the clusters storage capacity

Make sure each block of data is meeting the minimum defined replica policy.dfs.replication property in hdfs-site.xml

Single point of failure Don't use inexpensive comodity servers for NameNode.

Require Large amoutn of RAM. As it keeps all metadata in memory.HDFS federation : http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/Federation.html#HDFS_Federation

Secondary NameNode

Provides a high availability backup for the Name Node?No.

Secondary NameNode is Housekeeper for NameNode.

To maintain interactive speed, the filesystem metadata is stored in the NameNodes RAM.

NameNode has no capability to persist that data into Disk.

Instead of storing the current snapshot of the filesystem every time, modifications is continually appended to a log file called the EditLog.

Restarting the NameNode involves replaying the EditLog to reconstruct the final system state.

Secondary NameNode

The SecondaryNameNode periodically compacts the EditLog into a checkpoint; the EditLog is then cleared.

A restart of the NameNode then involves loading the most recent checkpoint and a shorter EditLog containing only events since the checkpoint.

Compaction ensures that restarts do not incur unnecessary downtime.

The duties of the SecondaryNameNode end there

Secondary Name Node connects to the Name Node [every hour] and grabs a copy of the Name Nodes in-memory metadata. Combines this information in a fresh set of files and delivers them back to the Name Node, while keeping a copy for itself.

Configuration : core-site.xmlfs.checkpoint.period, set to 1 hour by default.

fs.checkpoint.size, set to 64MB by default.

DataNode

DataNodes are the storage servers. These are nodes where the actual data resides.

These are the slaves of the hadoop Master-Slave architechture.

DataNodes are used to store the blocks of data, But without NameNode they are not capable to make any sense out of these blocks.

Data Nodes send heartbeats to the Name Node every 3 seconds via a TCP handshake.

Every tenth heartbeat is a Block Report, where the Data Node tells the Name Node about all the blocks it has.

The block reports allow the Name Node build its metadata and insure minimum required repilca of the block exist on different nodes, in different racks.

Rack Awareness

DN1

DN2

DN3

DN4

DN5

DN10

DN5

DN6

DN7

DN8

DN9

DN15

DN11

DN12

DN13

DN14

switchswitchswitchswitchAAABBBNameNodeMeta DataFile.txt = Block A:DN : 1, 6, 7Block B:DN : 7, 14, 15Rack AwarenessRack 1:1, 2, 3, 4, 5Rack 2 :6, 7, 8, 9, 10Rack 3 :11, 12, 13, 14, 15

Rack Awareness

Rack Awareness is a concept were NameNode is aware of which Data Node resides within which Rack.

For larger Hadoop installations with multiple racks, it is important to ensure that replicas of data exist on multiple racks. This way, the loss of a switch does not render portions of the data unavailable due to all replicas being underneath it.

Have to manually configure this using script in bash or python.

To set the rack mapping script, specify the key topology.script.file.name in conf/hdfs-site.xml. This provides a command to run to return a rack id; it must be an executable script or program.

Topology data :-================================192.168.8.50 /dc1/rack1192.168.8.70 /dc1/rack2192.168.8.90 /dc1/rack2

HDFS Web Interface

HDFS exposes a web server which is capable of performing basic status monitoring and file browsing operations.

By default this is exposed on port 50070 on the NameNode. http://namenode:50070/

Contains overview information about the health, capacity, and usage of the cluster (similar to the information returned by bin/hadoop dfsadmin -report).

The address and port where the web interface listens can be changed by setting dfs.http.address in conf/hdfs-site.xml.

It must be of the form address:port. To accept requests on all addresses, use 0.0.0.0.

From this interface, you can browse HDFS itself with a basic file-browser interface.

Each DataNode exposes its file browser interface on port 50075. You can override this by setting the dfs.datanode.http.address configuration key to a setting other than 0.0.0.0:50075.

Log files generated by the Hadoop daemons can be accessed through this interface, which is useful for distributed debugging and troubleshooting.

Command Line Interface

hadoop fs -cat URICopies source paths to stdout.

hadoop fs -chgrp [-R] GROUP URI [URI ] Change group association of files.

hadoop fs -chmod [-R] URI [URI ] Change the permissions of files.

hadoop fs -chown [-R] [OWNER][:[GROUP]] URI [URI ] Change the owner of files.

hadoop fs -put ... Copy single src, or multiple srcs from local file system to the destination filesystem.

hadoop fs -copyFromLocal URI Similar to put command, except that the source is restricted to a local file reference.

hadoop fs -copyToLocal [-ignorecrc] [-crc] URI Similar to get command, except that the destination is restricted to a local file reference.

hadoop fs -cp URI [URI ] Copy files from source to destination.

Command Line Interface

hadoop fs -get [-ignorecrc] [-crc] Copy files to the local file system.

hadoop fs -ls returns list of its direct children as in unix.

hadoop fs -lsr Recursive version of ls. Similar to Unix ls -R.

hadoop fs -mkdir Creates directories. The behavior is much like unix.

dfs -moveFromLocal Displays a "not implemented" message.

hadoop fs -mv URI [URI ] Moves files from source to destination.

hadoop fs -get [-ignorecrc] [-crc] Copy files to the local file system.

hadoop fs -rm URI [URI ] Delete files specified as args.

hadoop fs -rmr URI [URI ]

Recursive version of delete.

Write Operation.

Hadoop Client

DN1

DN2

DN3

DN4

DN5

DN10

DN5

DN6

DN7

DN8

DN9

DN15

DN11

DN12

DN13

DN14

switchswitchswitchswitchAAABBBNameNodeMeta DataFile.txt = Block A:DN : 1, 6, 7Block B:DN : 7, 14, 15Rack AwarenessRack 1:1, 2, 3, 4, 5Rack 2 :6, 7, 8, 9, 10Rack 3 :11, 12, 13, 14, 15Result.txt => C, D

C => 3, 11, 12D => 9, 4, 5

Write Operation.

Hadoop Client

DN1

DN2

DN3

DN4

DN5

DN10

DN5

DN6

DN7

DN8

DN9

DN15

DN11

DN12

DN13

DN14

switchswitchswitchswitchAAABBBNameNodeMeta DataFile.txt = Block A:DN : 1, 6, 7Block B:DN : 7, 14, 15Rack AwarenessRack 1:1, 2, 3, 4, 5Rack 2 :6, 7, 8, 9, 10Rack 3 :11, 12, 13, 14, 15Result.txt => C, D

C => 3, 11, 12D => 9, 4, 5

C

C

CReady : 11, 12

Ready : 12

Write Operation.

Hadoop Client

DN1

DN2

DN3

DN4

DN5

DN10

DN5

DN6

DN7

DN8

DN9

DN15

DN11

DN12

DN13

DN14

switchswitchswitchswitchAAABBBNameNodeMeta DataFile.txt = Block A:DN : 1, 6, 7Block B:DN : 7, 14, 15

Result.txt = Block C:DN : 3, 11, 12Block D:DN : 9, 4, 5CCC

Block Received

Success

Read Operation.

Hadoop Client

DN1

DN2

DN3

DN4

DN5

DN10

DN5

DN6

DN7

DN8

DN9

DN15

DN11

DN12

DN13

DN14

switchswitchswitchswitchAAABBBNameNodeMeta DataFile.txt = Block A:DN : 1, 6, 7Block B:DN : 7, 14, 15Rack AwarenessRack 1:1, 2, 3, 4, 5Rack 2 :6, 7, 8, 9, 10Rack 3 :11, 12, 13, 14, 15File.txt

A => 1, 6, 7B => 7, 14, 15

Read Operation.

Hadoop Client

DN1

DN2

DN3

DN4

DN5

DN10

DN5

DN6

DN7

DN8

DN9

DN15

DN11

DN12

DN13

DN14

switchswitchswitchswitchAAABBBNameNodeMeta DataFile.txt = Block A:DN : 1, 6, 7Block B:DN : 7, 14, 15Rack AwarenessRack 1:1, 2, 3, 4, 5Rack 2 :6, 7, 8, 9, 10Rack 3 :11, 12, 13, 14, 15File.txt

A => 1, 6, 7B => 7, 14, 15

Name Node intelligently order the list of Data Node, considering the Network traffic load on each Data Node containing the block.

Deletion of data from HDFS.

When a file is deleted by a user or an application, it is not immediately removed from HDFS.

HDFS first renames it to a file in the /trash directory.

The file can be restored quickly as long as it remains in /trash.

A file remains in /trash for a configurable amount of time.

The deletion of a file causes the blocks associated with the file to be freed.

There could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS.

A user can Undelete a file after deleting it as long as it remains in the /trash directory.

The /trash directory contains only the latest copy of the file that was deleted.

When the replication factor of a file is reduced, the NameNode selects excess replicas that can be deleted.

The next Heartbeat transfers this information to the DataNode. The DataNode then removes the corresponding blocks and the corresponding free space appears in the cluster.

The /trash directory is just like any other directory with one special feature: HDFS applies specified policies to automatically delete files from this directory. The current default policy is to delete files from /trash that are more than 6 hours old. In the future, this policy will be configurable through a well defined interface.

Persistent Data Structures
Name Node

In conf/hdfs-site.xml set property dfs.name.dir {/home/hduser/tmp/dfs/name}

Under /home/hduser/tmp/dfs/namecurrent

image

in_use.lock

previous.checkpoint

current and previous.checkpoint will have same directory structure, both will have files with same name under them.

/home/hduser/tmp/dfs/ will have namesecondary directory. This will have same directory structure as name except for the previous.checkpoint directory.

Directory structure for /home/hduser/tmp/dfs/name/current/edits

fsimage

fstime

VERSION

Name Node

VERSION file is a Java properties file that contains information about the version of HDFS that is running.#Thu Sep 19 18:29:16 IST 2013

namespaceID=1443825132

cTime=0

storageType=NAME_NODE

layoutVersion=-19

namespaceID : Is a unique identifier for the filesystem.

created when the filesystem is first formatted.

Namenode uses it to identify new datanodes, since they will not know the namespaceID until they have registered with the namenode.

CTime :Marks the creation time of the namenodes storage.

Newly for-matted storage will always have value zero.

It is updated to a timestamp whenever the filesystem is upgraded.

Name Node

storageType : indicates that this storage directory contains data structures for a namenode.

layoutVersion : Is a negative integer that defines the version of HDFSs persistent data structures.

This version number has no relation to the release number of the Ha-doop distribution.

Whenever the layout changes, the version number is decremented.

Name Node

fstime : fstime file used to record the time that the checkpoint was taken.

fsimage : file is a persistent checkpoint of the filesystem metadata. However, it is not updated for every filesystem write operation, since writing out the fsimage file, which can grow to be gigabytes in size, would be very slow.

edits : When a filesystem client performs a write operation (such as creating or moving a file), it is first recorded in the edit log. The namenode also has an in-memory representation of the filesystem metadata, which it updates after the edit log has been modified.

The edits file would grow without bound. Though this state of affairs would have no impact on the system while the namenode is running, if the namenode were restarted, it would take a long time to apply each of the operations in its edit log.

Name Node

The solution is to run the secondary namenode, whose purpose is to produce check- points of the primarys in-memory filesystem metadata.The checkpointing process proceeds as follows : The secondary asks the primary to roll its edits file, so new edits go to a new file.

The secondary retrieves fsimage and edits from the primary (using HTTP GET).

The secondary loads fsimage into memory, applies each operation from edits, then creates a new consolidated fsimage file.

The secondary sends the new fsimage back to the primary (using HTTP POST).

The primary replaces the old fsimage with the new one from the secondary, and the old edits file with the new one it started in step 1. It also updates the fstime file to record the time that the checkpoint was taken.

At the end of the process, the primary has an up-to-date fsimage file and a shorter edits file

Date Node

/home/hduser/tmp/dfs/data/currentdncp_block_verification.log.currVERSIONblk_-4815445268453121824blk_-4815445268453121824_1014.metablk_-128854014309496448blk_-128854014309496448_1013.metablk_4868937369333549191blk_4868937369333549191_1009.meta

VERSION :#Fri Sep 20 18:50:55 IST 2013namespaceID=1443825132storageID=DS-1242753232-127.0.0.1-50010-1379336648923cTime=0storageType=DATA_NODElayoutVersion=-19

Date Node

The namespaceID, cTime, and layoutVersion are all the same as the values in the name-node.The storageID is unique to the datanode and is used by the namenode to uniquely identify the datanode. The storageType identifies this directory as a datanode storage directory.

The other files in the datanodes current storage directory are the files with the blk_ prefix. There are two types: HDFS blocks themselves

metadata for a block

A block file just consists of the raw bytes of a portion of the file being stored; the metadata file is made up of a header with version and type information, followed by a series of checksums for sections of the block.When the number of blocks in a directory grows to a certain size, the datanode creates a new subdirectory in which to place new blocks and their accompanying metadata. Itcreates a new subdirectory every time the number of blocks in a directory reaches 64.dfs.datanode.numblocksThis ensures that thereis a manageable number of files per directory, which avoids the problems that most operating systems encounter when there are a large number of files.

Additional HDFS Tasks

REBALANCING BLOCKS

New nodes can be added to a cluster in a straightforward manner. On the new node, the same Hadoop version and configuration as on the rest of the cluster should be installed. conf/hadoop-site.xmlThe new node should be added to the slaves file on the master server as wellStarting the DataNode daemon on the machine will cause it to contact the NameNode and join the cluster.But the new DataNode will have no data on board initially. New files will be stored on the new DataNode in addition to the existing ones, but for optimum usage, storage should be evenly balanced across all nodes.This can be achieved with the automatic balancer tool included with Hadoop. The Balancer class will intelligently balance blocks across the nodes to achieve an even distribution of blocks within a given threshold, expressed as a percentage. Smaller percentages make nodes more evenly balanced, but may require more time to achieve this state. Perfect balancing (0%) is unlikely to actually be achieved.The balancer script can be run by starting bin/start-balancer.sh in the Hadoop directory. The script can be provided a balancing threshold percentage with the -threshold parameter; e.g., bin/start-balancer.sh -threshold 5. The balancer will automatically terminate when it achieves its goal, or when an error occurs, or it cannot find more candidate blocks to move to achieve better balance. The balancer can always be terminated safely by the administrator by running bin/stop-balancer.sh.

The amount of network traffic balancer can use is very low, with a default setting of 1MB/s. This setting can be changed with the dfs.balance.bandwidthPerSec parameter in the file hdfs-site.xml

Additional HDFS Tasks

DECOMMISSIONING NODES

Nodes can also be removed from a cluster while it is running, without data loss. But if nodes are simply shut down "hard," data loss may occur as they may hold the sole copy of one or more file blocks.Nodes must be retired on a schedule that allows HDFS to ensure that no blocks are entirely replicated within the to-be-retired set of DataNodes.HDFS provides a decommissioning feature which ensures that this process is performed safely. To use it, follow the steps below:Cluster configuration : . Add a key named dfs.hosts.exclude to your conf/hadoop-site.xml file. The value associated with this key provides the full path to a file on the NameNode's local file system which contains a list of machines which are not permitted to connect to HDFS.

Determine hosts to decommission. Each machine to be decommissioned should be added to the file identified by dfs.hosts.exclude, one per line. This will prevent them from connecting to the NameNode.

Force configuration reload. Run the command bin/hadoop dfsadmin -refreshNodes. This will force the NameNode to reread its configuration, including the newly-updated excludes file.

It will decommission the nodes over a period of time, allowing time for each node's blocks to be replicated onto machines which are scheduled to remain active.

Shutdown nodes. After the decommission process has completed, the decommissioned hardware can be safely shutdown for maintenance, etc. The bin/hadoop dfsadmin -report command will describe which nodes are connected to the cluster.

Edit excludes file again. Once the machines have been decommissioned, they can be removed from the excludes file. Running bin/hadoop dfsadmin -refreshNodes again will read the excludes file back into the NameNode, allowing the DataNodes to rejoin the cluster after maintenance has been completed, or additional capacity is needed in the cluster again, etc.

Conclusion.

Documents

hadoop_hdfs