22
Hadoop Distributed File System (HDFS) Applied Big Data and Visualization P. Healy CS1-08 Computer Science Bldg. tel: 202727 [email protected] Spring 2019–2020 P. Healy (University of Limerick) CS6502 Spring 2019–2020 1/9

Applied Big Data and Visualization - University of …garryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect04.pdfresponsible for taking care of the replication factor of all the blocks

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Applied Big Data and Visualization - University of …garryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect04.pdfresponsible for taking care of the replication factor of all the blocks

Hadoop Distributed File System (HDFS)

Applied Big Data and Visualization

P. Healy

CS1-08Computer Science Bldg.

tel: [email protected]

Spring 2019–2020

P. Healy (University of Limerick) CS6502 Spring 2019–2020 1 / 9

Page 2: Applied Big Data and Visualization - University of …garryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect04.pdfresponsible for taking care of the replication factor of all the blocks

Hadoop Distributed File System (HDFS)

Outline

1 Hadoop Distributed File System (HDFS)Architecture

P. Healy (University of Limerick) CS6502 Spring 2019–2020 2 / 9

Page 3: Applied Big Data and Visualization - University of …garryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect04.pdfresponsible for taking care of the replication factor of all the blocks

Hadoop Distributed File System (HDFS)

Announcements

Lab problemsLab assessment

P. Healy (University of Limerick) CS6502 Spring 2019–2020 3 / 9

Page 4: Applied Big Data and Visualization - University of …garryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect04.pdfresponsible for taking care of the replication factor of all the blocks

Hadoop Distributed File System (HDFS) Architecture

Outline

1 Hadoop Distributed File System (HDFS)Architecture

P. Healy (University of Limerick) CS6502 Spring 2019–2020 4 / 9

Page 5: Applied Big Data and Visualization - University of …garryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect04.pdfresponsible for taking care of the replication factor of all the blocks

Hadoop Distributed File System (HDFS) Architecture

Summary

Apache HDFS:block-structured file systemeach file divided into blocks of a pre-determined size,128MB in Hadoop 2.x (cf. linux block size of 4096B)blocks are stored across a cluster of one or severalmachinesfollows a Master/Slave Architecture, where a clustercomprises

a single NameNode (Master node)all the other nodes are DataNodes (Slave nodes)

HDFS can be deployed on most machines that supportJavait is usual (though not mandatory) to limit one DataNodeper machine

P. Healy (University of Limerick) CS6502 Spring 2019–2020 5 / 9

Page 6: Applied Big Data and Visualization - University of …garryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect04.pdfresponsible for taking care of the replication factor of all the blocks

Hadoop Distributed File System (HDFS) Architecture

Block Diagram

NameNode: master node; manages blocks present onDataNodesrecords the metadata of all the files stored in the cluster,e.g. location of blocks stored, the size of the files,permissions, hierarchy, etc. Two files of relevance:

FsImage: It contains the complete state of the file systemnamespace since the start of the NameNodeEditLog: It contains all the recent modifications made to thefile system with respect to the most recent FsImage.

P. Healy (University of Limerick) CS6502 Spring 2019–2020 6 / 9

Page 7: Applied Big Data and Visualization - University of …garryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect04.pdfresponsible for taking care of the replication factor of all the blocks

Hadoop Distributed File System (HDFS) Architecture

Block Diagram

NameNode: master node; manages blocks present onDataNodesrecords each change that takes place to the file systemmetadata. For example, if a file is deleted in HDFS, theNameNode will immediately record this in the EditLog

P. Healy (University of Limerick) CS6502 Spring 2019–2020 6 / 9

Page 8: Applied Big Data and Visualization - University of …garryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect04.pdfresponsible for taking care of the replication factor of all the blocks

Hadoop Distributed File System (HDFS) Architecture

Block Diagram

NameNode: master node; manages blocks present onDataNodesregularly receives a Heartbeat and a block report from allthe DataNodes in the cluster to ensure that the DataNodesare live

P. Healy (University of Limerick) CS6502 Spring 2019–2020 6 / 9

Page 9: Applied Big Data and Visualization - University of …garryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect04.pdfresponsible for taking care of the replication factor of all the blocks

Hadoop Distributed File System (HDFS) Architecture

Block Diagram

NameNode: master node; manages blocks present onDataNodesIn case of the DataNode failure, the NameNode choosesnew DataNodes for new replicas, balance disk usage andmanages the communication traffic to the DataNodes

P. Healy (University of Limerick) CS6502 Spring 2019–2020 6 / 9

Page 10: Applied Big Data and Visualization - University of …garryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect04.pdfresponsible for taking care of the replication factor of all the blocks

Hadoop Distributed File System (HDFS) Architecture

Block Diagram

NameNode: master node; manages blocks present onDataNodeskeeps a record of all the blocks in HDFS and in whichnodes these blocks are locatedresponsible for taking care of the replication factor of allthe blocks

P. Healy (University of Limerick) CS6502 Spring 2019–2020 6 / 9

Page 11: Applied Big Data and Visualization - University of …garryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect04.pdfresponsible for taking care of the replication factor of all the blocks

Hadoop Distributed File System (HDFS) Architecture

Block Diagram

DataNode: slave node; runs on commodity hardware, thatis, a non-expensive system which is not of high quality orhigh-availability (in contrast to NameNode)can have several slave nodes per physical machinecontrolled by a process (daemon) running on slavemachine

P. Healy (University of Limerick) CS6502 Spring 2019–2020 7 / 9

Page 12: Applied Big Data and Visualization - University of …garryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect04.pdfresponsible for taking care of the replication factor of all the blocks

Hadoop Distributed File System (HDFS) Architecture

Block Diagram

DataNode: slave node; runs on commodity hardware, thatis, a non-expensive system which is not of high quality orhigh-availability (in contrast to NameNode)block server that performs low-level reads / writes to storedata; stored in ext3 or ext4 file format

P. Healy (University of Limerick) CS6502 Spring 2019–2020 7 / 9

Page 13: Applied Big Data and Visualization - University of …garryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect04.pdfresponsible for taking care of the replication factor of all the blocks

Hadoop Distributed File System (HDFS) Architecture

Block Diagram

DataNode: slave node; runs on commodity hardware, thatis, a non-expensive system which is not of high quality orhigh-availability (in contrast to NameNode)send heartbeats to NameNode at specified frequency (1

3Hz = every 3s)

P. Healy (University of Limerick) CS6502 Spring 2019–2020 7 / 9

Page 14: Applied Big Data and Visualization - University of …garryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect04.pdfresponsible for taking care of the replication factor of all the blocks

Hadoop Distributed File System (HDFS) Architecture

Other Components

Secondary NameNode: works concurrently with theprimary NameNodenot a backup NameNode

P. Healy (University of Limerick) CS6502 Spring 2019–2020 8 / 9

Page 15: Applied Big Data and Visualization - University of …garryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect04.pdfresponsible for taking care of the replication factor of all the blocks

Hadoop Distributed File System (HDFS) Architecture

Other Components

Secondary NameNode: works concurrently with theprimary NameNodeconstantly reads all the file systems and metadata from theRAM of the NameNode and writes it into the hard disk orthe file system

P. Healy (University of Limerick) CS6502 Spring 2019–2020 8 / 9

Page 16: Applied Big Data and Visualization - University of …garryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect04.pdfresponsible for taking care of the replication factor of all the blocks

Hadoop Distributed File System (HDFS) Architecture

Other Components

Secondary NameNode: works concurrently with theprimary NameNodeit is responsible for combining the EditLogs with FsImagefrom the NameNode

P. Healy (University of Limerick) CS6502 Spring 2019–2020 8 / 9

Page 17: Applied Big Data and Visualization - University of …garryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect04.pdfresponsible for taking care of the replication factor of all the blocks

Hadoop Distributed File System (HDFS) Architecture

Other Components

Secondary NameNode: works concurrently with theprimary NameNodeit downloads the EditLogs from the NameNode at regularintervals and applies to FsImage. The new FsImage iscopied back to the NameNode, which is used wheneverthe NameNode is started the next time (whether due tofailure, or not)

P. Healy (University of Limerick) CS6502 Spring 2019–2020 8 / 9

Page 18: Applied Big Data and Visualization - University of …garryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect04.pdfresponsible for taking care of the replication factor of all the blocks

Hadoop Distributed File System (HDFS) Architecture

Other Components

Secondary NameNode: works concurrently with theprimary NameNodeperforms regular checkpoints in HDFS. Therefore, alsocalled CheckpointNode.

P. Healy (University of Limerick) CS6502 Spring 2019–2020 8 / 9

Page 19: Applied Big Data and Visualization - University of …garryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect04.pdfresponsible for taking care of the replication factor of all the blocks

Hadoop Distributed File System (HDFS) Architecture

Block Replication

Hadoop stores huge files containing data in blocksblock size is 128Mb; why so big?

P. Healy (University of Limerick) CS6502 Spring 2019–2020 9 / 9

Page 20: Applied Big Data and Visualization - University of …garryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect04.pdfresponsible for taking care of the replication factor of all the blocks

Hadoop Distributed File System (HDFS) Architecture

Block Replication

Hadoop stores huge files containing data in blocksblocks are replicated throughout the file system to providefault tolerance; default replication factor is 3

P. Healy (University of Limerick) CS6502 Spring 2019–2020 9 / 9

Page 21: Applied Big Data and Visualization - University of …garryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect04.pdfresponsible for taking care of the replication factor of all the blocks

Hadoop Distributed File System (HDFS) Architecture

Block Replication

Hadoop stores huge files containing data in blocksreplicas are always stored on different DataNodes...... and are also rack-aware

P. Healy (University of Limerick) CS6502 Spring 2019–2020 9 / 9

Page 22: Applied Big Data and Visualization - University of …garryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect04.pdfresponsible for taking care of the replication factor of all the blocks

Hadoop Distributed File System (HDFS) Architecture

Block Replication

Hadoop stores huge files containing data in blocksNameNode collects periodically block reports fromDataNode to maintain the replication factorso, whenever a block is over-replicated or under-replicatedthe NameNode deletes or add replicas as needed.

P. Healy (University of Limerick) CS6502 Spring 2019–2020 9 / 9