Applied Big Data and Visualizationgarryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect05.pdf · Hadoop and CAP (contd.) Since decision where to place data and where it can be read

HDFS ArchitectureHDFS and CAP

Applied Big Data and Visualization

P. Healy

CS1-08Computer Science Bldg.

tel: [email protected]

Spring 2019–2020

P. Healy (University of Limerick) CS6502 Spring 2019–2020 1 / 13


Outline

1 HDFS ArchitectureReplica Placement / SelectionHDFS Block-Writing

2 HDFS and CAPImplications for CAP



Replica Placement / SelectionHDFS Block-Writing

Outline






Typical Hadoop Cluster

Large HDFS instances run on a cluster of computers thatcommonly spread across many racksCommunication between two nodes in different racks hasto go through switches




Typical Hadoop Cluster

Usually, network bandwidth between machines in the samerack is greater than network bandwidth between machinesin different racks




Placement of Replica Blocks

Placement of replicas is critical to HDFS reliability andperformanceHDFS places emphasis on optimizing replica placement





Rack-aware replica placement policy can improve datareliability, availability, and network bandwidth utilizationNeeds lots of tuning and experience to get right, however





A simple but non-optimal policy is to place replicas onunique racks:





3 prevents losing data when an entire rack fails and allowsuse of bandwidth from multiple racks when reading data

3 evenly distributes replicas in the cluster which makes iteasy to balance load on component failure

7 increases cost of writes because a write needs to transferblocks to multiple racks





Usually (when RF=3) HDFS’s placement policy is to putone replica on one node in the local rack, another on adifferent node in the local rack, and the last on a differentnode in a different rack


https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Data_Replication




3 cuts the inter-rack write traffic which generally improveswrite performance

3 chance of rack failure is far less than that of node failure3 does not impact data reliability and availability guarantees3 aggregate network bandwidth used when reading data is

reduced since a block is placed in only two unique racksrather than three





7 With this policy, the replicas of a file do not evenlydistribute across the racks

7 One third of replicas are on one node, two thirds of replicasare on one rack, and the other third are evenly distributedacross the remaining racksOverall improves write performance without compromisingdata reliability or read performance





If 3 < RF, placement of 4th and following replicas aredetermined randomly while keeping the number of replicasper rack below the upper limit (which is (#replicas - 1) /#racks + 2).




Replica Selection for Reading

Goal: Minimize global bandwidth consumption (packetcontention) and read latency

HDFS tries to satisfy a read request from a replica that isclosest to the readerIf there exists a replica on the same rack as the readernode, then that replica is preferred to satisfy the readrequestIf HDFS cluster spans multiple data centers, then a replicathat is resident in the local data center is preferred overany remote replica.

To ensure data integrity each block’s checksum is verified onretrieval




Outline






Write Architecture

To write this file into HDFS:HDFS client connects to the NameNode for a WriteRequest against the two blocks, A, BNameNode grants client write permission and provides IPaddresses of destination DataNodes; destinations basedon availability, replication factor and rack awarenessSuppose NameNode provides following lists of IPaddresses to the client (RF=3):

Block A, list A = {IP of DataNode 1, IP of DataNode 4, IP ofDataNode 6}Block B, set B = {IP of DataNode 3, IP of DataNode 7, IP ofDataNode 9}

Each block copied in pipeline fashionP. Healy (University of Limerick) CS6502 Spring 2019–2020 8 / 13



Replication Pipelining

Set up of PipelineBefore writing blocks, client confirms whether theDataNodes, present in each of the list of IPs, are ready toreceive the data or notSo, client creates a pipeline for each of the blocks byconnecting the individual DataNodes in the respective listfor that blockConsidering Block A: list A = {IP of DataNode 1, IP ofDataNode 4, IP of DataNode 6}





Data streaming and replication“Pass it on”: Block A is passed to first nominatedDataNode, 1, and is stored thereDataNode 1 then passes the block along to DataNode 4,the second nominated DataNodeIt, lastly, passes it along to DataNode 6





Shutdown of Pipeline (Acknowledgement stage)Reverse of previous operationDataNode 6 acknowledges to DataNode 4 that it wrote theblock safelyIn turn, DataNode 4 acks its predecessorOperation complete when NameNode receives the final ack



Implications for CAP

Outline






What is CAP

A distributed file system requires a network, which may fail,resulting in a partition of the networkThe CAP theorem states that it is impossible for adistributed data store to simultaneously provide more thantwo out of the following three guarantees:

1 Consistency equivalent to having a single up-to-date copyof the data

2 high Availability of that data (for updates) but may not bemost current

3 tolerance to network PartitionsWhen a network partition failure happens should we

Cancel the operation thus ensuring consistency butdecreasing the availabilityProceed with the operation thus risking inconsistency butproviding availability

Some people claim this concern is overblownP. Healy (University of Limerick) CS6502 Spring 2019–2020 11 / 13

https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed/



Hadoop and CAP

HDFS has a unique central decision point, the NameNodeThus it can only fall in the CP side, since taking down thenamenode takes down the entire HDFS system (noAvailability)Hadoop does not try to hide this (from home page):

The NameNode is a Single Point of Failure for the HDFSCluster. HDFS is not currently a High Availability system.When the NameNode goes down, the file system goesoffline.There is an optional SecondaryNameNode that can behosted on a separate machine.It only creates checkpoints of the namespace by mergingthe edits file into the fsimage file and does not provide anyreal redundancy.




Hadoop and CAP (contd.)

Since decision where to place data and where it can beread from is always handled by the NameNode, whichmaintains a consistent view in memory, HDFS is alwaysconsistent (C)It is also partition-tolerant in that it can handle loosing datanodes, subject to replication factor and data topologystrategies


Documents

Applied Big Data and Visualizationgarryowen.csisdmz.ul.ie/~cs6502/resources/cs6502-lect05.pdf · Hadoop and CAP (contd.) Since decision where to place data and where it can be read