Transcript
Page 1: The  Hadoop  Distributed File System

PaoMin Wu University at Buffalo

The Hadoop Distributed File System

Page 2: The  Hadoop  Distributed File System

ARCHITECTURE

1. Namenodestores matadata of the systemkeeps all namespace in RAM

2. Datanodeblock replicastores application data

3. HDFS-ClientUser applications access the file system using the HDFSclient

Page 3: The  Hadoop  Distributed File System

HDFS Client Process

Page 4: The  Hadoop  Distributed File System

ARCHITECTURE

4. Image and JournalNamespace image = file system metadataPeresistent record of image = checkpoint

5. CheckpointNode (NameNode)Protects file system metadata

6. BackupNode (NameNode)Capable of creating periodic checkpoints

Page 5: The  Hadoop  Distributed File System

FILE I/O OPERATIONS AND REPLICA MANGEMENT

Page 6: The  Hadoop  Distributed File System

FILE I/O OPERATIONS AND REPLICA MANGEMENT

Page 7: The  Hadoop  Distributed File System

Sort Benchmark

Page 8: The  Hadoop  Distributed File System

Future Work

Problem:NameNode contains all important information

Solution:Allow multiple namespaces(and NameNodes) to share the physical storage within a cluster

Page 9: The  Hadoop  Distributed File System

PaoMin Wu University at Buffalo

MapReduce: Simplied Data Processing on Large Clusters

Page 10: The  Hadoop  Distributed File System

Introduction

•key/value pair

•execution across a set of machines

•handling machine failures

•managing the required inter-machine communication

•runs on a large cluster

•powerful interface

•automatic parallelization

•distribution of large-scale computations

Page 11: The  Hadoop  Distributed File System

Programming Model

Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs.

The Reduce function, also written by the user, acceptsan intermediate key and a set of values for that key.

The intermediate values are supplied to the user's reduce function via an iterator.

Page 12: The  Hadoop  Distributed File System

Example:

Page 13: The  Hadoop  Distributed File System

Execution Overflow:

Page 14: The  Hadoop  Distributed File System

Backup Tasks:

Page 15: The  Hadoop  Distributed File System

Conclusions

1. Restricting the programming model is beneficial

2. Network bandwidth is a scarce resource

3. Redundant execution can help

Page 16: The  Hadoop  Distributed File System

References:

The Hadoop Distributed File SystemKonstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert ChanslerYahoo!Sunnyvale, California USA{Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com

MapReduce: Simplied Data Processing on Large ClustersJeffrey Dean and Sanjay [email protected], [email protected], Inc.


Recommended