Hadoop High Availability through Metadata …...–Hadoop cluster: 1 active NameNode + 1 standby NameNode + 3 DataNode –NameNode: Intel Pentium 4 CPU 3.2GHz, 1.5G DDR 400MHz memory,

© 2009 IBM Corporation

Hadoop High Availability through Metadata Replication

Feng Wang, Jie Qiu, Jie Yang,

Bo Dong, Xinhui Li, Ying Li

IBM Research - China


© 2009 IBM Corporation2

Outline

Introduction

Single Point of Failure (SPOF) in Hadoop

Existing HA work in Hadoop

Proposed solution

– Metadata

– Initialization

– Replication

– Failover

Experiments

– Failover time

– Replication cost

Conclusion and Future work



Introduction

As a platform of computing and storage, availability of

Hadoop is the foundation of applications’ availability on it

–Hadoop has tried some methods to enhance the availability of

applications, but it doesn’t provide high availability for itself

Hadoop HA challenges

–SPOF identification

–Low overhead

–Flexible configuration



SPOF in Hadoop

Hadoop Distributed File System (HDFS) and MapReduce frame in Hadoop

are master-slave architecture

– SPOF

• NameNode in HDFS & JobTracker in MapReduce

MapReduce Architecture

* Pictures on this slide are from Internet

HDFS Architecture



Existing HA Support

Backup Node + Linux HA (Yahoo)

DRDB + Linux HA (Cloudera)

Bookkeeper + Multiple Names



Proposed Solution

A metadata replication based solution to enable Hadoop HA

–Metadata: the most important management information

• Initial metadata

• Runtime metadata

–Major phases: initialization, replication and failover

Two types of topology for Hadoop HA

–Active-Standby (AS)

• One active node for cluster management

• One standby node for failover

–Primary-Slaves (PS)

• One primary node for cluster management

• Several slave nodes for failover and taking a portion of read requests



Proposed Solution - Metadata

Initial metadata: replicated in initialization phase

–Version file

–File system image (fsimage)

Runtime metadata: replicated in replication phase

–Edit log

• 11 types of edit log which records the modification of namespace

–Lease state

• Writer, file path and update time



Proposed Solution - Initialization

Node registration

Initial metadata synchronization

– Version file checking

– Fsimage file checking

– Initial metadata synchronizing

Slave node registration process



Proposed Solution - Replication

Architecture Synchronization mode

– Configurable

Architecture of replication

Synchronization modes



Proposed Solution - Failover (1)

Leader election

– Node ordering

• Considering that there are n slave nodes, assigns each slave node r a unique id ir between 0 and

n-1, and the slave node r picks the smallest sequence number (SN) s larger than any it has seen

such that s mod n = ir

– Election process

0

1

2

3

5

4

SN of Node4 Agreement to Node4 SN of Node3 Agreement to Node3 Disagreement to Node3

0

1

2

3

5

4



Proposed Solution - Failover (2)

Server reconfiguration

– Network reconfiguration

• Invoke Linux shell commands to modify IP address and hostname of server

– Gratuitous ARP

• Trigger other nodes to update ARP table

Lease management

Client1

Client2

LeaseC1 LeaseC1 LeaseC2

Old namenode New namenode

Clients



Experiments

Environment

– Hadoop cluster: 1 active NameNode + 1 standby NameNode + 3 DataNode

– NameNode: Intel Pentium 4 CPU 3.2GHz, 1.5G DDR 400MHz memory, 1T

disk and SLES 10.2 (kernel version 2.6.12)

– DataNode: Intel Pentium 4 CPU 3.2GHz, 1G DDR 400MHz memory, 1T disk

and SLES 10.2 (kernel version 2.6.12)

– Network: 1.0Gbps Ethernet network

– Hadoop version: 0.20.0



Experiments - Failover time

A metric to evaluate the performance of failover process

– Block mapping construction time + Network transfer time

– Failover time varies from nearly 1s to more than 7s according to different workloads

– Block mapping construction time is always much smaller than network transfer time

Failover time cost

0

1000

2000

3000

4000

5000

6000

7000

8000

5000 10000 50000 100000

Number of blocks

Tim

e co

st (

ms)

Block mapping construction Network transfer Failover



Experiments - Replication cost

A metric to evaluate the performance penalty of high availability solution

– Synchronization mode (1)

– The metadata processing time overhead of replication process is nearly two times longer than that of

normal process

– The network communication is a bottleneck and a suitable synchronization mode is helpful to resolve

this problem

Metadata processing performance comparison between replication process and normal process

Number of

blocks

Process

type

Mem. Write

in namenode

(ms)

Disk write

in namenode

(ms)

Mem. write

in standby node

(ms)

Disk write

in standby node

(ms)

Network

communication

(ms)

Metadata

processing

(ms)

5000Replication 0.043 1.581 0.351 1.247 3.649 6.871

Normal 0.084 2.042 - - - 2.126

10000Replication 0.129 1.636 0.431 1.330 3.426 6.952

Normal 0.105 2.577 - - - 2.682

50000Replication 0.151 2.032 0.505 1.257 3.651 7.596

Normal 0.127 2.265 - - - 2.392

100000Replication 0.165 1.899 0.529 1.400 3.499 7.492

Normal 0.177 2.185 - - - 2.362



Conclusions

Propose a Hadoop HA solution

– Enable Hadoop high availability

– Metadata replication based

• remove the SPOF of namenode and jobtracker

– Experiments illustrate that the solution is effective

Future works

– Research more effective adaptive algorithms to adjust configuration of

replication process (such as synchronization mode, metadata buffer size)

– Test the solution in Hadoop cluster with larger number of datanodes

– Introduce into Hadoop more HA technologies other than the replication based

one



Thank you!

Q&A

Documents

Hadoop High Availability through Metadata …...–Hadoop cluster: 1 active NameNode + 1 standby NameNode + 3 DataNode –NameNode: Intel Pentium 4 CPU 3.2GHz, 1.5G DDR 400MHz memory,