Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
© 2009 IBM Corporation
Hadoop High Availability through Metadata Replication
Feng Wang, Jie Qiu, Jie Yang,
Bo Dong, Xinhui Li, Ying Li
IBM Research - China
IBM Research - China
© 2009 IBM Corporation2
Outline
Introduction
Single Point of Failure (SPOF) in Hadoop
Existing HA work in Hadoop
Proposed solution
– Metadata
– Initialization
– Replication
– Failover
Experiments
– Failover time
– Replication cost
Conclusion and Future work
IBM Research - China
© 2009 IBM Corporation3
Introduction
As a platform of computing and storage, availability of
Hadoop is the foundation of applications’ availability on it
–Hadoop has tried some methods to enhance the availability of
applications, but it doesn’t provide high availability for itself
Hadoop HA challenges
–SPOF identification
–Low overhead
–Flexible configuration
IBM Research - China
© 2009 IBM Corporation4
SPOF in Hadoop
Hadoop Distributed File System (HDFS) and MapReduce frame in Hadoop
are master-slave architecture
– SPOF
• NameNode in HDFS & JobTracker in MapReduce
MapReduce Architecture
* Pictures on this slide are from Internet
HDFS Architecture
IBM Research - China
© 2009 IBM Corporation5
Existing HA Support
Backup Node + Linux HA (Yahoo)
DRDB + Linux HA (Cloudera)
Bookkeeper + Multiple Names
IBM Research - China
© 2009 IBM Corporation6
Proposed Solution
A metadata replication based solution to enable Hadoop HA
–Metadata: the most important management information
• Initial metadata
• Runtime metadata
–Major phases: initialization, replication and failover
Two types of topology for Hadoop HA
–Active-Standby (AS)
• One active node for cluster management
• One standby node for failover
–Primary-Slaves (PS)
• One primary node for cluster management
• Several slave nodes for failover and taking a portion of read requests
IBM Research - China
© 2009 IBM Corporation7
Proposed Solution - Metadata
Initial metadata: replicated in initialization phase
–Version file
–File system image (fsimage)
Runtime metadata: replicated in replication phase
–Edit log
• 11 types of edit log which records the modification of namespace
–Lease state
• Writer, file path and update time
IBM Research - China
© 2009 IBM Corporation8
Proposed Solution - Initialization
Node registration
Initial metadata synchronization
– Version file checking
– Fsimage file checking
– Initial metadata synchronizing
Slave node registration process
IBM Research - China
© 2009 IBM Corporation9
Proposed Solution - Replication
Architecture Synchronization mode
– Configurable
Architecture of replication
Synchronization modes
IBM Research - China
© 2009 IBM Corporation10
Proposed Solution - Failover (1)
Leader election
– Node ordering
• Considering that there are n slave nodes, assigns each slave node r a unique id ir between 0 and
n-1, and the slave node r picks the smallest sequence number (SN) s larger than any it has seen
such that s mod n = ir
– Election process
0
1
2
3
5
4
SN of Node4 Agreement to Node4 SN of Node3 Agreement to Node3 Disagreement to Node3
0
1
2
3
5
4
IBM Research - China
© 2009 IBM Corporation11
Proposed Solution - Failover (2)
Server reconfiguration
– Network reconfiguration
• Invoke Linux shell commands to modify IP address and hostname of server
– Gratuitous ARP
• Trigger other nodes to update ARP table
Lease management
Client1
Client2
LeaseC1 LeaseC1 LeaseC2
Old namenode New namenode
Clients
IBM Research - China
© 2009 IBM Corporation12
Experiments
Environment
– Hadoop cluster: 1 active NameNode + 1 standby NameNode + 3 DataNode
– NameNode: Intel Pentium 4 CPU 3.2GHz, 1.5G DDR 400MHz memory, 1T
disk and SLES 10.2 (kernel version 2.6.12)
– DataNode: Intel Pentium 4 CPU 3.2GHz, 1G DDR 400MHz memory, 1T disk
and SLES 10.2 (kernel version 2.6.12)
– Network: 1.0Gbps Ethernet network
– Hadoop version: 0.20.0
IBM Research - China
© 2009 IBM Corporation13
Experiments - Failover time
A metric to evaluate the performance of failover process
– Block mapping construction time + Network transfer time
– Failover time varies from nearly 1s to more than 7s according to different workloads
– Block mapping construction time is always much smaller than network transfer time
Failover time cost
0
1000
2000
3000
4000
5000
6000
7000
8000
5000 10000 50000 100000
Number of blocks
Tim
e co
st (
ms)
Block mapping construction Network transfer Failover
IBM Research - China
© 2009 IBM Corporation14
Experiments - Replication cost
A metric to evaluate the performance penalty of high availability solution
– Synchronization mode (1)
– The metadata processing time overhead of replication process is nearly two times longer than that of
normal process
– The network communication is a bottleneck and a suitable synchronization mode is helpful to resolve
this problem
Metadata processing performance comparison between replication process and normal process
Number of
blocks
Process
type
Mem. Write
in namenode
(ms)
Disk write
in namenode
(ms)
Mem. write
in standby node
(ms)
Disk write
in standby node
(ms)
Network
communication
(ms)
Metadata
processing
(ms)
5000Replication 0.043 1.581 0.351 1.247 3.649 6.871
Normal 0.084 2.042 - - - 2.126
10000Replication 0.129 1.636 0.431 1.330 3.426 6.952
Normal 0.105 2.577 - - - 2.682
50000Replication 0.151 2.032 0.505 1.257 3.651 7.596
Normal 0.127 2.265 - - - 2.392
100000Replication 0.165 1.899 0.529 1.400 3.499 7.492
Normal 0.177 2.185 - - - 2.362
IBM Research - China
© 2009 IBM Corporation15
Conclusions
Propose a Hadoop HA solution
– Enable Hadoop high availability
– Metadata replication based
• remove the SPOF of namenode and jobtracker
– Experiments illustrate that the solution is effective
Future works
– Research more effective adaptive algorithms to adjust configuration of
replication process (such as synchronization mode, metadata buffer size)
– Test the solution in Hadoop cluster with larger number of datanodes
– Introduce into Hadoop more HA technologies other than the replication based
one
IBM Research - China
© 2009 IBM Corporation16
Thank you!
Q&A