Self-Adaptive, Energy-Conserving variant of Hadoop Distributed File System Kumar Sharshembiev

Self-Adaptive, Energy-Conserving variant ofHadoop Distributed File System

Kumar Sharshembiev

1. Current energy issues with HDFS and large server farms

2. Past approaches and solutions for energy conservation and cost cut

3. GreenHDFS unique design and solution

4. Conclusions and references

The purpose of HDFS was to build a scalable file system run on large number of commodity servers – currently ~ 155,500 at Yahoo

Large number of servers generate heat and consume energy in very large quantities

Over the lifetime of a server, the operating energy cost is comparable to the initial acquisition costs and ownership costs grow – power, cooling etc.

A lot of efforts and research put into solution for energy-conservation for extremely large scale server farms

One of the commonly used is “Scale-down” approach– transitioning servers into low power consuming state

Example: Many datacenters transfer workloads and their state to a fewer number of servers during low activity hours

Problem? Above approach works only when servers are state-less – i.e. get all of their data from NAS/SAN

“Scale-down” approaches work only with NAS/SAN since all of the data is stored on dedicated storage devices – possible to migrate workload to fewer number

Hadoop distributes all of its files among many server – any of the thousand nodes can be participating at any moment

Self-adaptive – depends only on HDFS and file access patterns

Applies Data-Classification techniques

Does energy-aware placement of data

Trades cost, performance, and power by separating cluster into logical zones

Team did a detailed analysis of files in a production Yahoo! Hadoop cluster:

Files are heterogeneous in access and lifespan patterns – some are rarely accessed, some get deleted shortly, some stay a while

60% of data is “cold” or dormant – meaning lying without getting accessed – “need to exist for history files”

95-98% of files had a very short “hotness” lifespan of less than 3 days – meaning it was actively used during the first 3 days

90% of files in the top-level directory were dormant or “cold” for more than 18 days

Majority of the data had a news-server-like access pattern – where most of the computation happens soon after its creation

GreenHDFS organizes servers into logical Hot and Cold Zones using different policies – FMP, SCP, FRP

FMP

Performance, Cost and Power

FMP monitors the dormancy of the files and runs in the Hot Zone

This gives higher storage efficiency for the Hot Zone as less accessed files are moved to the Cold zone

Also gives significant energy-conservation

Hot ZoneHeavy

Computations

FMP

Cold ZoneIdle

Servers

Coldness > Threshold

Hotness > Threshold

SCP runs in the Cold Zone and determines which servers can go to standby/sleep mode

SCP uses hardware techniques to transfer CPU, Disks and DRAM into low power state

SCP wakes the server up only if:◦ Data on that server is accessed◦ New data needs to placed on that server

FRP runs in the Cold Zone and ensures that QoS, bandwidth, and response time is managed well if the files become “popular”

If the number of accesses to certain file becomes higher than the threshold – then file metadata is changed and gets “moved” to the Hot Zone

All the threshold values of FMP,SCP, FRP should be chosen so that it results in maximum energy efficiency

File goes to several stages in its lifetime:◦ File Creation – just created◦ Hot period – frequently used ◦ Dormant period – not accessed◦ Deletion

GreenHDFS introduced various lifespan metrics and analyzed lifespan distributions to determine optimal threshold values for their policies

◦ FileLifeSpanCFR - file creation to first read

◦ FileLifeSpanCLR – file creation to last read

◦ FileLifeSpanLRD – last read access and deletion

◦ FileLifeSpanFLR – first read access and last read

◦ FileLifeTime - from the creation to deletion

Majority of files have short hotness lifespan

80% of files in d have dormancy period > 20 days

Simulation to test energy-conservation

24 % reduction in energy consumption ~ $2.1 million for 38,000 servers or $8.5 million saved on 155K servers today

More servers and space available = better performance

GreenHDFS is a policy-driven, self-adaptive, variant of HDFS

It relies on data classification driven data placement that gives significant periods of idleness on a subset of servers

It categorizes files into 2 zones: Hot and Cold

Applies sets of policies to classify files into Hot and Cold

Energy consumption reduced by 24% and saved $2.1ml for 38,000 servers at that time. Today could be more than $8.5 million saved

Storage efficiency also increased since dormant files get moved to the Cold Zone

More space and better utilization of Hot Zone leads to better performance for HDFS/MapReduce

http://www.cs.odu.edu/~mukka/cs775s11/Presentations/papers/kaushik.pdf

http://images.google.com/

http://cloudera.com/

http://hadoop.apache.org/

Documents

Self-Adaptive, Energy-Conserving variant of Hadoop Distributed File System Kumar Sharshembiev