14
Evaluation and Analysis of GreenHDFS: A Self-Adaptive, Energy-Conserving Variant of the Hadoop Distributed File System Rini T. Kaushik University of Illinois, Urbana-Champaign [email protected] Milind Bhandarkar Yahoo! Inc. [email protected] Klara Nahrstedt University of Illinois, Urbana-Champaign [email protected] Abstract We present a detailed evaluation and sensitivity anal- ysis of an energy-conserving, highly scalable variant of the Hadoop Distributed File System (HDFS) called Green- HDFS. GreenHDFS logically divides the servers in a Hadoop cluster into Hot and Cold Zones and relies on in- sightful data-classification driven energy-conserving data placement to realize guaranteed, substantially long periods (several days) of idleness in a significant subset of servers in the Cold Zone. Detailed lifespan analysis of the files in a large-scale production Hadoop cluster at Yahoo! points at the viability of GreenHDFS. Simulation results with real- world Yahoo! HDFS traces show that GreenHDFS can achieve 24% energy cost reduction by doing power man- agement in only one top-level tenant directory in the clus- ter and meets all the scale-down mandates in spite of the unique scale-down challenges present in a Hadoop cluster. If GreenHDFS technique is applied to all the Hadoop clus- ters at Yahoo! (amounting to 38000 servers), $2.1million can be saved in energy costs per annum. Sensitivity anal- ysis shows that energy-conservation is minimally sensitive to the thresholds in GreenHDFS. Lifespan analysis points out that one-size-fits-all energy-management policies won’t suffice in a multi-tenant Hadoop Cluster. 1 Introduction Cloud computing is gaining rapid popularity. Data- intensive computing needs range from advertising optimiza- tions, user-interest predictions, mail anti-spam, and data an- alytics to deriving search rankings. An increasing number of companies and academic institutions have started to rely on Hadoop [1] which is an open-source version of Google’s Map-reduce framework for their data-intensive computing needs [13]. Hadoop’s data-intensive computing framework is built on a large-scale, highly resilient object-based clus- ter storage managed by Hadoop Distributed File System (HDFS) [24]. With the increase in the sheer volume of the data that needs to be processed, storage and server demands of com- puting workloads are on a rapid increase. Yahoo!’s com- pute infrastructure already hosts 170 petabytes of data and deploys over 38000 servers [15]. Over the lifetime of IT equipment, the operating energy cost is comparable to the initial equipment acquisition cost [11] and constitutes a sig- nificant part of the total cost of ownership of a datacen- ter [6]. Hence, energy-conservation of the extremely large- scale, commodity server farms has become a priority. Scale-down (i.e., transitioning servers to an inactive, low power consuming sleep/standby state) is an attractive tech- nique to conserve energy as it allows energy proportional- ity with non energy-proportional components such as the disks [17] and significantly reduces power consumption (idle power draw of 132.46W vs. sleep power draw of 13.16W in a typical server as shown in Table 1). However, scale-down cannot be done naively as discussed in Section 3.2. One technique is to scale-down servers by manufactur- ing idleness by migrating workloads and their correspond- ing state to fewer machines during periods of low activ- ity [5, 9, 10, 25, 30, 34, 36]. This can be relatively easy to ac- complish when servers are state-less (i.e., serving data that resides on a shared NAS or SAN storage system). However, servers in a Hadoop cluster are not state-less. HDFS distributes data chunks and replicas across servers for resiliency, performance, load-balancing and data- locality reasons. With data distributed across all nodes, any node may be participating in the reading, writing, or com- putation of a data-block at any time. Such data placement makes it hard to generate significant periods of idleness in 2nd IEEE International Conference on Cloud Computing Technology and Science 978-0-7695-4302-4/10 $26.00 © 2010 IEEE DOI 10.1109/CloudCom.2010.109 274

[IEEE 2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom) - Indianapolis, IN, USA (2010.11.30-2010.12.3)] 2010 IEEE Second International Conference

  • Upload
    klara

  • View
    213

  • Download
    1

Embed Size (px)

Citation preview

Page 1: [IEEE 2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom) - Indianapolis, IN, USA (2010.11.30-2010.12.3)] 2010 IEEE Second International Conference

Evaluation and Analysis of GreenHDFS: A Self-Adaptive, Energy-ConservingVariant of the Hadoop Distributed File System

Rini T. KaushikUniversity of Illinois, Urbana-Champaign

[email protected]

Milind BhandarkarYahoo! Inc.

[email protected]

Klara NahrstedtUniversity of Illinois, Urbana-Champaign

[email protected]

Abstract

We present a detailed evaluation and sensitivity anal-ysis of an energy-conserving, highly scalable variant ofthe Hadoop Distributed File System (HDFS) called Green-HDFS. GreenHDFS logically divides the servers in aHadoop cluster into Hot and Cold Zones and relies on in-sightful data-classification driven energy-conserving dataplacement to realize guaranteed, substantially long periods(several days) of idleness in a significant subset of serversin the Cold Zone. Detailed lifespan analysis of the files ina large-scale production Hadoop cluster at Yahoo! points atthe viability of GreenHDFS. Simulation results with real-world Yahoo! HDFS traces show that GreenHDFS canachieve 24% energy cost reduction by doing power man-agement in only one top-level tenant directory in the clus-ter and meets all the scale-down mandates in spite of theunique scale-down challenges present in a Hadoop cluster.If GreenHDFS technique is applied to all the Hadoop clus-ters at Yahoo! (amounting to 38000 servers), $2.1millioncan be saved in energy costs per annum. Sensitivity anal-ysis shows that energy-conservation is minimally sensitiveto the thresholds in GreenHDFS. Lifespan analysis pointsout that one-size-fits-all energy-management policies won’tsuffice in a multi-tenant Hadoop Cluster.

1 Introduction

Cloud computing is gaining rapid popularity. Data-intensive computing needs range from advertising optimiza-tions, user-interest predictions, mail anti-spam, and data an-alytics to deriving search rankings. An increasing numberof companies and academic institutions have started to relyon Hadoop [1] which is an open-source version of Google’sMap-reduce framework for their data-intensive computing

needs [13]. Hadoop’s data-intensive computing frameworkis built on a large-scale, highly resilient object-based clus-ter storage managed by Hadoop Distributed File System(HDFS) [24].

With the increase in the sheer volume of the data thatneeds to be processed, storage and server demands of com-puting workloads are on a rapid increase. Yahoo!’s com-pute infrastructure already hosts 170 petabytes of data anddeploys over 38000 servers [15]. Over the lifetime of ITequipment, the operating energy cost is comparable to theinitial equipment acquisition cost [11] and constitutes a sig-nificant part of the total cost of ownership of a datacen-ter [6]. Hence, energy-conservation of the extremely large-scale, commodity server farms has become a priority.

Scale-down (i.e., transitioning servers to an inactive, lowpower consuming sleep/standby state) is an attractive tech-nique to conserve energy as it allows energy proportional-ity with non energy-proportional components such as thedisks [17] and significantly reduces power consumption(idle power draw of 132.46W vs. sleep power draw of13.16W in a typical server as shown in Table 1). However,scale-down cannot be done naively as discussed in Section3.2.

One technique is to scale-down servers by manufactur-ing idleness by migrating workloads and their correspond-ing state to fewer machines during periods of low activ-ity [5,9,10,25,30,34,36]. This can be relatively easy to ac-complish when servers are state-less (i.e., serving data thatresides on a shared NAS or SAN storage system). However,servers in a Hadoop cluster are not state-less.

HDFS distributes data chunks and replicas across serversfor resiliency, performance, load-balancing and data-locality reasons. With data distributed across all nodes, anynode may be participating in the reading, writing, or com-putation of a data-block at any time. Such data placementmakes it hard to generate significant periods of idleness in

2nd IEEE International Conference on Cloud Computing Technology and Science

978-0-7695-4302-4/10 $26.00 © 2010 IEEE

DOI 10.1109/CloudCom.2010.109

274

Page 2: [IEEE 2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom) - Indianapolis, IN, USA (2010.11.30-2010.12.3)] 2010 IEEE Second International Conference

the Hadoop clusters and renders usage of inactive powermodes infeasible [26].

Recent research on scale-down in GFS and HDFS man-aged clusters [3, 27] propose maintaining a primary replicaof the data on a small covering subset of nodes that are guar-anteed to be on. However, these solutions suffer from de-graded write-performance as they rely on write-offloadingtechnique [31] to avoid server wakeups at the time of writes.Write-performance is an important consideration in Hadoopand even more so in a production Hadoop cluster as dis-cussed in Section 3.1.

We took a different approach and proposed GreenHDFS,an energy-conserving, self-adaptive, hybrid, logical multi-zoned variant of HDFS in our paper [23]. Instead of anenergy-efficient placement of computations or using a smallcovering set for primary replicas as done in earlier research,GreenHDFS focuses on data-classification techniques toextract energy savings by doing energy-aware placement ofdata.

GreenHDFS trades cost, performance and power by sep-arating cluster into logical zones of servers. Each clusterzone has a different temperature characteristic where tem-perature is measured by the power consumption and the per-formance requirements of the zone. GreenHDFS relies onthe inherent heterogeneity in the access patterns in the datastored in HDFS to differentiate the data and to come up withan energy-conserving data layout and data placement ontothe zones. Since, computations exhibit high data locality inthe Hadoop framework, the computations then flow natu-rally to the data in the right temperature zones.

The contribution of this paper lies in showing that theenergy-aware data-differentiation based data-placement inGreenHDFS is able to meet all the effective scale-downmandates (i.e., generates significant idleness, results infew power state transitions, and doesn’t degrade write per-formance) despite the significant challenges posed by aHadoop cluster to scale-down. We do a detailed evaluationand sensitivity analysis of the policy thresholds in use inGreenHDFS with a trace-driven simulator with real-worldHDFS traces from a production Hadoop cluster at Yahoo!.While some aspects of GreenHDFS are sensitive to the pol-icy thresholds, we found that energy-conservation is mini-mally sensitive to the policy thresholds in GreenHDFS.

The remainder of the paper is structured as follows. InSection 2, we list some of the key observations from ouranalysis of the production Hadoop cluster at Yahoo!. InSection 3, we provide background on HDFS, and discussscale-down mandates. In Section 4, we give an overview ofthe energy management policies of GreenHDFS. In Section5, we present an analysis of the Yahoo! cluster. In Section6, we include experimental results demonstrating the effec-tiveness and robustness of our design and algorithms in asimulation environment. In Section 7, we discuss related

work and conclude.

2 Key observations

We did a detailed analysis of the evolution and lifespanof the files in in a production Yahoo! Hadoop cluster us-ing one-month long HDFS traces and Namespace metadatacheckpoints. We analyzed each top-level directory sepa-rately in the production multi-tenant Yahoo! Hadoop clus-ter as each top-level directory in the namespace exhibiteddifferent access patterns and lifespan distributions. The keyobservations from the analysis are:

∙ There is significant heterogeneity in the access pat-terns and the lifespan distributions across the varioustop-level directories in the production Hadoop clus-ter and one-size-fits-all energy-management policiesdon’t suffice across all directories.

∙ Significant amount of data amounting to 60% of usedcapacity is cold (i.e., is lying dormant in the systemwithout getting accessed) in the production Hadoopcluster. A majority of this cold data needs to exist forregulatory and historical trend analysis purposes.

∙ We found that the 95-98% files in majority of the top-level directories had a very short hotness lifespan ofless than 3 days. Only one directory had files withlonger hotness lifespan. Even in that directory 80%of files were hot for less than 8 days.

∙ We found that 90% of files amounting to 80.1% of thetotal used capacity in the most storage-heavy top-leveldirectory were dormant and hence, cold for more than18 days. Dormancy periods were much shorter in therest of the directories and only 20% files were dormantbeyond 1 day.

∙ Access pattern to majority of the data in the productionHadoop cluster have a news-server-like access patternwhereby most of the computations to the data happenssoon after the data’s creation.

3 Background

Map-reduce is a programming model designed to sim-plify data processing [13]. Google, Yahoo!, Facebook,Twitter etc. use Map-reduce to process massive amount ofdata on large-scale commodity clusters. Hadoop is an open-source cluster-based Map-reduce implementation written inJava [1]. It is logically separated into two subsystems: ahighly resilient and scalable Hadoop Distributed File Sys-tem (HDFS), and a Map-reduce task execution framework.HDFS runs on clusters of commodity hardware and is an

275

Page 3: [IEEE 2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom) - Indianapolis, IN, USA (2010.11.30-2010.12.3)] 2010 IEEE Second International Conference

object-based distributed file system. The namespace andthe metadata (modification, access times, permissions, andquotas) are stored on a dedicated server called the NameN-ode and are decoupled from the actual data which is storedon servers called the DataNodes. Each file in HDFS is repli-cated for resiliency and split into blocks of typically 128MBand individual blocks and replicas are placed on the DataN-odes for fine-grained load-balancing.

3.1 Importance of Write-Performance inProduction Hadoop Cluster

Reduce phase of a Map-reduce task writes intermediatecomputation results back to the Hadoop cluster and relies onhigh write performance for overall performance of a Map-reduce task. Furthermore, we observed that the majority ofthe data in a production Hadoop cluster has a news-serverlike access pattern. Predominant number of computationshappen on newly created data; thereby mandating good readand write performance of the newly created data.

3.2 Scale-down Mandates

Scale-down, in which server components such as CPU,disks, and DRAM are transitioned to inactive, low powerconsuming mode, is a popular energy-conservation tech-nique. However, scale-down cannot be applied naively. En-ergy is expended and transition time penalty is incurredwhen the components are transitioned back to an activepower mode. For example, transition time of componentssuch as the disks can be as high as 10secs. Hence, an effec-tive scale-down technique mandates the following:

∙ Sufficient idleness to ensure that energy savings arehigher than the energy spent in the transition.

∙ Less number of power state transitions as some com-ponents (e.g., disks) have limited number of start/stopcycles and too frequent transitions may adversely im-pact the lifetime of the disks.

∙ No performance degradation. Steps need to be takento amortize performance penalty of power state transi-tions and to ensure that load concentration on the re-maining active state servers doesn’t adversely impactoverall performance of the system.

4 GreenHDFS Design

GreenHDFS is a variant of the Hadoop Distributed FileSystem (HDFS) and GreenHDFS logically organizes theservers in the datacenter in multiple dynamically provi-sioned Hot and Cold zones. Each zone has a distinct perfor-mance, cost, and power characteristic. Each zone is man-aged by power and data placement policies most conducive

to the class of data residing in that zone. Differentiatingthe zones in terms of power is crucial towards attaining ourenergy-conservation goal.

Hot zone consists of files that are being accessed cur-rently and the newly created files. This zone has strict SLA(Service Level Agreements) requirements and hence, per-formance is of the greatest importance. We trade-off energysavings in interest of very high performance in this zone. Inthis paper, GreenHDFS employs data chunking, placementand replication policies similar to the policies in baselineHDFS or GFS.

Cold zone consists of files with low to rare accesses.Files are moved by File Migration policy from the Hotzones to the Cold zone as their temperature decreases be-yond a certain threshold. Performance and SLA require-ments are not as critical for this zone and GreenHDFS em-ploys aggressive energy-management schemes and policiesin this zone to transition servers to low power inactive state.Hence, GreenHDFS trades-off performance with high en-ergy savings in the Cold zone.

For optimal energy savings, it is important to increasethe idle times of the servers and limit the wakeups of serversthat have transitioned to the power saving mode. Keepingthis rationale in mind and recognizing the low performanceneeds and infrequency of data accesses to the Cold zone;this zone will not chunk the data. This will ensure that upona future access only the server containing the data will bewoken up.

By default, the servers in Cold zone are in a sleepingmode. A server is woken up when either new data needsto be placed on it or when data already residing on theserver is accessed. GreenHDFS tries to avoid powering-ona server in the Cold zone and maximizes the use of the exist-ing powered-on servers in its server allocation decisions ininterest of maximizing the energy savings. One server wo-ken up and is filled completely to its capacity before nextserver is chosen to be transitioned to an active power statefrom an ordered list of servers in the Cold zone.

The goal of GreenHDFS is to maximize the allocationof the servers to the Hot zone to minimize the performanceimpact of zoning and minimize the number of servers allo-cated to the Cold zone. We introduced a hybrid, storage-heavy cluster model in [23] paper whereby servers in theCold zone are storage-heavy and have 12, 1TB disks/server.

We argue that zoning in GreenHDFS will not affect theHot zone’s performance adversely and the computationalworkload can be consolidated on the servers in the Hot zonewithout exceeding the CPU utilization above the provision-ing guidelines. A study of 5000 Google compute servers,showed that most of the time is spent within the 10% - 50%CPU utilization range [4]. Hence, significant opportunitiesexist in workload consolidation. And, the compute capacityof the Cold zone can always be harnessed under peak load

276

Page 4: [IEEE 2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom) - Indianapolis, IN, USA (2010.11.30-2010.12.3)] 2010 IEEE Second International Conference

scenarios.

4.1 Energy-management Policies

Files are moved from the Hot Zones to the Cold Zone astheir temperature changes over time as shown in Figure 1.In this paper, we use dormancy of a file, as defined by theelapsed time since the last access to the file, as the measureof temperature of the file. Higher the dormancy lower is thetemperature of the file and hence, higher is the coldness ofthe files. On the other hand, lower the dormancy, higher isthe heat of the files. GreenHDFS uses existing mechanismin baseline HDFS to record and update the last access timeof the files upon every file read.

4.1.1 File Migration Policy

The File Migration Policy runs in the Hot zone, monitorsthe dormancy of the files as shown in Algorithm 1 andmoves dormant, i.e., cold files to the Cold Zone. The advan-tages of this policy are two-fold: 1) leads to higher space-efficiency as space is freed up on the hot Zone for fileswhich have higher SLA requirements by moving rarely ac-cessed files out of the servers in these zones, and 2) allowssignificant energy-conservation. Data-locality is an impor-tant consideration in the Map-reduce framework and com-putations are co-located with data. Thus, computations nat-urally happen on the data residing in the Hot zone. Thisresults in significant idleness in all the components of theservers in the Cold zone (i.e., CPU, DRAM and Disks), al-lowing effective scale-down of these servers.

Hot

Zone

Cold

Zone

Coldness > ThresholdFMP

Hotness > ThresholdFRP

Figure 1. State Diagram of a File’s Zone Alloca-tion based on Migration Policies

Algorithm 1 Description of the File Migration Policy whichClassifies and Migrates cold data to the Cold Zone from theHot Zones{For every file i in Hot Zone}for 𝑖 = 1 to n do

dormancy𝑖 ⇐ current time − last access time𝑖if dormancy𝑖 ≥ 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 then

{Cold Zone} ⇐ {Cold Zone} ∪ {f𝑖}{Hot Zone} ⇐ {Hot Zone} / {f𝑖}//filesystem metadata structures arechanged to Cold Zone

end ifend for

4.1.2 Server Power Conserver Policy

The Server Power Conserver Policy runs in the Cold zoneand determines the servers which can be transitioned intoa power saving standby/sleep mode in the Cold Zone asshown in Algorithm 2. The current trend in the internet-scale data warehouses and Hadoop clusters is to use com-modity servers with 4-6 directly attached disks instead ofusing expensive RAID controllers. In such systems, disksactually just constitute 10% of the entire power usage as il-lustrated in a study performed at Google [21] and CPU andDRAM constitute of 63% of the total power usage. Hence,power management of any one component is not sufficient.We leverage energy cost savings at the entire server granu-larity (CPU, Disks, and DRAM) in the Cold zone.

The GreenHDFS uses hardware techniques similar to[28] to transition the processors, disks and the DRAM intoa low power state. GreenHDFS uses the disk Sleep mode 1,CPU’s ACPI S3 Sleep state as it consumes minimal powerand requires only 30us to transition from sleep back to ac-tive execution, and DRAM’s self-refresh operating mode inwhich transitions into and out of self refresh can be com-pleted in less than a microsecond in the Cold zone.

The servers are transitioned back to an active powermode in three conditions: 1) data residing on the server isaccessed, 2) additional data needs to be placed on the server,or 3) block scanner needs to run on the server to ensurethe integrity of the data residing in the Cold zone servers.GreenHDFS relies on Wake-on-LAN in the NICs to send amagic packet to transition a server back to an active powerstate.

Active Inactive

Wake-up Events:

File Access

Bit Rot Integrity Checker

File Placement

File Deletion

Server Power Conserver Policy:

Coldness > ThresholdSCP

Figure 2. Triggering events leading to Power StateTransitions in the Cold Zone

Algorithm 2 Server Power Conserver Policy{For every Server i in Cold Zone}for 𝑖 = 1 to n do

coldness𝑖 ⇐ max0≤𝑗≤𝑚last access time𝑗if coldness𝑖 ≥ 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝑆𝑃𝐶 then

S𝑖 ⇐ INACTIVE STATEend if

end for

1In the Sleep mode the drive buffer is disabled, the heads are parkedand the spindle is at rest.

277

Page 5: [IEEE 2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom) - Indianapolis, IN, USA (2010.11.30-2010.12.3)] 2010 IEEE Second International Conference

4.1.3 File Reversal Policy

The File Reversal Policy runs in the Cold zone and en-sures that the QoS, bandwidth and response time of filesthat becomes popular again after a period of dormancy isnot impacted. If the number of accesses to a file that is re-siding in the Cold zone becomes higher than the threshold𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑅𝑃 , the file is moved back to the Hot zone asshown in 3. The file is chunked and placed unto the serversin the Hot zone in congruence with the policies in the Hotzone.

Algorithm 3 Description of the File Reversal Policy WhichMonitors temperature of the cold files in the Cold Zones andMoves Files Back to Hot Zones if their temperature changes

{For every file i in Cold Zone}for 𝑖 = 1 to n do

if num accesses𝑖 ≥ 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑅𝑃 then{Hot Zone} ⇐ {Hot Zone} ∪ {f𝑖}{Cold Zone} ⇐ {Cold Zone} / {f𝑖}//filesystem metadata are changed toHot Zone

end ifend for

4.1.4 Policy Thresholds Discussion

A good data migration scheme should result in maximalenergy savings, minimal data oscillations between Green-HDFS zones and minimal performance degradation. Min-imization of the accesses to the Cold zone files results inmaximal energy savings and minimal performance impact.For this, policy thresholds should be chosen in a way thatminimizes the number of accesses to the files residing in theCold zone while maximizing the movement of the dormantdata to the Cold zone. Results from our detailed sensitivityanalysis of the thresholds used in GreenHDFS are coveredin Section 6.3.5.

𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 : Low (i.e., aggressive) value of𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 results in an ultra-greedy selection offiles as potential candidates for migration to the Coldzone. While there are several advantages of an aggressive𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 such as higher space-savings in the Coldzone, there are disadvantages as well. If files have inter-mittent periods of dormancy, the files may incorrectly getlabeled as cold and get moved to the Cold zone. There ishigh probability that such files will get accessed in the nearfuture. Such accesses may suffer performance degradationas the accesses may get subject to power transition penaltyand may trigger data oscillations because of file reversalsback to the Hot zone.

A higher value of 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 results in a higheraccuracy in determining the really cold files. Hence, thenumber of reversals, server wakeups and associated perfor-mance degradation decreases as the threshold is increased.On the other hand, higher value of 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 signi-fies that files will be chosen as candidates for migration only

after they have been dormant in the system for a longer pe-riod of time. This would be an overkill for files with veryshort 𝐿𝑖𝑓𝑒𝑠𝑝𝑎𝑛𝐶𝐿𝑅 (hotness lifespan) as such files willunnecessarily lie dormant in the system, occupying preciousHot zone capacity for a longer period of time.

𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝑆𝐶𝑃 : A high 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝑆𝐶𝑃 increases thenumber of the days the servers in the Cold Zone remainin active power state and hence, lowers the energy savings.On the other hand, it results in a reduction in the power statetransitions which results in improved performance of the ac-cesses to the Cold Zone. Thus, a trade-off needs to be madebetween energy-conservation and data access performancein the selection of the value for 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝑆𝐶𝑃 .

𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑅𝑃 : A relatively high value of𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑅𝑃 ensures that files are accurately clas-sified as hot-again files before they are moved back to theHot zone from the Cold zone. This reduces data oscillationsin the system and reduces unnecessary file reversals.

5 Analysis of a production Hadoop cluster atYahoo!

We analyzed one-month of HDFS logs 2 and namespacecheckpoints in a multi-tenant cluster at Yahoo!. The clus-ter had 2600 servers, hosted 34 million files in the names-pace and the data set size was 6 Petabytes. There were425 million entries in the HDFS logs and each names-pace checkpoint contained 30-40 million files. The clus-ter namespace was divided into six main top-level directo-ries, whereby each directory addresses different workloadsand access patterns. We only considered 4 main directoriesand refer to them as: d, p, u, and m in our analysis insteadof referring them by their real names. The total numberof unique files that was seen in the HDFS logs in the one-month duration were 70 million (d-1.8million, p-30million,u-23million, and m-2million).

The logs and the metadata checkpoints were huge in sizeand we used a large-scale research Hadoop cluster at Yahoo!extensively for our analysis. We wrote the analysis scriptsin Pig. We considered several cases in our analysis as shownbelow:

∙ Files created before the analysis period and whichwere not read or deleted subsequently at all. We clas-sify these files as long-living cold files.

∙ Files created before the analysis period and whichwere read during the analysis period.

2The inode data and the list of blocks belonging to each file comprisethe metadata of the name system called the image. The persistent record ofthe image is called a checkpoint. HDFS has the ability to log all file systemaccess requests, which is required for auditing purposes in enterprises. Theaudit logging is implemented using log4j and once enabled, logs everyHDFS event in the NameNode’s log [37]. We used the above-mentionedcheckpoint and HDFS logs for our analysis.

278

Page 6: [IEEE 2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom) - Indianapolis, IN, USA (2010.11.30-2010.12.3)] 2010 IEEE Second International Conference

∙ Files created before the analysis period and whichwere both read and deleted during the analysis period.

∙ Files created during the analysis period and whichwere not read during the analysis period or deleted.

∙ Files created during the analysis period and whichwere not read during the analysis period, but weredeleted.

∙ Files created during the analysis period and whichwere read and deleted during the analysis period.

To accurately account for the file lifespan and lifetime,we handled the following cases: (a) Filename reuse. Weappended a timestamp to each file create to accurately trackthe audit log entries following the file create entry in the au-dit log, (b) File renames. We used an unique id per file to ac-curately track its lifetime across create, rename and delete,(c) Renames and deletes at higher level in the path hierarchyhad to be translated to leaf-level renames and deletes for ouranalysis, (d) HDFS logs do not have file size informationand hence, did a join of the dataset found in the HDFS logsand namespace checkpoint to get the file size information.

5.1 File Lifespan Analysis of the Yahoo!Hadoop Cluster

A file goes to several stages in its lifetime: 1) file cre-ation, 2) hot period during which the file is frequently ac-cessed, 3) dormant period during which file is not accessed,and 4) deletion. We introduced and considered various lifes-pan metrics in our analysis to characterize a file’s evolution.A study of the various lifespan distributions helps in decid-ing the energy-management policy thresholds that need tobe in place in GreenHDFS.

∙ 𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑆𝑝𝑎𝑛𝐶𝐹𝑅 metric is defined as the File lifes-pan between the file creation and first read access. Thismetric is used to find the clustering of the read accessesaround the file creation.

∙ 𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑆𝑝𝑎𝑛𝐶𝐿𝑅 metric is defined as the File lifes-pan between creation and last read access. This metricis used to determine the hotness profile of the files.

∙ 𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑆𝑝𝑎𝑛𝐿𝑅𝐷 metric is defined as the File lifes-pan between last read access and file deletion. Thismetric helps in determine the coldness profile of thefiles as this is the period for which files are dormant inthe system.

∙ 𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑆𝑝𝑎𝑛𝐹𝐿𝑅 metric is defined as the File lifes-pan between first read access and last read access. Thismetric helps in determining another dimension of thehotness profile of the files.

∙ FileLifetime. This metric helps in determining the life-time of the file between its creation and its deletion.

5.1.1 𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑆𝑝𝑎𝑛𝐶𝐹𝑅

The 𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑆𝑝𝑎𝑛𝐶𝐹𝑅 distribution throws light on theclustering of the file reads with the file creation. As shownin Figure 3, 99% of the files have a 𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑆𝑝𝑎𝑛𝐶𝐹𝑅 ofless than 2 days.

5.1.2 𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑆𝑝𝑎𝑛𝐶𝐿𝑅

Figure 4 shows the distribution of 𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑆𝑝𝑎𝑛𝐶𝐿𝑅 inthe cluster. In directory d, 80% of files are hot for less than8 days and 90% of the files amounting to 94.62% storage,are hot for less than 24 days. The 𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑆𝑝𝑎𝑛𝐶𝐿𝑅 of95% of the files amounting to 96.51% storage in the direc-tory p is less than 3 days and the 𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑆𝑝𝑎𝑛𝐶𝐿𝑅 of the100% of files in directory m and 98% of files in directorya is as small as 2 days. In directory u, 98% of files have𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑆𝑝𝑎𝑛𝐶𝐿𝑅 of less than 1 day. Thus, majority ofthe files in the cluster have a short hotness lifespan.

5.1.3 𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑆𝑝𝑎𝑛𝐿𝑅𝐷

𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑆𝑝𝑎𝑛𝐿𝑅𝐷 indicates the time for which a file staysin a dormant state in the system. The longer the dormancyperiod, higher is the coldness of the file and hence, higherthe suitability of the file for migration to the cold zone. Fig-ure 5 shows the distribution of 𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑆𝑝𝑎𝑛𝐿𝑅𝐷 in thecluster. In directory d, 90% of files are dormant beyond1 day and 80% of files, amounting to 80.1% of storageexist in dormant state past 20 days. In directory p, only25% files are dormant beyond 1 day and only 20% of thefiles remain dormant in the system beyond 10 days. In di-rectory m, only 0.02% files are dormant for more than 1day and in directory u, 20% of files are dormant beyond10 days. The 𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑆𝑝𝑎𝑛𝐿𝑅𝐷 needs to be consideredto find true migration suitability of a file. For example,given the extremely short dormancy period of the files inthe directory m, there is no point in exercising the File Mi-gration Policy on directory m. For directories p, and u,𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 less than 5 days will result in unneces-sary movement of files to the Cold zone as these files aredue for deletion in any case. On the other hand, given theshort 𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑆𝑝𝑎𝑛𝐶𝐿𝑅 in these directories, high value of𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 won’t do justice to space-efficiency in theCold zone as discussed in Section 4.1.4.

5.1.4 File Lifetime Analysis

Knowledge of the FileLifetime further assists in themigration file candidate selection and needs to be ac-counted for in addition to the 𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑆𝑝𝑎𝑛𝐿𝑅𝐷 and

279

Page 7: [IEEE 2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom) - Indianapolis, IN, USA (2010.11.30-2010.12.3)] 2010 IEEE Second International Conference

96%

98%

100%

102%

ota

l File

Co

un

t

d p m u

90%

92%

94%

1 3 5 7 9 11 13 15 17 19 21%

of

To

FileLifeSpanCFR (Days)

40%

60%

80%

100%

120%

al U

sed

Cap

acit

y

d p m u

0%

20%

40%

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

% o

f To

ta

FileLifeSpanCFR (Days)

Figure 3. 𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑆𝑝𝑎𝑛𝐶𝐹𝑅 distribution. 99% of files in directory d and 98% of files in directory p wereaccessed for the first time less than 2 days of creation.

75%80%85%90%95%

100%105%

Tota

l File

Co

un

t

d p m u

60%65%70%75%

1 3 5 7 9 11 131517192123252729

% o

f T

FileLifeSpanCLR (Days)

40%

60%

80%

100%

120%

al U

sed

Cap

acit

y

d p m u

0%

20%

40%

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

% o

f To

taFileLifeSpanCLR (Days)

Figure 4. 𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑆𝑝𝑎𝑛𝐶𝐿𝑅 Distribution in the four main top-level directories in the Yahoo! production cluster.𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑆𝑝𝑎𝑛𝐶𝐿𝑅 characterizes the lifespan for which files are hot. In directory d, 80% of files were hot for less than8 days and 90% of the files amounting to 94.62% storage, are hot for less than 24 days. The hotness lifespan of 95% ofthe files amounting to 96.51% storage in the directory p is less than 3 days and the hotness lifespan of the 100% of files indirectory m and in directory u, 98% of files are hot for less than 1 day.

40%

60%

80%

100%

120%

Tota

l File

Co

un

t

d p m u

0%

20%

40%

1 3 5 7 9 11 131517192123252729

% o

f T

FileLifeSpanLRD (Days)

40%

60%

80%

100%

120%

al U

sed

Cap

acit

y

d p m u

0%

20%

40%

1 3 5 7 9 11 131517192123252729

% if

To

ta

FileLifeSpanLRD (Days)

Figure 5. 𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑆𝑝𝑎𝑛𝐿𝑅𝐷 distribution of the top-level directories in the Yahoo! production cluster. 𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑆𝑝𝑎𝑛𝐿𝑅𝐷

characterizes the coldness in the cluster and is indicative of the time a file stays in a dormant state in the system. 80% of files,amounting to 80.1% of storage in the directory d have a dormancy period of higher than 20 days. 20% of files, amounting to28.6% storage in directory p are dormant beyond 10 days. 0.02% of files in directory m are dormant beyond 1 day.

40%

60%

80%

100%

120%

Tota

l File

Co

un

t

d p m u

0%

20%

40%

1 3 5 7 9 11 131517192123252729

% o

f T

FileLifetime (Days)

40%

60%

80%

100%

120%

al U

sed

Cap

acit

y

d p m u

0%

20%

40%

0 2 4 6 8 1012141618202224262830

% o

f To

ta

FileLifetime(Days)

Figure 6. FileLifetime distribution. 67% of the files in the directory p are deleted within one day of their creation. Only23% files live beyond 20 days. On the other hand, in directory d 80% of the files have a FileLifetime of more than 30 days.

7

280

Page 8: [IEEE 2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom) - Indianapolis, IN, USA (2010.11.30-2010.12.3)] 2010 IEEE Second International Conference

15 00%20.00%25.00%30.00%35.00%40.00%

% of Total File Count % of Total Used Storage

0.00%5.00%

10.00%15.00%

d p u

Figure 7. File size and file count percentage of long-living cold files. The cold files are defined as the files that were createdprior to the start of the observation period of one-month and were not accessed during the period of observation at all. Incase of directory d directory, 13% of the total file count in the cluster which amounts to 33% of total used capacity is cold.In case of directory p, 37% of the total file count in the cluster which amounts to 16% of total used capacity is cold. Overall,63.16% of total file count and 56.23% of total used capacity is cold in the system

40%

60%

80%

100%

Tota

l File

Co

un

t

d p u

0%

20%

10 20 40 60 80 100 120 140

% o

f

Dormancy > than (Days)

2

3

4

5

6

7

Co

un

t (M

illio

ns)

d p u

0

1

2

10 20 40 60 80 100 120 140

File

C

Dormancy > than (Days)

10%20%30%40%50%60%70%80%90%

tal U

sed

Sto

rag

e C

apac

ity

d p u

0%10%

10 20 40 60 80 100 120 140

% o

f To

Dormancy > than (Days)

1500

2000

2500

3000

3500

apai

cty

(TB

)

d p u

0

500

1000

10 20 40 60 80 100 120 140

Use

d S

tora

ge

C

Dormancy > than (Days)

Figure 8. Dormant period analysis of the file count distribution and histogram in one namespace checkpoint. Dormancyof the file is defined as the elapsed time between the last access time recorded in the checkpoint and the day of observation.34% of the files in the directory p and 58% of the files in the directory d were not accessed in the last 40 days.

281

Page 9: [IEEE 2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom) - Indianapolis, IN, USA (2010.11.30-2010.12.3)] 2010 IEEE Second International Conference

𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑆𝑝𝑎𝑛𝐶𝐿𝑅 metrices covered earlier. As shown inFigure 6, directory p only has 23% files that live beyond 20days. On the other hand, 80% of files in directory d livefor more than 30 days and 80% of the files have a hot lifes-pan of less than 8 days. Thus, directory d is a very goodcandidate for invoking the File Migration Policy.

5.2 Coldness Characterization of the Files

In this section, we show the file count and the storagecapacity used by the long-living cold files. The long-livingcold files are defined as the files that were created prior tothe start of the observation period and were not accessedduring the one-month period of observation at all. As shownin Figure 13, 63.16% of files amounting to 56.23% of thetotal used capacity are cold in the system. Such long-livingcold files present significant opportunity to conserve energyin GreenHDFS.

5.3 Dormancy Characterization of theFiles

The HDFS trace analysis gives information only aboutthe files that were accessed in the one-month duration. Toget a better picture, we analyzed the namespace checkpointsfor historical data on the file temperatures and periods ofdormancy. The namespace checkpoints contain the last ac-cess time information of the files and used this informationto calculate the dormancy of the files. The Dormancy met-ric defines the elapsed time between the last noted accesstime of the file and the day of observation. Figure 8 containsthe frequency histograms and distributions of the dormancy.34% of files amounting to 37% of storage in the directory ppresent in the namespace checkpoint were not accessed inthe last 40 days. 58% of files amounting to 53% of storagein the directory d were not accessed in the last 40 days. Theextent of dormancy exhibited in the system again shows theviability of the GreenHDFS solution.3

6 Evaluation

In this section, we first present our experimental platformand methodology, followed by a description of the work-loads used and then we give our experimental results. Ourgoal is to answer seven high-level sets of questions:

∙ What much energy is GreenHDFS able to conservecompared to a baseline HDFS with no energy manage-ment?

∙ What is the penalty of the energy management on av-erage response time?

3The number of files present in the namespace checkpoints were lessthan half the number of the files seen in the one-month trace.

∙ What is the sensitivity of the various policy thresholdsused in GreenHDFS on the energy savings results?

∙ How many power state transitions does a server gothrough in average in the Cold Zone?

∙ Finally, what is the number of accesses that happen tothe files in the Cold Zones, the days servers are pow-ered on and the number of migrations and reversals ob-served in the system?

∙ How many migrations happen daily?

∙ How may power state transitions are occurred duringthe simulation-run?

The following evaluation sections answer these questions,beginning with a description of our methodology, and thetrace workloads we use as inputs to the experiments.

6.1 Evaluation methodology

We evaluated GreenHDFS using a trace-driven simula-tor. The simulator was driven by real-world HDFS tracesgenerated by a production Hadoop cluster at Yahoo!. Thecluster had 2600 servers, hosted 34 million files in thenamespace and the data set size was 6 Petabytes.

We focused our analysis on the directory d as this di-rectory constituted of 60% of the used storage capacity inthe cluster (4PB out of the 6PB total used capacity). Justfocusing our analysis on the directory d cut down on oursimulation time significantly and reduced our analysis time4. We used 60% of the total cluster nodes in our analysis tomake the results realistic for just directory d analysis. Thetotal number of unique files that were seen in the HDFStraces for the directory d in the one-month duration were0.9 million. In our experiments, we compare GreenHDFSto the baseline case (HDFS without energy management).The baseline results give us the upper bound for energy con-sumption and the lower bound for average response time.

Simulation Platform: We used a trace-driven simula-tor for GreenHDFS to perform our experiments. We usedmodels for the power levels, power state transitions timesand access times of the disk, processor and the DRAM inthe simulator. The GreenHDFS simulator was implementedin Java and MySQL distribution 5.1.41 and executed usingJava 2 SDK, version 1.6.0-17. 5 Table 1 lists the variouspower, latency, transition times etc. used in the Simulator.The simulator was run on 10 nodes in a development clusterat Yahoo!.

4An important consideration given the massive scale of the traces5Both, performance and energy statistics were calculated based on the

information extracted from the datasheet of Seagate Barracuda ES.2 whichis a 1TB SATA hard drive, a Quad core Intel Xeon X5400 processor

282

Page 10: [IEEE 2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom) - Indianapolis, IN, USA (2010.11.30-2010.12.3)] 2010 IEEE Second International Conference

Table 1. Power and power-on penalties used in Simu-lator

Component ActivePower(W)

IdlePower(W)

SleepPower(W)

Power-uptime

CPU (Quad core, Intel XeonX5400 [22])

80-150 12.0-20.0

3.4 30 us

DRAM DIMM [29] 3.5-5 1.8-2.5

0.2 1 us

NIC [35] 0.7 0.3 0.3 NASATA HDD (Seagate Bar-racuda ES.2 1TB [16]

11.16 9.29 0.99 10 sec

PSU [2] 50-60 25-35 0.5 300 usHot server (2 CPU, 8 DRAMDIMM, 4 1TB HDD)

445.34 132.46 13.16

Cold server (2 CPU, 8 DRAMDIMM, 12 1TB HDD)

534.62 206.78 21.08

6.2 Simulator Parameters

The default simulation parameters used by in this paperare shown in Table 2.

Table 2. Simulator ParametersParameter Value

NumServer 1560NumZones 2𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝐹𝑀𝑃 1 Day𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 5, 10, 15, 20 Days𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝑆𝑃𝐶 1 Day𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝑆𝑃𝐶 2, 4, 6, 8 Days𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝐹𝑅𝑃 1 Day𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑅𝑃 1, 5, 10 AccessesNumServersPerZone Hot 1170 Cold 390

6.3 Simulation results

6.3.1 Energy-Conservation

In this section, we show the energy savings made possibleby GreenHDFS, compared to baseline, in one month sim-ply by doing power management in one of the main tenantdirectory of the Hadoop Cluster. The cost of electricity wasassumed to be $0.063/KWh. Figure 9(Left) shows a 24%reduction in energy consumption of a 1560 server datacen-ter with 80% capacity utilization. Extrapolating, $2.1mil-lion can be saved in the energy costs if GreenHDFS tech-nique is applied to all the Hadoop clusters at Yahoo (up-wards of 38000 servers). Energy saving from off-powerservers will be further compounded in the cooling system ofa real datacenter. For every Watt of power consumed by thecompute infrastructure, a modern data center expends an-other one-half to one Watt to power the cooling infrastruc-ture [32]. Energy-saving results underscore the importanceof supporting access time recording in the Hadoop computeclusters.

6.3.2 Storage-Efficiency

In this section, we show the increased storage efficiency ofthe Hot Zones compared to baseline. Figure 10 shows thatin the baseline case, the average capacity utilization of the1560 servers is higher than that of GreenHDFS which justhas 1170 servers out of the 1560 servers provisioned to theHot second Zone. GreenHDFS has much higher amount offree space available in the Hot zone which tremendously in-creases the potential for better data placement techniques onthe Hot zone. More aggressive the policy threshold, morespace is available in the Hot zone for truly hot data as moredata is migrated out to the Cold zone.

6.3.3 File Migrations and Reversals

The Figure 10 (right-most) shows the number and total sizeof the files which were migrated to the Cold zone daily witha 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 value of 10 Days. Every day, on average6.38TB worth of data and 28.9 thousand files are migratedto the Cold zone. Since, we have assumed storage-heavyservers in the Cold zone where each server has 12, 1TBdisks, assuming 80MB/sec of disk bandwidth, 6.38TB datacan be absorbed in less than 2hrs by one server. The mi-gration policy can be run during off-peak hours to minimizeany performance impact.

6.3.4 Impact of Power Management on Response Time

We examined the impact of server power management onthe response time of a file which was moved to the ColdZone following a period of dormancy and was accessedagain for some reason. The files residing on the Cold Zonemay suffer performance degradation in two ways: 1) if thefile resides on a server that is not powered ON currently–this will incur a server wakeup time penalty, 2) transfer timedegradation courtesy of no striping on the lower Zones. Thefile is moved back to Hot zone and chunked again by the filereversal policy. Figure 11 shows the impact on the averageresponse time. 97.8% of the total read requests are not im-pacted by the power management. Impact is seen only by2.1% of the reads. With a less aggressive 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃

(15, 20 days), impact on the Response time will reducemuch further.

6.3.5 Sensitivity Analysis

We tried different values of the thresholds for the File Mi-gration policy and the Server Power Conserver policy tounderstand the sensitivity of these thresholds on storage-efficiency, energy-conservation and number of power statetransitions. A discussion on the impact of the variousthresholds is done in Section 4.1.4.

283

Page 11: [IEEE 2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom) - Indianapolis, IN, USA (2010.11.30-2010.12.3)] 2010 IEEE Second International Conference

$10,000

$15,000

$20,000

$25,000

$30,000

$35,000

En

erg

y C

ost

s

$0

$5,000

File Migration Policy (Days)

15

20

25

30

35

ys S

erve

r O

N

Cold Zone Hot Zone

0

5

10

1 3 5 7 9 11 13 15 17 19 21 23 25 27

Day

Cold Zone Servers

345678

un

t (x

1000

00)

# Migrations # Reversals

012

5 10 15 20

Co

u

File Migration Policy Interval (Days)

Figure 9. (Left) Energy Savings with GreenHDFS and (Middle) Days Servers in Cold Zone were ON compared to theBaseline. Energy Cost Savings are Minimally Sensitive to the Policy Threshold Values. GreenHDFS achieves 24% savings inthe energy costs in one month simply by doing power management in one of the main tenant directory of the Hadoop Cluster.(Right) Number of migrations and reversals in GreenHDFS with different values of the 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 threshold.

100150 200 250 300 350 400 450 500

ora

ge

Cap

acit

y (G

B)

Policy5

Policy10

Policy15

Baseline

-50

100

1 93 185

277

369

461

553

645

737

829

921

1013

1105

1197

1289

1381

1473U

sed

Sto

Server Number

Policy5200

300

400

500

600

on

e U

sed

Cap

acit

y (T

B)

-

100

5 10 15 20

Co

ld Z

o

File Migration Policy Interval (Days)

15202530354045

4

6

8

10

12

ou

nt

(x 1

000)

ize

(TB

)

FileSize FileCount

0510

-

2

6/12

6/14

6/16

6/18

6/20

6/22

6/24

6/26

6/28

6/30

File

Co

File

S

Days

Figure 10. Capacity Growth and Utilization in the Hot and Cold Zone compared to the Baseline and Daily Migrations.GreenHDFS substantially increases the free space in the Hot Zones by migrating cold data to the Cold Zones. In the leftand middle chart, we only consider the new data that was introduced in the data directory and old data which was accessedduring the 1 month period. Right chart shows the number and total size of the files migrated daily to the Cold zone with𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 value of 10 Days.

𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 : We found that the energy costs areminimally sensitive to the 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 threshold value.As shown in Figure 9[Left], the energy cost savings variedminimally when the 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 was changed to 5, 10,15 and 20 days.

The performance impact and number of file reversals isminimally sensitive to the 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 value as well.This behavior can be explained by the observation that ma-jority of the data in the production Hadoop cluster at Yahoo!has a news-server-like access pattern. This implies that oncedata is deemed cold, there is low probability of data gettingaccessed again.

The Figure 9 (right-most) shows the total number of mi-grations of the files which were deemed cold by the file mi-gration policy and the reversals of the moved files in casethey were later accessed by a client in the one-month sim-ulation run. There were more instances (40,170, i.e., 4%of overall file count) of file reversals with the most ag-gressive 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 of 5 days. With less aggressive𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 of 15 days, the number of reversals in thesystem went down to 6,548 (i.e., 0.7% of file count). The

experiments were done with a 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑅𝑃 value of 1.The number of file reversals are substantially reduced by in-creasing the 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑅𝑃 value. With a 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑅𝑃

value of 10, zero reversals happen in the system.The storage-efficiency is sensitive to the value of the

𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 threshold as shown in Figure 10[Left]. Anincrease in the 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 value results in less effi-cient capacity utilization of the Hot Zones. Higher value of𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 threshold signifies that files will be chosenas candidates for migration only after they have been dor-mant in the system for a longer period of time. This wouldbe an overkill for files with very short 𝐹𝑖𝑙𝑒𝐿𝑖𝑓𝑒𝑠𝑝𝑎𝑛𝐶𝐿𝑅

as they will unnecessarily lie dormant in the system, oc-cupying precious Hot zone capacity for a longer period oftime.

𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝑆𝐶𝑃 : As Figure 12(Right) illustrates, in-creasing the 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝑆𝐶𝑃 value, minimally increases thenumber of the days the servers in the Cold Zone remain ONand hence, minimally lowers the energy savings. On theother hand, increasing the 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝑆𝐶𝑃 value results ina reduction in the power state transitions which improves

284

Page 12: [IEEE 2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom) - Indianapolis, IN, USA (2010.11.30-2010.12.3)] 2010 IEEE Second International Conference

0%20%40%60%80%

100%120%

012

012

012

012

012

012

012

012

012

012

012

012

012

012

012

012

% o

f To

tal F

ile R

ead

s

13-1

080

13-9

016

013-

170

2401

3-25

032

013-

330

4001

3-41

048

013-

490

5601

3-57

067

013-

680

7801

3-79

087

013-

880

9501

3-96

010

5013

-106

011

7013

-118

012

7013

-128

013

7013

-138

0%

Read Response Time (msecs)

110

1001000

10000100000

1000000

012

012

012

012

012

012

012

012

012

012

012

012

012

012Co

un

t in

Lo

g S

cale

13-1

090

13-1

0018

013-

190

2701

3-28

036

013-

370

4501

3-46

054

013-

550

6601

3-67

078

013-

790

8801

3-89

098

013-

990

1100

13-1

110

1220

13-1

230

1320

13-1

330

File

Read Response Time (msecs)

Figure 11. Performance Analysis: Impact on Response Time because of power management with a 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 of 10days. 97.8% of the total read requests are not impacted by the power management. Impact is seen only by 2.1% of the reads.With a less aggressive 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 (15, 20), impact on the Response time will reduce much more.

170

175

180

185

190

195

d Z

on

e S

erve

rs

Policy10

Policy15

Policy5

150

155

160

165

170

Use

d C

old Policy20 200

300

400

500

600

ge

Cap

acit

y (T

B)

Used Capacity Hot (TB) Used Capacity Cold (TB)

0

100

200

5 10 15 20Use

d S

tora

g

File Migration Policy Interval (Days)

50

100

150

200

Co

un

t

timesOn daysOn

0

50

4 6 8Server Power Conserver Policy Interval

(Days)Figure 12. Sensitivity Analysis: Sensitivity of Number of Servers Used in Cold Zone, Number of Power State Transi-tions and Capacity per Zone to the Migration File Policy’s Age Threshold and the Server Power Conserver Policy’s AccessThreshold.

the performance of the accesses to the Cold Zone. Thus,a trade-off needs to be made between energy-conservationand data access performance.

Summary on Sensitivity Analysis: From the aboveevaluation, it is clear that a trade-off needs to be made inchoosing the right thresholds in GreenHDFS based on anenterprise’s needs. If Hot zone space is at a premium, moreaggressive 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 needs to be used. This can bedone without impacting the energy-conservation that can bederived in GreenHDFS.

6.3.6 Number of Server Power Transitions

The Figure 13 (Left) shows the number of power transitionsincurred by the servers in the Cold Zones. Frequently start-ing and stopping disks is suspected to affect disk longevity.The number of start/stop cycles a disk can tolerate duringits service life time is still limited. Making the power tran-sitions infrequently reduces the risk of running into thislimit.The maximum number of power state transitions in-curred by a server in a one-month simulation run is just 11times and only 1 server out of the 390 servers provisioned

in the Cold Zone exhibited this behavior. Most of the disksare designed for a maximum service life time of 5 years andcan tolerate 500,000 start/stop cycles. Given the very smallnumber of transitions incurred by a server in the Cold Zonein a year, GreenHDFS has no risk of exceeding the start/stopcycles during the service life time of the disks.

7 Related Work

Management of energy, peak power, and temperature ofdata centers and warehouses are becoming the targets ofan increasing number of research studies. However, to thebest of our knowledge, none of the existing systems exploitdata classification-driven data placement to derive energy-efficiency nor have a file system managed multi-zoned, hy-brid data center layout. Most of the prior work focuseson workload placement to manage the thermal distributionwithin a data center. [30, 34] considered the placement ofcomputational workload for energy-efficiency. Chase et al.[8] do an energy-conscious provisioning which configuresswitches to concentrate request load on a minimal active set

285

Page 13: [IEEE 2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom) - Indianapolis, IN, USA (2010.11.30-2010.12.3)] 2010 IEEE Second International Conference

4

6

8

10

12

ber

of

Po

wer

Sta

te

Tran

siti

on

s

0

2

1 3 5 7 9 11 13 15 17 19 21 23 25 27

Nu

mb

Servers in Cold Zone

Figure 13. Cold Zone Behavior: Number of Times Servers Transitioned Power State with 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝐹𝑀𝑃 of 10 Days. Weonly show those servers in the Cold zone that either received newly cold data or had data accesses targeted to them in theone-month simulation run.

of servers for the current aggregate load level.

Le et. al. [25] focus on a multi-datacenter internet ser-vice. They exploit the inherent heterogeneity in the data-centers in electricity pricing, time-zone differences and col-location to renewable energy source, to reduce energy con-sumption without impacting SLA requirements of the appli-cations. Bash et al. [5] allocate heavy computational, longrunning workloads onto servers that are in more thermally-efficient places. Chun et. al. [12] propose a hybrid data-center comprising of low power Atom processors and highpower, high performance Xeon processors. However, theydo not specify any zoning in the system and focus more ontask migration rather than data migration. Narayanan et.al. [31] use a technique to offload write workload to onevolume to other storage elsewhere in the data center. Meis-ner et al. [28] reduce the power costs by transitioning theservers to a ”powernap” state whenever there is a period oflow utilization.

In addition, there is research on hardware-level tech-niques such as dynamic-voltage scaling as a mechanismto reduce peak power consumption in the datacenters [7,14] and Raghavendra et al. [33] coordinate hardware-levelpower capping with virtual machine dispatching mecha-nisms. Managing temperature is the subject of the systemsproposed in [20].

Recent research on increasing energy-efficiency in GFSand HDFS managed clusters [3, 27] propose maintaining aprimary replica of the data on a small covering subset ofnodes that are guaranteed to be on and which represent low-est power setting. Remaining replicas are stored in larger setof secondary nodes. Performance is scaled up by increas-ing number of secondary nodes. However, these solutionssuffer from degraded write-performance and increased DFScode complexity. These solutions also do not do any datadifferentiation and treat all the data in the system alike.

Existing highly scalable file systems such as Google filesystem [19] and HDFS [37] do not do energy management.Recently, an energy-efficient Log Structured File System

was proposed by Hakim et. al. [18]. However, that aimsto concentrate load on one disk at a time and hence, thisdesign will impact availability and performance.

8 Conclusion and Future Work

We presented the detailed evaluation and sensitivity anal-ysis of GreenHDFS, a policy-driven, self-adaptive, variantof Hadoop Distributed File System. GreenHDFS relies ondata classification driven data placement to realize guar-anteed, substantially long periods of idleness in a signifi-cant subset of servers in the datacenter. Detailed experi-mental results with real-world traces from a production Ya-hoo! Hadoop cluster show that GreenHDFS is capable ofachieving 24% savings in the energy costs of a Hadoop clus-ter by doing power management in only one of the maintenant top-level directory in the cluster. These savings willbe further compounded in the savings in the cooling costs.Detailed lifespan analysis of the files in a large-scale pro-duction Hadoop cluster at Yahoo! points at the viabilityof GreenHDFS. Evaluation results show that GreenHDFSis able to meet all the scale-down mandates (i.e., generatessignificant idleness in the cluster, results in very few powerstate transitions, and doesn’t degrade write performance)in spite of the unique scale-down challenges present in aHadoop cluster.

9 Acknowledgement

This work was supported by NSF grant CNS 05-51665and an internship at Yahoo!. The views and conclusionscontained in this paper are those of the authors and shouldnot be interpreted as representing the official policies, eitherexpressed or implied, of NSF or the U.S. government.

References

[1] http://hadoop.apache.org/.

286

Page 14: [IEEE 2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom) - Indianapolis, IN, USA (2010.11.30-2010.12.3)] 2010 IEEE Second International Conference

[2] Introduction to power supplies. National Semiconductor, 2002.

[3] H. Amur, J. Cipar, V. Gupta, G. R. Ganger, M. A. Kozuch, and K. Schwan.Robust and flexible power-proportional storage. In SoCC ’10: Proceedings ofthe 1st ACM symposium on Cloud computing, pages 217–228, New York, NY,USA, 2010. ACM.

[4] L. A. Barroso and U. Holzle. The case for energy-proportional computing.Computer, 40(12), 2007.

[5] C. Bash and G. Forman. Cool job allocation: measuring the power savingsof placing jobs at cooling-efficient locations in the data center. In ATC’07:2007 USENIX Annual Technical Conference on Proceedings of the USENIXAnnual Technical Conference, pages 1–6, Berkeley, CA, USA, 2007. USENIXAssociation.

[6] C. Belady. In the data center, power and cooling costs more than the it equip-ment it supports. Electronics Cooling, February, 2010.

[7] D. Brooks and M. Martonosi. Dynamic thermal management for high-performance microprocessors. In HPCA, pages 171–, 2001.

[8] J. S. Chase and R. P. Doyle. Balance of power: Energy management for serverclusters. In In Proceedings of the 8th Workshop on Hot Topics in OperatingSystems HotOS, 2001.

[9] G. Chen, W. He, J. Liu, S. Nath, L. Rigas, L. Xiao, and F. Zhao. Energy-awareserver provisioning and load dispatching for connection-intensive internet ser-vices. In NSDI’08: Proceedings of the 5th USENIX Symposium on NetworkedSystems Design and Implementation, Berkeley, CA, USA, 2008. USENIX As-sociation.

[10] Y. Chen, A. Das, W. Qin, A. Sivasubramaniam, Q. Wang, and N. Gautam.Managing server energy and operational costs in hosting centers. SIGMETRICSPerform. Eval. Rev., 33(1), 2005.

[11] Y. Chen, A. Ganapathi, A. Fox, R. H. Katz, and D. A. Patterson. Statisticalworkloads for energy efficient mapreduce. Technical report, UC, Berkeley,2010.

[12] B.-G. Chun, G. Iannaccone, G. Iannaccone, R. Katz, G. Lee, and L. Niccolini.An energy case for hybrid datacenters. In HotPower, 2009.

[13] J. Dean, S. Ghemawat, and G. Inc. Mapreduce: simplified data processing onlarge clusters. In In OSDI04: Proceedings of the 6th conference on Sympo-sium on Opearting Systems Design and Implementation. USENIX Association,2004.

[14] M. E. Femal and V. W. Freeh. Boosting data center performance through non-uniform power allocation. In ICAC ’05: Proceedings of the Second Inter-national Conference on Automatic Computing, Washington, DC, USA, 2005.IEEE Computer Society.

[15] Y. I. Eric Baldeschwieler. http://developer.yahoo.com/events/hadoopsummit2010.

[16] S. ES.2. http://www.seagate.com/staticfiles/support/disc/manuals/nl35 series &bc es series/barracuda es.2 series/100468393e.pdf. 2008.

[17] X. Fan, W.-D. Weber, and L. A. Barroso. Power provisioning for a warehouse-sized computer. In ISCA ’07: Proceedings of the 34th annual international sym-posium on Computer architecture, pages 13–23, New York, NY, USA, 2007.ACM.

[18] L. Ganesh, H. Weatherspoon, M. Balakrishnan, and K. Birman. Optimizingpower consumption in large scale storage systems. In HOTOS’07: Proceedingsof the 11th USENIX workshop on Hot topics in operating systems, Berkeley,CA, USA, 2007. USENIX Association.

[19] S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. SIGOPSOper. Syst. Rev., 37(5):29–43, 2003.

[20] T. Heath, A. P. Centeno, P. George, L. Ramos, Y. Jaluria, and R. Bianchini.Mercury and freon: temperature emulation and management for server systems.In ASPLOS, pages 106–116, 2006.

[21] U. Hoelzle and L. A. Barroso. The Datacenter as a Computer: An Introductionto the Design of Warehouse-Scale Machines. Morgan and Claypool Publishers,May 29, 2009.

[22] Intel. Quad-core intel xeon processor 5400 series. 2008.

[23] R. T. kaushik and M. Bhandarkar. Greenhdfs: Towards an energy-conserving,storage-efficient, hybrid hadoop compute cluster. HotPower, 2010.

[24] S. Konstantin, H. Kuang, S. Radia, and R. Chansler. The hadoop distributedfile system. Symposium on Massive Storage Systems and Technologies, 2010.

[25] K. Le, R. Bianchini, M. Martonosi, and T. Nguyen. Cost- and energy-awareload distribution across data centers. In HotPower, 2009.

[26] J. Leverich and C. Kozyrakis. On the energy (in)efficiency of hadoop clusters.HotPower, 2009.

[27] J. Leverich and C. Kozyrakis. On the energy (in)efficiency of hadoop clusters.SIGOPS Oper. Syst. Rev., 44(1):61–65, 2010.

[28] D. Meisner, B. T. Gold, and T. F. Wenisch. Powernap: eliminating server idlepower. In ASPLOS ’09: Proceeding of the 14th international conference on Ar-chitectural support for programming languages and operating systems, pages205–216, New York, NY, USA, 2009. ACM.

[29] Micron. Ddr2 sdram sodimm. 2004.

[30] J. Moore, J. Chase, P. Ranganathan, and R. Sharma. Making scheduling ”cool”:temperature-aware workload placement in data centers. In ATEC ’05: Proceed-ings of the annual conference on USENIX Annual Technical Conference, pages5–5, Berkeley, CA, USA, 2005. USENIX Association.

[31] D. Narayanan, A. Donnelly, and A. Rowstron. Write off-loading: Practicalpower management for enterprise storage. Trans. Storage, 4(3):1–23, 2008.

[32] C. Patel, E. Bash, R. Sharma, and M. Beitelmal. Smart cooling of data centers.In In Proceedings of the Pacific RIM/ASME International Electronics Packag-ing Technical Conference and Exhibition (IPACK03), 2003.

[33] R. Raghavendra, P. Ranganathan, V. Talwar, Z. Wang, and X. Zhu. No ”power”struggles: coordinated multi-level power management for the data center. InASPLOS XIII, pages 48–59, New York, NY, USA, 2008. ACM.

[34] R. K. Sharma, C. E. Bash, C. D. Patel, R. J. Friedrich, and J. S. Chase. Bal-ance of power: Dynamic thermal management for internet data centers. IEEEInternet Computing, 9:42–49, 2005.

[35] SMSC. Lan9420/lan9420i single-chip ethernet controller with hp auto-mdixsupport and pci interface. 2008.

[36] N. Tolia, Z. Wang, M. Marwah, C. Bash, P. Ranganathan, and X. Zhu. Deliver-ing energy proportionality with non energy-proportional systems - optimizingthe ensemble. In HotPower, 2008.

[37] T. White. Hadoop: The Definitive Guide. O’Reilly Media, May, 2009.

287